《纽约时报》数据镜头下的女性崛起-技术圈

大数据文摘出品

来源：medium

编译：Hippo、lin、夏雅薇

媒体镜头下的女性角色能反应一个社会的女性主义崛起。

今天，文摘菌给大家介绍的这个项目，就是希望通过情感分析、频繁术语可视化和主题建模，来调查过去70年里女性在《纽约时报》报道中的代表性。

一起看看。

为了进行这项调查，作者通过《纽约时报》开发人员门户网站的归档文件API抓取了《纽约时报》的相关数据。

首先，获取API密钥（钥匙链接地址：https://developer.nytimes.com/）。放心，这是免费的！获取了密钥之后，你就像是打开了《纽约时报》数据大坝的闸门，里面的数据将会源源不断地输送。由于这种类型的API比较适合批量数据的收集，因此不会对数据进行事先过滤。如果您想重新进行试验，请按照发布在Github上的Jupyter notebooks中的说明进行操作。如果您想看这篇文章的视频版本，也可以点击相关链接访问（链接：https://www.youtube.com/watch?v=rK-9t1IS0A4&feature=youtu.be）。

思维导图。图片由作者提供。图标由Freepik制作。

通过作者在GitHub上的my project repository（链接：https://github.com/sasha-talks-tech/New-York-Times）你也可以访问所有的说明、代码笔记和运行结果，可让你更顺利的进行复制。

通过API进行数据收集同时使用SpaCy和Gensim进行主题建模

在进一步进行分析之前，先对《纽约时报》自2019年1月至2020年9月的大部分文章进行主题建模，来分析标题、关键词和主要段落。目标是将最普遍的问题和经久不衰的主题区分开来，以确保相关研究遵循《纽约时报》的宗旨，并且不会歪曲他们的新闻风格。

数据收集思路的灵感来自Briena Herold的非常有用的教程（教程链接：https://towardsdatascience.com/collecting-data-from-the-new-york-times-over-any-period-of-time-3e365504004）。

让我们导入必要的工具和库：

import osimport pandas as pdimport requestsimport jsonimport timeimport dateutilimport datetimefrom dateutil.relativedelta import relativedeltaimport globDetermine the timeframe of the analysis:end = datetime.date.today()start = datetime.date(2019, 1, 1)

下列helper函数（请参阅教程：https://towardsdatascience.com/collecting-data-from-the-new-york-times-over-any-period-of-time-3e365504004）通过API提取《纽约时报》相关数据将其保存到特定的csv文件中：

def send_request(date):'''Sends a request to the NYT Archive API for given date.'''    base_url = 'https://api.nytimes.com/svc/archive/v1/'    url = base_url + '/' + date[0] + '/' + date[1] + '.json?api-key=' + 'F9FPP1mJjiX8pAEFAxBYBg08vZECa39n'try:        response = requests.get(url, verify=False).json()except Exception:return None    time.sleep(6)return responsedef is_valid(article, date):'''An article is only worth checking if it is in range, and has a headline.'''    is_in_range = date > start and date < end    has_headline = type(article['headline']) == dict and 'main' in article['headline'].keys()return is_in_range and has_headlinedef parse_response(response):'''Parses and returns response as pandas data frame.'''    data = {'headline': [], 'date': [],'doc_type': [],'material_type': [],'section': [],'keywords': [],'lead_paragraph': []}    articles = response['response']['docs']for article in articles: # For each article, make sure it falls within our date range        date = dateutil.parser.parse(article['pub_date']).date()if is_valid(article, date):            data['date'].append(date)            data['headline'].append(article['headline']['main'])if 'section' in article:                data['section'].append(article['section_name'])else:                data['section'].append(None)            data['doc_type'].append(article['document_type'])if 'type_of_material' in article:                data['material_type'].append(article['type_of_material'])else:                data['material_type'].append(None)            keywords = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == 'subject']            data['keywords'].append(keywords)if 'lead_paragraph' in article:                data['lead_paragraph'].append(article['lead_paragraph'])else:                data['lead_paragraph'].append(None)return pd.DataFrame(data)def get_data(dates):'''Sends and parses request/response to/from NYT Archive API for given dates.'''    total = 0    print('Date range: ' + str(dates[0]) + ' to ' + str(dates[-1]))if not os.path.exists('headlines'):        os.mkdir('headlines')for date in dates:        print('Working on ' + str(date) + '...')        csv_path = 'headlines/' + date[0] + '-' + date[1] + '.csv'if not os.path.exists(csv_path): # If we don't already have this month            response = send_request(date)if response is not None:                df = parse_response(response)                total += len(df)                df.to_csv(csv_path, index=False)                print('Saving ' + csv_path + '...')print('Number of articles collected: ' + str(total))

让我们详细看下helper函数：

函数send_request(date)向存档发送给定日期的请求，转换成json格式，返回响应。
函数is_valid(article, date)检查某篇文章是否在要求的时间范围内，确认标题的存在，返回is_in_range（在范围内）和has_headline（存在标题）结论。
函数parse_response(response)将响应转换为DataFrame数据集。data是一个字典，它包含DataFrame的列，这些列最初是空的，但将被此函数追加。函数返回最终的DataFrame。

函数get_data(dates)：如果日期对应于用户指定的范围，则利用send request()和parse response()函数，将标题和其他信息保存到.csv文件中，每月每年保存一个文件。

一旦获得该范围内每年的月度csv文件，我们就可以将它们串联起来以备将来使用。 Glob库是一个很好的工具，确保标题文件夹的路径与代码中的路径相匹配。这里使用的是相对路径而不是绝对路径。

# get data file namespath = "headlines/"filenames = glob.glob("*.csv")
dfs = []print(filenames)for filename in filenames:    dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFramebig_frame = pd.concat(dfs, ignore_index=True)

big_frame是一个DataFrame数据集，它包含了来自标题文件夹的所有文件并将它们合并为一个框架。下面是预期的输出：

135,954篇文章和所包含的数据被推送出来。

现在，我们可以进行主题建模了。接下来分析的目的是对过去一年半中《纽约时报》文章的标题、关键字和主要段落进行主题建模。我需要确保标题与介绍性段落和关键字一致。

Importing tools and libraries:from collections import defaultdict import re, string   #regular expressionsfrom gensim import corpora # this is the topic modeling libraryfrom gensim.models import LdaModel

让我们进一步来分析：

Defaultdict对统计唯一单词的出现很有用。
https://www.geeksforgeeks.org/defaultdict-in-python/
当我们在文本中查找精确或模糊的匹配项时re和string很有用。如果你对文本分析感兴趣，正则表达式（https://www.w3schools.com/python/python_regex.asp）会是一个常用表达式。如果想要练习，可以点击链接使用便捷工具（https://regex101.com/）。
gensim（https://radimrehurek.com/gensim/）是一个用于主题建模的库，
只要你能搞清必要的相关性（https://www.tutorialspoint.com/gensim/gensim_getting_started.htm），
你会发现它非常好用。

我们在查看DataFrame的三个不同列时，可以查看语料库的三个不同实例：一个包含标题的语料库、一个用于关键字的语料库和一个用于主要段落的语料库。通过三个不同的语料库，我们可以进行健全性检查，以确保标题、关键字和主要段落与文章的内容一致。

big_frame_corpus_headline = big_frame['headline']big_frame_corpus_keywords = big_frame['keywords']big_frame_corpus_lead = big_frame['lead_paragraph']

为了使文本数据可用，我们需要对它们进行预处理。通常是这样做的：删除小写和标点符号，提取词干（stemming）、词形还原（lemmatization）和词语切分（tokenization），然后去除停用词并矢量化。前四个操作显示为群集，因为这些操作的顺序通常取决于数据。同时，在某些情况下，切换操作的顺序可能很有意义。

文本预处理步骤。图片由作者提供。图标由Freepik制作。

让我们来谈谈预处理。

from nltk.corpus import stopwordsheadlines = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_headline]keywords = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_keywords]lead = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_lead]stopwords = set(stopwords.words('english'))# please note: you can append to this list of pre-defined stopwords if neededMore pre-processing:headline_texts = [[word for word in document.lower().split() if word not in stopwords] for document in headlines]keywords_texts = [[word for word in document.lower().split() if word not in stopwords] for document in keywords]lead_texts = [[word for word in document.lower().split() if word not in stopwords] for document in lead]Removing less frequent words:frequency = defaultdict(int)for headline_text in headline_texts:for token in headline_text:         frequency[token] += 1for keywords_text in keywords_texts:for token in keywords_text:         frequency[token] += 1for lead_text in lead_texts:for token in lead_text:         frequency[token] += 1
headline_texts = [[token for token in headline_text if frequency[token] > 1] for headline_text in headline_texts]keywords_texts = [[token for token in keywords_text if frequency[token] > 1] for keywords_text in keywords_texts]lead_texts = [[token for token in lead_text if frequency[token] > 1] for lead_text in lead_texts]dictionary_headline = corpora.Dictionary(headline_texts)dictionary_keywords = corpora.Dictionary(keywords_texts)dictionary_lead = corpora.Dictionary(lead_texts)headline_corpus = [dictionary.doc2bow(headline_text) for headline_text in headline_texts]keywords_corpus = [dictionary.doc2bow(keywords_text) for keywords_text in keywords_texts]lead_corpus = [dictionary.doc2bow(lead_text) for lead_text in lead_texts]Let’s decide on the optimal number of topics for our case:NUM_TOPICS = 5ldamodel_headlines = LdaModel(headline_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)ldamodel_keywords = LdaModel(keywords_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)ldamodel_lead = LdaModel(lead_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)Here’s the result:topics_headlines = ldamodel_headlines.show_topics()for topic_headlines in topics_headlines:    print(topic_headlines)topics_keywords = ldamodel_keywords.show_topics()for topic_keywords in topics_keywords:    print(topic_keywords)topics_lead = ldamodel_lead.show_topics()for topic_lead in topics_lead:    print(topic_lead)Let’s organize those into dataframes:word_dict_headlines = {};for i in range(NUM_TOPICS):    words_headlines = ldamodel_headlines.show_topic(i, topn = 20)    word_dict_headlines['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_headlines]pd.DataFrame(word_dict_headlines)for i in range(NUM_TOPICS):    words_keywords = ldamodel_keywords.show_topic(i, topn = 20)    word_dict_keywords['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_keywords]pd.DataFrame(word_dict_keywords)for i in range(NUM_TOPICS):    words_lead  = ldamodel_lead.show_topic(i, topn = 20)    word_dict_lead ['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_lead]pd.DataFrame(word_dict_lead)

切记：即使算法可以将单词分类到相应的主题中，也仍需要人工来解释和标记它们。

主题建模结果。图片由作者提供。图标由Freepik制作。

可以看到，各种各样的话题都展示了出来。所有这些话题在我们的社会中都是非常正式和重要的。接下来在这项特殊的研究中，我们将研究性别的代表性。

1950 – 至今：数据搜集与关键字分析

我们将使用前面提到的helper函数来获取从1950年1月1日到现在(即2020年9月)的数据。我建议使用更小的时间增量，例如十年，以防止API超时。

数据将被收集到headlines.csv中。然后使用上面演示的方法连接到一个数据框架中。一旦你得到了你辛辛苦苦得到的数据框架，我建议你把它pickling一下以备进一步使用:

import picklewith open('frame_all.pickle', 'wb') as to_write:    pickle.dump(frame, to_write)

以下是你如何提取pickled文档：

with open('frame_all.pickle', 'rb') as read_file:    df = pickle.load(read_file)

被找到的总文章数与70年时间的相关文章。作者提供图片模板来自Slidesgo。

让我们将date列转换为datetime格式，以便文章可以按时间顺序排序。我们还将删除空值和重复值。

df['date'] = pd.to_datetime(df['date'])df = df[df['headline'].notna()].drop_duplicates().sort_values(by='date')df.dropna(axis=0, subset=['keywords'], inplace = True)

检查相关关键字：

import astdf.keywords = df.keywords.astype(str).str.lower().transform(ast.literal_eval)keyword_counts = pd.Series(x for l in df['keywords'] for x in l).value_counts(ascending=False)len(keyword_counts)58,298 unique keywords.

我用我的个人判断来确定哪些关键词与女强人及其代表相关:政治、社会行动主义、企业家精神、科学、技术、军事成就、体育突破和女性领导力。这种分析决不是要把任何团体或个人排除在女强人的概念之外。我对于补充和建议持开放态度，所以请不要告诉我，如果你认为有什么可以做，使这个项目更全面。如果您发现单元格中的代码由于格式问题而难以复制，请参考我的项目存储库中的代码和说明。

project_keywords1 = [x for x in keyword_counts.keys() if 'women in politics' in xor 'businesswoman' in x or 'female executive' in xor 'female leader' in xor 'female leadership' in xor 'successful woman' in xor 'female entrepreneur' in xor 'woman entrepreneur' in xor 'women in tech' in xor 'female technology' in xor 'female startup' in xor 'female founder' in x ]

上面是一个相关关键字的示例查询。在这个笔记本中可以找到关于相关关键词搜索和文章标题提取的更详细的说明。

现在，让我们来看看与女性从政有关的头条新闻。

首先，我们通过小写来规范化它们:

df['headline'] = df['headline'].astype(str).str.lower()Examine the headlines that contain words like woman, politics and power:wip_headlines = df[df['headline'].str.contains(('women' or 'woman' or 'female')) & df['headline'].str.contains(('politics' or 'power' or 'election'))]‘wip’ stands for ‘women in politics’.Our search returned only 185 headlines. Let’s look at the keywords to supplement that.df['keywords'].dropna()df['keywords_joined'] = df.keywords.apply(', '.join)df['keywords_joined'] = df['keywords_joined'].astype(str)import rewip_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*politics)',regex=True)]

从政女性：来自DataFrame

上面的DataFrame包含2579篇基于相关关键字的文章。我们将对关键字和标题数据数据库执行外部连接，以获得更全面的数据:

wip_df = pd.concat([wip_headlines, wip_keywords], axis=0, sort = True)

使用同样的技术，我们将能够获得更多关于女性在军队、科学、体育、创业和其他形式的成就的数据。例如，如果我们要找关于女权主义的文章:

feminist_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*feminist)',regex=True)]

基于关键字女权检索的文章

#metoo movement:metoo_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*metoo)(?=.*movement)',regex=True)]

正则表达式和模糊匹配允许几乎无限的可能性。你可以在这个笔记本上看到更多的查询。

完成所有查询之后，最终的数据Dataframe将在GitHub上的代码笔记和本文中进一步称为project_df被引用。

我们看下这些年来文章分布情况：

ax = df.groupby(df.date.dt.year['headline'].count().plot(kind='bar', figsize=(20, 6))ax.set(xlabel='Year', ylabel='Number of Articles')ax.yaxis.set_tick_params(labelsize='large')ax.xaxis.label.set_size(18)ax.yaxis.label.set_size(18)ax.set_title('Total Published Every Year', fontdict={'fontsize': 24, 'fontweight': 'medium'})plt.show()  ax = project_df.groupby('year')['headline'].count().plot(kind='bar', figsize=(20, 6))ax.set(xlabel='Year', ylabel='Number of Articles')ax.yaxis.set_tick_params(labelsize='large')ax.xaxis.label.set_size(18)ax.yaxis.label.set_size(18)ax.set_title('Articles About Strong Women (based on relevant keywords) Published Every Year', \             fontdict={'fontsize': 20, 'fontweight': 'medium'})plt.show()

如果我们把这两个图叠加起来，蓝色的几乎消失了:

基于关键词和标题的相关发布，与一段时间内发表的大量文章相比，几乎是不可见的。

对妇女问题的报道似乎很温和。我认为这可能是由于关键字的编码并不很正确，:有些关键字丢失或者产生了误导，从而使研究人员更难通过归档API找到所需的材料。

在我的分析过程中，我发现了一个有趣的现象。在20世纪50年代早期，根据对n字格的分析，有很多提到女性的职业机会。他们中的许多人从大学毕业成为医生，以便加入海军。我把这种宣传高潮归因于二战的余波:为了补充军队的力量，妇女被鼓励参加工作。还记得《铆钉工罗西》的海报吗?

这些剪报是通过TimesMachine(纽约时报出版物档案)获得的。图片是由作者使用这些剪报创造的。

尽管看到女性在没有太多机会的时代能够获得这样的机会让人感到温暖和振奋，但我真的希望这不是因为战争。

N-grams，Wordcloud以及情感分析

探索标题中的总体词汇频率:

from sklearn.feature_extraction.text import CountVectorizerword_vectorizer = CountVectorizer(ngram_range=(1,3), analyzer='word')sparse_matrix = word_vectorizer.fit_transform(corpus)frequencies = sum(sparse_matrix).toarray()[0]ngram_df_project = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])from wordcloud import WordCloud, STOPWORDSall_headlines = ' '.join(project_df['headline'].str.lower())stopwords = STOPWORDSstopwords.add('will')# Note: you can append your own stopwords to the existing ones.wordcloud = WordCloud(stopwords=stopwords, background_color="white", max_words=1000, width = 480, height = 480).\generate(all_headlines)plt.figure(figsize=(20,10))plt.imshow(wordcloud)plt.axis("off");

被以上代码创建的WordCloud：频率最高的词汇以更大的字体显示

我们还可以基于各种时间段或特定关键字等特性创建wordclouds。有关更多的视觉效果，请参考笔记本。

我们来谈谈情感分析。我们将使用NLTK的Vader库分析与标题相关的情感。当记者们在写一篇文章的时候，我们真的能了解他们对一个问题的感受吗?

import nltknltk.download('vader_lexicon')from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIAsia = SIA()results = []for line in project_df.headline:    pol_score = sia.polarity_scores(line)    pol_score['headline'] = line    results.append(pol_score)print(results[:3])Output:[{'neg': 0.0, 'neu': 0.845, 'pos': 0.155, 'compound': 0.296, 'headline': 'women doctors join navy; seventeen end their training and are ordered to duty'}, {'neg': 0.18, 'neu': 0.691, 'pos': 0.129, 'compound': -0.2732, 'headline': 'n.y.u. to graduate 21 women doctors; war gave them, as others, an opportunity to enter a medical school'}, {'neg': 0.159, 'neu': 0.725, 'pos': 0.116, 'compound': -0.1531, 'headline': 'greets women doctors; dean says new york medical college has no curbs'}]
Sentiment as a dataframe:sentiment_df = pd.DataFrame.from_records(results)

dates = project_df['year']sentiment_df = pd.merge(sentiment_df, dates, left_index=True, right_index=True)The code above allows us to have a timeline for our sentiment. To simplify the sentiment analysis, we are going to create some new categories for positive, negative and neutral.sentiment_df['label'] = 0sentiment_df.loc[sentiment_df['compound'] > 0.2, 'label'] = 1sentiment_df.loc[sentiment_df['compound'] < -0.2, 'label'] = -1sentiment_df.head()To visualize overall sentiment distribution:sentiment_df.label.value_counts(normalize=True) * 100

图片由作者产生，Slidesgo模板

随着时间推移可视化情感：

sns.lineplot(x="year", y="label", data=sentiment_df)plt.show()

情感基于问题复杂性的波动

正如你所看到的，情绪波动。这一点也不出乎意料，因为女性问题通常都包含着沉重的主题，比如暴力和虐待。在这些情况下，我们预计人们的情绪会倾向于消极的一面。

我创建了一个Tableau仪表板，观众可以在其中与可视化交互。可以通过我的Tableau Public profile找到。这个指示板演示了几十年来关键字的分布情况。

结论

多年来，《纽约时报》在性别平等方面有了明显的进步。如果我要提出建议，我建议添加关键字列表。当我们进一步研究归档API的过去时，更全面、更健壮的关键字可以促进搜索。

持续地展示女性的领导力是很重要的，直到女性力量真正成为领导力。想象一下，在这个世界里，“女性”这个形容词不再需要用来作为描述成就的前缀，因为强调它变得多余。想象一下，世界上没有“女医生”或“女工程师”:只有医生和工程师。创始人和政治家。企业家、科学家和开拓者。作为一个社会，我们的目标是将多元化普及，变得更稀疏平常。只要我们齐心协力，不断提醒自己和周围的社会，任何性别和国籍都不可以被剥夺这些机会，总有一天我们能实现这一点。

​《纽约时报》数据镜头下的女性崛起

《纽约时报》数据镜头下的女性崛起