中等故事 - 数据集市

资源介绍

中等故事 (http://ds.jsai.org.cn/) 文本语料第1张

Building a data-set of 1.4 million stories Over the past few weeks I have been building a massive data-set of Medium stories. Initially, my goal was to better understand Medium’s clap-metric, but I quickly realized that this data could accomplish MUCH more than just that. Imagine if writers could choose a title with statistical models. Or if readers could automatically subscribe to the best authors in their field. Even answering basic questions like “How many claps should I give?” would be extremely valuable to both the writers and readers of Medium. These are all things that can be created with the right data and enough man-power. We have the data, now we just need your help. ---------- Intro to the Data The data-set consists of 1.4 million stories from 95 of Medium’s most popular story-tags. Every story was published between August 1st, 2017 and August 1st, 2018. I chose to collect the contents of story cards rather than the contents of entire stories for a few reasons. First, I didn’t want to run into any issues with Medium’s ownership rules. Second, it is around 90x faster to scrape story-cards than it is to scrape entire articles (which means more data for less time and less memory). Here is the full list of the information I was able to collect for each story: Title, Sub-Title, Author, Publication, Date, Tags, Read-Time, Claps-Received, Story-URL, and Author-URL. ---------- If you want a more in-depth introduction to the data-set, look at my GitHub. In the following repository I published a data analysis notebook to answer some of the most interesting questions about Medium’s readers, authors, and publications. Here’s the link. https://github.com/harrisonjansma/Analyzing_Medium/blob/master/Medium_EDA_expanded.ipynb [1]: https://cdn-images-1.medium.com/max/1250/1*MHsIo8FAYDcGJLxiqVOWfw.png

END

上一篇新闻数据

下一篇天气信息

发表评论取消回复

请先登录账户再评论哦

中等故事免费

资源介绍

发表评论取消回复

最新文章

热门文章

宋飞文本语料库

天气信息

PLastiCC 我提取的功能

皮尔逊'的父亲和儿子身高数据

datastf

标签云

猜你喜欢

中等故事免费

资源介绍

发表评论 取消回复

最新文章

热门文章

宋飞文本语料库

天气信息

PLastiCC 我提取的功能

皮尔逊'的父亲和儿子身高数据

datastf

标签云

猜你喜欢

宋飞文本语料库

天气信息

PLastiCC 我提取的功能

皮尔逊'的父亲和儿子身高数据

datastf

模糊多分类

推特数据集#AvengersEndgame

RIP Harambe

微博数据集

《旧金山纪事报》文章数据集

发表评论取消回复