中等故事免费

jsaifc 19 2021-09-02 文本语料

资源介绍

中等故事 (http://ds.jsai.org.cn/) 文本语料 第1张

Building a data-set of 1.4 million stories Over the past few weeks I have been building a massive data-set of Medium stories. Initially, my goal was to better understand Medium’s clap-metric, but I quickly realized that this data could accomplish MUCH more than just that. Imagine if writers could choose a title with statistical models. Or if readers could automatically subscribe to the best authors in their field. Even answering basic questions like “How many claps should I give?” would be extremely valuable to both the writers and readers of Medium. These are all things that can be created with the right data and enough man-power. We have the data, now we just need your help. ---------- Intro to the Data The data-set consists of 1.4 million stories from 95 of Medium’s most popular story-tags. Every story was published between August 1st, 2017 and August 1st, 2018. I chose to collect the contents of story cards rather than the contents of entire stories for a few reasons. First, I didn’t want to run into any issues with Medium’s ownership rules. Second, it is around 90x faster to scrape story-cards than it is to scrape entire articles (which means more data for less time and less memory). Here is the full list of the information I was able to collect for each story: Title, Sub-Title, Author, Publication, Date, Tags, Read-Time, Claps-Received, Story-URL, and Author-URL. ---------- If you want a more in-depth introduction to the data-set, look at my GitHub. In the following repository I published a data analysis notebook to answer some of the most interesting questions about Medium’s readers, authors, and publications. Here’s the link. https://github.com/harrisonjansma/Analyzing_Medium/blob/master/Medium_EDA_expanded.ipynb [1]: https://cdn-images-1.medium.com/max/1250/1*MHsIo8FAYDcGJLxiqVOWfw.png

END
上一篇
下一篇

发表评论