英语维基百科 文章 2017-08-20 SQLite免费

jsaifc 22 2021-09-06 文本语料

资源介绍

英语维基百科 文章 2017-08-20 SQLite (http://ds.jsai.org.cn/) 文本语料 第1张

Context This dataset was originally intended for the Data Science Nashville November 2018 meetup: [Introduction to Gensim][1]. I wanted to provide a large text corpus in a format often seen in industry, so I pulled the english Wikipedia dump from 2017-08-20, extracted the text using Gensim's excellent [segment_wiki][2] script, and finally wrote some custom code to populate a SQLite database. The dataset encompasses nearly 5 million articles, with more than 23 million individual sections. Only article text is included, all links have been stripped and no metadata (e.g., behind the scene discussion or version history) is included. Even then, I just barely met the file size limit, coming in at just below 20 GB. Content I wanted to keep things simple, so everything is in a single table: **articles**. There is an index on article_id. - **article_id**: Int, identifier for each unique title - **article_title**: Str, article titles - **section_title**: Str, subsection title from each article - **section_text**: Str, text from each subsection I've also pre-trained some simple topic models and word embeddings based on this dataset. At time of upload, the file size limit is 20 GB, so I created another dataset that contains the pre-trained gensim models: [English Wikipedia Articles 2017-08-20 Models][3]. Acknowledgements As per The Wikimedia Foundation's [requirements][4], this dataset is provided under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Permission is granted to copy, distribute, and/or modify Wikipedia's text under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License and, unless otherwise noted, the GNU Free Documentation License. unversioned, with no invariant sections, front-cover texts, or back-cover texts. The banner image is provided by [Lysander Yuen][5] on [Unsplash][6]. [1]: https://www.meetup.com/Data-Science-Nashville/events/256605771/ [2]: https://radimrehurek.com/gensim/scripts/segment_wiki.html [3]: https://www.kaggle.com/jkkphys/english-wikipedia-articles-20170820-models [4]: https://en.wikipedia.org/wiki/Wikipedia:Database_download [5]: https://unsplash.com/@_lysander_yuen [6]: https://unsplash.com/

END
上一篇
下一篇

发表评论