KDC-4007数据集收集数据集免费

jsaiyyp 12 2021-08-31 机器学习

资源介绍

Arazo M. Mustafa, (arazo.2007 '@' yahoo.com),
School of Computer Science University of Sulaimania, Kurdistan, Iraq

Data Set Information:

The most important feature of this dataset is its simplicity to use and its being well-documented, which can be widely used in various studies of text analysis regarding Kurdish Sorani news and articles.
The documents consist of eight categories, which are Sport, Religion, Art, Economic, Education, Social, Style, and Health. Each of them consisted of 500 text documents, where the total size of the corpus is 4,007 text files.
The dataset and documents have become freely accessible in order to have repeatable outcomes for experimental assessment.

Attribute Information:

There is four collection:

- ST-Ds datasets, just stop words elimination is performed by using Kurdish preprocessing-step approach.
- The pre-ds dataset, Kurdish preprocessing-step approach is used.
- The Pre+TW-Ds dataset, TF?—IDF term weighting on the Pre-Ds dataset is performed.
- Orig-Ds datasets, no process is used which is the original dataset.

Relevant Papers:

[1] Arazo M. Mustafa and Tarik A. Rashid,a€? Kurdish Stemmer Pre-processing Steps for Improving Information Retrievala€?, Journal of Information Science, First published date: january-01-2017, 10.1177/0165551516683617.
[2] Tarik A. Rashid, Arazo M. Mustafa and Ari M. Saeed, 2017.'A Robust Categorization System for Kurdish Sorani Text Documents'. Information Technology Journal, 16: 27-34.
[3] Tarik A. Rashid, Arazu M. Mustafa, Ari M. Saeed Automatic Kurdish Text Classification Using KDC 4007 Dataset, accepted in Springer book, Series Title: Lecture Notes on Data Engineering and Communications Technologies: Book title: Advances in Internetworking, Data & Web Technologies, Indexing: The books of this series are submitted to ISI Proceedings, EI, Scopus, MetaPress, Springerlink, 2017.

 

Citation Request:

If you have no special citation requests, please leave this field blank.

END

发表评论