垃圾短信分类器数据集
垃圾消息检测是互联网中最早投入实践的机器学习任务之一。这种任务也属于 NLP 和文本分类工作。所以,如果你想练习解决这类问题,Spam SMS 数据集是一个不错的选择。它在实践中用得非常多,非常适合初学者。
这个数据集最棒的一点是,它是从互联网的多个来源构建的。例如,它从 Grumbletext 网站上提取了 425 条垃圾短信,从新加坡国立大学的 NUS SMS Corpus(NSC)随机选择了 3,375 条短信,还有 450 条短信来自 Caroline Tag 的博士论文等。数据集本身由两列组成:标签(ham 或 spam)和原始文本。
1 数据集样本
我们加载数据,看看它是什么样的:
<span class="n">ham</span> <span class="n">What</span> <span class="n">you</span> <span class="n">doing</span><span class="err">?</span><span class="n">how</span> <span class="n">are</span> <span class="n">you</span><span class="err">?</span>
<span class="n">ham</span> <span class="n">Ok</span> <span class="n">lar</span><span class="o">...</span> <span class="n">Joking</span> <span class="n">wif</span> <span class="n">u</span> <span class="n">oni</span><span class="o">...</span>
<span class="n">ham</span> <span class="n">dun</span> <span class="n">say</span> <span class="n">so</span> <span class="n">early</span> <span class="n">hor</span><span class="o">...</span> <span class="n">U</span> <span class="n">c</span> <span class="n">already</span> <span class="n">then</span> <span class="n">say</span><span class="o">...</span>
<span class="n">ham</span> <span class="n">MY</span> <span class="n">NO</span><span class="o">.</span> <span class="n">IN</span> <span class="n">LUTON</span> <span class="mi">0125698789</span> <span class="n">RING</span> <span class="n">ME</span> <span class="n">IF</span> <span class="n">UR</span> <span class="n">AROUND</span><span class="err">!</span> <span class="n">H</span><span class="o">*</span>
<span class="n">ham</span> <span class="n">Siva</span> <span class="ow">is</span> <span class="ow">in</span> <span class="n">hostel</span> <span class="n">aha</span><span class="p">:</span><span class="o">-.</span>
<span class="n">ham</span> <span class="n">Cos</span> <span class="n">i</span> <span class="n">was</span> <span class="n">out</span> <span class="n">shopping</span> <span class="n">wif</span> <span class="n">darren</span> <span class="n">jus</span> <span class="n">now</span> <span class="n">n</span> <span class="n">i</span> <span class="n">called</span> <span class="n">him</span> <span class="mi">2</span> <span class="n">ask</span> <span class="n">wat</span> <span class="n">present</span> <span class="n">he</span> <span class="n">wan</span> <span class="n">lor</span><span class="o">.</span> <span class="n">Then</span> <span class="n">he</span> <span class="n">started</span> <span class="n">guessing</span> <span class="n">who</span> <span class="n">i</span> <span class="n">was</span> <span class="n">wif</span> <span class="n">n</span> <span class="n">he</span> <span class="k">finally</span> <span class="n">guessed</span> <span class="n">darren</span> <span class="n">lor</span><span class="o">.</span>
<span class="n">spam</span> <span class="n">FreeMsg</span><span class="p">:</span> <span class="n">Txt</span><span class="p">:</span> <span class="n">CALL</span> <span class="n">to</span> <span class="n">No</span><span class="p">:</span> <span class="mi">86888</span> <span class="o">&</span> <span class="n">claim</span> <span class="n">your</span> <span class="n">reward</span> <span class="n">of</span> <span class="mi">3</span> <span class="n">hours</span> <span class="n">talk</span> <span class="n">time</span> <span class="n">to</span> <span class="n">use</span> <span class="kn">from</span> <span class="nn">your</span> <span class="n">phone</span> <span class="n">now</span><span class="err">!</span> <span class="n">ubscribe6GBP</span><span class="o">/</span> <span class="n">mnth</span> <span class="n">inc</span> <span class="mi">3</span><span class="n">hrs</span> <span class="mi">16</span> <span class="n">stop</span><span class="err">?</span><span class="n">txtStop</span>
<span class="n">spam</span> <span class="n">Sunshine</span> <span class="n">Quiz</span><span class="err">!</span> <span class="n">Win</span> <span class="n">a</span> <span class="nb">super</span> <span class="n">Sony</span> <span class="n">DVD</span> <span class="n">recorder</span> <span class="k">if</span> <span class="n">you</span> <span class="n">canname</span> <span class="n">the</span> <span class="n">capital</span> <span class="n">of</span> <span class="n">Australia</span><span class="err">?</span> <span class="n">Text</span> <span class="n">MQUIZ</span> <span class="n">to</span> <span class="mf">82277.</span> <span class="n">B</span>
<span class="n">spam</span> <span class="n">URGENT</span><span class="err">!</span> <span class="n">Your</span> <span class="n">Mobile</span> <span class="n">No</span> <span class="mi">07808726822</span> <span class="n">was</span> <span class="n">awarded</span> <span class="n">a</span> <span class="n">L2</span><span class="p">,</span><span class="mi">000</span> <span class="n">Bonus</span> <span class="n">Caller</span> <span class="n">Prize</span> <span class="n">on</span> <span class="mi">02</span><span class="o">/</span><span class="mi">09</span><span class="o">/</span><span class="mi">03</span><span class="err">!</span> <span class="n">This</span> <span class="ow">is</span> <span class="n">our</span> <span class="mi">2</span><span class="n">nd</span> <span class="n">attempt</span> <span class="n">to</span> <span class="n">contact</span> <span class="n">YOU</span><span class="err">!</span> <span class="n">Call</span> <span class="mi">0871</span><span class="o">-</span><span class="mi">872</span><span class="o">-</span><span class="mi">9758</span> <span class="n">BOX95QU</span>
2 这个公共数据集适合解决什么问题?
顾名思义,该数据集最适合用于垃圾邮件检测和文本分类。它也经常用在工作面试中,所以大家最好练习一下。
3 有用的链接
从以下链接中可以找到关于这个数据集的更多信息: