Text mining, or text analytics, is to derive high-quality information from text.
- devising patterns and trends
- 80% of enterprise information originates and is locked in the unstructured form, rather than numerical data.
- derive information from unstructured sources.
- detect terrorist
- find a protein in biomedical literature that may lead to a cancer.
The high quality is the first thing of text data:
- choose the right source
- That a source is available doesn’t mean it’s right for the job
- source selection criteria include topicality, focus (high signal to noise ratio, currency, authority, your processing capabilities and analytics needs.
Three types of approaches:
- co-currency based
- machine learning based
Natural Language ToolKit (NLTK)
The toolkit is powerful. However, after a few hours, I think this is a wrong direction to look at. Because disassemble the texts into words will lose the context information. Without context, the understanding is quite superficial. Machine is only good at “literal meaning”. Statistical learning is meant for numerical data, not text data.
To me, the nice thing about nltk is it has some great datasets that come from the master piece of English literature.
conda install nltk python -m nltk.downloader all import nltk import IPython from nltk.corpus import treebank t = treebank.parsed_sents('wsj_0001.mrg') IPython.core.display.display(t) from nltk.book import * text1.concordance("monstrous") text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) nltk.corpus.gutenberg.fileids() from nltk.corpus import brown brown.categories()
Statistical results may be interesting, but not as much as reading a real book from cover to cover.