Wednesday, January 25, 2017

Text Analytics

Text mining, or text analytics, is to derive high-quality information from text.
  • devising patterns and trends
  • 80% of enterprise information originates and is locked in the unstructured form, rather than numerical data.
  • derive information from unstructured sources.
typical examples:
  • detect terrorist
  • find a protein in biomedical literature that may lead to a cancer.
The high quality is the first thing of text data:
  • choose the right source
  • That a source is available doesn’t mean it’s right for the job
  • source selection criteria include topicality, focus (high signal to noise ratio, currency, authority, your processing capabilities and analytics needs.
Three types of approaches:
  • co-currency based
  • rule-based
  • machine learning based

Natural Language ToolKit (NLTK)

The toolkit is powerful. However, after a few hours, I think this is a wrong direction to look at. Because disassemble the texts into words will lose the context information. Without context, the understanding is quite superficial. Machine is only good at “literal meaning”. Statistical learning is meant for numerical data, not text data.
To me, the nice thing about nltk is it has some great datasets that come from the master piece of English literature.
conda install nltk
python -m nltk.downloader all
import nltk
import IPython
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
IPython.core.display.display(t)
from nltk.book import *
text1.concordance("monstrous")
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
nltk.corpus.gutenberg.fileids()
from nltk.corpus import brown
brown.categories()
Statistical results may be interesting, but not as much as reading a real book from cover to cover.