NLP: Natural Language Processing

One of my specialties (or at least something that I am interested in) is NLP or Natural language processing.

NLP is basically programming computers to process and analyze large quantities of “natural language” data – or more simply, words. NLP is used in speech recondition, machine translation, data mining, chatbots, etc. One of my more adventurous uses of NLP can be experienced over on my AI BL0G where I attempt natural-language generation.

NLP’s history goes all the way back to 1950 when Alan Turning published “Computing Machinery and Intelligence” which outlined the concept of what we now call the Turning Test. The idea is this: can a computer create something, ala a blog post on AIBL0G, that is indistinguishable from a blog post created by a human. While my attempts are not quite there ( but getting closer! ) others like Open AI are very close indeed.

At it’s simplest NLP is concerned with a corpus, which consists of documents, which consist of words, which in turn contain parts of speech, morphemes, tokens, stems, lexical semantics, etc.

In the last century natural language processing was most often accomplished with statistical analysis. Basically coming up with a set of rules and feeding the data through them. While this approach yielded some significant successes it wasn’t until machine learning algorithms were applied to NLP that exponential improvements were realized.

Some will argue that with new deep learning techniques elaborate feature engineering is no longer needed. I have found this to be untrue. While I never spent time doing traditional NLP, I think that extracting features from raw data is still of primary importance. I will admit that tuning a data mining technique is better than hand writing feature rules…

Beyond my AIBL0G I use NLP machine learning to analyse the data that HEADLIN3S collects. The HEADLIN3S spiders collect allot of data – at this point a quarter million links and meta data from news sites. NLP plays a large part in determining what are the most representative news articles on any given day. NLP also aids in the creation of the new headline and excerpt the system creates for each article.

If you are interested in getting started with NLP Python is the way to go imho. While I have been writing much more JS than Python lately I still prefer Python and when it comes to data science and machine learning Python is king.

One of the oldest tools that is simple to get started with is NLTK. If you were to learn how to use NLTK you would be well on your way to understanding NLP well enough to create your first project. After you are comfortable with NLTK you could move on to Pattern, Text Blob, and my favorite NLP tool spaCy

If you are interested in studying some of the concepts that are hidden from you with these high level tools I would suggest starting with tf-idf, or term frequency inverse document frequency. Once I really understood tf-idf my NLP efforts took the next step as the concept is crucial to understand even if you never use the equation.

Next week I’ll share some code with you that I used to get started down the NLP path, until then feel free to share your NLP projects in the comments.


Leave a comment

Your email address will not be published. Required fields are marked *