In this edition of the blog series of Clojure/Python interop with libpython-clj, we’ll be taking a look at two popular Python NLP libraries: NLTK and SpaCy.
NLTK - Natural Language Toolkit
I was taking requests for doing examples of python-clojure interop libraries on twitter the other day, and by far NLTK was the most requested library. After looking into it, I can see why. It’s the most popular natural language processing library in Python and you will see it everywhere there is text someone is touching.
Installation
To use the NLTK toolkit you will need to install it. I use sudo pip3 install nltk, but libpython-clj now supports virtual environments with this PR, so feel free to use whatever is best for you.
Features
We’ll take a quick tour of the features of NLTK following along initially with the nltk official book and then moving onto this more data task centered tutorial.
First, we need to require all of our things as usual:
There are all sorts of packages available to download from NLTK. To start out and tour the library, I would go with a small one that has basic data for the nltk book tutorial.
There are all other sorts of downloads as well, such as (nltk/download "popular") for most used ones. You can also download "all", but beware that it is big.
You can check out some of the texts it downloaded with:
12345678910111213141516
(book/texts);;; prints out in repl;; text1: Moby Dick by Herman Melville 1851;; text2: Sense and Sensibility by Jane Austen 1811;; text3: The Book of Genesis;; text4: Inaugural Address Corpus;; text5: Chat Corpus;; text6: Monty Python and the Holy Grail;; text7: Wall Street Journal;; text8: Personals Corpus;; text9: The Man Who Was Thursday by G . K . Chesterton 1908book/text1;=> <Text: Moby Dick by Herman Melville 1851>book/text2;=> <Text: Sense and Sensibility by Jane Austen 1811>
You can do fun things like see how many tokens are in a text
1
(count (py.-book/text3tokens));=> 44764
Or even see the lexical diversity, which is a measure of the richness of the text by looking at the unique set of word tokens against the total tokens.
This of course is all very interesting but I prefer to look at some more practical tasks, so we are going to look at some sentence tokenization.
Sentence Tokenization
Text can be broken up into individual word tokens or sentence tokens. Let’s start off first with the token package
123
(require-python'([nltk.tokenize:astokenize]))(def text"Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard")
To tokenize sentences, you take the text and use tokenize/sent_tokenize.
12345
(def text"Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard")(def tokenized-sent(tokenize/sent_tokenizetext))tokenized-sent;;=> ['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]
Likewise, to tokenize words, you use tokenize/word_tokenize:
12345678910
(def text"Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard")(def tokenized-sent(tokenize/sent_tokenizetext))tokenized-sent;;=> ['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"](def tokenized-word(tokenize/word_tokenizetext))tokenized-word;;=> ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']
Frequency Distribution
You can also look at the frequency distribution of the words with using the probability package.
Stemming and Lemmatization allow ways for the text to be reduced to base words and normalized.
For example, the word flying has a stemmed word of fli and a lemma of fly.
It also has support for Part-of-Speech (POS) Tagging. A quick example of that is:
12345678
(let [sent"Albert Einstein was born in Ulm, Germany in 1879."tokens(nltk/word_tokenizesent)]{:tokenstokens:pos-tag(nltk/pos_tagtokens)});; {:tokens;; ['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.'],;; :pos-tag;; [('Albert', 'NNP'), ('Einstein', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('in', 'IN'), ('Ulm', 'NNP'), (',', ','), ('Germany', 'NNP'), ('in', 'IN'), ('1879', 'CD'), ('.', '.')]}
Phew! That’s a brief overview of what NLTK can do, now what about the other library SpaCy?
SpaCy
SpaCy is the main competitor to NLTK. It has a more opinionated library which is more object oriented than NLTK which mainly processes text. It has better performance for tokenization and POS tagging and has support for word vectors, which NLTK does not.
Let’s dive in a take a look at it.
Installation
To install spaCy, you will need to do:
pip3 install spacy
python3 -m spacy download en_core_web_sm to load up the small language model
As you can see, it can handle pretty much the same things as NLTK. But let’s take a look at what it can do that NLTK and that is with word vectors.
Word Vectors
In order to use word vectors, you will have to load up a medium or large size data model because the small ones don’t ship with word vectors. You can do that at the command line with:
1
python3-mspacydownloaden_core_web_md
You will need to restart your repl and then load it with: