Word Frequency with Python
One of the key steps in NLP or Natural Language Process is the ability to count the frequency of the terms used in a text document or table. To achieve this we must tokenize the words so that they represent individual objects that can be counted. There are a great set of libraries that you can use to tokenize words. However the most popular Python library is NLTK or Natural Language Tool Kit.
This tutorial will show you have to leverage NLTK to create word frequency counts and use these to create a word cloud. Lets review the code below or watch the video presentation.
# loading in all the essentials for data manipulationimport pandas as pdimport numpy as np#load inthe NTLK stopwords to remove articles, preposition and other words that are not actionablefrom nltk.corpus import stopwords# This allows to create individual objects from a bog of wordsfrom nltk.tokenize import word_tokenize# Lemmatizer helps to reduce words to the base formfrom nltk.stem import WordNetLemmatizer# Ngrams allows to group words in common pairs or trigrams..etcfrom nltk import ngrams# We can use counter to count the objectsfrom collections import Counter# This is our visual libraryimport seaborn as snsimport matplotlib.pyplot as plt
Now that you have the basic libraries. You can review the function below that cleans the text, lowers, removes numbers and creates data frames for word counts
def word_frequency(sentence):# joins all the sentensessentence =” “.join(sentence)# creates tokens, creates lower class, removes numbers and lemmatizes the wordsnew_tokens = word_tokenize(sentence)new_tokens = [t.lower() for t in new_tokens]new_tokens =[t for t in new_tokens if t not in stopwords.words(‘english’)]new_tokens = [t for t in new_tokens if t.isalpha()]lemmatizer = WordNetLemmatizer()new_tokens =[lemmatizer.lemmatize(t) for t in new_tokens]#counts the words, pairs and trigramscounted = Counter(new_tokens)counted_2= Counter(ngrams(new_tokens,2))counted_3= Counter(ngrams(new_tokens,3))#creates 3 data frames and returns themsword_freq = pd.DataFrame(counted.items(),columns=[‘word’,’frequency’]).sort_values(by=’frequency’,ascending=False)word_pairs =pd.DataFrame(counted_2.items(),columns=[‘pairs’,’frequency’]).sort_values(by=’frequency’,ascending=False)trigrams =pd.DataFrame(counted_3.items(),columns=[‘trigrams’,’frequency’]).sort_values(by=’frequency’,ascending=False)return word_freq,word_pairs,trigrams
The next step would be to visualize these words so that you can see how the stack up in terms of frequency.
# create subplot of the different data framesfig, axes = plt.subplots(3,1,figsize=(8,20))sns.barplot(ax=axes,x=’frequency’,y=’word’,data=data2.head(30))sns.barplot(ax=axes,x=’frequency’,y=’pairs’,data=data3.head(30))sns.barplot(ax=axes,x=’frequency’,y=’trigrams’,data=data4.head(30))Recommended
Watch the Full Video Presentation of NLTK Word Frequency with WallstreetBets Data
This tutorial will take you through all of the steps above. The tutorial below will show you how to shape this data into cool word clouds.