Table of Contents
In this blog, we will cover the below content
- Introduction to NLP
- Text operations and Regular expressions
- Tokenization
- Lemmatization and Stemming
- POS tagging and Named Entity Recognition(NER)
Introduction to NLP
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data of any language.
In this article, we will try to study Natural language Processing with libraries like Spacy and NLTK.
Pre-requisites:
Install Anaconda from this link.
Install Spacy from conda channels
install NLTK libraries from conda channels
English language Spacy library can be installed from
#run cmd in admin mode
python -m spacy download en #en stands for english
#https://spacy.io/usage
run this command to download Spacy library to support English language
#download all NLTK Libraries
#run anaconda command line in Admin mode
import nltk
nltk.download()
Text operations and Regular expressions
Now, Let’s see few basic text operations.
print("Natural Language Processing")
#output:
#Natural Language Processing
Printing a basic string in Python
We use Indexing and Slicing to select a specific range of characters from a string.
#first insert the string to a variable
string = "Language"
#get first alphabet with index
print(string[0])
#output: L
#printing multiple alphabets
print(string[2], string[5])
#output: n a
#for getting alphabet with negative indexing
print(string[-4])
#output: u
Basic Indexing operations
print(string[0:2])
#output: La
print(string[1:4])
#output: ang
Basic Slicing Operations
We have seen examples of Indexing and Slicing. At cases, we might need to clean strings. For this, we may use strip() for it.
#Remove * from the below statement
statement = "****Hello World! I am Chitti*, The Robot****"
#using strip function to remove asterisk(*)
print(statement.strip('*'))
#output:
#'Hello World! I am Chitti*, The Robot'
Stripping characters from the statement
We notice that asterisk(*) is removed from start and end of the statement, but not from the middle. If we don’t provide any character in the strip function, it will remove spaces from the beginning and end of the statement.
Now, let’s check the join function
str1 = 'Good'
str2 = 'Journey'
"Luck. Happy".join([str1,str2])
#output:
# 'GoodLuck. HappyJourney'
Regular Expressions
A regular expression is a set of characters, or a pattern, which is used to find substrings in a given string.
#import 're' library to use regular expression in code.
import re
#few most common methods in Regular expressions
#match() Determine if the RE matches at the beginning of the string
#search() Scan through a string, looking for any location where this RE matches
#findall() Find all the substrings where the RE matches, and return them as a list
#finditer() Find all substrings where RE matches and return them as iterator
#sub() Find all substrings where the RE matches and substitute them with the given string
Few most common methods from ‘re’ library
Tokenization
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.
#import Spacy library
import spacy
#Loading spacy english library
load_en = spacy.load('en_core_web_sm')
#take an example of string
example_string = "I'm going to meet\ M.S. Dhoni."
#load string to library
words = load_en(example_string)
#getting tokens pieces with for loop
for tokens in words:
print(tokens.text)
#output:
"
I
'm
going
to
meet\
M.S.
Dhoni
.
"
Tokenizing English statement
Stemming
Stemming is the process of reducing a word to its stem or root format. Let us take an example. Consider three words, “branched”, “branching” and “branches”. They all can be reduced to the same word “branch”. After all, all the three convey the same idea of something separating into multiple paths or branches. Again, this helps reduce complexity while retaining the essence of meaning carried by these three words.
We use “Porter stemmer” or “Snowball stemmer” for stemming
#Porter stemmer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
text = "A quick brown fox jumps over the lazy dog."
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()
# tokenize text
words = word_tokenize(text)
print (words)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
'''
stemmer = PorterStemmer()
words_stem = [stemmer.stem(word) for word in words]
# The above line of code is a shorter version of the following code:
'''
words_stem = [] for word in words:
words_stem.append(stemmer.stem(word))
'''
#words_stem_2 = [str(item) for item in words_stem]
#print (words_stem_2)
print (words_stem)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']
'''
Using Porter stemmer from NLTK library
from nltk.stem import SnowballStemmer
# Languages supported by SnowballStemmer
print (SnowballStemmer.languages)
'''
Output: ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish']
'''
Language Support for Snowball stemmer
from nltk.stem import SnowballStemmer
stemmer_spanish = SnowballStemmer('spanish')
print (stemmer_spanish.stem('trabajando'))
# output: trabaj
print (stemmer_spanish.stem('trabajos'))
# output: trabaj
print (stemmer_spanish.stem('trabajó'.decode('utf-8')))
# output: trabaj # UTF-8 decode is done to solve the following error:
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3
Lemmatization
Lemmatization is closely related to stemming. Lemmatization returns the lemmas of the word which is the base/root word.
Difference between Stemming and Lemmatization
– A stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
– While converting any word to the root/base word, stemming can create non-existent work but lemmatization creates actual dictionary words.
– Stemmers are typically easier to implement than Lemmatizers.
– Stemmers run faster than Lemmatizers.
– The accuracy of stemming is less than that of lemmatization.
Lemmatization is better than stemming and informative to find beyond the word to its stem and also determine part of speech around a word. That’s why spacy has lemmatization, not stemming. So we will perform lemmatization using spacy.
#import library
import spacy
#Loading spacy english library
load_en = spacy.load('en_core_web_sm')
#take an example of string
example_string = load_en(u'Manchester United is looking to sign a forward for $90 million')
for lem_word in example_string:
print(lem_word.text, '\t', lem_word.pos_, '\t', lem_word.lemma, '\t', lem_word.lemma_)
'''
Output:
Manchester PROPN 7202088082456649612 Manchester
United PROPN 13226800834791099135 United
is VERB 10382539506755952630 be
looking VERB 16096726548953279178 look
to PART 3791531372978436496 to
sign VERB 13677717534749417047 sign
a DET 11901859001352538922 a
forward NOUN 17319973514577326793 forward
for ADP 16037325823156266367 for
$ SYM 11283501755624150392 $
90 NUM 4144966524706581620 90
million NUM 17365054503653917826 million
'''
Stop words
“stop words” usually refers to the most common words in a language. There is no universal list of “stop words” that is used by all NLP tools in common.
#import library
import spacy
#Loading spacy english library
load_en = spacy.load('en_core_web_sm')
print(load_en.Defaults.stop_words)
'''
{'keep', 'and', 'anywhere', 'there', 'alone', 'because', 'thru', 'cannot', 'beforehand', 'therein', 'another', 'seems', 'somewhere', 'becomes', 'side', 'until', 'rather', 'within', 'would', 'your', 'seeming', 'when', '’s', 'himself', 'so', 'nothing', 'as', 'call', 'were', 'wherein', 'anyway', '’d', 'done', 'eleven', 'third', 'may', 'through', 'be', 'across', 'had', 'not', 'often', 'over', 'since', 'down', 'make', 'namely', 'though', 'few', 'whether', 'unless', 'yours', 'a', 'into', 'afterwards', 'beside', 'us', 'together', 'fifty', 'was', 'she', 'his', 'many', 'nobody', 'due', 'under', 'take', 'upon', 'you', 'mostly', 'less', 'n’t', 'than', 'we', 'first', 'thereupon', 'then', 'hereafter', 'thereafter', 'hers', 'at', 'never', 'elsewhere', 'three', 'well', 'on', 'seem', 'whose', 'otherwise', 'yourselves', 'her', 'same', '‘re', 'whole', '’m', 'say', 'such', 'almost', 'indeed', 'about', 'me', 'themselves', '’re', 'go', 'forty', 'that', 'sometimes', 'up', 'n‘t', '‘d', 'around', 'therefore', 'onto', 'else', "'ve", 'twelve', 'move', 'how', "'re", 'being', "'m", 'some', 'he', 'him', 'might', '‘ve', 'an', 'has', 'once', 'anyhow', 'ours', 'although', 'amongst', 'per', 'been', "'ll", 'thence', 'please', 'with', 'in', 'both', 'still', 'each', 'herself', 'thus', 'everyone', 'least', 'among', "n't", 'too', 'next', 'put', 'four', 'using', 'six', 'something', 'is', 'towards', 'via', 'whoever', 'yourself', 'above', 'very', 'before', 'neither', 'always', 'one', 'should', 'ten', 'every', 'eight', 'here', 'hundred', 'latterly', 'for', 'my', 'nevertheless', 'give', 'became', 'out', 'these', 'just', 'either', 'will', 'already', 'only', '‘ll', 'between', 're', 'nowhere', 'whenever', 'front', 'own', '‘s', 'various', 'do', 'everywhere', 'itself', 'latter', 'again', 'could', 'see', 'empty', 'but', 'anyone', 'no', 'really', 'whereafter', 'off', 'must', 'any', 'back', 'meanwhile', 'former', 'are', '’ll', 'herein', 'hereby', 'sometime', 'further', 'throughout', 'which', 'enough', 'of', 'or', 'someone', 'those', 'become', 'along', '’ve', 'becoming', 'during', 'this', 'after', 'now', 'serious', '‘m', 'ourselves', 'everything', 'myself', 'regarding', 'made', 'twenty', 'what', 'used', 'name', 'whereupon', 'much', 'they', 'can', 'who', "'d", 'mine', 'hereupon', 'behind', 'seemed', 'perhaps', 'them', 'bottom', 'fifteen', 'i', 'nine', 'all', 'ever', 'its', 'below', 'the', 'whatever', 'several', 'wherever', 'to', 'have', 'hence', 'beyond', 'whereas', 'if', "'s", 'why', 'ca', 'besides', 'also', 'even', 'last', 'whereby', 'our', 'except', 'noone', 'part', 'whither', 'anything', 'their', 'whom', 'formerly', 'am', 'it', 'more', 'others', 'nor', 'sixty', 'top', 'toward', 'quite', 'whence', 'did', 'show', 'none', 'get', 'where', 'five', 'amount', 'other', 'however', 'does', 'moreover', 'yet', 'two', 'without', 'somehow', 'while', 'against', 'most', 'doing', 'thereby', 'by', 'full', 'from'}
'''
POS Tagging
POS (Parts of Speech) tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition. This task is not straightforward, as a particular word may have a different part of speech based on the context in which the word is used.
Part of Speech Tags are useful for building parse trees, which are used in building NERs (most named entities are Nouns) and extracting relations between words. POS Tagging is also essential for building lemmatizers which are used to reduce a word to its root form.
For example: In the sentence “Give me your answer”, answer is a Noun, but in the sentence “Answer the question”, answer is a verb.
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
Named Entity Recognition (NER)
Named entity recognition (NER) , also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.
#import library
import spacy
#Loading spacy english library
load_en = spacy.load('en_core_web_sm')
#lets label the entity in the text file
file = load_en(u" I am living in India, Studying in IIIT")
doc = file
if doc.ents:
for ner in doc.ents:
print(ner.text + ' - '+ ner.label_ + ' - ' +
str(spacy.explain(ner.label_)))
else:
print('No Entity Found')
#output:
India - GPE - Countries, cities, states
Here we tried to identify “India” as a country
Conclusion:
You may treat the above concepts as basic building blocks of Natural Language Processing and this understanding and implementing these concepts get a better start for NLP.
Happy Modeling.