Table of Contents

In this blog, we will cover the below content

  • Introduction to NLP
  • Text operations and Regular expressions
  • Tokenization
  • Lemmatization and Stemming
  • POS tagging and Named Entity Recognition(NER)

Introduction to NLP

Natural language processing (NLP) is a subfield of linguisticscomputer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data of any language.

In this article, we will try to study Natural language Processing with libraries like Spacy and NLTK.

Pre-requisites:
Install Anaconda from this link.
Install Spacy from conda channels
install NLTK libraries from conda channels


English language Spacy library can be installed from

#run cmd in admin mode
python -m spacy download en   #en stands for english
#https://spacy.io/usage

run this command to download Spacy library to support English language

#download all NLTK Libraries
#run anaconda command line in Admin mode

import nltk
nltk.download()

Text operations and Regular expressions

Now, Let’s see few basic text operations.

print("Natural Language Processing")

#output:
#Natural Language Processing

Printing a basic string in Python

We use Indexing and Slicing to select a specific range of characters from a string.

#first insert the string to a variable
string = "Language"

#get first alphabet with index
print(string[0])

#output: L


#printing multiple alphabets
print(string[2], string[5])

#output: n a


#for getting alphabet with negative indexing
print(string[-4])
#output: u

Basic Indexing operations

print(string[0:2])

#output: La


print(string[1:4])

#output: ang

Basic Slicing Operations

We have seen examples of Indexing and Slicing. At cases, we might need to clean strings. For this, we may use strip() for it.

#Remove * from the below statement
statement = "****Hello World! I am Chitti*, The Robot****"


#using strip function to remove asterisk(*)
print(statement.strip('*'))

#output: 
#'Hello World! I am Chitti*, The Robot'

Stripping characters from the statement 

We notice that asterisk(*) is removed from start and end of the statement, but not from the middle. If we don’t provide any character in the strip function, it will remove spaces from the beginning and end of the statement.

Now, let’s check the join function

str1 = 'Good'
str2 = 'Journey'

"Luck. Happy".join([str1,str2])

#output:
# 'GoodLuck. HappyJourney'

Regular Expressions

A regular expression is a set of characters, or a pattern, which is used to find substrings in a given string.

#import 're' library to use regular expression in code.
import re

#few most common methods in Regular expressions
#match() Determine if the RE matches at the beginning of the string

#search() Scan through a string, looking for any location where this RE matches

#findall() Find all the substrings where the RE matches, and return them as a list

#finditer() Find all substrings where RE matches and return them as iterator

#sub() Find all substrings where the RE matches and substitute them with the given string

Few most common methods from ‘re’ library

Character sets for Regular Expression
Meta sequences in Regular Expressions

Tokenization

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

#import Spacy library
import spacy

#Loading spacy english library
load_en = spacy.load('en_core_web_sm')

#take an example of string
example_string = "I'm going to meet\ M.S. Dhoni."

#load string to library 
words = load_en(example_string)

#getting tokens pieces with for loop
for tokens in words:
    print(tokens.text)



#output:
"
I
'm
going
to
meet\
M.S.
Dhoni
.
"

Tokenizing English statement

Stemming

Stemming is the process of reducing a word to its stem or root format. Let us take an example. Consider three words, “branched”, “branching” and “branches”. They all can be reduced to the same word “branch”. After all, all the three convey the same idea of something separating into multiple paths or branches. Again, this helps reduce complexity while retaining the essence of meaning carried by these three words.

We use “Porter stemmer” or “Snowball stemmer” for stemming

#Porter stemmer

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer 

text = "A quick brown fox jumps over the lazy dog." 

# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.

text = text.lower() 
# tokenize text 
words = word_tokenize(text) 
print (words)

'''
Output: ['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
''' 
stemmer = PorterStemmer() 

words_stem = [stemmer.stem(word) for word in words] 
# The above line of code is a shorter version of the following code:
'''
words_stem = [] for word in words:    
	words_stem.append(stemmer.stem(word))
''' 

#words_stem_2 = [str(item) for item in words_stem]
#print (words_stem_2) 
print (words_stem)
'''
Output: ['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']
'''

Using Porter stemmer from NLTK library

from nltk.stem import SnowballStemmer 

# Languages supported by SnowballStemmer
print (SnowballStemmer.languages)


'''
Output: ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish']
'''

Language Support for Snowball stemmer

from nltk.stem import SnowballStemmer 

stemmer_spanish = SnowballStemmer('spanish') 

print (stemmer_spanish.stem('trabajando')) 
# output: trabaj

print (stemmer_spanish.stem('trabajos')) 
# output: trabaj

print (stemmer_spanish.stem('trabajó'.decode('utf-8'))) 
# output: trabaj # UTF-8 decode is done to solve the following error: 
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Lemmatization

Lemmatization is closely related to stemming. Lemmatization returns the lemmas of the word which is the base/root word.

Difference between Stemming and Lemmatization
– A stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

– While converting any word to the root/base word, stemming can create non-existent work but lemmatization creates actual dictionary words.

– Stemmers are typically easier to implement than Lemmatizers.

– Stemmers run faster than Lemmatizers.

– The accuracy of stemming is less than that of lemmatization.

Lemmatization is better than stemming and informative to find beyond the word to its stem and also determine part of speech around a word. That’s why spacy has lemmatization, not stemming. So we will perform lemmatization using spacy.

#import library
import spacy

#Loading spacy english library
load_en = spacy.load('en_core_web_sm')

#take an example of string
example_string = load_en(u'Manchester United is looking to sign a forward for $90 million')

for lem_word in example_string:
    print(lem_word.text, '\t', lem_word.pos_, '\t', lem_word.lemma, '\t', lem_word.lemma_)


'''
Output:
Manchester 	 PROPN 	 7202088082456649612 	 Manchester 

United 	 PROPN 	 13226800834791099135 	 United 

is 	 VERB 	 10382539506755952630 	 be 

looking 	 VERB 	 16096726548953279178 	 look 

to 	 PART 	 3791531372978436496 	 to 

sign 	 VERB 	 13677717534749417047 	 sign 

a 	 DET 	 11901859001352538922 	 a 

forward 	 NOUN 	 17319973514577326793 	 forward 

for 	 ADP 	 16037325823156266367 	 for 

$ 	 SYM 	 11283501755624150392 	 $ 

90 	 NUM 	 4144966524706581620 	 90 

million 	 NUM 	 17365054503653917826 	 million 
'''   

Stop words

“stop words” usually refers to the most common words in a language. There is no universal list of “stop words” that is used by all NLP tools in common.

#import library
import spacy

#Loading spacy english library
load_en = spacy.load('en_core_web_sm')

print(load_en.Defaults.stop_words)

'''
{'keep', 'and', 'anywhere', 'there', 'alone', 'because', 'thru', 'cannot', 'beforehand', 'therein', 'another', 'seems', 'somewhere', 'becomes', 'side', 'until', 'rather', 'within', 'would', 'your', 'seeming', 'when', '’s', 'himself', 'so', 'nothing', 'as', 'call', 'were', 'wherein', 'anyway', '’d', 'done', 'eleven', 'third', 'may', 'through', 'be', 'across', 'had', 'not', 'often', 'over', 'since', 'down', 'make', 'namely', 'though', 'few', 'whether', 'unless', 'yours', 'a', 'into', 'afterwards', 'beside', 'us', 'together', 'fifty', 'was', 'she', 'his', 'many', 'nobody', 'due', 'under', 'take', 'upon', 'you', 'mostly', 'less', 'n’t', 'than', 'we', 'first', 'thereupon', 'then', 'hereafter', 'thereafter', 'hers', 'at', 'never', 'elsewhere', 'three', 'well', 'on', 'seem', 'whose', 'otherwise', 'yourselves', 'her', 'same', '‘re', 'whole', '’m', 'say', 'such', 'almost', 'indeed', 'about', 'me', 'themselves', '’re', 'go', 'forty', 'that', 'sometimes', 'up', 'n‘t', '‘d', 'around', 'therefore', 'onto', 'else', "'ve", 'twelve', 'move', 'how', "'re", 'being', "'m", 'some', 'he', 'him', 'might', '‘ve', 'an', 'has', 'once', 'anyhow', 'ours', 'although', 'amongst', 'per', 'been', "'ll", 'thence', 'please', 'with', 'in', 'both', 'still', 'each', 'herself', 'thus', 'everyone', 'least', 'among', "n't", 'too', 'next', 'put', 'four', 'using', 'six', 'something', 'is', 'towards', 'via', 'whoever', 'yourself', 'above', 'very', 'before', 'neither', 'always', 'one', 'should', 'ten', 'every', 'eight', 'here', 'hundred', 'latterly', 'for', 'my', 'nevertheless', 'give', 'became', 'out', 'these', 'just', 'either', 'will', 'already', 'only', '‘ll', 'between', 're', 'nowhere', 'whenever', 'front', 'own', '‘s', 'various', 'do', 'everywhere', 'itself', 'latter', 'again', 'could', 'see', 'empty', 'but', 'anyone', 'no', 'really', 'whereafter', 'off', 'must', 'any', 'back', 'meanwhile', 'former', 'are', '’ll', 'herein', 'hereby', 'sometime', 'further', 'throughout', 'which', 'enough', 'of', 'or', 'someone', 'those', 'become', 'along', '’ve', 'becoming', 'during', 'this', 'after', 'now', 'serious', '‘m', 'ourselves', 'everything', 'myself', 'regarding', 'made', 'twenty', 'what', 'used', 'name', 'whereupon', 'much', 'they', 'can', 'who', "'d", 'mine', 'hereupon', 'behind', 'seemed', 'perhaps', 'them', 'bottom', 'fifteen', 'i', 'nine', 'all', 'ever', 'its', 'below', 'the', 'whatever', 'several', 'wherever', 'to', 'have', 'hence', 'beyond', 'whereas', 'if', "'s", 'why', 'ca', 'besides', 'also', 'even', 'last', 'whereby', 'our', 'except', 'noone', 'part', 'whither', 'anything', 'their', 'whom', 'formerly', 'am', 'it', 'more', 'others', 'nor', 'sixty', 'top', 'toward', 'quite', 'whence', 'did', 'show', 'none', 'get', 'where', 'five', 'amount', 'other', 'however', 'does', 'moreover', 'yet', 'two', 'without', 'somehow', 'while', 'against', 'most', 'doing', 'thereby', 'by', 'full', 'from'}
'''

POS Tagging

POS (Parts of Speech) tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition. This task is not straightforward, as a particular word may have a different part of speech based on the context in which the word is used.

Part of Speech  Tags are useful for building parse trees, which are used in building NERs (most named entities are Nouns) and extracting relations between words. POS Tagging is also essential for building lemmatizers which are used to reduce a word to its root form.

For example: In the sentence “Give me your answer”, answer is a Noun, but in the sentence “Answer the question”, answer is a verb.

text = word_tokenize("And now for something completely different")

nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
 text = word_tokenize("They refuse to permit us to obtain the refuse permit")

nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

Named Entity Recognition (NER)

Named entity recognition (NER) , also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

#import library
import spacy

#Loading spacy english library
load_en = spacy.load('en_core_web_sm')

#lets label the entity in the text file
file = load_en(u" I am living in India, Studying in IIIT")

doc = file
if doc.ents:
    for ner in doc.ents:
        print(ner.text + ' - '+ ner.label_ + ' - ' + 
               str(spacy.explain(ner.label_)))
else:
    print('No Entity Found')
#output:
India - GPE - Countries, cities, states

Here we tried to identify “India” as a country

Conclusion:

You may treat the above concepts as basic building blocks of Natural Language Processing and this understanding and implementing these concepts get a better start for NLP.

Happy Modeling.