brown and beige python

If you’re looking for how to use Python to extract keywords from DataFrame you’ve come to the right place. Natural Language Toolkit (NLTK) is a Python library used for Natural Language Processing (NLP). NLP allows machines to break down the human language to enable easier interpretation. NLP was used for this task in its most basic form to extract certain keywords out of cells within spreadsheets to help reduce time spent on manually doing it. Here’s how…

I was eavesdropping on the chat between a couple of my co-workers about having to manually update the titles of eCommerce shopping items to improve the SEO of the products which would take a long time. So, I wanted to find out more: they had to manually check through spreadsheets, extract the best keywords from the description fields and the titles had a maximum of 85 character limit.

“Basically, we want to go from this very long sentence to using Python to extract the keywords”

and remove all the redundant words so that all that remains is this:

“Long Sentance Python Extract Keywords”

Python can be used for automated keyword extraction from strings using NLP. Python is super quick and can be used to reduce repetitive tasks. Therefore I gave it a try on the product listings. A very basic process was used but it proved to be an efficient way to iterate through hundreds if not thousands of products, cut downtime spent on the task and best of all it is easy enough for absolute beginners.

Contents

  1. Loading the dataset
  2. Checking the data
  3. Ranking the keywords: popular/least popular
  4. Cleaning the data
  5. Creating a stopword list in Python
  6. Remove useless words from a DataFrame column
  7. Adding title character count and key word count
  8. Re-order DataFrame columns
  9. Merging two DataFrames
  10. Export DataFrame as .csv

Planning

First of all the descriptions included a lot of useless words. As we wanted to extract the main keywords the best solution I could think of was to remove the useless words that were conjunctions, prepositions etc…so those are words like and, to, for etc. Then after a good hour of research, I discovered the redundant words were called stopwords and that there was an NLP library for Python called NLTK that could find and remove them. I could use this library to load a stopword list, add extra words to it and then go through each description column value and remove them. Okay so let’s try this!

Tools

I used Jupyter notebooks for this but you can use your code editor or even Google collabs. I will leave out the results due to it being client data but you’ll get the rough idea from reading the code! If your dataset is in a spreadsheet or any other tabular program then try, if you can, to download it into a .csv format. This will make it easier to use and load in Jupyter.

1. Load dataset

First, you need to import the Pandas and Numpy libraries. Then you load the csv into a DataFrame and remove unrelated columns keeping the main ones so the tables are clearer. Where it says products.csv is where you could load your data file.

import pandas as pd
import numpy as np
 

# selected column headers we want to use 
new_headers = ['id', 'title', 'description']

# import diabetic data into dataframes
df = pd.read_csv("products.csv", header=0)

# remove unwanted columns keep selected headers
dataset = pd.DataFrame(df, columns=new_headers)

dataset

This resulted in the csv being loaded into a Pandas DataFrame showing only the essential attributes [‘id’, ‘title’, ‘description’].

2. Checking the data

I wanted to quickly check what kind of data type each attribute was, perhaps identify the most common and uncommon words within the description to help determine what to do with the data. Perhaps you will do more digging than me but this is all that was needed for now.

# check data type
dataset.dtypes

They were all objects.

3. Ranking Popular/Least Popular Keywords

Then I wanted to view the most common words within the description

# word count are min 1 to max 175 

# Identify most common words
common = pd.Series(' '.join(dataset['description']).split()).value_counts()[:40]
common

I found the most common words were the most useless ones like: the, and, of a, in, is, to, this, with, from and for. These are called syncategorematic words and do not make sense on their own and therefore checked against the stop list. If it didn’t contain them then they should be added. Then I checked the uncommon words.


#Identify uncommon words
uncommon =  pd.Series(' '.join(dataset 
         ['description']).split()).value_counts()[-40:]
uncommon

4. Cleaning the data

Okay, so I did much more cleaning than this and you will do a lot of cleaning too but at this point, I only converted to lowercase as I realised the stopwords list was all lowercase and the strings were case sensitive. So this was done to make it easier. When I inspected the types it was noted some of the description values were varied data types so they were all converted to strings for consistency.

There are probably better ways to do this – if you do see much more efficient methods then please do comment to let me know. I love teaching and learning from you guys 🙂

# convert to lowercase
dataset['description'] = dataset['description'].str.lower()
dataset['title'] = dataset['title'].str.lower()

# convert desc to string
dataset['description'] = dataset['description'].astype(str)

dataset.dtypes

5. Creating the stop word list and adding to it

In order to remove the stop words, we need to create the list and then add extra words you may want to remove. The words I wanted to remove were the ones found earlier when we identified the most common words.

First, we load the Regular Expression (Regex) and NLP libraries and then download the packages.

So where #nltk.download(‘stopwords’) and #nltk.download(‘wordnet’) are hashed out. You would want to remove the hash the first time round, run and download the packages. Once that’s done then you can re-add the hashtag. You only need to do this once.

# library for text preprocessing
import re
import nltk

#nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
#nltk.download('wordnet') 


from nltk.stem.wordnet import WordNetLemmatizer

Then you create your list

# create a list of stop words and adding custom stopwords
stop_words = set(stopwords.words("english"))

# create a list of custom stopwords
new_words =  ['the', 'a ','us', 'sold', 'of', 'in', 'created', 'that', 'made', 'we\'ve', 'after', 'struggling','their','his','only','previously','leaving']
stop_words = stop_words.union(new_words)

stop_words

6. Removing useless words from the DataFrame column

This is where the lambda function comes in handy to apply the same expression to every value within the description column. Here, we want to iterate through and check whether the word is in the stop word list. If it isn’t then it is split and joined to recreate one big string. The result is all the stop words are removed and all the words that are leftover are your main keywords.

dataset['description'] = dataset['description'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))

dataset.head(50)

7. Adding title character count and keyword count

So the rules were the title could not be longer than 85 characters and we also wanted to know the keyword count. Therefore we needed to calculate these based on specific columns which were the title column and the cleaned_description column (which I may later rename to ‘keywords’). These calculations were then added to the end of the DataFrame.



# Fetch character count for each title
dataset['title_char_count'] = dataset['title'].apply(lambda x: len(str(x)))
dataset[['title','title_char_count']].head()


# Fetch wordcount for each description
dataset['word_count'] = dataset['cleaned_description'].apply(lambda x: len(str(x).split(" ")))
dataset[['cleaned_description','word_count']].head()

dataset

Using this calculation we’re able to quickly check the mean, standard deviation, min, max and percentages of the Word Count.

# Descriptive statistics of word count
dataset.word_count.describe()

8. Re-order columns

I re-ordered the columns for better visual purposes so the data was easier to read.

product_headers = ['id', 'title', 'title_char_count', 'cleaned_description', 'word_count']

removed_desc = pd.DataFrame(dataset, columns=product_headers)
 
removed_desc

9. Merging the previous description column with new keyword columns

I decided to extract the previous description column from an untouched dataset with the new DataFrame with the new keyword column so that coming up with new title’s was a lot easier as viewing the keywords alone may not make much sense.

# merge description with cleaned description
merged_dataset = pd.concat([df['description'], removed_desc], axis=1)

# re-order the columns
export_dataset = pd.DataFrame (merged_dataset, columns=['id', 'title', 'title_char_count',  'description', 'cleaned_description', 'word_count'])

export_dataset

10. Export DataFrame to .csv

Then once preparing was done I exported the completed DataFrame to .csv ready for Google spreadsheets and to use! in your case you can change cleaned_product_names.csv to whatever you want to call your csv file. It will save to the same folder location as your script.

# export to csv 
export_dataset.to_csv('cleaned_product_names.csv', encoding='utf-8', index=False)

The end!

I hope you enjoyed this fun step-by-step guide to extracting keywords from spreadsheets. If you want to see more like these then subscribe to be notified when new, simple to follow Python tutorials come out.

Lots of love,

Becky x


Resource: https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34


If you found this post useful please support House Ninety Two: Buy Me a Coffee at ko-fi.com