How I used Python to extract keywords from LinkedIn job descriptions

photo by Ima

Abstract

This is a two-part tutorial, in the first part we will be building a web scraper to extract the job descriptions from LinkedIn jobs with specific query parameters. In the second part we will be using this data to create a TF-IDF (Term Frequency Inverse Document Frequency) model.

Requirements

  • Python 3.9+
  • Modules used: requests, re, sklearn, sqlite3, pandas, csv, time, langdetect

Web scraper (part 1)

In this tutorial we will be extracting job descriptions from LinkedIn that match the following query parameters: job=developer & location=Belgium, feel free to change these to your own search requirements. The following URL will be used to extract the data needed for part 2:

https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Developer&location=Belgium&start=0

LinkedIn Jobs

When browsing the previous URL we are presented with a simplified page that keeps track of some basic job search information. When looking at the source code of this page we see the following:

This is one of the job detail URLs we will need in part 2.

So now comes the fun part, by dynamically looping through the search URL let’s say 1000 times we change the start parameter, we then use the requests library to make GET requests and eventually use re (regular expressions) to extract all the detail URLs from the source page:

Depending on various factors results might differ, in my case I was able to extract 571 unique URLs. Save these URLs in a CSV file that we will use in the next part. (make sure to create a urls.csv file in your environment)

Web scraper (part 2)

In the following part we will use the same technique as before, but this time we will extract the job content from the URLs in the urls.csv file. We will also setup a sqlite3 database called linkedin.db (make sure you create one in your environment). Some basic pre-processing (lower case and remove digits/non-alphanumeric characters) and HTML sanitizing will be applied to the data before persisting it. The langdetect module is also used to classify these descriptions by language.

TF-IDF model

Term Frequency Inverse Document Frequency is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

The collection of documents in our case will be the data previously extracted. First we will create a countvector (bag-of-words) which keeps track of the frequency words appear, it is common to use a stopword list that removes that languages’ irrelevant words. We set the max_df to 0.70 and max_features to 1000.

The panda step is mainly included in case you’d like to use this data for other machine learning purposes and to filter out the English job descriptions. Once we have created our countvector we can fit this to our TF-IDF model.

First we will define some TF-IDF functions.

Testing this out with different new job descriptions I was able to extract some relevant domain specific keywords that the TF-IDF model would categorize by importance.

Java job

Conclusion

I originally wanted to create a second TF-IDF model that did the same but with LinkedIn developer user profiles, this way we could match keywords between user profiles and jobs and create a matching algorithm that would work similar to what LinkedIn does internally. However I have decided not to do this as it might violate EU privacy laws concerning usage of personal data.

Some things I could have done better:

  • Use a bigger data set (20000+)
  • Use stemming and lemmatization techniques
  • Use dictionaries to build up lists of common domain specific keywords
  • Use Named Entity Recognition (NER)

None the less I am happy with the results I got on the test data. These are some very relevant keywords to the domain we are searching in, and it opens the door to more interesting AI/ML layers we could add.

Credit

https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/

--

--

--

I like to build software

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Developer Productivity Tips: GO Links

Smashing All the Buttons

Software quality in interviews

https://iso25000.com/index.php/en/iso-25000-standards/iso-25010

Getting the Most out of Your HTML Forms

How to choose a work laptop

Superseded or Obsolete Retention in Microsoft 365 | Basic using E5

SAML Authentication in AEM using Microsoft Azure Active Directory

Create your First Model DB Django!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ima

Ima

I like to build software

More from Medium

What is machine learning: Definition, Kinds and Examples

Teaching Machine Learning with Social Media Examples

Great Free Resources to Start your Data Science Studies

What is A.I?