How I Created a Word cloud From a COVID-19 Research Dataset
In this article, I am going to describe the pipeline I made to create my COVID research word clouds. I will detail the resources, libraries, and API’s I used to accomplish this task, so that others may use a similar method for their own ventures.
The first step in creating my data visualization was finding a problem I had to solve. I talked about this more in my previous article, but with some guidance, I decided to visualize research being done on COVID-19 due to the ease of finding relevant data. The COORD-19 Dataset is a subset of Pubmed, and this allowed me to use all of the built in tools Pubmed provided.
I downloaded the metadata of CORD-19 from their website, which provided a host of information that I could use to create my wordcloud. Most importantly, it contained the titles, abstracts, and pubmed ID’s of each article. The metadata came in the form of a CSV file that I put onto a table in a Postgres database, which allowed me to manipulate the data programmatically.
I put almost all of my code in a single python script, which performed a series of several steps.
- I used the Postgres library to query for a list of the pubmed-id’s for each article. In this query, I also narrowed the search fields for articles that had titles that related to the subset I wanted. I created a list of keywords and searched for those words in each article’s titles before adding it to my list of I’ds.
- I then accessed the pubmed API to find the MeSH terms of each article. I did this by iterating through the list of pubmed IDs and fetching a list of the MeSH terms present, then removing irrelevant words from a list.
- Once I had the MeSH term for each article, I used Metamap as a tool to check whether the word was a medical term or not. I did this through Metamap’s built in Java API, which I wrote a java program for and connected to my main program through Py4J.
- In this Java Program, I turned on each of the respective metamap servers and retrieved each word’s relevance. If the word fell into certain categories that I had predetermined, I would send a confirmation back to the python program for the word to be added to the count on the wordcloud’s list.
- Once a complete word list was created and the frequency of each term described, I used the python Wordcloud library to create a cloud and save it to my computer.
All of these steps were optimized through each iteration of the program: on a later iteration, I created a new table in my Postgres database that had the frequency of each MeSH term saved so that I could directly manipulate it without needing to go through the process of pinging the pubmed API every time.
Here are resources that can help you start on your own similar data science project:
- Ubuntu: https://ubuntu.com/, https://ubuntu.com/tutorials
- PostgreSQL: https://www.postgresql.org/, https://www.postgresqltutorial.com/
- SQL in general: https://www.youtube.com/watch?v=HXV3zeQKqGY
- Python: https://docs.python.org/3/tutorial/, https://docs.python.org/3/tutorial/, https://www.jetbrains.com/pycharm/
- Pubmed: https://pubmed.ncbi.nlm.nih.gov/
- CORD-19 Dataset Metadata: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv
- Java: https://openjdk.java.net/, https://openjdk.java.net/guide/, https://www.jetbrains.com/idea/
- Py4J: https://www.py4j.org/
- Metamap: https://metamap.nlm.nih.gov/
- WordCloud for Python: https://amueller.github.io/word_cloud/
- Git: https://git-scm.com/, https://git-scm.com/book/en/v2