What I Learned About COVID From My First Data Science Experiment
How I used a COVID word cloud to start my journey of data analysis.
COVID-19 created a massive global impact that disrupted every facet of society. As such, it is an area where a lot of scientific research is being conducted in a very short time. All this research from the past 2 years is readily available in public datasets online. As a new student of data science, I wanted to tackle a problem that would allow me to learn the basics hands-on and at the same time, answer this question: What COVID-19 topics are being researched the most?
In order to find the answer to this question, I decided to create a data visualization.
Data visualizations can reveal interesting facts and offer insights about a topic. It can take a large amount of information and put it in a form that makes it easy to create claims, hypotheses, and to draw meaningful inferences to understand any phenomenon. I chose the quintessential word cloud because of its simple representation of prevalent terms used in scientific research.
Here is what I did.
First, I needed to find a dataset of scientific research papers on COVID-19. Then, I had to find out what they were about, and use that information as a source for the visualization. I found what I was looking for in the COVID-19 Open Research Dataset (CORD-19). I used this to download the metadata of a list of about 200,000 scientific research papers. I used the information in the metadata to access the Medical Subject Headings (MeSH terms) of each scientific paper to pull information on what they were about. For this process, I discovered and used Pubmed’s API. Pubmed is a fantastic search tool for accessing references in medical databases (NIH, 2021). Using a python script, I retrieved the Pubmed ID’s from the metadata in the CORD-19 dataset and used the API to get that ID’s MeSH terms.
To get this working, here is what I used.
- Postgresql to store the metadata for the CORD-19 Dataset
- The Pubmed API to get MeSH terms
- Metamap to check each term’s relevance to the topic
- Good ol’ Python to string this together
The under-the-hood architecture details is a topic for its own article, which I plan to write for fellow tech enthusiasts.
By compiling all of these MeSH terms and sorting them by frequency, the most common topics of research became apparent. Some of these terms are disease names, practices, or medicines, while others are broader concepts. However, the most powerful feature of this program is the ability to filter the scientific research papers by criteria such as age, gender, or the ethnicity of the affected population.
For example, filtering for MeSH terms of the papers that talk specifically about COVID in children can reveal the most common subjects being researched. Similarly, by filtering for COVID specifically in females or males, we can see differences between research related to the sexes. Although there were many possible ways to perform this filtering process, I chose a simple method: to check for keywords in the title that related to a topic. It gives enough of a sample size to see differences.
Some of the results were eye opening.
By narrowing the search fields with keywords like “child,” “kid,” “pediatric”, etc, this word cloud was created:
When looking at COVID in women, you see a very different picture:
This is the word cloud made by narrowing the search fields for men:
Here is what I learned.
Visualizing this data shows what is being talked about in the medical community relating to the virus, especially when the search fields are narrowed down. It also shows how much research is being conducted for each group.
The data itself reveals some things about COVID research that may not have been apparent before. For example, here are the approximate number of articles for each group in the dataset I downloaded:
- 8,000 papers referenced males
- 1,600 papers referenced females
- 7,500 papers referenced children.
You can see that there is much more research on the disease in males than in females. The reasons for this may need further study.
Some other revelations from these visualizations:
- For COVID research in Women, Pregnancy and Mental Health are a big focus.
- In children, I was expecting MIS-C to be prominent, but that is not the case. I haven’t figured out why, meaning there might really be not enough research done (or I just wrote some bad code).
Although the data can help us make many hypotheses and shows a lot about the nature of COVID research right now, more specific visualizations or better parameters can also help narrow down the MeSH terms for relevancy and reduce error. More can be done with this dataset, and different visualizations can be created.
This is where I need help.
- What inferences do you draw from these word clouds? What surprised you the most?
- What other groupings would you like to see on COVID research?
- What other questions can I find answers to?
— — —
Much gratitude to my mentor, Prasad Chodavarapu, for guiding me throughout this journey.