Natural Language Processing – “That’s what who said?”
We conducted five experiments using two pre-trained transformer-based models (DistilBERT and RoBERTa) to predict if the speaker of a line of dialogue from the television show, The Office, was either “Dwight” or “Not Dwight”. We wanted to build a tool that could accurately identify speakers to improve closed captioning to improve the viewing experience for television viewers who rely on closed captioning.
The linked github page has more information on the project, including our preprocessing steps, model implementation, and and our analysis. The project is also summarized in a technical paper that I wrote.
Data Acquisition and Pre-processing – Predictors of Global Mental Illness Rates
We used machine learning models to forecast the prevalence rates for depressive disorders for countries using past socioeconomic and mental health data from online research databases. The purpose of the project was to practice acquiring data, cleaning it, and then applying different machine learning models and evaluating model performance. You can read the final project report here.
I am in the process of uploading the notebook files to my github account…
This project was focused on data acquisition, so we chose a project that used webscraping with the BeautifulSoup package to get discography information for K-pop artists and show information for anime series that came from Kaggle files we downloaded using Kaggle’s API. I was responsible for all the webscraping to create the K-Pop artist dataset. My code can be accessed here.
This was my first python project. The greatest challenge came from scraping data from a fan wiki site, where the pages for different artists used different formatting styles. This meant writing code to handle each artist’s webpage to address the different html coding styles. The code may no longer work depending on updates and changes to webpages we scraped with BeautifulSoup. This was also before I had extensive experience with pandas. If I redid this project, I would create a table instead of a JSON file for the artist dictionary. I would also pick websites with more consistent formatting to reduce the workload for the webscraping.
This was truly a shambolic project since we started with the file downloader and were told to expand on that with webscraping.
The projects below are in the process of being added to this site and to my github accounts.
Retail Sales Data Project
Medicare hospital data project
NLP book project
