Slang Identification

Student Data Science Project

With the use of social media becoming a part of everyone’s lives and the use of slang on those platforms becoming ubiquitous, a group of UBC Master of Data Science (MDS) Computational Linguistics students (Hao Chen He, Claudia Ong, Jarrett MacFarlane and Douglas Chan) created a corpus of slang vocabulary as part of their Advanced Corpus Linguistics course in order to help people better navigate the world of slang.

The students developed a web app in order to help people determine what is slang or not, how to use slang appropriately, what slang to use on which social media site and what slang to use with which demographic.

Data from TikTok, Instagram, Reddit and YouTube with the topic of celebrities were scraped to build a corpus with over one million words. From the text collected, the team manually annotated about 12,000 examples and identified if those examples contained slang. If the presence of a slang was detected, more information (overall sentiment of the example, generation in which this slang appeared, etc.) was also annotated to perform further analysis.|

Master of Data Science Computational Linguistics Advanced Corpus Linguistics Project Slang Web App Word Cloud

Word clouds were developed to see what were the commonly used slang phrases for each social media platform the students analyzed. From those, they discovered that most of these social media platforms do share certain slang like “omg”, “lol” and “bro”. The different slang terms originating from the different social media sites could elucidate other important background information about the platforms, like the age range of users, what people do on the platform, the topics they talk about, etc.  The students noted that such insights could be used to identify different “personalities” within each of these social media platforms or could be used to potentially trace the history of a certain slang word and phrase. It could also be used to identify if the same slang word or phrase is used differently within different social media platforms.

From the dataset, the students noticed that Reddit users used a wider variety of slang while Instagram and YouTube users used less variety of slang. Users from those social media platforms tend to stick to some favourite slang terms.

The students also observed from the YouTube word cloud that the slang used within that platform was the most platform-specific and was mostly music related. However, this could be due to a bias from the data collection process since they focused on gathering content that pertained to the topic of celebrities only and musicians are the celebrities who have a more prominent presence on YouTube

While the students found slang interesting, they also found it complicated. The students discovered it was hard to determine if a word used in their dataset counted as slang or not, and even then, there was difficulties determining if it belonged to Gen-Z or Millennial slang. Some slang words used were also so commonplace that the students were not sure if they could still be considered slang.

The students believe that advertisers and marketers could benefit from their slang web app to enhance the user experience with their products, improve fluency in user communication, and conduct more extensive sentiment analysis.

During this whole process, the students enjoyed annotating and using their dataset, and learnt lots of new slang words along the way.

Explore Computational Linguistics Explore Other Data in Action Stories