Revelo Magicen – The Harry Potter Spell Hunt
Student Data Science Project
As fans of the Harry Potter book series, four MDS Computational Linguistics students (Xinyi Li, Boyang Liu, Zimo Wang, and Yiting Zhou) from the Class of 2022 knew they wanted to annotate spells from the books for their Advanced Corpus Linguistics project.
The project tasked students to collect a large corpus from the web, carry out reliable corpus annotation using external annotators and make the corpus searchable via HTML frontend and Python backend.
In addition to annotating all the spells from the books, the group decided to track all the spells in terms of when they are used, in which cases are they used, who are using them and come up with some analysis of them.
To start, the students found a website that had all the e-book versions of the Harry Potter books. Their data set comprised 1,084,170 words of raw text, which they extracted around 730 annotations.
The students wrote a Python script that found paragraphs that contain mentions of spells.
While working on the annotation of their data corpus, the students found that it could be ambiguous or confusing to make certain annotation and it could be biased potentially, so they came up with a protocol, namely “Interannotator agreement”, which involved splitting up the annotation task to each member. With work overlapping between two annotators, they were able to evaluate the accuracy score of annotation and eliminate some misleading information.
From their findings, the group discovered that Harry is the person who casts the most spells and also receives the most spells. As well, there’s a lot of day-to-day magic in the book rather than just killing or attacking spells.
The end result was implementing a web app using simple Python and HTML scripts and wrapped it in Docker for it to be visualized properly. To make the app work, a user would download the required Docker file and open it locally, then follow the instructions provided.
The students hope that this type of annotation would inspire fans of other books to do similar work and that by combining programming and manual annotation, fans can dig a lot deeper into the characters and scenarios of any book.