Creating a Search Engine for Domain-Specific Text

Alumni Data Science Project

Ilana Zimmerman, MDS Computational Linguistics Alumna (Class of 2020) was tasked to create a search engine based on two transformer models in her role as a Natural Language Processing Engineer with ALEX - Alternative Experts. ALEX is an ISO 9001:2015-certified solutions provider to Government, Defense, and Commercial contracts, based in Washington, DC.

The search engine was to be used on domain-specific text and Zimmerman was recommended to train a model based on the Semantic Textual Similarity (STS) task. This search engine was a key deliverable to a contract her team was working on.

Zimmerman was given code to train a model based on the STS task and means to deploy the model.

The first task was to develop a training set of sentence pairs from their domain for their in-house annotators to label on a scale of 1-5 based on the sentence pair similarity. 

Extracting sentence pairs with an equal distribution in similarity between 1 and 5 turned out to be tougher than anticipated for Zimmerman. It also turned out that the domain specific STS model was not appropriate for their use case. 

After reading a research paper on large-scale information retrieval, Zimmerman was able to successfully develop the code to train a model based on it. Additionally, Zimmerman also found a fast similarity search tool that was developed by Facebook AI and designed to support searching up to billions of images. From that, the back-end code was restructured for the search engine.

The search results were significantly improved from both the previous model design and the term-frequency based approach Zimmerman’s team had used previously. 

Since developing the tool, Zimmerman added other capabilities such as a keyword search and question-answering model. 

The most relevant skill that Zimmerman learned during the MDS-CL program that she applied to this particular project was training BERT-based models, a skill gained from the Trends in Computational Linguistics and Supervised Learning classes. 

By the end of the project, Zimmerman was able to reflect on all she has learned in such a short amount of time and being able to provide such a big impact to her team. 

Explore Computational Linguistics Explore Other Data in Action Stories