Institute and Center Predictions for Automated Recommendations of Grant Applications
Alumni Data Science Project
As a data scientist for the Bethesda, Maryland-based National Institutes of Health and its National Institute of General Medical Sciences’ (NIGMS) Division of Data Integration, Modeling, and Analytics (DIMA), Nicholas Sanders works on developing NLP and machine learning models for various tasks.
NIGMS supports and funds research that can cover a wide variety of domains that are focused on basic, biological research for advances in disease diagnosis, treatment, and prevention.
One of the tasks that Sanders worked on involved converting NIH’s existing PAC (Program Area Code)/PO (Program Officer) recommender from R to Python. The model was deployed on an internal NIH website for staff to input a given grant application and receive PAC/PO recommendations. The next task was to develop a new model for also predicting one of the 24 IC’s (Institutes & Centers) that provide funding support for research.
After optimizing the PAC/PO Python code and preparing it for deployment, Sanders dug deep into analyzing what data would be most useful for predicting IC. Afterwards, Sanders investigated which sections of grant applications were most correlated with IC and began research and developing different deep learning and machine learning models for long document classification.
Taking a look at the data, Sanders and his team noted a trend that not all historical awarded grant applications were useful for training models. The team found that certain meta data was actually often correlated with IC assignment, which gave insight on text-based NLP models’ performance. Additionally, Sanders tested a lot of newer deep learning model architectures, for tackling input sequence length constraints, which proved to be useful for other tasks.
The deep learning models tested for dealing with long input sequence constraints revolved around modifying transformers' attention mechanisms and reducing model size through methods like quantization and distillation. During the MDS-CL program, learning about attention through the Machine Translation course was fundamental to Sanders understanding of this task.
Before deploying the models, Sanders and his team often ran predictions on large amounts of incoming grant applications to gain ball park estimations of which grants were applicable to each IC.
Some of the skills that Sanders applied for this project included programming, statistics, data wrangling, SQL, deep learning, and machine learning as well as corpus linguistics and data set development.
Currently, both the PAC/PO and IC recommenders are deployed and available to NIGMS staff on an internal website, however the most recent versions are still being prepared to be added.