Knowledge Augmentation in Language Models to Overcome Domain Adaptation and Scarce Data Challenges in Clinical Domain

Kumar, Vivek

The co-existence of two scenarios, “the massive amount of unstructured text data that humanity produces” and “the scarcity of sufficient training data to train language models,” in the healthcare domain have multifold increased the need for intelligent tools and techniques to process, interpret and extract different types of knowledge from the data. My research goal in this thesis is to develop intelligent methods and models to automatically better interpret human language and sentiments, particularly its structure and semantics, to solve multiple higher-level Natural Language Processing (NLP) downstream tasks and beyond. This thesis is spread over six chapters and is divided into two parts based on the contributions. The first part is centered on best practices for modeling data and injecting domain knowledge to enrich data semantics applied to tackle several classification tasks in the healthcare domain and beyond. The contribution is to reduce the training time, improve the performance of classification models, and use world knowledge as a source of domain knowledge when working with limited/small training data. The second part introduces the one of its kind high-quality dataset of Motivational Interviewing (MI), AnnoMI, followed by the experimental benchmarking analysis for AnnoMI. The contribution accounts to provide a publicly accessible dataset of Motivational Interviewing and methods to overcome data scarcity challenges in complex domains (such as mental health). The overall organization of the thesis is as follows: \\ The first chapter provides a high-level introduction to the tools and techniques applied in the scope of the thesis. The second chapter presents optimal methods for (i) feature selection, (ii) eliminating irrelevant and superfluous attributes from the dataset, (iii) data preprocessing, and (iv) advanced data representation methods (word embedding and bag-of-words) to model data. The third chapter introduces the Language Model (LM), K-LM, a combination of Generative Pretrained Transformer (GPT)-2 and Bidirectional Encoder Representations from Transformers (BERT) that uses knowledge graphs to inject domain knowledge for domain adaptation tasks. The end goal of this chapter is to reduce the training time and improve the performance of classification models when working with limited/small training data. The fourth chapter introduces the high-quality dataset of expert-annotated MI (AnnoMI), comprised of 133 therapy session transcriptions distributed over 44 topics (including smoking cessation, anxiety management, weight loss, etc.), and provides an in-depth analysis of the dataset. \\ The fifth chapter presents the experimental analysis with AnnoMI, which includes (i) augmentation techniques to generate data and (ii) fairness and bias assessments of the employed Classical Machine Learning (CML) and Deep Learning (DL) approach to develop reliable classification models. Finally, the sixth chapter provides the conclusion and outcomes of all the work presented in this thesis. The scientific contributions of this thesis include the solution to overcome the challenges of scarce training data in complex domains and domain adaptation in LMs. The practical contributions of the thesis are data resources and the language model for a range of quantitative and qualitative NLP applications. Keywords: Natural Language Processing, Domain Adaptation, Motivational Interviewing, AI Fairness and Bias, Data Augmentation, GPT, BERT, Healthcare.