a close up of a plastic model of a human

The Silent Data - Hidden Stories of Mental Health

Abstract

This study employs Natural Language Processing (NLP) techniques to analyze and classify mental health statements, aiming to identify linguistic patterns associated with different mental health conditions such as anxiety and depression. We utilized a dataset containing statements and their corresponding mental health status labels. The results demonstrate the potential of NLP in understanding and categorizing mental health-related text data, with implications for early detection and intervention in mental health care. This research contributes to the growing field of computational psychiatry and highlights the importance of language analysis in mental health assessment.

Keywords

#Mental-health

#Natural Language Processing

#Text classification

#Machine learning

#Visualization

Introduction

Mental health disorders are a significant global health concern, affecting millions of individuals worldwide. The ability to accurately identify and categorize mental health conditions is crucial for effective treatment and intervention. In recent years, the intersection of technology and mental health has opened new avenues for research and diagnosis, with Natural Language Processing (NLP) emerging as a powerful tool in this domain (1).

This study focuses on the application of NLP techniques to analyze and classify mental health statements. The primary objective is to develop a robust methodology for identifying linguistic patterns and features that are indicative of various mental health conditions, such as anxiety and depression (Fig. 1). By leveraging machine learning and text analysis, we aim to contribute to the growing field of computational psychiatry and enhance our understanding of how language use reflects mental health status.

Fig. 1. Status of mental health distributed according
to patient statements

Fig. 2. WordClouds for the Status

Literature review

The application of NLP to mental health analysis represents a convergence of computational linguistics and psychiatric research, offering new ways for understanding and detecting mental health conditions through language patterns. This approach is grounded in the hypothesis that linguistic features and patterns in an individual's speech or writing can provide insights into their mental state and potential mental health conditions (2).

Fig. 3. Top important words as per their frequency

According to the hypothesis, certain language indicators, such as word choice, syntactic structures, and semantic content, are associated with various mental health states, including anxiety and depression (3).

Methodology

Data Collection and Preparation

Acquired a dataset concerning mental health from Kaggle and performed initial data cleaning and preprocessing by using pandas library only the first 10,000 inputs in the dataset for faster computational speed and analysis. The data is usually structured into two columns - statements and status of the mental health.

Text Preprocessing

Leveraging the nltk library in Python for NLP, converted all text to lowercase and tokenized the statements into individual words. Removed stopwords and non-alphabetic tokens and further applied lemmatization to reduce words to their base form. Fig. 3, demonstrates the top 20 important words in the entire corpus of the texts.

Machine Learning

For Feature Extraction, implemented TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text data into numerical features by implementing capabilities of scikit-learn library. By divided the dataset into training (80%) and testing (20%) sets model training was performed on three different ML models: Logistic Regression, Random Forest, Support Vector Machine (SVM). The model evaluation assessed model performance using accuracy, precision, recall, and F1-score metrics and generated classification reports for each model.

Visualization

Lastly, created visuals for better analysis of ML techniques using matplotlib, wordcloud, etc. features of Python. Fig. 2, represents the words as per their frequency in the entire set for all the four categories. Created a confusion matrix heatmap to visualize the performance of the best-performing model across all categories. Generated a learning curve to analyze model performance with varying training set sizes.

Table. 1. Comparison matrix for different Model Trainings

Analysis & Results

Model Performance Comparison

Table. 1, illustrates that Logistic Regression achieved the highest overall accuracy at 87.1%. Random Forest and SVM showed comparable performance with accuracies of 85.9% and 86.85% respectively.

Category-wise Analysis

Normal: All models showed excellent performance in identifying normal states (F1-scores > 0.95).
Anxiety: High precision across all models (0.93-0.97), indicating reliable identification of anxiety-related language.
Depression: Moderate performance (F1-scores 0.65-0.68), suggesting more challenging identification.
Suicidal: Lowest performance among categories (F1-scores 0.60-0.66), indicating difficulty in distinguishing suicidal language patterns.

Model-Specific Observations

Logistic Regression: Excelled in identifying Anxiety (precision: 0.97) and Normal states (recall: 0.99).
Random Forest: Showed more balanced performance across categories, particularly for Anxiety.
SVM: Demonstrated performance similar to Logistic Regression, with strong results for Anxiety and Normal categories.

Confusion Matrix Heatmap Analysis

Fig. 4, visually confirmed the strong performance in identifying Normal states, with the highest number of correct predictions. It revealed a notable number of misclassifications between Depression and Suicidal states, indicating potential linguistic similarities between these categories. Anxiety showed fewer misclassifications, supporting its distinct linguistic characteristics. The confusion matrix heatmap offers a clear view of where the model excels and where it struggles, highlighting the need for focused improvements in distinguishing between Depression and Suicidal states.

Fig. 4. Confusion matrix heatmap

Learning Curve Insights

Fig. 5, demonstrated that model performance improved with increasing training data size. A narrowing gap between training and validation accuracy as the training size increased indicated that the model was learning effectively without severe overfitting. The curve suggested that additional training data might further improve model performance, especially for the more challenging categories (Depression and Suicidal). The learning curve revealed the model's learning trajectory, suggesting that while the model performs well, there's potential for improvement with more data, particularly for the more challenging categories.

Fig. 5. The learning curve for the dataset

Conclusion

This project has demonstrated the potential of Natural Language Processing (NLP) techniques in analyzing and classifying mental health statements, offering valuable insights into the relationship between language use and mental health status. By leveraging machine learning algorithms and text analysis, we have developed a system capable of distinguishing between different mental health conditions with considerable accuracy. The study's findings reveal a nuanced landscape of linguistic markers associated with various mental health states. The high overall accuracy achieved by our models, particularly the Logistic Regression model at 87.1%, underscores the effectiveness of NLP in capturing subtle language patterns indicative of mental health conditions. This success suggests promising applications in mental health screening, early detection, and monitoring.

In conclusion, this study represents a significant step forward in the application of NLP to mental health analysis. It demonstrates the potential of computational approaches in understanding and assessing mental health through language, while also highlighting areas for improvement and further investigation. As we continue to refine these techniques, we move closer to developing more accurate, accessible, and ethically sound tools for mental health assessment and support. This work contributes to the growing field of computational psychiatry and holds promise for enhancing our ability to understand, detect, and address mental health concerns in diverse populations.

References

CALVO RA, MILNE DN, HUSSAIN MS, CHRISTENSEN H. Natural language processing in mental health applications using non-clinical texts. Natural Language Engineering. 2017;23(5):649-685. doi:10.1017/S1351324916000383
Malgaroli, M., Hull, T.D., Zech, J.M. et al. Natural language processing for mental health interventions: a systematic review and research framework. Transl Psychiatry 13, 309 (2023). https://doi.org/10.1038/s41398-023-02592-2
Taylor, V.A., Roy, A. & Brewer, J.A. Cluster-based psychological phenotyping and differences in anxiety treatment outcomes. Sci Rep 13, 3055 (2023). https://doi.org/10.1038/s41598-023-28660-7

Riddhi Kumavat

PROJECTS

The Silent Data - Hidden Stories of Mental Health

Abstract

Keywords

Introduction

Fig. 1. Status of mental health distributed according
to patient statements

Fig. 2. WordClouds for the Status

Literature review

Fig. 3. Top important words as per their frequency

Methodology

Table. 1. Comparison matrix for different Model Trainings

Analysis & Results

Fig. 4. Confusion matrix heatmap

Fig. 5. The learning curve for the dataset

Conclusion

References

Resources - Dataset & Codes

Let's connect

PROJECTS

The Silent Data - Hidden Stories of Mental Health

Abstract

Keywords

Introduction

Fig. 1. Status of mental health distributed according to patient statements

Fig. 2. WordClouds for the Status

Literature review

Fig. 3. Top important words as per their frequency

Methodology

Table. 1. Comparison matrix for different Model Trainings

Analysis & Results

Fig. 4. Confusion matrix heatmap

Fig. 5. The learning curve for the dataset

Conclusion

References

Resources - Dataset & Codes

Let's connect

Fig. 1. Status of mental health distributed according
to patient statements