Using NLP to Evaluate the Efficacy of Mental Health Apps

Extracting Insights from App Store Reviews

6 min readJul 23, 2021

The Pandemic’s Impact on Mental Health

Since COVID-19 hit in the beginning of 2020, government-implemented lockdowns and restrictions have forced people around the world to get used to staying at home. This has inevitably had a massive impact on people’s social lives, and it is of no surprise that the number of people seeking help with anxiety and depression has skyrocketed. According to Mental Health America, there was a 93% increase in the number of anxiety screens and a 62% increase in the number of depression screens from January to September of 2020 in comparison to the total number of anxiety and depression screens recorded in 2019.

To make matters worse, it has become increasingly difficult to arrange in-person appointments with health professionals who can provide the assistance that they seek. Due to this reason, those seeking assistance with anxiety or depression have been looking to mental health mobile apps as somewhat of an alternative solution, and according to Market Watch, there were 4 million first-time downloads in April 2020. Overall, the mental health app market has seen tremendous growth, and it is predicted that the mental health app market will continue to see a compound annual growth rate of 20.5% through 2027 to reach $3.3 billion. However, as more and more competitors enter the mental health app market, it becomes increasingly important to evaluate the efficacy of those apps by determining whether users are satisfied by the services provided.

Natural Language Processing with App Reviews

To evaluate the positive and negative aspects and better understand the efficacy of these apps, I conducted sentiment analysis of reviews written by users of 31 popular mental health apps. You can find the full analysis and data here.

About the Data

The data collected for this analysis includes just under 45,000 text reviews from the US iOS App Store paired with ratings out of 5. These numerical ratings can be used to label the overall sentiment of the review: 1 - 3 indicate a negative review, and 4 - 5 indicate a positive review.

Now, if we examine the distribution of ratings for the top five most-reviewed apps below, we see that they follow a trend. Moreover, we see that users were not as likely to leave negative ratings of 3 or less as they were to leave positive ratings of 4 or 5. This presents an issue of class imbalance which we need to keep in mind during the modeling process. Because we still have a substantial number of negative reviews to work with, we can extract valuable insights for both negative and positive text reviews.

Now that we have an idea of what the data looks like, let’s move on to deciphering it.

How the Model Works

A variety of machine learning models were trained on the data, but out of Logistic Regression, Random Forest and Support Vector Classification, Logistic Regression was determined to have the best prediction rate on the test data.

We will take a closer look at the model’s feature coefficients later, but first let’s see the model in action. By using the LIME text explainer, we can understand how the model breaks down each review before assigning weights to words to determine the overall sentiment.

In the below example, we can see that the user is unhappy with the fact that he/she has been overcharged for the app’s subscription service. Because the model was trained to understand payment-related words including “paid” and “subscription” as indicators of negative sentiment, it is easily able to determine that this review is a negative one.

In the next example, we can see that there are words such as “pain”, “restlessness” and “struggle” that might seem negative. However, the model was able to learn that there are other words that have higher importance, including “mind”, “sleeping”, “forward” and “asleep”, which carry more weight in classifying this review as a positive one.

These examples show how the trained Logistic Regression model reads the test data to predict the overall sentiment. Now that we have seen how the model works, we can proceed to examine its feature coefficients and ultimately evaluate the strengths and weaknesses of mental health apps.

Findings

Let’s start by taking a look at some word clouds to identify what words appeared most frequently in negative vs positive reviews:

It’s clear for the negative word cloud that the majority of dissatisfaction users have is either payment or customer-support related. On the other hand, the positive word cloud seems to have more variety in topics. This is likely due to the imbalance in data that we previously saw, with the positive class having a much higher number of reviews, which could have resulted in higher variance.

However, if we examine the feature coefficients of the trained Logistic Regression model, we can extract more well-defined insights.

The Logistic Regression model confirms that payment is a major issue, and through the words “service” and “crashes” further reinforces that customer support is also a determinant factor of negative reviews.

For the positive class, we can see from the words “asleep” and “night” that sleep functionality is considered a positive element of these apps. We also see from the words “easy”, “day”, “everyday” and “habits” that users appreciate an easy-to-use interface that enables and encourages people to use the app regularly.

In Conclusion…

Through this analysis, we have identified what constitutes a negative vs a positive review for mental health apps. We can conclude that negative reviews typically included complaints regarding payment, costs and support, and that positive reviews praised the apps’ ability to help users get better sleep as well as the general ease of use which enabled them to use the app on a regular basis.

Although the negative reviews mostly focused on payment and costs, the good news is that this does is not indicative of any flaws in the functionality of the apps. On the other hand, we found that the apps were highly praised for topics including “sleep”, “meditation” and “habits”, and this shows that there are multiple specific app features that users have universally agreed to have provided a positive experience. In addition to these insights, we also found that the distribution of ratings indicate a more positive than negative sentiment towards the apps overall.

Even with the underlined findings, it’s hard to definitively say whether mental health apps are an effective alternative solution to therapists. However, this analysis does prove that mental health apps are fairly effective when it comes to improving general mental wellness.

To anyone hoping to to maintain or improve their mental health as we continue to overcome the challenges presented to us by the pandemic, I would highly recommend giving one of these apps a try.