Patient communities are a rich source of information about reported treatment effectiveness and side effects. Patients use Reddit, Twitter, and online forums to learn what treatments other patients are trying, and to learn about the reported effectiveness, side effects, and personal experiences trying new treatments, but this process is ad-hoc, time consuming, and doesn’t provide a balanced picture (i.e. is biased towards recency and upvotes).
We investigated whether there was potential to apply large language models (LLMs) to create a structured dataset from this unstructured text, using data from Reddit as a proof-of-concept. We were primarily interested in providing representative overview of the most popular and most effective treatments across the community, and for specific symptoms. To achieve this, we used an LLM to analyze a large sample of posts and comments on the r/covidlonghaulers and r/longcovid subreddits to determine the self-reported impacts of treatments.
For a given Reddit post, we observed an overall accuracy of 92% for identifying a) that someone reported trying a treatment, b) correctly classifying the treatment(s) they tried, and c) assess magnitude of benefit and side effect with an average error of ±0.34 out of 5. When adding the additional task of assessing the presence and magnitude of benefit/worsening of any given symptom, the accuracy drops to 89%, with an average error of ±0.35 out of 5.
We think that these results are sufficient to provide a useful overview of the most popular and most effective treatments reported within an given online patient community.
1. Ingest data from Reddit
We used Reddit’s API to request the text of publicly available posts and comments from relevant subreddits. For Long Covid, there are two major subreddits: r/LongCovid (15k members) and r/covidlonghaulers (47k members). We were able to import a total of about 300k posts and comments to comprise the full dataset for this analysis. Here is a screenshot of a post that illustrates the type of useful information that’s contained in these comments - compelling and detailed descriptions of personal experiences trying supplements, drugs, and other treatments:
2. Classify whether the post author tried any treatment
Many posts are not about treatments at all, and some posts mention treatments in other contexts, for example in reference to research studies. Our first step was to include only those reports for which the author mentioned actually trying a treatment themselves. We observed that this removed 92% of posts, to a set of 23k possible “treatment reports”.
3. Classify which treatment(s) the post author tried
With the correct prompt, the LLM was able to robustly understand the concept of a “treatment”, which can include drugs, supplements, diets, activities, and more, and produce standardized names for each, which allowed downstream aggregation. This includes common variations of treatments, including generic vs. brand names for drugs, and common abbreviations like “natto” for “nattokinase”. Each report details 1-n treatments that they tried.
4. Assess the magnitude of overall improvement and side effects
The next task is to understand the benefits and side effects. Long Covid is a complex and poorly characterized disease with a wide variety of possible symptoms. We were forced to create a pragmatic measure of impact that matches the way that real patients convey the progression of their condition. Here is the rubric we used for the scale for “Degree of Overall Benefit”:
- None - Effects are absent (no effect). Look for formulations like: no effect, did nothing, didn't notice anything, no difference, didn't do anything, didn't help
- Mild - Positively describes some effect of the treatment, but the benefits are relatively minor. Look for formulations like: a bit, a little, somewhat, pretty good, I think
- Moderate - Describes a benefit without any further qualifiers
- Significant - Describes a benefit with some qualifiers indicating very positive changes, praise or major improvements. Look for formulations like: (helped) the most, very, 80%, really helped
- Very Significant - Describes a benefit so astounding as to be transformational, a cure, complete recovery, radical. Look for formulations like: gave my life back, back to normal, old self, magical, miraculous, perfect
This benefits from high coverage (essentially all posts can be mapped to this scale), while capturing the difference between small, medium, and large effects. We used a similar scale for severity of side effects. When prompted, we found the LLM model had average agreement with the human experts of 0.92, with an average error of +/-0.3 (out of 5).
Here is an example of a “5 (Very Significant)” improvement:
Here is an example of a “2 (Mild)” improvement:
5. Extract specific symptoms and side effects, and the degree of benefit or worsening for each
It is also valuable to understand any specific symptoms that were improved, and side effects that were experienced. Again, with correct prompting, the LLM was able to identify symptoms and side effects, output standardized classes for each, alongside the magnitude of benefit or worsening mentioned above.
Here is an example report, including the inferred structured data. Here, we were able to correctly identify that Metoprolol had “Very Significant” benefit overall and for anxiety, and had no reported side effects.
6. Extract details about how the treatment was taken, including dosage, brand, or timing
Posts often include timing, side effects, dosage, and other detailed information. When available, we also extract this information from posts to provide further potential for aggregation, for example, to help determine the most effective dose or brand for Nattokinase.
Once we had a structured dataset of treatment reports, it was also necessary to de-duplicate reports reported by the same user for the same treatment as some people share their experiences many times. In these cases we select the last available treatment report for a given user + treatment combination, as the likely represents the most up-to-date information.
Validation was complex for this task, as it has many sub-components and a very large set of potential classes. We created a reference dataset by hand with 1000 samples to evaluate the simpler task (correct treatment, correct benefit and side effect magnitudes) and 114 samples to evaluate the more complex task (extract all symptoms, side effects, dosages, types, etc.). This dataset was iteratively improved, as sometimes when human expert and the LLM disagreed, we would determine that the LLM was actually correct. The dataset was also intentionally “spiked” with difficult cases, to ensure the accuracy numbers provided a conservative estimate of true performance.
One tradeoff we faced was between recall and precision. It’s much worse to be presented with a report that’s ostensibly about the supplement “Nattokinase” when it’s not about a treatment at all, than it is to miss some fraction of Nattokinase posts. For user experience reasons, and for accuracy of the results, we biased towards higher precision and lower recall whenever possible. For treatment classification, our recall is 0.709 and our precision is 0.975. When we refer to “accuracy” below, we actually mean “precision” (true positives / (true positives + false positives)).
Model and Prompt Selection
We tested a number of popular, cutting-edge LLMs with the above dataset as a benchmark. We ultimately selected the latest version of GPT-4, with no additional fine-tuning. We iterated on the prompt to include a detailed description of the task, some examples, and instructions about the desired structured output format.
Our primary task, to accurately describe the benefits and side effects of a certain treatment, requires steps 2-3 in the “Sub-Problems” section. For this overall task, we observed 92% accuracy. For step 4, the average error in estimating the magnitude of benefits and side effects was ±0.34 out of 5.
Adding additional difficulty with step 5, we accurately classify specific symptoms and side effects 96% of the time, and assess magnitude with an accuracy of ±0.35 out of 5.
Given this structured, post-level data, we can aggregate the results to give a summary of the information provided across the community.
Top 5 most popular treatments:
- Antihistamines (n=1900)
- Magnesium (n=900)
- NSAIDs, including Aspirin (n=750)
- Corticosteroids, including Flonase and Prednisone (n=710)
- Vitamin D (n=710)
Treatments with the highest estimated effectiveness:
- Carnivore Diet (3.28 avg. benefit, n=47)
- Dairy Cessation (3.21, n=38)
- Rest (3.16, n=122)
- Benfotiamine (3.12, n=33)
- Psilocybin (3.12, n=66)
Web application for data exploration
All users of the Eureka Health online community can access and explore this data, to view aggregate results, and more detailed information and posts about each treatment. Here’s a screenshot of the treatment page for Stellate Ganglion Block:
You can learn more and sign up to use the application at https://eurekahealth.com/.
Effective treatments or cures for Long Covid are still years away from approval, and with tens of millions suffering today, there’s a dire need for a better, more up-to-date data on promising treatment candidates for consideration by patients, clinicians, and researchers. Patients deeply understand their conditions, and their responses to treatments, and are often at the forefront of knowledge and experience with new treatments. By aggregating their experiences, we can provide a more representative view into this important source of information.
While the accuracy of this approach isn’t perfect, we feel that it’s sufficient to provide a high-level overview of what’s out there and what’s working for other patients. It would be false precision to infer based on this data that a Stellate Ganglion Block works better than Glycine, but perhaps both are worth considering and discussing with a clinician as potential options.
For transparency, we provide all raw data alongside the inferred information at eurekahealth.com, so users can read through the detailed narratives for more context, and get a feel of the limitations of the model themselves.
Since we were optimizing for precision, false negative rate is high at 14%. One unknown is whether the treatments that the model correctly classified have the same distribution of benefits and side effects as those it “missed”. If the model is systematically more likely to classify a treatment in more positive treatments, that would push the average score higher than reality. However, if this were to happen systematically across all treatments, relative rankings of treatments would still be valid.
Dataset bias and limitations
The underlying dataset has some important characteristics and caveats. One is that all treatments have many more treatment positive than negative treatment reports. Perhaps it’s true that all of these treatments work more often than not, but it’s also likely that people are more likely to share their successes than their failures.
Another caveat is that users often report the impact of many treatments at once. In the worst case, users will even say that they substantially improved, and list out everything they’ve taken or tried over the previous year. This makes attributing the effect to a specific treatment difficult. Today, we include the effect of all treatments reported together in the total for each of the individual treatments; in the future, it would be prudent to weigh reports by the degree of specificity, and consider excluding certain reports entirely.
A further, clear limitation is that it’s difficult to accurately report the effects of a treatment after trying it. If there was benefit, it could be due to placebo effect, especially after reading other positive user reports for that treatment. There could have been a shift in the user’s baseline condition independent of the treatment. The user might have not taken the treatment “properly”, starting and stopping sporadically, or starting at too high of a dose. In the future, we can try to detect and filter according to some of these conditions, but this data will always be noisier than that obtained in a clean, clinical study. However, reports of large effects are meaningful - significant improvement after years of being severely ill are easy to notice and report - and the distribution of these “noisy” behaviors should be consistent across treatments, meaning that relative comparison is still generally meaningful.
This is a proof of concept, and there are many ways to build on and expand this work in the future.
We will continue to improve model accuracy, potentially by applying newer, more performant LLMs, improving the prompt, and/or applying other techniques like reinforcement learning from human feedback.
We will also find better ways to validate our results against trustworthy, externally available reference data. There aren’t many studies showing the average degree of benefit of common Long Covid treatments yet, unfortunately, but this will be an important validation whenever possible. One compelling option here is to apply the method to another disease with more quantitative outcome measurements (e.g. weight loss or A1C levels) for easier comparison to the literature.
It’s also exciting to use this as a starting point for a continuously improving system based on user-provided feedback. As the Eureka community grows, we can provide new ways for people to provide feedback and labels, which with scale can help create a much more accurate and performant system.