Classifying Positive Results in Clinical Psychology Using Natural Language Processing

Our paper in the Zeitschrift für Psychologie evaluates SciBERT and random forest for classifying whether clinical psychology abstracts report exclusively positive results. Trained on 1,900+ annotated abstracts, SciBERT reaches 86% accuracy and generalizes to out-of-domain data. Applied to 20,000+ psychotherapy RCT abstracts (1990–2022), the model reveals an inverted-U trend: positive results rose until the early 2010s and then declined.

📄 Paper (ZfP) | 💻 Code & Data (GitHub) | 🤗 Model (HuggingFace) | 📝 Preregistration (OSF)

Motivation: Why classify positive results?

High rates of positive results are observed throughout the sciences . In psychology, rates between 84–97% have been reported , with clinical psychology and psychiatry reaching up to 100% . Given the typically small effect sizes and sample sizes in psychological research, these rates cannot be fully explained by high statistical power—suggesting the influence of publication bias or questionable research practices .

Understanding trends in positive results matters for clinical psychology in particular: biased evidence on treatment efficacy can misinform clinical decisions , and mental health treatment costs represent a substantial share of health-economic expenditures .

Previous attempts to track positive results over time relied on either manual classification—accurate but resource-intensive —or rule-based algorithms that search for predefined n-grams like “significant difference” or p < .05 . However, rule-based methods capture only a narrow set of expressions and ignore linguistic context entirely. As Ioannidis put it: “No fancy informatics script can sort out that mess. One still needs to read the papers” .

We asked: Can modern NLP models learn to classify positive results from annotated abstracts, and what do they reveal about trends in clinical psychology?

Method: From annotations to transformers

Annotation strategy

We annotated 1,978 English-language abstracts from clinical psychology researchers affiliated with German universities (2013–2022). Each abstract was classified into two categories:

Interrater reliability was solid ($\kappa = .768$, 88% agreement on a subset of 198 independently double-coded abstracts).

Supervised learning pipelines

We evaluated two supervised models against three benchmarks:

SciBERT is a BERT variant pretrained on 1.14M scientific papers (3.1B tokens). It leverages the Transformer self-attention mechanism to interpret words in their full linguistic context. We fine-tuned SciBERT on our annotated abstracts using a grid search over learning rates and batch sizes.

Random Forest operates on bag-of-words features: text is lowercased, lemmatized, tokenized via CountVectorizer, and classified with a RandomForestClassifier .

Flowchart of SciBERT and Random Forest pipelines
Flowchart of the SciBERT and Random Forest classification pipelines. SciBERT uses a pretrained transformer fine-tuned on annotated abstracts; Random Forest uses bag-of-words features with hyperparameter optimization via cross-validation.

Benchmarks

We compared against three rule-based approaches:

  1. p-value algorithm: Classifies based on extracted p-values (p < .05 vs. p > .05) following De Winter & Dodou .
  2. Natural language indicator (NLI) algorithm: Classifies based on predefined n-grams like “significant difference” or “no significant difference” .
  3. Naive abstract length: A logistic regression using only word count as predictor.
Rule-based classification flowcharts
Algorithms for rule-based classification based on p-values and natural language indicators. When no relevant n-grams are detected, the algorithm falls back to the base rate in the training data.

Results

SciBERT outperforms all benchmarks

SciBERT achieved the highest accuracy across all evaluation sets: 86% on in-domain data and 85–88% on two out-of-domain validation sets (psychotherapy RCTs from non-German authors and from 1990–2012). Random Forest showed solid but lower performance (80–83%). The rule-based benchmarks performed near chance (47–57%).

Model performance comparison
Accuracy scores across in-domain (MAIN test) and out-of-domain (VAL1, VAL2) data. SciBERT (blue) consistently outperforms Random Forest and all rule-based benchmarks.

Why do rule-based approaches fail? Only 9% of abstracts in our data mentioned p-values and only 14% contained predefined NLIs, leaving 79% of abstracts where rule-based classifiers must resort to random guessing. SciBERT, by contrast, learns from the full vocabulary and linguistic context of the abstract.

The fine-tuned SciBERT model was deployed publicly as the NegativeResultDetector on HuggingFace .

We applied SciBERT to predict result types for 20,212 unannotated psychotherapy RCT abstracts spanning 1990–2022:

Longitudinal comparison of SciBERT predictions and rule-based approaches
Predicted proportions of positive results in psychotherapy RCTs (1990–2022). The SciBERT model (M3b, inverted-U) is compared with rule-based trend lines for p < .05, p > .05, and natural language indicators. Dots represent observed yearly proportions (n = 20,212).

The absence of an increase in the 1990s diverges from Fanelli’s cross-disciplinary finding of rising positive results . This may reflect early awareness of publication bias in clinical trials or the relatively strong funding for psychotherapy RCTs during that period. The decline after the early 2010s coincides with increased adoption of open science practices, following publications by Ioannidis and the Open Science Collaboration’s replication study , though the causal link remains speculative.

A breakpoint analysis placed the inflection point around 2011 rather than the hypothesized 2005, suggesting a time-lag effect—research culture shifts take years to manifest in the published literature.

Discussion and implications

For metascience: Machine learning—especially transformer-based models like SciBERT—can substantially advance the automation of research synthesis tasks . Where rule-based methods fail because of the heterogeneity of result reporting, supervised NLP models learn from the full linguistic context. This opens the door to large-scale, systematic monitoring of positive results across disciplines.

For clinical psychology: The decline in exclusively positive results since the early 2010s may coincide with the adoption of open science practices such as registered reports and preregistration , though it could also stem from changes in statistical power, effect sizes, or reporting norms. Whether the trend continues remains to be seen .

Limitations: We classified abstracts (not full texts) into a binary scheme—a simplification that may obscure nuances. Larger language models with longer context windows could enable more fine-grained annotation strategies in the future. Furthermore, high rates of positive results do not necessarily imply publication bias; they could also reflect high statistical power or true effects.

Deployment: The fine-tuned SciBERT model is publicly available as the NegativeResultDetector for researchers to classify their own data. Code for training, evaluation, and inference is on GitHub .

If you find this work useful for your research, please consider citing our paper:

@article{schiekiera2024classifying,
  title={Classifying Positive Results in Clinical Psychology
         Using Natural Language Processing},
  author={Schiekiera, Louis and Diederichs, Jonathan
          and Niemeyer, Helen},
  journal={Zeitschrift f{\"u}r Psychologie},
  year={2024},
  doi={10.1027/2151-2604/a000563}
}

📄 Read the paper here

💻 View the code and data on GitHub

🤗 View the model on HuggingFace

📝 View the preregistration on OSF