Yesterday’s News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models
Ivo Verhoeven†, Pushkar Mishra‡ and Ekaterina Shutova†
† ILLC, University of Amsterdam, ‡ MetaAI, London
This GitHub repository contains documentation for misinfo-general
, and code used for our accompanying paper. With it we hope to introduce new data and evaluation methods for testing and training for out-of-distribution of generalisation in misinformation detection models.
Please direct your questions to: [email protected]
This paper introduces
misinfo-general
, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalisation. Misinformation changes rapidly, much quicker than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation models need to be able to perform out-of-distribution generalisation, an understudied problem in existing datasets. We identify 6 axes of generalisation—time, event, topic, publisher, political bias, misinformation type—and design evaluation procedures for each. We also analyse some baseline models, highlighting how these fail important desiderata.
/config/
various configuration YAML files
/data/
├── README_dataverse.md
│ the dataset card used for storing data on Harvard Dataverse
└── README_dataverse.md
the dataset card used for storing data on Hugging Face Hub
/scripts/
various scripts for running various experiments on a SLURM cluster
/src/
├── /misinfo_general/
│ utility code
└── *.py
top level scripts for training and evaluating misinformation models on misinfo-general
/env.yaml/
conda environment used for local development
/env_snellius.yaml/
conda environment used for training and evaluation on a SLURM cluster
We have released our data on two separate platforms: Hugging Face Hub and Harvard Dataverse. Both of these repositories require access requests before downloading is possible. We provide additional detail on their respective dataset cards.
The dataset is licensed under CC BY-SA-NC 4.0. This allows for sharing and redistribution, but requires attribution and sharing derivatives under similar terms. It does permit commercial use-cases.
On either repo, we provide data in a set of .arrow
files, which can be read using a variety of packages although we used datasets
, an provide the publisher-level metadata in a duckdb
database. Upon request, we can change the formatting of either the dataset or metadata database.
Because of the nature of the language it includes, misinfo-general
contains texts that are toxic, hateful, or otherwise harmful to society if disseminated. The dataset itself or any derivative formats of it, like LLMs, should not be released for non-research purposes. The texts themselves might also be copyrighted by their original publishers.
We have deliberately removed all social media content, and all hyperlinks to such content. We consider such content Personally identifiable information (PII), with limited use in misinformation classification beyond author profiling. Such applications are fraught with ethical problems, and likely only induce overfitting in text-based classification.
The development environment is stored as a conda
readable YAMl file in ./env.yaml
. The training environment, used on the Snellius supercomputer, can be found in ./env_snellius.yaml
.
For configuration, we used Hydra. The configuration files may be fund in ./config
. All scripts in /main/
can be run from the command line, using the Hydra syntax. For example,
python src/train_uniform.py \
fold=0 \
year=2017 \
seed=942 \
model_name='microsoft/deberta-v3-base' \
data.max_length=512 \
batch_size.tokenization=1024 \
batch_size.train=64 \
batch_size.eval=128 \
++trainer.kwargs.fp16=true \
++trainer.kwargs.use_cpu=false \
++trainer.memory_metrics=false \
++trainer.torch_compile=false \
++optim.patience=5 \
data_dir=$LOCAL_DATA_DIR \
disable_progress_bar=true
uses the data stored at $LOCAL_DATA_DIR
to train a uniform split model on the 2017 data iteration.
@misc{verhoeven2024yesterdaysnewsbenchmarkingmultidimensional,
title={Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models},
author={Ivo Verhoeven and Pushkar Mishra and Ekaterina Shutova},
year={2024},
eprint={2410.18122},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2410.18122},
}