Yesterday’s News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models

Ivo Verhoeven^†, Pushkar Mishra^‡ and Ekaterina Shutova^†
† ILLC, University of Amsterdam, ‡ MetaAI, London

This GitHub repository contains documentation for misinfo-general, and code used for our accompanying paper. With it we hope to introduce new data and evaluation methods for testing and training for out-of-distribution of generalisation in misinformation detection models.

Please direct your questions to: [email protected]

Abstract

This paper introduces misinfo-general, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalisation. Misinformation changes rapidly, much quicker than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation models need to be able to perform out-of-distribution generalisation, an understudied problem in existing datasets. We identify 6 axes of generalisation—time, event, topic, publisher, political bias, misinformation type—and design evaluation procedures for each. We also analyse some baseline models, highlighting how these fail important desiderata.

Structure

/config/
    various configuration YAML files
/data/
    ├── README_dataverse.md
    │       the dataset card used for storing data on Harvard Dataverse
    └── README_dataverse.md
            the dataset card used for storing data on Hugging Face Hub
/scripts/
    various scripts for running various experiments on a SLURM cluster
/src/
    ├── /misinfo_general/
    │       utility code
    └── *.py
            top level scripts for training and evaluating misinformation models on misinfo-general
/env.yaml/
    conda environment used for local development
/env_snellius.yaml/
    conda environment used for training and evaluation on a SLURM cluster

Data

We have released our data on two separate platforms: Hugging Face Hub and Harvard Dataverse. Both of these repositories require access requests before downloading is possible. We provide additional detail on their respective dataset cards.

The dataset is licensed under CC BY-SA-NC 4.0. This allows for sharing and redistribution, but requires attribution and sharing derivatives under similar terms. It does permit commercial use-cases.

On either repo, we provide data in a set of .arrow files, which can be read using a variety of packages although we used datasets, an provide the publisher-level metadata in a duckdb database. Upon request, we can change the formatting of either the dataset or metadata database.

Content

Because of the nature of the language it includes, misinfo-general contains texts that are toxic, hateful, or otherwise harmful to society if disseminated. The dataset itself or any derivative formats of it, like LLMs, should not be released for non-research purposes. The texts themselves might also be copyrighted by their original publishers.

We have deliberately removed all social media content, and all hyperlinks to such content. We consider such content Personally identifiable information (PII), with limited use in misinformation classification beyond author profiling. Such applications are fraught with ethical problems, and likely only induce overfitting in text-based classification.

Code & Environment

The development environment is stored as a conda readable YAMl file in ./env.yaml. The training environment, used on the Snellius supercomputer, can be found in ./env_snellius.yaml.

For configuration, we used Hydra. The configuration files may be fund in ./config. All scripts in /main/ can be run from the command line, using the Hydra syntax. For example,

python src/train_uniform.py \
    fold=0 \
    year=2017 \
    seed=942 \
    model_name='microsoft/deberta-v3-base' \
    data.max_length=512 \
    batch_size.tokenization=1024 \
    batch_size.train=64 \
    batch_size.eval=128 \
    ++trainer.kwargs.fp16=true \
    ++trainer.kwargs.use_cpu=false \
    ++trainer.memory_metrics=false \
    ++trainer.torch_compile=false \
    ++optim.patience=5 \
    data_dir=$LOCAL_DATA_DIR \
    disable_progress_bar=true

uses the data stored at $LOCAL_DATA_DIR to train a uniform split model on the 2017 data iteration.

Citation

@misc{verhoeven2024yesterdaysnewsbenchmarkingmultidimensional,
      title={Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models},
      author={Ivo Verhoeven and Pushkar Mishra and Ekaterina Shutova},
      year={2024},
      eprint={2410.18122},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2410.18122},
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
assets		assets
config		config
data		data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
env.yaml		env.yaml
env_snellius.yaml		env_snellius.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yesterday’s News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models

Abstract

Structure

Data

Content

Code & Environment

Citation

About

Languages

ioverho/misinfo-general

Folders and files

Latest commit

History

Repository files navigation

Yesterday’s News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models

Abstract

Structure

Data

Content

Code & Environment

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages