Skip to content

We introduce `misinfo-general`, a benchmark dataset for evaluating misinfor- mation models’ ability to perform out-of- distribution generalisation

Notifications You must be signed in to change notification settings

ioverho/misinfo-general

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Yesterday’s News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models

Ivo Verhoeven, Pushkar Mishra and Ekaterina Shutova
† ILLC, University of Amsterdam, ‡ MetaAI, London

arXiv Link
Dataset on HF Dataset on Harvard Dataverse

This GitHub repository contains documentation for misinfo-general, and code used for our accompanying paper. With it we hope to introduce new data and evaluation methods for testing and training for out-of-distribution of generalisation in misinformation detection models.

Please direct your questions to: [email protected]

Abstract

This paper introduces misinfo-general, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalisation. Misinformation changes rapidly, much quicker than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation models need to be able to perform out-of-distribution generalisation, an understudied problem in existing datasets. We identify 6 axes of generalisation—time, event, topic, publisher, political bias, misinformation type—and design evaluation procedures for each. We also analyse some baseline models, highlighting how these fail important desiderata.

Structure

/config/
    various configuration YAML files
/data/
    ├── README_dataverse.md
    │       the dataset card used for storing data on Harvard Dataverse
    └── README_dataverse.md
            the dataset card used for storing data on Hugging Face Hub
/scripts/
    various scripts for running various experiments on a SLURM cluster
/src/
    ├── /misinfo_general/
    │       utility code
    └── *.py
            top level scripts for training and evaluating misinformation models on misinfo-general
/env.yaml/
    conda environment used for local development
/env_snellius.yaml/
    conda environment used for training and evaluation on a SLURM cluster

Data

License: CC BY-NC-ND 4.0

We have released our data on two separate platforms: Hugging Face Hub and Harvard Dataverse. Both of these repositories require access requests before downloading is possible. We provide additional detail on their respective dataset cards.

The dataset is licensed under CC BY-SA-NC 4.0. This allows for sharing and redistribution, but requires attribution and sharing derivatives under similar terms. It does permit commercial use-cases.

On either repo, we provide data in a set of .arrow files, which can be read using a variety of packages although we used datasets, an provide the publisher-level metadata in a duckdb database. Upon request, we can change the formatting of either the dataset or metadata database.

Content

Because of the nature of the language it includes, misinfo-general contains texts that are toxic, hateful, or otherwise harmful to society if disseminated. The dataset itself or any derivative formats of it, like LLMs, should not be released for non-research purposes. The texts themselves might also be copyrighted by their original publishers.

We have deliberately removed all social media content, and all hyperlinks to such content. We consider such content Personally identifiable information (PII), with limited use in misinformation classification beyond author profiling. Such applications are fraught with ethical problems, and likely only induce overfitting in text-based classification.

Code & Environment

The development environment is stored as a conda readable YAMl file in ./env.yaml. The training environment, used on the Snellius supercomputer, can be found in ./env_snellius.yaml.

For configuration, we used Hydra. The configuration files may be fund in ./config. All scripts in /main/ can be run from the command line, using the Hydra syntax. For example,

python src/train_uniform.py \
    fold=0 \
    year=2017 \
    seed=942 \
    model_name='microsoft/deberta-v3-base' \
    data.max_length=512 \
    batch_size.tokenization=1024 \
    batch_size.train=64 \
    batch_size.eval=128 \
    ++trainer.kwargs.fp16=true \
    ++trainer.kwargs.use_cpu=false \
    ++trainer.memory_metrics=false \
    ++trainer.torch_compile=false \
    ++optim.patience=5 \
    data_dir=$LOCAL_DATA_DIR \
    disable_progress_bar=true

uses the data stored at $LOCAL_DATA_DIR to train a uniform split model on the 2017 data iteration.

Citation

@misc{verhoeven2024yesterdaysnewsbenchmarkingmultidimensional,
      title={Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models},
      author={Ivo Verhoeven and Pushkar Mishra and Ekaterina Shutova},
      year={2024},
      eprint={2410.18122},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2410.18122},
}

About

We introduce `misinfo-general`, a benchmark dataset for evaluating misinfor- mation models’ ability to perform out-of- distribution generalisation

Topics

Resources

Stars

Watchers

Forks