Dataset Restricted Access
Johannes Kiesel;
Martin Potthast;
Maria Mestre;
Rishabh Shukla;
Benno Stein;
David Corney;
Emmanuel Vincent;
Payam Adineh
Trial dataset for the SemEval 2019 Task 4: Hyperpartisan News Detection.
The dataset contains ~200.000 articles: ~100.000 hyperpartisan and ~100.000 least biased. All articles are labeled by the overall bias of the publisher as provided by BuzzFeed journalists or MediaBiasFactCheck.com.
The trial data is not fully cleaned. Due to some encoding error, some characters are replaced by question marks. Some text is duplicated. Some tags are not cleaned yet. Also, some articles may be contained several times when they are published by several publishers. These errors will be fixed stepwise for the final data.
The final article data (within the <article> text) will contain only the following tags: <p>, <q>, <a>. The <a>-tags will have an attribute "type" that is either "external" (in which case the "href"-attribute contains the URI as usual) or "internal" (in which case the "href"-attribute is empty or left out). We removed all URLs that link to the same domain as the article ("internal") to avoid biasing classifiers towards utilizing the domain of the article. You will find that this tag-cleaning already succeeded for the vast majority of the articles in the current dataset, but unfortunately not all.
You may request access to the files in this upload, provided that you fulfil the conditions below. The decision whether to grant/deny access is solely under the responsibility of the record owner.
Access is restricted to participants and organizers of the challenge for now. The data will be publicly available after the evaluation period.
All versions | This version | |
---|---|---|
Views | 21,135 | 688 |
Downloads | 24,031 | 137 |
Data volume | 10.6 TB | 23.8 GB |
Unique views | 17,168 | 510 |
Unique downloads | 5,745 | 48 |