This is the online appendix for the paper “Political Text Scaling Meets Computational Semantics” [working draft], that I wrote in collaboration with Goran Glavaš, Simone Paolo Ponzetto and Heiner Stuckenschmidt.
The paper introduces SemScale, a tool for text scaling aware of the semantic of the documents under study. If you want to test it, you can try our demo; for research purposes, we recommend you to use our command-line tool, implemented in Python.
In our work, we have obtained a dataset from the European Parliament website. It comprises all speeches given in either English, German, French, Italian or Spanish by members of the European Parliament (MEP) during the 5th and 6th legislative terms and translated to all of the other four languages. For each legislation and language, we concatenated all speeches of all members of the same national party in a single textual document, aiming to discover the overall political positions of the party during the legislation. The party-document ids match the ones used by the Chapel Hill exert survey. In this folder, you will find for each legislation the concatenated party-speeches in each language, together with Chapel Hill positioning of parties during that legislations (scaled between 0 and 1), concerning the European integration dimension and overall left-right ideology.
Pre-processing tools adopted
For part-of-speech tagging, lemmatization and named entity recognition we have used Spacy. The tool is very easy to use and well documented; you simply need to download the model of the language under study.
For entity-linking we adopted DBpedia Spotlight, as it is the only tool that currently openly supports the five languages under study. As opposed to Spacy, using Spotlight it is not quite straight-forward, as it is not extensively documented and to run it on your machine you would need to 1) install Docker, 2) load the models, 3) send texts to be annotated via CURL. Nevertheless, if you work on either English, German or Italian you could use another entity-linker, TagMe, which is simpler to import in your pipeline.
Due to the above issues, we offer a .zip folder (around 700mb) containing the entire dataset already processed as a colleciton of enriched json files. Each file, in each language, contains the output of Spacy plus isolated NERs (PER, ORG, LOC) and linked entities (“ents”). For each entity, you will have the DBpedia entry and its mention in the document. This way, our evaluation setting can be reproduced without the need of employing Spacy or DBpedia Spotlight. To parse the dataset, we offer simple Python and R scripts.
SemScale is implemented in Python. When using the tool, you can import the word embeddings of your choice (as a language-prefixed file – see the documentation on Github). In our work, we used a) (prefixed) in-domain embeddings trained using the entire collection of speeches for each language and legislation as corpus, which we share with the community [in-domain-embs; in-domain-lemma-embs] and b) pre-trained general purpose FastText embeddings (300d), for each language under study. For testing purposes, we provide you with a single-file, containing the 100k most-frequently used (prefixed) words on Wikipedia for the following languages: English, French, German, Italian and Spanish. Be aware that by employing this file you are limiting the size of the vocabulary under study, but will drastically speed-up the analysis.
Wordfish. We have used the R implementation of Wordfish, available in Quanteda, keeping all characters (i.e, flagging out number and punctuation removal and keeping all words when building the word-frequency matrix). Note that this process is computationally quite expensive, depending on the size of the documents under study.
In this folder, you will find all the scaling results obtained during our comparison between Wordfish and SemScale. Their name will give you the information of the pre-processing steps adopted, for instance:
5 = 5th legislation
EN = English
all_original = means the entire text in its original form
0 = indicates the percentage of text removed as a pre-processing step. Note that in this study we did not use this filtering, so all results present “0”.
indomain_emb = the embeddings used in the analysis. “NA” for wordfish results.
semscale = the scaling method
To compute pairwise accuracy, Pearson and Spearman correlation, we provide a simple Python script that you can use as a command-line tool. You just need to provide the path of the scaling file and the Chapel Hill party positions, concerning either European integrations or left-right ideology. For instance:
python eval-script.py EuroParl-Dataset/5/integration.txt scaling-results/5-EN-all_original-0-indomain_emb-semscale.txt