Here’s the link for the datasets used at the hackathon.
Parliamentary Text Collections [link]. The collected materials cover 4 legislations (5th to 8th) and almost 20 years of European politics (1999-2017). Each .zip file contains one document per representative aggregating all speeches that each representative delivered over the course of the legislation period. Beside the speeches (content and date for each speech), for each representative we also obtain the information on the national party and European party group. IMPORTANT: data for the 8th legislation have been collected at the end of 2017; they will be updated at the end of the legislation.
Benchmark Datasets [link]. For the purpose of the hackathon, we decided to use only the data from completed legislation periods, which is why we discarded the ongoing eighth period. For each legislation, we considered only the speeches stated or manually translated in English and we concatenated all speeches of all representatives of the same party into a single party-level document. We selected the parties from the five largest European countries: Germany, France, United Kingdom, Italy, and Spain. Finally, we discarded the parties for which the aggregate texts over the whole legislation period ended up being shorter than 10.000 tokens. The speeches are available as .txt in a single folder, where each party is identified with the associated code from the Chapel Hill experts survey.
Gold Standard [link]. As gold standard party positions we consider the European integration dimension from the Chapel Hill expert survey (years 2002, 2006, 2010). We normalised between 0 (strongly Eurosceptic) and 1 (strongly in favour of European integration).