From the “references.tsv” file available on the repository, I extracted all article / entity-section pairs where there is a section title (it’s not “n/a”). This produced 52000 entities. Let’s call this set “E”.
Then from the different “_sections” files (which contain the textual content of each section of each entity) I’ve extracted all entities that appear in “E”. The number drops to 4937 entities. Let’s call this “F”.
Finally, I re-looped over “references.tsv” to run the experiment. I’ve considered only articles-entity-section pairs where:
- the entity appears in “F” (namely, I have the content of the sections from the dataset) [from 352000 articles, the number drops to 1589]
- there are at least 2 sections [1576 articles]
- the section title in references.tsv appears as one of the possible section-headings from “F”. [276 articles]
Among these 276 articles, the majority are associated only with two candidate sections.
Here’s a script to reproduce the analysis.