Data Set: Cora

Proposed by: Andrew McCallum

Added on: 20 November 2016.

Tags: bibliographic, local, public, external


The Cora data contains bibliographic records of machine learning papers that have been manually clustered into groups that refer to the same publication.

Originally, Cora was prepared by Andrew McCallum, and his versions of this data set are available on his Data web page. The data is also hosted here in the DLRep.

Note that various versions of the Cora data set have been used by many publications in record linkage and entity resolution over the years.



The Cora versions local in dlrep is a comma separated values (CSV) file as downloaded from the SecondString approximate string matching open source package.

Note the second column (field/attribute) contains the entity identifiers (publication identifiers).

License CC BY 4.0