Today, all human activities leave digital traces. Terabytes of person level data exist, and researchers who can make sense of them are positioned to gain insight into many of the most challenging problems facing our society—healthcare, education, employment, welfare, economics, and the environment. A key challenge to reaping the benefits of such big data is integrating the heterogeneous data without a common identifier. This introduces the need for record linkage – the process of identifying record pairs which belong to the same real-world entity.
Access to real data with different challenges is critical for effective development of any data analytic algorithm development. Benchmarking data repositories such as the UCI machine learning repository and the CSPLib, a benchmark library for constraint programming, have all been critical to the development of these research communities. Establishing a common benchmarking repository of linkage methodologies will propel the field to the next level of rigor by facilitating comparison of different algorithms, understanding what type of algorithms work best under certain conditions and problem domains, promoting transparency and replicability of research, and encouraging proper citation of methodological contributions and their resulting datasets. It will bring together the diverse scholarly communities (e.g., computer scientists, statisticians, and social, behaviour, economic, and health (SBEH) scientists) who are currently addressing these challenges in disparate ways that do not build on one another’s work.