A record linkage algorithm tries to identify records which belong to the same individual. We analyze the matching behavior of an approach used in the E-PIX matching tool on the very limited attribute set of first name, last name, date of birth and sex. Our benchmark set contains almost 37,000 records from the Popgen biobank. We develop a model which allows us to predict the workload on clerical review for data sets growing up to a factor of 10 or even more, without the need for a data set of this size. Based on this model we show two parameter sets with comparable detection rate of true duplicates, but where only one of them scales well on growing data sets. Our model provides realistic example records for each predicted matching of an upscaled data set. Thus, it enables to identify the parameters which need to be adjusted in order to improve the quality of the matching candidates. We also show that unreviewed merging of records is prone to homonym errors on data sets with 200,000 records and the limited attribute set above, while the merged record pairs are obviously different in clerical review.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com