链接: https://pan.baidu.com/s/1cXYoEi-p7nONCIapU1CfnA 提取码: tcst
Title: Record Linkage Comparison Patterns
— Underlying records: Epidemiologisches Krebsregister NRW
— Creation of comparison patterns and gold standard classification:
Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI),
University Medical Center of Johannes Gutenberg University, Mainz, Germany
— Donor: Murat Sariyar, Andreas Borg (IMBEI)
— Date: September 2008
Irene Schmidtmann, Gael Hammer, Murat Sariyar, Aslihan Gerhold-Ay:
Evaluation des Krebsregisters NRW Schwerpunkt Record Linkage. Technical
Report, IMBEI 2009.
— Describes the external evaluation of the registry’s record linkage
— The comparison patterns in this data set were created in course of
Murat Sariyar, Andreas Borg, Klaus Pommerening:
Controlling false match rates in record linkage using extreme value theory.
Journal of Biomedical Informatics, 2011 (in press).
— Predicted attribute: matching status (boolean).
— A new approach for estimating the false match rate in record
linkage by methods of Extreme Value Theory (EVT).
— The model eliminates the need for labelled training data while
achieving only slighter lower accuracy compared to a procedure
that has knowledge about the matching status.
The records represent individual data including first and
family name, sex, date of birth and postal code, which were collected through
iterative insertions in the course of several years. The comparison
patterns in this data set are based on a sample of 100.000 records dating
from 2005 to 2008. Data pairs were classified as “match” or “non-match” during
an extensive manual review where several documentarists were involved.
The resulting classification formed the basis for assessing the quality of the
registry’s own record linkage procedure.
In order to limit the amount of patterns a blocking procedure was applied,
which selects only record pairs that meet specific agreement conditions. The
results of the following six blocking iterations were merged together:
- Phonetic equality of first name and family name, equality of date of birth.
- Phonetic equality of first name, equality of day of birth.
- Phonetic equality of first name, equality of month of birth.
- Phonetic equality of first name, equality of year of birth.
- Equality of complete date of birth.
- Phonetic equality of family name, equality of sex.
This procedure resulted in 5.749.132 record pairs, of which 20.931 are matches.
The data set is split into 10 blocks of (approximately) equal size and ratio
of matches to non-matches.
The separate file frequencies.csv contains for every predictive attribute
the average number of values in the underlying records. These values can, for example,
be used as u-probabilities in weight-based record linkage following the
framework of Fellegi and Sunter.
Number of Instances: 5.749.132
Number of Attributes: 12 (9 predictive attributes, 2 non-predictive,
1 goal field)
- id_1: Internal identifier of first record.
- id_2: Internal identifier of second record.
- cmp_fname_c1: agreement of first name, first component
- cmp_fname_c2: agreement of first name, second component
- cmp_lname_c1: agreement of family name, first component
- cmp_lname_c2: agreement of family name, second component
- cmp_sex: agreement sex
- cmp_bd: agreement of date of birth, day component
- cmp_bm: agreement of date of birth, month component
- cmp_by: agreement of date of birth, year component
- cmp_plz: agreement of postal code
- is_match: matching status (TRUE for matches, FALSE for non-matches)
Class Distribution: 20.931 matches, 5728201 non-matches