Methodological Developments in Data Linkage
Harron, Katie
Goldstein, Harvey
Dibben, Chris
A comprehensive compilation of new developments in data linkage methodology The increasing availability of large administrative databases has led to a dramatic rise in the use of data linkage, yet the standard texts on linkage are still those which describe the seminal work from the 1950–60s, with some updates. Linkage and analysis of data across sources remains problematic due to lack of discriminatory and accurate identifiers, missing data and regulatory issues. Recent developments in data linkage methodology have concentrated on bias and analysis of linked data, novel approaches to organising relationships between databases and privacy–preserving linkage. Methodological Developments in Data Linkage brings together a collection of contributions from members of the international data linkage community, covering cutting edge methodology in this field. It presents opportunities and challenges provided by linkage of large and often complex datasets, including analysis problems, legal and security aspects, models for data access and the development of novel research areas. New methods for handling uncertainty in analysis of linked data, solutions for anonymised linkage and alternative models for data collection are also discussed. Key Features : Presents cutting edge methods for a topic of increasing importance to a wide range of research areas, with applications to data linkage systems internationally Covers the essential issues associated with data linkage today Includes examples based on real data linkage systems, highlighting the opportunities, successes and challenges that the increasing availability of linkage data provides Novel approach incorporates technical aspects of both linkage, management and analysis of linked data This book will be of core interest to academics, government employees, data holders, data managers, analysts and statisticians who use administrative data. It will also appeal to researchers in a variety of areas, including epidemiology, biostatistics, social statistics, informatics, policy and public health. INDICE: 1.1 Introduction: Data linkage as it exists 8 .1.2 Background and issues 9 .1.3 Data linkage methods 9 .1.3.1 Deterministic linkage 10 .1.3.2 Probabilistic linkage 10 .1.3.3 Data preparation 11 .1.4 Linkage error 12 .1.5 Impact of linkage error on analysis of linked data 13 .1.6 Data linkage: the future 14 .1.1 Introduction 16 .2.2 Overview of methods 18 .2.2.1 The Fellegi–Sunter model of record linkage 18 .2.2.2 Learning parameters 20 .2.2.3 Additional methods for matching 28 .2.2.4 An empirical example 30 .2.3 Data preparation 32 .2.3.1 Description of a matching project 32 .2.3.2 Initial file preparation 33 .2.3.3 Name standardisation and parsing 35 .2.3.4 Address standardisation and parsing 36 .2.3.5 Summarising comments on pre–processing 36 .2.4 Advanced methods 37 .2.4.1 Estimating false–match rates without training data 37 .2.4.2 Adjusting analyses for linkage error 41 .2.5 Concluding comments 44 .3.1 Introduction 46 .3.2 The data linkage context 46 .3.2.1 Administrative or routine data 46 .3.2.2 The law and the use of administrative (personal) data for research 47 .3.2.3 The identifiability problem in data linkage 52 .3.3 The tools used in the production of functional anonymity through a data linkage environment 53 .3.3.1 Governance, rules and the researcher 53 .3.3.2 Application process, ethics scrutiny and peer review 53 .3.3.3 Shaping safe behaviour – Training, Sanctions, Contracts and licenses 54 .3.3.4 Safe data analysis environments 54 .3.3.5 Fragmentation – separation of linkage process and temporary linked data 58 .3.4 Models for data access and data linkage 59 .3.4.1 Single centre 59 .3.4.2 Separation of functions, firewalls within single centre 60 .3.4.3 Separation of functions, Trusted Third Party linkage 61 .3.4.4 Secure multi–party computation 62 .3.5 Four case study data linkage centres 63 .3.5.1 Population Data BC 63 .3.5.2 The Secure Anonymised Information Linkage Bank (SAIL) UK 66 .3.5.3 Centre for Data Linkage (Population Health Research Network) – Australia 67 .3.5.4 The Centre for Health Record Linkage (CHeReL) Australia 69 .3.6 Conclusion 71 .4.1 Background 73 .4.2 Description of types of linkage error 74 .4.2.1 Missed matches from missing linkage variables 74 .4.2.2 Missed matches from inconsistent case ascertainment 75 .4.2.3 False matches – description of cases incorrectly matched 76 .4.3 How linkage error impacts research findings 77 .4.3.1 Results 78 .4.3.2 Assessment of linkage bias 84 .4.4 Discussion 86 .4.4.1 Potential biases in the review process 88 .4.4.2 Recommendations & implications for practice 88 .4.5 References to studies included in the review 92 .5.1 Introduction 94 .5.2 Measurement error issues arising from linkage 95 .5.2.1 Correct links, incorrect links and non–links 95 .5.2.2 Characterising linkage errors 96 .5.2.3 Characterising errors from non–linkage 96 .5.3 Models for different types of linking errors 97 .5.3.1 Linkage errors under binary linking 97 .5.3.2 Linkage errors under multi–linking 98 .5.3.3 Incomplete linking 99 .5.3.4 Modelling the linkage error 100 .5.4 Regression analysis using complete binary–linked data 101 .5.4.1 Linear regression 102 .5.4.2 Logistic regression 107 .5.5 Regression analysis using incomplete binary–linked data 107 .5.5.1 Linear regression using incomplete sample to register linked data 109 .5.6 Regression analysis with multi–linked data 112 .5.6.1 Uncorrelated multi–linking – complete linkage 112 .5.6.2 Uncorrelated multi–linking – sample to register linkage 115 .5.6.3 Correlated multi–linkage 118 .5.6.4 Incorporating auxiliary population information 120 .5.7 Conclusion and discussion 121 .6.1 Introduction 124 .6.2 Probabilistic record linkage 125 .6.3 Multiple Imputation (MI) 127 .6.4 Prior informed imputation 128 .6.4.1 Estimating matching probabilities 129 .6.5 Example 1: Linking electronic healthcare data to estimate trends in blood–stream infection 130 .6.5.1 Methods 130 .6.5.2 Results 131 .6.5.3 Conclusions 132 .6.6 Example 2: Simulated data including non–random linkage error 132 .6.6.1 Methods 132 .6.6.2 Results 133 .6.7 Discussion 134 .6.7.1 Non–random linkage error 134 .6.7.2 Strengths and limitations: handling linkage error 135 .6.7.3 Implications for data linkers and data users 135 .Appendix A 136 .7.1 Summary 139 .7.2 Introduction 139 .7.2.1 Flat approach 141 .7.2.2 Oops, your legacy is showing 144 .7.2.3 Shortcomings 144 .7.3 Graph approach 146 .7.3.1 Overview of graph concepts 146 .7.3.2 Graph queries versus relational queries 147 .7.3.3 Comparison of data in flat database versus graph database 148 .7.3.4 Relaxing the notion of truth 150 .7.3.5 Not a linkage approach per se but a management approach which enables novel linkage approaches 150 .7.3.6 Linkage engine independent 151 .7.3.7 Separates out linkage from cluster identification phase (and clerical review) 152 .7.4 Methodologies 152 .7.4.1 Overview of storage and extraction approach 152 .7.4.2 Overall management of data as collections 153 .7.4.3 Data loading 154 .7.4.4 Identification of equivalence sets and deterministic linkage 155 .7.4.5 Probabilistic linkage 155 .7.4.6 Clerical review 156 .7.4.7 Determining cut–off thresholds 157 .7.4.8 Final cluster extraction 158 .7.4.9 Graph partitioning 159 .7.4.10 Data management/curation 160 .7.4.11 User interface challenges 161 .7.4.12 Final cluster extraction 164 .7.4.13 A typical end–to–end workflow 165 .7.5 Algorithm Implementation 166 .7.5.1 Graph traversal 166 .7.5.2 Cluster identification 167 .7.5.3 Partitioning visitor 167 .7.5.4 Encapsulating edge following policies 169 .7.5.5 Graph partitioning 170 .7.5.6 Insertion of review links 170 .7.5.7 How to migrate while preserving current clusters 172 .7.6 New approaches facilitated by graph storage approach 172 .7.6.1 Multiple threshold extraction 173 .7.6.2 Possibility of returning graph to end–users 173 .7.6.3 Optimized cluster analysis 174 .7.6.4 Other link types 175 .7.7 Conclusion 176 .Acknowledgements 177 .8.1 Introduction 179 .8.2 Current practice in record linkage for population censuses 180 .8.2.1 Introduction 180 .8.2 Case study: The 2011 England and Wales Census assessment of coverage 180 .8.3 Population level linkage in countries that operate a population register: Register–based censuses 187 .8.3.1 Introduction 187 .8.3.2 Case study 1 – Finland 188 .8.3.3 Case study 2 The Netherlands Virtual Census 189 .8.3.4 Case study 3 Poland 190 .8.3.5 Case study 4 Germany 190 .8.3.6 Summary 191 .8.4 New challenges in record linkage: the Beyond 2011 Programme 192 .8.4.1 Introduction 192 .8.4.2 Beyond 2011 linking methodology 192 .8.4.3 The anonymisation process in Beyond 2011 194 .8.4.4 Beyond 2011 linkage strategy using pseudonymised data 195 .8.4.5 Linkage quality 203 .8.4.6 Next steps 206 .8.4.7 Conclusion 207 .8.5 Summary 209 .9.1 Introduction 211 .9.2 Chapter outline 211 .9.3 Linking with and without Personal identification Numbers 212 .9.3.1 Linking using a trusted third party 213 .9.3.2 Linking with encrypted PIDs 213 .9.3.3 Linking with encrypted quasi–identifiers 213 .9.3.4 PPRL in decentralised organisations 214 .9.4 PPRL approaches 215 .9.4.1 Phonetic codes 215 .9.4.2 High–dimensional embeddings 216 .9.4.3 Reference tables 216 .9.4.4 Secure Multiparty Computations for PPRL 216 .9.4.5 Bloom filter based PPRL 217 .9.5 PPRL for very large databases: blocking 219 .9.5.1 Blocking for PPRL with Bloom filters 219 .9.5.2 Blocking Bloom filters with Multibit trees 220 .9.5.3 Empirical comparison of blocking techniques for Bloom filters 221 .9.5.4 Current recommendations for linking very large datasets with Bloom filters 222 .9.6. Privacy considerations 222 .9.6.1 Probability of attacks 223 .9.6.2 Kind of attacks 224 .9.6.3 Attacks on Bloom filters 224 .9.7 Hardening Bloom filters 227 .9.7.1 Randomly selected hash values 227 .9.7.2 Random bits 227 .9.7.3 Avoiding padding 229 .9.7.4 Standardising the length of identifiers 229 .9.7.5 Sampling bits for composite Bloom filters 229 .9.7.6 Re–hashing 230 .9.7.7 Salting keys with record specific data 231 .9.7.8 Fake injections 231 .9.7.9 Evaluation of Bloom filter hardening procedures 232 .9.8 Future research 232 .9.9 PPRL research and implementation with national databases 233 .10.1 Introduction 235 .10.2 Part 1: Data linkage as it exists today 235 .10.3 Part 2: Analysis of linked data 236 .10.3.1 Quality of identifiers 236 .10.3.2 Quality of linkage methods 236 .10.3.3 Quality of evaluation 237 .10.4 Part 3: Data linkage in practice: new developments 238 .10.5 Concluding remarks 240
- ISBN: 978-1-118-74587-8
- Editorial: Wiley–Blackwell
- Encuadernacion: Cartoné
- Páginas: 296
- Fecha Publicación: 27/11/2015
- Nº Volúmenes: 1
- Idioma: Inglés