虞慧婷,蔡任之,林维晓,等.贝叶斯概率链接模型在出生和死亡数据链接中的应用[J].上海预防医学,2024,36(1):98-103.. doi: 10.19428/j.cnki.sjpm.2024.23137
引用本文: 虞慧婷,蔡任之,林维晓,等.贝叶斯概率链接模型在出生和死亡数据链接中的应用[J].上海预防医学,2024,36(1):98-103.. doi: 10.19428/j.cnki.sjpm.2024.23137
YU Huiting,CAI Renzhi,LIN Weixiao,et al.Application of Bayesian probabilistic linkage model in birth and death data linking[J].Shanghai Journal of Preventive Medicine,2024,36(01):98-103.. doi: 10.19428/j.cnki.sjpm.2024.23137
Citation: YU Huiting,CAI Renzhi,LIN Weixiao,et al.Application of Bayesian probabilistic linkage model in birth and death data linking[J].Shanghai Journal of Preventive Medicine,2024,36(01):98-103.. doi: 10.19428/j.cnki.sjpm.2024.23137

贝叶斯概率链接模型在出生和死亡数据链接中的应用

Application of Bayesian probabilistic linkage model in birth and death data linking

  • 摘要:
    目的 阐述贝叶斯概率链接模型的原理和方法,并应用于出生和死亡数据的链接以展示模型的应用效果。
    方法 通过上海市出生和死亡登记系统,收集2017年出生婴儿199 025例,2017和2018年死亡婴儿1 512例,对清洗后数据按月份分区后进行全链接,以Jaro⁃Winkler算法和欧式距测量两个数据集用于匹配字段的相似度,以之构建贝叶斯概率链接模型,并用混淆矩阵评估链接效果。
    结果 应用贝叶斯概率链接模型,将婴儿出生和死亡数据进行了有效链接,发现上海市死亡婴儿中36.71%生于外地,测算得到婴儿死亡概率为2.60‰。测试集混淆矩阵显示,模型的召回率为0.86,精确率为0.76,F⁃score为0.81。
    结论 贝叶斯概率链接的实例应用显示模型效果良好,用于建立出生死亡队列,能更准确地反映婴儿死亡的真实水平。利用该技术,整合不同部门数据,可有效提升公共卫生领域的研究效率。

     

    Abstract:
    Objective To elucidate the principles and methods of the Bayesian probabilistic linkage model, and to demonstrate the effect of applying the model in linking birth and death data.
    Methods Through the Shanghai birth and death registration system, data of 199 025 infants born in 2017 and 1 512 infants who died in 2017 and 2018 were collected. After cleaning the data, the data were divided into monthly blocks and fully linked. The Jaro-Winkler algorithm and Euclidean distance were employed to measure the similarity of fields for matching. A Bayesian probabilistic linkage model was constructed and the linking effect was evaluated using a confusion matrix.
    Results Using the Bayesian probabilistic linkage model, the birth and death data of infants were effectively linked, revealing that 36.71% of infants who died in Shanghai were born outside the city, and the probability of infant death was 2.6‰. The confusion matrix of the test set showed a recall rate of 0.86, precision of 0.76, and an F-score of 0.81.
    Conclusion The practical application of Bayesian probabilistic linkage demonstrates a good model performance, enabling the establishment of birth-death cohorts that more accurately reflect the true levels of infant mortality. Utilizing this technique to integrate data from different departments can effectively improve research efficiency in the field of public health.

     

/

返回文章
返回