A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT) - Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur Accéder directement au contenu
Article Dans Une Revue Language Resources and Evaluation Année : 2017

A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)

Résumé

Quality annotated resources are essential for Natural Language Processing. The objective of this work is to present a corpus of clinical narratives in French annotated for linguistic, semantic and structural information, aimed at clinical information extraction. Six annotators contributed to the corpus annotation, using a comprehensive annotation scheme covering 21 entities, 11 attributes and 37 relations. All annotators trained on a small, common portion of the corpus before proceeding independently. An automatic tool was used to produce entity and attribute pre-annotations. About a tenth of the corpus was doubly annotated and annotation differences were resolved in consensus meetings. To ensure annotation consistency throughout the corpus, we devised harmonization tools to automatically identify annotation differences to be addressed to improve the overall corpus quality. The annotation project spanned over 24 months and resulted in a corpus comprising 500 documents (148,476 tokens) annotated with 44,740 entities and 26,478 relations. The average inter-annotator agreement is 0.793 F-measure for entities and 0.789 for relations. The performance of the pre-annotation tool for entities reached 0.814 F-measure when sufficient training data was available. The performance of our entity pre-annotation tool shows the value of the corpus to build and evaluate information extraction methods. In addition, we introduced harmonization methods that further improved the quality of annotations in the corpus.
Fichier principal
Vignette du fichier
lre.pdf (1.78 Mo) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

hal-01631743 , version 1 (13-11-2017)

Licence

Paternité

Identifiants

Citer

Leonardo Campillos, Louise Deléger, Cyril Grouin, Thierry Hamon, Anne-Laure Ligozat, et al.. A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT). Language Resources and Evaluation, 2017, 52 (2), pp.571-601. ⟨10.1007/s10579-017-9382-y⟩. ⟨hal-01631743⟩
469 Consultations
392 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More