German Medical Text Corpus (GeMTeX)

Project goals of GeMTeX: An overview

The main objective of the GeMTeX project is to create a comprehensive, annotated text corpus of German medical texts from routine patient care. The plan is to extract documents from the electronic health records (ePA) of six university hospitals whose patients have previously given their consent. In a concerted effort, these documents will be converted into annotated text corpora and annotated in depth in several dimensions. Once anonymized, these documents will be shared to create new resources for research and development. Clinical Natural Language Processing (NLP) progress depends on specially trained language models that rely on authentic clinical documents. GeMTeX addresses two significant bottlenecks that have so far prevented the development of German clinical language models: access to data and the annotation of this data.

The Medical Informatics Initiative (MII) offers a unique opportunity to make clinical documents accessible on a large scale and enrich them with annotations. A German medical text corpus will foster the development of NLP resources that support the analysis of German clinical texts. GeMTeX will create a technical and organizational structure to prospectively collect anonymized texts and annotate them according to defined annotation guidelines. A wide range of annotation tasks will be covered, tested, validated, and applied on a large scale to create a unique resource. AI models trained with this resource will be analyzed in terms of their value in specific disciplinary application scenarios. The annotated text documents and the models will be made publicly available via the Central Library of Medicine (ZBMED) and the DFG-funded project NFDI4Health, with which GeMTeX works closely.

The work focuses on document processing and annotation, central annotation services, developing methods, corpus-related services, and tools.

Principal Investigator: Martin Boeker

Central methods and tools

This area provides central methods and structures for GeMTeX to support and monitor the annotation process and make the results publicly accessible. The focus is on the annotation platform INCEpTION, which is adapted to the needs of GeMTeX. Site-specific data is retrieved via INCEpTION and displayed centrally. Results and developments from the supporting projects are collected, documented, and made publicly accessible. Industry partners integrate their tools for scientific text analysis into the project. Munich TUM leads this area and provides a central repository in which GeMTeX methods, models, and tools are made available.

Document processing and annotation

The documents are annotated in this core work area. This requires the provision of documents and the management and training of the annotation teams at each of the participating locations, based on annotation guidelines and materials. The Munich TUM site manages this area and is responsible for access to documents, their pre-processing, the management and training of the annotation teams, and the actual annotation.

Central annotation services

This area closes the gap between principles and resources of formal ontology and medical semantic standards. An annotation guide will be created, updated, and published in different phases. A modular annotator training course based on blended learning will be set up. The tasks include:

  • Deriving a prioritized list of relevant entity and relation types from clinical texts.
  • Formally describing these types.
  • Ensuring the universal applicability of the trained language models.

Corpus-related services and tools

This area provides methods, models, and tools for developing, maintaining, and using the GeMTeX corpus. It includes the creation of a fully distributed synthetic clinical reference corpus for quality control and providing standard quality metrics for quality control and maintenance.

Through this comprehensive approach, GeMTeX aims to create a unique resource that will significantly advance and improve the use of AI in the medical field.

NUM Geschäftsstelle TUM Medizin