Laura Menotti

DOREMI: Optimizing Long Tail Predictions in Document-Level Relation Extraction

Laura Menotti, Stefano Marchesin, and Gianmaria Silvello

Journal Paper Knowledge-Based Systems, 337, 115359 (2026).
DOI: 10.1016/j.knosys.2026.115359

Abstract

Document-Level Relation Extraction (DocRE) presents significant challenges due to its reliance on cross-sentence context and the long-tail distribution of relation types, where many relations have scarce training examples. In this work, we introduce DOcument-level Relation Extraction optiMizing the long taIl (DOREMI), an iterative framework that enhances underrepresented relations through minimal yet targeted manual annotations. Unlike previous approaches that rely on large-scale noisy data or heuristic denoising, DOREMI actively selects the most informative examples to improve training efficiency and robustness. DOREMI can be applied to any existing DocRE model and is effective at mitigating long-tail biases, offering a scalable solution to improve generalization on rare relations.

A Domain-Specific Curated Benchmark for Entity and Document-Level Relation Extraction

Marco Martinelli, Stefano MArchesin, Vanessa Bonato, Giorgio Maria Di Nunzio, Nicola Ferro, Ornella Irrera, Laura Menotti, Federica Vezzani, and Gianmaria Silvello

Conference Paper Findings of the 19th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (EACL 2026).
To appear.

Abstract

Information Extraction (IE), encompassing Named Entity Recognition (NER), Named Entity Linking (NEL), and Relation Extraction (RE), is critical for transforming the rapidly growing volume of scientific publications into structured, actionable knowledge. This need is especially evident in fast-evolving biomedical fields such as the gut-brain axis, where research investigates complex interactions between the gut microbiota and brain-related disorders. Existing biomedical IE benchmarks, however, are often narrow in scope and rely heavily on distantly supervised or automatically generated annotations, limiting their utility for advancing robust IE methods. We introduce GUT-BRAINIE, a benchmark based on more than 1,600 PubMed abstracts, manually annotated by biomedical and terminological experts with fine-grained entities, concept-level links, and relations. While grounded in the gut-brain axis, the benchmark’s rich schema, multiple tasks, and combination of highly curated and weakly supervised data make it broadly applicable to the development and evaluation of biomedical IE systems across domains.

The BRAINTEASER Datasets: Clinical, Wearable and Environmental Data for ALS & MS Progression Modeling

Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Isotta Trescato, Lara Ahmad, Helena Aidos, Anca Loredana Alungulese, Riccardo Bellazzi, Roberto Bergamaschi, Giovanni Birolo, Pietro Bosoni, Maria Fernanda Cabrera-Umpierrez, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Piero Fariselli, Jose Manuel García Domínguez, Sergio Gonzalez Martinez, Marta Gromicho, Alessandro Guazzo, Aleksandar Jovanović, Borko Kostić, Enrico Longato, Sara C. Madeira, Umberto Manera, Jose Luis Muñoz Blanco, Eleonora Tavazzi, Erica Tavazzi, Elena Trasobares Iglesias, Vladimir Urošević, Martina Vettoretti, Giorgio Maria Di Nunzio, Gianmaria Silvello, Barbara Di Camillo, and Nicola Ferro

Journal Paper Scientific Data 12, 1854 (2025).
DOI : 10.1038/s41597-025-06095-1

Abstract

Amyotrophic lateral sclerosis (ALS) and multiple sclerosis (MS) are debilitating diseases with unpredictable progression. Artificial Intelligence-based tools for modelling disease progression could significantly improve the quality of life for patients and caregivers while supporting clinicians in delivering more personalized and timely care. However, the limited availability of data hinders the development, testing, and reproducibility of such predictive tools. To address this challenge, we curated, in the context of the H2020 BRAINTEASER project, four datasets containing clinical data from a total of 2,290 ALS patients and 723 MS patients. These datasets also include environmental data and information collected through wearable devices. Unlike most existing resources, the BRAINTEASER datasets are gathered from clinical practice, offering a more accurate representation of the data that an AI progression prediction tool would encounter in real-world scenarios. In addition to manual and automated data quality checks, the research community has validated the datasets through three editions of the intelligent Disease Progression Prediction challenges held within the Conference and Labs of the Evaluation Forum (CLEF).

Provenance-Driven Nanopublications: Representing Source Lineage and Trust Networks for Multi-Source Assertions

Laura Menotti, Stefano Marchesin, Fabio Giachelle and Gianmaria Silvello

Journal Paper International Journal of Digital Libraries 26, 24 (2025).
DOI: 10.1007/s00799-025-00431-x

Abstract

Nanopublishing is a paradigm enabling the representation of scientific claims in a distinctive, identifiable, citable, and reusable format, i.e., as a named graph. This approach can be applied to sentences extracted from scientific publications or triples within a Knowledge Base (KB). This way, one can track the provenance of assertions derived from a specific publication or database. However, nanopublications do not natively support multi-source scientific claims generated by aggregating different bodies of knowledge.

Methods: This work extends the nanopublication model with knowledge provenance, capturing provenance information for assertions derived by an aggregation algorithm or a truth discovery process, e.g., an information extraction system aggregating several sources of knowledge to populate a Knowledge Base (KB). In these cases, provenance information cannot be attributed to a single source, but it is the result of an ensemble of evidence, that can comprehend supporting and conflicting pieces of evidence and truth values. Knowledge provenance is represented as a named graph following the PROV-K ontology, developed for the case. To show how knowledge provenance applies to a real-world scenario, we serialized gene expression-cancer associations generated by the Collaborative Oriented Relation Extraction (CORE) System. To demonstrate the value of trust relationships, we present a use case leveraging an existing scientific KB to construct a trust network employing three Large Language Model (LLM) agents. We analyzed the ability of LLMs to evaluate trustworthiness, exploiting techniques from KB accuracy estimation.

Results: We published 197, 511 assertions generated by the CORE system in the form of extended nanopublications with knowledge provenance. PROV-K also defines trust relationships between agents or between an agent and a proposition. Starting from these assertions, we leveraged external agents – namely, multiple LLMs – to assess their trusted truth value. Based on these values, we defined trust relationships between the agents and the facts, yielding an exemplar trust network comprising over 45,000 facts and four agents.

Conclusion: The knowledge provenance graph allows the tracking of provenance for each piece of evidence contributing to the support or refutation of an assertion. To capture the semantics of the newly presented graph, we define the PROV-K ontology, designed to represent provenance information for multi-source assertions. The two use cases serve as a template to show how to serialize extended nanopublications and showcase the trust relationships’ capabilities.

Overview of GutBrainIE@CLEF 2025: Gut-Brain Interplay Information Extraction

Marco Martinelli, Gianmaria Silvello, Vanessa Bonato, Giorgio Maria Di Nunzio, Nicola Ferro, Ornella Irrera, Stefano Marchesin, Laura Menotti and Federica Vezzani

Workshop Paper CLEF 2025 Working Notes: Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9–12, 2025 (pp. 65–98). CEUR Workshop Proceedings, Vol. 4038
CEUR-WS: Vol-4038/paper_5

Abstract

Recent studies link the gut microbiota to mental health conditions and to neurodegenerative diseases such as Parkinson’s and Alzheimer’s. However, the rapid speed at which this research field is evolving presents a significant challenge for clinicians and researchers who have to keep pace with an ever-expanding volume of biomedical literature. In this context, automatic tools for extracting and structuring information from scientific texts are becoming essential to support the understanding of the gut-brain axis. GutBrainIE promotes the development of Natural Language Processing (NLP) systems capable of extracting structured specialized knowledge from biomedical texts related to the gut-brain axis, aiming to accelerate biomedical discoveries through automated Information Extraction (IE). GutBrainIE is part of the BioASQ Lab at CLEF 2025 and is organized within the context of the research project HEREDITARY, funded by the European Commission. The task includes four subtasks of increasing complexity, one dealing with Named Entity Recognition (NER) and the other three with Relation Extraction (RE), and comprises a dataset manually annotated for entities and relations structured into four quality tiers. This extended overview describes the subtasks, dataset, evaluation methodology, results, and participant approaches for the GutBrainIE-2025 task.

BioASQ at CLEF2025: The Thirteenth Edition of the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge

Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Martin Krallinger, Miguel Rodriguez Ortega, Natalia Loukachevitch, Andrey Sakhovskiy, Elena Tutubalina, Grigorios Tsoumakas, George Giannakoulas, Alexandra Bekiaridou, Athanasios Samaras, Giorgio Maria Di Nunzio, Nicola Ferro, Stefano Marchesin, Laura Menotti, Gianmaria Silvello and Georgios Paliouras

Conference Paper In Proc. of the 47th European Conference on Information Retrieval (ECIR 2025). Lecture Notes in Computer Science, vol 15576. Springer, Cham.
DOI: 10.1007/978-3-031-88720-8_61

Abstract

During the last twelve years, the large-scale biomedical semantic indexing and question-answering challenge (BioASQ) has been pushing towards the continuous advancement of methods and tools to accelerate access to the ever-increasing scientific resources of the biomedical domain. In this direction, each year, BioASQ organizes shared tasks representing the real information needs of biomedical experts and provides respective benchmark datasets. This way, it provides a unique common testbed where research teams around the world can test and compare new approaches for accessing biomedical knowledge. The thirteenth version of BioASQ will be held as an evaluation Lab in the context of CLEF2025 providing six tasks: (i) Task b on biomedical semantic question answering. (ii) Task Synergy on question answering developing biomedical topics. (iii) Task MultiClinSum on multilingual clinical summarization. (iv) Task BioNNE-L on nested named entity linking in Russian and English. (v) Task ELCardioCC on clinical coding in cardiology. (vi) Task GutBrainIE on gut-brain interplay information extraction. As BioASQ rewards the methods that outperform the state of the art in these shared tasks, it keeps pushing the research frontier towards approaches that will meet the need for efficient and precise access to biomedical knowledge.

HERO-Genomics: An Ontology for Integration and Access of Multicenter Genomic Data

Laura Menotti, Mirco Cazzaro, Manuel Rueda, Ivo G. Gut and Gianmaria Silvello

Conference PaperIn Proc. of the 16th International SWAT4HCLS Conference - Semantic web Applications and Tools for Health Care and Life Sciences (SWAT4HCLS 2025).
To appear.

Abstract

The Hereditary Ontology for Genomic Data (HERO-Genomics) supports the representation of genomic data specific to Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS), enabling the documentation of gene mutations linked to these diseases. The current version includes a detailed framework for storing specific gene sequencing variations, such as Single Nucleotide Variants (SNVs). HERO-Genomics is part of the Hereditary Ontology (HERO), which is the backbone of the Semantic Data Integration platform of the HEREDITARY project, exploiting the Ontology-Based Data Access (OBDA) technology to query, aggregate, and join large heterogeneous data in a distributed manner using a unique query language,i.e. SPARQL.

Extending Nanopublications with Knowledge Provenance for Multi-Source Scientific Assertions

Fabio Giachelle, Stefano Marchesin, Laura Menotti and Gianmaria Silvello.

Conference PaperBest Paper AwardIn Proc. of the 21st conference on Information and Research science Connecting to Digital and Library science (IRCDL 2025). CEUR-WS Proceedings vol. 3937.
CEUR-WS: Vol-3937/paper10

Abstract

Nanopublications are RDF graphs that enable the possibility of sharing machine-readable assertions on the Web while tracking their provenance and publication information. However, the current nanopublication model focuses on the provenance of single-source assertions derived from a specific publication or database. This work proposes extending the nanopublication model to include a fourth component called knowledge provenance. Knowledge provenance captures the context where an assertion is not derived from a single publication but from a body of knowledge that can comprehend supporting and conflicting pieces of evidence that we need to track and refer to. We apply the defined model to the facts generated by the Collaborative Oriented Relation Extraction (CORE) and published 197,511 assertions in the form of extended nanopublications, allowing the identification, representation, access, and citation of individual gene expression-cancer associations.

Content-Based Dataset Retrieval Methods: Reproducibility of the ACORDAR Test Collection (Findings)

Laura Menotti, Manuel Barusco, Riccardo Forzan and Gianmaria Silvello.

Conference Paper In Linking Theory and Practice of Digital Libraries. Proc. of The 28th International Conference on Theory and Practice of Digital Libraries (TPDL 2024), Lecture Notes in Computer Science, vol 15177. Springer, Cham.
DOI: 10.1007/978-3-031-72437-4_18

Abstract

The FAIR principles constitute a cornerstone of contemporary scientific methodology, with the Digital Library (DL) community actively participating and providing significant advancements within this framework. By taking a reproducibility approach, this paper centers on findability, a pivotal aspect of scientific data management and stewardship. Specifically, we delve into the critical role of Data Search in enabling efficient retrieval across various contexts, including scholarly publications and scientific data management. Consequently, the convergence of Digital Library and Information Retrieval (IR) domains underscores the necessity to adapt document-level IR techniques to optimize dataset retrieval processes.

Dataset retrieval relies on dataset descriptions, hampered by incomplete and inconsistent metadata issues. Lately, there has been a growing emphasis on Content-Based Dataset Retrieval (CBDR), where metadata and dataset content are equally considered during indexing and retrieval. ACORDAR is the first open test collection to evaluate CBDR methods. It offered early insights into the benefits of integrating dataset content in retrieval.

Our study thoroughly assesses ACORDAR's quality and reusability while investigating the reproducibility of retrieval results. Concerns arise about accessibility to the collection's content due to broken links for 17.6 of datasets. Despite some errors and requiring non-trivial pre-processing steps, we replicated most but not all CBDR methods, thus raising some concerns about the suitability of ACORDAR as a reference test collection to further advance CBDR research and to employ these methods in the context of DL.

An Extensible and Unifying Approach to Retrospective Clinical Data Modeling: The BrainTeaser Ontology

Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Marta Gromicho, Umberto Manera, Eleonora Tavazzi, Giorgio Maria Di Nunzio, Gianmaria Silvello, and Nicola Ferro.

Journal Paper Journal of Biomedical Semantics 15, 16 (2024).
DOI: 10.1186/s13326-024-00317-y

Abstract

This paper presents the Brainteaser Ontology (BTO), which models patients’ clinical history and disease progression affected by two debilitating neurological diseases: Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS). The BTO is openly available on the Web, adopting the FAIR principles for data sharing. Currently, BTO has been used as the schema to retrieve the data for the iDPP@CLEF open challenge.

Furthermore, it has already been used to devise explainable AI algorithms to predict the progression of ALS and MS. The present paper is centred around the subjects of the journal; in particular, it focuses on the development and content of an ontology relevant to the biomedical community and how to use this ontology.

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2024

Giovanni Birolo, Pietro Bosoni, Guglielmo Faggioli, Helena Aidos, Roberto Bergamaschi, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Alessandro Guazzo, Enrico Longato, Sara C. Madeira, Umberto Manera, Stefano Marchesin, Laura Menotti, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Isotta Trescato, Martina Vettoretti, Barbara Di Camillo, and Nicola Ferro.

Conference Paper In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2024). Lecture Notes in Computer Science, vol 14959. Springer, Cham.
DOI: 10.1007/978-3-031-71908-0_6

Abstract

Multiple Sclerosis (MS) and Amyotrophic Lateral Sclerosis (ALS) are two neurodegenerative diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Patients affected by these diseases undergo the physical, psy- chological, and economic burdens of hospital stays and home care while facing uncertainty. A possible aid to patients and clinicians might come from AI tools that can preemptively identify the need for intervention and suggest personalized therapies during the progression of these diseases. The objective of iDPP@CLEF is to develop automatic approaches based on AI that can be used to describe the progression of these two neurodegenerative diseases, with the final goal of allowing patient stratification as well as the prediction of the disease progression, to help clinicians in assisting patients in the most timely manner.

Overview of iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge

Giovanni Birolo, Pietro Bosoni, Guglielmo Faggioli, Helena Aidos, Roberto Bergamaschi, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Alessandro Guazzo, Enrico Longato, Sara C. Madeira, Umberto Manera, Stefano Marchesin, Laura Menotti, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Isotta Trescato, Martina Vettoretti, Barbara Di Camillo, and Nicola Ferro.

Workshop Paper CLEF (Working Notes) 2024, CEUR Workshop Proceedings vol. 3740, pp. 1312-1331.
CEUR-WS: Vol-3740/paper-122

Abstract

Multiple Sclerosis (MS) and Amyotrophic Lateral Sclerosis (ALS) are neurodegenerative diseases characterized by progressive or fluctuating impairments in motor, sensory, visual, and cognitive functions. Patients with these diseases endure significant physical, psychological, and economic burdens due to hospitalizations and home care while grappling with uncertainty about their conditions. AI tools hold promise for aiding patients and clinicians by identifying the need for intervention and suggesting personalized therapies throughout disease progression. The objective of iDPP@CLEF is to develop AI-based approaches to describe the progression of these diseases. The ultimate goal is to enable patient stratification and predict disease progression, thereby assisting clinicians in providing timely care. iDPP@CLEF 2024 continues the work of the previous editions, iDPP@CLEF 2022 and 2023. The 2022 edition focused on predicting ALS progression and utilizing explainable AI. The 2023 edition expanded on this by including environmental data and introduced a new task for predicting MS progression. This edition extends the MS dataset with environmental data and introduces two new ALS tasks aimed at predicting disease progression using data from wearable devices. This marks the first iDPP edition to utilize prospective data directly collected from patients involved in the BRAINTEASER project.

Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations (Extended Abstract)

Stefano Marchesin, Laura Menotti, Fabio Giachelle, Gianmaria Silvello and Omar Alonso

Conference Paper In Proc. of The 32nd Italian Symposium on Advanced Database Systems (SEBD 2024), CEUR Workshop Proceedings vol. 3741, pp. 163-173.
CEUR-WS: Vol-3741/paper10

Abstract

We introduce the Collaborative Oriented Relation Extraction (CORE) system for Knowledge Base Construction, based on the combination of Relation Extraction (RE) methods and domain experts feedback. CORE features a seamless, transparent, and modular architecture that suits large-scale processing. Via active learning, the CORE system bootstraps Knowledge Bases (KBs) and then employs RE methods to scale to large text corpora. We employ CORE to build one of the largest KBs focusing on fine-grained gene expression- cancer associations, fundamental to complement and validate experimental data for precision medicine and cancer research. We conducted comprehensive experiments showing the robustness of the approach and highlighting the scalability of CORE to large text corpora with limited manual annotations.

Exploring the Role of Generative AI in Constructing Knowledge Graphs for Drug Indications with Medical Context

Reham Alharbi, Umair Ahmed, Daniil Dobriy, Weronika Łajewska, Laura Menotti, Mohammad Javad Saeedizade, and Michel Dumontier.

Conference PaperIn Proc. of The 15th International Semantic Web Applications and Tools for Health Care and Life Science conference (SWAT4HCLS 2024), CEUR-WS Proceedings vol. 3890, pp. 1-10.
CEUR-WS: Vol-3890/paper-1

Abstract

The medical context for a drug indication provides crucial information on how the drug can be used in practice. However, the extraction of medical context from drug indications remains poorly explored, as most research concentrates on the recognition of medications and associated diseases. Indeed, most databases cataloging drug indications do not contain their medical context in a machine-readable format. This paper proposes the use of a large language model for constructing DIAMOND-KG, a knowledge graph of drug indications and their medical context. The study 1) examines the change in accuracy and precision in providing additional instruction to the language model, 2) estimates the prevalence of medical context in drug indications, and 3) assesses the quality of DIAMOND-KG against NeuroDKG, a small manually curated knowledge graph. The results reveal that more elaborated prompts improve the quality of extraction of medical context; 71% of indications had at least one medical context; 63.52% of extracted medical contexts correspond to those identified in NeuroDKG. This paper demonstrates the utility of using large language models for specialized knowledge extraction, with a particular focus on extracting drug indications and their medical context. We provide DIAMOND-KG as a FAIR RDF graph supported with an ontology. Openly accessible, DIAMOND-KG may be useful for downstream tasks such as semantic query answering, recommendation engines, and drug repositioning research.

Publishing CoreKB Facts as Nanopublications

Fabio Giachelle, Stefano Marchesin, Laura Menotti and Gianmaria Silvello.

Conference PaperIn Proc. of the 20th conference on Information and Research science Connecting to Digital and Library science (IRCDL 2024). CEUR-WS Proceedings vol. 3643, pp. 16-24.
CEUR-WS: Vol-3643/paper2

Abstract

The Collaborative Oriented Relation Extraction (CORE) system generates gene expression-cancer associations by combining scientific evidence from the literature. Such facts are then ingested into the CoreKB platform, where one can browse and search for associations. In this work, we publish 197,511 assertions from CoreKB as nanopublications, allowing the sharing of machine-readable gene-cancer associations while tracking their provenance and publication information.

Reproducibility and Generalization of a Relation Extraction System for Gene-Disease Associations (Invited Extended Abstract)

Laura Menotti

Conference PaperITADATA 2023 Best Master Thesis Award on Big Data & Data Science In Proc. of the 2nd Italian Conference on Big Data and Data Science (ITADATA 2023), CEUR-WS Proceedings vol. 3606.
CEUR-WS: Vol-3606/invited78

Abstract

Understanding the interactions between genes and diseases is a great resource for improving patient care as it could provide the foundation for curative therapies, beneficial treatments, and preventative measures. This type of data is available in databases, e.g. DisGeNET and BioXpress, in the form of Gene-Disease Associations (GDAs), that contain relationships between gene expressions and specific diseases such as cancer. Biomedical literature is a rich source of information about GDAs, that are usually extracted manually from text. Human annotations are expensive and cannot scale to the huge amount of data available in scientific literature (e.g., biomedical abstracts). Therefore, developing automated tools to identify GDAs is getting traction in the community. Such systems employ Relation Extraction (RE) techniques to extract information on gene/microRNA expression in diseases from text. Once an automated text-mining tool has been developed, it can be tested on human annotated data or it can be compared to state-of-the-art systems. Indeed, it is crucial for researchers to compare newly developed systems with the state-of-the-art to assess whether they made a breakthrough. The objective of this work is to reproduce DEXTER to provide a benchmark for RE, enabling researchers to test and compare their results to a state-of-the-art baseline. DEXTER is based on several modules, each dealing with a different part of the computation independently. While we preserved the original block structure, we decided to develop the system as an end-to-end application to foster reusability. In this way, our implementation of DEXTER can be easily run on different datasets, without extensive knowledge of the system’s internal architecture.

Overview of iDPP@CLEF 2023: The Intelligent Disease Progression Prediction Challenge

Guglielmo Faggioli, Alessandro Guazzo, Stefano Marchesin, Laura Menotti, Isotta Trescato, Helena Aidos, Roberto Bergamaschi, Giovanni Birolo, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Enrico Longato, Sara C. Madeira, Umberto Manera, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Martina Vettoretti, Barbara Di Camillo and Nicola Ferro

Workshop Paper CLEF 2023 Working Notes: 1123-1164.
CEUR-WS: Vol-3497/paper-095

Abstract

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Affected patients must manage hospital stays and home care while facing uncertainty and significant psychological and economic burdens that also affect their caregivers. To ease these challenges, clinicians need automatic tools to support them in all phases of patient treatment, suggest personalized therapeutic paths, and preemptively indicate urgent interventions. iDPP@CLEF aims at developing an evaluation infrastructure for AI algorithms to describe ALS and MS mechanisms, stratify patients based on their phenotype, and predict disease progression in a probabilistic, time-dependent manner. iDPP@CLEF 2023 was organised into three tasks, two of which (Tasks 1 and 2) pertained to Multiple Sclerosis (MS), and one (Task 3) concerned the evaluation of the impact of environmental factors in the progression of Amyotrophic Lateral Sclerosis (ALS), and how to use environmental data at prediction time. 10 teams took part in the iDPP@CLEF 2023 Lab, submitting a total of 163 runs with multiple approaches to the disease progression prediction task, including Survival Random Forests and Coxnets.

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2023

Guglielmo Faggioli, Alessandro Guazzo, Stefano Marchesin, Laura Menotti, Isotta Trescato, Helena Aidos, Roberto Bergamaschi, Giovanni Birolo, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Enrico Longato, Sara C. Madeira, Umberto Manera, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Martina Vettoretti, Barbara Di Camillo and Nicola Ferro

Conference Paper In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2023). Lecture Notes in Computer Science (LNCS) 14163, Springer, Heidelberg, Germany.
DOI: 10.1007/978-3-031-42448-9_24

Abstract

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Affected patients must manage hospital stays and home care while facing uncertainty and significant psychological and economic burdens that also affect their caregivers. To ease these challenges, clinicians need automatic tools to support them in all phases of patient treatment, suggest personalized therapeutic paths, and preemptively indicate urgent interventions.
iDPP@CLEF aims at developing an evaluation infrastructure for AI algorithms to describe ALS and MS mechanisms, stratify patients based on their phenotype, and predict disease progression in a probabilistic, time-dependent manner.
iDPP@CLEF 2022 ran as a pilot lab in CLEF 2022, with tasks related to predicting ALS progression and explainable AI algorithms for prediction. iDPP@CLEF 2023 will continue in CLEF 2023, with a focus on predicting MS progression and exploring whether pollution and environmental data can improve the prediction of ALS progression.

Building a Large Gene Expression-Cancer Knowledge Base with Limited Human Annotations

Stefano Marchesin, Laura Menotti, Fabio Giachelle,Gianmaria Silvello, and Omar Alonso

Journal Paper Database: The Journal of Biological Databases and Curation, Volume 2023 (2023).
DOI: 10.1093/database/baad061

Abstract

Cancer prevention is one of the most pressing challenges that public health needs to face. In this regard, data-driven research is central to assist medical solutions targeting cancer. To fully harness the power of data-driven research, it is imperative to have well-organized machine-readable facts into a Knowledge Base (KB). Motivated by this urgent need, we introduce the Collaborative Oriented Relation Extraction (CORE) system for building KBs with limited manual annotations. CORE is based on the combination of distant supervision and active learning paradigms, and offers a seamless, transparent, modular architecture equipped for large-scale processing.
We focus on precision medicine and build the largest KB on fine-grained gene expression-cancer associations – a key to complement and validate experimental data for cancer research. We show the robustness of CORE and discuss theusefulness of the provided KB.

Modelling Digital Health Data: The ExaMode Ontology for Computational Pathology

Laura Menotti, Gianmaria Silvello, Manfredo Atzori, Svetla Boytcheva, Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, Niccolò Marini, Henning Müller, and Todor Primov

Journal Paper Journal of Pathology Informatics, Volume 14 (2023), 100332.
DOI: 10.1016/j.jpi.2023.100332

Abstract

Computational pathology can significantly benefit from ontologies to standardize the employed nomenclature and help with knowledge extraction processes for high-quality annotated image datasets. The end goal is to reach a shared model for digital pathology to overcome data variability and integration problems. Indeed, data annotation in such a specific domain is still an unsolved challenge and datasets cannot be steadily reused in diverse contexts due to heterogeneity issues of the adopted labels, multilingualism, and different clinical practices.
Material and Methods. This paper presents the ExaMode ontology, modeling the histopathology process by considering three key cancer diseases (colon, cervical, and lung tumors) and celiac disease. The ExaMode ontology has been designed bottom-up in an iterative fashion with continuous feedback and validation from pathologists and clinicians. The ontology is organized into five semantic areas that defines an ontological template to model any disease of interest in histopathology.
Results. The ExaMode ontology is currently being used as a common semantic layer in (i) an entity linking tool for the automatic annotation of medical records; (ii) aWeb-based collaborative annotation tool for histopathology text reports; and (iii) a software platform for building holistic solutions integrating multimodal histopathology data.
Discussion. The ontology ExaMode is a key means to store data in a graph database according to the RDF data model. The creation of an RDF dataset can help develop more accurate algorithms for image analysis, especially in the field of digital pathology. This approach allows for seamless data integration and a unified query access point, from which we can extract relevant clinical insights about the considered diseases using SPARQL queries

An Ontology-Driven Knowledge Extraction Tool for Pathology Record Classification

Laura Menotti, Stefano Marchesin and Gianmaria Silvello

Conference Paper In Proc. of the 31st Italian Symposium on Advanced Database Systems (SEBD 2023), CEUR-WS Proceedings vol. 3478, pp. 228-238.
CEUR-WS: Vol-3478/paper14

Abstract

The information in pathology diagnostic reports is often encoded in natural language. Extracting such knowledge can be instrumental in developing clinical decision support systems. However, the digital pathology domain lacks knowledge extraction systems suited to the task. One of the few examples is the Semantic Knowledge Extractor Tool (SKET), a hybrid knowledge extraction system combining a rule-based expert system with pre-trained ML models. SKET has been designed to extract knowledge from colon, cervix, and lung cancer diagnostic reports. To do so, the system employs an ontology-driven approach, where the extracted entities are linked with concepts modeled through a reference ontology, namely, the ExaMode ontology. In this work, we adapt SKET to a newer version of the ExaMode ontology and extend the method to account for an additional use case: Celiac disease. Our experimental results show that: 1) the new version of SKET outperforms the previous one on colon, cervix, and lung cancer use cases; and 2) SKET is effective on Celiac disease, confirming the ability of the system architecture to adapt to new, unseen scenarios.

Building a Relation Extraction Baseline for Gene-Disease Associations: A Reproducibility Study

Laura Menotti

Symposium Paper 10th edition of the PhD Symposium on Future Directions in Information Access (FDIA 2022), Lisbon, Portugal, July 20, 2022. arXiv preprint arXiv:2207.06226

Abstract

Reproducibility is an important task in scientific research. It is crucial for researchers to compare newly developed systems with the state-of-the-art to assess whether they made a breakthrough. However previous works may not be immediately reproducible, for example due to the lack of source code. In this work we reproduce DEXTER, a system to automatically extract Gene-Disease Associations (GDAs) from biomedical abstracts. The goal is to provide a benchmark for future works regarding Relation Extraction (RE), enabling researchers to test and compare their results.

Reproducibility and Generalization of a Relation Extraction System for Gene-Disease Associations

Laura Menotti

Master ThesisITADATA 2023 Best Master Thesis Award on Big Data & Data Science
Master Degree in Computer Engineering, Department of Information Engineering, University of Padua, October 2022.

Abstract

Biomedical literature is a rich source of information on Gene-Disease Associations (GDAs) that could help physicians in assessing clinical decisions and improve patient care. GDAs are publicly available in databases containing relationships between gene/miRNA expression and related diseases such as specific types of cancer. Most of these resources, such as DisGeNET, miR2Disease and BioXpress, include also manually curated data from publications. Human annotations are expensive and cannot scale to the huge amount of data available in scientific literature (e.g., biomedical abstracts). Therefore, developing automated tools to identify GDAs is getting traction in the community. Such systems employ Relation Extraction (RE) techniques to extract information on gene/microRNA expression in diseases from text. Once an automated text-mining tool has been developed, it can be tested on human annotated data or it can be compared to state-of-the-art systems. In this work we reproduce DEXTER, a system to automatically extract Gene- Disease Associations (GDAs) from biomedical abstracts. The goal is to provide a benchmark for future works regarding Relation Extraction (RE), enabling researchers to test and compare their results. The implemented version of DEXTER is available in the following git repository: https://github.com/mntlra/DEXTER .

Information Management Systems

Department of Information Engineering

University of Padua

Publications

Filter by Type

Filter by Year

Sort by Year

DOREMI: Optimizing Long Tail Predictions in Document-Level Relation Extraction

Abstract

A Domain-Specific Curated Benchmark for Entity and Document-Level Relation Extraction

Abstract

The BRAINTEASER Datasets: Clinical, Wearable and Environmental Data for ALS & MS Progression Modeling

Abstract

Provenance-Driven Nanopublications: Representing Source Lineage and Trust Networks for Multi-Source Assertions

Abstract

Overview of GutBrainIE@CLEF 2025: Gut-Brain Interplay Information Extraction

Abstract

BioASQ at CLEF2025: The Thirteenth Edition of the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge

Abstract

HERO-Genomics: An Ontology for Integration and Access of Multicenter Genomic Data

Abstract

Extending Nanopublications with Knowledge Provenance for Multi-Source Scientific Assertions

Abstract

Content-Based Dataset Retrieval Methods: Reproducibility of the ACORDAR Test Collection (Findings)

Abstract

An Extensible and Unifying Approach to Retrospective Clinical Data Modeling: The BrainTeaser Ontology

Abstract

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2024

Abstract

Overview of iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge

Abstract

Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations (Extended Abstract)

Abstract

Exploring the Role of Generative AI in Constructing Knowledge Graphs for Drug Indications with Medical Context

Abstract

Publishing CoreKB Facts as Nanopublications

Abstract

Reproducibility and Generalization of a Relation Extraction System for Gene-Disease Associations (Invited Extended Abstract)

Abstract

Overview of iDPP@CLEF 2023: The Intelligent Disease Progression Prediction Challenge

Abstract

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2023

Abstract

Building a Large Gene Expression-Cancer Knowledge Base with Limited Human Annotations

Abstract

Modelling Digital Health Data: The ExaMode Ontology for Computational Pathology

Abstract

An Ontology-Driven Knowledge Extraction Tool for Pathology Record Classification

Abstract

Building a Relation Extraction Baseline for Gene-Disease Associations: A Reproducibility Study

Abstract

Reproducibility and Generalization of a Relation Extraction System for Gene-Disease Associations

Abstract