Filter by Type

Filter by Year

Sort by Year

Content-Based Dataset Retrieval Methods: Reproducibility of the ACORDAR Test Collection (Findings)

Laura Menotti, Manuel Barusco, Riccardo Forzan and Gianmaria Silvello.
Conference Paper In Linking Theory and Practice of Digital Libraries. Proc. of The 28th International Conference on Theory and Practice of Digital Libraries (TPDL 2024), Lecture Notes in Computer Science, vol 15177. Springer, Cham.
DOI: 10.1007/978-3-031-72437-4_18

Abstract

The FAIR principles constitute a cornerstone of contemporary scientific methodology, with the Digital Library (DL) community actively participating and providing significant advancements within this framework. By taking a reproducibility approach, this paper centers on findability, a pivotal aspect of scientific data management and stewardship. Specifically, we delve into the critical role of Data Search in enabling efficient retrieval across various contexts, including scholarly publications and scientific data management. Consequently, the convergence of Digital Library and Information Retrieval (IR) domains underscores the necessity to adapt document-level IR techniques to optimize dataset retrieval processes.

Dataset retrieval relies on dataset descriptions, hampered by incomplete and inconsistent metadata issues. Lately, there has been a growing emphasis on Content-Based Dataset Retrieval (CBDR), where metadata and dataset content are equally considered during indexing and retrieval. ACORDAR is the first open test collection to evaluate CBDR methods. It offered early insights into the benefits of integrating dataset content in retrieval.

Our study thoroughly assesses ACORDAR's quality and reusability while investigating the reproducibility of retrieval results. Concerns arise about accessibility to the collection's content due to broken links for 17.6 of datasets. Despite some errors and requiring non-trivial pre-processing steps, we replicated most but not all CBDR methods, thus raising some concerns about the suitability of ACORDAR as a reference test collection to further advance CBDR research and to employ these methods in the context of DL.

An Extensible and Unifying Approach to Retrospective Clinical Data Modeling: The BrainTeaser Ontology

Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Marta Gromicho, Umberto Manera, Eleonora Tavazzi, Giorgio Maria Di Nunzio, Gianmaria Silvello, and Nicola Ferro.
Journal Paper Journal of Biomedical Semantics 15, 16 (2024).
DOI: 10.1186/s13326-024-00317-y

Abstract

This paper presents the Brainteaser Ontology (BTO), which models patients’ clinical history and disease progression affected by two debilitating neurological diseases: Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS). The BTO is openly available on the Web, adopting the FAIR principles for data sharing. Currently, BTO has been used as the schema to retrieve the data for the iDPP@CLEF open challenge.

Furthermore, it has already been used to devise explainable AI algorithms to predict the progression of ALS and MS. The present paper is centred around the subjects of the journal; in particular, it focuses on the development and content of an ontology relevant to the biomedical community and how to use this ontology.

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2024

Giovanni Birolo, Pietro Bosoni, Guglielmo Faggioli, Helena Aidos, Roberto Bergamaschi, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Alessandro Guazzo, Enrico Longato, Sara C. Madeira, Umberto Manera, Stefano Marchesin, Laura Menotti, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Isotta Trescato, Martina Vettoretti, Barbara Di Camillo, and Nicola Ferro.
Conference Paper In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2024). Lecture Notes in Computer Science, vol 14959. Springer, Cham.
DOI: 10.1007/978-3-031-42448-9_24

Abstract

Multiple Sclerosis (MS) and Amyotrophic Lateral Sclerosis (ALS) are two neurodegenerative diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Patients affected by these diseases undergo the physical, psy- chological, and economic burdens of hospital stays and home care while facing uncertainty. A possible aid to patients and clinicians might come from AI tools that can preemptively identify the need for intervention and suggest personalized therapies during the progression of these diseases. The objective of iDPP@CLEF is to develop automatic approaches based on AI that can be used to describe the progression of these two neurodegenerative diseases, with the final goal of allowing patient stratification as well as the prediction of the disease progression, to help clinicians in assisting patients in the most timely manner.

Overview of iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge

Giovanni Birolo, Pietro Bosoni, Guglielmo Faggioli, Helena Aidos, Roberto Bergamaschi, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Alessandro Guazzo, Enrico Longato, Sara C. Madeira, Umberto Manera, Stefano Marchesin, Laura Menotti, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Isotta Trescato, Martina Vettoretti, Barbara Di Camillo, and Nicola Ferro.
Workshop Paper CLEF (Working Notes) 2024, CEUR Workshop Proceedings vol. 3740, pp. 1312-1331.
CEUR-WS: Vol-3740/paper-122

Abstract

Multiple Sclerosis (MS) and Amyotrophic Lateral Sclerosis (ALS) are neurodegenerative diseases characterized by progressive or fluctuating impairments in motor, sensory, visual, and cognitive functions. Patients with these diseases endure significant physical, psychological, and economic burdens due to hospitalizations and home care while grappling with uncertainty about their conditions. AI tools hold promise for aiding patients and clinicians by identifying the need for intervention and suggesting personalized therapies throughout disease progression. The objective of iDPP@CLEF is to develop AI-based approaches to describe the progression of these diseases. The ultimate goal is to enable patient stratification and predict disease progression, thereby assisting clinicians in providing timely care. iDPP@CLEF 2024 continues the work of the previous editions, iDPP@CLEF 2022 and 2023. The 2022 edition focused on predicting ALS progression and utilizing explainable AI. The 2023 edition expanded on this by including environmental data and introduced a new task for predicting MS progression. This edition extends the MS dataset with environmental data and introduces two new ALS tasks aimed at predicting disease progression using data from wearable devices. This marks the first iDPP edition to utilize prospective data directly collected from patients involved in the BRAINTEASER project.

Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations (Extended Abstract)

Stefano Marchesin, Laura Menotti, Fabio Giachelle, Gianmaria Silvello and Omar Alonso
Conference Paper In Proc. of The 32nd Italian Symposium on Advanced Database Systems (SEBD 2024), CEUR Workshop Proceedings vol. 3741, pp. 163-173.
CEUR-WS: Vol-3741/paper10

Abstract

We introduce the Collaborative Oriented Relation Extraction (CORE) system for Knowledge Base Construction, based on the combination of Relation Extraction (RE) methods and domain experts feedback. CORE features a seamless, transparent, and modular architecture that suits large-scale processing. Via active learning, the CORE system bootstraps Knowledge Bases (KBs) and then employs RE methods to scale to large text corpora. We employ CORE to build one of the largest KBs focusing on fine-grained gene expression- cancer associations, fundamental to complement and validate experimental data for precision medicine and cancer research. We conducted comprehensive experiments showing the robustness of the approach and highlighting the scalability of CORE to large text corpora with limited manual annotations.

Exploring the Role of Generative AI in Constructing Knowledge Graphs for Drug Indications with Medical Context

Reham Alharbi, Umair Ahmed, Daniil Dobriy, Weronika Łajewska, Laura Menotti, Mohammad Javad Saeedizade, and Michel Dumontier.
Conference PaperIn Proc. of The 15th International Semantic Web Applications and Tools for Health Care and Life Science conference (SWAT4HCLS 2024), CEUR-WS Proceedings vol. 3890, pp. 1-10.
CEUR-WS: Vol-3890/paper-1

Abstract

The medical context for a drug indication provides crucial information on how the drug can be used in practice. However, the extraction of medical context from drug indications remains poorly explored, as most research concentrates on the recognition of medications and associated diseases. Indeed, most databases cataloging drug indications do not contain their medical context in a machine-readable format. This paper proposes the use of a large language model for constructing DIAMOND-KG, a knowledge graph of drug indications and their medical context. The study 1) examines the change in accuracy and precision in providing additional instruction to the language model, 2) estimates the prevalence of medical context in drug indications, and 3) assesses the quality of DIAMOND-KG against NeuroDKG, a small manually curated knowledge graph. The results reveal that more elaborated prompts improve the quality of extraction of medical context; 71% of indications had at least one medical context; 63.52% of extracted medical contexts correspond to those identified in NeuroDKG. This paper demonstrates the utility of using large language models for specialized knowledge extraction, with a particular focus on extracting drug indications and their medical context. We provide DIAMOND-KG as a FAIR RDF graph supported with an ontology. Openly accessible, DIAMOND-KG may be useful for downstream tasks such as semantic query answering, recommendation engines, and drug repositioning research.

Publishing CoreKB Facts as Nanopublications

Fabio Giachelle, Stefano Marchesin, Laura Menotti and Gianmaria Silvello.
Conference PaperIn Proc. of the 20th conference on Information and Research science Connecting to Digital and Library science (IRCDL 2024). CEUR-WS Proceedings vol. 3643, pp. 16-24.
CEUR-WS: Vol-3643/paper2

Abstract

The Collaborative Oriented Relation Extraction (CORE) system generates gene expression-cancer associations by combining scientific evidence from the literature. Such facts are then ingested into the CoreKB platform, where one can browse and search for associations. In this work, we publish 197,511 assertions from CoreKB as nanopublications, allowing the sharing of machine-readable gene-cancer associations while tracking their provenance and publication information.

Reproducibility and Generalization of a Relation Extraction System for Gene-Disease Associations (Invited Extended Abstract)

Laura Menotti
Conference PaperITADATA 2023 Best Master Thesis Award on Big Data & Data Science In Proc. of the 2nd Italian Conference on Big Data and Data Science (ITADATA 2023), CEUR-WS Proceedings vol. 3606.
CEUR-WS: Vol-3606/invited78

Abstract

Understanding the interactions between genes and diseases is a great resource for improving patient care as it could provide the foundation for curative therapies, beneficial treatments, and preventative measures. This type of data is available in databases, e.g. DisGeNET and BioXpress, in the form of Gene-Disease Associations (GDAs), that contain relationships between gene expressions and specific diseases such as cancer. Biomedical literature is a rich source of information about GDAs, that are usually extracted manually from text. Human annotations are expensive and cannot scale to the huge amount of data available in scientific literature (e.g., biomedical abstracts). Therefore, developing automated tools to identify GDAs is getting traction in the community. Such systems employ Relation Extraction (RE) techniques to extract information on gene/microRNA expression in diseases from text. Once an automated text-mining tool has been developed, it can be tested on human annotated data or it can be compared to state-of-the-art systems. Indeed, it is crucial for researchers to compare newly developed systems with the state-of-the-art to assess whether they made a breakthrough. The objective of this work is to reproduce DEXTER to provide a benchmark for RE, enabling researchers to test and compare their results to a state-of-the-art baseline. DEXTER is based on several modules, each dealing with a different part of the computation independently. While we preserved the original block structure, we decided to develop the system as an end-to-end application to foster reusability. In this way, our implementation of DEXTER can be easily run on different datasets, without extensive knowledge of the system’s internal architecture.

Overview of iDPP@CLEF 2023: The Intelligent Disease Progression Prediction Challenge

Guglielmo Faggioli, Alessandro Guazzo, Stefano Marchesin, Laura Menotti, Isotta Trescato, Helena Aidos, Roberto Bergamaschi, Giovanni Birolo, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Enrico Longato, Sara C. Madeira, Umberto Manera, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Martina Vettoretti, Barbara Di Camillo and Nicola Ferro
Workshop Paper CLEF 2023 Working Notes: 1123-1164.
CEUR-WS: Vol-3497/paper-095

Abstract

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Affected patients must manage hospital stays and home care while facing uncertainty and significant psychological and economic burdens that also affect their caregivers. To ease these challenges, clinicians need automatic tools to support them in all phases of patient treatment, suggest personalized therapeutic paths, and preemptively indicate urgent interventions. iDPP@CLEF aims at developing an evaluation infrastructure for AI algorithms to describe ALS and MS mechanisms, stratify patients based on their phenotype, and predict disease progression in a probabilistic, time-dependent manner. iDPP@CLEF 2023 was organised into three tasks, two of which (Tasks 1 and 2) pertained to Multiple Sclerosis (MS), and one (Task 3) concerned the evaluation of the impact of environmental factors in the progression of Amyotrophic Lateral Sclerosis (ALS), and how to use environmental data at prediction time. 10 teams took part in the iDPP@CLEF 2023 Lab, submitting a total of 163 runs with multiple approaches to the disease progression prediction task, including Survival Random Forests and Coxnets.

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2023

Guglielmo Faggioli, Alessandro Guazzo, Stefano Marchesin, Laura Menotti, Isotta Trescato, Helena Aidos, Roberto Bergamaschi, Giovanni Birolo, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Enrico Longato, Sara C. Madeira, Umberto Manera, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Martina Vettoretti, Barbara Di Camillo and Nicola Ferro
Conference Paper In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2023). Lecture Notes in Computer Science (LNCS) 14163, Springer, Heidelberg, Germany.
DOI: 10.1007/978-3-031-42448-9_24

Abstract

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Affected patients must manage hospital stays and home care while facing uncertainty and significant psychological and economic burdens that also affect their caregivers. To ease these challenges, clinicians need automatic tools to support them in all phases of patient treatment, suggest personalized therapeutic paths, and preemptively indicate urgent interventions.
iDPP@CLEF aims at developing an evaluation infrastructure for AI algorithms to describe ALS and MS mechanisms, stratify patients based on their phenotype, and predict disease progression in a probabilistic, time-dependent manner.
iDPP@CLEF 2022 ran as a pilot lab in CLEF 2022, with tasks related to predicting ALS progression and explainable AI algorithms for prediction. iDPP@CLEF 2023 will continue in CLEF 2023, with a focus on predicting MS progression and exploring whether pollution and environmental data can improve the prediction of ALS progression.

Building a Large Gene Expression-Cancer Knowledge Base with Limited Human Annotations

Stefano Marchesin, Laura Menotti, Fabio Giachelle,Gianmaria Silvello, and Omar Alonso
Journal Paper Database: The Journal of Biological Databases and Curation, Volume 2023 (2023).
DOI: 10.1093/database/baad061

Abstract

Cancer prevention is one of the most pressing challenges that public health needs to face. In this regard, data-driven research is central to assist medical solutions targeting cancer. To fully harness the power of data-driven research, it is imperative to have well-organized machine-readable facts into a Knowledge Base (KB). Motivated by this urgent need, we introduce the Collaborative Oriented Relation Extraction (CORE) system for building KBs with limited manual annotations. CORE is based on the combination of distant supervision and active learning paradigms, and offers a seamless, transparent, modular architecture equipped for large-scale processing.
We focus on precision medicine and build the largest KB on fine-grained gene expression-cancer associations – a key to complement and validate experimental data for cancer research. We show the robustness of CORE and discuss theusefulness of the provided KB.

Modelling Digital Health Data: The ExaMode Ontology for Computational Pathology

Laura Menotti, Gianmaria Silvello, Manfredo Atzori, Svetla Boytcheva, Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, Niccolò Marini, Henning Müller, and Todor Primov
Journal Paper Journal of Pathology Informatics, Volume 14 (2023), 100332.
DOI: 10.1016/j.jpi.2023.100332

Abstract

Computational pathology can significantly benefit from ontologies to standardize the employed nomenclature and help with knowledge extraction processes for high-quality annotated image datasets. The end goal is to reach a shared model for digital pathology to overcome data variability and integration problems. Indeed, data annotation in such a specific domain is still an unsolved challenge and datasets cannot be steadily reused in diverse contexts due to heterogeneity issues of the adopted labels, multilingualism, and different clinical practices.
Material and Methods. This paper presents the ExaMode ontology, modeling the histopathology process by considering three key cancer diseases (colon, cervical, and lung tumors) and celiac disease. The ExaMode ontology has been designed bottom-up in an iterative fashion with continuous feedback and validation from pathologists and clinicians. The ontology is organized into five semantic areas that defines an ontological template to model any disease of interest in histopathology.
Results. The ExaMode ontology is currently being used as a common semantic layer in (i) an entity linking tool for the automatic annotation of medical records; (ii) aWeb-based collaborative annotation tool for histopathology text reports; and (iii) a software platform for building holistic solutions integrating multimodal histopathology data.
Discussion. The ontology ExaMode is a key means to store data in a graph database according to the RDF data model. The creation of an RDF dataset can help develop more accurate algorithms for image analysis, especially in the field of digital pathology. This approach allows for seamless data integration and a unified query access point, from which we can extract relevant clinical insights about the considered diseases using SPARQL queries

An Ontology-Driven Knowledge Extraction Tool for Pathology Record Classification

Laura Menotti, Stefano Marchesin and Gianmaria Silvello
Conference Paper In Proc. of the 31st Italian Symposium on Advanced Database Systems (SEBD 2023), CEUR-WS Proceedings vol. 3478, pp. 228-238.
CEUR-WS: Vol-3478/paper14

Abstract

The information in pathology diagnostic reports is often encoded in natural language. Extracting such knowledge can be instrumental in developing clinical decision support systems. However, the digital pathology domain lacks knowledge extraction systems suited to the task. One of the few examples is the Semantic Knowledge Extractor Tool (SKET), a hybrid knowledge extraction system combining a rule-based expert system with pre-trained ML models. SKET has been designed to extract knowledge from colon, cervix, and lung cancer diagnostic reports. To do so, the system employs an ontology-driven approach, where the extracted entities are linked with concepts modeled through a reference ontology, namely, the ExaMode ontology. In this work, we adapt SKET to a newer version of the ExaMode ontology and extend the method to account for an additional use case: Celiac disease. Our experimental results show that: 1) the new version of SKET outperforms the previous one on colon, cervix, and lung cancer use cases; and 2) SKET is effective on Celiac disease, confirming the ability of the system architecture to adapt to new, unseen scenarios.

Building a Relation Extraction Baseline for Gene-Disease Associations: A Reproducibility Study

Laura Menotti
Symposium Paper 10th edition of the PhD Symposium on Future Directions in Information Access (FDIA 2022), Lisbon, Portugal, July 20, 2022. arXiv preprint arXiv:2207.06226

Abstract

Reproducibility is an important task in scientific research. It is crucial for researchers to compare newly developed systems with the state-of-the-art to assess whether they made a breakthrough. However previous works may not be immediately reproducible, for example due to the lack of source code. In this work we reproduce DEXTER, a system to automatically extract Gene-Disease Associations (GDAs) from biomedical abstracts. The goal is to provide a benchmark for future works regarding Relation Extraction (RE), enabling researchers to test and compare their results.

Reproducibility and Generalization of a Relation Extraction System for Gene-Disease Associations

Laura Menotti
Master ThesisITADATA 2023 Best Master Thesis Award on Big Data & Data Science
Master Degree in Computer Engineering, Department of Information Engineering, University of Padua, October 2022.

Abstract

Biomedical literature is a rich source of information on Gene-Disease Associations (GDAs) that could help physicians in assessing clinical decisions and improve patient care. GDAs are publicly available in databases containing relationships between gene/miRNA expression and related diseases such as specific types of cancer. Most of these resources, such as DisGeNET, miR2Disease and BioXpress, include also manually curated data from publications. Human annotations are expensive and cannot scale to the huge amount of data available in scientific literature (e.g., biomedical abstracts). Therefore, developing automated tools to identify GDAs is getting traction in the community. Such systems employ Relation Extraction (RE) techniques to extract information on gene/microRNA expression in diseases from text. Once an automated text-mining tool has been developed, it can be tested on human annotated data or it can be compared to state-of-the-art systems. In this work we reproduce DEXTER, a system to automatically extract Gene- Disease Associations (GDAs) from biomedical abstracts. The goal is to provide a benchmark for future works regarding Relation Extraction (RE), enabling researchers to test and compare their results. The implemented version of DEXTER is available in the following git repository: https://github.com/mntlra/DEXTER .