Stefano Marchesin

Large Language Models and Data Quality for Knowledge Graphs

Stefano Marchesin, Gianmaria Silvello, and Omar Alonso

Journal Paper Information Processing & Management (IP&M), 2025, pages 104281.

Abstract

Knowledge Graphs (KGs) have become essential for applications such as virtual assistants, web search, reasoning, and information access and management. Prominent examples include Wikidata, DBpedia, YAGO, and NELL, which large companies widely use for structuring and integrating data. Constructing KGs involves various AI-driven processes, including data integration, entity recognition, relation extraction, and active learning. However, automated methods often lead to sparsity and inaccuracies, making rigorous KG quality evaluation crucial for improving construction methodologies and ensuring reliable downstream applications. Despite its importance, large-scale KG quality assessment remains an underexplored research area. The rise of Large Language Models (LLMs) introduces both opportunities and challenges for KG construction and evaluation. LLMs can enhance contextual understanding and reasoning in KG systems but also pose risks, such as introducing misinformation or “hallucinations” that could degrade KG integrity. Effectively integrating LLMs into KG workflows requires robust quality control mechanisms to manage errors and ensure trustworthiness. This special issue explores the intersection of KGs and LLMs, emphasizing human–machine collaboration for KG construction and evaluation. We present contributions on LLM-assisted KG generation, large-scale KG quality assessment, and quality control mechanisms for mitigating LLM-induced errors. Topics covered include KG construction methodologies, LLM deployment in KG systems, scalable KG evaluation, human-in-the-loop approaches, domain-specific applications, and industrial KG maintenance. By advancing research in these areas, this issue fosters innovation at the convergence of KGs and LLMs.

Binomial Confidence Intervals for Knowledge Graph Accuracy Estimation (Extended Abstract)

Stefano Marchesin and Gianmaria Silvello

Nat. Conference Paper To appear in Proceedings of the 33rd Italian Symposium on Advanced Database Systems (SEBD 2025), Ischia, Italy, June 16-19, 2025, pages 10.

Abstract

Data accuracy is a critical aspect of data quality, particularly in the context of Knowledge Graphs (KGs). Accurately auditing KGs is essential for informed decision-making in entity-centric services and applications. However, manual accuracy evaluation of large-scale KGs is prohibitively costly, prompting research into efficient sampling techniques for KG accuracy estimation. In this extended abstract, we report our endeavours in tackling the shortcomings of existing KG accuracy estimation methods, which predominantly rely on the Wald method for constructing Confidence Intervals (CIs). When used to gauge binomial proportions, such as KG accuracy, Wald intervals suffer from reliability issues such as zero-width and overshooting. We introduce a solution based on the Wilson method, which addresses these challenges and ensures broad applicability across diverse evaluation scenarios. The results demonstrate that the proposed solution enhances the reliability of accuracy estimates by up to two times compared to the state-of-the-art, without compromising efficiency. Moreover, this improvement remains consistent regardless of KG size or topology.

Doctron: A web-based collaborative annotation tool for Ground Truth creation in IR

Ornella Irrera, Stefano Marchesin, Farzad Shami, and Gianmaria Silvello

Int. Conference Paper To appear in Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025), Padua, Italy, July 13-17, 2025, pages 10.

Abstract

In Information Retrieval (IR), ground truth creation is a crucial yet resource-intensive task that relies on human experts to build test collections -- essential for training and evaluating retrieval models. Large-scale evaluation campaigns, such as TREC and CLEF, demand significant human effort to produce reliable, high-quality annotations. To ease this process, tailored annotation tools are pivotal to supporting assessors and streamlining their workload.
To this end, we introduce Doctron, a web-based, dockerized annotation tool designed to streamline ground truth creation for IR tasks. Doctron enables the annotation of both textual documents and images. It supports annotating textual passages, identifying relationships, tagging and linking entities, evaluating document relevance to a topic with graded labels, and performing object detection. It offers a collaborative environment where teams can work with defined user roles and permissions. The integration of Inter Annotator Agreement (IAA) measures helps to identify inconsistencies between annotators, thereby ensuring the reliability and high quality of the annotated ground truth data.

Fact Verification in Knowledge Graphs Using LLMs

Farzad Shami, Stefano Marchesin, and Gianmaria Silvello

Int. Conference Paper To appear in Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025), Padua, Italy, July 13-17, 2025, pages 5.

Abstract

Automated fact-checking systems often struggle with trustworthiness, as they lack transparency in their reasoning processes and fail to handle relationships in data.
This work presents FactCheck, a fact verification pipeline topped with a Web platform that shows how Large Language Models (LLMs) can be collectively used to verify facts within Knowledge Graphs (KGs).
While the underlying verification engine implements a system that combines Retrieval Augmented Generation (RAG) with an ensemble of LLMs to validate KG facts, the platform focuses on making the results of this complex process as transparent and accessible as possible. Users can explore how different models interpret the same evidence, compare their reasoning patterns, and understand the factors that lead to the final verification result. The platform supports technical users who want to analyze the model behavior and general users who need to verify whether the facts in the dataset are correct.

Credible Intervals for Knowledge Graph Accuracy Estimation

Stefano Marchesin and Gianmaria Silvello

Int. Conference PaperJournal Paper Proceedings of the ACM on Management of Data 3, 3 (SIGMOD), Article 142 (June 2025), Berlin, Germany, June 22-27, 2025, pages 26.

Abstract

Knowledge Graphs (KGs) are widely used in data-driven applications and downstream tasks, such as virtual assistants, recommendation systems, and semantic search. The accuracy of KGs directly impacts the reliability of the inferred knowledge and outcomes. Therefore, assessing the accuracy of a KG is essential for ensuring the quality of facts used in these tasks. However, the large size of real-world KGs makes manual triple-by-triple annotation impractical, thereby requiring sampling strategies to provide accuracy estimates with statistical guarantees.
The current state-of-the-art approaches rely on Confidence Intervals (CIs), derived from frequentist statistics. While efficient, CIs have notable limitations and can lead to interpretation fallacies.
In this paper, we propose to overcome the limitations of CIs by using Credible Intervals (CrIs), which are grounded in Bayesian statistics. These intervals are more suitable for reliable post-data inference, particularly in KG accuracy evaluation. We prove that CrIs offer greater reliability and stronger guarantees than frequentist approaches in this context. Additionally, we introduce aHPD, an adaptive algorithm that is more efficient for real-world KGs and statistically robust, addressing the interpretive challenges of CIs.

BioASQ at CLEF2025: The thirteenth edition of the large-scale biomedical semantic indexing and question answering challenge

Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Martin Krallinger, Miguel Rodriguez Ortega, Natalia Loukachevitch, Andrey Sakhovskiy, Elena Tutubalina, Grigorios Tsoumakas, George Giannakoulas, Alexandra Bekiaridou, Athanasios Samaras, Giorgio Maria Di Nunzio, Nicola Ferro, Stefano Marchesin, Laura Menotti, Gianmaria Silvello, and Georgios Paliouras

Int. Conference Paper Proceedings of the 47th European Conference on Information Retrieval (ECIR 2025), Lucca, Italy, April 6-10, 2025, pages 407-415.

Abstract

During the last twelve years, the large-scale biomedical semantic indexing and question-answering challenge (BioASQ) has been pushing towards the continuous advancement of methods and tools to accelerate access to the ever-increasing scientific resources of the biomedical domain. In this direction, each year, BioASQ organizes shared tasks representing the real information needs of biomedical experts and provides respective benchmark datasets. This way, it provides a unique common testbed where research teams around the world can test and compare new approaches for accessing biomedical knowledge.
The thirteenth version of BioASQ will be held as an evaluation Lab in the context of CLEF2025 providing six tasks: (i) Task b on biomedical semantic question answering. (ii) Task Synergy on question answering developing biomedical topics. (iii) Task MultiClinSum on multilingual clinical summarization. (iv) Task BioNNE-L on nested named entity linking in Russian and English. (v) Task ELCardioCC on clinical coding in cardiology. (vi) Task GutBrainIE on gut-brain interplay information extraction. As BioASQ rewards the methods that outperform the state of the art in these shared tasks, it keeps pushing the research frontier towards approaches that will meet the need for efficient and precise access to biomedical knowledge.

MetaTron: Streamlining Collaborative Annotation for Biomedical Documents

Ornella Irrera, Stefano Marchesin, and Gianmaria Silvello

Int. Conference Paper Proceedings of the 16th International SWAT4HCLS Conference (SWAT4HCLS 2025), Barcelona, Spain, February 24-27, 2025, pages 2.

Abstract

Manual annotation of biomedical texts, such as electronic health records and medical reports, is crucial to creating reliable corpora for training automated methods for tasks like relation extraction and entity linking. To streamline this process, we introduce MetaTron, a web-based collaborative tool that supports mention-level and document-level annotations and automatic built-in predictions.

Extending Nanopublications with Knowledge Provenance for Multi-Source Scientific Assertions

Fabio Giachelle, Stefano Marchesin, Laura Menotti, and Gianmaria Silvello

Nat. Conference PaperBest Paper Proceedings of the 21st Conference on Information and Research Science Connecting to Digital and Library Science (IRCDL 2025), Udine, Italy, February 20-21, 2025, pages 15.

Abstract

Nanopublications are RDF graphs that enable the possibility of sharing machine-readable assertions on the Web while tracking their provenance and publication information. However, the current nanopublication model focuses on the provenance of single-source assertions derived from a specific publication or database. This work proposes extending the nanopublication model to include a fourth component called knowledge provenance. Knowledge provenance captures the context where an assertion is not derived from a single publication but from a body of knowledge that can comprehend supporting and conflicting pieces of evidence that we need to track and refer to. We apply the defined model to the facts generated by the Collaborative Oriented Relation Extraction (CORE) and published 197,511 assertions in the form of extended nanopublications, allowing the identification, representation, access, and citation of individual gene expression-cancer associations.

Multimodal Representations of Biomedical Knowledge from Limited Training Whole Slide Images and Reports Using Deep Learning

Niccolò Marini, Stefano Marchesin, Marek Wodzinski, Alessandro Caputo, Damian Podareanu, Bryan Cardenas Guevara, Svetla Boytcheva, Simona Vatrano, Filippo Fraggetta, Francesco Ciompi, Gianmaria Silvello, Henning Müller, and Manfredo Atzori

Journal Paper Medical Image Analysis, Volume 97, Issue 1, October 2024, 103303, 2024.

Abstract

The increasing availability of biomedical data creates valuable resources for developing new deep learning algorithms to support experts, especially in domains where collecting large volumes of annotated data is not trivial. Biomedical data include several modalities containing complementary information, such as medical images and reports: images are often large and encode low-level information, while reports include a summarized high-level description of the findings identified within data and often only concerning a small part of the image. However, only a few methods allow to effectively link the visual content of images with the textual content of reports, preventing medical specialists from properly benefitting from the recent opportunities offered by deep learning models. This paper introduces a multimodal architecture creating a robust biomedical data representation encoding fine-grained text representations within image embeddings. The architecture aims to tackle data scarcity (combining supervised and self-supervised learning) and to create multimodal biomedical ontologies. The architecture is trained on over 6,000 colon whole slide Images (WSI), paired with the corresponding report, collected from two digital pathology workflows. The evaluation of the multimodal architecture involves three tasks: WSI classification (on data from pathology workflow and from public repositories), multimodal data retrieval, and linking between textual and visual concepts. Noticeably, the latter two tasks are available by architectural design without further training, showing that the multimodal architecture that can be adopted as a backbone to solve peculiar tasks. The multimodal data representation outperforms the unimodal one on the classification of colon WSIs and allows to halve the data needed to reach accurate performance, reducing the computational power required and thus the carbon footprint. The combination of images and reports exploiting self-supervised algorithms allows to mine databases without needing new annotations provided by experts, extracting new information. In particular, the multimodal visual ontology, linking semantic concepts to images, may pave the way to advancements in medicine and biomedical analysis domains, not limited to histopathology.

Veracity Estimation for Entity-Oriented Search with Knowledge Graphs

Stefano Marchesin, Gianmaria Silvello, and Omar Alonso

Int. Conference Paper Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024), Boise, Idaho, USA, October 21-25, 2024, pages 1649-1659.

Abstract

In this paper, we discuss the potential costs that emerge from using a Knowledge Graph (KG) in entity-oriented search without considering its data veracity. We argue for the need for KG veracity analysis to gain insights and propose a scalable assessment framework. Previous assessments focused on relevance, assuming correct KGs, and overlooking the potential risks of misinformation. Our approach strategically allocates annotation resources, optimizing utility and revealing the significant impact of veracity on entity search and card generation. Contributions include a fresh perspective on entity-oriented search extending beyond the conventional focus on relevance, a scalable assessment framework, exploratory experiments highlighting the impact of veracity on ranking and user experience, as well as outlining associated challenges and opportunities.

Utility-Oriented Knowledge Graph Accuracy Estimation with Limited Annotations: A Case Study on DBpedia

Stefano Marchesin, Gianmaria Silvello, and Omar Alonso

Int. Conference PaperBest Paper Honorable Mention Proceedings of the 12th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2024), Pittsburgh, Pennsylvania, USA, October 16-19, 2024, pages 105-114.

Abstract

Knowledge Graphs (KGs) are essential for applications like search, recommendation, and virtual assistants, where their accuracy directly impacts effectiveness. However, due to their large-scale and ever-evolving nature, it is impractical to manually evaluate all KG contents. We propose a framework that employs sampling, estimation, and active learning to audit KG accuracy in a cost-effective manner. The framework prioritizes KG facts based on their utility to downstream tasks. We applied the framework to DBpedia and gathered annotations from both expert and layman annotators. We also explored the potential of Large Language Models (LLMs) as KG evaluators, showing that while they can perform comparably to low-quality human annotators, they tend to overestimate KG accuracy. As such, LLMs are currently insufficient to replace human crowdworkers in the evaluation process. The results also provide insights into the scalability of methods for auditing KGs.

Overview of iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge

Giovanni Birolo, Pietro Bosoni, Guglielmo Faggioli, Helena Aidos, Roberto Bergamaschi, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Alessandro Guazzo, Enrico Longato, Sara Madeira, Umberto Manera, Stefano Marchesin, Laura Menotti, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Isotta Trescato, Martina Vettoretti, Barbara Di Camillo, and Nicola Ferro

Int. Conference Paper 25th Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), Grenoble, France, September 9-12, 2024, pages 1312-1331.

Abstract

Multiple Sclerosis (MS) and Amyotrophic Lateral Sclerosis (ALS) are neurodegenerative diseases characterized by progressive or fluctuating impairments in motor, sensory, visual, and cognitive functions. Patients with these diseases endure significant physical, psychological, and economic burdens due to hospitalizations and home care while grappling with uncertainty about their conditions. AI tools hold promise for aiding patients and clinicians by identifying the need for intervention and suggesting personalized therapies throughout disease progression. The objective of iDPP@CLEF is to develop AI-based approaches to describe the progression of these diseases. The ultimate goal is to enable patient stratification and predict disease progression, thereby assisting clinicians in providing timely care. iDPP@CLEF 2024 continues the work of the previous editions, iDPP@CLEF 2022 and 2023. The 2022 edition focused on predicting ALS progression and utilizing explainable AI. The 2023 edition expanded on this by including environmental data and introduced a new task for predicting MS progression. This edition extends the MS dataset with environmental data and introduces two new ALS tasks aimed at predicting disease progression using data from wearable devices. This marks the first iDPP edition to utilize prospective data directly collected from patients involved in the BRAINTEASER project.

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2024

Giovanni Birolo, Pietro Bosoni, Guglielmo Faggioli, Helena Aidos, Roberto Bergamaschi, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Alessandro Guazzo, Enrico Longato, Sara Madeira, Umberto Manera, Stefano Marchesin, Laura Menotti, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Isotta Trescato, Martina Vettoretti, Barbara Di Camillo, and Nicola Ferro

Int. Conference Paper Proceedings of the 15th International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF 2024), Grenoble, France, September 9-12, 2024, pages 118-139.

Abstract

Multiple Sclerosis (MS) and Amyotrophic Lateral Sclerosis (ALS) are two neurodegenerative diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Patients affected by these diseases undergo the physical, psychological, and economic burdens of hospital stays and home care while facing uncertainty. A possible aid to patients and clinicians might come from AI tools that can preemptively identify the need for intervention and suggest personalized therapies during the progression of these diseases. The objective of iDPP@CLEF is to develop automatic approaches based on AI that can be used to describe the progression of these two neurodegenerative diseases, with the final goal of allowing patient stratification as well as the prediction of the disease progression, to help clinicians in assisting patients in the most timely manner. iDPP@CLEF 2024 follows the two prior editions, iDPP@CLEF 2022 and 2023. iDPP@CLEF 2022 focused on ALS progression prediction and approaches of explainable AI for the task. iDPP@CLEF 2023 built upon iDPP@CLEF 2022 by extending the datasets provided during the previous edition with environmental data. Additionally, the 2023 edition of iDPP@CLEF introduced a new task focused on the progression prediction of MS. In this edition, we extended the MS dataset of iDPP@CLEF 2023 with environmental data. Furthermore, we introduced two new ALS tasks, focused on predicting the progression of the disease using data obtained from wearable devices, making it the first iDPP edition that uses prospective data collected directly from the patients involved in the BRAINTEASER project.

An Extensible and Unifying Approach to Retrospective Clinical Data Modeling: the BrainTeaser Ontology

Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Adriano Chió, Arianna Dagliati, Mamede de Carvalho, Marta Gromicho, Umberto Manera, Eleonora Tavazzi, Giorgio Maria Di Nunzio, Gianmaria Silvello, and Nicola Ferro

Journal Paper Journal of Biomedical Semantics, Volume 15, Issue 1, August 2024, 16, 2024.

Abstract

Automatic disease progression prediction models require large amounts of training data, which are seldom available, especially when it comes to rare diseases. A possible solution is to integrate data from different medical centres. Nevertheless, various centres often follow diverse data collection procedures and assign different semantics to collected data. Ontologies, used as schemas for interoperable knowledge bases, represent a state-of-the-art solution to homologate the semantics and foster data integration from various sources. This work presents the BrainTeaser Ontology (BTO), an ontology that models the clinical data associated with two brain-related rare diseases (ALS and MS) in a comprehensive and modular manner. BTO assists in organizing and standardizing the data collected during patient follow-up. It was created by harmonizing schemas currently used by multiple medical centers into a common ontology, following a bottom-up approach. As a result, BTO effectively addresses the practical data collection needs of various real-world situations and promotes data portability and interoperability. BTO captures various clinical occurrences, such as disease onset, symptoms, diagnostic and therapeutic procedures, and relapses, using an event-based approach. Developed in collaboration with medical partners and domain experts, BTO offers a holistic view of ALS and MS for supporting the representation of retrospective and prospective data. Furthermore, BTO adheres to Open Science and FAIR (Findable, Accessible, Interoperable, and Reusable) principles, making it a reliable framework for developing predictive tools to aid in medical decision-making and patient care. Although BTO is designed for ALS and MS, its modular structure makes it easily extendable to other brain-related diseases, showcasing its potential for broader applicability.

Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations

Stefano Marchesin, Laura Menotti, Fabio Giachelle, Gianmaria Silvello, and Omar Alonso

Nat. Conference Paper Proceedings of the 32nd Italian Symposium on Advanced Database Systems (SEBD 2024), Villasimius, Italy, June 23-26, 2024, pages 163-173.

Abstract

We introduce the Collaborative Oriented Relation Extraction (CORE) system for Knowledge Base Construction, based on the combination of Relation Extraction (RE) methods and domain experts feedback. CORE features a seamless, transparent, and modular architecture that suits large-scale processing. Via active learning, the CORE system bootstraps Knowledge Bases (KBs) and then employs RE methods to scale to large text corpora. We employ CORE to build one of the largest KBs focusing on fine-grained gene expression-cancer associations, fundamental to complement and validate experimental data for precision medicine and cancer research. We conducted comprehensive experiments showing the robustness of the approach and highlighting the scalability of CORE to large text corpora with limited manual annotations.

Efficient and Reliable Estimation of Knowledge Graph Accuracy

Stefano Marchesin and Gianmaria Silvello

Journal Paper Proceedings of the VLDB Endowment, Volume 17, Issue 9, June 2024, pages 2392-2404, 2024.

Abstract

Data accuracy is a central dimension of data quality, especially when dealing with Knowledge Graphs (KGs). Auditing the accuracy of KGs is essential to make informed decisions in entity-oriented services or applications. However, manually evaluating the accuracy of large-scale KGs is prohibitively expensive, and research is focused on developing efficient sampling techniques for estimating KG accuracy. This work addresses the limitations of current KG accuracy estimation methods, which rely on the Wald method to build confidence intervals, addressing reliability issues such as zero-width and overshooting intervals. Our solution, rooted in the Wilson method and tailored for complex sampling designs, overcomes these limitations and ensures applicability across various evaluation scenarios. We show that the presented methods increase the reliability of accuracy estimates by up to two times when compared to the state-of-the-art while preserving or enhancing efficiency. Additionally, this consistency holds regardless of the KG size or topology.

MetaTron: Advancing Biomedical Annotation Empowering Relation Annotation and Collaboration

Ornella Irrera, Stefano Marchesin, and Gianmaria Silvello

Journal Paper BMC Bioinformatics, Volume 25, Issue 112, March 2024.

Abstract

Background
The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools.
Results
We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances.
Conclusions
MetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats—PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.

Publishing CoreKB Facts as Nanopublications

Fabio Giachelle, Stefano Marchesin, Laura Menotti, and Gianmaria Silvello

Nat. Conference Paper Proceedings of the 20th Conference on Information and Research Science Connecting to Digital and Library Science (IRCDL 2024), Bressanone, Italy, February 22-23, 2024, pages 16-24.

Abstract

The Collaborative Oriented Relation Extraction (CORE) system generates gene expression-cancer associations by combining scientific evidence from the literature. Such facts are then ingested into the CoreKB platform, where one can browse and search for associations. In this work, we publish 197,511 assertions from CoreKB as nanopublications, allowing the sharing of machine-readable gene-cancer associations while tracking their provenance and publication information.

TPDL 2023: Linking Theory and Practice of Digital Libraries

Omar Alonso, Helena Cousijn, Gianmaria Silvello, Mónica Marrero, Carla Teixeira Lopes, and Stefano Marchesin

Editorship Proceedings of the 27th International Conference on Theory and Practice of Digital Libraries (TPDL 2023). Lecture Notes in Computer Science 14241, Springer. Zadar, Croatia, September 26-29, 2023.

Overview of iDPP@CLEF 2023: The Intelligent Disease Progression Prediction Challenge

Guglielmo Faggioli, Alessandro Guazzo, Stefano Marchesin, Laura Menotti, Isotta Trescato, Helena Aidos, Roberto Bergamaschi, Giovanni Birolo, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Enrico Longato, Sara Madeira, Umberto Manera, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Martina Vettoretti, Barbara Di Camillo, Nicola Ferro

Int. Conference Paper 24th Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18-21, 2023, pages 1123-1164.

Abstract

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Affected patients must manage hospital stays and home care while facing uncertainty and significant psychological and economic burdens that also affect their caregivers. To ease these challenges, clinicians need automatic tools to support them in all phases of patient treatment, suggest personalized therapeutic paths, and preemptively indicate urgent interventions. iDPP@CLEF aims at developing an evaluation infrastructure for AI algorithms to describe ALS and MS mechanisms, stratify patients based on their phenotype, and predict disease progression in a probabilistic, time-dependent manner. iDPP@CLEF 2023 was organised into three tasks, two of which (Tasks 1 and 2) pertained to Multiple Sclerosis (MS), and one (Task 3) concerned the evaluation of the impact of environmental factors in the progression of Amyotrophic Lateral Sclerosis (ALS), and how to use environmental data at prediction time. 10 teams took part in the iDPP@CLEF 2023 Lab, submitting a total of 163 runs with multiple approaches to the disease progression prediction task, including Survival Random Forests and Coxnets.

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2023

Guglielmo Faggioli, Alessandro Guazzo, Stefano Marchesin, Laura Menotti, Isotta Trescato, Helena Aidos, Roberto Bergamaschi, Giovanni Birolo, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Enrico Longato, Sara Madeira, Umberto Manera, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Martina Vettoretti, Barbara Di Camillo, Nicola Ferro

Int. Conference Paper Proceedings of the 14th International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF 2023), Thessaloniki, Greece, September 18-21, 2023, pages 343-369.

Abstract

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Affected patients must manage hospital stays and home care while facing uncertainty and significant psychological and economic burdens that also affect their caregivers. To ease these challenges, clinicians need automatic tools to support them in all phases of patient treatment, suggest personalized therapeutic paths, and preemptively indicate urgent interventions. iDPP@CLEF aims at developing an evaluation infrastructure for AI algorithms to describe ALS and MS mechanisms, stratify patients based on their phenotype, and predict disease progression in a probabilistic, time-dependent manner. iDPP@CLEF 2022 ran as a pilot lab in CLEF 2022, with tasks related to predicting ALS progression and explainable AI algorithms for prediction. iDPP@CLEF 2023 will continue in CLEF 2023, with a focus on predicting MS progression and exploring whether pollution and environmental data can improve the prediction of ALS progression.

Building a Large Gene Expression-Cancer Knowledge Base with Limited Human Annotations

Stefano Marchesin, Laura Menotti, Fabio Giachelle, Gianmaria Silvello, and Omar Alonso

Journal Paper Database: The Journal of Biological Databases and Curation, Volume 2023, Issue 2023, September 2023, baad061, 2023.

Abstract

Cancer prevention is one of the most pressing challenges that public health needs to face. In this regard, data-driven research is central to assist medical solutions targeting cancer. To fully harness the power of data-driven research, it is imperative to have well-organized machine-readable facts into a Knowledge Base (KB). Motivated by this urgent need, we introduce the Collaborative Oriented Relation Extraction (CORE) system for building KBs with limited manual annotations. CORE is based on the combination of distant supervision and active learning paradigms, and offers a seamless, transparent, modular architecture equipped for large-scale processing.
We focus on precision medicine and build the largest KB on fine-grained gene expression-cancer associations - a key to complement and validate experimental data for cancer research. We show the robustness of CORE and discuss the usefulness of the provided KB.

Modelling digital health data: The ExaMode ontology for computational pathology

Laura Menotti, Gianmaria Silvello, Manfredo Atzori, Svetla Boytcheva, Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, Niccolò Marini, Henning Müller, and Todor Primov

Journal Paper Journal of Pathology Informatics, Volume 14, Issue 1, August 2023, 100332, 2023.

Abstract

Computational pathology can significantly benefit from ontologies to standardize the employed nomenclature and help with knowledge extraction processes for high-quality annotated image datasets. The end goal is to reach a shared model for digital pathology to overcome data variability and integration problems. Indeed, data annotation in such a specific domain is still an unsolved challenge and datasets cannot be steadily reused in diverse contexts due to heterogeneity issues of the adopted labels, multilingualism, and different clinical practices.
Material and methods This paper presents the ExaMode ontology, modeling the histopathology process by considering 3 key cancer diseases (colon, cervical, and lung tumors) and celiac disease. The ExaMode ontology has been designed bottom-up in an iterative fashion with continuous feedback and validation from pathologists and clinicians. The ontology is organized into 5 semantic areas that defines an ontological template to model any disease of interest in histopathology.
Results The ExaMode ontology is currently being used as a common semantic layer in: (i) an entity linking tool for the automatic annotation of medical records; (ii) a web-based collaborative annotation tool for histopathology text reports; and (iii) a software platform for building holistic solutions integrating multimodal histopathology data.
Discussion The ontology ExaMode is a key means to store data in a graph database according to the RDF data model. The creation of an RDF dataset can help develop more accurate algorithms for image analysis, especially in the field of digital pathology. This approach allows for seamless data integration and a unified query access point, from which we can extract relevant clinical insights about the considered diseases using SPARQL queries.

A systematic review of Automatic Term Extraction: What happened in 2022?

Giorgio Maria Di Nunzio, Stefano Marchesin, and Gianmaria Silvello

Journal Paper Digital Scholarship in the Humanities, Volume 38, Issue Supplement_1, June 2023, pages i41-i47, 2023.

Abstract

Automatic Term Extraction (ATE) systems have been studied for many decades as, among other things, one of the most important tools for tasks such as information retrieval, sentiment analysis, named entity recognition, and others. The interest in this topic has even increased in recent years given the support and improvement of the new neural approaches. In this article, we present a follow-up on the discussions about the pipeline that allows extracting key terms from medical reports, presented at MDTT 2022, and analyze the very last papers about ATE in a systematic review fashion. We analyzed the journal and conference papers published in 2022 (and partially in 2023) about ATE and cluster them into subtopics according to the focus of the papers for a better presentation.

Towards Query Performance Prediction for Neural Information Retrieval: Challenges and Opportunities

Guglielmo Faggioli, Thibault Formal, Simon Lupart, Stefano Marchesin, Stéphane Clinchant, Nicola Ferro, and Benjamin Piwowarski

Int. Conference Paper Proceedings of the 13th International Conference on the Theory of Information Retrieval (ICTIR 2023), Taipei, Taiwan, July 23, 2023, pages 51-63.

Abstract

In this work, we propose a novel framework to devise features that can be used by Query Performance Prediction (QPP) models for Neural Information Retrieval (NIR). Using the proposed framework as a periodic table of QPP components, practitioners can devise new predictors better suited for NIR. Through the framework, we detail what challenges and opportunities arise for QPPs at different stages of the NIR pipeline. We show the potential of the proposed framework by using it to devise two types of novel predictors. The first one, named MEMory-based QPP (MEM-QPP), exploits the similarity between test and train queries to measure how much a NIR system can memorize. The second adapts traditional QPPs into NIR-oriented ones by computing the query-corpus semantic similarity. By exploiting the inherent nature of NIR systems, the proposed predictors overcome, under various setups, the current State of the Art, highlighting -- at the same time -- the versatility of the framework in describing different types of QPPs.

SEBD 2023: 31st Italian Symposium on Advanced Database Systems

Diego Calvanese, Claudia Diamantini, Guglielmo Faggioli, Nicola Ferro, Stefano Marchesin, Gianmaria Silvello, and Letizia Tanca

Editorship Proceedings of the 31st Italian Symposium on Advanced Database Systems (SEBD 2023). CEUR Workshop Proceedings 3478. Galzignano Terme (Padova), Italy, July 2-5, 2023.

CoreKB: A Web-based Platform for Searching Reliable Facts over a Medical Knowledge Base

Fabio Giachelle, Stefano Marchesin, Gianmaria Silvello, and Omar Alonso

Nat. Conference Paper Proceedings of the 31st Italian Symposium on Advanced Database Systems (SEBD 2023), Galzignano Terme, Padova, Italy, July 2-5, 2023, pages 200-209.

Abstract

CoreKB is a web-based platform enabling users to search for reliable scientific facts concerning gene expression-cancer associations over a medical Knowledge Base (KB). CoreKB provides a streamlined interface for searching either using natural language queries or by exploiting structured facets providing autocomplete facilities. It is designed to simplify information access and search of scientific facts targeting healthcare stakeholders (i.e., clinicians, physicians, and researchers). CoreKB aims at presenting the user a comprehensive overview of the scientific evidence supporting a medical fact, fully connected with ontology-based entities and well-defined literature resources. In addition, CoreKB provides the user a quantitative comparison of the possible gene-cancer associations related to a specific fact, thus enabling users to assess the degree of agreement among the evidence support.

An Ontology-Driven Knowledge Extraction Tool for Pathology Record Classification

Laura Menotti, Stefano Marchesin, and Gianmaria Silvello

Nat. Conference Paper Proceedings of the 31st Italian Symposium on Advanced Database Systems (SEBD 2023), Galzignano Terme, Padova, Italy, July 2-5, 2023, pages 228-238.

Abstract

The information in pathology diagnostic reports is often encoded in natural language. Extracting such knowledge can be instrumental in developing clinical decision support systems. However, the digital pathology domain lacks knowledge extraction systems suited to the task. One of the few examples is the Semantic Knowledge Extractor Tool (SKET), a hybrid knowledge extraction system combining a rule-based expert system with pre-trained ML models. SKET has been designed to extract knowledge from colon, cervix, and lung cancer diagnostic reports. To do so, the system employs an ontology-driven approach, where the extracted entities are linked with concepts modeled through a reference ontology, namely, the ExaMode ontology. In this work, we adapt SKET to a newer version of the ExaMode ontology and extend the method to account for an additional use case: Celiac disease. Our experimental results show that: 1) the new version of SKET outperforms the previous one on colon, cervix, and lung cancer use cases; and 2) SKET is effective on Celiac disease, confirming the ability of the system architecture to adapt to new, unseen scenarios.

On the Limitations of Query Performance Prediction for Neural IR (Discussion paper)

Guglielmo Faggioli, Thibault Formal, Stefano Marchesin, Stéphane Clinchant, Nicola Ferro, and Benjamin Piwowarski

Nat. Conference Paper Proceedings of the 31st Italian Symposium on Advanced Database Systems (SEBD 2023), Galzignano Terme, Padova, Italy, July 2-5, 2023, pages 379-390.

Abstract

The evaluation of Information Retrieval (IR) relies on human-made relevance assessments whose collection is time-consuming and expensive. To alleviate this limitation, Query Performance Prediction (QPP) models have been developed to estimate system performance without relying on human-made relevance judgements. QPP models have been applied to traditional IR methods with varying success. The shift towards semantic signals thanks to Neural IR (NIR) models has changed the retrieval paradigm. In this study, we investigate the ability of current QPP models to predict the performance of NIR systems. We evaluate seven traditional IR systems and seven NIR (BERT-based) approaches, as well as nineteen QPPs, on two collections: Deep Learning ’19 and Robust ’04. Our results highlight that QPPs perform significantly worse on NIR systems. When semantic signals are prevalent, such as in passage retrieval, their performance on neural models decreases by up to 10% compared to bag-of-words approaches.

An Analysis of a Methodology and Experimental Results for the Retrieval of Clinical Trials

Giorgio Maria Di Nunzio, Guglielmo Faggioli, and Stefano Marchesin

Nat. Conference Paper Proceedings of the 13th Italian Information Retrieval Workshop (IIR 2023), Pisa, Italy, June 8-9, 2023, pages 53-57.

Abstract

In this paper, we present our previous and current work about the the methodology and the experimental analysis of a query reformulation, pseudo-relevance feedback, and document filtering approach. In particular, we present a summary of two studies carried out in the context of the TREC Precision Medicine track. The two original papers are [1] and [2].

SKET X: A Visual Analytics Tool for Explaining Knowledge Extraction Results

Fabio Giachelle, Stefano Marchesin, and Gianmaria Silvello

Nat. Conference Paper Proceedings of the 13th Italian Information Retrieval Workshop (IIR 2023), Pisa, Italy, June 8-9, 2023, pages 119-124.

Abstract

In recent years, knowledge extraction approaches have been adopted to distill the medical knowledge included in clinical reports. In this regard, the Semantic Knowledge Extractor Tool (SKET) has been introduced for extracting knowledge from pathology reports, leveraging a hybrid approach that combines unsupervised rule-based techniques with pre-trained Machine Learning (ML) models. Since ML models are usually based on probabilistic/statistical approaches, their predictions cannot be easily understood, especially for what concerns their underlying decision mechanism. To explain the SKET's decision- making process, we propose SKET eXplained (SKET X), a web-based system providing visual explanations in terms of the models, rules, and parameters involved for each prediction. SKET X is designed for pathologists and experts to ease the comprehension of SKET predictions, increase awareness, and improve the effectiveness of the overall knowledge extraction process according to the pathologists' feedback. To assess the learnability and usability of SKET X, we conducted a user study designed to collect useful suggestions from pathologists and domain experts to further improve the system.

Caption Generation from Histopathology Whole-Slide Images Using Pre-Trained Transformers

Bryan Cardenas Guevara, Niccolò Marini, Stefano Marchesin, Witali Aswolinskiy, Robert-Jan Schlimbach, Damian Podareanu, and Francesco Ciompi

Int. Conference Paper Medical Imaging with Deep Learning (MIDL 2023), Nashville, USA, July 10-12, 2023, pages 5.

Abstract

The recent advent of foundation models and large language models has enabled scientists to leverage large-scale knowledge of pretrained (vision) transformers and efficiently tailor it to downstream tasks. This technology can potentially automate multiple aspects of cancer diagnosis in digital pathology, from whole-slide image classification to generating pathology reports while training with pairs of images and text from the diagnostic conclusion. In this work, we orchestrate a set of weakly-supervised transformer-based models with a first aim to address both whole-slide image classification and captioning, addressing the automatic generation of the conclusion of pathology reports in the form of image captions. We report our first results on a multicentric multilingual dataset of colon polyps and biopsies. We achieve high diagnostic accuracy with no supervision and cheap computational adaptation.

Searching for Reliable Facts over a Medical Knowledge Base

Fabio Giachelle, Stefano Marchesin, Gianmaria Silvello, and Omar Alonso

Int. Conference Paper Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), Taipei, Taiwan, July 23-27, 2023, pages 3205-3209.

Abstract

This work presents CoreKB, a Web platform for searching reliable facts over gene expression-cancer associations Knowledge Base (KB). It provides search capabilities over an RDF graph using natural language queries, structured facets, and autocomplete. CoreKB is designed to be intuitive and easy to use for healthcare professionals, medical researchers, and clinicians. The system offers the user a comprehensive overview of the scientific evidence supporting a medical fact. It provides a quantitative comparison between the possible gene-cancer associations a particular fact can reflect.

IRCDL 2023: Information and Research science Connecting to Digital and Library science

Alessia Bardi, Alex Falcon, Stefano Ferilli, Stefano Marchesin, and Domenico Redavid

Editorship Proceedings of the Conference on Information and Research science Connecting to Digital and Library science (IRCDL 2023). CEUR Workshop Proceedings 3365. Bari, Italy, February 23-24, 2023.

SKET: an Unsupervised Knowledge Extraction Tool to Empower Digital Pathology Applications

Giorgio Maria Di Nunzio, Nicola Ferro, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, and Gianmaria Silvello

Nat. Conference Paper Proceedings of the 19th Conference on Information and Research science Connecting to Digital and Library science (IRCDL 2023), Bari, Italy, February 23-24, 2023, pages 144-152.

Abstract

Large volumes of medical data have been produced for decades. These data include diagnoses, which are often reported as free text, thus encoding medical knowledge that is still largely unexploited. To decode the medical knowledge present within reports, we propose the Semantic Knowledge Extractor Tool (SKET), an unsupervised knowledge extraction system combining a rule-based expert system with pre- trained Machine Learning (ML) models. This work demonstrates the viability of unsupervised Natural Language Processing (NLP) techniques to extract critical information from cancer reports, opening opportunities such as data mining for knowledge extraction purposes, precision medicine applications, structured report creation, and multimodal learning.

Query Performance Prediction for Neural IR: Are We There Yet?

Guglielmo Faggioli, Thibault Formal, Stefano Marchesin, Stéphane Clinchant, Nicola Ferro, and Benjamin Piwowarski

Int. Conference Paper Proceedings of the 45th European Conference on Information Retrieval (ECIR 2023), Dublin, Ireland, April 2-6, 2023, pages 232-248.

Abstract

Evaluation in Information Retrieval (IR) relies on post-hoc empirical procedures, which are time-consuming and expensive operations. To alleviate this, Query Performance Prediction (QPP) models have been developed to estimate the performance of a system without the need for human-made relevance judgements. Such models, traditionally relying on lexical features from queries and corpora, have been applied to lexical IR systems – with various degrees of success. With the advent of neural IR based on large Pre-trained Language Models, the retrieval paradigm has significantly shifted towards the use of semantic signals. In this work, we aim to study and analyze to what extent current QPP models can predict the performance of neural IR systems. Our experiments consider nineteen state-of-the-art QPPs, seven traditional and seven neural (BERT-based) IR approaches, and two well-known collections: Deep Learning ’19 and Robust ’04. Our findings show that traditional QPPs perform statistically significantly worse on neural IR systems. In settings where semantic signals are prominent (e.g., passage retrieval), the QPP performance on neural systems drops down by as much as 10% compared to lexical ones. On top of that, in lexical-oriented scenarios, QPPs fail to predict performance for neural IR systems on those queries where they differ from lexical ones the most.

Empowering Digital Pathology Applications through Explainable Knowledge Extraction Tools

Stefano Marchesin, Fabio Giachelle, Niccolò Marini, Manfredo Atzori, Svetla Boytcheva, Genziana Buttafuoco, Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Ornella Irrera, Henning Muller, Todor Primov, Simona Vatrano, and Gianmaria Silvello

Journal Paper Journal of Pathology Informatics, Volume 13, Issue 1, September 2022, 100139, 2022.

Abstract

Exa-scale volumes of medical data have been produced for decades. In most cases, the diagnosis is reported in free text, encoding medical knowledge that is still largely unexploited. In order to allow decoding medical knowledge included in reports, we propose an unsupervised knowledge extraction system combining a rule-based expert system with pre-trained Machine Learning (ML) models, namely the Semantic Knowledge Extractor Tool (SKET). Combining rule-based techniques and pre-trained ML models provides high accuracy results for knowledge extraction. This work demonstrates the viability of unsupervised Natural Language Processing (NLP) techniques to extract critical information from cancer reports, opening opportunities such as data mining for knowledge extraction purposes, precision medicine applications, structured report creation, and multimodal learning. SKET is a practical and unsupervised approach to extracting knowledge from pathology reports, which opens up unprecedented opportunities to exploit textual and multimodal medical information in clinical practice. We also propose SKET eXplained (SKET X), a web-based system providing visual explanations about the algorithmic decisions taken by SKET. SKET X is designed/developed to support pathologists and domain experts in understanding SKET predictions, possibly driving further improvements to the system.

Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations

Niccolò Marini, Stefano Marchesin, Sebastian Otálora, Marek Wodzinski, Alessandro Caputo, Mart van Rijthoven, Witali Aswolinskiy, John-Melle Bokhorst, Damian Podareanu, Edyta Petters, Svetla Boytcheva, Genziana Buttafuoco, Simona Vatrano, Filippo Fraggetta, Jeroen van der Laak, Maristella Agosti, Francesco Ciompi, Gianmaria Silvello, Henning Muller, and Manfredo Atzori

Journal Paper npj Digital Medicine, Volume 5, Issue 1, July 2022, 102, 2022.

Abstract

The digitalization of clinical workflows and the increasing performance of deep learning algorithms are paving the way towards new methods for tackling cancer diagnosis. However, the availability of medical specialists to annotate digitized images and free-text diagnostic reports does not scale with the need for large datasets required to train robust computer-aided diagnosis methods that can target the high variability of clinical cases and data produced. This work proposes and evaluates an approach to eliminate the need for manual annotations to train computer-aided diagnosis tools in digital pathology. The approach includes two components, to automatically extract semantically meaningful concepts from diagnostic reports and use them as weak labels to train convolutional neural networks (CNNs) for histopathology diagnosis. The approach is trained (through 10-fold cross-validation) on 3’769 clinical images and reports, provided by two hospitals and tested on over 11’000 images from private and publicly available datasets. The CNN, trained with automatically generated labels, is compared with the same architecture trained with manual labels. Results show that combining text analysis and end-to-end deep neural networks allows building computer-aided diagnosis tools that reach solid performance (micro-accuracy = 0.908 at image-level) based only on existing clinical data without the need for manual annotations.

Assessing the Semantic Difficulty of Queries

Guglielmo Faggioli and Stefano Marchesin

Nat. Conference Paper Proceedings of the 12th Italian Information Retrieval Workshop (IIR 2022), Milan, Italy, June 29-30 2022, pages 6.

Abstract

Traditional Information Retrieval (IR) models, also known as lexical models, are hindered by the semantic gap, which refers to the mismatch between different representations of the same underlying concept. To address this gap, semantic models have been developed. Semantic and lexical models exploit complementary signals that are best suited for different types of queries. For this reason, these model categories should not be used interchangeably, but should rather be properly alternated depending on the query. Therefore, it is important to identify queries where the semantic gap is prominent and thus semantic models prove effective. In this work, we quantify the impact of using semantic or lexical models on different queries, and we show that the interaction between queries and model categories is large. Then, we propose a labeling strategy to classify queries into semantically hard or easy, and we deploy a prototype classifier to discriminate between them.

Exploiting Curated Databases to Train Relation Extraction Models for Gene-Disease Associations

Stefano Marchesin and Gianmaria Silvello

Nat. Conference Paper Proceedings of the 30th Italian Symposium on Advanced Database Systems (SEBD 2022), Tirrenia (Pisa), Tuscany, Italy, June 19-22, 2022, pages 133-140.

Abstract

Databases are pivotal to advancing biomedical science. Nevertheless, most of them are populated and updated by human experts with a great deal of effort. Biomedical Relation Extraction (BioRE) aims to shift these expensive and time-consuming processes to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges. Despite this, few resources have been devoted to training -- and evaluating -- models for GDA extraction. Besides, such resources are limited in size, preventing models from scaling effectively to large amounts of data. To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction: TBGA. TBGA is generated from more than 700K publications and consists of over 200K instances and 100K gene-disease pairs. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging dataset for the task. The dataset and models are publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.

Terminology Extraction in Electronic Health Records. The ExaMode Project

Stefano Marchesin, Giorgio Maria Di Nunzio, and Gianmaria Silvello

Int. Conference Paper Proceedings of the 1st International Conference on Multilingual Digital Terminology Today (MDTT 2022), Padua, Italy, June 16-17, 2022. Ceur-WS Proceedings vol. 3161, pages 3.

Abstract

Medical free-text records store a lot of useful information that can be exploited in developing computersupported medicine. Nevertheless, extracting terminological knowledge from unstructured text is difficult because the volume of medical texts created every year keeps growing at a very fast pace and it is highly dependent on the language under examination. In this work, we present an initial study of a Natural Language Processing pipeline in order to extract terminological information from pathology reports and link this information to medical ontologies.

TBGA: A Large-Scale Gene-Disease Association Dataset for Biomedical Relation Extraction

Stefano Marchesin and Gianmaria Silvello

Journal Paper BMC Bioinformatics, Volume 23, Issue 1, March 2022, 111, 2022.

Abstract

Background: Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size preventing models from scaling effectively to large amounts of data. Results: To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction. DisGeNET stores one of the largest available collections of genes and variants involved in human diseases. Relying on DisGeNET, we developed TBGA: a GDA extraction dataset generated from more than 700K publications that consists of over 200K instances and 100K gene-disease pairs. Each instance consists of the sentence from which the gene-disease association was extracted, the corresponding gene-disease association, and the information about the gene-disease pair. Conclusions: TBGA is amongst the largest datasets for GDA extraction. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging and well-suited dataset for the task. We made the dataset publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.

Filter, Transform, Expand, and Fuse. The IMS Unipd at TREC 2021 Clinical Trials

Giorgio Maria Di Nunzio, Guglielmo Faggioli, and Stefano Marchesin

Int. Conference Paper Proceedings of the 30th Text REtrieval Conference (TREC 2021), Gaithersburg, Maryland, USA, November 15-19, 2021, pages 7.

Abstract

We present the methodology and the experimental setting of the participation of the IMS Unipd team in TREC Clinical Trials 2021. The objective of this work is to continue the longitudinal study of the evaluation of query expansion, ranking fusion, and document filtering approach optimized in the previous participations to TREC. In particular, we added to our procedure proposed in 2020, a comparison with a pipeline that use the large transformers. The results obtained provide interesting insights in terms of the different per-topic effectiveness and will be used for further failure analyses.

Simple but Effective Knowledge-Based Query Reformulations for Precision Medicine Retrieval

Stefano Marchesin, Giorgio Maria Di Nunzio, and Maristella Agosti

Journal Paper MDPI Information, Volume 12, Issue 10, September 2021, 402, 2021.

Abstract

In Information Retrieval (IR), the semantic gap represents the mismatch between users’ queries and how retrieval models answer to these queries. In this paper, we explore how to use external knowledge resources to enhance bag-of-words representations and reduce the effect of the semantic gap between queries and documents. In this regard, we propose several knowledge-based query expansion and reduction techniques and we evaluate them for the medical domain – where the semantic gap is prominent and the presence of manually curated knowledge resources allows for the development of knowledge-enhanced methods to address it. The proposed query reformulations are used to increase the probability of retrieving relevant documents through the addition to or the removal from the original query of highly specific terms. The experimental analyses on different test collections for Precision Medicine – a particular use case of Clinical Decision Support – show the effectiveness of the developed query reformulations. In particular, a specific subset of query reformulations allow IR models to achieve top performing results in all the considered collections.

Report on the 2nd International Conference on Design of Experimental Search & Information REtrieval Systems (DESIRES 2021)

Omar Alonso, Stefano Marchesin, Marc Najork, and Gianmaria Silvello

Journal Paper w/o pr SIGIR Forum, Volume 55, Issue 2, Article 14, 2021, pages 1-13.

Abstract

This is a report on the second edition of the International Conference on Design of Experimental Search & Information REtrieval Systems (DESIRES 2021) held at the Department of Information Engineering of the University od Padua (Padua, Italy) from September 15 to September 18, 2021. Date: 15--18 September, 2021. Website: http://desires.dei.unipd.it

DESIRES 2021: Design of Experimental Search & Information Retrieval Systems

Omar Alonso, Stefano Marchesin, Marc Najork, and Gianmaria Silvello

Editorship Proceedings of the Second International Conference on Design of Experimental Search & Information REtrieval Systems (DESIRES 2021), CEUR Workshop Proceedings 2950. Padua, Italy, September 15-18, 2021.

What Makes a Query Semantically Hard?

Guglielmo Faggioli and Stefano Marchesin

Int. Conference Paper Proceedings of the Second International Conference on Design of Experimental Search & Information REtrieval Systems (DESIRES 2021), Padova, Italy, September 15-18, 2021, pages 61-69.

Abstract

Traditional Information Retrieval (IR) models, also known as lexical models, are hindered by the semantic gap, which refers to the mismatch between different representations of the same underlying concept. To address this gap, semantic models have been developed. Semantic and lexical models exploit complementary signals that are best suited for different types of queries. For this reason, these model categories should not be used interchangeably, but should rather be properly alternated depending on the query. Therefore, it is important to identify queries where the semantic gap is prominent and thus semantic models prove effective. In this work, we quantify the impact of using semantic or lexical models on different queries, and we show that the interaction between queries and model categories is large. Then, we propose a labeling strategy to classify queries into semantically hard or easy, and we deploy a prototype classifier to discriminate between them.

Multi-Scale Task Multiple Instance Learning for the Classification of Digital Pathology Images with Global Annotations

Niccolò Marini, Sebastian Otálora, Francesco Ciompi, Gianmaria Silvello, Stefano Marchesin, Simona Vatrano, Genziana Buttafuoco, Manfredo Atzori, and Henning Müller

Int. Workshop Paper Proceedings of the MICCAI Computational Pathology (COMPAY) Workshop, Strasbourg, France, September 27, 2021, pages 170-181.

Abstract

Whole slide images (WSIs) are high-resolution digitized images of tissue samples, stored including different magnification levels. WSIs datasets often include only global annotations, available thanks to pathology reports. Global annotations refer to global findings in the high-resolution image and do not include information about the location of the regions of interest or the magnification levels used to identify a finding. This fact can limit the training of machine learning models, as WSIs are usually very large and each magnification level includes different information about the tissue. This paper presents a Multi-Scale Task Multiple Instance Learning (MuSTMIL) method, allowing to better exploit data paired with global labels and to combine contextual and detailed information identified at several magnification levels. The method is based on a multiple instance learning framework and on a multi-task network, that combines features from several magnification levels and produces multiple predictions (a global one and one for each magnification level involved). MuSTMIL is evaluated on colon cancer images, on binary and multilabel classification. MuSTMIL shows an improvement in performance in comparison to both single scale and another multi-scale multiple instance learning algorithm, demonstrating that MuSTMIL can help to better deal with global labels targeting full and multi-scale images.

SAFIR: a Semantic-Aware Neural Framework for IR

Maristella Agosti, Stefano Marchesin, and Gianmaria Silvello

Nat. Conference Paper Proceedings of the 11th Italian Information Retrieval Workshop (IIR 2021), Bari, Italy, September 13-15, 2021, pages 4.

Abstract

The semantic mismatch between query and document terms – i.e., the semantic gap – is a long-standing problem in Information Retrieval (IR). Two main linguistic features related to the semantic gap that can be exploited to improve retrieval are synonymy and polysemy. Recent works integrate knowledge from curated external resources into the learning process of neural language models to reduce the effect of the semantic gap. However, these knowledge-enhanced language models have been used in IR mostly for re-ranking. We propose the Semantic-Aware Neural Framework for IR (SAFIR), an unsupervised knowledge-enhanced neural framework explicitly tailored for IR. SAFIR jointly learns word, concept, and document representations from scratch. The learned representations encode both polysemy and synonymy to address the semantic gap. We investigate SAFIR application in the medical domain, where the semantic gap is prominent and there are many specialized and manually curated knowledge resources. The evaluation on shared test collections for medical retrieval shows the effectiveness of SAFIR to address the semantic gap.

Developing Unsupervised Knowledge-Enhanced Models to Reduce the Semantic Gap in Information Retrieval

Stefano Marchesin

Journal Paper w/o pr SIGIR Forum, Volume 55, Issue 1, Article 18, 2021, pages 1-2.

Abstract

In this thesis we tackle the semantic gap, a long-standing problem in Information Retrieval (IR). The semantic gap can be described as the mismatch between users' queries and the way retrieval models answer to such queries. Two main lines of work have emerged over the years to bridge the semantic gap: (i) the use of external knowledge resources to enhance the bag-of-words representations used by lexical models, and (ii) the use of semantic models to perform matching between the latent representations of queries and documents. To deal with this issue, we first perform an in-depth evaluation of lexical and semantic models through different analyses [Marchesin et al., 2019]. The objective of this evaluation is to understand what features lexical and semantic models share, if their signals are complementary, and how they can be combined to effectively address the semantic gap. In particular, the evaluation focuses on (semantic) neural models and their critical aspects. Each analysis brings a different perspective in the understanding of semantic models and their relation with lexical models. The outcomes of this evaluation highlight the differences between lexical and semantic signals, and the need to combine them at the early stages of the IR pipeline to effectively address the semantic gap. Then, we build on the insights of this evaluation to develop lexical and semantic models addressing the semantic gap. Specifically, we develop unsupervised models that integrate knowledge from external resources, and we evaluate them for the medical domain - a domain with a high social value, where the semantic gap is prominent, and the large presence of authoritative knowledge resources allows us to explore effective ways to address it. For lexical models, we investigate how - and to what extent - concepts and relations stored within knowledge resources can be integrated in query representations to improve the effectiveness of lexical models. Thus, we propose and evaluate several knowledge-based query expansion and reduction techniques [Agosti et al., 2018, 2019; Di Nunzio et al., 2019]. These query reformulations are used to increase the probability of retrieving relevant documents by adding to or removing from the original query highly specific terms. The experimental analyses on different test collections for Precision Medicine - a particular use case of Clinical Decision Support (CDS) - show the effectiveness of the proposed query reformulations. In particular, a specific subset of query reformulations allow lexical models to achieve top performing results in all the considered collections. Regarding semantic models, we first analyze the limitations of the knowledge-enhanced neural models presented in the literature. Then, to overcome these limitations, we propose SAFIR [Agosti et al., 2020], an unsupervised knowledge-enhanced neural framework for IR. SAFIR integrates external knowledge in the learning process of neural IR models and it does not require labeled data for training. Thus, the representations learned within this framework are optimized for IR and encode linguistic features that are relevant to address the semantic gap. The evaluation on different test collections for CDS demonstrate the effectiveness of SAFIR when used to perform retrieval over the entire document collection or to retrieve documents for Pseudo Relevance Feedback (PRF) methods - that is, when it is used at the early stages of the IR pipeline. In particular, the quantitative and qualitative analyses highlight the ability of SAFIR to retrieve relevant documents affected by the semantic gap, as well as the effectiveness of combining lexical and semantic models at the early stages of the IR pipeline - where the complementary signals they provide can be used to obtain better answers to semantically hard queries.

Developing Unsupervised Knowledge-Enhanced Models to Reduce the Semantic Gap in Information Retrieval

Stefano Marchesin

PhD Thesis PhD School in Information Engineering, Department of Information Engineering, University of Padua, 2021.

Abstract

In this thesis we tackle the semantic gap, a long-standing problem in Information Retrieval (IR). The semantic gap can be described as the mismatch between users’ queries and the way retrieval models answer to such queries. Two main lines of work have emerged over the years to bridge the semantic gap: (i) the use of external knowledge resources to enhance the bag-of-words representations used by lexical models, and (ii) the use of semantic models to perform matching between the latent representations of queries and documents. To deal with this issue, we first perform an in-depth evaluation of lexical and semantic models through different analyses. The objective of this evaluation is to understand what features lexical and semantic models share, if their signals are complementary, and how they can be combined to effectively address the semantic gap. In particular, the evaluation focuses on (semantic) neural models and their critical aspects. Then, we build on the insights of this evaluation to develop lexical and semantic models addressing the semantic gap. Specifically, we develop unsupervised models that integrate knowledge from external resources, and we evaluate them for the medical domain – a domain with a high social value, where the semantic gap is prominent, and the large presence of authoritative knowledge resources allows us to explore effective ways to leverage external knowledge to address the semantic gap. For lexical models, we propose and evaluate several knowledge-based query expansion and reduction techniques. These query reformulations are used to increase the probability of retrieving relevant documents by adding to or removing from the original query highly specific terms. Regarding semantic models, we first analyze the limitations of the knowledge-enhanced neural models presented in the literature. Then, to overcome these limitations, we propose SAFIR, an unsupervised knowledge-enhanced neural framework for IR. The representations learned within this framework are optimized for IR and encode linguistic features that are relevant to address the semantic gap.

Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information Retrieval

Maristella Agosti, Stefano Marchesin, and Gianmaria Silvello

Journal Paper ACM Transactions on Information Systems (TOIS), Volume 38, Issue 4, Article 38, 2020, pages 1-48.

Abstract

The semantic mismatch between query and document terms – i.e., the semantic gap – is a long-standing problem in Information Retrieval (IR). Two main linguistic features related to the semantic gap that can be exploited to improve retrieval are synonymy and polysemy. Recent works integrate knowledge from curated external resources into the learning process of neural language models to reduce the effect of the semantic gap. However, these knowledge-enhanced language models have been used in IR mostly for re-ranking and not directly for document retrieval. We propose the Semantic-Aware Neural Framework for IR (SAFIR), an unsupervised knowledge-enhanced neural framework explicitly tailored for IR. SAFIR jointly learns word, concept, and document representations from scratch. The learned representations encode both polysemy and synonymy to address the semantic gap. SAFIR can be employed in any domain where external knowledge resources are available. We investigate its application in the medical domain where the semantic gap is prominent and there are many specialized and manually curated knowledge resources. The evaluation on shared test collections for medical literature retrieval shows the effectiveness of SAFIR in terms of retrieving and ranking relevant documents most affected by the semantic gap.

Focal Elements of Neural Information Retrieval Models. An Outlook through a Reproducibility Study

Stefano Marchesin, Alberto Purpura and Gianmaria Silvello

Journal Paper Information Processing & Management (IP&M), Volume 57, Issue 6, 2020, pages 102109.

Abstract

This paper analyzes two state-of-the-art Neural Information Retrieval (NeuIR) models: the Deep Relevance Matching Model (DRMM) and the Neural Vector Space Model (NVSM). Our contributions include: (i) a reproducibility study of two state-of-the-art supervised and unsupervised NeuIR models, where we present the issues we encountered during their reproducibility; (ii) a performance comparison with other lexical, semantic and state-of-the-art models, showing that traditional lexical models are still highly competitive with DRMM and NVSM; (iii) an application of DRMM and NVSM on collections from heterogeneous search domains and in different languages, which helped us to analyze the cases where DRMM and NVSM can be recommended; (iv) an evaluation of the impact of varying word embedding models on DRMM, showing how relevance-based representations generally outperform semantic-based ones; (v) a topic-by-topic evaluation of the selected NeuIR approaches, comparing their performance to the well-known BM25 lexical model, where we perform an in-depth analysis of the different cases where DRMM and NVSM outperform the BM25 model or fail to do so. We run an extensive experimental evaluation to check if the improvements of NeuIR models, if any, over the selected baselines are statistically significant.

A Study on Query Expansion and Rank Fusion for Precision Medicine. The IMS Unipd at TREC 2020 Precision Medicine

Giorgio Maria Di Nunzio and Stefano Marchesin

Int. Conference Paper Proceedings of the 29th Text REtrieval Conference (TREC 2020), Gaithersburg, Maryland, USA, November 18-20, 2020, pages 8.

Abstract

In this report, we describe the methodology and the experimental setting of our participation as the IMS Unipd team in TREC PM 2020. The objective of this work is to evaluate a query expansion and ranking fusion approach optimized on the previous years of TREC PM. In particular, we designed a procedure to (1) perform query expansion using a pseudo relevance feedback model on the first k retrieved documents, and (2) apply rank fusion techniques to the rankings produced by the different experimental settings. The results obtained provide interesting insights in terms of the different per-topic effectiveness and will be used for further failure analyses.

A Study on Reciprocal Ranking Fusion in Consumer Health Search. IMS UniPD at CLEF eHealth 2020 Task 2

Giorgio Maria di Nunzio, Stefano Marchesin, and Federica Vezzani

Int. Conference Paper Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, pages 7.

Abstract

In this paper, we describe the results of the participation of the Information Management Systems (IMS) group at CLEF eHealth 2020 Task 2, Consumer Health Search Task. In particular, we participated in both subtasks: Ad-hoc IR and Spoken queries retrieval. The goal of our work was to evaluate the reciprocal ranking fusion approach over 1) different query variants; 2) different retrieval functions; 3) w/out pseudo-relevance feedback. The results show that, on average, the best performances are obtained by a ranking fusion approach together with pseudo-relevance feedback.

A Post-Analysis of Query Reformulation Methods for Clinical Trials Retrieval

Maristella Agosti, Giorgio Maria di Nunzio, and Stefano Marchesin

Nat. Conference Paper Proceedings of the 28th Italian Symposium on Advanced Database Systems (SEBD 2020), Villasimius, Sardinia, Italy, June 21-24, 2020, pages 152-159.

Abstract

The Precision Medicine (PM) track of the Text REtrieval Conference (TREC) focuses on providing useful precision medicine information to clinicians treating cancer patients. The PM track gives the unique opportunity to evaluate medical IR systems on two different collections: scientific literature and clinical trials. In this paper, we evaluate several state-of-the-art query expansion and reduction methods to see whether a particular approach can be helpful in clinical trials retrieval. We present those approaches that are consistently effective in all three TREC PM editions and we compare them to the results obtained by the research groups who participated in all three editions.

Reproducibility of the Neural Vector Space Model via Docker

Nicola Ferro, Stefano Marchesin, Alberto Purpura, and Gianmaria Silvello

Nat. Conference Paper Proceedings of the 16th Italian Research Conference on Digital Libraries (IRCDL 2020), Bari, Italy, January 30-31, 2020, pages 3-8.

Abstract

In this work we describe how Docker images can be used to enhance the reproducibility of Neural IR models. We report our results reproducing the Vector Space Neural Model (NVSM) and we release a CPU-based and a GPU-based Docker image. Finally, we present some insights about reproducing Neural IR models.

Exploring how to Combine Query Reformulations for Precision Medicine

Giorgio Maria Di Nunzio, Stefano Marchesin, and Maristella Agosti

Int. Conference Paper Proceedings of the 28th Text REtrieval Conference (TREC 2019), Gaithersburg, Maryland, USA, November 13-15, 2019, pages 14.

Abstract

We report on our participation as the IMS Unipd team in both TREC PM 2019 tasks. The objective of the work is twofold: (i) we want to evaluate how different query reformulations affect the results and whether the findings obtained in previous years remain valid; (ii) we want to verify if combining different query reformulations based on expansion and reduction techniques prove effective in such a highly specific scenario. In particular, we designed a procedure to (1) filter out clinical trials based on demographic data, (2) perform query reformulations – both expansion and reduction techniques – based on knowledge bases to increase the probability of findings relevant documents, (3) apply rank fusion techniques to the rankings produced by the different query reformulations. We consider those query reformulations that have been validated on previous TREC PM experimental collections. These queries represent the most effective reformulations for our system on those topics/collections. The results obtained – especially in the clinical trials task – validate our assumptions and provide interesting insights in terms of the different per-topic effectiveness of the query reformulations.

Knowledge Enhanced Representations for Clinical Decision Support (Abstract)

Stefano Marchesin, and Maristella Agosti

Nat. Conference Paper Proceedings of the 10th Italian Information Retrieval Workshop (IIR 2019), Padua, Italy, September 16-18, 2019, pages 17-18.

Abstract

The study presents a methodology that contributes to reduce the semantic gap in clinical decision support systems. The methodology integrates semantic information – provided by external knowledge resources – into unsupervised neural Information Retrieval (IR) models. The objective is to design and develop innovative methods that can be effective in real-case medical scenarios.

A Docker-Based Replicability Study of a Neural Information Retrieval Model

Nicola Ferro, Stefano Marchesin, Alberto Purpura and Gianmaria Silvello

Int. Workshop Paper Proceedings of the Open-Source IR Replicability Challenge (OSIRRC 2019), Paris, France, July 25, 2019, pages 37-43.

Abstract

In this work, we propose a Docker image architecture for the replicability of Neural IR (NeuIR) models. We also share two self-contained Docker images to run the Neural Vector Space Model (NVSM)[22], an unsupervised NeuIR model. The first image we share (nvsm_cpu) can run on most machines and relies only on CPU to perform the required computations. The second image we share (nvsm_gpu) relies instead on the Graphics Processing Unit (GPU) of the host machine, when available, to perform computationally intensive tasks, such as the training of the NVSM model. Furthermore, we discuss some insights on the engineering challenges we encountered to obtain deterministic and consistent results from NeuIR models, relying on TensorFlow within Docker. We also provide an in-depth evaluation of the differences between the runs obtained with the shared images. The differences are due to the usage within Docker of TensorFlow and CUDA libraries–whose inherent randomness alter, under certain circumstances, the relative order of documents in rankings.

An Analysis of Query Reformulation Techniques for Precision Medicine

Maristella Agosti, Giorgio Maria Di Nunzio and Stefano Marchesin

Int. Conference Paper Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 21-25, 2019, pages 973-976.

Abstract

The Precision Medicine (PM) track at the Text REtrieval Conference (TREC) focuses on providing useful precision medicine-related information to clinicians treating cancer patients. The PM track gives the unique opportunity to evaluate medical IR systems using the same set of topics on two different collections: scientific literature and clinical trials. In the paper, we take advantage of this opportunity and we propose and evaluate state-of-the-art query expansion and reduction techniques to identify whether a particular approach can be helpful in both scientific literature and clinical trial retrieval. We present those approaches that are consistently effective in both TREC editions and we compare the results obtained with the best performing runs submitted to TREC PM 2017 and 2018.

Knowledge Enhanced Representations to Reduce the Semantic Gap in Clinical Decision Support

Stefano Marchesin

Int. Workshop Paper Proceedings of the 9th PhD Symposium on Future Directions in Information Access (FDIA 2019), Milan, Italy, July 17, 2019, pages 4-9.

Abstract

The semantic gap between queries and documents is a longstanding problem in Information Retrieval (IR), and it poses a critical challenge for medical IR due to the large presence in the medical language of synonymous and polysemous words, along with context-specific expressions. Two main lines of work have emerged in the past years to tackle this issue: (i) the use of external knowledge resources to enhance query and document bag-of-words representations; and (ii) the use of semantic models, based on the distributional hypothesis, which perform matching on latent representations of documents and queries. The presented research investigates the use of external knowledge resources in both lines – with a focus on knowledge-enhanced unsupervised neural latent representations and their analysis in terms of effectiveness and semantic representativeness.

Medical Retrieval using Structured Information Extracted from Knowledge Bases (Discussion Paper)

Maristella Agosti, Giorgio Maria Di Nunzio, Stefano Marchesin and Gianmaria Silvello

Nat. Conference Paper Proceedings of the 27th Italian Symposium on Advanced Database Systems (SEBD 2019), Castiglione della Pescaia (Grosseto), Italy, 16-19 June 2019, pages 8.

Abstract

We investigate how semantic relations between concepts extracted from medical documents, and linked to a reference knowledge base, can be employed to improve the retrieval of medical literature. Semantic relations explicitly represent relatedness between concepts and carry high informative power that can be leveraged to improve the effectiveness of the retrieval. We present preliminary results and show how relations are able to provide a sizable increase of the precision for several topics, albeit having no impact on others. We then discuss some future directions to minimize the impact of negative results while maximizing the impact of good results.

The University of Padua IMS Research Group at TREC 2018 Precision Medicine Track

Maristella Agosti, Giorgio Maria Di Nunzio and Stefano Marchesin

Int. Conference Paper Proceedings of the 27th Text REtrieval Conference Proceedings (TREC 2018), Gaithersburg, Maryland, USA, November 14-16, 2018, pages 10.

Abstract

We report on the participation of the Information Management System (IMS) Research Group of the University of Padua in the second task of the Precision Medicine Track at TREC 2018: the Clinical Trials task. We designed a procedure to: i) expand query terms iteratively, based on knowledge bases, to increase the probability of finding relevant trials by adding neoplasm, gene, and protein term variants to the initial query; ii) filter out trials based on demographic data. We submitted three runs: a plain BM25 using the provided textual fields and as query, a BM25 with a first knowledge-based query expansion, and another BM25 with an additional knowledge-based query expansion. This initial set of experiments lays the ground for a deeper study on the effectiveness of (automatic) knowledge-based expansion techniques in the context of precision medicine.

The University of Padua IMS Research Group at CENTRE@TREC 2018

Giorgio Maria Di Nunzio and Stefano Marchesin

Int. Conference Paper Proceedings of the 27th Text REtrieval Conference Proceedings (TREC 2018), Gaithersburg, Maryland, USA, November 14-16, 2018, pages 10.

Abstract

In this paper, we present our participation in one of the tasks of the CENTRE@TREC 2018 Track: the Clinical Decision Support task. We describe the steps of the original paper we wanted to reproduce, identifying the elements of ambiguity that may affect the reproducibility of the results. The experimental results we obtained follow a similar trend to those presented in the original paper: using clinical trials’ “note” field decreases the retrieval performances significantly, while the pseudo-relevance feedback approach together with query expansion achieves the best results across different measures. In the experimental results we find out that the choice of the stoplist is fundamental to achieve a reasonable level of reproducibility. However, stoplist creation is not described sufficiently well in the original paper.

A Relation Extraction Approach for Clinical Decision Support

Maristella Agosti, Giorgio Maria Di Nunzio, Stefano Marchesin and Gianmaria Silvello

Int. Workshop Paper Proocedings of the CIKM 2018 Workshops co-located with the 27th ACM International Conference on Information and Knowledge Management (CIKM 2018), Torino, Italy, October 22, 2018, pages 6.

Abstract

In this paper, we investigate how semantic relations between concepts extracted from medical documents can be employed to improve the retrieval of medical literature. Semantic relations explicitly represent relatedness between concepts and carry high informative power that can be leveraged to improve the effectiveness of retrieval functionalities of clinical decision support systems. We present preliminary results and show how relations are able to provide a sizable increase of the precision for several topics, albeit having no impact on others. We then discuss some future directions to minimize the impact of negative results while maximizing the impact of good results.

Implicit-Explicit Representations for Case-Based Retrieval

Stefano Marchesin

Int. Conference Paper Proceedings of the 1st Biennial Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES 2018), Bertinoro, Italy, August 28-31 2018, page 109.

Abstract

We propose an IR framework to combine the implicit representations — identified using distributional representation techniques — and the explicit representations — derived from external knowledge sources — of documents to improve medical case-based retrieval. Combining implicit-explicit representations of documents aims at enriching the semantic understanding of documents and reducing the semantic gap between documents and queries.

Case-Based Retrieval Using Document-Level Semantic Networks

Stefano Marchesin

Int. Conference Paper Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018), Ann Arbor, Michigan (USA), July 8-12, 2018, page 1451.

Abstract

We propose a research that aims at improving the effectiveness of case-based retrieval systems through the use of automatically created document-level semantic networks. The proposed research leverages the recent advancements in information extraction and relational learning to revisit and advance the core ideas of concept-centered hypertext models. The automatic extraction of semantic relations from documents---and their centrality in the creation and exploitation of the documents' semantic networks---represents our attempt to go one step further than previous approaches.

A Concept-Centered Hypertext Approach to Case-Based Retrieval

Stefano Marchesin

Journal Paper w/o pr CoRR, abs/1811.11133, 2018.

Abstract

The goal of case-based retrieval is to assist physicians in the clinical decision making process, by finding relevant medical literature in large archives. We propose a research that aims at improving the effectiveness of case-based retrieval systems through the use of automatically created document-level semantic networks. The proposed research tackles different aspects of information systems and leverages the recent advancements in information extraction and relational learning to revisit and advance the core ideas of conceptcentered hypertext models. We propose a two-step methodology that in the first step addresses the automatic creation of documentlevel semantic networks, then in the second step it designs methods that exploit such document representations to retrieve relevant cases from medical literature. For the automatic creation of documents’ semantic networks, we design a combination of information extraction techniques and relational learning models. Mining concepts and relations from text, information extraction techniques represent the core of the document-level semantic networks’ building process. On the other hand, relational learning models have the task of enriching the graph with additional connections that have not been detected by information extraction algorithms and strengthening the confidence score of extracted relations. For the retrieval of relevant medical literature, we investigate methods that are capable of comparing the documents’ semantic networks in terms of structure and semantics. The automatic extraction of semantic relations from documents, and their centrality in the creation of the documents’ semantic networks, represent our attempt to go one step further than previous graph-based approaches.

Thirty Years of Digital Libraries Research at the University of Padua: The User Side

Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro, Maria Maistro, Stefano Marchesin, Nicola Orio, Chiara Ponchia and Gianmaria Silvello

Nat. Conference Paper Proceedings of the 14th Italian Research Conference on Digital Libraries (IRCDL 2018), Udine, Italy, January 25-26, 2018, pages 42-54.

Abstract

For the 30th anniversary of the Information Management Systems (IMS) research group of the University of Padua, we report the main and more recent contributions of the group that focus on the users in the field of Digital Library (DL). In particular, we describe a dynamic and adaptive environment for user engagement with cultural heritage collections, the role of log analysis for studying the interaction between users and DL, and how to model user behaviour.

Keyword-based access to relational data: To reproduce, or to not reproduce?

Alex Badan, Luca Benvegnù, Matteo Biasetton, Giovanni Bonato, Alessandro Brighente, Stefano Marchesin, Alberto Minetto, Leonardo Pellegrina, Alberto Purpura, Riccardo Simionato, Matteo Tessarotto, Andrea Tonon and Nicola Ferro

Nat. Conference Paper Proceedings of the 25th Italian Symposium on Advanced Database Systems (SEBD 2017), Squillace Lido (Catanzaro), Italy, June 25-29, 2017, pages 166-177.

Abstract

We investigate the problem of the reproducibility of keywordbased access systems to relational data. These systems address a challenging and important issue, i.e. letting users to access in natural language databases whose schema and instance are possibly unknown. However, neither there are shared implementations of state-of-the-art algorithms nor experimental results are easily replicable. We explore the difficulties in reproducing such systems and experimental results by implementing from scratch several state-of-the-art algorithms and testing them on shared datasets.

Towards open-source shared implementations of keyword-based access systems to relational data

Alex Badan, Luca Benvegnù, Matteo Biasetton, Giovanni Bonato, Alessandro Brighente, Alberto Cenzato, Piergiorgio Ceron, Giovanni Cogato, Stefano Marchesin, Alberto Minetto, Leonardo Pellegrina, Alberto Purpura, Riccardo Simionato, Nicolò Soleti, Matteo Tessarotto, Andrea Tonon, Federico Vendramin and Nicola Ferro

Int. Workshop Paper Proceedings of 1st EDBT/ICDT Workshop on Keyword-based Access and Ranking at Scale (KARS 2017) - Proceedings of the Workshops of the EDBT/ICDT 2017 Joint Conference (EDBT/ICDT 2017), Venice, Italy, March 21-24, 2017, pages 5.

Abstract

Keyword-based access systems to relational data address a challenging and important issue, i.e. letting users to exploit natural language to access databases whose schema and instance are possibly unknown. Unfortunately, there are almost no shared implementations of such systems and this hampers the reproducibility of experimental results. We explore the difficulties in reproducing such systems and share implementations of several state-of-the-art algorithms.

An Adaptive Cross-Site User Modelling Platform for Cultural Heritage Websites

Maristella Agosti, Séamus Lawless, Stefano Marchesin and Vincent Wade

Nat. Conference Paper Proceedings of the 13th Italian Research Conference on Digital Libraries (IRCDL 2017), Modena, Italy, January 26-27, 2017, pages 132-141.

Abstract

This paper discusses an adaptive cross-site user modelling platform for cultural heritage websites. The objective is to present the overall design of this platform that allows for information exchange techniques, which can be subsequently used by websites to provide tailored personalisation to users that request it. The information exchange is obtained by implementing a third party user model provider that, through the use of an API, interfaces with custom-built module extensions of websites based on the Web-based Content Management System (WCMS) Drupal. The approach is non-intrusive, not hindering the browsing experience of the user, and has a limited impact on the core aspects of the websites that integrate it. The design of the API ensures user’s privacy by not disclosing personal browsing information to non-authenticated users. The user can enable/disable the cross-site service at any time.

Intelligent Interactive Information Access Hub

Department of Information Engineering

University of Padua

Publications

Filter by Type

Filter by Year

Sort by Year

Large Language Models and Data Quality for Knowledge Graphs

Abstract

Binomial Confidence Intervals for Knowledge Graph Accuracy Estimation (Extended Abstract)

Abstract

Doctron: A web-based collaborative annotation tool for Ground Truth creation in IR

Abstract

Fact Verification in Knowledge Graphs Using LLMs

Abstract

Credible Intervals for Knowledge Graph Accuracy Estimation

Abstract

BioASQ at CLEF2025: The thirteenth edition of the large-scale biomedical semantic indexing and question answering challenge

Abstract

MetaTron: Streamlining Collaborative Annotation for Biomedical Documents

Abstract

Extending Nanopublications with Knowledge Provenance for Multi-Source Scientific Assertions

Abstract

Multimodal Representations of Biomedical Knowledge from Limited Training Whole Slide Images and Reports Using Deep Learning

Abstract

Veracity Estimation for Entity-Oriented Search with Knowledge Graphs

Abstract

Utility-Oriented Knowledge Graph Accuracy Estimation with Limited Annotations: A Case Study on DBpedia

Abstract

Overview of iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge

Abstract

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2024

Abstract

An Extensible and Unifying Approach to Retrospective Clinical Data Modeling: the BrainTeaser Ontology

Abstract

Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations

Abstract

Efficient and Reliable Estimation of Knowledge Graph Accuracy

Abstract

MetaTron: Advancing Biomedical Annotation Empowering Relation Annotation and Collaboration

Abstract

Publishing CoreKB Facts as Nanopublications

Abstract

TPDL 2023: Linking Theory and Practice of Digital Libraries

Overview of iDPP@CLEF 2023: The Intelligent Disease Progression Prediction Challenge

Abstract

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2023

Abstract

Building a Large Gene Expression-Cancer Knowledge Base with Limited Human Annotations

Abstract

Modelling digital health data: The ExaMode ontology for computational pathology

Abstract

A systematic review of Automatic Term Extraction: What happened in 2022?

Abstract

Towards Query Performance Prediction for Neural Information Retrieval: Challenges and Opportunities

Abstract

SEBD 2023: 31st Italian Symposium on Advanced Database Systems

CoreKB: A Web-based Platform for Searching Reliable Facts over a Medical Knowledge Base

Abstract

An Ontology-Driven Knowledge Extraction Tool for Pathology Record Classification

Abstract

On the Limitations of Query Performance Prediction for Neural IR (Discussion paper)

Abstract

An Analysis of a Methodology and Experimental Results for the Retrieval of Clinical Trials

Abstract

SKET X: A Visual Analytics Tool for Explaining Knowledge Extraction Results

Abstract

Caption Generation from Histopathology Whole-Slide Images Using Pre-Trained Transformers

Abstract

Searching for Reliable Facts over a Medical Knowledge Base

Abstract

IRCDL 2023: Information and Research science Connecting to Digital and Library science

SKET: an Unsupervised Knowledge Extraction Tool to Empower Digital Pathology Applications

Abstract

Query Performance Prediction for Neural IR: Are We There Yet?

Abstract

Empowering Digital Pathology Applications through Explainable Knowledge Extraction Tools

Abstract

Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations