Research interests - Stefano Marchesin

Research Summary

My research interests concern Information Extraction, Information Retrieval, and Data Quality.

My research focuses on bridging the semantic gap in Information Retrieval (IR), particularly in the medical domain, by combining lexical and semantic signals for more effective query-document matching. Key areas include self-supervised, multi-task learning to combine implicit and explicit representations for IR, as well as knowledge-based, weighted query reformulations.

In Information Extraction, my research focuses on weakly/distantly supervised Entity Linking (EL) and Relation Extraction (RE) methods to construct Knowledge Graphs (KGs) with limited manual annotations, also empowering image classification and manual annotation tools.

As for Data Quality, my research focuses on the development of efficient approaches to evaluate the quality of large-scale KGs, leveraging sampling and estimation techniques that provide strong statistical guarantees. Want to know more? Check my seminar "Efficient Evaluation of Knowledge Graph Quality: Challenges and Opportunities", part of the Tales on Data Science and Big Data series organized by the CINI Lab on Data Science.

My research work received support by the HEREDITARY Horizon Europe and the EXAMODE and BRAINTEASER H2020 EU projects. More information below.

Interests

Active Learning
Data Quality
Knowledge Base Construction
Information Extraction
Information Retrieval

Active Projects

HEREDITARY

2024 - 2027

Hereditary: HetERogeneous sEmantic Data integratIon for the guT-bRain interplaY

HEREDITARY aims to significantly transform the way we approach disease detection, prepare treatment response, and explore medical knowledge by building a robust, interoperable, trustworthy and secure framework that integrates multimodal health data (including genetic data) while ensuring compliance with cross-national privacy-preserving policies. The HEREDITARY framework comprises five interconnected layers, from federated data processing and semantic data integration to visual interaction.
By utilizing advanced federated analytics and learning workflows, we aim to identify new risk factors and treatment responses focusing, as exploratory use cases, on neurodegenerative and gut microbiome related disorders. HEREDITARY is harmonizing and linking various sources of clinical, genomic, and environmental data on a large scale. This enables clinicians, researchers, and policymakers to understand these diseases better and develop more effective treatment strategies. HEREDITARY adheres to the citizen science paradigm to ensure that patients and the public have a primary role in guiding scientific and medical research while maintaining full control of their data. Our goal is to change the way we approach healthcare by unlocking insights that were previously impossible to obtain.

Role: UNIPD responsible for the “Communication and Dissemination” committee; contact person for the task 4.6 “Evidence-based knowledge graph creation and exploration”.

Project No: 101137074
Call: ORIZON-HLTH-2023-TOOL-05
Topic: Tools and technologies for a healthy society
Funding (UNIPD): 1.138.046€

Past Projects

BRAINTEASER

2021 - 2024

Brainteaser: BRinging Artificial INTelligencE home for a better cAre of amyotrophic lateral sclerosis and multiple SclERosis
Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases characterized by progressive or alternate impairment of neurological functions (motor, sensory, visual, cognitive). Artificial Intelligence is the key to successfully satisfy these needs to: i) better describe disease mechanisms; ii) stratify patients according to their phenotype assessed all over the disease evolution; iii) predict disease progression in a probabilistic, time dependent fashion; iv) investigate the role of the environment; v) suggest interventions that can delay the progression of the disease. BRAINTEASER will integrate large clinical datasets with novel personal and environmental data collected using low-cost sensors and apps.

We are leader of the "Open Science and FAIR Data" WP. The main goals of the WP are:
- Design of open ontologies to represent the data of the project and create knowledge bases to enrich and augment the value of the data.
- Design and implement methods for the evaluation of the FAIRification of the data and metadata produced by applying and reviewing the FAIR principles of the European Open Science Cloud (EOSC). Integration and sharing of research data with EOSC services.
- Design and implementation of the methods to expose the data as Linked Open Data and the services to favour their exploration and re-use.
- Organisation of three annual open evaluation challenges and sharing of the produced experimental data as open data Evaluation.
Role: Participant.

Project No: 101017598
Call: H2020-SC1-DTH-2020-1
Topic: Personalised early risk prediction, prevention and intervention based on Artificial Intelligence and Big Data technologies
Funding (UNIPD): 732.250€
EXAMODE

2019 - 2022

ExaMode: Extreme-scale Analytics via Multimodal Ontology Discovery & Enhancement
Exascale volumes of diverse data from distributed sources are continuously produced. Healthcare data stand out in the size produced (production is expected to be over 2000 exabytes in 2020), heterogeneity (many media, acquisition methods), included knowledge (e.g. diagnosis) and commercial value. The supervised nature of deep learning models requires large labeled, annotated data, which precludes models to extract knowledge and value. Examode solves this by allowing easy & fast, weakly supervised knowledge discovery of exascale heterogeneous data, limiting human interaction.

We are leader of the "Semantic knowledge discovery and visualisation" WP. The main goals of the WP are:
- Develop relation extraction methods to automatically extract semantic relationships between authoritative concepts within un/semi-structured text.
- Leverage entity linking methods in conjunction with developed relation extraction techniques to create report-level semantic networks out of extracted concepts and relationships.
- Model report-level semantic networks through conceptual descriptive frameworks to empower data management and exploitation.
- Develop information retrieval methods to semantically connect and discover semantic networks associated with relevant medical reports.
- Develop information visualization and visual analytics methods for interacting with deep learning algorithm and improve their understandability.
Role: Task leader for the task 2.1 “Semantic knowledge extractor prototype”; task leader for the task 2.3 “Automatic knowledge discovery system prototype and user study outcome”.

Project No: 825292
Call: H2020-ICT-2018-2
Topic: Big Data technologies and extreme-scale analytics
Funding (UNIPD): 516.000€

Data and Software

Data

The manual and automatic annotations used to estimate the accuracy of DBpedia can be found here. [paper]

The SPARQL endpoint to access the CORE KB is available here. [paper]

The gene expression-cancer KB generated by the Collaborative Oriented Relation Extraction (CORE) system can be found here. [paper]

The TBGA dataset for gene-disease association extraction can be found here. [paper]

The runs, pools, plots, and analyses to reproduce the Semantic-Aware neural Framework for IR (SAFIR) results are available here. [paper]

The runs used to perform experiments on Precision Medicine (PM) query reformulations can be found here. [paper]

Software

The methods using Bayesian credible intervals to estimate KG accuracy can be found here. [preprint]

The methods used to estimate KG veracity for entity-oriented search can be found here. [paper]

The source code used to aggregate labels and estimate the accuracy of DBpedia is available here. [paper]

The methods to estimate KG accuracy in an efficient and reliable manner are available here. [paper]

The CoreKB platform for searching reliable facts over gene expression-cancer associations is available here. [paper]

The source code and info about the Collaborative Oriented Relation Extraction (CORE) system are available here. [paper]

The source code and info about the Semantic Knowledge Extractor Tool (SKET) are available here. [paper]

The source code and info about Biomedical Relation Extraction (BioRE) methods are available here. [paper]

The source code and info about the Semantic-Aware Neural Framework for IR (SAFIR) can be found here. [paper]

Stefano Marchesin

Intelligent Interactive Information Access Hub

Department of Information Engineering

University of Padua