Privacy is critical when dealing with user-generated text, as common in Natural Language Processing (NLP) and Information Retrieval (IR) tasks. Documents, queries, posts, and reviews might pose a risk of inadvertently disclosing sensitive information. Such exposure of private data is a significant threat to user privacy, as it may reveal information that users prefer to keep confidential. The leading framework to protect user privacy when handling textual information is represented by the ε-Differential Privacy (DP). However, the research community lacks a unified framework for comparing different DP mechanisms. This study introduces pyPANTERA, an open-source Python package developed for text obfuscation. The package is designed to incorporate State-of-the-Art DP mechanisms within a unified framework for obfuscating data. pyPANTERA is not only designed as a modular and extensible library for enriching DP techniques, thereby enabling the integration of new DP mechanisms in future research, but also to allow reproducible comparison of the current State-of-the-Art mechanisms. Through extensive evaluation, we demonstrate the effectiveness of pyPANTERA, making it an essential resource for privacy researchers and practitioners. The source code of the library and for the experiments is available at: https://github.com/Kekkodf/pypantera
Evaluating privacy provided by obfuscation mechanisms remains an open problem in the research community. Especially for textual data, in Natural Language Processing (NLP) and Information Retrieval (IR) tasks, privacy guarantees are measured by analyzing the hyper-parameters of a mechanism, e.g., the privacy budget 𝜀 in Differential Privacy ( DP), and the impact of these on the performances. However, considering only the privacy parameters is not enough to understand the actual level of privacy achieved by a mechanism from a real user perspective. We analyse the requirements and the features needed to actually evaluate the privacy of obfuscated texts beyond the formal privacy provided by the analysis of the mechanisms' parameters, and suggest some research directions to devise new evaluation measures for this purpose.
The deterioration of the performances of Information Retrieval Systems (IRSs) over time remains an open issue among the Information Retrieval (IR) community. With this study for Task 1 of the Longitudinal Evaluation of Model Performance LAB (LongEval) at Conference and Labs of the Evaluation Forum (CLEF) 2024, we aim to propose and analyze the performance of an IRS that is able to handle changes over time in the data. In addition, the model uses different Large Language Models ( LLMs) to enhance the effectiveness of the retrieval process by rephrasing the queries for the search and the reranking of the retrieved documents. With an in-depth analysis of the performance of the MOUSE group Retrieval System, using the datasets provided by the organisers of CLEF , the proposed model was able to reach a Mean Average Precision (MAP) of 0.22 and a Normalized Discounted Cumulated Gain (nDCG) of 0.40 for the English collection, increasing the performance for the original French collection up to 0.31 and 0.50, for MAP and nDCG respectively.
Survival Analyses (SAs), a key statistical tool used to predict event occurrence over time, often involve sensitive information, necessitating robust privacy safeguards. This work demonstrates how the Revised Randomized Response (RRR) can be adapted to ensure Differential Privacy (DP) while performing SAs. This methodology seeks to safeguard the privacy of individuals’ data without significantly changing the utility, represented by the statistical properties of the survival rates computed. Our findings show that integrating DP through RRR into SAs is both practical and effective, providing a significant step forward in the privacy-preserving analysis of sensitive time-to-event data. This study contributes to the field by offering a new comparison method to the current state-of-the-art used for SAs in medical research.
The necessity of storing and manipulating electronic data (eHealth data) in the Oncology research field has introduced two main types of challenges. Firstly, research centers need to manage eHealth data with appropriate and secure Data Management Infrastructure, and secondly, to preserve the privacy of patients' data. This work consists of a study on the feasibility and privacy analysis of a Data Management Infrastructure in the Oncology Research Domain. The project studies potential strengths and weaknesses in developing a Digital Clinical Data Repository (DCDR) in a practical study case at “Centro di Riferimento Oncologico” in Aviano, Italy. The study considers the standard HL7 FHIR, an international standard for healthcare data retrieval and exchange, within two possible scenarios, a monolithic application and a fragmented one. The analysis of potential privacy-related aspects is examined within the General Data Protection Regulation (GDPR) framework and studied through a utility evaluation after applying Differential Private mechanisms.
In February 2022, Russia launched a full-scale invasion of Ukraine. This event had global repercussions, especially on the political decisions of European countries. As expected, the role of Italy in the conflict became a major campaign issue for the Italian General Election held on 25 September 2022. Politicians frequently use Twitter to communicate during political campaigns, but bots often interfere and attempt to manipulate elections. Hence, understanding whether bots influenced public opinion regarding the conflict and, therefore, the elections is essential. In this work, we investigate how Italian politics responded to the Russo-Ukrainian conflict on Twitter and whether bots manipulated public opinion before the 2022 general election. We first analyze 39,611 tweet of six major political Italian parties to understand how they discussed the war during the period February-December 2022. Then, we focus on the 360,823 comments under the last month’s posts before the elections, discovering around 12% of the commenters are bots. By examining their activities, it becomes clear they both distorted how war topics were treated and influenced real users during the last month before the elections.