Cet ouvrage se consacre à la notion de terminologie numérique considérée comme une approche de la discipline impliquant la représentation numérique d’informations conceptuelles et linguistiques d’un domaine spécifique. L’objectif est l’illustration des étapes de conception et d’implémentation de base de données terminologiques multilingues permettant le respect des meilleures pratiques dans la gestion des données terminologiques du numérique. Pour ce faire, l’ouvrage met en exergue les nouvelles compétences du terminologue à l’ère numérique. Celles-ci trouvent leur véritable essence dans l’esprit interdisciplinaire et collaboratif de la recherche.
This paper aims to highlight the potential of a terminological analysis approach to literary texts. We present a work of retro-digitization and terminological study of a corpus of letters exchanged between Proust and his mother. The identified medical terminolog y constitutes the object of investigation to illustrate how specialized lexical units emigrate from intimate writings to join the experimental laboratory of the novel. The illustrated case study will focus on the term “trional” and its evolution from the Correspondance to the Recherche.
Cet article porte sur une analyse contrastive multi-niveaux des technologies à la base de TBX et de Ontolex-lemon afin de modéliser les données terminologiques multilingues au sein de ressources terminologiques.
La traduction médicale nécessite, comme tous les processus de traduction spécialisée, une étude systématique de la terminologie utilisée pour véhiculer les messages technico-scientifiques. Cet article porte sur la description d’un nouveau modèle de fiche terminologique spécifiquement conçu et formulé pour l'implémentation de la ressource terminologique multilingue TriMED pour le domaine médical. La fiche proposée visa à l'exhaustivité afin de fournir une image complète du comportement morphosyntaxique, sémantique et phraséologique du terme source et de son traduisant.
The optimal organization of terminological (meta) data is an indispensable practice in the design and implementation of language resources. In this paper, we describe a methodology for the structural standardization of terminological resources based on the application of de jure standards developed by the ISO TC 37/SC 3 in order to ensure the FAIRness of terminological data. In this regard, we describe a project, recently launched by the University of Padua, which adopts the proposed paradigm in order to create the CAMEO multilingual terminological database for the commercial domain. This resource aims to be a valid standardized linguistic support for two categories of text professionals (technical communicators and specialized translators) dealing with monolingual and multilingual commercial product documentation.
In this paper, we propose the description of a very recent interdisciplinary project aiming at analysing both the conceptual and linguistic dimensions of human rights terminology. This analysis will result in the form of a new knowledge-based multilingual terminological resource which is designed in order to meet the FAIR principles for Open Science and will serve, in the future, as a prototype for the development of a new software for the simplified rewriting of international legal texts relating to human rights, in order to facilitate their comprehension for non-expert people. Given the early stage of the project, we will focus on the description of its rationale, the planned workflow, and the theoretical approach which will be adopted to achieve the main goal of this ambitious research project.
Semic analysis is a linguistic technique aimed at capturing the essential specificities of terms meaning through the identification of minimum semantic units. This procedure is functional for the achievement of an in-depth comprehension of technical terminology and the acquisition of a specialised conceptual knowledge. In this paper, we focus on semic analysis applied to medical terminology. In particular, we discuss some preliminary considerations in order to establish the starting points for a systematic approach to semic analysis. Firstly, we propose a preliminary experiment to 1) study users’ perception of semic analysis and 2) validate the absence of systematicity in its performance. Based on the resulting data, we secondly propose a methodology aiming at increasing the systematic factorisation of semic analysis. Finally, we propose an experimental study to investigate on the potential interrelation in terms of applicability and productivity of Word Embeddings with respect to semic analysis in the framework of the proposed methodological criteria.
This study aims at describing a new linguistic product developed in order to support pedagogical practice and professionalization in specialized translation. The resource, named FAIRterm, is configured as a collection of multilingual terminological records to assist the process of decoding and transcoding the terminology of a given domain. The tool is designed in compliance with the ISO standards in force in terms of terminology management. For its validation, we propose the description of a pedagogical experiment conducted for the Italian-French working languages in the perspective of professionalization of translation learners in the oenological domain.
In this paper, we want to speculate about the possibility to model all the currently known/proposed approaches to terminology into a single schema. We will use the Entity-Relationship (ER) diagram as our tool for the conceptual data model of the problem and to express the associations between the objects of the study.We will analyse the onomasiological and semasiological approaches, the ontoterminology paradigm, and the frame-based model, and we will draw the consequences in terms of the conceptual data model. The result of this discussion will be used as the basis of the next step of the data organization in terms of standardized terminological records and Linked Data.
Reusability of data is one of the most important practices in science, and investments in this (underestimated) operation may have positive long-term consequences in research (Pasquetto et al. 2017). In this paper, we discuss the benefits of this approach in terminology management by presenting a methodology for the preservation of multilingual terminological records and the practice of standardization as a fundamental step towards reusability. We present a case study to show the effectiveness of this methodology on an obsolete Website containing a multilingual medical glossary, and we share the source code as well as the standardized dataset.
In this paper, we describe the results of the participation of the Information Management Systems (IMS) group at CLEF eHealth 2021 Task 2, Consumer Health Search Task. We participated in the three subtasks: Ad-hoc IR, Weakly Supervised IR, Document credibility. The goal of our work was to evaluate the reciprocal ranking fusion approach over 1) manual query variants; 2) different retrieval functions; 3) w/out pseudo-relevance feedback; 4) reciprocal ranking fusion.
In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.
Terminology standardization reflects two different aspects involving the meaning of terms and the structure of terminological resources. In this paper, we focus on the structural aspect of standardization and we present the work of re-modeling TriMED, a multilingual terminological database conceived to support multi-register medical communication. In particular, we provide a general methodology to make the termbase compliant to three of the most recent ISO/TC 37 standards. We focus on the definition of (i) the structural meta-model of the resource, (ii) the provided data categories and its Data Category Repository, and (iii) the TBX format for its implementation. In particular, we provide a general methodology to make the termbase compliant to three of the most recent ISO/TC 37 standards. We focus on the definition of (i) the structural meta-model of the resource, (ii) the provided data categories and its Data Category Repository, and (iii) the TBX format for its implementation.
The process of standardization plays an important role in the management of terminological resources. In this context, we present the work of re-modeling an existing multilingual terminological database for the medical domain, named TriMED. This resource was conceived in order to tackle some problems related to the complexity of medical terminology and to respond to different users’ needs. We provide a methodology that should be followed in order to make a termbase compliant to the three most recent ISO/TC 37 standards. In particular, we focus on the definition of i) the structural meta-model of the resource, ii) the data categories provided, and iii) the TBX format for its implementation. In addition to the formal standardization of the resource, we describe the realization of a new data category repository for the management of the TriMED terminological data and a Web application that can be used to access the multilingual terminological records.
Machine translation of scientific abstracts and terminologies has the potential to support health professionals and biomedical researchers in some of their activities. In the fifth edition of the WMT Biomedical Task, we addressed a total of eight language pairs. Five language pairs were previously addressed in past editions of the shared task, namely, English/German, English/French, English/Spanish, English/Portuguese, and English/Chinese. Three additional languages pairs were also introduced this year: English/Russian, English/Italian, and English/Basque. The task addressed the evaluation of both scientific abstracts (all language pairs) and terminologies (English/Basque only). We received submissions from a total of 20 teams. For recurring language pairs, we observed an improvement in the translations in terms of automatic scores and qualitative evaluations, compared to previous years.
In this paper, we describe the results of the participation of the Information Management Systems (IMS) group at CLEF eHealth 2020 Task 2, Consumer Health Search Task. In particular, we participated in both subtasks: Ad-hoc IR and Spoken queries retrieval. The goal of our work was to evaluate the reciprocal ranking fusion approach over 1) different query variants; 2) different retrieval functions; 3) w/out pseudo-relevance feedback. The results show that, on average, the best performances are obtained by a ranking fusion approach together with pseudo-relevance feedback.
Sir. Arthur Conan Doyle was an esteemed and highly experienced physician and much of his medical knowledge spreads into his literary works. In this paper, we propose to study the medical terminology in the stories of Sherlock Holmes through the combination of a mixed method of quantitative and qualitative analysis. Our approach is based on 1) the automatic extraction of medical terminology throughthetidytextR package for text analyses, 2) a terminological analysis by means of the model of terminological record designed fortheTriMED database, and 3) the study of collocations through the linguistic tool Sketch Engine. Thanks to this approach, we perform a linguistic analysis in order to evaluate different terminological aspects such as: the semantic variation due to temporal and historical factors, the difference of the context of use, the change of meaning based on the reference corpus, the variation of use depending on speakers/writers register and, finally, the relationship between terms and their collocations from the syntactic viewpoint.
In this paper, we focus on the teaching of specialised translation and, in particular, on the preliminary phase of the translation process which is based on a broad and systematic work on the terminology of the micro-language considered. We present a new model of bilingual terminological record, as a digital tool supporting the process of translation of medical documents. Finally, we describe the results of a set of experiments which we have run since 2017 with two groups of students of the master’s degree of the University of Padua.
In this paper, we present a methodology for the development of a new eHealth resource in the context of Computational Terminol- ogy. This resource, named TriMED, is a digital library of terminological records designed to satisfy the information needs of different categories of users within the healthcare field: patients, language professionals and physicians. TriMED offers a wide range of information for the purpose of simplification of medical language in terms of understandability and readability. Finally, we present two applications of our resource in or- der to conduct different types of studies in particular in Information Retrieval and Literature Analysis.
Dans cette étude, nous proposons une perspective neuve du concept de poids des termes techniques en nous concentrant sur la notion de « technicité » comme propriété sémantique de l’unité linguistique elle-même. L’idée de base est que la valeur de technicité d’un terme est inversement proportionnelle à sa nature polysémique. Nous formalisons la formule v-tech et effectuons une évaluation expérimentale afin de 1) comparer la valeur v-tech avec d’autres mesures de termhood (termicité ou termitude) généralement calculées sur la fréquence d’occurrence des termes dans les collections, et 2) intégrer la formule v-tech dans le score d’un modèle de récupération de documents pertinents pour un travail de revue systématique dans le domaine médical.
Cette étude porte sur les critères de normalisation de certaines formes phraséologiques standardisées dans la terminologie médicale. Nous présenterons une méthodologie d’analyse basée sur la réalisation de fiches terminologiques à partir de la ressource multilingue TriMED afin d’identifier les termes techniques qui se sont cristallisés dans l’usage fréquent du langage médical, mais qui ne respectent pas nécessairement le critère de correction linguistique proposé par la norme ISO 704 : 2009.
Tre categorie di persone si confrontano con la complessità del linguaggio medico, ciascuna con le proprie esigenze di rimedio: medici, traduttori tecnico scientifici e pazienti. Il presente lavoro propone di elaborare uno strumento che contribuisca a porre rimedio all’opacità che caratterizza la comunicazione in ambito medico tra i suoi vari attori: soddisfare la comunicazione tra pari, fornire una risorsa regolarmente aggiornata ai traduttori tecnico scientifici e facilitare la comprensione delle informazioni da parte del grande pubblico: una risorsa terminologico-fraseologica multilingue. La banca dati si compone di schede terminologiche progettate per creare un ponte fra i vari registri individuati (specialistico, semi-specialistico, non specialistico) nelle lingue considerate. Limitatamente al settore oncologico dei trattamenti per il cancro al seno, i termini da trattare sono estratti da un corpus in lingua inglese, corredati di tutte le informazioni e le proprietà linguisticamente rilevanti, e ricollegati al loro equivalente pragmatico in italiano e in francese.
Supervised machine learning algorithms require a set of labelled examples to be trained; however, the labelling process is a costly and time consuming task which is carried out by experts of the domain who label the dataset by means of an iterative process to filter out non-relevant objects of the dataset. In this paper, we describe a set of experiments that use gamification techniques to transform this labelling task into an interactive learning process where users can cooperate in order to achieve a common goal. To this end, first we use a geometrical interpretation of Na\"ive Bayes (NB) classifiers in order to create an intuitive visualization of the current state of the system and let the user change some of the parameters directly as part of a game. We apply this visualization technique to the classification of newswire and we report the results of the experiments conducted with different groups of people: PhD students, Master Degree students and general public. Then, we present a preliminary experiment of query rewriting for systematic reviews in a medical scenario, which makes use of gamification techniques to collect different formulation of the same query. Both the experiments show how the exploitation of gamification approaches help to engage the users in abstract tasks that might be hard to understand and/or boring to perform.
Three precise categories of people are confronted with the complexity of medical language: physicians, patients and scientific translators. The purpose of this work is to develop a methodology for the implementation of a terminological tool that contributes to solve problems related to the opacity that characterizes communication in the medical field among its various actors. The main goals are: i) satisfy the peer-to-peer communication, ii) facilitate the comprehension of medical information by patients, and iii) provide a regularly updated resource for scientific translators. We illustrate our methodology and its application through the description of a multilingual terminological-phraseological resource named TriMED. This terminological database will consist of records designed to create a terminological bridge between the various registers (specialist, semi-specialist, non-specialist) as well as across the languages considered. In this initial analysis, we restricted to the field of breast cancer, and the terms to be analyzed will be extracted from a corpus in English, accompanied by all relevant linguistic information and properties, and re-attached to their pragmatic equivalent in Italian and French.
Technology-Assisted Review (TAR) approaches are essential to minimize the effort of the user during the search and collect all relevant documents. In this paper, we present a failure analysis based on terminological and linguistic aspects of a TAR system for systematic medical reviews. In particular, we analyze the results of the worst performing topics of the best experiments of the CLEF 2017 eHealth task on Technologically Assisted Reviews in Empirical Medicine. This is an extended abstract of the work presented in [2, 4].
This contribution proposes to provide an overview of the syntactic and semantic behavior of medical terms in the literary works of Conan Doyle. The object of study is the analysis of the scientific terms in the stories of Sherlock Holmes through the model of terminological record set out in a multilingual terminological database (TriMED) implemented for the linguistic analysis of technical medical terms. After the semi- automatic extraction of English technical terms and the realization of the terminological records for each of them, we have analyzed different aspects such as: the semantic variation due to temporal and historical factors, the difference of the context of use, the change of meaning based on the reference corpus, the variation of use depending on speakers/writers register and, finally, the relationship between terms and their collocations from the syntactic viewpoint. After presenting our methodology and discussing the results of this analysis, we will provide some preliminary insights related to a comparative study between the linguistic aspects of the English medical term and its equivalent in the Italian version.
This is the second participation of the Information Management Systems (IMS) group at CLEF eHealth Task of Technologically Assisted Reviews in Empirical Medicine. This task focuses on the problem of medical systematic reviews, a problem which requires a recall close (if not equal) to 100%. Semi-Automated approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience. We present a variation of the two-dimensional approach which 1) sets the maximum amount of documents that the physician is willing to read, 2) takes into account a sampling strategy to estimate the 95% confidence interval of the number of relevant documents present in the collection.
In this paper, we propose a methodology based on the R Markdown framework for replicating an experiment of query rewriting in the context of medical eHealth. We present a study on how to re-propose the same task of systematic medical reviews with the same conditions and methodologies to a larger group of participants. The task is the CLEF eHealth Task Technologically Assisted Reviews in Empirical Medicine which consists in finding all the most relevant medical documents, given an information need, with the least effort. We study how lay people, students of a master degree in languages in this case, can help the retrieval system in finding more relevant documents by means of a query rewriting approach.
Technology-Assisted Review (TAR) systems are essential to minimize the effort of the user during the search and retrieval of relevant documents for a specific information need. In this paper, we present a failure analysis based on terminological and linguistic aspects of a TAR system for systematic medical reviews. In particular, we analyze the results of the worst performing topics in terms of recall using the dataset of the CLEF 2017 eHealth task on TAR in Empirical Medicine.
In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2017 Task 2. This task focuses on the problem of systematic reviews, that is articles that summarise all evidence that is published regarding a certain medical topic. This task, known in Information Retrieval as the total recall problem, requires long and tedious search sessions by experts in the field of medicine. Automatic (or semi-automatic) approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience. We present the two-dimensional probabilistic version of BM25 with explicit relevance feedback together with a query aspect rewriting approach for both the simple evaluation and the cost-effective evaluation.
In this paper, we describe the participation of the Informa- tion Management Systems (IMS) group at CLEF eHealth 2017 Task 1. In this task, participants are required to extract causes of death from death reports (in French and in English) and label them with the correct Inter- national Classification Diseases (ICD10) code. We tackled this task by focusing on the replicability and reproducibility of the experiments and, in particular, on building a basic compact system that produces a clean dataset that can be used to implement more sophisticated approaches.
In this paper, we report the ongoing developments of our first participation to the Cross-Language Evaluation Forum (CLEF) eHealth Task 1: “Multilingual Information Extraction - ICD10 coding” (Névéol et al., 2017). The task consists in labelling death certificates, in French with international standard codes. In particular, we wanted to accomplish the goal of the ‘Replication track’ of this Task which promotes the sharing of tools and the dissemination of solid, reproducible results.