Prototyping AI Component for Fuzzy Matching in Controlled Vocabulary Terms
This roadmap outlines the development of an AI component that utilizes fuzzy matching to help users find controlled vocabulary terms. The process involves user input, AI searching, and user validation, aiming to streamline term selection and minimize manual steps. The system integrates thematic dictionaries, ontologies, and databases to provide accurate results based on the context of the search. Additionally, a Question/Answers Data Base stores user queries and selections for improved efficiency in subsequent searches.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
eHS AI component roadmap: Step I: prototype with fuzzy matching Maria BIRYUKOV University of Luxembourg
Premises Fuzzy matching step is intended to help user in finding controlled vocabulary (CV) terms of eTRIKS-selected terminology. Input: User-provided term (one or multiword expression) Output: |N| best corresponding CV terms, along with their unique identifiers. |N| can be specified by the user. Default N could be 5 or 10.
Resources Thematic dictionaries, ontologies, databases Currently, Entrez gene data base is used as a source Importantly the AI knows that the CV terms it searches is that content of a already identified CV variable in order to provide the CV term of the right terminology. If the search is for a CV variable, AI knows in what template to search (targeted search) In both cases, AI knows that the context (i.e. therapeutic area, user, study, study class, etc..) in which it searches a CV term in order to give a more accurate and faster result (targeted search) than a broad search Locally created and regularly updated Question/Answers Data Base (QADB) which stores user queries along with the answers, selected as appropriate by users . This is what we call the study annotation profile in previous eHS document. Again this needs to be recorded in eMDR and Biospeak with the context as defined above and in previous eHS doc. AI could interrogate the study profiles first before the public resource in order to gain in speed and accuracy (e.g. if the same user, then use the same annotation)
Procedure User is prompted to : 1. Introduce his/her query 2. Specify method he/she would like to use in order to find corresponding standard terms 3. Specify the max number of candidates to return (matches to show) 4. Specify the similarity threshold Step (1) is obligatory. Steps (2 4) are optional. If not provided, default parameters are used. The goal of the process is to avoid or reduce the manual steps. Thus, it seems to me that steps are missing: a) the AI takes the user s terms one by one automatically and searches for their corresponding CV terms. When there is a perfect match, then the CV term is selected; b) if there is partial matches, then the AI asks the user to choose the right CV term; c) if the user does not select a AI-proposed CV term, the user has the possibility to broad the query (step 2 above). In case of a perfect match and automatic selection of the CV terms by AI, the user has the possibility to validate the AI selection unless the AI selection comes from the annotation of that user for that study (that is stored in biospeak DB).
Procedure Once user has typed in the query (Q), QADB lookup is performed. If Q is in QADB: Answers to Q are displayed, from most to less popular What does popular mean? The number of same answers to a given query? If so, this is fine IF there is the same context. Eg you cannot expect all the times that a cardiologist annotates like an oncologist. The context should be taken into account to avoid irrelevant proposition/answers User is prompted to select the answer which corresponds to his/her intention, if there is one If the Q is answered, user may either introduce new query or quit If the displayed answers do not satisfy the user, he/she may either proceed for the regular DB search (go to step 2, see previous slide) or quit. Again the goal is to reduce the manual intervention. Thus, before going to step 2, the search would be automatically broaden: as explained in previous doc, the search flow/cycle for a value would be from the narrowest to the broadest range as described below: 1. Search in the user s study annotation profile. If no matches, then go to 2: 2. Search in all the study annotation profiles of same context. If no matches, then go to 3: 3. Search in the terminology selected by eTRIKS for a given CV variable and same context. If no matches, then go to 4: 4. Search in the terminology selected by eTRIKS for a given CV variable without context. If no matches, then go to 4: 5. Search in all the eTRIKS-selected terminology. If no matches, then go to step 2 of previous slide. If Q is not in QADB, the procedure continues from step 2 ( see previous slide)
Local Resource Maintenance All the queries are stored along with the information about how frequent they are in All-Queries Data Base (AQDB) I do not understand the usefulness of storing the queries and their frequencies if their answers are not stored with them? Moreover, the context of the query should be stored too. You cannot expect a cardiologist doing the same queries as an oncologist. Queries for which no answer was found in the resources can be worked off-line and serve for the resource enrichment. The user has to have the possibility to annotate his/her term with new CV term or to propose his/her term as a CV term (then validated or not by curators through the post-loading curation tool)
Fuzzy String Matching Methods Three methods for fuzzy string matching are implemented: Gestalt pattern matching (1) Ngram-based cosine similarity (2) Word-based cosine similarity (3) (1-3) are appropriate for fuzzy string matching and often produce similar results. However: (1, 2) better handle spelling mistakes, (3) is more robust for word order changes or word omission. We will test the methods with real data and, depending on the results, keep, remove or add methods.
Example For illustration purpose let s assume that user s queries are some protein names which he/she would like to map to standard Entrez Gene names and identifiers. Query 1: steroid hormone receptor .
Search and Results The DB was already searched for that term earlier. In your example, taken into account the context will provide more accurate result. Example: if the user mentioned that the subject is human, then only AI will show only answer one (see above) The candidate answers are proposed in the order of their popularity : the highest # of votes first Again, popularity could be misleading without taking the context into account. Other aliases/designations are alternative spellings of the standard name In a previous document, we describe the search first on the CV terms, then on the synonyms related to the CV terms when search is done in eTRIKS-selected terminologies (see slide 5) Organism illustrates the ambiguity Resolved. See above.
QA Database Update User s choice is accounted for immediately as suggested by the order of the candidate answers (compare with the previous slide) If user does not like any of the proposed answers, his/her query can be processed against the whole system database, i.e. regular search As described in slide 5 point 2? If no, please define regular search Note, when user s query is not found in the QADB, the procedure follows regular search path directly please define regular search for us and the user (see next slide).
Searching Process Regular search elements: query, string comparison method, max number of suggestions to display, similarity threshold User may select one, two or three methods String similarity threshold 0.00 = threshold value will be applied internally depending on the method. Similarity strength: define 0 and 1. Is 1 the strongest? To be indicated to the user.
Results Methods are applied one after another String similarity score Show more option I do not understand why the answer are displayed by batches. On which parameters are they are batched? Best answer can be selected from any/all batch (es) Answers are displayed from the highest to lowest string similarity score N top-ranked items are shown, N = max matches to display (see previous slide) If show more option is chosen, the answers are displayed by batches
End of the session Continue with the same query 2ndmethod from the user s method selection
Local resources after user session(s) Fragment of the All-Queries Data Base (AQDB) Query Query frequency Queries are systematically stored and their overall counter is maintained. It allows for: Local ontologies enrichment I do not understand how the storage of queries alone can help the ontology enrichment. The user s answers to the queries, OK. Grouping of queries by projects
Local resources after user session(s) Fragment of the Question/Answers Data Base (QADB) Query Term ID : Votes As written previously, the context should be saved along with the answers and queries in the eMDR (anonymous, all users) and biopseak DB (user specific). Query is a one or multi-word expression, answer is the standard term name and unique identifier Assuming (potential) ambiguity, many standard IDs may correspond to the same query Ambiguity should be limited if context is saved too and eTRIKS has already selected a terminology for a given CV variable. Votes = how many times users have selected specific ID for the term. Votes are accumulated throughout all sessions. In the example above, steroid hormone receptor was mostly selected as ESRRA of human (5 times); and equally as Esrra of Norway rat, and esrra of zebrafish. QADB stores query and answers which have been selected by users as most suitable.
Next session: already seen query If this term is searched again, the QADB suggestions will be displayed in following order: Context to be taken into account in order to avoid irrelevant propositions (eg propose rat ESRA because frequent, while the subject is human) See slide 9.
To Do Implement answering based on the user s votes Done Test the system on the real data (not provided yet) Adjust according to the results and with respect to the data The command line demo will be provided later as an API, once the form of the API is agreed with other WPs. Include the storage of the context for each query-answer in order to provide users with accurate/relevant and fast propositions.