Terminology Finding in the Sketch Engine

Terminology-finding in the Sketch
Engine
Miloš Jakubíček, Adam Kilgarriff,
Vojtěch Kovář, Pavel Rychlý, Vit Suchomel
Lexical Computing Ltd., Brighton, UK &
Masaryk University, Brno, Czech Republic
1
Terminology
Problem #1
Finding it
2
Terminology
Problem #1
Finding it
Existing lists
Ask experts
Corpora
3
To find terms in a corpus
Unithood
For multi-word terms
Do the words form a unit?
Termhood
Does it belong to the domain?
4
Unithood
Grammar
Terms are noun phrases
(in canonical form, without the article)
Requirements
Noun phrase grammar
Prerequisites: tokeniser, lemmatiser, POS-tagger
Parsing machinery
5
Termhood
Frequency
in domain corpus 
vs
 reference corpus
Same as keywords
Requirements
Formula for keyness
Domain corpus
Reference corpus
6
In the Sketch Engine
 
7
Unithood
Grammar
Terms are noun phrases
(in canonical form, without the article)
Requirements
Noun phrase grammar 
To date: Chinese English French Japanese Korean Spanish
In progress: German Portuguese Russian
Collaboration with experts 
Prerequisites: tokeniser, lemmatiser, POS-tagger
Available/installed for languages above and several others
Parsing machinery
In place: variant on word sketches infrastructure
8
Termhood
Frequency
in domain corpus 
vs
 reference corpus
Same as keywords
Requirements
Formula for keyness
Kilgarriff 2009: Simple maths for keywords
Ratio of normalised frequencies (with simplemaths parameter
Domain corpus
Existing machinery for
Instant corpora from the web: WebBootCaT
Uploading/installing your own corpus
Reference corpus
Large web corpora: sixty languages
9
<Examples ... En, Fr, Korean>
All – what do you think looks prettiest/best
From WIPO or plain?
Mixed?
I can revisit tomorrow
10
Processing chains
Tokeniser-lemmatiser-POS-tagger
Must be identical for
Reference corpus (batch mode)
Domain corpus (runtime)
Recent work
Processing chains reviewed
Separated out for independent application
11
 
12
Current status
Lead customer
WIPO (World Intellectual Property Organisation)
terminology group of their translation dept
Five languages: delivered
Added functionality, blacklists etc
All customers
First version in beta
13
14
15
Current challenge
Lemmas and word forms
When to user singular, when plural
Adjective-noun agreement
nuée ardente 
volcanology: Fr for 
pyroclastic surge
Feminine, often plural
Lemmas: nuée ardent 
wrong
Word forms: nuées ardentes 
a little bit wrong
16
Summary
Terminology-finding needs
Term grammar
Reference corpus + domain corpus
All available in Sketch Engine
Already, for
English French Chinese Japanese Korean Russian Spanish
Shortly for
German Portuguese
Others to follow as requested
All set for you to use: feedback please!
17
Thank you
http://www.sketchengine.co.uk
http://beta.sketchengine.co.uk
18
Slide Note
Embed
Share

Terminology finding in the Sketch Engine involves identifying terms in a corpus, determining their relevance through unithood and termhood, and utilizing grammar for analysis. The process includes assessing frequency in domain versus reference corpora, collaborating with experts, and applying keyness formulas for keyword extraction.

  • Terminology Finding
  • Sketch Engine
  • Corpus Analysis
  • Domain Reference
  • Language Processing

Uploaded on Sep 09, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Terminology-finding in the Sketch Engine Milo Jakub ek, Adam Kilgarriff, Vojt ch Kov , Pavel Rychl , Vit Suchomel Lexical Computing Ltd., Brighton, UK & Masaryk University, Brno, Czech Republic 1

  2. Terminology Problem #1 Finding it 2

  3. Terminology Problem #1 Finding it Existing lists Ask experts Corpora 3

  4. To find terms in a corpus Unithood For multi-word terms Do the words form a unit? Termhood Does it belong to the domain? 4

  5. Unithood Grammar Terms are noun phrases (in canonical form, without the article) Requirements Noun phrase grammar Prerequisites: tokeniser, lemmatiser, POS-tagger Parsing machinery 5

  6. Termhood Frequency in domain corpus vs reference corpus Same as keywords Requirements Formula for keyness Domain corpus Reference corpus 6

  7. In the Sketch Engine 7

  8. Unithood Grammar Terms are noun phrases (in canonical form, without the article) Requirements Noun phrase grammar To date: Chinese English French Japanese Korean Spanish In progress: German Portuguese Russian Collaboration with experts Prerequisites: tokeniser, lemmatiser, POS-tagger Available/installed for languages above and several others Parsing machinery In place: variant on word sketches infrastructure 8

  9. Termhood Frequency in domain corpus vs reference corpus Same as keywords Requirements Formula for keyness Kilgarriff 2009: Simple maths for keywords Ratio of normalised frequencies (with simplemaths parameter Domain corpus Existing machinery for Instant corpora from the web: WebBootCaT Uploading/installing your own corpus Reference corpus Large web corpora: sixty languages 9

  10. <Examples ... En, Fr, Korean> All what do you think looks prettiest/best From WIPO or plain? Mixed? I can revisit tomorrow 10

  11. Processing chains Tokeniser-lemmatiser-POS-tagger Must be identical for Reference corpus (batch mode) Domain corpus (runtime) Recent work Processing chains reviewed Separated out for independent application 11

  12. 12

  13. Current status Lead customer WIPO (World Intellectual Property Organisation) terminology group of their translation dept Five languages: delivered Added functionality, blacklists etc All customers First version in beta 13

  14. 14

  15. 15

  16. Current challenge Lemmas and word forms When to user singular, when plural Adjective-noun agreement nu e ardente volcanology: Fr for pyroclastic surge Feminine, often plural Lemmas: nu e ardent wrong Word forms: nu es ardentes a little bit wrong 16

  17. Summary Terminology-finding needs Term grammar Reference corpus + domain corpus All available in Sketch Engine Already, for English French Chinese Japanese Korean Russian Spanish Shortly for German Portuguese Others to follow as requested All set for you to use: feedback please! 17

  18. Thank you http://www.sketchengine.co.uk http://beta.sketchengine.co.uk 18

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#