Introduction to Corpora and Statistical Methods in Natural Language Processing

Slide Note

This course, CSA5011, delves into statistical natural language processing, covering language formalization, Java as an artificial language, natural language complexity, and levels of analysis in phonetics, morphology, syntax, and semantics.

bertram Follow

Uploaded on Aug 22, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Corpora and Statistical Methods Albert Gatt

Course goals Introduce the field of statistical natural language processing (statistical NLP). Describe the main directions, problems, and algorithms in the field. Discuss the theoretical foundations. Involve students in hands-on experiments with real problems. CSA5011 -- Corpora and Statistical Methods

A general introduction CSA5011 -- Corpora and Statistical Methods

Language We can define a language formally as: a set of symbols ( alphabet ) a set of rules to combine those symbols This mathematical definition covers many classes of languages, not just human language. CSA5011 -- Corpora and Statistical Methods

Java: An artificial (formal) language fixed set of basic symbols: public, static, for, while, {, } fixed syntax for symbol combination public static void main (String[] args) { for(int i = 0; i < args.length; i++) { } } CSA5011 -- Corpora and Statistical Methods

Natural language Often much more complicated than an artificial language. NB: Some theorists view NL as a special kind of formal language as well (Montague ). It does conform to the formal definition: there are symbols there are modes of combination However, there are many levels at which these symbols and rules are defined. CSA5011 -- Corpora and Statistical Methods

Levels of analysis in Natural language (I) Acoustic properties (phonetics) defines a basic set of sounds in terms of their features studies the combination of these phonemes Higher-order acoustic features (phonology) how combinations of phonemes combine into larger units, with suprasegmental features such as intonation. CSA5011 -- Corpora and Statistical Methods

Levels of analysis in Natural language (II) Word formation (morphology) combines morphemes into words Combination into longer units in a structure-dependent way (syntax) legal word combinations in a language recursive phrasal combination Interpretation (semantics): of words (lexical semantics) of longer units (sentential/propositional semantics) Interpretation in context (pragmatics) CSA5011 -- Corpora and Statistical Methods

Natural Language Processing Studies language at all its levels. phonology, morphology, syntax, semantics focuses on process (Sparck-Jones `07) computational methods to understand and generate human language Often, the distinction between NLP and computational linguistics is fuzzy CSA5011 -- Corpora and Statistical Methods

Kindred disciplines: Linguistics Theoretical linguistics tends to be less process-oriented than NLP Q: how can we characterise knowledge that native speakers have of their language? this leads to declarative models of speaker s knowledge of language tends to say less about how speakers process language in real time NB: This depends on the theoretical orientation! NLP has strong ties to theoretical linguistics it has also been an important contributor: process models can serve as tests for declarative models CSA5011 -- Corpora and Statistical Methods

Kindred disciplines: Psycholinguistics Like NLP, psycholinguistics tends to be strongly process- oriented studies the online processes of language understanding and language production NLP has benefited from such models. NLP has also been a contributor: it is increasingly common to test psycholinguistic theories by building computational models. CSA5011 -- Corpora and Statistical Methods

Paradigms in NLP (I) Knowledge-based: system is based on a priori rules and constraints e.g. a syntactic parser might have hand-crafted rules such as: NP Det AdjP N AdjP A+ Problem: it is extremely difficult to hand-code all the relevant knowledge. CSA5011 -- Corpora and Statistical Methods

Paradigms in NLP (II) Statistical: starting point is a large repository of text or speech (a corpus) corpus is often annotated with relevant information, e.g.: parsed corpora (syntax) tagged corpora (part-of-speech) word-sense annotated corpora (semantics) tries to learn a model from the data tries to generalise this model to new data CSA5011 -- Corpora and Statistical Methods

The paradigms: a birds-eye view We find similar divisions within mainstream linguistics: generative linguistics tends to formulate generalisations about internalised speaker knowledge of language (competence, I-Language ) corpus linguistics tends to formulate generalisations based on patterns observed in corpora The two paradigms are viewed as having roots in different traditions: rationalist tradition (Plato, Descartes ) empiricist tradition (Locke ) CSA5011 -- Corpora and Statistical Methods

The idea of linguistic knowledge Traditional linguistic theory (since the 1950s) introduced a dichotomy: competence: a person s knowledge of language, formalised as a set of rules performance: actual production and perception of language in concrete situations Much of linguistic theory has focused on characterising competence. CSA5011 -- Corpora and Statistical Methods

The idea of linguistic knowledge The use of data (corpora) involves an increased focus on performance . The idea is that exposure to such regularities is a crucial part of human language learning. CSA5011 -- Corpora and Statistical Methods

An initial example Suppose you re a linguist interested in the syntax of verb phrases. Some verbs are transitive, some intransitive I ate the meat pie (transitive) I swam (intransitive) What about: quiver quake Most traditional grammars characterise these as intransitive Corpus data suggests they have transitive uses: the insect quivered its wings it quaked his bowels (with fear) CSA5011 -- Corpora and Statistical Methods

Example II: lexical semantics Quasi-synonymous lexical items exhibit subtle differences in context. strong powerful A fine-grained theory of lexical semantics would benefit from data about these contextual cues to meaning. CSA5011 -- Corpora and Statistical Methods

Example II continued Some differences between strong and powerful (source: British National Corpus): strong wind, feeling, accent, flavour powerful tool, weapon, punch, engine The differences are subtle, but examining their collocates helps. CSA5011 -- Corpora and Statistical Methods

Statistical approaches to language Do not rely on categorical judgements of grammaticality etc. Examples: Degrees of grammaticality: people often do not have categorical judgements of acceptability. 1. Category blending: We live nearer town than you thought. Is near an adjective or a preposition? 2. Syntactic ambiguity: She killed the man with the gun. What is the most likely parse? 3. CSA5011 -- Corpora and Statistical Methods

Statistical NLP vs. Corpus Linguistics (I) Corpus linguistics became popular with the arrival of large, machine- readable corpora. generally viewed as a methodology tests hypotheses empirically on data aim is to refine a theory of language, or discover novel generalisations Statistical NLP shares these aims; however: it is often corpus-driven rather than corpus-based the theory or model learned is often not a priori given CSA5011 -- Corpora and Statistical Methods

Statistical NLP vs. Corpus Linguistics (II) The term corpus may mean different things to different people: To a corpus linguist, a corpus is a balanced, representative sample of a particular language variety (e.g. The British National Corpus) Representativeness allows generalisations to be made more rigorously. In statistical NLP, there has traditionally been less emphasis on these properties. emphasis on algorithms for learning language models we frequently find the tacit assumption that the algorithm can be applied to any set of data, given the right annotations CSA5011 -- Corpora and Statistical Methods

Some applications of Statistical NLP CSA5011 -- Corpora and Statistical Methods

Language Technology Meaning Natural Language analysis and understanding Natural Language Generation Structure Text Text Machine translation, summarisation Speech Recognition Speech Synthesis Speech Speech

A (very) rough division of NLP tasks understanding: typically take as input free text or speech, and conduct some structural or semantic analysis POS Tagging, parsing, semantic role labelling, sentiment/opinion mining, named entity recognition generation: typically take textual or non-linguistic input, outputting some text/speech automatic weather reporting, summarisation, machine translation How effective are statistical NLP tools to carry out these and other tasks? Are statistical techniques actually useful to learn things about language? CSA5011 -- Corpora and Statistical Methods

Example 1: Semantics 0.359 sheep Example of an automatically acquired thesaurus of similar words. Data: 1.5 bn words obtained from the web. (www.sketchengine.co.uk) How does this work? 0.345 cow 0.331 pig 0.305 rabbit 0.304 cattle 0.289 deer 0.286 lamb goat 0.276 donkey 0.262 poultry 0.261 boar 0.259 camel 0.258 elephant 0.258 calf 0.255 pony CSA5011 -- Corpora and Statistical Methods

Example 1: Semantics (cont/d) Corpus-based lexical semantic acquisition typically uses vector-space models. represent a word as a vectors containing information about the context in which it is likely to occur some models also include grammatical relations (subject-of, object-of etc) CSA5011 -- Corpora and Statistical Methods

Example 2: POS Tagging <tok pos="at">The</tok> <tok pos="jj">tall</tok> <tok pos="nn">woman</tok> <tok pos="cc">and</tok> <tok pos="at">the</tok> <tok pos="jj">strange</tok> <tok pos="nn">boy</tok> <tok pos="vbd">thought</tok> <tok pos="jj">statistical</tok> <tok pos="nn">NLP</tok> <tok pos="bedz">was</tok> <tok pos="jj">pointless</tok> <tok pos=".">.</tok> Output from a statistical POS Tagger, trained on the Brown Corpus (LingPipe demo library) Uses of POS Tagging: pre-parsing corpus analysis for linguistics The tall woman and the strange boy thought statistical NLP was pointless. CSA5011 -- Corpora and Statistical Methods

Example 3: parsing Parsed using the Stanford Parser. Based on probabilistic context-free grammar of English trained on a treebank CFG rules with probabilities CSA5011 -- Corpora and Statistical Methods

Example 4: Machine translation Input: (Maltese translation of example sentence) Translated using Maltese-English Google Translate. Obvious shortcomings, but robust, i.e. some output returned, even if garbled. Output: The wife and son long strange nonetheless feels that the statistical NLP is without purpose. Based on automatic alignment between parallel text corpora. CSA5011 -- Corpora and Statistical Methods

Example 5: Generation/Summarisation [ ] No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in the GeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease- causing mutations have been identified [ ] Automatically generated article about 3-M syndrome (Sauper and Barzilay 2009) Now on Wikipedia!!! (http://en.wikipedia.org/wiki/3- M_syndrome) Summarised from multiple documents drawn from the web. Uses automatically acquired templates from human-authored texts to ensure coherence. CSA5011 -- Corpora and Statistical Methods

Features of Statistical NLP systems Robustness: typically, don t break down with new or unknown input (although they may output garbage) Portability: statistical learning algorithms can in principle be ported to new domains (given data) Sensitivity to training data: if (say) a POS tagger is trained on medical text, its performance will decline on a new genre (e.g. news). CSA5011 -- Corpora and Statistical Methods

Some important concepts All the systems surveyed rely on regularities in large repositories of training data, expressed as probabilities. In practice, we distinguish between: training/development data: for learning a model and finetuning test data: for evaluation on unseen but compatible data CSA5011 -- Corpora and Statistical Methods

References Sparck-Jones, K. (2007). Computational Linguistics: What about the linguistics? Computational Linguistics 33 (3): 437 441 McEnery, T., Xiao, R. & Tono, Y. 2006: Corpus-based language studies: An advanced resource book. London: Routledge (Contains an interesting discussion of corpus-based vs. corpus-driven approaches) CSA5011 -- Corpora and Statistical Methods

Introduction to Corpora and Statistical Methods in Natural Language Processing

Download Presentation

Presentation Transcript

Related

More Related Content