Evolution of Indo-European Languages through Phylogeny Estimation

Phylogeny estimation under a
model of linguistic character
evolution
Tandy Warnow
Department of Computer Science
University of Illinois at Urbana-Champaign
The Computational Historical Linguistics Project
http://web.engr.illinois.edu/~warnow/histling.html
Collaboration with Don Ringe began
in 1994; 
17 papers 
since then, and two 
NSF grants. 
Dataset generation by Ringe and Ann Taylor
(then a postdoc with Ringe, now Senior Lecturer
at York University).
Method development with Luay Nakhleh (then
my student, now Chair and Professor at
Rice University), Steve Evans (Prof. Statistics, 
Berkeley). Simulation study with Francois 
Barbanson (then my postdoc).
Indo-European languages
From linguistica.tribe.net
Homoplasy-free evolution
When a character changes
state, it changes to a new
state not in the tree; i.e.,
there is 
no homoplasy
(character reversal or
parallel evolution)
First inferred for 
weird
inn
ovations
 in phonological
characters and
morphological characters in
the 19th century, and 
used to
establish all the major
subgroups within IE
(But…other characters also
evolve without homoplasy)
 
0
0
0
1
1
0
1
0
0
 Possible Indo-European tree
(Ringe, Warnow and Taylor 2000)
Another possible Indo-European tree
(Gray & Atkinson, 2004)
 
Italic  Gmc.  Celtic  Baltic  Slavic   Alb.  Indic  Iranian   Armenian Greek Toch.   Anatolian
    
Gray and Atkinson Method
Used 
only lexical data
Binary-encoding 
of all 
multi-state character
data 
to perform the analysis
Assumes the 
CFN
 (Cavender-Farris-Neyman)
model of binary sequence evolution, in which
all characters evolve identically and
independently down the model tree
Uses MCMC to sample from a distribution
(specifically used 
MrBayes
 software)
This talk
Introduction to molecular systematics (very fast)
Linguistic data and the Ringe-Warnow analyses of the Indo-
European language family
Perfect Phylogenetic Networks (Nakhleh et al., Language 2005)
Comparison of different phylogenetic methods on Indo-European
datasets (Nakhleh et al., Transactions of the Philological Society
2005)
Stochastic model of linguistic character evolution (Warnow et al.,
MacDonald Institute for Archaeological Research, 2006)
Simulation study evaluating different phylogenetic methods
(Barbançon et al., Diachronica 2013)
Discussion and Future work
DNA Sequence Evolution (Idealized)
Markov Models of 
DNA
 Sequence Evolution
The different sites are assumed to evolve 
i.i.d
. down the model tree
(with rates that are drawn from a gamma distribution).
Markov Models of 
DNA 
Sequence Evolution
The different sites are assumed to evolve 
i.i.d
. down the model tree
(with rates that are drawn from a gamma distribution).
Simplest site evolution model (Jukes-Cantor, 1969):
The model tree T is binary and has substitution probabilities p(e) on each edge e.
The state at the root is randomly drawn from 
{A,C,T,G} (nucleotides)
If a site (position) changes on an edge, it changes with equal probability to each
of the remaining states.
The evolutionary process is Markovian.
More complex models (such as the General Markov model) are also considered,
often with little change to the theory.
CFN Model of 
Binary
 Sequence Evolution
The different sites are assumed to evolve 
i.i.d
. down the model tree
(with rates that are drawn from a gamma distribution).
CFN model: (CFN: Cavender, Farris, Neyman):
The model tree T is binary and has substitution probabilities p(e) on each edge e.
The state at the root is randomly drawn from 
{0,1} (presence/absence)
If a site (position) changes on an edge, it changes with equal probability to each
of the remaining states.
The evolutionary process is Markovian.
Used for modelling trait evolution (presence/absence)
Markov Models in Biology
All standard models have a 
finite number of
states 
(2 for traits, 4 for DNA/RNA, 20 for
aminoacids),  
and hence lots of 
homoplasy
All characters (sites in an alignment) evolve
identically and independently 
(iid) 
down the
tree
Statistical Consistency and Identifiability
error
Data
“State of the Art” for Molecular Systematics
Many distance-methods are statistically
consistent (but not UPGMA)
Parsimony is not statistically consistent
Maximum likelihood and Bayesian MCMC are
statistically consistent and 
highly favored
Model of Linguistic Character Evolution
(Warnow, Evans, Ringe, and Nakhleh, 2006)
Three types of characters: lexical, phonological, and
morphological
Infinite
 number of possible states
Character evolution is 
not 
iid
Borrowing
 can occur
Limited homoplasy 
enables identifiability
Statistically consistent
 methods presented
Our main points
Biomolecular data evolve 
differently from linguistic data,
and linguistic models and methods should 
not
 be based
upon biological models.
Better (more accurate) phylogenies can be obtained by
formulating models and methods based upon linguistic
scholarship
, and using 
good data.
All methods, whether explicitly based upon statistical
models or not, need to be 
carefully tested.
Indo-European languages
From linguistica.tribe.net
Homoplasy-free evolution
When a character changes
state, it changes to a new
state not in the tree; i.e.,
there is 
no homoplasy
(character reversal or
parallel evolution)
First inferred for 
weird
innovation
s
 in phonological
characters and
morphological characters in
the 19th century, and 
used to
establish all the major
subgroups within IE
 
0
0
0
1
1
0
1
0
0
Historical Linguistic Data
A character is a function that maps a set of
languages, 
L
, to a set of states.
Three kinds of characters:
Phonological (sound changes)
Lexical (meanings based on a wordlist)
Morphological (especially inflectional)
Sound changes
Many sound changes are natural, and should not be used for
phylogenetic reconstruction.
Others are bizarre, or are composed of a sequence of simple
sound changes.  These are useful for subgrouping purposes.
Example: Grimm’
s Law.
1.
Proto-Indo-European voiceless stops change into voiceless fricatives.
2.
Proto-Indo-European voiced stops become voiceless stops.
3.
Proto-Indo-European voiced aspirated stops become voiced
fricatives.
 
 
  Semantic slot for hand – coded
(Partitioned into cognate classes)
 
Lexical characters can also
evolve without homoplasy
For every cognate
class, the nodes of the
tree in that class should
form a connected
subset - 
as long as
there is no undetected
borrowing nor parallel
semantic shift
.
 
0
0
1
1
2
1
1
1
0
Our (RWT) Data
Ringe & Taylor (2002)
259 lexical
13 morphological
22 phonological
These data have cognate judgments estimated by
Ringe and Taylor, and vetted by other Indo-
Europeanists. (Alternate encodings were tested, and
mostly did not change the reconstruction.)
Polymorphic characters, and characters known to
evolve in parallel, were removed.
Differences between different
characters
Lexical: most easily borrowed (most borrowings detectable),
and homoplasy relatively frequent (we estimate about 25-
30% overall for our wordlist, but a much smaller percentage
for  basic vocabulary).
Phonological: can still be borrowed but much less likely than
lexical. Complex phonological characters are  infrequently (if
ever) homoplastic, although simple phonological characters
very often homoplastic.
Morphological: least easily borrowed, least likely to be
homoplastic.
Our methods/models
Ringe & Warnow 
Almost Perfect Phylogeny
:
 most
characters evolve without homoplasy under a no-common-
mechanism assumption (various publications since 1995)
Ringe, Warnow, & Nakhleh 
Perfect Phylogenetic
Network
: 
extends APP model to allow for borrowing, but
assumes homoplasy-free evolution for all characters
(Language, 2005)
Warnow, Evans, Ringe & Nakhleh 
Extended Markov
model
:  parameterizes PPN and allows for homoplasy
provided that 
homoplastic states
 can be identified from
the data (MacDonald Institute for Archaeological Research,
2006)
First analysis: Almost Perfect Phylogeny
The original dataset contained 375 characters (336
lexical, 17 morphological, and 22 phonological).
We 
screened
 the dataset to eliminate characters
likely to evolve homoplastically or by borrowing.
On this reduced dataset (259 lexical, 13
morphological, 22 phonological), we attempted to
maximize the number of compatible characters while
requiring that certain of the morphological and
phonological characters be compatible
.
(Computational problem NP-hard.)
Indo-European Tree
(95% of the characters compatible)
Second attempt: PPN
We explain the remaining incompatible characters by inferring
previously 
undetected
 
borrowing
.
We attempted to find a PPN (perfect phylogenetic network)
with the smallest number of contact edges, borrowing events,
and with maximal feasibility with respect to the historical
record.  (Computational problems NP-hard).
Our analysis produced one solution with only three contact
edges that optimized each of the criteria.   Two of the contact
edges are well-supported.
Modelling borrowing: Networks
and Trees within Networks
Perfect Phylogenetic Network
(all characters compatible)
L. Nakhleh, D. Ringe, and T. Warnow, 
LANGUAGE, 2005
Comments
This network is very 
tree-like
 (only three
contact edges needed to explain the data.
Two of the three contact edges are strongly
supported by the data (many characters are
borrowed).
If the third contact edge is removed, then the
evolution of the remaining (two) incompatible
characters needs to be explained.  
Probably this
is parallel semantic shift.
Other IE analyses
Note: many reconstructions of IE have been done, but produce
different histories which differ in significant ways
Possible issues:
 
Dataset (modern vs. ancient data, errors in the cognancy
 
judgments, lexical vs. all types of characters,
 
screened vs. unscreened)
 
Translation of multi-state data to binary data
 
Reconstruction method
Another possible Indo-European tree
(Gray & Atkinson, 2004)
 
Italic  Gmc.  Celtic  Baltic  Slavic   Alb.  Indic  Iranian   Armenian Greek Toch.   Anatolian
    
Only lexical data, not well curated
Binary encoding!
Gray and Atkinson Method
Uses MCMC to sample from a distribution
(specifically used 
MrBayes
 software)
Assumes the 
CFN
 (Cavender-Farris-Neyman)
model of binary sequence evolution, in which
all characters evolve identically and
independently down the model tree
Used 
only lexical data
Binary-encoding 
of all 
multi-state character
data 
to perform the analysis
 
 
  Semantic slot for hand – coded
(Partitioned into cognate classes)
Binary Encoding of Multi-state Characters
Imagine you have a lexical character (semantic slot) with four
cognate classes
This is replaced by four characters, one for each cognate class
The four characters are now assumed to evolve i.i.d. under the CFN
model of binary character evolution
0 represents absence,
1 represents presence
Problems with binary encoding:
(0,0,0,0 can occur),  indicating 
no word for this semantic slot
(1,1,1,1), indicating all four words for the semantic slot (i.e., the model
allows 
unbounded polymorphism
)
Problem with CFN model applied to binary encoding
Model assumes appearance and loss of cognates are independent of
other cognates (
unrealistic
)
Another possible Indo-European tree
(Gray & Atkinson, 2004)
 
Italic  Gmc.  Celtic  Baltic  Slavic   Alb.  Indic  Iranian   Armenian Greek Toch.   Anatolian
    
Only lexical data, not well curated
Binary encoding!
 Possible Indo-European tree
(Ringe, Warnow and Taylor 2000)
Three types of data, highly curated
Multi-state!
Different methods/data
give different answers.
We don’
t know
which answer is correct.
Which method(s)/data
should we use?
The performance of methods on an IE data set,
Transactions of the Philological Society,
L. Nakhleh, T. Warnow, D. Ringe, and S.N. Evans, 2005)
Observation: 
Different datasets (not just different
methods) can give different reconstructed phylogenies.
Objective: 
Explore the differences in reconstructions as a
function of data (lexical alone versus lexical,
morphological, and phonological), screening (to remove
obviously homoplastic characters), and methods.
However, we use a 
better basic dataset
 (where cognancy
judgments are more reliable).
Phylogeny reconstruction methods
Perfect Phylogenetic Networks (Ringe, Warnow, and Nakhleh)
Other network methods
Neighbor joining (distance-based method)
UPGMA (distance-based method, same as glottochronology)
Maximum parsimony (minimize number of changes)
Maximum compatibility (weighted and unweighted)
Gray and Atkinson (Bayesian estimation based upon
presence/absence of cognates, as described in Nature 2003)
Phylogeny reconstruction methods
Perfect Phylogenetic Networks (Ringe, Warnow, and Nakhleh)
Other network methods
Neighbor joining (distance-based method)
UPGMA (distance-based method, same as glottochronology)
Maximum parsimony (minimize number of changes)
Maximum compatibility (weighted and unweighted)
Gray and Atkinson (Bayesian estimation based upon
presence/absence of cognates, as described in Nature 2003)
IE Languages used in the study
Four IE datasets
Ringe & Taylor
The  screened full dataset of 294 characters (259
lexical, 13 morphological, 22 phonological)
The  unscreened full dataset of 336 characters
(297 lexical, 17 morphological, 22 phonological)
The screened lexical dataset of 259 characters.
The unscreened lexical dataset of 297 characters.
Results: Likely Subgroups
Other than UPGMA, all methods reconstruct
the ten major subgroups
A
n
a
t
o
l
i
a
n
 
+
 
T
o
c
h
a
r
i
a
n
 
(
t
h
a
t
 
u
n
d
e
r
 
t
h
e
 
a
s
s
u
m
p
t
i
o
n
 
t
h
a
t
 
A
n
a
t
o
l
i
a
n
 
i
s
 
t
h
e
 
f
i
r
s
t
d
a
u
g
h
t
e
r
,
 
t
h
e
n
 
T
o
c
h
a
r
i
a
n
 
i
s
 
t
h
e
 
s
e
c
o
n
d
 
d
a
u
g
h
t
e
r
)
G
r
e
c
o
-
A
r
m
e
n
i
a
n
 
(
t
h
a
t
 
G
r
e
e
k
 
a
n
d
 
A
r
m
e
n
i
a
n
 
a
r
e
 
s
i
s
t
e
r
s
)
Nothing else is consistently reconstructed.
In particular, the choice of data (lexical only, or also morphology and phonological)
has an impact on the final tree.
The choice of method also has an impact!
d
i
f
f
e
r
 
s
i
g
n
i
f
i
c
a
n
t
l
y
 
o
n
 
t
h
e
 
d
a
t
a
s
e
t
s
,
 
a
n
d
 
f
r
o
m
 
e
a
c
h
 
o
t
h
e
r
.
Other observations
UPGMA (i.e.,  the tree-building technique for
glottochronology) does the worst (e.g. splits Italic
and Iranian groups).
The Satem Core (Indo-Iranian plus Balto-Slavic) is
not always reconstructed.
Almost all analyses put Italic, Celtic, and Germanic
together:
T
h
e
 
o
n
l
y
 
e
x
c
e
p
t
i
o
n
 
i
s
 
W
e
i
g
h
t
e
d
 
M
a
x
i
m
u
m
 
C
o
m
p
a
t
i
b
i
l
i
t
y
o
n
 
d
a
t
a
s
e
t
s
 
t
h
a
t
 
i
n
c
l
u
d
e
 
h
i
g
h
l
y
 
w
e
i
g
h
t
e
d
 
m
o
r
p
h
o
l
o
g
i
c
a
l
c
h
a
r
a
c
t
e
r
s
.
f
f
e
r
 
s
i
g
n
i
f
i
c
a
n
t
l
y
 
o
n
 
t
h
e
 
d
a
t
a
s
e
t
s
,
 
a
n
d
 
f
r
o
m
 
e
a
c
h
 
o
t
h
e
r
.
Different methods/data
give different answers.
We don’
t know
which answer is correct.
Which method(s)/data
should we use?
Stochastic model of language evolution
(Warnow et al., 2006)
Model based on linguistic scholarship
Borrowing between languages
Each character evolves with its own parameters (not identically to the
other characters)
When a character changes state, it usually attains a new state (so that
the number of states is unbounded)
Some homoplasy is allowed (but just one known homoplastic state
per character)
This is a statistically identifiable model (if not too much borrowing)
Barbancon et al. Diachronica 2013
Simulation study
Data
 (all generated under the Warnow et al. model) on phylogenetic
networks (model trees with borrowing edges)
Lexical and morphological characters
Phylogenetic networks with 30 leaves and 0-3 contact edges
Moderate homoplasy
:
morphology: 24% homoplastic, no borrowing
lexical: 13% homoplastic, 7% borrowing
Low homoplasy
:
morphology: no borrowing, no homoplasy;
lexical: 1% homoplastic, 6% borrowing
Barbanson et al. Diachronica 2013
Methods tested with respect to error in the
estimated tree (compared to the “true tree”,
which underlies the phylogenetic network)
Methods:
NJ and UPGMA (distance-based)
Gray and Atkinson (G&A)
Maximum Parsimony (MP), Weighted Maximum
Parsimony (WMP), and Weighted Maximum
Compatibility (WMC)
Simulation study – sample of results
Varying deviation from i.i.d. character evolution
Varying number of contact edges
Observations
1.
Choice of data does matter (good idea to add morphological
characters, and to screen well).
2.
Accuracy only slightly lessened with small increases in
homoplasy, borrowing, or deviation from the lexical clock. Some
amount of heterotachy (deviation from i.i.d.) improves accuracy.
3.
Relative performance between methods consistently shows:
Distance-based methods least accurate
Gray and Atkinson’s method middle accuracy
Parsimony and Compatibility methods most accurate
 
Markov Models in Molecular Evolution
All standard models have a 
finite number of
states
 (and hence 
lots of homoplasy
)
All sites evolve 
iid
 down the tree
No borrowing (horizontal gene transfer)
Markov Model in Linguistic Evolution
Warnow et al. model:
Infinite number of states
 (and 
restricted
homoplasy
)
Heterogeneity of character evolution 
(not iid)
Borrowing between languages
“State of the Art” for Molecular Systematics
The most favored methods are Maximum
likelihood and 
Bayesian MCMC, under
parametric models of sequence evolution that
assume
iid evolution across sites
Finite number of states (4 for DNA, 20 for amino
acids, 2 for traits) – and so lots of homoplasy
No borrowing
Our main points
Biomolecular data evolve differently from linguistic data,
and linguistic models and methods should 
not
 be based
upon biological models.
Better (more accurate) phylogenies can be obtained by
formulating models and methods based upon linguistic
scholarship, and using good data.
All methods, whether explicitly based upon statistical
models or not, need to be carefully tested.
Future research
We need more investigation of statistical
methods based on good stochastic
models, as these are now the methods of
choice in biology.
This requires 
realistic parametric models
of linguistic evolution 
and
 method
development under these parametric
models
!
Acknowledgements
Financial Support: The David and Lucile Packard
Foundation, The National Science Foundation, The
Program for Evolutionary Dynamics at Harvard, and
The Radcliffe Institute for Advanced Studies
Collaborators: Don Ringe, Steve Evans, Luay
Nakhleh, and Francois Barbançon
P
l
e
a
s
e
 
s
e
e
h
t
t
p
:
/
/
t
a
n
d
y
.
c
s
.
i
l
l
i
n
o
i
s
.
e
d
u
/
h
i
s
t
l
i
n
g
.
h
t
m
l
Modelling issues
What are the units?
Polymorphism
Homoplasy
Non-treelike evolution
Non-
i.i.d. 
evolution and violations of the rates-across-sites assumption
(heterotachy)
Deviation from the lexical clock (is dating even really possible?)
Note:
 The statistical model of Warnow, Evans, Ringe, and Nakhleh has homoplasy and
reticulation, but no polymorphism – independent evolution, but not identical across
characters
The bag-of-words model of Nicholls and Gray (J R. Statist. Soc.B 2008) allows for
polymorphism but no reticulation, and has homoplasy in the form of parallel back-
muation, iid evolution across characters
The model of Gray and Atkinson (binary-encoding) has unlimited polymorphism
and homoplasy, but no reticulation, iid characters
Language Evolution
Warnow, Evans, Ringe, and Nakhleh model of language
evolution (Phylogenetic Methods and the Prehistory of
Languages, MacDonald Institute Press, 2006)
Three types of characters: lexical, phonological, and
morphological
Limited 
homoplasy
 for morphological and phonological
characters 
enables identifiability 
in the presence of
unlimited character evolution heterogeneity
 down a tree.
Character evolution does not need to be iid (specifically,
identical evolution not required) to ensure identifiability!
Controversies for IE history
Subgrouping: Other than the 10 major subgroups, what is likely
to be true? In particular, what about
Italo-Celtic
Greco-Armenian
Anatolian + Tocharian
Satem Core (Indo-Iranian and Balto-Slavic)
Location of Germanic
Other questions about IE
Where is the IE homeland?
When did Proto-IE 
end
?
What was life like for the speakers of proto-Indo-
European (PIE)?
The Anatolian hypothesis
(from wikipedia.org)
 
Date for PIE ~7000 BCE
The Kurgan Expansion
Date of PIE ~4000 BCE.
Map of Indo-European migrations from ca. 4000 to 1000 BC
according to the Kurgan model
From http://indo-european.eu/wiki
Estimating the date and homeland of the
proto-Indo-Europeans
Step 1: Estimate the phylogeny
Step 2: Reconstruct words for proto-Indo-
European (and for intermediate proto-
languages)
Step 3: Use archaeological evidence to
constrain dates and geographic locations
of the proto-languages
Stochastic model of language evolution
(Warnow et al., 2006)
Model based on linguistic scholarship
Borrowing between languages
Each character evolves with its own parameters (not identically to the
other characters)
When a character changes state, it usually attains a new state (so that
the number of states is unbounded)
Some homoplasy is allowed (but just one known homoplastic state
per character)
This is a statistically identifiable model (if not too much borrowing)
Questions
Is the model tree 
identifiable
?
Which estimation methods are 
statistically
consistent 
under this model?
How much data 
does the method need to
estimate the model tree correctly (with high
probability)?
Estimating the date and homeland of the
proto-Indo-Europeans
Step 1: Estimate the phylogeny
Step 2: Reconstruct words for proto-Indo-
European (and for intermediate proto-
languages)
Step 3: Use archaeological evidence to
constrain dates and geographic locations
of the proto-languages
Implications regarding PIE
homeland and date
Linguists have “reconstructed” words for ‘wool’, ‘horse’, ‘
thill
(harness pole), and 
yoke
, for Proto-Indo-European, for 
wheel
 for
the ancestor of IE minus Anatolian, and for `axle" to the ancestor of
IE minus Anatolian and Tocharian.
Archaeological evidence (positive and negative) for these objects
used to constrain the date and location for proto-IE to be 
after
 the
secondary products revolution
, and somewhere with horses (wild
or domesticated).
Combination of evidence supports the date for PIE within 3000-5500
BCE (some would say 3500-4500 BCE), and location 
not
 Anatolia,
thus ruling out the Anatolian hypothesis.
CFN Model of Binary Sequence Evolution
The different sites are assumed to evolve 
i.i.d
. down the model tree
(with rates that are drawn from a gamma distribution).
CFN model (Cavender, Farris, Neyman):
The model tree T is binary and has substitution probabilities p(e) on each edge e.
The state at the root is randomly drawn from {0,1} (presence/absence)
If a site (position) changes on an edge, it changes with equal probability to each
of the remaining states.
The evolutionary process is Markovian.
Note all characters evolve identically and independently.
Binary encoding of multi-state characters: one “character” for each
cognate class!
Perfect Phylogenetic Networks
Problem formulation
Input: set of languages described by
characters
Output: Network on which all characters
evolve without homoplasy, but can be
borrowed
Nakhleh, Ringe, and Warnow, 2005. Language.
DNA Sequence Evolution (Idealized)
Critique of the Gray and Atkinson model
Gray and Atkinson’s model is for binary characters
(presence/absence), not for multi-state characters.
To use their method on multi-state data, they do a
binary encoding
” – and so treat a single cognate class
as a separate character,  and all cognate classes for a
single semantic slot are assumed to evolve identically
and independently.
This assumption is clearly violated by how languages
evolve.
Note: no rigorous biologist would perform the
equivalent treatment on biological data. So this is not
about linguistics vs. biologists.
Markov Models of DNA Sequence Evolution
The different sites are assumed to evolve 
i.i.d
. down the model tree
(with rates that are drawn from a gamma distribution).
Simplest site evolution model (Jukes-Cantor, 1969):
The model tree T is binary and has substitution probabilities p(e) on each edge e.
The state at the root is randomly drawn from {A,C,T,G} (nucleotides)
If a site (position) changes on an edge, it changes with equal probability to each
of the remaining states.
The evolutionary process is Markovian.
More complex models (such as the General Markov model) are also considered,
often with little change to the theory.
Critique of the Gray and Atkinson model
Gray and Atkinson’s model is for binary characters
(presence/absence), not for multi-state characters.
To use their method on multi-state data, they do a
binary encoding
” – and so treat a single cognate class
as a separate character,  and all cognate classes for a
single semantic slot are assumed to evolve identically
and independently.
This assumption is clearly violated by how languages
evolve.
Note: no rigorous biologist would perform the
equivalent treatment on biological data. So this is not
about linguistics vs. biology.
GA = Gray+Atkinson Bayesian
MCMC method
WMC = weighted maximum
compatibility
MC = maximum compatibility
(identical to maximum parsimony
on this dataset)
NJ = neighbor joining (distance-
based method, based upon
corrected distance)
UPGMA = agglomerative
clustering technique used in
glottochronology.
*
Markov Models of 
DNA 
Sequence Evolution
The different sites are assumed to evolve 
i.i.d
. down the model tree
(with rates that are drawn from a gamma distribution).
Simplest site evolution model (Jukes-Cantor, 1969):
The model tree T is binary and has substitution probabilities p(e) on each edge e.
The state at the root is randomly drawn from 
{A,C,T,G} (nucleotides)
If a site (position) changes on an edge, it changes with equal probability to each
of the remaining states.
The evolutionary process is Markovian.
More complex models (such as the General Markov model) are also considered,
often with little change to the theory.
Slide Note
Embed
Share

Explore the evolution of Indo-European languages through phylogeny estimation under a model of linguistic character evolution. Follow the Computational Historical Linguistics Project's collaboration that began in 1994, leading to the development of methods and studies on homoplasy-free evolution and possible Indo-European tree structures. Discover the Gray and Atkinson method utilizing lexical data to analyze linguistic evolution.

  • Phylogeny Estimation
  • Linguistic Evolution
  • Indo-European Languages
  • Computational Historical Linguistics

Uploaded on Oct 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Phylogeny estimation under a model of linguistic character evolution Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign

  2. The Computational Historical Linguistics Project http://web.engr.illinois.edu/~warnow/histling.html Collaboration with Don Ringe began in 1994; 17 papers since then, and two NSF grants. Dataset generation by Ringe and Ann Taylor (then a postdoc with Ringe, now Senior Lecturer at York University). Method development with Luay Nakhleh (then my student, now Chair and Professor at Rice University), Steve Evans (Prof. Statistics, Berkeley). Simulation study with Francois Barbanson (then my postdoc).

  3. Indo-European languages From linguistica.tribe.net

  4. Homoplasy-free evolution When a character changes state, it changes to a new state not in the tree; i.e., there is no homoplasy (character reversal or parallel evolution) First inferred for weird innovations in phonological characters and morphological characters in the 19th century, and used to establish all the major subgroups within IE (But other characters also evolve without homoplasy) 0 0 1 0 0 0 0 1 1

  5. Possible Indo-European tree (Ringe, Warnow and Taylor 2000) Anatolian Vedic Greek Italic Iranian Celtic Tocharian Germanic Armenian Baltic Slavic Albanian

  6. Another possible Indo-European tree (Gray & Atkinson, 2004) Italic Gmc. Celtic Baltic Slavic Alb. Indic Iranian Armenian Greek Toch. Anatolian

  7. Gray and Atkinson Method Used only lexical data Binary-encoding of all multi-state character data to perform the analysis Assumes the CFN (Cavender-Farris-Neyman) model of binary sequence evolution, in which all characters evolve identically and independently down the model tree Uses MCMC to sample from a distribution (specifically used MrBayes software)

  8. This talk Introduction to molecular systematics (very fast) Linguistic data and the Ringe-Warnow analyses of the Indo- European language family Perfect Phylogenetic Networks (Nakhleh et al., Language 2005) Comparison of different phylogenetic methods on Indo-European datasets (Nakhleh et al., Transactions of the Philological Society 2005) Stochastic model of linguistic character evolution (Warnow et al., MacDonald Institute for Archaeological Research, 2006) Simulation study evaluating different phylogenetic methods (Barban on et al., Diachronica 2013) Discussion and Future work

  9. DNA Sequence Evolution (Idealized) -3 mil yrs AAGACTT AAGACTT -2 mil yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT today AGGGCAT AGGGCAT TAGCCCA TAGCCCA TAGACTT TAGACTT AGCACAA AGCACAA AGCGCTT AGCGCTT

  10. Markov Models of DNA Sequence Evolution The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution).

  11. Markov Models of DNA Sequence Evolution The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution). Simplest site evolution model (Jukes-Cantor, 1969): The model tree T is binary and has substitution probabilities p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleotides) If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to the theory.

  12. CFN Model of Binary Sequence Evolution The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution). CFN model: (CFN: Cavender, Farris, Neyman): The model tree T is binary and has substitution probabilities p(e) on each edge e. The state at the root is randomly drawn from {0,1} (presence/absence) If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. The evolutionary process is Markovian. Used for modelling trait evolution (presence/absence)

  13. Markov Models in Biology All standard models have a finite number of states (2 for traits, 4 for DNA/RNA, 20 for aminoacids), and hence lots of homoplasy All characters (sites in an alignment) evolve identically and independently (iid) down the tree

  14. Statistical Consistency and Identifiability error Data

  15. State of the Art for Molecular Systematics Many distance-methods are statistically consistent (but not UPGMA) Parsimony is not statistically consistent Maximum likelihood and Bayesian MCMC are statistically consistent and highly favored

  16. Model of Linguistic Character Evolution (Warnow, Evans, Ringe, and Nakhleh, 2006) Three types of characters: lexical, phonological, and morphological Infinite number of possible states Character evolution is not iid Borrowing can occur Limited homoplasy enables identifiability Statistically consistent methods presented

  17. Our main points Biomolecular data evolve differently from linguistic data, and linguistic models and methods should not be based upon biological models. Better (more accurate) phylogenies can be obtained by formulating models and methods based upon linguistic scholarship, and using good data. All methods, whether explicitly based upon statistical models or not, need to be carefully tested.

  18. Indo-European languages From linguistica.tribe.net

  19. Homoplasy-free evolution When a character changes state, it changes to a new state not in the tree; i.e., there is no homoplasy (character reversal or parallel evolution) First inferred for weird innovations in phonological characters and morphological characters in the 19th century, and used to establish all the major subgroups within IE 0 0 1 0 0 0 0 1 1

  20. Historical Linguistic Data A character is a function that maps a set of languages, L, to a set of states. Three kinds of characters: Phonological (sound changes) Lexical (meanings based on a wordlist) Morphological (especially inflectional)

  21. Sound changes Many sound changes are natural, and should not be used for phylogenetic reconstruction. Others are bizarre, or are composed of a sequence of simple sound changes. These are useful for subgrouping purposes. Example: Grimm s Law. 1. Proto-Indo-European voiceless stops change into voiceless fricatives. 2. Proto-Indo-European voiced stops become voiceless stops. 3. Proto-Indo-European voiced aspirated stops become voiced fricatives.

  22. An Indo-European lexical character: hand. Data. Hittite kissar Lithuanian rank Old Prussian r nkan (acc.) Armenian je n Old English hand Latvian r ka Greek /k :r/ Old Irish l m Gothic handus Albanian dor Latin manus Old Norse h nd Tocharian B ar Luvian ssaris OHG hant Vedic h stas Lycian izredi (instr.) Welsh llaw Avestan zast Tocharian A tsar Oscan manim (acc.) OCS r ka Old Persian dasta Umbrian manf (acc. pl.)

  23. Semantic slot for hand coded (Partitioned into cognate classes)

  24. Proto-Indo-European *plh2meh2flat hand (cf. Homeric Greek palm:) > Proto-Celtic *l m hand > Old Irish l m > Welch llaw Proto-Germanic *handuz hand > Gothic handus > Runic Norse *handu (ending influenced by a different noun class) > Old Norse h nd > Proto-West Germanic *handu > Old English hand > Old High German hant Proto-Italic *man- hand > Latin manus (transferred into the u-stems) > Proto-Sabellian *man- > Oscan *manis > *mans, accusative manim (transf. into the i-stems) > Umbrian *man-, accusative plural manf

  25. Lexical characters can also evolve without homoplasy For every cognate class, the nodes of the tree in that class should form a connected subset - as long as there is no undetected borrowing nor parallel semantic shift. 1 1 1 0 0 0 1 1 2

  26. Our (RWT) Data Ringe & Taylor (2002) 259 lexical 13 morphological 22 phonological These data have cognate judgments estimated by Ringe and Taylor, and vetted by other Indo- Europeanists. (Alternate encodings were tested, and mostly did not change the reconstruction.) Polymorphic characters, and characters known to evolve in parallel, were removed.

  27. Differences between different characters Lexical: most easily borrowed (most borrowings detectable), and homoplasy relatively frequent (we estimate about 25- 30% overall for our wordlist, but a much smaller percentage for basic vocabulary). Phonological: can still be borrowed but much less likely than lexical. Complex phonological characters are infrequently (if ever) homoplastic, although simple phonological characters very often homoplastic. Morphological: least easily borrowed, least likely to be homoplastic.

  28. Our methods/models Ringe & Warnow Almost Perfect Phylogeny : most characters evolve without homoplasy under a no-common- mechanism assumption (various publications since 1995) Ringe, Warnow, & Nakhleh Perfect Phylogenetic Network : extends APP model to allow for borrowing, but assumes homoplasy-free evolution for all characters (Language, 2005) Warnow, Evans, Ringe & Nakhleh Extended Markov model : parameterizes PPN and allows for homoplasy provided that homoplastic states can be identified from the data (MacDonald Institute for Archaeological Research, 2006)

  29. First analysis: Almost Perfect Phylogeny The original dataset contained 375 characters (336 lexical, 17 morphological, and 22 phonological). We screened the dataset to eliminate characters likely to evolve homoplastically or by borrowing. On this reduced dataset (259 lexical, 13 morphological, 22 phonological), we attempted to maximize the number of compatible characters while requiring that certain of the morphological and phonological characters be compatible. (Computational problem NP-hard.)

  30. Indo-European Tree (95% of the characters compatible) Anatolian Vedic Greek Italic Iranian Celtic Tocharian Germanic Armenian Baltic Slavic Albanian

  31. Second attempt: PPN We explain the remaining incompatible characters by inferring previously undetected borrowing . We attempted to find a PPN (perfect phylogenetic network) with the smallest number of contact edges, borrowing events, and with maximal feasibility with respect to the historical record. (Computational problems NP-hard). Our analysis produced one solution with only three contact edges that optimized each of the criteria. Two of the contact edges are well-supported.

  32. Modelling borrowing: Networks and Trees within Networks

  33. Perfect Phylogenetic Network (all characters compatible) Anatolian Vedic Greek Italic Iranian Celtic Tocharian Germanic Armenian Baltic Slavic Albanian L. Nakhleh, D. Ringe, and T. Warnow, LANGUAGE, 2005

  34. Comments This network is very tree-like (only three contact edges needed to explain the data. Two of the three contact edges are strongly supported by the data (many characters are borrowed). If the third contact edge is removed, then the evolution of the remaining (two) incompatible characters needs to be explained. Probably this is parallel semantic shift.

  35. Other IE analyses Note: many reconstructions of IE have been done, but produce different histories which differ in significant ways Possible issues: Dataset (modern vs. ancient data, errors in the cognancy judgments, lexical vs. all types of characters, screened vs. unscreened) Translation of multi-state data to binary data Reconstruction method

  36. Another possible Indo-European tree (Gray & Atkinson, 2004) Only lexical data, not well curated Binary encoding! Italic Gmc. Celtic Baltic Slavic Alb. Indic Iranian Armenian Greek Toch. Anatolian

  37. Gray and Atkinson Method Uses MCMC to sample from a distribution (specifically used MrBayes software) Assumes the CFN (Cavender-Farris-Neyman) model of binary sequence evolution, in which all characters evolve identically and independently down the model tree Used only lexical data Binary-encoding of all multi-state character data to perform the analysis

  38. Semantic slot for hand coded (Partitioned into cognate classes)

  39. Binary Encoding of Multi-state Characters Imagine you have a lexical character (semantic slot) with four cognate classes This is replaced by four characters, one for each cognate class The four characters are now assumed to evolve i.i.d. under the CFN model of binary character evolution 0 represents absence, 1 represents presence Problems with binary encoding: (0,0,0,0 can occur), indicating no word for this semantic slot (1,1,1,1), indicating all four words for the semantic slot (i.e., the model allows unbounded polymorphism) Problem with CFN model applied to binary encoding Model assumes appearance and loss of cognates are independent of other cognates (unrealistic)

  40. Another possible Indo-European tree (Gray & Atkinson, 2004) Only lexical data, not well curated Binary encoding! Italic Gmc. Celtic Baltic Slavic Alb. Indic Iranian Armenian Greek Toch. Anatolian

  41. Possible Indo-European tree (Ringe, Warnow and Taylor 2000) Three types of data, highly curated Multi-state! Anatolian Vedic Greek Italic Iranian Celtic Tocharian Germanic Armenian Baltic Slavic Albanian

  42. Different methods/data give different answers. We don t know which answer is correct. Which method(s)/data should we use?

  43. The performance of methods on an IE data set, Transactions of the Philological Society, L. Nakhleh, T. Warnow, D. Ringe, and S.N. Evans, 2005) Observation: Different datasets (not just different methods) can give different reconstructed phylogenies. Objective: Explore the differences in reconstructions as a function of data (lexical alone versus lexical, morphological, and phonological), screening (to remove obviously homoplastic characters), and methods. However, we use a better basic dataset (where cognancy judgments are more reliable).

  44. Phylogeny reconstruction methods Perfect Phylogenetic Networks (Ringe, Warnow, and Nakhleh) Other network methods Neighbor joining (distance-based method) UPGMA (distance-based method, same as glottochronology) Maximum parsimony (minimize number of changes) Maximum compatibility (weighted and unweighted) Gray and Atkinson (Bayesian estimation based upon presence/absence of cognates, as described in Nature 2003)

  45. Phylogeny reconstruction methods Perfect Phylogenetic Networks (Ringe, Warnow, and Nakhleh) Other network methods Neighbor joining (distance-based method) UPGMA (distance-based method, same as glottochronology) Maximum parsimony (minimize number of changes) Maximum compatibility (weighted and unweighted) Gray and Atkinson (Bayesian estimation based upon presence/absence of cognates, as described in Nature 2003)

  46. Four IE datasets Ringe & Taylor The screened full dataset of 294 characters (259 lexical, 13 morphological, 22 phonological) The unscreened full dataset of 336 characters (297 lexical, 17 morphological, 22 phonological) The screened lexical dataset of 259 characters. The unscreened lexical dataset of 297 characters.

  47. Results: Likely Subgroups Other than UPGMA, all methods reconstruct the ten major subgroups Anatolian + Tocharian (that under the assumption that Anatolian is the first daughter, then Tocharian is the second daughter) Greco-Armenian (that Greek and Armenian are sisters) Nothing else is consistently reconstructed. In particular, the choice of data (lexical only, or also morphology and phonological) has an impact on the final tree. The choice of method also has an impact! differ significantly on the datasets, and from each other.

  48. Other observations UPGMA (i.e., the tree-building technique for glottochronology) does the worst (e.g. splits Italic and Iranian groups). The Satem Core (Indo-Iranian plus Balto-Slavic) is not always reconstructed. Almost all analyses put Italic, Celtic, and Germanic together: The only exception is Weighted Maximum Compatibility on datasets that include highly weighted morphological characters.ffer significantly on the datasets, and from each other.

  49. Different methods/data give different answers. We don t know which answer is correct. Which method(s)/data should we use?

  50. Stochastic model of language evolution (Warnow et al., 2006) Model based on linguistic scholarship Borrowing between languages Each character evolves with its own parameters (not identically to the other characters) When a character changes state, it usually attains a new state (so that the number of states is unbounded) Some homoplasy is allowed (but just one known homoplastic state per character) This is a statistically identifiable model (if not too much borrowing)

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#