Integrative Inference of Tumor Evolution from Single-Cell and Bulk Sequencing Data

B
-
S
C
I
T
E
:
I
n
t
e
g
r
a
t
i
v
e
 
i
n
f
e
r
e
n
c
e
 
o
f
 
s
u
b
c
l
o
n
a
l
t
u
m
o
u
r
 
e
v
o
l
u
t
i
o
n
 
f
r
o
m
 
s
i
n
g
l
e
-
c
e
l
l
a
n
d
 
b
u
l
k
 
s
e
q
u
e
n
c
i
n
g
 
d
a
t
a
P
r
e
s
e
n
t
e
r
:
 
C
h
e
n
g
z
e
 
S
h
e
n
1
Motivations
Cancer rapidly introduces mutations, leading to complex tumor-cell
populations and distinct clones.
Such complexity could lead to different responses to cancer therapies, and
could be the cause for treatment failure.
Thus, it is critical to understand the underlying 
evolution history
 of the cancer
for effective and targeted cancer treatment.
2
Current Approach - Bulk Sequencing
3
Figure credit: Sheila-10x genomics
Sample & Bulk sequencing
Variant allele
frequencies
(VAFs)
Pros
: Very accessible and provide indirect measurement of 
subclonal mutation
composition
.
Cons
: 
statistically underdetermined
, may lead to incorrect phylogenies.
True clonal tree
Clonal tree
Alternative Approach - Single-Cell Sequencing (SCS)
4
Figure credit: Sheila-10x genomics
Sequence & Profile
Pros
: 
Direct inference
 of phylogeny given sufficient data.
Cons
: 
High noise
 -> false positive (FP) signals; 
early allelic dropout
 -> false negative (FN) signals.
Mutation tree
B-SCITE wants to incorporate bulk sequencing and SCS data to improve the
inference of tumor phylogenies
, using a probabilistic approach.
I
n
p
u
t
s
:
 
1
)
 
h
e
t
e
r
o
z
y
g
o
u
s
 
s
o
m
a
t
i
c
 
m
u
t
a
t
i
o
n
s
2) bulk sequencing data
3) SCS data  
O
u
t
p
u
t
s
:
 
M
u
t
a
t
i
o
n
 
t
r
e
e
 
 
 
 
 
 
t
h
a
t
 
m
a
x
i
m
i
z
e
s
 
s
o
m
e
 
s
c
o
r
i
n
g
 
s
c
h
e
m
e
 
(
d
e
f
i
n
e
d
 
l
a
t
e
r
)
.
P
r
o
c
e
d
u
r
e
:
 
s
e
a
r
c
h
 
t
h
r
o
u
g
h
 
p
o
s
s
i
b
l
e
 
t
r
e
e
s
 
u
s
i
n
g
 
M
C
M
C
 
(
w
i
l
l
 
d
i
s
c
u
s
s
 
l
a
t
e
r
)
.
A
s
s
u
m
p
t
i
o
n
s
:
 
I
n
f
i
n
i
t
e
 
s
i
t
e
 
a
s
s
u
m
p
t
i
o
n
.
 
What does B-SCITE want to solve?
5
Tree scoring
6
C
a
n
d
i
d
a
t
e
 
m
u
t
a
t
i
o
n
 
t
r
e
e
 
T
 
i
s
 
s
c
o
r
e
d
 
b
a
s
e
d
 
o
n
 
b
o
t
h
 
b
u
l
k
 
s
e
q
u
e
n
c
i
n
g
d
a
t
a
 
a
n
d
 
S
C
S
 
d
a
t
a
.
> Let tree score from the bulk sequencing on T be: 
> Let tree score from the SCS on T be: 
  
(    is the sequencing error profile)
> We have the joint score as:
Then, we want to find the 
tree T
 and 
error profile   
 such that
Some definitions
 
e.g.
(1)         represents 
ancestry state
 between two nodes in T.
(2) There are s nodes in the tree T, and we can assume we will solve for
a mutation tree, where s=n (the number of mutations/SNVs).
(3) Mutation matrix D representing 
presence of each mutation in
each single-cell data
.
e.g. single-cell data C1 has mutations
a, b and d, so:
Example tree
>Mutation 
a
 first occurs here
>Mutation 
a
 still presents
>Mutation 
b
 first occurs here
7
Tree scoring - Bulk sequencing
Nodes (Cell types) in T
Assume we are given a mutation tree T with s=n (same number of mutation to nodes in the tree).
...
...
Sum to 1
Fraction of population of
cell type 
i
 in the bulk data
We can then infer the fraction of cells with mutation        in bulk sample       as:
8
Tree scoring - Bulk sequencing cont.
Assume we sequence a bulk sample       , get  
  
   reads in which  
    
 reads support mutation       .
If the true fraction of cells in       that have mutation        is 
y
, then the actual probability that a read support        is
y/2
 (heterozygous cell).
Using this information and high coverage from the bulk sample (high number of reads 
t
), we can model variant
reads using a 
binomial distribution
, and approximate it with a Gaussian distribution 
log likelihood
:
z=2r/t. We can obtain the log-likelihood of the whole bulk data for all mutations by 
summing over
 all
and approximate        with                     . Finally (skipping a lot of steps) we have:
9
Tree scoring - SCS
10
Let                                          be attachments of single cell to the tree T.
For example, C1 attaches to node d (C1 contains mutation a, b, d).
Since SCS has errors, both FP and FN, we can model the error profile as
where      denotes the 
FP rate
 and      denotes the 
FN rate
. We can then
represent the 
observational data
 + 
tree information
 probability as:
Tree scoring - SCS cont.
11
Thus, we can obtain the 
likelihood function
 of the SCS data given tree T, error profile    and attachment
as
If we only focus on the tree and error profile information, then we have:
By considering a weighted mixture of singlet and doublet (accidentally sequenced two cells together) data,
we have:
And our final scoring from SCS data is the 
log likelihood 
of P’ as:
Find best tree T
T
h
e
 
a
u
t
h
o
r
s
 
u
s
e
 
M
a
r
k
o
v
 
C
h
a
i
n
 
M
o
n
t
e
-
C
a
r
l
o
 
(
M
C
M
C
)
 
a
p
p
r
o
a
c
h
 
t
o
 
s
e
a
r
c
h
t
h
e
 
s
o
l
u
t
i
o
n
 
s
p
a
c
e
 
t
o
 
f
i
n
d
 
a
 
o
p
t
i
m
a
l
 
t
r
e
e
 
T
.
Each step, we can either change the tree T or change the error profile 
For a 
tree-move
, a new tree T’ is obtained by either:
i.
pruning/reattaching a subtree
ii.
Swapping two subtrees
iii.
Exchanging labels of two nodes
For an 
error-move
, a new     is obtained.
We move to the new state with probability of 1 (if the proposal
probability of new state is higher), or with probability of the ratio of the
two proposal probabilities.
* the author did not really explain the proposal probabilities 
q
 and how to calculate them.
12
Evaluation
13
Simulated dataset and real dataset (will only focus on the simulated one)
Measurements
a.
clustering accuracy (
V-measure
)
i.
Compared to ddClone, OncoNEM
b.
Phylogenetic inference accuracy (co-clustering)
i.
Compared to OncoNEM, SCITE
Results –
 simulated data - clustering accuracy
14
Results –
 simulated data - phylogeny inference
15
Results –
 real data - triple-negative breast cancer
16
The branching events from B-SCITE tree (c)
is 
highly similar
 to the tree from the
original study (a) than what the SCITE tree
(b).
1.
B
-
S
C
I
T
E
 
i
s
 
a
 
n
e
w
 
m
e
t
h
o
d
 
f
o
r
 
t
u
m
o
r
 
p
h
y
l
o
g
e
n
e
t
i
c
 
i
n
f
e
r
e
n
c
e
 
t
h
a
t
 
u
t
i
l
i
z
e
s
b
o
t
h
 
b
u
l
k
 
s
e
q
u
e
n
c
i
n
g
 
a
n
d
 
s
i
n
g
l
e
-
c
e
l
l
 
s
e
q
u
e
n
c
i
n
g
 
d
a
t
a
.
2.
I
t
 
i
s
 
s
h
o
w
n
 
t
o
 
b
e
 
m
o
r
e
 
a
c
c
u
r
a
t
e
 
a
n
d
 
p
r
e
c
i
s
e
 
o
n
 
b
o
t
h
 
s
u
b
c
l
o
n
e
 
i
n
f
e
r
e
n
c
e
a
n
d
 
p
h
y
l
o
g
e
n
e
t
i
c
 
i
n
f
e
r
e
n
c
e
 
t
h
a
n
 
m
e
t
h
o
d
s
 
c
o
m
p
a
r
e
d
 
t
o
.
3.
Real data tree inference shows high concordance to expert-generated
trees.
4.
It is robust to presences of copy number aberration and violation of the
infinite sites assumption (Supplementary data).
Takeaway
17
1.
No runtime/memory usage comparison.
2.
Few methods compared to.
3.
No convergence analysis on MCMC.
4.
Missing definitions for some variables.
5.
Lack of analysis on single-cell data
I
f we have very few SCS available (e.g. 5 single-cell profiles vs. 25
single-cell profiles).
If the error profile is different (only shown a fixed profile in the main
paper)
Weakness
18
1.
M
a
l
i
k
i
c
,
 
S
.
,
 
J
a
h
n
,
 
K
.
,
 
K
u
i
p
e
r
s
,
 
J
.
 
e
t
 
a
l
.
 
I
n
t
e
g
r
a
t
i
v
e
 
i
n
f
e
r
e
n
c
e
 
o
f
 
s
u
b
c
l
o
n
a
l
 
t
u
m
o
u
r
 
e
v
o
l
u
t
i
o
n
f
r
o
m
 
s
i
n
g
l
e
-
c
e
l
l
 
a
n
d
 
b
u
l
k
 
s
e
q
u
e
n
c
i
n
g
 
d
a
t
a
.
 
N
a
t
 
C
o
m
m
u
n
 
1
0
,
 
2
7
5
0
 
(
2
0
1
9
)
.
h
t
t
p
s
:
/
/
d
o
i
.
o
r
g
/
1
0
.
1
0
3
8
/
s
4
1
4
6
7
-
0
1
9
-
1
0
7
3
7
-
5
2.
10X genomics blog. 
https://www.10xgenomics.com/blog/single-cell-rna-seq-an-introductory-
overview-and-tools-for-getting-started
References
19
Slide Note
Embed
Share

Cancer's complex evolution introduces challenges in treatment response. B-SCITE aims to enhance tumor phylogeny inference by integrating bulk sequencing and single-cell data using a probabilistic approach. It addresses the complexity of tumor cell populations and potential treatment failure causes. The method involves scoring mutation trees based on both data types to maximize a scoring scheme. Assumptions include the infinite site assumption and the use of MCMC for tree search.

  • Tumor Evolution
  • Single-Cell Sequencing
  • Bulk Sequencing
  • Cancer Treatment
  • Phylogeny Inference

Uploaded on Oct 04, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. B-SCITE: Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data Presenter: Chengze Shen 1

  2. Motivations Cancer rapidly introduces mutations, leading to complex tumor-cell populations and distinct clones. Such complexity could lead to different responses to cancer therapies, and could be the cause for treatment failure. Thus, it is critical to understand the underlying evolution history of the cancer for effective and targeted cancer treatment. 2

  3. Current Approach - Bulk Sequencing Figure credit: Sheila-10x genomics Variant allele frequencies (VAFs) Sample & Bulk sequencing True clonal tree Clonal tree Pros: Very accessible and provide indirect measurement of subclonal mutation composition. Cons: statistically underdetermined, may lead to incorrect phylogenies. 3

  4. Alternative Approach - Single-Cell Sequencing (SCS) Figure credit: Sheila-10x genomics Sequence & Profile Mutation tree Pros: Direct inference of phylogeny given sufficient data. Cons: High noise -> false positive (FP) signals; early allelic dropout -> false negative (FN) signals. 4

  5. What does B-SCITE want to solve? B-SCITE wants to incorporate bulk sequencing and SCS data to improve the inference of tumor phylogenies, using a probabilistic approach. Inputs: 1) heterozygous somatic mutations 2) bulk sequencing data 3) SCS data Outputs: Mutation tree that maximizes some scoring scheme (defined later). Procedure: search through possible trees using MCMC (will discuss later). Assumptions: Infinite site assumption. 5

  6. Tree scoring Candidate mutation tree T is scored based on both bulk sequencing data and SCS data. > Let tree score from the bulk sequencing on T be: > Let tree score from the SCS on T be: ( is the sequencing error profile) > We have the joint score as: Then, we want to find the tree T and error profile such that 6

  7. Some definitions >Mutation a first occurs here (1) represents ancestry state between two nodes in T. Example tree a >Mutation a still presents >Mutation b first occurs here b c e.g. C1 (2) There are s nodes in the tree T, and we can assume we will solve for a mutation tree, where s=n (the number of mutations/SNVs). d (3) Mutation matrix D representing presence of each mutation in each single-cell data. e.g. single-cell data C1 has mutations a, b and d, so: 7

  8. Tree scoring - Bulk sequencing Assume we are given a mutation tree T with s=n (same number of mutation to nodes in the tree). Nodes (Cell types) in T ... Fraction of population of cell type i in the bulk data ... Sum to 1 We can then infer the fraction of cells with mutation in bulk sample as: 1 2 3 4 5 e.g. M2 (i=2) in some bulk sample B 8

  9. Tree scoring - Bulk sequencing cont. Assume we sequence a bulk sample , get reads in which reads support mutation . If the true fraction of cells in that have mutation is y, then the actual probability that a read support is y/2 (heterozygous cell). Using this information and high coverage from the bulk sample (high number of reads t), we can model variant reads using a binomial distribution, and approximate it with a Gaussian distribution log likelihood: z=2r/t. We can obtain the log-likelihood of the whole bulk data for all mutations by summing over all and approximate with . Finally (skipping a lot of steps) we have: 9

  10. Tree scoring - SCS Let be attachments of single cell to the tree T. For example, C1 attaches to node d (C1 contains mutation a, b, d). a Since SCS has errors, both FP and FN, we can model the error profile as b c C1 where denotes the FP rate and denotes the FN rate. We can then represent the observational data + tree informationprobability as: d 10

  11. Tree scoring - SCS cont. Thus, we can obtain the likelihood function of the SCS data given tree T, error profile and attachment as If we only focus on the tree and error profile information, then we have: By considering a weighted mixture of singlet and doublet (accidentally sequenced two cells together) data, we have: And our final scoring from SCS data is the log likelihood of P as: 11

  12. Find best tree T The authors use Markov Chain Monte-Carlo (MCMC) approach to search the solution space to find a optimal tree T. Each step, we can either change the tree T or change the error profile For a tree-move, a new tree T is obtained by either: i. pruning/reattaching a subtree ii. Swapping two subtrees iii. Exchanging labels of two nodes For an error-move, a new is obtained. We move to the new state with probability of 1 (if the proposal probability of new state is higher), or with probability of the ratio of the two proposal probabilities. * the author did not really explain the proposal probabilities q and how to calculate them. 12

  13. Evaluation Simulated dataset and real dataset (will only focus on the simulated one) Measurements a. clustering accuracy (V-measure) i. Compared to ddClone, OncoNEM b. Phylogenetic inference accuracy (co-clustering) i. Compared to OncoNEM, SCITE 13

  14. Results simulated data - clustering accuracy 14

  15. Results simulated data - phylogeny inference 15

  16. Results real data - triple-negative breast cancer The branching events from B-SCITE tree (c) is highly similar to the tree from the original study (a) than what the SCITE tree (b). 16

  17. Takeaway 1. B-SCITE is a new method for tumor phylogenetic inference that utilizes both bulk sequencing and single-cell sequencing data. 2. It is shown to be more accurate and precise on both subclone inference and phylogenetic inference than methods compared to. 3. Real data tree inference shows high concordance to expert-generated trees. 4. It is robust to presences of copy number aberration and violation of the infinite sites assumption (Supplementary data). 17

  18. Weakness 1. No runtime/memory usage comparison. 2. Few methods compared to. 3. No convergence analysis on MCMC. 4. Missing definitions for some variables. 5. Lack of analysis on single-cell data If we have very few SCS available (e.g. 5 single-cell profiles vs. 25 single-cell profiles). If the error profile is different (only shown a fixed profile in the main paper) 18

  19. References 1. Malikic, S., Jahn, K., Kuipers, J. et al. Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. Nat Commun 10, 2750 (2019). https://doi.org/10.1038/s41467-019-10737-5 2. 10X genomics blog. https://www.10xgenomics.com/blog/single-cell-rna-seq-an-introductory- overview-and-tools-for-getting-started 19

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#