Weighted Log-rank Tests in Survival Trials

 
 
Inference in survival trials:
Weighted log-rank tests and
some alternatives
 
Carl-Fredrik Burman, PhD, Assoc.Prof.
Statistical Innovation, Data Science & AI
AstraZeneca R&D, Gothenburg
 
29 April, 2021
 
 
 
2
 
 
Thanks to Dominic Magirr
for great collaboration
undefined
Back to Basics
1
 
 
 
What’s special with Survival Analysis?
 
STANDARD STATISTICAL PROBLEMS
Example 1: Two samples, normal distributions
Exampe 2: Zero-one responses
 
SURVIVAL ANALYSIS
1.
Responses are positive
May use transformation (e.g. log)
2.
We don’t have time to follow all patients (censoring)
Missing data, with known structure
3.
Competing risks, Treatment switching …
 
 
4
PhD advise:
Don’t complicate
things before
understanding
the basics
 
 
5
Statistical
inference
 
 Clinical trial, comparing
New treatment, X, 
versus
Standard of Care, C
 What is (really) the main efficacy objective?
To ”compare” X and C
To show a (=any) difference
To show X has an efficacy benefit
 
Hypothesis testing
Null hypothesis
Test statistic (pre-specified)
Null distribution
 
6
 
 
Distributional assumptions
 
Does the
 validity 
of the statistical analysis require that assumptions are
correct?
Do we need uncertain assumptions to guarantee Type 1 Error control?
OR
Will assumptions only influence the power (sponsor’s risk)?
 
 
7
 
 
Two different bases for inference
showing treatment benefit
 
Distribution-based
E.g. ANCOVA, t-test has exactly correct Type 1 Error (only) if normal distributions +
same residual variance
Randomisation-based
Relies of nothing else than randomisation of patients
Any reasonable pre-specified test statistic; can handle covariates
 
In principle, randomisation test much more robust
In practise, distribution-based tests are often (but not always) good enough
to control Type 1 Error even if assumptions are not fully correct. Preferable
with approximate formula than simulated p-value?
 
 
8
 
 
Example dataset*
 
n=37 per arm
Group A: New treatment
Group B: Control
Survival time in days
Mice data
No censoring
 
 
 
9
 
* Loosely based on Xie et al (2017). Nature Communications 8:155.
 
Empiric CDFs and survival functions
 
 
10
 
 
Example dataset
 
 
 
11
 
 
Life times vs rank
 
 
12
 
 
Randomisation test (permutation test)
 
 
13
 
H
0
: Time does not depend on arm
That is, the highest observed time
could equally well have been on A
as B.
Same for next highest etc.
We can generate a null distribution
by keeping all observation fixed
but re-randomize … and repeat
e.g. one million times.
 
Randomisation test
 
 
14
 
We can generate a null distribution
by keeping all observation fixed
but re-randomize … and repeat
e.g. one million times.
 
 
p-value = 0.6%
 
All p-values 1-sided, as we want to show B better than A
 
Distribution-based inference: Z-test
 
Assumptions
I.i.d. Normal in each arm
Same known variance (taken equal to estimate)
p-value = 1.0%
P-value happens to be 
higher than p
rand
=0.6%
Asymptotic Relative Efficiency of randomisation test vs Z-test is exactly 100% if Z-test
assumptions are OK
Z-test is approximately correct in many situations. Problems if few observations have
lrge impact on variance: outliers, heavy tails, …
(NB! All p-values are one-sided as we want to show B is better than A.)
 
 
15
 
N-distribution has
poor fit to data
 
T-test
 
Assumptions
I.i.d. Normal in each arm (Questionable assumption)
Same variance, estimated from data
p
T
=1.1%
Cf. p
Z
=1.0%, p
rand
=0.6%
T-test reflects change in null distribution due to random estimate of SD
Asymptotically t-test = Z-test
 
 
16
 
 
Lognormal test
 
 
17
 
Original
Log of data
 
With positive data, lognormal is sometimes better than normal
Not in this example
Recall: Test should be pre-specified
Log scale is dangerous if some life span is close to zero
NB! Tests are scale-and location-invariant
 
Transformations
 
Log transform, x
i
=log(t
i
)
Any monotone function f gives valid test based on x
i
=f(t
i
)
Randomisation test or – if distributional assumptions are appropriate – t-test, ANCOVA
etc based on transformed observations x
i
 
 
18
 
Estimands
 
 
The estimand framework can help us think,
structure discussions
But there are several contradicting dimensions
The best test to show treatment benefit may not
directly correspond to an easily understandable,
clinically attractive estimand
Average Hazard Ratio may not be the most natural
estimand (especially if HR varies greatly over time).
Still, that in itself may not disqualify using a
(perhaps weighted) logrank test
Tradeoff: Transformations can be useful but …
Don’t make it unnecessarily complicated
 
 
19
 
 
Simple and meaningful
 
What is clinically (most) important?
What is easy?
 
EXAMPLES
Mean survival
Proportion cured. Proxy: 5-year survival?
 
But easy and important may not mean statistically sensitive (/efficient)
 
 
20
 
undefined
Rank tests
2
 
 
 
Dichotomisation (landmark analysis)
 
 
22
 
Original
Score=1 if surviving
2.5 years
 
General rule: Dichotomisation loose power
Often: Very large power loss
Test statistic: Proportion surviving > y years
Fisher’s exact test, or e.g. normal approximation to Binomial dist
 
Rank test: Wilcoxon-Mann-Whitney
 
 
23
 
Original
Rank scores
 
Robust test
Asymptotic Relative Efficiency is 3/
=95% vs Z-test if Normal assuption is true
p=1.0%
 
Normal scores
 
 
24
 
Original
Normal scores
 
Robust
The worst and best responses get higher importance than in Wilcoxon
Asymptotic Relative Efficiency is 100% vs Z-test if Normal assumption is true
undefined
Log-rank test
3
 
 
 
Logrank test
 
 
26
 
Logrank test: Relation between # at risk & earlier deaths
 
 
27
 
Logrank test: Finding the scores
 
 
28
 
Logrank scores 
(translated)
 
 
29
 
Original
Logrank scores
 
De-emphasising early deaths
Very high emphasis on the longest survivors
One reason that logrank is ”never” used when there is no censoring
Less extreme when long life times are censored
Hazard ratios
 
Logrank is the best rank-preserving test IF the hazard
ratio is constant over time
It is valid even if HR(t) is varying
That is, Type 1 Error control is guaranteed
But the power is not guaranteed
Should we (ever) believe in proportional hazards?
Clinical models
As with the t-test etc., the logrank test will serve well in
many cases … if reality isn’t too far from assumptions
 
 
30
 
undefined
Weighted
log-rank tests
4
 
 
 
Weighted logrank tests
 
32
 
Can we use
any
 scores?
 
 
33
 
Are any weights OK?
 
34
 
Weighted logrank test & Scores
 
 
35
 
Weighted logrank scores if first 20 weights are zero
 
 
36
 
Original
WLRT scores
 
BAD!
 
Fleming-Harrington(0,1) scores
 
 
37
 
Original
FH(0,1) scores
 
BAD!
 
 
38
 
FH(0,1) and Max Combo
do not control 
under H0: Survival is never better for New than Control
(but 
 is controlled if survival is identical for all t)
 
In fact, Type 1 Error 
100% if N 
 infinity
How to defend a
test that thinks
early deaths
are better?
 
 
39
 
Lines of defense (1)
 
 ”We only test the null hypothesis of exactly equal survival
curves, for all times”
OK, so you prove that IO is not the same as chemo!??
Don’t you think regulators would like to see proof of any
positive effect?
”By looking at the Kaplan-Meier curves, you’ll obviously see
that the new drug is better after some time”
Hypothesis tests are there to avoid misunderstanding
noisy data
Take a look at the following example:
 
 
40
 
 
1-sided vs 2-sided test
 
It doesn’t matter … in the school book example
It makes all the difference in reality
 
A new drug should be 
demonstrated to be better,
 
not just ”different” 
from standard-of-care
 
41
 
 
Lines of defense (2)
 
”By looking at the Kaplan-Meier curves, you’ll obviously see
that the new drug is better after some time”
Hypothesis tests are there to avoid misunderstanding
noisy data
Take a look at the following example:
 
 
42
 
 
Weighted logrank may be severely misleading
 
 
 
T
R
U
E
 
O
B
S
E
R
V
E
D
 
E
x
p
e
r
i
m
e
n
t
a
l
 
i
s
a
l
w
a
y
s
 
w
o
r
s
e
 
Is it always easy to ”see” which drug is best? NO!
 
 
82.9% has more than
6 months improvement
in median
 
WLRT corresponds to an estimate of HR.
When Z
w
>1.96, always HR
est
<1
 
When Z
w
>1.96, 
observed
median is often improved
(while 
true
 median is
worse)
 
An adjuvant trial example
 
 
Adjuvant 
setting. We try to prolong the lives
of the non-cured (but we don’t know who is
cured or not).
 
2 arms with 200 patients each
70% cure rate on both treatments
Given non-cure, life span is exponentially
distributed with
Expected life time being 1 year on new
treatment, 2 years on placebo. (That is,
the 
treatment is killing off patients much
faster
.)
Type I error ~30% (nominal 2.5%)
 
With 1000 per arm, type I error ~90%
 
Lines of defense (3)
 
”Do you have a better proposal than MaxCombo?”
 
 
46
 
 
Longer life is always better
 
 
47
 
Original
 
Modestly weighted logrank test scores
 
 
48
 
Original
Modestly scores
 
Constant scores when no efficacy
Cf. Landmark analysis
Logrank scores thereafter
as logrank is optimal if HR are constant
Even better with cutoff somewhat earlier
 
The best we can do:
Constant scores when we
don’t think we have a benefit
 
Summary Z-scores for OS
 
 
Modestly test is better than logrank test in all IO examples
Not very sensitive in cutoff time for equal scores. Here 12 months.
Censoring
 
 
50
 
undefined
Conclusions
5
 
 
 
Recommendations when expecting efficacy delay
 
52
 
Consider using the ”Modestly” test
Clearly better power than standard logrank
S
trong a control (unlike Max Combo, many weighted logrank tests)
 
A
dd a variety of efficacy estimates
Kaplan-Meier curves (always!), directly corresponding to …
… Landmark rates, Quantile estimates (median not always useful)
(Truncated) mean survival makes clinical sense (but test is inefficient)
Modestly test corresponds to an HR estimate …
… but all average estimates of HR can be misleading
Estimated HR(t) should be interpreted with care
 
References
 
Magirr & Burman (2019). ”Modestly weighted logrank tests.” 
Statistics in Medicine
38:3782-3790.
Gregson, Sharples, Stone, Burman, Öhrn & Pocock (2019). "Nonproportional hazards for
time-to-event outcomes in clinical trials." 
J Am Coll Cardiol 
74:2102.
Bartlett, Morris, Stensrud, Daniel, Vansteelandt & Burman (2020). “The Hazards of
Period Specific and Weighted Hazard Ratios.” 
Statistics in Biopharmaceutical Research
12:518-519.
Magirr (2020). “Non‐proportional hazards in immuno‐oncology: Is an old perspective
needed?” Pharmaceutical Statistics (early view).
Magirr & Jimenez (2021). “Designing group sequential clinical trials when a delayed
effect is anticipated: A practical guidance.” 
https://arxiv.org/abs/2102.05535
Leton, Zuluaga (2001). Equivalence between score and weighted tests for survival
curves. 
Commun Stat Theory Methods
 30:591-608.
 
 
54
Slide Note
Embed
Share

Weighted log-rank tests and alternative methods in survival trials are discussed by Carl-Fredrik Burman, PhD, focusing on the basics, special considerations in survival analysis, clinical trial objectives, distributional assumptions, and different bases for inference.

  • Survival Trials
  • Log-rank Tests
  • Clinical Trials
  • Statistical Analysis

Uploaded on Feb 25, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Inference in survival trials: Weighted log-rank tests and some alternatives Carl-Fredrik Burman, PhD, Assoc.Prof. Statistical Innovation, Data Science & AI AstraZeneca R&D, Gothenburg 29 April, 2021

  2. Thanks to Dominic Magirr for great collaboration 2

  3. 1 Back to Basics

  4. Whats special with Survival Analysis? STANDARD STATISTICAL PROBLEMS Example 1: Two samples, normal distributions Exampe 2: Zero-one responses SURVIVAL ANALYSIS 1. Responses are positive May use transformation (e.g. log) 2. We don t have time to follow all patients (censoring) Missing data, with known structure 3. Competing risks, Treatment switching 4

  5. PhD advise: Don t complicate things before understanding the basics 5

  6. Clinical trial, comparing New treatment, X, versus Standard of Care, C What is (really) the main efficacy objective? To compare X and C To show a (=any) difference To show X has an efficacy benefit Hypothesis testing Null hypothesis Test statistic (pre-specified) Null distribution Statistical inference 6

  7. Distributional assumptions Does the validity of the statistical analysis require that assumptions are correct? Do we need uncertain assumptions to guarantee Type 1 Error control? OR Will assumptions only influence the power (sponsor s risk)? 7

  8. Two different bases for inference showing treatment benefit Distribution-based E.g. ANCOVA, t-test has exactly correct Type 1 Error (only) if normal distributions + same residual variance Randomisation-based Relies of nothing else than randomisation of patients Any reasonable pre-specified test statistic; can handle covariates In principle, randomisation test much more robust In practise, distribution-based tests are often (but not always) good enough to control Type 1 Error even if assumptions are not fully correct. Preferable with approximate formula than simulated p-value? 8

  9. Example dataset* n=37 per arm Group A: New treatment Group B: Control Survival time in days Mice data No censoring 9* Loosely based on Xie et al (2017). Nature Communications 8:155.

  10. Empiric CDFs and survival functions 10

  11. Example dataset 11

  12. Life times vs rank 12

  13. Randomisation test (permutation test) H0: Time does not depend on arm That is, the highest observed time could equally well have been on A as B. 1 ? ? ??? Test statistic ? = That is, average observed life time in group A Same for next highest etc. We can generate a null distribution by keeping all observation fixed but re-randomize and repeat e.g. one million times. Observed ????= 814.4 (whereas average in arm B was 919.0) 13

  14. Randomisation test We can generate a null distribution by keeping all observation fixed but re-randomize and repeat e.g. one million times. Observed ????= 814.4 (whereas average in arm B was 919.0) p-value = 0.6% ????= 814.4 All p-values 1-sided, as we want to show B better than A 14

  15. Distribution-based inference: Z-test Assumptions N-distribution has poor fit to data I.i.d. Normal in each arm Same known variance (taken equal to estimate) p-value = 1.0% P-value happens to be higher than prand=0.6% Asymptotic Relative Efficiency of randomisation test vs Z-test is exactly 100% if Z-test assumptions are OK Z-test is approximately correct in many situations. Problems if few observations have lrge impact on variance: outliers, heavy tails, (NB! All p-values are one-sided as we want to show B is better than A.) 15

  16. T-test Assumptions I.i.d. Normal in each arm (Questionable assumption) Same variance, estimated from data pT=1.1% Cf. pZ=1.0%, prand=0.6% T-test reflects change in null distribution due to random estimate of SD Asymptotically t-test = Z-test 16

  17. Lognormal test Original Log of data With positive data, lognormal is sometimes better than normal Not in this example Recall: Test should be pre-specified Log scale is dangerous if some life span is close to zero NB! Tests are scale-and location-invariant 17

  18. Transformations Log transform, xi=log(ti) Any monotone function f gives valid test based on xi=f(ti) Randomisation test or if distributional assumptions are appropriate t-test, ANCOVA etc based on transformed observations xi 18

  19. The estimand framework can help us think, structure discussions But there are several contradicting dimensions The best test to show treatment benefit may not directly correspond to an easily understandable, clinically attractive estimand Average Hazard Ratio may not be the most natural estimand (especially if HR varies greatly over time). Still, that in itself may not disqualify using a (perhaps weighted) logrank test Tradeoff: Transformations can be useful but Don t make it unnecessarily complicated Estimands 19

  20. Simple and meaningful What is clinically (most) important? What is easy? EXAMPLES Mean survival Proportion cured. Proxy: 5-year survival? But easy and important may not mean statistically sensitive (/efficient) 20

  21. 2 Rank tests

  22. Dichotomisation (landmark analysis) Original Score=1 if surviving 2.5 years General rule: Dichotomisation loose power Often: Very large power loss Test statistic: Proportion surviving > y years Fisher s exact test, or e.g. normal approximation to Binomial dist 22

  23. Rank test: Wilcoxon-Mann-Whitney Original Rank scores Robust test Asymptotic Relative Efficiency is 3/ =95% vs Z-test if Normal assuption is true p=1.0% 23

  24. Normal scores Original Normal scores Robust The worst and best responses get higher importance than in Wilcoxon Asymptotic Relative Efficiency is 100% vs Z-test if Normal assumption is true 24

  25. 3 Log-rank test

  26. Logrank test The logrank test is a non-parametric score test (like Wilcoxon, Normal Scores, etc.) Can be formulated in different ways. The test statistic may be defined as ?? ? = ?? ??+ ?? ? Where ??= 1 if the k:th death is on Arm A, and ??= 0if it s on arm B; ?? and ??are the numbers at risk in arm A and B, respectively, just before the k: death. We start with ?1and ?1as the sample sizes per arm. Note that ?2= ?1 ?1, that is the number at risk in arm A is unchanged if 1st death on B; it is reduced by one if 1st death in arm A With this idea, the logrank test statistics can be rewritten as a sum of scores 26

  27. Logrank test: Relation between # at risk & earlier deaths ?? ? = ?? ??+ ?? ? Where ??= 1 if the k:th death is on Arm A ?? and ??are the numbers at risk in arm A and B. Total # at risk is decreased by 1 for every death: ??+ ??= ? + ? (? 1) # at risk in Arm A is decreased by 1 for every time the death indicator is 1: ??= ? ?1 ?? 1 Inserting these two expressions, we can re-write the logrank test statistic: ? ? ?1 ?+? 1+ ?3 ? ?1 ?2 ?+? 2+ + ??+? ? ?1 ?2 ??+? 1 ? = ?1 ?+?+ ?2 ?+? (?+? 1) 27

  28. Logrank test: Finding the scores ? ? ?1 ? + ? 1 + ?3 ? ?1 ?2 ? + ? 2 + + ??+? ? ?1 ?2 ??+? 1 ? + ? (? + ? 1) ? = ?1 + ?2 ? + ? Note that e.g. ?1 occurs in several places: Partly as a direct contribution to the 1st term, corresponding to the 1st death But also modifying the # at risk in Arm A for all future events We can collect all terms involving ?1, and all terms for ?2, ?3, etc. We get ?+? 1 ? = ???? ? + ? (? 1) ? ?=1 where the scores are ?+? 1 1 ??= 1 + ? + ? ? ?=? 28

  29. Logrank scores (translated) Original Logrank scores De-emphasising early deaths Very high emphasis on the longest survivors One reason that logrank is never used when there is no censoring Less extreme when long life times are censored 29

  30. Logrank is the best rank-preserving test IF the hazard ratio is constant over time It is valid even if HR(t) is varying That is, Type 1 Error control is guaranteed But the power is not guaranteed Should we (ever) believe in proportional hazards? Clinical models As with the t-test etc., the logrank test will serve well in many cases if reality isn t too far from assumptions Hazard ratios 30

  31. 4 Weighted log-rank tests

  32. Weighted logrank tests ?? Logrank test ? = 1? ??+?? ?? Weighted logrank test ??= ?? 1? ??+?? Idea: Prespecify lower weights when drug efficacy is low (or negative) 32

  33. Can we use any scores? 33

  34. Are any weights OK? Could we take ??= 0 for k=1, ,K? No! We would then ignore all the first K deaths. A drug killing frail patients early could then look good! ?? ??= ?? 1? ??+ ?? 34

  35. Weighted logrank test & Scores ?? ? = ?? ?? ??+ ?? ? ? ? + ?1 ? + ? 1 + +??+? ??+? ? + ?1+ ?2+ + ??+? 1 ? + ? (? + ? 1) ? = ?1 ?1 +?2 ?2 ? + ? We can collect all terms involving ?1, and all terms for ?2, ?3, etc. We get ?+? ?? ? = ???? ? + ? (? 1) ? ?=1 where the scores are ?+? ?? ??= ??+ ? + ? (? 1) ?=?+1 35

  36. Weighted logrank scores if first 20 weights are zero Original WLRT scores BAD! Weighted logrank test with ??= 0 ??= 1 You get bonus points if the patient dies early Do not accept decreasing weights! if ? 20 otherwise (Alternatively, if ? ?) 36

  37. Fleming-Harrington(0,1) scores Original FH(0,1) scores BAD! Weighted logrank test: Fleming-Harrington(0,1) ?1= 0 ?? linearly increasing 37

  38. FH(0,1) and Max Combo do not control under H0: Survival is never better for New than Control (but is controlled if survival is identical for all t) In fact, Type 1 Error 100% if N infinity 38

  39. How to defend a test that thinks early deaths are better? 39

  40. Lines of defense (1) We only test the null hypothesis of exactly equal survival curves, for all times OK, so you prove that IO is not the same as chemo!?? Don t you think regulators would like to see proof of any positive effect? By looking at the Kaplan-Meier curves, you ll obviously see that the new drug is better after some time Hypothesis tests are there to avoid misunderstanding noisy data Take a look at the following example: 40

  41. 1-sided vs 2-sided test It doesn t matter in the school book example It makes all the difference in reality A new drug should be demonstrated to be better, not just different from standard-of-care 41

  42. Lines of defense (2) By looking at the Kaplan-Meier curves, you ll obviously see that the new drug is better after some time Hypothesis tests are there to avoid misunderstanding noisy data Take a look at the following example: 42

  43. Weighted logrank may be severely misleading TRUE OBSERVED Experimental is always worse Test Logrank FH(0,1) Z-score 1.45 2.70 p-value 14.8% 0.7%

  44. Is it always easy to see which drug is best? NO! WLRT corresponds to an estimate of HR. When Zw>1.96, always HRest<1 When Zw>1.96, observed median is often improved (while true median is worse) 82.9% has more than 6 months improvement in median

  45. An adjuvant trial example Adjuvant setting. We try to prolong the lives of the non-cured (but we don t know who is cured or not). 2 arms with 200 patients each 70% cure rate on both treatments Given non-cure, life span is exponentially distributed with Expected life time being 1 year on new treatment, 2 years on placebo. (That is, the treatment is killing off patients much faster.) Type I error ~30% (nominal 2.5%) With 1000 per arm, type I error ~90%

  46. Lines of defense (3) Do you have a better proposal than MaxCombo? 46

  47. Longer life is always better Original May transform life times Based on ??or its rank but the transformation has to be monotonely increasing 47

  48. Modestly weighted logrank test scores Modestly scores Original The best we can do: Constant scores when we don t think we have a benefit Constant scores when no efficacy Cf. Landmark analysis Logrank scores thereafter as logrank is optimal if HR are constant Even better with cutoff somewhat earlier 48

  49. Summary Z-scores for OS Logrank 5.11 2.35 2.64 2.82 2.81 2.94 3.72 2.89 2.49 3.77 2.20 1.14 3.07 4.35 2.13 Modestly 5.48 2.88 3.01 3.22 3.41 3.39 3.84 3.71 2.50 3.89 2.83 2.02 3.24 4.70 2.43 Modestly test is better than logrank test in all IO examples Not very sensitive in cutoff time for equal scores. Here 12 months.

  50. Dedicated survival analysis tests include how to handle censoring when data are analysed before some patients have had events The Modestly Weighted Logrank Test handles censoring just like the unweighted logrank test Other tests may be adapted to survival settings Parametric tests may e.g. use a likelihood function of the form ? = i has event?(??) i censored?(??) Censoring 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#