Weighted Log-rank Tests in Survival Trials

Inference in survival trials:

Weighted log-rank tests and

some alternatives

Carl-Fredrik Burman, PhD, Assoc.Prof.

Statistical Innovation, Data Science & AI

AstraZeneca R&D, Gothenburg

29 April, 2021

Thanks to Dominic Magirr

for great collaboration

undefined

Back to Basics

What’s special with Survival Analysis?

STANDARD STATISTICAL PROBLEMS

•

Example 1: Two samples, normal distributions

•

Exampe 2: Zero-one responses

SURVIVAL ANALYSIS

1.

Responses are positive

•

May use transformation (e.g. log)

2.

We don’t have time to follow all patients (censoring)

•

Missing data, with known structure

3.

Competing risks, Treatment switching …

PhD advise:

Don’t complicate

things before

understanding

the basics

Statistical

inference



 Clinical trial, comparing



New treatment, X,

versus



Standard of Care, C



 What is (really) the main efficacy objective?



To ”compare” X and C



To show a (=any) difference



To show X has an efficacy benefit



Hypothesis testing



Null hypothesis



Test statistic (pre-specified)



Null distribution

Distributional assumptions

•

Does the

 validity

of the statistical analysis require that assumptions are

correct?

•

Do we need uncertain assumptions to guarantee Type 1 Error control?

OR

•

Will assumptions only influence the power (sponsor’s risk)?

Two different bases for inference

showing treatment benefit

•

Distribution-based

•

E.g. ANCOVA, t-test has exactly correct Type 1 Error (only) if normal distributions +

same residual variance

•

Randomisation-based

•

Relies of nothing else than randomisation of patients

•

Any reasonable pre-specified test statistic; can handle covariates

•

In principle, randomisation test much more robust

•

In practise, distribution-based tests are often (but not always) good enough

to control Type 1 Error even if assumptions are not fully correct. Preferable

with approximate formula than simulated p-value?

Example dataset*

•

n=37 per arm

•

Group A: New treatment

•

Group B: Control

•

Survival time in days

•

Mice data

•

No censoring

* Loosely based on Xie et al (2017). Nature Communications 8:155.

Empiric CDFs and survival functions

Example dataset

Life times vs rank

Randomisation test (permutation test)

•

: Time does not depend on arm

•

That is, the highest observed time

could equally well have been on A

as B.

•

Same for next highest etc.

•

We can generate a null distribution

by keeping all observation fixed

but re-randomize … and repeat

e.g. one million times.

Randomisation test

•

We can generate a null distribution

by keeping all observation fixed

but re-randomize … and repeat

e.g. one million times.

p-value = 0.6%

All p-values 1-sided, as we want to show B better than A

Distribution-based inference: Z-test

•

Assumptions

•

I.i.d. Normal in each arm

•

Same known variance (taken equal to estimate)

•

p-value = 1.0%

•

P-value happens to be

higher than p

rand

=0.6%

•

Asymptotic Relative Efficiency of randomisation test vs Z-test is exactly 100% if Z-test

assumptions are OK

•

Z-test is approximately correct in many situations. Problems if few observations have

lrge impact on variance: outliers, heavy tails, …

•

(NB! All p-values are one-sided as we want to show B is better than A.)

N-distribution has

poor fit to data

T-test

•

Assumptions

•

I.i.d. Normal in each arm (Questionable assumption)

•

Same variance, estimated from data

•

=1.1%

•

Cf. p

=1.0%, p

rand

=0.6%

•

T-test reflects change in null distribution due to random estimate of SD

•

Asymptotically t-test = Z-test

Lognormal test

Original

Log of data

•

With positive data, lognormal is sometimes better than normal

•

Not in this example

•

Recall: Test should be pre-specified

•

Log scale is dangerous if some life span is close to zero

•

NB! Tests are scale-and location-invariant

Transformations

•

Log transform, x

=log(t

•

Any monotone function f gives valid test based on x

=f(t

•

Randomisation test or – if distributional assumptions are appropriate – t-test, ANCOVA

etc based on transformed observations x

Estimands

•

The estimand framework can help us think,

structure discussions

•

But there are several contradicting dimensions

•

The best test to show treatment benefit may not

directly correspond to an easily understandable,

clinically attractive estimand

•

Average Hazard Ratio may not be the most natural

estimand (especially if HR varies greatly over time).

Still, that in itself may not disqualify using a

(perhaps weighted) logrank test

•

Tradeoff: Transformations can be useful but …

Don’t make it unnecessarily complicated

Simple and meaningful

•

What is clinically (most) important?

•

What is easy?

EXAMPLES

•

Mean survival

•

Proportion cured. Proxy: 5-year survival?

•

But easy and important may not mean statistically sensitive (/efficient)

undefined

Rank tests

Dichotomisation (landmark analysis)

Original

Score=1 if surviving

2.5 years

•

General rule: Dichotomisation loose power

•

Often: Very large power loss

•

Test statistic: Proportion surviving > y years

•

Fisher’s exact test, or e.g. normal approximation to Binomial dist

Rank test: Wilcoxon-Mann-Whitney

Original

Rank scores

•

Robust test

•

Asymptotic Relative Efficiency is 3/



=95% vs Z-test if Normal assuption is true

•

p=1.0%

Normal scores

Original

Normal scores

•

Robust

•

The worst and best responses get higher importance than in Wilcoxon

•

Asymptotic Relative Efficiency is 100% vs Z-test if Normal assumption is true

undefined

Log-rank test

Logrank test

Logrank test: Relation between # at risk & earlier deaths

Logrank test: Finding the scores

Logrank scores

(translated)

Original

Logrank scores

•

De-emphasising early deaths

•

Very high emphasis on the longest survivors

•

One reason that logrank is ”never” used when there is no censoring

•

Less extreme when long life times are censored

Hazard ratios



Logrank is the best rank-preserving test IF the hazard

ratio is constant over time



It is valid even if HR(t) is varying

•

That is, Type 1 Error control is guaranteed



But the power is not guaranteed



Should we (ever) believe in proportional hazards?

•

Clinical models

•

As with the t-test etc., the logrank test will serve well in

many cases … if reality isn’t too far from assumptions

undefined

Weighted

log-rank tests

Weighted logrank tests

Can we use

any

 scores?

Are any weights OK?

Weighted logrank test & Scores

Weighted logrank scores if first 20 weights are zero

Original

WLRT scores

BAD!

Fleming-Harrington(0,1) scores

Original

FH(0,1) scores

BAD!

FH(0,1) and Max Combo

do not control



under H0: Survival is never better for New than Control

(but



 is controlled if survival is identical for all t)

In fact, Type 1 Error



100% if N

 infinity

How to defend a

test that thinks

early deaths

are better?

Lines of defense (1)



 ”We only test the null hypothesis of exactly equal survival

curves, for all times”

•

OK, so you prove that IO is not the same as chemo!??

•

Don’t you think regulators would like to see proof of any

positive effect?



”By looking at the Kaplan-Meier curves, you’ll obviously see

that the new drug is better after some time”

•

Hypothesis tests are there to avoid misunderstanding

noisy data

•

Take a look at the following example:

1-sided vs 2-sided test

•

It doesn’t matter … in the school book example

•

It makes all the difference in reality

•

A new drug should be

demonstrated to be better,

not just ”different”

from standard-of-care

Lines of defense (2)



”By looking at the Kaplan-Meier curves, you’ll obviously see

that the new drug is better after some time”

•

Hypothesis tests are there to avoid misunderstanding

noisy data

•

Take a look at the following example:

Weighted logrank may be severely misleading

Is it always easy to ”see” which drug is best? NO!

82.9% has more than

6 months improvement

in median

•

WLRT corresponds to an estimate of HR.



When Z

>1.96, always HR

est

<1

•

When Z

>1.96,

observed

median is often improved

(while

true

 median is

worse)

An adjuvant trial example

Adjuvant

setting. We try to prolong the lives

of the non-cured (but we don’t know who is

cured or not).

•

2 arms with 200 patients each

•

70% cure rate on both treatments

•

Given non-cure, life span is exponentially

distributed with

•

Expected life time being 1 year on new

treatment, 2 years on placebo. (That is,

the

treatment is killing off patients much

faster

.)

Type I error ~30% (nominal 2.5%)

With 1000 per arm, type I error ~90%

Lines of defense (3)



”Do you have a better proposal than MaxCombo?”

Longer life is always better

Original

Modestly weighted logrank test scores

Original

Modestly scores

•

Constant scores when no efficacy

•

Cf. Landmark analysis

•

Logrank scores thereafter

•

as logrank is optimal if HR are constant

•

Even better with cutoff somewhat earlier

The best we can do:

Constant scores when we

don’t think we have a benefit

Summary Z-scores for OS

Modestly test is better than logrank test in all IO examples

Not very sensitive in cutoff time for equal scores. Here 12 months.

Censoring

undefined

Conclusions

Recommendations when expecting efficacy delay

•

Consider using the ”Modestly” test

•

Clearly better power than standard logrank

•

trong a control (unlike Max Combo, many weighted logrank tests)

•

dd a variety of efficacy estimates

•

Kaplan-Meier curves (always!), directly corresponding to …

•

… Landmark rates, Quantile estimates (median not always useful)

•

(Truncated) mean survival makes clinical sense (but test is inefficient)

•

Modestly test corresponds to an HR estimate …

•

… but all average estimates of HR can be misleading

•

Estimated HR(t) should be interpreted with care

References

•

Magirr & Burman (2019). ”Modestly weighted logrank tests.”

Statistics in Medicine

38:3782-3790.

•

Gregson, Sharples, Stone, Burman, Öhrn & Pocock (2019). "Nonproportional hazards for

time-to-event outcomes in clinical trials."

J Am Coll Cardiol

74:2102.

•

Bartlett, Morris, Stensrud, Daniel, Vansteelandt & Burman (2020). “The Hazards of

Period Specific and Weighted Hazard Ratios.”

Statistics in Biopharmaceutical Research

12:518-519.

•

Magirr (2020). “Non‐proportional hazards in immuno‐oncology: Is an old perspective

needed?” Pharmaceutical Statistics (early view).

•

Magirr & Jimenez (2021). “Designing group sequential clinical trials when a delayed

effect is anticipated: A practical guidance.”

https://arxiv.org/abs/2102.05535

•

Leton, Zuluaga (2001). Equivalence between score and weighted tests for survival

curves.

Commun Stat Theory Methods

 30:591-608.

Slide Note

Embed Share

Download

Weighted log-rank tests and alternative methods in survival trials are discussed by Carl-Fredrik Burman, PhD, focusing on the basics, special considerations in survival analysis, clinical trial objectives, distributional assumptions, and different bases for inference.

koo_so Follow

Uploaded on Feb 25, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Inference in survival trials: Weighted log-rank tests and some alternatives Carl-Fredrik Burman, PhD, Assoc.Prof. Statistical Innovation, Data Science & AI AstraZeneca R&D, Gothenburg 29 April, 2021

Thanks to Dominic Magirr for great collaboration 2

1 Back to Basics

Whats special with Survival Analysis? STANDARD STATISTICAL PROBLEMS Example 1: Two samples, normal distributions Exampe 2: Zero-one responses SURVIVAL ANALYSIS 1. Responses are positive May use transformation (e.g. log) 2. We don t have time to follow all patients (censoring) Missing data, with known structure 3. Competing risks, Treatment switching 4

PhD advise: Don t complicate things before understanding the basics 5

Clinical trial, comparing New treatment, X, versus Standard of Care, C What is (really) the main efficacy objective? To compare X and C To show a (=any) difference To show X has an efficacy benefit Hypothesis testing Null hypothesis Test statistic (pre-specified) Null distribution Statistical inference 6

Distributional assumptions Does the validity of the statistical analysis require that assumptions are correct? Do we need uncertain assumptions to guarantee Type 1 Error control? OR Will assumptions only influence the power (sponsor s risk)? 7

Two different bases for inference showing treatment benefit Distribution-based E.g. ANCOVA, t-test has exactly correct Type 1 Error (only) if normal distributions + same residual variance Randomisation-based Relies of nothing else than randomisation of patients Any reasonable pre-specified test statistic; can handle covariates In principle, randomisation test much more robust In practise, distribution-based tests are often (but not always) good enough to control Type 1 Error even if assumptions are not fully correct. Preferable with approximate formula than simulated p-value? 8

Example dataset* n=37 per arm Group A: New treatment Group B: Control Survival time in days Mice data No censoring 9* Loosely based on Xie et al (2017). Nature Communications 8:155.

Empiric CDFs and survival functions 10

Example dataset 11

Life times vs rank 12

Randomisation test (permutation test) H0: Time does not depend on arm That is, the highest observed time could equally well have been on A as B. 1 ? ? ??? Test statistic ? = That is, average observed life time in group A Same for next highest etc. We can generate a null distribution by keeping all observation fixed but re-randomize and repeat e.g. one million times. Observed ????= 814.4 (whereas average in arm B was 919.0) 13

Randomisation test We can generate a null distribution by keeping all observation fixed but re-randomize and repeat e.g. one million times. Observed ????= 814.4 (whereas average in arm B was 919.0) p-value = 0.6% ????= 814.4 All p-values 1-sided, as we want to show B better than A 14

Distribution-based inference: Z-test Assumptions N-distribution has poor fit to data I.i.d. Normal in each arm Same known variance (taken equal to estimate) p-value = 1.0% P-value happens to be higher than prand=0.6% Asymptotic Relative Efficiency of randomisation test vs Z-test is exactly 100% if Z-test assumptions are OK Z-test is approximately correct in many situations. Problems if few observations have lrge impact on variance: outliers, heavy tails, (NB! All p-values are one-sided as we want to show B is better than A.) 15

T-test Assumptions I.i.d. Normal in each arm (Questionable assumption) Same variance, estimated from data pT=1.1% Cf. pZ=1.0%, prand=0.6% T-test reflects change in null distribution due to random estimate of SD Asymptotically t-test = Z-test 16

Lognormal test Original Log of data With positive data, lognormal is sometimes better than normal Not in this example Recall: Test should be pre-specified Log scale is dangerous if some life span is close to zero NB! Tests are scale-and location-invariant 17

Transformations Log transform, xi=log(ti) Any monotone function f gives valid test based on xi=f(ti) Randomisation test or if distributional assumptions are appropriate t-test, ANCOVA etc based on transformed observations xi 18

The estimand framework can help us think, structure discussions But there are several contradicting dimensions The best test to show treatment benefit may not directly correspond to an easily understandable, clinically attractive estimand Average Hazard Ratio may not be the most natural estimand (especially if HR varies greatly over time). Still, that in itself may not disqualify using a (perhaps weighted) logrank test Tradeoff: Transformations can be useful but Don t make it unnecessarily complicated Estimands 19

Simple and meaningful What is clinically (most) important? What is easy? EXAMPLES Mean survival Proportion cured. Proxy: 5-year survival? But easy and important may not mean statistically sensitive (/efficient) 20

2 Rank tests

Dichotomisation (landmark analysis) Original Score=1 if surviving 2.5 years General rule: Dichotomisation loose power Often: Very large power loss Test statistic: Proportion surviving > y years Fisher s exact test, or e.g. normal approximation to Binomial dist 22

Rank test: Wilcoxon-Mann-Whitney Original Rank scores Robust test Asymptotic Relative Efficiency is 3/ =95% vs Z-test if Normal assuption is true p=1.0% 23

Normal scores Original Normal scores Robust The worst and best responses get higher importance than in Wilcoxon Asymptotic Relative Efficiency is 100% vs Z-test if Normal assumption is true 24

3 Log-rank test

Logrank test The logrank test is a non-parametric score test (like Wilcoxon, Normal Scores, etc.) Can be formulated in different ways. The test statistic may be defined as ?? ? = ?? ??+ ?? ? Where ??= 1 if the k:th death is on Arm A, and ??= 0if it s on arm B; ?? and ??are the numbers at risk in arm A and B, respectively, just before the k: death. We start with ?1and ?1as the sample sizes per arm. Note that ?2= ?1 ?1, that is the number at risk in arm A is unchanged if 1st death on B; it is reduced by one if 1st death in arm A With this idea, the logrank test statistics can be rewritten as a sum of scores 26

Logrank test: Relation between # at risk & earlier deaths ?? ? = ?? ??+ ?? ? Where ??= 1 if the k:th death is on Arm A ?? and ??are the numbers at risk in arm A and B. Total # at risk is decreased by 1 for every death: ??+ ??= ? + ? (? 1) # at risk in Arm A is decreased by 1 for every time the death indicator is 1: ??= ? ?1 ?? 1 Inserting these two expressions, we can re-write the logrank test statistic: ? ? ?1 ?+? 1+ ?3 ? ?1 ?2 ?+? 2+ + ??+? ? ?1 ?2 ??+? 1 ? = ?1 ?+?+ ?2 ?+? (?+? 1) 27

Logrank test: Finding the scores ? ? ?1 ? + ? 1 + ?3 ? ?1 ?2 ? + ? 2 + + ??+? ? ?1 ?2 ??+? 1 ? + ? (? + ? 1) ? = ?1 + ?2 ? + ? Note that e.g. ?1 occurs in several places: Partly as a direct contribution to the 1st term, corresponding to the 1st death But also modifying the # at risk in Arm A for all future events We can collect all terms involving ?1, and all terms for ?2, ?3, etc. We get ?+? 1 ? = ???? ? + ? (? 1) ? ?=1 where the scores are ?+? 1 1 ??= 1 + ? + ? ? ?=? 28

Logrank scores (translated) Original Logrank scores De-emphasising early deaths Very high emphasis on the longest survivors One reason that logrank is never used when there is no censoring Less extreme when long life times are censored 29

Logrank is the best rank-preserving test IF the hazard ratio is constant over time It is valid even if HR(t) is varying That is, Type 1 Error control is guaranteed But the power is not guaranteed Should we (ever) believe in proportional hazards? Clinical models As with the t-test etc., the logrank test will serve well in many cases if reality isn t too far from assumptions Hazard ratios 30

4 Weighted log-rank tests

Weighted logrank tests ?? Logrank test ? = 1? ??+?? ?? Weighted logrank test ??= ?? 1? ??+?? Idea: Prespecify lower weights when drug efficacy is low (or negative) 32

Can we use any scores? 33

Are any weights OK? Could we take ??= 0 for k=1, ,K? No! We would then ignore all the first K deaths. A drug killing frail patients early could then look good! ?? ??= ?? 1? ??+ ?? 34

Weighted logrank test & Scores ?? ? = ?? ?? ??+ ?? ? ? ? + ?1 ? + ? 1 + +??+? ??+? ? + ?1+ ?2+ + ??+? 1 ? + ? (? + ? 1) ? = ?1 ?1 +?2 ?2 ? + ? We can collect all terms involving ?1, and all terms for ?2, ?3, etc. We get ?+? ?? ? = ???? ? + ? (? 1) ? ?=1 where the scores are ?+? ?? ??= ??+ ? + ? (? 1) ?=?+1 35

Weighted logrank scores if first 20 weights are zero Original WLRT scores BAD! Weighted logrank test with ??= 0 ??= 1 You get bonus points if the patient dies early Do not accept decreasing weights! if ? 20 otherwise (Alternatively, if ? ?) 36

Fleming-Harrington(0,1) scores Original FH(0,1) scores BAD! Weighted logrank test: Fleming-Harrington(0,1) ?1= 0 ?? linearly increasing 37

FH(0,1) and Max Combo do not control under H0: Survival is never better for New than Control (but is controlled if survival is identical for all t) In fact, Type 1 Error 100% if N infinity 38

How to defend a test that thinks early deaths are better? 39

Lines of defense (1) We only test the null hypothesis of exactly equal survival curves, for all times OK, so you prove that IO is not the same as chemo!?? Don t you think regulators would like to see proof of any positive effect? By looking at the Kaplan-Meier curves, you ll obviously see that the new drug is better after some time Hypothesis tests are there to avoid misunderstanding noisy data Take a look at the following example: 40

1-sided vs 2-sided test It doesn t matter in the school book example It makes all the difference in reality A new drug should be demonstrated to be better, not just different from standard-of-care 41

Lines of defense (2) By looking at the Kaplan-Meier curves, you ll obviously see that the new drug is better after some time Hypothesis tests are there to avoid misunderstanding noisy data Take a look at the following example: 42

Weighted logrank may be severely misleading TRUE OBSERVED Experimental is always worse Test Logrank FH(0,1) Z-score 1.45 2.70 p-value 14.8% 0.7%

Is it always easy to see which drug is best? NO! WLRT corresponds to an estimate of HR. When Zw>1.96, always HRest<1 When Zw>1.96, observed median is often improved (while true median is worse) 82.9% has more than 6 months improvement in median

An adjuvant trial example Adjuvant setting. We try to prolong the lives of the non-cured (but we don t know who is cured or not). 2 arms with 200 patients each 70% cure rate on both treatments Given non-cure, life span is exponentially distributed with Expected life time being 1 year on new treatment, 2 years on placebo. (That is, the treatment is killing off patients much faster.) Type I error ~30% (nominal 2.5%) With 1000 per arm, type I error ~90%

Lines of defense (3) Do you have a better proposal than MaxCombo? 46

Longer life is always better Original May transform life times Based on ??or its rank but the transformation has to be monotonely increasing 47

Modestly weighted logrank test scores Modestly scores Original The best we can do: Constant scores when we don t think we have a benefit Constant scores when no efficacy Cf. Landmark analysis Logrank scores thereafter as logrank is optimal if HR are constant Even better with cutoff somewhat earlier 48

Summary Z-scores for OS Logrank 5.11 2.35 2.64 2.82 2.81 2.94 3.72 2.89 2.49 3.77 2.20 1.14 3.07 4.35 2.13 Modestly 5.48 2.88 3.01 3.22 3.41 3.39 3.84 3.71 2.50 3.89 2.83 2.02 3.24 4.70 2.43 Modestly test is better than logrank test in all IO examples Not very sensitive in cutoff time for equal scores. Here 12 months.

Dedicated survival analysis tests include how to handle censoring when data are analysed before some patients have had events The Modestly Weighted Logrank Test handles censoring just like the unweighted logrank test Other tests may be adapted to survival settings Parametric tests may e.g. use a likelihood function of the form ? = i has event?(??) i censored?(??) Censoring 50

Weighted Log-rank Tests in Survival Trials

Download Presentation

Presentation Transcript

Related

More Related Content