Understanding Nonparametric Statistics in R Short Course

Slide Note
Embed
Share

Explore the application of nonparametric statistics in R Short Course Part 2, covering topics such as inference for a binomial proportion, inference for a median, and various tests for independent and paired data. Dive into hypothesis testing, confidence intervals, and real-world examples like studying drug response rates. Utilize tools like the binom package in R for calculations and interpretation.


Uploaded on Jul 16, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. R Short Course Part 2 Topic 4: Nonparametric statistics Kai Ding, PhD Department of Biostatistics and Epidemiology Hudson College of Public Health, OUHSC March 10, 2021

  2. Outline Inference for a binomial proportion Inference for a median Wilcoxon signed-rank test for paired data Wilcoxon rank-sum test for two independent samples Kruskal-Wallis test for k (k > 2) independent samples Dwass, Steel, Critchlow, and Fligner multiple comparison procedure Spearman correlation coefficient Kaplan-Meier curve and log-rank test for right- censored data 2

  3. Inference for a binomial proportion (p) Suppose we observe the outcomes of n independent Bernoulli trials (success/failure). Probability of success in each trial is p X: number of successes out of the n trials Hypothesis testing: H0: p = p0 vs. Ha: p p0 Pre-specified value p0 Confidence interval: Wilson CI, Laplace Wald CI (labeled as asymptotic in R), Agresti-Coull CI, Clopper-Pearson CI (labeled as exact in R) etc. 3

  4. Example Drug developers wish to study the response rate p of a new drug. This drug was tested on 7 patients and 6 patients responded. Suppose there is a standard drug currently being used with a response rate of 15%. Is the new drug different in terms of response rate? 4

  5. Inference for a binomial proportion (p) > # Two-sided p-value > binom.test(6,7,0.15,alternative="two.sided") Exact binomial test data: 6 and 7 number of successes = 6, number of trials = 7, p-value = 6.948e-05 alternative hypothesis: true probability of success is not equal to 0.15 95 percent confidence interval: 0.4212768 0.9963897 sample estimates: probability of success 0.8571429 > # Install and load package "binom" > library(binom) > # Compute point and interval estimate > binom.confint(x=6,n=7,conf.level=0.95,methods='all') method x n mean lower upper 1 agresti-coull 6 7 0.8571429 0.4665107 0.9946819 2 asymptotic 6 7 0.8571429 0.5979180 1.1163677 3 bayes 6 7 0.8125000 0.5590884 0.9992584 4 cloglog 6 7 0.8571429 0.3340539 0.9785611 5 exact 6 7 0.8571429 0.4212768 0.9963897 6 logit 6 7 0.8571429 0.4193984 0.9803294 7 probit 6 7 0.8571429 0.4676230 0.9866675 8 profile 6 7 0.8571429 0.5060753 0.9864386 9 lrt 6 7 0.8571429 0.5061163 0.9912466 10 prop.test 6 7 0.8571429 0.4200783 0.9924972 11 wilson 6 7 0.8571429 0.4868722 0.9743204 5

  6. Inference for a median Hypothesis testing for population median 0.5 H0: 0.5= 0 vs. Ha: 0.5 0 0 is a specified value Analogous to one-sample t-test Translate to the problem of testing for a binomial proportion Let p be the probability of each Xi> 0 Equivalently, the hypothesis can be written as: H0: p=0.5 vs. Ha: p 0.5 6

  7. Example Suppose a certain food product is advertised to contain 75 mg of sodium per serving, but preliminary studies indicate that servings may contain more than that amount. Suppose one wish to test hypothesis about median as follows: H0: 0.5= 75 vs. Ha: 0.5> 75 Equivalently, H0: p=0.5 vs. Ha: p > 0.5 7

  8. Inference for a median > #inference for a median > > sodium=c(72.1, 72.8, 72.9, 73.3, 73.3, 73.3, 73.9, 74.0, 74.2, 74.2, + 74.3, 74.6, 74.7, 75.0, 75.1, 75.1, 75.2, 75.3, 75.3, 75.3, + 75.4, 76.1, 76.5, 76.5, 76.6, 76.9, 77.1, 77.2, 77.4, 77.4, + 77.7, 78.0, 78.3, 78.6, 78.8, 78.9, 79.7, 80.3, 80.5, 81.0) > sum(sodium>75) #total number of obs. that >75 [1] 26 > > #test for Ha: median > 75 is equivalent to test for Ha: p > 0.5 > binom.test(sum(sodium>75),40,0.5,alternative="greater") Exact binomial test data: sum(sodium > 75) and 40 number of successes = 26, number of trials = 40, p-value = 0.04035 alternative hypothesis: true probability of success is greater than 0.5 95 percent confidence interval: 0.5080545 1.0000000 sample estimates: probability of success 0.65 8

  9. Wilcoxon signed-rank test for paired data Paired data arise from pre- and post-treatment measures, cross-over design with 2 treatments, twin studies, matched pair designs, etc. Inference is based on the difference (D) between the observations within the pair rank |Di|, i=1, 2, , n attach the signs of differences to ranks signed ranks Analogous to paired t-test 9

  10. Example Twin study: One of the twins received drug 1 and the other received drug 2. Treatment assignment was random. Outcome is the reduction in cholesterol from baseline. Is there a difference in the mean outcome between drug 1 and drug2? 10

  11. Wilcoxon signed-rank test for paired data > #Input raw data > drug1=c(74,55,61,41,53,74,52,31,50,58,54,53,69,60,61,54,57) > drug2=c(63,58,49,47,50,69,67,40,44,38,56,38,47,41,46,47,44) > > #Wilcoxon signed-rank test (two sided) > wilcox.test(drug1, drug2, paired=TRUE, alternative="two.sided") Wilcoxon signed rank test with continuity correction data: drug1 and drug2 V = 123, p-value = 0.0293 alternative hypothesis: true location shift is not equal to 0 11

  12. Wilcoxon rank-sum test for two independent samples Assumption X-sample (X1, , Xm) and Y-sample (Y1, , Yn) are independent Hypothesis testing H0: X variable and Y variable have the same distribution Ha: Y tends to be larger (or smaller) than X Wilcoxon rank-sum statistic Combine m + n observations into one group Rank the observations from smallest to largest W is the sum of the ranks assigned to the Y-values Equivalent to Mann-Whitney U test Analogous to two-sample t-test 12

  13. Example Compare the effects of two soporific drugs Each subject receives a placebo and then is randomly assigned to receive Drug 1 (X) or Drug 2 (Y) Outcome: Number of hours of increased sleep over placebo Study question: which drug is more effective at increasing sleep? 13

  14. Wilcoxon rank-sum test for two independent samples > #Input raw data > drug1=c(0.7,-1.6,-0.2,-1.2,-0.1,3.4,3.7,0.8,0.0,2.0) > drug2=c(1.9,0.8,1.1,0.1,-0.1,4.4,5.5,1.6,4.6,5.4) > > #Wilcoxon rank-sum test (two sided) > wilcox.test(drug1, drug2, paired=FALSE, alternative="two.sided") Wilcoxon rank sum test with continuity correction data: drug1 and drug2 W = 24, p-value = 0.05372 alternative hypothesis: true location shift is not equal to 0 14

  15. Kruskal-Wallis test for k (k > 2) independent samples Assumption Treatment 1, , and treatment k are mutually independent Hypothesis H0: The k population medians are equal Ha: The k population medians are not all equal Similar to Wilcoxon rank-sum test, Kruskal-Wallis test is based on ranks Rank of observation from subject j in treatment i among the COMBINED sample Analogous to ANOVA 15

  16. Example Data on length (mm) of YOY Gizzard Shad were collected at four different sites in Kokosing lake (Ohio) Question: Is there any difference between the median YOY lengths at the four sites? If so, which particular sites differ in median YOY length? 16

  17. Kruskal-Wallis test for k (k > 2) independent samples > #input data as a list > gizzards<-list(site.I=c(46,28,46,37,32,41,42,45,38,44), site.II=c(42,60,32, 42,45,58,27,51,42,52), + site.III=c(38,33,26,25,28,28,26,27,27,27), site.IV=c(31,30,2 7,29,30,25,25,24,27,30)) > kruskal.test(gizzards) Kruskal-Wallis rank sum test data: gizzards Kruskal-Wallis chi-squared = 22.852, df = 3, p-value = 4.335e-05 17

  18. Dwass, Steel, Critchlow, and Fligner multiple comparison procedure Applied after rejection of ?0 by Kruskal-Wallis test Perform all pairwise comparisons, while controlling for experiment-wise error rate Based on Wilcoxon rank-sum statistic for treatments i and j under comparison 18

  19. Dwass, Steel, Critchlow, and Fligner multiple comparison procedure > #install and load the PMCMRplus package > library(PMCMRplus) > ##ALL pairwise comparisons, Dwass, Steel, Critchlow-Fligner procedure, larg e sample approximation > dscfAllPairsTest(gizzards) Pairwise comparisons using Dwass-Steele-Critchlow-Fligner all-pairs te st data: gizzards 1 2 3 2 0.6411 - - 3 0.0053 0.0067 - 4 0.0036 0.0046 1.0000 P value adjustment method: single-step 19

  20. Spearman correlation coefficient Data: n bivariate observations ?1,?1, ,(??,??), a random sample from a continuous/ordinal bivariate population Let ?? be the rank of ?? in the joint ranking of ?1, ,??; Let ?? be the rank of ?? in the joint ranking of ?1, ,?? Spearman rank correlation coefficient is simply the Pearson correlation coefficient of the ?? ? and ?? ? Analogous to Pearson correlation coefficient 20

  21. Example Researchers studied the relation between the free pool of proline and collagen content in human liver cirrhosis. Data were collected from seven patients with a histological diagnosis of portal cirrhosis. Conduct a test of whether total collagen and free proline levels are positively related, based on Spearman correlation coefficient. 21

  22. Spearman correlation coefficient > #input data > collagen=c(7.1,7.1,7.2,8.3,9.4,10.5,11.4) > proline=c(2.8,2.9,2.8,2.6,3.5,4.6,5.0) > x=data.frame(collagen,proline) > > #compute and test Spearman's correlation coefficient > cor.test(x$collagen, x$proline, method = "spearm") Spearman's rank correlation rho data: x$collagen and x$proline S = 16.8, p-value = 0.07992 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.7 22

  23. Kaplan-Meier curve and log-rank test for right-censored data Time to the occurrence of a given event (e.g. death) measured from a well-defined starting point (time origin) e.g. Time from study enrollment to death in a cancer clinical trial The full time to death is not observed for some subjects Withdrawal, lost to follow-up, alive at the end of study All we know is the survival time is larger than the censoring time (e.g. time from study enrollment to drop out) 23

  24. Example Question: do leukemia patients treated with 6-mercaptopruine (6-MP) have a longer remission time than those in the control group? Sample: 42 leukemia patients, 1:1 randomized Survival outcome: remission time (weeks) Event: relapse 24

  25. Dataset > #import data > x=read.table("R://Teaching/Training materials/R course/non-parametric/6-mp. dat", header=TRUE) > #show first 10 observations > x[1:10,] group time delta 1 1 6 0 2 1 6 1 3 1 6 1 4 1 6 1 5 1 7 1 6 1 9 0 7 1 10 0 8 1 10 1 9 1 11 0 10 1 13 1 Group: 0=control 1=drug Time: Remission time (wks) Delta: 0=censored 1=replace event observed 25

  26. Kaplan-Meier estimate > #install and load the survival package > library(survival) > > #compute Kaplan-Meier estimate and confidence intervals > surv=survfit(Surv(time, delta) ~ group, conf.type = "log-log") > summary(surv) Call: survfit(formula = Surv(time, delta) ~ group, conf.type = "log-log") group=0 time n.risk n.event survival std.err lower 95% CI upper 95% CI 1 21 2 0.9048 0.0641 0.67005 0.975 2 19 2 0.8095 0.0857 0.56891 0.924 3 17 1 0.7619 0.0929 0.51939 0.893 4 16 2 0.6667 0.1029 0.42535 0.825 5 14 2 0.5714 0.1080 0.33798 0.749 8 12 4 0.3810 0.1060 0.18307 0.578 11 8 2 0.2857 0.0986 0.11656 0.482 12 6 2 0.1905 0.0857 0.05948 0.377 15 4 1 0.1429 0.0764 0.03566 0.321 17 3 1 0.0952 0.0641 0.01626 0.261 22 2 1 0.0476 0.0465 0.00332 0.197 23 1 1 0.0000 NaN NA NA group=1 time n.risk n.event survival std.err lower 95% CI upper 95% CI 6 21 3 0.857 0.0764 0.620 0.952 7 17 1 0.807 0.0869 0.563 0.923 10 15 1 0.753 0.0963 0.503 0.889 13 12 1 0.690 0.1068 0.432 0.849 16 11 1 0.627 0.1141 0.368 0.805 22 7 1 0.538 0.1282 0.268 0.747 23 6 1 0.448 0.1346 0.188 0.680 26

  27. Kaplan-Meier curves > #Kaplan-Meier curves > plot(surv, lty = 1:2, xlab = "Time of remission (weeks)", ylab = "Survival probability") > #plot(surv, conf.int = T, lty = 1:2, xlab = "Time of remission (weeks)", yl ab = "Survival probability") > legend('topright', c("control", "6-MP"), lty = 1:2, lwd = 2) 27

  28. Median survival time estimate and confidence interval > #median survival time and confidence interval > survfit(Surv(time, delta) ~ group, conf.type = "log-log") Call: survfit(formula = Surv(time, delta) ~ group, conf.type = "log-log") n events median 0.95LCL 0.95UCL group=0 21 21 8 4 11 group=1 21 9 23 13 NA 28

  29. Log-rank test > #log-rank test for differnces between the two groups > survdiff(Surv(time, delta) ~ group) Call: survdiff(formula = Surv(time, delta) ~ group) N Observed Expected (O-E)^2/E (O-E)^2/V group=0 21 21 10.7 9.77 16.8 group=1 21 9 19.3 5.46 16.8 Chisq= 16.8 on 1 degrees of freedom, p= 4e-05 29

  30. Other nonparametric methods Contingency tables (e.g. Chi-square test, Fisher s exact test, Cochran Mantel Haenszel test, McNemar s Test) Bootstrap Smoothing (e.g. kernel, spline) 30

  31. Reference Hollander M, Wolfe DA and Chicken E (2014). Nonparametric Statistical Methods. Wiley; 3rd edition. ISBN-13: 9780470387375. Higgins JJ (2003). Introduction to Modern Nonparametric Statistics. Duxbury Press; 1 edition. ISBN-13: 9780534387754 Collett D (2015) Modelling Survival Data in Medical Research. Third Edition. ISBN-13: 9781439856789. 31

Related


More Related Content