La presentazione è in caricamento. Aspetta per favore

La presentazione è in caricamento. Aspetta per favore

Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students Lecturer: Lorenzo Marini DAFNAE, University of Padova, Viale.

Presentazioni simili


Presentazione sul tema: "Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students Lecturer: Lorenzo Marini DAFNAE, University of Padova, Viale."— Transcript della presentazione:

1 Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students Lecturer: Lorenzo Marini DAFNAE, University of Padova, Viale dellUniversità 16, Legnaro, Padova. Tel.: Session 2 Lecture: Introduction to statistical hypothesis testing Null and alternate hypothesis. Types of error. Two-sample hypotheses. Correlation. Analysis of frequency data. Introduction to statistical modeling

2 Inference Sample A statistical hypothesis test is a method of making statistical decisions from and about experimental data. Null-hypothesis testing just answers the question of "how well do the findings fit the possibility that chance factors alone might be responsible?. Population Statistical Model sampling Estimation (Uncertainty!!!) testing

3 Statistical testing in five steps: 1. Construct a null hypothesis (H0) and alterantive hypothesis 2. Choose a statistical analysis (assumptions!!!) 3. Collect the data (sampling) 4. Calculate P-value and test statistic 5. Reject/accept (H0) if P is small/large Key concepts: Session 1 Concept of replication vs. pseudoreplication 1. Spatial dependence (e.g. spatial autocorrelation) 2. Temporal dependence (e.g. repeated measures) 3. Biological dependence (e.g. siblings) Key quantities mean residual yiyi Remember the order!!! x y n=6

4 4 1. Costruire e testare unipotesi Ipotesi: affermazione che ha come oggetto accadimenti nel mondo reale, che si presta ad essere confermata o smentita dai dati osservati sperimentalmente Esempio: gli studenti maschi e femmine presentano gli stessi voti

5 5 1. Costruire e testare unipotesi Ipotesi nulla (H 0 ): è unaffermazione riguardo alla popolazione che si assume essere vera fino a che non ci sia una prova evidente del contrario (status quo, mancanza di effetto etc.) Ipotesi alterantiva (H a ): è unaffermazione riguardo alla popolazione che è contraria allipotesi nulla e che viene accettata solo nel caso in cui ci sia una prova evidente in suo favore

6 6 1. Costruire e testare unipotesi Test di ipotesi consiste in una decisione fra H 0 e H a 1. Rifiutare H 0 (e quindi accettare H a ) 2. Accettare H 0 (e quindi rifiutare H a )

7 7 1. Costruire e testare unipotesi 1. Rifiutare H 0 2. Accettare H 0 La statistica inferenziale ci permette di quantificare delle probabilità per decidere se accettare o rifiutare lipotesi nulla: Quanto attendibile è H 0 ? ?

8 Livello di significatività (alpha) Devo definire a priori una probabilità (alpha) per rifiutare lipotesi nulla Il livello di significatività di un test: probabilità di rifiutare H 0, quando in realtà è vera (quanto confidenti siamo nelle nostre conclusioni?) Più piccola è alpha maggiore sarà la certezza nel rifiutare lipotesi nulla Valori usuali sono 10%, 5%, 1%, 0.1% I valori più comuni

9 Hypothesis testing 1 – Hypothesis formulation (Null hypothesis H0 vs. alternative hypothesis H1) 2 – Compute the probability P P-value is often described as the probability of seeing results as or more extreme as those actually observed if the null hypothesis was true 3 – If this probability is lower than a defined threshold (level of significance: 0.01, 0.05) we can reject the null hypothesis

10 Hypothesis testing: Types of error STATISTICAL DECISION Reject H 0 Retain H 0 REALITY EffectCorrect Effect detected Type 2 error ( ) Effect not detected No effect Type 1 error ( ) Effect detected, none exists (P-value) Correct, No effect detected, None exists (POWER) As power increases, the chances of a Type II error decreases Statistical power depends on: -The statistical significance criterion used in the teststatistical significance -The size of the difference or the strength of the similarity (effect size)effect size -Variability of the population -Sample size -Type of test

11 Statistical analyses Mean comparisons for 2 populations Test the difference between the means drawn by two samples Correlation In probability theory and statistics, correlation, (often measured as a correlation coefficient), indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation refers to the departure of two variables from independence.probability theorystatisticsrandom variables Analysis of count or proportion data Whole number or integer numbers (not continuous, different distributional properties) or proportion

12 Mean comparisons for 2 samples Assumptions Independence of cases (work with true replications!!!) - this is a requirement of the design.Independence Normality - the distributions in each of the groups are normalNormalitynormal Homogeneity of variances - the variance of data in groups should be the same (use Fisher test or Fligner's test for homogeneity of variances).Homogeneity of variancesFligner's test These together form the common assumption that the errors are independently, identically, and normally distributederrors H0: means do not differ H1: means differ The t test

13 Normality Before we can carry out a test assuming normality of the data we need to test our distribution (not always before!!!) Graphics analysis Shapiro-Wilk Normality Test shapiro.test() Test for normality In many cases we must check this assumption after having fitted the model (e.g. regression or multifactorial ANOVA) Skew + kurtosis (t test) hist(y) lines(density(y)) library(car) qq.plot(y) or qqnorm(y) RESIDUALS MUST BE NORMAL

14 Normality: Histogram

15 library(animation) ani.options(nmax = , interval = 0.003) freq = quincunx(balls = 2000, col.balls = rainbow(1)) # frequency table barplot(freq, space = 0) Normal distribution must be symmetrical around the mean

16 Normality: Q-Q Plot

17 Normality: Quantile-Quantile Plot Quantiles are points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. The quantiles are the data values marking the boundaries between consecutive subsets cumulative distribution functionrandom variable

18 Normality In case of non-normality: 2 possible approaches 1. Change the distribution (use GLMs) Logaritmic (skewed data) 2. Data transformation E.g. Poisson (count data) E.g. Binomial (proportion) Square-root Arcsin (percentage) Probit (proportion) Box-Cox transformation Advanced statistics

19 Homogeneity of variance: two samples Before we can carry out a test to compare two sample means, we need to test whether the sample variances are significantly different. The test could not be simpler. It is called Fishers F To compare two variances, all you do is divide the larger variance by the smaller variance. Test can be carried out with the var.test() F<-var(A)/var(B) qf(0.975,n A -1,n B -1) F calculated F critical if the calculated F is larger than the critical value, we reject the null hypothesis E.g. Students from TESAF vs. Students from DAFNAE

20 Homogeneity of variance : > two samples It is important to know whether variance differs significantly from sample to sample. Constancy of variance (homoscedasticity) is the most important assumption underlying regression and analysis of variance. For multiple samples you can choose between the Bartlett test and the Fligner–Killeen test. Bartlett.test(response,factor) There are differences between the tests: Fisher and Bartlett are very sensitive to outliers, whereas Fligner–Killeen is not Fligner.test(response,factor)

21 Mean comparison -Some assumptions not met? Non-parametric W ilcox.test() - The Wilcoxon signed-rank test is a non-parametric alternative to the Student's t-test for the case of two samples.non-parametric Student's t-test - All Assumptions met? Parametric t.test() - t test with independent or paired sample In many cases, a researcher is interesting in gathering information about two populations in order to compare them. As in statistical inference for one population parameter, confidence intervals and tests of significance are useful statistical tools for the difference between two population parameters.confidence intervalstests of significance Ho: the two means are the same H1: the two means differ

22 22 Il test t Misura legata alla differenza fra le medie Misura di variabilità dentro i gruppi Differenza medie Variabilità dei gruppi t calcolato =

23 23 Il test t Differenza fra le medie Caso 1 Caso 2 Caso 3 Caso 4 A BA B A B A B Variabile Variabilità B Variabilità A

24 24 Il test t Differenza fra le medie Errore standard della differenza Differenza fra medie t Variabilità dentro i gruppi t Più estremo sarà t calcolato minore sarà P maggiore sarà la probabilità di rifiutare H0 t calcolato =

25 25 Il test t t calcolato = + estremo sarà t calcolato maggiore la probabilità di rifiutare H0 P -T critico T critico Differenza fra le medie Errore standard della differenza

26 26 Come scegliere il test t giusto a partire dalle assunzioni Indipendenza NOSÌ Test t appaiatoTest t non appaiati Test t per pop. omoschedastiche Test t per pop. eteroschedastiche Welch t-test (formula complessa richiesto un PC)

27 27 Campioni independenti omoschedastici: Test t! ? Varianza combinata (pooled) I gradi di libertà sono n 1 + n 2 -2 per T critico

28 28 Campioni independenti omoschedastici: Test t! I gradi di libertà sono n 1 +n 2 -2 per T critico Test di ipotesi: 1.Calcolo la varianza combinata dei due campioni 2.Determino il valore di t calcolato 3.Decido il livello di significatività (alpha, 1 o 2 code?) 4.Determino il valore di t critico 5.Se |t calcolato |> |t critico | rifiuto H0 6.Conclusione: le medie sono DIVERSE! H0: le due medie sono uguali Ha: le due medie sono diverse

29 29 Campioni appaiati: 2 casi StudentePrimaDopo A2223 B 24 C D25 E2021 F18 G H Misure ripetute2. Correlazione nello spazio Industria tessile Misura a valle Misura a monte Fiume A Fiume B Fiume C [Ammoniaca] in acqua

30 30 Campioni appaiati: Test t Media delle differenze Deviazione standard delle differenze Numero di coppie StudentePrimaDopoDiDi A22231 B 241 C 0 D25 0 E20211 F18 0 G 0 H19201 I gradi di libertà sono n-1 per t critico

31 31 I gradi di libertà sono n-1 per t critico Test di ipotesi: 1.Determino il valore di t calcolato 2.Decido il livello di significatività (alpha, 1 o 2 code?) 3.Determino il valore di t critico 4.Se |t calcolato |> |t critico | rifiuto H0 5.Conclusione: le medie sono DIVERSE! H0: le due medie sono uguali Ha: le due medie sono diverse Campioni appaiati: Test t ?

32 Non parametrica: Wilcoxon I ranghi AB Test can be carried out with the wilcox.test() function n 1 and n 2 sono I numeri delle osservazioni R 1 è la somma dei rnaghi nel campione 1

33 Correlation Correlation, (often measured as a correlation coefficient), indicates the strength and direction of a linear relationship between two random variables Three alternative approaches 1. Parametric - cor() 2. Nonparametric - cor() 3. Bootstrapping - replicate(), boot() Plant species richness … 458 Bird species richness x 1 x 2 x 3 x 4 … x 458 l 1 l 2 l 3 l 4 … l 458 Sampling unit

34 Correlation: causal relationship? Which is the response variable in a correlation analysis? Plant species richness … 458 Bird species richness x 1 x 2 x 3 x 4 … x 458 l 1 l 2 l 3 l 4 … l 458 Sampling unit NONE

35 Correlation LINEAR A correlation of +1 means that there is a perfect positive LINEAR relationship between variables. LINEAR A correlation of -1 means that there is a perfect negative LINEAR relationship between variables. LINEAR A correlation of 0 means there is no LINEAR relationship between the two variables. Plot the two variables in a Cartesian space

36 Correlation Same correlation coefficient! r= 0.816

37 Assumptions -Two random variables from a random populations - cor() detects ONLY linear relationships Pearson product-moment correlation coefficient Correlation coefficient: Hypothesis testing using the t distribution: Ho: Is cor = 0 H1: Is cor 0 Parametric correlation: when is it significant? t critic value for d.f. = n-2

38 Rank procedures Nonparametric correlation Spearman correlation index The Kendall tau rank correlation coefficient P is the number of concordant pairs n is the total number of pairs Distribution-free but less power

39 Issues related to correlation 2. Spatial autocorrelation Values in close sites are more similar Dependence of the data 1. Temporal autocorrelation Values in close years are more similar Dependence of the data Moran's I = 0Moran's I = 1 Moran's I or Gearys C Measures of global spatial autocorrelation

40 Three issues related to correlation 2. Temporal autocorrelation Values in close years are more similar Dependence of the data Working with time series is likely to have temporal pattern in the data E.g. Ring width series Autoregressive models (not covered!)

41 Three issues related to correlation 3. Spatial autocorrelation Values in close sites are more similar Dependence of the data Moran's I or Gearys C (univariate response) Measures of global spatial autocorrelation Hint: If you find spatial autocorrelation in your residuals, you should start worrying ISSUE: can we explain the spatial autocorrelation with our models? Raw response Residuals after model fitting

42 >a<-c(1:5) > a [1] > replicate(10, sample(a, replace=TRUE)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] [2,] [3,] [4,] [5,] Estimate correlation with bootstrap BOOTSTRAP Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of SEs and CIs of a population parametersampling distributionestimatorsampling Sampling with replacement 1 original sample 10 bootstrapped samples

43 Estimate correlation with bootstrap Why bootstrap? It doesnt depend on normal distribution assumption It allows the computation of unbiased SE and CIs SampleBootstrap Statistic distribution Quantiles N samples with replacement …

44 Estimate correlation with bootstrap CIs are asymmetric because our distribution reflects the structure of the data and not a defined probability distribution If we repeat the sample n time we will find 0.95*n values included in the CIs

45 Frequency data Properties of frequency data : -Count data -Proportion data Proportion data Proportion data : where we know the number doing a particular thing, but also the number not doing that thing (e.g. mortality of the students who attend the first lesson, but not the second) Count data Count data: where we count how many times something happened, but we have no way of knowing how often it did not happen (e.g. number of students coming at the first lesson)

46 Straightforward linear methods (assuming constant variance, normal errors) are not appropriate for count data for four main reasons: The linear model might lead to the prediction of negative counts. The variance of the response variable is likely to increase with the mean. The errors will not be normally distributed. Many zeros are difficult to handle in transformations. Count data - Classical test with contingency tables - Generalized linear models with Poisson distribution and log-link function (extremely powerful and flexible!!!)

47 - Pearsons chi-squared ( χ 2 ) - G test - Fishers exact test Count data: contingency tables Group 1Group 2Row total Trait 1aba+b Trait 2cdc+d Column totala+cb+da+b+c+d We can assess the significance of the differences between observed and expected frequencies in a variety of ways: H0: frequencies found in rows are independent from frequencies in columns

48 - Pearsons chi-squared ( χ 2 ) expected frequencies (E) We need a model to define the expected frequencies (E) (many possibilities) – E.g. perfect independence Count data: contingency tables OakBeechRow total (R i ) With ants Without ants Column total (C i ) (G) X Critic value

49 - G test expected frequencies (E) 1. We need a model to define the expected frequencies (E) (many possibilities) – E.g. perfect independence χ 2 distribution Count data: contingency tables If expected values are less than 4 o 5 - Fishers exact test fisher.test()

50 Proportion data Proportion data have three important properties that affect the way the data should be analyzed: the data are strictly bounded (0-1); the variance is non-constant (it depends on the mean); errors are non-normal. - Classical test with probit or arcsin transformation - Generalized linear models with binomial distribution and logit-link function (extremely powerful and flexible!!!)

51 Arcsine transformation The arcsine transformation takes care of the error distribution Proportion data: traditional approach Probit transformation The probit transformation takes care of the non-linearity p are percentages (0-100%) p are proportions (0-1) Transform the data!

52 An important class of problems involves data on proportions such as: studies on percentage mortality (LD50), infection rates of diseases, proportion responding to clinical treatment (bioassay), sex ratios, or in general data on proportional response to an experimental treatment Proportion data: modern analysis 1. It is often needed to transform both response and explanatory variables or 2. To use Generalized Linear Models (GLM) using different error distributions 2 approaches

53 Statistical modelling MODEL Generally speaking, a statistical model is a function of your explanatory variables to explain the variation in your response variable (y) The best model is the model that produces the least unexplained variation (the minimal residual deviance), subject to the constraint that all the parameters in the model should be statistically significant (many ways to reach this!) The object is to determine the values of the parameters (a, b, c and d) in a specific model that lead to the best fit of the model to the data E.g. Y=a+bx 1 +cx 2 + dx 3 Y= response variable (performance of the students) x i = explanatory variables (ability of the teacher, background, age)

54 Statistical modelling Getting started with complex statistical modeling It is essential, that you can answer the following questions: Which of your variables is the response variable? Which are the explanatory variables? Are the explanatory variables continuous or categorical, or a mixture of both? What kind of response variable do you have: is it a continuous measurement, a count, a proportion, a time at death, or a category?

55 Statistical modelling Getting started with complex statistical modeling The explanatory variables (a) All explanatory variables continuous - Regression (b) All explanatory variables categorical - Analysis of variance (ANOVA) (c) Explanatory variables both continuous and categorical - Analysis of covariance (ANCOVA) The response variable (a) Continuous - Normal regression, ANOVA or ANCOVA (b) Proportion - Logistic regression, GLM logit-linear models (c) Count - GLM Log-linear models (d) Binary - GLM binary logistic analysis (e) Time at death - Survival analysis

56 1. Multicollinearity Correlation between predictors in a non-orthogonal multiple linear models Confounding effects difficult to separate Variables are not independent This makes an important difference to our statistical modelling because, in orthogonal designs, the variation that is attributed to a given factor is constant, and does not depend upon the order in which factors are removed from the model. In contrast, with non-orthogonal data, we find that the variation attributable to a given factor does depend upon the order in which factors are removed from the model Statistical modelling: multicollinearity The order of variable selection makes a huge difference (please wait for session 4!!!)

57 You want the model to be minimal (parsimony), and adequate (must describe a significant fraction of the variation in the data) It is very important to understand that there is not just one model. given the data, and given our choice of model, what values of the parameters of that model make the observed data most likely? Model building: estimate of parameters (slopes and level of factors) Occams Razor Statistical modelling Each analysis estimate a MODEL

58 Models should have as few parameters as possible; linear models should be preferred to non-linear models; experiments relying on few assumptions should be preferred to those relying on many; models should be pared down until they are minimal adequate; simple explanations should be preferred to complex explanations. The process of model simplification is an integral part of hypothesis testing in R. In general, a variable is retained in the model only if it causes a significant increase in deviance when it is removed from the current model. Occams Razor MODEL SIMPLIFICATION Statistical modelling


Scaricare ppt "Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students Lecturer: Lorenzo Marini DAFNAE, University of Padova, Viale."

Presentazioni simili


Annunci Google