Session 2 Lecture: Introduction to statistical hypothesis testing

Slides:



Advertisements
Presentazioni simili
Trieste, 26 novembre © 2005 – Renato Lukač Using OSS in Slovenian High Schools doc. dr. Renato Lukač LinuxDay Trieste.
Advertisements

Anno Diaconale f Federazione delle Chiese Evangeliche in Italia ufficio volontariato internazionale via firenze 38, roma tel. (+39) fax.
Centro Internazionale per gli Antiparassitari e la Prevenzione Sanitaria Azienda Ospedaliera Luigi Sacco - Milano WP4: Cumulative Assessment Group refinement.
I numeri, l’ora, I giorni della settimana
Giovanni Falcone & Paolo Borsellino.
L’esperienza di un valutatore nell’ambito del VII FP Valter Sergo
Cache Memory Prof. G. Nicosia University of Catania
Teoria e Tecniche del Riconoscimento
Parametri Acustici (ISO 3382)
A. Oppio, S. Mattia, A. Pandolfi, M. Ghellere ERES Conference 2010 Università Commerciale Luigi Bocconi Milan, june 2010 A Multidimensional and Participatory.
Relaunching eLene Who are we now and which are our interests.
EBRCN General Meeting, Paris, 28-29/11/20021 WP4 Analysis of non-EBRCN databases and network services of interest to BRCs Current status Paolo Romano Questa.
DG Ricerca Ambientale e Sviluppo FIRMS' FUNDING SCHEMES AND ENVIRONMENTAL PURPOSES IN THE EU STRUCTURAL FUNDS (Monitoring of environmental firms funding.
Italiano Da quando siamo passati al corso di metallurgia (3^o ) abbiamo cominciato a lavorare utilizzando i maniera didattica tecnologie di tipo hardware.
Interrogativi Asking and answering questions in italiano.
piacere The verb to like does not have a direct equivalent in Italian.
© and ® 2011 Vista Higher Learning, Inc.4B.1-1 Punto di partenza Italian uses two principal tenses to talk about events in the past: the passato prossimo.
Cancer Pain Management Guidelines
A. Nuzzo U.O. di Oncologia Medica ospedale Renzetti di Lanciano (CH)
© and ® 2011 Vista Higher Learning, Inc.4B.2-1 Punto di partenza The verbs conoscere and sapere both mean to know. The choice of verb depends on its context.
Punto di partenza Reciprocal verbs are reflexives that express a shared or reciprocal action between two or more people or things. In English we often.
Il presente del congiuntivo (the present subjunctive)
Il presente del congiuntivo (the present subjunctive)
Raffaele Cirullo Head of New Media Seconda Giornata italiana della statistica Aziende e bigdata.
TIPOLOGIA DELLE VARIABILI SPERIMENTALI: Variabili nominali Variabili quantali Variabili semi-quantitative Variabili quantitative.
Metodi di simulazione numerica in Chimica Fisica Dario Bressanini Universita degli Studi dellInsubria III anno della Laurea triennale in Scienze Chimiche.
2000 Prentice Hall, Inc. All rights reserved. 1 Capitolo 3 - Functions Outline 3.1Introduction 3.2Program Components in C++ 3.3Math Library Functions 3.4Functions.
Metodi statistici per lo studio delle direzioni darrivo di muoni in MACRO Monica Brigida Università di Bari & INFN Via Amendola 173 – BARI.
Magnetochimica AA Marco Ruzzi Marina Brustolon
Chistmas is the most loved holiday of the years. Adults and children look forward to Chistmas and its magical atmosphere. It is traditional to decorate.
HERES OUR SCHOOL.. 32 years ago this huge palace was built and it was just the beginning; It is becoming larger and larger as a lot of students choose.
Le regole Giocatori: da 2 a 10, anche a coppie o a squadre Scopo del gioco: scartare tutte le carte per primi Si gioca con 108 carte: 18 carte.
Players: 3 to 10, or teams. Aim of the game: find a name, starting with a specific letter, for each category. You need: internet connection laptop.
Compito desame del Svolgimento della Sezione 5: CONTROLLORI Esempio preparato da Michele MICCIO.
LHCf Status Report Measurement of Photons and Neutral Pions in the Very Forward Region of LHC Oscar Adriani INFN Sezione di Firenze - Dipartimento di Fisica.
Francesca Pizzorni Ferrarese 05/05/2010
Confronto fra 2 popolazioni
English Course Gentile studente,
UNIVERSITÀ DEGLI STUDI DI PAVIA FACOLTÀ DI ECONOMIA, GIURISPRUDENZA, INGEGNERIA, LETTERE E FILOSOFIA, SCIENZE POLITICHE. Corso di Laurea Interfacoltà in.
Guardate le seguenti due frasi:
Italian Regular Verbs Italian Regular Verbs Regular or irregular?? Italian verbs are either regular or irregular. Italian irregular verbs MUST be memorized…
Motor Sizing.
Richard Horton , Lancet 2005.
CORE STRENGTH SYNERGY AND ITS INFLUENCE IN NON CHRONIC LBP Anna Rita Calavalle, Davide Sisti, Giuseppe Andolina, Marco Gervasi, Carla Spineto, Marco Rocchi,
Frequency Domain Processing (part 2) and Filtering C. Andrés Méndez 03/04/2013.
Tutor: Elisa Turrini Mail:
Enzo Anselmo Ferrari By Giovanni Amicucci. Di Enzo Questo è Enzo Anselmo Ferrari. Enzo compleanno è diciotto febbraio Enzo muore è quattordici agosto.
Enzo anselmo ferrari By: Orazio Nahar.
FOR EVERY CALLOUT THAT YOU WILL SEE IN ENGLISH PROVIDE (IN WRITING) THE CORRECT ITALIAN SENTENCE OR EXPRESSION. REMEMBER TO LOOK AT THE VERBS AND PAY.
UG40 Energy Saving & Twin Cool units Functioning and Adjustment
EMPOWERMENT OF VULNERABLE PEOPLE An integrated project.
Teorie e tecniche della Comunicazione di massa Lezione 7 – 14 maggio 2014.
You’ve got a friend in me!
UITA Genève ottobre Comitè du Groupe Professionnel UITA Genève octobre 2003 Trade Union and Tour.
A PEACEFUL BRIDGE BETWEEN THE CULTURES TROUGH OLYMPICS OLYMPIC CREED: the most significant thing in the olympic games is not to win but to take part OLYMPIC.
Moles and Formula Mass.
Guida alla compilazione del Piano di Studi Curricula Sistemi per l’Automazione Automation Engineering.
Passato Prossimo. What is it?  Passato Prossimo is a past tense and it is equivalent to our:  “ed” as in she studied  Or “has” + “ed” as in she has.
Lezione n°27 Università degli Studi Roma Tre – Dipartimento di Ingegneria Corso di Teoria e Progetto di Ponti – A/A Dott. Ing. Fabrizio Paolacci.
Italian 1 -- Capitolo 2 -- Strutture
Scenario e Prospettive della Planetologia Italiana
Castelpietra G., Bassi G., Frattura L.
1 Acceleratori e Reattori Nucleari Saverio Altieri Dipartimento di Fisica Università degli Studi - Pavia
Un problema multi impianto Un’azienda dispone di due fabbriche A e B. Ciascuna fabbrica produce due prodotti: standard e deluxe Ogni fabbrica, A e B, gestisce.
Do You Want To Pass Actual Exam in 1 st Attempt?.
The Behavioral Insight Team
Fitness-Associated Sexual Reproduction in a Filamentous Fungus
The effects of leverage in financial markets Zhu Chenge, An Kenan, Yang Guang, Huang Jiping. Department of Physics, Fudan University, Shanghai, ,
Gülüm Kosova, Nicole M. Scott, Craig Niederberger, Gail S
Transcript della presentazione:

Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students Session 2 Lecture: Introduction to statistical hypothesis testing Null and alternate hypothesis. Types of error. Two-sample hypotheses. Correlation. Analysis of frequency data. Introduction to statistical modeling Lecturer: Lorenzo Marini DAFNAE, University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. E-mail: lorenzo.marini@unipd.it Tel.: +39 0498272807 http://www.biodiversity-lorenzomarini.eu/

Inference sampling Sample Estimation Population testing A statistical hypothesis test is a method of making statistical decisions from and about experimental data. Null-hypothesis testing just answers the question of "how well do the findings fit the possibility that chance factors alone might be responsible?”. sampling Sample Estimation (Uncertainty!!!) Population testing Statistical Model

Key concepts: Session 1 y x Statistical testing in five steps: 1. Construct a null hypothesis (H0) and alterantive hypothesis 2. Choose a statistical analysis (assumptions!!!) 3. Collect the data (sampling) 4. Calculate P-value and test statistic 5. Reject/accept (H0) if P is small/large Remember the order!!! Concept of replication vs. pseudoreplication 1. Spatial dependence (e.g. spatial autocorrelation) 2. Temporal dependence (e.g. repeated measures) 3. Biological dependence (e.g. siblings) n=6 yi Key quantities residual y mean x

1. Costruire e testare un’ipotesi Ipotesi: affermazione che ha come oggetto accadimenti nel mondo reale, che si presta ad essere confermata o smentita dai dati osservati sperimentalmente Esempio: gli studenti maschi e femmine presentano gli stessi voti

1. Costruire e testare un’ipotesi Ipotesi nulla (H0): è un’affermazione riguardo alla popolazione che si assume essere vera fino a che non ci sia una prova evidente del contrario (status quo, mancanza di effetto etc.) Ipotesi alterantiva (Ha): è un’affermazione riguardo alla popolazione che è contraria all’ipotesi nulla e che viene accettata solo nel caso in cui ci sia una prova evidente in suo favore

1. Costruire e testare un’ipotesi 1. Rifiutare H0 (e quindi accettare  Ha) Test di ipotesi consiste in una decisione fra H0 e Ha 2. Accettare H0 (e quindi rifiutare Ha)

1. Costruire e testare un’ipotesi ? 1. Rifiutare H0 2. Accettare H0 La statistica inferenziale ci permette di quantificare delle probabilità per decidere se accettare o rifiutare l’ipotesi nulla: Quanto attendibile è H0?

Valori usuali sono 10%, 5%, 1%, 0.1% Livello di significatività (alpha) Devo definire a priori una probabilità (alpha) per rifiutare l’ipotesi nulla Il livello di significatività di un test: probabilità di rifiutare H0, quando in realtà è vera (quanto confidenti siamo nelle nostre conclusioni?) Più piccola è alpha maggiore sarà la certezza nel rifiutare l’ipotesi nulla Valori usuali sono 10%, 5%, 1%, 0.1% I valori più comuni

Hypothesis testing 1 – Hypothesis formulation (Null hypothesis H0 vs. alternative hypothesis H1) 2 – Compute the probability P P-value is often described as the probability of seeing results as or more extreme as those actually observed if the null hypothesis was true 3 – If this probability is lower than a defined threshold (level of significance: 0.01, 0.05) we can reject the null hypothesis

Hypothesis testing: Types of error STATISTICAL DECISION Reject H0 Retain H0 REALITY Effect Correct Effect detected Type 2 error () Effect not detected No effect Type 1 error () Effect detected, none exists (P-value) Correct, No effect detected, None exists (POWER) As power increases, the chances of a Type II error decreases Statistical power depends on: -The statistical significance criterion used in the test -The size of the difference or the strength of the similarity (effect size) -Variability of the population -Sample size -Type of test

Statistical analyses Mean comparisons for 2 populations Correlation Test the difference between the means drawn by two samples Correlation In probability theory and statistics, correlation, (often measured as a correlation coefficient), indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation refers to the departure of two variables from independence. Analysis of count or proportion data Whole number or integer numbers (not continuous, different distributional properties) or proportion

Mean comparisons for 2 samples The t test H0: means do not differ H1: means differ Assumptions Independence of cases (work with true replications!!!) - this is a requirement of the design. Normality - the distributions in each of the groups are normal Homogeneity of variances - the variance of data in groups should be the same (use Fisher test or Fligner's test for homogeneity of variances). These together form the common assumption that the errors are independently, identically, and normally distributed

RESIDUALS MUST BE NORMAL Normality Before we can carry out a test assuming normality of the data we need to test our distribution (not always before!!!) Graphics analysis In many cases we must check this assumption after having fitted the model (e.g. regression or multifactorial ANOVA) hist(y) lines(density(y)) library(car) qq.plot(y) or qqnorm(y) RESIDUALS MUST BE NORMAL Test for normality Shapiro-Wilk Normality Test shapiro.test() Skew + kurtosis (t test)

Normality: Histogram

Normality: Histogram Normal distribution must be symmetrical around the mean library(animation) ani.options(nmax = 2000 + 15 -2, interval = 0.003) freq = quincunx(balls = 2000, col.balls = rainbow(1)) # frequency table barplot(freq, space = 0)

Normality: Q-Q Plot

Normality: Quantile-Quantile Plot Quantiles are points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. The quantiles are the data values marking the boundaries between consecutive subsets

Logaritmic (skewed data) Normality In case of non-normality: 2 possible approaches 1. Change the distribution (use GLMs) Advanced statistics E.g. Poisson (count data) E.g. Binomial (proportion) 2. Data transformation Logaritmic (skewed data) Square-root Arcsin (percentage) Probit (proportion) Box-Cox transformation

Homogeneity of variance: two samples Before we can carry out a test to compare two sample means, we need to test whether the sample variances are significantly different. The test could not be simpler. It is called Fisher’s F To compare two variances, all you do is divide the larger variance by the smaller variance. E.g. Students from TESAF vs. Students from DAFNAE F<-var(A)/var(B) F calculated qf(0.975,nA-1,nB-1) F critical if the calculated F is larger than the critical value, we reject the null hypothesis Test can be carried out with the var.test()

Homogeneity of variance : > two samples It is important to know whether variance differs significantly from sample to sample. Constancy of variance (homoscedasticity) is the most important assumption underlying regression and analysis of variance. For multiple samples you can choose between the Bartlett test and the Fligner–Killeen test. Bartlett.test(response,factor) Fligner.test(response,factor) There are differences between the tests: Fisher and Bartlett are very sensitive to outliers, whereas Fligner–Killeen is not

Ho: the two means are the same Mean comparison In many cases, a researcher is interesting in gathering information about two populations in order to compare them. As in statistical inference for one population parameter, confidence intervals and tests of significance are useful statistical tools for the difference between two population parameters. Ho: the two means are the same H1: the two means differ - All Assumptions met? Parametric t.test() - t test with independent or paired sample -Some assumptions not met? Non-parametric Wilcox.test() - The Wilcoxon signed-rank test is a non-parametric alternative to the Student's t-test for the case of two samples.

Il test t Misura legata alla differenza fra le medie tcalcolato= Misura di variabilità dentro i gruppi Differenza medie Variabilità dei gruppi

Il test t Variabile Differenza fra le medie Variabilità A A B A B Caso 1 Caso 2 Variabile Differenza fra le medie Variabilità A A B A B Variabilità B Caso 3 Caso 4 Variabile A B A B

Il test t Differenza fra le medie tcalcolato= Errore standard della differenza t t Differenza fra medie Variabilità dentro i gruppi Più estremo sarà t calcolato minore sarà P maggiore sarà la probabilità di rifiutare H0

Il test t Differenza fra le medie tcalcolato= Errore standard della differenza + estremo sarà tcalcolato maggiore la probabilità di rifiutare H0 P -Tcritico Tcritico

Come scegliere il test t giusto a partire dalle assunzioni Indipendenza NO SÌ Test t appaiato Test t non appaiati Test t per pop. omoschedastiche Test t per pop. eteroschedastiche Welch t-test (formula complessa richiesto un PC)

I gradi di libertà sono n1 + n2-2 per Tcritico Campioni independenti omoschedastici: Test t! ? Varianza combinata (”pooled”) I gradi di libertà sono n1 + n2-2 per Tcritico

I gradi di libertà sono n1+n2-2 per Tcritico Campioni independenti omoschedastici: Test t! H0: le due medie sono uguali Ha: le due medie sono diverse Test di ipotesi: Calcolo la varianza combinata dei due campioni Determino il valore di tcalcolato Decido il livello di significatività (alpha, 1 o 2 code?) Determino il valore di tcritico Se |tcalcolato|> |t critico| rifiuto H0 Conclusione: le medie sono DIVERSE! I gradi di libertà sono n1+n2-2 per Tcritico

Campioni appaiati: 2 casi 1. Misure ripetute 2. Correlazione nello spazio Studente Prima Dopo A 22 23 B 24 C D 25 E 20 21 F 18 G H 19 Misura a monte Misura a valle Fiume B Fiume C Fiume A Industria tessile [Ammoniaca] in acqua

I gradi di libertà sono n-1 per tcritico Campioni appaiati: Test t Media delle differenze Deviazione standard delle differenze Numero di coppie Studente Prima Dopo Di A 22 23 1 B 24 C D 25 E 20 21 F 18 G H 19 I gradi di libertà sono n-1 per tcritico

I gradi di libertà sono n-1 per tcritico Campioni appaiati: Test t H0: le due medie sono uguali Ha: le due medie sono diverse ? Test di ipotesi: Determino il valore di tcalcolato Decido il livello di significatività (alpha, 1 o 2 code?) Determino il valore di tcritico Se |tcalcolato|> |tcritico| rifiuto H0 Conclusione: le medie sono DIVERSE! I gradi di libertà sono n-1 per tcritico

Non parametrica: Wilcoxon B 3 5 4 6 7 2 1 I ranghi n1 and n2 sono I numeri delle osservazioni R1 è la somma dei rnaghi nel campione 1 Test can be carried out with the wilcox.test() function

Correlation Correlation, (often measured as a correlation coefficient), indicates the strength and direction of a linear relationship between two random variables Bird species richness Plant species richness 1 2 3 4 … 458 x1 x2 x3 x4 … x458 l1 l2 l3 l4 … l458 Sampling unit Three alternative approaches 1. Parametric - cor() 2. Nonparametric - cor() 3. Bootstrapping - replicate(), boot()

NONE Correlation: causal relationship? Which is the response variable in a correlation analysis? NONE Bird species richness Plant species richness 1 2 3 4 … 458 x1 x2 x3 x4 … x458 l1 l2 l3 l4 … l458 Sampling unit

Correlation Plot the two variables in a Cartesian space A correlation of +1 means that there is a perfect positive LINEAR relationship between variables. A correlation of -1 means that there is a perfect negative LINEAR relationship between variables. A correlation of 0 means there is no LINEAR relationship between the two variables.

Correlation Same correlation coefficient! r= 0.816

Parametric correlation: when is it significant? Pearson product-moment correlation coefficient Correlation coefficient: Hypothesis testing using the t distribution: Ho: Is cor = 0 H1: Is cor ≠ 0 t critic value for d.f. = n-2 Assumptions Two random variables from a random populations - cor() detects ONLY linear relationships

Distribution-free but Nonparametric correlation Rank procedures Distribution-free but less power Spearman correlation index The Kendall tau rank correlation coefficient P is the number of concordant pairs n is the total number of pairs

Issues related to correlation 1. Temporal autocorrelation Values in close years are more similar Dependence of the data 2. Spatial autocorrelation Values in close sites are more similar Dependence of the data Moran's I = 0 Moran's I = 1 Moran's I or Geary’s C Measures of global spatial autocorrelation

Autoregressive models (not covered!) Three issues related to correlation 2. Temporal autocorrelation Values in close years are more similar Dependence of the data Working with time series is likely to have temporal pattern in the data E.g. Ring width series Autoregressive models (not covered!)

Three issues related to correlation 3. Spatial autocorrelation Values in close sites are more similar Dependence of the data ISSUE: can we explain the spatial autocorrelation with our models? Moran's I or Geary’s C (univariate response) Measures of global spatial autocorrelation Raw response Residuals after model fitting Hint: If you find spatial autocorrelation in your residuals, you should start worrying

Sampling with replacement Estimate correlation with bootstrap BOOTSTRAP Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of SEs and CIs of a population parameter Sampling with replacement >a<-c(1:5) > a [1] 1 2 3 4 5 > replicate(10, sample(a, replace=TRUE)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 2 3 2 1 4 2 1 2 1 3 [2,] 1 5 2 3 5 3 1 1 3 2 [3,] 4 4 4 5 4 4 5 1 1 5 [4,] 4 1 1 3 3 2 3 1 5 2 [5,] 5 5 1 3 5 2 4 1 5 4 1 original sample 10 bootstrapped samples

Estimate correlation with bootstrap Why bootstrap? It doesn’t depend on normal distribution assumption It allows the computation of unbiased SE and CIs Sample Bootstrap Statistic distribution Quantiles N samples with replacement …

Estimate correlation with bootstrap CIs are asymmetric because our distribution reflects the structure of the data and not a defined probability distribution If we repeat the sample n time we will find 0.95*n values included in the CIs

Properties of frequency data: Count data Proportion data Count data: where we count how many times something happened, but we have no way of knowing how often it did not happen (e.g. number of students coming at the first lesson) Proportion data: where we know the number doing a particular thing, but also the number not doing that thing (e.g. ‘mortality’ of the students who attend the first lesson, but not the second)

Count data Straightforward linear methods (assuming constant variance, normal errors) are not appropriate for count data for four main reasons: • The linear model might lead to the prediction of negative counts. • The variance of the response variable is likely to increase with the mean. • The errors will not be normally distributed. • Many zeros are difficult to handle in transformations. - Classical test with contingency tables - Generalized linear models with Poisson distribution and log-link function (extremely powerful and flexible!!!)

We can assess the significance of the differences between Count data: contingency tables We can assess the significance of the differences between observed and expected frequencies in a variety of ways: - Pearson’s chi-squared (χ2) - G test - Fisher’s exact test Group 1 Group 2 Row total Trait 1 a b a+b Trait 2 c d c+d Column total a+c b+d a+b+c+d H0: frequencies found in rows are independent from frequencies in columns

X Count data: contingency tables - Pearson’s chi-squared (χ2) We need a model to define the expected frequencies (E) (many possibilities) – E.g. perfect independence Critic value Oak Beech Row total (Ri) With ants 22 30 52 Without ants 31 18 49 Column total (Ci) 53 48 101 (G) X

Count data: contingency tables - G test 1. We need a model to define the expected frequencies (E) (many possibilities) – E.g. perfect independence χ2 distribution - Fisher’s exact test fisher.test() If expected values are less than 4 o 5

Proportion data Proportion data have three important properties that affect the way the data should be analyzed: • the data are strictly bounded (0-1); • the variance is non-constant (it depends on the mean); • errors are non-normal. - Classical test with probit or arcsin transformation - Generalized linear models with binomial distribution and logit-link function (extremely powerful and flexible!!!)

Proportion data: traditional approach Transform the data! Arcsine transformation The arcsine transformation takes care of the error distribution p are percentages (0-100%) Probit transformation The probit transformation takes care of the non-linearity p are proportions (0-1)

Proportion data: modern analysis An important class of problems involves data on proportions such as: • studies on percentage mortality (LD50), • infection rates of diseases, • proportion responding to clinical treatment (bioassay), • sex ratios, or in general • data on proportional response to an experimental treatment 2 approaches 1. It is often needed to transform both response and explanatory variables or 2. To use Generalized Linear Models (GLM) using different error distributions

Statistical modelling MODEL Generally speaking, a statistical model is a function of your explanatory variables to explain the variation in your response variable (y) E.g. Y=a+bx1+cx2+ dx3 Y= response variable (performance of the students) xi= explanatory variables (ability of the teacher, background, age) The object is to determine the values of the parameters (a, b, c and d) in a specific model that lead to the best fit of the model to the data The best model is the model that produces the least unexplained variation (the minimal residual deviance), subject to the constraint that all the parameters in the model should be statistically significant (many ways to reach this!)

Getting started with complex statistical modeling Statistical modelling Getting started with complex statistical modeling It is essential, that you can answer the following questions: • Which of your variables is the response variable? • Which are the explanatory variables? • Are the explanatory variables continuous or categorical, or a mixture of both? • What kind of response variable do you have: is it a continuous measurement, a count, a proportion, a time at death, or a category?

Getting started with complex statistical modeling Statistical modelling Getting started with complex statistical modeling The explanatory variables (a) All explanatory variables continuous - Regression (b) All explanatory variables categorical - Analysis of variance (ANOVA) (c) Explanatory variables both continuous and categorical - Analysis of covariance (ANCOVA) The response variable (a) Continuous - Normal regression, ANOVA or ANCOVA (b) Proportion - Logistic regression, GLM logit-linear models (c) Count - GLM Log-linear models (d) Binary - GLM binary logistic analysis (e) Time at death - Survival analysis

Statistical modelling: multicollinearity Correlation between predictors in a non-orthogonal multiple linear models Confounding effects difficult to separate Variables are not independent This makes an important difference to our statistical modelling because, in orthogonal designs, the variation that is attributed to a given factor is constant, and does not depend upon the order in which factors are removed from the model. In contrast, with non-orthogonal data, we find that the variation attributable to a given factor does depend upon the order in which factors are removed from the model The order of variable selection makes a huge difference (please wait for session 4!!!)

Model building: estimate of parameters (slopes and level of factors) Statistical modelling Each analysis estimate a MODEL You want the model to be minimal (parsimony), and adequate (must describe a significant fraction of the variation in the data) It is very important to understand that there is not just one model. • given the data, • and given our choice of model, • what values of the parameters of that model make the observed data most likely? Model building: estimate of parameters (slopes and level of factors) Occam’s Razor

Statistical modelling Occam’s Razor • Models should have as few parameters as possible; • linear models should be preferred to non-linear models; • experiments relying on few assumptions should be preferred to those relying on many; • models should be pared down until they are minimal adequate; • simple explanations should be preferred to complex explanations. MODEL SIMPLIFICATION The process of model simplification is an integral part of hypothesis testing in R. In general, a variable is retained in the model only if it causes a significant increase in deviance when it is removed from the current model.