GOD On probability, z-scores, and distributions Lecture 1 – Introduction to Statistics – Thijs Bol 1 TODAY 1. Overview of the course 2. Recap of statistics lecture in research methods – – Centrality, dispersion Z-scores 3. Probabilities in normal distributions 4. God 2 I AM • Thijs (Th as in thread, ijs as in ice). • Quantitative sociological research on: – Inequality in education – Inequality in the labor market – Inequality in science • Your lecturer in statistics in January. 3 PRACTICAL STUFF 4 STATISTICS • Everyone can learn statistics. • Not everyone has the same learning pace. • During the course you can learn from: – – – – – – The book The lecture The tutorial groups The extra tutoring group Repetition in the lecture Self study using MyMathlab 5 COURSE INFORMATION • 6 lectures and 6 tutorials in 3 weeks. – Suzanne de Leeuw, Bram Hogendoorn, Sander Kunst, Christoph Janietz & Thijs Bol. • De Pijp data that you have gathered will be the basis of the course. • What you learn: – The first steps into statististcs – Using SPSS (aka IBM Statistics) • Grading: – Exam (60%), minimum grade 5.5 – One individual assignment (20%) – One duo assignment (20%) Final grade at least 5.5 6 7 COURSE INFORMATION (2) • No fraud, no plagiarizing. • In this course there are a few things set in stone: – You can not redo the assignments. – You can only participate in the resit of the exam if you’ve participated in the original exam. – Grades are only eligible in this academic year (also for assignments etc). • Laptops during lectures. 8 COURSE INFORMATION (3) • Literature – Agresti – (Treiman) • MyMathLab – Inlogcode: bol84716 • Extra tutorial group – Bram Hogendoorn and Suzanne de Leeuw – Tuesday and Thursday from 17.00-19.00. • More info? Syllabus is on Canvas! 9 STRUCTURE OF THE COURSE Univariate Descriptive statistics (L0-1) Inferential statistics (L1-6) Bivariate Z-scores & probability (L0-1) Sample/population (L2-3) Tables and correlation measures (L4-5) Significance and t-tests of difference (L3) Regression (L5-6) 10 LAST TIME (GUEST LECTURE) 11 TYPES OF VARIABLES • Different types of variables: Measurement level Description Example Nominal No rank order Religion, political party voted for Ordinal Rank order, but unequal distances Disagree completely – Agree completely Interval Rank order with equal distances Celsius, hourly wage Ratio Rank order with equal distances and a natural 0. Age, weight, height Right now we don’t care about the difference between ratio and interval 12 WHY DO WE CARE ABOUT THIS? • Different types of variables require different types of description. • We want to describe data. • We can’t do that by showing all answers to a survey. – 200 filled in surveys from de Pijp project is a bit confusing. • A core function of statistics is to describe (survey) data. 1. 2. Centrality Dispersion 13 CENTRALITY • The type of variable defines the centrality measure that we can use. Nominal Ordinal Interval/ratio Dichotomous Mode Median Mean 14 MEAN FOR DICHOTOMOUS VARAIBLES • For dichotomous variables the mean equals the proportion (𝜋𝜋�). • We have 1’s (for example, female) en 0’s (for example, male). • The mean then equals the proportion 1’s. – In this case the proportion females in the data. • The proportion is basically the same as the percentage. – Proportion=percentage/100. – 0,58 is the same as 58% (for example, 58% is female). 15 DE PIJP DATA Proportion is 0,1916, so 19,2% says they believe in God. 16 DISPERSION (1) • If we know the center of data, we know very little about the distribution of data. • Data has a certain level of dispersion. Interval/ratio data has a mean Interval/ratio data has a dispersion 17 STANDARD DEVIATION sy = 2 ( y y − ) ∑ i n −1 The sum of all distances from an observation to the mean, squared. Number of observations (minus 1) • The sum of all squared distances to the mean. – If all observations are clustered around the mean, sum of distances will be small. – If observations are widely dispersed around the mean, the sum of distances will be larger. • It’s a summary measure of the average distance to the mean. • If there is more dispersion, the standard deviation sy will be higher. 18 CENTRALITY AND DISPERSION • Equal means (0), different dispersion. – Standard deviation is higher for the red dashed distribution. • WHY DO WE CARE ABOUT THIS? 19 COMPARING DISTRIBUTIONS • Imagine we want to compare different positions in distributions. • Age – A 29-year old goes to music festival (Lowlands, Glastonbury). – A 31-year old goes to the Stopera to watch an opera. – Who is relatively the oldest? • This all depends on the distribution of age at Lowlands and the Stopera. • How do we solve this? 20 Z-SCORE 𝑦𝑦𝑖𝑖 − 𝑦𝑦� z= 𝑠𝑠 Distance from observation yi to the mean ӯ Standard deviation s • Amount of standard deviations to mean. • Independent of the original scale of the variable (age). – We can compare age (29 year old at Lowlands, 31 year old at the Stopera) • How many standard deviations are both observations from the mean? 21 THE IMPORTANCE OF THE Z-SCORE • Z-scores take into account differences in both centrality and dispersion. – Sometimes convenient when comparing different distributions. • More important though: z-scores are a key concept in inferential statistics. • Z-scores help us to describe bell-shaped distributions. – Normal distributions. – Key concept in the first three lectures. 22 DISTRIBUTION OF DATA 23 DISTRIBUTION OF DATA • Data can be distributed in different ways. Skewed distribution 24 BELL-SHAPED DISTRIBUTIONS (More) bell-shaped distribution 25 A PERFECT BELL-SHAPED DISTRIBUTION ӯ The distribution is perfectly symmetrical distribution around mean ӯ EMPIRICAL RULE ӯ – 3s ӯ – 2s ӯ–s ӯ 68 % 95,4 % 99,7 % ӯ+s ӯ + 2s ӯ + 3s • We can summarize all observations in bell-shaped distributions: – 68% of all observations is between ӯ-s and ӯ+s – 95,4% of all observations is between ӯ-2s and ӯ+2s – 99,7% of all observations is between ӯ-3s and ӯ+3s 27 EXAMPLE PIJP DATA So 68% of the observations is between 26,4 (42,8-16,4) and 59,2 (42,8+16,4). ӯ–s ӯ ӯ+s 28 PROBABILITIES AND PROBABILITY DISTRIBUTIONS 29 PROBABILITIES • We just looked at frequency distributions. – With how many do they live in the houshold? – How many square meters is their house? • But we can think of them as probability distributions as well. • I pick one random inhabitant of De Pijp: – What is the probability that he/she is older than 35? – What is the probability that he/she has two or more children? • We can determine this on the basis of the distribution! • Probability = p – The probability is the area under the curve. 30 PROBABILITY DISTRIBUTION 68% probability to be 99,7% to be here 95,4% probability here ӯ – 3s ӯ – 2s ӯ–s ӯ ӯ+s ӯ + 2s ӯ + 3s 31 NORMAL DISTRIBUTIONS • We can apply this to all normal distributions. ӯ=4 s=1 ӯ=2 s=0.5 ӯ=4 s=2 32 STANDARD NORMAL DISTRIBUTION • We can also apply this to the standard normal distribution. • Theoretical distribution used in inferential statistics. – Empirical distributions are hardly ever normally distributed. – We use the standard normal distribution for calculations. • Characteristics of the standard normal distribution: – Bell-shaped – Perfectly symmetrical – ӯ=0, s=1 (mean=0, standard deviation=1) 33 PROBABILITY AND THE EMPIRICAL RULE • Let’s apply the Empirical Rule to the standard normal distribution. 68% Or 0,68 2 1 0 1 95,4% or 0,954 2 34 Z-SCORES AND PROBABILITIES • Probabilities can be defined as z-scores. – In the standard normal distribution, z=yi 𝑦𝑦𝑖𝑖 − 𝑦𝑦� 𝑦𝑦𝑖𝑖 − 0 z= = = 𝑦𝑦𝑖𝑖 𝑠𝑠 1 ӯ=0 s=1 • Every position in a normal distribution has a z-score with a corresponding probability. – Can be found in Table A. • For normally distributed variables we can convert z-scores to probabilities (and the other way around). 35 36 INFORMATION IN THE Z-TABLE p-value using Table A Z- 0 score Zscore 37 EXAMPLE USING TABLE A ? Z 1 0 Z 1 38 • 68% is between ӯ-s and ӯ+s. • Let’s find evidence for this in Table A. • What z-score do we use? • 1 • What p-value corresponds to a zscore of 1? • The Z-table shows that the p-value that correspondes to a z-score of 1 is 0,1587. 39 • The total area under every curve is 1. • There’s a probability of 100% that you’re somewhere under the curve.. • The area we are looking for is: 1-(0,1587*2) = 1-(0,32) = 0,68. • 68% (0,68) is within the range of ӯ-z and ӯ+z. ? 0,68 0,1587 z=-1 • You can check the evidence for 95,4% (z-score of 2) and 99,7% (a z-score of 3) yourself. 0,1587 0 z=1 40 APPLICATION OF ALL THIS • Why are we doing this? – This just a theoretical story on probabilities and curves. • We can apply the information from a theoretical probability distribution to any empirical data that follow a normal distribution. • Using the z-scores of the empirical data, we can calculate different probabilities (using the Z-table). – What’s the probability that someone scores an 8 or higher for this course? • At the same time we can convert probabilities to real scores on a variable. – What grade is above the 95th percentile, and what grade markes the 25th percentile? 41 GOD 42 GOD • In Western countries, a decreasing number of people are religious. • One explanation is that God is more relevant for older generations. • An alternative explanation is that God or religion has taken a different meaning for younger people. – “Ietsism” 43 YOUTH AND RELIGION 44 GOD (2) • So maybe spiritualism for younger cohorts has not so much to with “God”. • Both variables are in De Pijp data. – Do you believe in God? – Do you believe in a higher power other than God? • Is the probability that a younger individual believes in a higher power other than God greater than the probability that a younger individual believes in God?? • We define “younger” as being 30 or under. 45 DE PIJP DATA 46 AGE OF GOD-BELIEVERS ӯ=44,3 and s=16,8 Let’s assume for now that this variable has a perfect normal distribution. 47 YOUNGSTERS AND GOD • What is the probability that a God-believer is 30 years or younger? p 30 Z 44,3 48 FROM Z-SCORES TO PROBABILITY • Let’s calculate the z-score – Required information 1. 2. 3. Value of an observation Mean Standard deviation 𝑦𝑦𝑖𝑖 − 𝑦𝑦� z= = 𝑠𝑠 • yi = 30 ӯ = 44,3 s = 16,8 30 − 44,3 = −𝟎𝟎, 𝟖𝟖𝟖𝟖 16,8 What probability (p-value) corresponds to a z-score of (-)0,85? – Why is it irrelevant if the z-score is negative or positive? – Normal distributions are perfectly symmetrical, so the probability is the same irrespective of the tail! 49 P=0,1977 (so 19,77 %). 0,1977 The probability that a youngster believes in God is 0.1977 or 19,77%. Z 44,3 0,85 50 IETSISM 51 AGE OF IETSISTS Ӯ=47,0 and s=16,3 52 YOUNGSTERS & HIGHER POWER OTHER THAN GOD • What is the probability that a believer in a higher power other than God is 30 years or younger? p 30 Z 47,0 53 FROM Z-SCORES TO PROBABILITY • Let’s calculate the z-score – Required information 1. 2. 3. Value of an observation Mean Standard deviation 𝑦𝑦𝑖𝑖 − 𝑦𝑦� z= = 𝑠𝑠 • yi = 30 ӯ = 47,0 s = 16,3 30 − 47,0 = −𝟏𝟏, 𝟎𝟎𝟎𝟎 16,3 What probability (p-value) corresponds to a z-score of (-)1,04? 54 P=0,1492 (so 14,92 %). The probability that a youngster believes in a higher power other than God is 0.1492 or 14,92%. 0,1492 47,0 Z 1,04 55 GOD AND IETSISM • The probability that a God-believer is young is slightly higher than a believer in a higher power other than God is young. – 19,8% chance for believing in God, 14,9% chance for believing in something. • But – Data are not normally distributed. – Example to practice relation z-scores/probabilities. 56 FROM PROBABILITY TO REAL SCORES • You can also do this the other way around. • Imagine that we want to know the age of the 30% of the youngest believers in God. – Here we don’t know yi – But we do know the p-value. • What’s the p-value? – 0,30 57 WHAT DO AND DON’T WE KNOW? • • • • 𝑦𝑦𝑖𝑖 − 𝑦𝑦� z= 𝑠𝑠 yi is unknown. ӯ is known (44,3) s is known (16,8) z can be looked up via the p-value 58 FROM P-VALUE TO Z-SCORE A probability of 30% corresponds to a z-score of (-)0,52 59 FROM Z-SCORES TO PROBABILITIES • ӯ =32,2; s= 6,0; normally distributed population. • Waarde van observatie 1. 2. 3. • Z-score Mean Standard deviation 𝑦𝑦𝑖𝑖 − 𝑦𝑦� z= 𝑠𝑠 𝑦𝑦𝑦𝑦 = 𝑦𝑦� + 𝑠𝑠 ∗ 𝑧𝑧 yi = ? ӯ = 44,3 s = 16,8 z = -0,52 𝑦𝑦𝑦𝑦 = 44,3 + 16,8 ∗ −0,52 = 35,6 What does this 35,6 mean? – The 30% youngest God-believers are at most 35.6 years of age. 60 VISUAL REPRESENTATION 0,30 35,6 0,52 44,3 61 OK, BUT WHY DO WE NEED THIS? • Good question. • TBH: We don’t really need it right now. • But thinking about a normal distribution as a probability distribution is crucial for inferential statistics. • More specifically, we need this if we want to say something about a population (inhabitants of De Pijp) on the basis of a sample (De Pijp Data). 62 RECAP 1. We can compare z-scores between distributions. 2. The Empirical Rule. – 68% of all observations can be found between ӯ-s and ӯ+s – The probability to be between ӯ-s and ӯ+s is p=0,68 = 68% 3. A frequency distribution is a probability distribution. – Area under the curve depicts the probability. 4. On the basis of the standard normal distribution we are able to conver z-scores to probabilities. We can find these probabilities in the Z-table, and we can apply them to the De Pijp Data. – Of coure you can also go from probabilities to z-scores. 5. The probability that a God-believer is young is slightly larger than that a believer in a higher power other than God is young. 63 TUTORIAL GROUPS • Z-scores and probabilities. • Calculate it yourself! • Introduction to SPSS. 64 SEE YOU WEDNESDAY! (13.00) 65