# college1 Statistiek ```GOD
On probability, z-scores, and distributions
Lecture 1 – Introduction to Statistics – Thijs Bol
1
TODAY
1. Overview of the course
2. Recap of statistics lecture in research methods
–
–
Centrality, dispersion
Z-scores
3. Probabilities in normal distributions
4. God
2
I AM
• Thijs (Th as in thread, ijs as in ice).
• Quantitative sociological research on:
– Inequality in education
– Inequality in the labor market
– Inequality in science
• Your lecturer in statistics in January.
3
PRACTICAL STUFF
4
STATISTICS
• Everyone can learn statistics.
• Not everyone has the same learning pace.
• During the course you can learn from:
–
–
–
–
–
–
The book
The lecture
The tutorial groups
The extra tutoring group
Repetition in the lecture
Self study using MyMathlab
5
COURSE INFORMATION
• 6 lectures and 6 tutorials in 3 weeks.
– Suzanne de Leeuw, Bram Hogendoorn, Sander Kunst, Christoph Janietz &amp;
Thijs Bol.
• De Pijp data that you have gathered will be the basis of the course.
• What you learn:
– The first steps into statististcs
– Using SPSS (aka IBM Statistics)
– Exam (60%), minimum grade 5.5
– One individual assignment (20%)
– One duo assignment (20%)
6
7
COURSE INFORMATION (2)
• No fraud, no plagiarizing.
• In this course there are a few things set in stone:
– You can not redo the assignments.
– You can only participate in the resit of the exam if you’ve participated
in the original exam.
– Grades are only eligible in this academic year (also for assignments
etc).
• Laptops during lectures.
8
COURSE INFORMATION (3)
• Literature
– Agresti
– (Treiman)
• MyMathLab
– Inlogcode: bol84716
• Extra tutorial group
– Bram Hogendoorn and Suzanne de Leeuw
– Tuesday and Thursday from 17.00-19.00.
9
STRUCTURE OF THE COURSE
Univariate
Descriptive statistics
(L0-1)
Inferential statistics
(L1-6)
Bivariate
Z-scores &amp;
probability (L0-1)
Sample/population
(L2-3)
Tables and
correlation measures
(L4-5)
Significance and
t-tests of difference
(L3)
Regression (L5-6)
10
LAST TIME
(GUEST LECTURE)
11
TYPES OF VARIABLES
• Different types of variables:
Measurement level
Description
Example
Nominal
No rank order
Religion, political party
voted for
Ordinal
Rank order, but unequal distances
Disagree completely –
Agree completely
Interval
Rank order with equal distances
Celsius, hourly wage
Ratio
Rank order with equal distances and
a natural 0.
Age, weight, height
Right now we don’t care about the difference between ratio and
interval
12
• Different types of variables require different types of description.
• We want to describe data.
• We can’t do that by showing all answers to a survey.
– 200 filled in surveys from de Pijp project is a bit confusing.
• A core function of statistics is to describe (survey) data.
1.
2.
Centrality
Dispersion
13
CENTRALITY
• The type of variable defines the centrality measure that we can
use.
Nominal
Ordinal
Interval/ratio
Dichotomous
Mode
Median
Mean
14
MEAN FOR DICHOTOMOUS VARAIBLES
• For dichotomous variables the mean equals the proportion (𝜋𝜋�).
• We have 1’s (for example, female) en 0’s (for example, male).
• The mean then equals the proportion 1’s.
– In this case the proportion females in the data.
• The proportion is basically the same as the percentage.
– Proportion=percentage/100.
– 0,58 is the same as 58% (for example, 58% is female).
15
DE PIJP DATA
Proportion is 0,1916, so 19,2%
says they believe in God.
16
DISPERSION (1)
• If we know the center of data, we know very little about the
distribution of data.
• Data has a certain level of dispersion.
Interval/ratio data has
a mean
Interval/ratio data has
a dispersion
17
STANDARD DEVIATION
sy =
2
(
y
y
−
)
∑ i
n −1
The sum of all distances from an
observation to the mean, squared.
Number of observations
(minus 1)
• The sum of all squared distances to the mean.
– If all observations are clustered around the mean, sum of distances will be
small.
– If observations are widely dispersed around the mean, the sum of distances will
be larger.
• It’s a summary measure of the average distance to the mean.
• If there is more dispersion, the standard deviation sy will be higher.
18
CENTRALITY AND DISPERSION
• Equal means (0), different dispersion.
– Standard deviation is higher for the red dashed distribution.
19
COMPARING DISTRIBUTIONS
• Imagine we want to compare different positions in distributions.
• Age
– A 29-year old goes to music festival (Lowlands, Glastonbury).
– A 31-year old goes to the Stopera to watch an opera.
– Who is relatively the oldest?
• This all depends on the distribution of age at Lowlands and the
Stopera.
• How do we solve this?
20
Z-SCORE
𝑦𝑦𝑖𝑖 − 𝑦𝑦�
z=
𝑠𝑠
Distance from observation yi
to the mean ӯ
Standard deviation s
• Amount of standard deviations to mean.
• Independent of the original scale of the variable (age).
– We can compare age (29 year old at Lowlands, 31 year old at the
Stopera)
• How many standard deviations are both observations from
the mean?
21
THE IMPORTANCE OF THE Z-SCORE
• Z-scores take into account differences in both centrality and
dispersion.
– Sometimes convenient when comparing different distributions.
• More important though: z-scores are a key concept in
inferential statistics.
• Z-scores help us to describe bell-shaped distributions.
– Normal distributions.
– Key concept in the first three lectures.
22
DISTRIBUTION OF DATA
23
DISTRIBUTION OF DATA
• Data can be distributed in different ways.
Skewed distribution
24
BELL-SHAPED DISTRIBUTIONS
(More) bell-shaped distribution
25
A PERFECT BELL-SHAPED DISTRIBUTION
ӯ
The distribution is perfectly symmetrical distribution around mean ӯ
EMPIRICAL RULE
ӯ – 3s
ӯ – 2s
ӯ–s
ӯ
68 %
95,4 %
99,7 %
ӯ+s
ӯ + 2s
ӯ + 3s
• We can summarize all observations in bell-shaped distributions:
– 68% of all observations is between ӯ-s and ӯ+s
– 95,4% of all observations is between ӯ-2s and ӯ+2s
– 99,7% of all observations is between ӯ-3s and ӯ+3s
27
EXAMPLE PIJP DATA
So 68% of the observations
is between 26,4 (42,8-16,4)
and 59,2 (42,8+16,4).
ӯ–s
ӯ
ӯ+s
28
PROBABILITIES AND
PROBABILITY DISTRIBUTIONS
29
PROBABILITIES
• We just looked at frequency distributions.
– With how many do they live in the houshold?
– How many square meters is their house?
• But we can think of them as probability distributions as well.
• I pick one random inhabitant of De Pijp:
– What is the probability that he/she is older than 35?
– What is the probability that he/she has two or more children?
• We can determine this on the basis of the distribution!
• Probability = p
– The probability is the area under the curve.
30
PROBABILITY DISTRIBUTION
68% probability to be
99,7%
to be here
95,4% probability
here
ӯ – 3s
ӯ – 2s
ӯ–s
ӯ
ӯ+s
ӯ + 2s
ӯ + 3s
31
NORMAL DISTRIBUTIONS
• We can apply this to all normal distributions.
ӯ=4 s=1
ӯ=2 s=0.5
ӯ=4 s=2
32
STANDARD NORMAL DISTRIBUTION
• We can also apply this to the standard normal distribution.
• Theoretical distribution used in inferential statistics.
– Empirical distributions are hardly ever normally distributed.
– We use the standard normal distribution for calculations.
• Characteristics of the standard normal distribution:
– Bell-shaped
– Perfectly symmetrical
– ӯ=0, s=1 (mean=0, standard deviation=1)
33
PROBABILITY AND THE EMPIRICAL RULE
• Let’s apply the Empirical Rule to the standard normal distribution.
68%
Or
0,68
2
1
0
1
95,4% or 0,954
2
34
Z-SCORES AND PROBABILITIES
• Probabilities can be defined as z-scores.
– In the standard normal distribution, z=yi
𝑦𝑦𝑖𝑖 − 𝑦𝑦� 𝑦𝑦𝑖𝑖 − 0
z=
=
= 𝑦𝑦𝑖𝑖
𝑠𝑠
1
ӯ=0
s=1
• Every position in a normal distribution has a z-score with a
corresponding probability.
– Can be found in Table A.
• For normally distributed variables we can convert z-scores to
probabilities (and the other way around).
35
36
INFORMATION IN THE Z-TABLE
p-value using
Table A
Z- 0
score
Zscore
37
EXAMPLE USING TABLE A
?
Z
1
0
Z
1
38
• 68% is between ӯ-s and ӯ+s.
•
Let’s find evidence for this in Table A.
• What z-score do we use?
• 1
• What p-value corresponds to a zscore of 1?
• The Z-table shows that the p-value
that correspondes to a z-score of 1 is
0,1587.
39
• The total area under every curve is 1.
• There’s a probability of 100% that
you’re somewhere under the curve..
• The area we are looking for is:
1-(0,1587*2) = 1-(0,32) = 0,68.
• 68% (0,68) is within the range of ӯ-z
and ӯ+z.
?
0,68
0,1587
z=-1
• You can check the evidence for 95,4%
(z-score of 2) and 99,7% (a z-score of
3) yourself.
0,1587
0
z=1
40
APPLICATION OF ALL THIS
• Why are we doing this?
– This just a theoretical story on probabilities and curves.
• We can apply the information from a theoretical probability
distribution to any empirical data that follow a normal
distribution.
• Using the z-scores of the empirical data, we can calculate
different probabilities (using the Z-table).
– What’s the probability that someone scores an 8 or higher for this
course?
• At the same time we can convert probabilities to real scores on
a variable.
– What grade is above the 95th percentile, and what grade markes the
25th percentile?
41
GOD
42
GOD
• In Western countries, a decreasing number of people are
religious.
• One explanation is that God is more relevant for older
generations.
• An alternative explanation is that God or religion has taken a
different meaning for younger people.
– “Ietsism”
43
YOUTH AND RELIGION
44
GOD (2)
• So maybe spiritualism for younger cohorts has not so much to
with “God”.
• Both variables are in De Pijp data.
– Do you believe in God?
– Do you believe in a higher power other than God?
• Is the probability that a younger individual believes in a
higher power other than God greater than the probability
that a younger individual believes in God??
• We define “younger” as being 30 or under.
45
DE PIJP DATA
46
AGE OF GOD-BELIEVERS
ӯ=44,3 and s=16,8
Let’s assume for now that this variable has a perfect normal distribution.
47
YOUNGSTERS AND GOD
• What is the probability that a God-believer is 30 years or
younger?
p
30
Z
44,3
48
FROM Z-SCORES TO PROBABILITY
• Let’s calculate the z-score
– Required information
1.
2.
3.
Value of an observation
Mean
Standard deviation
𝑦𝑦𝑖𝑖 − 𝑦𝑦�
z=
=
𝑠𝑠
•
yi = 30
ӯ = 44,3
s = 16,8
30 − 44,3
= −𝟎𝟎, 𝟖𝟖𝟖𝟖
16,8
What probability (p-value) corresponds to a z-score of (-)0,85?
– Why is it irrelevant if the z-score is negative or positive?
– Normal distributions are perfectly symmetrical, so the probability is the
same irrespective of the tail!
49
P=0,1977 (so 19,77 %).
0,1977
The probability that a youngster believes
in God is 0.1977 or 19,77%.
Z
44,3
0,85
50
IETSISM
51
AGE OF IETSISTS
Ӯ=47,0 and s=16,3
52
YOUNGSTERS &amp; HIGHER POWER OTHER THAN GOD
• What is the probability that a believer in a higher power other
than God is 30 years or younger?
p
30
Z
47,0
53
FROM Z-SCORES TO PROBABILITY
• Let’s calculate the z-score
– Required information
1.
2.
3.
Value of an observation
Mean
Standard deviation
𝑦𝑦𝑖𝑖 − 𝑦𝑦�
z=
=
𝑠𝑠
•
yi = 30
ӯ = 47,0
s = 16,3
30 − 47,0
= −𝟏𝟏, 𝟎𝟎𝟎𝟎
16,3
What probability (p-value) corresponds to a z-score of (-)1,04?
54
P=0,1492 (so 14,92 %).
The probability that a youngster believes
in a higher power other than God is
0.1492 or 14,92%.
0,1492
47,0
Z
1,04
55
GOD AND IETSISM
• The probability that a God-believer is young is slightly higher
than a believer in a higher power other than God is young.
– 19,8% chance for believing in God, 14,9% chance for believing in
something.
• But
– Data are not normally distributed.
– Example to practice relation z-scores/probabilities.
56
FROM PROBABILITY TO REAL SCORES
• You can also do this the other way around.
• Imagine that we want to know the age of the 30% of the
youngest believers in God.
– Here we don’t know yi
– But we do know the p-value.
• What’s the p-value?
– 0,30
57
WHAT DO AND DON’T WE KNOW?
•
•
•
•
𝑦𝑦𝑖𝑖 − 𝑦𝑦�
z=
𝑠𝑠
yi is unknown.
ӯ is known (44,3)
s is known (16,8)
z can be looked up via the p-value
58
FROM P-VALUE TO Z-SCORE
A probability of 30% corresponds
to a z-score of (-)0,52
59
FROM Z-SCORES TO PROBABILITIES
• ӯ =32,2; s= 6,0; normally distributed population.
• Waarde van observatie
1.
2.
3.
•
Z-score
Mean
Standard deviation
𝑦𝑦𝑖𝑖 − 𝑦𝑦�
z=
𝑠𝑠
𝑦𝑦𝑦𝑦 = 𝑦𝑦� + 𝑠𝑠 ∗ 𝑧𝑧
yi = ?
ӯ = 44,3
s = 16,8
z = -0,52
𝑦𝑦𝑦𝑦 = 44,3 + 16,8 ∗ −0,52 = 35,6
What does this 35,6 mean?
– The 30% youngest God-believers are at most 35.6 years of age.
60
VISUAL REPRESENTATION
0,30
35,6
0,52
44,3
61
OK, BUT WHY DO WE NEED THIS?
• Good question.
• TBH: We don’t really need it right now.
• But thinking about a normal distribution as a probability
distribution is crucial for inferential statistics.
• More specifically, we need this if we want to say something
about a population (inhabitants of De Pijp) on the basis of a
sample (De Pijp Data).
62
RECAP
1. We can compare z-scores between distributions.
2. The Empirical Rule.
– 68% of all observations can be found between ӯ-s and ӯ+s
– The probability to be between ӯ-s and ӯ+s is p=0,68 = 68%
3. A frequency distribution is a probability distribution.
–
Area under the curve depicts the probability.
4. On the basis of the standard normal distribution we are able to
conver z-scores to probabilities. We can find these probabilities
in the Z-table, and we can apply them to the De Pijp Data.
– Of coure you can also go from probabilities to z-scores.
5. The probability that a God-believer is young is slightly larger
than that a believer in a higher power other than God is young.
63
TUTORIAL GROUPS
• Z-scores and probabilities.
• Calculate it yourself!
• Introduction to SPSS.
64
SEE YOU WEDNESDAY!
(13.00)
65
```