Data Analytics Learning Note
Python Topics:
Numpy arrays:
- NumPy, which stands for Numerical Python, it is the key module for scientific computing Convenient and provides efficient ways to handle multi-dimensional arrays. A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
We can initialize numpy arrays from nested Python lists, and access elements using square brackets.
import numpy as np
a = np.array([1, 2, 3])
Pandas Dataframe
DataFrame is a collection of with series of columns and series data form rows. Pandas is a library to read data from the file and build structured data frames.
Matplotlib
Matplotlib is a Python library for plotting.
matplotlib.pyplot is a collection of command style functions.
import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()
plot() is a versatile command, and will take an arbitrary number of arguments. For example, to plot x versus y, you can issue the command:
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
- Scipy.stats
- Spicy provides solutions to common scientific computing problem such as linear algebra, optimization, statistics, sparse matrix. Scipy.stats is the module that contains a large number of probability distributions as well as a growing library of statistical functions.
Possibilities
Baye’s Theorem
For two events A and B, Baye’s Theorem states that
P(A | B) = P(B | A)P(A) P(B)
Discrete Random Variables
A random variable X is a numerical variable (i.e. something that varies) and whose value in any one repetition of the experiment is not predictable with certainty.
If X is a discrete random variable, the distribution of X, also called the probability distribution of X, also called the probability mass function (pmf) of X, and also called the density function of X is the set of possible values x of X together with the probability
f(x) = P(X = x).
Let X be a discrete random variable with distribution P(X = x). The mean value of X or expected value of X, or population mean(sometimes loosely called the average value of X) is the centre of gravity of the distribution of X is denoted by the letter µ or by the symbols E(X) and is defined as
µ: = E(X) = ΣxP(X = x).
The variance of X, also called the population variance, is denoted by either σ^2 or Var(X), and is defined by σ^2 := Var(X) = Σ(x(x − µ)^2P(X = x)).Finally, the standard deviation of X is +σ.
If X is a continuous random variable, the definitions of the population mean µ = E(X) and the population variance σ 2 = Var(X) requires calculus so that we will leave the full definition alone . . . . We shall interpret the mean, variance and standard deviation of a continuous X as the limit (as n → ∞) of the sample mean x, the sample variance s 2 and the sample standard deviation s, respectively.
Normal Distributions
Continuous random variables that have a normal distribution are the most frequently encountered.
The density function of a normal random variable that has mean µ and variance σ^2 is given by
f(x) = (1 /(sqrt(2πσ ))(exp (−1/(2σ^2) (x − µ)^2) , −∞ < x < ∞
Shouldthisbememorised. . . . . .We write this for short as X ∼ N(µ, σ^2 ), that is, X is distributed as normal with mean µ and variance σ^2. This can be found by integration and the formula above. . . . . .
To use the table, if the variable has a normal distribution, then Z will have an N(0, 1) distribution
Z = X − µ /σ.
Since the normal distribution is symmetric,
P(−1 < Z < 1) = 1 − P(Z > 1) − P(Z < −1) = 1 − 2P(Z > 1);
P(Z > −1) = P(Z < 1) = 1 − P(Z > 1)
Sampling Distributions
Suppose we take a random sample X1, X2, . . . , Xn of size n from a population that has mean µ and variance σ^2. Let X_ebar be the mean of the random sample. Then
- if the population has a normal distribution (that is, if X ∼ N(µ, σ^2 ), where X denotes one random member of the population) then the distribution of X_ebar is normal with mean µ and variance σ^2/n. That is X_ebar ∼ N (µ, σ^2/n) .
- The Central Limit Theorem. No matter what distribution the population has, the provided n is large, the distribution of X_ebar is approximately normal with mean µ and variance σ^2/n. That is X_ebar ≈ N(µ,σ^2/n).
Z = X − µ /(σ/sqrt(n))
Useful links:
standard normal distribution table
Statistical Inference
Here and elsewhere, use the notation zλ to devote the value of a standard normal variable Z that satisfies P(Z > zλ). For example z0.1587 = 1(i.e. P(Z > 1) = 0.1587)
Confidence interval
For an approximately normal distribution with unknown mean µ and standard deviation σ, the appropriate formula for 100(1 − α)% confidence interval is
x_ebar ± (zα/2)(σ*sqrt(n))
For an approximately normal distribution with unknown mean µ and unknown standard deviation σ, we must replace σ by an estimate s (sample standard deviation) and must replace zα/2 by tn−1,α/2 (using the t-tables with n − 1 degrees of freedom (df) ). The appropriate formula for 100(1 − α)% confidence interval is
x_ebar ± (tn−1,α/2) (s /sqrt(n))
Confidence Interval for a single Population Proportion
Let p be the unknown population proportion. Let pˆ be the sample proportion. The appropriate formula for 100(1 − α)% confidence interval for p is
pˆ± (zα/2)(sqrt((pˆ(1 − pˆ))/n))
Confidence Interval for Differences in Means
For two independent samples (one from each population) and calculate x1_ebar and x2_ebar. If n1 and n2 are large (≥ 30) then x1_ebar − x2_ebar ∼ N (µ1 − µ2, σ1^2/ n1 + σ2^2 /n2 )
A confidence interval for µ1 − µ2 is
x1_ebar − x2_ebar ±(zα/2)*sqrt( σ1^2/ n1 + σ2^2/n2)
Hypothesis Test
- The null hypothesis H0 (the ’status quo’)
- The alternative hypothesis H1 [or HA ] (the research hypothesis’)
- Collect data and ask ”how likely is it to get such extreme data” if H0 is true.
- Alternative hypothesis can be one-sided or two-sided. A two-sided alternative hypothesis is testing if they are too big or too small.
Hypothesis Test Errors:
Terms
A test statistic is a random function of a sample/samples whose observed value will be used to help us decide if we should reject or not reject H0.
- A decision rule tells us for which values of the test statistic we should reject H0.
- The critical region is the set of values of the test statistic for which we will reject H0.
Hypothesis Testing Procedure
State the null hypothesis H0 and the alternative hypothesis H1.
Choose the significance level (often 5%).
Collect the random sample. State the decision rule.
Compute the test statistic (this measures the amount of disagreement between the hypothesis and the sample).
Identify the rejection region (this is the region such that only 5% (or whatever the significance level is) of the data lie in this region). aka critical region. Be careful about specific (one-tailed) H1 and vague (two-tailed) H1.
If the test statistic is in the rejection region then reject H0 at that level of significance. Otherwise, accept H0.
Give a clear conclusion to interpret your result.
When the population variance is known and either the population is normal or the sample size is large
zOBS = (X_ebar − µ0 ) / (sqrt(σ)/n)
T-Test
Population variance is unknown and the population is normal
tOBS =( X_ebar − µ0 ) / (s/sqrt(n)) with n − 1 degrees of free, s = sqrt((Σ(xi − x_ebar^2))/n-1)
Difference Between Two Population Means
Assumptions: Both populations normal, population variances σ12 , σ22 unknown but assumed equal, independent random samples. n1size of sample 1 and n2 size of sample 2.
tOBS = (X_ebar1 − X_ebar2) /(sp *sqrt (1/n1 + 1/n2)) with
n1 + n2 − 2 degrees of freedom (df ),
sp= sqrt( ((n1 − 1)*S12+ (n1 − 1)*S22 )/ (n1+ n2 − 2) )
Assumptions: Population of differences normal; paired random samples
tOBS = (d_ebar − D0) /(sd/ sqrt(n)) with n − 1 (df)
where d_ebar is the mean of the differences, Sd is the standard deviation of the differences and D0 is the null hypothesis value for the difference in the population means.
Binomial distributions
X Bin(n, p) [i.e. X has the binomial distribution with parameters n and p]. Thus the density function of X is
If n is large (np and np(1−p) are both at least 5) it follows from the Central Limit Theorem that the distribution of X is approx.
N(µ, σ2 ) where µ = np and σ 2 = np(1 − p)
Letting p ^ := X, n be the sample proportion of successes, we can write the above result as the distribution of
p^ is approximately N (p, p(1−p)/n )
Upon standardizing as before, the above result is the same as writing that the distribution of
Z = (pˆ− p0) /sqrt( p0(1 − p0) /n) is approximately N(0, 1)
Chi-Squared Test
χ 2OBS = Σ ((O − E) 2 / E) with k − c − 1 df.
where O are the observed frequencies and the E are the (estimated) expected frequencies under H0, k is the number of categories, and c is the number of parameters estimated to compute the E
Independence of Two Classifications
Here we want to test the independence of two categorical variables (or quantitative variables whose values have been broken into categories)
r × c, contingency table test statistic
χ 2OBS = Σ ((O − E) 2 / E) with (r-1)(c-1) df
Correlation and Regression
Correlation concerns the strength of the relationship between the values of two variables. Regression analysis determines the nature of that relationship and enables us to make predictions from it.
Simple Linear Regression
Y = α + βX
Regression Equation: Expected value of y at a given level of x
E(yi |xi) = α + βxi
The predicted value for an individual
yi = α + βxi + random error Εi
Regression Equation
y^= a0 + a1x
Where
- a0 = the y-intercept (i.e. where the line crosses x-axis)
- a1 = slope of regression line
To calculate these:
Correlation
Every correlation has two qualities: strength and direction. The direction of a correlation is either positive or negative. In a negative correlation, the variables move in inverse, or opposite, directions. In other words, as one variable increases, the other variable decreases.
Strength of Correlation
The strength of the correlation is called the correlation coefficient. AKA Pearson product-moment or Pearson’s correlation. When measured in a population the Pearson Product Moment correlation is designated by letter ρ, when applied to a sample is commonly represented by r
Rankings of Correlation Strength:
Correlation Efficient
Strength of Correlation
0.0 ~ 0.2Very weak, negligible0.2 ~ 0.4Weak, low0.4 ~ 0.7Moderate0.7 ~ 0.9Strong0.9 ~ 1.0Very strong
Hypothesis Test
The sample correlation coefficient r is the estimator of population correlation coefficient ρ. Relations in samples does not necessarily depict the same in the population. Use the hypothesis test to check if some samples are reliable.
Correlation t-test
t = r /sqrt((1−r^2)/(n−2))
State the null and alternative hypotheses
Decide on the significance level, α
State the decision rule
Compute the value of the correlation test statistic
Determine the critical value(s)
If the value of the test statistic falls in the rejection region, reject H0
Interpret results
Multiple regression analysis
y = a0 + β1x1 + β2x2 + .... + βkxk + Ε
Big Data
3Vs: Volume, Velocity, and Variety
Data Volume:
- 44x increase from 2009 2020
- From 0.8 zettabytes to 35zb
- Data volume is increasing exponentially
Variety (Complexity)
- Data Types
- Relational Data (Tables/Transaction/Legacy Data)
- Text Data (Web)
- Semi-structured Data (XML)
- Graph Data
- Social Network, Semantic Web (RDF), …
- Streaming Data
- A single application can be generating/collecting many types of data
- Big Public Data (online, weather, finance, etc)
- To extract knowledge, all these types of data need to be linked together
Velocity (Speed)
- Data is begin generated fast and need to be processed fast
- Online Data Analytics would be required
- Late decisions might mean the missing of opportunities
Map Reduce Algorithm:
MapReduce algorithm is mainly used to process a huge amount of data parallelly in cluster environments.
Map: (input shard) → intermediate(key/value pairs)
- Map calls are distributed across machines by automatically partitioning the input data into M "shards".
- Groups together all intermediate values associated with the same intermediate key & pass them to the Reduce function
Reduce: intermediate(key/value pairs) → result files
- Accepts an intermediate key & a set of values for the key
- It merges these values together to form a smaller set of values
- Reducer is applied to all values associated with the same key
- Reduce calls are distributed by partitioning the intermediate key space into R pieces using a partitioning function
- (e.g., hash(key) mod R).
- The user specifies the # of partitions (R) and the partitioning function.
- The Reduce Workers can only start working when every single map worker has completed its task
- All <key, value> pairs must be generated before any reducer can run
Complete Picture
- Split input files into chunks: Break up the input data into M pieces
- Fork processes:
- Startupmanycopies of the program on a cluster of machines
- 1 master: scheduler & coordinator
- Lots of workers
- Idle workers are assigned either:
- map tasks (each works on a shard) – there are M map tasks
- reduce tasks (each works on intermediate files) – there are R
- R = # partitions, defined by the user
- Map Task
- Reads contents of the input shard assigned to it
- Parseskey/value pairs out of the input data
- Passes each pair to a user-defined map function
- Produces intermediate key/value pairs
- Buffered in memory
- Create intermediate files
- Intermediate key/value pairs produced by the user’s map function
- buffered in memory and are periodically written to the local disk
- Partitioned into R regions by a partitioning function
- Notifies master when complete
- Passes locations of intermediate data to the master
- Master forwards these locations to the reduce worker
- Partitioning
- Map data will be processed by Reduce workers
- The user’s Reduce function will be called once per unique key generated by Map.
- All data need to be sorted by key
- Partition function: decides which of R reduce workers will work on which key
- Default function: hash(key) mod R
- Map worker partitions the data by keys
- Each Reduce worker will read their partition from every Map worker
- Reduce Task: Sorting
- Reduce worker gets notified by the master about the location of intermediate files for its partition
- Uses RPCs to read the data from the local disks of the map workers
- When the reduce worker reads intermediate data for its partition
- It sorts the data by the intermediate keys
- Reduce Task: Reduce
- All occurrences of the same key are grouped together after being sorted by intermediate keys
- The sorting phase grouped data with a unique intermediate key
- User’s Reduce function is given the key and the set of intermediate values for that key
- < key, (value1, value2, value3, value4, …) >
- Return to user
- When all map and reduce tasks have completed, the master wakes up the user program
The MapReduce call in the user program returns and the program can resume execution
- The output of MapReduce is available in R output files
Count # occurrences of each word in a collection of documents
Map:
Parse data; output each word and a count (1)
Reduce:
Sort: sort by keys (words)
Reduce: Sum together counts each key (word)
map(String doc_id, String value):
// key: document name, value: document contents
for each word w in value:
Emit(w, 1);
reduce(String key, Iterator<int> values):
// key: a word; values: a list of counts
int sum = 0;
for each v in values:
sum += v;
Emit(key, sum);
The Combiner
A combiner is a local aggregation function for repeated keys produced by the same map, for associative ops. like sum, count, max. It decreases the size of the intermediate data.
def combiner(key, values):
output(key, sum(values))
Computing Mean
Without Combiner
With Combiner
map(string t, integer r):
Emit(string t, integer r);
reducer(string key, Iterator<int> values):
int sum = 0, cnt = 0;
for each v in values:
sum += v;
cnt += 1;
avg = sum/cnt;
Emit(key, avg);
map(string key, integer r):
Emit(string key, pair(r,1));
combiner(string key, pairs[(s1,c1),(s2,c2)...]):
int sum = 0, cnt = 0;
for each pair(s,c) in pairs:
sum += s;
cnt += c;
Emit(key, pair(sum,cnt));
reducer(string key, pairs[(s1,c1),(s2,c2)...]):
int sum = 0, cnt = 0;
for each pair(s,c) in pairs:
sum += s;
cnt += c;
avg = sum/cnt;
Emit(key, avg);
Sort and Shuffle in Hadoop
Shuffling is the process by which it transfers mappers intermediate output to the reducer. MapReduce library automatically sorts the keys generated by the mapper. Thus, before starting of reducer, all intermediate key-value pairs get sorted by key and not by value. It helps reducer to easily distinguish when a new reduce task should start, saves time for the reduce.