Mathematics for Data Science
What Is Data Science? And Why Math Is Required?
Data Science is a Way of solving business Problems using mathematics to make faster and simplified solution.
Tough Definition!! Let’s Simplify it.
Remember: Data Science = Mathematics + Computer Science
That Means: Computing Data Using Mathematical Solutions
To Become a Data Scientist, having a good understanding of programming languages, Machine Learning algorithms and following a data-driven approach is necessary.
Let’s Dig Down Into Mathematical Concepts
1. Statistics:
Statistics is the Collecting, Analysing, Presenting and interpreting data to assist in making more effective decisions. It extracts information from data.
The Statistics has the influence over all domains.
- 20% Chance of rain
- Batting Averages
- Chances of getting king of spade.
- Share fall or rise in Stock Market
Variables: Any Characteristics of an Entity
Qualitative: when we can’t perform operations. It is also called as Categorical Variables. Which is Subdivided into Ordinal and Nominal.
Quantitative: when we can perform operations. It is also called as Numeric Variables
- Population: Set of Sources from which data has to be collected.
- Sample: Subset of Population.
Statistics are Sub-Divided into:
- Descriptive Statistics: Method of Organising, summarising and presenting data.
- Inferential Statistics: methods for estimating what the population characteristics might be, given what is known about the sample's characteristics.
2. Probability Distribution: Is a Statistical Function that describes Possible Values likelihoods that a random variable can take within a given range.
Measures of Central Tendency:
1. Mean: Measure of Average of all the values
2. Median: Measure of Middle value in sorted order.
3. Mode: The highest Occurring value
Measures of the Spread
Just like the measure of centre, we also have measures of the spread, which comprises of the following measures:
1. Range: It is the given measure of how spread apart the values in a data set are.
2. Inter Quartile Range (IQR): It is the measure of variability, based on dividing a data set into quartiles.
3. Variance: It describes how much a random variable differs from its expected value. It entails computing squares of deviations.
1. Deviation is the difference between each element from the mean.
2. Population Variance is the average of squared deviations
3. Sample Variance is the average of squared differences from the mean
4. Standard Deviation: It is the measure of the dispersion of a set of data from its mean.
3. Bernoulli Distribution:
It is a discrete Probability distribution. It is applied on independent events and is applicable to only 2 probabilities “Success” or “failure”.
Bernoulli Distribution Is also known as Binomial Distribution.
Where p : probability of getting a success in single trail
1 – p : probability of getting failure in single trail
n : Total number of trials
r : Number of Successes Desired
4. Normal Distribution:
It is a bell-shaped curve which is symmetric about the mean. The area under the curve specifies the probability of occurrence within the specific ranges, so the total area under the curve is equal to 1 as the sum of all probabilities is 1 (Probability theorem).
There’s a special case of normal distribution is when the mean is 0 and the standard deviation is 1. This is called standard normal distribution.
Standard Normal Distribution
It was basically invented to simplify the integral computations coupled with the normal distribution when you have to calculate the probabilities. To convert any normal distribution to standard normal form we use the below formula to calculate the z-score
Hypothesis Testing:
A hypothesis is an assumption about a population parameter.
Hypothesis testing involving one population focuses on confirming claims such as the population average is equal to a specific value.
Through hypothesis testing, you can determine whether there is enough evidence to conclude if the hypothesis about the population parameter is true or not.
Hypothesis Testing starts with the formulation of these two hypotheses:
Null hypothesis (H₀): Represents the status quo and involves stating the belief that the mean of the population is greater than or equal to, =, lesser than or equal to - a specific value.
Alternate hypothesis (H₁): represents the opposite of the null hypothesis and holds true if the null hypothesis is found to be false. It is also called Researchers hypothesis as researchers are always interested in proving this one right.
5. Linear Algebra and Calculus:
Linear algebra uses the tools and methods of vector and matrix operations to determine the properties of linear systems. It covers topics such as vectors, vector spaces and matrix theory used for calculating and exploring the properties of vectors and matrices, the linear independence of vectors and the vector spaces underlying sets of vectors and matrices.
We'll continue the topic in detail at a later point✌.