Basics of Statistics for Data Science

Aditya Ranjan Behera
5 min readJan 2, 2021

--

“Data Scientists is a person who is better at statistics than any programmer and better at programming than any statistician.” — Josh Wills

To become a data scientist, you must require a strong knowledge of mathematics and statistics. Math and statistics are essential for data science because these are the basic foundation of all machine learning algorithms. In fact, mathematics is based on everything around us, like from shapes, patterns and colours. Mathematics is basic of each and every aspect of our lives.

Here you will understand about the basics and importance of math and statistics for data science and how they are helping to build ML algorithms.

History of Statistics

The word statistics is derived from the Latin word “Status” or the Italian word “Statista” and the meaning of these words is “political state” or “government”. If we have a look in history statistics wad used by rulers. The application of statistics was very but rulers and kings needed information about the land, agriculture, commerce, populations of their states to assess their military potential, their wealth taxation and other aspects of govt. For more details, search here.

Introduction to Statistics

So let’s know about what is statistics?

Statistics is a branch of science which deals with collecting and analyzing numerical data in large quantities especially for the purpose of inferring proportions in a whole from those in a representative sample.

It helps data scientist to look for meaningful trends and changes by processing complex problems in the real world. It performs mathematical on data for deriving meaningful insights from data.

Terminologies in Statistics

There are some key words which have to be known before dealing with statistics for data science. Here are some terminologies: -

1) Population is the whole dataset from which data has to be collected.

2) Sample is a subset of the population

3) Variable may be characteristic, quantity, number. It can be called a data item.

There are two types of analysis in statistics: -

1) Quantitative Analysis: — Quantitative analysis deals with numerical data like age, height, income etc. and graphs for identifying patterns and trends. It is also known as statistical analysis

2) Qualitative Analysis: — Qualitative analysis deals with classifying data into patterns to arrange and conclude results. It can be used in many forms like texts, images, sounds. It is also known as non-statistical analysis.

But the main purpose of both of these analyses is to find insights and providing results.

Categories in statistics

There are two categories in statistics.

Understanding Descriptive Statistics

Descriptive statistics is deals with the data which provides descriptions about the population through numerical calculations or graphs or tables. It helps to organize the data and focuses on the characteristics of parameters provided by the data. It is a summary statistic that quantifies or summarizes features from a collection of.

Descriptive statistics includes random variable, binomial distribution, normal distribution (bell curve), uniform normal distribution, central Limit Theorem, central tendency, variability, random variables, z-score.

Descriptive statistics is different from inferential statistics by aiming at summarizing a sample then use the data for learning about the population. It simply describing what is or what the data shows. It uses to present the quantitative descriptions of data in a manageable form. It helps us to simplify a large amount of data in an easy way.

Suppose you want to study the average age of employees of an organization, in descriptive statistics you can record the age of all the employees in the organization and then find the maximum, minimum average and distribution of ages of the organization.

Understanding Inferential Statistics

Inferential statistics is deals with making inferences and predictions about a population based on a sample of data which is taken from the population. It generalizes the large data set and applies probability methods for deriving conclusions. It infers deriving parameters of the population based on sample dataset and building models on it.

For example, finding the average age of employees of an organization in inferential statistics. You will take a sample of the class which is basically a few people from the entire organization. You already have had grouped the employees into old, mid-age and younger. In this method, you basically build a statistical model and expand it for the entire population of the organization.

Applications of Statistics

A number of specialities have evolved to apply statistical and methods to various disciplines. Certain topics have “statistical” in their name but relate to manipulations of probability distributions rather than to statistical analysis.

There is a number of fields where statistics can be applied. Here we have discussed some of them.

1) Stock market is an important application of statistics. In stock market statistics majorly uses to forecasting the sale or demand or market share for various types of organizations. Apart from this, it uses for factor analysis, conjoint analysis and multidimensional scaling.

2) Use of statistics in life science is known as biostatistics. Here statistics use to development and applications of statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

3) Statistics is used in weather in for forecasting weather reports. Here some statistical methods have been used to assess the potential predictability of climate and weather to develop schemes for initializing dynamical forecasting models for post-processing dynamical forecasts and to forecast future weather climate states.

4) In retail marketing statistics applied in marketing to measure customer satisfaction, brand loyalty and support. To implement a market-tracking program the marketer needs access to the company as well as industry statistics.

5) In insurance statistics majorly uses for customer acquisition, risk analytics and fraud detection what percentage of policies is likely to payout, and how much money a company can expect to pay out in claims.

6) Statistics is very important in education as it helps in collecting, presenting analysis and interpreting data. It also helps in drawing a general conclusion.

Summary

Here I have represented some basic knowledge of statistics essential for data science. Hope this blog will help you for understanding basic concepts of statistics. For more details, you may follow the book Thinkstats by Allen B. Downey.

--

--

Aditya Ranjan Behera

A student in PG Diploma in Data Science at IIIT, Bangalore with an interest in data analysis, ETL, Machine learning and business problem-solving.