Friday, February 20, 2015

Basic Statistics for Business Analytics



21 Feb 2015, Shailendra Kadre and Venkat Reddy



By the end of this blog post, you will get some basic idea the following concepts that are essential for proceeding with business analytics techniques:
1.   The difference between population and sample
2.   Different types of sampling
3.   The difference between variable and parameter
4.   The differences between descriptive, inferential, and predictive statistics
5.   The steps involved in solving a business analytics problem
1.   Population and Sample
Population is the complete set of objects or data records that are available for an analytics project or data analysis. For example, in a countrywide marketing campaign, a narrowed-down list of the country’s citizens will form the population for the analytics problem. Generally it might not be possible to analyze the entire population because of the sheer size of the data, availability of time, funding, or limited processing power of available computing machines. These reasons may compel you to consider only a subset of the population. This subset is usually referred to as a sample in statistical terminology. If properly chosen, analyzing with a sample can be as good as analyzing the full population.

2.   Different Types of Sampling

A sample can be formally defined as the subset of a population that is selected for analysis. The procedure of creating or collecting this subset is called sampling. Sometimes, it might be necessary to manually collect some records from the overall population. There are several types of sampling techniques. The following are the ones that are most commonly used in business analytics projects.
Simple random sampling is the most commonly used sampling method. Randomly choosing some records from a population (denoted by n) is called simple random sampling. There are several methods for deciding on the right sample size. Sometimes the business problem that we are handling gives us an idea of the sample size. Once the sample size (n) has been decided based on one of the methods, records are randomly selected from the population. Convenient functions are available in SAS for this purpose.
A classic example of random sampling is of a blindfolded man picking up ten apples from a basket full of apples. All the apples have an equal probability of being picked from the basket.
Consider an example population, which has preexisting segments of same or different sizes. Segments are the population records that are already classified into a distinct number of subgroups. In such a case, it is best to do a random sampling from each segment; as such, a sample will truly represent the nature of such population.
The size of each segment can be based upon the proportion of that segment in the entire population. Such segments are usually referred to as strata. The process of simple random sampling from each strata is called stratified sampling. Segments can be manually created, and stratified sampling can be performed even when there are no obvious segments in the population.
For example, if 1,000 random candidates are to be picked from across the country for a sporting event, it might be a good idea to pick them proportionately from each state.
Systematic sampling is based on a fixed rule, like picking every fifth or seventh observation from a given population. It is different from random sampling, wherein any random values are picked. This type of sampling is generally done if testing is a continuous process. Recording the room temperature every 60 minutes or measuring the blood pressure of a patient every 10 minutes are examples of systematic samples.
·         Example: Consider a mass manufacturing machine that produces simple bolts to be used in a chemical plant erection project. Every 30th bolt manufactured by the machine can be collected as sample. This may look like a random sample from the whole lot, but you are not actually waiting for the whole lot to form; instead, you are collecting your sample much before creating the heap.

3.   The difference between variable and parameter

Simply put, a variable in a statistical data table is nothing but a column or a field in the table, a feature that may change its value from one record to another. It may well be a numeric, which can be measured for each record, or a non-numeric such as city, gender, or a status field containing Yes or No entries. Other examples are age, monthly income, daily sales, and cost data. The following are the major types of variables that a population or a sample may contain.

Non-numeric, qualitative, and categorical variables are the type of variables that represent quality or a characteristic field.
Examples are shirt sizes expressed as S, M, L, XL, and XXL, or distance, which is expressed as near and far. It can as well be a Boolean value like a pass or a fail or a yes or no field.

Parameter


A parameter is a measure that is calculated on the entire population. Any summary measure that gives information of population is called a parameter.
For example, take the data on electricity utility bills of an entire state like California.  It will be huge by any standards because it represents the variables such as name, address, type of connection, month, units consumed, and the bill amount for all households in the state. Now for planning purposes, that is, to forecast the electricity demand for the next five years in the state, if you calculate the averages on all the state’s households for the variables like units consumed and bill amount, it will be termed as parameters. So, two example parameters, that is, the entire state’s average units consumed per household and the average bill amount may look like 650 units and $100, respectively. These parameters are calculated on the entire population, which might be really large at times. So, it’s not hard to predict that it may require huge amount of computational effort.


There are three methods of Statistical analysis: descriptive, inferential, and predictive. In descriptive statistics methods, the data is simply summarized using statistical central tendencies and variations. In inferential statistics, a sample is drawn from the population to infer on the full set of data or population. Predictive statistics, as expected, can predict the dependent variable using methodologies such as linear and logistic regression.


The typical steps in problem solving in Business Analytics are as follows

a.       The Data Preparation
b.      Descriptive Analysis and Visualization



Many thanks to you for spending time reading this article. Much more on this and many other topics is available in the book, Practical Business Analytics Using SAS: A Hands-on Guide by Venkat Reddy and Shailendra Kadre. You can buy it right now at Amazon. The authors are reachable at shailendrakadre@gmail.com and 21.venkat@gmail.com.

 





2 comments: