Business Analytics Techniques Used in the Industry
3rd April, 2015 Shailendra Kadre and Venkat Reddy
The previous few blog posts introduced the uses of data
mining or business analytics. This post will examine the terminology in detail.
Only the frequently used terms in the industry are discussed here.
Then the blog post will introduce and give examples of
many of these analytics techniques and applications. Some of the more
frequently used techniques will be covered in later posts.
Regression Modeling and Analysis
To understand regression and predictive modeling,
consider the same example of a bank trying to aggressively increase its
customer base for some of its credit card offerings. The credit card manager
wants to attract new customers who will not default on credit card loans. The
bank manager might want to build a model from a similar set of past customer
data that resembles the set of target customers closely. This model will be
used to assign a credit score to the new customers, which in turn will be used
to decide whether to issue a credit card to a potential customer. There might
be several other considerations aside from the calculated credit score before a
final decision is made to allocate the card.
The bank manager might want to view and analyze several
variables related to each of the potential clients in order to calculate their
credit score, which is dependent on variables such as the customer’s age,
income group, profession, number of existing loans, and so on. The credit score
here is a dependent variable, and other customer variables are independent
variables. With the help of past customer data and a set of suitable
statistical techniques, the manager will attempt to build a model that will
establish a relationship between the dependent variable (the credit score in
this case) and a lot of independent variables about the customers, such as
monthly income, house and car ownership status, education, current loans
already taken, information on existing credit cards, credit score and the past
loan default history from the federal data bureaus, and so on. There may be up
to 500 such independent variables that are collected from a variety of sources,
such as credit card application, federal data, and customers’ data and credit
history available with the bank. All such variables might not be useful in
building the model. The number of independent variables can be reduced to a
more manageable number, for instance 50 or less, by applying some empirical and
scientific techniques. Once the relationship between independent and dependent
variables is established using available data, the model needs validation on a
different but similar set of customer data. Then it can be used to predict the
credit scores of the potential customers. A prediction accuracy of 90 percent
to 95 percent may be considered good in banking and financial applications; an
accuracy of 75 percent is must. This kind of predictive model needs a periodic
validation and may be rebuilt. It is mandatory in some financial institutions to
revalidate the model at least once a year with renewed conditions and data.
In recent times, revenues for new movies depend largely
on the buzz created by that movie on social media in its first weekend of release.
In an experiment, data for 37 movies was collected. The data was the number of
tweets on a movie and the corresponding tickets sold. The graph in Figure 1-2
shows the number of tweets on the x-axis and number of tickets sold on the
y-axis for a particular movie. The question to be answered was, If a new movie
gets 50,000 tweets (for instance), how many tickets are expected to be sold in
the first weeks? A regression model was used to predict the number of tickets
(y) based on number of tweets (x) (Figure 1-3).
Figure 1-2. Number of Tickets Sold vs. Number of Tweets - a Data Collection for Sample Movies
Figure 1-3. The Regression Model for Number of Tickets
Sold vs. Number of Tweets - Prediction using Regression Model
Using the previous regression predictive model equation,
the number of tickets was estimated to be 5,271 for a movie hat had 50,000
tweets in the first week of release.
Time Series Forecasting
Time series
forecasting is a simple form of forecasting technique,
wherein some data points are available over regular time intervals of days,
weeks, or months. If some patterns can be identified in the historical data, it
is possible to project those patterns
into the future as a forecast. Sales forecasting is a popular usage of time
series forecasting. In Figure 1-3, a straight line shows the trend from the
past data. This straight line can easily be extended into a few more time
periods to have fairly accurate forecasts. In addition to trends, time series
forecasts can also show seasonality, which is simply a repeat pattern that is
observed within a year or less (such as more sales of gift items on occasions such
as Christmas or Valentine’s Day). Figure 1-4 shows an actual sales forecast,
the trend, and the seasonality of demand.
Figure
1-4. A Time Series Forecast Showing the Seasonality of Demand
Figure 1-4 shows the average monthly sales of an apparel
showroom for three years. There is a stock clearance sale every four months,
with huge discounts on all brands. The peak in every fourth month is apparent from
the figure.
Time series analysis can also be used in the example of
bank and credit card to forecast losses or profits in future, given the same
data for a historical period of 24 months, for instance. Time series forecasts
are also used in weather and stock market analysis.
Other examples of time series data include
representations of yearly sales volumes, mortgage interest rate variations over
time, and data representations in statistical quality control measurements such
as accuracy of an industrial lathe machine for a period of one month. In these
representations, the time component is taken on the x-axis, and the variable,
like sales volume, is on the y-axis. Some of these trends may follow a steady
straight-line increase or a decline over a period of time. Others may be cyclic
or random in nature. While applying time series forecasting techniques, it is
usually assumed that the past trend will continue for a reasonable time in the
future. This future forecasting of the trend may be useful in many business
situations, such as stocks procurement, planning of cash flow, and so on.
Conjoint Analysis
Conjoint analysis is a statistical technique, mostly
used in market research, to determine what product (or service), features, or
pricing would be attractive to most of the customers in order to affect their
buying decision positively.
In conjoint studies, target responders are shown a
product with different features and pricing levels. Their preferences, likes, and
dislikes are recorded for the alternative product profiles. Researchers then
apply statistical techniques to determine the contribution of each of these
product features to overall likability or a potential buying decision. Based
on these studies, a marketing model can be made that can estimate the
profitability, market share, and potential revenue that can be realized from
different product designs, pricing, or their combinations.
It is an established fact that some mobile phones sell
more because of their ease of use and other user-friendly features. While
designing the user interface of a new phone, for example, a set of target users
is shown a carefully controlled set of different phone models, each having some
different and unique feature yet very close to each other in terms of the
overall functionality. Each user interface may have a different set of
background colors; the placement of commonly used functions may also be
different for each phone. Some phones might also offer unique features such as
dual SIM. The responders are then asked to rate the models and the controlled
set of functionalities available in each variation. Based on a conjoint
analysis of this data, it may be possible to decide which features will be
received well in the marketplace. The analysis may also help determine the
price points of the new model in various markets across the globe.
Cluster Analysis
The intent of any cluster analysis exercise is to split
the existing data or observations into similar and discrete groups. Each
observation is divided group wise in classification type of problems, while in
cluster analysis; the aim is to determine the number and composition of groups
that may exist in a given data or observation set.
For example, the customers could be grouped into some
distinct groups in order to target them with different pricing strategies and
customized products and services. These distinct customer groups (Figure 1-5)
may include frequent customers, occasional customers, high net worth customers,
and so on. The number of such groups is unknown when beginning the analysis but
is determined as a result of analysis.
Figure 1-5. A cluster
analysis plot
The graph in Figure 1-6 shows the income to debt ratio
versus age. Customer segments that are similar in nature can be identified
using cluster analysis.
Figure 1-6. Income to Debt Ratio vs. Age
The income-to-debt ratio in Figure 1-6 is low for age
groups 20 to 30. The 30-to-45 age group segment has a higher debt ratio. The
three groups need to be treated differently instead of as one single
population, depending on the business objective.
Segmentation
Segmentation is similar to classification, where the
criteria to divide observations into distinct groups needs to be found. The
number of groups may be apparent even at the beginning of the analysis, while
the aim of cluster analysis is to identify areas with concentrations different
than other groups. Hence, clustering is discovering the presence of boundaries
between different groups, while segmentation uses boundaries or some distinct
criterion to form the groups.
Clustering is about dividing the population into
different groups based on all the factors available. Segmentation is also
dividing the population into different groups but based on predefined criteria
such as maximizing the profit variable, minimizing the defects, and so on.
Segmentation is widely used in marketing to create the right campaign for the
customer segment that yields maximum leads.
Principal Components and Factor Analysis
These statistical methodologies are used to reduce the
number of variables or dimensions in a model building exercise. These are
usually independent variables. Principal component analysis is a method of
combining a large number of variables into a small number of subsets, while
factor analysis is a methodology used to determine the structure or underlying
relationship by calculating the hidden factors that determine the variable
relationships.
Some analysis studies may start with a large number of
variables, but because of practical constraints such as data handling, data
collection time, budgets, computing resources available, and so on, it may be
necessary to drastically reduce the number of variables that will appear on the
final data model. Only those independent variables that make most sense to the
business need to be retained.
There might also be interdependency between some
variables. For example, income levels of individuals in a typical analysis
might be closely related to the monthly dollars they spend. The more the
income, the more the monthly spend. In such a case, it is better to keep only
one variable for the analysis and remove the monthly spend from the final
analysis.
The regression modeling section discussed using 500
variables as a starting point to determine the credit score of potential
customers. The principal component analysis can be one of the methods to reduce
the number of variables to a manageable level of 40 variables (for example),
which will finally appear in the final data model.
Correspondence Analysis
Correspondence analysis is similar to principal
component analysis but applies to nonquantitative or categorical data such as
gender, status of pass or fail, color of objects, and field of specialization.
It especially applies to cross-tabulation. Correspondence analysis provides a
way to graphically represent the structure of cross-tabulations with each row
and column represented as a point.
Note You can find a complete example of
correspondence analysis on the SAS web site. See http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_corresp_sect026.htm.
Survival Analytics
Survival analytics is typically used when variables such
as time of death, duration of a hospital stay, and time to complete a doctoral
thesis need to be predicted. It basically deals with the time to event data.
For a more detailed treatment of this topic, please refer to www.amstat.org/chapters/northeasternillinois/pastevents/presentations/summer05_Ibrahim_J.pdf.
Some Practical Applications of Business Analytics
The following sections discuss a couple of examples on
the practical usage of application of business analytics in the real world.
Predicting customer behavior towards some product features, or application of
business analytics in the supply chain to predict the constraints, such as raw
material lead times, are very common examples. Applications of analytics are
very popular in retail and predicting trends on social media as well.
Customer Analytics
Predicting consumer behavior is the key to all marketing
campaigns. Market segmentation, customer relationship management, pricing,
product design, promotion, and direct marketing campaigns can benefit to a
large extent if consumer behavior can be predicted with reasonable accuracy.
Companies with direct interaction with customers collect and analyze a lot of
consumer-related data to get valuable business insights that may be useful in
positively affecting sales and marketing strategies. Retailers such as Amazon
and Walmart have a vast amount of transactional data available at their
disposal, and it contains information about every product and customer on a
daily basis. These companies use business analytics techniques effectively for
marketing, pricing policies, and campaign designs, which enable them to reach
the right customers with the right products. They understand customer needs
better using analytics. They can swap better-selling products at the cost of
less-efficient ones. Many companies are also tapping the power of social media
to get meaningful data, which can be used to analyze consumer behavior. The
results of this analytics can also be used to design more personalized direct
marketing campaigns.
Operational Analytics
Several companies use operational analytics to improve
existing operations. It is now possible to look into business processes in real
time for any given time frame, with companies having enterprise resource
planning (ERP) systems such as SAP, which give an integrated operational view
of the business. Drilling down into history to re-create the events is also
possible. With proper analytics tools, this data is used to analyze root
cases, uncover trends, and prevent
disasters. Operational analytics can be used to predict lead times of shipments
and other constraints in supply chains. Some software can present a graphical
view of supply chain, which can depict any possible constraints in events such
as shipments and production delays.
Social Media Analytics
Millions of consumers use social media at any given
time. Whenever a new mobile phone or a movie, for instance, is launched in the
market, millions of people tweet about it almost instantly, write about their
feelings on Facebook, and give their opinions in the numerous blogs on the
World Wide Web. This data, if tapped properly, can be an important source to
uncover user attitudes, sentiments, opinions, and trends. Online reputation and
future revenue predictions for brands, products, and effectiveness of ad
campaigns can be determined by applying proper analytical techniques on these
instant, vast, and valuable sources of data. In fact, many players in the
analytics software market such as IBM and SAS claim to have products to achieve
this.
Social media analytics is simply text mining or text
analytics in some sense. Unstructured text data is available on social media
web sites, which can be challenging to analyze using traditional analytics
techniques. (Describing text analytics techniques is out of scope for this
book.)
Some companies are now using consumer sentiment analysis
on key social media web sites such as Twitter and Facebook to predict revenues
from new movie launches or any new products introduced in the market.
Data Used in Analytics
The data used in analytics can be broadly divided into
two types: qualitative and quantitative. The qualitative, discrete, or
categorical data is expressed in terms of natural languages. Color, days of a
week, street name, city name, and so on, fall under this type of data.
Measurements that are explained with the help of numbers are quantitative data,
or a continuous form of data. Distance between two cities expressed in miles,
height of a person measured in inches, and so on, are forms of continuous data.
This data can come from a variety of sources that can be
internal or external. Internal sources include customer transactions, company
databases and data warehouses, e-mails, product databases, and the like.
External data sources can be professional credit bureaus, federal databases,
and other commercially available databases. In some cases, such as engineering
analysis, a company may like developing its own data to solve an uncommon
problem.
Selecting the data for building a business analytics
problem requires a thorough understanding of the overall business and the
problem to be resolved. The past sections discussed that an analytics model
uses data combined with the statistical techniques used to analyze it. Hence,
the accuracy of any model is largely dependent upon the quality of underlying
data and statistical methods used to analyze it.
Obtaining data in a usable format is the first step in
any model-building process. You need to first understand the format and content
of the raw data made available for building a model. Raw data may require
extraction from its base sources such as a flat file or a data warehouse. It
may be available in multiple sources and in a variety of formats. The format of
the raw data may warrant separation of desired field values, which otherwise
appear to be junk or have little meaning
in its raw form. The data may require a cleansing step as well, before an
attempt is made to process it further. For example, a gender field may have
only two values of male and female. Any other value in this field may be
considered as junk. However, it may vary depending upon the application. In the
same way, a negative value in an amounts field may not be acceptable.
In some cases, the size of available data may be so large
that it may require sampling to reduce it to a manageable form for analysis
purposes. A sample is a subset from the available data, which for all practical
purposes represents all the characteristics of the original population. The
data sourcing, extraction, transformation, and cleansing may eat up to 70
percent of total hours made available to a business analytics project.
These edited excerpts are taken from Shailendra Kadre
and Venkat Reddy’s
latest book titled, Practical
Business Analytics Using SAS: A Hands-on Guide. It’s available on Amazon.com
and Amazon
India web sites.
·
Paperback: 580 pages
·
Publisher: Apress; 1 edition (January 29, 2015)
·
Language: English
·
ISBN-10: 1484200446
·
ISBN-13: 978-1484200445
Thanks, you are telling the exact thing about business analytic. i had some useful information in your blog, keep sharing.
ReplyDeleteRegards,
SAS Training in Chennai|SAS Course in Chennai
You will find a lot of approaches after visiting your post. I was exactly searching for. Thanks for such post and please keep it up.
ReplyDeleteEquity valuation Seattle
Thanks
ReplyDeleteThanks for sharing this unique information. Was very useful for me.
ReplyDeleteSAS Training in Chennai
SAS Course in Chennai
Very Good blog thanku so much...
ReplyDeleteSAS Training Institutes in Chennai