This blog will discuss some of the more commonly used
business analytics tools that are used in the industry today. It is not enough
for a data analyst to learn about just one tool. They need to apply different
tools as per the situation or what the problem at hand demands. A general
knowledge of the strengths and weaknesses of the tools will definitely add
value to a data analytics career.
This blog will discuss the features of and give resources
for further information for three industry-leading tools: SAS, R, and SPSS.
Business analytics aims to model the data in order to
discover useful information or business insights and to support decision
making. It requires various operations on the data, such as reviewing,
cleansing, transforming and modeling, validations, and interpretations to gain
useful business insights. Sometimes the sets of data may have a million records
or more. Handling and operating such complex data requires automated analysis
tools. Fortunately, many such good tools are available today. A simple Google
search for data analysis tools will
give you a list of a number of such tools. Many of them are open source and
free for use. SAS, SPSS, and R are the most widely used software packages
today, at least for business analytics applications. R is the most popular and
widely used statistical analysis package in the open source category, and SAS
and SPSS are the two most widely used data analysis packages that are
commercially available.
The SAS tool has been around since the 1970s. There are
so many legacies built using this tool that most of the companies in the
corporate world that are into business analytics at any level continue to use
SAS. R was introduced in 1996. Over the years, a lot of new capabilities and
features have been built around R. It is one of the most powerful open source
data analysis tools available today. This makes it popular in the research and
academic community. Many companies from the corporate world have also started
using R. SPSS has also existed for more than 20 years now. It has a strong user
base in the social sciences and many other communities.
Commonly Used Data Analysis Software
In the following sections, we talk about some commonly
used data analysis software and how to make a choice. SAS, SPSS, and R may be
termed as the most commonly used software in the industry.
SAS
·
Most widely used commercial advanced analytics
tool
·
Has lot of predictive modeling algorithms
SPSS
·
Has good text mining and data mining algorithms
R
·
Most widely used open source analytics tool
·
Has several packages for data analysis
MATLAB
·
Widely used for numerical analysis and computing
RapidMiner
·
Good GUI-based tool for segmentation and
clustering; can also be used for conventional modeling
·
Open source
Weka
·
Open source
·
Machine learning tool
SAP
·
Tool for managing business operations and
customer relations
·
Most widely used operations tracking tool
Minitab
·
A light mversion analytics tool
Apache Mahout
·
Open source
·
Advanced analytics tool for big data
Other Tools
·
Statistica
·
KXEN Modeler
·
GNU Octave
·
Knime
Choosing a Tool
The final choice of data analysis tool to be used depends
upon many considerations.
·
The business application, its complexity, and
the level of expertise available in the organization.
·
The long-term business, information, and
analytics strategy of the organization.
·
Existing organizational processes.
·
Budgetary constraints.
·
The investments proposed or already done in the
processing hardware systems, which in turn might decide on factors such as processing
power and memory available for the software to run.
·
Overall organization structure and where the analytics
work is being carried out.
·
Project environment and governance structure.
·
Comfort level of the company in using open
source software and warranties and other legal considerations.
·
The size of data to be handled.
·
The sophistication of graphics and presentation
required in the project.
·
What analytics techniques to be used and how
frequently they will be used.
·
How the current data is organized and how
comfortable the team is in handling data.
·
Whether a data warehouse is in place and how
adequately it covers business and customer information that may be required for
the analysis.
·
Legacy considerations. Is any other similar tool
in use already? If yes, how much time and resources are required for any
planned switch-over?
·
Return-on-investment expectations.
Many more considerations specific to an organization or a
project can be added to this list. The order of importance of these
considerations may vary from person to person, from project to project, and
from one organization to another. The final decision, however, is not an easy
one. The later sections of this blog will list a few comparative features of
SAS, SPSS, and R, which might help the decision-making process on the choice of
tool that will best suit your needs. Finally, instead of zeroing in on a single
tool, deciding to use multiple tools for different business analytics needs is
also possible.
In some cases, it might be concluded that a simple
spreadsheet application tool, such as Microsoft Excel, is the most convenient
and effective and yet gives sufficient insights required to solve the business
problem in hand.
Sometimes a single analytics project might require the
use of more than one tool. A data analyst will be expected to apply different
software tools depending on the problem at hand.
SAS and SPSS have hundreds of functions and procedures
and can be broadly divided into five parts.
·
Data management and input functions, which help
to read, transform, and organize the data prior to the analysis
·
Data analysis procedures, which help in the
actual statistical analysis
·
SAS’s output delivery system (ODS) and SPSS’s output
management system (OMS), which help to extract the output data for final
representation or to be used by another procedures as inputs
·
Macro languages, which can be used to give sets
of commands repeatedly and to conduct programming-related tasks
·
Matrix languages (SAS IML and SPSS Matrix),
which can be used to add new algorithms
R has all these five areas integrated into one. Most of
the R procedures are written using the R language, while SAS and SPSS do not
use their native languages to write their procedures. Being open source, R’s procedures
are available for the users to see and edit to their own advantage.
As per the SAS web site, the SAS suite of business
analytics software has 35+ years of experience and 60,000+ customer sites
worldwide. It has the largest market share globally with regard to advanced
analytics. It can do practically everything related to advanced analytics,
business intelligence, data management, and predictive analytics. It is
therefore not strange that our entire book is dedicated to explaining the
applications of SAS in advanced business analytics.
SAS development originally started at North Carolina
State University, where it was developed from 1966 to 1976. The SAS Institute,
founded in 1976, owns this software worldwide. Since 1976, new modules and
functionalities have been being added in the core software. The social media
analytics module was added in 2010.
The SAS software is overall huge and has more than 200
components. Some of the interesting components are the following:
·
Base SAS:
Basic procedures and data management
·
SAS/STAT:
Statistical analysis
·
SAS/GRAPH:
Graphics and presentation
·
SAS/ETS:
Econometrics and Time Series Analysis
·
SAS/IML:
Interactive matrix language
·
SAS/INSIGHT:
Data mining
·
Enterprise
Miner: Data mining
Analysis Using SAS: The Basic Elements
This section will concentrate on Base SAS procedures.
Base SAS helps to read, extract, transform, manage, and do statistical analysis
on almost all forms of data. This data can be from a variety of sources such as
Excel, flat files, relational databases, and the Internet. SAS provides a
point-and-click graphical user interface to perform statistical analysis of
data. This option is easy to use and may be useful to nontechnical users or as a means to do a
quick analysis. SAS also provides its own programming language, called the SAS
programming language. This option provides everything that the GUI has, in
addition to several advanced operations and analysis. Many professional SAS
users prefer using only the programming option because it gives almost
unlimited control to the user on data manipulation, analysis, and presentation.
Most SAS programs have a DATA step and a PROC step. The
DATA step is used for retrieval and manipulation of data, while the PROC step
contains code for data analysis and reporting. There are approximately 300 PROC
procedures. SAS also provides a macro language that can be used to perform
routine programming tasks, including repetitive calls to SAS procedures. In the
earlier system, SAS provided an ODS, and by using it, SAS data could be
published in many commonly used file formats such as Excel, PDF, and HTML. Many of the SAS procedures have the advantage
of long history, a wide user base, and excellent documentation.
The Main Advantage Offered by SAS
The SAS programming language is a high-level procedural
programming language that offers a plethora of built-in statistical and
mathematical functions. It also offers both linear and nonlinear graphics
capabilities with advanced reporting features. It is possible to manipulate and
conveniently handle the data using SAS programming language, prior to applying
statistical techniques. The data manipulation capabilities offered by SAS
become even more important because up to three-fourths of the time spent in
most analytics project is on data extraction, transformation, and cleaning.
This capability is nonexistent in some other popular data analysis packages,
which may require data to be manipulated or transformed using several other
programs before it can be submitted to the actual statistical analysis
procedures. Some statistical techniques such as analysis of variance (ANOVA)
procedures are especially strong in the SAS environment.
Listing 1-1 and Listing 1-2 are samples of SAS code. They
are just to give you a feel of how SAS code generally looks.
Listing 1-1. Regression SAS Code
Proc reg data=sales;
Model bill amount=income
+ Average spending + family members + Age;
Run;
Listing 1-2. Cluster
Analysis Code
Proc fastclus data= sup_market radius=0 replace=full
maxclusters = 5 ;
idcust_id;
Varvisitsincome age
spends;
run;
Discussed in the earlier sections, R is an integrated
tool for data manipulation, data management, data analysis, and graphics and
reporting capabilities. It can do all of the following in an integrated
environment:
·
Data management functions such as extraction,
handling, manipulation and transformation, storage
·
The full function and object-oriented R
programming language
·
Statistical analysis procedures
·
Graphics and advanced reporting capabilities
R is open source software maintained by the R Development
Core Team and a vast R community (www.r-project.org).
It is supported by a large number of packages, which makes it feature rich for
the analytics community. About 25 statistical packages are supplied with the core
R software as standard and recommended packages. Many more are made available
from the CRAN web site at http://CRAN.R-project.org and from other
sources. The CRAN site at http://cran.r-project.org/doc/manuals/R-intro.html#Top
offers a good resource for an R introduction, including documentation
resources.
R’s extensibility is one of its biggest advantages.
Thousands of packages are available as extensions to the core R software.
Developers can see the code behind R procedures and modify it to write their
own packages. Most popular programming languages such as C++, Java, and Python
can be connected to the R environment. SPSS has a link to R for users who are
primarily using the SPSS environment for data analysis. SAS also offers some
means to move the data and graphics between the two packages. Table 1-2 lists
the most widely used R packages (see http://piccolboni.info/2012/05/essential-r-packages.html).
Table 1-2. Most Widely Used R Packages
Rank
|
Package
|
Description
|
1
|
Stats
|
Distributions and other basic statistical stuff
|
2
|
Methods
|
Object-oriented programming
|
3
|
graphics
|
Of course, graphics
|
4
|
MASS
|
Supporting material for Modern Applied Statistics with S
|
5
|
grDevices
|
Graphical devices
|
6
|
utils
|
In a snub to modularity, a little bit of
everything, but very useful
|
7
|
lattice
|
Graphics
|
8
|
grid
|
More graphics
|
9
|
Matrix
|
Matrices
|
10
|
mvtnorm
|
Multivariate normal and t distributions
|
11
|
sp
|
Spatial data
|
12
|
tcltk
|
GUI development
|
13
|
splines
|
Needless to say, splines
|
14
|
nlme
|
Mixed-effects models
|
15
|
survival
|
Survival analysis
|
16
|
cluster
|
Clustering
|
17
|
R.methodsS3
|
Object-oriented programming
|
18
|
coda
|
MCMC
|
19
|
igraph
|
Graphs (the combinatorial objects)
|
20
|
akima
|
Interpolation of irregularly spaced data
|
21
|
rgl
|
3D graphics (openGL)
|
22
|
rJava
|
Interface with Java
|
23
|
RColorBrewer
|
Palette generations
|
24
|
ape
|
Phylogenetics
|
25
|
gtools
|
Functions that didn’t fit anywhere else, including
macros
|
26
|
nnet
|
Neural networks
|
27
|
quadprog
|
Quadratic programming
|
28
|
boot
|
Bootstrap
|
29
|
Hmisc
|
Yet another miscellaneous package
|
30
|
car
|
Companion to the Applied Regression book
|
31
|
lme4
|
Linear mixed-effects models
|
32
|
foreign
|
Data compatibility
|
33
|
Rcpp
|
R C++ integration
|
Here are a few R code snippets. It is not necessary to
understand them at this stage. They can be understood at a later stage.
Input_data=read.csv("Datasets/Insurance_data.csv")
#reads an external CSV
file
input_data_final=Input_data[,-c(1)]
#stores the variables of
the dataset separately
input_data_final=scale(input_data_final)
#normalizes the data
clusters_five<-kmeans(input_data_final,5)
#creates 5 clusters from
the given data
cluser_summary_mean_five=aggregate(input_data_final,by=list(clusters_five$cluster),FUN=mean)
#summarizes clusters by
mean
View(cluser_summary_count_five)
#returns the results
summarized by size
SPSS originally stood for Statistical Package for the
Social Sciences. It is a software package used for statistical analysis,
originally developed by SPSS Inc. It was acquired by IBM in 2009, and IBM
renamed it as SPSS Statistics, with a latest version of SPSS Statistics 22.
Many SPSS users think it has a stronger command menu
option compared to R and SAS; its learning curve is also shorter.
The web site at http://fmwww.bc.edu/GStat/docs/StataVSPSS.html
has the following opinion about SPSS:
SPSS has its roots in the social sciences and the analysis of
questionnaires and surveys is where many of its core strengths lie.
SPSS has been in existence for a long time and hence has
a strong user base. Like with any other software, you always have to do a
cost-to-benefit analysis while making a buying decision.
Users may find SAS and SPSS similar to each other,
and switching from one to the other may be fairly easy. R may look somewhat
different for first-time users.
Features of SPSS Statistics 22
SPSS Statistics 22 is built on the philosophy of data-driven
decision making anytime, anywhere. It has many new features, such as
interaction with mobile devices. It works on Windows, Mac, and Linux desktops.
For mobile devices, it supports Apple, Windows 8, and Android devices. It has
support for Automatic Linear Modeling (ALM) and heat maps. It enhances the
Monte Carlo simulation to help in improving the accuracy of predictive models
for uncertain inputs. SPSS Statistics Server is good as far as scalability and
performance is concerned. Custom programming is also made easier than before.
Python plug-ins can be added as a part of the main installation.
Monte Carlo simulation is a problem-solving technique
that is used for approximating certain results by doing multiple trials or
simulations that use random variables.
The web site at http://stanfordphd.com/Statistical_Software.html
contains a statistical feature comparison for R, Minitab, SAS, STATA, and SPSS.
R looks feature-rich given its supporting packages, which are written by the R
core development team and many other R enthusiasts. SAS has been around since
the 1970s and has a large user base. It has great data management capabilities,
which make it a one-stop shop for most of the analytics exercises. The user
does not need to go to any other program whether for reading the data files or for
the final presentation.
SPSS has great menu-driven features and does not need any
training in programming in most cases. The SAS market share and its wide appeal
make it the main topic of our book. Several companies in the corporate world find
themselves comfortable with SAS, mainly because of the large legacy built
around it over the years, its industrial quality code, its good quality of
documentation, and the skill availability in the market.
As discussed in this blog, the fact that all available applications
have their own strengths and weaknesses needs to be accepted. No software is
fit for all occasions. A data analyst must learn multiple software products and
use them as the situation demands to be successful.
·
Paperback: 580 pages
·
Publisher: Apress; 1 edition (January 29, 2015)
·
Language: English
·
ISBN-10: 1484200446
·
ISBN-13: 978-1484200445