Wednesday, April 22, 2015

How to Chose Your Data Analysis Tool...

This blog will discuss some of the more commonly used business analytics tools that are used in the industry today. It is not enough for a data analyst to learn about just one tool. They need to apply different tools as per the situation or what the problem at hand demands. A general knowledge of the strengths and weaknesses of the tools will definitely add value to a data analytics career.


This blog will discuss the features of and give resources for further information for three industry-leading tools: SAS, R, and SPSS.

Business analytics aims to model the data in order to discover useful information or business insights and to support decision making. It requires various operations on the data, such as reviewing, cleansing, transforming and modeling, validations, and interpretations to gain useful business insights. Sometimes the sets of data may have a million records or more. Handling and operating such complex data requires automated analysis tools. Fortunately, many such good tools are available today. A simple Google search for data analysis tools will give you a list of a number of such tools. Many of them are open source and free for use. SAS, SPSS, and R are the most widely used software packages today, at least for business analytics applications. R is the most popular and widely used statistical analysis package in the open source category, and SAS and SPSS are the two most widely used data analysis packages that are commercially available.

The SAS tool has been around since the 1970s. There are so many legacies built using this tool that most of the companies in the corporate world that are into business analytics at any level continue to use SAS. R was introduced in 1996. Over the years, a lot of new capabilities and features have been built around R. It is one of the most powerful open source data analysis tools available today. This makes it popular in the research and academic community. Many companies from the corporate world have also started using R. SPSS has also existed for more than 20 years now. It has a strong user base in the social sciences and many other communities.

Commonly Used Data Analysis Software

In the following sections, we talk about some commonly used data analysis software and how to make a choice. SAS, SPSS, and R may be termed as the most commonly used software in the industry.
SAS
·         Most widely used commercial advanced analytics tool
·         Has lot of predictive modeling algorithms
SPSS
·         Has good text mining and data mining algorithms
R
·         Most widely used open source analytics tool
·         Has several packages for data analysis
MATLAB
·         Widely used for numerical analysis and computing
RapidMiner
·         Good GUI-based tool for segmentation and clustering; can also be used for conventional modeling
·         Open source
Weka
·         Open source
·         Machine learning tool
SAP
·         Tool for managing business operations and customer relations
·         Most widely used operations tracking tool
Minitab
·         A light mversion analytics tool
Apache Mahout
·         Open source
·         Advanced analytics tool for big data
Other Tools
·         Statistica
·         KXEN Modeler
·         GNU Octave
·         Knime

Choosing a Tool

The final choice of data analysis tool to be used depends upon many considerations.
·         The business application, its complexity, and the level of expertise available in the organization.
·         The long-term business, information, and analytics strategy of the organization.
·         Existing organizational processes.
·         Budgetary constraints.
·         The investments proposed or already done in the processing hardware systems, which in turn might decide on factors such as processing power and memory available for the software to run.
·         Overall organization structure and where the analytics work is being carried out.
·         Project environment and governance structure.
·         Comfort level of the company in using open source software and warranties and other legal considerations.
·         The size of data to be handled.
·         The sophistication of graphics and presentation required in the project.
·         What analytics techniques to be used and how frequently they will be used.
·         How the current data is organized and how comfortable the team is in handling data.
·         Whether a data warehouse is in place and how adequately it covers business and customer information that may be required for the analysis.
·         Legacy considerations. Is any other similar tool in use already? If yes, how much time and resources are required for any planned switch-over?
·         Return-on-investment expectations.
Many more considerations specific to an organization or a project can be added to this list. The order of importance of these considerations may vary from person to person, from project to project, and from one organization to another. The final decision, however, is not an easy one. The later sections of this blog will list a few comparative features of SAS, SPSS, and R, which might help the decision-making process on the choice of tool that will best suit your needs. Finally, instead of zeroing in on a single tool, deciding to use multiple tools for different business analytics needs is also possible.

In some cases, it might be concluded that a simple spreadsheet application tool, such as Microsoft Excel, is the most convenient and effective and yet gives sufficient insights required to solve the business problem in hand.

Sometimes a single analytics project might require the use of more than one tool. A data analyst will be expected to apply different software tools depending on the problem at hand.

Main Parts of SAS, SPSS, and R

SAS and SPSS have hundreds of functions and procedures and can be broadly divided into five parts.
·         Data management and input functions, which help to read, transform, and organize the data prior to the analysis
·         Data analysis procedures, which help in the actual statistical analysis
·         SAS’s output delivery system (ODS) and SPSS’s output management system (OMS), which help to extract the output data for final representation or to be used by another procedures as inputs
·         Macro languages, which can be used to give sets of commands repeatedly and to conduct programming-related tasks
·         Matrix languages (SAS IML and SPSS Matrix), which can be used to add new algorithms
R has all these five areas integrated into one. Most of the R procedures are written using the R language, while SAS and SPSS do not use their native languages to write their procedures. Being open source, R’s procedures are available for the users to see and edit to their own advantage.

SAS

As per the SAS web site, the SAS suite of business analytics software has 35+ years of experience and 60,000+ customer sites worldwide. It has the largest market share globally with regard to advanced analytics. It can do practically everything related to advanced analytics, business intelligence, data management, and predictive analytics. It is therefore not strange that our entire book is dedicated to explaining the applications of SAS in advanced business analytics.

SAS development originally started at North Carolina State University, where it was developed from 1966 to 1976. The SAS Institute, founded in 1976, owns this software worldwide. Since 1976, new modules and functionalities have been being added in the core software. The social media analytics module was added in 2010.

The SAS software is overall huge and has more than 200 components. Some of the interesting components are the following:
·         Base SAS: Basic procedures and data management
·         SAS/STAT: Statistical analysis
·         SAS/GRAPH: Graphics and presentation
·         SAS/ETS: Econometrics and Time Series Analysis
·         SAS/IML: Interactive matrix language
·         SAS/INSIGHT: Data mining
·         Enterprise Miner: Data mining

Analysis Using SAS: The Basic Elements

This section will concentrate on Base SAS procedures. Base SAS helps to read, extract, transform, manage, and do statistical analysis on almost all forms of data. This data can be from a variety of sources such as Excel, flat files, relational databases, and the Internet. SAS provides a point-and-click graphical user interface to perform statistical analysis of data. This option is easy to use and may be useful to  nontechnical users or as a means to do a quick analysis. SAS also provides its own programming language, called the SAS programming language. This option provides everything that the GUI has, in addition to several advanced operations and analysis. Many professional SAS users prefer using only the programming option because it gives almost unlimited control to the user on data manipulation, analysis, and presentation.

Most SAS programs have a DATA step and a PROC step. The DATA step is used for retrieval and manipulation of data, while the PROC step contains code for data analysis and reporting. There are approximately 300 PROC procedures. SAS also provides a macro language that can be used to perform routine programming tasks, including repetitive calls to SAS procedures. In the earlier system, SAS provided an ODS, and by using it, SAS data could be published in many commonly used file formats such as Excel, PDF, and HTML.  Many of the SAS procedures have the advantage of long history, a wide user base, and excellent documentation.

The Main Advantage Offered by SAS

The SAS programming language is a high-level procedural programming language that offers a plethora of built-in statistical and mathematical functions. It also offers both linear and nonlinear graphics capabilities with advanced reporting features. It is possible to manipulate and conveniently handle the data using SAS programming language, prior to applying statistical techniques. The data manipulation capabilities offered by SAS become even more important because up to three-fourths of the time spent in most analytics project is on data extraction, transformation, and cleaning. This capability is nonexistent in some other popular data analysis packages, which may require data to be manipulated or transformed using several other programs before it can be submitted to the actual statistical analysis procedures. Some statistical techniques such as analysis of variance (ANOVA) procedures are especially strong in the SAS environment.

Listing 1-1 and Listing 1-2 are samples of SAS code. They are just to give you a feel of how SAS code generally looks.
Listing 1-1. Regression SAS Code
Proc reg data=sales;
Model bill amount=income + Average spending + family members + Age;
Run;
Listing 1-2. Cluster Analysis Code
Proc fastclus  data= sup_market radius=0 replace=full maxclusters = 5 ;
idcust_id;
Varvisitsincome age spends;
run;

The R Tool

Discussed in the earlier sections, R is an integrated tool for data manipulation, data management, data analysis, and graphics and reporting capabilities. It can do all of the following in an integrated environment:
·         Data management functions such as extraction, handling, manipulation and transformation, storage
·         The full function and object-oriented R programming language
·         Statistical analysis procedures
·         Graphics and advanced reporting capabilities
R is open source software maintained by the R Development Core Team and a vast R community (www.r-project.org). It is supported by a large number of packages, which makes it feature rich for the analytics community. About 25 statistical packages are supplied with the core R software as standard and recommended packages. Many more are made available from the CRAN web site at http://CRAN.R-project.org and from other sources. The CRAN site at http://cran.r-project.org/doc/manuals/R-intro.html#Top offers a good resource for an R introduction, including documentation resources.

R’s extensibility is one of its biggest advantages. Thousands of packages are available as extensions to the core R software. Developers can see the code behind R procedures and modify it to write their own packages. Most popular programming languages such as C++, Java, and Python can be connected to the R environment. SPSS has a link to R for users who are primarily using the SPSS environment for data analysis. SAS also offers some means to move the data and graphics between the two packages. Table 1-2 lists the most widely used R packages (see http://piccolboni.info/2012/05/essential-r-packages.html).
Table 1-2. Most Widely Used R Packages
Rank
Package
Description
1
Stats
Distributions and other basic statistical stuff
2
Methods
Object-oriented programming
3
graphics
Of course, graphics
4
MASS
Supporting material for Modern Applied Statistics with S
5
grDevices
Graphical devices
6
utils
In a snub to modularity, a little bit of everything, but very useful
7
lattice
Graphics
8
grid
More graphics
9
Matrix
Matrices
10
mvtnorm
Multivariate normal and t distributions
11
sp
Spatial data
12
tcltk
GUI development
13
splines
Needless to say, splines
14
nlme
Mixed-effects models
15
survival
Survival analysis
16
cluster
Clustering
17
R.methodsS3
Object-oriented programming
18
coda
MCMC
19
igraph
Graphs (the combinatorial objects)
20
akima
Interpolation of irregularly spaced data
21
rgl
3D graphics (openGL)
22
rJava
Interface with Java
23
RColorBrewer
Palette generations
24
ape
Phylogenetics
25
gtools
Functions that didn’t fit anywhere else, including macros
26
nnet
Neural networks
27
quadprog
Quadratic programming
28
boot
Bootstrap
29
Hmisc
Yet another miscellaneous package
30
car
Companion to the Applied Regression book
31
lme4
Linear mixed-effects models
32
foreign
Data compatibility
33
Rcpp
R C++ integration
Here are a few R code snippets. It is not necessary to understand them at this stage. They can be understood at a later stage.
Input_data=read.csv("Datasets/Insurance_data.csv")
#reads an external CSV file

input_data_final=Input_data[,-c(1)]
#stores the variables of the dataset separately

input_data_final=scale(input_data_final)
#normalizes the data

clusters_five<-kmeans(input_data_final,5)
#creates 5 clusters from the given data

cluser_summary_mean_five=aggregate(input_data_final,by=list(clusters_five$cluster),FUN=mean)
#summarizes clusters by mean

View(cluser_summary_count_five)
#returns the results summarized by size

IBM SPSS Analytics Tool

SPSS originally stood for Statistical Package for the Social Sciences. It is a software package used for statistical analysis, originally developed by SPSS Inc. It was acquired by IBM in 2009, and IBM renamed it as SPSS Statistics, with a latest version of SPSS Statistics 22.

Many SPSS users think it has a stronger command menu option compared to R and SAS; its learning curve is also shorter.

The web site at http://fmwww.bc.edu/GStat/docs/StataVSPSS.html has the following opinion about SPSS:
SPSS has its roots in the social sciences and the analysis of questionnaires and surveys is where many of its core strengths lie.
SPSS has been in existence for a long time and hence has a strong user base. Like with any other software, you always have to do a cost-to-benefit analysis while making a buying decision.
 Users may find SAS and SPSS similar to each other, and switching from one to the other may be fairly easy. R may look somewhat different for first-time users.

Features of SPSS Statistics 22

SPSS Statistics 22 is built on the philosophy of data-driven decision making anytime, anywhere. It has many new features, such as interaction with mobile devices. It works on Windows, Mac, and Linux desktops. For mobile devices, it supports Apple, Windows 8, and Android devices. It has support for Automatic Linear Modeling (ALM) and heat maps. It enhances the Monte Carlo simulation to help in improving the accuracy of predictive models for uncertain inputs. SPSS Statistics Server is good as far as scalability and performance is concerned. Custom programming is also made easier than before. Python plug-ins can be added as a part of the main installation.

Monte Carlo simulation is a problem-solving technique that is used for approximating certain results by doing multiple trials or simulations that use random variables.

Selection of Analytics Tools

The web site at http://stanfordphd.com/Statistical_Software.html contains a statistical feature comparison for R, Minitab, SAS, STATA, and SPSS. R looks feature-rich given its supporting packages, which are written by the R core development team and many other R enthusiasts. SAS has been around since the 1970s and has a large user base. It has great data management capabilities, which make it a one-stop shop for most of the analytics exercises. The user does not need to go to any other program whether for reading the data files or for the final presentation.

SPSS has great menu-driven features and does not need any training in programming in most cases. The SAS market share and its wide appeal make it the main topic of our book. Several companies in the corporate world find themselves comfortable with SAS, mainly because of the large legacy built around it over the years, its industrial quality code, its good quality of documentation, and the skill availability in the market.

As discussed in this blog, the fact that all available applications have their own strengths and weaknesses needs to be accepted. No software is fit for all occasions. A data analyst must learn multiple software products and use them as the situation demands to be successful.

These edited excerpts are taken from Shailendra Kadre and Venkat Reddy’s latest book titled, Practical Business Analytics Using SAS: A Hands-on Guide. It’s available on Amazon.com and Amazon India web sites.

·         Paperback: 580 pages
·         Publisher: Apress; 1 edition (January 29, 2015)
·         Language: English
·         ISBN-10: 1484200446
·         ISBN-13: 978-1484200445





3 comments:

  1. Maharashtra Police Patil Recruitment 2016


    This website has very good content thanks for the article. .....

    ReplyDelete
  2. Good post you have used almost all the tools used in data analysis here you can find the use of data analytics in companies. Also the major benefits of using these tools in your business.

    ReplyDelete
  3. Hello maam,
    I want to know about openings in SAS tool.
    Kindly contact with me and if possible help me to find some professional knowledge about SAS tool.
    My mail id : namanrungta4@gmail.com

    ReplyDelete