February 14, 2015 Shailendra Kadre and Venkat Reddy
Conventional analytical tools and techniques are
inadequate to handle data that is unstructured (like text data), that is too
large in size, or that is growing rapidly like social media data. A cluster
analysis on a 200MB file with 1 million customer records is manageable, but the
same cluster analysis on 1000GB of Facebook customer profile information will
take a considerable amount of time if conventional tools and techniques are
used. Facebook as well as entities like Google and Walmart generate data in
petabytes every day. Distributed computing methodologies might need to be used
in order to carry out such analysis.
The SAS web site defines big data as follows:
Big data is a popular term used to describe the exponential
growth and availability of data, both structured and unstructured. And big data
may be as important to business—and society—as the Internet has become. Why?
More data may lead to more accurate analyses.
It further states, “More accurate analyses may lead to
more confident decision making. And better decisions can mean greater
operational efficiencies, cost reductions, and reduced risk.”
This definition refers to big data as a popular term that
is used to describe data sets of large volumes that are beyond the limits of
commonly used desktop database and analytical applications. Sometimes even
server-class databases and analytical applications fail to manage this kind of
data set.
Wikipedia describes big data sizes as a constantly moving
target, ranging from a few dozen terabytes to some petabytes (as in 2012) in a
single data set. The size of big data may vary from one organization to the
other, depending on the capabilities of software that are commonly used to
process the data set in its domain. For some organizations, only a few hundred
gigabytes of data may require reconsideration using their data processing and
analysis systems, while some may feel quite at home with even hundreds of
terabytes of data.
Consider a few examples
Cern’s Large Hydron Collider experiments deal
with 500 quintillion bytes of data per day, which is 200 times more than all
other sources combined in the world.
In Business
eBay.com uses a 40 petabytes Hadoop cluster to support
its merchandising, search, and consumer recommendations.
Amazon.com deals with some of the world’s largest Linux
databases, which measure up to 24 terabytes.
Walmart’s daily consumer transactions run into 1 million
per hour. Its databases sizes are estimated to be 2.5 petabytes.
All this, of course, is extremely large data. It is
almost impossible for conventional database and business applications to handle
it.
The industry has two more definitions for big data.
1. Big data is a collection of data sets so large
and complex that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
2. Big data is the data whose scale, diversity, and
complexity requires new architecture, techniques, algorithms, and analytics to
manage it and extract value and hidden knowledge from it.
In simple terms, big data cannot be handled by
conventional data-handling tools, and big data analysis cannot be performed
using conventional analytical tools. Big data tools that use distributed
computing techniques are needed.
Gartner defines the three v’s of big data as volume, velocity, and variety. So far, only the
volume aspect of big data has been discussed. In this context, the speed with
which the data is getting created is also important. Consider the familiar
example of the Cern Hydron Collider experiments; it annually generates 150
million petabytes of data, which is about 1.36EB (1EB = 1073741824GB) per day.
Per-hour transactions for Walmart are more than 1 million.
The third v is
variety. This dimension refers to the type of formats in which the data gets
generated. It can be structured, numeric or non-numeric, text, e-mail, customer
transactions, audio, and video, to name just a few.
In addition to these three v’s, some like to include veracity while defining big data.
Veracity includes the biases, noise, and deviation that is inherent in most big
data sets. It is more common to the data generated from social media web sites.
The SAS web site also counts on data complexity as one of the factors for
defining big data.
Gartner’s definition of the three v’s has almost become an industry standard when it comes to
defining big data.
Some of the big data sources have already been discussed
in the earlier sections. Advanced science studies in environmental sciences,
genomics, microbiology, quantum physics, and so on, are the sources of data
sets that may be classified in the category of big data. Scientists are often
struck by the sheer volume of data sets they need to analyze for their research
work. They need to continuously innovate ways and means to store, process, and
analyze such data.
Daily customer transactions with retailers such as
Amazon, Walmart, and eBay also generate large volumes of data at amazing rates.
This kind of data mainly falls under the category of structured data.
Unstructured text data such as product descriptions, book reviews, and so on,
is also involved. Healthcare systems also add hundreds of terabytes of data to
data centers annually in the form of patient records and case documentations.
Global consumer transactions processed daily by credit card companies such as
Visa, American Express, and MasterCard may also be classified as sources of big
data.
The United States and other governments also are big
sources of data generation. They need the power of some of the world’s most
powerful supercomputers to meaningfully process the data in reasonable time
frames. Research projects in fields such as economics and population studies,
conducted by the World Bank, UN, and IMF, also consume large amounts of data.
More recently, social media sites such as Facebook,
Twitter, and LinkedIn are presenting some great opportunities in the field of
big data analysis. These sites are now among some of the biggest data
generation sources in the world. They are mainly the sources of unstructured
data. Data forms included here are text data such as customer responses,
conversations, messages, and so on. Lots of other data sources such as audio
clips, numerous videos, and images are also included. Their databases are
hundreds of petabytes. This data, although difficult to analyze, presents
immense opportunities to generate useful insights and information such as
product promotion, trend and sentiment analysis, brand management, online
reputation management for political outfits and individuals, to name a few.
Social media analytics is a rapidly growing field, and several startups and
established companies are devoting considerable time and energies to this
practice. Table 1-1 compares big data to conventional data.
Table 1-1. Big Data vs.
Conventional Data
Big Data
|
Normal or Conventional Data
|
Huge data sets.
|
Data set size in control.
|
Unstructured data such as text, video, and audio.
|
Normally structured data such as numbers and
categories, but it can take other forms as well.
|
Hard-to-perform queries and analysis.
|
Relatively easy-to-perform queries and analysis.
|
Needs a new methodology for analysis.
|
Data analysis can be achieved by using
conventional methods.
|
Need tools such as Hadoop, Hive, Hbase, Pig,
Sqoop, and so on.
|
Tools such as SQL, SAS, R, and Excel alone may be
sufficient.
|
Raw transactional data.
|
The aggregated or sampled or filtered data.
|
Used for
reporting, basic analysis, and text mining. Advanced analytics is only
in a starting stage in big data.
|
Used for reporting, advanced analysis, and
predictive modeling .
|
Big data analysis needs both programming skills
(such as Java) and analytical skills to perform analysis.
|
Analytical skills are sufficient for conventional
data; advanced analysis tools don’t require expert programing skills.
|
Petabytes/exabytes of data.
Millions/billions of accounts.
Billions/trillions of transactions.
|
Megabytes/gigabytes of data.
Thousands/millions of accounts.
Millions of transactions.
|
Generated by big financial institutions,
Facebook, Google, Amazon, eBay, Walmart, and so on.
|
Generated by small enterprises and small banks.
|
To
read more on Big Data and Business Analytics, refer to our latest book release, ‘Practical
Business Analytics Using SAS: A Hands on Guide, By Apress Inc Newyork, Authors –
Venkat Reddy and Shailendra Kadre
Thanks for splitting your comprehension with us. It’s really useful to me & I hope it helps the people who in need of this vital information.
ReplyDeleteRegards,
SAS Training in Chennai|SAS Institutes in Chennai