Big Data Analytics: Asking the Right Questions

By Renee Boucher Ferguson (Part 1 of 4)

According to IDC’s 2011 Extracting Value from Chaos study – the 5th consecutive report of its kind – last year the amount of data created and replicated burst through the zettabyte barrier for the first time. That’s more than one trillion gigabytes of data. Even if you don’t know the scale between zettas and gigas, you know that’s Big Data. In her first post of a 4-part series for the SHARE President’s Corner, veteran tech journalist Renee Boucher Ferguson explores how organizations are gleaning Big Analytics from Big Data.

In 2011 the amount of global digital data generated is expected to grow exponentially – by a factor of nine – to 1.8 zettabytes. By 2012 that number is expected to reach about 2.7 zettabytes, a whopping 48% increase from 2011.

To take these numbers to the organizational level, McKinsey Global Institute estimates that by 2009, nearly all sectors in the U.S. economy had at least an average of 200 terabytes of stored data per company (with more than 1,000 employees) and that many sectors had more than one petabyte (one quadrillion bytes) of stored data per company. To compensate, in the next decade the number of servers (virtual and physical) worldwide will grow 10x, and the number of files data centers process will grow 75x.

This influx of data both floating around the digital universe and stored in IT organizations is dubbed, appropriately, Big Data. The challenge for enterprises can be boiled down to two words: Speed kills.

As David Corrigan, director of strategy for IBM's InfoSphere portfolio, told ITBusinessEdge in a 2011 interview, “velocity” is one factor that defines Big Data: “By velocity, we’re talking about the pace at which the information is ingested, so streaming analytics is an example — the pace of huge volumes. You could call it batch, but these really are bursts of information.”

Another v-word defining Big Data, according to Corrigan, is “variety,” as he explained in the ITBusinessEdge article: “Big Data isn’t just about volume. It equally has something to do with the variety of data. In other words, when you're not just dealing with structured information or semi-structured information, but you get into text and content, video, audio and the need to analyze data from all of those different variety of sources to come up with an answer or to solve a particular use case.”

Data accumulates so quickly it’s difficult for IT organizations to not only maintain enough storage capacity, but keep pace with new architectures, technologies and methodologies springing forth to generate Big Data Analytics. In sum, it’s increasingly difficult to determine appropriate strategies for analyzing — and gleaning value from – Big Data sets.

But there is hope.

Storage economics have changed with the times. The cost of storage has lowered substantially as processing power has sped up, and technologies such as compression and deduplication have come along to shrink capacity requirements. Combined, these developments have helped companies keep even Big Data sets manageable.

At the same time, technology for extracting value from Big Data has evolved. According to a recent report from The Data Warehousing Institute (TDWI), the emerging category of Big Data Analytics has developed to encompass a collection of techniques and tools enterprises can use to handle Big Data. This tool set can include predictive analytics, data mining, statistics, artificial intelligence and natural language processing, among others.

The IBM Entity Analytics group, for example, develops the InfoSphere Identity Insight Solutions that enable streaming analytics utilizing data sets with potentially billions of rows of data – in real time, with sub-millisecond decisions. In part, the solutions accomplish this feat by counting “entities” and determining those that are the same. Jeff Jonas, chief scientist of the IBM Entity Analytics group and an IBM Distinguished engineer, explains it this way:

“Imagine a giant pile of puzzle pieces – giant – with different colors, sizes, shapes…and you don’t know if there are duplicates, if there are pieces missing or if it’s one puzzle or fifty puzzles. We call that Big Data. What we do in Entity Analytics is we take each puzzle piece and see how it relates to each other. When you do that, it ends up getting this much richer understanding and it allows you to make higher quality decisions. The advantage of Big Data is when you blend together the blue, green, yellow, magenta puzzle pieces. Then, the quality of your understanding is so much better and your decisions start to get really smart.”

And there’s Apache Hadoop – an open-source programming framework that supports the processing of massive data sets in distributed computing environments. While the platform hasn’t yet achieved widespread scale adoption, the TDWI report found that 24 percent of IT organizations surveyed are using Hadoop.

Why the buzz about Hadoop? Unstructured data. Studies estimate as much as 90 percent of data being generated in the digital universe is unstructured. It comes from diverse sources that continue to multiply — sensors, devices, Web applications, images, voice, video surveillance and social media. Hadoop breaks down not just volumes of data for query, but also a wide variety of data types. Companies such as IBM and Cloudera are developing commercial tools and services that sit on top of Hadoop.

“It’s a new idea,” said Cloudera CEO Mike Olson in a recent YouTube interview conducted by tech blogger Robert Scoble. “You can ask a question in a reasonable time that touches every single byte and terabyte—not just touches it, but manhandles it… We’ve never been able to solve problems like that [by thinking in scale], so we don’t even think of questions like that. Thinking in that way is a new skill.”

Which means the key to unlocking the value of Big Data is not just having technology to handle the volume. Analytical thinking must evolve regarding Big Data in order to design Big Analytics from it.

“We have to put it all in perspective,” said Neil Raden, VP and Principal Analyst at Constellation Research Group, who focuses on Analytics and Business Intelligence. “From 2000 to 2011 the amount of data IT organizations are handling has shown exponential growth. What’s really happened in response is we’ve worked with a set of technologies for exponentially expanding data [capabilities] and we’ve taken them pretty far.”

“The real question is… what are you going to do with it?”

In the next installment of the Big Data Analytics, Renee Boucher Ferguson continues her conversation with experts in the field, who discuss how organizations are coping with analyzing all that information.

Recent Stories
Securing the Mainframe: Minding the Details

'Framing the Future: Part 1 – Seeking and Foundation

z/VM Virtual Switch: The Benefits of Network Virtualization for the Mainframe