Big data – what’s it all about? When you really dig into the concept, you find that it’s both more and less than the hype surrounding it. The meaning seems obvious; we’re talking about massive amounts of digital information. In a world where knowledge is power and storage space is cheap, organizations are amassing data sets that reach into the petabyte range. New data is being created at unprecedented rates, with predictions that by 2020, the amount of stored data will reach the 40 zettabyte mark (that’s 42 trillion GB). However, the definition of big data goes beyond mere size. The real value is in the ability to sift through all that information to discern relationships and predict behaviors – big data analytics.
A recent Gartner survey indicates that businesses are buying into the big data concept in a big way, at least in theory if not in practice. Fully 64 percent of respondents said their companies plan to invest in big data technology – but fewer than 10 percent have actually deployed a big data solution. Why the discrepancy in numbers? Is it the difficulty of implementation? The cost? Confusion over which solution to adopt? Let’s take a quick look at the state of big data today and the business case for implementing a big data strategy.
The scientific community, government/military, and e-commerce are some examples of the fields where big data is currently being put to use. Big tech companies such as Microsoft, Google, and Amazon handle huge amounts of data every day, as do large retail and e-commerce establishments such as Walmart and eBay. According to IBM, 2.5 billion GB of high-velocity data are created on a daily basis. High-velocity data sources are those that generate large amounts of information over a relatively short period of time. With new data constantly coming in, useful analytics have to be performed in real time.
In addition to size and velocity, big data is also about structure – or lack thereof. Traditional relational databases are designed to handle structured data; it’s easy to search using relatively simple algorithms. Structured data can be easily arranged into specifically defined fields. An example would be a list of contacts with their names, addresses and phone numbers.
Unstructured data is less organized and much more difficult to categorize. For example, think of an email message that may contain comments about several different topics or a quantity of Facebook posts or Twitter tweets. Estimates vary within the industry but there is a general agreement that the majority of data is unstructured, with Gartner placing the percentage of data that is unstructured at around 80 percent.
Big data usually consists of both structured and unstructured data, although in some cases, such as social networking sites, most or all of the data is of the unstructured (or semi-structured) type. The data typically comes from a wide variety of sources, in many different formats: documents, spreadsheets, email messages, chat logs, social network posts, graphics files, audio and video recordings, and data collected by various sensors and monitors. Somehow, big data analytics solutions must be able make sense of it all, ferret out patterns and trends, and present results in a way that allows human beings to make intelligent business decisions based on the data. Otherwise, what’s the point of collecting and storing all that data in the first place?
Why aren’t companies deploying big data solutions more quickly? As with any relatively new technology, there are a number of factors involved. There’s no shortage of players in the big data space who are offering solutions, from well-known tech giants such as IBM, HP, Oracle, Amazon, Microsoft, Google and EMC to startups such as Cloudera, Hortonworks and Splunk. The problem is that there are few real standards and it can be difficult to get a handle on exactly how individual solutions work and what the costs – both up front and ongoing – will be. You can’t talk about big data without mentioning Apache Hadoop, which is an open source framework on which many big data solutions are based. Unfortunately, there is a shortage of talented data scientists who have the knowledge and skills to make it work, and a shortage of tools to enable those data scientists to do so most effectively and efficiently.
There’s no doubt that big data is the way of the future. But it’s a disruptive technology, and resistance – even if ultimately futile – is likely to be strong until some of the complexity and confusion can be removed. Part of the solution lies in the development of better tools but most of it will rely on education. If knowledge is power, knowledge about how to deal with big data may be one of the most powerful weapons an IT professional can have in his/her arsenal as we move further into this information-driven era.
Like our posts? Subscribe to our RSS feed now, and be the first to get them!