I’ve been asked this question many times by friends as well as industry contacts. The quick answer is that they are all false. The longer answer is not so quick. There is no single statement that is true of all the providers of big data services. Each provider has different needs, desires, expertise, cultures, and so forth. Even if a particular provider made a statement about big data being true, it may only be because that provider happened to be talking to the right audience at the right time.
It is certainly true that big data science in general have changed dramatically over the last couple of decades. In particular, the field of machine learning has received a great deal of attention. This field uses large databases, typically containing tens or even hundreds of thousands of examples, in order to perform some kind of task (usually training a computer to recognize handwriting) by relying on the results of “training” – or previous results – from the actual examples in the database. The result is a machine that can quickly and accurately recognize which of two photographs is a man and that one is a woman, or which of the two faces is identical with those seen in a previous photo, etc.
The accuracy of these programs is, in my opinion, one of the most important questions concerning big data services. What is the likelihood that the program will generate the right result? What is the likelihood that the person receiving the training via that program will be able to apply the learned material and do what he or she was trained to do? How accurate is that predictor? Can it be trusted?
I think the answer to these questions is: not very accurate. However, if you give someone who is trained in using big data services the right set of parameters to work with, they can often make a lot of money. I know someone who makes a good living using them, and I know others who have given much better results. So, there is certainly room for improvement.
The typical vendors selling big data services typically point out that their software works only with a very specific data type. This includes unstructured or semi-structured data, which they describe as “raw” or unmanageable. Their software will therefore be very effective for transforming raw data into something more useful, but it won’t be efficient for use when it comes to extracting useful information from the raw data.
However, anyone who has worked with big data services can tell you that this is simply not the case. Almost any data type can be transformed into a useful representation using specialized software. Even text files and web pages can be transformed using proprietary data transformation software, as well as audio and video files. Therefore, vendors selling software that is based on a theory of transforming raw data into useful information are actually using that very same theory to transform that data into profitable information products.
There are two main types of software vendors based on this theory. One uses a programming language such as C++ or Java, while the other uses proprietary scripting languages. Vendors that use a programming language will claim that their product is superior because it is easier to use, and that it also offers more features than do vendors that use scripting languages. However, the reality is that those programs are just as efficient, but they don’t cost any more to produce. In fact, scripts tend to be less accurate than many kinds of software used for scientific analysis. Vendors that use a programming language will also benefit from lower development costs.
Which of the following statements about big data is true? The answer is both “yes” and “no”. There is no one single best way to process massive amounts of data. Processes must be adaptable to each specific situation. It is a little-known fact that one of the primary factors in determining the quality of any scientific study is the level of cooperation between the group of scientists involved.