# Answers to the Question: “What Is Big Data?”

Big Data and statistics. The two seem to go hand in hand like a hand in glove. Statistics can provide us with tons of information. But how can we make sense of all that information?

Facts: Big Data is facts. Data is a sequence of facts, and each of those facts can tell us something…if it’s consistent with other facts. So a statement like “The average American spends approximately twice as much on carpet cleaning as the Japanese,” would be accurate. In this example, there may be some variation between the data set presented and the actual results, but if you assume that the range for American spending is similar to that of Japan, you can see how this ranges statement can tell you whether or not your theory about patterning trends is correct.

Interpretation: Interpretation means that you figure out what all the facts mean. We can apply this same principle to many other types of statements in science, including the statement “X happens before Y.” It is easy to see how people can argue with this. In fact, they often do. However, the underlying assumption in this example is that observation can indeed provide us with the means by which observation can further refine our understanding of the data and thus to test our theories.

Options: Which of the following statements is true of big data? If the range is very large, the answer is yes. In general, however, no. If the range of a data set is very large, it means that there is room for error-even within that data set. It also means that it is relatively easy to improve upon. That’s why a range of observations is considered to be a range instead of a single observation.

Options: Which of the following statements is true of big data if it’s available for widespread use? It is possible for big data to improve upon the accuracy and precision of individual data sets. That is, it is possible for the individual pieces of information to be tuned or otherwise changed to reflect new conditions or patterns.

Options: Which of the following statements is true of big data if we apply the same methods to each of the sets? If we use the standard deviation, for instance, the relationship between the standard deviation and the volatility is actually a function of the variance. If we apply a logistic regression to the data set, then the relationship is one which can be tested by calculating the deviation from the normal curve.

Options: What probability statement best expresses the probability of the data distribution? Standard deviation is one such probability statement. While it expresses an approximately continuous probability, there are many other probability statements like binomial tree, log-normal curve, or binomial tree also. The probability density is one of the probability statements, which express a high degree of probability. There are other probability statements which cannot be derived from normal distributions, like the log-norm, chi-square, or lattice diagrams.

Verifiable answer: Which of the following statements is true of big data science if it can be verified? First, it is not necessary to validate each piece of data as it is generated. Second, validation is useful only for measuring consistency and precision. Third, it may not be possible to verify all pieces of the data set because their variability makes them too difficult to predict. Last, it may not be necessary to validate because we know what causes variability, and the uncertainty associated with the measurements is large.

Reliability statement: Which of the following statements is true of big data science if it can be reliably calculated? First, it is not necessary to validate each piece of data as it is generated. Second, validation is useful only for measuring consistency and precision. Third, it may not be possible to verify all pieces of the data set because their variability makes them too difficult to predict. Fourth, it may not be necessary to validate because we already know what causes variability, and the uncertainty associated with the measurements is large.

Consistency statement: Which of the following statements is true of data mining? First, it is necessary to check for discrepancies and incorrect values before data mining. Second, after data mining, it is necessary to correct for errors and inconsistencies whenever they occur. Third, in some applications, it is not necessary to correct for inconsistencies and incorrect values because the sources used by the analysts are themselves consistent. Fourth, it may not be possible to correct for inconsistencies and incorrect values because a number of sources contradict each other.

uracy statement: Which of the following statements is true of big data science if it can be sufficiently maintained? First, it is necessary to store and retrieve data frequently. Second, it is necessary to use tools that make it easy to maintain consistency and uniformity in big data sets. Third, it may not be possible to maintain consistency and uniformity in big data sets due to their large size and dynamic contents. Fourth, it may not be possible to retrieve and store data very frequently because the time needed to conduct quality assurance checks may be too long.