Data is the first step towards wisdom; data are the values of the measures that we take as part of the analysis we want to perform. Each element taking part in our analysis is a sample, represented by one or more variables. 175, 95.5, "brown" and TRUE are data, but nothing else. If we say that a person named "John" is represented by his height (in centimeters), weight (in kilograms), hair color and whether he is a man or not, then the vector (175, 95.5, "brown", TRUE) is the information which describes such person. If we know that John is fat (because we are told so), we could infer that if a person is 175cm and it weights 95.5kg (or more), the person can be considered to be fat. These kinds of rules are what we call knowledge. Notice that we have not used neither hair color nor gender for inferring whether a person is fat or not, which it seems to be clear for the former but it could not be true for the latter (i.e., men are usually taller and heavier than women).
As seen in the aforementioned example, there are several kinds of data:
- Discrete: data that takes values from a finite set. It can be:
- Binary: when only two values are possible, TRUE and FALSE.
- Categorical: when two or more values are possible, i.e. "brown", "black", "white" and "blonde". Notice that binary variables are a special case of categorical ones.
- Ordinal: a categorical variable where the possible values are ordered.
- Continuous: data that takes values from real numbers. Usually, continuous variables come from real world measures or physical concepts, i.e. weight, height, length and so.
Knowing the exact kind of each variable in the data set is important because some models and algorithms can be applied only to specific kinds of data. Furthermore, most data mining techniques can be optimized in order to exploit the particular characteristics of each variable.
Usually, data present missing values. Missing values mean that no data value is stored for one or more variables in one or more samples. For example, if we do not know which is John's hair color, John's information would be (175, 95.5, ?, TRUE) where "?" (or any other symbol, "NULL", "unknown", etc.) is used for describing such fact. Missing values are important because not all statistical and data mining models deal properly with.
Data usually comes in tabular form, each sample in a row, and each variable in a column, as follows:
The number of rows (the number of samples) is usually denoted by N and it is called the cardinality of the data set. The number of variables is usually denoted by d and it is called the dimensionality. It is desireable that N >> d, where >> means that it is much larger (ten times, for example). In other words, we need many more samples than variables.
Raudys, S. J.; Jain, A. K. (1991) "Small sample size effects in statistical pattern recognition: recommendations for practitioners". IEEE Transactions on Pattern analysis and Machine Intelligence, vol. 13, no. 3, pp. 252-264.