Model construction
Model construction
Once data has been preprocessed, we have to build a model for achieving the desired goal (clustering, classification, prediction, estimation). There are two basic kinds of models:
- Supervised: when there is a variable in the data set that represents the expected outcome for the other variables. In this case, such variable is called goal.
- Unsupervised: when no goal variable is available.
Unsupervised models are used for clustering purposes, that is, classifying samples according to the similarities (and differences) they show. For example, users of a web site could be clustered according to their navigation profile. On the other hand, supervised models are used to build classifiers which can predict or estimate a value for each sample in the data set. For example, credit scoring systems use classifiers to determine whether a customer will be a good credit card user or not. Or using the users' navigational profile, if there is a variable describing user's academic performance, it is possible to build a classifier to predict academic performance according to navigational profile, if there is any possible relationship between both concepts.
Classifiers are built with a goal in mind, namely generalization. This means that the classifier trained with the available data set should perform similarly with new samples, in order to be useful. It is possible to build a perfect classifier for a given data set, but this will probably lead to overfitting, that is, lack of generalization. This can be avoided by using two different data sets, one for building the classifier (the training set), and other for validating it (the testing set). There are several ways to do this:
- If the original data set is large enough, split it randomly in two dats sets with the rule of two thirds, that is, two thirds of samples for training and one third for testing. This is called the hold-out method.
- If the original data set is not large enough, there are several techniques to improve classification accuracy:
- Cross-validation: split the original data set in two sets, one of k samples (testing) and other of N-k samples (training). This generates a combinatorial number of (N, k) pairs of training and testing sets which can be used to build (N, k) classifiers. Then:
- Bagging: use a simple majority voting system.
- Boosting: use a weighted voting system, according to classifier performance.
- Leave-one-out: is a special case of cross-validation when k=1.
- Bootstrapping: a subset of M samples is randomly selected from the original set by sampling with replacement. This procedure is repeated r times and results are averaged.
- Cross-validation: split the original data set in two sets, one of k samples (testing) and other of N-k samples (training). This generates a combinatorial number of (N, k) pairs of training and testing sets which can be used to build (N, k) classifiers. Then: