Sport Informatics and Analytics/Pattern Recognition/Knowledge Discovery

Introduction
This topic explores how we can extract useful information and actionable insights from sport data.

There has been a variety of labels used to characterise processes that extract of useful information from data. These include "data mining, knowledge extraction, information discovery, information harvesting, data archaeology, and data pattern processing".

Gregory Piatetsky-Shapiro introduced the term knowledge discovery in a report of a workshop in 1989 that brought together practitioners from "expert systems, machine learning, intelligent databases, knowledge acquisition, case-based reasoning and statistics". The report of the workshop concluded "knowledge discovery in databases is an idea whose time has come".

William Frawley, Gregory Piatetsky-Shapiro, and Christopher Matheus (1992) provided one of the earliest overviews of knowledge discovery in databases. They defined knowledge discovery in databases (KDD) as: Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. Given a set of facts (data) F, a language L, and some measure of certainty C, we define a pattern as a statement S in L that describes relationships among a subset Fs of F with a certainty c, such that S is simpler (in some sense) than the enumeration of all facts in Fs. A pattern that is interesting (according to a user-imposed interest measure) and certain enough (again according to the user’s criteria)is called knowledge. The output of a program that monitors the set of facts in a database and produces patterns in this sense is discovered knowledge.

They added "Patterns are interesting when they are novel, useful, and non-trivial to compute".

In 1996, Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth discussed "an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases". Their paper distinguishes KDD from data mining. They note: In our view, KDD refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data. They argue that KDD is a process and data mining is a step within that process. The derivation of useful knowledge from data requires: Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth provide the conceptual and practical foundation for the the KDD process in sport contexts. They propose: KDD focuses on the overall process of knowledge discovery from data, including how the data are stored and accessed, how algorithms can be scaled to massive data sets and still run efficiently, how results can be interpreted and visualized, and how the overall man-machine interaction can usefully be modeled and supported. Twenty years after the publication of their paper there is still a tendency to regard data mining and KDD as interchangeable terms. During this unit we have used the term analytics as a shorthand for KDD.
 * data preparation
 * data selection
 * data cleaning
 * incorporation of appropriate prior knowledge
 * proper interpretation of the results of data mining

Our discussion of analytics used this definition: The discovery, communication, and implementation of actionable insights derived from structured information in order to improve the quality of decisions and performance in an organization. As we develop our KDD skills this activity will include unstructured data too. Whatever is included, it will be part of a process that the literature of the 1990s foresaw.

Sport examples
We present two examples here for your consideration.

Chris Anderson and David Sally discuss the potential of an analytics approach to association football in their study of The Numbers Game.

In the introduction to their book, they write: The clue to analytics is in the name. To make (those) numbers mean something, to learn something from them, they must be analysed. The key, for those at the vanguard of what some have called a data 'revolution and what we think of as football's reformation, is to work out what they need to be counting, and to discover why, exactly, what they are counting counts.

Their book explores the analytics process and raises important empirical and methodological issues for this course.

The second example presented here is the paper written in 1997 by Inderpal Bhandari and his colleagues at the IBM TJ Watson Research Center. The paper is titled Advanced Scout: Data Mining and Knowledge Discovery in NBA Data. In the paper, they report their analysis of data gathered by a software program, Advanced Scout, that "seeks out and discovers interesting patterns in game data". We have chosen this paper to connect with the spirit of the literature of the time. The editor of the journal within which the paper was accepted was Gregory Piatetsky-Shapiro.