Using the ID3 symbolic classification algorithm to reduce data density

Effective data reduction is mandatory for modeling complex domains. The work described here demonstrates how to use a symbol ic c lass i f ier algori thm from machine learning to effectively reduce large amounts of data. The algorithm, Quirdan's ID3, uses input data records and corresponding classifications to produce a decision tree. The resulting tree can be used to c l a s s i fy p rev ious ly unseen inputs . Alternatively, the attributes found in the tree can be used as the basis to develop other system modeling techniques such as neural networks or mathematical programming algorithms. This approach has been used to effectively reduce data from a large complex domain. The example shown here comes from the F/A-18 Hornet aircraft. Results of using the algorithm to identify different phases of flight from aircraft flight data is presented. I n t r o d u c t i o n Successful modeling of large, complex systems often requires the analysis of vast quantities of data to determine which a t t r i bu te s c o n t r i b u t e s i g n i f i c a n t l y to the sy s t em characteristics. Each data record may contain a large number of parameters, only a few of which are important in modeling the process. In most cases, even experts in the domain have only a vague idea of the relat ive importance of many of the parameters. The symbolic classification algorithm, ID3, was developed by machine learning researcher Ross Quinlan [Quinlan 1986] to help solve c lass i f icat ion problems. The purpose of" the algorithm is to learn concepts from a set of example data. The concepts are represented in the form of a decision tree. Each branch of the tree can be thought of as a rule where interior nodes and branches correspond to conditions that must be met. 1 Work supported by the McDonnell Douglas Independent Research and Development program and Statement of Work WS-MDRL-4062 with the Univers i ty o f Missour i -Rol la Engineering Education Center in St. Louis. Permission to copy without fee all or pan of this material is gnmted provided that the copies are not made or diswibuted for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, lind notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires • fee and/or specific permission. O 1994 ACM 089791-647-6/941 0003 $3.50 Leaves correspond to conclusions. Thus, ~. patl, from root to a leaf may be considered a rule. The set of all paths in a tree can represent the set of concepts embodied in the original data. This type of algori thm is referred to as a classif icat ion algori thm since the decis ion trees produced are used to "classify" input cases. The attributes found in an ID3 tree are those that are essential for performing classifications. In a decision tree that performs classifications with a high degree of accuracy, the attributes that appear in the tree axe the ones that contribute sigrLificandy to modeling the process. The ID3 classification tree may model the system sufficiently. I f it doesn't, the attributes in the decision tree can then be used to assist in the development of other system models, The authors have used this technique to select at tr ibutes for inclusion in c lass ical mode l ing techniques and in neural networks. An example of the authors ' use of this approach is in eva lua t ing the da ta connec ted with mi l i t a ry a i rcraf t maintenance. In order to quickly identify and resolve problems, large amounts of data are collected prior to, during, and after each flight. On the F/A-18 Hornet aircraft, one data collection point is known as the data storage unit (DSU). The portion of the DSU dealt with in these experiments collects over 180 different attribute values at periodic times, starting before flight and ending after flight. In the event of an equipment malfunction, the DSU record can be reviewed to determine the specific system status near the time of the malfunction. However, the large number of attributes and the large number of sets of information provide too much data to allow for timely help. Hence, there is a need to focus on subsets of the attributes. In order to explore this issue, a simple problem was chosen: the identification of phase of flight. In some cases, it is important to know the fl ight phase where a malfunct ion occurred -e.g., during takeoff, during cruise, or during landing. It was felt that this information could be determined by examining a subset of the attributes available. This subset could then be examined further to evaluate the importance of each attribute in determining flight classification. Since some attributes were indirect ly related to the values of other attributes, repeating the process for multiple subsets might provide additional insight into the attributes. The fol lowing sections descr ibe the approach in detail , beginning with ID3 and the F/A-18 problem domain. The experimental results sect ion shows how ID3 was used to