ABSTRACT Title of Dissertation: TACKLING UNCERTAINTIES AND ERRORS IN THE SATELLITE MONITORING OF FOREST COVER CHANGE Kuan Song, Ph. D, 2010 Directed By: Professor John R. G. Townshend Department of Geography University of Maryland, College Park This study aims at improving the reliability of automatic forest change detection. Forest change detection is of vital importance for understanding global land cover as well as the carbon cycle. Remote sensing and machine learning have been widely adopted for such studies with increasing degrees of success. However, contemporary global studies still suffer from lower-than-satisfactory accuracies and robustness problems whose causes were largely unknown. Global geographical observations are complex, as a result of the hidden interweaving geographical processes. Is it possible that some geographical complexities were not expected in contemporary machine learning? Could they cause uncertainties and errors when contemporary machine learning theories are applied for remote sensing? This dissertation adopts the philosophy of error elimination. We start by explaining the mathematical origins of possible geographic uncertainties and errors in chapter two. Uncertainties are unavoidable but might be mitigated. Errors are hidden but might be found and corrected. Then in chapter three, experiments are specifically designed to assess whether or not the contemporary machine learning theories can handle these geographic uncertainties and errors. In chapter four, we identify an unreported systemic error source: the proportion distribution of classes in the training set. A subsequent Bayesian Optimal solution is designed to combine Support Vector Machine and Maximum Likelihood. Finally, in chapter five, we demonstrate how this type of error is widespread not just in classification algorithms, but also embedded in the conceptual definition of geographic classes before classification. In chapter six, the sources of errors and uncertainties and their solutions are summarized, with theoretical implications for future studies. The most important finding is, how we design a classification largely pre-determines the ?scientific conclusions? we eventually get from the classification of geographical observations. This happened to many contemporary popular classifiers including various neural nets, decision tree, and support vector machine. This is a cause of the so-called overfitting problem in contemporary machine learning. Therefore, we propose that the emphasis of classification work be shifted to the planning stage before the actual classification. Geography should not just be the analysis of collected observations, but also about the planning of observation collection. This is where geography, machine learning, and survey statistics meet. TACKLING UNCERTAINTIES AND ERRORS IN THE SATELLITE MONITORING OF FOREST COVER CHANGE By Kuan Song Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Ph. D 2010 Advisory Committee: Professor John R. G. Townshend, Chair Professor Samuel Goward Professor Shunlin Liang Dr. Chengquan Huang Professor Joseph F. Jaja ? Copyright by Kuan Song 2010 ii Acknowledgements I want to express deep thanks to my advisor, the committee, the department, and my family and friends. Dr. Townshend has been aware of the possible uncertainty problem in global remote sensing since the 1980s. He had an instinct that unsupervised classification has some kind of virtue over supervised ones, although supervised classification offers good performance and efficiency. Thus I started to work on this field with the right direction. The fascinating and thorough course on machine learning and knowledge engineering offered by Dr. Jaja further helped me to understand the mathematical origins of machine learning. The committee further offered many insightful suggestions and questions along the way. Many experiments in chapter two were gradually refined because of the questions raised by them in committee meetings. For example, section 3.5 was improved based on Dr. Goward?s advice. Dr. Huang offered me immense help on the general process of dissertation work through regular discussions between the two of us. His dissertation work back in 1999 was the first one on SVM in the field of remote sensing. For example, he discovered that RBF kernel is better than other kernels used in SVM. This triggered my thoughts on why different machine learning designs can be good, or bad, on different aspects. The dissertation writing process had been very slow, partly because I was not good at all at writing long papers, and partly because it has been difficult to interpret the key observations. Dr. Townshend went to great lengths to teach me on these iii aspects and looked through every chapter multiple times before the defense. During the time, funding has been covered with NASA grants, a Dean?s assistantship, and a teaching stipend from the department. Thus during this Ph.D study, I benefited not just from trainings in research, but also in teaching, writing, and public presentation. This has been a rich education experience that would benefit me for life. Teaching five courses, including two graduate courses, in the department improved my teaching skills and unexpectedly helped me to systematically organize materials in writing. The software package LibSVM developed by Chih-Chung Chang and Chih-Jen Lin has been heavily used in this dissertation. Its free availability, versatile usability, and rich documentation greatly facilitated my research. Also I feel extremely lucky to have been trained at Peking University and Ohio State University before I come to pursue my PhD at Maryland. At Peking University, the total tuition and boarding I paid for four year of excellent college education and two Bachelor?s degrees was USD $1250. With a fellowship from OSU, I was able to focus on coursework without worrying about living expenses. The broad knowledge spectrum at OSU basically retrained me and exposed me to geoinformatics, statistics, and electrical engineering. The education at UMD, on the other hand, emphasized more on the reasoning process, the science side of remote sensing, and a global perspective in observation. Finally I would not have been able to finish this without the support and encouragement from my family, friends, and the committee. They lent me the strength to carry on in good times and bad times. iv Table of Contents Acknowledgements .......................................................................................................................... ii List of Tables .................................................................................................................................. vii List of Figures ............................................................................................................................... viii 1. Introduction ............................................................................................................................... 1 1.1. Remote Sensing for Global Forest Monitoring ......................................................... 1 1.2. Current Problems ....................................................................................................... 3 1.2.1. Reliability of Classification Algorithms .......................................................... 3 1.2.2. Error Propagation within the Designs of Change Detection ............................ 6 1.3. A Framework of Uncertainty-Oriented Methodology ............................................... 8 2. Candidate Classifiers for Forest Change Detection ................................................................. 12 2.1. Introduction ............................................................................................................. 12 2.2. Major Families of Machine Learning Algorithms Used in Change Detection ........ 13 2.3. Maximum Likelihood Classification (MLC) ........................................................... 17 2.4. Decision Tree Classification (DT) ........................................................................... 21 2.5. Fuzzy ARTMAP Neural Network Classification (ARTMAP NN) .......................... 24 2.5.1. The ART network ........................................................................................... 24 2.5.2. The Fuzzy ARTMAP algorithm ..................................................................... 26 2.6. Support Vector Machine Classification (SVM) ....................................................... 27 2.6.1. The Max-Margin Idea .................................................................................... 27 2.6.2. From Max-Margin idea to SVM Implementation .......................................... 28 2.6.3. The Risk Minimization Ideas behind SVM ................................................... 31 2.6.4. From Hard-Margin SVM to Soft-Margin SVM ............................................. 32 2.6.5. From 2-class SVM to Multi-class SVM ........................................................ 33 2.6.6. Choice of Kernel and Kernel Parameters ....................................................... 34 2.7. Kernel Perceptron (KP): Introducing Neural Network into SVM ........................... 36 2.7.1. Adaptive boosting: Infinite Ensemble Learning ............................................ 36 2.7.2. Building the Ensemble Kernels for SVM ...................................................... 37 2.7.3. Kernel Perceptron .......................................................................................... 39 2.8. A Brief Discussion on Self-Organizing Maps Neural Network (SOM) .................. 40 2.9. Cross-comparison of Machine Learning Algorithms for Remote Sensing .............. 41 3. Assessing Machine Learning Algorithms with Real-World Uncertainties .............................. 46 3.1. Assessments and Comparison Design ..................................................................... 46 3.1.1. The Tradeoff between Generalization Power and Accuracy .......................... 48 3.1.2. The Realistic Acknowledgement of Errors in the Source .............................. 49 3.1.3. The Uncertainty in Class Definition .............................................................. 50 3.1.4. The ?Blind men and the Elephant? Problem ................................................... 51 3.1.5. Minimizing the Cost of Sample Collection .................................................... 52 3.2. Geographical Information of the Assessment Areas ................................................ 53 3.3. Assessing the Algorithms in Different Geographical Regions ................................ 56 3.4. Assessing the Algorithms over Large Areas ............................................................ 59 3.5. Assessing the Error Tolerance of Algorithms .......................................................... 61 3.6. Assessing the Algorithms with Mixed or Atypical Training Data ........................... 66 v 3.7. Assessing the Algorithms with Varying Contents of Training Data ........................ 68 3.8. Assessing the Algorithms with Scarce Training Data ............................................. 80 3.9. The Algorithm of Best Overall Performance ........................................................... 81 4. Optimizing Class Proportions in the Training Set ................................................................... 84 4.1. Class Proportions in Training Data: an Overlooked Pitfall ..................................... 84 4.2. Why are Modern Classifiers Heavily Influenced by Class Proportions in the Training Data? ......................................................................................................................... 87 4.2.1. Maximum Likelihood Classification ............................................................. 88 4.2.2. Decision Trees................................................................................................ 89 4.2.3. ARTMAP Neural net ..................................................................................... 89 4.2.4. Support Vector Machine and Kernel Perceptron ............................................ 90 4.2.5. Self-Organizing Maps coupled with Learning Vector Quantization (SOM-LVQ) .................................................................................................................... 91 4.3. Prioritized Training Proportions (PTP): Reducing the uncertainties in classification and change detection of satellite data ...................................................................................... 92 4.3.1. A Tale of Two Optimization Rules ................................................................. 93 4.3.2. Redefining Cross Validation .......................................................................... 96 4.3.3. Enumeration of Key Class Proportion in the Training Dataset ...................... 98 4.3.4. Constructing the Largest Possible Training Dataset and the Optimal Classifier Model 100 4.4. Assessment of the Joint Classifier MLC-SVM ..................................................... 101 4.4.1. Assessment Design ...................................................................................... 101 4.4.2. Approaches in Comparison .......................................................................... 102 4.4.3. Outcomes ..................................................................................................... 105 4.5. Discussion and Conclusions .................................................................................. 118 4.5.1. Redefining the Designs of Training and Cross Validation ........................... 118 4.5.2. Effectiveness of New Approaches ............................................................... 119 4.5.3. Future Improvement of Prioritized Training Proportions Approach ............ 121 4.5.4. The Relationship between Training Class Proportions and ?Class Prior? Probability ..................................................................................................................... 122 4.5.5. The Relationship between Training Class Proportions and Boosting .......... 125 5. The Dilution of the Change Signal ........................................................................................ 128 5.1. Change as the Class with the Lowest Accuracy .................................................... 128 5.2. A Possible Dilution Effect in the Change Training Data ....................................... 129 5.3. An Experiment on the Separation of the Change-Relevant and Change-Irrelevant Nonforest132 5.3.1. Experiment Settings ..................................................................................... 132 5.3.2. The Post-hoc Change Detection Algorithm ................................................. 133 5.4. Assessment Result ................................................................................................. 137 5.4.1. Accuracy Assessment................................................................................... 137 5.4.2. Error Patterns ............................................................................................... 139 5.5. Conclusions ........................................................................................................... 144 6. Conclusions and Recommendations ...................................................................................... 147 6.1. Sources of Uncertainties and Errors ...................................................................... 147 vi 6.1.1. Inevitable Errors .......................................................................................... 148 6.1.2. Variability in Class Definition ..................................................................... 149 6.1.3. Observational Sufficiency ............................................................................ 151 6.2. Integrated Solution for Uncertainties .................................................................... 152 6.3. The Overfitting Problem: From Structural to Geographical Risk Minimization ... 154 6.4. Future Explorations ............................................................................................... 158 6.4.1. Predictions on Further Uncertainties............................................................ 158 6.4.2. Budgeting Uncertainty ................................................................................. 161 6.4.3. Publishing Data Products with Training Data Sets ...................................... 162 6.5. Geographical Machine Learning ........................................................................... 163 Bibliography .................................................................................................................................. 165 vii List of Tables Table 2.1 Summary of mathematical characteristics and expected strengths and weaknesses of algorithms discussed in Chapter 2 ................................................................................................... 43 Table 3.1 Geographical Information of Test Areas (Huang et al. 2009) .......................................... 54 Table 3.2 Overall Accuracy of different classifiers in different regions .......................................... 57 Table 3.3 User Accuracy of the Forest Change Class produced by different classifiers in different geographical regions ....................................................................................................................... 57 Table 3.4 Producer Accuracy of the Forest Change Class produced by different classifiers in different geographical regions ......................................................................................................... 57 Table 3.5 Performance of algorithms over large areas .................................................................... 60 Table 3.6 Experiment on training data contents .............................................................................. 70 Table 3.7 Percentage of Forest Change pixels in training data when optimal SVM performances are achieved ..................................................................................................................................... 79 Table 3.8 The effect of training data scarcity on accuracy .............................................................. 81 Table 3.9 The theoretical strengths and suspected weaknesses revisited ........................................ 83 Table 4.1 A standard confusion matrix for a 3-class classification ................................................. 94 Table 4.2 An example for enumeration of key class proportion in training data ............................. 99 Table 4.3 An illustration of possible omission-commission dynamics due to enumeration of key class proportion in the training set ................................................................................................ 100 Table 4.4 Overall accuracy in 8 study areas of 3 approaches ........................................................ 105 Table 4.5 The amount of detected change normalized by that of real change ............................... 109 Table 5.1 Assessment Plan of the ?Dilution of Change Signal? Hypothesis .................................. 133 Table 5.2 Accuracy Assessment of Experiment One ..................................................................... 137 Table 5.3 Accuracy Assessment of Experiment Two..................................................................... 137 Table 5.4 Accuracy Assessment of Experiment Three .................................................................. 138 Table 5.5 Accuracy Assessment of Experiment Four .................................................................... 138 Table 5.6 Accuracy Assessment of Experiment Five .................................................................... 138 viii List of Figures Figure 1.1 Popular methodologies of contemporary change detection ............................................. 6 Figure 2.1 A family tree of supervised classifiers. .......................................................................... 16 Figure 2.2 The maximizing margin philosophy of SVM (same as Figure 5.2 in Vapnik 1999) ...... 28 Figure 3.1Three test areas in Paraguay ........................................................................................... 55 Figure 3.2 Change detection results from different algorithms in Eastern Paraguay ...................... 58 Figure 3.3 Change detection results from different algorithms on large-area test........................... 61 Figure 3.4 Error Tolerances of different Algorithms in Eastern Paraguay ...................................... 63 Figure 3.5 Error Tolerances of Different Algorithms in Western Paraguay .................................... 64 Figure 3.6 Error tolerances of Classifiers in Western Paraguay with 30% errors in training .......... 65 Figure 3.7 Error Tolerances of Different Algorithms in Central Paraguay ..................................... 66 Figure 3.8 Location Effects of Training Data .................................................................................. 68 Figure 3.9 The Producer accuracy plot of the eastern test area ....................................................... 71 Figure 3.10 The User accuracy plot of the eastern test area ............................................................ 71 Figure 3.11 The overall accuracy plot of the east test area ............................................................. 72 Figure 3.12 The producer accuracy of the central test area ............................................................. 72 Figure 3.13 The user accuracy of the central test area .................................................................... 73 Figure 3.14 The overall accuracy of the central test area ................................................................ 73 Figure 3.15 The producer accuracy plots of the western test area................................................... 73 Figure 3.16 The user accuracy plots of the western test area .......................................................... 74 Figure 3.17 The overall accuracy plots of the western test area ...................................................... 74 Figure 3.18 The user and producer accuracies of SVM in the eastern study area, affected by the class proportions in training ............................................................................................................ 75 Figure 3.19 The user and producer accuracies of MLC in the eastern study area, unaffected by the class proportions in training ............................................................................................................ 75 Figure 3.20 Landsat TM 7-4-2 (Left), ETM+ 7-4-2 (Center), Change reference map (Right) ....... 76 Figure 3.21MLC Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) ................................................................................................... 77 Figure 3.22 DT Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) ................................................................................................... 77 Figure 3.23 SVM Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) .................................................................................... 78 Figure 3.24 KP Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) ................................................................................................... 78 Figure 3.25 ARTMAP Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) .................................................................................... 78 Figure 3.26 SOM Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) .................................................................................... 78 Figure 4.1 The workflow of Self-Organizing Maps (Cited from the help file of the Idrisi software) ......................................................................................................................................................... 91 Figure 4.2 Class boundary illustrations in three approaches ......................................................... 104 Figure 4.3 Overall accuracy in eight study areas of three approaches .......................................... 105 Figure 4.4 The User Accuracies and Producer Accuracies after Stratified Training ..................... 107 ix Figure 4.5 The User Accuracies and Producer Accuracies after PTP Training ............................. 108 Figure 4.6 The User Accuracies and Producer Accuracies after Adaptive Training...................... 108 Figure 4.7 Comparison of class overestimation-underestimation in area one ............................... 110 Figure 4.8 Comparison of class overestimation-underestimation in area two............................... 111 Figure 4.9 Comparison of class overestimation-underestimation in area three ............................. 112 Figure 4.10 Comparison of class overestimation-underestimation in Area four ........................... 113 Figure 4.11 Comparison of class overestimation-underestimation in Area five ............................ 114 Figure 4.12 Comparison of class overestimation-underestimation in Area six ............................. 115 Figure 4.13 Comparison of class overestimation-underestimation in Area seven ......................... 116 Figure 4.14 Comparison of class overestimation-underestimation in Area eight .......................... 117 Figure 4.15 The omission-commission difference in study Area one ........................................... 120 Figure 5.1 Creating the Training Data for the Change Class from Stacking ................................. 130 Figure 5.2 Creating the Training Data for the Real Change Class from Stacking ......................... 131 Figure 5.3 Training Data Construction using the Post-hoc Change Detection .............................. 134 Figure 5.4 Experiment result at test area one ................................................................................ 140 Figure 5.5 Experiment result at test area two ................................................................................ 140 Figure 5.6 Experiment result at test area three .............................................................................. 141 Figure 5.7 Experiment result at test area four ............................................................................... 141 Figure 5.8 Experiment result at test area five ................................................................................ 142 Figure 5.9 Experiment result of test area six ................................................................................. 142 Figure 5.10 Experiment result of test area seven .......................................................................... 143 Figure 5.11 Experiment result of test area eight ............................................................................ 143 Figure 6.1 The Structural Risk Minimization Theory (SRM) by Vapnik ...................................... 154 Figure 6.2 Another interpretation of the overfitting problem ........................................................ 155 1 1. Introduction 1.1. Remote Sensing for Global Forest Monitoring There are two major dimensions of global change: land cover change and climate change. The information on forest change is vital in both topics. On the Land cover science side it is important for biodiversity conservation (Kennedy et al. 2009), sustainable forest management (Quincey et al. 2007), regional planning (Wiens et al. 2009), and international environmental agreements (Noss 2001). On the climate change science side it is an important input variable for carbon cycle models (Schimel 1995; Foody et al. 1996; Hese et al. 2005). But forest change is a very broad concept. The term ?forest? can be dense closed forest, or open-canopy woodlands. Forest can also be evergreen or deciduous. And in terms of forest change, forest can become a wide variety of land use and land cover types. Natural forest change types include burning, which happen frequently in the relatively dry climates and the northern forests. Forest use of mankind includes clear cutting, selective logging, and rotational timber management. Given the importance and diversity, then how can we get reliable estimations of Earth?s forest and its temporal changes? There have been two major sources of information: forest inventory statistics from individual governments, and the interpreted results from remotely sensed imagery (Estes et al. 1980; Nelson et al. 1987; Townshend et al. 1991; Cardille and Foley 2003). The country-based forest 2 inventory data records have been widely used to conduct regional studies. For example, the historical forest changes in China and United States were estimated respectively to identify the ?missing carbon? for carbon cycle models (Fang et al. 2001; Pacala et al. 2001). Satellite remote sensing is another way to estimate forest and its changes. Global tropical forest change along with regional rates of changes were estimated from AVHRR and Landsat respectively (DeFries et al. 2002). Forest inventory data and satellite monitoring were both used in some studies (Myneni et al. 2001). The United Nations Food and Agriculture Organization?s (FAO) Forest Resource Assessment (FRA) follows another unique path. The FRA1980 (FAO 1981), FRA1990 (FAO 1995), FRA2000 (FAO 2001), and FRA2005 (FAO 2006) reports provided global estimation of forest inventory based on governmental statistics. FAO?s forest change reports of 1996 (FAO 1996) and 2001 (FAO 2001) added a 10% stratified random sample of Landsat sensor scenes to estimate the global extent of tropical deforestation from 1980 to 1990, and 1990 to 2000. Forest inventory data generated by individual countries has various quality issues. FRA2000 and FRA2005 adopted broad expert advices to synchronize the definition of ?forest? globally. Yet the two most complained sources of error, pointed out by the users of FAO2000 estimation, are the low frequency of monitoring and the relatively less accurate estimation for open woodlands (Matthews and Grainger 2002). Some researchers refer to this problem as the ?weak definition? of forest (Sasaki and Putz 2009). Not only is the government inventory data prone to uncertainties, the forest change estimation derived from those datasets are also unavoidably affected. The 3 situation was as bad as ?Consistent data time series do not exist beyond the decade spanned by each report? (Matthews and Grainger 2002). In light of this, remote sensing had been given high hopes to produce better estimations for both forest inventory and its change over time. Satellite observation can reach conventionally inaccessible regions as well (Tucker and Townshend 2000). Thus according to the IPCC GPG (Intergovernmental Panel on Climate Change, Good Practice Guidance), remote sensing methods are especially suitable for independent verification of the national LULUCF (Land Use, Land-Use Change, and Forestry) carbon pool estimates, particularly the aboveground biomass (IPCC 2003). The importance of satellite monitoring of global forest change is also illustrated in the recent NASA initiative of ?Earth System Data Records? (ESDR), of which global forest change is an aspect. (NASA 2006; Chuvieco and Justice 2008) In some sense, the research community and the international organizations expect remote sensing to offer us reliable forest data to help us understand global change. 1.2. Current Problems 1.2.1. Reliability of Classification Algorithms As we have seen in the previous section, the science community put high hopes in remote sensing because the other approach, based on national statistics, has lots of weaknesses. But is the remote sensing approach largely error-free? The use of remote sensing in global forest change is actually far from operational. A number of controversies exist in the specification of consistent reliable methods. 4 The previously mentioned FAO report series of world?s forest in years 1980, 1990, 1995, and 2000 did not see much use of remote sensing. The forest change reports incorporated the use of satellite images with a 10% random sampling scheme. It was criticized for only sampling 10% randomly (Tucker and Townshend 2000). They argued that such a low sampling rate is insufficient given the high spatial variability of forest change. Forest change is not likely to be spatially random event. Their suggestion of a wall-to-wall mapping was countered by FAO. ?FAO did not have sufficient funding or staffing to accomplish this immense task? (Czaplewski 2002). This discussion showed us two important issues: 1. Global forest change has a high spatial heterogeneity that can only be reliably estimated with a census instead of limited sampling. 2. The very high cost and the need for big staff cited necessary to achieve that purpose only imply that automated algorithms are not fully-fledged. Apart from these two issues, there are controversies around another vital theme: the accuracy of remote sensing analysis. In the same paper by Tucker and Townshend, they gave an optimistic evaluation to this topic. They were pleased with the approximately 85% accuracy achievable by combining unsupervised classification, human interpretation, and expert inputs. However, this approach is too labor-intensive that it is not suitable for global studies. What Tucker and Townshend did not mention, is the capability of fully automated analysis. Another study, around the same time, outlined the major criteria of nearly-automated approaches (DeFries and Chan 2000). They listed four criteria namely total accuracy, computation resources, stability, and robustness to error in data. 5 Basically these four criteria is one fundamental issue: robustness of automated algorithms. They applied these criteria to different variants of decision tree (Quinlan 1986) and achieved mixed results ranging from low to high performance in each criteria. Worth noticing is that, they found no variant of Decision Tree, which has been widely applied in MODIS applications, achieved high performance in all the judging criteria for Landsat imagery. DeFries and Chan recognized two other important issues: 1. Error handling is important. 2. Fine-resolution imagery such as Landsat seems more difficult to analyze automatically than coarser resolution imagery such as MODIS. If we combine the contribution of the two papers above, we can get a clearer picture of what remote sensing can and cannot offer at the turn of the century. First, remote sensing data analyzed using unsupervised classification together with human modifications can give ~85% overall accuracy. However, it is highly time-consuming. Second, automated supervised classification of fine-resolution imagery produces lower accuracy for global studies compared to local studies. The reason of this suboptimal performance has not been identified but can be reasonably deduced. In local studies, manual editing is widely used and does not take much time. However, manual editing in global studies will be an unthinkably costly operation. Third, the high spatial heterogeneity of forest change means that reliable global forest change monitoring has to be done preferably wall-to-wall with a fine resolution. 6 One can immediately see that these three ?status quo? leads to a dilemma between quality and cost. How do we solve this? 1.2.2. Error Propagation within the Designs of Change Detection Another problem the remote sensing community faces is what the phrase ?change detection? actually means in practice. Forest change detection is largely based on classification, but it also involves more designs to model the change signal. Three major methodology approaches are prevalent in contemporary studies. The following figure shows their basic designs. There are well-known flaws in them. Figure 1.1 is a synthesis from two papers. The methodologies A and B were discussed in 1990s (Townshend et al. 1992). Methodology B was considered to have less error propagation and was thus preferred more than methodology A. Approaches A and C are the most popular methodology in contemporary studies (Kennedy et al. 2009). In contemporary studies, the majority use approach A (Yuan et al. 2005; Liu et al. 2008; Kuemmerle et al. 2009; Wang et al. 2009). Approach B Time 1 Spectral Data Time 2 Spectral Data Time 1 Classification Time 2 Classification Change Matrix Time 1 Spectral Data Time 2 Spectral Data Stacked Bi-temporal Spectral Data Stacked Classification Time 1 Spectral Data Time 2 Spectral Data Spectral Differencing or Modeling Threshold Tuning Approach A. Separate Classification Approach B. Stacked Classification Approach C. Direct Differencing Figure 1.1 Popular methodologies of contemporary change detection 7 has also been used recently (Song et al. 2005; Huang et al. 2008). All the experiments in this dissertation have also been done using Approach B. Approach C saw some usages (Zhan et al. 2002; Masek et al. 2008; Xian et al. 2009). These three approaches all showed signs of problem for different reasons. Approach A is more sensitive to error propagation than Approach B (Townshend et al. 1992). Error propagation is a fundamental concept in the science and engineering world (Taylor 1997). Basically, the more multi-stage optimization steps involved in a study, the more likely it is inferior to a one-step overall optimization. By stacking the images of multiple dates, Approach B has less error propagation because it only performs classification once. However, our experiments, which adopted Approach B, are conducted with much better training data than practically available in reality. Our training data in the change class was easily available because we had wall-to-wall change map in the first place. In reality, this is not the case. In the change detection based on the classification of stacked bi-temporal images, the training data for the change class is the most difficult to acquire. That is the main reason that researchers prefer the methodology approach A described in figure 1.1. Despite strengths, Approach B is hard to implement in reality because the researcher needs to collect training data specifically on land parcels that went through actual changes. Exhaustive search of those land parcels can be challenging. Approach C is based on differencing and thresholding, which are almost always parametric operators and very often simple linear operators. The complexity in 8 spectral signature can overwhelm the over-simplified parametric operators. In addition, there is a heavy reliance on tuning in Approach C. Thus it is unavoidably and heavily influenced by individual researchers. It should be avoided at all costs in continental or global studies, unless it can be automated without human intervention at local scales. TDA (Training Data Automation) (Huang et al. 2008) is such an effort to collect training data automatically at local scales. 1.3. A Framework of Uncertainty-Oriented Methodology Many contemporary studies of forest change have tried state-of-the-art machine learning methods side-by-side to find out which one produces the best accuracy (Collins and Woodcock 1996; Descl?e et al. 2006; Rogan et al. 2008). While that approach is productive in individual study sites, this dissertation will not follow that research paradigm. New machine learning methods are designed every year, if not every month. Comparing performances with the ever-newer algorithms in a local test site shows us the accuracies but not the causes of those accuracies. Besides, the world outside our own small test site is what really matters. To actively seek out and learn from the failures, we need another path. We will instead try to locate the error sources and then improve the available machine learning algorithms. In particular we will focus on these questions: ?What are the errors and uncertainties in the classification of remotely sensed imagery? Where do they come from? How do we eliminate them?? This kind of research paradigm is not completely new. In fact, modern survey 9 methodology is built on the analysis of error origins. For example, the origins of survey errors have been well studied and put into categories such as sampling error, interviewer error, measurement error, and nonresponsive omission (Groves 1989). Remote sensing can be seen as a special type of survey. The data is acquired through optical sensors, analyzed by machine learning algorithms, and trained by one or more arbitrary human arbitrator. Thus, error origins in remote sensing analysis are arguably more complex. Yet, this complex situation does not mean it is insolvable. It only suggests more possible sources of error than in a traditional survey. In the field of remote sensing, pioneering efforts on the origins of error were made in the 1960s and 1970s. As put by Landgrebe (Landgrebe 1980), ?The scene is the portion of the (remote sensing) system which provides us with the greatest challenge. It is the only portion not under design or operational control, and by far the most dynamic and complex portion of the system.? He cited an early work (Hughes 1968) illustrating the decreasing performance of Maximum Likelihood classifiers with increasing dimensionality. What they discovered echoes a statistician?s term ?The curse of dimensionality? (Bellman 1961), but the remote sensing world at that time did not link this to their peers on the statistics side. However, these efforts were largely left forgotten until they were picked up a decade ago (DeFries and Chan 2000). They faced up to the fact that, the training data in practical work is generally not 100% correct. Errors could be caused by bad geo-referencing, interpretation mistakes, or severely mixed classes. 10 We adopt this idea and extend it into a framework? a framework of uncertainty handling. This framework treats global automated forest change detection as an information retrieval process, during which a number of known and unknown uncertainties reduce the accuracy significantly from the theoretical expectation. The image analyst is also a possible source of errors. This notion echoes with survey methodology. Although training data error is the only widely explored type of error in the analysis of satellite imagery, there are in fact many more possible causes of errors. We understand very little about why the accuracy of forest change detection is still only around ~85% even after integrating modern machine learning methods and human interpretation. We do not have a theoretical explanation for the difference between automated algorithms and human interpretation either. We also do not understand well why accuracy varies a lot from one image to another. Neither do we understand why the forest change class, among all classes, is usually the class with the lowest accuracy. However, these observations do shed a light on the hidden uncertainties: its magnitude and variability. Landgrebe sensed some of these problems 30 years ago, but he could not give a thorough theoretical explanation. However, his intuition, that the remote sensed imagery is not ?under design or control?, is a good start. Can we add geographical designs and controls into the machine learning theories? Here is the plan for our hunt for the uncertainties. Different machine learning methods were designed with different philosophies, often in parallel, for different 11 situations in the real world. Hence they may have different capabilities to tackle different uncertainties. They may also have redundancy or even some designs that can backfire for remote sensing applications, because they were rarely designed for image classification at all. If we dissect machine learning algorithms and examine their components, we might be able to identify those that are extremely effective in handling uncertainties in satellite monitoring. If we can integrate the more useful components, we may be able to create a more successful hybrid algorithm out of parent algorithms, without reinventing the wheels again. In chapter two, we will thoroughly examine the most popular and promising machine learning algorithms. We will try to figure out in which aspect(s) of uncertainties every algorithm were designed to overcome. Then in chapter three we will conduct a test of these algorithms for different types of uncertainties. If there is an algorithm that excels in all aspects, then we do not need to construct any new algorithm. But if no algorithm can tackle all aspects of uncertainties, our further chapters will be on the combining of building blocks from different machine learning algorithms until we come to a universal solution. As we will see in the chapters, the situation is far more complicated than we anticipated. We actually identified a previously unreported error source in remote sensing. This error source will be explained and resolved in chapter four. A side effect of this error source is our conceptual definition of classes. It will be explained and dealt with in chapter five. Then we will make a summary of the findings in chapter six. 12 2. Candidate Classifiers for Forest Change Detection 2.1. Introduction Various machine learning algorithms have been applied to retrieve forest change information by the remote sensing community. These algorithms fall into two basic categories: unsupervised learning and supervised learning. It has been found that unsupervised learning such as ISODATA clustering often produces lower accuracy than combining ISODATA and maximum likelihood classification, which is a supervised method (Justice and Townshend 1982). Moreover, they found that clustering takes more time in the computing and manual labeling processes. The computing power has been dramatically improved since then, but the time needed for manual labeling of unsupervised clusters has not and possibly will not be substantially improved. Automating the labeling of unsupervised clusters had been shown to be impractical (Song et al. 2005) Several other studies also favors supervised over unsupervised learning (Rogan et al. 2002; Keuchel et al. 2003). Supervised algorithms are even reported to have higher accuracies than visual interpretation on SPOT imagery (Martin and Howarth 1989). Thus our current change detection study will focus on supervised change detection. It is the goal of this chapter to examine contemporary supervised learning algorithms, and find out whether or not their designs can tackle errors and uncertainties in the process of retrieving forest change information from Landsat imagery. We will outline the theoretical backgrounds and the unique strengths of the 13 designs. Five algorithm candidates were chosen representing different schools of machine learning philosophy. These are the Maximum Likelihood Classifier (MLC), Decision Tree (DT), Fuzzy ARTMAP Neural Network (ARTMAP), Support Vector Machine (SVM), and Kernel Perceptron (KP) algorithms. The reason for their selection will be detailed in section 2.2. Another algorithm, the Self-Organizing Maps Neural net (SOM) will be briefly used in only one experiment. 2.2. Major Families of Machine Learning Algorithms Used in Change Detection Supervised change detection algorithms used in the remote sensing community were first developed in the machine learning community since the 1950s (Chow 1957; Rosenblatt 1958), approximately the same time of Sputnik and Explorer 1. Satellite remote sensing has since consistently benefited from the development of computers and machine learning. These classifiers have different theoretical origins and make various mathematical assumptions, which may or may not fit remote sensing applications. Some algorithms were developed from probability theories such as the Bayes rule. Some were constructed from pure guesses on how the human brain functions, for example, the Perceptron neural network model. Others were based on arbitrary criteria of how an ?optimal? classification should be executed. For example, the DT algorithm was developed from the entropy minimization criterion while the SVM algorithm was developed from the class distance maximization criterion 14 It is impractical for one to assess each and every algorithm for a given remote sensing application. However, the hundreds of supervised change detection algorithms now available can be categorized into a handful of groups. The approach of this study is to limit our study to a handful of representative algorithms with good prospects. In figure 2.1 we propose a typology of modern machine learning algorithms for effective cross-comparison. Each branch of this ?tree? represents a school of thought from the machine-learning society. The Bayes classifiers, the neural networks, the Entropy-minimization classifiers, and the max-margin classifiers are four prominent schools of machine learning theories. In addition, the method of boosting is a meta-algorithm which means it can be applied onto one or several classifiers. It is also known as Ensemble Learning. With the same given set of raw data, these four prominent schools of machine learning theories each extracts information in its own unique rationale. They analyze the data set in very fundamentally different ways to determine the class label of each data point. We could see how different they really are through a simple walkthrough of the core philosophies. The Bayes? classifiers are rooted in the Bayes rule of probabilities and give a Bayes Optimal solution in which the average error is lowest. Neural networks, on the other hand, are based on the thought that there are one or more iterations of algebraic equations which stand between the raw data and the class labels. Those iterations of algebraic equations were named ?hidden layers?. The making of those algebraic equations leads to different subtypes of neural networks. The 15 entropy-minimization classifiers are formed on the assumption that heterogeneous data should be sub-divided into purer classes. The iteration of this sub-dividing process becomes the classifier itself. And for the max-margin classifiers, they are based on the philosophy that different classes are best separated when there is a big enough buffer zone between each other. Each of the above philosophies is quite convincing but their choice is inherently subjective. They are methods designed by individual researchers to understand the data and observations in scientific and engineering fields. They are not solely based on axioms of mathematics or rules of physics. They are very unique, and thus might be more or less suitable in different research fields. It is worth mentioning that many machine learning ideas were developed not by computer scientists. For example, the Bayes rule was first formulated by Pierre-Simon Laplace more than a century before the age of computers. A landmark paper (Perrone and Cooper 1993) creating the field of Ensemble learning involved a Nobel Laureate in Physics: Leon Cooper, whose major contributions lie in the distant field of superconductivity. Vapnik, who invented SVM, has been heavily influenced by the Russian tradition of nonparametric probability theory carried on by Andrei Kolmogorov. Therefore, when we unravel contemporary machine learning, it is necessary to understand not just the names and equations, but also the rationales and philosophies at their cores. Dozens of algorithms have been developed in each family of machine learning theories. From this tree typology we choose one typical algorithm from each branch. Our choices (Figure 2.1) are: the maximum likelihood from the Bayes? 16 classifier family as a classic benchmark, the fuzzy ARTMAP algorithm from the neural network family, the soft-boundary SVM and the Kernel Perceptron algorithm from the max-margin classifier family, and the decision tree classifier from the entropy minimization family. This is the first time that the powerful Kernel Perceptron algorithmic approach has been applied in remote sensing studies. In recent years, the max-margin philosophy has been used to modify more and more traditional methods, such as principal component analysis and multivariate regression. Kernel Perceptron combined the designs of neural network, kernel machine, and ensemble learning. For these reasons, in this study we used two algorithms in this machine learning family. The light blue boxes show the algorithms we will use. In this chapter, we will discuss in detail the background and theoretical strengths of these candidate algorithms. Then in the following chapter, we will figure out their Bayes classifiers Neural networks Perceptron Back-propag ation Feed forward-only ARTMAP Recurrent Maximum Likelihood A Family Typology of Machine Learning Theories Fuzzy ARTMAP Max-Margin Classifier Support Vector Soft-boundary SVM Kernel Boosting Entropy Minimization Decision Tree Figure 2.1 A family tree of supervised classifiers. 17 possible advantages and disadvantages in the face of practical uncertainties and errors, in change detection applications using remote sensing. However, it must be pointed out, that these possible advantages and disadvantages are formed with mathematical reasoning and past literature in the field of remote sensing. We will use another chapter to assess these claims. 2.3. Maximum Likelihood Classification (MLC) The Maximum Likelihood Classifier was developed gradually (Mahalanobis 1936; Chow 1957; Chow 1962; Haralick 1969; Swain and Davis 1978; Strahler 1980). The equations in this sub-section are cited from Swain and Davis (1978). MLC classifies a pattern X in n-feature imagery into class I using the Bayes Optimal criteria: )()|()()|( jiii pXppXp ???? ? For all j=1, 2, ?, n (Equation 2.1) Where i? is the i-th class and )( ip ? is the prior probability of the i-th class. The probability function )|( iXp ? has to be estimated from the data set. In remote sensing applications, two hidden assumptions were made. The first assumption is Bayes optimal, which means to minimize the average error over the entire set of classification. And the second assumption is Gaussian distribution in each class. From Bayes optimal, the total error is defined as a loss function: ? = = n j jX XpjiiL 1 )|()|()( ?? (Equation 2.2) 18 Where )|( ji? is called the loss function, defined as the loss or cost caused by mistakenly classifying a data point into class i but actually belongs to class j. The Bayes Optimal rule defines the relationship between joint probabilities and conditional probabilities: )()|()()|(),( XpXppXpXp jjjj ???? == (Equation 2.3) Combining forms 2.2 and 2.3, we have the average error formulated as: ? = = n j jjX XppXpjiiL 1 )(/)()|()|()( ??? (Equation 2.4) The remote sensing community tends to simplify the loss function into 0 and 1: jiji jiji ?= == ,1)|( ,0)|( ? ? (Equation 2.5) Assuming that the data set follows multivariate normal distribution, i.e. Gaussian distribution N ( k? , 1), )()(21log21)(log)( 1 kkTkkeieX XXpiL ??? ???+?+?= ? (Equation 2.6) Where: )(iLX is the loss function to be minimized, according to the Bayes optimal strategy. n: number of features, or bands in the imagery X: image data of n features k? : mean vector of class k 19 k? : Variance-covariance matrix of class k k? : Determinant of the k? matrix The remote sensing community also tends to simplify the prior probabilities, P(X), of all classes to be equal. Laplace, who first formulated the Bayes rule, also favors using equal prior probabilities. The pioneers of MLC also warned of prior probability. Chow?s initial form of MLC does not include prior probability. Swain and Davis warned that the use of prior probability will be discriminating against the naturally rare classes (Swain and Davis 1978). Laplace himself is very wary about using prior probability. He even coined a term ?principle of insufficient reason? and chose to use equal prior probabilities for all classes. Also it was proposed that, after the first classification, the percentage of each class can be used as prior probabilities (Strahler 1980). But this approach does not bring significant accuracy improvements. Strahler also explained a subjective use of prior probability. The researcher?s own belief can be used as prior probability. He admitted in the same paper that this does not generate very accurate results. The controversy in the use of objective and subjective prior probability in remote sensing reflects the controversy of this subject even in the field of Bayesian Statistics itself. As put by the influential statistician William Feller on page 114 of his book: ?Unfortunately, Bayes? rule has been somewhat discredited by metaphysical applications??In routine practice this kind of argument can be dangerous.? (Feller 1957) This echoes with Laplace?s concerns. But in the remote sensing world, researchers have been much less wary than these statisticians. 20 Researchers also integrated neighborhood information into prior probabilities and called them contextual classifiers (Settle 1987), which in fact is the same idea of the MLC inventor in the 1960s (Chow 1962). Recently researchers have been trying to iteratively adjust the prior probabilities towards the outcome results and found slightly better results in some cases (Hagner and Reese 2007). The Maximum Likelihood classifier had been applied in remote sensing studies since the 1970s. It enabled researchers to explore early multi-spectral satellite data, which is often noisy and with little calibration, such as AVHRR data (Parikh 1977), MSS data (Fraser et al. 1977), and even the very early APOLLO-9 mission data (Anuta and MacDonald 1971-1973). The Gaussian assumption of MLC turns out often to be quite well suited for land cover mapping and change detection within relatively small to medium areas. MLC has yielded quite some good results in single-scene studies of Landsat, SPOT, ASTER imagery and even hyperspectral imagery. It was reported to achieve even better results than back-propagating neural networks on Landsat TM and SAR data (Michelson et al. 2000). It was concluded to work well on the hyperspectral AVIRIS data within a small study site (Hoffbeck and Landgrebe 1996). MLC achieved results comparable to Decision Tree classification on Landsat ETM+ data and performed better than Decision Tree on hyperspectral data (Pal and Mather 2003). On the other hand, it is relatively less successful in multiple-scene studies and studies on large-swath imagery such as the AVHRR data (Friedl and Brodley 1997; Gopal et al. 1999). Some studies suggest that the Gaussian assumption is well suited 21 for small areas but not for large areas (Small 2004). However, such conclusions have not been strongly supported theoretically. It remains something of a mystery as to why such an ?outdated? classifier has been reported in so many studies to have comparable performances to its modern competitors. On yet another hand, it had been shown through simulated data set (Hughes 1968) and local experiments (Lillesand and Kiefer 1979) that the solving power of MLC will decrease with the amount of data dimensions. That echoes with the statistical term of ?The Curse of Dimensionality? (Bellman 1961). However the experiment he designed used simulated datasets and thus has limited persuasion power. MLC is still widely used for its simplicity and excellent results at the local scale. It also has an desirable property, which is also shared by some other families of algorithms to be described in this chapter, that pixel level probability estimates can be output and further modeled (Strahler 1980). Thus it is frequently used as the No.1 benchmark algorithm in many research fields including remote sensing. 2.4. Decision Tree Classification (DT) The Decision Tree (Quinlan 1986) is a classifier in the form of a binary tree structure where each node is either a leaf node or a decision node. The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. For the selection of the features with the most heterogeneous class distribution the algorithm uses the concept of Entropy. The entropy of a dataset S is calculated as: 22 ? = ?= n i ii ppSEntropy 1 )ln()( (Equation 2.7) Where pi is the proportion of S belonging to class i. The decision tree splits at every decision node with the criteria of maximizing Gain with an attribute A: ? ? ?= )( )()(),( AValuesv V V SEntropy S SSEntropyASGain (Equation 2.8) where SV refers to the data with value v. When every attribute has been included in the tree or the training samples associated with every leaf node all have the same target attribute value (i.e., their entropy is zero), the tree is complete. However, a complete tree is often very complicated and unwanted because of elongated computing time. Often the full tree is ?pruned? to accelerate the classification. It has been verified that a heavily pruned decision tree does not suffer from significant loss of accuracy in forest change detection (Song et al. 2005). The decision tree, since its introduction into remote sensing, has been frequently used with the help of boosting. Boosting, as depicted in our typology of machine learning diagram, is a meta-algorithm that improves upon other algorithms. There are several major types of boosting. The first type of boosting came from the idea to combine the results of several different classifiers, including that of decision tree, through voting or consensus theory (Benediktsson and Swain 1992; Perrone and Cooper 1993). Due to the complexity of each algorithm, the result is sometimes 23 unreliable (Foody et al. 2007). Another form of ensemble classification is based on a single learning algorithm while changing the training set. Bagging (Breiman 1996) and Adaboost (Freund and Schapire 1996) are the two most popular approaches today. It has been demonstrated that decision tree enhanced with bagging gets better accuracy when applied on both AVHRR and Landsat TM data (DeFries and Chan 2000). Adaboost will be discussed in detail in section 2.7.1 The decision tree method has enjoyed popularity in the remote sensing community around year 2000 because people like a classifier without the Gaussian assumption. Researchers hoped it can be used where this assumption is violated (Friedl and Brodley 1997; Gopal et al. 1999). It is also valued by biogeographers because Decision Trees explicitly identify what are the chief discriminating features are and where the class boundaries are located (Hansen et al. 2000). It has also been widely applied in AVHRR and MODIS data analyses. In summary, researchers attributed its performance to its zero assumption on data distributions. However, the accuracy of decision tree has never significantly exceeded MLC in local scale studies. This interesting phenomenon is, however, often overlooked. It has been reported that decision tree cannot perform as well as maximum likelihood or neural network classifications on hyperspectral data (Pal and Mather 2003). This sounds like the ?Curse of Dimensionality? again. Therefore, decision tree might probably have less value in the stacked change detection involving a total of 14 bands than in single date classification with 7 bands of Landsat?s TM and ETM. 24 2.5. Fuzzy ARTMAP Neural Network Classification (ARTMAP NN) Neural network algorithms enjoyed great popularity from the late 1980s to around 2000. Many studies reported high accuracy given enough training data and fine tuning. Most of the studies, such as those described as ?a neural network model of Z layers with Z-2 hidden layers?, adopted the feed-forward back-propagation models (Lippman 1987). This family of models is known to be capable of high accuracy given enough training data and especially easy to use for remote sensing applications (Foody et al. 1995). They are also known to be prone to overfitting (Gopal and Woodcock 1996). Our study will not cover the traditional feedforward-backpropagation model, because it has been compared to decision tree and support vector machine in the past and found to be inferior (Huang 1999). We will instead look for newer implementations in the neural network family, which show some promises in overcoming these deficiencies. 2.5.1. The ART network Fuzzy ARTMAP is a type of supervised neural network models based on the Adaptive Resonance Theory (ART) (Grossberg 1976; Grossberg 1987). It was developed from the simplest ART network, which is a classifier for multi-dimensional vector datasets. Each training class consists of many ?patterns? of vectors. The input data vector is classified into a class which it most closely resembles depending on the stored training pattern. Once a training pattern is found, it is modified to 25 resemble the input data. If the input data does not match any stored pattern within a certain tolerance range, then the input data is absorbed into the training data as a new pattern. Resemblance between the training data and the input data for classification is measured through the following equation: x PxPxR i i ?=),( (Equation 2.9) In this form, R(x,Pi) is the resemblance coefficient; x is the input data vector; Pi is the ith pattern stored in the training data; and ? is a bitwise AND operator. If the resemblance coefficient is larger than a threshold value, then the training pattern Pi is updated through a linear equation: )()1( xPiPiPi ?+?= ?? (Equation 2.10) In this form,? is the updating speed coefficient between 0 and 1. Consequently, no stored pattern is ever modified unless it matches the input vector within a certain tolerance. New classes will be formed when the input data does not match any of the stored patterns. The ART network is said to be uniquely designed to have both ?plasticity? and ?stability? (Carpenter 1999). ?Plasticity? comes from the design that the training data keeps evolving according to the classification data. ?Stability? is maintained by a chosen tolerance value. The ART network distinguishes itself from most other contemporary pattern classifiers by integrating ?plasticity? into its design. However, how these theoretical designs work in reality is not very well tested. 26 2.5.2. The Fuzzy ARTMAP algorithm ARTMAP was developed by Grossberg and Carpenter (Carpenter et al. 1992; Carpenter 1999) and was introduced into the land cover mapping community rapidly (Carpenter 1999). The original ARTMAP performs binary classification while the fuzzy ARTMAP classifies on multi-valued data. The fuzzy ARTMAP algorithm, along with the decision tree algorithm, were the only two candidates competing for the MODIS land cover classification algorithm (MLCCA). Fuzzy ARTMAP was not chosen for MLCCA because the algorithm was ?in the early developing stage and could not handle missing data points? (Friedl 2002). However, this is not very convincing. Handling missing data points does not seem to be a major programming obstacle. What Friedl found at that time might be an artifact that seemed to be caused by missing data handling but in reality isn?t. Still, researchers in the land cover community had high expectations for fuzzy ARTMAP because it does not assume any statistical distribution in the dataset and might be suitable for global land cover mapping. The ARTMAP classifier is built upon modules called ART and MAP networks. ART1 is the simplest variety of ART networks, accepting only binary inputs.(Carpenter et al. 1992) ART2 extends network capabilities to support continuous inputs. ARTMAP combines two slightly modified ART-1 or ART-2 units into a supervised learning structure where the first unit takes the input data and the second unit takes the correct output data. The matching of the outputs from these 27 two ART modules is done through a MAP module. Then the vigilance parameter in the first unit will be adjusted for the minimum possible amount in order to make the correct classification. 2.6. Support Vector Machine Classification (SVM) 2.6.1. The Max-Margin Idea The Support Vector Machine has been considered as one of the most promising mathematical solver for statistical learning in general. It was introduced into the field of remote sensing a decade ago and has demonstrated its potentials (Huang 1999). Understanding of its mechanism in geographical term is not complete yet. The Support Vector Machine algorithm came from a long way. We will need several subsections to explain its origins and developments. Only when we are thoroughly clear about these, can we possibly predict how SVM might respond to geographical uncertainties and errors. A straightforward rationale was suggested for linear binary classification (Vapnik and Chervonenkis 1974; Vapnik 1982). The maximum distance between the data of two classes is determined and called the ?margin?. The plane in the center of the margin is used as the classifier. This is known as the max-margin classifier, or the optimal-margin classifier. For example, the two outer planes (H1 and H2) in the following figure are the maximum margins while the optimal hyperplane in the center separates the two classes. 28 Figure 2.2 The maximizing margin philosophy of SVM (same as Figure 5.2 in Vapnik 1999) For a 2-D linear feature space of D: (xi, yi), the hyperplane set H1 and H2 is formulated with slope w and intersection b. The equations in section 2.6 are all adopted from Cortes and Vapnik (1995) 1 1 ?=+? +=+? bwx bwx i i (Equation 2.11) The maximizing margin solution is derived by minimizing ww? while constrained by: 11 11 ?=??+? +=+?+? ii ii yforbwx yforbwx (Equation 2.12) However, Vapnik?s idea in the 1970s was not a practical classifier yet. It was more like a philosophy. 2.6.2. From Max-Margin idea to SVM Implementation The max-margin classification idea has been developed into a powerful pattern classifier with several mathematical techniques (Boser et al. 1992). 29 First, the max-margin training of N-dimensional data x with the dataset size of p is expressed as: Bxthenotherwise AxthenxD ? ?> , ,0)( , where ? = += N i ii bxwxD 1 )()( ? (Equation 2.13) D(x) is the decision function of the classifier. iw and b are the adjustable parameters for the classifier to tune. )(xi? are pre-defined functions of the data x most suitable for the dataset model. The decision function can also be written in pure vector form as: bxwxD +?= )()( ? , where w and )(x? are N-dimensional vectors. (Equation 2.14) Assuming that a full separation between class A and B exists, and then the margin M between the classes can be expressed as: pkwherewxDyM kk ,...,2,1,)( =? (Equation 2.15) Since we wish to maximize the margin size, we would want the minimization of the norm w . The 2-class max-margin classifier of N-dimensional data of size p thus becomes: 2min w w , under the condition that: pkxDy kk ,...,2,1,1)( =? (Equation 2.16) This is the optimization goal for the solution of max-margin classifier. Calculating directly with high-dimensional data is exceedingly expensive or practically impossible. Only after they incorporated two important mathematical 30 techniques was the max-margin classifier named ?Support Vector Machine? (Boser et al. 1992). The first technique is to use symmetric kernels. Instead of directly calculating the inner product in Hilbert space, the trick is to use the kernel mapping. Mercer?s condition (Vapnik 1998) states that a symmetric kernel is a valid inner product if and only if its Gram matrix is always positive semi-definite. This technique will simulate mapping the data into a very high dimensional feature space. A symmetric kernel K can be expressed as: ?= i ii xxxxK )()(),( '' ?? (Equation 2.17) With this new knowledge, ?? == +=+= p k kk N i ii bxxKbxwxD 11 ),()()( ?? (Equation 2.18) The second new technique is solving the optimization of max-margin by means of a Langrangian. The prime problem is converted to the dual problem: ? = ??= p k kkk xDywbwL 1 2 ]1)([ 2 1),,( ?? , subject to pkk ,...2,1,0 =?? (Equation 2.19) The optimization problem becomes searching for a saddle point of ),,( ?bwL that minimizes L with respect to w and maximizes L with respect to ? . This can be solved via quadratic programming. In short, the solution of 2-class N-dimensional max-margin classification using kernels was found in 1992 (Boser et al. 1992). This 31 is known as the 2-class prototype of support vector machine. SVM leads to a family of pattern recognition methods based on kernels with varying performance. 2.6.3. The Risk Minimization Ideas behind SVM The development of SVM has been centered on the minimization of expected algorithm risks, which is arguably an extension of the Bayesian school. In the 1970s, Vapnik and Chervonenkis came up with an idea called the Empirical Risk Minimization (ERM) criterion (Vapnik and Chervonenkis 1974). They mentioned the heavy influence by the idea of algorithmic complexity (Kolmogorov 1965) at the time. Basically the Russian statisticians at that time were trying to define the complexity of algorithms, and thus by proxy to define the complexity of real-world data which the algorithms tackle. The ERM idea suggests that, all statistical learning methods aim at minimizing the risk function, which is defined as the difference between empirical observation and algorithm estimation. In regression, ERM is the least squares method; in statistical inferencing it is the Kolmogorov-Smirnov test; while in classification, it is the maximum likelihood classifier as equation 2.1 (Vapnik and Chervonenkis 1974; Vapnik 1982; Vapnik 1999). In the 1970s and 1980s, Vapnik went on to define the second risk minimization criterion which he named as the Structural Risk Minimization (SRM). What it means is that the complexity of the algorithm should not be greater than the complexity of the real-world problem to be solved. One can immediately see the 32 Russian nonparametric statistics tradition from Kolmogorov. Vapnik believes that the cause of overfitting in statistical learning is that the complexity of the algorithm was uncontrolled. For example, a neural network can have arbitrary amount of hidden layers. The more complex an algorithm is, the more fit it can achieve with a given set of observation data. However, that would only make it worse when generalized to the data population. Therefore, an ideal statistical learning algorithm should be flexible to adjust its own complexity to match that of the observation data (Vapnik 1982; Vapnik 1999). The complexity of each SVM model is determined by the structure and parameters of the kernel. This is why the choice of kernels and the tuning of kernel parameter are so important. They directly determine whether or not the classification has overfitting. In the 1980s, Vapnik went on to define the third risk minimization rule which he named as the Vicinity Risk Minimization (VRM). It assumes two ?smoothness? conditions. The probability function of the data distribution and the algorithm function should both be smooth around observed data values. This VRM rule gives SVM a new design: the error margins. Vapnik presented two cases: the soft-vicinity and hard-vicinity SVMs (Vapnik 1999). They are more commonly referred to as soft-margin and hard-margin SVMs (Cortes and Vapnik 1995). 2.6.4. From Hard-Margin SVM to Soft-Margin SVM SVM was further developed to cope with real-world situations where class 33 separation can be difficult. It has been pointed out that the margin between the two classes can be arbitrarily small if the training data cannot be separated by hyperplanes in the Hilbert space (Cortes and Vapnik 1995). Therefore the classification can be useless under that situation. To counter this problem, they introduced the ?soft margin? concept. The soft margin hyperplanes allow a certain amount of training data to lie between the hyperplanes as outliers. A vector of ?slack variables? k? is introduced to enable this concept of soft margin hyperplanes. The direct form of the optimization problem now becomes: ))(21(min 1 ? = +>?< p k kw CFww ? , under the condition that pkxDy kkk ,...,2,1,1)( =?? ? (Equation 2.20) C is a sufficiently large constant, often different in different variations of SVM, used as a penalty coefficient. It acts similarly to the loss function in MLC. k? should be between the value of 0 and 1. F(n) is a monotonic convex function, chosen from a many options at the discretion algorithm developers. It has been proven that the 2-class soft-margin SVM can be solved using kernels in the same way as in the 2-class hard-margin SVM classifier (Cortes and Vapnik 1995). 2.6.5. From 2-class SVM to Multi-class SVM SVM was developed from the classic case of 2-class separation. Researchers have tried different approaches to solve the multi-class separation case. For a dataset 34 with N classes, it was proposed to execute N(N-1)/2 pair-wise SVM classifiers and use a voting mechanism to determine the final class label of each data point (Hastie 1996). This algorithm is known as the ?one-against-one? approach. It also has been proposed to execute N SVM classifiers of each class vs. the rest of the classes (Bottou 1994). This is known as the ?one-against-all? approach. Lately, the ?one-against-one? approach, the ?one-against-all? approach, and a multi-class simultaneous optimization approach were compared sided by side. Their results showed that the ?one-against-one? and ?one-against-all? approaches achieve the best accuracies, while the ?one-against-one? is also the fastest approach (Hsu 2002). In light of this, current multi-class SVM implementations usually adopt the ?one-against-one? voting algorithm. This voting mechanism leads to two important consequences. The first is that the probability generated by contemporary SVM algorithms is the summary of the votes. Thus, arguably, it cannot be viewed as statistical probability. The second consequence is that, if the SVM algorithm is implemented by the ?one-against-one? approach, the computation time will increase rapidly with the number of classes. 2.6.6. Choice of Kernel and Kernel Parameters The use of symmetric kernels is a key breakthrough in the development of SVM. The structure and parameters of the kernels are vital to avoid overfitting. Several kernels have been proposed for use with real-world datasets. The most commonly used kernels are the RBF (Radial Basis Function) kernel, the polynomial kernel and 35 the Sigmoid kernel. )tanh(),(: ),(: )(),(: 2 rxaxxxKKernelSigmoid exxKKernelRBF rxaxxxKKernelPolynomial j T iji xx ji d j T iji ji += = += ??? (Equation 2.21) The Sigmoid kernel has been proved to be less efficient than the RBF kernel (Lin and Lin 2003). The classification accuracy of polynomial kernel varied a lot with regard to the polynomial order(Huang et al. 2002). Only when high-order polynomial forms are used can the polynomial kernel achieve similar accuracy as the RBF kernel. The use of high-order polynomial kernels substantially increases the time needed for training. A similar study demonstrated that the RBF kernel has become the most favored kernel for SVM in practice (Huang et al. 2002). An interesting fact is that the RBF kernel is actually a high-dimensional Gaussian kernel. There has been reported (Small 2004) that the Gaussian assumption of the maximum likelihood classification negatively affects MLC performance when applied to large areas. It would be also necessary to assess whether the Gaussian kernel of SVM is also susceptible to this problem. Therefore in the next chapter we will take a look into this case. The RBF kernel is controlled by two variables: C and ? . The choice of their values strongly affects the accuracy of SVM outputs. In practice, a procedure called K-fold cross validation is used to identify the best set of parameters (Stone 1974; Lin and Lin 2003). In each permutation run, a random 1/K of the total training data is used to train the SVM model using a particular combination of parameters. The rest 36 of the training data are used for accuracy assessments. The parameter set of the permutation run with highest accuracy will be used for the complete training dataset. In practice, it has been showed that SVM classification accuracies do not fluctuate significantly when the size of the training dataset shrinks (Song et al. 2005). Therefore, the K-fold cross validation process can just use a fraction of the total training data and still find the optimal parameter set. This greatly shortens the time needed for cross validation. The whole cross validation process, however, is completely missing or unspecified in the current generation of ENVI software, which is the first major remote sensing toolbox to incorporate the SVM algorithm. 2.7. Kernel Perceptron (KP): Introducing Neural Network into SVM Kernel Perceptron is a recent development of SVM (Lin and Li 2005; Lin and Li 2005). It is developed from three theories: a boosting theory called infinite ensemble learning, the classical neural network model of the Perceptron, and the kernel design of the support vector machine. It has been suggested that KP is should outperform SVM (Lin and Li 2005). Therefore in our study we decided to include KP as a more recent integration of both SVM and Neural Nets. 2.7.1. Adaptive boosting: Infinite Ensemble Learning Boosting is a meta-algorithm, which means it is used on top of other learning algorithms to improve their performance. It has been described as ??one of the most important recent developments in classification methodology?? (Friedman 2000). 37 AdaBoost (Freund and Schapire 1996) refers to adaptive boosting. It is the most simple, popular, and successful boosting meta-method for machine learning. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. Otherwise, it is less susceptible to the overfitting problem than most learning algorithms. This has been demonstrated in a remote sensing study (Chan and Paelinckx 2008). For a given integer T and a hypothesis set H, AdaBoost iteratively selects T hypotheses Hht ? and weights 0?tw to construct an ensemble classifier. The equations in section 2.7 are all adopted from Lin and Li (2005). ))(()( 1 xhwsignxg tt T t ? = = (Equation 2.22) When T goes to infinity, AdaBoost approximates an infinite ensemble classifier: ))(()( 1??== t tt xhwsignxg (Equation 2.23) 2.7.2. Building the Ensemble Kernels for SVM It has been pointed out that AdaBoost and SVM both use the inner product (Freund and Schapire 1999). This similarity was later demonstrated in an effort to build special kernels for the SVM algorithm such that the infinite ensemble algorithm gets embedded in the kernels (Lin and Li 2005). To do this, a kernel that embodies all the hypotheses in H needs to be designed. Then, the classifier obtained from SVM with this kernel is a linear combination of those hypotheses and thus an 38 ensemble classifier. In addition, the structure of SVM makes it possible for the first time to construct ensemble classifiers with infinite hypotheses intended against overfitting (Lin and Li 2005). Lin?s ensemble kernel thus designed has the following general form: ?= c xxH dxxK ?????? )()()',( ', (Equation 2.24) In this kernel form, H is the set of hypotheses }:{ ChH ?= ?? . C is a measurement space. The function )()()( xhrx ??? = maps the data x into Hilbert space. The variable ? is the parameter of an arbitrary hypothesis )(xh? . This general form of an ensemble kernel is thus an integral of inner products. An earlier technique (Scholkopf 2002) was used to construct kernels from an integral inner product. Lin?s kernel is used in the soft-margin SVM. The SVM optimization problem was ))(21(min 1 ? = +>?< p k kw CFww ? under the condition that 0;,...,2,1;1)( >=?? kkkk pkxDy ?? . Now it becomes: ? ? = + C p k kCFdw ))()(2 1min 1 2 ??? under the constraint that 0;,...,2,1;1))()()(( >=??+? kk C ik pkbdxhrwy ????? ? (Equation 2.25) This SVM model based on the ensemble filter will be valid if and only if the hypothesis set H is negation complete. That is, Hh? if and only if Hh ?? )( . Negation completeness is usually a mild assumption for a reasonable H. 39 Lin?s ensemble SVM model will have the solution classifier g(x): ))()()(()( ? += C bdxhrwsignxg ??? ? (Equation 2.26) With Lagrange multiplers i? , the final form of Lin?s ensemble SVM classifier is )),(()( 1 ? = += N i iHii bxxKysignxg ? (Equation 2.27) 2.7.3. Kernel Perceptron The Kernel Perceptron is an ensemble kernel method built on the Perceptron idea (Rosenblatt 1958). He designed the Perceptron as a hypothesis on how the human brain perceives the information from the outside world, and hence the name ?Perceptron?. Later it was developed into a neural network learning method by assuming the neurons work as Perceptrons. The Perceptron classifier can simply be expressed as: )()(, ???? ??= xsignxp (Equation 2.28) In this equation, x is the input data of multi dimensions, ? is an array of coefficients, >?< is the inner product of vectors, and ? is the threshold value. Lin embedded infinite amount of Perceptrons into an ensemble classifier (Lin and Li 2005; Lin and Li 2005), and used a SVM to get the optimal solution, which could not be achieved before the advent of SVM. The resulting algorithm, named the Kernel Perceptron, is equivalent to a neural network with one hidden layer, infinitely many hidden neurons, and a hard-threshold activation functions. They proved that mathematically the Kernel Perceptron is just the regular form of SVM 40 with a special type of kernel. ')',( xxpxxK p ???= , where p? is a constant. (Equation 2.29) In other words, Kernel Perceptron is more like a new type of SVM with a neural network kernel. In this we see the integration of several of the families of classifiers described in figure 2.1. Using several standard machine learning test databases, the performance of the Kernel Perceptron was compared to that of SVM with the RBF kernel. The result shows that KP outperforms SVM-RBF when the source data contains 10% mislabeling error (Lin and Li 2005). This encouraging result suggests that the KP algorithm might also outperform SVM-RBF in real world datasets. Therefore the KP algorithm is also studied in our experiment. 2.8. A Brief Discussion on Self-Organizing Maps Neural Network (SOM) Kohonen?s Self-Organizing Map (SOM) neural network is a unique type of neural nets because it takes into consideration the detailed boundary of classes (Kohonen 1990). We will mention it briefly here, and use it in only one experiment (section 3.7) in chapter three. It has a special design that is of interest to us. This method will not be covered in the other experiments and therefore we will not elaborate on it. SOM consists of three steps. The first step is called coarse tuning, which is basically an unsupervised clustering based on Euclidean distance. This step establishes a fundamental regional organization (a topology) of neuron weights that 41 represent the clusters and sub-clusters in the input data. The second step is called labeling. This determines the classes to which the neurons belong. The third step is called the fine tuning, which uses the training data to carve out the detailed borders among neurons, using an algorithm called Learning Vector Quantization (LVQ). The refined neurons in the output layer are now considered fully trained, and can then be used to conduct classification. The unique design caveat of SOM is that it incorporates the underlying clusters of the input training data. This means that SOM should be very sensitive to the class proportions in the training set. 2.9. Cross-comparison of Machine Learning Algorithms for Remote Sensing The remote sensing community has adopted change detection algorithms from the machine learning community. As new algorithms appear every year, there are numerous of remote sensing studies that assess one ?new? algorithm against a couple of ?standard? algorithms such as the classic MLC. This approach effectively demonstrates the virtues of a new algorithm. Each study presents us with one algorithm superior than the MLC algorithm, which was designed more than five decades ago. One would naturally ask: with so many new algorithms at hand, is there a generally superior algorithm? Or are these fancy new algorithms good for different situations respectively? A related question is whether many users simply over-tune the classifiers and hence demonstrate no more than that for one particular 42 example better results can be obtained with no guarantee that similar improvements will be found for other areas. It is obvious that, a cross-comparison among the modern algorithms is more important. However, there are too many change detection algorithms to be tested one by one. Rather than comparing every new variant of the basic methods our approach is to carry out a cross-comparison of superior examples from each of the different families. Moreover our comparison will not simply be an empirical assessment but will attempt to explain the differences in terms of the mathematical theories underlying them. To be more specific, we seek to find which underlying designs are effective at handling uncertainties and errors in practical applications. We summarize them into table 2.1. These promising methods are chosen based on their theoretical strengths and feedbacks from contemporary literature. Some of these theoretical strengths are very desirable for remote sensing studies. All the methods are tested in the next chapter, in different scenarios chosen to resemble real-world geographical applications. Our study is not aimed at touting at performance of the supposedly best algorithm(s). We are aware that the mathematical characteristics may have side-effects as well as strengths. From the table above, we can see that the machine learning algorithms were born with hidden assumptions. 43 Table 2.1 Summary of mathematical characteristics and expected strengths and weaknesses of algorithms discussed in Chapter 2 Algorithm Family Algorithm Name Mathematical Characteristics Expected Strengths Possible Weaknesses Bayes Classifiers MLC Assumption of Gaussian Distribution; Classes defined from centers High accuracy in small-scale studies Lower accuracy in complicated non-Gaussian data; Curse of Dimensionality Entropy-mi nimization DT No assumption on data distribution Good accuracy in large-scale studies Salt-and-pepper errors; Curse of dimensionality Neural Networks ARTMAP Adaptive training data Training pattern can be improved with incoming data for classification overfitting Margin-ma ximization SVM Classes defined from boundaries; The RBF kernel assumes Multimodal Gaussian Distribution of data; SVM assumes smoothness in both the estimator and data observation High accuracy at all scales The Gaussian assumption is controversial Kernel KP Classes defined from boundaries; No Gaussian assumption on the data; Infinite Boosting High accuracy at all scales overfitting It is also interesting to see that as new algorithms are developed, some of the controversial hidden assumptions in the older algorithms were adopted again as building blocks. For example, the Gaussian assumption has been used in both MLC and RBF, which is the most successful kernel form of SVM. Another example is that the Perceptron model was used both in traditional neural networks and modern Kernel Perceptron. Arguably, the algorithms that share assumptions might exhibit 44 similar performance weakness under certain scenarios. The Gaussian assumption has long been criticized for being too simple for geographical variations. The Perceptron neural networks model, on the other hand, has long been criticized for being too prone to over-fitting. This also leads to some worries about Kernel Perceptron. Would it also tend to overfit? Also in this chapter we identified two interesting hypotheses from contradictory literature. The first is that the decision tree might be ill-suited for stacked change detection because it does not handle high dimensional data as well as some algorithms. The second is that the Gaussian assumption on geographical data over large areas might not be totally invalid. Since the Gaussian kernel of SVM is indeed a simulation of multimodal Gaussian, it might actually fit the geographical phenomena very well. These pros and cons have deep roots in the mathematical theory and have to be assessed in empirical studies. Since these mathematical features were built into the algorithm to handle uncertainties, we will test the algorithm under challenging classificatory situations. Unlike other studies that assess algorithms in an arbitrary scenario, our study simulates special scenarios for testing different aspects of the algorithms. These different aspects trace back to and are targeted at the theoretical strengths and suspected weakness we discussed here. In the next chapter, we will also define the qualities of a truly good algorithm. In DeFries et al. (2000), two general criteria were raised as key: stability and robustness. In the past several years we have accumulated knowledge on what 45 stability and robustness truly mean in the real world. We will find out in the next chapter which algorithm best meets these criteria. And if no algorithm can satisfy all the criteria, we will try to find out the most acceptable solution. 46 3. Assessing Machine Learning Algorithms with Real-World Uncertainties 3.1. Assessments and Comparison Design In chapter two, we outlined the possible strengths and weaknesses of modern machine learning methods. We hope to find out under what conditions will classifiers be successful and when not. What are the internal designs that lead to varying degrees of success? Is there a classifier successful enough for most real-world applications in remote sensing? These questions have not been systematically addressed in previous studies. This study tries to attribute the varying degree of success to two factors: the internal designs of classifiers, and the real-world complexities in the field of remote sensing. The designs of classifiers originate largely from statistical theory, e.g. the Mahalanobis Distance (Mahalanobis 1936) and applications in computer science, e.g. the MLC learning of texts (Chow 1957). They were never custom-built for geographical phenomena. It would be wishful thinking that existing machine learning methods can automatically handle geographical uncertainties perfectly. Traditionally, when the accuracies of different supervised change detection algorithms are assessed and compared, the characteristics of the selected training and validation sets are not quantified. This can introduce biases into the comparison. Although this source of bias had been brought up from time to time during the past 47 four decades, but it had largely been ignored in contemporary remote sensing. When addressing the generalization power of machine learning in general, Vapnik (Vapnik 1999) stated: ?One first has to find the appropriate structure of the learning machine which is a result of the tradeoff between overfitting and poor approximation, and second find in this machine the function that minimizes the number of errors on the training data.? Thus, if we put the secondary goal of accuracy maximization as the top priority, as seen in so many contemporary remote sensing studies claiming classification accuracies over 95% and regression R-squares over 95%, then we lose sight of the big picture: the tradeoff between overfitting and underfitting. Therefore, classification accuracy is only meaningful if the classifier structure is right for the data. To figure out that appropriate structure for geographical phenomena, we must identify possible weaknesses. After that, we can think about improving the accuracy. Most previous studies have significant weaknesses when applying training data, though several investigations have attempted to overcome individual weaknesses with varying amount of success. Our perspective here is to systematically outline these weaknesses and seek solutions accordingly. This will enable us to improve the classifier structure and then the accuracy itself. Our approach aims to isolate the effect of each bias caused by training data sets. We would estimate how well the change detection algorithm can do with or without the biases caused by the training data set. The best change detection algorithm 48 should be able to achieve high accuracy, while least influenced by adverse biases from the training data set. If no existing algorithm satisfies these high standards, then we would need to think why this happens and how to solve this. 3.1.1. The Tradeoff between Generalization Power and Accuracy First, most traditional assessments tell how successful the algorithms are when analyzing study areas of limited range of sizes, land cover variation, and atmospheric conditions. This is a problem. In a pioneering work based on a land cover study using Landsat-1 imagery, it was shown that when the atmospheric turbidity decreases 1.3, the maximum likelihood classification result can differ by a whopping 22% (Fraser et al. 1977). It was also shown that the performance of the MLC algorithm starts to drop in complex environments after the band number is more than five (Lillesand and Kiefer 1979). However this type of issue was not widely acknowledged until the last decade, partly because multispectral remote sensing is more and more applied to study continental and global changes. Researchers in the last decade started to raise the ?stability? requirement (DeFries and Chan 2000) and the generalization power criterion (Woodcock et al. 2001). In the latter paper, the benefit of generalization power to Geographers were clearly laid out: ?Methods based on generalization require less time and effort than conventional methods and as a result may allow monitoring of larger areas or more frequent monitoring at reduced cost.? (Woodcock et al. 2001) Also pointed out was that many data-driven algorithms, such as the maximum 49 likelihood algorithm, can fail at complex datasets (Hastie 2001). Also mentioned is that the principal component analysis would have drastically different results in different areas while decision tree is reasonably better (Scull et al. 2005). ARTMAP and Decision Tree have been recommended in such situations (Rogan et al. 2002). SVM was recommended above Decision Tree and MLC for variations over large areas (Song et al. 2005). In another recent study (Masek et al. 2008), the accuracy assessment for their new algorithm is done in 23 sites across the US. This is very convincing yet also very hard to achieve. To avoid this methodological weakness, we choose our study areas to be large and very complex. Three study areas were chosen. Each area has a distinctive ecosystem, and a unique landscape pattern. These three areas also show a sharp difference in annual rainfall. The impacts of geographical variation will be further discussed in sections 3.2, 3.3, and 3.4. 3.1.2. The Realistic Acknowledgement of Errors in the Source Second, the traditional assessment routine most tells how successful the algorithms are when they are fed with 100% correct training data. Only in the last decade has researchers started to address the problem that there exists mislabeled training data in remote sensing applications (Brodley 1996). Algorithms thus must possess the ?robustness? property (DeFries and Chan 2000). It was then reported that MLC might be susceptible to mislabeled training data (Simard 2000; Miller and Franklin 2002) and is inferior to ARTMAP on error tolerance (Rogan et al. 2008). 50 To avoid this methodological weakness, we carried out experiments on the impact of varying amounts of error in the training data. Errors from 5% to 50% will be used to see how well the algorithm resists training errors. A contemporary study tested error for three algorithms (Rogan et al. 2008). Our study will test 5 algorithms instead. And more importantly we need to find out what in the classifier(s) works mathematically against errors. Section 3.5 covers the results on the error tolerance. 3.1.3. The Uncertainty in Class Definition Third, the traditional assessment routine tells how successful the algorithms are when they are fed with training data from typical and pure ground cover types. There has been no known publication discussing this issue. To avoid this, we do not choose our training data only from distinctive and pure landscapes. Instead, training data will be chosen randomly from across the whole study area. We also assessed using training data in the relatively transitional zone against relatively that in the core zone. This separation and comparison of training data from the core zone and the transitional zone has not been mentioned before. The results on the transitional training data will be discussed in section 3.6. If we bring this topic a little further, we can also ask a more fundamental question. How would the conceptual definition of geographical classes affect the classification outcome? Chapter 5 will explore into this question. 51 3.1.4. The ?Blind men and the Elephant? Problem Fourth, the traditional assessment routine tells how successful the algorithms are, but the actual sampling process of the training data is often arbitrary or neglected. This situation is similar to the ancient Asian fable of the blind men and the elephant. It was put into a poem by John Godfrey Saxe (1816-1887). It was six men of Hindustan To learning much inclined, Who went to see the Elephant (Though all of them were blind), That each by observation Might satisfy his mind This ancient fable shows us that our observation is a sampling of reality, and that can induce our partial perception of reality. If we only observe the tail, we might conclude that the elephant is like a snake. Although we are not blind, we still could blindly trust a methodology developed not specifically for Geographical phenomena. Almost all contemporary change detection studies use three types of sampling strategy when they choose the training dataset. The sampling may be random, the systematic, stratified or even purposive (i.e. when chosen by the analyst). This is intended to avoid statistical bias in the inference of ?population? accuracy. The performance of change detection algorithms may be affected by the choice of sampling method. There have been no known studies on the effect of this aspect, although stratified sampling is often preferred because it gives an ?equal? representation for all classes. For example, in a study that discussed the effect of training data (Rogan et al. 2008), his training data is not selected pure randomly, but 52 with equal amounts of training data in each class. Recently, researchers have been focusing more on the topic of sampling. Stehman published a series of papers (Stehman 2000; Zhu et al. 2000; Stehman et al. 2003; Stehman 2005; Stehman 2009) introducing the ?model-based sampling? as compared to the ?design-based sampling? such as random, stratified, and systematic samplings mentioned before. His major concern was that geographical events are often not spatial random. Therefore design-based sampling is not sufficient to characterize the whole area statistically. This is similar to the concerns raised by Tucker and Townshend (Tucker and Townshend 2000) although expressed with a different language. However, Stehman?s interest was purely in the estimation of accuracy for end products of remote sensing studies, not in the process of remote sensing analysis. He did not realize that, our way of observation can foul our analysis process. To study this problem, we used variable class proportions in the training data. This kind of study has also not been done in contemporary publications. Section 3.7 will cover the results on the sampling of training classes. 3.1.5. Minimizing the Cost of Sample Collection Fifth, the traditional assessment routine tells how successful the algorithms are when the amount of the training data is often unrealistically large for practical applications. This problem has only been noticed in the past a few years. Our earlier work mentioned that the accuracies of SVM and Decision Tree do not decrease as much as MLC does when the available training set size was reduced to 1% of 53 original set (Song et al. 2005). There have also been efforts trying to prove theoretically that SVM requires far less training data because of its mathematical designs (Foody et al. 2006). Another study found that ARTMAP accuracy only lost 10% when the training set size was reduced by 25% (Rogan et al. 2008). To avoid the fifth weakness of traditional assessments, we use varying amounts of training data in our assessment. The results on the abundance of training information will be discussed in section 3.8. A contemporary study tested 3 algorithms when the training data is reduced by 50% (Rogan et al. 2008), while our study compares 5 algorithms when the training data is reduced by 80%. What is more important than just finding the efficiencies of different algorithms is to find out which internal design makes this happen. These five approaches in our assessment will tell how well the candidate algorithms handle geographical uncertainties and errors in the real world. These assessments will allow us to assess empirically whether the theoretical strengths and limitations listed in table 2.1 really exist. In this chapter we will also present the first large-scale testing of the SVM and ARTMAP algorithms in remote sensing, and the first application of the promising Kernel Perceptron algorithm in remote sensing. 3.2. Geographical Information of the Assessment Areas As the first step to avoid overfitting, our experiments from sections 3.3 ~ 3.8 look at multiple areas with different ecosystems and complex land use trajectories. We chose three areas in the country of Paraguay to test the algorithm candidates. 54 The country of Paraguay has three major ecosystems from east to west, namely the Atlantic Forest, the Humid Chaco, and the Dry Chaco. These three ecosystems have vastly different appearances and species. The Atlantic forest is a closed canopy forest in humid coastal climate from the Eastern coast of Brazil (Olson and Dinerstein 2002) to the eastern departments (provinces) in Paraguay. The dry Chaco in inland Paraguay and Bolivia has wet season and dry season in a year and is mainly covered by open-canopy woodland (Olson et al. 2000). The humid Chaco is a transitional zone between Atlantic forest and dry Chaco, with some wetlands, grasslands, and inter-annual floods (Cabrera 1976). All three areas have moderate-to-extensive agriculture developments during the time span of 1990-2000. The Dry Chaco area is dominated by woodland, while the other two areas are dominated by non-forest. Each area was chosen to include significant amount of forest change. The sizes of these three test areas are 9076, 9849, and 5878 km2 respectively from east to west. Table 3.1 Geographical Information of Test Areas (Huang et al. 2009) Landsat path/row 224/78 225/77 228/76 Ecosystem Atlantic Forest Humid Chaco Dry Chaco Forest Percentage 26.7% 23.5% 58.1% Nonforest Percentage 48.5% 68.1% 34.7% Forest Change Percentage 24.8% 8.4% 7.2% Area (sq km) 9076 9876 5878 We have an accurate forest change map of Paraguay and we used it as both for the training and accuracy assessment (Huang et al. 2007; Huang et al. 2009). Cloud-free images of Landsat TM (1990) and ETM+ (2000) were used to develop this wall-to-wall forest cover change map using an iterative clustering-supervised labeling (ICSL) method. Unsupervised clustering using the ISODATA algorithm (Tou and 55 Gonzalez 1974) and supervised labeling of clusters using training pixels were applied iteratively to resolve spectral confusions among the concerned classes. This iterative process is highly reliable and has been assessed by 136 aerial photos, as well as IKONOS and Quickbird imagery covering 64km2. The overall accuracy is higher than 95% (Huang et al. 2007; Huang et al. 2009). The resulting Paraguay forest change map is thus a good test-bed for training data and testing data as well. We select training data randomly instead of confined to a fieldtrip or an IKONOS image. We also use the whole area as our testing data for the accuracy assessment. In this map and throughout this dissertation, the color scheme will be: Green for the Forest-to-Forest class, yellow for the Nonforest-to-Nonforest class, and red for the Forest-to-Nonforest class. Figure 3.1Three test areas in Paraguay 56 Only three classes were used in these experiments: persistent forest, forest-to-nonforest change, and persistent nonforest. Our study did not use the nonforest-to-forest class. There has been no sizeable land in Paraguay that went through forest regrowth during the 1990s. 3.3. Assessing the Algorithms in Different Geographical Regions In this experiment, we start to look at the basic characteristics of our algorithms with a very simple design. 2000 random pixels were used in each test area as the training data. Each class was given the same amount of training pixels. And we evaluate the algorithms by means of total accuracy, as well as the user and producer accuracy of the forest change class. The higher accuracy, the more capable is the algorithm at adapting to various geographical contents. In sections 3.4-3.8, our experimental designs are actually further developed from the experiment in this section. Our findings are listed in table 3.2-3.4. The algorithms have achieved different accuracies in the three ecosystems. Generally speaking, the algorithms have higher accuracies in the Dry Chaco region. This might be caused by both the dry climate, the limited types of land use in that region, and the fact that this test area is smaller than the other two. The forest clearings in the Dry Chaco region become ranches and farms. These large ranches and farms are very large and stand out easily against other classes. While in the eastern regions the forest clearings become farms of soybean and other crops. The land parcels are much smaller and more varied in the 57 east. In short, the west area has a simpler set of geographical features. Table 3.2 Overall Accuracy of different classifiers in different regions Classifier Atlantic Forest Humid Chaco Dry Chaco MLC 90.47% 86.69% 93.33% ARTMAPNN 86.89% 86.96% 88.53% DT 89.94% 89.62% 91.43% SVM 91.12% 91.77% 93.68% KP 92.56% 91.97% 94.12% Table 3.3 User Accuracy of the Forest Change Class produced by different classifiers in different geographical regions Classifier Atlantic Forest Humid Chaco Dry Chaco MLC 80.29% 83.62% 94.32% ARTMAPNN 76.12% 71.61% 82.71% DT 81.52% 72.85% 88.43% SVM 85.32% 76.89% 89.20% KP 86.29% 76.91% 91.66% Table 3.4 Producer Accuracy of the Forest Change Class produced by different classifiers in different geographical regions Classifier Atlantic Forest Humid Chaco Dry Chaco MLC 90.27% 63.08% 81.52% ARTMAPNN 80.28% 68.13% 80.63% DT 86.71% 77.14% 81.10% SVM 88.39% 81.63% 90.40% KP 89.76% 82.93% 89.05% As we compare the algorithms in three geographical setting, we have several findings. The first finding is that the ARTMAP neural network is clearly not good for any geographical setting at all. It almost always achieves the worst performance. The second finding is that SVM and KP almost always achieve best performance in overall accuracy as well as the user and producer accuracies of the forest change class. More important is that they did well in all three ecosystems, showing the robustness of kernel methods. 58 Our third finding is that MLC remains a good alternative although its performance does vary from place to place. For example, MLC made the best producer accuracy among all five methods in the Atlantic forest test area, but achieved the worst producer accuracy among all five methods in the Humid Chaco test area. We also produced some images to show the change detection results from different algorithms. Throughout this dissertation, the color scheme will be: Green for the Forest-to-Forest class, yellow for the Nonforest-to-Nonforest class, and red for the Forest-to-Nonforest class. Figure 3.2 shows the classification results by different algorithms in the eastern area. We can see that, graphically speaking, SVM and KP results have a distinctive look of rounded edges around land cover patches, while ARTMAP and DT results have a lot more salt-and-pepper noises. Paraguay Map MLC ARTMAP NN DT SVM KP Figure 3.2 Change detection results from different algorithms in Eastern Paraguay 59 3.4. Assessing the Algorithms over Large Areas In our earlier work (Song et al. 2005), we found that the SVM algorithm have a unique property. It could use limited training data from multiple satellite scenes blended and still has decent performance at detecting forest change over large area. In that comparison, MLC and DT were tested against SVM. MLC showed poor performance. The DT algorithm got limited success in terms of accuracy but the resulting change map is virtually unusable due to widespread tiny errors of the salt-and-pepper type. While forest change detection does not necessarily have to be performed at multiple scenes at once, what is important is that SVM showed a potentially useful generalization property. The geographical variations over large areas did not ruin the change detection. This property can be of good value at some regions of Earth where strong local geographical variations exist. Therefore, we hope to examine all five of our candidate algorithms. It would be nice if some of them other than SVM also show this property. This assessment creates a pseudo-image mosaic of all three areas together. There is no atmospheric correction or any radiometric enhancement. On one hand, the classification of individual satellite images does not benefit significantly from atmospheric correction (Song et al. 2001), on the other hand, the classification of multiple satellite images together is a grill for the classifier. The five classifiers will have to deal with much larger spectral variation in every land cover change class. 60 We also limit the amount of the training data to a very small set of 1000 pixels. If some algorithm(s) could still achieve good change detection in this manmade extreme case, then in the real world it can as well handle strong geographical variations with very limited training information. Our test results are shown in table 3.5: Table 3.5 Performance of algorithms over large areas Total Accuracy User Accuracy of Change Class Producer Accuracy of Change Class MLC 85.73% 81.26% 62.59% ARTMAP 82.86% 66.64% 64.54% DT 88.40% 73.73% 81.31% SVM 91.72% 73.79% 91.48% KP 91.93% 76.13% 90.17% We concluded that, first of all, the ARTMAP Neural Network method should be avoided at all costs. They are quite ineffective at generalization. Second, MLC and the kernel methods have different strengths. MLC have higher user accuracy (100%-commission), while the kernel methods tend to have much higher producer accuracies (100%-omission). This seemingly odd contrast will be explained using the findings from section 3.7. Finally, the best overall performance still belongs to the kernel methods. In addition to the accuracy numbers, we also studied the change detection images closely. Figure 3.3 shows a subset image on the border of three areas. We could see from the above images that, although the accuracies numbers do not vary too much, we could only find the map outputs from SVM and Kernel Perceptron are much more clear and meaningful. ARTMAP and DT results still have the undesired salt-and-pepper noises. 3.5. Assessing the Error In this assessment we blemish the original class label of the training data with varying amount of rando errors such as those caused by image misregistration, ambiguous land cover types, and different interpretations among analysts. Therefore our approach of adding a percentage of errors into an than an ?ideal? training set. We hope to know how the algorithms would perform with regard to such errors in the training data. Algorithms without significant accuracy loss would be considered as error-tolerant and thus prized in practice. Before developing the TDA algorithm, we already found out by luck that the SVM algorithm showed some error In this experiment we will systematically assess all five algorithms in this regard. Is Paraguay Map DT Figure 3.3 Change detection results from different algorithms on large 61 Tolerance of Algorithms m errors. In the real-world application, there are inevitable ?ideal? training set is closer to the real MLC ARTMAP NN SVM KP -world application tolerance. -area test SVM the only algorithm with such share this property? In each test, a total of 1500 training pixels are whole study area. E from the eastern area are 62 error tolerance? Do other modern algorithms systematically sampled from the rror starts from 0% to 50%, by 5% increment shown in figure series 3.4. s. The results Figure 3.4 Error Tolerances of different Algorithms in Eastern Paraguay There is a very distinctive pattern of exceptional that with ~30% errors in the training set, the overall accuracy of SVM classification in this test Let us also look at the results from other test area are shown in figure series 3.5 63 error tolerance in SVM. It i area stays largely unaffected! Would this be a coincidence? areas. The results from th : s truly e western Figure 3.5 Error Tolerances of Different Algorithms in Western Paraguay We found that in the above two test tolerance of error in the training data. Usually SVM can maintain >90% accuracy when 0%~30% of the training data is actually wrong. Kernel Perceptron maintains about 85% accuracy with up to 20% error in the training data. MLC shows a lower error tolerance but fluctuates a lo all. However, the user accuracy of KP algorithm drops to 0 when 30% of training data is wrong. The change detection map shows that for some unknown reason, the KP algorithm fails to pick up a error and a very low commission error. This shows why we have to look at both the user accuracy figure and the producer accuracy figure. 64 areas, SVM consistently t. DT and ARTMAP are almost not error ny forest change. This leads to very high omission showed a unique -tolerant at 65 The test in the Central Paraguay area also shows this problem. What?s worse is that some of the SVM change detection results do not have the change class also. This brings a possibility: when a class has a small training dataset with lots of errors, kernel methods might fail to pick them up at all. In this study site, the change class is a quite minor class. Thus by systematic sampling, we are actually only giving the change class ~ 110 training points. This tells us the ?bottom line? of SVM?s error tolerance property. We can use SVM when we have a training set of small size but high reliability (will be explained in section 3.8), or a training set of large size but less reliability. But we cannot expect SVM to cope with a training set of small size and low reliability. The accuracy results in the central test area are plotted in figure series 3.7. Paraguay Map MLC ARTMAP NN DT SVM KP Figure 3.6 Error tolerances of Classifiers in Western Paraguay with 30% errors in training Figure 3.7 Error Tolerances of Different Algorithms in Central Paraguay 3.6. Assessing the Algorithms with Mixed Data Reliable training data is usually derived from field trips and image interpretation. Traditionally, when change detection algorithms are assessed and compared, the researcher tends to rely on the most reliable training data pixels, which are of no surprise often from the most prominent 66 or Atypical Training and most typical land cover parcels. 67 Researchers also tend to pick the pixels in the center of land parcels for an important reason: to avoid misregistration. Pixels there are also usually more pure than transitional or mixed land cover. Our intention in this experiment is to see how the candidate algorithms handle mixed land cover as training data in addition to the pure land cover as training data. Our hypothesis in this experiment is that, those pixels at the hearts of land parcels are more likely to be pure land cover types, and the pixels around the edge of land parcels are more likely to be transitional land cover types. Our experiment looks at the change detection accuracy variation when the training data pixels were selected from varying distances from the land parcel boundaries. The land parcel boundaries are generated using the Canny edge detector, a detector used routinely in image processing. Only the classifiers of SVM and DT were performed in this experiment. This experiment was conceived in the very early stage of this dissertation, before the inclusion of other classifiers. Our result is shown in the following graph: 68 Figure 3.8 Location Effects of Training Data Our experiment did not find significant accuracy improvement when the training data is selected around the land parcel boundaries compared to when the training data is selected in the heart of land parcels. SVM is a boundary classifier in the feature space, but it does not seem to benefit significantly from training pixels of physical boundaries. Therefore, the relative geolocation of training data for SVM seems to be not important. 3.7. Assessing the Algorithms with Varying Contents of Training Data A training data set contains training samples from multiple classes. When designing a change detection study, the amount of training pixels for each class has to be decided. 1 2 3 4 5 6 7 8 9 100.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 The distance of training data from nearest land cover patch boundary (in pixels) Ov er all A cc ur ac y f or C ha ng e D ete cti on (% ) Will the location of training pixels affect SVM accuracy? SVM Contextual Result Accuracy SVM Result Accuracy DT Result Accuracy 69 Contemporary classification studies have used a variety of different approaches which impact the relative proportions of training sets. The so-called availability-based sampling is the most popular approach in which the researcher feed all the available training data to the classifier. This is actually the most common type in many contemporary studies (Keuchel et al. 2003; Sesnie et al. 2008; Schneider et al. 2009). Several papers have used equal or roughly equal number of points in each class (Rosenfeld et al. 1982; Rogan et al. 2002; Foody et al. 2006; Kuemmerle et al. 2009). Another approach, systematic sampling, collects sample points using a grid (Yuan et al. 2005). This is rather rarely used though because the cost for collecting data systematically is quite high. In many studies the relative sizes of classes is not even discussed (Keuchel et al. 2003; Lucas et al. 2008; Potapov et al. 2008; Brenning 2009). Generally, classification modules in commercial software such as Idrisi, ENVI, or ERDAS Imagine leave it to the user to decide on the size and relative proportions of training data. However Idrisi Andes (version 15.0) developed by Clark University assigns equal amount of training data for each class in its Multi-layer Perceptron (MLP) neural net module. The reason for this was not explained in IDRISI help file. Stratified sampling had been widely used not just because it allows easier collection of training data compared to random sampling. It can also provide statistical confidence interval for the total forest change over the whole area, which random sampling can also provide. It also ensures that every major geographical unit has been represented, which random sampling cannot. 70 Different sampling methods lead to different sets of training data. Will the different amount of training data in each class affect the final performance of the change detection algorithm? People have not asked this question yet. Will any of our algorithms perform well without significant differences under stratified sampling and random sampling? We designed an experiment to answer these two questions. For each of our three test areas, we perform 19 runs of change detection. Each run has a different set of training data. This is shown in table 3.6 Table 3.6 Experiment on training data contents Change Detection Runs Forest Change pixels in training set(%) Unchanged Forest pixels in training set (%) Unchanged Nonforest pixels in training set (%) No.1 5% (1-5%)/2=47.5% (1-5%)/2=47.5% No.2 10% (1-10%)/2=45% (1-10%)/2=45% ? ? ? ? No.18 90% (1-90%)/2=5% (1-90%)/2=5% No.19 95% (1-95%)/2=2.5% (1-95%)/2=2.5% For each run we calculated the user accuracy of forest change class, the producer accuracy of forest change class, and the total accuracy of the whole study area. The results are plotted in the following figures. Figure 3.9 shows producer accuracy for the Eastern test area. We can see that, as the proportion of one class in the training set increases, the corresponding producer accuracy of that class generally increases gradually and approaches 100%. However, the MLC algorithm is different. It stays almost the same regardless of the class proportion in training. We can also see that when the proportion of a class in the 71 training set is extremely small, omission error can be high, especially for the Self-Organizing Map Neural Net and ARTMAP. Figure 3.9 The Producer accuracy plot of the eastern test area Let us move on to look at the user accuracy results. Figure 3.10 shows that, most classifiers result in lower user accuracy for a class when the proportion of that class increases in the training set. User accuracies drop to around 40% when the proportion of that class occupies 95% of the training set. This indicates substantial overestimation. MLC is again indifferent to the variation in training proportion. Figure 3.10 The User accuracy plot of the eastern test area We found that, the overall accuracy almost always ranges from 80% to 90%+, overshadowing the fact the accuracy of change detection is often very poor. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu rac y Proportion of Change Class in Training MLC Producer Accuracy DT Producer Accuracy SVM Producer Accuracy KP Producer Accuracy ARTMAP Producer Accuracy SOM Producer Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu rac y Proportion of Change Class in Training MLC User Accuracy DT User Accuracy SVM User Accuracy KP User Accuracy ARTMAP User Accuracy SOM User Accuracy 72 Figure 3.11 The overall accuracy plot of the east test area The pattern we found from the first test area is clear. The performances of Decision Tree, SVM, KP, ARTMAP, and SOM are all significantly affected by the class proportions within the training set. If a class is over-represented in the training set, then it is overestimated in the classification output; and vice versa. This relationship has apparently eluded the remote sensing field, probably because the overall accuracy stays seemingly unaffected. We also observed that the MLC algorithm stays unaffected. The following figures show the three accuracy indicators of the other two test areas. These three figures are from the central test area (WRS-2 footprint 225/077): Figure 3.12 The producer accuracy of the central test area 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu rac y Proportion of Change Class in Training MLC Overall Accuracy DT Overall Accuracy SVM Overall Accuracy KP Overall Accuracy ARTMAP Overall Accuracy SOM Overall Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu rac y Proportion of Change Class in Training MLC Producer Accuracy DT Producer Accuracy SVM Producer Accuracy KP Producer Accuracy ARTMAP Producer Accuracy SOM Producer Accuracy 73 Figure 3.13 The user accuracy of the central test area Figure 3.14 The overall accuracy of the central test area The following three figures are from the western test area (WRS-2 footprint 228/76): Figure 3.15 The producer accuracy plots of the western test area 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu rac y Proportion of Change Class in Training MLC User Accuracy DT User Accuracy SVM User Accuracy KP User Accuracy ARTMAP User Accuracy SOM User Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu rac y Proportion of Change Class in Training MLC Overall Accuracy DT Overall Accuracy SVM Overall Accuracy KP Overall Accuracy ARTMAP Overall AccuracySOM Overall Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu rac y Proportion of Change Class in Training MLC Producer Accuracy DT Producer Accuracy SVM Producer Accuracy KP Producer Accuracy ARTMAP Producer Accuracy SOM Producer Accuracy 74 Figure 3.16 The user accuracy plots of the western test area Figure 3.17 The overall accuracy plots of the western test area The accuracy trends in all three test areas have a marked similarity. As the percentage of a class increases in the training set, the more appearance it makes in the classification output; and vice versa. There is a difference among then, however. The western and central test areas show consistently higher producer accuracies than the eastern areas, while the eastern area shows consistently higher user accuracies than the other two areas. This is caused by the different class proportions of three test areas. The western and central areas have much lower proportion of forest change than the eastern area, as outlined in table 3.1. Therefore, classifiers are more prone to overestimate forest change in those two areas. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 Ac cu rac y Proportion of Change Class in Training MLC User Accuracy DT User Accuracy SVM User Accuracy KP User Accuracy ARTMAP User Accuracy SOM User Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu rac y Proportion of Change Class in Training MLC Overall Accuracy DT Overall Accuracy SVM Overall Accuracy KP Overall Accuracy ARTMAP Overall AccuracySOM Overall Accuracy 75 This effect is especially important to change detection studies, because the change class is almost always a minority class in the whole satellite image. Popular practice is to use as big a training set as possible for the change class, but this will lead to the overestimation of this key class. Also, since each satellite scene has a distinctive spatial distribution of classes, we could not have a universal optimal percentage for a class in different satellite scenes. If we plot the trends of the producer accuracy and user accuracy together, we will see an interesting pattern. The user and producer accuracies of SVM meet at some midpoint (figure 3.18), while those of MLC stay approximately parallel (figure 3.19). Figure 3.18 The user and producer accuracies of SVM in the eastern study area, affected by the class proportions in training Figure 3.19 The user and producer accuracies of MLC in the eastern study area, unaffected by the class proportions in training 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 Ac cu rac y Proportion of Change Class in Training SVM Producer Accuracy SVM User Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 Ac cu rac y Proportion of Change Class in Training MLC Producer Accuracy MLC User Accuracy This implies that, the omission machine learning algorithms are determined in the training stage, directly related to the amount of training pixels in each class. The maximum likelihood algorithm though now often considered inferior by the community, The trends in the figures of user accuracy and producer accuracy give the overall picture of this issue. We will also give a more visual examination of the spatial patterns compared against the ground reference map. This should indicate the locations of overestimation and underestimation errors. We would also like to find out whether overestimation We will show the results from different algorithms and training proportions side by side in a representative sub difficult one for change detection among the three test phenology between the two image dates. It also has the inter phenomenon on the nonforest land surface between the two image dates. The following figures show the Landsat TM 7 composite image, and the forest change reference map. Figure 3.20 Landsat TM 7 (Right) 76 and commission rates of the powerful new is largely unaffected. and underestimation are solvable by post -region of the west test area. This test areas. It has a varied forest -4-2 composite image, Landsat ETM+ 7 -4-2 (Left), ETM+ 7-4-2 (Center), Change reference , -processing area is the most -annual flooding -4-2 map We have 342 classification results in total and thus could not show all of them here. Instead, we will only show 18 classification results, in which six algorithms are fed with three types of training sets. The first training set has 5% data forest change. The second training set has 50% data labeled as forest change. The third training set has 95% data labeled as forest change. Figures 3.21 to figure 3.26 illustrates how different supervised classifiers handle training sets of same amount yet different class proportions. Figure 3.21MLC Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) Figure 3.22 DT Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) 77 labeled as Figure 3.23 SVM Classification with 5% change training (Center), with 95% change training (Right) Figure 3.24 KP Classification with 5% change training (Left), with 50% change training (Center), with 95% change traini Figure 3.25 ARTMAP Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) Figure 3.26 SOM Classification with 5% change training (Left), with 50% change training (Center), with 95% change training (Right) We have observed that, 78 training (Left), with 50% change ng (Right) when the class proportions in the training set vary , the 79 Maximum Likelihood Classifier is much more robust than the newer and more popular classifiers. SVM has shown desirable properties consistently in previous experiments, but this experiment identified that SVM shares the same problem with neural nets and decision tree in this aspect. The quality of classification results can be very bad if the proportions of training classes are left to be arbitary. This is a serious source of error. We also found that, when the producer accuracy curve meets the user accuracy, the percentage of the forest change pixels in the training data is somewhat but not strictly related to the percentage of the forest change pixels in the whole study area. The following table illustrates this vague relationship. Table 3.7 Percentage of Forest Change pixels in training data when optimal SVM performances are achieved Study Area Percentage of Forest Change pixels in study area Percentage of Forest Change pixels in training data with optimal performance Atlantic Forest 24.8% 25% Humid Chaco 8.4% 15% Dry Chaco 7.2% 10% These numbers give us some hopes. Maybe, to achieve the optimal accuracy, SVM has to have a carefully-selected training data set that has the same class proportions as the data population? However, the data population is not known before the change detection. How can we solve this ?chicken-and-egg? dilemma? Let us continue with the experiments, and return to this question in the summary section of this chapter. 80 3.8. Assessing the Algorithms with Scarce Training Data Traditional assessments of change detection algorithms are usually based on ample training data. But it is not practical to always have ample training data collected from field trips and high-resolution photo interpretation everywhere on Earth. A good algorithm needs to be able to achieve reasonably good accuracy when the available training data is scarce. Algorithms need to cope with scarce training data not just because the total training data might be scarce. If one class only has a small amount of training data while the other classes have disproportional ample training data, our experiment in section 4.7 have demonstrated the effect. Accuracy decreases sharply when the training data sampling does not comply with the Equal Sample Size (ESS) rule. Therefore, any class with scarce training data will lead to the reduction of total number of training pixels. Thus it is vital that algorithms for large-area forest change detection must perform well with less-than-perfect amount of training data. This experiment was also conceived in the very early stage of this dissertation, and only SVM and Decision Tree were tested. Our experiment in this section assesses the accuracies using different amount of training data. For the Atlantic Forest study area, which has roughly 10 million pixels, the result is listed in table 3.8. Our experiment shows that, SVM and DT do not need a lot of training pixels to achieve good accuracy. And when they use the same amount of training data, SVM 81 consistently out-performs DT. We can also interpret the finding in another direction: There is an intriguing limit of classification accuracy irrelevant to training size. Table 3.8 The effect of training data scarcity on accuracy Training pixel Count SVM Overall Accuracy DT Overall accuracy 12500 0.8823 0.8510 10000 0.8790 0.8433 7500 0.8774 0.8473 5000 0.8833 0.8465 2500 0.8785 0.8454 In the Ph.D Dissertation of Dr. Chengquan Huang (Huang 1999), he also looked at this aspect. His observation was that the SVM algorithm at that time needs a training set 6% of the total data volume. It now seems that his evaluation might be conservative. Apparently, the SVM algorithm does not lose much accuracy even when the training set is less than one thousandth of the data population. 3.9. The Algorithm of Best Overall Performance Our empirical cross-comparison of change detection algorithms aims at comparing the detection power of algorithms on a fair basis, and compare them as close to real-world situations as possible so as to challenge them with uncertainties. The influences of less-than-perfect training data are well considered, in order to find algorithms that are truly robust and accurate. Our assessment in section 3.3 show that geographical variations do have impact on the accuracy of all the algorithms, but SVM and Kernel Perceptron consistently excel. Our experiment in section 3.4 shows that SVM, Kernel Perceptron, and Decision Tree all have good capabilities in handling large-area variation. Our 82 experiment in section 3.5 shows that SVM and Kernel Perceptron have outstanding error tolerance. Our experiment in section 3.6 shows that SVM is not significantly impacted by training data located in the transitional land cover. Our experiment in section 3.7 shows that the modern algorithms are heavily affected by the sampling method of the training data while the old-school MLC is almost not affected. Our experiment in 3.8 shows that both SVM and Decision Tree can work with less-than-conventional amount of training data and still get good results. When these results are linked with the theoretical strengths and limitation outlined in chapter 2, we can see that some of the theoretical characteristics are verified, while some are rejected. SVM and KP do have the theoretical advantages of handling geographical variations and high error-tolerance. SVM does not have the theoretical disadvantage of the Gaussian assumption as MLC has, because the Gaussian kernel in SVM is the more versatile multi-modal Gaussian distribution. However, the generalization power of KP is not as good as that of SVM. We conclude that, the machine learning community has already built an excellent baseline classifier for us. The SVM family can tackle most types of known uncertainties and errors in remote sensing applications. It is much better than Decision Tree and Neural Nets. To be specific, when >85% of the training data is reliable, Kernel Perceptron is the best algorithm to perform forest change detection. When <85% of the training data is reliable, then the standard SVM with RBF kernel is the solution. However, in real-world applications, it is often difficult to know a 83 priori the percentage of errors in our observations. Therefore, it is safer to use the SVM with RBF kernel as a baseline algorithm. Table 3.9 The theoretical strengths and suspected weaknesses revisited Algorithm Family Algorithm Name Validated Strengths Validated Weaknesses Bayes Classifiers MLC N/A Lower accuracy in complicated, high-dimensional features No error tolerance Entropy-minimization DT Good accuracy in large-scale studies Salt-and-pepper errors Mediocre error tolerance Neural Networks ARTMAP Training pattern can be improved with incoming data for classification In developing and varies a lot among versions Margin-maximization SVM High accuracy at all scales High error tolerance sampling bias can hurt Kernel KP High accuracy at all scale Medium error tolerance Boosting without extra computational time sampling bias can hurt Meanwhile, we discovered an unreported source of error for most of the contemporary machine learning algorithms. The relative proportions of classes in the training set exert a powerful hidden influence on the classification results. It is unlikely that any remote sensing study can construct a perfect training set by chance. We must understand where this error source originates from, and how to bring it under control. This effort will be outline in the next two chapters. 84 4. Optimizing Class Proportions in the Training Set 4.1. Class Proportions in Training Data: an Overlooked Pitfall In chapter three we have discovered that, the performance of most supervised classifiers are significantly affected by the proportions of training data used to represent each class. Change detection studies are particularly heavily affected by this side effect, given that the change classes are quite unique. The change class is numerically a minority class in most studies. The number of change pixels is often highly variable from one satellite scene to another. The change classes are also often of highest importance. Therefore, the proportions of the change classes are small, variable, and important. This fact makes them the most susceptible classes under the newly discovered pitfall. How do we quantify the severity of this pitfall? In remote sensing studies, the producer accuracy is defined as the detection success against omission error, and the user accuracy is defined as the detection success against commission error (Congalton 1991). In section 3.7, we studied empirically the dynamic nature of producer accuracy and user accuracy as they are influenced by class proportions in training. They seem able to catch the problem. How serious is it? For the case of Decision Tree, when the proportion of change class in the training set is gradually adjusted from 5% to 95%, the producer accuracy increases from 64% 85 to 98% while the user accuracy drops from 95% to 45%. In addition to Decision Trees, other popular contemporary algorithms such as Support Vector Machine, ARTMAP neural nets, and Self-organizing Maps also fall prey to this pitfall. The only algorithm that is largely immune to this effect is the Maximum Likelihood Classifier. The user and producer accuracy produced by MLC are invariant, although not always unbiased, when the class proportions in the training data change. Therefore, we interpret our empirical findings as: most nonparametric classifiers increasingly overestimate any class when the training data proportion of that class increases in a training set of fixed size. Vice versa, they increasingly underestimate any class when the training data proportion of that class decreases in a training set of fixed size. In short, the outcome of classification is highly dependent on the class proportions in training. There seems to be a simple internal relationship between underestimation and overestimation in classifiers. This relationship can be easily pushed in any direction by increasing or decreasing training data in a class. Therefore, this issue likely does not just exist in change detection studies, but also is present everywhere in the broader field of classification of remotely sensed data. Through our empirical study in chapter three, we have found that, different geographical regions have different patterns of overestimation and underestimation. This implies that a significant challenge exists in continental-to-global classification study of remotely sensed data. If we have little or no control over the balance of underestimation and overestimation errors, then the same class might be 86 underestimated in one satellite scene yet overestimated in another. In addition, the smaller the satellite footprint is, the more likely it is affected. This effect was well hidden in a sense. In chapter three, we found that when the proportion of change class in the training set was adjusted from 5% to 80%, the overall accuracy always stays above 85%, which is a decent performance. In most real-world applications, the overall accuracies are often used as a benchmark for project success. The overall accuracy hides the variations in user and producer accuracies. In remote sensing studies, researchers are often interested in thematic information of one class, such as forest, water, and urban, instead of all the classes. Those studies will suffer the most from this pitfall. Change detection studies are also among the most-affected because a single change class such as deforestation is of highest importance, yet the problem has been hidden. The sufficiency of training is not a new topic of discussion. In the past, researchers have directed their attentions to the sufficient quantity of training. Several contemporary studies have looked at the effect of the total training set (Foody et al. 1995; Foody and Mathur 2004; Song et al. 2005; Foody et al. 2006; Rogan et al. 2008), and the effect of sufficient training data for each class (Pal and Mather 2003), but there has been no study of over- and under-estimation caused by class proportions in the training set. In this chapter, we will investigate the mathematical origin, magnitude of impact, and the solution to this newly-found pitfall that greatly challenges the reliability of data products from remote sensing. 87 4.2. Why are Modern Classifiers Heavily Influenced by Class Proportions in the Training Data? Modern supervised classification of remotely sensed data starts with a training dataset usually collected either through fieldwork, or visually-interpreted high-resolution images. The training process effectively tunes the classifier model towards the best overall accuracy for a given training set. The tuned classifier model is then applied to the whole image. The classification result is then compared to a set of and reference validation data for accuracy assessment. The accuracy assessment benchmarks the performance of the classification, and gives a confidence interval of accuracy on the whole image. Very often, the training data and the validation data come from the same fieldwork or image interpretation process. This has been a quite standard procedure for the past three decades. Past studies on the general methodology of training procedures have focused on two topics: 1. How to collect training data so that the training data covers all the features in the feature space while being minimal in numbers (Foody and Mathur 2004; Foody et al. 2006). 2. How to choose the sampling scheme for validation dataset in accuracy assessment so that we can estimate the confidence interval for accuracy on the whole classified image (Stehman et al. 2003; Stehman 2005; Stehman et al. 2009). An overlooked aspect is the arrangement of class balance inside the training set. Collecting training data in real-world applications is costly and often limited by 88 geographical accessibility. We propose that, the sampling design of the training set should not be based on data availability, or merely for the convenience in statistical accuracy estimation, but instead it should be directly targeted for the optimization of a given classifier algorithm. In chapter three we demonstrated that the supervised classification process is more complicated than simply building classification models based on an arbitrary training dataset available. In this section, we will examine, one by one, how modern supervised classifiers were designed to use the class information of the training data. 4.2.1. Maximum Likelihood Classification The training process of MLC is solely dependent on two basic statistical measurements: the mean of each class, and the covariance matrix among all the classes (Equation 2.6). These two form an ellipsoid for each class in the feature space. If we introduce more training data points only for one class, the mean and covariance matrix are not easily changed. MLC uses the covariance matrix in the determination of class boundaries. Thus the class boundaries are not easily changeable and the classification result is also not easy to be changed. However, when a class is described by only a very small amount of training data, and that small training set contains some errors due for example to misregistration or misinterpretation of ground features, then the mean center of the class might be substantially changed. This could explain the sudden drop of accuracy at extreme 89 ends in graphs in section 3.7 Another known problem regarding classes happens when the ellipsoids characterizing different classes are not separable. They can simply overlap with each other, or go through one another. In that case, MLC might fail completely. This is caused by the definition of classes, not caused by class proportions in training. 4.2.2. Decision Trees The training process of DT starts with calculating Entropy of the training dataset. ? = ?= n i ii ppSEntropy 1 )ln()( (Equation 4.1) In this equation, pi is the percentage of data points in class i out of the whole training set. It is very obvious that if we introduce more training points into one of the classes, the calculation of Entropy is now significantly affected. Thus the building of the decision tree will be altered. Therefore, Decision tree might be the classifier most sensitive to class proportion variations in the training set. 4.2.3. ARTMAP Neural net The training process of ARTMAP is the matching process of clusters identified by two ART modules. One ART module performs clustering using the training label, and the other ART module performs clustering using the spectral data. Increasing the amount of training data for an arbitrary class would increase the ?coverage? of clusters of that class in the feature space, and leads to overestimation. However, judging from this mechanism, the center of the clusters should not be changed a lot. 90 ARTMAP is expected to be less sensitive to the variation of training data proportions than Decision Trees. 4.2.4. Support Vector Machine and Kernel Perceptron The contemporary SVM and KP algorithms are based on the soft-margin SVM design. The internal optimization function is ))(21(min 1 ? = +>?< p k kw CFww ? , in which C is the penalty coefficient and k? varies between 0-1, allowing some data points to exist between the hyperplanes (class boundaries) in Hilbert Space. This design was first introduced to effectively deal with inseparable classes. In chapter three, we found that it also had an unplanned but useful side-effect of error tolerance. However, this design also leads to another unplanned and unwanted side-effect: the hyperplanes could be pushed to move substantially. When a class is given more training data, the hyperplanes around this class will be pushed outwards, eroding other classes. This might be one origin of the problem. Another hidden mechanism is the cross-validation (CV) stage (Stone 1974) in the tuning of classifiers. SVM with a specific kernel needs to tune the parameters of the kernel for the maximum possible accuracy. This CV stage can achieve best accuracy for a given training set. However, there has been no documented rule on how to construct the training set for CV. Researchers usually just take a random sub-sample of the available training data. This also might be another cause of the problem. Worth noticing is that, this dubious CV process is also present in most neural nets. 91 4.2.5. Self-Organizing Maps coupled with Learning Vector Quantization (SOM-LVQ) Kohonen?s Self-Organizing Map (SOM) neural network is a special kind of neural network. It is not a typical feed-forward network, and not a typical recurrent network. It does not have the popular design of hidden layers either. It consists of two layers: the input layer which contains neurons of the amount of input data dimension, and the output layer which contains a two-dimensional neuron array. Figure 4.1 The workflow of Self-Organizing Maps (Cited from the help file of the Idrisi software) In the first step of training stage, known as the ?coarse tuning?, the neurons in the output layer are derived in such a way that the neurons corresponds to clusters in the spectral data, and each neuron is kept at a distance from other neurons. Neurons are then labeled into each class. In the second step of training stage, known as the ?fine tuning?, Learning Vector Quantization (SOM-LVQ) creates a topology of neurons in the output layer. Neurons that are similar in the feature space will make the class boundary expanding outward. 92 The design of SOM-LVQ is somewhat controversial for the issue of class proportions in training data. The ?course tuning? part will not provide a very high accuracy in the training area, but might be effective against the pitfall overestimation-underestimation. The ?fine tuning? part will provide a high accuracy in the training area, but is susceptible to the pitfall of overestimation-underestimation. In summary, SOM might be of some value without the ?fine tuning? phase, but it is to be examined in real-world cases. SOM-LVQ was discussed only briefly in chapter two. It was not used in chapter three. It is introduced here simply because of its potential to help overcome the pitfall of over and underestimation. 4.3. Prioritized Training Proportions (PTP): Reducing the uncertainties in classification and change detection of satellite data In the previous discussion, we have outlined the uncertainties and the possible causes of a previously hidden issue for all classification-based change detections. Empirical studies in chapter three showed that all modern supervised classifiers but MLC are strongly affected by variations in training set proportions, and that past studies in the methodology of machine learning have not identified this issue yet. Remote sensing studies, especially those aiming at continental-to-global scales, need a way to minimize this uncertainty. In this section, two candidate solutions are proposed by going to the source mechanisms of supervised classifiers, and by combining the strengths of different hard classifiers to make a joint classifier. This is expected to reduce the uncertainties in the overestimation-underestimation dilemma. 93 This joint classifier will be based on a new optimization goal, and make use of SVM and MLC together. 4.3.1. A Tale of Two Optimization Rules With the exception of MLC, all modern supervised classifiers described previously have the same optimization rule: maximization of overall accuracy in the dataset used for cross validation. The dataset used for cross validation, however, is usually only a random subset of the training set. Thus Bayes Optimal was aimed for the data population but actually achieved for the sample. Therefore, the first optimization rule we propose, is to indeed achieve Bayes Optimal for the data population. Another optimization rule we propose here is the minimization of the absolute difference between the estimated omission data points and commission data points for a Key class. Let us call this the Bayes Optimal for a Key Class. Assume there are M classes in the dataset, and the proportion of each class in the training set is written as KP , i=1 ?, K,? M. The Kth class is chosen as the most important class. 1 1 =? = M i iP , 1