PRINCIPAL CURVES AND SURFACES*

Principal Curves and Surfaces

Trevor Hastie

Ph.D Dissertation

Stanford Linear Accelerator Center

Stanford University

Stanford, California 94305

November 1984

Abstract

Principal curves are smooth one dimensional curves that pass through the middle of a p dimensional data set. They minimise the distance from the points, and provide a non-linear summary of the Data. The curves are non-parametric and their shape is suggested by the data. Similarly, principal surfaces are two dimensional surfaces that pass through the middle of the data. The curves and surfaces are found using an iterative procedure which starts with a linear summary such as the usual principal component line or plane. Each successive iteration is a smooth or local average of the p-dimensional points, where local is based on the projections of the points onto the curve or surface of the previous iteration.

A number of linear techniques, such as factor analysis and errors in variables regression, end up using the principal components as their estimates (after a suitable scaling of the co-ordinates). Principal curves and surfaces can be viewed as the estimates of non-linear generalisations of these procedures. We present some real data examples that illustrate these applications.

Principal Curves (or surfaces) have a theoretical definition for distributions: they are the Self Consistent curves. A curve is self consistent if each point on the curve is the conditional mean of the points that project there. The main theorem proves that principal curves are critical values of the expected squared distance between the points and the curve. Linear principal components have this property as well; in fact, we prove that if a principal curve is straight, then it is a principal component. These results generalize the usual duality between conditional expectation and distance minimization. We also examine two sources of bias in the procedures, which have the satisfactory property of partially cancelling each other.

We compare the principal curve and surface procedures to other generalisations of principal components in the literature; the usual generalisations transform the space, whereas we transform the model. There are also strong ties with multidimensional scaling.

Contents

1 Introduction

2 Background and Motivation

2.1 Linear Principal Components

2.2 A linear model formulation

2.2.1 Outline of the linear model

2.2.2 Estimation

2.2.3 Units of measurement

2.3 A non-linear generalization of the linear model

2.4 Other generalizations ,

3 The Principal Curve and Surface models

3.1 The principal curves of a probability distribution

3.1.1 One dimensional curves

3.1.2 Definition of principal curves

3.1.3 Existence of principal curves

3.1.4 The distance property of principal curves

3.2 The principal surfaces of a probability distribution

3.2.1 Two dimensional surfaces

3.2.2 Definition of principal surfaces

3.3 An algorithm for finding principal curves and surfaces

3.4 Principal curves and surfaces for data sets

3.5 Demonstrations of the procedures

3.5.1 The circle in two-space

3.5.2 The half-sphere in three-space

3.6 Principal surfaces and principal components

3.6.1 A Variance decomposition

3.6.2 The power method

4Theory for principal curves and surfaces

4.1 The projection index is measureable

4.2 The stationarity property of principal curves

4.3 Some results on the subclass of smooth principal curves

4.4 Some results on bias

4.4.1 A simple model for investigating bias

4.4.2 Prom the circle to the helix

4.4.3 One more bias demonstration

4.5 Principal curves of elliptical distributions

5 Algorithmic details

5.1 Estimation of curves and surfaces

5.1.1 One dimensional smoothers

5.1.2 Two dimensional smoothers

5.1.3 The local planar surface smoother

5.2 The projection step

5.2.1 Projecting by exact enumeration

5.2.2 Projections using the k-d tree

5.2.3 Resealing the X’s to arc-length

5.3 Span selection

5.3.1 Global procedural spans

5.3.2 Mean squared error spans

6 Examples

6.1 Gold assay pairs

6.2 The helix in three-space

6.3 Geological data

6.4 The uniform ball

6.5 One dimensional color data

6.6 Lipoprotein data

7 Discussion and conclusions

7.1 Alternative techniques

7.1.1 Generalized linear principal components

7.1.2 Multi-dimensional scaling

7.1.3 Proximity models

7.1.4 Non-linear factor analysis

7.1.5 Axis interchangeable smoothing

7.2 Conclusions

Bibliography

Вся диссертация ЗДЕСЬ (4МВ)