#### A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

#### Communicating sequential processes

This paper suggests that input and output are basic primitives of programming and that parallel composition of communicating sequential processes is a fundamental program structuring method. When combined with a development of Dijkstra's guarded command, these concepts are surprisingly versatile. Their use is illustrated by sample solutions of a variety of a familiar programming exercises.

#### Applied Multivariate Statistical Analysis

(NOTE: Each chapter begins with an Introduction, and concludes with Exercises and References.) I. GETTING STARTED. 1. Aspects of Multivariate Analysis. Applications of Multivariate Techniques. The Organization of Data. Data Displays and Pictorial Representations. Distance. Final Comments. 2. Matrix Algebra and Random Vectors. Some Basics of Matrix and Vector Algebra. Positive Definite Matrices. A Square-Root Matrix. Random Vectors and Matrices. Mean Vectors and Covariance Matrices. Matrix Inequalities and Maximization. Supplement 2A Vectors and Matrices: Basic Concepts. 3. Sample Geometry and Random Sampling. The Geometry of the Sample. Random Samples and the Expected Values of the Sample Mean and Covariance Matrix. Generalized Variance. Sample Mean, Covariance, and Correlation as Matrix Operations. Sample Values of Linear Combinations of Variables. 4. The Multivariate Normal Distribution. The Multivariate Normal Density and Its Properties. Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation. The Sampling Distribution of 'X and S. Large-Sample Behavior of 'X and S. Assessing the Assumption of Normality. Detecting Outliners and Data Cleaning. Transformations to Near Normality. II. INFERENCES ABOUT MULTIVARIATE MEANS AND LINEAR MODELS. 5. Inferences About a Mean Vector. The Plausibility of ...m0 as a Value for a Normal Population Mean. Hotelling's T 2 and Likelihood Ratio Tests. Confidence Regions and Simultaneous Comparisons of Component Means. Large Sample Inferences about a Population Mean Vector. Multivariate Quality Control Charts. Inferences about Mean Vectors When Some Observations Are Missing. Difficulties Due To Time Dependence in Multivariate Observations. Supplement 5A Simultaneous Confidence Intervals and Ellipses as Shadows of the p-Dimensional Ellipsoids. 6. Comparisons of Several Multivariate Means. Paired Comparisons and a Repeated Measures Design. Comparing Mean Vectors from Two Populations. Comparison of Several Multivariate Population Means (One-Way MANOVA). Simultaneous Confidence Intervals for Treatment Effects. Two-Way Multivariate Analysis of Variance. Profile Analysis. Repealed Measures, Designs, and Growth Curves. Perspectives and a Strategy for Analyzing Multivariate Models. 7. Multivariate Linear Regression Models. The Classical Linear Regression Model. Least Squares Estimation. Inferences About the Regression Model. Inferences from the Estimated Regression Function. Model Checking and Other Aspects of Regression. Multivariate Multiple Regression. The Concept of Linear Regression. Comparing the Two Formulations of the Regression Model. Multiple Regression Models with Time Dependant Errors. Supplement 7A The Distribution of the Likelihood Ratio for the Multivariate Regression Model. III. ANALYSIS OF A COVARIANCE STRUCTURE. 8. Principal Components. Population Principal Components. Summarizing Sample Variation by Principal Components. Graphing the Principal Components. Large-Sample Inferences. Monitoring Quality with Principal Components. Supplement 8A The Geometry of the Sample Principal Component Approximation. 9. Factor Analysis and Inference for Structured Covariance Matrices. The Orthogonal Factor Model. Methods of Estimation. Factor Rotation. Factor Scores. Perspectives and a Strategy for Factor Analysis. Structural Equation Models. Supplement 9A Some Computational Details for Maximum Likelihood Estimation. 10. Canonical Correlation Analysis Canonical Variates and Canonical Correlations. Interpreting the Population Canonical Variables. The Sample Canonical Variates and Sample Canonical Correlations. Additional Sample Descriptive Measures. Large Sample Inferences. IV. CLASSIFICATION AND GROUPING TECHNIQUES. 11. Discrimination and Classification. Separation and Classification for Two Populations. Classifications with Two Multivariate Normal Populations. Evaluating Classification Functions. Fisher's Discriminant Function...nSeparation of Populations. Classification with Several Populations. Fisher's Method for Discriminating among Several Populations. Final Comments. 12. Clustering, Distance Methods and Ordination. Similarity Measures. Hierarchical Clustering Methods. Nonhierarchical Clustering Methods. Multidimensional Scaling. Correspondence Analysis. Biplots for Viewing Sample Units and Variables. Procustes Analysis: A Method for Comparing Configurations. Appendix. Standard Normal Probabilities. Student's t-Distribution Percentage Points. ...c2 Distribution Percentage Points. F-Distribution Percentage Points. F-Distribution Percentage Points (...a = .10). F-Distribution Percentage Points (...a = .05). F-Distribution Percentage Points (...a = .01). Data Index. Subject Index.

#### A system of shuttle vectors and yeast host strains designed for efficient manipulation of DNA in Saccharomyces cerevisiae.

A series of yeast shuttle vectors and host strains has been created to allow more efficient manipulation of DNA in Saccharomyces cerevisiae. Transplacement vectors were constructed and used to derive yeast strains containing nonreverting his3, trp1, leu2 and ura3 mutations. A set of YCp and YIp vectors (pRS series) was then made based on the backbone of the multipurpose plasmid pBLUESCRIPT. These pRS vectors are all uniform in structure and differ only in the yeast selectable marker gene used (HIS3, TRP1, LEU2 and URA3). They possess all of the attributes of pBLUESCRIPT and several yeast-specific features as well. Using a pRS vector, one can perform most standard DNA manipulations in the same plasmid that is introduced into yeast.

#### INFERENCE AND MISSING DATA

Two results are presented concerning inference when data may be missing. First, ignoring the process that causes missing data when making sampling distribution inferences about the parameter of the data, θ, is generally appropriate if and only if the missing data are “missing at random” and the observed data are “observed at random,” and then such inferences are generally conditional on the observed pattern of missing data. Second, ignoring the process that causes missing data when making Bayesian inferences about θ is generally appropriate if and only if the missing data are missing at random and the parameter of the missing data is “independent” of θ. Examples and discussion indicating the implications of these results are included.

#### Statistical Analysis with Missing Data

Preface.PART I: OVERVIEW AND BASIC APPROACHES.Introduction.Missing Data in Experiments.Complete-Case and Available-Case Analysis, Including Weighting Methods.Single Imputation Methods.Estimation of Imputation Uncertainty.PART II: LIKELIHOOD-BASED APPROACHES TO THE ANALYSIS OF MISSING DATA.Theory of Inference Based on the Likelihood Function.Methods Based on Factoring the Likelihood, Ignoring the Missing-Data Mechanism.Maximum Likelihood for General Patterns of Missing Data: Introduction and Theory with Ignorable Nonresponse.Large-Sample Inference Based on Maximum Likelihood Estimates.Bayes and Multiple Imputation.PART III: LIKELIHOOD-BASED APPROACHES TO THE ANALYSIS OF MISSING DATA: APPLICATIONS TO SOME COMMON MODELS.Multivariate Normal Examples, Ignoring the Missing-Data Mechanism.Models for Robust Estimation.Models for Partially Classified Contingency Tables, Ignoring the Missing-Data Mechanism.Mixed Normal and Nonnormal Data with Missing Values, Ignoring the Missing-Data Mechanism.Nonignorable Missing-Data Models.References.Author Index.Subject Index.

#### Statistical Analysis With Missing Data

authors brie y review various methods and refer readers to works such as Little (1995) for details. The analyses presented are based on certain assumptions, such that the available GEE software can be applied. Chapter 4 gives a thorough discussion on model selection and testing and graphical methods for residual diagnostics. Overall, Generalized Estimating Equations is a good introductory book for analyzing continuous and discrete correlated data using GEE methods. The authors discuss the differences among the four commercial software programs and provide suggestions and cautions for users. This book is easy to read, and it assumes that the reader has some background in GLM. Many examples are drawn from biomedical studies and survey studies, and so it provides good guidance for analyzing correlated data in these and other areas.

#### An Introduction to Multivariate Statistical Analysis

Preface to the Third Edition.Preface to the Second Edition.Preface to the First Edition.1. Introduction.2. The Multivariate Normal Distribution.3. Estimation of the Mean Vector and the Covariance Matrix.4. The Distributions and Uses of Sample Correlation Coefficients.5. The Generalized T2-Statistic.6. Classification of Observations.7. The Distribution of the Sample Covariance Matrix and the Sample Generalized Variance.8. Testing the General Linear Hypothesis: Multivariate Analysis of Variance9. Testing Independence of Sets of Variates.10. Testing Hypotheses of Equality of Covariance Matrices and Equality of Mean Vectors and Covariance Matrices.11. Principal Components.12. Cononical Correlations and Cononical Variables.13. The Distributions of Characteristic Roots and Vectors.14. Factor Analysis.15. Pattern of Dependence Graphical Models.Appendix A: Matrix Theory.Appendix B: Tables.References.Index.

#### Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints

A program denotes computations in some universe of objects. Abstract interpretation of programs consists in using that denotation to describe computations in another universe of abstract objects, so that the results of abstract execution give some information on the actual computations. An intuitive example (which we borrow from Sintzoff [72]) is the rule of signs. The text -1515 * 17 may be understood to denote computations on the abstract universe {(+), (-), (±)} where the semantics of arithmetic operators is defined by the rule of signs. The abstract execution -1515 * 17 → -(+) * (+) → (-) * (+) → (-), proves that -1515 * 17 is a negative number. Abstract interpretation is concerned by a particular underlying structure of the usual universe of computations (the sign, in our example). It gives a summary of some facets of the actual executions of a program. In general this summary is simple to obtain but inaccurate (e.g. -1515 + 17 → -(+) + (+) → (-) + (+) → (±)). Despite its fundamentally incomplete results abstract interpretation allows the programmer or the compiler to answer questions which do not need full knowledge of program executions or which tolerate an imprecise answer, (e.g. partial correctness proofs of programs ignoring the termination problems, type checking, program optimizations which are not carried in the absence of certainty about their feasibility, …).

#### Recommendations for cardiac chamber quantification by echocardiography in adults: an update from the American Society of Echocardiography and the European Association of Cardiovascular Imaging.

The rapid technological developments of the past decade and the changes in echocardiographic practice brought about by these developments have resulted in the need for updated recommendations to the previously published guidelines for cardiac chamber quantification, which was the goal of the joint writing group assembled by the American Society of Echocardiography and the European Association of Cardiovascular Imaging. This document provides updated normal values for all four cardiac chambers, including three-dimensional echocardiography and myocardial deformation, when possible, on the basis of considerably larger numbers of normal subjects, compiled from multiple databases. In addition, this document attempts to eliminate several minor discrepancies that existed between previously published guidelines.

#### An Introduction to Multivariate Statistical Analysis

Preface to the Third Edition.Preface to the Second Edition.Preface to the First Edition.1. Introduction.2. The Multivariate Normal Distribution.3. Estimation of the Mean Vector and the Covariance Matrix.4. The Distributions and Uses of Sample Correlation Coefficients.5. The Generalized T2-Statistic.6. Classification of Observations.7. The Distribution of the Sample Covariance Matrix and the Sample Generalized Variance.8. Testing the General Linear Hypothesis: Multivariate Analysis of Variance9. Testing Independence of Sets of Variates.10. Testing Hypotheses of Equality of Covariance Matrices and Equality of Mean Vectors and Covariance Matrices.11. Principal Components.12. Cononical Correlations and Cononical Variables.13. The Distributions of Characteristic Roots and Vectors.14. Factor Analysis.15. Pattern of Dependence Graphical Models.Appendix A: Matrix Theory.Appendix B: Tables.References.Index.

#### The effect of cardiac resynchronization on morbidity and mortality in heart failure.

BACKGROUND Cardiac resynchronization reduces symptoms and improves left ventricular function in many patients with heart failure due to left ventricular systolic dysfunction and cardiac dyssynchrony. We evaluated its effects on morbidity and mortality. METHODS Patients with New York Heart Association class III or IV heart failure due to left ventricular systolic dysfunction and cardiac dyssynchrony who were receiving standard pharmacologic therapy were randomly assigned to receive medical therapy alone or with cardiac resynchronization. The primary end point was the time to death from any cause or an unplanned hospitalization for a major cardiovascular event. The principal secondary end point was death from any cause. RESULTS A total of 813 patients were enrolled and followed for a mean of 29.4 months. The primary end point was reached by 159 patients in the cardiac-resynchronization group, as compared with 224 patients in the medical-therapy group (39 percent vs. 55 percent; hazard ratio, 0.63; 95 percent confidence interval, 0.51 to 0.77; P<0.001). There were 82 deaths in the cardiac-resynchronization group, as compared with 120 in the medical-therapy group (20 percent vs. 30 percent; hazard ratio 0.64; 95 percent confidence interval, 0.48 to 0.85; P<0.002). As compared with medical therapy, cardiac resynchronization reduced the interventricular mechanical delay, the end-systolic volume index, and the area of the mitral regurgitant jet; increased the left ventricular ejection fraction; and improved symptoms and the quality of life (P<0.01 for all comparisons). CONCLUSIONS In patients with heart failure and cardiac dyssynchrony, cardiac resynchronization improves symptoms and the quality of life and reduces complications and the risk of death. These benefits are in addition to those afforded by standard pharmacologic therapy. The implantation of a cardiac-resynchronization device should routinely be considered in such patients.

#### A Test of Missing Completely at Random for Multivariate Data with Missing Values

A common concern when faced with multivariate data with missing values is whether the missing data are missing completely at random (MCAR); that is, whether missingness depends on the variables in the data set. One way of assessing this is to compare the means of recorded values of each variable between groups defined by whether other variables in the data set are missing or not. Although informative, this procedure yields potentially many correlated statistics for testing MCAR, resulting in multiple-comparison problems. This article proposes a single global test statistic for MCAR that uses all of the available data. The asymptotic null distribution is given, and the small-sample null distribution is derived for multivariate normal data with a monotone pattern of missing data. The test reduces to a standard t test when the data are bivariate with missing data confined to a single variable. A limited simulation study of empirical sizes for the test applied to normal and nonnormal data suggests th...

#### An Efficient k-Means Clustering Algorithm: Analysis and Implementation

In k-means clustering, we are given a set of n data points in d-dimensional space R/sup d/ and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

#### Maximum Likelihood Estimation of Misspecified Models

This paper examines the consequences and detection of model misspecification when using maximum likelihood techniques for estimation and inference. The quasi-maximum likelihood estimator (QMLE) converges to a well defined limit, and may or may not be consistent for particular parameters of interest. Standard tests (Wald, Lagrange Multiplier, or Likelihood Ratio) are invalid in the presence of misspecification, but more general statistics are given which allow inferences to be drawn robustly. The properties of the QMLE and the information matrix are exploited to yield several useful tests for model misspecification.

#### INFERENCE AND MISSING DATA

Two results are presented concerning inference when data may be missing. First, ignoring the process that causes missing data when making sampling distribution inferences about the parameter of the data, θ, is generally appropriate if and only if the missing data are “missing at random” and the observed data are “observed at random,” and then such inferences are generally conditional on the observed pattern of missing data. Second, ignoring the process that causes missing data when making Bayesian inferences about θ is generally appropriate if and only if the missing data are missing at random and the parameter of the missing data is “independent” of θ. Examples and discussion indicating the implications of these results are included.

#### INFERENCE AND MISSING DATA

Two results are presented concerning inference when data may be missing. First, ignoring the process that causes missing data when making sampling distribution inferences about the parameter of the data, θ, is generally appropriate if and only if the missing data are “missing at random” and the observed data are “observed at random,” and then such inferences are generally conditional on the observed pattern of missing data. Second, ignoring the process that causes missing data when making Bayesian inferences about θ is generally appropriate if and only if the missing data are missing at random and the parameter of the missing data is “independent” of θ. Examples and discussion indicating the implications of these results are included.

#### Aspects Of Multivariate Statistical Theory

Tables. Commonly Used Notation. 1. The Multivariate Normal and Related Distributions. 2. Jacobians, Exterior Products, Kronecker Products, and Related Topics. 3. Samples from a Multivariate Normal Distribution, and the Wishart and Multivariate BETA Distributions. 4. Some Results Concerning Decision-Theoretic Estimation of the Parameters of a Multivariate Normal Distribution. 5. Correlation Coefficients. 6. Invariant Tests and Some Applications. 7. Zonal Polynomials and Some Functions of Matrix Argument. 8. Some Standard Tests on Covariance Matrices and Mean Vectors. 9. Principal Components and Related Topics. 10. The Multivariate Linear Model. 11. Testing Independence Between k Sets of Variables and Canonical Correlation Analysis. Appendix: Some Matrix Theory. Bibliography. Index.

#### Data clustering: 50 years beyond K-means

Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

#### Missing value estimation methods for DNA microarrays

MOTIVATION Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. RESULTS We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.

monte carlo hoc network data analysi monte carlo simulation markov chain regression model parameter estimation maximum likelihood clustering algorithm data collection statistical analysi linear model gene expression data model heart rate confidence interval logistic regression missing datum maximum likelihood estimation structural equation gene expression datum risk factor normal distribution k-means clustering likelihood estimation expression datum likelihood ratio heart rate variability maximum likelihood estimator hierarchical clustering rate variability heart disease data clustering multivariate analysi k-means algorithm k-means clustering algorithm diffusion model likelihood estimator sparse matrix emergency department left ventricular gamma distribution maximum likelihood estimate multivariate statistical document clustering maximum likelihood method multivariate normal finite mixture heart failure transcription factor cardiovascular disease large database likelihood function likelihood method multiple imputation gibbs sampling confidence region based on maximum multivariate normal distribution hierarchical clustering algorithm squares estimation sample covariance matrix myocardial infarction multivariate statistical analysi odds ratio mixture distribution semiparametric regression maximum likelihood approach coronary heart disease relative performance chronic kidney disease semiparametric regression model cardiac output congestive heart failure coronary heart distributed clustering ventricular fibrillation handling missing datum cardiac arrest sample variance restricted maximum likelihood nonparametric maximum likelihood complete datum acute myocardial infarction population parameter citation index expanded cardiopulmonary resuscitation distributed clustering algorithm acute myocardial restricted maximum sudden cardiac death hazard ratio muscle cell nonparametric maximum information maximum likelihood sudden cardiac chronic heart failure encoding model neural encoding cardiac rehabilitation cardiac muscle cardiac resynchronization therapy finite mixture distribution innovation diffusion model pseudo maximum likelihood cardiac myocyte partitional clustering algorithm cardiac resynchronization pseudo maximum ventricular hypertrophy introduction to multivariate completely at random estimation for multivariate missing completely cardiac troponin cardiac risk efficient k-means clustering follow-up report angina pectori in-hospital cardiac arrest left ventricular systolic dysfunction cessation of life personnameuse - assigned artificial cardiac pacemaker surgical revision ventricular dysfunction, left morbidity - disease rate document completion status - documented myocytes, cardiac resuscitation procedure cardiac hypertrophy data set