Gina Grünhage GRK 1589 / TU Berlin
Low dimensional visualization and modelling of data using distance-based models
This thesis consists of two parts, which seek low-dimensional representations for visualization and analysis of data. Both parts use rather different types of models and inference methods. In both cases, however, the models show inherent invariances, which need to be coped with during the optimization procedures.
The first part addresses a fundamental problem in machine learning, namely, the choice of a distance function to describe the similarity of two data points. Often, implicit choices are made, e.g., a spatial scale is chosen to compare images, or features are weighted in a certain way. Unfortunately, different distance functions can yield fundamentally different results since they influence the data structure. Here, a method called continuous multi-dimensional scaling (cMDS) is proposed, which allows to visualize the changes in the data structure in a simple and efficient way as the distance function varies. The algorithm is a variant of dynamical MDS and is based on the Concave-Convex Procedure (CCCP). To address the challenges of MDS which arise due to its invariance to translation and rotation, cMDS uses a smoothing penalty. The algorithm is published as an \textbf{R} package. This work also presents interactive web applications which demonstrate the key features of cMDS. The method is applied to several data sets to show that cMDS can deal with numerous types of hyperparameters and readily reveals important insights and dynamics.
The second part of the thesis focusses on modelling and visualizing network data. Network data consists of nodes (e.g., people) and egdes which indicate a relationship (e.g., friendship) between nodes. This work is based on latent space models (LSM), which provide a generative model of network data: Nodes are placed in an auxiliary, low dimensional space, i.e., the latent space. Then, the distance between two nodes is used as a predictor for an edge, such that nodes far apart are unlikely to be linked. Latent space models are usually estimated with MCMC, which is limited with regards to the size of the network, and Variational Bayes, which yields faster but poorer approximations. This work establishes that LSMs can instead be estimated with expectation propagation (EP). Necessary changes to the algorithm and the optimization procedures are shown and the implications of pairwise likelihood sites are thorougly discussed. Comparing EP to other methods of estimation for LSMs (MCMC and Variational Bayes), using a real data set, shows that EP and MCMC yield equivalent results. Thus, it is suggested that EP is the preferable method to estimate latent space models, due to the quality of results, the competitive run time and the potential to use prior knowledge. The results achieved with EP in this work are particularly remarkable in light of the invariances that are inherent to these distance-based models.
Organized by
Manfred Opper / Robert Martin