Descripción
|
|
---|---|
Dimensionality reduction is often crucial for the application of machine learning and data mining. Feature selection methods can be employed for this purpose, with the advantage of preserving interpretability. There exist unsupervised feature selection methods based on matrix factorization algorithms, which can help choose the most informative features in terms of approximation error. Randomized methods have been proposed recently to provide better theoretical guarantees and better approximation errors than their deterministic counterparts, but their computational costs can be signiffcant when dealing with big, high dimensional data sets. Some existing randomized and deterministic approaches require the computation of the singular value decomposition in O(mn min(m; n)) time (for m samples and n features) for providing leverage scores. This compromises their applicability to domains of even moderately high dimensionality. In this paper we propose the use of Probabilistic PCA to compute the leverage scores in O(mnk) time, enabling the applicability of some of these randomized methods to large, highdimensional data sets. We show that using this approach, we can rapidly provide an approximation of the leverage scores that is works well in this context. In addition, we offer a parallelized version over the emerging Resilient Distributed Datasets paradigm (RDD) on Apache Spark, making it horizontally scalable for enormous numbers of data instances. We validate the performance of our approach on different data sets comprised of real-world and synthetic data. | |
Internacional
|
Si |
Nombre congreso
|
International Work-Conference on Artificial Neural Networks |
Tipo de participación
|
960 |
Lugar del congreso
|
Cadiz, España |
Revisores
|
Si |
ISBN o ISSN
|
978-3-319-59146-9 |
DOI
|
DOI: 10.1007/978-3-319-59147-6_61 |
Fecha inicio congreso
|
14/06/2017 |
Fecha fin congreso
|
16/06/2017 |
Desde la página
|
722 |
Hasta la página
|
733 |
Título de las actas
|
Advances in Computational Intelligence. IWANN 2017. Lecture Notes in Computer Science, vol 10306. Springer, Cham |