Memorias de investigación
Courses, Seminars and tutorials:
Build a large-scale cross-lingual text search engine from scratch

Research Areas
  • Information technology and adata processing

Searching for similar documents and exploring major themes cov- ered across groups of documents are common actions when brows- ing collections of scientific papers. This manual, knowledge-intensive task may become less tedious and even lead to unforeseen relevant findings if unsupervised algorithms are applied to help researchers. Most text mining algorithms represent documents in a common feature space that abstracts away from the specific sequence of words used in them. Probabilistic Topic Models reduce that fea- ture space by annotating documents with thematic information. Over this low-dimensional latent space some algorithms have been proposed to perform document similarity search. However, text search engines are based on algorithms that use term matching to measure similarity among texts (e.g TF-IDF, BM25) making a prior translation of multilingual texts required to relate them. In large-scale scenarios, this requirement is difficult to assume due to its high computational and storage cost. The aim of this tutorial is to show the foundations and modern practical applications of knowledge-based and statistical methods for exploring large document corpora. It will first focus on many of the techniques required for this purpose, including natural lan- guage processing tasks, approximate nearest neighbours methods, clustering algorithms, probabilistic topic models, and will then describe how a combination of these techniques is being used in practical applications for browsing large multilingual document corpora without the need to translate texts. Participants will be involved in the entire process of creating the necessary resources to finally build a multilingual text search engine.
Entity Nationality
Sin nacionalidad
Malibu, Santa Mónica EEUU
Start Date
End Date

Research Group, Departaments and Institutes related
  • Creador: Grupo de Investigación: Ontology Engineering Group
  • Departamento: Inteligencia Artificial