Observatorio de I+D+i UPM

Memorias de investigación
Artículos en revistas:
Spatial Features Selection for Unsupervised Speaker Segmentation and Clustering
Áreas de investigación
  • Tecnología electrónica y de las comunicaciones,
  • Ingeniería eléctrica, electrónica y automática
The selection of the best features to be used in expert systems is a key issue in obtaining a satisfactory performance. Unsupervised speaker segmentation and clustering is the task of the automatic identifi- cation of the number of participants in a meeting and the determination of their speaking turns (also called ?diarization?). This is part of an intelligent system that replaces human intervention in several tasks related to automatic language and speech processing. The segmentation and clustering of speakers is crucial if we want to transcribe any audio recording automatically when several people take their turn. It is a task necessary to archive automatically interventions of several people in meetings, broadcast ra- dio, lectures, parliamentary sessions etc. since a simple transcription of what is said without assigning it to a specific speaker makes the information unusable. The automation of this task would save enormous amounts of resources currently spent on human transcribers. When used online it could also be useful to point a video camera automatically to the person talking when a videoconference with multiple speakers is taking place thus replacing a human operator. Furthermore it could also help to scan large amounts of audio automatically in search of crimes or audio interventions of a particular person. In the case of recordings with several distant microphones (MDM), spatial features may and should be used. The most widely used spatial features in diarization are the Time Delay of Arrival (TDOA) features. These delays are extracted from pairs of microphones of unknown location and quality, which makes the selection of the best pairs highly advisable. This paper analyses this issue and proposes and evaluates several methods that significantly improve the performance both in speaker error rate (SER) and in computational time. The methods propose a selection ofTDOA features based on the quality of the cross-correlation of signals coming from different pairs of microphones. We prove that the use of the wrong pairs can be highly detrimental to the overall performance. The methods proposed, based on cross correlation, are compared and combined with other two selection methods, based on the dynamic range of the delay features and the selection of every pair of microphones available followed by a reduction in dimensionality. Although all algorithms achieve some improvements, it is proved that selection methods based on cross correlation have the fewest errors. The improvements on the baseline system for the two best proposed systems are 25.14% and 33.70% for the development set, and 55.06% and 46.09% for the test set. Furthermore the best method for the test set also reduces the computational cost by 20%.
Título de la revista
Expert Systems With Applications
Factor de impacto JCR
Información de impacto
Número de revista
Desde la página
Hasta la página
Journal Rank in Category 18/133; Quartile in category Q1
Esta actividad pertenece a memorias de investigación
  • Autor: Beatriz Martinez Gonzalez (UPM)
  • Autor: Jose Manuel Pardo Muñoz (UPM)
  • Autor: Julián David Echeverry Correa (Facultad de Ingenierías.Programa de Ingeniería Eléctrica.Universidad Tecnológica de Pereira, Colombia)
  • Autor: Ruben San Segundo Hernandez (UPM)
Grupos de investigación, Departamentos, Centros e Institutos de I+D+i relacionados
  • Creador: Grupo de Investigación: Grupo de Tecnología del Habla
  • Departamento: Ingeniería Electrónica
  • Centro o Instituto I+D+i: Centro de I+d+i en Procesado de la Información y Telecomunicaciones
S2i 2021 Observatorio de investigación @ UPM con la colaboración del Consejo Social UPM
Cofinanciación del MINECO en el marco del Programa INNCIDE 2011 (OTR-2011-0236)
Cofinanciación del MINECO en el marco del Programa INNPACTO (IPT-020000-2010-22)