Memorias de investigación
Research Publications in journals:
Spatial Features Selection for Unsupervised Speaker Segmentation and Clustering

Research Areas
  • Electronic technology and of the communications,
  • Electric engineers, electronic and automatic (eil)

The selection of the best features to be used in expert systems is a key issue in obtaining a satisfactory performance. Unsupervised speaker segmentation and clustering is the task of the automatic identifi- cation of the number of participants in a meeting and the determination of their speaking turns (also called ?diarization?). This is part of an intelligent system that replaces human intervention in several tasks related to automatic language and speech processing. The segmentation and clustering of speakers is crucial if we want to transcribe any audio recording automatically when several people take their turn. It is a task necessary to archive automatically interventions of several people in meetings, broadcast ra- dio, lectures, parliamentary sessions etc. since a simple transcription of what is said without assigning it to a specific speaker makes the information unusable. The automation of this task would save enormous amounts of resources currently spent on human transcribers. When used online it could also be useful to point a video camera automatically to the person talking when a videoconference with multiple speakers is taking place thus replacing a human operator. Furthermore it could also help to scan large amounts of audio automatically in search of crimes or audio interventions of a particular person. In the case of recordings with several distant microphones (MDM), spatial features may and should be used. The most widely used spatial features in diarization are the Time Delay of Arrival (TDOA) features. These delays are extracted from pairs of microphones of unknown location and quality, which makes the selection of the best pairs highly advisable. This paper analyses this issue and proposes and evaluates several methods that significantly improve the performance both in speaker error rate (SER) and in computational time. The methods propose a selection ofTDOA features based on the quality of the cross-correlation of signals coming from different pairs of microphones. We prove that the use of the wrong pairs can be highly detrimental to the overall performance. The methods proposed, based on cross correlation, are compared and combined with other two selection methods, based on the dynamic range of the delay features and the selection of every pair of microphones available followed by a reduction in dimensionality. Although all algorithms achieve some improvements, it is proved that selection methods based on cross correlation have the fewest errors. The improvements on the baseline system for the two best proposed systems are 25.14% and 33.70% for the development set, and 55.06% and 46.09% for the test set. Furthermore the best method for the test set also reduces the computational cost by 20%.
Expert Systems With Applications
Impact factor JCR
Impact info
Journal number
From page
To page
Journal Rank in Category 18/133; Quartile in category Q1
  • Autor: Beatriz Martinez Gonzalez UPM
  • Autor: Jose Manuel Pardo Muñoz UPM
  • Autor: Julián David Echeverry Correa Facultad de Ingenierías.Programa de Ingeniería Eléctrica.Universidad Tecnológica de Pereira, Colombia
  • Autor: Ruben San Segundo Hernandez UPM

Research Group, Departaments and Institutes related
  • Creador: Grupo de Investigación: Grupo de Tecnología del Habla
  • Departamento: Ingeniería Electrónica
  • Centro o Instituto I+D+i: Centro de I+d+i en Procesado de la Información y Telecomunicaciones