Memorias de investigación
Artículos en revistas:
Spatial Features Selection for Unsupervised Speaker Segmentation and Clustering
Año:2017

Áreas de investigación
  • Tecnología electrónica y de las comunicaciones,
  • Ingeniería eléctrica, electrónica y automática

Datos
Descripción
The selection of the best features to be used in expert systems is a key issue in obtaining a satisfactory performance. Unsupervised speaker segmentation and clustering is the task of the automatic identifi- cation of the number of participants in a meeting and the determination of their speaking turns (also called ?diarization?). This is part of an intelligent system that replaces human intervention in several tasks related to automatic language and speech processing. The segmentation and clustering of speakers is crucial if we want to transcribe any audio recording automatically when several people take their turn. It is a task necessary to archive automatically interventions of several people in meetings, broadcast ra- dio, lectures, parliamentary sessions etc. since a simple transcription of what is said without assigning it to a specific speaker makes the information unusable. The automation of this task would save enormous amounts of resources currently spent on human transcribers. When used online it could also be useful to point a video camera automatically to the person talking when a videoconference with multiple speakers is taking place thus replacing a human operator. Furthermore it could also help to scan large amounts of audio automatically in search of crimes or audio interventions of a particular person. In the case of recordings with several distant microphones (MDM), spatial features may and should be used. The most widely used spatial features in diarization are the Time Delay of Arrival (TDOA) features. These delays are extracted from pairs of microphones of unknown location and quality, which makes the selection of the best pairs highly advisable. This paper analyses this issue and proposes and evaluates several methods that significantly improve the performance both in speaker error rate (SER) and in computational time. The methods propose a selection ofTDOA features based on the quality of the cross-correlation of signals coming from different pairs of microphones. We prove that the use of the wrong pairs can be highly detrimental to the overall performance. The methods proposed, based on cross correlation, are compared and combined with other two selection methods, based on the dynamic range of the delay features and the selection of every pair of microphones available followed by a reduction in dimensionality. Although all algorithms achieve some improvements, it is proved that selection methods based on cross correlation have the fewest errors. The improvements on the baseline system for the two best proposed systems are 25.14% and 33.70% for the development set, and 55.06% and 46.09% for the test set. Furthermore the best method for the test set also reduces the computational cost by 20%.
Internacional
Si
JCR del ISI
Si
Título de la revista
Expert Systems With Applications
ISSN
0957-4174
Factor de impacto JCR
3,928
Información de impacto
Volumen
DOI
10.1016/j.eswa.2016.12.005
Número de revista
73
Desde la página
27
Hasta la página
42
Mes
MAYO
Ranking
Journal Rank in Category 18/133; Quartile in category Q1

Esta actividad pertenece a memorias de investigación

Participantes
  • Autor: Beatriz Martinez Gonzalez UPM
  • Autor: Jose Manuel Pardo Muñoz UPM
  • Autor: Julián David Echeverry Correa Facultad de Ingenierías.Programa de Ingeniería Eléctrica.Universidad Tecnológica de Pereira, Colombia
  • Autor: Ruben San Segundo Hernandez UPM

Grupos de investigación, Departamentos, Centros e Institutos de I+D+i relacionados
  • Creador: Grupo de Investigación: Grupo de Tecnología del Habla
  • Departamento: Ingeniería Electrónica
  • Centro o Instituto I+D+i: Centro de I+d+i en Procesado de la Información y Telecomunicaciones