Observatorio de I+D+i UPM

Memorias de investigación
Tesis:
Analysis and development of robust speaker diarization for meetings
Año:2017
Áreas de investigación
  • Tecnología electrónica y de las comunicaciones,
  • Ingeniería eléctrica, electrónica y automática
Datos
Descripción
Unlike human beings, following a real conversation is not as easy for machines. We have been trained our whole life, without even noticing, to recognize voices, to remember them and to automatically disregard the non-relevant noises. Speech technologies try to make machines to work as similar to us as possible. Speaker diarization is the part of these speech technologies which tries to solve the task of recognizing, within a recording, who is speaking and when. This analysis could be applied to any recording with two or more speakers (such as broadcast news or telephone conversations) but this thesis will focus on its application in real meetings. To recognize a set of unknown speakers, which the system has never ?heard? before, and to follow their speaking turns along the whole meeting requires many pre-processing steps and a very well trained classifier. In this thesis some features used in diarization systems for meetings are studied and improved by a selection of the most representative ones. Current diarization systems all make use of MFCC features, usually combined with others. If multiple recordings are available, the delay between microphones are the other most extensively used features. We study these delay features in deep and propose several ways to select among them those with the highest quality or those most representative of the real speaker turns. It is found that methods of selection relying on the cross correlation measure between the signals arriving to the different microphones are the most prone to get a reduction in the diarization error. However, other methods which make use of a Principal Component Analysis (PCA) or a Kmeans classification followed by a selection based on the dynamic margin of the delay values or the Silhouette coefficient, achieve great improvement as well. If the recording has done with only one microphone the use of TDOA features turns impossible. The second part of this thesis presents a study on some glottal features as substitute features for TDOA when there is no more than one recording channel of the meeting. The analysis is focused on meetings from the media, whose speaking styles can be heavily affected by the type of program (as reportages vs talk shows or meteorology reports). We found that the inclusion of a music detection stage is very beneficial for this kind of audio recordings and that, among the studied features, the Harmonics merged with lif0 (logarithm of the interpolated f0) and the harmonics to noise ratio (HNR) are the most promising ones that definitely improve the diarization performance. In the third part of this thesis, all the previously analysed features are used in a preliminary study to detect overlap. As real meetings are supposed to contain people who talk simultaneously, these last glottal features and many more related to the previous TDOA features and its cross correlation are studied to detect overlap. Though this study was preliminary, our conclusion is that some cross correlation and delay related features did show some relation with the presence of overlap regions and should be studied in deep in the future. Finally, a modification of the segmentation stage of the speaker diarization system is presented. It was discovered during this thesis that the decision of changing from one speaker to another was ruled not only by the acoustics but by a parameter dependent on the number of active speakers, which varies throughout the diarization process. This dependency is cancelled and a new weight factor is added to make the speaker turns independent on the number of speakers and dependent only on the database and the numerical characteristics of the system. In our experiments, the best results were obtained when this weight factor favoured speaker changes when acoustics between speakers were balanced.
Internacional
No
ISBN
Tipo de Tesis
Doctoral
Calificación
Sobresaliente cum laude
Fecha
21/07/2017
Esta actividad pertenece a memorias de investigación
Participantes
  • Autor: Beatriz Martinez Gonzalez (UPM)
  • Director: Jose Manuel Pardo Muñoz (UPM)
Grupos de investigación, Departamentos, Centros e Institutos de I+D+i relacionados
  • Creador: Grupo de Investigación: Grupo de Tecnología del Habla
  • Centro o Instituto I+D+i: Centro de I+d+i en Procesado de la Información y Telecomunicaciones
  • Departamento: Ingeniería Electrónica
S2i 2021 Observatorio de investigación @ UPM con la colaboración del Consejo Social UPM
Cofinanciación del MINECO en el marco del Programa INNCIDE 2011 (OTR-2011-0236)
Cofinanciación del MINECO en el marco del Programa INNPACTO (IPT-020000-2010-22)