Descripción
|
|
---|---|
Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and foreground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation. | |
Internacional
|
Si |
Nombre congreso
|
15th Annual Conference of the International Speech Communication Association |
Tipo de participación
|
960 |
Lugar del congreso
|
Singapore |
Revisores
|
Si |
ISBN o ISSN
|
2308-457X |
DOI
|
|
Fecha inicio congreso
|
14/09/2014 |
Fecha fin congreso
|
18/09/2014 |
Desde la página
|
2370 |
Hasta la página
|
2374 |
Título de las actas
|
Proceedings 15th Annual Conference of the International Speech Communication Association |