Descripción
|
|
---|---|
We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to be able to model reliably a longer span than PPRLM, namely 5-gram instead of trigram, because this ranking will need less training data for a reliable estimation. We demonstrate that this approach overcomes PPRLM (6% relative improvement) due to the inclusion of 4- gram and 5-gram in the classifier. We present two alternatives: ranking with absolute values for the number of occurrences and ranking with discriminative values (11% relative improvement). | |
Internacional
|
Si |
Nombre congreso
|
8th Annual Conference of the Internacional Speech Communication Association (Interspeech 2007) |
Tipo de participación
|
960 |
Lugar del congreso
|
Antwerp, Belgium |
Revisores
|
Si |
ISBN o ISSN
|
ISSN 1990-9772 |
DOI
|
|
Fecha inicio congreso
|
27/08/2007 |
Fecha fin congreso
|
31/08/2007 |
Desde la página
|
|
Hasta la página
|
|
Título de las actas
|