Abordagem para o desenvolvimento de um etiquetador de alta acurácia para o Português do Brasil

DOMINGUES, Miriam Lúcia Campos Serra

Tese

Abordagem para o desenvolvimento de um etiquetador de alta acurácia para o Português do Brasil

Part-of-speech tagging is a basic task required by many applications of natural language processing, such as parsing and machine translation, and by applications of speech processing, for example, speech synthesis. This task consists of tagging words in a sentence with their grammatical categories....

ver descrição completa

Autor principal:	DOMINGUES, Miriam Lúcia Campos Serra
Grau:	Tese
Idioma:	por
Publicado em:	Universidade Federal do Pará 2012
Assuntos:	Etiquetagem morfossintática Processamento de linguagem natural (Computação) Linguística computacional Linguística de corpus Part-of-speech tagging Natural language processing Computational linguistics Corpus linguistics CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA::TELECOMUNICACOES::TEORIA ELETROMAGNETICA, MICROONDAS, PROPAGACAO DE ONDAS, ANTENAS
Acesso em linha:	http://repositorio.ufpa.br/jspui/handle/2011/2828

Resumo:
Part-of-speech tagging is a basic task required by many applications of natural language processing, such as parsing and machine translation, and by applications of speech processing, for example, speech synthesis. This task consists of tagging words in a sentence with their grammatical categories. Although these applications require taggers with greater precision, the state of the art taggers still achieved accuracy of 96 to 97%. In this thesis, corpus and software resources are investigated for the development of a tagger with accuracy above of that of the state of the art for the Brazilian Portuguese language. Based on a hybrid solution that combines probabilistic tagging with rule-based tagging, the proposed thesis focuses on an exploratory study on the tagging method, size, quality, tag set, and the textual genre of the corpora available for training and testing, and evaluates the disambiguation of new or out-of-vocabulary words found in texts to be tagged. Four corpora were used in experiments: CETENFolha, Bosque CF 7.4, Mac-Morpho, and Selva Científica. The proposed tagging model was based on the use of the method of transformation-based learning (TBL) to which were added three strategies combined in a architecture that integrates the outputs (tagged texts) of two free tools, Treetagger and -TBL, with the modules that were added to the model. In the tagger model trained with Mac-Morpho corpus of journalistic genre, tagging accuracy rates of 98.05% on Mac-Morpho test set and 98.27% on Bosque CF 7.4 were achieved, both of journalistic genres. The performance of the proposed hybrid model tagger was also evaluated in the texts of Selva Científica Corpus, of the scientific genre. Needs of adjustments in the tagger and in corpora were identified and, as result, accuracy rates of 98.07% in Selva Científica, 98.06% in the text set of Mac-Morpho, and 98.30% in the texts of the Bosque CF 7.4 have been achieved. These results are significant because the accuracy rates achieved are higher than those of the state of the art, thus validating the proposed model to obtain a more reliable part-of-speech tagger.

Abordagem para o desenvolvimento de um etiquetador de alta acurácia para o Português do Brasil

Registros relacionados