In any case, itis not easy to automatically identify the parallel parts. The size and quality of parallel contents may vary considerably from one comparable corpus to another, depending on various factors, including the method of construction of the corpus. A comparable corpus is a collection of texts in multiple languages, collected independently, but often containing parts that are mutual translations. We present two effective architectures to achieve this.In the first part of this thesis, we worked on the use of comparable corpora to improve statistical machine translation systems. This thesis provides methods to overcome this need by exploiting the easily available huge comparable and monolingual data collections. ![]() This is an expensive task in terms of money, human resources and time. Most existing parallel corpora were produced by professional translators. A parallel corpus is a collection of sentences in source and target languages that are aligned at the sentence level. The research presented in this thesis is an attempt to overcome barriers to massive deployment of statistical machine translation systems: the lack of parallel corpora. This is commonly put forward as a great advantage of statistical approaches since no human intervention is required, but this can also turn into a problem when the necessary development data are not available, are too small or the domain is not appropriate. Statistical machine translation is a data driven process. In this approach, knowledge is automatically extracted from examples of translations, called parallel texts, and monolingual data in the target language. Several pproaches exist, but in recent years the so-called Statistical Machine Translation (SMT) is considered the most promising. In our world of international communications, machine translation has become a key technology essential. When applying the latter data, acquisition precision reaches 76.0% in English and 76.3% in Japanese. Re-evaluation of incorrect word pairs with source texts indicates that the method captures the appropriate parts of source texts with 89.5% precision. When applying the former data, our method acquires synonym pairs with 70.0% precision. We evaluated our method with two kinds of news article data: sentence-aligned par- allel texts and document-aligned comparable texts. ![]() This method has advantages in that it requires only part-of- speech information and it can acquire infrequent synonyms. To improve acquisition precision, prevention of outside appear- ance is used. Our method uses con- textual information of surrounding one word on each side of the target words. Our acquisition method takes advantage of a characteristic of MCT that included words and their relations are confined. MCT denotes a set of monolin- gual texts whose contents are similar and can be obtained automatically. This paper presents a method for acquiring synonyms from monolingual comparable text (MCT).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |