Tree-Based TM for Computer-Assisted Bible Translation

Details

Author: ANDI WU

Year: 2019

Track(s):

Resources

Abstract

TM (translation memory) is a database that stores the translation units that have been previously translated. It can be used in computer-assisted Bible translation to provide suggestions when new texts are translated, thus improving efficiency and consistency in translation. This paper presents an automatic way of incrementally creating a TM in real time as a translation project goes on. What is required in this approach is (1) an automatic word aligner, and (2) syntactic treebanks of the original Hebrew and Greek texts. After each verse is translated, the auto-aligner is used to align the translation words to their corresponding Hebrew/Greek words which are the leaf nodes in a syntactic tree. Since each node in the tree (a subtree) represents a word, phrase, or clause, phrase/clause alignment can also be automatically created by mapping the sequence of leaf nodes in the subtree to the translation words aligned to these nodes, resulting in a TM that contains linguistically valid translation units of any textual size. As such a TM grows, Bible translators get increasing better suggestions.

This approach differs from traditional methods of TM creation where translation units beyond the word level are arbitrary word sequences which are hard to identify automatically, and which are not always legitimate linguistic units. We are able to avoid these problems due to existence of syntactic treebanks of Biblical texts. Such resources are seldom available in other domains.

About the Author

Andi Wu holds a PhD in computational linguistics from UCLA. After working at Microsoft Research for 8 years, he joined Global Bible Initiative where he has been developing linguistic data in the Bible domain and using the data to create tools for accelerating Bible translation.