Sunday, July 24, 2022

CATting edge technology – a PDF document set case study


[Me and my real life cat tool*]

As a professional legal translator, I often am called upon to translate PDF versions of official documents and utility notices as well as bank statements. As these documents significantly vary in terms of formatting complexity and font clarity, I have to choose the most efficient way of approaching the translation, i.e., by hand, using a CAT tool or some combination of the two. I present a recent project that included some ten such documents that had to be translated from Hebrew to English, explain my approach and state my personal conclusions.

Before going into detail, non-translators may need a short explanation of the methods. Hand translation involves building a text, line by line, adjusting font size and column widths to create a document that is visually identical to the original. Even with practice, this method can be quite time-consuming unless one has a template already (which I did not have in this case, unfortunately). The more efficient way is to use an OCR application, ABBYY FineReader on my computer, to convert the PDF into a Word text. The application creates a Word document after first asking for confirmation of any letter that it seems uncertain. Such “verifications” can range from a few to a page to a quarter of it in worse cases. The factors influencing the convertibility include the complexity of the formatting, type of font and quality of the PDF. Translators then take the resulting Word document and, import it to a Computer Assisted Translation (CAT) tool, MemoQ here, which creates sentence-level segments, which are then translated one by one, with numbers and repetitions automatically entered. Upon export, translators receive the same document in the target language but formatting and font often must be tweaked to produce a final document. This method is significantly faster in many cases and much more accurate if numbers are involved.

The project in question involved 10 pages ranging from a text with a simple format and clear font, a simple letter, to complex formatting and poor PDF quality, a government notice and a utility bill as well as texts that contained significant percentages of numbers combined with the short but complex formatting on top (bank statements). I priced the document in terms of time as if I would do all the documents by hand with my “profit” being how efficient I can be.

In practice, I immediately removed three documents from the OCR application as their formatting and font would not convert well, specifically the utility bill and two government notices. While processing the documents in the OCR, I then removed two more documents as “verifying” the text and then redoing the formatting would have been more time-consuming than simply translating it manually. Two of the remaining documents came out almost perfect but the bank statements were problematical as the OCR did not produce a document properly reflecting the complex formatting on the upper part and its varying size fonts. However, I chose to complete the scanning process and label the bank account inputs as a table as it would ensure that there were no errors in numbers and reduce my QA time in terms of eliminating the need to double check the numbers.

Ultimately, I translated five documents by hand, essentially the government notices and utility bills, which are quite complex in terms of formatting. Two of the documents, the simple letter and a simple notice, required almost no additional work after export from the CAT tool. On the other hand, I took a hybrid approach to the bank statements, hand creating the short upper part with all the account details but pasting the chart from the CAT tool import to ensure that the numbers were correct, with a few minor tweaks.

In terms of time, the CAT tool did improve my efficiency to a certain degree but not as much as theoretically possible due to the quality of the PDF image and the type of font in this set of documents. In the future, I will immediately remove those segments that are currently beyond the capacity of the OPR to properly recognize in terms of text and format in order to avoid wasting time on “verifying” text that I will not use. On a positive note, I discovered that the combination of manual translation and CAT usage on a single document can be an effective method. Live and learn.

* Use picture captions to allow the blind to fully access the Internet.

No comments:

Post a Comment