PDF files translation, Part 2

In today’s highly digital workplace, you’re likely to find as many MSWord documents as PDF documents. Although your Inbox might contain a number of either file as attachments, they are not created the same. PDFs are often more challenging to translate, requiring more time and costing more than a standard MSWord document. That’s why we decided to write about what translating a PDF document entails on this blog. Last week, we gave a general explanation of how a PDF is created[F1]  since understanding the structure of a PDF document is key to understanding the steps prior to actual translation. This week, we’ll look at some of the ways we work around the issues that PDFs present.


How to translate a PDF

All PDFs must be prepped prior to translation. The actual process depends on whether the PDF is searchable or non-searchable. (Refer back to last week’s post[F2]  for a more detailed explanation of what a searchable PDF means.) 


Translating a searchable PDF

A searchable PDF – such as a document created from a MSWord document - can often be converted into a more convenient file type. Adobe Acrobat allows conversions of PDFs into MSWord documents, generally conserving formatting and images. This converted document can then be translated. This is one option when the translation is for publication purposes. 


The PDF can also be converted into a text-only file. The resulting document will not be formatted, nor include any images or graphics. However, this text can easily be handled by a translator using any tools. This may be the best option when the translation is for-information purposes only. Manipulating a searchable PDF will always depend, however, on the authorizations and permissions on the PDF document itself. If the PDF is 'locked' for example, the abovementioned strategies may not work.


Translating a non-searchable PDF

A non-searchable PDF – such as a document created from a scanned hard copy – must first be processed through Optical Character Recognition (OCR) software to recreate the source document. OCR software ‘reads’ the images of words and recreates an editable document.


The quality of the resulting document depends on a number of factors, including the software, quality of the scan, the document’s language, the complexity of formatting, and the user’s expertise in the software. Clean, high resolution scans of simply-formatted documents, written in a commonly supported language (e.g., French) can result in high quality output that requires minimal post-OCR editing.


OCR software comes as a stand-alone product or can be integrated as a feature in other products. Nuance Omnipage[F3]  and ABBYY FineReader[F4]  are examples of stand-alone OCR software. Acrobat (Adobe) offers conversion of a scanned document into an editable document as a feature.


On many projects, project managers may resort to using a combination of the strategies described above for handling PDF source files.  The best solution to working with PDFs is to avoid it when possible. Clients should provide an editable source document (such as a MSWord document) when available. If this isn’t possible, clients should provide a PDF that requires no authorizations or permissions for access.


