PDF files translation, Part 1

PDFs may be a highly common file format, but they still provoke dread in translators and project managers alike. Clients sometimes don’t understand why translating a PDF is any different from translating, say, an ordinary Microsoft Word document. There are a number of issues that make translating a PDF lengthier process, but it isn‘t something that experienced project managers or translators can’t handle. This week, we give you Part 1 of our quick guide to understanding PDFs and translation.

translating pdfNearly everyone has created, read, sent, or received a PDF document. Along with Microsoft’s Word documents, Excel spreadsheets, and PowerPoint presentations, Adobe’s Portable Document Format, or PDF, is one of the most common file formats. Every file type presents its own challenges to translation, and PDFs are no exception. We’ve helped our clients translate digital reams of PDFs so we are very familiar with working with this file type. The key to understanding how to translate PDFs is to first understand how a PDF file is created. Here’s simple breakdown of a PDF’s structure.


What is a PDF?

A Portable Document Format (PDF) is a file format that was first introduced by Adobe Systems  in 1993. The PDF file format made it possible for documents to appear in the same manner regardless of the software, hardware or operating system being used. This ensures that a company’s product catalogue, for example, will appear exactly as it should, whether the company’s customers use a sleek Apple MacBook Pro or a workhorse HP Pavilion Slimline.

How is this possible?  Each PDF file contains all relevant typographic information needed to correctly display texts. A PDF is a self-contained document. You merely need the proper reader to view the document. You might think of a PDF as an old-fashioned traveling show, where everything a showman needs to put on his show is contained in his wagon.


How is a PDF structured?


Typically, a PDF is made up of two layers. One layer is the image layer. This is what a reader sees on his or her screen. The text is correctly formatted and displayed in the fonts chosen by the document’s author. The reader can also see any images or graphics. The second layer is a text-only layer. This layer, composed uniquely of the document’s text, remains hidden to the reader. A document that is originally created in MS Word and then converted to a PDF would have two layers.

PDFs sometimes contain only a single layer. This occurs, for example, when the PDF is created from a scan. A scanner creates an image of the document. The scanner does not re-create the document itself. Text, images, graphics, and formatting are all fused into one non-modifiable image. A paper copy that has been scanned and saved as a PDF would only have one layer.

When a PDF contains both the image and the text layer, the document is said to be ‘searchable’. This means that it would be possible to search for a specific term in the document. In fact, the search itself would be conducted in the text-only layer, but the results would be highlighted for you to see in the image layer. But ‘searchable’ conveys more than just the possibility of searching the document. It also means that all text can be selected and – with the right software – manipulated.


Next week, we’ll look at how the structure of a PDF affects translation, and what tools and strategies we use to help our clients translate their PDF documents.