Thursday, August 03, 2006

Word 2007 (.docx) and ISO 19005 (PDF/A)

Word 2007 is the fourth generation of the most popular Windows word processor. First generation are Word 1.0, 1.1, 1.2 and 2.0. Second are Word 6.0 and Word 95 (this is 32-bit Word 6.0 with support for long file names and "red-squiggle underlined spell-checking"). Third generation (VBA replaced WordBasic) consists of Word 97, Word 2000, Word XP and Word 2003.

We can characterize Word versions by library riched20.dll. For example, Word 2000 came with version, and Word 2007 beta is using version 12.0.4017.1003. BTW, for unicode support, the file usp10.dll (Uniscribe Unicode script processor) is important.

Word default file extension used to be .doc. Let's download one document and try to open it in doknir: "Windows Vista Hardware Start Button Specification". This is the result:

Upgrading KWord to version 1.5.2 does not help - again "The application KWord (kword) crashed and caused the signal 6 (SIGABRT)". So we will need to use Word - in our case that will be version 2007 beta.

Before starting Word 2007, stop windows print spooler just like in "Crystal reports" example. Word 2007 behaves more friendly than Crystal reports, but again, only limited number of paper sizes are available (Letter, A4, Legal, A3, B4, B5) - fortunately, it is possible to define custom paper size.

There is no file menu in new Word - we must use "Office Button" instead:

For example, if we click on "Print", nothing happens, because print spooler is not running. BTW, because of the new button it is not possible to close window by double-clicking the little horizontal line icon in the upper-left corner:

That's why I've added little '×' next to the "Office Button" to close Word...

So, how to print Word file using doknir? Fortunately, Word 2007 has very useful new feature: "Save As PDF". You must select folder and enter the name of the PDF file and click on button Publish. To display PDF in 'doknir', open that folder, select newly created PDF, right-click on it and select 'doknir' in pop-up menu. Here is the first page of above document, displayed in 'doknir' - VMware virtual appliance:

What kind of joke is this ?!!! Aren't PDF documents portable? Ehm, no! Let's take a look at Word "Save As PDF" options:

Option "ISO 19005-1 compliant (PDF/A)" seems interesting - let's try it:

Aha, now it is OK. So what's the difference between the two PDF documents? If we look at the "Document Properties"->"Fonts" in Adobe Reader:

we can see that fonts in correct PDF document are embedded and this is the main difference between PDF and PDF/A:

The constraints include:

  • Audio and video content are forbidden
  • Javascript and executable file launches are prohibited
  • All fonts must be embedded and also must be legally embeddable for unlimited, universal rendering
  • Colorspaces specified in a device-independent manner
  • Encryption is disallowed
  • Use of standards-based metadata is mandated

At the end of this post, let's take a look at the new Word extension .docx. 'X' means that this is (compressed) Open XML format and I've found an excellent example with a lot of mathematical formulas. Bellow is the second page printed via 'doknir':

The quality of formulas in PDF files is still not perfect, but I hope Microsoft will correct this in the final version of Word 2007.

In the future, we will write a VBA macro to automate Word printing via 'doknir' ...


1 comment:

Alex said...

There are many problems with ms word files, but once I could solve out all of them by means of a tool. It was downloaded from one soft blog and I hope it will help in many similar situations too - repair .docx document.