PDF2Office 1.0.3, by Recosoft
Posted: 17-Jul-2004

3 out of 5 Mice

Vendor: Recosoft Type: COMMERCIAL

Reviewer: Alex Levinson Class: PRODUCTIVITY

Overview
PDF2Office is an application that is designed to convert a PDF documents back into fully the editable MSWord, RTF, Appleworks, or HTML files.  A rather challenging task considering that a PDF document contains no original document formatting information.  To accomplish this, PDF2Office needs to make a lot of educated guesses based on the page rendering instructions contained in the PDF source.

To bring back a bit of history... Remember that buzz of the paperless office?  Ironic, isn't it, considering that the use of computers has often lead to an increase in the use of paper. Portable Document Format, or PDF for short, started off on the dream of a paperless office, as the pet project of one of Adobe's founders, John Warnock.  Initially it was an internal project at Adobe to create a file format so documents could be spread throughout the company and displayed on any computer using any operating system.  In the grand vision of John Warnock, it was a format that would enable sending full text and graphics documents (newspapers, magazine articles, technical manuals etc.) over electronic mail distribution networks.  These documents could be viewed on any machine and any selected document could be printed locally.  This capability would truly change the way information is managed.  However, like much of the grand visions, the reality fell a bit short and to the left of the mark.  But not by much!

At the time, Adobe already had the two basic technologies: PostScript as a device and platform independent technology to describe documents, and Adobe Illustrator as an example of an application that ran on Windows and Mac and could open and visualize fairly simple PostScript files.  Technically, PostScript is known as a "page-description language."  Files that contain documents described in the PostScript (PS) language are normally called "PostScript files".  PostScript really embodies two different concepts - a file format as well as a programming language.  It has things such as conditional and looping structured constructs, subprograms and other things associated with programming languages.  The PDF format evolved from this point of departure, but was optimized for rendering the document on the screen.  It lost a lot of the programming aspect of PostScript, which made the structure of PDF files much more predictable than that of PostScript, making it easier to modify or extract information from a document.
 
PDF reproduces the documents almost precisely as they were originally composed, provides built-in compression, is supported by all popular operating systems and is compatible with most printers.  The freely available Adobe Acrobat Reader is required to view, print and search PDF documents.   It is both device and resolution independent.
 
PDF files tend to be smaller than the corresponding PS files, allowing files to be handled more efficiently than larger ones - it takes less time to send smaller files across a network and they take up less space. PDF can do lots of things that PostScript can't do.  PDF files can be viewed on the Web (with the proper software), whereas PS files normally cannot.  A PDF file can contain links to locations within the same PDF file, within other PDF files, or on the Web; a PostScript file normally does not contain links.  A PDF file can function as a data-entry form but a PostScript file cannot.  There are lots of other differences, too.  For these reasons and more, PDF became a replacement for PostScript in many situations.
 
Most laser printers and image setters understand the PostScript language.  The Adobe Acrobat Distiller (a tool designed to convert PostScript files to PDF) also understands the PostScript language. Over the years, Adobe opened the PDF file format specification and made PDF available to anyone who wants to develop tools to create, view, or manipulate PDF documents.
 
To illustrate, here is an example of what a PostScript file looks like (PDF files have a similar look and feel)

%!PS-Adobe-2.0
/Helvetica findfont 12 scalefont setfont
72 648 moveto
(This text is 1 inch from the left edge of the page and 9 inches from the bottom.)show
showpage
%%EOF

 
This PostScript file describes a one-page document with one line of text in twelve-point Helvetica type that is positioned one inch from the left edge of the page and nine inches from the bottom of the page. From this example, you can see the challenge PDF2Office faces.  It needs to interpret the page description metaphors of PDF to construct an editable document.
 
Key Features
PDF2Office is intended to convert PDF documents into fully editable MSWord, RTF, AppleWorks, HTML, etc., files re-creating the original construct and layout of the document - forming paragraphs, applying styles, re-grouping independent graphics elements, extracting images, creating tables, processing headers/footers, endnotes/footnotes and columns/sections automatically, i.e. without any intervention from the end user.  It provides options for converting a range of pages in a PDF document into word processing formats and popular image types such as JPEG, Photoshop, PNG, TIFF, etc.  PDF2Office also provides a batch conversion facility for converting many files at once simply by targeting the folder they are in. It handles multi-language English/Japanese/Chinese/Korean/Western European-language data contained in PDF documents.
 
The value proposition of PDF2Office is that it is a standalone utility eliminating the necessity to acquire and install additional PDF editing software and tools such as Adobe Acrobat resulting in cost savings in both time and expense.
 
Installation
PDF2Office is a product of Recosoft Corporation of Osaka, Japan.  PDF2Office came to me on a CD.  The installation instruction was straightforward - simply drag the application folder anywhere on the Hard Disk and I was ready to go.  The CD contained a comprehensive Users Guide and a PowerPoint-based tutorial.  At the time of the review, PDF2Office was at rev 1.0 and I applied the recommended updates to the filters from the Recosoft web site to bring the rev. level to 1.0.3.  It is designed to run under OS X 10.2 and higher on at least a 300 MHz G3.  A single user license is listed at $129.00 and the education single user license is listed at $89.00.  No volume discount or maintenance fee structure is mentioned.
 
In Use
When you first launch PDF2Office, it presents a well-designed, clean user interface.  The interface consists of the toolbar, the conversion pane into which you drag and drop the documents to be converted, the conversion-setting bar, the preview pane and the preview controls.


PDF2Office Interface

To test PDF2Office, I chose a 15-page scientific report document containing a mix of the single column text, headers, footers, mathematical formulas, graphics, and tables - probably the most difficult kind of document I can think of.  The conversion process is simple - drag and drop PDF files into the conversion pane, set the conversion type and the final format to convert to and hit the convert button.


Dropping harrison.pdf into PDF2Office

Alternately, you can simply select one or more files from the conversion pane and drag them onto the desktop.  After a short delay, the corresponding .doc document appears on the desktop and the conversion log is presented.


PDF2Office conversion log

I then opened the output file in MS Word to see the result.  Interesting... After trying several different input files with progressively decreasing level of complexity, I came to the conclusion that the only type of input files worth trying to convert was straight text with nothing more elaborate than a bulleted or numbered list.  Anything more complex than that, such as tables, mathematical formulas, graphics, headers and footers came out either completely jumbled or missing.  For instance, click here to see what the corresponding page of the harrison.pdf file looks like.
 
Some of the problems I ran into are probably due to the fact that MS Word is not a page layout program.  If the original document was designed using a page layout tool, then it is very difficult to reconstruct such documents using the facilities of a word processor.
 
A few points about the performance.  While PDF2Office takes only a few seconds to convert a 10 or 15 page document, long documents such as the PDF Reference guide which contains 1172 pages took a tad less than 2 hours to complete on a 867 MHz Quicksilver G4 with 640 MB of RAM.  However, it did complete!  While it was crunching away, no progress bar is presented, though.  Just the usual rolling Aqua barbershop "any day now" indicator.  In the end it, did produce a 951 page MS Word output file that contained the entire document through its last page.  The content of it was a mix of the blocks of text, unconverted PDF tokens, broken formulas, fragments of graphics and stray characters.

 
Summary
PDF2Office is a PDF document conversion and data extraction tool.  It is intended to convert PDF documents into fully editable MSWord, RTF, AppleWorks, or HTML files, re-creating the original construct and layout of the document - forming paragraphs, applying styles, re-grouping independent graphics elements, extracting images, creating tables, processing headers, footers, footnotes and columns with minimal or no intervention. PDF2Office provides options for converting a range of pages in a PDF document into word processing formats and popular image types such as JPEG, Photoshop, PNG, TIFF, etc..  It offers the capability of extracting images from specific pages within a PDF document.  PDF2Office also provides a batch conversion facility for converting many files at once.  To facilitate the conversion process PDF2Office provides a layout preview and navigation of a PDF document within the application itself - allowing the user to identify the pages to extract.
 
PDF2Office produces good output for single-column blocks of text of minimal formatting complexity.  However, conversion of more complex documents falls short of expectations.  In other words, the success of the conversion rather depends on the original purpose.  If it were to produce an editable document suitable for re-editing, then the output was not particularly useful.  It would take a significant amount of effort to fix and to proof the document manually to the point where any edits could then be applied.  However, if the purpose of it were to extract a portion of the document containing simple text with minimal formatting, then PDF2Office does a reasonably good job of it.
 
In contrast, the Adobe free Acrobat Reader tool contains the Text Select (V) and Graphic Select (G) tools.  These tools allow one to select portions of the PDF document and to Copy-Paste the selection into any other text-based tool.  While this capability is far more rudimental than what PDF2Office offers, in a pinch it does an acceptable job and you cannot beat the price (free).

For doing simple text copies, I recommend using Acrobat Reader. For performing editable conversions of PDF documents, I do recommend PDF2Office for getting a decent head start, but expect to perform manual formatting and error correcting.



Pros

  • Good user interface
  • Good performance
  • Stable, even when working with very large PDF files
  • Good output quality for blocks of text of minimal formatting complexity

Cons

  • PDF files containing multiple columns, formulas, graphics, tables, and other more complex document features requires significant amount of troubleshooting, manual regeneration and proofing
  • No progress indicator
  • Does not output in the MS PowerPoint format
  • Rather expensive considering Adobe's free Acrobat Reader text and graphics extract tools


Overall Rating

3 out of 5 Mice