PDF2Office is an application that is designed to convert a PDF documents back into
fully the editable MSWord, RTF, Appleworks, or HTML files. A rather challenging
task considering that a PDF document contains no original document formatting information.
To accomplish this, PDF2Office needs to make a lot of educated guesses based on the
page rendering instructions contained in the PDF source.
To bring back a bit of history... Remember that buzz of the paperless office?
Ironic, isn't it, considering that the use of computers has often lead to an increase
in the use of paper. Portable Document Format, or PDF for short, started off on the
dream of a paperless office, as the pet project of one of Adobe's founders, John
Warnock. Initially it was an internal project at Adobe to create a file format
so documents could be spread throughout the company and displayed on any computer
using any operating system. In the grand vision of John Warnock, it was a format
that would enable sending full text and graphics documents (newspapers, magazine
articles, technical manuals etc.) over electronic mail distribution networks.
These documents could be viewed on any machine and any selected document could be
printed locally. This capability would truly change the way information is
managed. However, like much of the grand visions, the reality fell a bit short
and to the left of the mark. But not by much!
At the time, Adobe already had the two basic technologies: PostScript as a device
and platform independent technology to describe documents, and Adobe Illustrator
as an example of an application that ran on Windows and Mac and could open and visualize
fairly simple PostScript files. Technically, PostScript is known as a "page-description
language." Files that contain documents described in the PostScript (PS)
language are normally called "PostScript files". PostScript really
embodies two different concepts - a file format as well as a programming language.
It has things such as conditional and looping structured constructs, subprograms
and other things associated with programming languages. The PDF format evolved
from this point of departure, but was optimized for rendering the document on the
screen. It lost a lot of the programming aspect of PostScript, which made the
structure of PDF files much more predictable than that of PostScript, making it easier
to modify or extract information from a document.
PDF reproduces the documents almost precisely as they were originally composed, provides
built-in compression, is supported by all popular operating systems and is compatible
with most printers. The freely available Adobe Acrobat Reader is required to
view, print and search PDF documents. It is both device and resolution
PDF files tend to be smaller than the corresponding PS files, allowing files to be
handled more efficiently than larger ones - it takes less time to send smaller files
across a network and they take up less space. PDF can do lots of things that PostScript
can't do. PDF files can be viewed on the Web (with the proper software), whereas
PS files normally cannot. A PDF file can contain links to locations within
the same PDF file, within other PDF files, or on the Web; a PostScript file normally
does not contain links. A PDF file can function as a data-entry form but a
PostScript file cannot. There are lots of other differences, too. For
these reasons and more, PDF became a replacement for PostScript in many situations.
Most laser printers and image setters understand the PostScript language. The
Adobe Acrobat Distiller (a tool designed to convert PostScript files to PDF) also
understands the PostScript language. Over the years, Adobe opened the PDF file format
specification and made PDF available to anyone who wants to develop tools to create,
view, or manipulate PDF documents.
To illustrate, here is an example of what a PostScript file looks like (PDF files
have a similar look and feel)
/Helvetica findfont 12 scalefont setfont
72 648 moveto
(This text is 1 inch from the left edge of the page and 9 inches from the bottom.)show
This PostScript file describes a one-page document with one line of text in twelve-point
Helvetica type that is positioned one inch from the left edge of the page and nine
inches from the bottom of the page. From this example, you can see the challenge
PDF2Office faces. It needs to interpret the page description metaphors of PDF
to construct an editable document.
PDF2Office is intended to convert PDF documents into fully editable MSWord, RTF,
AppleWorks, HTML, etc., files re-creating the original construct and layout of the
document - forming paragraphs, applying styles, re-grouping independent graphics
elements, extracting images, creating tables, processing headers/footers, endnotes/footnotes
and columns/sections automatically, i.e. without any intervention from the end user.
It provides options for converting a range of pages in a PDF document into word processing
formats and popular image types such as JPEG, Photoshop, PNG, TIFF, etc. PDF2Office
also provides a batch conversion facility for converting many files at once simply
by targeting the folder they are in. It handles multi-language English/Japanese/Chinese/Korean/Western
European-language data contained in PDF documents.
The value proposition of PDF2Office is that it is a standalone utility eliminating
the necessity to acquire and install additional PDF editing software and tools such
as Adobe Acrobat resulting in cost savings in both time and expense.
PDF2Office is a product of Recosoft Corporation of Osaka, Japan. PDF2Office
came to me on a CD. The installation instruction was straightforward - simply
drag the application folder anywhere on the Hard Disk and I was ready to go.
The CD contained a comprehensive Users Guide and a PowerPoint-based tutorial.
At the time of the review, PDF2Office was at rev 1.0 and I applied the recommended
updates to the filters from the Recosoft web site to bring the rev. level to 1.0.3.
It is designed to run under OS X 10.2 and higher on at least a 300 MHz G3.
A single user license is listed at $129.00 and the education single user license
is listed at $89.00. No volume discount or maintenance fee structure is mentioned.
When you first launch PDF2Office, it presents a well-designed, clean user interface.
The interface consists of the toolbar, the conversion pane into which you drag and
drop the documents to be converted, the conversion-setting bar, the preview pane
and the preview controls.
To test PDF2Office, I
chose a 15-page scientific report document containing a mix of the single column
text, headers, footers, mathematical formulas, graphics, and tables - probably the
most difficult kind of document I can think of. The conversion process is simple
- drag and drop PDF files into the conversion pane, set the conversion type and the
final format to convert to and hit the convert button.
Dropping harrison.pdf into PDF2Office
Alternately, you can simply
select one or more files from the conversion pane and drag them onto the desktop.
After a short delay, the corresponding .doc document appears on the desktop and the
conversion log is presented.
PDF2Office conversion log
I then opened the output
file in MS Word to see the result. Interesting... After trying several different
input files with progressively decreasing level of complexity, I came to the conclusion
that the only type of input files worth trying to convert was straight text with
nothing more elaborate than a bulleted or numbered list. Anything more complex
than that, such as tables, mathematical formulas, graphics, headers and footers came
out either completely jumbled or missing. For instance, click here to see what the corresponding
page of the harrison.pdf file looks like.
Some of the problems I ran into are probably due to the fact that MS Word is not
a page layout program. If the original document was designed using a page layout
tool, then it is very difficult to reconstruct such documents using the facilities
of a word processor.
A few points about the performance. While PDF2Office takes only a few seconds
to convert a 10 or 15 page document, long documents such as the PDF Reference guide
which contains 1172 pages took a tad less than 2 hours to complete on a 867 MHz Quicksilver
G4 with 640 MB of RAM. However, it did complete! While it was crunching
away, no progress bar is presented, though. Just the usual rolling Aqua barbershop
"any day now" indicator. In the end it, did produce a 951 page MS
Word output file that contained the entire document through its last page.
The content of it was a mix of the blocks of text, unconverted PDF tokens, broken
formulas, fragments of graphics and stray characters.
a PDF document conversion and data extraction tool. It is intended to convert
PDF documents into fully editable MSWord, RTF, AppleWorks, or HTML files, re-creating
the original construct and layout of the document - forming paragraphs, applying
styles, re-grouping independent graphics elements, extracting images, creating tables,
processing headers, footers, footnotes and columns with minimal or no intervention.
PDF2Office provides options for converting a range of pages in a PDF document into
word processing formats and popular image types such as JPEG, Photoshop, PNG, TIFF,
etc.. It offers the capability of extracting images from specific pages within
a PDF document. PDF2Office also provides a batch conversion facility for converting
many files at once. To facilitate the conversion process PDF2Office provides
a layout preview and navigation of a PDF document within the application itself -
allowing the user to identify the pages to extract.
PDF2Office produces good output for single-column blocks of text of minimal formatting
complexity. However, conversion of more complex documents falls short of expectations.
In other words, the success of the conversion rather depends on the original purpose.
If it were to produce an editable document suitable for re-editing, then the output
was not particularly useful. It would take a significant amount of effort to
fix and to proof the document manually to the point where any edits could then be
applied. However, if the purpose of it were to extract a portion of the document
containing simple text with minimal formatting, then PDF2Office does a reasonably
good job of it.
In contrast, the Adobe free Acrobat Reader tool contains the Text Select (V) and
Graphic Select (G) tools. These tools allow one to select portions of the PDF
document and to Copy-Paste the selection into any other text-based tool. While
this capability is far more rudimental than what PDF2Office offers, in a pinch it
does an acceptable job and you cannot beat the price (free).
For doing simple text copies, I recommend using Acrobat Reader. For performing editable
conversions of PDF documents, I do recommend PDF2Office for getting a decent head
start, but expect to perform manual formatting and error correcting.
- Good user interface
- Good performance
- Stable, even when working
with very large PDF files
- Good output quality for
blocks of text of minimal formatting complexity
- PDF files containing
multiple columns, formulas, graphics, tables, and other more complex document features
requires significant amount of troubleshooting, manual regeneration and proofing
- No progress indicator
- Does not output in the
MS PowerPoint format
- Rather expensive considering
Adobe's free Acrobat Reader text and graphics extract tools
3 out of 5 Mice