|
 |
asp_components thread: Doing the reverse -- converting PDF to an XML document or string/- character data?
Message #1 by "Butner, Robert S" <butner@B...> on Wed, 22 Aug 2001 09:36:08 -0700
|
|
The recent thread regarding on-the-fly generation of PDF files has led me to
make a
renewed effort to see what is available to do the opposite conversion --
from PDF files to
plain text or marked-up text (e.g., HTML or better yet XML) that I can then
process further.
In our application, we are trying to extract information features from web
documents, to help us
categorize and classify them better. We use pretty conventional spidering
techniques to retrieve documents, but
PDF files have been a real problem since (to the best of my knowledge) they
don't provide a simple way to access their contents as character/string
data. So we need a cost effective way to "break open" a PDF file to access
the actual text contents, any meta-data fields, and preferably even the
mark-up and style information (e.g., is a certain phrase a headline, a
caption, or part of the body text?). In an ideal world, I would be able to
convert a PDF file into a well-defined XML document, which I could then
"take apart" using an XML parser and an understanding of the schema.
I've had a heckuva time trying to find "off the shelf" components to do
this, or even to allow me to recover the text of the document as string
data. Any suggestions on components that might make this job easier would
be welcome.
Scott Butner (butner@b...)
Senior Research Scientist, Environmental Technology Division
Pacific Northwest National Laboratory
MS K6-04
PO Box 999, Richland, WA 99352
(xxx)-xxx-xxxx voice/(509)-372-4995 fax
Message #2 by Robert Illing <Robert.Illing@f...> on Thu, 23 Aug 2001 08:56:10 +0100
|
|
Is PDF a proprietary file format? i.e.: Are you supposed to pay a license
fee to Adobe if you write any component that reads or writes the format?
Has anyone checked Adobe's website to see if there's a file format
specification?
Cheers,
Rob
-----Original Message-----
>Subject: Doing the reverse -- converting PDF to an XML document or string/-
>character data?
>From: "Butner, Robert S" <butner@B...>
>Date: Wed, 22 Aug 2001 09:36:08 -0700
>X-Message-Number: 6
>
>The recent thread regarding on-the-fly generation of PDF files has led me
to
>make a renewed effort to see what is available to do the opposite
>conversion -- from PDF files to plain text or marked-up text
>(e.g., HTML or better yet XML) that I can then process further.
Message #3 by "Tim Morford" <tmorford@n...> on Thu, 23 Aug 2001 07:10:00 -0400
|
|
Check out this Link It looks pretty interesting, I have a job coming up
where I have to make PFD's on the fly, So I have been sniffing around a bit.
http://support.adobe.com/devsup/devsup.nsf/docs/51533.htm
http://support.adobe.com/devsup/devsup.nsf/docs/51586.htm
http://support.adobe.com/devsup/devsup.nsf/docs/51009.htm
http://support.adobe.com/devsup/devsup.nsf/docs/51409.htm
http://support.adobe.com/devsup/devsup.nsf/docs/51281.htm
http://www.planetpdf.com/mainpage.asp?webpageid=907
Here are some of the links that I liked, I hope this helps.
Have fun and Enjoy
-----Original Message-----
From: Robert Illing [mailto:Robert.Illing@f...]
Sent: Thursday, August 23, 2001 3:56 AM
To: ASP components
Subject: [asp_components] RE: Doing the reverse -- converting PDF to an
XML document or str ing/- character data?
Is PDF a proprietary file format? i.e.: Are you supposed to pay a license
fee to Adobe if you write any component that reads or writes the format?
Has anyone checked Adobe's website to see if there's a file format
specification?
Cheers,
Rob
-----Original Message-----
>Subject: Doing the reverse -- converting PDF to an XML document or string/-
>character data?
>From: "Butner, Robert S" <butner@B...>
>Date: Wed, 22 Aug 2001 09:36:08 -0700
>X-Message-Number: 6
>
>The recent thread regarding on-the-fly generation of PDF files has led me
to
>make a renewed effort to see what is available to do the opposite
>conversion -- from PDF files to plain text or marked-up text
>(e.g., HTML or better yet XML) that I can then process further.
Message #4 by "Butner, Robert S" <butner@B...> on Thu, 23 Aug 2001 10:39:05 -0700
|
|
Rob --
Good question. Actually, my reading of the nice set of links that Tim
Morford provided
in a separate post to this list tells me that while the document format
and
file specs are
indeed proprietary, Adobe has set up developer's license conditions
that
allow access to the
PDF data via their API without royalties (but of course you need to
have
their products installed on
the target machine(s)).
The file format specification does exist on the Adobe site, along with
an
enormous amount of
API documentation (the main API doc is more than 2700 pages!). The
index of
all available SDK
info is available at http://partners.adobe.com/asn/developer/sdks.html
Given the complexity of the API and the ubiquity of PDF files, it's
amazing
that there aren't more
readily available tools (components) for extracting text from the PDF
files.
After all, there are
a lot of limitations on what can be done with the text within the file,
in
its native format.
Hopefully someone out there will have a component that shields me from
having to tear into the Adobe API.
SB
Scott Butner (butner@b...)
Senior Research Scientist, Environmental Technology Division
Pacific Northwest National Laboratory
MS K6-04
PO Box 999, Richland, WA 99352
(xxx)-xxx-xxxx voice/(509)-372-4995 fax
http://www.chemalliance.org/
-----Original Message-----
From: Robert Illing [mailto:Robert.Illing@f...]
Sent: Thursday, August 23, 2001 12:56 AM
To: ASP components
Subject: [asp_components] RE: Doing the reverse -- converting PDF to an
XML document or str ing/- character data?
Is PDF a proprietary file format? i.e.: Are you supposed to pay a
license
fee to Adobe if you write any component that reads or writes the
format?
Has anyone checked Adobe's website to see if there's a file format
specification?
Cheers,
Rob
-----Original Message-----
>Subject: Doing the reverse -- converting PDF to an XML document or
string/-
>character data?
>From: "Butner, Robert S" <butner@B...>
>Date: Wed, 22 Aug 2001 09:36:08 -0700
>X-Message-Number: 6
>
>The recent thread regarding on-the-fly generation of PDF files has led
me
to
>make a renewed effort to see what is available to do the opposite
>conversion -- from PDF files to plain text or marked-up text
>(e.g., HTML or better yet XML) that I can then process further.
Message #5 by "Tim Morford" <tmorford@n...> on Mon, 27 Aug 2001 21:21:16 -0400
|
|
Hey all I have done some more Snooping around and this is what I have come
up with So far. There is a OLE AUTOMATION METHOD that is called GetText, I
can not copy from a PDF but this is what it says.
GetText
CString GetText(long nTextIndex);
Description
Gets the text from the specified element of a text selection. To obtain all
text in a text selection, use PDTextSelect.GetNumText to determine the
number of elements in the text selection, then use this method in a loop to
obtain each of the elements.
Parameters
nTextIndex
Then element of the text selection to get.
Return Value
Then text, or an empty string if nTextIndex is greater than the number of
elements in the text selection.
that was from
http://partners.adobe.com/asn/developer/acrosdk/docs/iacref.pdf
But this is all I have so Far, I think this can be Cracked with the right
minds at work. I know creating them is farley strait forward and some what
easy, But now to extract them that would be Very Cool!
Tim Morford
-----Original Message-----
From: Butner, Robert S [mailto:butner@B...]
Sent: Thursday, August 23, 2001 1:39 PM
To: ASP components
Subject: [asp_components] RE: Doing the reverse -- converting PDF to a
nXML document or str ing/- character data?
Rob --
Good question. Actually, my reading of the nice set of links that Tim
Morford provided
in a separate post to this list tells me that while the document format and
file specs are
indeed proprietary, Adobe has set up developer's license conditions that
allow access to the
PDF data via their API without royalties (but of course you need to have
their products installed on
the target machine(s)).
The file format specification does exist on the Adobe site, along with an
enormous amount of
API documentation (the main API doc is more than 2700 pages!). The index of
all available SDK
info is available at http://partners.adobe.com/asn/developer/sdks.html
Given the complexity of the API and the ubiquity of PDF files, it's amazing
that there aren't more
readily available tools (components) for extracting text from the PDF files.
After all, there are
a lot of limitations on what can be done with the text within the file, in
its native format.
Hopefully someone out there will have a component that shields me from
having to tear into the Adobe API.
SB
Scott Butner (butner@b...)
Senior Research Scientist, Environmental Technology Division
Pacific Northwest National Laboratory
MS K6-04
PO Box 999, Richland, WA 99352
(xxx)-xxx-xxxx voice/(509)-372-4995 fax
http://www.chemalliance.org/
-----Original Message-----
From: Robert Illing [mailto:Robert.Illing@f...]
Sent: Thursday, August 23, 2001 12:56 AM
To: ASP components
Subject: [asp_components] RE: Doing the reverse -- converting PDF to an
XML document or str ing/- character data?
Is PDF a proprietary file format? i.e.: Are you supposed to pay a license
fee to Adobe if you write any component that reads or writes the format?
Has anyone checked Adobe's website to see if there's a file format
specification?
Cheers,
Rob
-----Original Message-----
>Subject: Doing the reverse -- converting PDF to an XML document or string/-
>character data?
>From: "Butner, Robert S" <butner@B...>
>Date: Wed, 22 Aug 2001 09:36:08 -0700
>X-Message-Number: 6
>
>The recent thread regarding on-the-fly generation of PDF files has led me
to
>make a renewed effort to see what is available to do the opposite
>conversion -- from PDF files to plain text or marked-up text
>(e.g., HTML or better yet XML) that I can then process further.
|
|
 |