Wrox Home  
Search P2P Archive for: Go

  Return to Index  

asp_components thread: Doing the reverse -- converting PDF to an XML document or string/- character data?


Message #1 by "Butner, Robert S" <butner@B...> on Wed, 22 Aug 2001 09:36:08 -0700
The recent thread regarding on-the-fly generation of PDF files has led me to

make a 

renewed effort to see what is available to do the opposite conversion --

from PDF files to

plain text or marked-up text (e.g., HTML or better yet XML) that I can then

process further.



In our application, we are trying to extract information features from web

documents, to help us 

categorize and classify them better.   We use pretty conventional spidering

techniques to retrieve documents, but 

PDF files have been a real problem since (to the best of my knowledge) they

don't provide a simple way to access their contents as character/string

data.  So we need a cost effective way to "break open" a PDF file to access

the actual text contents, any meta-data fields, and preferably even the

mark-up and style information (e.g., is a certain phrase a headline, a

caption, or part of the body text?).  In an ideal world, I would be able to

convert a PDF file into a well-defined XML document, which I could then

"take apart" using an XML parser and an understanding of the schema.   



I've had a heckuva time trying to find "off the shelf" components to do

this, or even to allow me to recover the text of the document as string

data.  Any suggestions on components that might make this job easier would

be welcome.



Scott Butner (butner@b...) 

Senior Research Scientist, Environmental Technology Division

Pacific Northwest National Laboratory

MS K6-04

PO Box 999, Richland, WA  99352

(xxx)-xxx-xxxx  voice/(509)-372-4995 fax



Message #2 by Robert Illing <Robert.Illing@f...> on Thu, 23 Aug 2001 08:56:10 +0100
Is PDF a proprietary file format?  i.e.: Are you supposed to pay a license

fee to Adobe if you write any component that reads or writes the format?



Has anyone checked Adobe's website to see if there's a file format

specification?



Cheers,



Rob



-----Original Message-----

>Subject: Doing the reverse -- converting PDF to an XML document or string/-

>character data?

>From: "Butner, Robert S" <butner@B...>

>Date: Wed, 22 Aug 2001 09:36:08 -0700

>X-Message-Number: 6

>

>The recent thread regarding on-the-fly generation of PDF files has led me

to

>make a renewed effort to see what is available to do the opposite

>conversion -- from PDF files to plain text or marked-up text

>(e.g., HTML or better yet XML) that I can then process further.





Message #3 by "Tim Morford" <tmorford@n...> on Thu, 23 Aug 2001 07:10:00 -0400
Check out this Link It looks pretty interesting, I have a job coming up

where I have to make PFD's on the fly, So I have been sniffing around a bit.

http://support.adobe.com/devsup/devsup.nsf/docs/51533.htm

http://support.adobe.com/devsup/devsup.nsf/docs/51586.htm

http://support.adobe.com/devsup/devsup.nsf/docs/51009.htm

http://support.adobe.com/devsup/devsup.nsf/docs/51409.htm

http://support.adobe.com/devsup/devsup.nsf/docs/51281.htm

http://www.planetpdf.com/mainpage.asp?webpageid=907



Here are some of the links that I liked, I hope this helps.



Have fun and Enjoy

-----Original Message-----

From: Robert Illing [mailto:Robert.Illing@f...]

Sent: Thursday, August 23, 2001 3:56 AM

To: ASP components

Subject: [asp_components] RE: Doing the reverse -- converting PDF to an

XML document or str ing/- character data?





Is PDF a proprietary file format?  i.e.: Are you supposed to pay a license

fee to Adobe if you write any component that reads or writes the format?



Has anyone checked Adobe's website to see if there's a file format

specification?



Cheers,



Rob



-----Original Message-----

>Subject: Doing the reverse -- converting PDF to an XML document or string/-

>character data?

>From: "Butner, Robert S" <butner@B...>

>Date: Wed, 22 Aug 2001 09:36:08 -0700

>X-Message-Number: 6

>

>The recent thread regarding on-the-fly generation of PDF files has led me

to

>make a renewed effort to see what is available to do the opposite

>conversion -- from PDF files to plain text or marked-up text

>(e.g., HTML or better yet XML) that I can then process further.

Message #4 by "Butner, Robert S" <butner@B...> on Thu, 23 Aug 2001 10:39:05 -0700
Rob --



Good question.  Actually, my reading of the nice set of links that Tim

Morford provided

in a separate post to this list tells me that while the document format 

and

file specs are

indeed proprietary, Adobe has set up developer's license conditions 

that

allow access to the

PDF data via their API without royalties (but of course you need to 

have

their products installed on

the target machine(s)). 



The file format specification does exist on the Adobe site, along with 

an

enormous amount of

API documentation (the main API doc is more than 2700 pages!).  The 

index of

all available SDK

info is available at http://partners.adobe.com/asn/developer/sdks.html



Given the complexity of the API and the ubiquity of PDF files, it's 

amazing

that there aren't more

readily available tools (components) for extracting text from the PDF 

files.

After all, there are

a lot of limitations on what can be done with the text within the file, 

in

its native format.



Hopefully someone out there will have a component that shields me from

having to tear into the Adobe API.





SB



Scott Butner (butner@b...)

Senior Research Scientist, Environmental Technology Division

Pacific Northwest National Laboratory

MS K6-04

PO Box 999, Richland, WA  99352

(xxx)-xxx-xxxx  voice/(509)-372-4995 fax

http://www.chemalliance.org/









-----Original Message-----

From: Robert Illing [mailto:Robert.Illing@f...]

Sent: Thursday, August 23, 2001 12:56 AM

To: ASP components

Subject: [asp_components] RE: Doing the reverse -- converting PDF to an

XML document or str ing/- character data?





Is PDF a proprietary file format?  i.e.: Are you supposed to pay a 

license

fee to Adobe if you write any component that reads or writes the 

format?



Has anyone checked Adobe's website to see if there's a file format

specification?



Cheers,



Rob



-----Original Message-----

>Subject: Doing the reverse -- converting PDF to an XML document or 

string/-

>character data?

>From: "Butner, Robert S" <butner@B...>

>Date: Wed, 22 Aug 2001 09:36:08 -0700

>X-Message-Number: 6

>

>The recent thread regarding on-the-fly generation of PDF files has led 

me

to

>make a renewed effort to see what is available to do the opposite

>conversion -- from PDF files to plain text or marked-up text

>(e.g., HTML or better yet XML) that I can then process further.

Message #5 by "Tim Morford" <tmorford@n...> on Mon, 27 Aug 2001 21:21:16 -0400
Hey all I have done some more Snooping around and this is what I have come

up with So far. There is a OLE AUTOMATION METHOD that is called GetText, I

can not copy from a PDF but this is what it says.



GetText

	CString GetText(long nTextIndex);



Description

	Gets the text from the specified element of a text selection. To obtain all

text in a text selection, use 	PDTextSelect.GetNumText to determine the

number of elements in the text selection, then use this method in a loop to

obtain each of the elements.

Parameters

	nTextIndex

		Then element of the text selection to get.

Return Value

	Then text, or an empty string if nTextIndex is greater than the number of

elements in the text selection.



that was from

http://partners.adobe.com/asn/developer/acrosdk/docs/iacref.pdf



But this is all I have so Far, I think this can be Cracked with the right

minds at work. I know creating them is farley strait forward and some what

easy, But now to extract them that would be Very Cool!



Tim Morford



-----Original Message-----

From: Butner, Robert S [mailto:butner@B...]

Sent: Thursday, August 23, 2001 1:39 PM

To: ASP components

Subject: [asp_components] RE: Doing the reverse -- converting PDF to a

nXML document or str ing/- character data?





Rob --



Good question.  Actually, my reading of the nice set of links that Tim

Morford provided

in a separate post to this list tells me that while the document format and

file specs are

indeed proprietary, Adobe has set up developer's license conditions that

allow access to the

PDF data via their API without royalties (but of course you need to have

their products installed on

the target machine(s)).



The file format specification does exist on the Adobe site, along with an

enormous amount of

API documentation (the main API doc is more than 2700 pages!).  The index of

all available SDK

info is available at http://partners.adobe.com/asn/developer/sdks.html



Given the complexity of the API and the ubiquity of PDF files, it's amazing

that there aren't more

readily available tools (components) for extracting text from the PDF files.

After all, there are

a lot of limitations on what can be done with the text within the file, in

its native format.



Hopefully someone out there will have a component that shields me from

having to tear into the Adobe API.





SB



Scott Butner (butner@b...)

Senior Research Scientist, Environmental Technology Division

Pacific Northwest National Laboratory

MS K6-04

PO Box 999, Richland, WA  99352

(xxx)-xxx-xxxx  voice/(509)-372-4995 fax

http://www.chemalliance.org/









-----Original Message-----

From: Robert Illing [mailto:Robert.Illing@f...]

Sent: Thursday, August 23, 2001 12:56 AM

To: ASP components

Subject: [asp_components] RE: Doing the reverse -- converting PDF to an

XML document or str ing/- character data?





Is PDF a proprietary file format?  i.e.: Are you supposed to pay a license

fee to Adobe if you write any component that reads or writes the format?



Has anyone checked Adobe's website to see if there's a file format

specification?



Cheers,



Rob



-----Original Message-----

>Subject: Doing the reverse -- converting PDF to an XML document or string/-

>character data?

>From: "Butner, Robert S" <butner@B...>

>Date: Wed, 22 Aug 2001 09:36:08 -0700

>X-Message-Number: 6

>

>The recent thread regarding on-the-fly generation of PDF files has led me

to

>make a renewed effort to see what is available to do the opposite

>conversion -- from PDF files to plain text or marked-up text

>(e.g., HTML or better yet XML) that I can then process further.


  Return to Index