Wrox Programmer Forums
Go Back   Wrox Programmer Forums > XML > XSLT
|
XSLT General questions and answers about XSLT. For issues strictly specific to the book XSLT 1.1 Programmers Reference, please post to that forum instead.
Welcome to the p2p.wrox.com Forums.

You are currently viewing the XSLT section of the Wrox Programmer to Programmer discussions. This is a community of software programmers and website developers including Wrox book authors and readers. New member registration was closed in 2019. New posts were shut off and the site was archived into this static format as of October 1, 2020. If you require technical support for a Wrox book please contact http://hub.wiley.com
 
Old October 19th, 2006, 12:21 PM
Authorized User
 
Join Date: Apr 2006
Posts: 51
Thanks: 0
Thanked 0 Times in 0 Posts
Default PDF to XML conversion

I am working on a project that involves converting ( PDF and Quark) files to XML files. The XML files should validate with docbook DTD or a specific DTD. Any tools or applications that can help me in this project?

Thanks,
Bill

 
Old October 19th, 2006, 12:51 PM
mhkay's Avatar
Wrox Author
 
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
Default

I think I've said before that turning PDF into XML is like turning hamburgers into cows. You're probably best off printing the PDF, scanning it in using an OCR reader, and working from there.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference
 
Old October 19th, 2006, 01:58 PM
Authorized User
 
Join Date: Apr 2006
Posts: 51
Thanks: 0
Thanked 0 Times in 0 Posts
Default

Michael,

The time and volume of the project doesn't permitt using an OCR reader. I am invloved in thousands of pages, can you please make some suggetions. There will be Quality Assurance to make sure no text is missing.

By the way, I do have a small project that might be duable with the OCR. Can you please give more info or direct me to sources that might help me in the conversion.

Thanks,
Bill

 
Old October 19th, 2006, 02:42 PM
joefawcett's Avatar
Wrox Author
 
Join Date: Jun 2003
Posts: 3,074
Thanks: 1
Thanked 38 Times in 37 Posts
Default

Well I agree with Michael, it's nigh on impossible.
For example given a simple Hello World PDF what would the XML look like?

--

Joe (Microsoft MVP - XML)
 
Old October 19th, 2006, 02:55 PM
joefawcett's Avatar
Wrox Author
 
Join Date: Jun 2003
Posts: 3,074
Thanks: 1
Thanked 38 Times in 37 Posts
Default

Okay, I've had an idea.
There are some PDF to Word convertors available so the process would be:
PDF => Word
Word => XML, either via builtin functions of via OpenOffice
XML => DocBook XML via XSLT.

It won't be easy but it might work.


--

Joe (Microsoft MVP - XML)
 
Old October 19th, 2006, 04:37 PM
Registered User
 
Join Date: Oct 2006
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
Send a message via Yahoo to DiamondDaveUSA
Default

YOU MAY HAVE TO DO RANDOM SEARCH THE WWW FOR OFF-THE-SHELF MASSIVE-VOLUME FILE CONVERSION UTILITIES THAT WOULD ALLOW YOU TO VIRTUALLY CONVERT ANY FILE FORMAT INTO ANY OTHER FILE FORMAT, AND MAINTAIN "FULL FORM-FACTOR-FORMAT", AND ACCOMPLISH THE LARGE SCALE PROJECTS, PRESTO, PRESTO!

YOU CAN BUY SPECIALIZED FILE CONVERSION UTILITIES.
YOU CAN BUY SHAREWARE DOWNLOADS.
YOU MAY DOWNLOAD FREEWARE.
LOOK INTO SOME DOWNLOAD WEBSITES:
http://www.cnet.com and others ~

BUILT-IN GUI-CONTROL-MENU-OPTIONS-FEATURES ARE VERY HELPFUL TO SEE AND VALIDATE FILES "FORM-FACTOR-FORMAT" CONVERSIONS.

HIGHLIGHT AND MOVE (OR WITH "COPY TO" YOU AT LEAST RETAIN ORIGINALS INTACT AND SAFE FROM INCIDENTAL DAMAGE) ALL CLUSTERED FILES, TO BE CONVERTED, TO A UNIQUE REPOSITORY FOLDER TO PREVENT DUPLICATIONS, ETC.

HERE NOW, ONE PRESUMES, YOU ARE NOT LOOKING FOR PDF-TO-XML FILE CONVERSION SYNTAX OR CODE. JUST SOME GOOD FILE CONVERSION UTILITIES THAT WILL HELP YOU ACCOMPLISH YOUR GOAL AT HAND.

FOR QUALITY CONTROL AND PROFICIENCY, WE ALWAYS REPEATEDLY TEST-AND-VALIDATE, SAMPLE, TEST-AND-VALIDATE, BENCHMARK REGARDING PRESERVATION OF CRITICAL DATA FORM-FACTOR-FORMAT, BEFORE DEPLOYING IN PROPORTIONAL AND WELL-MEASURED DOSAGES/INCREMENTS.

TO BRIEFLY ILLUSTRATE "INDIVIDUAL" FILE FORMAT CONVERSION CONCEPTS WITH WHICH WE HAVE EXPERIMENTED, ON AS NEEDED BASIS, AND YOU ARE PROBABLY FAMILIAR:

FILE-FORMAT SAVING OPTIONS:

SAVE-AS / SAVE
CONVERT AS / CONVERT TO ~~~ WEB FORMAT, ETC.
COPY / PASTE / IMPORT / EXPORT / ~~~ ETC.
(MANY STANDARDIZED / ROUTINIZED FEATURES IN MANY FILE CONVERSION APPLICATIONS)

WITH "SAVE AS" OPTION, ONE MAY BE ABLE TO

ZIP-TO-UNZIP,
UNZIP-TO-ZIP,
HTML-TO-TXT,
TXT-TO-HTML,
MHTL-TO-TXT,
PDF-TO-TXT,
TXT-TO-PDF,
XML-TO-TXT,
TXT-TO-XML,
PDF-TO-XML, (VALIDATION WITH DTD OR SPECIFIC DTD FILES)
XML-TO-PDF,
BMP-TO-JPG,
AND VICE-A-VERSA
~ ~ ~
AND OTHER UP-AND-COMING NEW FILE FORMAT VERSIONS.

YOU GET THE PICTURE!

THE POINT HERE IS THAT YOU MUST ACQUIRE THE RIGHT AND PROPER SOFTWARE UTILITIES TOOLS TO ACCOMPLISH YOUR LARGE-SCALE FILES CONVERSIONS PROJECT!

TRY ANY OTHER ASYMETRICAL GUERILLA-WARFARE TACTICS THAT COULD HELP YOU ACCOMPLISH YOUR PROJECTS SUCCESSFULLY!

LONG TIME AGO, I SAW SOME FILE CONVRSION UTILITIES/TOOLS AT THE
http://www.microsoft.com IN THE SDK / VIEWERS / CONVERTERS / TOOLS / INTERNET EXPLORER / HTML / XML EDITORS SECTION;
http://www.adobe.com IN THE SDK / READERS / VIEWERS / CONVERTERS / TOOLS SECTION, ETC.

FEEL FREE TO DRIVE DOWN TO YOUR LOCAL UNIVERSITY LIBRARY AND PC-SOFTWARE RETAIL STORES AND TAKE A LOOK AT WHAT OFF-THE-SHELF FILES CONVERSION UTILITIES MAY BE ABAILABLE AT REASONABLE PRICES. BROWSE, LOOK, READ BOX SPECIFICATIONS, PERUSE BOOKS, ARTICLES, EXPLAIN, EXCHANGE CONCEPTS, ETC.

I DON'T KNOW WHERE YOU ARE LOCATED, BUT IF YOU WERE TO BE IN ATLANTA (GA, USA), WE COULD HELP YOU GIFT-WRAP YOUR LARGE-SCALE PROJECT IN JUST A MATTER OF HOURS TO A FEW DAYS!

GOOD LUCK!

PS: IF ALL WORKS OUT SATISFACTORILY, FEEL FREE TO SEND ME A DECENT PORTION OF THE BONUS REWARD!

HAVE A GREAT DAY!

©2006-DiamondDaveUSA. All rights reserved.
 
Old January 6th, 2007, 11:27 PM
Registered User
 
Join Date: Jan 2007
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
Default

There IS a tool for conversion of PDF files to XML files.
We named it PDFConvertIt. It is Java based.
Normally you can use it as a rather sophisticated GUI application, but it is embeddable into Java applications, too.
At the time being it may be used for a rather good preconversion of book-like PDF-documents to docbook-like XML documents. QA has to be done, anyway.
Main features are:
- recognition of chapter hierarchy
- recognition of footnotes
- recognition of running headers and footers
- semantic recognition of paragraphs by layout(i.e. as "citation" or "figure caption")
- recognition of logical (not physical) page numbers (like "IV")
Planned features are:
- recognition of tables
- recognition of diagrams
At the time being, PDFConvertit is NOT free or shareware. But IT WORKS.
Contact us, if you are interested: [email protected]


 
Old September 21st, 2011, 06:27 AM
Registered User
 
Join Date: Sep 2011
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
Default PDF to XML conversion

Hi,

I am dinesh and working on BPO solutions in Australia. I suggest a PDF to XML conversion process,
  • OCR Using Abby find reader 8, Gemini Tool
  • Entity Replacement and insert formatting tags using MS-word macro
  • Tag Editing (ex: Image inserting and parsing) using Epsilon
  • Validation using XML spy
  • QC using style sheet view the Internet Explore
  • Shipment to Client Via FTP

This conversion process provide from " InformaticsOutsourcing ". This is an Indian based Data Conversion Solutions. They are doing outsource our xml conversion projects at affordable cost, Top-notch standard & Time saving. So I am suggesting this process.





Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF to HTML conversion madhukp Classic ASP Basics 11 June 24th, 2013 07:19 PM
html to pdf conversion gaurikhot ASP.NET 2.0 Professional 1 December 8th, 2008 06:57 AM
PCL to PDF conversion Kevinanderson VB Components 12 June 13th, 2008 11:47 AM
PDF Conversion to Image bpmills HTML Code Clinic 1 June 21st, 2005 11:37 AM





Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Copyright (c) 2020 John Wiley & Sons, Inc.