PDF is highly variable in how easy it is to reverse-engineer - it depends how it was created. Sometimes it's just scanned images. There are tools for turning it into something more usable, but sometimes they give no better results than you could get by OCR scanning the printed pages. Ask for some samples and do some experiments before you commit to your cost estimates. Preferably, get the source documents from which the PDF was produced, they will be much easier to convert.
As for the XML standard to use, I think it would be best to design your own. Use MathML for the maths part perhaps, and you could base the rest on something like DocBook, but I suspect you'll have more flexibility if you define your own schema rather than trying to use something off the shelf.
Author, XSLT 2.0 and XPath 2.0 Programmer's Reference