In the end, I implemented a LibCurl interface with a SAX Parser to construct an OO DOM model. LibCurl handled http/https, cookies, and other HTTP requirements. The OO DOM was constructed via the SAX call-back functions (begin, character, end). DOM construction was somewhat slow so I needed to create several short-cuts; for example, use meta-references into a buffer/cache instead of keeping separate segment copies, limit tag collections, etc. Handling HTML with syntactical errors required additional techniques (stack recovery rules and precedences). Final results were acceptable but there is still room for improvement.
Philibuster
|