I think it's not important whether you will use java or c to do your project. It's a information extraction problem.
The first step will be tokenizing the source text, and what you need maybe JLex. If you don't like it, Java's String.matches() (maybe this name) can also do the job. XML you mentioned can store your patterns.
Step 2 will be anaylzing the token list(or you can say 'word list')you have got from the source text. Maybe you will need corpus and some patterns and they can also be stored in xml files.
Step 3 you need to extract the information you need (some merge work is needed) by some certain patterns. For example, [0-9]-[0-9] maybe a phone number. Again, you can store the patterns in an xml file.
if you only want to extract phone number, maybe you only need step 1.
Personal opinion.
|