Processing large UTF-8 non-text files
Hi, I have to filter out strings of UTF-8 strings in a large (100 MB) non-text file, i.e. the file cannot be read StreamReader.ReadLine or ReadAll. Since its UTF-8, it has 2 bytes/character, I've never worked with byte arrays that encode UTF-8, and I would prefer not to.
Is there a way to read fixed chunks of UTF-8 (say 1000 chars / 500 chars) ?
If NOT, can anyone give me a working example of what to do correctly with the byte arrays ?
I guess I'd have to read an even number of bytes, and then insert a CR/LF after those bytes, and save this garbage in a new file. Would this work ?
Thanks,
Mike
|