Processing large UTF-8 non-text files

mike_abc · February 15th, 2012, 01:25 PM

Hi, I have to filter out strings of UTF-8 strings in a large (100 MB) non-text file, i.e. the file cannot be read StreamReader.ReadLine or ReadAll. Since its UTF-8, it has 2 bytes/character, I've never worked with byte arrays that encode UTF-8, and I would prefer not to.

Is there a way to read fixed chunks of UTF-8 (say 1000 chars / 500 chars) ?

If NOT, can anyone give me a working example of what to do correctly with the byte arrays ?

I guess I'd have to read an even number of bytes, and then insert a CR/LF after those bytes, and save this garbage in a new file. Would this work ?

Thanks,
Mike

mike_abc · February 16th, 2012, 06:12 AM

Problem solved with Notepad++, where I managed to automatically insert CR/LF at the "right" places. The rest was simple.

Mikey