character set problem

BrendonMelville · March 12th, 2007, 10:29 AM

We are getting Â¿ stored into our Oracle 10g database that is using WE8ISO8859P1 character set.

Problem is caused by the following:

Microsoft released software (in particularly MS Word) before considering any ANSI or ISO standard (although they claimed so).
At that time of pioneering graphical interface - they were the standard. Since then things changed. Microsoft initially targeted US marked, however very soon they wanted to expanded to Europe. For that they needed to get standardized character set instead of one initially being in use. Microsoft re-mapped character set in newer applications using Windows-1252 character set which is compatible with ISO-8851-1 (we are using in our Java Web applications). That cleared obstacles to forward on European market, where extended characters are necessary (like in French, Dutch, German languages...)

What happened with the initial character codes before Microsoft agreed with ISO to standardize characters? Well - Nothing.

So what are the consequences of that?

If we are using Microsoft Word document in conjunction with one of the oldest character set (universe) the ice-age character mapping is still there. So when we "cut and paste" the content to another application, characters sets are not mapped anymore. Especially French language "is sensitive to this" The Microsoft character set in Word decimally coded 146 ( ' ) is very often used in French.

Therefore if the text is generated in MS Word using old character mapping (universe) which we do, and using method of "cutting and paste" we are mismatching character set interpreted by other, newer applications.

Initially in the MS words (apostrophe) ' had code 191; then later after implementing Windows 1252 character set being moved to code 146 in accordance to ISO. Well ISO threat character coded 191 as Â¿. So if you are using MS Word universe character set, ' looks like ' but in newer applications or ISO compatible applications it looks like Â¿.

So we scratched out heads and start thinking what would be the right solution for us...

write a patch in Java to correct character mapping and in addition to it to eliminate displaying control characters which would mess up "the look and feel" of content displayed. Does anyone know of a java class that already does this ?
Thanks

Brendon

dafl00 · March 15th, 2007, 12:16 PM

Check the String class's methods http://java.sun.com/javase/6/docs/ap...ng/String.html