Since writing that last post, I've taken to encoding everything that comes in, as far as user submitted data goes, as entities and decoding everything sent back to the browser.
It looks like this (part of a class named db)
Code:
public static function escapeStringCallback(&$value)
{
if (is_array($value))
{
foreach ($value as $i => $string)
{
if (is_array($value[$i]))
{
array_walk($value[$i], array('db', 'escapeStringCallback'));
}
else
{
$value[$i] = mysql_real_escape_string(mb_convert_encoding($value[$i], 'HTML-ENTITIES', 'UTF-8'));
}
}
}
else
{
$value = mysql_real_escape_string(mb_convert_encoding($value, 'HTML-ENTITIES', 'UTF-8'));
}
}
Of course, that requires the multibyte extension to be installed (can't believe that isn't part of the core library). That method is ran on POST, GET and COOKIE data in place of magic_quotes_gpc.
http://www.php.net/manual/en/ref.mbstring.php
Then, when I want to go the other way and send back out to the browser, I just do the opposite:
echo utf8_decode(mb_convert_encoding($value, 'UTF-8', 'HTML-ENTITIES'));
...or something to that effect. Or course, being stored as entities, you don't necessarily have to convert back to UTF-8, but I found it plays nicer with my XML documents, which can't handle the entities, since they aren't provided in a DTD, which would then result in well-formedness errors.
It's a little extra overhead, since you're ideally encoding everything that goes into the database like this, and then decoding upon retrieval. Before I took this approach I had quite a bit of difficulties with my database backup, restoring would cause my UTF-8 special characters to corrupt. Probably something to do with the encoding from the mysqldump utility, or the plain text file that the dump was stored to. The database is defined with the correct collation, BTW, probably just something overlooked on my part, that is to say I'm sure it's me not the database. But as a work-around, this method, combined with the right character set specified in the content-type header resulted in a working UTF-8 capable site.
This is done specifying the UTF-8 character set in the browser, BTW, rather than windows-1252. As far as I can tell this method allowed me to get around the Microsoft character set ambiguities with ISO 8859-1, presumably that is all handled transparently by the mb_convert_encoding function, AFAICT. Microsoft's smart quotes and what not remain in tact, and I get the full spectrum of UTF-8 capabilities. Our customers from Europe and elsewhere with umlauts, cedillas, et al, in their names no longer result in question marks appearing in place of those characters in the browser.
Anyway, it's just one more road to get there.
Regards,
Rich
--
Author,
Beginning CSS: Cascading Style Sheets For Web Design
CSS Instant Results
http://www.catb.org/~esr/faqs/smart-questions.html