utf8 support in PHP

jaya_76 · October 3rd, 2005, 06:03 AM

Subject: We are some problem with utf-8 encoded characters in PHP.

Using MySQL 4.1.7 collation utf8 and char set utf8-general_ci, we are storing German, French, Spanish and turkey, Nederlandâs etc.. Characters, those special characters are storing properly in MySQL.

But when extracting those characters using PHP into variable and want to replace special characters with relevant Umlautâs, PHP unable to recognize the Special characters.

Please let me know reason, why PHP 4.3.11 does not support utf8
8-bit characters??

How do I achieve the replace functionality with utf8 characters in PHP..

Please respond as early as possible.

Regards
Jay

richard.york · October 17th, 2005, 01:33 PM

As I mentioned in the other thread, I haven't personally experienced any problems with UTF-8 in PHP. In my experience, it's always been the browser's character set, MySQL, or some other thing. I did have a problem once with data encoded with htmlentities(), which destroyed UTF-8 characters, I ended up using htmlspecialchars() instead. And by destroy, I mean that htmlentities() was replacing UTF-8 characters with the wrong HTML entity, and sometimes even more than one entity.

Don't beleive me? Check out the umlauts being output by my PHP mail application:
http://www.smilingsouls.net/index.ht...Mail_IMAP/live

The umlauts in that page didn't show up when I set the browser's character set to utf-8, but do when I specify windows-1252.

<meta http-equiv='content-type' content='text/html; charset=windows-1252' />

Not sure why that is, I haven't investigated it much.

Anyway, it is clear that PHP *does* support UTF-8. Why would PHP be so popular, and especially in Europe if it didn't?

My suggestion to you is to evalutate the trek your data takes, and consider that it's likely to be either the client-side charset setting, or a function the data is being passed through that is causing trouble. If all else fails, get on php.net and ask on one of their mailing lists. http://www.php.net/mailing-lists.php

Regards,
Rich

--
[http://www.smilingsouls.net]
Mail_IMAP: A PHP/C-Client/PEAR solution for webmail
Author: Beginning CSS: Cascading Style Sheets For Web Design

BM · August 18th, 2006, 02:11 PM

This is an old post but it often comes up in search so I'm adding some info to help people trying to figure this out.

Most people using "Western European" languages are served by an international character coding - ISO 8859-1 - also known as "Latin-1".
ISO 8859-1 has an area in the code space that is not used by characters (in the region of 80h to 9Fh).

When Microsoft was creating its current Windows system, it created the concept of "code pages" to handle different character set requirements around the world. Microsoft created code page 1252 (cp1252) to handle the "Western European" space. However, Microsoft did not adhere to international standards and "reused" the empty space in ISO 8859-1 to support several characters. Among these are the "balanced quotes" that Microsoft products like Word automatically substitute for the single "unbalanced" versions on the standard keyboard.

When a Word-produced document using cp1252 hits a standard-based system (e.g. SQL database or PHP) then the character coding does not match.

The wide use of Microsoft's software makes this a common problem. To make things worse, browser manufacturers sometimes assume cp1252 is being used so as to hide problems from less technical users. However, this can lead to other problems with character encoding.

The best way round this is to avoid configuring your software to use cp1252. unfortunately, this is the out-of-the-box setup for most MS-based systems so you have to pro-actively work out the details.

I avoid the problem by writing all web-targeted material in Open Office Writer with the character set selected as ISO 8859-1. Never get a problem. Further, if I'm receiving PR material to put on the web, I open it in OO Writer and re-save before sending to the web.

Note that in UTF, the characters Microsoft added to cp1252 are available - they are just in another part of the code space.

Finally, if you are specifiying website or code development, remember that "Latin-1" is not a standard term. Latin-1 is applied to both cp1252 and to ISO 8859-1. Since these are different, recognize that the term "Latin-1" means whatever the user wants it to mean. You might write Latin-1 into a spec thinking cp1252 while the coder develops to ISO-8859-1. Use the correct definition you intend your developers to follow: ISO 8859-1 or cp1252 as appropriate.

richard.york · August 18th, 2006, 03:09 PM

Since writing that last post, I've taken to encoding everything that comes in, as far as user submitted data goes, as entities and decoding everything sent back to the browser.

It looks like this (part of a class named db)

Code:

    public static function escapeStringCallback(&$value)
    {
        if (is_array($value))
        {
            foreach ($value as $i => $string)
            {
                if (is_array($value[$i]))
                {
                    array_walk($value[$i], array('db', 'escapeStringCallback'));
                }
                else
                {
                    $value[$i] = mysql_real_escape_string(mb_convert_encoding($value[$i], 'HTML-ENTITIES', 'UTF-8'));
                }
            }
        }
        else
        {
               $value = mysql_real_escape_string(mb_convert_encoding($value, 'HTML-ENTITIES', 'UTF-8'));
        }
    }

Of course, that requires the multibyte extension to be installed (can't believe that isn't part of the core library). That method is ran on POST, GET and COOKIE data in place of magic_quotes_gpc.

http://www.php.net/manual/en/ref.mbstring.php

Then, when I want to go the other way and send back out to the browser, I just do the opposite:

echo utf8_decode(mb_convert_encoding($value, 'UTF-8', 'HTML-ENTITIES'));

...or something to that effect. Or course, being stored as entities, you don't necessarily have to convert back to UTF-8, but I found it plays nicer with my XML documents, which can't handle the entities, since they aren't provided in a DTD, which would then result in well-formedness errors.

It's a little extra overhead, since you're ideally encoding everything that goes into the database like this, and then decoding upon retrieval. Before I took this approach I had quite a bit of difficulties with my database backup, restoring would cause my UTF-8 special characters to corrupt. Probably something to do with the encoding from the mysqldump utility, or the plain text file that the dump was stored to. The database is defined with the correct collation, BTW, probably just something overlooked on my part, that is to say I'm sure it's me not the database. But as a work-around, this method, combined with the right character set specified in the content-type header resulted in a working UTF-8 capable site.

This is done specifying the UTF-8 character set in the browser, BTW, rather than windows-1252. As far as I can tell this method allowed me to get around the Microsoft character set ambiguities with ISO 8859-1, presumably that is all handled transparently by the mb_convert_encoding function, AFAICT. Microsoft's smart quotes and what not remain in tact, and I get the full spectrum of UTF-8 capabilities. Our customers from Europe and elsewhere with umlauts, cedillas, et al, in their names no longer result in question marks appearing in place of those characters in the browser.

Anyway, it's just one more road to get there.

Regards,
Rich

--
Author,
Beginning CSS: Cascading Style Sheets For Web Design
CSS Instant Results

http://www.catb.org/~esr/faqs/smart-questions.html