i18n friendly user input sanitization

chroniclemaster1 · May 29th, 2010, 02:38 AM

I've been using a pretty draconian whitelist for sanitizing user input.

Code:

[^.@a-zA-Z0-9 ]

However, this really only covers American English (maybe British too, but I'm not sure). It certainly won't work for Spanish, French, or German, much less Arabic, Hebrew, or anything further off the beaten path like the East Asian character sets. Has anyone done work on input validation for any languages other than English? I found virtually nothing on the topic beyond the arguments over blacklist vs. whitelist. There's little on implementation, and nothing I've found on implementation for an internationalized situation.

I can include most of the Western European languages, and "almost" have the characters for the Chinese Pinyin romanization just by including part of the Latin-1 Supplemental code points...

Code:

[^.@a-zA-Z0-9Ã-ÃÃ-Ã¶Ã¸-Ã¿  ]

It would be easier to use the single range, Ã-Ã¿ but that would include both Ã and Ã·. Do those characters raise any potential security concerns? Can you get away with them? I would think (and feel free to correct me) that most letters should be safe but punctuation and mathematical operators are more open to abuse.