Submit Blog  RSS Feeds

Monday, June 4, 2012

Removing accents in unicode strings

It wouldn't be very insightful if I said that the key feature of most programs / scripts is the ability to process data the way it is expected. In many cases the data comes from users, and every programmer should know - users are very creative and could crash your script in the way you didn't think was possible. All in all programmers have to spend a lot of time on boring things like data validation/preprocessing to prevent such situations.

Some problems are caused by regional and special characters. This doesn't concern the storage problem, many databases support utf-8. Problems occur when you need to convert your data to another encoding... well it wouldn't be so bad if you are positive about the target encoding. Some systems however (especially those implemented a few decades ago, i.e. bank systems) accept encodings that you don't see everyday, such as IBM-852 or IBM-850. So if are not 100% sure of what should the target encoding be, ASCII is the safest pick. Although it supports little characters, it is usually enough to keep the data context.

A simple way of getting an ASCII string is to drop all regional/special characters, that require more then 7 bits to encode. Example python code

 1  data = u"aąbcćdef"
 2  ascii_data = "".join(map(lambda x: x if ord(x) < 127 else "", data))

After these operations ascii_data has the following value: "abcdef". This result may also be obtained by performing:

 3  ascii_data = data.encode('ASCII', 'ignore')

Using this method we loose the context, especially when processing personal data - first names / last names / cities often have native characters, without them the data is incomplete.

The best way of converting the data to ASCII would be striping the accents. This may be achieved by using the Normal Form Compatibility Decomposition, also known as NFKD (an annex to the unicode standard). This decomposition splits characters containing accents into (usually) two components: the base character and the accent. For example letter ą would be split into a and  ̨ (u+0328, known as the combining ogonek). This was just the first part of the solution - there are still non-ASCII characters in the string. The second step is converting the string to ASCII ignoring all special characters (the simple method presented earlier). Have a look at the presented python script:


 1  import unicodedata
 2
 3  data = u'aąbcćdef'
 4  ascii_data = unicodedata.normalize("NFKD" , data).encode('ASCII', 'ignore')

After running this script, ascii_data has been assigned the following value: "aabccdef", perfect! Hope you find this solution useful.

~KR


No comments:

Post a Comment

free counters