Friday, October 16, 2009

Dealing with extended characters in bash

This week I was writing a script that was searching for missing data in SGML files. I ran into issues with data that had extended characters in it. I did a little digging and discovered that iconv could come the rescue. The man pages let you specify the input format with -f, but I found that I had better luck leaving out -f. Here is what I used:

iconv -t UTF-8 --byte-subst="&#x%X;" target-file| grep "term" 
This may not be perfect, but I found this useful enough for my purposes. If this doesn't work, don't forget to try specifying -f.

...

No comments: