2012-03-27

UTF-8 issue: find doesn't find all your files

Public bug announcement: Beware that GNU find in findutils 4.4.2 (as shipped on Ubuntu Lucid) will not find all your files if it's run in the UTF-8 locale: even if the file is there, find may just skip printing its name. Solution: If you have non-ASCII characters in your file names, use LC_CTYPE=C find instead of find.

Example:

$ echo $LC_CTYPE
en_US.UTF-8
$ ls foo*                                                    
ls: cannot access foo*: No such file or directory
$ perl -e 'die if !open F, ">", "foo\x80bar"'
$ ls foo*
foo?bar
$ find -type f
...
./foo?bar
...
$ find -name 'foo*'
$ LC_CTYPE=C find -name 'foo*'                               
./foo?bar

Possible explanation: The file name matcher won't match a file if its name cannot be parsed properly in the current locale (LC_CTYPE). That is, since foo\x80bar is not valid UTF-8, GNU find 4.4.2 will not find it.

This strange behavior can be very surprising and possibly dangerous, especially in automated shell scripts.

No comments: