* A probem with CTLESC, CTLQUOTEMARK and UTF-8.
@ 2009-09-07 7:46 ZelinskiyIS
0 siblings, 0 replies; only message in thread
From: ZelinskiyIS @ 2009-09-07 7:46 UTC (permalink / raw)
To: dash
Good day (or night)!
I am Ubuntu user, as for Jaunty 9.04 we have dash 0.5.4 installed as the
default sh interpreter. Ubuntu uses multibute UTF-8 to represent local
symbols, these symbols are often to be found in file names.
I found a bug, when trying to find out why a python script, doing some
little work of converting music files, would fail on songs with Cyrillic
names, containing letters с,ш,Ё. The reason was sh in system(...) call,
that created files with garbage in names when using "> $file_name"
redirection when $file_name contained these three letters.
For example, a sequence рсшЁъ (byte-by-byte)
{d1 80 d1 81 d1 88 d0 81 d1 8a}
is turned into
{d1 80 d1 d1 d0 d1 8a}. Bytes hex 81 and hex 88 disappear from the file
name.
The reason for such behaviour is in expand.c:239-240 for dash 0.5.4. The
lines and bug look similar in dash 0.5.5.1, here the place is
expand.c:216-217.
The piece of code:
########################################################################
if (flag & EXP_REDIR) /*XXX - for now, just remove escapes */
rmescapes(p);
########################################################################
cuts bytes x81 and x88. The behaviour seems to be allways unwanted,
because according to UTF-8 specifications, x81 and x88 can not represent
an individual symbol. Indeed, hex 81 = binary 10000001, hex 88 = binary
10001000; the upper two bits are 10, what means that the byte is
data-carrier and must always trail initiating byte (from
http://en.wikipedia.org/wiki/UTF-8#Description).
The problem, probably, do not occur when using a single-byte KOI8-R
encoding for Cyrillics, which is default for Debian.
I have also created a launchpad bug for Ubuntu,
https://bugs.launchpad.net/ubuntu/+source/dash/+bug/422298.
That's it, thanks for attention.
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2009-09-07 7:56 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-07 7:46 A probem with CTLESC, CTLQUOTEMARK and UTF-8 ZelinskiyIS
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox