* dvb-apps: charset support
@ 2011-04-06 12:27 Mauro Carvalho Chehab
2011-04-11 17:48 ` handygewinnspiel
0 siblings, 1 reply; 3+ messages in thread
From: Mauro Carvalho Chehab @ 2011-04-06 12:27 UTC (permalink / raw)
To: Linux Media Mailing List; +Cc: wk
Hi,
I added some patches to dvb-apps/util/scan.c in order to properly support EN 300 468 charsets.
Before the patch, scan were producing invalid UTF-8 codes here, for ISO-8859-15 charsets, as
scan were simply filling service/provider name with whatever non-control characters that were
there. So, if your computer uses the same character as your service provider, you're lucky.
Otherwise, invalid characters will appear at the scan tables.
After the changes, scan gets the locale environment charset, and use it as the output charset
on the output files.
The TS info may provide the used charset on the first character of the provider name and service name,
if the first character is < 0x20. If not provided, the spec says that the character table 00 should be
assumed (a modified version of ISO 6937 charset). However, on my tests, local carriers here
don't fill it, but they use ISO-8859-15 charset, instead of ISO-6937. So, a new optional parameter
allows to change the default charset.
Also, the spec provides 2 tables with control character codes, one for 1-byte character tables,
and another for 2-byte character tables. Before the patch, the 1-byte control character table
were applied for all character sets. Now, the table is applied only for ISO-8859* and ISO-6937,
as they don't seem to make sense for the other character sets. However, the 2-byte control
character table were not implemented yet, due to a few reasons:
1) I'm not familiar with 2-byte charsets;
2) I don't have any environment here that would allow me to test it;
3) The spec is not very clear about what character tables use 2-byte control codes.
The EN 300 428 Annex A says, just before the 2-byte control code table:
"For two-byte character tables, the codes in the range 0xE080 to 0xE09F
are assigned to control functions as shown in table A.2."
So, it seems that the 2-byte control character table refers to character tables 0x11 to 0x14
(iso-10646 + Korean Character Set + GB2312 + BIG5).
However, the table A.2 is described as just:
"Table A.2: DVB codes within private use area of ISO/IEC 10646"
So, one may assume that it refers only to ISO-10646 (character table 0x11), or to this one
plus BIG5 (table 0x14), as BIG5 is a subset of ISO-10646.
The spec is even less clear about what should be done with character table 0x15 (ISO-10646/UTF-8),
as UTF-8 codes have a variable length from 1-byte to 4-bytes.
I _suspect_ that all character tables that are not ISO-8859 or ISO-6937 should be using table
A.2 (that means, character tables 0x11 to 0x15).
The code change to implement 2-byte control codes should be trivial trough. A placeholder for such
code is there at the scancode with a short comment.
It would be great to have some feedback about it. So, comments are welcome.
Thanks,
Mauro.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: dvb-apps: charset support
2011-04-06 12:27 dvb-apps: charset support Mauro Carvalho Chehab
@ 2011-04-11 17:48 ` handygewinnspiel
2011-04-11 18:24 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 3+ messages in thread
From: handygewinnspiel @ 2011-04-11 17:48 UTC (permalink / raw)
To: Mauro Carvalho Chehab, linux-media
Hi Mauro,
> I added some patches to dvb-apps/util/scan.c in order to properly support
> EN 300 468 charsets.
> Before the patch, scan were producing invalid UTF-8 codes here, for
> ISO-8859-15 charsets, as
> scan were simply filling service/provider name with whatever non-control
> characters that were
> there. So, if your computer uses the same character as your service
> provider, you're lucky.
> Otherwise, invalid characters will appear at the scan tables.
>
> After the changes, scan gets the locale environment charset, and use it as
> the output charset
> on the output files.
This implementation in scan expects the environment settings to be 'language_country.encoding', but i think the more general way is 'language_country.encoding@variant'.
i get the following error from scan, because iconv doesnt know 'ISO-8859-15@euro'.
<snip>
WARNING: Conversion from ISO-8859-9 to ISO-8859-15@euro not supported
WARNING: Conversion from ISO-8859-9 to ISO-8859-15@euro not supported
...
WARNING: Conversion from ISO-8859-15 to ISO-8859-15@euro not supported
WARNING: Conversion from ISO-8859-15 to ISO-8859-15@euro not supported
</snap>
I suggest to change scan.c as follows:
--- dvb-apps-5e68946b0e0d_orig/util/scan/scan.c 2011-04-10 20:22:52.000000000 +0200
+++ dvb-apps-5e68946b0e0d/util/scan/scan.c 2011-04-11 19:41:21.460000060 +0200
@@ -2570,14 +2570,14 @@
if ((charset = getenv("LC_ALL")) ||
(charset = getenv("LC_CTYPE")) ||
(charset = getenv ("LANG"))) {
- while (*charset != '.' && *charset)
- charset++;
- if (*charset == '.')
- charset++;
- if (*charset)
- output_charset = charset;
- else
- output_charset = nl_langinfo(CODESET);
+ // assuming 'language_country.encoding@variant'
+ char * p;
+
+ if ((p = strchr(charset, '.')))
+ charset = p + 1;
+ if ((p = strchr(charset, '@')))
+ *p = 0;
+ output_charset = charset;
} else
output_charset = nl_langinfo(CODESET);
This cuts the '@variant' part from charset, so that iconv will find its way.
cheers,
Winfried
--
NEU: FreePhone - kostenlos mobil telefonieren und surfen!
Jetzt informieren: http://www.gmx.net/de/go/freephone
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: dvb-apps: charset support
2011-04-11 17:48 ` handygewinnspiel
@ 2011-04-11 18:24 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 3+ messages in thread
From: Mauro Carvalho Chehab @ 2011-04-11 18:24 UTC (permalink / raw)
To: handygewinnspiel; +Cc: linux-media
Em 11-04-2011 14:48, handygewinnspiel@gmx.de escreveu:
> Hi Mauro,
>
>> I added some patches to dvb-apps/util/scan.c in order to properly support
>> EN 300 468 charsets.
>> Before the patch, scan were producing invalid UTF-8 codes here, for
>> ISO-8859-15 charsets, as
>> scan were simply filling service/provider name with whatever non-control
>> characters that were
>> there. So, if your computer uses the same character as your service
>> provider, you're lucky.
>> Otherwise, invalid characters will appear at the scan tables.
>>
>> After the changes, scan gets the locale environment charset, and use it as
>> the output charset
>> on the output files.
>
> This implementation in scan expects the environment settings to be 'language_country.encoding', but i think the more general way is 'language_country.encoding@variant'.
>
> i get the following error from scan, because iconv doesnt know 'ISO-8859-15@euro'.
Ah, ok. I never saw such syntax. Thanks for pinging me about that!
>
> <snip>
> WARNING: Conversion from ISO-8859-9 to ISO-8859-15@euro not supported
> WARNING: Conversion from ISO-8859-9 to ISO-8859-15@euro not supported
> ...
> WARNING: Conversion from ISO-8859-15 to ISO-8859-15@euro not supported
> WARNING: Conversion from ISO-8859-15 to ISO-8859-15@euro not supported
> </snap>
>
> I suggest to change scan.c as follows:
>
> --- dvb-apps-5e68946b0e0d_orig/util/scan/scan.c 2011-04-10 20:22:52.000000000 +0200
> +++ dvb-apps-5e68946b0e0d/util/scan/scan.c 2011-04-11 19:41:21.460000060 +0200
> @@ -2570,14 +2570,14 @@
> if ((charset = getenv("LC_ALL")) ||
> (charset = getenv("LC_CTYPE")) ||
> (charset = getenv ("LANG"))) {
> - while (*charset != '.' && *charset)
> - charset++;
> - if (*charset == '.')
> - charset++;
> - if (*charset)
> - output_charset = charset;
> - else
> - output_charset = nl_langinfo(CODESET);
> + // assuming 'language_country.encoding@variant'
> + char * p;
> +
> + if ((p = strchr(charset, '.')))
> + charset = p + 1;
> + if ((p = strchr(charset, '@')))
> + *p = 0;
> + output_charset = charset;
This will fail if LANG=C
Basically, if charset doesn't contain '.', this block should not set output_charset.
> } else
> output_charset = nl_langinfo(CODESET);
>
>
> This cuts the '@variant' part from charset, so that iconv will find its way.
>
> cheers,
> Winfried
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2011-04-11 18:24 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-06 12:27 dvb-apps: charset support Mauro Carvalho Chehab
2011-04-11 17:48 ` handygewinnspiel
2011-04-11 18:24 ` Mauro Carvalho Chehab
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox