Unicode or not?

linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Unicode or not?
@ 2012-03-05 12:23 Krzysztof
  2012-03-05 14:04 ` Andrej Gelenberg
  2012-03-05 19:57 ` Glynn Clements
  0 siblings, 2 replies; 5+ messages in thread
From: Krzysztof @ 2012-03-05 12:23 UTC (permalink / raw)
  To: linux-c-programming

Does it happen that command line which is passed to program arguments is "unicoded"? In other words, when should "main" be defined as "main(int argc, wchar_t **argv)"?

-- 
Regards
Krzysztof J.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode or not?
  2012-03-05 12:23 Unicode or not? Krzysztof
@ 2012-03-05 14:04 ` Andrej Gelenberg
  2012-03-05 20:19   ` Krzysztof
  2012-03-05 19:57 ` Glynn Clements
  1 sibling, 1 reply; 5+ messages in thread
From: Andrej Gelenberg @ 2012-03-05 14:04 UTC (permalink / raw)
  To: Krzysztof; +Cc: linux-c-programming

HI,

it's always char **, which encoding the arguments has, depends on system
locale (mostly LANG environment variable), but it must be always some
sort of multibyte encoding at most (like utf8). UCS 2 (UTF-16) would
break many existing programs.

http://stackoverflow.com/questions/1664476/is-it-possible-to-use-a-unicode-argv

and here, how to covert between utf8 and wchar:
http://www.ibm.com/developerworks/linux/library/l-linuni/index.html

On 03/05/2012 01:23 PM, Krzysztof wrote:
> Does it happen that command line which is passed to program arguments is
> "unicoded"? In other words, when should "main" be defined as "main(int
> argc, wchar_t **argv)"?
> 

Regards,
Andrej Gelenberg

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode or not?
  2012-03-05 12:23 Unicode or not? Krzysztof
  2012-03-05 14:04 ` Andrej Gelenberg
@ 2012-03-05 19:57 ` Glynn Clements
  1 sibling, 0 replies; 5+ messages in thread
From: Glynn Clements @ 2012-03-05 19:57 UTC (permalink / raw)
  To: Krzysztof; +Cc: linux-c-programming

Krzysztof wrote:

> Does it happen that command line which is passed to program arguments
> is "unicoded"? In other words, when should "main" be defined as
> "main(int argc, wchar_t **argv)"?

Never. Windows supports a wmain() function which uses wchar_t** for
argv, but there is no equivalent on Unix.

Unix doesn't normally use wide strings for communication between
programs or between programs and the OS. Unicode text invariably uses
UTF-8.

-- 
Glynn Clements <glynn@gclements.plus.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode or not?
  2012-03-05 14:04 ` Andrej Gelenberg
@ 2012-03-05 20:19   ` Krzysztof
  2012-03-05 20:52     ` Andrej Gelenberg
  0 siblings, 1 reply; 5+ messages in thread
From: Krzysztof @ 2012-03-05 20:19 UTC (permalink / raw)
  To: linux-c-programming

So how to read effectively UTF-8 characters from char* passed as an 
argument under Linux?
Should one simply cast argv[n] to wchar_t*?

-- 
Regards
Krzysztof J.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode or not?
  2012-03-05 20:19   ` Krzysztof
@ 2012-03-05 20:52     ` Andrej Gelenberg
  0 siblings, 0 replies; 5+ messages in thread
From: Andrej Gelenberg @ 2012-03-05 20:52 UTC (permalink / raw)
  To: Krzysztof; +Cc: linux-c-programming

Hi,

no, you can't simply cast it to wchar. I recommend to read this article
about unicode under linux:
http://www.ibm.com/developerworks/linux/library/l-linuni/index.html

There are 2 possible ways to deal with utf8: keep it char* and use as
simple c-string. Pro: it's simple and you can keep using standard str*
functions and it often smaller as wchar string. Cons: some non latin
symbols may consume more then one bytes, so strlen will report bigger
number as characters there, which can lead to problems with displaying
or counting the characters. You can steel do it with mblen, but it's bit
pain.
Second option is to convert it to wchar with mbstowcs() function. Pro:
characters are always fixed bit-width. Cons: you need to convert between
utf8 and wchar and you need additional buffer to hold wchar string (you
can't do in in-place, because wchar string will be often bigger then
utf8 string).

For example, if you need or just wont wchar string, you can do something
like this:

int l = strlen(argv[i]);
wchar_t *nbuf = calloc(sizeof(*nbuf), l);
if ( !nbuf ) return 1;
l = mbstowcs(nbuf, argv[i], l); // mbstowcs may return smaller value as
                                // l
if ( l == -1 ) {
  /* invalid multibyte sequence was encountered */
  free(nbuf);
  return 2;
}

Regards,
Andrej Gelenberg

On 03/05/2012 09:19 PM, Krzysztof wrote:
> So how to read effectively UTF-8 characters from char* passed as an
> argument under Linux?
> Should one simply cast argv[n] to wchar_t*?
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-03-05 20:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-05 12:23 Unicode or not? Krzysztof
2012-03-05 14:04 ` Andrej Gelenberg
2012-03-05 20:19   ` Krzysztof
2012-03-05 20:52     ` Andrej Gelenberg
2012-03-05 19:57 ` Glynn Clements

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).