linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Reading UTF16 input from STDIN using fgetws()
@ 2009-10-09 15:12 SriKrishna Erra
  2009-10-10  9:01 ` Glynn Clements
  0 siblings, 1 reply; 2+ messages in thread
From: SriKrishna Erra @ 2009-10-09 15:12 UTC (permalink / raw)
  To: linux-c-programming


Hi all,
I am trying to read UTF16 input from STDIN using fgetws() on LINUX.

man pages of fgetws() says,
fgetws() converts the input string from the encoding either specified by
"LC_CTYPE" or specified in fopen() to wide character string.

My intention is to read UTF16 input from STDIN not from a file. For this i
will not use any fopen() because  the input is direct from STDIN. 

So i will use direct "STDIN" in fgetws() like
fgetws(buf,count,stdin);

As deault LC_CTYPE value is "utf-8" on linux (also on all unix flavours),
fgetws() is treating the input as utf-8 and trying to convert to a wide
character string.

My input contains UTF16 strings and it has "0xfeff" BOM as first character.
As fgetws() treating the input is in utf-8, 0xfe(first byte of BOM) will be
an invalid sequence in utf-8 and also 0xff (second byte of BOM) is an EOF.
So fgetws() is returning nothing.

If i use fopen() then there will be no issues because i will pass the input
ecnoding as parameter to fopen() like  fopen("filename","r,ccs=UTF16-LE");

With this fgetws() will treat the input as UTF16 and will convert the input
from UTF16 to a wide character string.

So no issues with fopen().

But my requirement is input from STDIN.

Please let me know how to set the encoding "ccs=UTF-16LE" to STDIN so that
fgetws() will consider the STDIN input in UTF16 form.

I have also tried fdopen() but no use. 
fdopen(int fd,mode);

There is a requirement that the mode parameter vaues in fdopen() should be
the same as of the one used in fopen().

But we are not at all using fopen() and the default encoding of STDIN is
utf-8. 
So when fdopen(fd,"ccs=utf-16le"); is used, it returns nothing as there is a
mismatch in encodings i.e defualt of STDIN is utf-8 and fdopen is passing
utf-16LE.

So please let me know how to change the encoding of STDIN i.e how to set
encoding "ccs=UTF-16LE" to STDIN 

In windows, we hava _setmode() but in LINUX no such provision because LINUX
will treat text & Binary files as same.

Thanks in Advance.

regards,
Srikrishna Erra.
-- 
View this message in context: http://www.nabble.com/Reading-UTF16-input-from-STDIN-using-fgetws%28%29-tp25822791p25822791.html
Sent from the linux-c-programming mailing list archive at Nabble.com.


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Reading UTF16 input from STDIN using fgetws()
  2009-10-09 15:12 Reading UTF16 input from STDIN using fgetws() SriKrishna Erra
@ 2009-10-10  9:01 ` Glynn Clements
  0 siblings, 0 replies; 2+ messages in thread
From: Glynn Clements @ 2009-10-10  9:01 UTC (permalink / raw)
  To: SriKrishna Erra; +Cc: linux-c-programming


SriKrishna Erra wrote:

> As deault LC_CTYPE value is "utf-8" on linux (also on all unix flavours),
> fgetws() is treating the input as utf-8 and trying to convert to a wide
> character string.

The default locale for all categories is "C" (alias "POSIX"), which
uses the ASCII encoding. To use any other locale, you must use
setlocale(LC_CTYPE, ...) or setlocale(LC_ALL, ...).

If you call setlocale(LC_CTYPE, "") (i.e. an empty locale string), the
LC_CTYPE category will be initialised based upon the environment
variables LC_ALL, LC_CTYPE, or LANG. If none of those are variables
are defined, it will remain in the "C" locale.

Your Linux distribution may configure these variables to refer to a
UTF-8 locale, but that's a different issue.

> My input contains UTF16 strings and it has "0xfeff" BOM as first character.
> As fgetws() treating the input is in utf-8, 0xfe(first byte of BOM) will be
> an invalid sequence in utf-8 and also 0xff (second byte of BOM) is an EOF.
> So fgetws() is returning nothing.
> 
> If i use fopen() then there will be no issues because i will pass the input
> ecnoding as parameter to fopen() like  fopen("filename","r,ccs=UTF16-LE");
> 
> With this fgetws() will treat the input as UTF16 and will convert the input
> from UTF16 to a wide character string.
> 
> So no issues with fopen().
> 
> But my requirement is input from STDIN.
> 
> Please let me know how to set the encoding "ccs=UTF-16LE" to STDIN so that
> fgetws() will consider the STDIN input in UTF16 form.
> 
> I have also tried fdopen() but no use. 
> fdopen(int fd,mode);

fdopen() returns a new FILE*. I don't know whether it's safe to assign
this to stdin, though.

> There is a requirement that the mode parameter vaues in fdopen() should be
> the same as of the one used in fopen().

No, it says that the mode must be compatible with the underlying
descriptor, i.e. that you can't use fdopen(fd, "w") if the file was
opened in O_RDONLY mode.

> But we are not at all using fopen() and the default encoding of STDIN is
> utf-8. 
> So when fdopen(fd,"ccs=utf-16le"); is used, it returns nothing as there is a
> mismatch in encodings i.e defualt of STDIN is utf-8 and fdopen is passing
> utf-16LE.

fdopen() returns a new FILE* which has nothing to do with stdin, even
if you use 0 for the fd. The underlying descriptor doesn't have an
encoding associated with it.

Have you tried using the FILE* returned from fdopen()?

> So please let me know how to change the encoding of STDIN i.e how to set
> encoding "ccs=UTF-16LE" to STDIN 

Have you tried:

	freopen("/dev/stdin", "r,ccs=UTF16-LE", stdin);

?

freopen() is the "standard" way to associate a new file with an
existing FILE*.

-- 
Glynn Clements <glynn@gclements.plus.com>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2009-10-10  9:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-09 15:12 Reading UTF16 input from STDIN using fgetws() SriKrishna Erra
2009-10-10  9:01 ` Glynn Clements

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).