* [PATCH] new Page: isalpha__3(3)
@ 2014-03-04 8:37 walter harms
0 siblings, 0 replies; 5+ messages in thread
From: walter harms @ 2014-03-04 8:37 UTC (permalink / raw)
To: LKML
Hi List,
The ctype macros like isalpha(3) have a locale specific counterpart.
This page was missing.
re,
wh
Signed-off-by: wharms@bfs.de <wharms@bfs.de>
.\" Copyright (c) 2013 by Walter Harms
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date. The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein. The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.\"
.TH ISALPHA_L 3 2013-09-20 "GNU" "Linux Programmer's Manual"
.SH NAME
isalnum_l, isalpha_l, isblank_l, iscntrl_l, isdigit_l, isgraph_l, islower_l,
isprint_l, ispunct_l, isspace_l, isupper_l, isxdigit_l \- character
classification routines
.SH SYNOPSIS
.nf
.B #include <ctype.h>
.sp
.BI "int isalnum_l(int " "c" ", locale_t " loc );
.br
.BI "int isalpha_l(int " "c" ", locale_t " loc );
.br
.BI "int isascii_l(int " "c" ", locale_t " loc );
.br
.BI "int isblank_l(int " "c" ", locale_t " loc );
.br
.BI "int iscntrl_l(int " "c" ", locale_t " loc );
.br
.BI "int isdigit_l(int " "c" ", locale_t " loc );
.br
.BI "int isgraph_l(int " "c" ", locale_t " loc );
.br
.BI "int islower_l(int " "c" ", locale_t " loc );
.br
.BI "int isprint_l(int " "c" ", locale_t " loc );
.br
.BI "int ispunct_l(int " "c" ", locale_t " loc );
.br
.BI "int isspace_l(int " "c" ", locale_t " loc );
.br
.BI "int isupper_l(int " "c" ", locale_t " loc );
.br
.BI "int isxdigit_l(int " "c" ", locale_t " loc );
.fi
.sp
.SH DESCRIPTION
These functions check whether
.IR c ,
which must have the value of an
.I unsigned char
or
.BR EOF ,
falls into a certain character class according to the current locale.
.TP
.BR isalnum_l ()
checks for an alphanumeric character; it is equivalent to
.BI "(isalpha(" c ") || isdigit(" c "))" \fR.
.TP
.BR isalpha_l ()
checks for an alphabetic character; in the standard \fB"C"\fP
locale, it is equivalent to
.BI "(isupper_l(" c ") || islower_l(" c "))" \fR.
In some locales, there may be additional characters for which
.BR isalpha ()
is true\(emletters which are neither upper case nor lower
case.
.TP
.BR isascii_l ()
checks whether \fIc\fP is a 7-bit
.I unsigned char
value that fits into
the ASCII character set.
.TP
.BR isblank_l ()
checks for a blank character; that is, a space or a tab.
.TP
.BR iscntrl_l ()
checks for a control character.
.TP
.BR isdigit_l ()
checks for a digit (0 through 9).
.TP
.BR isgraph_l ()
checks for any printable character except space.
.TP
.BR islower_l ()
checks for a lower-case character.
.TP
.BR isprint_l ()
checks for any printable character including space.
.TP
.BR ispunct_l ()
checks for any printable character which is not a space or an
alphanumeric character.
.TP
.BR isspace_l ()
checks for white-space characters.
In the
.B """C"""
and
.B """POSIX"""
locales, these are: space, form-feed
.RB ( \(aq\ef\(aq ),
newline
.RB ( \(aq\en\(aq ),
carriage return
.RB ( \(aq\er\(aq ),
horizontal tab
.RB ( \(aq\et\(aq ),
and vertical tab
.RB ( \(aq\ev\(aq ).
.TP
.BR isupper_l ()
checks for an uppercase letter.
.TP
.BR isxdigit_l ()
checks for a hexadecimal digits, that is, one of
.br
.BR "0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F" .
.SH RETURN VALUE
The values returned are nonzero if the character
.I c
falls into the tested class, and a zero value
if not.
.SH CONFORMING TO
POSIX.1-2008 specifies all of these functions.
.SH NOTES
The details of what characters belong into which class depend on the current
locale.
.sp
from
.IR locale.h :
The concept of one static locale per category is not very well
thought out. Many applications will need to process its data using
information from several different locales. Another application is
the implementation of the internationalization handling in the
upcoming ISO C++ standard library. To support this another set of
the functions using locale data exist which have an additional
argument.
For example,
.BR isupper ()
will not recognize an A-umlaut (\(:A) as an uppercase letter in the default
.B "C"
locale.
.SH EXAMPLE
The following example takes a locale abbreviation like "de_DE" as argument.
Is no argument is supplied it will use "C". With "de_DE" the code will
identify the O-Umlaut correctly as alphanumeric character but not with "C".
In contrast the punctuation will perform as before.
.nf
#include <stdio.h>
#include <locale.h>
int main(int argc,char *argv[])
{
char *str="c2p\(:O.,";
int i;
locale_t loc;
if (argc > 1 )
loc = newlocale (LC_ALL_MASK, argv[1], NULL);
else
loc = newlocale (LC_ALL_MASK, "C", NULL);
for(i=0;str[i]!=0;i++) {
if (isalnum_l(str[i],loc))
printf("The character %c is alphanumeric.\\n",str[i]);
if ( ispunct_l(str[i],loc) )
printf ("The character %c is punctuation.\\n",str[i]);
}
return 0;
}
.fi
.SH SEE ALSO
.BR iswalnum (3),
.BR iswalpha (3),
.BR iswblank (3),
.BR iswcntrl (3),
.BR iswdigit (3),
.BR iswgraph (3),
.BR iswlower (3),
.BR iswprint (3),
.BR iswpunct (3),
.BR iswspace (3),
.BR iswupper (3),
.BR iswxdigit (3),
.BR setlocale (3),
.BR toascii (3),
.BR tolower (3),
.BR toupper (3),
.BR ascii (7),
.BR locale (7)
^ permalink raw reply [flat|nested] 5+ messages in thread* [PATCH] new Page: isalpha__3(3)
@ 2014-03-04 15:35 walter harms
[not found] ` <5315F2B5.2040009-fPG8STNUNVg@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: walter harms @ 2014-03-04 15:35 UTC (permalink / raw)
To: linux-man
Hi List,
The ctype macros like isalpha(3) have a locale specific counterpart.
This page was missing.
re,
wh
Signed-off-by: wharms-fPG8STNUNVg@public.gmane.org <wharms-fPG8STNUNVg@public.gmane.org>
.\" Copyright (c) 2013 by Walter Harms
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date. The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein. The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.\"
.TH ISALPHA_L 3 2013-09-20 "GNU" "Linux Programmer's Manual"
.SH NAME
isalnum_l, isalpha_l, isblank_l, iscntrl_l, isdigit_l, isgraph_l, islower_l,
isprint_l, ispunct_l, isspace_l, isupper_l, isxdigit_l \- character
classification routines
.SH SYNOPSIS
.nf
.B #include <ctype.h>
.sp
.BI "int isalnum_l(int " "c" ", locale_t " loc );
.br
.BI "int isalpha_l(int " "c" ", locale_t " loc );
.br
.BI "int isascii_l(int " "c" ", locale_t " loc );
.br
.BI "int isblank_l(int " "c" ", locale_t " loc );
.br
.BI "int iscntrl_l(int " "c" ", locale_t " loc );
.br
.BI "int isdigit_l(int " "c" ", locale_t " loc );
.br
.BI "int isgraph_l(int " "c" ", locale_t " loc );
.br
.BI "int islower_l(int " "c" ", locale_t " loc );
.br
.BI "int isprint_l(int " "c" ", locale_t " loc );
.br
.BI "int ispunct_l(int " "c" ", locale_t " loc );
.br
.BI "int isspace_l(int " "c" ", locale_t " loc );
.br
.BI "int isupper_l(int " "c" ", locale_t " loc );
.br
.BI "int isxdigit_l(int " "c" ", locale_t " loc );
.fi
.sp
.SH DESCRIPTION
These functions check whether
.IR c ,
which must have the value of an
.I unsigned char
or
.BR EOF ,
falls into a certain character class according to the current locale.
.TP
.BR isalnum_l ()
checks for an alphanumeric character; it is equivalent to
.BI "(isalpha(" c ") || isdigit(" c "))" \fR.
.TP
.BR isalpha_l ()
checks for an alphabetic character; in the standard \fB"C"\fP
locale, it is equivalent to
.BI "(isupper_l(" c ") || islower_l(" c "))" \fR.
In some locales, there may be additional characters for which
.BR isalpha ()
is true\(emletters which are neither upper case nor lower
case.
.TP
.BR isascii_l ()
checks whether \fIc\fP is a 7-bit
.I unsigned char
value that fits into
the ASCII character set.
.TP
.BR isblank_l ()
checks for a blank character; that is, a space or a tab.
.TP
.BR iscntrl_l ()
checks for a control character.
.TP
.BR isdigit_l ()
checks for a digit (0 through 9).
.TP
.BR isgraph_l ()
checks for any printable character except space.
.TP
.BR islower_l ()
checks for a lower-case character.
.TP
.BR isprint_l ()
checks for any printable character including space.
.TP
.BR ispunct_l ()
checks for any printable character which is not a space or an
alphanumeric character.
.TP
.BR isspace_l ()
checks for white-space characters.
In the
.B """C"""
and
.B """POSIX"""
locales, these are: space, form-feed
.RB ( \(aq\ef\(aq ),
newline
.RB ( \(aq\en\(aq ),
carriage return
.RB ( \(aq\er\(aq ),
horizontal tab
.RB ( \(aq\et\(aq ),
and vertical tab
.RB ( \(aq\ev\(aq ).
.TP
.BR isupper_l ()
checks for an uppercase letter.
.TP
.BR isxdigit_l ()
checks for a hexadecimal digits, that is, one of
.br
.BR "0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F" .
.SH RETURN VALUE
The values returned are nonzero if the character
.I c
falls into the tested class, and a zero value
if not.
.SH CONFORMING TO
POSIX.1-2008 specifies all of these functions.
.SH NOTES
The details of what characters belong into which class depend on the current
locale.
.sp
from
.IR locale.h :
The concept of one static locale per category is not very well
thought out. Many applications will need to process its data using
information from several different locales. Another application is
the implementation of the internationalization handling in the
upcoming ISO C++ standard library. To support this another set of
the functions using locale data exist which have an additional
argument.
For example,
.BR isupper ()
will not recognize an A-umlaut (\(:A) as an uppercase letter in the default
.B "C"
locale.
.SH EXAMPLE
The following example takes a locale abbreviation like "de_DE" as argument.
Is no argument is supplied it will use "C". With "de_DE" the code will
identify the O-Umlaut correctly as alphanumeric character but not with "C".
In contrast the punctuation will perform as before.
.nf
#include <stdio.h>
#include <locale.h>
int main(int argc,char *argv[])
{
char *str="c2p\(:O.,";
int i;
locale_t loc;
if (argc > 1 )
loc = newlocale (LC_ALL_MASK, argv[1], NULL);
else
loc = newlocale (LC_ALL_MASK, "C", NULL);
for(i=0;str[i]!=0;i++) {
if (isalnum_l(str[i],loc))
printf("The character %c is alphanumeric.\\n",str[i]);
if ( ispunct_l(str[i],loc) )
printf ("The character %c is punctuation.\\n",str[i]);
}
return 0;
}
.fi
.SH SEE ALSO
.BR iswalnum (3),
.BR iswalpha (3),
.BR iswblank (3),
.BR iswcntrl (3),
.BR iswdigit (3),
.BR iswgraph (3),
.BR iswlower (3),
.BR iswprint (3),
.BR iswpunct (3),
.BR iswspace (3),
.BR iswupper (3),
.BR iswxdigit (3),
.BR setlocale (3),
.BR toascii (3),
.BR tolower (3),
.BR toupper (3),
.BR ascii (7),
.BR locale (7)
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread[parent not found: <5315F2B5.2040009-fPG8STNUNVg@public.gmane.org>]
* Re: [PATCH] new Page: isalpha__3(3) [not found] ` <5315F2B5.2040009-fPG8STNUNVg@public.gmane.org> @ 2014-03-10 13:24 ` Michael Kerrisk (man-pages) [not found] ` <531DBD1F.5090400-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [not found] ` <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ@mail.gmail.com> 0 siblings, 2 replies; 5+ messages in thread From: Michael Kerrisk (man-pages) @ 2014-03-10 13:24 UTC (permalink / raw) To: wharms-fPG8STNUNVg, linux-man Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Bruno Haible [CC += Bruno; Bruno, if you have *two* moments to spare, maybe you could confirm that I am not saying stupid things about UTF-8, MBS, and WCS. If you have only *one* moment to spare, I'd appreciate if you look at the program I provide below as an example of the use of mbstowcs(); I plan to add it to the mbstowcs() man page that you wrote, and I wonder if you might check it to see that everything is in order.Thanks! To save time, just grep for the instance of your name below...] Hello Walter, You've submitted a number of pages over the past months that I have not found the energy to review. I respond to this submission, with the goal of explaining why, since it shows many of the problems that I see in the other submissions (and, in several cases, problems that I have commented on in your past submissions). On 03/04/2014 04:35 PM, walter harms wrote: > Hi List, > The ctype macros like isalpha(3) have a locale specific counterpart. > This page was missing. > > re, > wh > > > Signed-off-by: wharms-fPG8STNUNVg@public.gmane.org <wharms-fPG8STNUNVg@public.gmane.org> > > > .\" Copyright (c) 2013 by Walter Harms > .\" > .\" %%%LICENSE_START(VERBATIM) > .\" Permission is granted to make and distribute verbatim copies of this > .\" manual provided the copyright notice and this permission notice are > .\" preserved on all copies. > .\" > .\" Permission is granted to copy and distribute modified versions of this > .\" manual under the conditions for verbatim copying, provided that the > .\" entire resulting derived work is distributed under the terms of a > .\" permission notice identical to this one. > .\" > .\" Since the Linux kernel and libraries are constantly changing, this > .\" manual page may be incorrect or out-of-date. The author(s) assume no > .\" responsibility for errors or omissions, or for damages resulting from > .\" the use of the information contained herein. The author(s) may not > .\" have taken the same level of care in the production of this manual, > .\" which is licensed free of charge, as they might when working > .\" professionally. > .\" > .\" Formatted or processed versions of this manual, if unaccompanied by > .\" the source, must acknowledge the copyright and authors of this work. > .\" %%%LICENSE_END Problem: much of the text below is a straight copy from isalpha.3, written by Thomas Koenig. You can't do this without attribution. (Simplest would have been to retain Thomas's copyright line and license; I see that you did the latter.) To be clear, I do not believe you've done this maliciously; rather, you've done it in ignorance of what the requirements of copyright law are. But, I feel sure I've mentioned this issue to you in the past... Problem: notwithstanding the copyright issue, the duplication of text is done without good reason. A more sensible approach in this case would be to perform some integration/reworking in the existing ispalpha.3 page, or simply to defer to that page inside this page. > .\" > .\" > .TH ISALPHA_L 3 2013-09-20 "GNU" "Linux Programmer's Manual" > .SH NAME > isalnum_l, isalpha_l, isblank_l, iscntrl_l, isdigit_l, isgraph_l, islower_l, > isprint_l, ispunct_l, isspace_l, isupper_l, isxdigit_l \- character > classification routines > .SH SYNOPSIS > .nf > .B #include <ctype.h> > .sp > .BI "int isalnum_l(int " "c" ", locale_t " loc ); > .br > .BI "int isalpha_l(int " "c" ", locale_t " loc ); > .br > .BI "int isascii_l(int " "c" ", locale_t " loc ); > .br > .BI "int isblank_l(int " "c" ", locale_t " loc ); > .br > .BI "int iscntrl_l(int " "c" ", locale_t " loc ); > .br > .BI "int isdigit_l(int " "c" ", locale_t " loc ); > .br > .BI "int isgraph_l(int " "c" ", locale_t " loc ); > .br > .BI "int islower_l(int " "c" ", locale_t " loc ); > .br > .BI "int isprint_l(int " "c" ", locale_t " loc ); > .br > .BI "int ispunct_l(int " "c" ", locale_t " loc ); > .br > .BI "int isspace_l(int " "c" ", locale_t " loc ); > .br > .BI "int isupper_l(int " "c" ", locale_t " loc ); > .br > .BI "int isxdigit_l(int " "c" ", locale_t " loc ); > .fi > .sp Problem: no description of the feature test macro requirements for these pages. See man-pages(7). Problem: this patch includes no "link" pages (containing just ".so man3/isalpha_l.3" for the dozen or so other functions documented on this page.) > .SH DESCRIPTION > These functions check whether > .IR c , > which must have the value of an > .I unsigned char > or > .BR EOF , > falls into a certain character class according to the current locale. The above sentence applies for the functions documented in isalpha.3, but is meaningless for the functions listed in the SYNOPSIS of this page, where the conversion is dependent on the locale 'loc', not the current locale. Problem: all of the rest of the DESCRIPTION contains nothing new that isn't already in isalpha.3. > .TP > .BR isalnum_l () > checks for an alphanumeric character; it is equivalent to > .BI "(isalpha(" c ") || isdigit(" c "))" \fR. > .TP > .BR isalpha_l () > checks for an alphabetic character; in the standard \fB"C"\fP > locale, it is equivalent to > .BI "(isupper_l(" c ") || islower_l(" c "))" \fR. > In some locales, there may be additional characters for which > .BR isalpha () > is true\(emletters which are neither upper case nor lower > case. > .TP > .BR isascii_l () > checks whether \fIc\fP is a 7-bit > .I unsigned char > value that fits into > the ASCII character set. > .TP > .BR isblank_l () > checks for a blank character; that is, a space or a tab. > .TP > .BR iscntrl_l () > checks for a control character. > .TP > .BR isdigit_l () > checks for a digit (0 through 9). > .TP > .BR isgraph_l () > checks for any printable character except space. > .TP > .BR islower_l () > checks for a lower-case character. > .TP > .BR isprint_l () > checks for any printable character including space. > .TP > .BR ispunct_l () > checks for any printable character which is not a space or an > alphanumeric character. > .TP > .BR isspace_l () > checks for white-space characters. > In the > .B """C""" > and > .B """POSIX""" > locales, these are: space, form-feed > .RB ( \(aq\ef\(aq ), > newline > .RB ( \(aq\en\(aq ), > carriage return > .RB ( \(aq\er\(aq ), > horizontal tab > .RB ( \(aq\et\(aq ), > and vertical tab > .RB ( \(aq\ev\(aq ). > .TP > .BR isupper_l () > checks for an uppercase letter. > .TP > .BR isxdigit_l () > checks for a hexadecimal digits, that is, one of > .br > .BR "0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F" . Problem: there is no mention of the requirements on 'loc'. (It must be a valid locale handle and must not be LC_GLOBAL_LOCALE.) > .SH RETURN VALUE > The values returned are nonzero if the character > .I c > falls into the tested class, and a zero value > if not. (Minor) problem: no VERSIONS section. > .SH CONFORMING TO > POSIX.1-2008 specifies all of these functions. > .SH NOTES > The details of what characters belong into which class depend on the current > locale. > .sp > from > .IR locale.h : > The concept of one static locale per category is not very well > thought out. Many applications will need to process its data using > information from several different locales. Another application is > the implementation of the internationalization handling in the > upcoming ISO C++ standard library. To support this another set of > the functions using locale data exist which have an additional > argument. Simply quoting this text from locale.h without explanation does not really add much to the description. The point is that the *_l pages are designed to address the limitation that the traditional locale APIs do not mix well with multi-threaded applications and with applications that must deal with multiple locales. A general statement to that effect needs to appear somewhere, though probably not on this page. (I'll add something to locale(7).) > For example, > .BR isupper () > will not recognize an A-umlaut (\(:A) as an uppercase letter in the default > .B "C" > locale. Problem: The sentence above relates to a function not even in the SYNOPSIS of this page. > .SH EXAMPLE > The following example takes a locale abbreviation like "de_DE" as argument. > Is no argument is supplied it will use "C". With "de_DE" the code will > identify the O-Umlaut correctly as alphanumeric character but not with "C". > In contrast the punctuation will perform as before. > .nf > > #include <stdio.h> > #include <locale.h> > > int main(int argc,char *argv[]) > { > char *str="c2p\(:O.,"; > int i; > locale_t loc; > if (argc > 1 ) > loc = newlocale (LC_ALL_MASK, argv[1], NULL); > else > loc = newlocale (LC_ALL_MASK, "C", NULL); > > for(i=0;str[i]!=0;i++) { > if (isalnum_l(str[i],loc)) > printf("The character %c is alphanumeric.\\n",str[i]); > if ( ispunct_l(str[i],loc) ) > printf ("The character %c is punctuation.\\n",str[i]); > } > return 0; > } > > .fi The example program above violates multiple guidelines from man-pages(7), including, for example: * 4-space indent levels * spacing around operators and parentheses does not follow K&R norms ("indent -kr" fixes most of this) or the norms demonstrated in numerous existing pages. * The program does not compile without warnings when using 'cc -Wall" (in particular, _XOPEN_SOURCE needs to be defined as 700). * Various header files that should be included, are not. * The program does not do error checking of function calls. But, more to the point, the program appears to be broken, if you are operating in a UTF-8 locale, which I assume you are. I suppose the program does work if you are operating in an iso-8859-1 locale (though that seems an unlikely set-up these days), but that point would need some careful explanation in the man page, or some clarification in a shell session log that shows a run of the program, otherwise the program would cause much confusion for people on UTF-8 systems. Functions such as isalnum_l() can't be applied to UTF-8 characters. POSIX seems clear: The c argument is an int, the value of which the application shall ensure is representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined. (It would have been useful to see sample output from your program as part of the man page, as a help to the reader, but also as a check of what you believe is happening, and what locale you are working with.) Instead, a conversion to wide characters is needed (mbstowcs(3)), and then the use of the isw*_l() functions. See my example, further down. A modified version of your program illustrates the problem: #define _XOPEN_SOURCE 700 #include <locale.h> #include <string.h> #include <stdio.h> #include <locale.h> #include <ctype.h> int main(int argc, char *argv[]) { char *str = "c2pÖ.,"; int i; locale_t loc; printf("string length = %ld\n", (long) strlen(str)); if (argc > 1) loc = newlocale(LC_ALL_MASK, argv[1], NULL); else loc = newlocale(LC_ALL_MASK, "C", NULL); for (i = 0; str[i] != 0; i++) { printf("%d: %x %c: ", i, str[i] & 0xff, str[i] & 0xff); printf("%salphanumeric ", isalnum_l(str[i], loc) ? "" : "!"); printf("%spunctuation ", ispunct_l(str[i], loc) ? "" : "!"); printf("\n"); } return 0; } See what happens when we run this: $ ./a.out de_DE string length = 7 0: 63 c: alphanumeric !punctuation 1: 32 2: alphanumeric !punctuation 2: 70 p: alphanumeric !punctuation 3: c3 �: alphanumeric !punctuation <==== 4: 96 �: !alphanumeric !punctuation <==== 5: 2e .: !alphanumeric punctuation 6: 2c ,: !alphanumeric punctuation Assuming you are operating in a UTF-8 system, the fourth glyph in your string (O-umlaut) is actually treated as two bytes, not one character by the program, but the construction of your program hides this, because it does not print information about all of the bytes! Now, I am no expert (and so I may yet end up embarrassed), but it appears to me that you have misunderstood some of the fundamentals of how multibyte characters, UTF-8, and wide character strings work, and you didn't gain that understanding while writing the page. (Good reading: http://www.joelonsoftware.com/articles/Unicode.html http://www.cprogramming.com/tutorial/unicode.html And I admit I learned a whole lot as I pulled your program apart.) > .SH SEE ALSO Repeating the SEE ALSO from isalpha.3 isn't really correct (or useful) here. > .BR iswalnum (3), > .BR iswalpha (3), > .BR iswblank (3), > .BR iswcntrl (3), > .BR iswdigit (3), > .BR iswgraph (3), > .BR iswlower (3), > .BR iswprint (3), > .BR iswpunct (3), > .BR iswspace (3), > .BR iswupper (3), > .BR iswxdigit (3), > .BR setlocale (3), > .BR toascii (3), > .BR tolower (3), > .BR toupper (3), > .BR ascii (7), > .BR locale (7) (Minor) problem: probably, other pages (at least, locale(7)) should add SEE ALSO references to this page. That change is not in this patch. == Now, to be clear: many page submissions that I receive fail on some of the points mentioned above, but this page fails on multiple counts. In summary... Walter, you are often good at finding things that need to be documented, and I know your work is well intended. However, the pages you submit often require so much review/repair effort (in some cases, initial drafts of pages appear not to even have been run through a spell checker, though this page seems okay), that it is often faster to write the pages myself. And some of the same problems that I comment on in earlier submissions turn up again in new submissions. Thus, it is hard for me to find the enthusiasm to review these pages myself (and I have in any case very limited bandwidth) and help them get repaired (and it's rare that others step in, though I noticed that Stefan Puiu did take a shot on one of your submissions). I do not know what the solution is here, but this mail explains the problems from my side, and why I'm often unresponsive / slow to respond to your submissions (and am likely to remain so, unless something changes). And here's how I think the kind of thing you intended to do in your example program actually needs to be done. (Bruno, perhaps you can confirm that this code is okay, as I plan to place this example in the wcstombs(3) man page.) ---8x------8x------8x------8x------8x------8x------8x------8x------8x--- #include <locale.h> #include <wchar.h> #include <stdio.h> #include <string.h> #include <stdlib.h> int main(int argc, char *argv[]) { size_t mbslen; /* Number of multibyte characters in source */ wchar_t *wcs; /* Pointer to converted wide character string */ wchar_t *wp; if (argc < 3) { fprintf(stderr, "Usage: %s <locale> <string>\n", argv[0]); exit(EXIT_FAILURE); } /* Apply the specified locale */ if (setlocale(LC_ALL, argv[1]) == NULL) { perror("setlocale"); exit(EXIT_FAILURE); } /* Calculate the length required to hold argv[2] converted to a wide character string */ mbslen = mbstowcs(NULL, argv[2], 0); if (mbslen == -1) { perror("mbstowcs"); exit(EXIT_FAILURE); } /* Describe the source string to the user */ printf("Length of source string (excluding terminator):\n"); printf(" %ld bytes\n", (long) strlen(argv[2])); printf(" %ld multibyte characters\n\n", (long) mbslen); /* Allocate wide character string of the desired size. Add 1 to allow for terminating null wide character (L'\0'). */ wcs = calloc(mbslen + 1, sizeof(wchar_t)); if (wcs == NULL) { perror("calloc"); exit(EXIT_FAILURE); } /* Convert the multibyte character string in argv[2] to a wide character string */ if (mbstowcs(wcs, argv[2], mbslen + 1) == -1) { perror("mbstowcs"); exit(EXIT_FAILURE); } printf("Wide character string is: %ls (%ld characters)\n", wcs, (long) mbslen); /* Now do some inspection of the classes of the characters in the wide character string */ for (wp = wcs; *wp != 0; wp++) { printf(" %lc ", (wint_t) *wp); if (!iswalpha(*wp)) printf("!"); printf("alpha "); if (iswalpha(*wp)) { if (iswupper(*wp)) printf("upper "); if (iswlower(*wp)) printf("lower "); } putchar('\n'); } exit(EXIT_SUCCESS); } ---8x------8x------8x------8x------8x------8x------8x------8x------8x--- And here's an example of what we see when running the program: $ ./a.out de_DE.UTF-8 "Grüße!" Length of source string (excluding terminator): 8 bytes 6 multibyte characters Wide character string is: Grüße! (6 characters) G alpha upper r alpha lower ü alpha lower ß alpha lower e alpha lower ! !alpha With kind regards, Michael PS I'm working on adding the *_l functions to the isalpha.3 page, and will send a draft to the list. -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <531DBD1F.5090400-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH] new Page: isalpha__3(3) [not found] ` <531DBD1F.5090400-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2014-03-12 18:34 ` walter harms 0 siblings, 0 replies; 5+ messages in thread From: walter harms @ 2014-03-12 18:34 UTC (permalink / raw) To: Michael Kerrisk (man-pages); +Cc: linux-man Am 10.03.2014 14:24, schrieb Michael Kerrisk (man-pages): > > Hello Walter, > > You've submitted a number of pages over the past months that I have > not found the energy to review. I respond to this submission, with > the goal of explaining why, since it shows many of the problems that > I see in the other submissions (and, in several cases, problems that > I have commented on in your past submissions). My major problem was i got no reply at all and was wondering if the page arrived in the ml at all. I shortened the mail a bit. Since you made a new page, most stuff is now obsolete but i would like to explain a few things i did. >> .SH CONFORMING TO >> POSIX.1-2008 specifies all of these functions. >> .SH NOTES >> The details of what characters belong into which class depend on the current >> locale. >> .sp >> from >> .IR locale.h : >> The concept of one static locale per category is not very well >> thought out. Many applications will need to process its data using >> information from several different locales. Another application is >> the implementation of the internationalization handling in the >> upcoming ISO C++ standard library. To support this another set of >> the functions using locale data exist which have an additional >> argument. > > Simply quoting this text from locale.h without explanation does not > really add much to the description. The point is that the *_l > pages are designed to address the limitation that the traditional > locale APIs do not mix well with multi-threaded applications > and with applications that must deal with multiple locales. > A general statement to that effect needs to appear somewhere, though > probably not on this page. (I'll add something to locale(7).) For me that was very clear at that time since i had exactly that problem, i needed two locale internally for parsing, no need for threads. For me that illustrated the intention of the original author, therefore it went into notes. > >> For example, >> .BR isupper () >> will not recognize an A-umlaut (\(:A) as an uppercase letter in the default >> .B "C" >> locale. > > Problem: The sentence above relates to a function not even in the > SYNOPSIS of this page. > yes, bad example, the basic idea was to describe that some function depend on a certain locale. > But, more to the point, the program appears to be broken, if you are > operating in a UTF-8 locale, which I assume you are. I suppose > the program does work if you are operating in an iso-8859-1 locale > (though that seems an unlikely set-up these days), but that > point would need some careful explanation in the man page, or some > clarification in a shell session log that shows a run of the program, > otherwise the program would cause much confusion for people on UTF-8 > systems. I use 8859-1 and i guess that is the reason i did not see any problems. But i am really wondering why i missed the -Wall point. > > Functions such as isalnum_l() can't be applied to UTF-8 characters. > POSIX seems clear: > > The c argument is an int, the value of which the application > shall ensure is representable as an unsigned char or equal to > the value of the macro EOF. If the argument has any other > value, the behavior is undefined. > > (It would have been useful to see sample output from your program as > part of the man page, as a help to the reader, but also as a check of > what you believe is happening, and what locale you are working with.) > > Instead, a conversion to wide characters is needed (mbstowcs(3)), and > then the use of the isw*_l() functions. See my example, further down. I found isw*_l() also while figuring out what to do with isalpha_l and friends. > == > > Now, to be clear: many page submissions that I receive fail on some > of the points mentioned above, but this page fails on multiple counts. > > In summary... Walter, you are often good at finding things that need to > be documented, and I know your work is well intended. However, the pages > you submit often require so much review/repair effort (in some cases, > initial drafts of pages appear not to even have been run through a spell > checker, though this page seems okay), that it is often faster to write > the pages myself. And some of the same problems that I comment on in > earlier submissions turn up again in new submissions. Thus, it is hard > for me to find the enthusiasm to review these pages myself (and I have > in any case very limited bandwidth) and help them get repaired (and > it's rare that others step in, though I noticed that Stefan Puiu did > take a shot on one of your submissions). I do not know what the solution > is here, but this mail explains the problems from my side, and why > I'm often unresponsive / slow to respond to your submissions (and > am likely to remain so, unless something changes). I admit that i underestimated the complexity of the these locale stuff. I can only say i tried to document what was not documented and add an example basically showing how i came to that conclusion and how it works. > same problems that I comment in earlier submissions Sorry about that that i always try to get better, the wired thing is i can not find anything in my mailarchive. I found i send some pages years ago what i can not find is any reply, with one exception a typo in sem_wait.3. Serious, i can not remember what happened, did i ever send a reply ? And important, i guess the ISWALPHA_L page has the same defects as it is a direct derivative of ISALPHA_L. NTL it seems there are a lot more _l functions, i did not check if there is a page but you may like to add some of them to undocumented(3): see: http://www.freebsd.org/cgi/man.cgi?query=xlocale&sektion=3 sorry for the trouble, wh -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ@mail.gmail.com>]
[parent not found: <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] new Page: isalpha__3(3) [not found] ` <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-03-14 11:05 ` Michael Kerrisk (man-pages) 0 siblings, 0 replies; 5+ messages in thread From: Michael Kerrisk (man-pages) @ 2014-03-14 11:05 UTC (permalink / raw) To: Stefan Puiu; +Cc: linux-man, Bruno Haible [CC restored, since I think this is a point that others may have comments on] Hi Stefan, On Thu, Mar 13, 2014 at 1:23 PM, Stefan Puiu <stefan.puiu-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > Hi Michael, > > Small nit about your wcstombs example below: > > On Mon, Mar 10, 2014 at 3:24 PM, Michael Kerrisk (man-pages) > <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > [...] >> ---8x------8x------8x------8x------8x------8x------8x------8x------8x--- > [...] >> printf("Length of source string (excluding terminator):\n"); >> printf(" %ld bytes\n", (long) strlen(argv[2])); >> printf(" %ld multibyte characters\n\n", (long) mbslen); > > Why not use %zu for mbslen and strlen(argv[2])? Both are size_t as far > as I can tell. Then you wouldn't need the cast. The 'z' specifier is a C99 invention as I recall, and it took a few years before it became widespread. For example, it wasn't on Sun's libc for Solaris 8, and reading some man pages suggests that it wasn't there on Solaris 9 (released in 2002), though it was there by Solaris 10 (2005). Likewise, FreeBSD seems to have since it version 5 (2003). So although glibc has had 'z' for a long time (at least as far back as glibc 2.1 at the start of 1999, though I suspect a little earlier as well), there were even a few years ago a lot of installed non-Linux systems that didn't support it. So, for portable code, I've tended to almost reflexively use %ld+(long). Probably, the need to do that is less pressing now. There are many fewer of those installed legacy systems these days. I'll change that code for the mbstowcs() page to use %zu. Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-03-14 11:05 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-04 8:37 [PATCH] new Page: isalpha__3(3) walter harms
-- strict thread matches above, loose matches on Subject: below --
2014-03-04 15:35 walter harms
[not found] ` <5315F2B5.2040009-fPG8STNUNVg@public.gmane.org>
2014-03-10 13:24 ` Michael Kerrisk (man-pages)
[not found] ` <531DBD1F.5090400-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-03-12 18:34 ` walter harms
[not found] ` <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ@mail.gmail.com>
[not found] ` <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-03-14 11:05 ` Michael Kerrisk (man-pages)
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.