* [PATCH] iconv.3: Clarify the behavior when input is untranslatable
@ 2023-05-21 10:31 Alejandro Colomar
2023-05-21 10:32 ` Alejandro Colomar
2023-05-21 11:11 ` Bruno Haible
0 siblings, 2 replies; 9+ messages in thread
From: Alejandro Colomar @ 2023-05-21 10:31 UTC (permalink / raw)
To: linux-man
Cc: Reuben Thomas, Steffen Nurpmeso, Bruno Haible, Martin Sebor,
Alejandro Colomar
From: Reuben Thomas <rrt@sc3d.org>
The manual page does not fully reflect the behaviour of glibc's
iconv(3). The manual page says:
The conversion can stop for four reasons:
- An invalid multibyte sequence is encountered in the input. In
this case, it sets errno to EILSEQ and returns (size_t) -1.
*inbuf is left pointing to the beginning of the invalid multibyte
sequence.
[...]
The phrase "An invalid multibyte sequence is encountered in the input"
is confusing, because it suggests that it refers only to the validity of
the input per se (e.g. a non-UTF-8 sequence in input purporting to be
UTF-8).
However, according to the original author of the manual page, Bruno
Haible[1], it also refers to input that cannot be translated to the
desired output encoding; and indeed, glibc's iconv returns EILSEQ when
the input cannot be translated, even though it is valid.
This patch adds language that reflects the actual behavior, by adding an
explicit bullet that distinguishes this case.
Link: [1] <https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4>
Link: <https://bugzilla.kernel.org/show_bug.cgi?id=217059>
Reported-by: Reuben Thomas <rrt@sc3d.org>
Cc: Steffen Nurpmeso <steffen@sdaoden.eu>
Cc: Bruno Haible <bruno@clisp.org>
Cc: Martin Sebor <msebor@redhat.com>
Signed-off-by: Alejandro Colomar <alx@kernel.org>
f
---
man3/iconv.3 | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/man3/iconv.3 b/man3/iconv.3
index 66f59b8c3..6bb27c802 100644
--- a/man3/iconv.3
+++ b/man3/iconv.3
@@ -80,6 +80,14 @@ .SH DESCRIPTION
\fI*inbuf\fP
is left pointing to the beginning of the invalid multibyte sequence.
.IP \[bu]
+An multibyte sequence is encountered in the input which
+cannot be translated to the character encoding of the output.
+In this case,
+it sets \fIerrno\fP to \fBEILSEQ\fP and returns
+.IR (size_t)\ \-1 .
+\fI*inbuf\fP
+is left pointing to the beginning of the invalid multibyte sequence.
+.IP \[bu]
The input byte sequence has been entirely converted,
that is, \fI*inbytesleft\fP has gone down to 0.
In this case,
--
2.40.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable 2023-05-21 10:31 [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar @ 2023-05-21 10:32 ` Alejandro Colomar 2023-05-21 11:11 ` Bruno Haible 1 sibling, 0 replies; 9+ messages in thread From: Alejandro Colomar @ 2023-05-21 10:32 UTC (permalink / raw) To: linux-man Cc: Reuben Thomas, Steffen Nurpmeso, Bruno Haible, Martin Sebor, Alejandro Colomar [-- Attachment #1.1: Type: text/plain, Size: 2538 bytes --] Sorry, ignore this patch. I forgot to remove Reuben's authorship when I modified it. I also forgot to specify v2. On 5/21/23 12:31, Alejandro Colomar wrote: > From: Reuben Thomas <rrt@sc3d.org> > > The manual page does not fully reflect the behaviour of glibc's > iconv(3). The manual page says: > > The conversion can stop for four reasons: > > - An invalid multibyte sequence is encountered in the input. In > this case, it sets errno to EILSEQ and returns (size_t) -1. > *inbuf is left pointing to the beginning of the invalid multibyte > sequence. > > [...] > > The phrase "An invalid multibyte sequence is encountered in the input" > is confusing, because it suggests that it refers only to the validity of > the input per se (e.g. a non-UTF-8 sequence in input purporting to be > UTF-8). > > However, according to the original author of the manual page, Bruno > Haible[1], it also refers to input that cannot be translated to the > desired output encoding; and indeed, glibc's iconv returns EILSEQ when > the input cannot be translated, even though it is valid. > > This patch adds language that reflects the actual behavior, by adding an > explicit bullet that distinguishes this case. > > Link: [1] <https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4> > Link: <https://bugzilla.kernel.org/show_bug.cgi?id=217059> > Reported-by: Reuben Thomas <rrt@sc3d.org> > Cc: Steffen Nurpmeso <steffen@sdaoden.eu> > Cc: Bruno Haible <bruno@clisp.org> > Cc: Martin Sebor <msebor@redhat.com> > Signed-off-by: Alejandro Colomar <alx@kernel.org> > > f > --- > man3/iconv.3 | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/man3/iconv.3 b/man3/iconv.3 > index 66f59b8c3..6bb27c802 100644 > --- a/man3/iconv.3 > +++ b/man3/iconv.3 > @@ -80,6 +80,14 @@ .SH DESCRIPTION > \fI*inbuf\fP > is left pointing to the beginning of the invalid multibyte sequence. > .IP \[bu] > +An multibyte sequence is encountered in the input which > +cannot be translated to the character encoding of the output. > +In this case, > +it sets \fIerrno\fP to \fBEILSEQ\fP and returns > +.IR (size_t)\ \-1 . > +\fI*inbuf\fP > +is left pointing to the beginning of the invalid multibyte sequence. > +.IP \[bu] > The input byte sequence has been entirely converted, > that is, \fI*inbytesleft\fP has gone down to 0. > In this case, -- <http://www.alejandro-colomar.es/> GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable 2023-05-21 10:31 [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar 2023-05-21 10:32 ` Alejandro Colomar @ 2023-05-21 11:11 ` Bruno Haible 2023-05-21 14:41 ` Alejandro Colomar 1 sibling, 1 reply; 9+ messages in thread From: Bruno Haible @ 2023-05-21 11:11 UTC (permalink / raw) To: linux-man, Alejandro Colomar Cc: Reuben Thomas, Steffen Nurpmeso, Martin Sebor, Alejandro Colomar [-- Attachment #1: Type: text/plain, Size: 451 bytes --] Alejandro Colomar wrote: > This patch adds language that reflects the actual behavior, by adding an > explicit bullet that distinguishes this case. That is the right approach. Thanks for taking the initiative. But I think that more details should be added, so that programmers are not surprised if their program behaves differently on, say, musl libc or FreeBSD than on glibc. Find attached my take to describe the condition appropriately. Bruno [-- Attachment #2: 0001-List-a-fifth-conditions-when-iconv-3-may-stop.patch --] [-- Type: text/x-patch, Size: 2364 bytes --] From bc3102bd88b2c481d49cdb3433d8520d1289271b Mon Sep 17 00:00:00 2001 From: Bruno Haible <bruno@clisp.org> Date: Sun, 21 May 2023 13:05:29 +0200 Subject: [PATCH] List a fifth conditions when iconv(3) may stop. Link: https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4 Link: https://bugzilla.kernel.org/show_bug.cgi?id=217059 Reported-by: Steffen Nurpmeso <steffen@sdaoden.eu> Reported-by: Reuben Thomas <rrt@sc3d.org> Signed-off-by: Bruno Haible <bruno@clisp.org> --- man3/iconv.3 | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/man3/iconv.3 b/man3/iconv.3 index 66f59b8c3..b440da578 100644 --- a/man3/iconv.3 +++ b/man3/iconv.3 @@ -71,7 +71,7 @@ If the character encoding of the input is stateful, the function can also convert a sequence of input bytes to an update to the conversion state without producing any output bytes; such input is called a \fIshift sequence\fP. -The conversion can stop for four reasons: +The conversion can stop for five reasons: .IP \[bu] 3 An invalid multibyte sequence is encountered in the input. In this case, @@ -80,6 +80,34 @@ it sets \fIerrno\fP to \fBEILSEQ\fP and returns \fI*inbuf\fP is left pointing to the beginning of the invalid multibyte sequence. .IP \[bu] +A multibyte sequence is encountered that is valid but that cannot be +translated to the character encoding of the output. This condition +depends on the implementation and on the conversion descriptor. +In the GNU C library and GNU libiconv, if +.I cd +was created without the suffix +.B //TRANSLIT +or +.BR //IGNORE , +the conversion is strict: lossy conversions produce this condition. +If the suffix +.B //TRANSLIT +was specified, transliteration can avoid this condition in some cases. +In the musl C library, this condition cannot occur because a conversion to +.B '*' +is used as a fallback. +In the FreeBSD, NetBSD, and Solaris implementations of +.BR iconv , +this condition cannot occur either, because a conversion to +.B '?' +is used as a fallback. +When this condition is met, +.B iconv +sets \fIerrno\fP to \fBEILSEQ\fP and returns +.IR (size_t)\ \-1 . +\fI*inbuf\fP +is left pointing to the beginning of the invalid multibyte sequence. +.IP \[bu] The input byte sequence has been entirely converted, that is, \fI*inbytesleft\fP has gone down to 0. In this case, -- 2.34.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable 2023-05-21 11:11 ` Bruno Haible @ 2023-05-21 14:41 ` Alejandro Colomar 2023-05-21 19:37 ` Bruno Haible 0 siblings, 1 reply; 9+ messages in thread From: Alejandro Colomar @ 2023-05-21 14:41 UTC (permalink / raw) To: Bruno Haible, linux-man Cc: Reuben Thomas, Steffen Nurpmeso, Martin Sebor, Alejandro Colomar [-- Attachment #1.1: Type: text/plain, Size: 2916 bytes --] Hi Bruno On 5/21/23 13:11, Bruno Haible wrote: > Alejandro Colomar wrote: >> This patch adds language that reflects the actual behavior, by adding an >> explicit bullet that distinguishes this case. > > That is the right approach. Thanks for taking the initiative. > > But I think that more details should be added, so that programmers are > not surprised if their program behaves differently on, say, musl libc > or FreeBSD than on glibc. > > Find attached my take to describe the condition appropriately. Thanks! > > Bruno > > @@ -80,6 +80,34 @@ .SH DESCRIPTION > \fI*inbuf\fP > is left pointing to the beginning of the invalid multibyte sequence. > .IP \[bu] > +A multibyte sequence is encountered that is valid but that cannot be > +translated to the character encoding of the output. This condition Please use semantic newlines. See man-pages(7): Use semantic newlines In the source of a manual page, new sentences should be started on new lines, long sentences should be split into lines at clause breaks (commas, semicolons, colons, and so on), and long clauses should be split at phrase boundaries. This convention, sometimes known as "semantic newlines", makes it easier to see the effect of patches, which often operate at the level of in‐ dividual sentences, clauses, or phrases. > +depends on the implementation and on the conversion descriptor. > +In the GNU C library and GNU libiconv, if > +.I cd > +was created without the suffix > +.B //TRANSLIT > +or > +.BR //IGNORE , > +the conversion is strict: lossy conversions produce this condition. > +If the suffix > +.B //TRANSLIT > +was specified, transliteration can avoid this condition in some cases. What do you mean by "can" and "some cases"? > +In the musl C library, this condition cannot occur because a conversion to > +.B '*' I recommend either using \[aq]*\[aq] for producing valid C code, or just having an unquoted *. > +is used as a fallback. > +In the FreeBSD, NetBSD, and Solaris implementations of > +.BR iconv , .BR iconv () , > +this condition cannot occur either, because a conversion to > +.B '?' Similar stuff here. > +is used as a fallback. > +When this condition is met, > +.B iconv And here. > +sets \fIerrno\fP to \fBEILSEQ\fP and returns .I errno .B EILSEQ I know in other places in the page we use \f, but I'll fix that at some point. Please use macros for new code. Cheers, Alex > +.IR (size_t)\ \-1 . > +\fI*inbuf\fP > +is left pointing to the beginning of the invalid multibyte sequence. > +.IP \[bu] > The input byte sequence has been entirely converted, > that is, \fI*inbytesleft\fP has gone down to 0. > In this case, -- <http://www.alejandro-colomar.es/> GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable 2023-05-21 14:41 ` Alejandro Colomar @ 2023-05-21 19:37 ` Bruno Haible 2023-05-21 20:53 ` 2 spaces after the end of a sentence is the _right_ amount (was: [PATCH] iconv.3: Clarify the behavior when input is untranslatable) Alejandro Colomar 2023-05-21 20:57 ` [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar 0 siblings, 2 replies; 9+ messages in thread From: Bruno Haible @ 2023-05-21 19:37 UTC (permalink / raw) To: Alejandro Colomar Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor, Alejandro Colomar [-- Attachment #1: Type: text/plain, Size: 1268 bytes --] Hi Alejandro, > Please use semantic newlines. See man-pages(7): Thanks for explaining. I wondered whether I should use one space or two spaces after the end of a sentence, but found no precedent for either style. This explains it :) > > +In the GNU C library and GNU libiconv, if > > +.I cd > > +was created without the suffix > > +.B //TRANSLIT > > +or > > +.BR //IGNORE , > > +the conversion is strict: lossy conversions produce this condition. > > +If the suffix > > +.B //TRANSLIT > > +was specified, transliteration can avoid this condition in some cases. > > What do you mean by "can" and "some cases"? GNU libc and GNU libiconv support transliteration, for example, of "½" to "1/2", or of "å" to "aa" in a Danish locale. Here I want to give a hint at the transliteration facility, but without going into too much detail. "transliteration can avoid this condition if there is a transliteration rule for the multibyte character and it fits the character encoding of the output" is too detailed, IMO. Do you have a better wording than "can ... in some cases"? > I recommend either using \[aq]*\[aq] for producing valid C code, > or just having an unquoted *. I made the requested style changes. New patch is attached. [-- Attachment #2: 0001-List-a-fifth-conditions-when-iconv-3-may-stop.patch --] [-- Type: text/x-patch, Size: 2392 bytes --] From caa04c49e89e64d7e8b52ab878c6dc2cd0cef5b9 Mon Sep 17 00:00:00 2001 From: Bruno Haible <bruno@clisp.org> Date: Sun, 21 May 2023 13:05:29 +0200 Subject: [PATCH] List a fifth conditions when iconv(3) may stop. Link: https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4 Link: https://bugzilla.kernel.org/show_bug.cgi?id=217059 Reported-by: Steffen Nurpmeso <steffen@sdaoden.eu> Reported-by: Reuben Thomas <rrt@sc3d.org> Signed-off-by: Bruno Haible <bruno@clisp.org> --- man3/iconv.3 | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/man3/iconv.3 b/man3/iconv.3 index 66f59b8c3..94441f602 100644 --- a/man3/iconv.3 +++ b/man3/iconv.3 @@ -71,7 +71,7 @@ If the character encoding of the input is stateful, the function can also convert a sequence of input bytes to an update to the conversion state without producing any output bytes; such input is called a \fIshift sequence\fP. -The conversion can stop for four reasons: +The conversion can stop for five reasons: .IP \[bu] 3 An invalid multibyte sequence is encountered in the input. In this case, @@ -80,6 +80,39 @@ it sets \fIerrno\fP to \fBEILSEQ\fP and returns \fI*inbuf\fP is left pointing to the beginning of the invalid multibyte sequence. .IP \[bu] +A multibyte sequence is encountered that is valid but that cannot be +translated to the character encoding of the output. +This condition depends on the implementation and on the conversion +descriptor. +In the GNU C library and GNU libiconv, if +.I cd +was created without the suffix +.B //TRANSLIT +or +.BR //IGNORE , +the conversion is strict: lossy conversions produce this condition. +If the suffix +.B //TRANSLIT +was specified, transliteration can avoid this condition in some cases. +In the musl C library, this condition cannot occur because a conversion to +.B \[aq]*\[aq] +is used as a fallback. +In the FreeBSD, NetBSD, and Solaris implementations of +.BR iconv (), +this condition cannot occur either, because a conversion to +.B \[aq]?\[aq] +is used as a fallback. +When this condition is met, +.BR iconv () +sets +.I errno +to +.B EILSEQ +and returns +.IR (size_t)\ \-1 . +.I *inbuf +is left pointing to the beginning of the unconvertible multibyte sequence. +.IP \[bu] The input byte sequence has been entirely converted, that is, \fI*inbytesleft\fP has gone down to 0. In this case, -- 2.34.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* 2 spaces after the end of a sentence is the _right_ amount (was: [PATCH] iconv.3: Clarify the behavior when input is untranslatable) 2023-05-21 19:37 ` Bruno Haible @ 2023-05-21 20:53 ` Alejandro Colomar 2023-05-21 20:57 ` [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar 1 sibling, 0 replies; 9+ messages in thread From: Alejandro Colomar @ 2023-05-21 20:53 UTC (permalink / raw) To: Bruno Haible Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor, Alejandro Colomar [-- Attachment #1.1: Type: text/plain, Size: 736 bytes --] Hi Bruno, On 5/21/23 21:37, Bruno Haible wrote: > Hi Alejandro, > >> Please use semantic newlines. See man-pages(7): > > Thanks for explaining. I wondered whether I should use one space or two spaces > after the end of a sentence, That one's easy: one space is always wrong. :-) <https://web.archive.org/web/20171107164742/http://www.heracliteanriver.com/?p=324> <https://lore.kernel.org/linux-man/9c5c5744-dde0-b333-09e0-ba9d92aa96b1@gmail.com/T/#u> <https://lists.gnu.org/archive/html/groff/2020-11/msg00076.html> > but found no precedent for either style. This > explains it :) :) Cheers, Alex -- <http://www.alejandro-colomar.es/> GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable 2023-05-21 19:37 ` Bruno Haible 2023-05-21 20:53 ` 2 spaces after the end of a sentence is the _right_ amount (was: [PATCH] iconv.3: Clarify the behavior when input is untranslatable) Alejandro Colomar @ 2023-05-21 20:57 ` Alejandro Colomar 2023-05-24 22:07 ` Bruno Haible 1 sibling, 1 reply; 9+ messages in thread From: Alejandro Colomar @ 2023-05-21 20:57 UTC (permalink / raw) To: Bruno Haible Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor, Alejandro Colomar [-- Attachment #1.1: Type: text/plain, Size: 1611 bytes --] Hi Bruno, On 5/21/23 21:37, Bruno Haible wrote: > Hi Alejandro, > >> Please use semantic newlines. See man-pages(7): > > Thanks for explaining. I wondered whether I should use one space or two spaces > after the end of a sentence, but found no precedent for either style. This > explains it :) > >>> +In the GNU C library and GNU libiconv, if >>> +.I cd >>> +was created without the suffix >>> +.B //TRANSLIT >>> +or >>> +.BR //IGNORE , >>> +the conversion is strict: lossy conversions produce this condition. >>> +If the suffix >>> +.B //TRANSLIT >>> +was specified, transliteration can avoid this condition in some cases. >> >> What do you mean by "can" and "some cases"? > > GNU libc and GNU libiconv support transliteration, for example, of "½" to "1/2", > or of "å" to "aa" in a Danish locale. Here I want to give a hint at the > transliteration facility, but without going into too much detail. > "transliteration can avoid this condition if there is a transliteration rule > for the multibyte character and it fits the character encoding of the output" > is too detailed, IMO. > Do you have a better wording than "can ... in some cases"? If you include the full version in the commit log, to be able to understand it in the future, I'm fine with it. > >> I recommend either using \[aq]*\[aq] for producing valid C code, >> or just having an unquoted *. > > I made the requested style changes. Thanks, Alex > > New patch is attached. > -- <http://www.alejandro-colomar.es/> GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable 2023-05-21 20:57 ` [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar @ 2023-05-24 22:07 ` Bruno Haible 2023-05-24 23:25 ` Alejandro Colomar 0 siblings, 1 reply; 9+ messages in thread From: Bruno Haible @ 2023-05-24 22:07 UTC (permalink / raw) To: Alejandro Colomar Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor, Alejandro Colomar [-- Attachment #1: Type: text/plain, Size: 278 bytes --] Alejandro Colomar wrote: > > Do you have a better wording than "can ... in some cases"? > > If you include the full version in the commit log, to be able to > understand it in the future, I'm fine with it. OK. Here is a patch with the details included in the commit message. [-- Attachment #2: 0001-List-a-fifth-condition-when-iconv-3-may-stop.patch --] [-- Type: text/x-patch, Size: 3720 bytes --] From 4cc4ad011b3ffa30159d3a67e262a46da4600cba Mon Sep 17 00:00:00 2001 From: Bruno Haible <bruno@clisp.org> Date: Sun, 21 May 2023 13:05:29 +0200 Subject: [PATCH] List a fifth condition when iconv(3) may stop. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The wording regarding transliteration is vague, because this man page is not the right place for going into the details of the transliteration. Here are the details: GNU libc and GNU libiconv support transliteration, for example, of "½" to "1/2", or of "å" to "aa" in a Danish locale. The transliteration maps a multibyte character of the input encoding to zero or more characters in the output. There are two kinds of transliteration rules: - Those that are valid regardless of locale. Typically this means that the original and the transliterated character have similar glyphs, such as in the case "½" to "1/2". In GNU libc, these are collected in the files glibc/localedata/locales/translit_*. - Those that are valid in a single locale only. Often such a rule reflects similar pronounciation of the original and the transliterated characters. Some locales have script-based transliteration, for example from the Cyrillic script to the Latin script. In GNU libc, these are collected in the file glibc/localedata/locales/<locale>. In GNU libiconv, transliterations of this kind are not supported. Link: https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4 Link: https://bugzilla.kernel.org/show_bug.cgi?id=217059 Reported-by: Steffen Nurpmeso <steffen@sdaoden.eu> Reported-by: Reuben Thomas <rrt@sc3d.org> Signed-off-by: Bruno Haible <bruno@clisp.org> --- man3/iconv.3 | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/man3/iconv.3 b/man3/iconv.3 index 66f59b8c3..94441f602 100644 --- a/man3/iconv.3 +++ b/man3/iconv.3 @@ -71,7 +71,7 @@ If the character encoding of the input is stateful, the function can also convert a sequence of input bytes to an update to the conversion state without producing any output bytes; such input is called a \fIshift sequence\fP. -The conversion can stop for four reasons: +The conversion can stop for five reasons: .IP \[bu] 3 An invalid multibyte sequence is encountered in the input. In this case, @@ -80,6 +80,39 @@ it sets \fIerrno\fP to \fBEILSEQ\fP and returns \fI*inbuf\fP is left pointing to the beginning of the invalid multibyte sequence. .IP \[bu] +A multibyte sequence is encountered that is valid but that cannot be +translated to the character encoding of the output. +This condition depends on the implementation and on the conversion +descriptor. +In the GNU C library and GNU libiconv, if +.I cd +was created without the suffix +.B //TRANSLIT +or +.BR //IGNORE , +the conversion is strict: lossy conversions produce this condition. +If the suffix +.B //TRANSLIT +was specified, transliteration can avoid this condition in some cases. +In the musl C library, this condition cannot occur because a conversion to +.B \[aq]*\[aq] +is used as a fallback. +In the FreeBSD, NetBSD, and Solaris implementations of +.BR iconv (), +this condition cannot occur either, because a conversion to +.B \[aq]?\[aq] +is used as a fallback. +When this condition is met, +.BR iconv () +sets +.I errno +to +.B EILSEQ +and returns +.IR (size_t)\ \-1 . +.I *inbuf +is left pointing to the beginning of the unconvertible multibyte sequence. +.IP \[bu] The input byte sequence has been entirely converted, that is, \fI*inbytesleft\fP has gone down to 0. In this case, -- 2.34.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable 2023-05-24 22:07 ` Bruno Haible @ 2023-05-24 23:25 ` Alejandro Colomar 0 siblings, 0 replies; 9+ messages in thread From: Alejandro Colomar @ 2023-05-24 23:25 UTC (permalink / raw) To: Bruno Haible Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor, Alejandro Colomar [-- Attachment #1.1: Type: text/plain, Size: 496 bytes --] Hi Bruno, On 5/25/23 00:07, Bruno Haible wrote: > Alejandro Colomar wrote: >>> Do you have a better wording than "can ... in some cases"? >> >> If you include the full version in the commit log, to be able to >> understand it in the future, I'm fine with it. > > OK. Here is a patch with the details included in the commit message. > Thanks! Patch applied. Cheers, Alex -- <http://www.alejandro-colomar.es/> GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-05-24 23:26 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-05-21 10:31 [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar 2023-05-21 10:32 ` Alejandro Colomar 2023-05-21 11:11 ` Bruno Haible 2023-05-21 14:41 ` Alejandro Colomar 2023-05-21 19:37 ` Bruno Haible 2023-05-21 20:53 ` 2 spaces after the end of a sentence is the _right_ amount (was: [PATCH] iconv.3: Clarify the behavior when input is untranslatable) Alejandro Colomar 2023-05-21 20:57 ` [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar 2023-05-24 22:07 ` Bruno Haible 2023-05-24 23:25 ` Alejandro Colomar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox