[PATCH] iconv.3: Clarify the behavior when input is untranslatable

public inbox for linux-man@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] iconv.3: Clarify the behavior when input is untranslatable
@ 2023-05-21 10:31 Alejandro Colomar
  2023-05-21 10:32 ` Alejandro Colomar
  2023-05-21 11:11 ` Bruno Haible
  0 siblings, 2 replies; 9+ messages in thread
From: Alejandro Colomar @ 2023-05-21 10:31 UTC (permalink / raw)
  To: linux-man
  Cc: Reuben Thomas, Steffen Nurpmeso, Bruno Haible, Martin Sebor,
	Alejandro Colomar

From: Reuben Thomas <rrt@sc3d.org>

The manual page does not fully reflect the behaviour of glibc's
iconv(3).  The manual page says:

    The conversion can stop for four reasons:

    -  An invalid multibyte sequence is encountered in the input.  In
       this case, it sets errno to EILSEQ and returns (size_t) -1.
       *inbuf is left pointing to the beginning of the invalid multibyte
       sequence.

    [...]

The phrase "An invalid multibyte sequence is encountered in the input"
is confusing, because it suggests that it refers only to the validity of
the input per se (e.g. a non-UTF-8 sequence in input purporting to be
UTF-8).

However, according to the original author of the manual page, Bruno
Haible[1], it also refers to input that cannot be translated to the
desired output encoding; and indeed, glibc's iconv returns EILSEQ when
the input cannot be translated, even though it is valid.

This patch adds language that reflects the actual behavior, by adding an
explicit bullet that distinguishes this case.

Link: [1] <https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4>
Link: <https://bugzilla.kernel.org/show_bug.cgi?id=217059>
Reported-by: Reuben Thomas <rrt@sc3d.org>
Cc: Steffen Nurpmeso <steffen@sdaoden.eu>
Cc: Bruno Haible <bruno@clisp.org>
Cc: Martin Sebor <msebor@redhat.com>
Signed-off-by: Alejandro Colomar <alx@kernel.org>

f
---
 man3/iconv.3 | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/man3/iconv.3 b/man3/iconv.3
index 66f59b8c3..6bb27c802 100644
--- a/man3/iconv.3
+++ b/man3/iconv.3
@@ -80,6 +80,14 @@ .SH DESCRIPTION
 \fI*inbuf\fP
 is left pointing to the beginning of the invalid multibyte sequence.
 .IP \[bu]
+An multibyte sequence is encountered in the input which
+cannot be translated to the character encoding of the output.
+In this case,
+it sets \fIerrno\fP to \fBEILSEQ\fP and returns
+.IR (size_t)\ \-1 .
+\fI*inbuf\fP
+is left pointing to the beginning of the invalid multibyte sequence.
+.IP \[bu]
 The input byte sequence has been entirely converted,
 that is, \fI*inbytesleft\fP has gone down to 0.
 In this case,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable
  2023-05-21 10:31 [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar
@ 2023-05-21 10:32 ` Alejandro Colomar
  2023-05-21 11:11 ` Bruno Haible
  1 sibling, 0 replies; 9+ messages in thread
From: Alejandro Colomar @ 2023-05-21 10:32 UTC (permalink / raw)
  To: linux-man
  Cc: Reuben Thomas, Steffen Nurpmeso, Bruno Haible, Martin Sebor,
	Alejandro Colomar


[-- Attachment #1.1: Type: text/plain, Size: 2538 bytes --]

Sorry, ignore this patch.  I forgot to remove Reuben's authorship
when I modified it.  I also forgot to specify v2.

On 5/21/23 12:31, Alejandro Colomar wrote:
> From: Reuben Thomas <rrt@sc3d.org>
> 
> The manual page does not fully reflect the behaviour of glibc's
> iconv(3).  The manual page says:
> 
>     The conversion can stop for four reasons:
> 
>     -  An invalid multibyte sequence is encountered in the input.  In
>        this case, it sets errno to EILSEQ and returns (size_t) -1.
>        *inbuf is left pointing to the beginning of the invalid multibyte
>        sequence.
> 
>     [...]
> 
> The phrase "An invalid multibyte sequence is encountered in the input"
> is confusing, because it suggests that it refers only to the validity of
> the input per se (e.g. a non-UTF-8 sequence in input purporting to be
> UTF-8).
> 
> However, according to the original author of the manual page, Bruno
> Haible[1], it also refers to input that cannot be translated to the
> desired output encoding; and indeed, glibc's iconv returns EILSEQ when
> the input cannot be translated, even though it is valid.
> 
> This patch adds language that reflects the actual behavior, by adding an
> explicit bullet that distinguishes this case.
> 
> Link: [1] <https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4>
> Link: <https://bugzilla.kernel.org/show_bug.cgi?id=217059>
> Reported-by: Reuben Thomas <rrt@sc3d.org>
> Cc: Steffen Nurpmeso <steffen@sdaoden.eu>
> Cc: Bruno Haible <bruno@clisp.org>
> Cc: Martin Sebor <msebor@redhat.com>
> Signed-off-by: Alejandro Colomar <alx@kernel.org>
> 
> f
> ---
>  man3/iconv.3 | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/man3/iconv.3 b/man3/iconv.3
> index 66f59b8c3..6bb27c802 100644
> --- a/man3/iconv.3
> +++ b/man3/iconv.3
> @@ -80,6 +80,14 @@ .SH DESCRIPTION
>  \fI*inbuf\fP
>  is left pointing to the beginning of the invalid multibyte sequence.
>  .IP \[bu]
> +An multibyte sequence is encountered in the input which
> +cannot be translated to the character encoding of the output.
> +In this case,
> +it sets \fIerrno\fP to \fBEILSEQ\fP and returns
> +.IR (size_t)\ \-1 .
> +\fI*inbuf\fP
> +is left pointing to the beginning of the invalid multibyte sequence.
> +.IP \[bu]
>  The input byte sequence has been entirely converted,
>  that is, \fI*inbytesleft\fP has gone down to 0.
>  In this case,

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable
  2023-05-21 10:31 [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar
  2023-05-21 10:32 ` Alejandro Colomar
@ 2023-05-21 11:11 ` Bruno Haible
  2023-05-21 14:41   ` Alejandro Colomar
  1 sibling, 1 reply; 9+ messages in thread
From: Bruno Haible @ 2023-05-21 11:11 UTC (permalink / raw)
  To: linux-man, Alejandro Colomar
  Cc: Reuben Thomas, Steffen Nurpmeso, Martin Sebor, Alejandro Colomar

[-- Attachment #1: Type: text/plain, Size: 451 bytes --]

Alejandro Colomar wrote:
> This patch adds language that reflects the actual behavior, by adding an
> explicit bullet that distinguishes this case.

That is the right approach. Thanks for taking the initiative.

But I think that more details should be added, so that programmers are
not surprised if their program behaves differently on, say, musl libc
or FreeBSD than on glibc.

Find attached my take to describe the condition appropriately.

Bruno


[-- Attachment #2: 0001-List-a-fifth-conditions-when-iconv-3-may-stop.patch --]
[-- Type: text/x-patch, Size: 2364 bytes --]

From bc3102bd88b2c481d49cdb3433d8520d1289271b Mon Sep 17 00:00:00 2001
From: Bruno Haible <bruno@clisp.org>
Date: Sun, 21 May 2023 13:05:29 +0200
Subject: [PATCH] List a fifth conditions when iconv(3) may stop.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217059
Reported-by: Steffen Nurpmeso <steffen@sdaoden.eu>
Reported-by: Reuben Thomas <rrt@sc3d.org>
Signed-off-by: Bruno Haible <bruno@clisp.org>
---
 man3/iconv.3 | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/man3/iconv.3 b/man3/iconv.3
index 66f59b8c3..b440da578 100644
--- a/man3/iconv.3
+++ b/man3/iconv.3
@@ -71,7 +71,7 @@ If the character encoding of the input is stateful, the
 function can also convert a sequence of input bytes
 to an update to the conversion state without producing any output bytes;
 such input is called a \fIshift sequence\fP.
-The conversion can stop for four reasons:
+The conversion can stop for five reasons:
 .IP \[bu] 3
 An invalid multibyte sequence is encountered in the input.
 In this case,
@@ -80,6 +80,34 @@ it sets \fIerrno\fP to \fBEILSEQ\fP and returns
 \fI*inbuf\fP
 is left pointing to the beginning of the invalid multibyte sequence.
 .IP \[bu]
+A multibyte sequence is encountered that is valid but that cannot be
+translated to the character encoding of the output.  This condition
+depends on the implementation and on the conversion descriptor.
+In the GNU C library and GNU libiconv, if
+.I cd
+was created without the suffix
+.B //TRANSLIT
+or
+.BR //IGNORE ,
+the conversion is strict: lossy conversions produce this condition.
+If the suffix
+.B //TRANSLIT
+was specified, transliteration can avoid this condition in some cases.
+In the musl C library, this condition cannot occur because a conversion to
+.B '*'
+is used as a fallback.
+In the FreeBSD, NetBSD, and Solaris implementations of
+.BR iconv ,
+this condition cannot occur either, because a conversion to
+.B '?'
+is used as a fallback.
+When this condition is met,
+.B iconv
+sets \fIerrno\fP to \fBEILSEQ\fP and returns
+.IR (size_t)\ \-1 .
+\fI*inbuf\fP
+is left pointing to the beginning of the invalid multibyte sequence.
+.IP \[bu]
 The input byte sequence has been entirely converted,
 that is, \fI*inbytesleft\fP has gone down to 0.
 In this case,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable
  2023-05-21 11:11 ` Bruno Haible
@ 2023-05-21 14:41   ` Alejandro Colomar
  2023-05-21 19:37     ` Bruno Haible
  0 siblings, 1 reply; 9+ messages in thread
From: Alejandro Colomar @ 2023-05-21 14:41 UTC (permalink / raw)
  To: Bruno Haible, linux-man
  Cc: Reuben Thomas, Steffen Nurpmeso, Martin Sebor, Alejandro Colomar


[-- Attachment #1.1: Type: text/plain, Size: 2916 bytes --]

Hi Bruno

On 5/21/23 13:11, Bruno Haible wrote:
> Alejandro Colomar wrote:
>> This patch adds language that reflects the actual behavior, by adding an
>> explicit bullet that distinguishes this case.
> 
> That is the right approach. Thanks for taking the initiative.
> 
> But I think that more details should be added, so that programmers are
> not surprised if their program behaves differently on, say, musl libc
> or FreeBSD than on glibc.
> 
> Find attached my take to describe the condition appropriately.

Thanks!

> 
> Bruno
> 

> @@ -80,6 +80,34 @@ .SH DESCRIPTION
>  \fI*inbuf\fP
>  is left pointing to the beginning of the invalid multibyte sequence.
>  .IP \[bu]
> +A multibyte sequence is encountered that is valid but that cannot be
> +translated to the character encoding of the output.  This condition

Please use semantic newlines.  See man-pages(7):
   Use semantic newlines
       In the source of a manual page, new sentences should be started
       on  new  lines,  long  sentences  should be split into lines at
       clause breaks (commas, semicolons, colons, and so on), and long
       clauses should be split at phrase boundaries.  This convention,
       sometimes known as "semantic newlines", makes it easier to  see
       the  effect of patches, which often operate at the level of in‐
       dividual sentences, clauses, or phrases.


> +depends on the implementation and on the conversion descriptor.
> +In the GNU C library and GNU libiconv, if
> +.I cd
> +was created without the suffix
> +.B //TRANSLIT
> +or
> +.BR //IGNORE ,
> +the conversion is strict: lossy conversions produce this condition.
> +If the suffix
> +.B //TRANSLIT
> +was specified, transliteration can avoid this condition in some cases.

What do you mean by "can" and "some cases"?

> +In the musl C library, this condition cannot occur because a conversion to
> +.B '*'

I recommend either using \[aq]*\[aq] for producing valid C code,
or just having an unquoted *.

> +is used as a fallback.
> +In the FreeBSD, NetBSD, and Solaris implementations of
> +.BR iconv ,

.BR iconv () ,

> +this condition cannot occur either, because a conversion to
> +.B '?'

Similar stuff here.

> +is used as a fallback.
> +When this condition is met,
> +.B iconv

And here.

> +sets \fIerrno\fP to \fBEILSEQ\fP and returns

.I errno

.B EILSEQ

I know in other places in the page we use \f, but I'll fix
that at some point.  Please use macros for new code.

Cheers,
Alex

> +.IR (size_t)\ \-1 .
> +\fI*inbuf\fP
> +is left pointing to the beginning of the invalid multibyte sequence.
> +.IP \[bu]
>  The input byte sequence has been entirely converted,
>  that is, \fI*inbytesleft\fP has gone down to 0.
>  In this case,


-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable
  2023-05-21 14:41   ` Alejandro Colomar
@ 2023-05-21 19:37     ` Bruno Haible
  2023-05-21 20:53       ` 2 spaces after the end of a sentence is the _right_ amount (was: [PATCH] iconv.3: Clarify the behavior when input is untranslatable) Alejandro Colomar
  2023-05-21 20:57       ` [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar
  0 siblings, 2 replies; 9+ messages in thread
From: Bruno Haible @ 2023-05-21 19:37 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor,
	Alejandro Colomar

[-- Attachment #1: Type: text/plain, Size: 1268 bytes --]

Hi Alejandro,

> Please use semantic newlines.  See man-pages(7):

Thanks for explaining. I wondered whether I should use one space or two spaces
after the end of a sentence, but found no precedent for either style. This
explains it :)

> > +In the GNU C library and GNU libiconv, if
> > +.I cd
> > +was created without the suffix
> > +.B //TRANSLIT
> > +or
> > +.BR //IGNORE ,
> > +the conversion is strict: lossy conversions produce this condition.
> > +If the suffix
> > +.B //TRANSLIT
> > +was specified, transliteration can avoid this condition in some cases.
> 
> What do you mean by "can" and "some cases"?

GNU libc and GNU libiconv support transliteration, for example, of "½" to "1/2",
or of "å" to "aa" in a Danish locale. Here I want to give a hint at the
transliteration facility, but without going into too much detail.
"transliteration can avoid this condition if there is a transliteration rule
for the multibyte character and it fits the character encoding of the output"
is too detailed, IMO.
Do you have a better wording than "can ... in some cases"?

> I recommend either using \[aq]*\[aq] for producing valid C code,
> or just having an unquoted *.

I made the requested style changes.

New patch is attached.


[-- Attachment #2: 0001-List-a-fifth-conditions-when-iconv-3-may-stop.patch --]
[-- Type: text/x-patch, Size: 2392 bytes --]

From caa04c49e89e64d7e8b52ab878c6dc2cd0cef5b9 Mon Sep 17 00:00:00 2001
From: Bruno Haible <bruno@clisp.org>
Date: Sun, 21 May 2023 13:05:29 +0200
Subject: [PATCH] List a fifth conditions when iconv(3) may stop.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217059
Reported-by: Steffen Nurpmeso <steffen@sdaoden.eu>
Reported-by: Reuben Thomas <rrt@sc3d.org>
Signed-off-by: Bruno Haible <bruno@clisp.org>
---
 man3/iconv.3 | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/man3/iconv.3 b/man3/iconv.3
index 66f59b8c3..94441f602 100644
--- a/man3/iconv.3
+++ b/man3/iconv.3
@@ -71,7 +71,7 @@ If the character encoding of the input is stateful, the
 function can also convert a sequence of input bytes
 to an update to the conversion state without producing any output bytes;
 such input is called a \fIshift sequence\fP.
-The conversion can stop for four reasons:
+The conversion can stop for five reasons:
 .IP \[bu] 3
 An invalid multibyte sequence is encountered in the input.
 In this case,
@@ -80,6 +80,39 @@ it sets \fIerrno\fP to \fBEILSEQ\fP and returns
 \fI*inbuf\fP
 is left pointing to the beginning of the invalid multibyte sequence.
 .IP \[bu]
+A multibyte sequence is encountered that is valid but that cannot be
+translated to the character encoding of the output.
+This condition depends on the implementation and on the conversion
+descriptor.
+In the GNU C library and GNU libiconv, if
+.I cd
+was created without the suffix
+.B //TRANSLIT
+or
+.BR //IGNORE ,
+the conversion is strict: lossy conversions produce this condition.
+If the suffix
+.B //TRANSLIT
+was specified, transliteration can avoid this condition in some cases.
+In the musl C library, this condition cannot occur because a conversion to
+.B \[aq]*\[aq]
+is used as a fallback.
+In the FreeBSD, NetBSD, and Solaris implementations of
+.BR iconv (),
+this condition cannot occur either, because a conversion to
+.B \[aq]?\[aq]
+is used as a fallback.
+When this condition is met,
+.BR iconv ()
+sets
+.I errno
+to
+.B EILSEQ
+and returns
+.IR (size_t)\ \-1 .
+.I *inbuf
+is left pointing to the beginning of the unconvertible multibyte sequence.
+.IP \[bu]
 The input byte sequence has been entirely converted,
 that is, \fI*inbytesleft\fP has gone down to 0.
 In this case,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* 2 spaces after the end of a sentence is the _right_ amount (was: [PATCH] iconv.3: Clarify the behavior when input is untranslatable)
  2023-05-21 19:37     ` Bruno Haible
@ 2023-05-21 20:53       ` Alejandro Colomar
  2023-05-21 20:57       ` [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar
  1 sibling, 0 replies; 9+ messages in thread
From: Alejandro Colomar @ 2023-05-21 20:53 UTC (permalink / raw)
  To: Bruno Haible
  Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor,
	Alejandro Colomar


[-- Attachment #1.1: Type: text/plain, Size: 736 bytes --]

Hi Bruno,

On 5/21/23 21:37, Bruno Haible wrote:
> Hi Alejandro,
> 
>> Please use semantic newlines.  See man-pages(7):
> 
> Thanks for explaining. I wondered whether I should use one space or two spaces
> after the end of a sentence,

That one's easy: one space is always wrong.  :-)

<https://web.archive.org/web/20171107164742/http://www.heracliteanriver.com/?p=324>
<https://lore.kernel.org/linux-man/9c5c5744-dde0-b333-09e0-ba9d92aa96b1@gmail.com/T/#u>
<https://lists.gnu.org/archive/html/groff/2020-11/msg00076.html>

> but found no precedent for either style. This
> explains it :)

:)

Cheers,
Alex

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable
  2023-05-21 19:37     ` Bruno Haible
  2023-05-21 20:53       ` 2 spaces after the end of a sentence is the _right_ amount (was: [PATCH] iconv.3: Clarify the behavior when input is untranslatable) Alejandro Colomar
@ 2023-05-21 20:57       ` Alejandro Colomar
  2023-05-24 22:07         ` Bruno Haible
  1 sibling, 1 reply; 9+ messages in thread
From: Alejandro Colomar @ 2023-05-21 20:57 UTC (permalink / raw)
  To: Bruno Haible
  Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor,
	Alejandro Colomar


[-- Attachment #1.1: Type: text/plain, Size: 1611 bytes --]

Hi Bruno,

On 5/21/23 21:37, Bruno Haible wrote:
> Hi Alejandro,
> 
>> Please use semantic newlines.  See man-pages(7):
> 
> Thanks for explaining. I wondered whether I should use one space or two spaces
> after the end of a sentence, but found no precedent for either style. This
> explains it :)
> 
>>> +In the GNU C library and GNU libiconv, if
>>> +.I cd
>>> +was created without the suffix
>>> +.B //TRANSLIT
>>> +or
>>> +.BR //IGNORE ,
>>> +the conversion is strict: lossy conversions produce this condition.
>>> +If the suffix
>>> +.B //TRANSLIT
>>> +was specified, transliteration can avoid this condition in some cases.
>>
>> What do you mean by "can" and "some cases"?
> 
> GNU libc and GNU libiconv support transliteration, for example, of "½" to "1/2",
> or of "å" to "aa" in a Danish locale. Here I want to give a hint at the
> transliteration facility, but without going into too much detail.
> "transliteration can avoid this condition if there is a transliteration rule
> for the multibyte character and it fits the character encoding of the output"
> is too detailed, IMO.
> Do you have a better wording than "can ... in some cases"?

If you include the full version in the commit log, to be able to
understand it in the future, I'm fine with it.

> 
>> I recommend either using \[aq]*\[aq] for producing valid C code,
>> or just having an unquoted *.
> 
> I made the requested style changes.

Thanks,
Alex

> 
> New patch is attached.
> 

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable
  2023-05-21 20:57       ` [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar
@ 2023-05-24 22:07         ` Bruno Haible
  2023-05-24 23:25           ` Alejandro Colomar
  0 siblings, 1 reply; 9+ messages in thread
From: Bruno Haible @ 2023-05-24 22:07 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor,
	Alejandro Colomar

[-- Attachment #1: Type: text/plain, Size: 278 bytes --]

Alejandro Colomar wrote:
> > Do you have a better wording than "can ... in some cases"?
> 
> If you include the full version in the commit log, to be able to
> understand it in the future, I'm fine with it.

OK. Here is a patch with the details included in the commit message.


[-- Attachment #2: 0001-List-a-fifth-condition-when-iconv-3-may-stop.patch --]
[-- Type: text/x-patch, Size: 3720 bytes --]

From 4cc4ad011b3ffa30159d3a67e262a46da4600cba Mon Sep 17 00:00:00 2001
From: Bruno Haible <bruno@clisp.org>
Date: Sun, 21 May 2023 13:05:29 +0200
Subject: [PATCH] List a fifth condition when iconv(3) may stop.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The wording regarding transliteration is vague, because this man page is not
the right place for going into the details of the transliteration.
Here are the details:
GNU libc and GNU libiconv support transliteration, for example, of "½" to "1/2",
or of "å" to "aa" in a Danish locale. The transliteration maps a multibyte
character of the input encoding to zero or more characters in the output.
There are two kinds of transliteration rules:
  - Those that are valid regardless of locale. Typically this means that the
    original and the transliterated character have similar glyphs, such as
    in the case "½" to "1/2".
    In GNU libc, these are collected in the files
    glibc/localedata/locales/translit_*.
  - Those that are valid in a single locale only. Often such a rule
    reflects similar pronounciation of the original and the transliterated
    characters. Some locales have script-based transliteration, for example
    from the Cyrillic script to the Latin script.
    In GNU libc, these are collected in the file
    glibc/localedata/locales/<locale>.
    In GNU libiconv, transliterations of this kind are not supported.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=29913#c4
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217059
Reported-by: Steffen Nurpmeso <steffen@sdaoden.eu>
Reported-by: Reuben Thomas <rrt@sc3d.org>
Signed-off-by: Bruno Haible <bruno@clisp.org>
---
 man3/iconv.3 | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/man3/iconv.3 b/man3/iconv.3
index 66f59b8c3..94441f602 100644
--- a/man3/iconv.3
+++ b/man3/iconv.3
@@ -71,7 +71,7 @@ If the character encoding of the input is stateful, the
 function can also convert a sequence of input bytes
 to an update to the conversion state without producing any output bytes;
 such input is called a \fIshift sequence\fP.
-The conversion can stop for four reasons:
+The conversion can stop for five reasons:
 .IP \[bu] 3
 An invalid multibyte sequence is encountered in the input.
 In this case,
@@ -80,6 +80,39 @@ it sets \fIerrno\fP to \fBEILSEQ\fP and returns
 \fI*inbuf\fP
 is left pointing to the beginning of the invalid multibyte sequence.
 .IP \[bu]
+A multibyte sequence is encountered that is valid but that cannot be
+translated to the character encoding of the output.
+This condition depends on the implementation and on the conversion
+descriptor.
+In the GNU C library and GNU libiconv, if
+.I cd
+was created without the suffix
+.B //TRANSLIT
+or
+.BR //IGNORE ,
+the conversion is strict: lossy conversions produce this condition.
+If the suffix
+.B //TRANSLIT
+was specified, transliteration can avoid this condition in some cases.
+In the musl C library, this condition cannot occur because a conversion to
+.B \[aq]*\[aq]
+is used as a fallback.
+In the FreeBSD, NetBSD, and Solaris implementations of
+.BR iconv (),
+this condition cannot occur either, because a conversion to
+.B \[aq]?\[aq]
+is used as a fallback.
+When this condition is met,
+.BR iconv ()
+sets
+.I errno
+to
+.B EILSEQ
+and returns
+.IR (size_t)\ \-1 .
+.I *inbuf
+is left pointing to the beginning of the unconvertible multibyte sequence.
+.IP \[bu]
 The input byte sequence has been entirely converted,
 that is, \fI*inbytesleft\fP has gone down to 0.
 In this case,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] iconv.3: Clarify the behavior when input is untranslatable
  2023-05-24 22:07         ` Bruno Haible
@ 2023-05-24 23:25           ` Alejandro Colomar
  0 siblings, 0 replies; 9+ messages in thread
From: Alejandro Colomar @ 2023-05-24 23:25 UTC (permalink / raw)
  To: Bruno Haible
  Cc: linux-man, Reuben Thomas, Steffen Nurpmeso, Martin Sebor,
	Alejandro Colomar


[-- Attachment #1.1: Type: text/plain, Size: 496 bytes --]

Hi Bruno,

On 5/25/23 00:07, Bruno Haible wrote:
> Alejandro Colomar wrote:
>>> Do you have a better wording than "can ... in some cases"?
>>
>> If you include the full version in the commit log, to be able to
>> understand it in the future, I'm fine with it.
> 
> OK. Here is a patch with the details included in the commit message.
> 
Thanks!  Patch applied.

Cheers,
Alex

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-05-24 23:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-21 10:31 [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar
2023-05-21 10:32 ` Alejandro Colomar
2023-05-21 11:11 ` Bruno Haible
2023-05-21 14:41   ` Alejandro Colomar
2023-05-21 19:37     ` Bruno Haible
2023-05-21 20:53       ` 2 spaces after the end of a sentence is the _right_ amount (was: [PATCH] iconv.3: Clarify the behavior when input is untranslatable) Alejandro Colomar
2023-05-21 20:57       ` [PATCH] iconv.3: Clarify the behavior when input is untranslatable Alejandro Colomar
2023-05-24 22:07         ` Bruno Haible
2023-05-24 23:25           ` Alejandro Colomar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox