[PATCH] man/man7/path-format.7: Add file documenting format of pathnames

public inbox for linux-man@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] man/man7/path-format.7: Add file documenting format of pathnames
@ 2025-01-13 21:32 Jason Yundt
  2025-01-14  0:20 ` Alejandro Colomar
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Jason Yundt @ 2025-01-13 21:32 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man

The goal of this new manual page is to help people create programs that
do the right thing even in the face of unusual paths.  The information
that I used to create this new manual page came from this Unix & Linux
Stack Exchange answer [1] and from this Libc-help mailing list post [2].

[1]: <https://unix.stackexchange.com/a/39179/316181>
[2]: <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
 man/man7/path-format.7 | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 man/man7/path-format.7

diff --git a/man/man7/path-format.7 b/man/man7/path-format.7
new file mode 100644
index 000000000..c3c01cbf5
--- /dev/null
+++ b/man/man7/path-format.7
@@ -0,0 +1,41 @@
+.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH PATH-FORMAT 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+path-format \- how pathnames are encoded and interpreted
+.SH DESCRIPTION
+Some system calls allow you to pass a pathname as a parameter.
+When writing code that deals with paths,
+there are kernel space requirements that you must comply with
+and userspace requirements that you should comply with.
+.P
+The kernel stores paths as null-terminated byte sequences.
+As far as the kernel is concerned, there are only three rules for paths:
+.IP \[bu]
+The last byte in the sequence needs to be a null.
+.IP \[bu]
+Any other bytes in the sequence need to not be null bytes.
+.IP \[bu]
+A 0x2F byte is always interpreted as a directory separator (/).
+.P
+This means that programs can technically do weird things
+like create paths using random character encodings
+or create paths without using any character encoding at all.
+Filesystems may impose additional restrictions on paths, though.
+For example, if you want to store a file on an ext4 filesystem,
+then its filename can’t be longer than 255 bytes.
+.P
+Userspace treats paths differently.
+Userspace applications typically expect paths to use
+a consistent character encoding.
+For maximum interoperability, programs should use
+.BR nl_langinfo (3)
+to determine the current locale’s codeset.
+Paths should be encoded and decoded using the current locale’s codeset
+in order to help prevent mojibake.
+.SH SEE ALSO
+.BR open (2),
+.BR nl_langinfo (3),
+.BR path_resolution (7)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] man/man7/path-format.7: Add file documenting format of pathnames
  2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
@ 2025-01-14  0:20 ` Alejandro Colomar
  2025-01-14 12:54 ` [PATCH v2] " Jason Yundt
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-14  0:20 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man

[-- Attachment #1: Type: text/plain, Size: 3115 bytes --]

Hi Jason,

On Mon, Jan 13, 2025 at 04:32:46PM -0500, Jason Yundt wrote:
> The goal of this new manual page is to help people create programs that
> do the right thing even in the face of unusual paths.  The information
> that I used to create this new manual page came from this Unix & Linux
> Stack Exchange answer [1] and from this Libc-help mailing list post [2].
> 
> [1]: <https://unix.stackexchange.com/a/39179/316181>
> [2]: <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
> 
> Signed-off-by: Jason Yundt <jason@jasonyundt.email>
> ---
>  man/man7/path-format.7 | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
>  create mode 100644 man/man7/path-format.7
> 
> diff --git a/man/man7/path-format.7 b/man/man7/path-format.7
> new file mode 100644
> index 000000000..c3c01cbf5
> --- /dev/null
> +++ b/man/man7/path-format.7
> @@ -0,0 +1,41 @@
> +.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH PATH-FORMAT 7 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +path-format \- how pathnames are encoded and interpreted

I would use path_format instead of path-format or PATH-FORMAT.

> +.SH DESCRIPTION
> +Some system calls allow you to pass a pathname as a parameter.
> +When writing code that deals with paths,
> +there are kernel space requirements that you must comply with
> +and userspace requirements that you should comply with.
> +.P
> +The kernel stores paths as null-terminated byte sequences.
> +As far as the kernel is concerned, there are only three rules for paths:
> +.IP \[bu]
> +The last byte in the sequence needs to be a null.
> +.IP \[bu]
> +Any other bytes in the sequence need to not be null bytes.

... need to be non-null bytes.

seems easier to read.

> +.IP \[bu]
> +A 0x2F byte is always interpreted as a directory separator (/).
> +.P
> +This means that programs can technically do weird things
> +like create paths using random character encodings
> +or create paths without using any character encoding at all.
> +Filesystems may impose additional restrictions on paths, though.
> +For example, if you want to store a file on an ext4 filesystem,
> +then its filename can’t be longer than 255 bytes.
> +.P
> +Userspace treats paths differently.
> +Userspace applications typically expect paths to use
> +a consistent character encoding.
> +For maximum interoperability, programs should use
> +.BR nl_langinfo (3)
> +to determine the current locale’s codeset.

I would say that for maximum interoperability one should self-limit to
the POSIX Portable Filename Character Set:
<https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265>


Have a lovely night!
Alex

> +Paths should be encoded and decoded using the current locale’s codeset
> +in order to help prevent mojibake.
> +.SH SEE ALSO
> +.BR open (2),
> +.BR nl_langinfo (3),
> +.BR path_resolution (7)
> -- 
> 2.47.0
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] man/man7/path-format.7: Add file documenting format of pathnames
  2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
  2025-01-14  0:20 ` Alejandro Colomar
@ 2025-01-14 12:54 ` Jason Yundt
  2025-01-14 13:14   ` Alejandro Colomar
  2025-01-15  9:01   ` Florian Weimer
  2025-01-14 21:01 ` [PATCH v3] man/man7/path_format.7: " Jason Yundt
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Jason Yundt @ 2025-01-14 12:54 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man

The goal of this new manual page is to help people create programs that
do the right thing even in the face of unusual paths.  The information
that I used to create this new manual page came from this Unix & Linux
Stack Exchange answer [1], this Libc-help mailing list post [2] and this
line of code from the kernel [3].

[1]: <https://unix.stackexchange.com/a/39179/316181>
[2]: <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
[3]: <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
Here’s what I changed from the previous version:

• The title of the page is now “path_format”. It’s now always written in all lowercase.
• The second kernel rule now uses the suggested phrase “…need to be non-null bytes”.
• The manual page now recommends self-limiting to the POSIX Portable Filename Character Set.
• A missing word (byte) was added to the first kernel rule.
• I added a missing source to the commit message.

 man/man7/path_format.7 | 47 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 man/man7/path_format.7

diff --git a/man/man7/path_format.7 b/man/man7/path_format.7
new file mode 100644
index 000000000..0a129eeba
--- /dev/null
+++ b/man/man7/path_format.7
@@ -0,0 +1,47 @@
+.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH path_format 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+path_format \- how pathnames are encoded and interpreted
+.SH DESCRIPTION
+Some system calls allow you to pass a pathname as a parameter.
+When writing code that deals with paths,
+there are kernel space requirements that you must comply with
+and userspace requirements that you should comply with.
+.P
+The kernel stores paths as null-terminated byte sequences.
+As far as the kernel is concerned, there are only three rules for paths:
+.IP \[bu]
+The last byte in the sequence needs to be a null byte.
+.IP \[bu]
+Any other bytes in the sequence need to be non-null bytes.
+.IP \[bu]
+A 0x2F byte is always interpreted as a directory separator (/).
+.P
+This means that programs can technically do weird things
+like create paths using random character encodings
+or create paths without using any character encoding at all.
+Filesystems may impose additional restrictions on paths, though.
+For example, if you want to store a file on an ext4 filesystem,
+then its filename can’t be longer than 255 bytes.
+.P
+Userspace treats paths differently.
+Userspace applications typically expect paths to use
+a consistent character encoding.
+For maximum interoperability, programs should use
+.BR nl_langinfo (3)
+to determine the current locale’s codeset.
+Paths should be encoded and decoded using the current locale’s codeset
+in order to help prevent mojibake.
+For maximum interoperability,
+programs and users should also limit
+the characters that they use for their own paths to characters in
+.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
+the POSIX Portable Filename Character Set
+.UE .
+.SH SEE ALSO
+.BR open (2),
+.BR nl_langinfo (3),
+.BR path_resolution (7)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] man/man7/path-format.7: Add file documenting format of pathnames
  2025-01-14 12:54 ` [PATCH v2] " Jason Yundt
@ 2025-01-14 13:14   ` Alejandro Colomar
  2025-01-14 21:00     ` Jason Yundt
  2025-01-15  9:01   ` Florian Weimer
  1 sibling, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-14 13:14 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 4462 bytes --]

[CC += Florian]

Hi Jason, Florian,

On Tue, Jan 14, 2025 at 07:54:45AM -0500, Jason Yundt wrote:
> The goal of this new manual page is to help people create programs that
> do the right thing even in the face of unusual paths.  The information
> that I used to create this new manual page came from this Unix & Linux
> Stack Exchange answer [1], this Libc-help mailing list post [2] and this
> line of code from the kernel [3].
> 
> [1]: <https://unix.stackexchange.com/a/39179/316181>
> [2]: <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
> [3]: <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>
> 
> Signed-off-by: Jason Yundt <jason@jasonyundt.email>
> ---
> Here’s what I changed from the previous version:
> 
> • The title of the page is now “path_format”. It’s now always written in all lowercase.
> • The second kernel rule now uses the suggested phrase “…need to be non-null bytes”.
> • The manual page now recommends self-limiting to the POSIX Portable Filename Character Set.
> • A missing word (byte) was added to the first kernel rule.
> • I added a missing source to the commit message.
> 
>  man/man7/path_format.7 | 47 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
>  create mode 100644 man/man7/path_format.7
> 
> diff --git a/man/man7/path_format.7 b/man/man7/path_format.7
> new file mode 100644
> index 000000000..0a129eeba
> --- /dev/null
> +++ b/man/man7/path_format.7
> @@ -0,0 +1,47 @@
> +.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH path_format 7 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +path_format \- how pathnames are encoded and interpreted
> +.SH DESCRIPTION
> +Some system calls allow you to pass a pathname as a parameter.

Maybe we should call the page pathname(7)?

> +When writing code that deals with paths,
> +there are kernel space requirements that you must comply with
> +and userspace requirements that you should comply with.
> +.P
> +The kernel stores paths as null-terminated byte sequences.

There's a specific term for this: string.

Which means you don't need to explain so much about the null byte.
It is understood that a string cannot contain null bytes (except for the
terminator itself).

> +As far as the kernel is concerned, there are only three rules for paths:
> +.IP \[bu]
> +The last byte in the sequence needs to be a null byte.
> +.IP \[bu]
> +Any other bytes in the sequence need to be non-null bytes.
> +.IP \[bu]
> +A 0x2F byte is always interpreted as a directory separator (/).
> +.P
> +This means that programs can technically do weird things
> +like create paths using random character encodings
> +or create paths without using any character encoding at all.

I think I would skip this.  It is implicit by the fact that the only
forbidden character in a filename is '/'.

> +Filesystems may impose additional restrictions on paths, though.
> +For example, if you want to store a file on an ext4 filesystem,
> +then its filename can’t be longer than 255 bytes.

It might be good to mention that some filesystems restrict the valid
characters in a filename.

> +.P
> +Userspace treats paths differently.
> +Userspace applications typically expect paths to use
> +a consistent character encoding.
> +For maximum interoperability, programs should use
> +.BR nl_langinfo (3)
> +to determine the current locale’s codeset.

Do we want to recommend that?  IMHO, for maximum portability, programs
should assume the Portable Filename Character Set (or at most some
subset of ASCII), and fail hard outside of that, which will itself favor
that users self-restrict to portable file names.


Cheers,
Alex

> +Paths should be encoded and decoded using the current locale’s codeset
> +in order to help prevent mojibake.
> +For maximum interoperability,
> +programs and users should also limit
> +the characters that they use for their own paths to characters in
> +.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
> +the POSIX Portable Filename Character Set
> +.UE .
> +.SH SEE ALSO
> +.BR open (2),
> +.BR nl_langinfo (3),
> +.BR path_resolution (7)
> -- 
> 2.47.0
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] man/man7/path-format.7: Add file documenting format of pathnames
  2025-01-14 13:14   ` Alejandro Colomar
@ 2025-01-14 21:00     ` Jason Yundt
  2025-01-14 23:06       ` Alejandro Colomar
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-14 21:00 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: linux-man, Florian Weimer

On Tue, Jan 14, 2025 at 02:14:42PM +0100, Alejandro Colomar wrote:
> Maybe we should call the page pathname(7)?

I don’t really have an opinion one way or the other.  If you want, I can
submit a new version of this patch that changes it to pathname(7).

> There's a specific term for this: string.
> 
> Which means you don't need to explain so much about the null byte.
> It is understood that a string cannot contain null bytes (except for the
> terminator itself).

I purposefully avoided using the term string because I thought that
using the term string would make the manual harder to understand.  The
term string is associated with several different concepts, and those
concepts would hinder someone’s understanding of paths:

• The term string is often used to refer to counted strings, and counted
  strings can contain null bytes.  I’m more used to counted strings than
  null-terminated strings personally because I have more experience with
  Java, Python and Rust than I do with languages that default to using
  null-terminated strings.  I know that the Linux man-pages mainly focus
  on the C programming language, but paths in particular are something
  that applies to all programming languages.

• Even in the context of the C programming language, the term string can
  still refer to counted strings.  The Windows kernel has three
  different structures: ANSI_STRING [1], OEM_STRING [2] and
  UNICODE_STRING [3].  All three of them are counted and can contain
  null bytes.  As a result, it’s possible to create valid paths on
  Windows that contain NUL characters [4].  When I wrote this manual
  page, I wanted to make it clear that this was one of the ways that the
  Linux kernel differs from the Windows kernel.

• People often think of strings as sequences of characters.  In
  programming languages like Python, this is literally true (you have to
  convert a str object into a bytes object if you want to work with
  bytes instead of characters).  To have the best possible understanding
  of how the kernel handles paths, you should think of them as sequences
  of bytes, not as sequences of characters, and the term string makes
  people think of sequences of characters.

• When I’m writing code in C or C++ and I see a char *, I assume that
  it’s supposed to contains characters that are encoded in the execution
  character set.  That is not a good assumption for paths.

When I first tried to figure out character encoding of paths on Linux, I
found stuff like this post [5].  That post (among others) really helped
me understand paths better because it specifically describes paths as
sequences of bytes rather than strings

[1]: <https://learn.microsoft.com/en-us/windows/win32/api/ntdef/ns-ntdef-string>
[2]: <https://learn.microsoft.com/en-us/previous-versions/windows/hardware/kernel/ff558741(v=vs.85)>
[3]: <https://learn.microsoft.com/en-us/windows/win32/api/ntdef/ns-ntdef-_unicode_string>
[4]: <https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html>
[5]: <https://unix.stackexchange.com/a/39179/316181>

> I think I would skip this.  It is implicit by the fact that the only
> forbidden character in a filename is '/'.

OK, I’ll submit a v3 that removes that part.

> It might be good to mention that some filesystems restrict the valid
> characters in a filename.

OK, I’ll submit a v3 that adds an example of a filesystem that puts
restrictions on the bytes that can be in filenames.

> Do we want to recommend that?  IMHO, for maximum portability, programs
> should assume the Portable Filename Character Set (or at most some
> subset of ASCII), and fail hard outside of that, which will itself favor
> that users self-restrict to portable file names.

I have a concern about programs failing hard when paths contain
non-ASCII characters.  I have a lot of songs and medleys saved on my
computer.  The paths for over 10,000 of them contain non-ASCII
characters.  Most of those non-ASCII characters come from Chinese,
Japanese or Korean characters that are in the titles of songs or
medleys.  If programs failed hard on paths that contain non-ASCII
characters, what impact would that have on my music collection?

Even if we were to only use a subset of ASCII characters, I would still
be concerned about programs failing hard when paths contain characters
outside of the POSIX Portable Filename Character Set.  I dual boot Linux
and Windows.  When I installed Windows, it automatically created
partitions named “Microsoft reserved partition” and “Basic data
partition”.  At the moment, I can access those partitions using the
paths /dev/disk/by-partlabel/Microsoft\x20reserved\x20partition and
/dev/disk/by-partlabel/Basic\x20data\x20partition.  If programs failed
hard on paths that contain characters outside of the POSIX Portable
Filename Character Set, would I have to fall back to using /dev/sda1 and
/dev/sda2?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3] man/man7/path_format.7: Add file documenting format of pathnames
  2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
  2025-01-14  0:20 ` Alejandro Colomar
  2025-01-14 12:54 ` [PATCH v2] " Jason Yundt
@ 2025-01-14 21:01 ` Jason Yundt
  2025-01-15 16:20 ` [PATCH v4] man/man7/pathname.7: " Jason Yundt
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Jason Yundt @ 2025-01-14 21:01 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man

The goal of this new manual page is to help people create programs that
do the right thing even in the face of unusual paths.  The information
that I used to create this new manual page came from this Unix & Linux
Stack Exchange answer [1], this Libc-help mailing list post [2] and this
line of code from the kernel [3].

[1]: <https://unix.stackexchange.com/a/39179/316181>
[2]: <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
[3]: <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
 man/man7/path_format.7 | 49 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)
 create mode 100644 man/man7/path_format.7

diff --git a/man/man7/path_format.7 b/man/man7/path_format.7
new file mode 100644
index 000000000..c34d78f65
--- /dev/null
+++ b/man/man7/path_format.7
@@ -0,0 +1,49 @@
+.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH path_format 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+path_format \- how pathnames are encoded and interpreted
+.SH DESCRIPTION
+Some system calls allow you to pass a pathname as a parameter.
+When writing code that deals with paths,
+there are kernel space requirements that you must comply with
+and userspace requirements that you should comply with.
+.P
+The kernel stores paths as null-terminated byte sequences.
+As far as the kernel is concerned, there are only three rules for paths:
+.IP \[bu]
+The last byte in the sequence needs to be a null byte.
+.IP \[bu]
+Any other bytes in the sequence need to be non-null bytes.
+.IP \[bu]
+A 0x2F byte is always interpreted as a directory separator (/).
+.P
+Filesystems may impose additional restrictions on paths, though.
+ext4 is one example of a filesystem that does this.
+If you want to store a file on an ext4 filesystem,
+then its filename can’t be longer than 255 bytes.
+vfat is another example.
+If you want to store a file on a vfat filesystem,
+then its filename can’t contain a 0x3A byte (: in ASCII)
+unless the filesystem was mounted with iocharset set to something unusual.
+.P
+Userspace treats paths differently.
+Userspace applications typically expect paths to use
+a consistent character encoding.
+For maximum interoperability, programs should use
+.BR nl_langinfo (3)
+to determine the current locale’s codeset.
+Paths should be encoded and decoded using the current locale’s codeset
+in order to help prevent mojibake.
+For maximum interoperability,
+programs and users should also limit
+the characters that they use for their own paths to characters in
+.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
+the POSIX Portable Filename Character Set
+.UE .
+.SH SEE ALSO
+.BR open (2),
+.BR nl_langinfo (3),
+.BR path_resolution (7)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] man/man7/path-format.7: Add file documenting format of pathnames
  2025-01-14 21:00     ` Jason Yundt
@ 2025-01-14 23:06       ` Alejandro Colomar
  2025-01-15 16:21         ` Jason Yundt
  0 siblings, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-14 23:06 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 5736 bytes --]

Hi Jason,

On Tue, Jan 14, 2025 at 04:00:46PM -0500, Jason Yundt wrote:
> On Tue, Jan 14, 2025 at 02:14:42PM +0100, Alejandro Colomar wrote:
> > Maybe we should call the page pathname(7)?
> 
> I don’t really have an opinion one way or the other.  If you want, I can
> submit a new version of this patch that changes it to pathname(7).

Hmmm, yep, let's make it pathname(7).

> > There's a specific term for this: string.
> > 
> > Which means you don't need to explain so much about the null byte.
> > It is understood that a string cannot contain null bytes (except for the
> > terminator itself).
> 
> I purposefully avoided using the term string because I thought that
> using the term string would make the manual harder to understand.  The
> term string is associated with several different concepts, and those
> concepts would hinder someone’s understanding of paths:
> 
> • The term string is often used to refer to counted strings, and counted
>   strings can contain null bytes.  I’m more used to counted strings than
>   null-terminated strings personally because I have more experience with
>   Java, Python and Rust than I do with languages that default to using
>   null-terminated strings.  I know that the Linux man-pages mainly focus
>   on the C programming language, but paths in particular are something
>   that applies to all programming languages.
> 
> • Even in the context of the C programming language, the term string can
>   still refer to counted strings.  The Windows kernel has three
>   different structures: ANSI_STRING [1], OEM_STRING [2] and
>   UNICODE_STRING [3].  All three of them are counted and can contain
>   null bytes.  As a result, it’s possible to create valid paths on
>   Windows that contain NUL characters [4].  When I wrote this manual
>   page, I wanted to make it clear that this was one of the ways that the
>   Linux kernel differs from the Windows kernel.

Makes sense.  How about a null-terminated string?

> • People often think of strings as sequences of characters.  In
>   programming languages like Python, this is literally true (you have to
>   convert a str object into a bytes object if you want to work with
>   bytes instead of characters).  To have the best possible understanding
>   of how the kernel handles paths, you should think of them as sequences
>   of bytes, not as sequences of characters, and the term string makes
>   people think of sequences of characters.
> 
> • When I’m writing code in C or C++ and I see a char *, I assume that
>   it’s supposed to contains characters that are encoded in the execution
>   character set.  That is not a good assumption for paths.
> 
> When I first tried to figure out character encoding of paths on Linux, I
> found stuff like this post [5].  That post (among others) really helped
> me understand paths better because it specifically describes paths as
> sequences of bytes rather than strings
> 
> [1]: <https://learn.microsoft.com/en-us/windows/win32/api/ntdef/ns-ntdef-string>
> [2]: <https://learn.microsoft.com/en-us/previous-versions/windows/hardware/kernel/ff558741(v=vs.85)>
> [3]: <https://learn.microsoft.com/en-us/windows/win32/api/ntdef/ns-ntdef-_unicode_string>
> [4]: <https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html>
> [5]: <https://unix.stackexchange.com/a/39179/316181>
> 
> > I think I would skip this.  It is implicit by the fact that the only
> > forbidden character in a filename is '/'.
> 
> OK, I’ll submit a v3 that removes that part.
> 
> > It might be good to mention that some filesystems restrict the valid
> > characters in a filename.
> 
> OK, I’ll submit a v3 that adds an example of a filesystem that puts
> restrictions on the bytes that can be in filenames.
> 
> > Do we want to recommend that?  IMHO, for maximum portability, programs
> > should assume the Portable Filename Character Set (or at most some
> > subset of ASCII), and fail hard outside of that, which will itself favor
> > that users self-restrict to portable file names.
> 
> I have a concern about programs failing hard when paths contain
> non-ASCII characters.  I have a lot of songs and medleys saved on my
> computer.  The paths for over 10,000 of them contain non-ASCII
> characters.  Most of those non-ASCII characters come from Chinese,
> Japanese or Korean characters that are in the titles of songs or
> medleys.  If programs failed hard on paths that contain non-ASCII
> characters, what impact would that have on my music collection?

The core utils (e.g., rm(1) et al.) are nice and work well for arbitrary
characters, to allow you to fix them.  But yeah, most high level
programs and (especially) scripts aren't so nice.  Think for example of
makefiles, where handling files with spaces correctly is almost
impossible.


Have a lovely night!
Alex

> Even if we were to only use a subset of ASCII characters, I would still
> be concerned about programs failing hard when paths contain characters
> outside of the POSIX Portable Filename Character Set.  I dual boot Linux
> and Windows.  When I installed Windows, it automatically created
> partitions named “Microsoft reserved partition” and “Basic data
> partition”.  At the moment, I can access those partitions using the
> paths /dev/disk/by-partlabel/Microsoft\x20reserved\x20partition and
> /dev/disk/by-partlabel/Basic\x20data\x20partition.  If programs failed
> hard on paths that contain characters outside of the POSIX Portable
> Filename Character Set, would I have to fall back to using /dev/sda1 and
> /dev/sda2?

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] man/man7/path-format.7: Add file documenting format of pathnames
  2025-01-14 12:54 ` [PATCH v2] " Jason Yundt
  2025-01-14 13:14   ` Alejandro Colomar
@ 2025-01-15  9:01   ` Florian Weimer
  1 sibling, 0 replies; 38+ messages in thread
From: Florian Weimer @ 2025-01-15  9:01 UTC (permalink / raw)
  To: Jason Yundt; +Cc: Alejandro Colomar, linux-man

* Jason Yundt:

> +The kernel stores paths as null-terminated byte sequences.
> +As far as the kernel is concerned, there are only three rules for paths:
> +.IP \[bu]
> +The last byte in the sequence needs to be a null byte.
> +.IP \[bu]
> +Any other bytes in the sequence need to be non-null bytes.
> +.IP \[bu]
> +A 0x2F byte is always interpreted as a directory separator (/).

There are also rules about overall length.  Some pathnames cannot be
resolved by the kernel directly, even though they exist and can be
resolved piecewise, say using openat.

There are also places with more stringent pathname limits, like the
sun_path in AF_LOCAL socket addresses.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v4] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
                   ` (2 preceding siblings ...)
  2025-01-14 21:01 ` [PATCH v3] man/man7/path_format.7: " Jason Yundt
@ 2025-01-15 16:20 ` Jason Yundt
  2025-01-15 17:12   ` Florian Weimer
  2025-01-15 17:20   ` Alejandro Colomar
  2025-01-17 13:02 ` [PATCH v5] " Jason Yundt
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Jason Yundt @ 2025-01-15 16:20 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man, Florian Weimer

The goal of this new manual page is to help people create programs that
do the right thing even in the face of unusual paths.  The information
that I used to create this new manual page came from these sources:

• <https://unix.stackexchange.com/a/39179/316181>
• <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
• <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/uapi/linux/limits.h?h=v6.12.9#n12>
• <https://docs.kernel.org/filesystems/affs.html#mount-options-for-the-affs>
• <man:unix(7)>

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
Here’s what I changed from the previous version:

• The title of the page is now “pathname(7)”.
• The list of kernel rules now mentions that paths can’t be longer than
  4,096 bytes (Thanks for mentioning this, Florian).
• The list of kernel rules now mentions that filenames can’t be longer
  than 255 bytes.
• I replaced the ext4 filename limitation example with a Amiga filename
  limitation example.  It no longer made sense to say that ext4 limited
  filenames to 255 bytes now we’re saying that all filenames are limited
  to 255 bytes.
• I added UNIX domain sockets’s sun_path as an example of a situation
  where the kernel puts additional limitations on paths (Thanks for
  mentioning this, Florian).
• I added additional sources to the commit message in order to account
  for the new information added by this version.

 man/man7/pathname.7 | 61 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)
 create mode 100644 man/man7/pathname.7

diff --git a/man/man7/pathname.7 b/man/man7/pathname.7
new file mode 100644
index 000000000..15ff98e15
--- /dev/null
+++ b/man/man7/pathname.7
@@ -0,0 +1,61 @@
+.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH pathname 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+pathname \- how pathnames are encoded and interpreted
+.SH DESCRIPTION
+Some system calls allow you to pass a pathname as a parameter.
+When writing code that deals with paths,
+there are kernel space requirements that you must comply with
+and userspace requirements that you should comply with.
+.P
+The kernel stores paths as null-terminated byte sequences.
+The kernel has a few general rules that apply to all paths:
+.IP \[bu]
+The last byte in the sequence needs to be a null byte.
+.IP \[bu]
+Any other bytes in the sequence need to be non-null bytes.
+.IP \[bu]
+A 0x2F byte is always interpreted as a directory separator (/).
+.IP \[bu]
+A path can be at most 4,096 bytes long.
+A path that’s longer than 4,096 bytes can be split into multiple smaller paths
+and opened piecewise using
+.BR openat (2).
+.IP \[bu]
+Filenames can be at most 255 bytes long.
+.P
+The kernel also has some rules that only apply in certain situations.
+Here are some examples:
+.IP \[bu]
+If you want to store a file on an Amiga filesystem,
+then its filename can’t be longer than 30 bytes.
+.IP \[bu]
+If you want to store a file on a vfat filesystem,
+then its filename can’t contain a 0x3A byte (: in ASCII)
+unless the filesystem was mounted with iocharset set to something unusual.
+.IP \[bu]
+A UNIX domain socket’s sun_path can be at most 108 bytes long (see
+.BR unix (7)
+for details).
+.P
+Userspace treats paths differently.
+Userspace applications typically expect paths to use
+a consistent character encoding.
+For maximum interoperability, programs should use
+.BR nl_langinfo (3)
+to determine the current locale’s codeset.
+Paths should be encoded and decoded using the current locale’s codeset
+in order to help prevent mojibake.
+For maximum interoperability,
+programs and users should also limit
+the characters that they use for their own paths to characters in
+.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
+the POSIX Portable Filename Character Set
+.UE .
+.SH SEE ALSO
+.BR open (2),
+.BR nl_langinfo (3),
+.BR path_resolution (7)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] man/man7/path-format.7: Add file documenting format of pathnames
  2025-01-14 23:06       ` Alejandro Colomar
@ 2025-01-15 16:21         ` Jason Yundt
  2025-01-15 16:47           ` Alejandro Colomar
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-15 16:21 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: linux-man, Florian Weimer

On Wed, Jan 15, 2025 at 12:06:10AM +0100, Alejandro Colomar wrote:
> Hmmm, yep, let's make it pathname(7).

OK, I’ll submit a new version that uses pathname(7) as the title.

> Makes sense.  How about a null-terminated string?

The term null-terminated string still has some of the problems that I
mentioned earlier.  Specifically, people think of null-terminated
strings as sequences of characters.  It’s easier to understand how the
kernel handles paths if you think of paths as sequences of bytes, not as
sequences of characters.

Also, people typically make assumptions about the encoding of
null-terminated strings in the C programming language.  It’s reasonable
to assume that a char * is encoded in the execution character set, that
a wchar_t * is encoded in the wide execution character set, that a
char8_t * is encoded in UTF-8, that a char16_t * is encoded in UTF-16
and that a char32_t * is encoded in UTF-32.  Paths don’t necessarily
have one character encoding, and their character encoding may not be any
of those.

> > I have a concern about programs failing hard when paths contain
> > non-ASCII characters.  I have a lot of songs and medleys saved on my
> > computer.  The paths for over 10,000 of them contain non-ASCII
> > characters.  Most of those non-ASCII characters come from Chinese,
> > Japanese or Korean characters that are in the titles of songs or
> > medleys.  If programs failed hard on paths that contain non-ASCII
> > characters, what impact would that have on my music collection?
> 
> The core utils (e.g., rm(1) et al.) are nice and work well for arbitrary
> characters, to allow you to fix them.  But yeah, most high level
> programs and (especially) scripts aren't so nice.  Think for example of
> makefiles, where handling files with spaces correctly is almost
> impossible.

I agree that the core utils work well with arbitrary paths.  I’m not so
sure that most high level programs and scripts don’t work well with
spaces and non-ASCII characters.  Most of the high level programs and
scripts that I personally use work fine with paths that contain spaces
and non-ASCII characters, but I don’t know if most programs and scripts
in general work that well.  I also agree that handling spaces correctly
in makefiles is almost impossible which is why I don’t use makefiles for
my own personal projects.

That being said, I think that you misunderstood my two questions.  You
told me the current state of things.  I’m not asking about the current
state of things, I’m asking about a hypothetical future where programs
started to “assume the Portable Filename Character Set (or at most some
subset of ASCII), and fail hard outside of that”.  If we start making
that recommendation and programs start following that recommendation,
then it sounds like I wouldn’t be able to do anything with a large part
of my music collection, and it sounds like I wouldn’t be able to use the
symbolic links that are in my /dev/disks/by-partlabel directory.  Am I
understanding your recommendation correctly?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] man/man7/path-format.7: Add file documenting format of pathnames
  2025-01-15 16:21         ` Jason Yundt
@ 2025-01-15 16:47           ` Alejandro Colomar
  2025-01-15 17:44             ` G. Branden Robinson
  0 siblings, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-15 16:47 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 1835 bytes --]

Hi Jason,

On Wed, Jan 15, 2025 at 11:21:02AM -0500, Jason Yundt wrote:
> > Makes sense.  How about a null-terminated string?
> 
> The term null-terminated string still has some of the problems that I
> mentioned earlier.  Specifically, people think of null-terminated
> strings as sequences of characters.  It’s easier to understand how the
> kernel handles paths if you think of paths as sequences of bytes, not as
> sequences of characters.

Hmmm, okay.  Maybe I'm too biased as a C programmer, and this being a
generic page for users it makes sense to use other terms.
 
> That being said, I think that you misunderstood my two questions.  You
> told me the current state of things.  I’m not asking about the current
> state of things, I’m asking about a hypothetical future where programs
> started to “assume the Portable Filename Character Set (or at most some
> subset of ASCII), and fail hard outside of that”.  If we start making
> that recommendation and programs start following that recommendation,
> then it sounds like I wouldn’t be able to do anything with a large part
> of my music collection,

You could rename that music into something usable, and then use it.  :)

> and it sounds like I wouldn’t be able to use the
> symbolic links that are in my /dev/disks/by-partlabel directory.  Am I
> understanding your recommendation correctly?

I would be happy in a world where all tools are restricted to the
portable filename character set.  I once toyed with a patch for
enforcing such filenames in the kernel, just for fun.

On the other hand, I see the usefulness for others in programs trying to
work with other stuff.  So the manual page makes sense, and I'll swallow
my disagreement.  :-)


Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-15 16:20 ` [PATCH v4] man/man7/pathname.7: " Jason Yundt
@ 2025-01-15 17:12   ` Florian Weimer
  2025-01-15 17:20   ` Alejandro Colomar
  1 sibling, 0 replies; 38+ messages in thread
From: Florian Weimer @ 2025-01-15 17:12 UTC (permalink / raw)
  To: Jason Yundt; +Cc: Alejandro Colomar, linux-man

* Jason Yundt:

> +.IP \[bu]
> +Filenames can be at most 255 bytes long.

I don't think this is accurate, particularly not for network file
systems and file systems that use UCS-2 or UTF-16 internally.  The
latter typically have their own 255 character limit, but a character can
take up to 3 bytes in UTF-8 (as used by Linux).

This is why we deprecated readdir_r.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-15 16:20 ` [PATCH v4] man/man7/pathname.7: " Jason Yundt
  2025-01-15 17:12   ` Florian Weimer
@ 2025-01-15 17:20   ` Alejandro Colomar
  2025-01-15 18:37     ` A modest proposal regarding pathnames (was: " G. Branden Robinson
  1 sibling, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-15 17:20 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 6342 bytes --]

Hi Jason,

On Wed, Jan 15, 2025 at 11:20:51AM -0500, Jason Yundt wrote:
> The goal of this new manual page is to help people create programs that
> do the right thing even in the face of unusual paths.  The information
> that I used to create this new manual page came from these sources:
> 
> • <https://unix.stackexchange.com/a/39179/316181>
> • <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
> • <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/uapi/linux/limits.h?h=v6.12.9#n12>
> • <https://docs.kernel.org/filesystems/affs.html#mount-options-for-the-affs>
> • <man:unix(7)>
> 
> Signed-off-by: Jason Yundt <jason@jasonyundt.email>
> ---
> Here’s what I changed from the previous version:

Thanks!  The page starts looking good.  I'll make some minor comments
below.

> • The title of the page is now “pathname(7)”.
> • The list of kernel rules now mentions that paths can’t be longer than
>   4,096 bytes (Thanks for mentioning this, Florian).
> • The list of kernel rules now mentions that filenames can’t be longer
>   than 255 bytes.
> • I replaced the ext4 filename limitation example with a Amiga filename
>   limitation example.  It no longer made sense to say that ext4 limited
>   filenames to 255 bytes now we’re saying that all filenames are limited
>   to 255 bytes.
> • I added UNIX domain sockets’s sun_path as an example of a situation
>   where the kernel puts additional limitations on paths (Thanks for
>   mentioning this, Florian).
> • I added additional sources to the commit message in order to account
>   for the new information added by this version.
> 
>  man/man7/pathname.7 | 61 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)
>  create mode 100644 man/man7/pathname.7
> 
> diff --git a/man/man7/pathname.7 b/man/man7/pathname.7
> new file mode 100644
> index 000000000..15ff98e15
> --- /dev/null
> +++ b/man/man7/pathname.7
> @@ -0,0 +1,61 @@
> +.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH pathname 7 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +pathname \- how pathnames are encoded and interpreted

Maybe, since this also discusses filenames, we should use both names:

	.SH NAME
	filename,
	pathname
	\-
	...

> +.SH DESCRIPTION
> +Some system calls allow you to pass a pathname as a parameter.
> +When writing code that deals with paths,
> +there are kernel space requirements that you must comply with

s/kernel space/kernel-space/

since it works as an adjective.

also, I'd put a comma after that: s/$/,/

> +and userspace requirements that you should comply with.

s/userspace/user-space/

for similar reasons.

> +.P
> +The kernel stores paths as null-terminated byte sequences.
> +The kernel has a few general rules that apply to all paths:
> +.IP \[bu]

See man-pages(7):

   Lists
     There are different kinds of lists:

     [...]

     Bullet lists
            Elements  are preceded by bullet symbols (\[bu]).  Anything
            that doesn’t fit elsewhere is usually covered by this  type
            of list.

     [...]

     There should always be exactly 2 spaces between  the  list  symbol
     and  the  elements.   This  doesn’t  apply to "tagged paragraphs",
     which use the default indentation rules.

So, you'll need to use

	.IP \[bu] 3

in the first item (and only there; the following ones inherit the
value).

> +The last byte in the sequence needs to be a null byte.
> +.IP \[bu]
> +Any other bytes in the sequence need to be non-null bytes.
> +.IP \[bu]
> +A 0x2F byte is always interpreted as a directory separator (/).

How about adding this?:

	and cannot be part of a filename.

> +.IP \[bu]
> +A path can be at most 4,096 bytes long.

For self-consistency, let's use the same term all of the time: either
path or pathname.  Otherwise, a reader might think they are different
things.

For consistency with POSIX, let's say pathname, since that's what POSIX
uses:
<https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_254>

> +A path that’s longer than 4,096 bytes can be split into multiple smaller paths
> +and opened piecewise using
> +.BR openat (2).
> +.IP \[bu]
> +Filenames can be at most 255 bytes long.

For consistency with bullet one:

s/Filenames/A filename/

> +.P
> +The kernel also has some rules that only apply in certain situations.
> +Here are some examples:
> +.IP \[bu]
> +If you want to store a file on an Amiga filesystem,
> +then its filename can’t be longer than 30 bytes.

I would simplify and make it more consistent with the bullets above:

	-  Filenames on the Amiga filesystem can be at most 30 bytes long.

> +.IP \[bu]
> +If you want to store a file on a vfat filesystem,
> +then its filename can’t contain a 0x3A byte (: in ASCII)

Is that the only one?  I expect there are several characters that are
not allowed in vfat.

> +unless the filesystem was mounted with iocharset set to something unusual.
> +.IP \[bu]
> +A UNIX domain socket’s sun_path can be at most 108 bytes long (see
> +.BR unix (7)
> +for details).
> +.P
> +Userspace treats paths differently.

s/Userspace/User space/

> +Userspace applications typically expect paths to use

.

> +a consistent character encoding.
> +For maximum interoperability, programs should use
> +.BR nl_langinfo (3)
> +to determine the current locale’s codeset.
> +Paths should be encoded and decoded using the current locale’s codeset
> +in order to help prevent mojibake.

It might be interesting to add an example program.

> +For maximum interoperability,
> +programs and users should also limit
> +the characters that they use for their own paths to characters in
> +.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
> +the POSIX Portable Filename Character Set
> +.UE .
> +.SH SEE ALSO
> +.BR open (2),
> +.BR nl_langinfo (3),
> +.BR path_resolution (7)

Also interesting:

	.BR mount (8)

(It talks about iocharset.)


Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] man/man7/path-format.7: Add file documenting format of pathnames
  2025-01-15 16:47           ` Alejandro Colomar
@ 2025-01-15 17:44             ` G. Branden Robinson
  0 siblings, 0 replies; 38+ messages in thread
From: G. Branden Robinson @ 2025-01-15 17:44 UTC (permalink / raw)
  To: linux-man; +Cc: Alejandro Colomar, Jason Yundt, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 6364 bytes --]

At 2025-01-15T17:47:58+0100, Alejandro Colomar wrote:
> On Wed, Jan 15, 2025 at 11:21:02AM -0500, Jason Yundt wrote:
> > > Makes sense.  How about a null-terminated string?
> > 
> > The term null-terminated string still has some of the problems that
> > I mentioned earlier.  Specifically, people think of null-terminated
> > strings as sequences of characters.  It’s easier to understand how
> > the kernel handles paths if you think of paths as sequences of
> > bytes, not as sequences of characters.
> 
> Hmmm, okay.  Maybe I'm too biased as a C programmer, and this being a
> generic page for users it makes sense to use other terms.

There are many ways to represent strings.  C is not the whole world.  :)

I think Jason has a good point.  When considering byte sequences as
simple small integers (some values of which are perhaps invalid), I
think it's clearer to articulate them as such.

Here, for instance, if I'm understanding Jason correctly, I might say
"byte sequence terminated by a zero value".

I think assembly programmers used to call that "ASCIZ".  And they got up
to all sorts of mischief in the eighth bit...

> > That being said, I think that you misunderstood my two questions.
> > You told me the current state of things.  I’m not asking about the
> > current state of things, I’m asking about a hypothetical future
> > where programs started to “assume the Portable Filename Character
> > Set (or at most some subset of ASCII), and fail hard outside of
> > that”.  If we start making that recommendation and programs start
> > following that recommendation, then it sounds like I wouldn’t be
> > able to do anything with a large part of my music collection,
> 
> You could rename that music into something usable, and then use it.  :)

If you tell Japanese users they can't name a music file
"いぬのおまわりさん.flac", they might run over you with a truck.  ;-)

(This reference may be intelligible only to members of Gen X.)

> I would be happy in a world where all tools are restricted to the
> portable filename character set.  I once toyed with a patch for
> enforcing such filenames in the kernel, just for fun.

I've been pleased to start moving GNU troff in the _opposite_ direction.

NEWS from the forthcoming 1.24.0 release:

*  GNU troff now strips a leading neutral double quote from the argument
   to the `cf`, `hpf`, `hpfa`, `mso`, `msoquiet`, `nx`, `pi`, `pso`,
   `so`, `soquiet`, `sy`, and `trf` requests, and the second argument to
   the `open` and `opena` requests, allowing it to contain embedded
   leading spaces.

*  GNU troff now accepts space characters in the argument to the `cf`,
   `hpf`, `hpfa`, `mso`, `msoquiet`, `nx`, `so`, `soquiet`, and `trf`
   requests, and the second argument to the `open` and `opena` requests.
   See "soelim" below.

*  soelim no longer requires embedded space characters in `so` arguments
   to be backslash-escaped.  (It continues to support that syntax, even
   though neither AT&T nor GNU troff ever has.)  If the argument to a
   `so` request must contain leading spaces, any such sequence of spaces
   must now be prefixed with a double quote character ("), which the
   program then discards.  These changes are to better align this
   program's parsing rules with the language of the formatter; consider
   the `ds` and `as` requests.

In 1.25 I want to support the use of groff-style Unicode special
character escape sequences to encode byte sequences in file names.
Notice that I do say _bytes_, so the range will be limited: \[u0000] to
\[u00FF].  But that will be enough to encode UTF-8, or sickness like
UTF-16LE.

https://savannah.gnu.org/bugs/index.php?65108

> On the other hand, I see the usefulness for others in programs trying
> to work with other stuff.  So the manual page makes sense, and I'll
> swallow my disagreement.  :-)

[digression into software development philosophy follows]

You're joining the side of the angels.

Authors of literature (fiction, academic, legal, technical, etc.) tend
to be unimpressed by some of the limitations on representation that
systems programmers find obvious and sensible.

More generally, the whole reason the operating system exists is to
facilitate the efficient execution of _applications_ (or "jobs", as
their card decks were known in the days when a "monitor program" to
occupy a machine's idle cycles was a novel concept).

Systems programming (be it in the kernel per se or at the layer of
general services in user space) can definitely be a great place to spend
one's career, but we do best when we remember that it's not an end in
itself...lest we come to resemble those JavaScript fanatics who seem to
spend all their time fighting wars with each other over "frameworks".

Thus, if a groff user wants to name their document on-disk "Обладала
фактической самостоятельностью.ms", I feel pretty lame if I tell them
they can't.  When dealing with users, a principle I try to follow is to
actively look for ways to say "yes" to their requests.  Reasons for
saying "no" generally don't need to be sought out--they present
themselves with depressing frequency.  Sometimes the user wants an
impossible or infeasible thing; beyond the obvious limitations of finite
storage, CPU cycles, and I/O bandwidth.  Some problems have high lower
bounds on complexity.  Occasionally someone asks for something that
blunders directly into an unsolved problem in computer science.

I omit from the foregoing consideration the phenomenon of users who know
something--but often not enough--about implementation details, and
therefore have a tendency to design "solutions" in their heads and
request those instead of presenting their problem scenario.  This type
of user shows up everywhere, but the bug-bash mailing list is especially
rich with them.  With these people, before you can get to "yes" you have
to ask, "what is it you're trying to do?".  Sometimes they just won't
tell you.  To some of these people, the only way to stay sane is to
start with "no".

"The three most dangerous things in the world are a programmer with a
soldering iron, a hardware type with a program patch, and a user with an
idea." -- Rick Cook

Regards,
Branden

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* A modest proposal regarding pathnames (was: [PATCH v4] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-15 17:20   ` Alejandro Colomar
@ 2025-01-15 18:37     ` G. Branden Robinson
  2025-01-15 19:25       ` Alejandro Colomar
  0 siblings, 1 reply; 38+ messages in thread
From: G. Branden Robinson @ 2025-01-15 18:37 UTC (permalink / raw)
  To: linux-man; +Cc: Alejandro Colomar, Jason Yundt, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 5131 bytes --]

Hi Alex,

At 2025-01-15T18:20:45+0100, Alejandro Colomar wrote:
> Maybe, since this also discusses filenames, we should use both names:
> 
> 	.SH NAME
> 	filename,
> 	pathname
> 	\-
> 	...
...
> > +.IP \[bu]
> > +A path can be at most 4,096 bytes long.
> 
> For self-consistency, let's use the same term all of the time: either
> path or pathname.  Otherwise, a reader might think they are different
> things.
> 
> For consistency with POSIX, let's say pathname, since that's what POSIX
> uses:
> <https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_254>

One way we've stepped on a rake in Unix terminology, and for no good
reason I've been able to discover, is that we cling to the practice of
referring to two different things as "paths".[1]

* file names, possibly qualified with location information that may be
  absolute or relative to the current working directory ("pathname",
  "absolute path(name)", "relative path(name)"

* a list of the foregoing used to search for command file names or other
  loadable resources that an application thinks likely to exist ("PATH",
  "LD_LIBRARY_PATH", "MANPATH", "PYTHONPATH", "CLASSPATH", etc.)

To state it differently, we are passionately committed to using the term
"path" to refer to objects of significantly distinguishable types, such
as:

  char *
and
  char **.

And since this application doesn't admit general recursion--we don't
ever refer to a single character as a "path", nor to a list of lists of
file names as "path", the usage is corrosive to coherent thought.

I don't have any real hope of reforming this abhorrent practice--
I fear the cement had set good and hard before even POSIX Issue _1_
came out.  (Can I blame "/usr/group"?)

But...in the event the donkey I'm riding has borrowed some of its
genetic material from a vigorous young warhorse (let's call him
"JeanHeyd"), I would:

1.  Reserve the term "path" solely for discussion search paths, such as
    those implemented by "PATH".

2.  Adopt the term "filespec", or "file specification", or even just
    "file name", to refer to a character sequence that locates a file.
    POSIX interfaces and utilities tend strongly to be general in this
    respect, in the sense that anywhere a "basename" (the "final
    component" of a "pathname") is accepted, one that is qualified is
    also accepted, as in an "absolute pathname" or "relative pathname".

    The occasions upon which you want to refuse to traverse outside of a
    directory is rare enough, and specialized enough, that it merits
    case-specific discussion.  These are replete with complication.  Is
    traversing only into a subdirectory of the current working directory
    acceptable?  Should symlinks be followed?  If so, should they be
    permitted to escape the part of the tree descended from the current
    working directory?  Back in the day, about a thousand security
    advisories were issued against FTP servers arising from confused or
    unstated policy here, and the terminology of "pathname" did
    _nothing_ to help resolve them.  (Did that term help create the
    problems by fogging the minds of the application developers?  Who
    knows?)

and

3.  Throw away the term "pathname" entirely.  Banish it.

And yes, I know, POSIXly correct people can claim to "eliminate" this
confusion by interrupting conversations with a raised finger:

"No, no--you don't mean 'path', but 'path_name_'."

In my life I have found that I have sufficient talent for being
simultaneously right and annoying.  I don't need that kind of help.

So--will you ride with me, Sancho?  I mean, Alex?  ;-)

> > +.IP \[bu]
> > +If you want to store a file on a vfat filesystem,
> > +then its filename can’t contain a 0x3A byte (: in ASCII)
> 
> Is that the only one?  I expect there are several characters that are
> not allowed in vfat.

You also can't _end_ a file name with "." (0x2E).  I think there are
other restrictions.  Putting my own music collection on a file system
that I needed to be able to share with Windows boxes, many years ago,
was a tedious exercise in discovering VFAT's irritating limitations.

Regards,
Branden

[1] George Lakoff would probably have something to say about the
    unreasonable persistence of metaphors.  When a technical person
    finds that they can employ a notion familiar as a childhood fairy
    tale--as with Hansel and Gretel ambling through the forest--to win
    claims of comprehension from the audience for their design, they
    cling to it passionately.  In Unix, both kernel- and user-space
    developers did so, and neither yielded, snarling like a pair of
    dogs, one at each end of a femur still slick with gore from a bovine
    carcass.  I admit that I'm impressed that Thompson[2] was fought to
    a draw in this instance.  Unfortunately that outcome was the least
    helpful one for the Unix community.  Either side winning would have
    been better.

[2] Or whoever involved with the Unix kernel refused to yield here.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: A modest proposal regarding pathnames (was: [PATCH v4] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-15 18:37     ` A modest proposal regarding pathnames (was: " G. Branden Robinson
@ 2025-01-15 19:25       ` Alejandro Colomar
  2025-01-15 19:47         ` Alejandro Colomar
  0 siblings, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-15 19:25 UTC (permalink / raw)
  To: G. Branden Robinson; +Cc: linux-man, Jason Yundt, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 6111 bytes --]

Hi Branden,

On Wed, Jan 15, 2025 at 12:37:24PM -0600, G. Branden Robinson wrote:
> > <https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_254>
> 
> One way we've stepped on a rake in Unix terminology, and for no good
> reason I've been able to discover, is that we cling to the practice of
> referring to two different things as "paths".[1]
> 
> * file names, possibly qualified with location information that may be
>   absolute or relative to the current working directory ("pathname",
>   "absolute path(name)", "relative path(name)"

POSIX consistently calls these "pathname", I think.

> 
> * a list of the foregoing used to search for command file names or other
>   loadable resources that an application thinks likely to exist ("PATH",
>   "LD_LIBRARY_PATH", "MANPATH", "PYTHONPATH", "CLASSPATH", etc.)

You could say the *PATH variables contain one or more path(name)s.
Maybe I would have put an *S in those variable names, because a plural
amount may be one, but a singular amount may not be more than 1, but
history.  :/

> To state it differently, we are passionately committed to using the term
> "path" to refer to objects of significantly distinguishable types, such
> as:
> 
>   char *
> and
>   char **.

I would actually use path and paths for such variable names.

> And since this application doesn't admit general recursion--we don't
> ever refer to a single character as a "path", nor to a list of lists of
> file names as "path", the usage is corrosive to coherent thought.

In my programming, I tend to use plural for lists.  (*checks to make
sure can't be called a liar*)

> I don't have any real hope of reforming this abhorrent practice--
> I fear the cement had set good and hard before even POSIX Issue _1_
> came out.  (Can I blame "/usr/group"?)
> 
> But...in the event the donkey I'm riding has borrowed some of its
> genetic material from a vigorous young warhorse (let's call him
> "JeanHeyd"), I would:
> 
> 1.  Reserve the term "path" solely for discussion search paths, such as
>     those implemented by "PATH".

The issue I have is: I hate long function parameter names.  I think I
prefer having path and pathname be synonyms.  I still would be
consistent in manual pages to use only one of them, but would make them
synonyms.

I think I would use pathname when speaking, but path for variable names
(which are usually shorter; e.g., string and s).

> 2.  Adopt the term "filespec", or "file specification", or even just
>     "file name", to refer to a character sequence that locates a file.
>     POSIX interfaces and utilities tend strongly to be general in this
>     respect, in the sense that anywhere a "basename" (the "final
>     component" of a "pathname") is accepted, one that is qualified is
>     also accepted, as in an "absolute pathname" or "relative pathname".

A good example of what you're talking about is exec(3):

     int execl(const char *pathname, const char *arg, ...
                     /*, (char *) NULL */);
     int execlp(const char *file, const char *arg, ...
                     /*, (char *) NULL */);
     int execle(const char *pathname, const char *arg, ...
                     /*, (char *) NULL, char *const envp[] */);
     int execv(const char *pathname, char *const argv[]);
     int execvp(const char *file, char *const argv[]);
     int execvpe(const char *file, char *const argv[], char *const envp[]);

The p functions *require* a filename, while the non-p functions accept a
pathname.  I would change that manual page for consistency into either
pathname and filename, or path and file, but the current mix is bad.

>     The occasions upon which you want to refuse to traverse outside of a
>     directory is rare enough, and specialized enough, that it merits
>     case-specific discussion.  These are replete with complication.  Is
>     traversing only into a subdirectory of the current working directory
>     acceptable?  Should symlinks be followed?  If so, should they be
>     permitted to escape the part of the tree descended from the current
>     working directory?  Back in the day, about a thousand security
>     advisories were issued against FTP servers arising from confused or
>     unstated policy here, and the terminology of "pathname" did
>     _nothing_ to help resolve them.  (Did that term help create the
>     problems by fogging the minds of the application developers?  Who
>     knows?)

That is, filename is rare, and pathname is usually what tools use.
Agree.

> and
> 
> 3.  Throw away the term "pathname" entirely.  Banish it.

Nah, it's standard.  I like that one can go to POSIX and consult what it
means.  I'll try to use POSIXly correct terms.  Actually, it's the term
I'd use more often.

> And yes, I know, POSIXly correct people

I tend to be.  :-)

> can claim to "eliminate" this
> confusion by interrupting conversations with a raised finger:
> 
> "No, no--you don't mean 'path', but 'path_name_'."
> 
> In my life I have found that I have sufficient talent for being
> simultaneously right and annoying.  I don't need that kind of help.
> 
> So--will you ride with me, Sancho?  I mean, Alex?  ;-)

Hmmm, not this time, I think.  :-)

> 
> > > +.IP \[bu]
> > > +If you want to store a file on a vfat filesystem,
> > > +then its filename can’t contain a 0x3A byte (: in ASCII)
> > 
> > Is that the only one?  I expect there are several characters that are
> > not allowed in vfat.
> 
> You also can't _end_ a file name with "." (0x2E).  I think there are
> other restrictions.  Putting my own music collection on a file system
> that I needed to be able to share with Windows boxes, many years ago,
> was a tedious exercise in discovering VFAT's irritating limitations.

<https://unix.stackexchange.com/questions/92426/why-doesnt-the-linux-vfat-driver-allow-certain-characters>
seems to say there's a list of forbidden characters:

	?<>\:*|"

Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: A modest proposal regarding pathnames (was: [PATCH v4] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-15 19:25       ` Alejandro Colomar
@ 2025-01-15 19:47         ` Alejandro Colomar
  0 siblings, 0 replies; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-15 19:47 UTC (permalink / raw)
  To: G. Branden Robinson; +Cc: linux-man, Jason Yundt, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

On Wed, Jan 15, 2025 at 08:25:36PM +0100, Alejandro Colomar wrote:
> A good example of what you're talking about is exec(3):
> 
>      int execl(const char *pathname, const char *arg, ...
>                      /*, (char *) NULL */);
>      int execlp(const char *file, const char *arg, ...
>                      /*, (char *) NULL */);
>      int execle(const char *pathname, const char *arg, ...
>                      /*, (char *) NULL, char *const envp[] */);
>      int execv(const char *pathname, char *const argv[]);
>      int execvp(const char *file, char *const argv[]);
>      int execvpe(const char *file, char *const argv[], char *const envp[]);
> 
> The p functions *require* a filename, while the non-p functions accept a
> pathname.  I would change that manual page for consistency into either
> pathname and filename, or path and file, but the current mix is bad.

I started The POSIXly Correct Reform.  :)
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/log/?h=posixly>

Cheers,
Alex


-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
                   ` (3 preceding siblings ...)
  2025-01-15 16:20 ` [PATCH v4] man/man7/pathname.7: " Jason Yundt
@ 2025-01-17 13:02 ` Jason Yundt
  2025-01-17 14:14   ` Alejandro Colomar
  2025-01-17 23:59 ` [PATCH v6] " Jason Yundt
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-17 13:02 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man, Florian Weimer

The goal of this new manual page is to help people create programs that
do the right thing even in the face of unusual paths.  The information
that I used to create this new manual page came from these sources:

• <https://unix.stackexchange.com/a/39179/316181>
• <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
• <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>
• <man:unix(7)>
• <https://unix.stackexchange.com/q/92426/316181>

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
Here’s what I changed from the previous version:

• I stopped saying that the kernel has a 255-byte limit on filenames.
  Florian was right, you can create files with names longer than 255
  characters.  I tried it, and I was able to create a file with a 355-character
  long name on both tmpfs and bcachefs.  This leaves us with one problem,
  though.  In <linux/limits.h>, NAME_MAX is defined as 255 and has a comment
  that says “chars in a file name” [1].  POSIX says that NAME_MAX is the
  “Maximum number of bytes in a filename (not including the terminating null of
  a filename string).”  Why is NAME_MAX set to 255 if you can have longer
  filenames?
• I from the Amiga filesystem back to the ext4 filesystem example.  The only
  reason why I had used the Amiga filesystem example was because I had
  previously thought that 255 bytes was the maximum for any filename,
  regardless of the filesystem.  I think that ext4 is better example because
  people are more likely to use an ext4 filesystem than an Amiga filesystem.
• I implemented all of Alex suggestions, except for the ones that
  no longer apply because they were suggestions for text that was deleted for
  other reasons.
• I added an example program.

[1]: <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/uapi/linux/limits.h?h=v6.12.9#n12>
[2]: <https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/limits.h.html#tag_14_26_03_02>

 man/man7/pathname.7 | 151 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 151 insertions(+)
 create mode 100644 man/man7/pathname.7

diff --git a/man/man7/pathname.7 b/man/man7/pathname.7
new file mode 100644
index 000000000..9545c3b07
--- /dev/null
+++ b/man/man7/pathname.7
@@ -0,0 +1,151 @@
+.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH pathname 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+pathname,
+filename
+\-
+how pathnames are encoded and interpreted
+.SH DESCRIPTION
+Some system calls allow you to pass a pathname as a parameter.
+When writing code that deals with pathnames,
+there are kernel-space requirements that you must comply with,
+and user-space requirements that you should comply with.
+.P
+The kernel stores pathnames as null-terminated byte sequences.
+The kernel has a few general rules that apply to all pathnames:
+.IP \[bu] 3
+The last byte in the sequence needs to be a null byte.
+.IP \[bu]
+Any other bytes in the sequence need to be non-null bytes.
+.IP \[bu]
+A 0x2F byte is always interpreted as a directory separator (/)
+and cannot be part of a filename.
+.IP \[bu]
+A pathname can be at most 4,096 bytes long.
+A pathname that’s longer than 4,096 bytes
+can be split into multiple smaller pathnames and opened piecewise using
+.BR openat (2).
+.P
+The kernel also has some rules that only apply in certain situations.
+Here are some examples:
+.IP \[bu] 3
+Filenames on the ext4 filesystem can be at most 30 bytes long.
+.IP \[bu]
+Filenames on the vfat filesystem cannot a
+0x22, 0x2A, 0x3A, 0x3C, 0x3E, 0x3F, 0x5C or 0x7C byte
+(", *, :, <, >, ?, \ or | in ASCII)
+unless the filesystem was mounted with iocharset set to something unusual.
+.IP \[bu]
+A UNIX domain socket’s sun_path can be at most 108 bytes long (see
+.BR unix (7)
+for details).
+.P
+User space treats pathnames differently.
+User space applications typically expect pathnames to use
+a consistent character encoding.
+For maximum interoperability, programs should use
+.BR nl_langinfo (3)
+to determine the current locale’s codeset.
+Paths should be encoded and decoded using the current locale’s codeset
+in order to help prevent mojibake.
+For maximum interoperability,
+programs and users should also limit
+the characters that they use for their own pathnames to characters in
+.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
+the POSIX Portable Filename Character Set
+.UE .
+.SH EXAMPLES
+The following program demonstrates
+how to ensure that a pathname uses the proper encoding.
+The program starts with a UTF-32 encoded pathname.
+It then calls
+.BR nl_langinfo (3)
+in order to determine what the current locale’s codeset is.
+After that, it uses
+.BR iconv (3)
+to convert the UTF-32 encoded pathname into a locale codeset encoded pathname.
+Finally, the program uses the locale codeset encoded pathname to create
+a file that contains the message “Hello, world!”
+.SS Program source
+.\" SRC BEGIN (pathname_encoding_example.c)
+.EX
+#include <iconv.h>
+#include <langinfo.h>
+#include <locale.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <uchar.h>
+\&
+int
+main(void)
+{
+    if (setlocale(LC_ALL, "") == NULL) {
+        exit(EXIT_FAILURE);
+    }
+    char32_t *utf32_pathname = U"example";
+    size_t characters_in_pathname = (sizeof utf32_pathname) \- 1;
+    size_t bytes_in_locale_pathname =
+        characters_in_pathname * MB_CUR_MAX + 1;
+    // We use calloc() here to make sure that the output from iconv() is null
+    // terminated.
+    char *locale_pathname = calloc(1, bytes_in_locale_pathname);
+    if (locale_pathname == NULL) {
+        exit(EXIT_FAILURE);
+    }
+\&
+    iconv_t cd = iconv_open(nl_langinfo(CODESET), "UTF\-32");
+    if (cd == (iconv_t) \- 1) {
+        exit(EXIT_FAILURE);
+    }
+    char *inbuf = (char *) utf32_pathname;
+    size_t inbytesleft =
+        characters_in_pathname * (sizeof *utf32_pathname);
+    char *outbuf = locale_pathname;
+    size_t outbytesleft = bytes_in_locale_pathname;
+    size_t iconv_result;
+    // iconv() doesn’t necessarily convert everything all in one go, so we call
+    // it in a while loop just in case it takes multiple calls to finish
+    // converting everything.
+    while (inbytesleft > 0) {
+        iconv_result =
+            iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
+        if (iconv_result == \-1) {
+            exit(EXIT_FAILURE);
+        }
+    }
+    // This ensures that the conversion is 100% complete.  See iconv(3) for
+    // details.
+    iconv_result =
+        iconv(cd, NULL, &inbytesleft, &outbuf, &outbytesleft);
+    if (iconv_result == \-1) {
+        exit(EXIT_FAILURE);
+    }
+    if (iconv_close(cd) == \-1) {
+        exit(EXIT_FAILURE);
+    }
+\&
+    FILE *fp = fopen(locale_pathname, "w");
+    if (fp == NULL) {
+        exit(EXIT_FAILURE);
+    }
+    if (fputs("Hello, world!\\n", fp) == EOF) {
+        exit(EXIT_FAILURE);
+    }
+    if (fclose(fp) == EOF) {
+        exit(EXIT_FAILURE);
+    }
+\&
+    free(locale_pathname);
+    exit(EXIT_SUCCESS);
+}
+.EE
+.\" SRC END
+.SH SEE ALSO
+.BR open (2),
+.BR iconv (3),
+.BR nl_langinfo (3),
+.BR path_resolution (7),
+.BR mount (8)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-17 13:02 ` [PATCH v5] " Jason Yundt
@ 2025-01-17 14:14   ` Alejandro Colomar
  2025-01-18  0:01     ` Jason Yundt
  0 siblings, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-17 14:14 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 11713 bytes --]

Hi Jason,

On Fri, Jan 17, 2025 at 08:02:03AM -0500, Jason Yundt wrote:
> The goal of this new manual page is to help people create programs that
> do the right thing even in the face of unusual paths.  The information
> that I used to create this new manual page came from these sources:
> 
> • <https://unix.stackexchange.com/a/39179/316181>
> • <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
> • <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>
> • <man:unix(7)>
> • <https://unix.stackexchange.com/q/92426/316181>
> 
> Signed-off-by: Jason Yundt <jason@jasonyundt.email>
> ---
> Here’s what I changed from the previous version:
> 
> • I stopped saying that the kernel has a 255-byte limit on filenames.
>   Florian was right, you can create files with names longer than 255
>   characters.  I tried it, and I was able to create a file with a 355-character
>   long name on both tmpfs and bcachefs.  This leaves us with one problem,
>   though.  In <linux/limits.h>, NAME_MAX is defined as 255 and has a comment
>   that says “chars in a file name” [1].  POSIX says that NAME_MAX is the
>   “Maximum number of bytes in a filename (not including the terminating null of
>   a filename string).”  Why is NAME_MAX set to 255 if you can have longer
>   filenames?

There's fpathconf(3) which might give a different value.  I tend to use
the hardcoded macros in programs (although, I use PATH_MAX, since
usually I don't store single filenames).

I think for portability you should restrict yourself to creating stuff
shorter than the hard-coded macro, but accept up to the fpathconf(3)
value (similar to character sets).

You could test this in your system:

	alx@devuan:~/tmp/linux$ cat nm.c 
	#include <limits.h>
	#include <stdio.h>
	#include <unistd.h>

	int
	main(void)
	{
		printf("NAME_MAX: %d\n", NAME_MAX);
		printf("_PC_NAME_MAX: %ld\n", pathconf("/run/", _PC_NAME_MAX));
	}
	alx@devuan:~/tmp/linux$ gcc -Wall -Wextra nm.c 
	alx@devuan:~/tmp/linux$ ./a.out 
	NAME_MAX: 255
	_PC_NAME_MAX: 255
	alx@devuan:~/tmp/linux$ echo /run/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | wc
	      1       1     444
	alx@devuan:~/tmp/linux$ sudo touch /run/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
	[sudo] password for alx: 
	touch: cannot touch '/run/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa': File name too long

Curiously, my system is also limited to 255 for tmpfs filesystems but
yours is not?  I still get longer paths rejected.


> • I from the Amiga filesystem back to the ext4 filesystem example.  The only
>   reason why I had used the Amiga filesystem example was because I had
>   previously thought that 255 bytes was the maximum for any filename,
>   regardless of the filesystem.  I think that ext4 is better example because
>   people are more likely to use an ext4 filesystem than an Amiga filesystem.
> • I implemented all of Alex suggestions, except for the ones that
>   no longer apply because they were suggestions for text that was deleted for
>   other reasons.
> • I added an example program.
> 
> [1]: <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/uapi/linux/limits.h?h=v6.12.9#n12>
> [2]: <https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/limits.h.html#tag_14_26_03_02>
> 
>  man/man7/pathname.7 | 151 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 151 insertions(+)
>  create mode 100644 man/man7/pathname.7
> 
> diff --git a/man/man7/pathname.7 b/man/man7/pathname.7
> new file mode 100644
> index 000000000..9545c3b07
> --- /dev/null
> +++ b/man/man7/pathname.7
> @@ -0,0 +1,151 @@
> +.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH pathname 7 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +pathname,
> +filename
> +\-
> +how pathnames are encoded and interpreted
> +.SH DESCRIPTION
> +Some system calls allow you to pass a pathname as a parameter.
> +When writing code that deals with pathnames,
> +there are kernel-space requirements that you must comply with,
> +and user-space requirements that you should comply with.
> +.P
> +The kernel stores pathnames as null-terminated byte sequences.
> +The kernel has a few general rules that apply to all pathnames:
> +.IP \[bu] 3
> +The last byte in the sequence needs to be a null byte.
> +.IP \[bu]
> +Any other bytes in the sequence need to be non-null bytes.
> +.IP \[bu]
> +A 0x2F byte is always interpreted as a directory separator (/)
> +and cannot be part of a filename.
> +.IP \[bu]
> +A pathname can be at most 4,096 bytes long.
> +A pathname that’s longer than 4,096 bytes
> +can be split into multiple smaller pathnames and opened piecewise using
> +.BR openat (2).
> +.P
> +The kernel also has some rules that only apply in certain situations.
> +Here are some examples:
> +.IP \[bu] 3
> +Filenames on the ext4 filesystem can be at most 30 bytes long.
> +.IP \[bu]
> +Filenames on the vfat filesystem cannot a
> +0x22, 0x2A, 0x3A, 0x3C, 0x3E, 0x3F, 0x5C or 0x7C byte
> +(", *, :, <, >, ?, \ or | in ASCII)
> +unless the filesystem was mounted with iocharset set to something unusual.
> +.IP \[bu]
> +A UNIX domain socket’s sun_path can be at most 108 bytes long (see
> +.BR unix (7)
> +for details).
> +.P
> +User space treats pathnames differently.
> +User space applications typically expect pathnames to use
> +a consistent character encoding.
> +For maximum interoperability, programs should use
> +.BR nl_langinfo (3)
> +to determine the current locale’s codeset.
> +Paths should be encoded and decoded using the current locale’s codeset
> +in order to help prevent mojibake.
> +For maximum interoperability,
> +programs and users should also limit
> +the characters that they use for their own pathnames to characters in
> +.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
> +the POSIX Portable Filename Character Set
> +.UE .
> +.SH EXAMPLES
> +The following program demonstrates
> +how to ensure that a pathname uses the proper encoding.
> +The program starts with a UTF-32 encoded pathname.
> +It then calls
> +.BR nl_langinfo (3)
> +in order to determine what the current locale’s codeset is.
> +After that, it uses
> +.BR iconv (3)
> +to convert the UTF-32 encoded pathname into a locale codeset encoded pathname.
> +Finally, the program uses the locale codeset encoded pathname to create
> +a file that contains the message “Hello, world!”
> +.SS Program source
> +.\" SRC BEGIN (pathname_encoding_example.c)
> +.EX
> +#include <iconv.h>
> +#include <langinfo.h>
> +#include <locale.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <uchar.h>
> +\&
> +int
> +main(void)
> +{
> +    if (setlocale(LC_ALL, "") == NULL) {
> +        exit(EXIT_FAILURE);

I prefer showing an error message on errors.  For example:

	err(EXIT_FAILURE, "setlocale");

> +    }
> +    char32_t *utf32_pathname = U"example";

You probably wanted an array, not a pointer.

	char32_t  utf8_pathname[] = U"example";

> +    size_t characters_in_pathname = (sizeof utf32_pathname) \- 1;

`sizeof utf32_pathname` is 4.  You're taking the size of a pointer, not
of an array.  Also, sizeof gives you the number of bytes, not elements.
Also, the number of characters in a string is called 'length' (this is
standard nomenclature; see strlen(3)).  You probably wanted this:

	size_t  len = nelementsof(utf8_pathname) - 1;

Oh, I'm too far into an uncertain future, and we don't yet know how that
operator will be called.
<https://thephd.dev/the-big-array-size-survey-for-c>
For now, you'll want this:

	#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))

	size_t  len = NELEMS(utf8_pathname) - 1;

> +    size_t bytes_in_locale_pathname =
> +        characters_in_pathname * MB_CUR_MAX + 1;

The number of bytes in an object is called 'size'.  This is also
standard nomenclature.

	size_t  size = len * MB_CUR_MAX + 1;


Have a lovely day!
Alex

> +    // We use calloc() here to make sure that the output from iconv() is null
> +    // terminated.

Doesn't iconv(3) terminate its output?  I've never used that API, so I
don't know.

> +    char *locale_pathname = calloc(1, bytes_in_locale_pathname);

I prefer it reversed:  we're allocating n bytes (of size 1), not
1 element of a weird size.  Remember the prototype is:

	void *calloc(size_t n, size_t size);

> +    if (locale_pathname == NULL) {
> +        exit(EXIT_FAILURE);
> +    }
> +\&
> +    iconv_t cd = iconv_open(nl_langinfo(CODESET), "UTF\-32");
> +    if (cd == (iconv_t) \- 1) {
> +        exit(EXIT_FAILURE);
> +    }
> +    char *inbuf = (char *) utf32_pathname;
> +    size_t inbytesleft =
> +        characters_in_pathname * (sizeof *utf32_pathname);
> +    char *outbuf = locale_pathname;
> +    size_t outbytesleft = bytes_in_locale_pathname;
> +    size_t iconv_result;
> +    // iconv() doesn’t necessarily convert everything all in one go, so we call
> +    // it in a while loop just in case it takes multiple calls to finish
> +    // converting everything.
> +    while (inbytesleft > 0) {
> +        iconv_result =
> +            iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
> +        if (iconv_result == \-1) {
> +            exit(EXIT_FAILURE);
> +        }
> +    }
> +    // This ensures that the conversion is 100% complete.  See iconv(3) for
> +    // details.
> +    iconv_result =
> +        iconv(cd, NULL, &inbytesleft, &outbuf, &outbytesleft);
> +    if (iconv_result == \-1) {
> +        exit(EXIT_FAILURE);
> +    }
> +    if (iconv_close(cd) == \-1) {
> +        exit(EXIT_FAILURE);
> +    }
> +\&
> +    FILE *fp = fopen(locale_pathname, "w");
> +    if (fp == NULL) {
> +        exit(EXIT_FAILURE);
> +    }
> +    if (fputs("Hello, world!\\n", fp) == EOF) {
> +        exit(EXIT_FAILURE);
> +    }
> +    if (fclose(fp) == EOF) {
> +        exit(EXIT_FAILURE);
> +    }
> +\&
> +    free(locale_pathname);
> +    exit(EXIT_SUCCESS);
> +}
> +.EE
> +.\" SRC END
> +.SH SEE ALSO
> +.BR open (2),
> +.BR iconv (3),
> +.BR nl_langinfo (3),
> +.BR path_resolution (7),
> +.BR mount (8)
> -- 
> 2.47.1
> 
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
                   ` (4 preceding siblings ...)
  2025-01-17 13:02 ` [PATCH v5] " Jason Yundt
@ 2025-01-17 23:59 ` Jason Yundt
  2025-01-20 16:24 ` [PATCH v8] " Jason Yundt
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Jason Yundt @ 2025-01-17 23:59 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man, Florian Weimer

The goal of this new manual page is to help people create programs that
do the right thing even in the face of unusual paths.  The information
that I used to create this new manual page came from these sources:

• <https://unix.stackexchange.com/a/39179/316181>
• <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
• <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>
• <man:unix(7)>
• <https://unix.stackexchange.com/q/92426/316181>

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
Here’s what I changed from the previous version:

• The man page now says “PATH_MAX bytes” instead of “4,096 bytes”, and tells
  the user to take a look at limits.h(0p).  I had originally gotten the 4,096
  bytes number by looking at limits.h(0p), so it makes sense to direct readers
  to that header because its the source of truth.
• The man page now mentions that each filesystem has its own filename length
  limit and that you can use fpathconf(3) in order to determine that limit.
• I removed the part that mentioned ext4’s filename length limit because the
  man page now has a different part that tells you how to use fpathconf(3) to
  figure out the limit for any filesystem.
• The man page now recommends that programs and users use at most NAME_MAX
  bytes for filenames.
• I implemented Alex’s suggested changes to the example program.

 man/man7/pathname.7 | 171 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 man/man7/pathname.7

diff --git a/man/man7/pathname.7 b/man/man7/pathname.7
new file mode 100644
index 000000000..93bc9d225
--- /dev/null
+++ b/man/man7/pathname.7
@@ -0,0 +1,171 @@
+.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH pathname 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+pathname,
+filename
+\-
+how pathnames are encoded and interpreted
+.SH DESCRIPTION
+Some system calls allow you to pass a pathname as a parameter.
+When writing code that deals with pathnames,
+there are kernel-space requirements that you must comply with,
+and user-space requirements that you should comply with.
+.P
+The kernel stores pathnames as null-terminated byte sequences.
+The kernel has a few general rules that apply to all pathnames:
+.IP \[bu] 3
+The last byte in the sequence needs to be a null byte.
+.IP \[bu]
+Any other bytes in the sequence need to be non-null bytes.
+.IP \[bu]
+A 0x2F byte is always interpreted as a directory separator (/)
+and cannot be part of a filename.
+.IP \[bu]
+A pathname can be at most PATH_MAX bytes long.
+PATH_MAX is defined in
+.BR limits.h (0p)\
+\.
+A pathname that’s longer than PATH_MAX bytes
+can be split into multiple smaller pathnames and opened piecewise using
+.BR openat (2).
+.IP \[bu]
+A filename can be at most a certain number of bytes long.
+The number is filesystem-specific.
+You can get the filename length limit for a currently mounted filesystem
+by passing _PC_NAME_MAX to
+.BR fpathconf (3)\
+\.
+For maximum portability, programs should be able to handle filenames
+that are as long as the relevant filesystems will allow.
+For maximum portability, programs and users should limit the length
+of their own pathnames to NAME_MAX bytes.
+NAME_MAX is defined in
+.BR limits.h (0p)\
+\.
+.P
+The kernel also has some rules that only apply in certain situations.
+Here are some examples:
+.IP \[bu] 3
+Filenames on the ext4 filesystem can be at most 30 bytes long.
+.IP \[bu]
+Filenames on the vfat filesystem cannot a
+0x22, 0x2A, 0x3A, 0x3C, 0x3E, 0x3F, 0x5C or 0x7C byte
+(", *, :, <, >, ?, \ or | in ASCII)
+unless the filesystem was mounted with iocharset set to something unusual.
+.IP \[bu]
+A UNIX domain socket’s sun_path can be at most 108 bytes long (see
+.BR unix (7)
+for details).
+.P
+User space treats pathnames differently.
+User space applications typically expect pathnames to use
+a consistent character encoding.
+For maximum interoperability, programs should use
+.BR nl_langinfo (3)
+to determine the current locale’s codeset.
+Paths should be encoded and decoded using the current locale’s codeset
+in order to help prevent mojibake.
+For maximum interoperability,
+programs and users should also limit
+the characters that they use for their own pathnames to characters in
+.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
+the POSIX Portable Filename Character Set
+.UE .
+.SH EXAMPLES
+The following program demonstrates
+how to ensure that a pathname uses the proper encoding.
+The program starts with a UTF-32 encoded pathname.
+It then calls
+.BR nl_langinfo (3)
+in order to determine what the current locale’s codeset is.
+After that, it uses
+.BR iconv (3)
+to convert the UTF-32 encoded pathname into a locale codeset encoded pathname.
+Finally, the program uses the locale codeset encoded pathname to create
+a file that contains the message “Hello, world!”
+.SS Program source
+.\" SRC BEGIN (pathname_encoding_example.c)
+.EX
+#include <err.h>
+#include <iconv.h>
+#include <langinfo.h>
+#include <locale.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <uchar.h>
+\&
+#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))
+\&
+int
+main(void)
+{
+    if (setlocale(LC_ALL, "") == NULL) {
+        err(EXIT_FAILURE, "setlocale");
+    }
+    char32_t utf32_pathname[] = U"example";
+    size_t pathname_len = NELEMS(utf32_pathname) \- 1;
+    size_t locale_pathname_size = pathname_len * MB_CUR_MAX + 1;
+    // We use calloc() here to make sure that the output from iconv() is
+    // null terminated.
+    char *locale_pathname = calloc(locale_pathname_size, 1);
+    if (locale_pathname == NULL) {
+	err(EXIT_FAILURE, "calloc");
+    }
+\&
+    iconv_t cd = iconv_open(nl_langinfo(CODESET), "UTF\-32");
+    if (cd == (iconv_t) \- 1) {
+        err(EXIT_FAILURE, "iconv_open");
+    }
+    char *inbuf = (char *) utf32_pathname;
+    size_t inbytesleft = pathname_len * (sizeof *utf32_pathname);
+    char *outbuf = locale_pathname;
+    size_t outbytesleft = locale_pathname_size;
+    size_t iconv_result;
+    // iconv() doesn’t necessarily convert everything all in one go, so
+    // we call it in a while loop just in case it takes multiple calls
+    // to finish converting everything.
+    while (inbytesleft > 0) {
+        iconv_result =
+            iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
+        if (iconv_result == \-1) {
+            err(EXIT_FAILURE, "iconv");
+        }
+    }
+    // This ensures that the conversion is 100% complete.
+    // See iconv(3) for details.
+    iconv_result =
+        iconv(cd, NULL, &inbytesleft, &outbuf, &outbytesleft);
+    if (iconv_result == \-1) {
+        err(EXIT_FAILURE, "iconv");
+    }
+    if (iconv_close(cd) == \-1) {
+        err(EXIT_FAILURE, "iconv_close");
+    }
+\&
+    FILE *fp = fopen(locale_pathname, "w");
+    if (fp == NULL) {
+        err(EXIT_FAILURE, "fopen");
+    }
+    if (fputs("Hello, world!\\n", fp) == EOF) {
+        err(EXIT_FAILURE, "fputs");
+    }
+    if (fclose(fp) == EOF) {
+        err(EXIT_FAILURE, "fclose");
+    }
+\&
+    free(locale_pathname);
+    exit(EXIT_SUCCESS);
+}
+.EE
+.\" SRC END
+.SH SEE ALSO
+.BR limits.h (0p),
+.BR open (2),
+.BR fpathconf (3),
+.BR iconv (3),
+.BR nl_langinfo (3),
+.BR path_resolution (7),
+.BR mount (8)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-17 14:14   ` Alejandro Colomar
@ 2025-01-18  0:01     ` Jason Yundt
  2025-01-18  0:23       ` Alejandro Colomar
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-18  0:01 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: linux-man, Florian Weimer

On Fri, Jan 17, 2025 at 03:14:55PM +0100, Alejandro Colomar wrote:
> Curiously, my system is also limited to 255 for tmpfs filesystems but
> yours is not?  I still get longer paths rejected.

That was my mistake.  Running your program confirms that tmpfs filenames
are limited to 255 bytes as well.  I had originally assumed that /tmp
was tmpfs on my system, but it was actually bcachefs which has a
_PC_NAME_MAX of 512.

> Doesn't iconv(3) terminate its output?  I've never used that API, so I
> don't know.

I thought that at first because I hadn’t ever used iconv(3) either.  I
created a test program in order to make sure that it doesn’t terminate
its output:

	$ cat iconv_termination_test.c
	#include <err.h>
	#include <iconv.h>
	#include <langinfo.h>
	#include <locale.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>
	#include <uchar.h>

	#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))


	void
	display_memory(char *memory_block, size_t len) {
	    for (size_t i = 0; i < len; i++) {
	        printf("%02hhX ", memory_block[i]);
	    }
	    printf("\n");
	}

	int
	main(void)
	{
	    char32_t utf32_pathname[] = U"example";
	    size_t pathname_len = NELEMS(utf32_pathname) - 1;
	    size_t utf8_pathname_size = pathname_len * 4 + 1;
	    char *utf8_pathname = malloc(utf8_pathname_size);
	    if (utf8_pathname == NULL) {
	     err(EXIT_FAILURE, "calloc");
	    }
	    memset(utf8_pathname, 0xFF, utf8_pathname_size);
	    printf("utf8_pathname before calling iconv: ");
	    display_memory(utf8_pathname, utf8_pathname_size);

	    iconv_t cd = iconv_open("UTF-8", "UTF-32");
	    if (cd == (iconv_t) - 1) {
	        err(EXIT_FAILURE, "iconv_open");
	    }
	    char *inbuf = (char *) utf32_pathname;
	    size_t inbytesleft = pathname_len * (sizeof *utf32_pathname);
	    char *outbuf = utf8_pathname;
	    size_t outbytesleft = utf8_pathname_size;
	    size_t iconv_result;
	    while (inbytesleft > 0) {
	        iconv_result =
	            iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
	        if (iconv_result == -1) {
	            err(EXIT_FAILURE, "iconv");
	        }
	    }
	    iconv_result =
	        iconv(cd, NULL, &inbytesleft, &outbuf, &outbytesleft);
	    if (iconv_result == -1) {
	        err(EXIT_FAILURE, "iconv");
	    }
	    if (iconv_close(cd) == -1) {
	        err(EXIT_FAILURE, "iconv_close");
	    }

	    printf("utf8_pathname after calling iconv: ");
	    display_memory(utf8_pathname, utf8_pathname_size);

	    free(utf8_pathname);
	    exit(EXIT_SUCCESS);
	}
	$ gcc -Wall iconv_termination_test.c
	$ ./a.out
	utf8_pathname before calling iconv: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
	utf8_pathname after calling iconv: 65 78 61 6D 70 6C 65 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
	$

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-18  0:01     ` Jason Yundt
@ 2025-01-18  0:23       ` Alejandro Colomar
  2025-01-19 13:17         ` Jason Yundt
  0 siblings, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-18  0:23 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 3761 bytes --]

Hi Jason,

On Fri, Jan 17, 2025 at 07:01:46PM -0500, Jason Yundt wrote:
> On Fri, Jan 17, 2025 at 03:14:55PM +0100, Alejandro Colomar wrote:
> > Curiously, my system is also limited to 255 for tmpfs filesystems but
> > yours is not?  I still get longer paths rejected.
> 
> That was my mistake.  Running your program confirms that tmpfs filenames
> are limited to 255 bytes as well.  I had originally assumed that /tmp
> was tmpfs on my system, but it was actually bcachefs which has a
> _PC_NAME_MAX of 512.
> 
> > Doesn't iconv(3) terminate its output?  I've never used that API, so I
> > don't know.
> 
> I thought that at first because I hadn’t ever used iconv(3) either.  I
> created a test program in order to make sure that it doesn’t terminate
> its output:
> 
> 	$ cat iconv_termination_test.c
> 	#include <err.h>
> 	#include <iconv.h>
> 	#include <langinfo.h>
> 	#include <locale.h>
> 	#include <stdio.h>
> 	#include <stdlib.h>
> 	#include <string.h>
> 	#include <uchar.h>
> 
> 	#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))
> 
> 
> 	void
> 	display_memory(char *memory_block, size_t len) {
> 	    for (size_t i = 0; i < len; i++) {
> 	        printf("%02hhX ", memory_block[i]);
> 	    }
> 	    printf("\n");
> 	}
> 
> 	int
> 	main(void)
> 	{
> 	    char32_t utf32_pathname[] = U"example";
> 	    size_t pathname_len = NELEMS(utf32_pathname) - 1;

There's no other length.  You could just call it len.  pathname_ just
adds noise here.  See the section on "Variable names" here:
<https://doc.cat-v.org/bell_labs/pikestyle>.

> 	    size_t utf8_pathname_size = pathname_len * 4 + 1;
> 	    char *utf8_pathname = malloc(utf8_pathname_size);
> 	    if (utf8_pathname == NULL) {
> 	     err(EXIT_FAILURE, "calloc");
> 	    }
> 	    memset(utf8_pathname, 0xFF, utf8_pathname_size);
> 	    printf("utf8_pathname before calling iconv: ");
> 	    display_memory(utf8_pathname, utf8_pathname_size);
> 
> 	    iconv_t cd = iconv_open("UTF-8", "UTF-32");
> 	    if (cd == (iconv_t) - 1) {
> 	        err(EXIT_FAILURE, "iconv_open");
> 	    }
> 	    char *inbuf = (char *) utf32_pathname;
> 	    size_t inbytesleft = pathname_len * (sizeof *utf32_pathname);
> 	    char *outbuf = utf8_pathname;
> 	    size_t outbytesleft = utf8_pathname_size;
> 	    size_t iconv_result;
> 	    while (inbytesleft > 0) {

I don't think we need a loop.  Do you?  iconv(3) should convert the
	entire string if it is valid and there's enough room.

> 	        iconv_result =
> 	            iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);

It seems you're passing a non-terminated string, and thus it's producing
a non-terminated string.  Why don't you pass a null-terminated string?

That is, inbytesleft should include be the length + 1.


Have a lovely night!
Alex

> 	        if (iconv_result == -1) {
> 	            err(EXIT_FAILURE, "iconv");
> 	        }
> 	    }
> 	    iconv_result =
> 	        iconv(cd, NULL, &inbytesleft, &outbuf, &outbytesleft);
> 	    if (iconv_result == -1) {
> 	        err(EXIT_FAILURE, "iconv");
> 	    }
> 	    if (iconv_close(cd) == -1) {
> 	        err(EXIT_FAILURE, "iconv_close");
> 	    }
> 
> 	    printf("utf8_pathname after calling iconv: ");
> 	    display_memory(utf8_pathname, utf8_pathname_size);
> 
> 	    free(utf8_pathname);
> 	    exit(EXIT_SUCCESS);
> 	}
> 	$ gcc -Wall iconv_termination_test.c
> 	$ ./a.out
> 	utf8_pathname before calling iconv: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
> 	utf8_pathname after calling iconv: 65 78 61 6D 70 6C 65 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
> 	$

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-18  0:23       ` Alejandro Colomar
@ 2025-01-19 13:17         ` Jason Yundt
  2025-01-19 15:24           ` Alejandro Colomar
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-19 13:17 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: linux-man, Florian Weimer

On Sat, Jan 18, 2025 at 01:23:07AM +0100, Alejandro Colomar wrote:
> There's no other length.  You could just call it len.  pathname_ just
> adds noise here.  See the section on "Variable names" here:
> <https://doc.cat-v.org/bell_labs/pikestyle>.

OK.  The variable is now named len in my local version of pathname(7).
I’ll submit a new version of the patch once we wrap up the passing a
null-terminated string to iconv(3) discussion.

> I don't think we need a loop.  Do you?  iconv(3) should convert the
> 	entire string if it is valid and there's enough room.

You’re right.  After rereading iconv(3), I’m now realizing that iconv()
will only return if it finished converting the entire string, or it
encounters an error.  I’ve removed the loop locally.  I’ll submit a new
version of the patch once we wrap up the passing a null-terminated
string to iconv(3) discussion.

> > 	        iconv_result =
> > 	            iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
> 
> It seems you're passing a non-terminated string, and thus it's producing
> a non-terminated string.  Why don't you pass a null-terminated string?
> 
> That is, inbytesleft should include be the length + 1.

Here’re my concern: iconv(3) is going to see the final element of
utf32_pathname and interpret it as a U+0000 null character.  In some
character encodings, U+0000 null is encoded as a single null byte.  In
other character encodings, U+0000 null is encoded as something other
than a single null byte.  For example, in Modified UTF-8, U+0000 null is
encoded as the bytes C0 80.  Is there any guarantee that
nl_langinfo(CODESET) will return a character encoding where U+0000 is
represented as a single null byte?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-19 13:17         ` Jason Yundt
@ 2025-01-19 15:24           ` Alejandro Colomar
  2025-01-20  8:20             ` Florian Weimer
  0 siblings, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-19 15:24 UTC (permalink / raw)
  To: Florian Weimer, Jason Yundt; +Cc: linux-man

[-- Attachment #1: Type: text/plain, Size: 1187 bytes --]

Hi Jason, Florian,

On Sun, Jan 19, 2025 at 08:17:46AM -0500, Jason Yundt wrote:
> > It seems you're passing a non-terminated string, and thus it's producing
> > a non-terminated string.  Why don't you pass a null-terminated string?
> > 
> > That is, inbytesleft should include be the length + 1.
> 
> Here’re my concern: iconv(3) is going to see the final element of
> utf32_pathname and interpret it as a U+0000 null character.  In some
> character encodings, U+0000 null is encoded as a single null byte.  In
> other character encodings, U+0000 null is encoded as something other
> than a single null byte.  For example, in Modified UTF-8, U+0000 null is
> encoded as the bytes C0 80.  Is there any guarantee that
> nl_langinfo(CODESET) will return a character encoding where U+0000 is
> represented as a single null byte?

Hmmm.

Florian, do you know this?

You could maybe overcommit, that is, provide space for 4 bytes, just in
case.  But I would prefer to not need to do that, so if we can get a
guarantee that the terminator will be a single null byte, it would be
much better.


Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-19 15:24           ` Alejandro Colomar
@ 2025-01-20  8:20             ` Florian Weimer
  2025-01-20 11:14               ` Alejandro Colomar
  0 siblings, 1 reply; 38+ messages in thread
From: Florian Weimer @ 2025-01-20  8:20 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man

* Alejandro Colomar:

> Hi Jason, Florian,
>
> On Sun, Jan 19, 2025 at 08:17:46AM -0500, Jason Yundt wrote:
>> > It seems you're passing a non-terminated string, and thus it's producing
>> > a non-terminated string.  Why don't you pass a null-terminated string?
>> > 
>> > That is, inbytesleft should include be the length + 1.
>> 
>> Here’re my concern: iconv(3) is going to see the final element of
>> utf32_pathname and interpret it as a U+0000 null character.  In some
>> character encodings, U+0000 null is encoded as a single null byte.  In
>> other character encodings, U+0000 null is encoded as something other
>> than a single null byte.  For example, in Modified UTF-8, U+0000 null is
>> encoded as the bytes C0 80.  Is there any guarantee that
>> nl_langinfo(CODESET) will return a character encoding where U+0000 is
>> represented as a single null byte?

> Florian, do you know this?

Character sets used by glibc locales must be mostly ASCII-transparent.
This includes the mapping of the null byte.  It is possible to create
locales that do not follow these rules, but they tend to introduce
security vulnerabilities, particularly if shell metacharacters are
mapped differently.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-20  8:20             ` Florian Weimer
@ 2025-01-20 11:14               ` Alejandro Colomar
  2025-01-20 13:17                 ` Jason Yundt
  0 siblings, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-20 11:14 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Jason Yundt, linux-man

[-- Attachment #1: Type: text/plain, Size: 734 bytes --]

Hi Florian, Jason,

On Mon, Jan 20, 2025 at 09:20:27AM +0100, Florian Weimer wrote:
> Character sets used by glibc locales must be mostly ASCII-transparent.
> This includes the mapping of the null byte.  It is possible to create
> locales that do not follow these rules, but they tend to introduce
> security vulnerabilities, particularly if shell metacharacters are
> mapped differently.

Thanks!  Then, Jason, please use terminated strings in the example, and
assume a glibc locale.

If one uses a locale that doesn't work like this, they'll have the call
fail because the converted null character won't fit, so the program
would still be safe.

Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-20 11:14               ` Alejandro Colomar
@ 2025-01-20 13:17                 ` Jason Yundt
  2025-01-20 13:25                   ` Alejandro Colomar
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-20 13:17 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Florian Weimer, linux-man

On Mon, Jan 20, 2025 at 12:14:42PM +0100, Alejandro Colomar wrote:
> Hi Florian, Jason,
> 
> On Mon, Jan 20, 2025 at 09:20:27AM +0100, Florian Weimer wrote:
> > Character sets used by glibc locales must be mostly ASCII-transparent.
> > This includes the mapping of the null byte.  It is possible to create
> > locales that do not follow these rules, but they tend to introduce
> > security vulnerabilities, particularly if shell metacharacters are
> > mapped differently.
> 
> Thanks!  Then, Jason, please use terminated strings in the example, and
> assume a glibc locale.

OK.  I’ll submit a new version of the patch that does that.

> If one uses a locale that doesn't work like this, they'll have the call
> fail because the converted null character won't fit, so the program
> would still be safe.

I disagree.  I don’t think that the code would necessarily be safe if
someone uses such a locale.  Specifically, I think that the converted
U+0000 null character would fit in the output buffer most of time.

Imagine this scenario:

1. We use malloc() instead of calloc().
2. The user uses a modified UTF-8 locale.

Here’s what the example code would do in that scenario.  First, it would
calculate locale_pathname_size:

	size_t locale_pathname_size = len * MB_CUR_MAX + 1;

There’s 8 characters in utf32_pathname, but lengths don’t include the
final null terminator, so len is going to be 7.  For modified UTF-8,
MB_CUR_MAX would be 6.  7 * 6 + 1 = 43.  We would then allocate 43
bytes:

	char *locale_pathname = malloc(locale_pathname_size);

Here are our 43 bytes:

UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU
UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU

When I write a byte as being “UU” that means that the byte’s value is
undefined.  Next, we have iconv() convert the UTF-32 string to modified
UTF-8.  Here’s what our memory block will look like after iconv() has
converted 7 out of the 8 characters in utf32_pathname:

65 78 61 6D 70 6C 65 UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU
UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU

Now it’s time for iconv() to convert the final U+0000 null character.
In modified UTF-8, U+0000 null is encoded as C0 80.  iconv() will check to
see if there’s enough room in outbuf for those two bytes.  There is
enough room for those two bytes, so iconv() will store those two bytes
and finish without error:

65 78 61 6D 70 6C 65 C0 80 UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU
UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU UU

In this scenario, iconv() finished successfully, but there aren’t any
bytes that we know for a fact are null.  The best we can hope for is
that one of the undefined bytes just so happens to be null so that we
don’t do an out-of-bounds read.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-20 13:17                 ` Jason Yundt
@ 2025-01-20 13:25                   ` Alejandro Colomar
  0 siblings, 0 replies; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-20 13:25 UTC (permalink / raw)
  To: Jason Yundt; +Cc: Florian Weimer, linux-man

[-- Attachment #1: Type: text/plain, Size: 1202 bytes --]

On Mon, Jan 20, 2025 at 08:17:00AM -0500, Jason Yundt wrote:
> On Mon, Jan 20, 2025 at 12:14:42PM +0100, Alejandro Colomar wrote:
> > Hi Florian, Jason,
> > 
> > On Mon, Jan 20, 2025 at 09:20:27AM +0100, Florian Weimer wrote:
> > > Character sets used by glibc locales must be mostly ASCII-transparent.
> > > This includes the mapping of the null byte.  It is possible to create
> > > locales that do not follow these rules, but they tend to introduce
> > > security vulnerabilities, particularly if shell metacharacters are
> > > mapped differently.
> > 
> > Thanks!  Then, Jason, please use terminated strings in the example, and
> > assume a glibc locale.
> 
> OK.  I’ll submit a new version of the patch that does that.
> 
> > If one uses a locale that doesn't work like this, they'll have the call
> > fail because the converted null character won't fit, so the program
> > would still be safe.
> 
> I disagree.  I don’t think that the code would necessarily be safe if
> someone uses such a locale.

D'oh!  Agree, I was wrong.  Anyway, if one creates an unsafe locale,
let's say the warranty is void.  :-)

Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v8] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
                   ` (5 preceding siblings ...)
  2025-01-17 23:59 ` [PATCH v6] " Jason Yundt
@ 2025-01-20 16:24 ` Jason Yundt
  2025-01-20 16:36   ` Alejandro Colomar
  2025-01-20 19:06 ` [PATCH v9] " Jason Yundt
  2025-01-21 13:35 ` [PATCH v10] man/man7/pathname.7: Add file documenting format of pathnames Jason Yundt
  8 siblings, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-20 16:24 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man, Florian Weimer

The goal of this new manual page is to help people create programs that
do the right thing even in the face of unusual paths.  The information
that I used to create this new manual page came from these sources:

• <https://unix.stackexchange.com/a/39179/316181>
• <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
• <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>
• <man:unix(7)>
• <https://unix.stackexchange.com/q/92426/316181>

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
Here’s what I changed from the previous version:

• I made the changes to the example program that Alex requested.

 man/man7/pathname.7 | 167 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)
 create mode 100644 man/man7/pathname.7

diff --git a/man/man7/pathname.7 b/man/man7/pathname.7
new file mode 100644
index 000000000..5fc5e3a81
--- /dev/null
+++ b/man/man7/pathname.7
@@ -0,0 +1,167 @@
+.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH pathname 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+pathname,
+filename
+\-
+how pathnames are encoded and interpreted
+.SH DESCRIPTION
+Some system calls allow you to pass a pathname as a parameter.
+When writing code that deals with pathnames,
+there are kernel-space requirements that you must comply with,
+and user-space requirements that you should comply with.
+.P
+The kernel stores pathnames as null-terminated byte sequences.
+The kernel has a few general rules that apply to all pathnames:
+.IP \[bu] 3
+The last byte in the sequence needs to be a null byte.
+.IP \[bu]
+Any other bytes in the sequence need to be non-null bytes.
+.IP \[bu]
+A 0x2F byte is always interpreted as a directory separator (/)
+and cannot be part of a filename.
+.IP \[bu]
+A pathname can be at most PATH_MAX bytes long.
+PATH_MAX is defined in
+.BR limits.h (0p)\
+\.
+A pathname that’s longer than PATH_MAX bytes
+can be split into multiple smaller pathnames and opened piecewise using
+.BR openat (2).
+.IP \[bu]
+A filename can be at most a certain number of bytes long.
+The number is filesystem-specific.
+You can get the filename length limit for a currently mounted filesystem
+by passing _PC_NAME_MAX to
+.BR fpathconf (3)\
+\.
+For maximum portability, programs should be able to handle filenames
+that are as long as the relevant filesystems will allow.
+For maximum portability, programs and users should limit the length
+of their own pathnames to NAME_MAX bytes.
+NAME_MAX is defined in
+.BR limits.h (0p)\
+\.
+.P
+The kernel also has some rules that only apply in certain situations.
+Here are some examples:
+.IP \[bu] 3
+Filenames on the ext4 filesystem can be at most 30 bytes long.
+.IP \[bu]
+Filenames on the vfat filesystem cannot a
+0x22, 0x2A, 0x3A, 0x3C, 0x3E, 0x3F, 0x5C or 0x7C byte
+(", *, :, <, >, ?, \ or | in ASCII)
+unless the filesystem was mounted with iocharset set to something unusual.
+.IP \[bu]
+A UNIX domain socket’s sun_path can be at most 108 bytes long (see
+.BR unix (7)
+for details).
+.P
+User space treats pathnames differently.
+User space applications typically expect pathnames to use
+a consistent character encoding.
+For maximum interoperability, programs should use
+.BR nl_langinfo (3)
+to determine the current locale’s codeset.
+Paths should be encoded and decoded using the current locale’s codeset
+in order to help prevent mojibake.
+For maximum interoperability,
+programs and users should also limit
+the characters that they use for their own pathnames to characters in
+.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
+the POSIX Portable Filename Character Set
+.UE .
+.SH EXAMPLES
+The following program demonstrates
+how to ensure that a pathname uses the proper encoding.
+The program starts with a UTF-32 encoded pathname.
+It then calls
+.BR nl_langinfo (3)
+in order to determine what the current locale’s codeset is.
+After that, it uses
+.BR iconv (3)
+to convert the UTF-32 encoded pathname into a locale codeset encoded pathname.
+Finally, the program uses the locale codeset encoded pathname to create
+a file that contains the message “Hello, world!”
+.SS Program source
+.\" SRC BEGIN (pathname_encoding_example.c)
+.EX
+#include <err.h>
+#include <iconv.h>
+#include <langinfo.h>
+#include <locale.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <uchar.h>
+\&
+#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))
+\&
+int
+main(void)
+{
+    size_t size;
+    char32_t utf32_pathname[] = U"example";
+    char *locale_pathname;
+    iconv_t cd;
+    char *inbuf;
+    size_t inbytesleft;
+    char *outbuf;
+    size_t outbytesleft;
+    size_t iconv_result;
+    FILE *fp;
+
+    if (setlocale(LC_ALL, "") == NULL) {
+        err(EXIT_FAILURE, "setlocale");
+    }
+    size = NELEMS(utf32_pathname) * MB_CUR_MAX;
+    locale_pathname = malloc(size);
+    if (locale_pathname == NULL) {
+      err(EXIT_FAILURE, "malloc");
+    }
+\&
+    cd = iconv_open(nl_langinfo(CODESET), "UTF\-32");
+    if (cd == (iconv_t) \- 1) {
+        err(EXIT_FAILURE, "iconv_open");
+    }
+    inbuf = (char *) utf32_pathname;
+    inbytesleft = sizeof utf32_pathname;
+    outbuf = locale_pathname;
+    outbytesleft = size;
+    iconv_result =
+        iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
+    if (iconv_result == \-1) {
+        err(EXIT_FAILURE, "iconv");
+    }
+    // This ensures that the conversion is 100% complete.
+    // See iconv(3) for details.
+    iconv_result =
+        iconv(cd, NULL, &inbytesleft, &outbuf, &outbytesleft);
+    if (iconv_result == \-1) {
+        err(EXIT_FAILURE, "iconv");
+    }
+    if (iconv_close(cd) == \-1) {
+        err(EXIT_FAILURE, "iconv_close");
+    }
+\&
+    fp = fopen(locale_pathname, "w");
+    fputs("Hello, world!\\n", fp);
+    if (fclose(fp) == EOF) {
+        err(EXIT_FAILURE, "fclose");
+    }
+\&
+    free(locale_pathname);
+    exit(EXIT_SUCCESS);
+}
+.EE
+.\" SRC END
+.SH SEE ALSO
+.BR limits.h (0p),
+.BR open (2),
+.BR fpathconf (3),
+.BR iconv (3),
+.BR nl_langinfo (3),
+.BR path_resolution (7),
+.BR mount (8)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v8] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-20 16:24 ` [PATCH v8] " Jason Yundt
@ 2025-01-20 16:36   ` Alejandro Colomar
  0 siblings, 0 replies; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-20 16:36 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 2887 bytes --]

Hi Jason,

On Mon, Jan 20, 2025 at 11:24:14AM -0500, Jason Yundt wrote:
> +.SS Program source
> +.\" SRC BEGIN (pathname_encoding_example.c)
> +.EX
> +#include <err.h>
> +#include <iconv.h>
> +#include <langinfo.h>
> +#include <locale.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <uchar.h>
> +\&
> +#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))
> +\&
> +int
> +main(void)
> +{
> +    size_t size;
> +    char32_t utf32_pathname[] = U"example";

This should be const-qualified, since we won't be modifying it.  Please
add const where needed.

> +    char *locale_pathname;
> +    iconv_t cd;
> +    char *inbuf;
> +    size_t inbytesleft;
> +    char *outbuf;
> +    size_t outbytesleft;
> +    size_t iconv_result;
> +    FILE *fp;

Please group declarations of the same type in consecutive lines.
Shorter type names up and longer type names below.  For same length,
please use alphabetic order.

> +
> +    if (setlocale(LC_ALL, "") == NULL) {
> +        err(EXIT_FAILURE, "setlocale");
> +    }

Please don't use braces for a single statement.
Add a blank line where closing brace is now, for readability.

> +    size = NELEMS(utf32_pathname) * MB_CUR_MAX;
> +    locale_pathname = malloc(size);
> +    if (locale_pathname == NULL) {
> +      err(EXIT_FAILURE, "malloc");
> +    }
> +\&
> +    cd = iconv_open(nl_langinfo(CODESET), "UTF\-32");
> +    if (cd == (iconv_t) \- 1) {
> +        err(EXIT_FAILURE, "iconv_open");
> +    }
> +    inbuf = (char *) utf32_pathname;
> +    inbytesleft = sizeof utf32_pathname;
> +    outbuf = locale_pathname;
> +    outbytesleft = size;
> +    iconv_result =
> +        iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);

I would rename the variables {in,out}bytesleft to just {in,out}bytes.

> +    if (iconv_result == \-1) {
> +        err(EXIT_FAILURE, "iconv");
> +    }
> +    // This ensures that the conversion is 100% complete.
> +    // See iconv(3) for details.
> +    iconv_result =
> +        iconv(cd, NULL, &inbytesleft, &outbuf, &outbytesleft);
> +    if (iconv_result == \-1) {
> +        err(EXIT_FAILURE, "iconv");
> +    }
> +    if (iconv_close(cd) == \-1) {
> +        err(EXIT_FAILURE, "iconv_close");
> +    }
> +\&
> +    fp = fopen(locale_pathname, "w");
> +    fputs("Hello, world!\\n", fp);

For writing a '\', you should use \[rs] (that means "reverse solidus").


Have a lovely day!
Alex

> +    if (fclose(fp) == EOF) {
> +        err(EXIT_FAILURE, "fclose");
> +    }
> +\&
> +    free(locale_pathname);
> +    exit(EXIT_SUCCESS);
> +}
> +.EE
> +.\" SRC END
> +.SH SEE ALSO
> +.BR limits.h (0p),
> +.BR open (2),
> +.BR fpathconf (3),
> +.BR iconv (3),
> +.BR nl_langinfo (3),
> +.BR path_resolution (7),
> +.BR mount (8)
> -- 
> 2.47.1
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
                   ` (6 preceding siblings ...)
  2025-01-20 16:24 ` [PATCH v8] " Jason Yundt
@ 2025-01-20 19:06 ` Jason Yundt
  2025-01-20 22:26   ` Alejandro Colomar
  2025-01-21 13:35 ` [PATCH v10] man/man7/pathname.7: Add file documenting format of pathnames Jason Yundt
  8 siblings, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-20 19:06 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man, Florian Weimer

The goal of this new manual page is to help people create programs that
do the right thing even in the face of unusual paths.  The information
that I used to create this new manual page came from these sources:

• <https://unix.stackexchange.com/a/39179/316181>
• <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
• <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>
• <man:unix(7)>
• <https://unix.stackexchange.com/q/92426/316181>

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
Here’s what I changed from the previous version:

• I removed the second iconv() call.
• I made utf32_pathname const.  I think that that was the only one that could
  be made const, but correct me if I’m wrong.
• I changed the order of the variable declarations.  I think that they’re in
  the correct order now, but correct me if I’m wrong.
• I removed the curly brackets from all of the if statements.
• I renamed inbytesleft to inbytes and outbytesleft to outbytes.
• I replaced the \\ with \[rs].

 man/man7/pathname.7 | 160 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 160 insertions(+)
 create mode 100644 man/man7/pathname.7

diff --git a/man/man7/pathname.7 b/man/man7/pathname.7
new file mode 100644
index 000000000..5864f230d
--- /dev/null
+++ b/man/man7/pathname.7
@@ -0,0 +1,160 @@
+.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH pathname 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+pathname,
+filename
+\-
+how pathnames are encoded and interpreted
+.SH DESCRIPTION
+Some system calls allow you to pass a pathname as a parameter.
+When writing code that deals with pathnames,
+there are kernel-space requirements that you must comply with,
+and user-space requirements that you should comply with.
+.P
+The kernel stores pathnames as null-terminated byte sequences.
+The kernel has a few general rules that apply to all pathnames:
+.IP \[bu] 3
+The last byte in the sequence needs to be a null byte.
+.IP \[bu]
+Any other bytes in the sequence need to be non-null bytes.
+.IP \[bu]
+A 0x2F byte is always interpreted as a directory separator (/)
+and cannot be part of a filename.
+.IP \[bu]
+A pathname can be at most PATH_MAX bytes long.
+PATH_MAX is defined in
+.BR limits.h (0p)\
+\.
+A pathname that’s longer than PATH_MAX bytes
+can be split into multiple smaller pathnames and opened piecewise using
+.BR openat (2).
+.IP \[bu]
+A filename can be at most a certain number of bytes long.
+The number is filesystem-specific.
+You can get the filename length limit for a currently mounted filesystem
+by passing _PC_NAME_MAX to
+.BR fpathconf (3)\
+\.
+For maximum portability, programs should be able to handle filenames
+that are as long as the relevant filesystems will allow.
+For maximum portability, programs and users should limit the length
+of their own pathnames to NAME_MAX bytes.
+NAME_MAX is defined in
+.BR limits.h (0p)\
+\.
+.P
+The kernel also has some rules that only apply in certain situations.
+Here are some examples:
+.IP \[bu] 3
+Filenames on the ext4 filesystem can be at most 30 bytes long.
+.IP \[bu]
+Filenames on the vfat filesystem cannot a
+0x22, 0x2A, 0x3A, 0x3C, 0x3E, 0x3F, 0x5C or 0x7C byte
+(", *, :, <, >, ?, \ or | in ASCII)
+unless the filesystem was mounted with iocharset set to something unusual.
+.IP \[bu]
+A UNIX domain socket’s sun_path can be at most 108 bytes long (see
+.BR unix (7)
+for details).
+.P
+User space treats pathnames differently.
+User space applications typically expect pathnames to use
+a consistent character encoding.
+For maximum interoperability, programs should use
+.BR nl_langinfo (3)
+to determine the current locale’s codeset.
+Paths should be encoded and decoded using the current locale’s codeset
+in order to help prevent mojibake.
+For maximum interoperability,
+programs and users should also limit
+the characters that they use for their own pathnames to characters in
+.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
+the POSIX Portable Filename Character Set
+.UE .
+.SH EXAMPLES
+The following program demonstrates
+how to ensure that a pathname uses the proper encoding.
+The program starts with a UTF-32 encoded pathname.
+It then calls
+.BR nl_langinfo (3)
+in order to determine what the current locale’s codeset is.
+After that, it uses
+.BR iconv (3)
+to convert the UTF-32 encoded pathname into a locale codeset encoded pathname.
+Finally, the program uses the locale codeset encoded pathname to create
+a file that contains the message “Hello, world!”
+.SS Program source
+.\" SRC BEGIN (pathname_encoding_example.c)
+.EX
+#include <err.h>
+#include <iconv.h>
+#include <langinfo.h>
+#include <locale.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <uchar.h>
+\&
+#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))
+\&
+int
+main(void)
+{
+    char *inbuf;
+    char *locale_pathname;
+    char *outbuf;
+    FILE *fp;
+    size_t iconv_result;
+    size_t inbytes;
+    size_t outbytes;
+    size_t size;
+    iconv_t cd;
+    const char32_t utf32_pathname[] = U"example";
+\&
+    if (setlocale(LC_ALL, "") == NULL)
+        err(EXIT_FAILURE, "setlocale");
+\&
+    size = NELEMS(utf32_pathname) * MB_CUR_MAX;
+    locale_pathname = malloc(size);
+    if (locale_pathname == NULL)
+      err(EXIT_FAILURE, "malloc");
+\&
+    cd = iconv_open(nl_langinfo(CODESET), "UTF\-32");
+    if (cd == (iconv_t) \- 1)
+        err(EXIT_FAILURE, "iconv_open");
+\&
+    inbuf = (char *) utf32_pathname;
+    inbytes = sizeof utf32_pathname;
+    outbuf = locale_pathname;
+    outbytes = size;
+    iconv_result =
+        iconv(cd, &inbuf, &inbytes, &outbuf, &outbytes);
+    if (iconv_result == \-1)
+        err(EXIT_FAILURE, "iconv");
+\&
+    if (iconv_result == \-1)
+        err(EXIT_FAILURE, "iconv");
+\&
+    if (iconv_close(cd) == \-1)
+        err(EXIT_FAILURE, "iconv_close");
+\&
+    fp = fopen(locale_pathname, "w");
+    fputs("Hello, world!\[rs]n", fp);
+    if (fclose(fp) == EOF)
+        err(EXIT_FAILURE, "fclose");
+\&
+    free(locale_pathname);
+    exit(EXIT_SUCCESS);
+}
+.EE
+.\" SRC END
+.SH SEE ALSO
+.BR limits.h (0p),
+.BR open (2),
+.BR fpathconf (3),
+.BR iconv (3),
+.BR nl_langinfo (3),
+.BR path_resolution (7),
+.BR mount (8)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-20 19:06 ` [PATCH v9] " Jason Yundt
@ 2025-01-20 22:26   ` Alejandro Colomar
  2025-01-21  0:26     ` C code style for Linux man-pages examples (was: [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames) G. Branden Robinson
  0 siblings, 1 reply; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-20 22:26 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 3847 bytes --]

Hi Jason,

On Mon, Jan 20, 2025 at 02:06:26PM -0500, Jason Yundt wrote:
> The goal of this new manual page is to help people create programs that
> do the right thing even in the face of unusual paths.  The information
> that I used to create this new manual page came from these sources:
> 
> • <https://unix.stackexchange.com/a/39179/316181>
> • <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
> • <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>
> • <man:unix(7)>
> • <https://unix.stackexchange.com/q/92426/316181>
> 
> Signed-off-by: Jason Yundt <jason@jasonyundt.email>
> ---
> Here’s what I changed from the previous version:
> 
> • I removed the second iconv() call.
> • I made utf32_pathname const.  I think that that was the only one that could
>   be made const, but correct me if I’m wrong.
> • I changed the order of the variable declarations.  I think that they’re in
>   the correct order now, but correct me if I’m wrong.
> • I removed the curly brackets from all of the if statements.
> • I renamed inbytesleft to inbytes and outbytesleft to outbytes.
> • I replaced the \\ with \[rs].

Thanks!

> +.SS Program source
> +.\" SRC BEGIN (pathname_encoding_example.c)
> +.EX
> +#include <err.h>
> +#include <iconv.h>
> +#include <langinfo.h>
> +#include <locale.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <uchar.h>
> +\&
> +#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))
> +\&
> +int
> +main(void)
> +{
> +    char *inbuf;

I'd rename inbuf,outbut to in,out.

> +    char *locale_pathname;
> +    char *outbuf;
> +    FILE *fp;
> +    size_t iconv_result;

I've removed this variable (see below).

> +    size_t inbytes;
> +    size_t outbytes;
> +    size_t size;
> +    iconv_t cd;

Please align (and merge some) with spaces the above as:

    char     *locale_pathname;
    char     *in, *out;
    FILE     *fp;
    size_t   size;
    size_t   inbytes, outbytes;
    iconv_t  cd;

I've also reordered a few so that they appear in order of use (more or
less).

> +    const char32_t utf32_pathname[] = U"example";
> +\&
> +    if (setlocale(LC_ALL, "") == NULL)
> +        err(EXIT_FAILURE, "setlocale");
> +\&
> +    size = NELEMS(utf32_pathname) * MB_CUR_MAX;
> +    locale_pathname = malloc(size);
> +    if (locale_pathname == NULL)
> +      err(EXIT_FAILURE, "malloc");
> +\&
> +    cd = iconv_open(nl_langinfo(CODESET), "UTF\-32");
> +    if (cd == (iconv_t) \- 1)
> +        err(EXIT_FAILURE, "iconv_open");
> +\&
> +    inbuf = (char *) utf32_pathname;
> +    inbytes = sizeof utf32_pathname;

Please use sizeof(utf32_pathname), with parentheses.

> +    outbuf = locale_pathname;
> +    outbytes = size;
> +    iconv_result =
> +        iconv(cd, &inbuf, &inbytes, &outbuf, &outbytes);
> +    if (iconv_result == \-1)

This variable seems useless:

    if (iconv(cd, &in, &inbytes, &out, &outbytes) == \-1)

> +        err(EXIT_FAILURE, "iconv");
> +\&
> +    if (iconv_result == \-1)
> +        err(EXIT_FAILURE, "iconv");

This is a leftover from the previous version.


Have a lovely night!
Alex

> +\&
> +    if (iconv_close(cd) == \-1)
> +        err(EXIT_FAILURE, "iconv_close");
> +\&
> +    fp = fopen(locale_pathname, "w");
> +    fputs("Hello, world!\[rs]n", fp);
> +    if (fclose(fp) == EOF)
> +        err(EXIT_FAILURE, "fclose");
> +\&
> +    free(locale_pathname);
> +    exit(EXIT_SUCCESS);
> +}
> +.EE
> +.\" SRC END
> +.SH SEE ALSO
> +.BR limits.h (0p),
> +.BR open (2),
> +.BR fpathconf (3),
> +.BR iconv (3),
> +.BR nl_langinfo (3),
> +.BR path_resolution (7),
> +.BR mount (8)
> -- 
> 2.47.1
> 
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* C code style for Linux man-pages examples (was: [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames)
  2025-01-20 22:26   ` Alejandro Colomar
@ 2025-01-21  0:26     ` G. Branden Robinson
  2025-01-21  1:05       ` Alejandro Colomar
  2025-01-21 13:39       ` Jason Yundt
  0 siblings, 2 replies; 38+ messages in thread
From: G. Branden Robinson @ 2025-01-21  0:26 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man

[-- Attachment #1: Type: text/plain, Size: 6152 bytes --]

[dropped Florian from CC]

Hi Alex,

I have some feedback on project management.

And then a couple of minor technical points.

At 2025-01-20T23:26:04+0100, Alejandro Colomar wrote:
[much snipped]
> I'd rename inbuf,outbut to in,out.
> 
> > +    char *locale_pathname;
> > +    char *outbuf;
> > +    FILE *fp;
> > +    size_t iconv_result;
[...]
> I've removed this variable (see below).
[...]
> Please align (and merge some) with spaces the above as:
> 
>     char     *locale_pathname;
>     char     *in, *out;
>     FILE     *fp;
>     size_t   size;
>     size_t   inbytes, outbytes;
>     iconv_t  cd;
> 
> I've also reordered a few so that they appear in order of use (more or
> less).
[...]
> Please use sizeof(utf32_pathname), with parentheses.
[...]
> This variable seems useless:
> 
>     if (iconv(cd, &in, &inbytes, &out, &outbytes) == \-1)
> 
> > +        err(EXIT_FAILURE, "iconv");
> > +\&
> > +    if (iconv_result == \-1)
> > +        err(EXIT_FAILURE, "iconv");
> 
> This is a leftover from the previous version.
[even earlier]
> Please group declarations of the same type in consecutive lines.
> Shorter type names up and longer type names below.  For same length,
> please use alphabetic order.

This style of feedback is producing a lot of churn.  Jason's going to be
well into the v-teens before this patch is accepted, at this rate.

It appears to me that what is happening here is that you are iteratively
developing a C code style guide under the banner of a man page review.
If Jason's okay with being the test subject for this procedure, then
there's not exactly a problem here, but if it were me submitting a man
page, I'd be getting frustrated by (or before) this point.  I just "git
pulled" https://git.kernel.org/pub/scm/docs/man-pages/man-pages/ and
checked "./man/man7/man-pages.7", and practically _none_ of the rules
you've been stating to Jason is expressed there.

I propose that the submissions of patches to the Linux man-pages not be
a black-box process, with you serving as the oracle that accepts or
rejects the input.  I admit you're offering a bit more information than
a binary semaphore (ACCEPT/REJECT), but still, it would be better if
Jason, and others, had a "Linux man-pages example C code style guide"
document they could consult so that they could anticipate more of your
objections.

If the construction of such a document is what this precise instance of
the process is groping toward, good.  If not, then I suggest that it's
about time to prepare that document.

I don't dispute that having a consistent style for code examples in the
Linux man-page corpus is worthwhile; I do think it will, ultimately, pay
dividends to harried hackers in a hurry.  But it is precisely to the
extent that style guidelines are arbitrary that they need to be
documented and easily located.

On different, nerdier subjects...

> Please don't use braces for a single statement.

I think they are helpful for clarity.  Yes, modern compilers will warn
about misleading indentation.  I still think braces around any block
guarded by control instructions are a good idea for the human brain
interpreting code.  And the presence of the braces costs nothing at
translation time.  Does any compiler construct a new stack frame just
because it saw an opening brace in the input (that wasn't part of an
initializer)?

> Please separate declarations from code.

I think this is considered old-fashioned in some quarters.  It has been
valid since ISO C99 to introduce declarations anywhere, and a common
style is to place them at, or adjacently to, the point where they're
used.

The traditional arrangement of placing all declarations at the top of a
function definition arises, as I understand it, from the limitations of
early compilers, which were often--and sometimes had to be--simple and
small.  When the compiler read the function definition, it could
generate an assembly language preamble for setting up a stack frame that
reserved all of the room necessary for any storage of automatic
duration, and then start translating statements into instructions at
once.[1]  (A test of this understanding would be whether any pre-C99
compilers rejected "late" declaration of automatic variables, but
happily accepted them for static or register variables, since those
would not complicate stack memory allocation.  I'm not quite old enough
to say; for the first <mumble> years of my programming career, GCC was
the only C compiler I ever used.)

Anyway, this is another of those matters of taste, so if mandatory early
declarations are to be the rule, you probably want to say so explicitly
so that you're not mistaken for a grognard who either isn't aware that
ISO C99 happened, or, like Dennis Ritchie, refused to countenance its
its existence with a 3rd edition of its central textbook, and eventually
ran out of time to do so.  (In a 2000 interview, he said it "needs to
quiesce for a while".)[2]

Finally, I'll note that asserting a dichotomy between "declarations" and
"code" can be misleading.  Declarations can generate assembly language
too, and not just when they are coupled with initializers.  I'd say
"declarations" and "statements" instead, or avoid the issue entirely and
say something like, "group all variable declarations at the top of each
function".

Regards,
Branden

[1] This is also borne out by other structural features of pre-ANSI C
    function definitions.  Return type first because the corresponding
    value will need to be visible in the enclosing scope.  Then,
    _outside_ the function parameter parentheses, the types of any
    arguments the function takes, because they'll be pushed onto the
    stack before the function is called.  Then, inside the corresponding
    assembly subroutine, stack memory is set aside to house whatever
    local--meaning non-static, non-register, storage is needed.

    Maybe if I actually wrote a compiler for pre-ANSI C, I'd know for
    sure.  ;-)

[2] http://www.gotw.ca/publications/c_family_interview.htm

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: C code style for Linux man-pages examples (was: [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames)
  2025-01-21  0:26     ` C code style for Linux man-pages examples (was: [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames) G. Branden Robinson
@ 2025-01-21  1:05       ` Alejandro Colomar
  2025-01-21 13:39       ` Jason Yundt
  1 sibling, 0 replies; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-21  1:05 UTC (permalink / raw)
  To: G. Branden Robinson; +Cc: Jason Yundt, linux-man

[-- Attachment #1: Type: text/plain, Size: 7859 bytes --]

Hi Branden,

On Mon, Jan 20, 2025 at 06:26:06PM -0600, G. Branden Robinson wrote:
> [even earlier]
> > Please group declarations of the same type in consecutive lines.
> > Shorter type names up and longer type names below.  For same length,
> > please use alphabetic order.
> 
> This style of feedback is producing a lot of churn.  Jason's going to be
> well into the v-teens before this patch is accepted, at this rate.
> 
> It appears to me that what is happening here is that you are iteratively
> developing a C code style guide under the banner of a man page review.

Not really.  I'm just not looking at all the code at once, because it
was highly unreadable.  Also, I expected that if I told all the issues I
spotted at once, some might not make much sense at the same time.

I like reviewing a small number of issues in each iteration, and don't
load the contributor too much by pointing out 100 issues in their patch
at once, which might get them the feeling of "hell, where do I start?".

That coding style I already use it, and should be mostly the same as all
other manual pages in this repository are following (except that maybe I
haven't changed a few things in some existing code because the bar is
slightly higher for additions than for existing pages (hysteresis comes
into play)).

> If Jason's okay with being the test subject for this procedure, then
> there's not exactly a problem here, but if it were me submitting a man
> page, I'd be getting frustrated by (or before) this point.  I just "git
> pulled" https://git.kernel.org/pub/scm/docs/man-pages/man-pages/ and
> checked "./man/man7/man-pages.7", and practically _none_ of the rules
> you've been stating to Jason is expressed there.

Yep, the C coding style is not stated.  But that doesn't mean the
project doesn't have one.  I haven't written in paper one of my own, and
my C coding style is an interesting mix of

	<https://git.kernel.org/pub/scm/git/git.git/tree/Documentation/CodingGuidelines>
	<https://www.gnu.org/prep/standards/standards.html>
	<https://google.github.io/styleguide/>
	<https://www.kernel.org/doc/html/latest/process/coding-style.html>
	<https://nginx.org/en/docs/dev/development_guide.html#code_style>
	<http://doc.cat-v.org/bell_labs/pikestyle>
	<https://www.cis.upenn.edu/~lee/06cse480/data/cstyle.ms.pdf>

(in no particular order.)  I should probably write my own somewhere, but
that takes some time.  I'll try to, at some point.

> I propose that the submissions of patches to the Linux man-pages not be
> a black-box process, with you serving as the oracle that accepts or
> rejects the input.  I admit you're offering a bit more information than
> a binary semaphore (ACCEPT/REJECT), but still, it would be better if
> Jason, and others, had a "Linux man-pages example C code style guide"
> document they could consult so that they could anticipate more of your
> objections.
> 
> If the construction of such a document is what this precise instance of
> the process is groping toward, good.  If not, then I suggest that it's
> about time to prepare that document.
> 
> I don't dispute that having a consistent style for code examples in the
> Linux man-page corpus is worthwhile; I do think it will, ultimately, pay
> dividends to harried hackers in a hurry.  But it is precisely to the
> extent that style guidelines are arbitrary that they need to be
> documented and easily located.
> 
> On different, nerdier subjects...
> 
> > Please don't use braces for a single statement.
> 
> I think they are helpful for clarity.  Yes, modern compilers will warn
> about misleading indentation.  I still think braces around any block
> guarded by control instructions are a good idea for the human brain
> interpreting code.

In GNU code, with 2 spaces, definitely!  :)

In the civilised world, I think a blank line after the indented
statement is enough for the brain to see the structure.

On the contrary, I think that braces clutter the code, and take me more
time to read.  My brain doesn't even read indented stuff, on the premise
that it's unimportant code (usually handling error cases, whose handling
is unimportant for the main story).

>  And the presence of the braces costs nothing at
> translation time.  Does any compiler construct a new stack frame just
> because it saw an opening brace in the input (that wasn't part of an
> initializer)?

I only do it for readability reasons.  Performance should be the same.

> > Please separate declarations from code.
> 
> I think this is considered old-fashioned in some quarters.  It has been
> valid since ISO C99 to introduce declarations anywhere, and a common
> style is to place them at, or adjacently to, the point where they're
> used.

I like C99's for-loop variables.  But mixing statements and declarations
makes the code less readable.

I like being presented the protagonists of a function, and then told the
story around those protagonists.  Usually, by looking at the
declarations, you can already guess most of the story.  If the story is
so deep that some protagonists have to be presented mid-story, the story
is complex enough to be separated into a helper function.

C89 declarations serve as a reminder that you should break your
functions if they're big enough that you lost track of the declarations.

> The traditional arrangement of placing all declarations at the top of a
> function definition arises, as I understand it, from the limitations of
> early compilers, which were often--and sometimes had to be--simple and
> small.

Don't forget that brains keep being simple and small.

>  When the compiler read the function definition, it could
> generate an assembly language preamble for setting up a stack frame that
> reserved all of the room necessary for any storage of automatic
> duration, and then start translating statements into instructions at
> once.[1]  (A test of this understanding would be whether any pre-C99
> compilers rejected "late" declaration of automatic variables, but
> happily accepted them for static or register variables, since those
> would not complicate stack memory allocation.  I'm not quite old enough
> to say; for the first <mumble> years of my programming career, GCC was
> the only C compiler I ever used.)
> 
> Anyway, this is another of those matters of taste, so if mandatory early
> declarations are to be the rule, you probably want to say so explicitly

Yeah, there are many things I should mention there.

> so that you're not mistaken for a grognard who either isn't aware that
> ISO C99 happened, or, like Dennis Ritchie, refused to countenance its
> its existence with a 3rd edition of its central textbook, and eventually
> ran out of time to do so.  (In a 2000 interview, he said it "needs to
> quiesce for a while".)[2]

Heh!  While I wrote (literally) a couple of programs in the early 2000s
with the help of my father, I didn't program again in C until after
2011.  The first standard under which I programmed my third program
(first solo programming) was C11 already.  :-)

To me, C89 is a dead language, and C99 is already getting into the same
category.  BTW, I joined WG14 (the ISO C committee) last year.  :)

See also: <https://thephd.dev/the-big-array-size-survey-for-c>

> Finally, I'll note that asserting a dichotomy between "declarations" and
> "code" can be misleading.  Declarations can generate assembly language
> too, and not just when they are coupled with initializers.  I'd say
> "declarations" and "statements" instead,

Makes sense.

> or avoid the issue entirely and
> say something like, "group all variable declarations at the top of each
> function".

Have a lovely night!
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v10] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
                   ` (7 preceding siblings ...)
  2025-01-20 19:06 ` [PATCH v9] " Jason Yundt
@ 2025-01-21 13:35 ` Jason Yundt
  2025-01-23 23:51   ` Alejandro Colomar
  8 siblings, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-21 13:35 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Jason Yundt, linux-man

The goal of this new manual page is to help people create programs that
do the right thing even in the face of unusual paths.  The information
that I used to create this new manual page came from these sources:

• <https://unix.stackexchange.com/a/39179/316181>
• <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
• <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>
• <man:unix(7)>
• <https://unix.stackexchange.com/q/92426/316181>

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
Here’s what I changed from the previous version:

• I renamed inbuf to in and outbuf to out.
• I removed the iconv_result variable.
• I aligned and merged the variable declarations as requested.
• I added parentheses to my use of sizeof.
• I removed the leftover if statement.
• I removed some unintentional spaces.

 man/man7/pathname.7 | 152 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 152 insertions(+)
 create mode 100644 man/man7/pathname.7

diff --git a/man/man7/pathname.7 b/man/man7/pathname.7
new file mode 100644
index 000000000..96e0009e1
--- /dev/null
+++ b/man/man7/pathname.7
@@ -0,0 +1,152 @@
+.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH pathname 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+pathname,
+filename
+\-
+how pathnames are encoded and interpreted
+.SH DESCRIPTION
+Some system calls allow you to pass a pathname as a parameter.
+When writing code that deals with pathnames,
+there are kernel-space requirements that you must comply with,
+and user-space requirements that you should comply with.
+.P
+The kernel stores pathnames as null-terminated byte sequences.
+The kernel has a few general rules that apply to all pathnames:
+.IP \[bu] 3
+The last byte in the sequence needs to be a null byte.
+.IP \[bu]
+Any other bytes in the sequence need to be non-null bytes.
+.IP \[bu]
+A 0x2F byte is always interpreted as a directory separator (/)
+and cannot be part of a filename.
+.IP \[bu]
+A pathname can be at most PATH_MAX bytes long.
+PATH_MAX is defined in
+.BR limits.h (0p)\
+\.
+A pathname that’s longer than PATH_MAX bytes
+can be split into multiple smaller pathnames and opened piecewise using
+.BR openat (2).
+.IP \[bu]
+A filename can be at most a certain number of bytes long.
+The number is filesystem-specific.
+You can get the filename length limit for a currently mounted filesystem
+by passing _PC_NAME_MAX to
+.BR fpathconf (3)\
+\.
+For maximum portability, programs should be able to handle filenames
+that are as long as the relevant filesystems will allow.
+For maximum portability, programs and users should limit the length
+of their own pathnames to NAME_MAX bytes.
+NAME_MAX is defined in
+.BR limits.h (0p)\
+\.
+.P
+The kernel also has some rules that only apply in certain situations.
+Here are some examples:
+.IP \[bu] 3
+Filenames on the ext4 filesystem can be at most 30 bytes long.
+.IP \[bu]
+Filenames on the vfat filesystem cannot a
+0x22, 0x2A, 0x3A, 0x3C, 0x3E, 0x3F, 0x5C or 0x7C byte
+(", *, :, <, >, ?, \ or | in ASCII)
+unless the filesystem was mounted with iocharset set to something unusual.
+.IP \[bu]
+A UNIX domain socket’s sun_path can be at most 108 bytes long (see
+.BR unix (7)
+for details).
+.P
+User space treats pathnames differently.
+User space applications typically expect pathnames to use
+a consistent character encoding.
+For maximum interoperability, programs should use
+.BR nl_langinfo (3)
+to determine the current locale’s codeset.
+Paths should be encoded and decoded using the current locale’s codeset
+in order to help prevent mojibake.
+For maximum interoperability,
+programs and users should also limit
+the characters that they use for their own pathnames to characters in
+.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
+the POSIX Portable Filename Character Set
+.UE .
+.SH EXAMPLES
+The following program demonstrates
+how to ensure that a pathname uses the proper encoding.
+The program starts with a UTF-32 encoded pathname.
+It then calls
+.BR nl_langinfo (3)
+in order to determine what the current locale’s codeset is.
+After that, it uses
+.BR iconv (3)
+to convert the UTF-32 encoded pathname into a locale codeset encoded pathname.
+Finally, the program uses the locale codeset encoded pathname to create
+a file that contains the message “Hello, world!”
+.SS Program source
+.\" SRC BEGIN (pathname_encoding_example.c)
+.EX
+#include <err.h>
+#include <iconv.h>
+#include <langinfo.h>
+#include <locale.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <uchar.h>
+\&
+#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))
+\&
+int
+main(void)
+{
+    char     *locale_pathname;
+    char     *in, *out;
+    FILE     *fp;
+    size_t   size;
+    size_t   inbytes, outbytes;
+    iconv_t  cd;
+    const char32_t utf32_pathname[] = U"example";
+\&
+    if (setlocale(LC_ALL, "") == NULL)
+        err(EXIT_FAILURE, "setlocale");
+\&
+    size = NELEMS(utf32_pathname) * MB_CUR_MAX;
+    locale_pathname = malloc(size);
+    if (locale_pathname == NULL)
+      err(EXIT_FAILURE, "malloc");
+\&
+    cd = iconv_open(nl_langinfo(CODESET), "UTF\-32");
+    if (cd == (iconv_t)\-1)
+        err(EXIT_FAILURE, "iconv_open");
+\&
+    in = (char *) utf32_pathname;
+    inbytes = sizeof(utf32_pathname);
+    out = locale_pathname;
+    outbytes = size;
+    if (iconv(cd, &in, &inbytes, &out, &outbytes) == \-1)
+        err(EXIT_FAILURE, "iconv");
+\&
+    if (iconv_close(cd) == \-1)
+        err(EXIT_FAILURE, "iconv_close");
+\&
+    fp = fopen(locale_pathname, "w");
+    fputs("Hello, world!\[rs]n", fp);
+    if (fclose(fp) == EOF)
+        err(EXIT_FAILURE, "fclose");
+\&
+    free(locale_pathname);
+    exit(EXIT_SUCCESS);
+}
+.EE
+.\" SRC END
+.SH SEE ALSO
+.BR limits.h (0p),
+.BR open (2),
+.BR fpathconf (3),
+.BR iconv (3),
+.BR nl_langinfo (3),
+.BR path_resolution (7),
+.BR mount (8)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: C code style for Linux man-pages examples (was: [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames)
  2025-01-21  0:26     ` C code style for Linux man-pages examples (was: [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames) G. Branden Robinson
  2025-01-21  1:05       ` Alejandro Colomar
@ 2025-01-21 13:39       ` Jason Yundt
  2025-01-21 14:00         ` Alejandro Colomar
  1 sibling, 1 reply; 38+ messages in thread
From: Jason Yundt @ 2025-01-21 13:39 UTC (permalink / raw)
  To: G. Branden Robinson; +Cc: Alejandro Colomar, linux-man

On Mon, Jan 20, 2025 at 06:26:06PM -0600, G. Branden Robinson wrote:
> This style of feedback is producing a lot of churn.  Jason's going to be
> well into the v-teens before this patch is accepted, at this rate.
> 
> It appears to me that what is happening here is that you are iteratively
> developing a C code style guide under the banner of a man page review.
> If Jason's okay with being the test subject for this procedure, then
> there's not exactly a problem here, but if it were me submitting a man
> page, I'd be getting frustrated by (or before) this point.  I just "git
> pulled" https://git.kernel.org/pub/scm/docs/man-pages/man-pages/ and
> checked "./man/man7/man-pages.7", and practically _none_ of the rules
> you've been stating to Jason is expressed there.
> 
> I propose that the submissions of patches to the Linux man-pages not be
> a black-box process, with you serving as the oracle that accepts or
> rejects the input.  I admit you're offering a bit more information than
> a binary semaphore (ACCEPT/REJECT), but still, it would be better if
> Jason, and others, had a "Linux man-pages example C code style guide"
> document they could consult so that they could anticipate more of your
> objections.
> 
> If the construction of such a document is what this precise instance of
> the process is groping toward, good.  If not, then I suggest that it's
> about time to prepare that document.
> 
> I don't dispute that having a consistent style for code examples in the
> Linux man-page corpus is worthwhile; I do think it will, ultimately, pay
> dividends to harried hackers in a hurry.  But it is precisely to the
> extent that style guidelines are arbitrary that they need to be
> documented and easily located.

Thank you for standing up for me here, Branden.  I am going to continue
the back and forth with Alex, but I am frustrated by the process.  It
does indeed feel like a black-box process.  I would have much preferred
it if Alex had given me as many feedback points as possible each time.
I really dislike it when I receive feedback and think to myself “I could
have fixed this all the way back in v6.  Why wasn’t I told this
earlier?”

I agree that having a “Linux man-pages example C code style guide” would
be good.  Alex said in another email “I'm just not looking at all the
code at once, because it was highly unreadable.”  It was impossible for
me to produce code that was not highly unreadable to Alex.  I say that
because readability is in the eye of the beholder.  When I first created
the example program, I did many things to try and make my code as
readable as possible.  What I’m discovering now is that most of the
things that I did made the code more readable for me and less readable
for Alex.  If there had been a “Linux man-pages example C code style
guide” document, then I would have produced code that was more readable
to Alex to begin with and I wouldn’t have been frustrated by the
process.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: C code style for Linux man-pages examples (was: [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames)
  2025-01-21 13:39       ` Jason Yundt
@ 2025-01-21 14:00         ` Alejandro Colomar
  0 siblings, 0 replies; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-21 14:00 UTC (permalink / raw)
  To: Jason Yundt; +Cc: G. Branden Robinson, linux-man

[-- Attachment #1: Type: text/plain, Size: 2103 bytes --]

Hi Jason,

On Tue, Jan 21, 2025 at 08:39:49AM -0500, Jason Yundt wrote:
> Thank you for standing up for me here, Branden.  I am going to continue
> the back and forth with Alex, but I am frustrated by the process.  It
> does indeed feel like a black-box process.  I would have much preferred
> it if Alex had given me as many feedback points as possible each time.
> I really dislike it when I receive feedback and think to myself “I could
> have fixed this all the way back in v6.  Why wasn’t I told this
> earlier?”

Thank you for expressing your frustration.  I will take it into account.

> I agree that having a “Linux man-pages example C code style guide” would
> be good.

I've put that in my mental TODO list, and will try to have it soon.

> Alex said in another email “I'm just not looking at all the
> code at once, because it was highly unreadable.”  It was impossible for
> me to produce code that was not highly unreadable to Alex.  I say that
> because readability is in the eye of the beholder.

Agree.

> When I first created
> the example program, I did many things to try and make my code as
> readable as possible.  What I’m discovering now is that most of the
> things that I did made the code more readable for me and less readable
> for Alex.  If there had been a “Linux man-pages example C code style
> guide” document, then I would have produced code that was more readable
> to Alex to begin with and I wouldn’t have been frustrated by the
> process.

Anyway, I did actually send all the feedback I had remaining in v9, and
v10 should already be good (at least the example).  And the wording is
already good enough, AFAIR.  So there shouldn't be much more iteration.

And I think iterating isn't all that bad, because it makes us read the
code again, which helps catch other issues unintentionally, which is why
I never worried by having many patch iterations in general.  But I
understand that it might be uncomfortable as a contributor.


Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v10] man/man7/pathname.7: Add file documenting format of pathnames
  2025-01-21 13:35 ` [PATCH v10] man/man7/pathname.7: Add file documenting format of pathnames Jason Yundt
@ 2025-01-23 23:51   ` Alejandro Colomar
  0 siblings, 0 replies; 38+ messages in thread
From: Alejandro Colomar @ 2025-01-23 23:51 UTC (permalink / raw)
  To: Jason Yundt; +Cc: linux-man, Florian Weimer, G. Branden Robinson

[-- Attachment #1: Type: text/plain, Size: 7012 bytes --]

Hi Jason,

On Tue, Jan 21, 2025 at 08:35:20AM -0500, Jason Yundt wrote:
> The goal of this new manual page is to help people create programs that
> do the right thing even in the face of unusual paths.  The information
> that I used to create this new manual page came from these sources:
> 
> • <https://unix.stackexchange.com/a/39179/316181>
> • <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>
> • <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/ext4.h?h=v6.12.9#n2288>
> • <man:unix(7)>
> • <https://unix.stackexchange.com/q/92426/316181>
> 
> Signed-off-by: Jason Yundt <jason@jasonyundt.email>

Thanks!  I've applied the patch, with some tweaks:
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=5e0b1cb79b88d3a78f60bf85bfd3a76df7c10307>

Feel free to send further patches.


Have a lovely night!
Alex

> ---
> Here’s what I changed from the previous version:
> 
> • I renamed inbuf to in and outbuf to out.
> • I removed the iconv_result variable.
> • I aligned and merged the variable declarations as requested.
> • I added parentheses to my use of sizeof.
> • I removed the leftover if statement.
> • I removed some unintentional spaces.
> 
>  man/man7/pathname.7 | 152 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 152 insertions(+)
>  create mode 100644 man/man7/pathname.7
> 
> diff --git a/man/man7/pathname.7 b/man/man7/pathname.7
> new file mode 100644
> index 000000000..96e0009e1
> --- /dev/null
> +++ b/man/man7/pathname.7
> @@ -0,0 +1,152 @@
> +.\" Copyright (C) 2025 Jason Yundt (jason@jasonyundt.email)
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH pathname 7 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +pathname,
> +filename
> +\-
> +how pathnames are encoded and interpreted
> +.SH DESCRIPTION
> +Some system calls allow you to pass a pathname as a parameter.
> +When writing code that deals with pathnames,
> +there are kernel-space requirements that you must comply with,
> +and user-space requirements that you should comply with.
> +.P
> +The kernel stores pathnames as null-terminated byte sequences.
> +The kernel has a few general rules that apply to all pathnames:
> +.IP \[bu] 3
> +The last byte in the sequence needs to be a null byte.
> +.IP \[bu]
> +Any other bytes in the sequence need to be non-null bytes.
> +.IP \[bu]
> +A 0x2F byte is always interpreted as a directory separator (/)
> +and cannot be part of a filename.
> +.IP \[bu]
> +A pathname can be at most PATH_MAX bytes long.
> +PATH_MAX is defined in
> +.BR limits.h (0p)\
> +\.
> +A pathname that’s longer than PATH_MAX bytes
> +can be split into multiple smaller pathnames and opened piecewise using
> +.BR openat (2).
> +.IP \[bu]
> +A filename can be at most a certain number of bytes long.
> +The number is filesystem-specific.
> +You can get the filename length limit for a currently mounted filesystem
> +by passing _PC_NAME_MAX to
> +.BR fpathconf (3)\
> +\.
> +For maximum portability, programs should be able to handle filenames
> +that are as long as the relevant filesystems will allow.
> +For maximum portability, programs and users should limit the length
> +of their own pathnames to NAME_MAX bytes.
> +NAME_MAX is defined in
> +.BR limits.h (0p)\
> +\.
> +.P
> +The kernel also has some rules that only apply in certain situations.
> +Here are some examples:
> +.IP \[bu] 3
> +Filenames on the ext4 filesystem can be at most 30 bytes long.
> +.IP \[bu]
> +Filenames on the vfat filesystem cannot a
> +0x22, 0x2A, 0x3A, 0x3C, 0x3E, 0x3F, 0x5C or 0x7C byte
> +(", *, :, <, >, ?, \ or | in ASCII)
> +unless the filesystem was mounted with iocharset set to something unusual.
> +.IP \[bu]
> +A UNIX domain socket’s sun_path can be at most 108 bytes long (see
> +.BR unix (7)
> +for details).
> +.P
> +User space treats pathnames differently.
> +User space applications typically expect pathnames to use
> +a consistent character encoding.
> +For maximum interoperability, programs should use
> +.BR nl_langinfo (3)
> +to determine the current locale’s codeset.
> +Paths should be encoded and decoded using the current locale’s codeset
> +in order to help prevent mojibake.
> +For maximum interoperability,
> +programs and users should also limit
> +the characters that they use for their own pathnames to characters in
> +.UR https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265
> +the POSIX Portable Filename Character Set
> +.UE .
> +.SH EXAMPLES
> +The following program demonstrates
> +how to ensure that a pathname uses the proper encoding.
> +The program starts with a UTF-32 encoded pathname.
> +It then calls
> +.BR nl_langinfo (3)
> +in order to determine what the current locale’s codeset is.
> +After that, it uses
> +.BR iconv (3)
> +to convert the UTF-32 encoded pathname into a locale codeset encoded pathname.
> +Finally, the program uses the locale codeset encoded pathname to create
> +a file that contains the message “Hello, world!”
> +.SS Program source
> +.\" SRC BEGIN (pathname_encoding_example.c)
> +.EX
> +#include <err.h>
> +#include <iconv.h>
> +#include <langinfo.h>
> +#include <locale.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <uchar.h>
> +\&
> +#define NELEMS(a)  (sizeof(a) / sizeof(a[0]))
> +\&
> +int
> +main(void)
> +{
> +    char     *locale_pathname;
> +    char     *in, *out;
> +    FILE     *fp;
> +    size_t   size;
> +    size_t   inbytes, outbytes;
> +    iconv_t  cd;
> +    const char32_t utf32_pathname[] = U"example";
> +\&
> +    if (setlocale(LC_ALL, "") == NULL)
> +        err(EXIT_FAILURE, "setlocale");
> +\&
> +    size = NELEMS(utf32_pathname) * MB_CUR_MAX;
> +    locale_pathname = malloc(size);
> +    if (locale_pathname == NULL)
> +      err(EXIT_FAILURE, "malloc");
> +\&
> +    cd = iconv_open(nl_langinfo(CODESET), "UTF\-32");
> +    if (cd == (iconv_t)\-1)
> +        err(EXIT_FAILURE, "iconv_open");
> +\&
> +    in = (char *) utf32_pathname;
> +    inbytes = sizeof(utf32_pathname);
> +    out = locale_pathname;
> +    outbytes = size;
> +    if (iconv(cd, &in, &inbytes, &out, &outbytes) == \-1)
> +        err(EXIT_FAILURE, "iconv");
> +\&
> +    if (iconv_close(cd) == \-1)
> +        err(EXIT_FAILURE, "iconv_close");
> +\&
> +    fp = fopen(locale_pathname, "w");
> +    fputs("Hello, world!\[rs]n", fp);
> +    if (fclose(fp) == EOF)
> +        err(EXIT_FAILURE, "fclose");
> +\&
> +    free(locale_pathname);
> +    exit(EXIT_SUCCESS);
> +}
> +.EE
> +.\" SRC END
> +.SH SEE ALSO
> +.BR limits.h (0p),
> +.BR open (2),
> +.BR fpathconf (3),
> +.BR iconv (3),
> +.BR nl_langinfo (3),
> +.BR path_resolution (7),
> +.BR mount (8)
> -- 
> 2.47.1
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-01-23 23:51 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-13 21:32 [PATCH] man/man7/path-format.7: Add file documenting format of pathnames Jason Yundt
2025-01-14  0:20 ` Alejandro Colomar
2025-01-14 12:54 ` [PATCH v2] " Jason Yundt
2025-01-14 13:14   ` Alejandro Colomar
2025-01-14 21:00     ` Jason Yundt
2025-01-14 23:06       ` Alejandro Colomar
2025-01-15 16:21         ` Jason Yundt
2025-01-15 16:47           ` Alejandro Colomar
2025-01-15 17:44             ` G. Branden Robinson
2025-01-15  9:01   ` Florian Weimer
2025-01-14 21:01 ` [PATCH v3] man/man7/path_format.7: " Jason Yundt
2025-01-15 16:20 ` [PATCH v4] man/man7/pathname.7: " Jason Yundt
2025-01-15 17:12   ` Florian Weimer
2025-01-15 17:20   ` Alejandro Colomar
2025-01-15 18:37     ` A modest proposal regarding pathnames (was: " G. Branden Robinson
2025-01-15 19:25       ` Alejandro Colomar
2025-01-15 19:47         ` Alejandro Colomar
2025-01-17 13:02 ` [PATCH v5] " Jason Yundt
2025-01-17 14:14   ` Alejandro Colomar
2025-01-18  0:01     ` Jason Yundt
2025-01-18  0:23       ` Alejandro Colomar
2025-01-19 13:17         ` Jason Yundt
2025-01-19 15:24           ` Alejandro Colomar
2025-01-20  8:20             ` Florian Weimer
2025-01-20 11:14               ` Alejandro Colomar
2025-01-20 13:17                 ` Jason Yundt
2025-01-20 13:25                   ` Alejandro Colomar
2025-01-17 23:59 ` [PATCH v6] " Jason Yundt
2025-01-20 16:24 ` [PATCH v8] " Jason Yundt
2025-01-20 16:36   ` Alejandro Colomar
2025-01-20 19:06 ` [PATCH v9] " Jason Yundt
2025-01-20 22:26   ` Alejandro Colomar
2025-01-21  0:26     ` C code style for Linux man-pages examples (was: [PATCH v9] man/man7/pathname.7: Add file documenting format of pathnames) G. Branden Robinson
2025-01-21  1:05       ` Alejandro Colomar
2025-01-21 13:39       ` Jason Yundt
2025-01-21 14:00         ` Alejandro Colomar
2025-01-21 13:35 ` [PATCH v10] man/man7/pathname.7: Add file documenting format of pathnames Jason Yundt
2025-01-23 23:51   ` Alejandro Colomar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox