Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall
From: Michael Kerrisk (man-pages) @ 2019-10-09 10:32 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: mtk.manpages, Al Viro, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel
In-Reply-To: <20191009101733.kgbg2aekjguykddu@yavin>

Hello Aleksa,

On 10/9/19 12:17 PM, Aleksa Sarai wrote:
> On 2019-10-09, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> Hello Aleksa,
>>
>> Thanks for this. It's a great piece of documentation work!
>>
>> I would prefer the path_resolution(7) piece as a separate patch.
> 
> Thanks, and will do.
> 
>> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
>>> Rather than trying to merge the new syscall documentation into open.2
>>> (which would probably result in the man-page being incomprehensible),
>>> instead the new syscall gets its own dedicated page with links between
>>> open(2) and openat2(2) to avoid duplicating information such as the list
>>> of O_* flags or common errors.
>>
>> Yes, looking at the size of the proposed openat2(2) page,
>> this seems best.
>>>
>>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
>>> ---

[...]

>>> diff --git a/man2/openat2.2 b/man2/openat2.2
>>> new file mode 100644
>>> index 000000000000..c43c76046243
>>> --- /dev/null
>>> +++ b/man2/openat2.2

[...]

>>> +.TP
>>> +.B RESOLVE_NO_SYMLINKS
>>> +Disallow all symlink resolution during path resolution. If the trailing
>>
>> Disallow resolution of symbolic links during path resolution
>>
>>> +component is a symlink, and
>>
>> symbolic link (throughout the page)
>>
>>> +.I flags
>>> +contains both
>>> +.BR O_PATH " and " O_NOFOLLOW ","
>>> +then an
>>> +.B O_PATH
>>> +file descriptor referencing the symlink will be returned. This option implies
>>> +.BR RESOLVE_NO_MAGICLINKS .
>>> +
>>> +Users of this flag are encouraged to make its use configurable (unless it is
>>> +used for a specific security purpose), as symlinks are very widely used by
>>> +end-users and thus enabling this flag globally may result in spurious errors on
>>> +some systems.
>>
>> It's not really clear what you mean by "enabling this flag globally".
>> Could you reword, or explain in a bit more detail?
> 
> A better word might be "indiscriminately" -- the point being that if
> a program uses it for every openat2() call (and users cannot disable
> it), then the program will break on all sorts of systems.

Okay -- could you please amend the text to say something more like what
you just clarified.

> 
>>> +.TP
>>> +.B RESOLVE_NO_MAGICLINKS
>>> +Disallow all magic-link resolution during path resolution. If the trailing
>>> +component is a magic-link, and
>>> +.I flags
>>> +contains both
>>> +.BR O_PATH " and " O_NOFOLLOW ","
>>> +then an
>>> +.B O_PATH
>>> +file descriptor referencing the magic-link will be returned.
>>> +
>>> +Magic-links are symlink-like objects that are most notably found in
>>> +.BR proc (5)
>>> +(examples include
>>> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
>>> +Due to the potential danger of unknowingly opening these magic-links, it may be
>>> +preferable for users to disable their resolution entirely (see
>>> +.BR symlink (7)
>>> +for more details.)
>>> +.TP
>>> +.B RESOLVE_BENEATH
>>> +Do not permit the path resolution to succeed if any component of the resolution
>>> +is not a descendant of the directory indicated by
>>> +.IR dirfd .
>>> +This results in absolute symlinks (and absolute values of
>>> +.IR pathname )
>>> +to be rejected. Magic-link resolution is also not permitted.
>>
>> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
>> it would be good to state that more explicitly,
> 
> It does, though this might change in the future (some magic-link
> resolutions might be safe -- but it's unclear what the semantics should
> be). Users should explicitly set RESOLVE_NO_MAGICLINKS if they really
> don't want to resolve them.

Okay -- I understand. Perhaps you could then at least say something like:

Currently, this flag also disable magic-link resolution. However, this
may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.

>>> +
>>> +.TP
>>> +.B RESOLVE_IN_ROOT
>>> +Temporarily treat
>>> +.I dirfd
>>> +as the root of the filesystem (as though the user called
>>
>> Perhaps better:
>>
>> Treat
>> .I dirfd
>> as the root directory while resolving
>> .I pathname
>> (as though...)
> 
> Yeah that sounds better.
> 
>>> +.BR chroot (2)
>>> +with
>>> +.IR dirfd
>>> +as the argument.) Absolute symlinks and ".." path components will be scoped to
>>> +.IR dirfd . Magic-link resolution is also not permitted.
>>
>> Insert a newline before "Magic" to fix a formatting problem.
>>
>> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
>> it would be good to state that more explicitly,
> 
> Same reply as above.

See above :-)

[...]

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall
From: Aleksa Sarai @ 2019-10-09 10:17 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Al Viro, Christian Brauner, Aleksa Sarai, linux-man, linux-api,
	linux-kernel
In-Reply-To: <b52e4a93-a3de-dcbf-3684-bb2c355f3f24@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 22645 bytes --]

On 2019-10-09, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> Hello Aleksa,
> 
> Thanks for this. It's a great piece of documentation work!
> 
> I would prefer the path_resolution(7) piece as a separate patch.

Thanks, and will do.

> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Rather than trying to merge the new syscall documentation into open.2
> > (which would probably result in the man-page being incomprehensible),
> > instead the new syscall gets its own dedicated page with links between
> > open(2) and openat2(2) to avoid duplicating information such as the list
> > of O_* flags or common errors.
> 
> Yes, looking at the size of the proposed openat2(2) page,
> this seems best.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  man2/open.2            |   5 +
> >  man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
> >  man7/path_resolution.7 |  57 ++++--
> >  3 files changed, 426 insertions(+), 17 deletions(-)
> >  create mode 100644 man2/openat2.2
> > 
> > diff --git a/man2/open.2 b/man2/open.2
> > index 7217fe056e5e..a0b43394bbee 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
> >  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
> >  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
> >  ", mode_t " mode );
> > +.PP
> > +/* Docuented separately, in \fBopenat2\fP(2). */
> 
> Documented
> 
> > +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> > +const struct open_how *" how ", size_t " size ");
> >  .fi
> >  .PP
> >  .in -4n
> > @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
> >  .B O_DIRECTORY
> >  is ignored).
> >  .SH SEE ALSO
> > +.BR openat2 (2),
> 
> Entries here should into alphabetical order (within
> sections).
> 
> >  .BR chmod (2),
> >  .BR chown (2),
> >  .BR close (2),
> > diff --git a/man2/openat2.2 b/man2/openat2.2
> > new file mode 100644
> > index 000000000000..c43c76046243
> > --- /dev/null
> > +++ b/man2/openat2.2
> > @@ -0,0 +1,381 @@
> > +.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein.  The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +openat2 \- open and possibly create a file (extended)
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/types.h>
> > +.B #include <sys/stat.h>
> > +.B #include <fcntl.h>
> > +.PP
> > +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> > +const struct open_how *" how ", size_t " size ");
> > +.fi
> > +.PP
> > +.IR Note :
> > +There is no glibc wrapper for this system call; see NOTES.
> > +.SH DESCRIPTION
> > +The
> > +.BR openat2 ()
> > +system call is an extension of
> > +.BR openat (2)
> > +and provides a superset of its functionality. Rather than taking a single
> 
> Please start new sentences on new source lines. I recently added this
> text in man-pages(7):
> 
>    Use semantic newlines
>        In the source of a manual page, new sentences should be started on
>        new  lines,  and  long sentences should split into lines at clause
>        breaks (commas, semicolons, colons, and so on).  This  convention,
>        sometimes known as "semantic newlines", makes it easier to see the
>        effect of patches, which often operate at the level of  individual
>        sentences or sentence clauses.
> 
> > +.I flag
> > +argument, an extensible structure (\fIhow\fP) is passed instead to allow for
> > +seamless future extensions.
> 
> s/seamless//
> 
> > +.PP
> > +.I size
> > +must be set to
> > +.IR "sizeof(struct open_how)" ,
> > +to facilitate future extensions (see the "Extensibility" section of the
> > +\fBNOTES\fP for more detail on how extensions are handled.)
> > +
> > +.SS The open_how structure
> > +The following structure indicates how
> > +.I pathname
> > +should be opened, and acts as a superset of the
> > +.IR flag " and " mode
> > +arguments to
> > +.BR openat (2).
> > +.PP
> > +.in +4n
> > +.EX
> > +struct open_how {
> > +    uint32_t flags;              /* open(2)-style O_* flags. */
> > +    union {
> > +        uint16_t mode;           /* File mode bits for new file creation. */
> > +        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
> > +    };
> > +    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
> > +};
> > +.EE
> > +.in
> > +.PP
> > +Any future extensions to
> > +.BR openat2 ()
> > +will be implemented as new fields appended to the above structure, with the
> > +zero value of the new fields acting as though the extension were not present.
> > +.PP
> > +The meaning of each field is as follows:
> > +.RS
> > +
> > +.I flags
> > +.RSall
> > +The file creation and status flags to use for this operation. All of the
> > +.B O_*
> > +flags defined for
> > +.BR openat (2)
> > +are valid
> > +.BR openat2 ()
> > +flag values.
> > +.RE
> > +
> > +.I upgrade_mask
> > +.RS
> > +Restrict with which
> > +.I access modes
> > +the returned
> > +.B O_PATH
> > +descriptor may be re-opened (either through
> > +.B O_EMPTYPATH
> > +or
> > +.IR /proc/self/fd/ .)
> > +This field may only be set to a non-zero value if
> > +.I flags
> > +contains
> > +.BR O_PATH .
> > +By default, an
> > +.B O_PATH
> > +file descriptor of an ordinary file may be re-opened with with any access mode (but an
> > +.B O_PATH
> > +file descriptor of a magic-link may only be re-opened with access modes that
> > +the original magic-link possessed). The full list of
> 
> magic link (throughout the page)
> 
> > +.I upgrade_mask
> > +flags is given below.
> > +.TP
> > +.B UPGRADE_NOREAD
> > +Do not permit the
> > +.B O_PATH
> > +file descriptor to be re-opened for reading (i.e.
> > +.BR O_RDONLY " or " O_RDWR .)
> > +.TP
> > +.B UPGRADE_NOWRITE
> > +Do not permit the
> > +.B O_PATH
> > +file descriptor to be re-opened for writing (i.e.
> > +.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
> > +.RE
> > +.I resolve
> > +.RS
> > +Change how the components of
> > +.I pathname
> > +will be resolved (see
> > +.BR path_resolution (7)
> > +for background information.) The primary use-case for these flags is to allow
> 
> use case
> 
> > +trusted programs to restrict how un-trusted paths (or paths inside un-trusted
> 
> untrusted
> 
> > +directories) are resolved. The full list of
> > +.I resolve
> > +flags is given below.
> > +.TP
> > +.B RESOLVE_NO_XDEV
> > +Disallow all mount-point crossings during path resolution (including
> 
> I think better would be: "Disallow traversal of mount points". Do you 
> agree?

Yes, that sounds better.

> > +all bind-mounts).
> 
> bind mounts
> 
> > +
> > +Users of this flag are encouraged to make its use configurable (unless it is
> > +used for a specific security purpose), as bind-mounts are very widely used by
> > +end-users and thus enabling this flag globally may result in spurious errors on
> > +some systems.
> > +.TP
> > +.B RESOLVE_NO_SYMLINKS
> > +Disallow all symlink resolution during path resolution. If the trailing
> 
> Disallow resolution of symbolic links during path resolution
> 
> > +component is a symlink, and
> 
> symbolic link (throughout the page)
> 
> > +.I flags
> > +contains both
> > +.BR O_PATH " and " O_NOFOLLOW ","
> > +then an
> > +.B O_PATH
> > +file descriptor referencing the symlink will be returned. This option implies
> > +.BR RESOLVE_NO_MAGICLINKS .
> > +
> > +Users of this flag are encouraged to make its use configurable (unless it is
> > +used for a specific security purpose), as symlinks are very widely used by
> > +end-users and thus enabling this flag globally may result in spurious errors on
> > +some systems.
> 
> It's not really clear what you mean by "enabling this flag globally".
> Could you reword, or explain in a bit more detail?

A better word might be "indiscriminately" -- the point being that if
a program uses it for every openat2() call (and users cannot disable
it), then the program will break on all sorts of systems.

> > +.TP
> > +.B RESOLVE_NO_MAGICLINKS
> > +Disallow all magic-link resolution during path resolution. If the trailing
> > +component is a magic-link, and
> > +.I flags
> > +contains both
> > +.BR O_PATH " and " O_NOFOLLOW ","
> > +then an
> > +.B O_PATH
> > +file descriptor referencing the magic-link will be returned.
> > +
> > +Magic-links are symlink-like objects that are most notably found in
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Due to the potential danger of unknowingly opening these magic-links, it may be
> > +preferable for users to disable their resolution entirely (see
> > +.BR symlink (7)
> > +for more details.)
> > +.TP
> > +.B RESOLVE_BENEATH
> > +Do not permit the path resolution to succeed if any component of the resolution
> > +is not a descendant of the directory indicated by
> > +.IR dirfd .
> > +This results in absolute symlinks (and absolute values of
> > +.IR pathname )
> > +to be rejected. Magic-link resolution is also not permitted.
> 
> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
> it would be good to state that more explicitly,

It does, though this might change in the future (some magic-link
resolutions might be safe -- but it's unclear what the semantics should
be). Users should explicitly set RESOLVE_NO_MAGICLINKS if they really
don't want to resolve them.

> > +
> > +.TP
> > +.B RESOLVE_IN_ROOT
> > +Temporarily treat
> > +.I dirfd
> > +as the root of the filesystem (as though the user called
> 
> Perhaps better:
> 
> Treat
> .I dirfd
> as the root directory while resolving
> .I pathname
> (as though...)

Yeah that sounds better.

> > +.BR chroot (2)
> > +with
> > +.IR dirfd
> > +as the argument.) Absolute symlinks and ".." path components will be scoped to
> > +.IR dirfd . Magic-link resolution is also not permitted.
> 
> Insert a newline before "Magic" to fix a formatting problem.
> 
> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
> it would be good to state that more explicitly,

Same reply as above.

> > +
> > +However, unlike
> > +.BR chroot (2)
> > +(which changes the filesystem root persistently for an entire thread-group),
> 
> s/persistently for an entire thread-group/
>  /permanently for a process/
> 
> > +.B RESOLVE_IN_ROOT
> > +allows a program to efficiently restrict path resolution for only certain
> > +operations. It also has several hardening features (such as not permitting
> > +magic-link resolution) which
> > +.BR chroot (2)
> > +does not.
> > +.RE
> > +
> > +.RE
> > +
> > +.PP
> > +Unlike
> > +.BR openat (2),
> > +any unknown flags set in fields of
> > +.I how
> > +will result in an error, rather than being ignored. 
> 
> Thank you, thank you, thank you. It was sad
> that openat() never fixed that antifeature.

No problem, it's bothered me for a long time as well. :D

> > In addition, an error will
> > +be returned if the value of the
> > +.IR mode " and " upgrade_mask
> > +union is non-zero unless:
> > +.RS
> > +.IP * 3
> > +.I flags
> > +indicates that a new file will be created (it contains
> > +.BR O_CREAT " or " O_TMPFILE ),
> > +in which case
> > +.I mode
> > +may be any valid file mode.
> > +.IP *
> > +.I flags
> > +contains
> > +.BR O_PATH ,
> > +in which case
> > +.I upgrade_mask
> > +must only contain valid
> > +.B UPGRADE_*
> > +flags.
> > +.RE
> > +
> > +.SH RETURN VALUE
> > +On success, a new file descriptor is returned. On error, -1 is returned, and
> > +.I errno
> > +is set appropriately.
> > +
> > +.SH ERRORS
> > +The set of errors returned by
> > +.BR openat2 ()
> > +includes all of the errors returned by
> > +.BR openat (2),
> > +as well as the following additional errors:
> > +.TP
> > +.B EINVAL
> > +An unknown flag or invalid value was specified in
> > +.IR how .
> > +.TP
> > +.B EINVAL
> > +.I size
> > +was smaller than any known version of
> > +.IR "struct open_how" .
> > +.TP
> > +.B E2BIG
> > +An extension was specified in
> > +.IR how ,
> > +which the current kernel does not support (see the "Extensibility" section of
> > +the \fBNOTES\fP for more detail on how extensions are handled.)
> > +.TP
> > +.B EAGAIN
> > +.I resolve
> > +contains either
> > +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> > +and the kernel could not ensure that a ".." component didn't escape (due to a
> > +race condition or potential attack). Callers may choose to retry the
> > +.BR openat2 ()
> > +call.
> > +.TP
> > +.B EXDEV
> > +.I resolve
> > +contains either
> > +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> > +and a path component attempted to escape the root of the resolution.
> > +
> > +.TP
> > +.B EXDEV
> > +.I resolve
> > +contains
> > +.BR RESOLVE_NO_XDEV ,
> > +and a path component attempted to cross a mount-point.
> 
> mount point
> 
> > +
> > +.TP
> > +.B ELOOP
> > +.I resolve
> > +contains
> > +.BR RESOLVE_NO_SYMLINKS ,
> > +and one of the path components was a symlink.
> > +.TP
> > +.B ELOOP
> > +.I resolve
> > +contains
> > +.BR RESOLVE_NO_MAGICLINKS ,
> > +and one of the path components was a magic-link.
> > +
> > +.SH VERSIONS
> > +.BR openat2 ()
> > +was added to Linux in kernel 5.FOO.
> > +
> > +.SH CONFORMING TO
> > +This system call is Linux-specific.
> > +
> > +The semantics of
> > +.B RESOLVE_BENEATH
> > +were modelled after FreeBSD's
> > +.BR O_BENEATH .
> > +
> > +.SH NOTES
> > +Glibc does not provide a wrapper for this system call; call it using
> > +.BR syscall (2).
> > +
> > +.SS Extensibility
> > +In order to allow for
> > +.I struct open_how
> > +to be extended in future kernel revisions,
> > +.BR openat2 ()
> > +requires userspace to specify what sized
> 
> s/what sized/the size of/
> 
> > +.I struct open_how
> > +structure they are passing. By providing this information, it is possible for
> > +.BR openat2 ()
> > +to provide both forwards- and backwards-compatibility \(em with
> > +.I size
> > +acting as an implicit version number (because new extension fields will always
> > +be appended, the size will always increase.) This extensibility design is very
> > +similar to other system calls such as
> > +.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
> 
> The following explantion of uszie and ksize is great. Thanks for that.

Glad to hear you don't think it's too much fluff. :D

> > +If we let
> > +.I usize
> > +be the size of the structure according to userspace and
> > +.I ksize
> > +be the size of the structure which the kernel supports, then there are only
> > +three cases to consider:
> > +
> > +.RS
> > +.IP * 3
> > +If
> > +.IR ksize " equals " usize ,
> > +then there is no version mismatch and
> > +.I how
> > +can be used verbatim.
> > +.IP *
> > +If
> > +.IR ksize " is larger than " usize ,
> > +then there are some extensions the kernel supports which the userspace program
> > +is unaware of. Because all extensions must have their zero values be a no-op,
> > +the kernel treats all of the extension fields not set by userspace to have zero
> > +values. This provides backwards-compatibility.
> > +.IP *
> > +If
> > +.IR ksize " is smaller than " usize ,
> > +then there are some extensions which the userspace program is aware of but the
> > +kernel does not support. Because all extensions must have their zero values be
> > +a no-op, the kernel can safely ignore the unsupported extension fields if they
> > +are all-zero. If any unsupported extension fields are non-zero, then an error
> > +is returned. This provides forwards-compatibility.
> > +.RE
> > +
> > +Therefore, most userspace programs will not need to have any special handling
> > +of extensions. However, if a userspace program wishes to determine what
> > +extensions the running kernel supports, they may conduct a binary search on
> > +.IR size
> > +(to find the largest value which doesn't produce an error.)
> > +
> > +.SH SEE ALSO
> > +.BR openat (2),
> > +.BR path_resolution (7),
> > +.BR symlink (7)
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 85dd354e9a93..3da3e5b614c8 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
> >  Some UNIX/Linux system calls have as parameter one or more filenames.
> >  A filename (or pathname) is resolved as follows.
> >  .SS Step 1: start of the resolution process
> > -If the pathname starts with the \(aq/\(aq character,
> > -the starting lookup directory
> > -is the root directory of the calling process.
> > -(A process inherits its
> > -root directory from its parent.
> > -Usually this will be the root directory
> > -of the file hierarchy.
> > -A process may get a different root directory
> > -by use of the
> > +If the pathname starts with the \(aq/\(aq character, the starting lookup
> > +directory is the root directory of the calling process. (A process inherits its
> > +root directory from its parent. Usually this will be the root directory of the
> > +file hierarchy. A process may get a different root directory by use of the
> >  .BR chroot (2)
> > -system call.
> > +system call, or may temporarily use a different root directory by using
> > +.BR openat2 (2)
> > +with the
> > +.B RESOLVE_IN_ROOT
> > +flag set.
> > +.PP
> >  A process may get an entirely private mount namespace in case
> >  it\(emor one of its ancestors\(emwas started by an invocation of the
> >  .BR clone (2)
> > @@ -48,16 +48,24 @@ system call that had the
> >  flag set.)
> >  This handles the \(aq/\(aq part of the pathname.
> >  .PP
> > -If the pathname does not start with the \(aq/\(aq character, the
> > -starting lookup directory of the resolution process is the current working
> > -directory of the process.
> > -(This is also inherited from the parent.
> > -It can be changed by use of the
> > +If the pathname does not start with the \(aq/\(aq character, the starting
> > +lookup directory of the resolution process is the current working directory of
> > +the process \(em or in the case of
> > +.BR openat (2)-style
> > +syscalls, the
> 
> system calls
> 
> > +.I dfd
> > +argument (or the current working directory if
> > +.B AT_FDCWD
> > +is passed as the
> > +.I dfd
> > +argumnet). The current working directory is inherited from the parent, and can
> 
> argument
> 
> > +be changed by use of the
> >  .BR chdir (2)
> > -system call.)
> > +syscall.
> 
> "system call" please.
> 
> >  .PP
> >  Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
> >  Pathnames not starting with a \(aq/\(aq are called relative pathnames.
> > +
> 
> No blank line here.
> 
> >  .SS Step 2: walk along the path
> >  Set the current lookup directory to the starting lookup directory.
> >  Now, for each nonfinal component of the pathname, where a component
> > @@ -124,6 +132,13 @@ the kernel's pathname-resolution code
> >  was reworked to eliminate the use of recursion,
> >  so that the only limit that remains is the maximum of 40
> >  resolutions for the entire pathname.
> > +.PP
> > +The resolution of syscalls during this stage can be blocked by using
> 
> "resolution of syscall" seems wrong? "syscall" should be something 
> else?

Yeah, should be "resolution of symlinks". ;)

> > +.BR openat2 (2),
> > +with the
> > +.B RESOLVE_NO_SYMLINKS
> > +flag set.
> > +
> >  .SS Step 3: find the final entry
> >  The lookup of the final component of the pathname goes just like
> >  that of all other components, as described in the previous step,
> > @@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
> >  their conventional meanings, regardless of whether they are
> >  actually present in the physical filesystem.
> >  .PP
> > -One cannot walk down past the root: "/.." is the same as "/".
> > +One cannot walk up past the root: "/.." is the same as "/".
> > +
> 
> No blank line please.
> 
> >  .SS Mount points
> >  After a "mount dev path" command, the pathname "path" refers to
> >  the root of the filesystem hierarchy on the device "dev", and no
> > @@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
> >  One can walk out of a mounted filesystem: "path/.." refers to
> >  the parent directory of "path",
> >  outside of the filesystem hierarchy on "dev".
> > +.PP
> > +Mount-point crossings can be blocked by using
> 
> Traversal of mount points can be disallowed by...
> 
> > +.BR openat2 (2),
> > +with the
> > +.B RESOLVE_NO_XDEV
> > +flag set (though note that this also restricts bind-mount crossings).
> > +
> 
> No blank line please.
> 
> >  .SS Trailing slashes
> >  If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
> >  component as in Step 2: it has to exist and resolve to a directory.
> > 

Thanks so much, and I'll clean up your nits.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation
From: Aleksa Sarai @ 2019-10-09 10:00 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Al Viro, Christian Brauner, Aleksa Sarai, linux-man, linux-api,
	linux-kernel
In-Reply-To: <c4485b10-692d-ed24-a1d9-a047bb1054bf@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5326 bytes --]

On 2019-10-09, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> Hello Aleksa,
> 
> You write "5.FOO" in these patches. When do you expect these changes to 
> land in the kernel?

Probably 5.6 (I'd hope for 5.5, but I don't know how the v14 review will
go). I'm not too sure though, and the magic-link changes (plus
O_EMPTYPATH) will probably land after openat2(2) since there is some
remaining work to do.

> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Some of the wording around empty paths in path_resolution(7) also needed
> > to be reworked since it's now legal (if you pass O_EMPTYPATH).
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  man2/open.2            | 42 +++++++++++++++++++++++++++++++++++++++++-
> >  man7/path_resolution.7 | 17 ++++++++++++++++-
> >  2 files changed, 57 insertions(+), 2 deletions(-)
> > 
> > diff --git a/man2/open.2 b/man2/open.2
> > index b0f485b41589..7217fe056e5e 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -48,7 +48,7 @@
> >  .\" FIXME . Apr 08: The next POSIX revision has O_EXEC, O_SEARCH, and
> >  .\" O_TTYINIT.  Eventually these may need to be documented.  --mtk
> >  .\"
> > -.TH OPEN 2 2018-04-30 "Linux" "Linux Programmer's Manual"
> > +.TH OPEN 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> 
> No need to update the timestamp. I have scripts that handle this
> automatically.
> 
> >  .SH NAME
> >  open, openat, creat \- open and possibly create a file
> >  .SH SYNOPSIS
> > @@ -421,6 +421,21 @@ was followed by a call to
> >  .BR fdatasync (2)).
> >  .IR "See NOTES below" .
> >  .TP
> > +.BR O_EMPTYPATH " (since Linux 5.FOO)"
> > +If \fIpathname\fP is an empty string, re-open the the file descriptor given as
> 
> In general, I prefer the general form
> 
> .I pathname
> 
> over \fIpathname\fP. 
> 
> If you would be willing to cahnge that, it would  save me a little work.
> (And likewise throughout the rest of the patch.)
> 
> > +the \fIdirfd\fP argument to
> > +.BR openat (2).
> > +This can be used with both ordinary (file and directory) and \fBO_PATH\fP file
> > +descriptors, but cannot be used with
> > +.BR AT_FDCWD
> > +(or as an argument to plain
> > +.BR open (2).) When re-opening an \fBO_PATH\fP file descriptor, the same "link
> 
> There's a formatting problem here which can be fixed by inserting a 
> newline before "When".
> 
> > +mode" restrictions apply as with re-opening through
> > +.BR proc (5)
> > +(see
> > +.BR path_resolution "(7) and " symlink (7)
> > +for more details.)
> > +.TP
> >  .B O_EXCL
> >  Ensure that this call creates the file:
> >  if this flag is specified in conjunction with
> > @@ -668,6 +683,13 @@ with
> >  (or via procfs using
> >  .BR AT_SYMLINK_FOLLOW )
> >  even if the file is not a directory.
> > +You can even "re-open" (or upgrade) an
> > +.BR O_PATH
> > +file descriptor by using
> > +.BR O_EMPTYPATH
> > +(see the section for
> > +.BR O_EMPTYPATH
> > +for more details.)
> >  .IP *
> >  Passing the file descriptor to another process via a UNIX domain socket
> >  (see
> > @@ -958,6 +980,15 @@ is not allowed.
> >  (See also
> >  .BR path_resolution (7).)
> >  .TP
> > +.B EBADF
> > +.I pathname
> > +was an empty string (and
> > +.B O_EMPTYPATH
> > +was passed) with
> > +.BR open (2)
> > +(instead of
> > +.BR openat (2).)
> > +.TP
> >  .B EDQUOT
> >  Where
> >  .B O_CREAT
> > @@ -1203,6 +1234,15 @@ The following additional errors can occur for
> >  .I dirfd
> >  is not a valid file descriptor.
> >  .TP
> > +.B EBADF
> > +.I pathname
> > +was an empty string (and
> > +.B O_EMPTYPATH
> > +was passed), but the provided
> > +.I dirfd
> > +was an invalid file descriptor (or was
> > +.BR AT_FDCWD .)
> > +.TP
> >  .B ENOTDIR
> >  .I pathname
> >  is a relative pathname and
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 46f25ec4cdfa..85dd354e9a93 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -22,7 +22,7 @@
> >  .\" the source, must acknowledge the copyright and authors of this work.
> >  .\" %%%LICENSE_END
> >  .\"
> > -.TH PATH_RESOLUTION 7 2017-11-26 "Linux" "Linux Programmer's Manual"
> > +.TH PATH_RESOLUTION 7 2019-10-03 "Linux" "Linux Programmer's Manual"
> >  .SH NAME
> >  path_resolution \- how a pathname is resolved to a file
> >  .SH DESCRIPTION
> > @@ -198,6 +198,21 @@ successfully.
> >  Linux returns
> >  .B ENOENT
> >  in this case.
> > +.PP
> > +As of Linux 5.FOO, an empty path argument can be used to indicate the "re-open"
> > +an existing file descriptor if
> > +.B O_EMPTYPATH
> > +is passed as a flag argument to
> > +.BR openat (2),
> > +with the
> > +.I dfd
> > +argument indicating which file descriptor to "re-open". This is approximately
> > +equivalent to opening
> > +.I /proc/self/fd/$fd
> 
> .IR /proc/self/fd/$fd ,
> 
> > +where
> > +.I $fd
> > +is the open file descriptor to be "re-opened".
> > +
> 
> No blank line here.
> 
> >  .SS Permissions
> >  The permission bits of a file consist of three groups of three bits; see
> >  .BR chmod (1)

Will fix all of the above -- thanks!


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
From: Aleksa Sarai @ 2019-10-09  9:57 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Al Viro, Christian Brauner, Aleksa Sarai, linux-man, linux-api,
	linux-kernel
In-Reply-To: <2fd9e82d-2a9c-cda9-0c17-3a20034eca1d@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5588 bytes --]

On 2019-10-09, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Traditionally, magic-links have not been a well-understood topic in
> > Linux. Given the new changes in their semantics (related to the link
> > mode of trailing magic-links), it seems like a good opportunity to shine
> > more light on magic-links and their semantics.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> Thanks for doing this. Some comments below.

No problem -- just a heads-up that I'm going to split off the magic-link
changes from the openat2(2) series (there are quite a few things that
need to be done). So I will drop this man page for now.

> > ---
> >  man7/path_resolution.7 | 15 +++++++++++++++
> >  man7/symlink.7         | 39 ++++++++++++++++++++++++++++++---------
> >  2 files changed, 45 insertions(+), 9 deletions(-)
> > 
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 07664ed8faec..46f25ec4cdfa 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -136,6 +136,21 @@ we are just creating it.
> >  The details on the treatment
> >  of the final entry are described in the manual pages of the specific
> >  system calls.
> > +.PP
> > +Since Linux 5.FOO, if the final entry is a "magic-link" (see
> 
> "magic link". As Jann points out, this is more normal English usage.
> 
> > +.BR symlink (7)),
> > +and the user is attempting to
> > +.BR open (2)
> > +it, then there is an additional permission-related restriction applied to the
> > +operation: the requested access mode must not exceed the "link mode" of the
> > +magic-link (unlike ordinary symlinks, magic-links have their own file mode.)
> 
> Remove the hyphens (magic link). And also, as someone else pointed out,
> manual pages fairly consistently uses the term "symbolic link"
> (written in full).

Will do.

> You use the term "file mode" here. Do you mean the file permissions bits?

Yes.

> If yes, it is a bit misleading to suggest that symbolic links don't
> have these mode bits. They do, but--as noted in the existing symlink(7)
> manual page text--these bits are ignored. I suggest just removing the
> parenthesized text.

I was trying to say that their file mode can be non-0777 -- but I can
just drop the entire thing.

> > +For example, if
> > +.I /proc/[pid]/fd/[num]
> > +has a link mode of
> > +.BR 0500 ,
> > +unprivileged users are not permitted to
> > +.BR open ()
> > +the magic-link for writing.
> >  .SS . and ..
> >  By convention, every directory has the entries "." and "..",
> >  which refer to the directory itself and to its parent directory,
> > diff --git a/man7/symlink.7 b/man7/symlink.7
> > index 9f5bddd5dc21..33f0ec703acd 100644
> > --- a/man7/symlink.7
> > +++ b/man7/symlink.7
> > @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
> >  are outlined here.
> >  It is important that site-local applications also conform to these rules,
> >  so that the user interface can be as consistent as possible.
> > +.SS Magic-links
> > +There is a special class of symlink-like objects known as "magic-links" which
> 
> "magic links" (and through the rest of the page).
> 
> > +can be found in certain pseudo-filesystems such as
> 
> pseudofilesystems
> 
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Unlike normal symlinks, magic-links are not resolved through
> 
> symbolic links
> 
> > +pathname-expansion, but instead act as direct references to the kernel's own
> 
> pathname expansion

Will do all of the above.

> > +representation of a file handle. As such, these magic-links allow users to
> > +access files which cannot be referenced with normal paths (such as unlinked
> > +files still referenced by a running program.)
> > +.PP
> > +Because they can bypass ordinary
> > +.BR mount_namespaces (7)-based
> > +restrictions, magic-links have been used as attack vectors in various exploits.
> > +As such (since Linux 5.FOO), there are additional restrictions placed on the
> > +re-opening of magic-links (see
> > +.BR path_resolution (7)
> > +for more details.)
> >  .SS Symbolic link ownership, permissions, and timestamps
> >  The owner and group of an existing symbolic link can be changed
> >  using
> > @@ -99,16 +118,18 @@ of a symbolic link can be changed using
> >  or
> >  .BR lutimes (3).
> >  .PP
> > -On Linux, the permissions of a symbolic link are not used
> > -in any operations; the permissions are always
> > -0777 (read, write, and execute for all user categories),
> >  .\" Linux does not currently implement an lchmod(2).
> > -and can't be changed.
> > -(Note that there are some "magic" symbolic links in the
> > -.I /proc
> > -directory tree\(emfor example, the
> > -.IR /proc/[pid]/fd/*
> > -files\(emthat have different permissions.)
> > +On Linux, the permissions of an ordinary symbolic link are not used in any
> > +operations; the permissions are always 0777 (read, write, and execute for all
> > +user categories), and can't be changed.
> > +.PP
> > +However, magic-links do not follow this rule. They can have a non-0777 mode,
> > +which is used for permission checks when the final
> > +component of an
> > +.BR open (2)'s
> > +path is a magic-link (see
> > +.BR path_resolution (7).)
> > +
> >  .\"
> >  .\" The
> >  .\" 4.4BSD

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [MANPAGE PATCH] Add manpage for fsinfo(2)
From: Michael Kerrisk (man-pages) @ 2019-10-09  9:52 UTC (permalink / raw)
  To: David Howells
  Cc: mtk.manpages, viro, linux-api, linux-fsdevel, torvalds,
	linux-kernel, linux-man, Eric W. Biederman
In-Reply-To: <15519.1531263314@warthog.procyon.org.uk>

Hello David,

See my previous mails.

There is no fsinfo(2) in the system call in the kernel currently.
Will that call still be added, or was it replaced by fsconfig(2),
which--as far as I can tell--dnot have a man-pages patch?

Thanks,

Michael

On 7/11/18 12:55 AM, David Howells wrote:
> Add a manual page to document the fsinfo() system call.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  man2/fsinfo.2       | 1017 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  man2/ioctl_iflags.2 |    6 
>  man2/stat.2         |    7 
>  man2/statx.2        |   13 +
>  man2/utime.2        |    7 
>  man2/utimensat.2    |    7 
>  6 files changed, 1057 insertions(+)
>  create mode 100644 man2/fsinfo.2
> 
> diff --git a/man2/fsinfo.2 b/man2/fsinfo.2
> new file mode 100644
> index 000000000..5710232df
> --- /dev/null
> +++ b/man2/fsinfo.2
> @@ -0,0 +1,1017 @@
> +'\" t
> +.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH FSINFO 2 2018-06-06 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +fsinfo \- Get filesystem information
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.br
> +.B #include <sys/fsinfo.h>
> +.br
> +.B #include <unistd.h>
> +.br
> +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> +.PP
> +.BI "int fsinfo(int " dirfd ", const char *" pathname ","
> +.BI "           struct fsinfo_params *" params ","
> +.BI "           void *" buffer ", size_t " buf_size );
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for
> +.BR fsinfo ();
> +see NOTES.
> +.SH DESCRIPTION
> +.PP
> +fsinfo() retrieves the desired filesystem attribute, as selected by the
> +parameters pointed to by
> +.IR params ,
> +and stores its value in the buffer pointed to by
> +.IR buffer .
> +.PP
> +The parameter structure is optional, defaulting to all the parameters being 0
> +if the pointer is NULL.  The structure looks like the following:
> +.PP
> +.in +4n
> +.nf
> +struct fsinfo_params {
> +    __u32 at_flags;     /* AT_SYMLINK_NOFOLLOW and similar flags */
> +    __u32 request;      /* Requested attribute */
> +    __u32 Nth;          /* Instance of attribute */
> +    __u32 Mth;          /* Subinstance of Nth instance */
> +    __u32 __reserved[6]; /* Reserved params; all must be 0 */
> +};
> +.fi
> +.in
> +.PP
> +The filesystem to be queried is looked up using a combination of
> +.IR dfd ", " pathname " and " params->at_flags.
> +This is discussed in more detail below.
> +.PP
> +The desired attribute is indicated by
> +.IR params->request .
> +If
> +.I params
> +is NULL, this will default to
> +.BR fsinfo_attr_statfs ,
> +which retrieves some of the information returned by
> +.BR statfs ().
> +The available attributes are described below in the "THE ATTRIBUTES" section.
> +.PP
> +Some attributes can have multiple values and some can even have multiple
> +instances with multiple values.  For example, a network filesystem might use
> +multiple servers.  The names of each of these servers can be retrieved by
> +using
> +.I params->Nth
> +to iterate through all the instances until error
> +.B ENODATA
> +occurs, indicating the end of the list.  Further, each server might have
> +multiple addresses available; these can be enumerated using
> +.I params->Nth
> +to iterate the servers and
> +.I params->Mth
> +to iterate the addresses of the Nth server.
> +.PP
> +The amount of data written into the buffer depends on the attribute selected.
> +Some attributes return variable-length strings and some return fixed-size
> +structures.  If either
> +.IR buffer " is  NULL  or " buf_size " is 0"
> +then the size of the attribute value will be returned and nothing will be
> +written into the buffer.
> +.PP
> +The
> +.I params->__reserved
> +parameters must all be 0.
> +.\"_______________________________________________________
> +.SS
> +Allowance for Future Attribute Expansion
> +.PP
> +To allow for the future expansion and addition of fields to any fixed-size
> +structure attribute,
> +.BR fsinfo ()
> +makes the following guarantees:
> +.RS 4m
> +.IP (1) 4m
> +It will always clear any excess space in the buffer.
> +.IP (2) 4m
> +It will always return the actual size of the data.
> +.IP (3) 4m
> +It will truncate the data to fit it into the buffer rather than giving an
> +error.
> +.IP (4) 4m
> +Any new version of a structure will incorporate all the fields from the old
> +version at same offsets.
> +.RE
> +.PP
> +So, for example, if the caller is running on an older version of the kernel
> +with an older, smaller version of the structure than was asked for, the kernel
> +will write the smaller version into the buffer and will clear the remainder of
> +the buffer to make sure any additional fields are set to 0.  The function will
> +return the actual size of the data.
> +.PP
> +On the other hand, if the caller is running on a newer version of the kernel
> +with a newer version of the structure that is larger than the buffer, the write
> +to the buffer will be truncated to fit as necessary and the actual size of the
> +data will be returned.
> +.PP
> +Note that this doesn't apply to variable-length string attributes.
> +
> +.\"_______________________________________________________
> +.SS
> +Invoking \fBfsinfo\fR():
> +.PP
> +To access a file's status, no permissions are required on the file itself, but
> +in the case of
> +.BR fsinfo ()
> +with a path, execute (search) permission is required on all of the directories
> +in
> +.I pathname
> +that lead to the file.
> +.PP
> +.BR fsinfo ()
> +uses
> +.IR pathname ", " dirfd " and " params->at_flags
> +to locate the target file in one of a variety of ways:
> +.TP
> +[*] By absolute path.
> +.I pathname
> +points to an absolute path and
> +.I dirfd
> +is ignored.  The file is looked up by name, starting from the root of the
> +filesystem as seen by the calling process.
> +.TP
> +[*] By cwd-relative path.
> +.I pathname
> +points to a relative path and
> +.IR dirfd " is " AT_FDCWD .
> +The file is looked up by name, starting from the current working directory.
> +.TP
> +[*] By dir-relative path.
> +.I pathname
> +points to relative path and
> +.I dirfd
> +indicates a file descriptor pointing to a directory.  The file is looked up by
> +name, starting from the directory specified by
> +.IR dirfd .
> +.TP
> +[*] By file descriptor.
> +.IR pathname " is " NULL " and " dirfd
> +indicates a file descriptor.  The file attached to the file descriptor is
> +queried directly.  The file descriptor may point to any type of file, not just
> +a directory.
> +.PP
> +.I flags
> +can be used to influence a path-based lookup.  A value for
> +.I flags
> +is constructed by OR'ing together zero or more of the following constants:
> +.TP
> +.BR AT_EMPTY_PATH
> +.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
> +If
> +.I pathname
> +is an empty string, operate on the file referred to by
> +.IR dirfd
> +(which may have been obtained using the
> +.BR open (2)
> +.B O_PATH
> +flag).
> +If
> +.I dirfd
> +is
> +.BR AT_FDCWD ,
> +the call operates on the current working directory.
> +In this case,
> +.I dirfd
> +can refer to any type of file, not just a directory.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.BR AT_NO_AUTOMOUNT
> +Don't automount the terminal ("basename") component of
> +.I pathname
> +if it is a directory that is an automount point.  This allows the caller to
> +gather attributes of the filesystem holding an automount point (rather than
> +the filesystem it would mount).  This flag can be used in tools that scan
> +directories to prevent mass-automounting of a directory of automount points.
> +The
> +.B AT_NO_AUTOMOUNT
> +flag has no effect if the mount point has already been mounted over.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.B AT_SYMLINK_NOFOLLOW
> +If
> +.I pathname
> +is a symbolic link, do not dereference it:
> +instead return information about the link itself, like
> +.BR lstat ().
> +.SH THE ATTRIBUTES
> +.PP
> +There is a range of attributes that can be selected from.  These are:
> +
> +.\" __________________ fsinfo_attr_statfs __________________
> +.TP
> +.B fsinfo_attr_statfs
> +This retrieves the "dynamic"
> +.B statfs
> +information, such as block and file counts, that are expected to change whilst
> +a filesystem is being used.  This fills in the following structure:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_statfs {
> +    __u64 f_blocks;	/* Total number of blocks in fs */
> +    __u64 f_bfree;	/* Total number of free blocks */
> +    __u64 f_bavail;	/* Number of free blocks available to ordinary user */
> +    __u64 f_files;	/* Total number of file nodes in fs */
> +    __u64 f_ffree;	/* Number of free file nodes */
> +    __u64 f_favail;	/* Number of free file nodes available to ordinary user */
> +    __u32 f_bsize;	/* Optimal block size */
> +    __u32 f_frsize;	/* Fragment size */
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +The fields correspond to those of the same name returned by
> +.BR statfs ().
> +
> +.\" __________________ fsinfo_attr_fsinfo __________________
> +.TP
> +.B fsinfo_attr_fsinfo
> +This retrieves information about the
> +.BR fsinfo ()
> +system call itself.  This fills in the following structure:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_fsinfo {
> +    __u32 max_attr;
> +    __u32 max_cap;
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +The
> +.I max_attr
> +value indicates the number of attributes supported by the
> +.BR fsinfo ()
> +system call, and
> +.I max_cap
> +indicates the number of capability bits supported by the
> +.B fsinfo_attr_capabilities
> +attribute.  The first corresponds to
> +.I fsinfo_attr__nr
> +and the second to
> +.I fsinfo_cap__nr
> +in the header file.
> +
> +.\" __________________ fsinfo_attr_ids __________________
> +.TP
> +.B fsinfo_attr_ids
> +This retrieves a number of fixed IDs and other static information otherwise
> +available through
> +.BR statfs ().
> +The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_ids {
> +    char  f_fs_name[15 + 1]; /* Filesystem name */
> +    __u64 f_flags;	/* Filesystem mount flags (MS_*) */
> +    __u64 f_fsid;	/* Short 64-bit Filesystem ID */
> +    __u64 f_sb_id;	/* Internal superblock ID */
> +    __u32 f_fstype;	/* Filesystem type from linux/magic.h */
> +    __u32 f_dev_major;	/* As st_dev_* from struct statx */
> +    __u32 f_dev_minor;
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +Most of these are filled in as for
> +.BR statfs (),
> +with the addition of the filesystem's symbolic name in
> +.I f_fs_name
> +and an identifier for use in notifications in
> +.IR f_sb_id .
> +
> +.\" __________________ fsinfo_attr_limits __________________
> +.TP
> +.B fsinfo_attr_limits
> +This retrieves information about the limits of what a filesystem can support.
> +The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_limits {
> +    __u64 max_file_size;
> +    __u64 max_uid;
> +    __u64 max_gid;
> +    __u64 max_projid;
> +    __u32 max_dev_major;
> +    __u32 max_dev_minor;
> +    __u32 max_hard_links;
> +    __u32 max_xattr_body_len;
> +    __u16 max_xattr_name_len;
> +    __u16 max_filename_len;
> +    __u16 max_symlink_len;
> +    __u16 __reserved[1];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +These indicate the maximum supported sizes for a variety of filesystem objects,
> +including the file size, the extended attribute name length and body length,
> +the filename length and the symlink body length.
> +.IP
> +It also indicates the maximum representable values for a User ID, a Group ID,
> +a Project ID, a device major number and a device minor number.
> +.IP
> +And finally, it indicates the maximum number of hard links that can be made to
> +a file.
> +.IP
> +Note that some of these values may be zero if the underlying object or concept
> +is not supported by the filesystem or the medium.
> +
> +.\" __________________ fsinfo_attr_supports __________________
> +.TP
> +.B fsinfo_attr_supports
> +This retrieves information about what bits a filesystem supports in various
> +masks.  The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_supports {
> +    __u64 stx_attributes;
> +    __u32 stx_mask;
> +    __u32 ioc_flags;
> +    __u32 win_file_attrs;
> +    __u32 __reserved[1];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +The
> +.IR stx_attributes " and " stx_mask
> +fields indicate what bits in the struct statx fields of the matching names
> +are supported by the filesystem.
> +.IP
> +The
> +.I ioc_flags
> +field indicates what FS_*_FL flag bits as used through the FS_IOC_GET/SETFLAGS
> +ioctls are supported by the filesystem.
> +.IP
> +The
> +.I win_file_attrs
> +indicates what DOS/Windows file attributes a filesystem supports, if any.
> +
> +.\" __________________ fsinfo_attr_capabilities __________________
> +.TP
> +.B fsinfo_attr_capabilities
> +This retrieves information about what features a filesystem supports as a
> +series of single bit indicators.  The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_capabilities {
> +    __u8 capabilities[(fsinfo_cap__nr + 7) / 8];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +where the bit of interest can be found by:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +	p->capabilities[bit / 8] & (1 << (bit % 8)))
> +.fi
> +.in
> +.RE
> +.IP
> +The bits are listed by
> +.I enum fsinfo_capability
> +and
> +.B fsinfo_cap__nr
> +is one more than the last capability bit listed in the header file.
> +.IP
> +Note that the number of capability bits actually supported by the kernel can be
> +found using the
> +.B fsinfo_attr_fsinfo
> +attribute.
> +.IP
> +The capability bits and their meanings are listed below in the "THE
> +CAPABILITIES" section.
> +
> +.\" __________________ fsinfo_attr_timestamp_info __________________
> +.TP
> +.B fsinfo_attr_timestamp_info
> +This retrieves information about what timestamp resolution and scope is
> +supported by a filesystem for each of the file timestamps.  The following
> +structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_timestamp_info {
> +	__s64 minimum_timestamp;
> +	__s64 maximum_timestamp;
> +	__u16 atime_gran_mantissa;
> +	__u16 btime_gran_mantissa;
> +	__u16 ctime_gran_mantissa;
> +	__u16 mtime_gran_mantissa;
> +	__s8  atime_gran_exponent;
> +	__s8  btime_gran_exponent;
> +	__s8  ctime_gran_exponent;
> +	__s8  mtime_gran_exponent;
> +	__u32 __reserved[1];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +where
> +.IR minimum_timestamp " and " maximum_timestamp
> +are the limits on the timestamps that the filesystem supports and
> +.IR *time_gran_mantissa " and " *time_gran_exponent
> +indicate the granularity of each timestamp in terms of seconds, using the
> +formula:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +mantissa * pow(10, exponent) Seconds
> +.fi
> +.in
> +.RE
> +.IP
> +where exponent may be negative and the result may be a fraction of a second.
> +.IP
> +Four timestamps are detailed: \fBA\fPccess time, \fBB\fPirth/creation time,
> +\fBC\fPhange time and \fBM\fPodification time.  Capability bits are defined
> +that specify whether each of these exist in the filesystem or not.
> +.IP
> +Note that the timestamp description may be approximated or inaccurate if the
> +file is actually remote or is the union of multiple objects.
> +
> +.\" __________________ fsinfo_attr_volume_id __________________
> +.TP
> +.B fsinfo_attr_volume_id
> +This retrieves the system's superblock volume identifier as a variable-length
> +string.  This does not necessarily represent a value stored in the medium but
> +might be constructed on the fly.
> +.IP
> +For instance, for a block device this is the block device identifier
> +(eg. "sdb2"); for AFS this would be the numeric volume identifier.
> +
> +.\" __________________ fsinfo_attr_volume_uuid __________________
> +.TP
> +.B fsinfo_attr_volume_uuid
> +This retrieves the volume UUID, if there is one, as a little-endian binary
> +UUID.  This fills in the following structure:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_volume_uuid {
> +    __u8 uuid[16];
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +
> +.\" __________________ fsinfo_attr_volume_name __________________
> +.TP
> +.B fsinfo_attr_volume_name
> +This retrieves the filesystem's volume name as a variable-length string.  This
> +is expected to represent a name stored in the medium.
> +.IP
> +For a block device, this might be a label stored in the superblock.  For a
> +network filesystem, this might be a logical volume name of some sort.
> +
> +.\" __________________ fsinfo_attr_cell/domain __________________
> +.PP
> +.B fsinfo_attr_cell_name
> +.br
> +.B fsinfo_attr_domain_name
> +.br
> +.IP
> +These two attributes are variable-length string attributes that may be used to
> +obtain information about network filesystems.  An AFS volume, for instance,
> +belongs to a named cell.  CIFS shares may belong to a domain.
> +
> +.\" __________________ fsinfo_attr_realm_name __________________
> +.TP
> +.B fsinfo_attr_realm_name
> +This attribute is variable-length string that indicates the Kerberos realm that
> +a filesystem's authentication tokens should come from.
> +
> +.\" __________________ fsinfo_attr_server_name __________________
> +.TP
> +.B fsinfo_attr_server_name
> +This attribute is a multiple-value attribute that lists the names of the
> +servers that are backing a network filesystem.  Each value is a variable-length
> +string.  The values are enumerated by calling
> +.BR fsinfo ()
> +multiple times, incrementing
> +.I params->Nth
> +each time until an ENODATA error occurs, thereby indicating the end of the
> +list.
> +
> +.\" __________________ fsinfo_attr_server_address __________________
> +.TP
> +.B fsinfo_attr_server_address
> +This attribute is a multiple-instance, multiple-value attribute that lists the
> +addresses of the servers that are backing a network filesystem.  Each value is
> +a structure of the following type:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_server_address {
> +    struct __kernel_sockaddr_storage address;
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +Where the address may be AF_INET, AF_INET6, AF_RXRPC or any other type as
> +appropriate to the filesystem.
> +.IP
> +The values are enumerated by calling
> +.IR fsinfo ()
> +multiple times, incrementing
> +.I params->Nth
> +to step through the servers and
> +.I params->Mth
> +to step through the addresses of the Nth server each time until ENODATA errors
> +occur, thereby indicating either the end of a server's address list or the end
> +of the server list.
> +.IP
> +Barring the server list changing whilst being accessed, it is expected that the
> +.I params->Nth
> +will correspond to
> +.I params->Nth
> +for
> +.BR fsinfo_attr_server_name .
> +
> +.\" __________________ fsinfo_attr_parameter __________________
> +.TP
> +.B fsinfo_attr_parameter
> +This attribute is a multiple-value attribute that lists the values of the mount
> +parameters for a filesystem as variable-length strings.
> +.IP
> +The parameters are enumerated by calling
> +.BR fsinfo ()
> +multiple times, incrementing
> +.I params->Nth
> +to step through them until error ENODATA is given.
> +.IP
> +Parameter strings are presented in a form akin to the way they're passed to the
> +context created by the
> +.BR fsopen ()
> +system call.  For example, straight text parameters will be rendered as
> +something like:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +"o data=journal"
> +"o noquota"
> +.fi
> +.in
> +.RE
> +.IP
> +Where the initial "word" indicates the option form.
> +
> +.\" __________________ fsinfo_attr_source __________________
> +.TP
> +.B fsinfo_attr_source
> +This attribute is a multiple-value attribute that lists the mount sources for a
> +filesystem as variable-length strings.  Normally only one source will be
> +available, but the possibility of having more than one is allowed for.
> +.IP
> +The sources are enumerated by calling
> +.BR fsinfo ()
> +multiple times, incrementing
> +.I params->Nth
> +to step through them until error ENODATA is given.
> +.IP
> +Source strings are presented in a form akin to the way they're passed to the
> +context created by the
> +.BR fsopen ()
> +system call.  For example, they will be rendered as something like:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +"s /dev/sda1"
> +"s example.com/pub/linux/"
> +.fi
> +.in
> +.RE
> +.IP
> +Where the initial "word" indicates the option form.
> +
> +.\" __________________ fsinfo_attr_name_encoding __________________
> +.TP
> +.B fsinfo_attr_name_encoding
> +This attribute is variable-length string that indicates the filename encoding
> +used by the filesystem.  The default is "utf8".  Note that this may indicate a
> +non-8-bit encoding if that's what the underlying filesystem actually supports.
> +
> +.\" __________________ fsinfo_attr_name_codepage __________________
> +.TP
> +.B fsinfo_attr_name_codepage
> +This attribute is variable-length string that indicates the codepage used to
> +translate filenames from the filesystem to the system if this is applicable to
> +the filesystem.
> +
> +.\" __________________ fsinfo_attr_io_size __________________
> +.TP
> +.B fsinfo_attr_io_size
> +This retrieves information about the I/O sizes supported by the filesystem.
> +The following structure is filled in:
> +.PP
> +.RS
> +.in +4n
> +.nf
> +struct fsinfo_io_size {
> +    __u32 block_size;
> +    __u32 max_single_read_size;
> +    __u32 max_single_write_size;
> +    __u32 best_read_size;
> +    __u32 best_write_size;
> +};
> +.fi
> +.in
> +.RE
> +.IP
> +Where
> +.I block_size
> +indicates the fundamental I/O block size of the filesystem as something
> +O_DIRECT read/write sizes must be a multiple of;
> +.IR max_single_write_size " and " max_single_write_size
> +indicate the maximum sizes for individual unbuffered data transfer operations;
> +and
> +.IR best_read_size " and " best_write_size
> +indicate the recommended I/O sizes.
> +.IP
> +Note that any of these may be zero if inapplicable or indeterminable.
> +
> +
> +
> +.SH THE CAPABILITIES
> +.PP
> +There are number of capability bits in a bit array that can be retrieved using
> +.BR fsinfo_attr_capabilities .
> +These give information about features of the filesystem driver and the specific
> +filesystem.
> +
> +.\" __________________ fsinfo_cap_is_*_fs __________________
> +.PP
> +.B fsinfo_cap_is_kernel_fs
> +.br
> +.B fsinfo_cap_is_block_fs
> +.br
> +.B fsinfo_cap_is_flash_fs
> +.br
> +.B fsinfo_cap_is_network_fs
> +.br
> +.B fsinfo_cap_is_automounter_fs
> +.IP
> +These indicate the primary type of the filesystem.
> +.B kernel
> +filesystems are special communication interfaces that substitute files for
> +system calls; examples include procfs and sysfs.
> +.B block
> +filesystems require a block device on which to operate; examples include ext4
> +and XFS.
> +.B flash
> +filesystems require an MTD device on which to operate; examples include JFFS2.
> +.B network
> +filesystems require access to the network and contact one or more servers;
> +examples include NFS and AFS.
> +.B automounter
> +filesystems are kernel special filesystems that host automount points and
> +triggers to dynamically create automount points.  Examples include autofs and
> +AFS's dynamic root.
> +
> +.\" __________________ fsinfo_cap_automounts __________________
> +.TP
> +.B fsinfo_cap_automounts
> +The filesystem may have automount points that can be triggered by pathwalk.
> +
> +.\" __________________ fsinfo_cap_adv_locks __________________
> +.TP
> +.B fsinfo_cap_adv_locks
> +The filesystem supports advisory file locks.  For a network filesystem, this
> +indicates that the advisory file locks are cross-client (and also between
> +server and its local filesystem on something like NFS).
> +
> +.\" __________________ fsinfo_cap_mand_locks __________________
> +.TP
> +.B fsinfo_cap_mand_locks
> +The filesystem supports mandatory file locks.  For a network filesystem, this
> +indicates that the mandatory file locks are cross-client (and also between
> +server and its local filesystem on something like NFS).
> +
> +.\" __________________ fsinfo_cap_leases __________________
> +.TP
> +.B fsinfo_cap_leases
> +The filesystem supports leases.  For a network filesystem, this means that the
> +server will tell the client to clean up its state on a file before passing the
> +lease to another client.
> +
> +.\" __________________ fsinfo_cap_*ids __________________
> +.PP
> +.B fsinfo_cap_uids
> +.br
> +.B fsinfo_cap_gids
> +.br
> +.B fsinfo_cap_projids
> +.IP
> +These indicate that the filesystem supports numeric user IDs, group IDs and
> +project IDs respectively.
> +
> +.\" __________________ fsinfo_cap_id_* __________________
> +.PP
> +.B fsinfo_cap_id_names
> +.br
> +.B fsinfo_cap_id_guids
> +.IP
> +These indicate that the filesystem employs textual names and/or GUIDs as
> +identifiers.
> +
> +.\" __________________ fsinfo_cap_windows_attrs __________________
> +.TP
> +.B fsinfo_cap_windows_attrs
> +Indicates that the filesystem supports some Windows FILE_* attributes.
> +
> +.\" __________________ fsinfo_cap_*_quotas __________________
> +.PP
> +.B fsinfo_cap_user_quotas
> +.br
> +.B fsinfo_cap_group_quotas
> +.br
> +.B fsinfo_cap_project_quotas
> +.IP
> +These indicate that the filesystem supports quotas for users, groups and
> +projects respectively.
> +
> +.\" __________________ fsinfo_cap_xattrs/filetypes __________________
> +.PP
> +.B fsinfo_cap_xattrs
> +.br
> +.B fsinfo_cap_symlinks
> +.br
> +.B fsinfo_cap_hard_links
> +.br
> +.B fsinfo_cap_hard_links_1dir
> +.br
> +.B fsinfo_cap_device_files
> +.br
> +.B fsinfo_cap_unix_specials
> +.IP
> +These indicate that the filesystem supports respectively extended attributes;
> +symbolic links; hard links spanning direcories; hard links, but only within a
> +directory; block and character device files; and UNIX special files, such as
> +FIFO and socket.
> +
> +.\" __________________ fsinfo_cap_*journal* __________________
> +.PP
> +.B fsinfo_cap_journal
> +.br
> +.B fsinfo_cap_data_is_journalled
> +.IP
> +The first of these indicates that the filesystem has a journal and the second
> +that the file data changes are being journalled.
> +
> +.\" __________________ fsinfo_cap_o_* __________________
> +.PP
> +.B fsinfo_cap_o_sync
> +.br
> +.B fsinfo_cap_o_direct
> +.IP
> +These indicate that O_SYNC and O_DIRECT are supported respectively.
> +
> +.\" __________________ fsinfo_cap_o_* __________________
> +.PP
> +.B fsinfo_cap_volume_id
> +.br
> +.B fsinfo_cap_volume_uuid
> +.br
> +.B fsinfo_cap_volume_name
> +.br
> +.B fsinfo_cap_volume_fsid
> +.br
> +.B fsinfo_cap_cell_name
> +.br
> +.B fsinfo_cap_domain_name
> +.br
> +.B fsinfo_cap_realm_name
> +.IP
> +These indicate if various attributes are supported by the filesystem, where
> +.B fsinfo_cap_X
> +here corresponds to
> +.BR fsinfo_attr_X .
> +
> +.\" __________________ fsinfo_cap_iver_* __________________
> +.PP
> +.B fsinfo_cap_iver_all_change
> +.br
> +.B fsinfo_cap_iver_data_change
> +.br
> +.B fsinfo_cap_iver_mono_incr
> +.IP
> +These indicate if
> +.I i_version
> +on an inode in the filesystem is supported and
> +how it behaves.
> +.B all_change
> +indicates that i_version is incremented on metadata changes as well as data
> +changes.
> +.B data_change
> +indicates that i_version is only incremented on data changes, including
> +truncation.
> +.B mono_incr
> +indicates that i_version is incremented by exactly 1 for each change made.
> +
> +.\" __________________ fsinfo_cap_resource_forks __________________
> +.TP
> +.B fsinfo_cap_resource_forks
> +This indicates that the filesystem supports some sort of resource fork or
> +alternate data stream on a file.  This isn't the same as an extended attribute.
> +
> +.\" __________________ fsinfo_cap_name_* __________________
> +.PP
> +.B fsinfo_cap_name_case_indep
> +.br
> +.B fsinfo_cap_name_non_utf8
> +.br
> +.B fsinfo_cap_name_has_codepage
> +.IP
> +These indicate certain facts about the filenames in a filesystem: whether
> +they're case-independent; if they're not UTF-8; and if there's a codepage
> +employed to map the names.
> +
> +.\" __________________ fsinfo_cap_sparse __________________
> +.TP
> +.B fsinfo_cap_sparse
> +This indicates that the filesystem supports sparse files.
> +
> +.\" __________________ fsinfo_cap_not_persistent __________________
> +.TP
> +.B fsinfo_cap_not_persistent
> +This indicates that the filesystem is not persistent, and that any data stored
> +here will not be saved in the event that the filesystem is unmounted, the
> +machine is rebooted or the machine loses power.
> +
> +.\" __________________ fsinfo_cap_no_unix_mode __________________
> +.TP
> +.B fsinfo_cap_no_unix_mode
> +This indicates that the filesystem doesn't support the UNIX mode permissions
> +bits.
> +
> +.\" __________________ fsinfo_cap_has_*time __________________
> +.PP
> +.B fsinfo_cap_has_atime
> +.br
> +.B fsinfo_cap_has_btime
> +.br
> +.B fsinfo_cap_has_ctime
> +.br
> +.B fsinfo_cap_has_mtime
> +.IP
> +These indicate as to what timestamps a filesystem supports, including: Access
> +time, Birth/creation time, Change time (metadata and data) and Modification
> +time (data only).
> +
> +
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.SH RETURN VALUE
> +On success, the size of the value that the kernel has available is returned,
> +irrespective of whether the buffer is large enough to hold that.  The data
> +written to the buffer will be truncated if it is not.  On error, \-1 is
> +returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Search permission is denied for one of the directories
> +in the path prefix of
> +.IR pathname .
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EBADF
> +.I dirfd
> +is not a valid open file descriptor.
> +.TP
> +.B EFAULT
> +.I pathname
> +is NULL or
> +.IR pathname ", " params " or " buffer
> +point to a location outside the process's accessible address space.
> +.TP
> +.B EINVAL
> +Reserved flag specified in
> +.IR params->at_flags " or one of " params->__reserved[]
> +is not 0.
> +.TP
> +.B EOPNOTSUPP
> +Unsupported attribute requested in
> +.IR params->request .
> +This may be beyond the limit of the supported attribute set or may just not be
> +one that's supported by the filesystem.
> +.TP
> +.B ENODATA
> +Unavailable attribute value requested by
> +.IR params->Nth " and/or " params->Mth .
> +.TP
> +.B ELOOP
> +Too many symbolic links encountered while traversing the pathname.
> +.TP
> +.B ENAMETOOLONG
> +.I pathname
> +is too long.
> +.TP
> +.B ENOENT
> +A component of
> +.I pathname
> +does not exist, or
> +.I pathname
> +is an empty string and
> +.B AT_EMPTY_PATH
> +was not specified in
> +.IR params->at_flags .
> +.TP
> +.B ENOMEM
> +Out of memory (i.e., kernel memory).
> +.TP
> +.B ENOTDIR
> +A component of the path prefix of
> +.I pathname
> +is not a directory or
> +.I pathname
> +is relative and
> +.I dirfd
> +is a file descriptor referring to a file other than a directory.
> +.SH VERSIONS
> +.BR fsinfo ()
> +was added to Linux in kernel 4.18.
> +.SH CONFORMING TO
> +.BR fsinfo ()
> +is Linux-specific.
> +.SH NOTES
> +Glibc does not (yet) provide a wrapper for the
> +.BR fsinfo ()
> +system call; call it using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR ioctl_iflags (2),
> +.BR statx (2),
> +.BR statfs (2)
> diff --git a/man2/ioctl_iflags.2 b/man2/ioctl_iflags.2
> index 9c77b08b9..49ba4444e 100644
> --- a/man2/ioctl_iflags.2
> +++ b/man2/ioctl_iflags.2
> @@ -200,9 +200,15 @@ the effective user ID of the caller must match the owner of the file,
>  or the caller must have the
>  .BR CAP_FOWNER
>  capability.
> +.PP
> +The set of flags supported by a filesystem can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_supports .
>  .SH SEE ALSO
>  .BR chattr (1),
>  .BR lsattr (1),
> +.BR fsinfo (2),
>  .BR mount (2),
>  .BR btrfs (5),
>  .BR ext4 (5),
> diff --git a/man2/stat.2 b/man2/stat.2
> index dad9a01ac..ee4001f85 100644
> --- a/man2/stat.2
> +++ b/man2/stat.2
> @@ -532,6 +532,12 @@ If none of the aforementioned macros are defined,
>  then the nanosecond values are exposed with names of the form
>  .IR st_atimensec .
>  .\"
> +.PP
> +Which timestamps are supported by a filesystem and their the ranges and
> +granularities can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_timestamp_info .
>  .SS C library/kernel differences
>  Over time, increases in the size of the
>  .I stat
> @@ -707,6 +713,7 @@ main(int argc, char *argv[])
>  .BR access (2),
>  .BR chmod (2),
>  .BR chown (2),
> +.BR fsinfo (2),
>  .BR readlink (2),
>  .BR utime (2),
>  .BR capabilities (7),
> diff --git a/man2/statx.2 b/man2/statx.2
> index edac9f6f4..9a57c1b90 100644
> --- a/man2/statx.2
> +++ b/man2/statx.2
> @@ -534,12 +534,25 @@ Glibc does not (yet) provide a wrapper for the
>  .BR statx ()
>  system call; call it using
>  .BR syscall (2).
> +.PP
> +The sets of mask/stx_mask and stx_attributes bits supported by a filesystem
> +can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_supports .
> +.PP
> +Which timestamps are supported by a filesystem and their the ranges and
> +granularities can also be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_timestamp_info .
>  .SH SEE ALSO
>  .BR ls (1),
>  .BR stat (1),
>  .BR access (2),
>  .BR chmod (2),
>  .BR chown (2),
> +.BR fsinfo (2),
>  .BR readlink (2),
>  .BR stat (2),
>  .BR utime (2),
> diff --git a/man2/utime.2 b/man2/utime.2
> index 03a43a416..c6acdbac2 100644
> --- a/man2/utime.2
> +++ b/man2/utime.2
> @@ -181,9 +181,16 @@ on an append-only file.
>  .\" is just a wrapper for
>  .\" .BR utime ()
>  .\" and hence does not allow a subsecond resolution.
> +.PP
> +Which timestamps are supported by a filesystem and their the ranges and
> +granularities can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_timestamp_info .
>  .SH SEE ALSO
>  .BR chattr (1),
>  .BR touch (1),
> +.BR fsinfo (2),
>  .BR futimesat (2),
>  .BR stat (2),
>  .BR utimensat (2),
> diff --git a/man2/utimensat.2 b/man2/utimensat.2
> index d61b43e96..be8925548 100644
> --- a/man2/utimensat.2
> +++ b/man2/utimensat.2
> @@ -633,9 +633,16 @@ instead checks whether the
>  .\" conversely, a process with a read-only file descriptor won't
>  .\" be able to update the timestamps of a file,
>  .\" even if it has write permission on the file.
> +.PP
> +Which timestamps are supported by a filesystem and their the ranges and
> +granularities can be determined by calling
> +.IR fsinfo (2)
> +with attribute
> +.IR fsinfo_attr_timestamp_info .
>  .SH SEE ALSO
>  .BR chattr (1),
>  .BR touch (1),
> +.BR fsinfo (2),
>  .BR futimesat (2),
>  .BR openat (2),
>  .BR stat (2),
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [MANPAGE PATCH] Add manpage for fsopen(2), fspick(2) and fsmount(2)
From: Michael Kerrisk (man-pages) @ 2019-10-09  9:52 UTC (permalink / raw)
  To: David Howells
  Cc: mtk.manpages, viro, linux-api, linux-fsdevel, torvalds,
	linux-kernel, linux-man, Eric W. Biederman
In-Reply-To: <15488.1531263249@warthog.procyon.org.uk>

Hello David,

See my previous mail.

With respect to the patch below, would you be willing to review
the content of this man-pages patch to see if it accurately reflects 
what was merged into the kernel, and then resubmit please?

Thanks,

Michael

On 7/11/18 12:54 AM, David Howells wrote:
> Add a manual page to document the fsopen(), fspick() and fsmount() system
> calls.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  man2/fsmount.2 |    1 
>  man2/fsopen.2  |  357 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  man2/fspick.2  |    1 
>  3 files changed, 359 insertions(+)
>  create mode 100644 man2/fsmount.2
>  create mode 100644 man2/fsopen.2
>  create mode 100644 man2/fspick.2
> 
> diff --git a/man2/fsmount.2 b/man2/fsmount.2
> new file mode 100644
> index 000000000..2bf59fc3e
> --- /dev/null
> +++ b/man2/fsmount.2
> @@ -0,0 +1 @@
> +.so man2/fsopen.2
> diff --git a/man2/fsopen.2 b/man2/fsopen.2
> new file mode 100644
> index 000000000..1bc761ab4
> --- /dev/null
> +++ b/man2/fsopen.2
> @@ -0,0 +1,357 @@
> +'\" t
> +.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH FSOPEN 2 2018-06-07 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +fsopen, fsmount, fspick \- Handle filesystem (re-)configuration and mounting
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.br
> +.B #include <sys/mount.h>
> +.br
> +.B #include <unistd.h>
> +.br
> +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> +.PP
> +.BI "int fsopen(const char *" fsname ", unsigned int " flags );
> +.PP
> +.BI "int fsmount(int " fd ", unsigned int " flags ", unsigned int " ms_flags );
> +.PP
> +.BI "int fspick(int " dirfd ", const char *" pathname ", unsigned int " flags );
> +.fi
> +.PP
> +.IR Note :
> +There are no glibc wrappers for these system calls.
> +.SH DESCRIPTION
> +.PP
> +.BR fsopen ()
> +creates a new filesystem configuration context within the kernel for the
> +filesystem named in the
> +.I fsname
> +parameter and attaches it to a file descriptor, which it then returns.  The
> +file descriptor can be marked close-on-exec by setting
> +.B FSOPEN_CLOEXEC
> +in flags.
> +.PP
> +The
> +file descriptor can then be used to configure the desired filesystem parameters
> +and security parameters by using
> +.BR write (2)
> +to pass parameters to it and then writing a command to actually create the
> +filesystem representation.
> +.PP
> +The file descriptor also serves as a channel by which more comprehensive error,
> +warning and information messages may be retrieved from the kernel using
> +.BR read (2).
> +.PP
> +Once the kernel's filesystem representation has been created, it can be queried
> +by calling
> +.BR fsinfo (2)
> +on the file descriptor.  fsinfo() will spot that the target is actually a
> +creation context and look inside that.
> +.PP
> +.BR fsmount ()
> +can then be called to create a mount object that refers to the newly created
> +filesystem representation, with the propagation and mount restrictions to be
> +applied specified in
> +.IR ms_flags .
> +The mount object is then attached to a new file descriptor that looks like one
> +created by
> +.BR open "(2) with " O_PATH " or " open_tree (2).
> +This can be passed to
> +.BR move_mount (2)
> +to attach the mount object to a mountpoint, thereby completing the process.
> +.PP
> +The file descriptor returned by fsmount() is marked close-on-exec if
> +FSMOUNT_CLOEXEC is specified in
> +.IR flags .
> +.PP
> +After fsmount() has completed, the context created by fsopen() is reset and
> +moved to reconfiguration state, allowing the new superblock to be reconfigured.
> +.PP
> +.BR fspick ()
> +creates a new filesystem context within the kernel, attaches the superblock
> +specified by
> +.IR dfd ", " pathname ", " flags
> +and puts it into the reconfiguration state and attached the context to a new
> +file descriptor that can then be parameterised with
> +.BR write (2)
> +exactly the same as for the context created by fsopen() above.
> +.PP
> +.I flags
> +is an OR'd together mask of
> +.B FSPICK_CLOEXEC
> +which indicates that the returned file descriptor should be marked
> +close-on-exec and
> +.BR FSPICK_SYMLINK_NOFOLLOW ", " FSPICK_NO_AUTOMOUNT " and " FSPICK_EMPTY_PATH
> +which control the pathwalk to the target object (see below).
> +
> +.\"________________________________________________________
> +.SS Writable Command Interface
> +Superblock (re-)configuration is achieved by writing command strings to the
> +context file descriptor using
> +.BR write (2).
> +Each string is prefixed with a specifier indicating the class of command
> +being specified.  The available commands include:
> +.TP
> +\fB"o <option>"\fP
> +Specify a filesystem or security parameter.
> +.I <option>
> +is typically a key or key=val format string.  Since the length of the option is
> +given to write(), the option may include any sort of character, including
> +spaces and commas or even binary data.
> +.TP
> +\fB"s <name>"\fP
> +Specify a device file, network server or other other source specification.
> +This may be optional, depending on the filesystem, and it may be possible to
> +provide multiple of them to a filesystem.
> +.TP
> +\fB"x create"\fP
> +End the filesystem configuration phase and try and create a representation in
> +the kernel with the parameters specified.  After this, the context is shifted
> +to the mount-pending state waiting for an fsmount() call to occur.
> +.TP
> +\fB"x reconfigure"\fP
> +End a filesystem reconfiguration phase try to apply the parameters to the
> +filesystem representation.  After this, the context gets reset and put back to
> +the start of the reconfiguration phase again.
> +.PP
> +With this interface, option strings are not limited to 4096 bytes, either
> +individually or in sum, and they are also not restricted to text-only options.
> +Further, errors may be given individually for each option and not aggregated or
> +dumped into the kernel log.
> +
> +.\"________________________________________________________
> +.SS Message Retrieval Interface
> +The context file descriptor may be queried for message strings at any time by
> +calling
> +.BR read (2)
> +on the file descriptor.  This will return formatted messages that are prefixed
> +to indicate their class:
> +.TP
> +\fB"e <message>"\fP
> +An error message string was logged.
> +.TP
> +\fB"i <message>"\fP
> +An informational message string was logged.
> +.TP
> +\fB"w <message>"\fP
> +An warning message string was logged.
> +.PP
> +Messages are removed from the queue as they're read.
> +
> +.\"________________________________________________________
> +.SH EXAMPLES
> +To illustrate the process, here's an example whereby this can be used to mount
> +an ext4 filesystem on /dev/sdb1 onto /mnt.  Note that the example ignores the
> +fact that
> +.BR write (2)
> +has a length parameter and that errors might occur.
> +.PP
> +.in +4n
> +.nf
> +sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> +write(sfd, "s /dev/sdb1");
> +write(sfd, "o noatime");
> +write(sfd, "o acl");
> +write(sfd, "o user_attr");
> +write(sfd, "o iversion");
> +write(sfd, "x create");
> +fsinfo(sfd, NULL, ...);
> +mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> +move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.in
> +.PP
> +Here, an ext4 context is created first and attached to sfd.  This is then told
> +where its source will be, given a bunch of options and created.
> +.BR fsinfo (2)
> +can then be used to query the filesystem.  Then fsmount() is called to create a
> +mount object and
> +.BR move_mount (2)
> +is called to attach it to its intended mountpoint.
> +.PP
> +And here's an example of mounting from an NFS server:
> +.PP
> +.in +4n
> +.nf
> +sfd = fsopen("nfs", 0);
> +write(sfd, "s example.com/pub/linux");
> +write(sfd, "o nfsvers=3");
> +write(sfd, "o rsize=65536");
> +write(sfd, "o wsize=65536");
> +write(sfd, "o rdma");
> +write(sfd, "x create");
> +mfd = fsmount(sfd, 0, MS_NODEV);
> +move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.in
> +.PP
> +Reconfiguration can be achieved by:
> +.PP
> +.in +4n
> +.nf
> +sfd = fspick(AT_FDCWD, "/mnt", FSPICK_NO_AUTOMOUNT | FSPICK_CLOEXEC);
> +write(sfd, "o ro");
> +write(sfd, "x reconfigure");
> +.fi
> +.in
> +.PP
> +or:
> +.PP
> +.in +4n
> +.nf
> +sfd = fsopen(...);
> +...
> +mfd = fsmount(sfd, ...);
> +...
> +write(sfd, "o ro");
> +write(sfd, "x reconfigure");
> +.fi
> +.in
> +
> +
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.SH RETURN VALUE
> +On success, all three functions return a file descriptor.  On error, \-1 is
> +returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +The error values given below result from filesystem type independent
> +errors.
> +Each filesystem type may have its own special errors and its
> +own special behavior.
> +See the Linux kernel source code for details.
> +.TP
> +.B EACCES
> +A component of a path was not searchable.
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EACCES
> +Mounting a read-only filesystem was attempted without giving the
> +.B MS_RDONLY
> +flag.
> +.TP
> +.B EACCES
> +The block device
> +.I source
> +is located on a filesystem mounted with the
> +.B MS_NODEV
> +option.
> +.\" mtk: Probably: write permission is required for MS_BIND, with
> +.\" the error EPERM if not present; CAP_DAC_OVERRIDE is required.
> +.TP
> +.B EBUSY
> +.I source
> +cannot be reconfigured read-only, because it still holds files open for
> +writing.
> +.TP
> +.B EFAULT
> +One of the pointer arguments points outside the user address space.
> +.TP
> +.B EINVAL
> +.I source
> +had an invalid superblock.
> +.TP
> +.B EINVAL
> +.I ms_flags
> +includes more than one of
> +.BR MS_SHARED ,
> +.BR MS_PRIVATE ,
> +.BR MS_SLAVE ,
> +or
> +.BR MS_UNBINDABLE .
> +.TP
> +.BR EINVAL
> +An attempt was made to bind mount an unbindable mount.
> +.TP
> +.B ELOOP
> +Too many links encountered during pathname resolution.
> +.TP
> +.B EMFILE
> +The system has too many open files to create more.
> +.TP
> +.B ENFILE
> +The process has too many open files to create more.
> +.TP
> +.B ENAMETOOLONG
> +A pathname was longer than
> +.BR MAXPATHLEN .
> +.TP
> +.B ENODEV
> +Filesystem
> +.I fsname
> +not configured in the kernel.
> +.TP
> +.B ENOENT
> +A pathname was empty or had a nonexistent component.
> +.TP
> +.B ENOMEM
> +The kernel could not allocate sufficient memory to complete the call.
> +.TP
> +.B ENOTBLK
> +.I source
> +is not a block device (and a device was required).
> +.TP
> +.B ENOTDIR
> +.IR pathname ,
> +or a prefix of
> +.IR source ,
> +is not a directory.
> +.TP
> +.B ENXIO
> +The major number of the block device
> +.I source
> +is out of range.
> +.TP
> +.B EPERM
> +The caller does not have the required privileges.
> +.SH CONFORMING TO
> +These functions are Linux-specific and should not be used in programs intended
> +to be portable.
> +.SH VERSIONS
> +.BR fsopen "(), " fsmount "() and " fspick ()
> +were added to Linux in kernel 4.18.
> +.SH NOTES
> +Glibc does not (yet) provide a wrapper for the
> +.BR fsopen "() , " fsmount "() or " fspick "()"
> +system calls; call them using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR mountpoint (1),
> +.BR move_mount (2),
> +.BR open_tree (2),
> +.BR umount (2),
> +.BR mount_namespaces (7),
> +.BR path_resolution (7),
> +.BR findmnt (8),
> +.BR lsblk (8),
> +.BR mount (8),
> +.BR umount (8)
> diff --git a/man2/fspick.2 b/man2/fspick.2
> new file mode 100644
> index 000000000..2bf59fc3e
> --- /dev/null
> +++ b/man2/fspick.2
> @@ -0,0 +1 @@
> +.so man2/fsopen.2
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [MANPAGE PATCH] Add manpages for move_mount(2) and open_tree(2)
From: Michael Kerrisk (man-pages) @ 2019-10-09  9:51 UTC (permalink / raw)
  To: David Howells
  Cc: mtk.manpages, viro, linux-api, linux-fsdevel, torvalds,
	linux-kernel, linux-man, Eric W. Biederman
In-Reply-To: <15449.1531263162@warthog.procyon.org.uk>

Hello David,

Your wrote a series of manual pages patches (of which the mail below is one)
for the new mount API about a year before the code patches were actually
released in the kernel.

I'd like to check that these man-pages patches are up to date before
merging them. I think they may not be, since there is one patch for
fsinfo(2) which does not exist in the kernel, and no manual page for
fsconfig(2). I imagine that details may also have changed
in the system calls that were ultimately merged.

Could you write a manual page for fsconfig(2) please?

With respect to the patch below, would you be willing to:
* split it into two pieces, one for each page.
* review the content to see if it accurately reflects what was
  merged into the kernel and then resubmit please?

Thanks,

Michael

On 7/11/18 12:52 AM, David Howells wrote:
> Add manual pages to document the move_mount and open_tree() system calls.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  man2/move_mount.2 |  274 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  man2/open_tree.2  |  260 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 534 insertions(+)
>  create mode 100644 man2/move_mount.2
>  create mode 100644 man2/open_tree.2
> 
> diff --git a/man2/move_mount.2 b/man2/move_mount.2
> new file mode 100644
> index 000000000..3a819fb84
> --- /dev/null
> +++ b/man2/move_mount.2
> @@ -0,0 +1,274 @@
> +'\" t
> +.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH MOVE_MOUNT 2 2018-06-08 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +move_mount \- Move mount objects around the filesystem topology
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.br
> +.B #include <sys/mount.h>
> +.br
> +.B #include <unistd.h>
> +.br
> +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> +.PP
> +.BI "int move_mount(int " from_dirfd ", const char *" from_pathname ","
> +.BI "               int " to_dirfd ", const char *" to_pathname ","
> +.BI "               unsigned int " flags );
> +.fi
> +.PP
> +.IR Note :
> +There are no glibc wrappers for these system calls.
> +.SH DESCRIPTION
> +The
> +.BR move_mount ()
> +call moves a mount from one place to another; it can also be used to attach an
> +unattached mount created by
> +.BR fsmount "() or " open_tree "() with " OPEN_TREE_CLONE .
> +.PP
> +If
> +.BR move_mount ()
> +is called repeatedly with a file descriptor that refers to a mount object,
> +then the object will be attached/moved the first time and then moved again and
> +again and again, detaching it from the previous mountpoint each time.
> +.PP
> +To access the source mount object or the destination mountpoint, no
> +permissions are required on the object itself, but if either pathname is
> +supplied, execute (search) permission is required on all of the directories
> +specified in
> +.IR from_pathname " or " to_pathname .
> +.PP
> +The caller does, however, require the appropriate capabilities or permission
> +to effect a mount.
> +.PP
> +.BR move_mount ()
> +uses
> +.IR from_pathname ", " from_dirfd " and some " flags
> +to locate the mount object to be moved and
> +.IR to_pathname ", " to_dirfd " and some other " flags
> +to locate the destination mountpoint.  Each lookup can be done in one of a
> +variety of ways:
> +.TP
> +[*] By absolute path.
> +The pathname points to an absolute path and the dirfd is ignored.  The file is
> +looked up by name, starting from the root of the filesystem as seen by the
> +calling process.
> +.TP
> +[*] By cwd-relative path.
> +The pathname points to a relative path and the dirfd is
> +.IR AT_FDCWD .
> +The file is looked up by name, starting from the current working directory.
> +.TP
> +[*] By dir-relative path.
> +The pathname points to relative path and the dirfd indicates a file descriptor
> +pointing to a directory.  The file is looked up by name, starting from the
> +directory specified by
> +.IR dirfd .
> +.TP
> +[*] By file descriptor.
> +The pathname points to "", the dirfd points directly to the mount object to
> +move or the destination mount point and the appropriate
> +.B *_EMPTY_PATH
> +flag is set.
> +.PP
> +.I flags
> +can be used to influence a path-based lookup.  A value for
> +.I flags
> +is constructed by OR'ing together zero or more of the following constants:
> +.TP
> +.BR MOVE_MOUNT_F_EMPTY_PATH
> +.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
> +If
> +.I from_pathname
> +is an empty string, operate on the file referred to by
> +.IR from_dirfd
> +(which may have been obtained using the
> +.BR open (2)
> +.B O_PATH
> +flag or
> +.BR open_tree ())
> +If
> +.I from_dirfd
> +is
> +.BR AT_FDCWD ,
> +the call operates on the current working directory.
> +In this case,
> +.I from_dirfd
> +can refer to any type of file, not just a directory.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.B MOVE_MOUNT_T_EMPTY_PATH
> +As above, but operating on
> +.IR to_pathname " and " to_dirfd .
> +.TP
> +.B MOVE_MOUNT_F_NO_AUTOMOUNT
> +Don't automount the terminal ("basename") component of
> +.I from_pathname
> +if it is a directory that is an automount point.  This allows a mount object
> +that has an automount point at its root to be moved and prevents unintended
> +triggering of an automount point.
> +The
> +.B MOVE_MOUNT_F_NO_AUTOMOUNT
> +flag has no effect if the automount point has already been mounted over.  This
> +flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.B MOVE_MOUNT_T_NO_AUTOMOUNT
> +As above, but operating on
> +.IR to_pathname " and " to_dirfd .
> +This allows an automount point to be manually mounted over.
> +.TP
> +.B MOVE_MOUNT_F_SYMLINKS
> +If
> +.I from_pathname
> +is a symbolic link, then dereference it.  The default for
> +.BR move_mount ()
> +is to not follow symlinks.
> +.TP
> +.B MOVE_MOUNT_T_SYMLINKS
> +As above, but operating on
> +.IR to_pathname " and " to_dirfd .
> +
> +.SH EXAMPLES
> +The
> +.BR move_mount ()
> +function can be used like the following:
> +.PP
> +.RS
> +.nf
> +move_mount(AT_FDCWD, "/a", AT_FDCWD, "/b", 0);
> +.fi
> +.RE
> +.PP
> +This would move the object mounted on "/a" to "/b".  It can also be used in
> +conjunction with
> +.BR open_tree "(2) or " open "(2) with " O_PATH :
> +.PP
> +.RS
> +.nf
> +fd = open_tree(AT_FDCWD, "/mnt", 0);
> +move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
> +move_mount(fd, "", AT_FDCWD, "/mnt3", MOVE_MOUNT_F_EMPTY_PATH);
> +move_mount(fd, "", AT_FDCWD, "/mnt4", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.RE
> +.PP
> +This would attach the path point for "/mnt" to fd, then it would move the
> +mount to "/mnt2", then move it to "/mnt3" and finally to "/mnt4".
> +.PP
> +It can also be used to attach new mounts:
> +.PP
> +.RS
> +.nf
> +sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> +write(sfd, "s /dev/sda1");
> +write(sfd, "o user_xattr");
> +mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_NODEV);
> +move_mount(mfd, "", AT_FDCWD, "/home", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.RE
> +.PP
> +Which would open the Ext4 filesystem mounted on "/dev/sda1", turn on user
> +extended attribute support and create a mount object for it.  Finally, the new
> +mount object would be attached with
> +.BR move_mount ()
> +to "/home".
> +
> +
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.SH RETURN VALUE
> +On success, 0 is returned.  On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Search permission is denied for one of the directories
> +in the path prefix of
> +.IR pathname .
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EBADF
> +.IR from_dirfd " or " to_dirfd
> +is not a valid open file descriptor.
> +.TP
> +.B EFAULT
> +.IR from_pathname " or " to_pathname
> +is NULL or either one point to a location outside the process's accessible
> +address space.
> +.TP
> +.B EINVAL
> +Reserved flag specified in
> +.IR flags .
> +.TP
> +.B ELOOP
> +Too many symbolic links encountered while traversing the pathname.
> +.TP
> +.B ENAMETOOLONG
> +.IR from_pathname " or " to_pathname
> +is too long.
> +.TP
> +.B ENOENT
> +A component of
> +.IR from_pathname " or " to_pathname
> +does not exist, or one is an empty string and the appropriate
> +.B *_EMPTY_PATH
> +was not specified in
> +.IR flags .
> +.TP
> +.B ENOMEM
> +Out of memory (i.e., kernel memory).
> +.TP
> +.B ENOTDIR
> +A component of the path prefix of
> +.IR from_pathname " or " to_pathname
> +is not a directory or one or the other is relative and the appropriate
> +.I *_dirfd
> +is a file descriptor referring to a file other than a directory.
> +.SH VERSIONS
> +.BR move_mount ()
> +was added to Linux in kernel 4.18.
> +.SH CONFORMING TO
> +.BR move_mount ()
> +is Linux-specific.
> +.SH NOTES
> +Glibc does not (yet) provide a wrapper for the
> +.BR move_mount ()
> +system call; call it using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR fsmount (2),
> +.BR fsopen (2),
> +.BR open_tree (2)
> diff --git a/man2/open_tree.2 b/man2/open_tree.2
> new file mode 100644
> index 000000000..7e9c86fe3
> --- /dev/null
> +++ b/man2/open_tree.2
> @@ -0,0 +1,260 @@
> +'\" t
> +.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH OPEN_TREE 2 2018-06-08 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +open_tree \- Pick or clone mount object and attach to fd
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.br
> +.B #include <sys/mount.h>
> +.br
> +.B #include <unistd.h>
> +.br
> +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> +.PP
> +.BI "int open_tree(int " dirfd ", const char *" pathname ", unsigned int " flags );
> +.fi
> +.PP
> +.IR Note :
> +There are no glibc wrappers for these system calls.
> +.SH DESCRIPTION
> +.BR open_tree ()
> +picks the mount object specified by the pathname and attaches it to a new file
> +descriptor or clones it and attaches the clone to the file descriptor.  The
> +resultant file descriptor is indistinguishable from one produced by
> +.BR open "(2) with " O_PATH .
> +.PP
> +In the case that the mount object is cloned, the clone will be "unmounted" and
> +destroyed when the file descriptor is closed if it is not otherwise mounted
> +somewhere by calling
> +.BR move_mount (2).
> +.PP
> +To select a mount object, no permissions are required on the object referred
> +to by the path, but execute (search) permission is required on all of the
> +directories in
> +.I pathname
> +that lead to the object.
> +.PP
> +To clone an object, however, the caller must have mount capabilities and
> +permissions.
> +.PP
> +.BR open_tree ()
> +uses
> +.IR pathname ", " dirfd " and " flags
> +to locate the target object in one of a variety of ways:
> +.TP
> +[*] By absolute path.
> +.I pathname
> +points to an absolute path and
> +.I dirfd
> +is ignored.  The object is looked up by name, starting from the root of the
> +filesystem as seen by the calling process.
> +.TP
> +[*] By cwd-relative path.
> +.I pathname
> +points to a relative path and
> +.IR dirfd " is " AT_FDCWD .
> +The object is looked up by name, starting from the current working directory.
> +.TP
> +[*] By dir-relative path.
> +.I pathname
> +points to relative path and
> +.I dirfd
> +indicates a file descriptor pointing to a directory.  The object is looked up
> +by name, starting from the directory specified by
> +.IR dirfd .
> +.TP
> +[*] By file descriptor.
> +.I pathname
> +is "",
> +.I dirfd
> +indicates a file descriptor and
> +.B AT_EMPTY_PATH
> +is set in
> +.IR flags .
> +The mount attached to the file descriptor is queried directly.  The file
> +descriptor may point to any type of file, not just a directory.
> +
> +.\"______________________________________________________________
> +.PP
> +.I flags
> +can be used to control the operation of the function and to influence a
> +path-based lookup.  A value for
> +.I flags
> +is constructed by OR'ing together zero or more of the following constants:
> +.TP
> +.BR AT_EMPTY_PATH
> +.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
> +If
> +.I pathname
> +is an empty string, operate on the file referred to by
> +.IR dirfd
> +(which may have been obtained from
> +.BR open "(2) with"
> +.BR O_PATH ", from " fsmount (2)
> +or from another
> +.BR open_tree ()).
> +If
> +.I dirfd
> +is
> +.BR AT_FDCWD ,
> +the call operates on the current working directory.
> +In this case,
> +.I dirfd
> +can refer to any type of file, not just a directory.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.BR AT_NO_AUTOMOUNT
> +Don't automount the terminal ("basename") component of
> +.I pathname
> +if it is a directory that is an automount point.  This flag allows the
> +automount point itself to be picked up or a mount cloned that is rooted on the
> +automount point.  The
> +.B AT_NO_AUTOMOUNT
> +flag has no effect if the mount point has already been mounted over.
> +This flag is Linux-specific; define
> +.B _GNU_SOURCE
> +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> +to obtain its definition.
> +.TP
> +.B AT_SYMLINK_NOFOLLOW
> +If
> +.I pathname
> +is a symbolic link, do not dereference it: instead pick up or clone a mount
> +rooted on the link itself.
> +.TP
> +.B OPEN_TREE_CLOEXEC
> +Set the close-on-exec flag for the new file descriptor.  This will cause the
> +file descriptor to be closed automatically when a process exec's.
> +.TP
> +.B OPEN_TREE_CLONE
> +Rather than directly attaching the selected object to the file descriptor,
> +clone the object, set the root of the new mount object to that point and
> +attach the clone to the file descriptor.
> +.TP
> +.B AT_RECURSIVE
> +This is only permitted in conjunction with OPEN_TREE_CLONE.  It causes the
> +entire mount subtree rooted at the selected spot to be cloned rather than just
> +that one mount object.
> +
> +
> +.SH EXAMPLE
> +The
> +.BR open_tree ()
> +function can be used like the following:
> +.PP
> +.RS
> +.nf
> +fd1 = open_tree(AT_FDCWD, "/mnt", 0);
> +fd2 = open_tree(fd1, "",
> +                AT_EMPTY_PATH | OPEN_TREE_CLONE | AT_RECURSIVE);
> +move_mount(fd2, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
> +.fi
> +.RE
> +.PP
> +This would attach the path point for "/mnt" to fd1, then it would copy the
> +entire subtree at the point referred to by fd1 and attach that to fd2; lastly,
> +it would attach the clone to "/mnt2".
> +
> +
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> +.SH RETURN VALUE
> +On success, the new file descriptor is returned.  On error, \-1 is returned,
> +and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Search permission is denied for one of the directories
> +in the path prefix of
> +.IR pathname .
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EBADF
> +.I dirfd
> +is not a valid open file descriptor.
> +.TP
> +.B EFAULT
> +.I pathname
> +is NULL or
> +.IR pathname
> +point to a location outside the process's accessible address space.
> +.TP
> +.B EINVAL
> +Reserved flag specified in
> +.IR flags .
> +.TP
> +.B ELOOP
> +Too many symbolic links encountered while traversing the pathname.
> +.TP
> +.B ENAMETOOLONG
> +.I pathname
> +is too long.
> +.TP
> +.B ENOENT
> +A component of
> +.I pathname
> +does not exist, or
> +.I pathname
> +is an empty string and
> +.B AT_EMPTY_PATH
> +was not specified in
> +.IR flags .
> +.TP
> +.B ENOMEM
> +Out of memory (i.e., kernel memory).
> +.TP
> +.B ENOTDIR
> +A component of the path prefix of
> +.I pathname
> +is not a directory or
> +.I pathname
> +is relative and
> +.I dirfd
> +is a file descriptor referring to a file other than a directory.
> +.SH VERSIONS
> +.BR open_tree ()
> +was added to Linux in kernel 4.18.
> +.SH CONFORMING TO
> +.BR open_tree ()
> +is Linux-specific.
> +.SH NOTES
> +Glibc does not (yet) provide a wrapper for the
> +.BR open_tree ()
> +system call; call it using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR fsmount (2),
> +.BR move_mount (2),
> +.BR open (2)
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall
From: Michael Kerrisk (man-pages) @ 2019-10-09  8:36 UTC (permalink / raw)
  To: Aleksa Sarai, Al Viro
  Cc: mtk.manpages, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel
In-Reply-To: <20191003145542.17490-4-cyphar@cyphar.com>

Hello Aleksa,

Thanks for this. It's a great piece of documentation work!

I would prefer the path_resolution(7) piece as a separate patch.


On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> Rather than trying to merge the new syscall documentation into open.2
> (which would probably result in the man-page being incomprehensible),
> instead the new syscall gets its own dedicated page with links between
> open(2) and openat2(2) to avoid duplicating information such as the list
> of O_* flags or common errors.

Yes, looking at the size of the proposed openat2(2) page,
this seems best.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  man2/open.2            |   5 +
>  man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
>  man7/path_resolution.7 |  57 ++++--
>  3 files changed, 426 insertions(+), 17 deletions(-)
>  create mode 100644 man2/openat2.2
> 
> diff --git a/man2/open.2 b/man2/open.2
> index 7217fe056e5e..a0b43394bbee 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
>  ", mode_t " mode );
> +.PP
> +/* Docuented separately, in \fBopenat2\fP(2). */

Documented

> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
>  .fi
>  .PP
>  .in -4n
> @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
>  .B O_DIRECTORY
>  is ignored).
>  .SH SEE ALSO
> +.BR openat2 (2),

Entries here should into alphabetical order (within
sections).

>  .BR chmod (2),
>  .BR chown (2),
>  .BR close (2),
> diff --git a/man2/openat2.2 b/man2/openat2.2
> new file mode 100644
> index 000000000000..c43c76046243
> --- /dev/null
> +++ b/man2/openat2.2
> @@ -0,0 +1,381 @@
> +.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +openat2 \- open and possibly create a file (extended)
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.B #include <sys/stat.h>
> +.B #include <fcntl.h>
> +.PP
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
> +.SH DESCRIPTION
> +The
> +.BR openat2 ()
> +system call is an extension of
> +.BR openat (2)
> +and provides a superset of its functionality. Rather than taking a single

Please start new sentences on new source lines. I recently added this
text in man-pages(7):

   Use semantic newlines
       In the source of a manual page, new sentences should be started on
       new  lines,  and  long sentences should split into lines at clause
       breaks (commas, semicolons, colons, and so on).  This  convention,
       sometimes known as "semantic newlines", makes it easier to see the
       effect of patches, which often operate at the level of  individual
       sentences or sentence clauses.

> +.I flag
> +argument, an extensible structure (\fIhow\fP) is passed instead to allow for
> +seamless future extensions.

s/seamless//

> +.PP
> +.I size
> +must be set to
> +.IR "sizeof(struct open_how)" ,
> +to facilitate future extensions (see the "Extensibility" section of the
> +\fBNOTES\fP for more detail on how extensions are handled.)
> +
> +.SS The open_how structure
> +The following structure indicates how
> +.I pathname
> +should be opened, and acts as a superset of the
> +.IR flag " and " mode
> +arguments to
> +.BR openat (2).
> +.PP
> +.in +4n
> +.EX
> +struct open_how {
> +    uint32_t flags;              /* open(2)-style O_* flags. */
> +    union {
> +        uint16_t mode;           /* File mode bits for new file creation. */
> +        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
> +    };
> +    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
> +};
> +.EE
> +.in
> +.PP
> +Any future extensions to
> +.BR openat2 ()
> +will be implemented as new fields appended to the above structure, with the
> +zero value of the new fields acting as though the extension were not present.
> +.PP
> +The meaning of each field is as follows:
> +.RS
> +
> +.I flags
> +.RSall
> +The file creation and status flags to use for this operation. All of the
> +.B O_*
> +flags defined for
> +.BR openat (2)
> +are valid
> +.BR openat2 ()
> +flag values.
> +.RE
> +
> +.I upgrade_mask
> +.RS
> +Restrict with which
> +.I access modes
> +the returned
> +.B O_PATH
> +descriptor may be re-opened (either through
> +.B O_EMPTYPATH
> +or
> +.IR /proc/self/fd/ .)
> +This field may only be set to a non-zero value if
> +.I flags
> +contains
> +.BR O_PATH .
> +By default, an
> +.B O_PATH
> +file descriptor of an ordinary file may be re-opened with with any access mode (but an
> +.B O_PATH
> +file descriptor of a magic-link may only be re-opened with access modes that
> +the original magic-link possessed). The full list of

magic link (throughout the page)

> +.I upgrade_mask
> +flags is given below.
> +.TP
> +.B UPGRADE_NOREAD
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for reading (i.e.
> +.BR O_RDONLY " or " O_RDWR .)
> +.TP
> +.B UPGRADE_NOWRITE
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for writing (i.e.
> +.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
> +.RE
> +.I resolve
> +.RS
> +Change how the components of
> +.I pathname
> +will be resolved (see
> +.BR path_resolution (7)
> +for background information.) The primary use-case for these flags is to allow

use case

> +trusted programs to restrict how un-trusted paths (or paths inside un-trusted

untrusted

> +directories) are resolved. The full list of
> +.I resolve
> +flags is given below.
> +.TP
> +.B RESOLVE_NO_XDEV
> +Disallow all mount-point crossings during path resolution (including

I think better would be: "Disallow traversal of mount points". Do you 
agree?

> +all bind-mounts).

bind mounts

> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as bind-mounts are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.
> +.TP
> +.B RESOLVE_NO_SYMLINKS
> +Disallow all symlink resolution during path resolution. If the trailing

Disallow resolution of symbolic links during path resolution

> +component is a symlink, and

symbolic link (throughout the page)

> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the symlink will be returned. This option implies
> +.BR RESOLVE_NO_MAGICLINKS .
> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as symlinks are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.

It's not really clear what you mean by "enabling this flag globally".
Could you reword, or explain in a bit more detail?

> +.TP
> +.B RESOLVE_NO_MAGICLINKS
> +Disallow all magic-link resolution during path resolution. If the trailing
> +component is a magic-link, and
> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the magic-link will be returned.
> +
> +Magic-links are symlink-like objects that are most notably found in
> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Due to the potential danger of unknowingly opening these magic-links, it may be
> +preferable for users to disable their resolution entirely (see
> +.BR symlink (7)
> +for more details.)
> +.TP
> +.B RESOLVE_BENEATH
> +Do not permit the path resolution to succeed if any component of the resolution
> +is not a descendant of the directory indicated by
> +.IR dirfd .
> +This results in absolute symlinks (and absolute values of
> +.IR pathname )
> +to be rejected. Magic-link resolution is also not permitted.

So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
it would be good to state that more explicitly,

> +
> +.TP
> +.B RESOLVE_IN_ROOT
> +Temporarily treat
> +.I dirfd
> +as the root of the filesystem (as though the user called

Perhaps better:

Treat
.I dirfd
as the root directory while resolving
.I pathname
(as though...)

> +.BR chroot (2)
> +with
> +.IR dirfd
> +as the argument.) Absolute symlinks and ".." path components will be scoped to
> +.IR dirfd . Magic-link resolution is also not permitted.

Insert a newline before "Magic" to fix a formatting problem.

So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
it would be good to state that more explicitly,

> +
> +However, unlike
> +.BR chroot (2)
> +(which changes the filesystem root persistently for an entire thread-group),

s/persistently for an entire thread-group/
 /permanently for a process/

> +.B RESOLVE_IN_ROOT
> +allows a program to efficiently restrict path resolution for only certain
> +operations. It also has several hardening features (such as not permitting
> +magic-link resolution) which
> +.BR chroot (2)
> +does not.
> +.RE
> +
> +.RE
> +
> +.PP
> +Unlike
> +.BR openat (2),
> +any unknown flags set in fields of
> +.I how
> +will result in an error, rather than being ignored. 

Thank you, thank you, thank you. It was sad
that openat() never fixed that antifeature.

> In addition, an error will
> +be returned if the value of the
> +.IR mode " and " upgrade_mask
> +union is non-zero unless:
> +.RS
> +.IP * 3
> +.I flags
> +indicates that a new file will be created (it contains
> +.BR O_CREAT " or " O_TMPFILE ),
> +in which case
> +.I mode
> +may be any valid file mode.
> +.IP *
> +.I flags
> +contains
> +.BR O_PATH ,
> +in which case
> +.I upgrade_mask
> +must only contain valid
> +.B UPGRADE_*
> +flags.
> +.RE
> +
> +.SH RETURN VALUE
> +On success, a new file descriptor is returned. On error, -1 is returned, and
> +.I errno
> +is set appropriately.
> +
> +.SH ERRORS
> +The set of errors returned by
> +.BR openat2 ()
> +includes all of the errors returned by
> +.BR openat (2),
> +as well as the following additional errors:
> +.TP
> +.B EINVAL
> +An unknown flag or invalid value was specified in
> +.IR how .
> +.TP
> +.B EINVAL
> +.I size
> +was smaller than any known version of
> +.IR "struct open_how" .
> +.TP
> +.B E2BIG
> +An extension was specified in
> +.IR how ,
> +which the current kernel does not support (see the "Extensibility" section of
> +the \fBNOTES\fP for more detail on how extensions are handled.)
> +.TP
> +.B EAGAIN
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and the kernel could not ensure that a ".." component didn't escape (due to a
> +race condition or potential attack). Callers may choose to retry the
> +.BR openat2 ()
> +call.
> +.TP
> +.B EXDEV
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and a path component attempted to escape the root of the resolution.
> +
> +.TP
> +.B EXDEV
> +.I resolve
> +contains
> +.BR RESOLVE_NO_XDEV ,
> +and a path component attempted to cross a mount-point.

mount point

> +
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_SYMLINKS ,
> +and one of the path components was a symlink.
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_MAGICLINKS ,
> +and one of the path components was a magic-link.
> +
> +.SH VERSIONS
> +.BR openat2 ()
> +was added to Linux in kernel 5.FOO.
> +
> +.SH CONFORMING TO
> +This system call is Linux-specific.
> +
> +The semantics of
> +.B RESOLVE_BENEATH
> +were modelled after FreeBSD's
> +.BR O_BENEATH .
> +
> +.SH NOTES
> +Glibc does not provide a wrapper for this system call; call it using
> +.BR syscall (2).
> +
> +.SS Extensibility
> +In order to allow for
> +.I struct open_how
> +to be extended in future kernel revisions,
> +.BR openat2 ()
> +requires userspace to specify what sized

s/what sized/the size of/

> +.I struct open_how
> +structure they are passing. By providing this information, it is possible for
> +.BR openat2 ()
> +to provide both forwards- and backwards-compatibility \(em with
> +.I size
> +acting as an implicit version number (because new extension fields will always
> +be appended, the size will always increase.) This extensibility design is very
> +similar to other system calls such as
> +.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).

The following explantion of uszie and ksize is great. Thanks for that.

> +If we let
> +.I usize
> +be the size of the structure according to userspace and
> +.I ksize
> +be the size of the structure which the kernel supports, then there are only
> +three cases to consider:
> +
> +.RS
> +.IP * 3
> +If
> +.IR ksize " equals " usize ,
> +then there is no version mismatch and
> +.I how
> +can be used verbatim.
> +.IP *
> +If
> +.IR ksize " is larger than " usize ,
> +then there are some extensions the kernel supports which the userspace program
> +is unaware of. Because all extensions must have their zero values be a no-op,
> +the kernel treats all of the extension fields not set by userspace to have zero
> +values. This provides backwards-compatibility.
> +.IP *
> +If
> +.IR ksize " is smaller than " usize ,
> +then there are some extensions which the userspace program is aware of but the
> +kernel does not support. Because all extensions must have their zero values be
> +a no-op, the kernel can safely ignore the unsupported extension fields if they
> +are all-zero. If any unsupported extension fields are non-zero, then an error
> +is returned. This provides forwards-compatibility.
> +.RE
> +
> +Therefore, most userspace programs will not need to have any special handling
> +of extensions. However, if a userspace program wishes to determine what
> +extensions the running kernel supports, they may conduct a binary search on
> +.IR size
> +(to find the largest value which doesn't produce an error.)
> +
> +.SH SEE ALSO
> +.BR openat (2),
> +.BR path_resolution (7),
> +.BR symlink (7)
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 85dd354e9a93..3da3e5b614c8 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
>  Some UNIX/Linux system calls have as parameter one or more filenames.
>  A filename (or pathname) is resolved as follows.
>  .SS Step 1: start of the resolution process
> -If the pathname starts with the \(aq/\(aq character,
> -the starting lookup directory
> -is the root directory of the calling process.
> -(A process inherits its
> -root directory from its parent.
> -Usually this will be the root directory
> -of the file hierarchy.
> -A process may get a different root directory
> -by use of the
> +If the pathname starts with the \(aq/\(aq character, the starting lookup
> +directory is the root directory of the calling process. (A process inherits its
> +root directory from its parent. Usually this will be the root directory of the
> +file hierarchy. A process may get a different root directory by use of the
>  .BR chroot (2)
> -system call.
> +system call, or may temporarily use a different root directory by using
> +.BR openat2 (2)
> +with the
> +.B RESOLVE_IN_ROOT
> +flag set.
> +.PP
>  A process may get an entirely private mount namespace in case
>  it\(emor one of its ancestors\(emwas started by an invocation of the
>  .BR clone (2)
> @@ -48,16 +48,24 @@ system call that had the
>  flag set.)
>  This handles the \(aq/\(aq part of the pathname.
>  .PP
> -If the pathname does not start with the \(aq/\(aq character, the
> -starting lookup directory of the resolution process is the current working
> -directory of the process.
> -(This is also inherited from the parent.
> -It can be changed by use of the
> +If the pathname does not start with the \(aq/\(aq character, the starting
> +lookup directory of the resolution process is the current working directory of
> +the process \(em or in the case of
> +.BR openat (2)-style
> +syscalls, the

system calls

> +.I dfd
> +argument (or the current working directory if
> +.B AT_FDCWD
> +is passed as the
> +.I dfd
> +argumnet). The current working directory is inherited from the parent, and can

argument

> +be changed by use of the
>  .BR chdir (2)
> -system call.)
> +syscall.

"system call" please.

>  .PP
>  Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
>  Pathnames not starting with a \(aq/\(aq are called relative pathnames.
> +

No blank line here.

>  .SS Step 2: walk along the path
>  Set the current lookup directory to the starting lookup directory.
>  Now, for each nonfinal component of the pathname, where a component
> @@ -124,6 +132,13 @@ the kernel's pathname-resolution code
>  was reworked to eliminate the use of recursion,
>  so that the only limit that remains is the maximum of 40
>  resolutions for the entire pathname.
> +.PP
> +The resolution of syscalls during this stage can be blocked by using

"resolution of syscall" seems wrong? "syscall" should be something 
else?

> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_SYMLINKS
> +flag set.
> +
>  .SS Step 3: find the final entry
>  The lookup of the final component of the pathname goes just like
>  that of all other components, as described in the previous step,
> @@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
>  their conventional meanings, regardless of whether they are
>  actually present in the physical filesystem.
>  .PP
> -One cannot walk down past the root: "/.." is the same as "/".
> +One cannot walk up past the root: "/.." is the same as "/".
> +

No blank line please.

>  .SS Mount points
>  After a "mount dev path" command, the pathname "path" refers to
>  the root of the filesystem hierarchy on the device "dev", and no
> @@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
>  One can walk out of a mounted filesystem: "path/.." refers to
>  the parent directory of "path",
>  outside of the filesystem hierarchy on "dev".
> +.PP
> +Mount-point crossings can be blocked by using

Traversal of mount points can be disallowed by...

> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_XDEV
> +flag set (though note that this also restricts bind-mount crossings).
> +

No blank line please.

>  .SS Trailing slashes
>  If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
>  component as in Step 2: it has to exist and resolve to a directory.
> 

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation
From: Michael Kerrisk (man-pages) @ 2019-10-09  8:01 UTC (permalink / raw)
  To: Aleksa Sarai, Al Viro
  Cc: mtk.manpages, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel
In-Reply-To: <20191003145542.17490-3-cyphar@cyphar.com>

Hello Aleksa,

You write "5.FOO" in these patches. When do you expect these changes to 
land in the kernel?

On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> Some of the wording around empty paths in path_resolution(7) also needed
> to be reworked since it's now legal (if you pass O_EMPTYPATH).
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  man2/open.2            | 42 +++++++++++++++++++++++++++++++++++++++++-
>  man7/path_resolution.7 | 17 ++++++++++++++++-
>  2 files changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/man2/open.2 b/man2/open.2
> index b0f485b41589..7217fe056e5e 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -48,7 +48,7 @@
>  .\" FIXME . Apr 08: The next POSIX revision has O_EXEC, O_SEARCH, and
>  .\" O_TTYINIT.  Eventually these may need to be documented.  --mtk
>  .\"
> -.TH OPEN 2 2018-04-30 "Linux" "Linux Programmer's Manual"
> +.TH OPEN 2 2019-10-03 "Linux" "Linux Programmer's Manual"

No need to update the timestamp. I have scripts that handle this
automatically.

>  .SH NAME
>  open, openat, creat \- open and possibly create a file
>  .SH SYNOPSIS
> @@ -421,6 +421,21 @@ was followed by a call to
>  .BR fdatasync (2)).
>  .IR "See NOTES below" .
>  .TP
> +.BR O_EMPTYPATH " (since Linux 5.FOO)"
> +If \fIpathname\fP is an empty string, re-open the the file descriptor given as

In general, I prefer the general form

.I pathname

over \fIpathname\fP. 

If you would be willing to cahnge that, it would  save me a little work.
(And likewise throughout the rest of the patch.)

> +the \fIdirfd\fP argument to
> +.BR openat (2).
> +This can be used with both ordinary (file and directory) and \fBO_PATH\fP file
> +descriptors, but cannot be used with
> +.BR AT_FDCWD
> +(or as an argument to plain
> +.BR open (2).) When re-opening an \fBO_PATH\fP file descriptor, the same "link

There's a formatting problem here which can be fixed by inserting a 
newline before "When".

> +mode" restrictions apply as with re-opening through
> +.BR proc (5)
> +(see
> +.BR path_resolution "(7) and " symlink (7)
> +for more details.)
> +.TP
>  .B O_EXCL
>  Ensure that this call creates the file:
>  if this flag is specified in conjunction with
> @@ -668,6 +683,13 @@ with
>  (or via procfs using
>  .BR AT_SYMLINK_FOLLOW )
>  even if the file is not a directory.
> +You can even "re-open" (or upgrade) an
> +.BR O_PATH
> +file descriptor by using
> +.BR O_EMPTYPATH
> +(see the section for
> +.BR O_EMPTYPATH
> +for more details.)
>  .IP *
>  Passing the file descriptor to another process via a UNIX domain socket
>  (see
> @@ -958,6 +980,15 @@ is not allowed.
>  (See also
>  .BR path_resolution (7).)
>  .TP
> +.B EBADF
> +.I pathname
> +was an empty string (and
> +.B O_EMPTYPATH
> +was passed) with
> +.BR open (2)
> +(instead of
> +.BR openat (2).)
> +.TP
>  .B EDQUOT
>  Where
>  .B O_CREAT
> @@ -1203,6 +1234,15 @@ The following additional errors can occur for
>  .I dirfd
>  is not a valid file descriptor.
>  .TP
> +.B EBADF
> +.I pathname
> +was an empty string (and
> +.B O_EMPTYPATH
> +was passed), but the provided
> +.I dirfd
> +was an invalid file descriptor (or was
> +.BR AT_FDCWD .)
> +.TP
>  .B ENOTDIR
>  .I pathname
>  is a relative pathname and
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 46f25ec4cdfa..85dd354e9a93 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -22,7 +22,7 @@
>  .\" the source, must acknowledge the copyright and authors of this work.
>  .\" %%%LICENSE_END
>  .\"
> -.TH PATH_RESOLUTION 7 2017-11-26 "Linux" "Linux Programmer's Manual"
> +.TH PATH_RESOLUTION 7 2019-10-03 "Linux" "Linux Programmer's Manual"
>  .SH NAME
>  path_resolution \- how a pathname is resolved to a file
>  .SH DESCRIPTION
> @@ -198,6 +198,21 @@ successfully.
>  Linux returns
>  .B ENOENT
>  in this case.
> +.PP
> +As of Linux 5.FOO, an empty path argument can be used to indicate the "re-open"
> +an existing file descriptor if
> +.B O_EMPTYPATH
> +is passed as a flag argument to
> +.BR openat (2),
> +with the
> +.I dfd
> +argument indicating which file descriptor to "re-open". This is approximately
> +equivalent to opening
> +.I /proc/self/fd/$fd

.IR /proc/self/fd/$fd ,

> +where
> +.I $fd
> +is the open file descriptor to be "re-opened".
> +

No blank line here.

>  .SS Permissions
>  The permission bits of a file consist of three groups of three bits; see
>  .BR chmod (1)
> 

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
From: Michael Kerrisk (man-pages) @ 2019-10-09  7:55 UTC (permalink / raw)
  To: Aleksa Sarai, Al Viro
  Cc: mtk.manpages, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel
In-Reply-To: <20191003145542.17490-2-cyphar@cyphar.com>

Hello Aleksa,


On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> Traditionally, magic-links have not been a well-understood topic in
> Linux. Given the new changes in their semantics (related to the link
> mode of trailing magic-links), it seems like a good opportunity to shine
> more light on magic-links and their semantics.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks for doing this. Some comments below.

> ---
>  man7/path_resolution.7 | 15 +++++++++++++++
>  man7/symlink.7         | 39 ++++++++++++++++++++++++++++++---------
>  2 files changed, 45 insertions(+), 9 deletions(-)
> 
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 07664ed8faec..46f25ec4cdfa 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -136,6 +136,21 @@ we are just creating it.
>  The details on the treatment
>  of the final entry are described in the manual pages of the specific
>  system calls.
> +.PP
> +Since Linux 5.FOO, if the final entry is a "magic-link" (see

"magic link". As Jann points out, this is more normal English usage.

> +.BR symlink (7)),
> +and the user is attempting to
> +.BR open (2)
> +it, then there is an additional permission-related restriction applied to the
> +operation: the requested access mode must not exceed the "link mode" of the
> +magic-link (unlike ordinary symlinks, magic-links have their own file mode.)

Remove the hyphens (magic link). And also, as someone else pointed out,
manual pages fairly consistently uses the term "symbolic link"
(written in full).

You use the term "file mode" here. Do you mean the file permissions bits?
If yes, it is a bit misleading to suggest that symbolic links don't
have these mode bits. They do, but--as noted in the existing symlink(7)
manual page text--these bits are ignored. I suggest just removing the
parenthesized text.

> +For example, if
> +.I /proc/[pid]/fd/[num]
> +has a link mode of
> +.BR 0500 ,
> +unprivileged users are not permitted to
> +.BR open ()
> +the magic-link for writing.
>  .SS . and ..
>  By convention, every directory has the entries "." and "..",
>  which refer to the directory itself and to its parent directory,
> diff --git a/man7/symlink.7 b/man7/symlink.7
> index 9f5bddd5dc21..33f0ec703acd 100644
> --- a/man7/symlink.7
> +++ b/man7/symlink.7
> @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
>  are outlined here.
>  It is important that site-local applications also conform to these rules,
>  so that the user interface can be as consistent as possible.
> +.SS Magic-links
> +There is a special class of symlink-like objects known as "magic-links" which

"magic links" (and through the rest of the page).

> +can be found in certain pseudo-filesystems such as

pseudofilesystems

> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Unlike normal symlinks, magic-links are not resolved through

symbolic links

> +pathname-expansion, but instead act as direct references to the kernel's own

pathname expansion

> +representation of a file handle. As such, these magic-links allow users to
> +access files which cannot be referenced with normal paths (such as unlinked
> +files still referenced by a running program.)
> +.PP
> +Because they can bypass ordinary
> +.BR mount_namespaces (7)-based
> +restrictions, magic-links have been used as attack vectors in various exploits.
> +As such (since Linux 5.FOO), there are additional restrictions placed on the
> +re-opening of magic-links (see
> +.BR path_resolution (7)
> +for more details.)
>  .SS Symbolic link ownership, permissions, and timestamps
>  The owner and group of an existing symbolic link can be changed
>  using
> @@ -99,16 +118,18 @@ of a symbolic link can be changed using
>  or
>  .BR lutimes (3).
>  .PP
> -On Linux, the permissions of a symbolic link are not used
> -in any operations; the permissions are always
> -0777 (read, write, and execute for all user categories),
>  .\" Linux does not currently implement an lchmod(2).
> -and can't be changed.
> -(Note that there are some "magic" symbolic links in the
> -.I /proc
> -directory tree\(emfor example, the
> -.IR /proc/[pid]/fd/*
> -files\(emthat have different permissions.)
> +On Linux, the permissions of an ordinary symbolic link are not used in any
> +operations; the permissions are always 0777 (read, write, and execute for all
> +user categories), and can't be changed.
> +.PP
> +However, magic-links do not follow this rule. They can have a non-0777 mode,
> +which is used for permission checks when the final
> +component of an
> +.BR open (2)'s
> +path is a magic-link (see
> +.BR path_resolution (7).)
> +
>  .\"
>  .\" The
>  .\" 4.4BSD

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH -next] treewide: remove unused argument in lock_release()
From: Yuyang Du @ 2019-10-09  1:14 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Will Deacon, LKML,
	linux-api, maarten.lankhorst, mripard, sean, airlied, daniel,
	dri-devel, gregkh, jslaby, viro, linux-fsdevel, joonas.lahtinen,
	rodrigo.vivi, intel-gfx, tytso, jack, linux-ext4, tj, mark, jlbec,
	joseph.qi, ocfs2-devel, davem, st, daniel, netdev
In-Reply-To: <1568909380-32199-1-git-send-email-cai@lca.pw>

I didn't have the guts to do this, and I am glad you did it :)

Yuyang

On Fri, 20 Sep 2019 at 00:10, Qian Cai <cai@lca.pw> wrote:
>
> Since the commit b4adfe8e05f1 ("locking/lockdep: Remove unused argument
> in __lock_release"), @nested is no longer used in lock_release(), so
> remove it from all lock_release() calls and friends.
>
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  drivers/gpu/drm/drm_connector.c               |  2 +-
>  drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  6 +++---
>  drivers/gpu/drm/i915/gt/intel_engine_pm.c     |  2 +-
>  drivers/gpu/drm/i915/i915_request.c           |  2 +-
>  drivers/tty/tty_ldsem.c                       |  8 ++++----
>  fs/dcache.c                                   |  2 +-
>  fs/jbd2/transaction.c                         |  4 ++--
>  fs/kernfs/dir.c                               |  4 ++--
>  fs/ocfs2/dlmglue.c                            |  2 +-
>  include/linux/jbd2.h                          |  2 +-
>  include/linux/lockdep.h                       | 21 ++++++++++-----------
>  include/linux/percpu-rwsem.h                  |  4 ++--
>  include/linux/rcupdate.h                      |  2 +-
>  include/linux/rwlock_api_smp.h                | 16 ++++++++--------
>  include/linux/seqlock.h                       |  4 ++--
>  include/linux/spinlock_api_smp.h              |  8 ++++----
>  include/linux/ww_mutex.h                      |  2 +-
>  include/net/sock.h                            |  2 +-
>  kernel/bpf/stackmap.c                         |  2 +-
>  kernel/cpu.c                                  |  2 +-
>  kernel/locking/lockdep.c                      |  3 +--
>  kernel/locking/mutex.c                        |  4 ++--
>  kernel/locking/rtmutex.c                      |  6 +++---
>  kernel/locking/rwsem.c                        | 10 +++++-----
>  kernel/printk/printk.c                        | 10 +++++-----
>  kernel/sched/core.c                           |  2 +-
>  lib/locking-selftest.c                        | 24 ++++++++++++------------
>  mm/memcontrol.c                               |  2 +-
>  net/core/sock.c                               |  2 +-
>  tools/lib/lockdep/include/liblockdep/common.h |  3 +--
>  tools/lib/lockdep/include/liblockdep/mutex.h  |  2 +-
>  tools/lib/lockdep/include/liblockdep/rwlock.h |  2 +-
>  tools/lib/lockdep/preload.c                   | 16 ++++++++--------
>  33 files changed, 90 insertions(+), 93 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c
> index 4c766624b20d..4a8b2e5c2af6 100644
> --- a/drivers/gpu/drm/drm_connector.c
> +++ b/drivers/gpu/drm/drm_connector.c
> @@ -719,7 +719,7 @@ void drm_connector_list_iter_end(struct drm_connector_list_iter *iter)
>                 __drm_connector_put_safe(iter->conn);
>                 spin_unlock_irqrestore(&config->connector_list_lock, flags);
>         }
> -       lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
> +       lock_release(&connector_list_iter_dep_map, _RET_IP_);
>  }
>  EXPORT_SYMBOL(drm_connector_list_iter_end);
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> index edd21d14e64f..1a51b3598d63 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> @@ -509,14 +509,14 @@ void i915_gem_shrinker_taints_mutex(struct drm_i915_private *i915,
>                       I915_MM_SHRINKER, 0, _RET_IP_);
>
>         mutex_acquire(&mutex->dep_map, 0, 0, _RET_IP_);
> -       mutex_release(&mutex->dep_map, 0, _RET_IP_);
> +       mutex_release(&mutex->dep_map, _RET_IP_);
>
> -       mutex_release(&i915->drm.struct_mutex.dep_map, 0, _RET_IP_);
> +       mutex_release(&i915->drm.struct_mutex.dep_map, _RET_IP_);
>
>         fs_reclaim_release(GFP_KERNEL);
>
>         if (unlock)
> -               mutex_release(&i915->drm.struct_mutex.dep_map, 0, _RET_IP_);
> +               mutex_release(&i915->drm.struct_mutex.dep_map, _RET_IP_);
>  }
>
>  #define obj_to_i915(obj__) to_i915((obj__)->base.dev)
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> index 65b5ca74b394..7f647243b3b9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> @@ -52,7 +52,7 @@ static inline unsigned long __timeline_mark_lock(struct intel_context *ce)
>  static inline void __timeline_mark_unlock(struct intel_context *ce,
>                                           unsigned long flags)
>  {
> -       mutex_release(&ce->timeline->mutex.dep_map, 0, _THIS_IP_);
> +       mutex_release(&ce->timeline->mutex.dep_map, _THIS_IP_);
>         local_irq_restore(flags);
>  }
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index a53777dd371c..e1f1be4d0531 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1456,7 +1456,7 @@ long i915_request_wait(struct i915_request *rq,
>         dma_fence_remove_callback(&rq->fence, &wait.cb);
>
>  out:
> -       mutex_release(&rq->engine->gt->reset.mutex.dep_map, 0, _THIS_IP_);
> +       mutex_release(&rq->engine->gt->reset.mutex.dep_map, _THIS_IP_);
>         trace_i915_request_wait_end(rq);
>         return timeout;
>  }
> diff --git a/drivers/tty/tty_ldsem.c b/drivers/tty/tty_ldsem.c
> index 60ff236a3d63..ce8291053af3 100644
> --- a/drivers/tty/tty_ldsem.c
> +++ b/drivers/tty/tty_ldsem.c
> @@ -303,7 +303,7 @@ static int __ldsem_down_read_nested(struct ld_semaphore *sem,
>         if (count <= 0) {
>                 lock_contended(&sem->dep_map, _RET_IP_);
>                 if (!down_read_failed(sem, count, timeout)) {
> -                       rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +                       rwsem_release(&sem->dep_map, _RET_IP_);
>                         return 0;
>                 }
>         }
> @@ -322,7 +322,7 @@ static int __ldsem_down_write_nested(struct ld_semaphore *sem,
>         if ((count & LDSEM_ACTIVE_MASK) != LDSEM_ACTIVE_BIAS) {
>                 lock_contended(&sem->dep_map, _RET_IP_);
>                 if (!down_write_failed(sem, count, timeout)) {
> -                       rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +                       rwsem_release(&sem->dep_map, _RET_IP_);
>                         return 0;
>                 }
>         }
> @@ -390,7 +390,7 @@ void ldsem_up_read(struct ld_semaphore *sem)
>  {
>         long count;
>
> -       rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +       rwsem_release(&sem->dep_map, _RET_IP_);
>
>         count = atomic_long_add_return(-LDSEM_READ_BIAS, &sem->count);
>         if (count < 0 && (count & LDSEM_ACTIVE_MASK) == 0)
> @@ -404,7 +404,7 @@ void ldsem_up_write(struct ld_semaphore *sem)
>  {
>         long count;
>
> -       rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +       rwsem_release(&sem->dep_map, _RET_IP_);
>
>         count = atomic_long_add_return(-LDSEM_WRITE_BIAS, &sem->count);
>         if (count < 0)
> diff --git a/fs/dcache.c b/fs/dcache.c
> index e88cf0554e65..f7931b682a0d 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1319,7 +1319,7 @@ static void d_walk(struct dentry *parent, void *data,
>
>                 if (!list_empty(&dentry->d_subdirs)) {
>                         spin_unlock(&this_parent->d_lock);
> -                       spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
> +                       spin_release(&dentry->d_lock.dep_map, _RET_IP_);
>                         this_parent = dentry;
>                         spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
>                         goto repeat;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index bee8498d7792..b25ebdcabfa3 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -713,7 +713,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
>         if (need_to_start)
>                 jbd2_log_start_commit(journal, tid);
>
> -       rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
> +       rwsem_release(&journal->j_trans_commit_map, _THIS_IP_);
>         handle->h_buffer_credits = nblocks;
>         /*
>          * Restore the original nofs context because the journal restart
> @@ -1848,7 +1848,7 @@ int jbd2_journal_stop(handle_t *handle)
>                         wake_up(&journal->j_wait_transaction_locked);
>         }
>
> -       rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
> +       rwsem_release(&journal->j_trans_commit_map, _THIS_IP_);
>
>         if (wait_for_commit)
>                 err = jbd2_log_wait_commit(journal, tid);
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 6ebae6bbe6a5..c45b82feac9a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -438,7 +438,7 @@ void kernfs_put_active(struct kernfs_node *kn)
>                 return;
>
>         if (kernfs_lockdep(kn))
> -               rwsem_release(&kn->dep_map, 1, _RET_IP_);
> +               rwsem_release(&kn->dep_map, _RET_IP_);
>         v = atomic_dec_return(&kn->active);
>         if (likely(v != KN_DEACTIVATED_BIAS))
>                 return;
> @@ -476,7 +476,7 @@ static void kernfs_drain(struct kernfs_node *kn)
>
>         if (kernfs_lockdep(kn)) {
>                 lock_acquired(&kn->dep_map, _RET_IP_);
> -               rwsem_release(&kn->dep_map, 1, _RET_IP_);
> +               rwsem_release(&kn->dep_map, _RET_IP_);
>         }
>
>         kernfs_drain_open_files(kn);
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index ad594fef2ab0..71975b9b142c 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -1687,7 +1687,7 @@ static void __ocfs2_cluster_unlock(struct ocfs2_super *osb,
>         spin_unlock_irqrestore(&lockres->l_lock, flags);
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         if (lockres->l_lockdep_map.key != NULL)
> -               rwsem_release(&lockres->l_lockdep_map, 1, caller_ip);
> +               rwsem_release(&lockres->l_lockdep_map, caller_ip);
>  #endif
>  }
>
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 603fbc4e2f70..564793c24d12 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1170,7 +1170,7 @@ struct journal_s
>  #define jbd2_might_wait_for_commit(j) \
>         do { \
>                 rwsem_acquire(&j->j_trans_commit_map, 0, 0, _THIS_IP_); \
> -               rwsem_release(&j->j_trans_commit_map, 1, _THIS_IP_); \
> +               rwsem_release(&j->j_trans_commit_map, _THIS_IP_); \
>         } while (0)
>
>  /* journal feature predicate functions */
> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> index b8a835fd611b..c50d01ef1414 100644
> --- a/include/linux/lockdep.h
> +++ b/include/linux/lockdep.h
> @@ -349,8 +349,7 @@ extern void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
>                          int trylock, int read, int check,
>                          struct lockdep_map *nest_lock, unsigned long ip);
>
> -extern void lock_release(struct lockdep_map *lock, int nested,
> -                        unsigned long ip);
> +extern void lock_release(struct lockdep_map *lock, unsigned long ip);
>
>  /*
>   * Same "read" as for lock_acquire(), except -1 means any.
> @@ -428,7 +427,7 @@ static inline void lockdep_set_selftest_task(struct task_struct *task)
>  }
>
>  # define lock_acquire(l, s, t, r, c, n, i)     do { } while (0)
> -# define lock_release(l, n, i)                 do { } while (0)
> +# define lock_release(l, i)                    do { } while (0)
>  # define lock_downgrade(l, i)                  do { } while (0)
>  # define lock_set_class(l, n, k, s, i)         do { } while (0)
>  # define lock_set_subclass(l, s, i)            do { } while (0)
> @@ -591,42 +590,42 @@ static inline void print_irqtrace_events(struct task_struct *curr)
>
>  #define spin_acquire(l, s, t, i)               lock_acquire_exclusive(l, s, t, NULL, i)
>  #define spin_acquire_nest(l, s, t, n, i)       lock_acquire_exclusive(l, s, t, n, i)
> -#define spin_release(l, n, i)                  lock_release(l, n, i)
> +#define spin_release(l, i)                     lock_release(l, i)
>
>  #define rwlock_acquire(l, s, t, i)             lock_acquire_exclusive(l, s, t, NULL, i)
>  #define rwlock_acquire_read(l, s, t, i)                lock_acquire_shared_recursive(l, s, t, NULL, i)
> -#define rwlock_release(l, n, i)                        lock_release(l, n, i)
> +#define rwlock_release(l, i)                   lock_release(l, i)
>
>  #define seqcount_acquire(l, s, t, i)           lock_acquire_exclusive(l, s, t, NULL, i)
>  #define seqcount_acquire_read(l, s, t, i)      lock_acquire_shared_recursive(l, s, t, NULL, i)
> -#define seqcount_release(l, n, i)              lock_release(l, n, i)
> +#define seqcount_release(l, i)                 lock_release(l, i)
>
>  #define mutex_acquire(l, s, t, i)              lock_acquire_exclusive(l, s, t, NULL, i)
>  #define mutex_acquire_nest(l, s, t, n, i)      lock_acquire_exclusive(l, s, t, n, i)
> -#define mutex_release(l, n, i)                 lock_release(l, n, i)
> +#define mutex_release(l, i)                    lock_release(l, i)
>
>  #define rwsem_acquire(l, s, t, i)              lock_acquire_exclusive(l, s, t, NULL, i)
>  #define rwsem_acquire_nest(l, s, t, n, i)      lock_acquire_exclusive(l, s, t, n, i)
>  #define rwsem_acquire_read(l, s, t, i)         lock_acquire_shared(l, s, t, NULL, i)
> -#define rwsem_release(l, n, i)                 lock_release(l, n, i)
> +#define rwsem_release(l, i)                    lock_release(l, i)
>
>  #define lock_map_acquire(l)                    lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)
>  #define lock_map_acquire_read(l)               lock_acquire_shared_recursive(l, 0, 0, NULL, _THIS_IP_)
>  #define lock_map_acquire_tryread(l)            lock_acquire_shared_recursive(l, 0, 1, NULL, _THIS_IP_)
> -#define lock_map_release(l)                    lock_release(l, 1, _THIS_IP_)
> +#define lock_map_release(l)                    lock_release(l, _THIS_IP_)
>
>  #ifdef CONFIG_PROVE_LOCKING
>  # define might_lock(lock)                                              \
>  do {                                                                   \
>         typecheck(struct lockdep_map *, &(lock)->dep_map);              \
>         lock_acquire(&(lock)->dep_map, 0, 0, 0, 1, NULL, _THIS_IP_);    \
> -       lock_release(&(lock)->dep_map, 0, _THIS_IP_);                   \
> +       lock_release(&(lock)->dep_map, _THIS_IP_);                      \
>  } while (0)
>  # define might_lock_read(lock)                                                 \
>  do {                                                                   \
>         typecheck(struct lockdep_map *, &(lock)->dep_map);              \
>         lock_acquire(&(lock)->dep_map, 0, 0, 1, 1, NULL, _THIS_IP_);    \
> -       lock_release(&(lock)->dep_map, 0, _THIS_IP_);                   \
> +       lock_release(&(lock)->dep_map, _THIS_IP_);                      \
>  } while (0)
>
>  #define lockdep_assert_irqs_enabled()  do {                            \
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 3998cdf9cd14..ad2ca2a89d5b 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -93,7 +93,7 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
>                 __percpu_up_read(sem); /* Unconditional memory barrier */
>         preempt_enable();
>
> -       rwsem_release(&sem->rw_sem.dep_map, 1, _RET_IP_);
> +       rwsem_release(&sem->rw_sem.dep_map, _RET_IP_);
>  }
>
>  extern void percpu_down_write(struct percpu_rw_semaphore *);
> @@ -118,7 +118,7 @@ extern int __percpu_init_rwsem(struct percpu_rw_semaphore *,
>  static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
>                                         bool read, unsigned long ip)
>  {
> -       lock_release(&sem->rw_sem.dep_map, 1, ip);
> +       lock_release(&sem->rw_sem.dep_map, ip);
>  #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
>         if (!read)
>                 atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 75a2eded7aa2..269b31eab3d6 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -210,7 +210,7 @@ static inline void rcu_lock_acquire(struct lockdep_map *map)
>
>  static inline void rcu_lock_release(struct lockdep_map *map)
>  {
> -       lock_release(map, 1, _THIS_IP_);
> +       lock_release(map, _THIS_IP_);
>  }
>
>  extern struct lockdep_map rcu_lock_map;
> diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h
> index 86ebb4bf9c6e..abfb53ab11be 100644
> --- a/include/linux/rwlock_api_smp.h
> +++ b/include/linux/rwlock_api_smp.h
> @@ -215,14 +215,14 @@ static inline void __raw_write_lock(rwlock_t *lock)
>
>  static inline void __raw_write_unlock(rwlock_t *lock)
>  {
> -       rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +       rwlock_release(&lock->dep_map, _RET_IP_);
>         do_raw_write_unlock(lock);
>         preempt_enable();
>  }
>
>  static inline void __raw_read_unlock(rwlock_t *lock)
>  {
> -       rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +       rwlock_release(&lock->dep_map, _RET_IP_);
>         do_raw_read_unlock(lock);
>         preempt_enable();
>  }
> @@ -230,7 +230,7 @@ static inline void __raw_read_unlock(rwlock_t *lock)
>  static inline void
>  __raw_read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
>  {
> -       rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +       rwlock_release(&lock->dep_map, _RET_IP_);
>         do_raw_read_unlock(lock);
>         local_irq_restore(flags);
>         preempt_enable();
> @@ -238,7 +238,7 @@ static inline void __raw_read_unlock(rwlock_t *lock)
>
>  static inline void __raw_read_unlock_irq(rwlock_t *lock)
>  {
> -       rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +       rwlock_release(&lock->dep_map, _RET_IP_);
>         do_raw_read_unlock(lock);
>         local_irq_enable();
>         preempt_enable();
> @@ -246,7 +246,7 @@ static inline void __raw_read_unlock_irq(rwlock_t *lock)
>
>  static inline void __raw_read_unlock_bh(rwlock_t *lock)
>  {
> -       rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +       rwlock_release(&lock->dep_map, _RET_IP_);
>         do_raw_read_unlock(lock);
>         __local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
>  }
> @@ -254,7 +254,7 @@ static inline void __raw_read_unlock_bh(rwlock_t *lock)
>  static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
>                                              unsigned long flags)
>  {
> -       rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +       rwlock_release(&lock->dep_map, _RET_IP_);
>         do_raw_write_unlock(lock);
>         local_irq_restore(flags);
>         preempt_enable();
> @@ -262,7 +262,7 @@ static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
>
>  static inline void __raw_write_unlock_irq(rwlock_t *lock)
>  {
> -       rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +       rwlock_release(&lock->dep_map, _RET_IP_);
>         do_raw_write_unlock(lock);
>         local_irq_enable();
>         preempt_enable();
> @@ -270,7 +270,7 @@ static inline void __raw_write_unlock_irq(rwlock_t *lock)
>
>  static inline void __raw_write_unlock_bh(rwlock_t *lock)
>  {
> -       rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +       rwlock_release(&lock->dep_map, _RET_IP_);
>         do_raw_write_unlock(lock);
>         __local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
>  }
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index bcf4cf26b8c8..0491d963d47e 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -79,7 +79,7 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
>
>         local_irq_save(flags);
>         seqcount_acquire_read(&l->dep_map, 0, 0, _RET_IP_);
> -       seqcount_release(&l->dep_map, 1, _RET_IP_);
> +       seqcount_release(&l->dep_map, _RET_IP_);
>         local_irq_restore(flags);
>  }
>
> @@ -384,7 +384,7 @@ static inline void write_seqcount_begin(seqcount_t *s)
>
>  static inline void write_seqcount_end(seqcount_t *s)
>  {
> -       seqcount_release(&s->dep_map, 1, _RET_IP_);
> +       seqcount_release(&s->dep_map, _RET_IP_);
>         raw_write_seqcount_end(s);
>  }
>
> diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
> index b762eaba4cdf..19a9be9d97ee 100644
> --- a/include/linux/spinlock_api_smp.h
> +++ b/include/linux/spinlock_api_smp.h
> @@ -147,7 +147,7 @@ static inline void __raw_spin_lock(raw_spinlock_t *lock)
>
>  static inline void __raw_spin_unlock(raw_spinlock_t *lock)
>  {
> -       spin_release(&lock->dep_map, 1, _RET_IP_);
> +       spin_release(&lock->dep_map, _RET_IP_);
>         do_raw_spin_unlock(lock);
>         preempt_enable();
>  }
> @@ -155,7 +155,7 @@ static inline void __raw_spin_unlock(raw_spinlock_t *lock)
>  static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock,
>                                             unsigned long flags)
>  {
> -       spin_release(&lock->dep_map, 1, _RET_IP_);
> +       spin_release(&lock->dep_map, _RET_IP_);
>         do_raw_spin_unlock(lock);
>         local_irq_restore(flags);
>         preempt_enable();
> @@ -163,7 +163,7 @@ static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock,
>
>  static inline void __raw_spin_unlock_irq(raw_spinlock_t *lock)
>  {
> -       spin_release(&lock->dep_map, 1, _RET_IP_);
> +       spin_release(&lock->dep_map, _RET_IP_);
>         do_raw_spin_unlock(lock);
>         local_irq_enable();
>         preempt_enable();
> @@ -171,7 +171,7 @@ static inline void __raw_spin_unlock_irq(raw_spinlock_t *lock)
>
>  static inline void __raw_spin_unlock_bh(raw_spinlock_t *lock)
>  {
> -       spin_release(&lock->dep_map, 1, _RET_IP_);
> +       spin_release(&lock->dep_map, _RET_IP_);
>         do_raw_spin_unlock(lock);
>         __local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
>  }
> diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h
> index 3af7c0e03be5..d7554252404c 100644
> --- a/include/linux/ww_mutex.h
> +++ b/include/linux/ww_mutex.h
> @@ -182,7 +182,7 @@ static inline void ww_acquire_done(struct ww_acquire_ctx *ctx)
>  static inline void ww_acquire_fini(struct ww_acquire_ctx *ctx)
>  {
>  #ifdef CONFIG_DEBUG_MUTEXES
> -       mutex_release(&ctx->dep_map, 0, _THIS_IP_);
> +       mutex_release(&ctx->dep_map, _THIS_IP_);
>
>         DEBUG_LOCKS_WARN_ON(ctx->acquired);
>         if (!IS_ENABLED(CONFIG_PROVE_LOCKING))
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2c53f1a1d905..e46db0c846d2 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1484,7 +1484,7 @@ static inline void sock_release_ownership(struct sock *sk)
>                 sk->sk_lock.owned = 0;
>
>                 /* The sk_lock has mutex_unlock() semantics: */
> -               mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
> +               mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
>         }
>  }
>
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 052580c33d26..dcfe2d37ad15 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -338,7 +338,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>                  * up_read_non_owner(). The rwsem_release() is called
>                  * here to release the lock from lockdep's perspective.
>                  */
> -               rwsem_release(&current->mm->mmap_sem.dep_map, 1, _RET_IP_);
> +               rwsem_release(&current->mm->mmap_sem.dep_map, _RET_IP_);
>         }
>  }
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index e1967e9eddc2..97ed88e0cf72 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -336,7 +336,7 @@ static void lockdep_acquire_cpus_lock(void)
>
>  static void lockdep_release_cpus_lock(void)
>  {
> -       rwsem_release(&cpu_hotplug_lock.rw_sem.dep_map, 1, _THIS_IP_);
> +       rwsem_release(&cpu_hotplug_lock.rw_sem.dep_map, _THIS_IP_);
>  }
>
>  /*
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 233459c03b5a..8123518f9045 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -4491,8 +4491,7 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
>  }
>  EXPORT_SYMBOL_GPL(lock_acquire);
>
> -void lock_release(struct lockdep_map *lock, int nested,
> -                         unsigned long ip)
> +void lock_release(struct lockdep_map *lock, unsigned long ip)
>  {
>         unsigned long flags;
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 468a9b8422e3..5352ce50a97e 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -1091,7 +1091,7 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock)
>  err_early_kill:
>         spin_unlock(&lock->wait_lock);
>         debug_mutex_free_waiter(&waiter);
> -       mutex_release(&lock->dep_map, 1, ip);
> +       mutex_release(&lock->dep_map, ip);
>         preempt_enable();
>         return ret;
>  }
> @@ -1225,7 +1225,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>         DEFINE_WAKE_Q(wake_q);
>         unsigned long owner;
>
> -       mutex_release(&lock->dep_map, 1, ip);
> +       mutex_release(&lock->dep_map, ip);
>
>         /*
>          * Release the lock before (potentially) taking the spinlock such that
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 2874bf556162..851bbb10819d 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1517,7 +1517,7 @@ int __sched rt_mutex_lock_interruptible(struct rt_mutex *lock)
>         mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_);
>         ret = rt_mutex_fastlock(lock, TASK_INTERRUPTIBLE, rt_mutex_slowlock);
>         if (ret)
> -               mutex_release(&lock->dep_map, 1, _RET_IP_);
> +               mutex_release(&lock->dep_map, _RET_IP_);
>
>         return ret;
>  }
> @@ -1561,7 +1561,7 @@ int __sched __rt_mutex_futex_trylock(struct rt_mutex *lock)
>                                        RT_MUTEX_MIN_CHAINWALK,
>                                        rt_mutex_slowlock);
>         if (ret)
> -               mutex_release(&lock->dep_map, 1, _RET_IP_);
> +               mutex_release(&lock->dep_map, _RET_IP_);
>
>         return ret;
>  }
> @@ -1600,7 +1600,7 @@ int __sched rt_mutex_trylock(struct rt_mutex *lock)
>   */
>  void __sched rt_mutex_unlock(struct rt_mutex *lock)
>  {
> -       mutex_release(&lock->dep_map, 1, _RET_IP_);
> +       mutex_release(&lock->dep_map, _RET_IP_);
>         rt_mutex_fastunlock(lock, rt_mutex_slowunlock);
>  }
>  EXPORT_SYMBOL_GPL(rt_mutex_unlock);
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index eef04551eae7..44e68761f432 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1504,7 +1504,7 @@ int __sched down_read_killable(struct rw_semaphore *sem)
>         rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
>
>         if (LOCK_CONTENDED_RETURN(sem, __down_read_trylock, __down_read_killable)) {
> -               rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +               rwsem_release(&sem->dep_map, _RET_IP_);
>                 return -EINTR;
>         }
>
> @@ -1546,7 +1546,7 @@ int __sched down_write_killable(struct rw_semaphore *sem)
>
>         if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
>                                   __down_write_killable)) {
> -               rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +               rwsem_release(&sem->dep_map, _RET_IP_);
>                 return -EINTR;
>         }
>
> @@ -1573,7 +1573,7 @@ int down_write_trylock(struct rw_semaphore *sem)
>   */
>  void up_read(struct rw_semaphore *sem)
>  {
> -       rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +       rwsem_release(&sem->dep_map, _RET_IP_);
>         __up_read(sem);
>  }
>  EXPORT_SYMBOL(up_read);
> @@ -1583,7 +1583,7 @@ void up_read(struct rw_semaphore *sem)
>   */
>  void up_write(struct rw_semaphore *sem)
>  {
> -       rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +       rwsem_release(&sem->dep_map, _RET_IP_);
>         __up_write(sem);
>  }
>  EXPORT_SYMBOL(up_write);
> @@ -1639,7 +1639,7 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
>
>         if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
>                                   __down_write_killable)) {
> -               rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +               rwsem_release(&sem->dep_map, _RET_IP_);
>                 return -EINTR;
>         }
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index ca65327a6de8..c8be5a0f5259 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -248,7 +248,7 @@ static void __up_console_sem(unsigned long ip)
>  {
>         unsigned long flags;
>
> -       mutex_release(&console_lock_dep_map, 1, ip);
> +       mutex_release(&console_lock_dep_map, ip);
>
>         printk_safe_enter_irqsave(flags);
>         up(&console_sem);
> @@ -1679,20 +1679,20 @@ static int console_lock_spinning_disable_and_check(void)
>         raw_spin_unlock(&console_owner_lock);
>
>         if (!waiter) {
> -               spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> +               spin_release(&console_owner_dep_map, _THIS_IP_);
>                 return 0;
>         }
>
>         /* The waiter is now free to continue */
>         WRITE_ONCE(console_waiter, false);
>
> -       spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> +       spin_release(&console_owner_dep_map, _THIS_IP_);
>
>         /*
>          * Hand off console_lock to waiter. The waiter will perform
>          * the up(). After this, the waiter is the console_lock owner.
>          */
> -       mutex_release(&console_lock_dep_map, 1, _THIS_IP_);
> +       mutex_release(&console_lock_dep_map, _THIS_IP_);
>         return 1;
>  }
>
> @@ -1746,7 +1746,7 @@ static int console_trylock_spinning(void)
>         /* Owner will clear console_waiter on hand off */
>         while (READ_ONCE(console_waiter))
>                 cpu_relax();
> -       spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> +       spin_release(&console_owner_dep_map, _THIS_IP_);
>
>         printk_safe_exit_irqrestore(flags);
>         /*
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f9a1346a5fa9..f845693e8e75 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3105,7 +3105,7 @@ static inline void finish_task(struct task_struct *prev)
>          * do an early lockdep release here:
>          */
>         rq_unpin_lock(rq, rf);
> -       spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
> +       spin_release(&rq->lock.dep_map, _THIS_IP_);
>  #ifdef CONFIG_DEBUG_SPINLOCK
>         /* this is a valid case when another task releases the spinlock */
>         rq->lock.owner = next;
> diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
> index a1705545e6ac..14f44f59e733 100644
> --- a/lib/locking-selftest.c
> +++ b/lib/locking-selftest.c
> @@ -1475,7 +1475,7 @@ static void ww_test_edeadlk_normal(void)
>
>         mutex_lock(&o2.base);
>         o2.ctx = &t2;
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>
>         WWAI(&t);
>         t2 = t;
> @@ -1500,7 +1500,7 @@ static void ww_test_edeadlk_normal_slow(void)
>         int ret;
>
>         mutex_lock(&o2.base);
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>         o2.ctx = &t2;
>
>         WWAI(&t);
> @@ -1527,7 +1527,7 @@ static void ww_test_edeadlk_no_unlock(void)
>
>         mutex_lock(&o2.base);
>         o2.ctx = &t2;
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>
>         WWAI(&t);
>         t2 = t;
> @@ -1551,7 +1551,7 @@ static void ww_test_edeadlk_no_unlock_slow(void)
>         int ret;
>
>         mutex_lock(&o2.base);
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>         o2.ctx = &t2;
>
>         WWAI(&t);
> @@ -1576,7 +1576,7 @@ static void ww_test_edeadlk_acquire_more(void)
>         int ret;
>
>         mutex_lock(&o2.base);
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>         o2.ctx = &t2;
>
>         WWAI(&t);
> @@ -1597,7 +1597,7 @@ static void ww_test_edeadlk_acquire_more_slow(void)
>         int ret;
>
>         mutex_lock(&o2.base);
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>         o2.ctx = &t2;
>
>         WWAI(&t);
> @@ -1618,11 +1618,11 @@ static void ww_test_edeadlk_acquire_more_edeadlk(void)
>         int ret;
>
>         mutex_lock(&o2.base);
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>         o2.ctx = &t2;
>
>         mutex_lock(&o3.base);
> -       mutex_release(&o3.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o3.base.dep_map, _THIS_IP_);
>         o3.ctx = &t2;
>
>         WWAI(&t);
> @@ -1644,11 +1644,11 @@ static void ww_test_edeadlk_acquire_more_edeadlk_slow(void)
>         int ret;
>
>         mutex_lock(&o2.base);
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>         o2.ctx = &t2;
>
>         mutex_lock(&o3.base);
> -       mutex_release(&o3.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o3.base.dep_map, _THIS_IP_);
>         o3.ctx = &t2;
>
>         WWAI(&t);
> @@ -1669,7 +1669,7 @@ static void ww_test_edeadlk_acquire_wrong(void)
>         int ret;
>
>         mutex_lock(&o2.base);
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>         o2.ctx = &t2;
>
>         WWAI(&t);
> @@ -1694,7 +1694,7 @@ static void ww_test_edeadlk_acquire_wrong_slow(void)
>         int ret;
>
>         mutex_lock(&o2.base);
> -       mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +       mutex_release(&o2.base.dep_map, _THIS_IP_);
>         o2.ctx = &t2;
>
>         WWAI(&t);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1c4c08b45e44..3956ab6dba14 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1800,7 +1800,7 @@ static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
>         struct mem_cgroup *iter;
>
>         spin_lock(&memcg_oom_lock);
> -       mutex_release(&memcg_oom_lock_dep_map, 1, _RET_IP_);
> +       mutex_release(&memcg_oom_lock_dep_map, _RET_IP_);
>         for_each_mem_cgroup_tree(iter, memcg)
>                 iter->oom_lock = false;
>         spin_unlock(&memcg_oom_lock);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 07863edbe6fc..a988e70cdac5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -521,7 +521,7 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb,
>
>                 rc = sk_backlog_rcv(sk, skb);
>
> -               mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
> +               mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
>         } else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
>                 bh_unlock_sock(sk);
>                 atomic_inc(&sk->sk_drops);
> diff --git a/tools/lib/lockdep/include/liblockdep/common.h b/tools/lib/lockdep/include/liblockdep/common.h
> index a81d91d4fc78..a6d7ee5f18ba 100644
> --- a/tools/lib/lockdep/include/liblockdep/common.h
> +++ b/tools/lib/lockdep/include/liblockdep/common.h
> @@ -42,8 +42,7 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name,
>  void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
>                         int trylock, int read, int check,
>                         struct lockdep_map *nest_lock, unsigned long ip);
> -void lock_release(struct lockdep_map *lock, int nested,
> -                       unsigned long ip);
> +void lock_release(struct lockdep_map *lock, unsigned long ip);
>  void lockdep_reset_lock(struct lockdep_map *lock);
>  void lockdep_register_key(struct lock_class_key *key);
>  void lockdep_unregister_key(struct lock_class_key *key);
> diff --git a/tools/lib/lockdep/include/liblockdep/mutex.h b/tools/lib/lockdep/include/liblockdep/mutex.h
> index 783dd0df06f9..bd106b82759b 100644
> --- a/tools/lib/lockdep/include/liblockdep/mutex.h
> +++ b/tools/lib/lockdep/include/liblockdep/mutex.h
> @@ -42,7 +42,7 @@ static inline int liblockdep_pthread_mutex_lock(liblockdep_pthread_mutex_t *lock
>
>  static inline int liblockdep_pthread_mutex_unlock(liblockdep_pthread_mutex_t *lock)
>  {
> -       lock_release(&lock->dep_map, 0, (unsigned long)_RET_IP_);
> +       lock_release(&lock->dep_map, (unsigned long)_RET_IP_);
>         return pthread_mutex_unlock(&lock->mutex);
>  }
>
> diff --git a/tools/lib/lockdep/include/liblockdep/rwlock.h b/tools/lib/lockdep/include/liblockdep/rwlock.h
> index 365762e3a1ea..6d5d2932bf4d 100644
> --- a/tools/lib/lockdep/include/liblockdep/rwlock.h
> +++ b/tools/lib/lockdep/include/liblockdep/rwlock.h
> @@ -44,7 +44,7 @@ static inline int liblockdep_pthread_rwlock_rdlock(liblockdep_pthread_rwlock_t *
>
>  static inline int liblockdep_pthread_rwlock_unlock(liblockdep_pthread_rwlock_t *lock)
>  {
> -       lock_release(&lock->dep_map, 0, (unsigned long)_RET_IP_);
> +       lock_release(&lock->dep_map, (unsigned long)_RET_IP_);
>         return pthread_rwlock_unlock(&lock->rwlock);
>  }
>
> diff --git a/tools/lib/lockdep/preload.c b/tools/lib/lockdep/preload.c
> index 76245d16196d..8f1adbe887b2 100644
> --- a/tools/lib/lockdep/preload.c
> +++ b/tools/lib/lockdep/preload.c
> @@ -270,7 +270,7 @@ int pthread_mutex_lock(pthread_mutex_t *mutex)
>          */
>         r = ll_pthread_mutex_lock(mutex);
>         if (r)
> -               lock_release(&__get_lock(mutex)->dep_map, 0, (unsigned long)_RET_IP_);
> +               lock_release(&__get_lock(mutex)->dep_map, (unsigned long)_RET_IP_);
>
>         return r;
>  }
> @@ -284,7 +284,7 @@ int pthread_mutex_trylock(pthread_mutex_t *mutex)
>         lock_acquire(&__get_lock(mutex)->dep_map, 0, 1, 0, 1, NULL, (unsigned long)_RET_IP_);
>         r = ll_pthread_mutex_trylock(mutex);
>         if (r)
> -               lock_release(&__get_lock(mutex)->dep_map, 0, (unsigned long)_RET_IP_);
> +               lock_release(&__get_lock(mutex)->dep_map, (unsigned long)_RET_IP_);
>
>         return r;
>  }
> @@ -295,7 +295,7 @@ int pthread_mutex_unlock(pthread_mutex_t *mutex)
>
>         try_init_preload();
>
> -       lock_release(&__get_lock(mutex)->dep_map, 0, (unsigned long)_RET_IP_);
> +       lock_release(&__get_lock(mutex)->dep_map, (unsigned long)_RET_IP_);
>         /*
>          * Just like taking a lock, only in reverse!
>          *
> @@ -355,7 +355,7 @@ int pthread_rwlock_rdlock(pthread_rwlock_t *rwlock)
>         lock_acquire(&__get_lock(rwlock)->dep_map, 0, 0, 2, 1, NULL, (unsigned long)_RET_IP_);
>         r = ll_pthread_rwlock_rdlock(rwlock);
>         if (r)
> -               lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +               lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>
>         return r;
>  }
> @@ -369,7 +369,7 @@ int pthread_rwlock_tryrdlock(pthread_rwlock_t *rwlock)
>         lock_acquire(&__get_lock(rwlock)->dep_map, 0, 1, 2, 1, NULL, (unsigned long)_RET_IP_);
>         r = ll_pthread_rwlock_tryrdlock(rwlock);
>         if (r)
> -               lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +               lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>
>         return r;
>  }
> @@ -383,7 +383,7 @@ int pthread_rwlock_trywrlock(pthread_rwlock_t *rwlock)
>         lock_acquire(&__get_lock(rwlock)->dep_map, 0, 1, 0, 1, NULL, (unsigned long)_RET_IP_);
>         r = ll_pthread_rwlock_trywrlock(rwlock);
>         if (r)
> -                lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +               lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>
>         return r;
>  }
> @@ -397,7 +397,7 @@ int pthread_rwlock_wrlock(pthread_rwlock_t *rwlock)
>         lock_acquire(&__get_lock(rwlock)->dep_map, 0, 0, 0, 1, NULL, (unsigned long)_RET_IP_);
>         r = ll_pthread_rwlock_wrlock(rwlock);
>         if (r)
> -               lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +               lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>
>         return r;
>  }
> @@ -408,7 +408,7 @@ int pthread_rwlock_unlock(pthread_rwlock_t *rwlock)
>
>          init_preload();
>
> -       lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +       lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>         r = ll_pthread_rwlock_unlock(rwlock);
>         if (r)
>                 lock_acquire(&__get_lock(rwlock)->dep_map, 0, 0, 0, 1, NULL, (unsigned long)_RET_IP_);
> --
> 1.8.3.1
>

^ permalink raw reply

* Re: [PATCH -next] treewide: remove unused argument in lock_release()
From: Peter Zijlstra @ 2019-10-08 19:18 UTC (permalink / raw)
  To: Qian Cai, akpm, mingo, will, linux-kernel, linux-api,
	maarten.lankhorst, mripard, sean, airlied, dri-devel, gregkh,
	jslaby, viro, linux-fsdevel, joonas.lahtinen, rodrigo.vivi,
	intel-gfx, tytso, jack, linux-ext4, tj, mark, jlbec, joseph.qi,
	ocfs2-devel, davem, st, daniel, netdev, bpf, duyuyang, juri.lelli,
	vincent.guittot
In-Reply-To: <20191008163351.GR16989@phenom.ffwll.local>

On Tue, Oct 08, 2019 at 06:33:51PM +0200, Daniel Vetter wrote:
> On Thu, Sep 19, 2019 at 12:09:40PM -0400, Qian Cai wrote:
> > Since the commit b4adfe8e05f1 ("locking/lockdep: Remove unused argument
> > in __lock_release"), @nested is no longer used in lock_release(), so
> > remove it from all lock_release() calls and friends.
> > 
> > Signed-off-by: Qian Cai <cai@lca.pw>
> 
> Ack on the concept and for the drm parts (and feel free to keep the ack if
> you inevitably have to respin this later on). Might result in some
> conflicts, but welp we need to keep Linus busy :-)
> 
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Thanks Daniel!

^ permalink raw reply

* Re: [PATCH -next] treewide: remove unused argument in lock_release()
From: Daniel Vetter @ 2019-10-08 16:33 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, mingo, peterz, will, linux-kernel, linux-api,
	maarten.lankhorst, mripard, sean, airlied, daniel, dri-devel,
	gregkh, jslaby, viro, linux-fsdevel, joonas.lahtinen,
	rodrigo.vivi, intel-gfx, tytso, jack, linux-ext4, tj, mark, jlbec,
	joseph.qi, ocfs2-devel, davem, st, daniel, netdev, bpf, duyuyang,
	juri.lelli
In-Reply-To: <1568909380-32199-1-git-send-email-cai@lca.pw>

On Thu, Sep 19, 2019 at 12:09:40PM -0400, Qian Cai wrote:
> Since the commit b4adfe8e05f1 ("locking/lockdep: Remove unused argument
> in __lock_release"), @nested is no longer used in lock_release(), so
> remove it from all lock_release() calls and friends.
> 
> Signed-off-by: Qian Cai <cai@lca.pw>

Ack on the concept and for the drm parts (and feel free to keep the ack if
you inevitably have to respin this later on). Might result in some
conflicts, but welp we need to keep Linus busy :-)

Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/drm_connector.c               |  2 +-
>  drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  6 +++---
>  drivers/gpu/drm/i915/gt/intel_engine_pm.c     |  2 +-
>  drivers/gpu/drm/i915/i915_request.c           |  2 +-
>  drivers/tty/tty_ldsem.c                       |  8 ++++----
>  fs/dcache.c                                   |  2 +-
>  fs/jbd2/transaction.c                         |  4 ++--
>  fs/kernfs/dir.c                               |  4 ++--
>  fs/ocfs2/dlmglue.c                            |  2 +-
>  include/linux/jbd2.h                          |  2 +-
>  include/linux/lockdep.h                       | 21 ++++++++++-----------
>  include/linux/percpu-rwsem.h                  |  4 ++--
>  include/linux/rcupdate.h                      |  2 +-
>  include/linux/rwlock_api_smp.h                | 16 ++++++++--------
>  include/linux/seqlock.h                       |  4 ++--
>  include/linux/spinlock_api_smp.h              |  8 ++++----
>  include/linux/ww_mutex.h                      |  2 +-
>  include/net/sock.h                            |  2 +-
>  kernel/bpf/stackmap.c                         |  2 +-
>  kernel/cpu.c                                  |  2 +-
>  kernel/locking/lockdep.c                      |  3 +--
>  kernel/locking/mutex.c                        |  4 ++--
>  kernel/locking/rtmutex.c                      |  6 +++---
>  kernel/locking/rwsem.c                        | 10 +++++-----
>  kernel/printk/printk.c                        | 10 +++++-----
>  kernel/sched/core.c                           |  2 +-
>  lib/locking-selftest.c                        | 24 ++++++++++++------------
>  mm/memcontrol.c                               |  2 +-
>  net/core/sock.c                               |  2 +-
>  tools/lib/lockdep/include/liblockdep/common.h |  3 +--
>  tools/lib/lockdep/include/liblockdep/mutex.h  |  2 +-
>  tools/lib/lockdep/include/liblockdep/rwlock.h |  2 +-
>  tools/lib/lockdep/preload.c                   | 16 ++++++++--------
>  33 files changed, 90 insertions(+), 93 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c
> index 4c766624b20d..4a8b2e5c2af6 100644
> --- a/drivers/gpu/drm/drm_connector.c
> +++ b/drivers/gpu/drm/drm_connector.c
> @@ -719,7 +719,7 @@ void drm_connector_list_iter_end(struct drm_connector_list_iter *iter)
>  		__drm_connector_put_safe(iter->conn);
>  		spin_unlock_irqrestore(&config->connector_list_lock, flags);
>  	}
> -	lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
> +	lock_release(&connector_list_iter_dep_map, _RET_IP_);
>  }
>  EXPORT_SYMBOL(drm_connector_list_iter_end);
>  
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> index edd21d14e64f..1a51b3598d63 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> @@ -509,14 +509,14 @@ void i915_gem_shrinker_taints_mutex(struct drm_i915_private *i915,
>  		      I915_MM_SHRINKER, 0, _RET_IP_);
>  
>  	mutex_acquire(&mutex->dep_map, 0, 0, _RET_IP_);
> -	mutex_release(&mutex->dep_map, 0, _RET_IP_);
> +	mutex_release(&mutex->dep_map, _RET_IP_);
>  
> -	mutex_release(&i915->drm.struct_mutex.dep_map, 0, _RET_IP_);
> +	mutex_release(&i915->drm.struct_mutex.dep_map, _RET_IP_);
>  
>  	fs_reclaim_release(GFP_KERNEL);
>  
>  	if (unlock)
> -		mutex_release(&i915->drm.struct_mutex.dep_map, 0, _RET_IP_);
> +		mutex_release(&i915->drm.struct_mutex.dep_map, _RET_IP_);
>  }
>  
>  #define obj_to_i915(obj__) to_i915((obj__)->base.dev)
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> index 65b5ca74b394..7f647243b3b9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> @@ -52,7 +52,7 @@ static inline unsigned long __timeline_mark_lock(struct intel_context *ce)
>  static inline void __timeline_mark_unlock(struct intel_context *ce,
>  					  unsigned long flags)
>  {
> -	mutex_release(&ce->timeline->mutex.dep_map, 0, _THIS_IP_);
> +	mutex_release(&ce->timeline->mutex.dep_map, _THIS_IP_);
>  	local_irq_restore(flags);
>  }
>  
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index a53777dd371c..e1f1be4d0531 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1456,7 +1456,7 @@ long i915_request_wait(struct i915_request *rq,
>  	dma_fence_remove_callback(&rq->fence, &wait.cb);
>  
>  out:
> -	mutex_release(&rq->engine->gt->reset.mutex.dep_map, 0, _THIS_IP_);
> +	mutex_release(&rq->engine->gt->reset.mutex.dep_map, _THIS_IP_);
>  	trace_i915_request_wait_end(rq);
>  	return timeout;
>  }
> diff --git a/drivers/tty/tty_ldsem.c b/drivers/tty/tty_ldsem.c
> index 60ff236a3d63..ce8291053af3 100644
> --- a/drivers/tty/tty_ldsem.c
> +++ b/drivers/tty/tty_ldsem.c
> @@ -303,7 +303,7 @@ static int __ldsem_down_read_nested(struct ld_semaphore *sem,
>  	if (count <= 0) {
>  		lock_contended(&sem->dep_map, _RET_IP_);
>  		if (!down_read_failed(sem, count, timeout)) {
> -			rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +			rwsem_release(&sem->dep_map, _RET_IP_);
>  			return 0;
>  		}
>  	}
> @@ -322,7 +322,7 @@ static int __ldsem_down_write_nested(struct ld_semaphore *sem,
>  	if ((count & LDSEM_ACTIVE_MASK) != LDSEM_ACTIVE_BIAS) {
>  		lock_contended(&sem->dep_map, _RET_IP_);
>  		if (!down_write_failed(sem, count, timeout)) {
> -			rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +			rwsem_release(&sem->dep_map, _RET_IP_);
>  			return 0;
>  		}
>  	}
> @@ -390,7 +390,7 @@ void ldsem_up_read(struct ld_semaphore *sem)
>  {
>  	long count;
>  
> -	rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +	rwsem_release(&sem->dep_map, _RET_IP_);
>  
>  	count = atomic_long_add_return(-LDSEM_READ_BIAS, &sem->count);
>  	if (count < 0 && (count & LDSEM_ACTIVE_MASK) == 0)
> @@ -404,7 +404,7 @@ void ldsem_up_write(struct ld_semaphore *sem)
>  {
>  	long count;
>  
> -	rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +	rwsem_release(&sem->dep_map, _RET_IP_);
>  
>  	count = atomic_long_add_return(-LDSEM_WRITE_BIAS, &sem->count);
>  	if (count < 0)
> diff --git a/fs/dcache.c b/fs/dcache.c
> index e88cf0554e65..f7931b682a0d 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1319,7 +1319,7 @@ static void d_walk(struct dentry *parent, void *data,
>  
>  		if (!list_empty(&dentry->d_subdirs)) {
>  			spin_unlock(&this_parent->d_lock);
> -			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
> +			spin_release(&dentry->d_lock.dep_map, _RET_IP_);
>  			this_parent = dentry;
>  			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
>  			goto repeat;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index bee8498d7792..b25ebdcabfa3 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -713,7 +713,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
>  	if (need_to_start)
>  		jbd2_log_start_commit(journal, tid);
>  
> -	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
> +	rwsem_release(&journal->j_trans_commit_map, _THIS_IP_);
>  	handle->h_buffer_credits = nblocks;
>  	/*
>  	 * Restore the original nofs context because the journal restart
> @@ -1848,7 +1848,7 @@ int jbd2_journal_stop(handle_t *handle)
>  			wake_up(&journal->j_wait_transaction_locked);
>  	}
>  
> -	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
> +	rwsem_release(&journal->j_trans_commit_map, _THIS_IP_);
>  
>  	if (wait_for_commit)
>  		err = jbd2_log_wait_commit(journal, tid);
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 6ebae6bbe6a5..c45b82feac9a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -438,7 +438,7 @@ void kernfs_put_active(struct kernfs_node *kn)
>  		return;
>  
>  	if (kernfs_lockdep(kn))
> -		rwsem_release(&kn->dep_map, 1, _RET_IP_);
> +		rwsem_release(&kn->dep_map, _RET_IP_);
>  	v = atomic_dec_return(&kn->active);
>  	if (likely(v != KN_DEACTIVATED_BIAS))
>  		return;
> @@ -476,7 +476,7 @@ static void kernfs_drain(struct kernfs_node *kn)
>  
>  	if (kernfs_lockdep(kn)) {
>  		lock_acquired(&kn->dep_map, _RET_IP_);
> -		rwsem_release(&kn->dep_map, 1, _RET_IP_);
> +		rwsem_release(&kn->dep_map, _RET_IP_);
>  	}
>  
>  	kernfs_drain_open_files(kn);
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index ad594fef2ab0..71975b9b142c 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -1687,7 +1687,7 @@ static void __ocfs2_cluster_unlock(struct ocfs2_super *osb,
>  	spin_unlock_irqrestore(&lockres->l_lock, flags);
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  	if (lockres->l_lockdep_map.key != NULL)
> -		rwsem_release(&lockres->l_lockdep_map, 1, caller_ip);
> +		rwsem_release(&lockres->l_lockdep_map, caller_ip);
>  #endif
>  }
>  
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 603fbc4e2f70..564793c24d12 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1170,7 +1170,7 @@ struct journal_s
>  #define jbd2_might_wait_for_commit(j) \
>  	do { \
>  		rwsem_acquire(&j->j_trans_commit_map, 0, 0, _THIS_IP_); \
> -		rwsem_release(&j->j_trans_commit_map, 1, _THIS_IP_); \
> +		rwsem_release(&j->j_trans_commit_map, _THIS_IP_); \
>  	} while (0)
>  
>  /* journal feature predicate functions */
> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> index b8a835fd611b..c50d01ef1414 100644
> --- a/include/linux/lockdep.h
> +++ b/include/linux/lockdep.h
> @@ -349,8 +349,7 @@ extern void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
>  			 int trylock, int read, int check,
>  			 struct lockdep_map *nest_lock, unsigned long ip);
>  
> -extern void lock_release(struct lockdep_map *lock, int nested,
> -			 unsigned long ip);
> +extern void lock_release(struct lockdep_map *lock, unsigned long ip);
>  
>  /*
>   * Same "read" as for lock_acquire(), except -1 means any.
> @@ -428,7 +427,7 @@ static inline void lockdep_set_selftest_task(struct task_struct *task)
>  }
>  
>  # define lock_acquire(l, s, t, r, c, n, i)	do { } while (0)
> -# define lock_release(l, n, i)			do { } while (0)
> +# define lock_release(l, i)			do { } while (0)
>  # define lock_downgrade(l, i)			do { } while (0)
>  # define lock_set_class(l, n, k, s, i)		do { } while (0)
>  # define lock_set_subclass(l, s, i)		do { } while (0)
> @@ -591,42 +590,42 @@ static inline void print_irqtrace_events(struct task_struct *curr)
>  
>  #define spin_acquire(l, s, t, i)		lock_acquire_exclusive(l, s, t, NULL, i)
>  #define spin_acquire_nest(l, s, t, n, i)	lock_acquire_exclusive(l, s, t, n, i)
> -#define spin_release(l, n, i)			lock_release(l, n, i)
> +#define spin_release(l, i)			lock_release(l, i)
>  
>  #define rwlock_acquire(l, s, t, i)		lock_acquire_exclusive(l, s, t, NULL, i)
>  #define rwlock_acquire_read(l, s, t, i)		lock_acquire_shared_recursive(l, s, t, NULL, i)
> -#define rwlock_release(l, n, i)			lock_release(l, n, i)
> +#define rwlock_release(l, i)			lock_release(l, i)
>  
>  #define seqcount_acquire(l, s, t, i)		lock_acquire_exclusive(l, s, t, NULL, i)
>  #define seqcount_acquire_read(l, s, t, i)	lock_acquire_shared_recursive(l, s, t, NULL, i)
> -#define seqcount_release(l, n, i)		lock_release(l, n, i)
> +#define seqcount_release(l, i)			lock_release(l, i)
>  
>  #define mutex_acquire(l, s, t, i)		lock_acquire_exclusive(l, s, t, NULL, i)
>  #define mutex_acquire_nest(l, s, t, n, i)	lock_acquire_exclusive(l, s, t, n, i)
> -#define mutex_release(l, n, i)			lock_release(l, n, i)
> +#define mutex_release(l, i)			lock_release(l, i)
>  
>  #define rwsem_acquire(l, s, t, i)		lock_acquire_exclusive(l, s, t, NULL, i)
>  #define rwsem_acquire_nest(l, s, t, n, i)	lock_acquire_exclusive(l, s, t, n, i)
>  #define rwsem_acquire_read(l, s, t, i)		lock_acquire_shared(l, s, t, NULL, i)
> -#define rwsem_release(l, n, i)			lock_release(l, n, i)
> +#define rwsem_release(l, i)			lock_release(l, i)
>  
>  #define lock_map_acquire(l)			lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)
>  #define lock_map_acquire_read(l)		lock_acquire_shared_recursive(l, 0, 0, NULL, _THIS_IP_)
>  #define lock_map_acquire_tryread(l)		lock_acquire_shared_recursive(l, 0, 1, NULL, _THIS_IP_)
> -#define lock_map_release(l)			lock_release(l, 1, _THIS_IP_)
> +#define lock_map_release(l)			lock_release(l, _THIS_IP_)
>  
>  #ifdef CONFIG_PROVE_LOCKING
>  # define might_lock(lock) 						\
>  do {									\
>  	typecheck(struct lockdep_map *, &(lock)->dep_map);		\
>  	lock_acquire(&(lock)->dep_map, 0, 0, 0, 1, NULL, _THIS_IP_);	\
> -	lock_release(&(lock)->dep_map, 0, _THIS_IP_);			\
> +	lock_release(&(lock)->dep_map, _THIS_IP_);			\
>  } while (0)
>  # define might_lock_read(lock) 						\
>  do {									\
>  	typecheck(struct lockdep_map *, &(lock)->dep_map);		\
>  	lock_acquire(&(lock)->dep_map, 0, 0, 1, 1, NULL, _THIS_IP_);	\
> -	lock_release(&(lock)->dep_map, 0, _THIS_IP_);			\
> +	lock_release(&(lock)->dep_map, _THIS_IP_);			\
>  } while (0)
>  
>  #define lockdep_assert_irqs_enabled()	do {				\
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 3998cdf9cd14..ad2ca2a89d5b 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -93,7 +93,7 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
>  		__percpu_up_read(sem); /* Unconditional memory barrier */
>  	preempt_enable();
>  
> -	rwsem_release(&sem->rw_sem.dep_map, 1, _RET_IP_);
> +	rwsem_release(&sem->rw_sem.dep_map, _RET_IP_);
>  }
>  
>  extern void percpu_down_write(struct percpu_rw_semaphore *);
> @@ -118,7 +118,7 @@ extern int __percpu_init_rwsem(struct percpu_rw_semaphore *,
>  static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
>  					bool read, unsigned long ip)
>  {
> -	lock_release(&sem->rw_sem.dep_map, 1, ip);
> +	lock_release(&sem->rw_sem.dep_map, ip);
>  #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
>  	if (!read)
>  		atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 75a2eded7aa2..269b31eab3d6 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -210,7 +210,7 @@ static inline void rcu_lock_acquire(struct lockdep_map *map)
>  
>  static inline void rcu_lock_release(struct lockdep_map *map)
>  {
> -	lock_release(map, 1, _THIS_IP_);
> +	lock_release(map, _THIS_IP_);
>  }
>  
>  extern struct lockdep_map rcu_lock_map;
> diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h
> index 86ebb4bf9c6e..abfb53ab11be 100644
> --- a/include/linux/rwlock_api_smp.h
> +++ b/include/linux/rwlock_api_smp.h
> @@ -215,14 +215,14 @@ static inline void __raw_write_lock(rwlock_t *lock)
>  
>  static inline void __raw_write_unlock(rwlock_t *lock)
>  {
> -	rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +	rwlock_release(&lock->dep_map, _RET_IP_);
>  	do_raw_write_unlock(lock);
>  	preempt_enable();
>  }
>  
>  static inline void __raw_read_unlock(rwlock_t *lock)
>  {
> -	rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +	rwlock_release(&lock->dep_map, _RET_IP_);
>  	do_raw_read_unlock(lock);
>  	preempt_enable();
>  }
> @@ -230,7 +230,7 @@ static inline void __raw_read_unlock(rwlock_t *lock)
>  static inline void
>  __raw_read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
>  {
> -	rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +	rwlock_release(&lock->dep_map, _RET_IP_);
>  	do_raw_read_unlock(lock);
>  	local_irq_restore(flags);
>  	preempt_enable();
> @@ -238,7 +238,7 @@ static inline void __raw_read_unlock(rwlock_t *lock)
>  
>  static inline void __raw_read_unlock_irq(rwlock_t *lock)
>  {
> -	rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +	rwlock_release(&lock->dep_map, _RET_IP_);
>  	do_raw_read_unlock(lock);
>  	local_irq_enable();
>  	preempt_enable();
> @@ -246,7 +246,7 @@ static inline void __raw_read_unlock_irq(rwlock_t *lock)
>  
>  static inline void __raw_read_unlock_bh(rwlock_t *lock)
>  {
> -	rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +	rwlock_release(&lock->dep_map, _RET_IP_);
>  	do_raw_read_unlock(lock);
>  	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
>  }
> @@ -254,7 +254,7 @@ static inline void __raw_read_unlock_bh(rwlock_t *lock)
>  static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
>  					     unsigned long flags)
>  {
> -	rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +	rwlock_release(&lock->dep_map, _RET_IP_);
>  	do_raw_write_unlock(lock);
>  	local_irq_restore(flags);
>  	preempt_enable();
> @@ -262,7 +262,7 @@ static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
>  
>  static inline void __raw_write_unlock_irq(rwlock_t *lock)
>  {
> -	rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +	rwlock_release(&lock->dep_map, _RET_IP_);
>  	do_raw_write_unlock(lock);
>  	local_irq_enable();
>  	preempt_enable();
> @@ -270,7 +270,7 @@ static inline void __raw_write_unlock_irq(rwlock_t *lock)
>  
>  static inline void __raw_write_unlock_bh(rwlock_t *lock)
>  {
> -	rwlock_release(&lock->dep_map, 1, _RET_IP_);
> +	rwlock_release(&lock->dep_map, _RET_IP_);
>  	do_raw_write_unlock(lock);
>  	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
>  }
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index bcf4cf26b8c8..0491d963d47e 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -79,7 +79,7 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
>  
>  	local_irq_save(flags);
>  	seqcount_acquire_read(&l->dep_map, 0, 0, _RET_IP_);
> -	seqcount_release(&l->dep_map, 1, _RET_IP_);
> +	seqcount_release(&l->dep_map, _RET_IP_);
>  	local_irq_restore(flags);
>  }
>  
> @@ -384,7 +384,7 @@ static inline void write_seqcount_begin(seqcount_t *s)
>  
>  static inline void write_seqcount_end(seqcount_t *s)
>  {
> -	seqcount_release(&s->dep_map, 1, _RET_IP_);
> +	seqcount_release(&s->dep_map, _RET_IP_);
>  	raw_write_seqcount_end(s);
>  }
>  
> diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
> index b762eaba4cdf..19a9be9d97ee 100644
> --- a/include/linux/spinlock_api_smp.h
> +++ b/include/linux/spinlock_api_smp.h
> @@ -147,7 +147,7 @@ static inline void __raw_spin_lock(raw_spinlock_t *lock)
>  
>  static inline void __raw_spin_unlock(raw_spinlock_t *lock)
>  {
> -	spin_release(&lock->dep_map, 1, _RET_IP_);
> +	spin_release(&lock->dep_map, _RET_IP_);
>  	do_raw_spin_unlock(lock);
>  	preempt_enable();
>  }
> @@ -155,7 +155,7 @@ static inline void __raw_spin_unlock(raw_spinlock_t *lock)
>  static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock,
>  					    unsigned long flags)
>  {
> -	spin_release(&lock->dep_map, 1, _RET_IP_);
> +	spin_release(&lock->dep_map, _RET_IP_);
>  	do_raw_spin_unlock(lock);
>  	local_irq_restore(flags);
>  	preempt_enable();
> @@ -163,7 +163,7 @@ static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock,
>  
>  static inline void __raw_spin_unlock_irq(raw_spinlock_t *lock)
>  {
> -	spin_release(&lock->dep_map, 1, _RET_IP_);
> +	spin_release(&lock->dep_map, _RET_IP_);
>  	do_raw_spin_unlock(lock);
>  	local_irq_enable();
>  	preempt_enable();
> @@ -171,7 +171,7 @@ static inline void __raw_spin_unlock_irq(raw_spinlock_t *lock)
>  
>  static inline void __raw_spin_unlock_bh(raw_spinlock_t *lock)
>  {
> -	spin_release(&lock->dep_map, 1, _RET_IP_);
> +	spin_release(&lock->dep_map, _RET_IP_);
>  	do_raw_spin_unlock(lock);
>  	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
>  }
> diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h
> index 3af7c0e03be5..d7554252404c 100644
> --- a/include/linux/ww_mutex.h
> +++ b/include/linux/ww_mutex.h
> @@ -182,7 +182,7 @@ static inline void ww_acquire_done(struct ww_acquire_ctx *ctx)
>  static inline void ww_acquire_fini(struct ww_acquire_ctx *ctx)
>  {
>  #ifdef CONFIG_DEBUG_MUTEXES
> -	mutex_release(&ctx->dep_map, 0, _THIS_IP_);
> +	mutex_release(&ctx->dep_map, _THIS_IP_);
>  
>  	DEBUG_LOCKS_WARN_ON(ctx->acquired);
>  	if (!IS_ENABLED(CONFIG_PROVE_LOCKING))
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2c53f1a1d905..e46db0c846d2 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1484,7 +1484,7 @@ static inline void sock_release_ownership(struct sock *sk)
>  		sk->sk_lock.owned = 0;
>  
>  		/* The sk_lock has mutex_unlock() semantics: */
> -		mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
> +		mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
>  	}
>  }
>  
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 052580c33d26..dcfe2d37ad15 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -338,7 +338,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>  		 * up_read_non_owner(). The rwsem_release() is called
>  		 * here to release the lock from lockdep's perspective.
>  		 */
> -		rwsem_release(&current->mm->mmap_sem.dep_map, 1, _RET_IP_);
> +		rwsem_release(&current->mm->mmap_sem.dep_map, _RET_IP_);
>  	}
>  }
>  
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index e1967e9eddc2..97ed88e0cf72 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -336,7 +336,7 @@ static void lockdep_acquire_cpus_lock(void)
>  
>  static void lockdep_release_cpus_lock(void)
>  {
> -	rwsem_release(&cpu_hotplug_lock.rw_sem.dep_map, 1, _THIS_IP_);
> +	rwsem_release(&cpu_hotplug_lock.rw_sem.dep_map, _THIS_IP_);
>  }
>  
>  /*
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 233459c03b5a..8123518f9045 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -4491,8 +4491,7 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
>  }
>  EXPORT_SYMBOL_GPL(lock_acquire);
>  
> -void lock_release(struct lockdep_map *lock, int nested,
> -			  unsigned long ip)
> +void lock_release(struct lockdep_map *lock, unsigned long ip)
>  {
>  	unsigned long flags;
>  
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 468a9b8422e3..5352ce50a97e 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -1091,7 +1091,7 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock)
>  err_early_kill:
>  	spin_unlock(&lock->wait_lock);
>  	debug_mutex_free_waiter(&waiter);
> -	mutex_release(&lock->dep_map, 1, ip);
> +	mutex_release(&lock->dep_map, ip);
>  	preempt_enable();
>  	return ret;
>  }
> @@ -1225,7 +1225,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>  	DEFINE_WAKE_Q(wake_q);
>  	unsigned long owner;
>  
> -	mutex_release(&lock->dep_map, 1, ip);
> +	mutex_release(&lock->dep_map, ip);
>  
>  	/*
>  	 * Release the lock before (potentially) taking the spinlock such that
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 2874bf556162..851bbb10819d 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1517,7 +1517,7 @@ int __sched rt_mutex_lock_interruptible(struct rt_mutex *lock)
>  	mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_);
>  	ret = rt_mutex_fastlock(lock, TASK_INTERRUPTIBLE, rt_mutex_slowlock);
>  	if (ret)
> -		mutex_release(&lock->dep_map, 1, _RET_IP_);
> +		mutex_release(&lock->dep_map, _RET_IP_);
>  
>  	return ret;
>  }
> @@ -1561,7 +1561,7 @@ int __sched __rt_mutex_futex_trylock(struct rt_mutex *lock)
>  				       RT_MUTEX_MIN_CHAINWALK,
>  				       rt_mutex_slowlock);
>  	if (ret)
> -		mutex_release(&lock->dep_map, 1, _RET_IP_);
> +		mutex_release(&lock->dep_map, _RET_IP_);
>  
>  	return ret;
>  }
> @@ -1600,7 +1600,7 @@ int __sched rt_mutex_trylock(struct rt_mutex *lock)
>   */
>  void __sched rt_mutex_unlock(struct rt_mutex *lock)
>  {
> -	mutex_release(&lock->dep_map, 1, _RET_IP_);
> +	mutex_release(&lock->dep_map, _RET_IP_);
>  	rt_mutex_fastunlock(lock, rt_mutex_slowunlock);
>  }
>  EXPORT_SYMBOL_GPL(rt_mutex_unlock);
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index eef04551eae7..44e68761f432 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1504,7 +1504,7 @@ int __sched down_read_killable(struct rw_semaphore *sem)
>  	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
>  
>  	if (LOCK_CONTENDED_RETURN(sem, __down_read_trylock, __down_read_killable)) {
> -		rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +		rwsem_release(&sem->dep_map, _RET_IP_);
>  		return -EINTR;
>  	}
>  
> @@ -1546,7 +1546,7 @@ int __sched down_write_killable(struct rw_semaphore *sem)
>  
>  	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
>  				  __down_write_killable)) {
> -		rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +		rwsem_release(&sem->dep_map, _RET_IP_);
>  		return -EINTR;
>  	}
>  
> @@ -1573,7 +1573,7 @@ int down_write_trylock(struct rw_semaphore *sem)
>   */
>  void up_read(struct rw_semaphore *sem)
>  {
> -	rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +	rwsem_release(&sem->dep_map, _RET_IP_);
>  	__up_read(sem);
>  }
>  EXPORT_SYMBOL(up_read);
> @@ -1583,7 +1583,7 @@ void up_read(struct rw_semaphore *sem)
>   */
>  void up_write(struct rw_semaphore *sem)
>  {
> -	rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +	rwsem_release(&sem->dep_map, _RET_IP_);
>  	__up_write(sem);
>  }
>  EXPORT_SYMBOL(up_write);
> @@ -1639,7 +1639,7 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
>  
>  	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
>  				  __down_write_killable)) {
> -		rwsem_release(&sem->dep_map, 1, _RET_IP_);
> +		rwsem_release(&sem->dep_map, _RET_IP_);
>  		return -EINTR;
>  	}
>  
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index ca65327a6de8..c8be5a0f5259 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -248,7 +248,7 @@ static void __up_console_sem(unsigned long ip)
>  {
>  	unsigned long flags;
>  
> -	mutex_release(&console_lock_dep_map, 1, ip);
> +	mutex_release(&console_lock_dep_map, ip);
>  
>  	printk_safe_enter_irqsave(flags);
>  	up(&console_sem);
> @@ -1679,20 +1679,20 @@ static int console_lock_spinning_disable_and_check(void)
>  	raw_spin_unlock(&console_owner_lock);
>  
>  	if (!waiter) {
> -		spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> +		spin_release(&console_owner_dep_map, _THIS_IP_);
>  		return 0;
>  	}
>  
>  	/* The waiter is now free to continue */
>  	WRITE_ONCE(console_waiter, false);
>  
> -	spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> +	spin_release(&console_owner_dep_map, _THIS_IP_);
>  
>  	/*
>  	 * Hand off console_lock to waiter. The waiter will perform
>  	 * the up(). After this, the waiter is the console_lock owner.
>  	 */
> -	mutex_release(&console_lock_dep_map, 1, _THIS_IP_);
> +	mutex_release(&console_lock_dep_map, _THIS_IP_);
>  	return 1;
>  }
>  
> @@ -1746,7 +1746,7 @@ static int console_trylock_spinning(void)
>  	/* Owner will clear console_waiter on hand off */
>  	while (READ_ONCE(console_waiter))
>  		cpu_relax();
> -	spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> +	spin_release(&console_owner_dep_map, _THIS_IP_);
>  
>  	printk_safe_exit_irqrestore(flags);
>  	/*
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f9a1346a5fa9..f845693e8e75 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3105,7 +3105,7 @@ static inline void finish_task(struct task_struct *prev)
>  	 * do an early lockdep release here:
>  	 */
>  	rq_unpin_lock(rq, rf);
> -	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
> +	spin_release(&rq->lock.dep_map, _THIS_IP_);
>  #ifdef CONFIG_DEBUG_SPINLOCK
>  	/* this is a valid case when another task releases the spinlock */
>  	rq->lock.owner = next;
> diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
> index a1705545e6ac..14f44f59e733 100644
> --- a/lib/locking-selftest.c
> +++ b/lib/locking-selftest.c
> @@ -1475,7 +1475,7 @@ static void ww_test_edeadlk_normal(void)
>  
>  	mutex_lock(&o2.base);
>  	o2.ctx = &t2;
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  
>  	WWAI(&t);
>  	t2 = t;
> @@ -1500,7 +1500,7 @@ static void ww_test_edeadlk_normal_slow(void)
>  	int ret;
>  
>  	mutex_lock(&o2.base);
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  	o2.ctx = &t2;
>  
>  	WWAI(&t);
> @@ -1527,7 +1527,7 @@ static void ww_test_edeadlk_no_unlock(void)
>  
>  	mutex_lock(&o2.base);
>  	o2.ctx = &t2;
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  
>  	WWAI(&t);
>  	t2 = t;
> @@ -1551,7 +1551,7 @@ static void ww_test_edeadlk_no_unlock_slow(void)
>  	int ret;
>  
>  	mutex_lock(&o2.base);
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  	o2.ctx = &t2;
>  
>  	WWAI(&t);
> @@ -1576,7 +1576,7 @@ static void ww_test_edeadlk_acquire_more(void)
>  	int ret;
>  
>  	mutex_lock(&o2.base);
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  	o2.ctx = &t2;
>  
>  	WWAI(&t);
> @@ -1597,7 +1597,7 @@ static void ww_test_edeadlk_acquire_more_slow(void)
>  	int ret;
>  
>  	mutex_lock(&o2.base);
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  	o2.ctx = &t2;
>  
>  	WWAI(&t);
> @@ -1618,11 +1618,11 @@ static void ww_test_edeadlk_acquire_more_edeadlk(void)
>  	int ret;
>  
>  	mutex_lock(&o2.base);
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  	o2.ctx = &t2;
>  
>  	mutex_lock(&o3.base);
> -	mutex_release(&o3.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o3.base.dep_map, _THIS_IP_);
>  	o3.ctx = &t2;
>  
>  	WWAI(&t);
> @@ -1644,11 +1644,11 @@ static void ww_test_edeadlk_acquire_more_edeadlk_slow(void)
>  	int ret;
>  
>  	mutex_lock(&o2.base);
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  	o2.ctx = &t2;
>  
>  	mutex_lock(&o3.base);
> -	mutex_release(&o3.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o3.base.dep_map, _THIS_IP_);
>  	o3.ctx = &t2;
>  
>  	WWAI(&t);
> @@ -1669,7 +1669,7 @@ static void ww_test_edeadlk_acquire_wrong(void)
>  	int ret;
>  
>  	mutex_lock(&o2.base);
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  	o2.ctx = &t2;
>  
>  	WWAI(&t);
> @@ -1694,7 +1694,7 @@ static void ww_test_edeadlk_acquire_wrong_slow(void)
>  	int ret;
>  
>  	mutex_lock(&o2.base);
> -	mutex_release(&o2.base.dep_map, 1, _THIS_IP_);
> +	mutex_release(&o2.base.dep_map, _THIS_IP_);
>  	o2.ctx = &t2;
>  
>  	WWAI(&t);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1c4c08b45e44..3956ab6dba14 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1800,7 +1800,7 @@ static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
>  	struct mem_cgroup *iter;
>  
>  	spin_lock(&memcg_oom_lock);
> -	mutex_release(&memcg_oom_lock_dep_map, 1, _RET_IP_);
> +	mutex_release(&memcg_oom_lock_dep_map, _RET_IP_);
>  	for_each_mem_cgroup_tree(iter, memcg)
>  		iter->oom_lock = false;
>  	spin_unlock(&memcg_oom_lock);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 07863edbe6fc..a988e70cdac5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -521,7 +521,7 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb,
>  
>  		rc = sk_backlog_rcv(sk, skb);
>  
> -		mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
> +		mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
>  	} else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
>  		bh_unlock_sock(sk);
>  		atomic_inc(&sk->sk_drops);
> diff --git a/tools/lib/lockdep/include/liblockdep/common.h b/tools/lib/lockdep/include/liblockdep/common.h
> index a81d91d4fc78..a6d7ee5f18ba 100644
> --- a/tools/lib/lockdep/include/liblockdep/common.h
> +++ b/tools/lib/lockdep/include/liblockdep/common.h
> @@ -42,8 +42,7 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name,
>  void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
>  			int trylock, int read, int check,
>  			struct lockdep_map *nest_lock, unsigned long ip);
> -void lock_release(struct lockdep_map *lock, int nested,
> -			unsigned long ip);
> +void lock_release(struct lockdep_map *lock, unsigned long ip);
>  void lockdep_reset_lock(struct lockdep_map *lock);
>  void lockdep_register_key(struct lock_class_key *key);
>  void lockdep_unregister_key(struct lock_class_key *key);
> diff --git a/tools/lib/lockdep/include/liblockdep/mutex.h b/tools/lib/lockdep/include/liblockdep/mutex.h
> index 783dd0df06f9..bd106b82759b 100644
> --- a/tools/lib/lockdep/include/liblockdep/mutex.h
> +++ b/tools/lib/lockdep/include/liblockdep/mutex.h
> @@ -42,7 +42,7 @@ static inline int liblockdep_pthread_mutex_lock(liblockdep_pthread_mutex_t *lock
>  
>  static inline int liblockdep_pthread_mutex_unlock(liblockdep_pthread_mutex_t *lock)
>  {
> -	lock_release(&lock->dep_map, 0, (unsigned long)_RET_IP_);
> +	lock_release(&lock->dep_map, (unsigned long)_RET_IP_);
>  	return pthread_mutex_unlock(&lock->mutex);
>  }
>  
> diff --git a/tools/lib/lockdep/include/liblockdep/rwlock.h b/tools/lib/lockdep/include/liblockdep/rwlock.h
> index 365762e3a1ea..6d5d2932bf4d 100644
> --- a/tools/lib/lockdep/include/liblockdep/rwlock.h
> +++ b/tools/lib/lockdep/include/liblockdep/rwlock.h
> @@ -44,7 +44,7 @@ static inline int liblockdep_pthread_rwlock_rdlock(liblockdep_pthread_rwlock_t *
>  
>  static inline int liblockdep_pthread_rwlock_unlock(liblockdep_pthread_rwlock_t *lock)
>  {
> -	lock_release(&lock->dep_map, 0, (unsigned long)_RET_IP_);
> +	lock_release(&lock->dep_map, (unsigned long)_RET_IP_);
>  	return pthread_rwlock_unlock(&lock->rwlock);
>  }
>  
> diff --git a/tools/lib/lockdep/preload.c b/tools/lib/lockdep/preload.c
> index 76245d16196d..8f1adbe887b2 100644
> --- a/tools/lib/lockdep/preload.c
> +++ b/tools/lib/lockdep/preload.c
> @@ -270,7 +270,7 @@ int pthread_mutex_lock(pthread_mutex_t *mutex)
>  	 */
>  	r = ll_pthread_mutex_lock(mutex);
>  	if (r)
> -		lock_release(&__get_lock(mutex)->dep_map, 0, (unsigned long)_RET_IP_);
> +		lock_release(&__get_lock(mutex)->dep_map, (unsigned long)_RET_IP_);
>  
>  	return r;
>  }
> @@ -284,7 +284,7 @@ int pthread_mutex_trylock(pthread_mutex_t *mutex)
>  	lock_acquire(&__get_lock(mutex)->dep_map, 0, 1, 0, 1, NULL, (unsigned long)_RET_IP_);
>  	r = ll_pthread_mutex_trylock(mutex);
>  	if (r)
> -		lock_release(&__get_lock(mutex)->dep_map, 0, (unsigned long)_RET_IP_);
> +		lock_release(&__get_lock(mutex)->dep_map, (unsigned long)_RET_IP_);
>  
>  	return r;
>  }
> @@ -295,7 +295,7 @@ int pthread_mutex_unlock(pthread_mutex_t *mutex)
>  
>  	try_init_preload();
>  
> -	lock_release(&__get_lock(mutex)->dep_map, 0, (unsigned long)_RET_IP_);
> +	lock_release(&__get_lock(mutex)->dep_map, (unsigned long)_RET_IP_);
>  	/*
>  	 * Just like taking a lock, only in reverse!
>  	 *
> @@ -355,7 +355,7 @@ int pthread_rwlock_rdlock(pthread_rwlock_t *rwlock)
>  	lock_acquire(&__get_lock(rwlock)->dep_map, 0, 0, 2, 1, NULL, (unsigned long)_RET_IP_);
>  	r = ll_pthread_rwlock_rdlock(rwlock);
>  	if (r)
> -		lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +		lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>  
>  	return r;
>  }
> @@ -369,7 +369,7 @@ int pthread_rwlock_tryrdlock(pthread_rwlock_t *rwlock)
>  	lock_acquire(&__get_lock(rwlock)->dep_map, 0, 1, 2, 1, NULL, (unsigned long)_RET_IP_);
>  	r = ll_pthread_rwlock_tryrdlock(rwlock);
>  	if (r)
> -		lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +		lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>  
>  	return r;
>  }
> @@ -383,7 +383,7 @@ int pthread_rwlock_trywrlock(pthread_rwlock_t *rwlock)
>  	lock_acquire(&__get_lock(rwlock)->dep_map, 0, 1, 0, 1, NULL, (unsigned long)_RET_IP_);
>  	r = ll_pthread_rwlock_trywrlock(rwlock);
>  	if (r)
> -                lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +		lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>  
>  	return r;
>  }
> @@ -397,7 +397,7 @@ int pthread_rwlock_wrlock(pthread_rwlock_t *rwlock)
>  	lock_acquire(&__get_lock(rwlock)->dep_map, 0, 0, 0, 1, NULL, (unsigned long)_RET_IP_);
>  	r = ll_pthread_rwlock_wrlock(rwlock);
>  	if (r)
> -		lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +		lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>  
>  	return r;
>  }
> @@ -408,7 +408,7 @@ int pthread_rwlock_unlock(pthread_rwlock_t *rwlock)
>  
>          init_preload();
>  
> -	lock_release(&__get_lock(rwlock)->dep_map, 0, (unsigned long)_RET_IP_);
> +	lock_release(&__get_lock(rwlock)->dep_map, (unsigned long)_RET_IP_);
>  	r = ll_pthread_rwlock_unlock(rwlock);
>  	if (r)
>  		lock_acquire(&__get_lock(rwlock)->dep_map, 0, 0, 0, 1, NULL, (unsigned long)_RET_IP_);
> -- 
> 1.8.3.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply

* Re: [REPOST][RFC][PATCH] sysctl: Remove the sysctl system call
From: Michael Kerrisk (man-pages) @ 2019-10-08 10:30 UTC (permalink / raw)
  To: Kees Cook, Eric W. Biederman
  Cc: mtk.manpages, Florian Weimer, linux-kernel, linux-arch, linux-api,
	Jann Horn, Arnd Bergmann, Helge Deller
In-Reply-To: <201910031404.C30A0F16@keescook>

On 10/3/19 11:05 PM, Kees Cook wrote:
> On Thu, Oct 03, 2019 at 03:44:32PM -0500, Eric W. Biederman wrote:
>>
>> This system call has been deprecated almost since it was introduced, and none
>> of the common distributions enable it.  The only indication that I can find that
>> anyone might care is that a few of the defconfigs in the kernel enable it.  However
>> that is a small fractions of the defconfigs so I suspect it just a lack of care
>> rather than a reflection of software using the the sysctl system call.
>>
>> As there appear to be no users of the sysctl system call, remove the
>> code so that the proc filesystem can be simplified.
> 
> nitpick: line lengths near 80 characters
> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> But, yes, I would love to see this gone. :)
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>

And for the record, the manual page has since 2007 documented that 
this system call is likely to go away in the future.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
From: Aleksa Sarai @ 2019-10-08  1:33 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, Michael Kerrisk, Christian Brauner, Aleksa Sarai,
	linux-man, Linux API, kernel list
In-Reply-To: <CAG48ez2LuOGAXgKftZKfDKxhdb6xcBTdoK468-HXdcpxCW4r4w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2987 bytes --]

On 2019-10-07, Jann Horn <jannh@google.com> wrote:
> On Thu, Oct 3, 2019 at 4:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > Traditionally, magic-links have not been a well-understood topic in
> > Linux. Given the new changes in their semantics (related to the link
> > mode of trailing magic-links), it seems like a good opportunity to shine
> > more light on magic-links and their semantics.
> [...]
> > +++ b/man7/symlink.7
> > @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
> >  are outlined here.
> >  It is important that site-local applications also conform to these rules,
> >  so that the user interface can be as consistent as possible.
> > +.SS Magic-links
> > +There is a special class of symlink-like objects known as "magic-links" which
> 
> I think names like that normally aren't hypenated in english, and
> instead of "magic-links", it'd be "magic links"? Just like how you
> wouldn't write "symbolic-link", but "symbolic link". But this is
> bikeshedding, and if you disagree, feel free to ignore this comment.

Looking at it now, I think you're right -- I hyphenated it here because
that's how I wrote it when documenting the feature in comments. But I
think that's because "symlink" and "magic-link" (the "abbreviated"
versions) seem to match better than "symlink" and "magic link".

I'll use "magic link" in documentation, but "magic-link" for all cases
where I would normally write "symlink".

> > +can be found in certain pseudo-filesystems such as
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Unlike normal symlinks, magic-links are not resolved through
> 
> nit: AFAICS symlinks are always referred to as "symbolic links"
> throughout the manpages.

:+1:

> > +pathname-expansion, but instead act as direct references to the kernel's own
> > +representation of a file handle. As such, these magic-links allow users to
> > +access files which cannot be referenced with normal paths (such as unlinked
> > +files still referenced by a running program.)
> 
> Could maybe add "and files in different mount namespaces" as another
> example here; at least for me, that's the main usecases for
> /proc/*/root.

Will do.

> [...]
> > +However, magic-links do not follow this rule. They can have a non-0777 mode,
> > +which is used for permission checks when the final
> > +component of an
> > +.BR open (2)'s
> 
> Maybe leave out the "open" part, since the same restriction has to
> also apply to other syscalls operating on files, like truncate() and
> so on?

Yes (though I've just realised I hadn't implemented that -- oops.) Given
how expansive this patchset will get -- I might end up splitting it into
the magic-link stuff (and O_EMPTYPATH) and a separate series for
openat2(2) and the path resolution restrictions.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
From: Jann Horn @ 2019-10-07 16:36 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Michael Kerrisk, Christian Brauner, Aleksa Sarai,
	linux-man, Linux API, kernel list
In-Reply-To: <20191003145542.17490-2-cyphar@cyphar.com>

On Thu, Oct 3, 2019 at 4:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> Traditionally, magic-links have not been a well-understood topic in
> Linux. Given the new changes in their semantics (related to the link
> mode of trailing magic-links), it seems like a good opportunity to shine
> more light on magic-links and their semantics.
[...]
> +++ b/man7/symlink.7
> @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
>  are outlined here.
>  It is important that site-local applications also conform to these rules,
>  so that the user interface can be as consistent as possible.
> +.SS Magic-links
> +There is a special class of symlink-like objects known as "magic-links" which

I think names like that normally aren't hypenated in english, and
instead of "magic-links", it'd be "magic links"? Just like how you
wouldn't write "symbolic-link", but "symbolic link". But this is
bikeshedding, and if you disagree, feel free to ignore this comment.

> +can be found in certain pseudo-filesystems such as
> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Unlike normal symlinks, magic-links are not resolved through

nit: AFAICS symlinks are always referred to as "symbolic links"
throughout the manpages.

> +pathname-expansion, but instead act as direct references to the kernel's own
> +representation of a file handle. As such, these magic-links allow users to
> +access files which cannot be referenced with normal paths (such as unlinked
> +files still referenced by a running program.)

Could maybe add "and files in different mount namespaces" as another
example here; at least for me, that's the main usecases for
/proc/*/root.

[...]
> +However, magic-links do not follow this rule. They can have a non-0777 mode,
> +which is used for permission checks when the final
> +component of an
> +.BR open (2)'s

Maybe leave out the "open" part, since the same restriction has to
also apply to other syscalls operating on files, like truncate() and
so on?

> +path is a magic-link (see
> +.BR path_resolution (7).)

^ permalink raw reply

* Re: trace_printk issue. Was: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-10-04 19:56 UTC (permalink / raw)
  To: Steven Rostedt, Alexei Starovoitov
  Cc: Kees Cook, Andy Lutomirski, Andy Lutomirski, Alexei Starovoitov,
	LSM List, James Morris, Jann Horn, Peter Zijlstra,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Network Development, bpf, Kernel Team, Linux API
In-Reply-To: <20191003124148.4b94a720@gandalf.local.home>

On 10/3/19 9:41 AM, Steven Rostedt wrote:
> On Thu, 3 Oct 2019 09:18:40 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> I think dropping last events is just as bad. Is there a mode to overwrite old
>> and keep the last N (like perf does) ?
> 
> Well, it drops it by pages. Thus you should always have the last page
> of events.
> 
>> Peter Wu brought this issue to my attention in
>> commit 55c33dfbeb83 ("bpf: clarify when bpf_trace_printk discards lines").
>> And later sent similar doc fix to ftrace.rst.
> 
> It was documented there, he just elaborated on it more:
> 
>          This file holds the output of the trace in a human
>          readable format (described below). Note, tracing is temporarily
> -       disabled while this file is being read (opened).
> +       disabled when the file is open for reading. Once all readers
> +       are closed, tracing is re-enabled.
> 
> 
>> To be honest if I knew of this trace_printk quirk I would not have picked it
>> as a debugging mechanism for bpf.
>> I urge you to fix it.
> 
> It's not a trivial fix by far.
> 
> Note, trying to read the trace file without disabling the writes to it,
> will most likely make reading it when function tracing enabled totally
> garbage, as the buffer will most likely be filled for every read event.
> That is, each read event will not be related to the next event that is
> read, making it very confusing.
> 
> Although, I may be able to make it work per page. That way you get at
> least a page worth of events.

That sounds much better. As long as trace_printk() doesn't disappear
into the void, it's good.

But the part I'm not getting is why trace_printk() has
if (tracing_disabled) goto out;

It's a concurrent ring buffer. One cpu can write into it while
another reading. What is the point disabling trace_printk in particular?
Each __buffer_unlock_commit is an atomic ring buffer update,
so read from trace will either see it as a whole or won't see it.
'trace_pipe' clearly works fine. Why 'trace' is any different?
Just keep tracing enabled and keep reading it until the end of current
ring buffer. Whether open() determines current or it reads until next=0
is an implementation detail.

^ permalink raw reply

* vger mail woes? (was: Re: [RFC][PATCH] sysctl: Remove the sysctl system call)
From: Florian Weimer @ 2019-10-04  7:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Eric W. Biederman, linux-kernel, linux-arch, linux-api,
	Jann Horn, Arnd Bergmann, Helge Deller, postmaster, David Miller
In-Reply-To: <20191003210814.gh7rbbv6bpxlhz3w@wittgenstein>

* Christian Brauner:

> On Thu, Oct 03, 2019 at 08:56:19AM +0200, Florian Weimer wrote:
>> Is anyone else getting a very incomplete set of messages in this
>> thread?
>> 
>> These changes likely matter to glibc, and I've yet to see the actual
>> patch.  Would someone please forward it to me?
>> 
>> The original message didn't make it into the lore.kernel.org archives
>> (the cross-post to linux-kernel should have taken care of that).
>
> Yeah, I didn't get it either and the repost too weirdly enough.

I got curious and tried to repost the repost to vger.kernel.org (in
the hope to bypass any SMTP callout verifications that may still be
failing for Eric), and got this:

2019-10-04 07:09:29 1iGHiT-00007b-Na <= fw@deneb.enyo.de H=(deneb.enyo.de) [172.17.203.2] P=esmtps X=TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=no S=72580 id=87h84pgf6h.fsf@mid.deneb.enyo.de
2019-10-04 07:09:37 1iGHiT-00007b-Na => linux-api@vger.kernel.org R=dnslookup T=remote_smtp_ext H=vger.kernel.org [209.132.180.67] C="250 2.7.1 Looks like Linux source DIFF email.. BF:<S 1>; S1728766AbfJDHJg"
2019-10-04 07:09:37 1iGHiT-00007b-Na -> linux-arch@vger.kernel.org R=dnslookup T=remote_smtp_ext H=vger.kernel.org [209.132.180.67] C="250 2.7.1 Looks like Linux source DIFF email.. BF:<S 1>; S1728766AbfJDHJg"
2019-10-04 07:09:37 1iGHiT-00007b-Na -> linux-kernel@vger.kernel.org R=dnslookup T=remote_smtp_ext H=vger.kernel.org [209.132.180.67] C="250 2.7.1 Looks like Linux source DIFF email.. BF:<S 1>; S1728766AbfJDHJg"
2019-10-04 07:09:37 1iGHiT-00007b-Na Completed

But nothing came back.  Timestamps are UTC.

Dave, could please have a look, assuming that you are still involved
with vger operations?

^ permalink raw reply

* Re: [RFC][PATCH] sysctl: Remove the sysctl system call
From: Christian Brauner @ 2019-10-03 21:08 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Kees Cook, Eric W. Biederman, linux-kernel, linux-arch, linux-api,
	Jann Horn, Arnd Bergmann, Helge Deller
In-Reply-To: <87y2y271ws.fsf@mid.deneb.enyo.de>

On Thu, Oct 03, 2019 at 08:56:19AM +0200, Florian Weimer wrote:
> Is anyone else getting a very incomplete set of messages in this
> thread?
> 
> These changes likely matter to glibc, and I've yet to see the actual
> patch.  Would someone please forward it to me?
> 
> The original message didn't make it into the lore.kernel.org archives
> (the cross-post to linux-kernel should have taken care of that).

Yeah, I didn't get it either and the repost too weirdly enough.

Christian

^ permalink raw reply

* Re: [REPOST][RFC][PATCH] sysctl: Remove the sysctl system call
From: Kees Cook @ 2019-10-03 21:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Florian Weimer, linux-kernel, linux-arch, linux-api, Jann Horn,
	Arnd Bergmann, Helge Deller
In-Reply-To: <87tv8pftjj.fsf_-_@x220.int.ebiederm.org>

On Thu, Oct 03, 2019 at 03:44:32PM -0500, Eric W. Biederman wrote:
> 
> This system call has been deprecated almost since it was introduced, and none
> of the common distributions enable it.  The only indication that I can find that
> anyone might care is that a few of the defconfigs in the kernel enable it.  However
> that is a small fractions of the defconfigs so I suspect it just a lack of care
> rather than a reflection of software using the the sysctl system call.
> 
> As there appear to be no users of the sysctl system call, remove the
> code so that the proc filesystem can be simplified.

nitpick: line lengths near 80 characters

> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

But, yes, I would love to see this gone. :)

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply

* Re: trace_printk issue. Was: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Steven Rostedt @ 2019-10-03 16:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Kees Cook, Andy Lutomirski, Andy Lutomirski,
	Alexei Starovoitov, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, David S. Miller,
	Daniel Borkmann, Network Development, bpf, Kernel Team, Linux API
In-Reply-To: <20191003161838.7lz746aa2lzl7qi4@ast-mbp.dhcp.thefacebook.com>

On Thu, 3 Oct 2019 09:18:40 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> I think dropping last events is just as bad. Is there a mode to overwrite old
> and keep the last N (like perf does) ?

Well, it drops it by pages. Thus you should always have the last page
of events.

> Peter Wu brought this issue to my attention in
> commit 55c33dfbeb83 ("bpf: clarify when bpf_trace_printk discards lines").
> And later sent similar doc fix to ftrace.rst.

It was documented there, he just elaborated on it more:

        This file holds the output of the trace in a human
        readable format (described below). Note, tracing is temporarily
-       disabled while this file is being read (opened).
+       disabled when the file is open for reading. Once all readers
+       are closed, tracing is re-enabled.

> To be honest if I knew of this trace_printk quirk I would not have picked it
> as a debugging mechanism for bpf.
> I urge you to fix it.

It's not a trivial fix by far.

Note, trying to read the trace file without disabling the writes to it,
will most likely make reading it when function tracing enabled totally
garbage, as the buffer will most likely be filled for every read event.
That is, each read event will not be related to the next event that is
read, making it very confusing.

Although, I may be able to make it work per page. That way you get at
least a page worth of events.

Now, I could also make it where you have to stop tracing to read the
trace file. That is, if you try to open the trace files while the
buffer is active, it will error -EBUSY. Forcing you to stop tracing to
read it, otherwise you would need to read the trace_pipe. At least this
way you will not get surprised that events were dropped.

-- Steve

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-10-03 16:20 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Kees Cook, Steven Rostedt, Andy Lutomirski, Andy Lutomirski,
	Alexei Starovoitov, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20191003151204.5857bb24245f9c3355f27e0d@kernel.org>

On Thu, Oct 03, 2019 at 03:12:04PM +0900, Masami Hiramatsu wrote:
> On Mon, 30 Sep 2019 11:31:29 -0700
> Kees Cook <keescook@chromium.org> wrote:
> 
> > On Sat, Sep 28, 2019 at 07:37:27PM -0400, Steven Rostedt wrote:
> > > On Wed, 28 Aug 2019 21:07:24 -0700
> > > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > > > 
> > > > > This won’t make me much more comfortable, since CAP_BPF lets it do an ever-growing set of nasty things. I’d much rather one or both of two things happen:
> > > > > 
> > > > > 1. Give it CAP_TRACING only. It can leak my data, but it’s rather hard for it to crash my laptop, lose data, or cause other shenanigans.
> > > > > 
> > > > > 2. Improve it a bit do all the privileged ops are wrapped by capset().
> > > > > 
> > > > > Does this make sense?  I’m a security person on occasion. I find
> > > > > vulnerabilities and exploit them deliberately and I break things by
> > > > > accident on a regular basis. In my considered opinion, CAP_TRACING
> > > > > alone, even extended to cover part of BPF as I’ve described, is
> > > > > decently safe. Getting root with just CAP_TRACING will be decently
> > > > > challenging, especially if I don’t get to read things like sshd’s
> > > > > memory, and improvements to mitigate even that could be added.  I
> > > > > am quite confident that attacks starting with CAP_TRACING will have
> > > > > clear audit signatures if auditing is on.  I am also confident that
> > > > > CAP_BPF *will* allow DoS and likely privilege escalation, and this
> > > > > will only get more likely as BPF gets more widely used. And, if
> > > > > BPF-based auditing ever becomes a thing, writing to the audit
> > > > > daemon’s maps will be a great way to cover one’s tracks.  
> > > > 
> > > > CAP_TRACING, as I'm proposing it, will allow full tracefs access.
> > > > I think Steven and Massami prefer that as well.
> > > > That includes kprobe with probe_kernel_read.
> > > > That also means mini-DoS by installing kprobes everywhere or running
> > > > too much ftrace.
> > > 
> > > I was talking with Kees at Plumbers about this, and we were talking
> > > about just using simple file permissions. I started playing with some
> > > patches to allow the tracefs be visible but by default it would only be
> > > visible by root.
> > > 
> > >  rwx------
> > > 
> > > Then a start up script (or perhaps mount options) could change the
> > > group owner, and change this to:
> > > 
> > >  rwxrwx---
> > > 
> > > Where anyone in the group assigned (say "tracing") gets full access to
> > > the file system.
> 
> Does it for "all" files under tracefs?
> 
> > > 
> > > The more I was playing with this, the less I see the need for
> > > CAP_TRACING for ftrace and reading the format files.
> > 
> > Nice! Thanks for playing with this. I like it because it gives us a way
> > to push policy into userspace (group membership, etc), and provides a
> > clean way (hopefully) do separate "read" (kernel memory confidentiality)
> > from "write" (kernel memory integrity), which wouldn't have been possible
> > with a single new CAP_...
> 
>  From the confidentiality point of view, if tracefs exposes traced data,
> it might include in-kernel pointer and symbols, but the user still can't
> see /proc/kallsyms. This means we still have several different confidentiality
> for each interface.
> 
> Anyway, adding a tracefs mount option for allowing a user group to access
> event format data will be a good idea. But even though, I  think we still
> need the CAP_TRACING for allowing control of intrusive tracing, like kprobes
> and bpf etc. (Or, do we keep those for CAP_SYS_ADMIN??)

No doubt. This thread is only about tracefs wanting to do its own fs based controls.

^ permalink raw reply

* trace_printk issue. Was: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-10-03 16:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Alexei Starovoitov, Kees Cook, Andy Lutomirski, Andy Lutomirski,
	Alexei Starovoitov, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, David S. Miller,
	Daniel Borkmann, Network Development, bpf, Kernel Team, Linux API
In-Reply-To: <20191002190027.4e204ea8@gandalf.local.home>

On Wed, Oct 02, 2019 at 07:00:27PM -0400, Steven Rostedt wrote:
> > >>>>
> > >>>> Both 'trace' and 'trace_pipe' have quirky side effects.
> > >>>> Like opening 'trace' file will make all parallel trace_printk() to be ignored.
> > >>>> While reading 'trace_pipe' file will clear it.
> > >>>> The point that traditional 'read' and 'write' ACLs don't map as-is
> > >>>> to tracefs, so I would be careful categorizing things into
> > >>>> confidentiality vs integrity only based on access type.  
> > >>>
> > >>> What exactly is the bpf_trace_printk() used for? I may have other ideas
> > >>> that can help.  
> > >>
> > >> It's debugging of bpf programs. Same is what printk() is used for
> > >> by kernel developers.
> > >>  
> > > 
> > > How is it extracted? Just read from the trace or trace_pipe file?  
> > 
> > yep. Just like kernel devs look at dmesg when they sprinkle printk.
> > btw, if you can fix 'trace' file issue that stops all trace_printk
> > while 'trace' file is open that would be great.
> > Some users have been bitten by this behavior. We even documented it.
> 
> The behavior is documented as well in the ftrace documentation. That's
> why we suggest the trace_pipe redirected into a file so that you don't
> lose data (unless the writer goes too fast). If you prefer a producer
> consumer where you lose newer events (like perf does), you can turn off
> overwrite mode, and it will drop events when the buffer is full (see
> options/overwrite).

I think dropping last events is just as bad. Is there a mode to overwrite old
and keep the last N (like perf does) ?
That aside having 'trace' file open should NOT drop trace_printks.
My point that bpf_trace_printk is just as important to bpf developers as
printk to kernel developers.
Imagine kernel developer losing their printk-s only because they typed
"dmesg" in another terminal?
It's completely unexpected and breaks developer trust in debugging mechanism.
Peter Wu brought this issue to my attention in
commit 55c33dfbeb83 ("bpf: clarify when bpf_trace_printk discards lines").
And later sent similar doc fix to ftrace.rst.
To be honest if I knew of this trace_printk quirk I would not have picked it
as a debugging mechanism for bpf.
I urge you to fix it.

^ permalink raw reply

* Re: [PATCH RFC 3/3] openat2.2: document new syscall
From: Aleksa Sarai @ 2019-10-03 15:00 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Christian Brauner, Aleksa Sarai, linux-man, linux-api,
	linux-kernel
In-Reply-To: <20191003145542.17490-5-cyphar@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 18111 bytes --]

Ignore this one (it's an older version of the openat2.2 patch) -- I sent
it by accident.

On 2019-10-04, Aleksa Sarai <cyphar@cyphar.com> wrote:
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  man2/open.2            |   5 +
>  man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
>  man7/path_resolution.7 |  57 ++++--
>  3 files changed, 426 insertions(+), 17 deletions(-)
>  create mode 100644 man2/openat2.2
> 
> diff --git a/man2/open.2 b/man2/open.2
> index 7217fe056e5e..a0b43394bbee 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
>  ", mode_t " mode );
> +.PP
> +/* Docuented separately, in \fBopenat2\fP(2). */
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
>  .fi
>  .PP
>  .in -4n
> @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
>  .B O_DIRECTORY
>  is ignored).
>  .SH SEE ALSO
> +.BR openat2 (2),
>  .BR chmod (2),
>  .BR chown (2),
>  .BR close (2),
> diff --git a/man2/openat2.2 b/man2/openat2.2
> new file mode 100644
> index 000000000000..c43c76046243
> --- /dev/null
> +++ b/man2/openat2.2
> @@ -0,0 +1,381 @@
> +.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +openat2 \- open and possibly create a file (extended)
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.B #include <sys/stat.h>
> +.B #include <fcntl.h>
> +.PP
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
> +.SH DESCRIPTION
> +The
> +.BR openat2 ()
> +system call is an extension of
> +.BR openat (2)
> +and provides a superset of its functionality. Rather than taking a single
> +.I flag
> +argument, an extensible structure (\fIhow\fP) is passed instead to allow for
> +seamless future extensions.
> +.PP
> +.I size
> +must be set to
> +.IR "sizeof(struct open_how)" ,
> +to facilitate future extensions (see the "Extensibility" section of the
> +\fBNOTES\fP for more detail on how extensions are handled.)
> +
> +.SS The open_how structure
> +The following structure indicates how
> +.I pathname
> +should be opened, and acts as a superset of the
> +.IR flag " and " mode
> +arguments to
> +.BR openat (2).
> +.PP
> +.in +4n
> +.EX
> +struct open_how {
> +    uint32_t flags;              /* open(2)-style O_* flags. */
> +    union {
> +        uint16_t mode;           /* File mode bits for new file creation. */
> +        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
> +    };
> +    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
> +};
> +.EE
> +.in
> +.PP
> +Any future extensions to
> +.BR openat2 ()
> +will be implemented as new fields appended to the above structure, with the
> +zero value of the new fields acting as though the extension were not present.
> +.PP
> +The meaning of each field is as follows:
> +.RS
> +
> +.I flags
> +.RS
> +The file creation and status flags to use for this operation. All of the
> +.B O_*
> +flags defined for
> +.BR openat (2)
> +are valid
> +.BR openat2 ()
> +flag values.
> +.RE
> +
> +.I upgrade_mask
> +.RS
> +Restrict with which
> +.I access modes
> +the returned
> +.B O_PATH
> +descriptor may be re-opened (either through
> +.B O_EMPTYPATH
> +or
> +.IR /proc/self/fd/ .)
> +This field may only be set to a non-zero value if
> +.I flags
> +contains
> +.BR O_PATH .
> +By default, an
> +.B O_PATH
> +file descriptor of an ordinary file may be re-opened with with any access mode (but an
> +.B O_PATH
> +file descriptor of a magic-link may only be re-opened with access modes that
> +the original magic-link possessed). The full list of
> +.I upgrade_mask
> +flags is given below.
> +.TP
> +.B UPGRADE_NOREAD
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for reading (i.e.
> +.BR O_RDONLY " or " O_RDWR .)
> +.TP
> +.B UPGRADE_NOWRITE
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for writing (i.e.
> +.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
> +.RE
> +
> +.I resolve
> +.RS
> +Change how the components of
> +.I pathname
> +will be resolved (see
> +.BR path_resolution (7)
> +for background information.) The primary use-case for these flags is to allow
> +trusted programs to restrict how un-trusted paths (or paths inside un-trusted
> +directories) are resolved. The full list of
> +.I resolve
> +flags is given below.
> +.TP
> +.B RESOLVE_NO_XDEV
> +Disallow all mount-point crossings during path resolution (including
> +all bind-mounts).
> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as bind-mounts are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.
> +.TP
> +.B RESOLVE_NO_SYMLINKS
> +Disallow all symlink resolution during path resolution. If the trailing
> +component is a symlink, and
> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the symlink will be returned. This option implies
> +.BR RESOLVE_NO_MAGICLINKS .
> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as symlinks are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.
> +.TP
> +.B RESOLVE_NO_MAGICLINKS
> +Disallow all magic-link resolution during path resolution. If the trailing
> +component is a magic-link, and
> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the magic-link will be returned.
> +
> +Magic-links are symlink-like objects that are most notably found in
> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Due to the potential danger of unknowingly opening these magic-links, it may be
> +preferable for users to disable their resolution entirely (see
> +.BR symlink (7)
> +for more details.)
> +.TP
> +.B RESOLVE_BENEATH
> +Do not permit the path resolution to succeed if any component of the resolution
> +is not a descendant of the directory indicated by
> +.IR dirfd .
> +This results in absolute symlinks (and absolute values of
> +.IR pathname )
> +to be rejected. Magic-link resolution is also not permitted.
> +
> +.TP
> +.B RESOLVE_IN_ROOT
> +Temporarily treat
> +.I dirfd
> +as the root of the filesystem (as though the user called
> +.BR chroot (2)
> +with
> +.IR dirfd
> +as the argument.) Absolute symlinks and ".." path components will be scoped to
> +.IR dirfd . Magic-link resolution is also not permitted.
> +
> +However, unlike
> +.BR chroot (2)
> +(which changes the filesystem root persistently for an entire thread-group),
> +.B RESOLVE_IN_ROOT
> +allows a program to efficiently restrict path resolution for only certain
> +operations. It also has several hardening features (such as not permitting
> +magic-link resolution) which
> +.BR chroot (2)
> +does not.
> +.RE
> +
> +.RE
> +
> +.PP
> +Unlike
> +.BR openat (2),
> +any unknown flags set in fields of
> +.I how
> +will result in an error, rather than being ignored. In addition, an error will
> +be returned if the value of the
> +.IR mode " and " upgrade_mask
> +union is non-zero unless:
> +.RS
> +.IP * 3
> +.I flags
> +indicates that a new file will be created (it contains
> +.BR O_CREAT " or " O_TMPFILE ),
> +in which case
> +.I mode
> +may be any valid file mode.
> +.IP *
> +.I flags
> +contains
> +.BR O_PATH ,
> +in which case
> +.I upgrade_mask
> +must only contain valid
> +.B UPGRADE_*
> +flags.
> +.RE
> +
> +.SH RETURN VALUE
> +On success, a new file descriptor is returned. On error, -1 is returned, and
> +.I errno
> +is set appropriately.
> +
> +.SH ERRORS
> +The set of errors returned by
> +.BR openat2 ()
> +includes all of the errors returned by
> +.BR openat (2),
> +as well as the following additional errors:
> +.TP
> +.B EINVAL
> +An unknown flag or invalid value was specified in
> +.IR how .
> +.TP
> +.B EINVAL
> +.I size
> +was smaller than any known version of
> +.IR "struct open_how" .
> +.TP
> +.B E2BIG
> +An extension was specified in
> +.IR how ,
> +which the current kernel does not support (see the "Extensibility" section of
> +the \fBNOTES\fP for more detail on how extensions are handled.)
> +.TP
> +.B EAGAIN
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and the kernel could not ensure that a ".." component didn't escape (due to a
> +race condition or potential attack). Callers may choose to retry the
> +.BR openat2 ()
> +call.
> +.TP
> +.B EXDEV
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and a path component attempted to escape the root of the resolution.
> +
> +.TP
> +.B EXDEV
> +.I resolve
> +contains
> +.BR RESOLVE_NO_XDEV ,
> +and a path component attempted to cross a mount-point.
> +
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_SYMLINKS ,
> +and one of the path components was a symlink.
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_MAGICLINKS ,
> +and one of the path components was a magic-link.
> +
> +.SH VERSIONS
> +.BR openat2 ()
> +was added to Linux in kernel 5.FOO.
> +
> +.SH CONFORMING TO
> +This system call is Linux-specific.
> +
> +The semantics of
> +.B RESOLVE_BENEATH
> +were modelled after FreeBSD's
> +.BR O_BENEATH .
> +
> +.SH NOTES
> +Glibc does not provide a wrapper for this system call; call it using
> +.BR syscall (2).
> +
> +.SS Extensibility
> +In order to allow for
> +.I struct open_how
> +to be extended in future kernel revisions,
> +.BR openat2 ()
> +requires userspace to specify what sized
> +.I struct open_how
> +structure they are passing. By providing this information, it is possible for
> +.BR openat2 ()
> +to provide both forwards- and backwards-compatibility \(em with
> +.I size
> +acting as an implicit version number (because new extension fields will always
> +be appended, the size will always increase.) This extensibility design is very
> +similar to other system calls such as
> +.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
> +
> +If we let
> +.I usize
> +be the size of the structure according to userspace and
> +.I ksize
> +be the size of the structure which the kernel supports, then there are only
> +three cases to consider:
> +
> +.RS
> +.IP * 3
> +If
> +.IR ksize " equals " usize ,
> +then there is no version mismatch and
> +.I how
> +can be used verbatim.
> +.IP *
> +If
> +.IR ksize " is larger than " usize ,
> +then there are some extensions the kernel supports which the userspace program
> +is unaware of. Because all extensions must have their zero values be a no-op,
> +the kernel treats all of the extension fields not set by userspace to have zero
> +values. This provides backwards-compatibility.
> +.IP *
> +If
> +.IR ksize " is smaller than " usize ,
> +then there are some extensions which the userspace program is aware of but the
> +kernel does not support. Because all extensions must have their zero values be
> +a no-op, the kernel can safely ignore the unsupported extension fields if they
> +are all-zero. If any unsupported extension fields are non-zero, then an error
> +is returned. This provides forwards-compatibility.
> +.RE
> +
> +Therefore, most userspace programs will not need to have any special handling
> +of extensions. However, if a userspace program wishes to determine what
> +extensions the running kernel supports, they may conduct a binary search on
> +.IR size
> +(to find the largest value which doesn't produce an error.)
> +
> +.SH SEE ALSO
> +.BR openat (2),
> +.BR path_resolution (7),
> +.BR symlink (7)
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 85dd354e9a93..3da3e5b614c8 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
>  Some UNIX/Linux system calls have as parameter one or more filenames.
>  A filename (or pathname) is resolved as follows.
>  .SS Step 1: start of the resolution process
> -If the pathname starts with the \(aq/\(aq character,
> -the starting lookup directory
> -is the root directory of the calling process.
> -(A process inherits its
> -root directory from its parent.
> -Usually this will be the root directory
> -of the file hierarchy.
> -A process may get a different root directory
> -by use of the
> +If the pathname starts with the \(aq/\(aq character, the starting lookup
> +directory is the root directory of the calling process. (A process inherits its
> +root directory from its parent. Usually this will be the root directory of the
> +file hierarchy. A process may get a different root directory by use of the
>  .BR chroot (2)
> -system call.
> +system call, or may temporarily use a different root directory by using
> +.BR openat2 (2)
> +with the
> +.B RESOLVE_IN_ROOT
> +flag set.
> +.PP
>  A process may get an entirely private mount namespace in case
>  it\(emor one of its ancestors\(emwas started by an invocation of the
>  .BR clone (2)
> @@ -48,16 +48,24 @@ system call that had the
>  flag set.)
>  This handles the \(aq/\(aq part of the pathname.
>  .PP
> -If the pathname does not start with the \(aq/\(aq character, the
> -starting lookup directory of the resolution process is the current working
> -directory of the process.
> -(This is also inherited from the parent.
> -It can be changed by use of the
> +If the pathname does not start with the \(aq/\(aq character, the starting
> +lookup directory of the resolution process is the current working directory of
> +the process \(em or in the case of
> +.BR openat (2)-style
> +syscalls, the
> +.I dfd
> +argument (or the current working directory if
> +.B AT_FDCWD
> +is passed as the
> +.I dfd
> +argumnet). The current working directory is inherited from the parent, and can
> +be changed by use of the
>  .BR chdir (2)
> -system call.)
> +syscall.
>  .PP
>  Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
>  Pathnames not starting with a \(aq/\(aq are called relative pathnames.
> +
>  .SS Step 2: walk along the path
>  Set the current lookup directory to the starting lookup directory.
>  Now, for each nonfinal component of the pathname, where a component
> @@ -124,6 +132,13 @@ the kernel's pathname-resolution code
>  was reworked to eliminate the use of recursion,
>  so that the only limit that remains is the maximum of 40
>  resolutions for the entire pathname.
> +.PP
> +The resolution of syscalls during this stage can be blocked by using
> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_SYMLINKS
> +flag set.
> +
>  .SS Step 3: find the final entry
>  The lookup of the final component of the pathname goes just like
>  that of all other components, as described in the previous step,
> @@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
>  their conventional meanings, regardless of whether they are
>  actually present in the physical filesystem.
>  .PP
> -One cannot walk down past the root: "/.." is the same as "/".
> +One cannot walk up past the root: "/.." is the same as "/".
> +
>  .SS Mount points
>  After a "mount dev path" command, the pathname "path" refers to
>  the root of the filesystem hierarchy on the device "dev", and no
> @@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
>  One can walk out of a mounted filesystem: "path/.." refers to
>  the parent directory of "path",
>  outside of the filesystem hierarchy on "dev".
> +.PP
> +Mount-point crossings can be blocked by using
> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_XDEV
> +flag set (though note that this also restricts bind-mount crossings).
> +
>  .SS Trailing slashes
>  If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
>  component as in Step 2: it has to exist and resolve to a directory.
> -- 
> 2.23.0
> 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH RFC 3/3] openat2.2: document new syscall
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel
In-Reply-To: <20191003145542.17490-1-cyphar@cyphar.com>

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man2/open.2            |   5 +
 man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
 man7/path_resolution.7 |  57 ++++--
 3 files changed, 426 insertions(+), 17 deletions(-)
 create mode 100644 man2/openat2.2

diff --git a/man2/open.2 b/man2/open.2
index 7217fe056e5e..a0b43394bbee 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
 ", mode_t " mode );
+.PP
+/* Docuented separately, in \fBopenat2\fP(2). */
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
 .fi
 .PP
 .in -4n
@@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
 .B O_DIRECTORY
 is ignored).
 .SH SEE ALSO
+.BR openat2 (2),
 .BR chmod (2),
 .BR chown (2),
 .BR close (2),
diff --git a/man2/openat2.2 b/man2/openat2.2
new file mode 100644
index 000000000000..c43c76046243
--- /dev/null
+++ b/man2/openat2.2
@@ -0,0 +1,381 @@
+.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
+.SH NAME
+openat2 \- open and possibly create a file (extended)
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.B #include <sys/stat.h>
+.B #include <fcntl.h>
+.PP
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR openat2 ()
+system call is an extension of
+.BR openat (2)
+and provides a superset of its functionality. Rather than taking a single
+.I flag
+argument, an extensible structure (\fIhow\fP) is passed instead to allow for
+seamless future extensions.
+.PP
+.I size
+must be set to
+.IR "sizeof(struct open_how)" ,
+to facilitate future extensions (see the "Extensibility" section of the
+\fBNOTES\fP for more detail on how extensions are handled.)
+
+.SS The open_how structure
+The following structure indicates how
+.I pathname
+should be opened, and acts as a superset of the
+.IR flag " and " mode
+arguments to
+.BR openat (2).
+.PP
+.in +4n
+.EX
+struct open_how {
+    uint32_t flags;              /* open(2)-style O_* flags. */
+    union {
+        uint16_t mode;           /* File mode bits for new file creation. */
+        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
+    };
+    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
+};
+.EE
+.in
+.PP
+Any future extensions to
+.BR openat2 ()
+will be implemented as new fields appended to the above structure, with the
+zero value of the new fields acting as though the extension were not present.
+.PP
+The meaning of each field is as follows:
+.RS
+
+.I flags
+.RS
+The file creation and status flags to use for this operation. All of the
+.B O_*
+flags defined for
+.BR openat (2)
+are valid
+.BR openat2 ()
+flag values.
+.RE
+
+.I upgrade_mask
+.RS
+Restrict with which
+.I access modes
+the returned
+.B O_PATH
+descriptor may be re-opened (either through
+.B O_EMPTYPATH
+or
+.IR /proc/self/fd/ .)
+This field may only be set to a non-zero value if
+.I flags
+contains
+.BR O_PATH .
+By default, an
+.B O_PATH
+file descriptor of an ordinary file may be re-opened with with any access mode (but an
+.B O_PATH
+file descriptor of a magic-link may only be re-opened with access modes that
+the original magic-link possessed). The full list of
+.I upgrade_mask
+flags is given below.
+.TP
+.B UPGRADE_NOREAD
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for reading (i.e.
+.BR O_RDONLY " or " O_RDWR .)
+.TP
+.B UPGRADE_NOWRITE
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for writing (i.e.
+.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
+.RE
+
+.I resolve
+.RS
+Change how the components of
+.I pathname
+will be resolved (see
+.BR path_resolution (7)
+for background information.) The primary use-case for these flags is to allow
+trusted programs to restrict how un-trusted paths (or paths inside un-trusted
+directories) are resolved. The full list of
+.I resolve
+flags is given below.
+.TP
+.B RESOLVE_NO_XDEV
+Disallow all mount-point crossings during path resolution (including
+all bind-mounts).
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as bind-mounts are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_SYMLINKS
+Disallow all symlink resolution during path resolution. If the trailing
+component is a symlink, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the symlink will be returned. This option implies
+.BR RESOLVE_NO_MAGICLINKS .
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as symlinks are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_MAGICLINKS
+Disallow all magic-link resolution during path resolution. If the trailing
+component is a magic-link, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the magic-link will be returned.
+
+Magic-links are symlink-like objects that are most notably found in
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Due to the potential danger of unknowingly opening these magic-links, it may be
+preferable for users to disable their resolution entirely (see
+.BR symlink (7)
+for more details.)
+.TP
+.B RESOLVE_BENEATH
+Do not permit the path resolution to succeed if any component of the resolution
+is not a descendant of the directory indicated by
+.IR dirfd .
+This results in absolute symlinks (and absolute values of
+.IR pathname )
+to be rejected. Magic-link resolution is also not permitted.
+
+.TP
+.B RESOLVE_IN_ROOT
+Temporarily treat
+.I dirfd
+as the root of the filesystem (as though the user called
+.BR chroot (2)
+with
+.IR dirfd
+as the argument.) Absolute symlinks and ".." path components will be scoped to
+.IR dirfd . Magic-link resolution is also not permitted.
+
+However, unlike
+.BR chroot (2)
+(which changes the filesystem root persistently for an entire thread-group),
+.B RESOLVE_IN_ROOT
+allows a program to efficiently restrict path resolution for only certain
+operations. It also has several hardening features (such as not permitting
+magic-link resolution) which
+.BR chroot (2)
+does not.
+.RE
+
+.RE
+
+.PP
+Unlike
+.BR openat (2),
+any unknown flags set in fields of
+.I how
+will result in an error, rather than being ignored. In addition, an error will
+be returned if the value of the
+.IR mode " and " upgrade_mask
+union is non-zero unless:
+.RS
+.IP * 3
+.I flags
+indicates that a new file will be created (it contains
+.BR O_CREAT " or " O_TMPFILE ),
+in which case
+.I mode
+may be any valid file mode.
+.IP *
+.I flags
+contains
+.BR O_PATH ,
+in which case
+.I upgrade_mask
+must only contain valid
+.B UPGRADE_*
+flags.
+.RE
+
+.SH RETURN VALUE
+On success, a new file descriptor is returned. On error, -1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+The set of errors returned by
+.BR openat2 ()
+includes all of the errors returned by
+.BR openat (2),
+as well as the following additional errors:
+.TP
+.B EINVAL
+An unknown flag or invalid value was specified in
+.IR how .
+.TP
+.B EINVAL
+.I size
+was smaller than any known version of
+.IR "struct open_how" .
+.TP
+.B E2BIG
+An extension was specified in
+.IR how ,
+which the current kernel does not support (see the "Extensibility" section of
+the \fBNOTES\fP for more detail on how extensions are handled.)
+.TP
+.B EAGAIN
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and the kernel could not ensure that a ".." component didn't escape (due to a
+race condition or potential attack). Callers may choose to retry the
+.BR openat2 ()
+call.
+.TP
+.B EXDEV
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and a path component attempted to escape the root of the resolution.
+
+.TP
+.B EXDEV
+.I resolve
+contains
+.BR RESOLVE_NO_XDEV ,
+and a path component attempted to cross a mount-point.
+
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_SYMLINKS ,
+and one of the path components was a symlink.
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_MAGICLINKS ,
+and one of the path components was a magic-link.
+
+.SH VERSIONS
+.BR openat2 ()
+was added to Linux in kernel 5.FOO.
+
+.SH CONFORMING TO
+This system call is Linux-specific.
+
+The semantics of
+.B RESOLVE_BENEATH
+were modelled after FreeBSD's
+.BR O_BENEATH .
+
+.SH NOTES
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+
+.SS Extensibility
+In order to allow for
+.I struct open_how
+to be extended in future kernel revisions,
+.BR openat2 ()
+requires userspace to specify what sized
+.I struct open_how
+structure they are passing. By providing this information, it is possible for
+.BR openat2 ()
+to provide both forwards- and backwards-compatibility \(em with
+.I size
+acting as an implicit version number (because new extension fields will always
+be appended, the size will always increase.) This extensibility design is very
+similar to other system calls such as
+.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
+
+If we let
+.I usize
+be the size of the structure according to userspace and
+.I ksize
+be the size of the structure which the kernel supports, then there are only
+three cases to consider:
+
+.RS
+.IP * 3
+If
+.IR ksize " equals " usize ,
+then there is no version mismatch and
+.I how
+can be used verbatim.
+.IP *
+If
+.IR ksize " is larger than " usize ,
+then there are some extensions the kernel supports which the userspace program
+is unaware of. Because all extensions must have their zero values be a no-op,
+the kernel treats all of the extension fields not set by userspace to have zero
+values. This provides backwards-compatibility.
+.IP *
+If
+.IR ksize " is smaller than " usize ,
+then there are some extensions which the userspace program is aware of but the
+kernel does not support. Because all extensions must have their zero values be
+a no-op, the kernel can safely ignore the unsupported extension fields if they
+are all-zero. If any unsupported extension fields are non-zero, then an error
+is returned. This provides forwards-compatibility.
+.RE
+
+Therefore, most userspace programs will not need to have any special handling
+of extensions. However, if a userspace program wishes to determine what
+extensions the running kernel supports, they may conduct a binary search on
+.IR size
+(to find the largest value which doesn't produce an error.)
+
+.SH SEE ALSO
+.BR openat (2),
+.BR path_resolution (7),
+.BR symlink (7)
diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 85dd354e9a93..3da3e5b614c8 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
 Some UNIX/Linux system calls have as parameter one or more filenames.
 A filename (or pathname) is resolved as follows.
 .SS Step 1: start of the resolution process
-If the pathname starts with the \(aq/\(aq character,
-the starting lookup directory
-is the root directory of the calling process.
-(A process inherits its
-root directory from its parent.
-Usually this will be the root directory
-of the file hierarchy.
-A process may get a different root directory
-by use of the
+If the pathname starts with the \(aq/\(aq character, the starting lookup
+directory is the root directory of the calling process. (A process inherits its
+root directory from its parent. Usually this will be the root directory of the
+file hierarchy. A process may get a different root directory by use of the
 .BR chroot (2)
-system call.
+system call, or may temporarily use a different root directory by using
+.BR openat2 (2)
+with the
+.B RESOLVE_IN_ROOT
+flag set.
+.PP
 A process may get an entirely private mount namespace in case
 it\(emor one of its ancestors\(emwas started by an invocation of the
 .BR clone (2)
@@ -48,16 +48,24 @@ system call that had the
 flag set.)
 This handles the \(aq/\(aq part of the pathname.
 .PP
-If the pathname does not start with the \(aq/\(aq character, the
-starting lookup directory of the resolution process is the current working
-directory of the process.
-(This is also inherited from the parent.
-It can be changed by use of the
+If the pathname does not start with the \(aq/\(aq character, the starting
+lookup directory of the resolution process is the current working directory of
+the process \(em or in the case of
+.BR openat (2)-style
+syscalls, the
+.I dfd
+argument (or the current working directory if
+.B AT_FDCWD
+is passed as the
+.I dfd
+argumnet). The current working directory is inherited from the parent, and can
+be changed by use of the
 .BR chdir (2)
-system call.)
+syscall.
 .PP
 Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
 Pathnames not starting with a \(aq/\(aq are called relative pathnames.
+
 .SS Step 2: walk along the path
 Set the current lookup directory to the starting lookup directory.
 Now, for each nonfinal component of the pathname, where a component
@@ -124,6 +132,13 @@ the kernel's pathname-resolution code
 was reworked to eliminate the use of recursion,
 so that the only limit that remains is the maximum of 40
 resolutions for the entire pathname.
+.PP
+The resolution of syscalls during this stage can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_SYMLINKS
+flag set.
+
 .SS Step 3: find the final entry
 The lookup of the final component of the pathname goes just like
 that of all other components, as described in the previous step,
@@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
 their conventional meanings, regardless of whether they are
 actually present in the physical filesystem.
 .PP
-One cannot walk down past the root: "/.." is the same as "/".
+One cannot walk up past the root: "/.." is the same as "/".
+
 .SS Mount points
 After a "mount dev path" command, the pathname "path" refers to
 the root of the filesystem hierarchy on the device "dev", and no
@@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
 One can walk out of a mounted filesystem: "path/.." refers to
 the parent directory of "path",
 outside of the filesystem hierarchy on "dev".
+.PP
+Mount-point crossings can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_XDEV
+flag set (though note that this also restricts bind-mount crossings).
+
 .SS Trailing slashes
 If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
 component as in Step 2: it has to exist and resolve to a directory.
-- 
2.23.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox