* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Christian Brauner @ 2019-09-05 11:29 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: Aleksa Sarai, Al Viro, Jeff Layton, J. Bruce Fields,
Arnd Bergmann, David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
Peter Zijlstra, Christian Brauner, Eric Biederman,
Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
Oleg Nesterov
In-Reply-To: <ae415ea8-4442-d81c-3b46-2ae5fb35bbdf@rasmusvillemoes.dk>
On Thu, Sep 05, 2019 at 01:17:38PM +0200, Rasmus Villemoes wrote:
> On 05/09/2019 13.05, Christian Brauner wrote:
> > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
>
> >> + if (unlikely(!access_ok(dst, usize)))
> >> + return -EFAULT;
> >> +
> >> + /* Deal with trailing bytes. */
> >> + if (usize < ksize) {
> >> + if (memchr_inv(src + size, 0, rest))
> >> + return -EFBIG;
> >> + } else if (usize > ksize) {
> >> + if (__memzero_user(dst + size, rest))
> >> + return -EFAULT;
> >
> > Is zeroing that memory really our job? Seems to me we should just check
> > it is zeroed.
>
> Of course it is, otherwise you'd require userspace to clear the output
> buffer it gives us, which in the majority of cases is wasted work. It's
> much easier to reason about if we just say "the kernel populates [uaddr,
> uaddr + usize)".
I don't really mind either way so sure. :)
^ permalink raw reply
* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Christian Brauner @ 2019-09-05 11:40 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
Peter Zijlstra, Christian Brauner, Rasmus Villemoes,
Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
David Drysdale, Chanho Min, Oleg
In-Reply-To: <20190905112718.ojg3znly6x3m4mjq@yavin.dot.cyphar.com>
On Thu, Sep 05, 2019 at 09:27:18PM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > A common pattern for syscall extensions is increasing the size of a
> > > struct passed from userspace, such that the zero-value of the new fields
> > > result in the old kernel behaviour (allowing for a mix of userspace and
> > > kernel vintages to operate on one another in most cases). This is done
> > > in both directions -- hence two helpers -- though it's more common to
> > > have to copy user space structs into kernel space.
> > >
> > > Previously there was no common lib/ function that implemented
> > > the necessary extension-checking semantics (and different syscalls
> > > implemented them slightly differently or incompletely[1]). A future
> > > patch replaces all of the common uses of this pattern to use the new
> > > copy_struct_{to,from}_user() helpers.
> > >
> > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> > > similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> > > always rejects differently-sized struct arguments.
> > >
> > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> >
> > I would probably split this out into a separate patchset. It can very
> > well go in before openat2(). Thoughts?
>
> Yeah, I'll split this and the related patches out -- though I will admit
> I'm not sure how you're supposed to deal with multiple independent
> patchsets that depend on each other. How will folks reviewing openat2(2)
> know to include the lib/struct_user.c changes?
The way I usually deal with this is to make two branches. One with the
changes the other depends on and then merge this branch into the other
and put the changes on top. Then you can provide a complete branch that
people can test when you send the patchset out by just linking to it in
the cover letter.
(But if it's too much hazzle just leave it.)
>
> Also, whose tree should it go through?
If people think splitting it out makes sense and we can settle the
technical details I can take it and let it stew in linux-next at least
for a little while.
I have changes to clone3() in there that touch
copy_clone_args_from_user() anyway and there are tests for clone3()
struct copying so we'd catch regressions (for clone3() at least) pretty
quickly.
If we don't see any major issues in the next two weeks it might even be
ok to send for 5.4.
Christian
^ permalink raw reply
* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Aleksa Sarai @ 2019-09-05 13:35 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
Christian Brauner, Rasmus Villemoes, Eric Biederman,
Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
Oleg Nesterov, Alexander
In-Reply-To: <20190905094305.GJ2349@hirez.programming.kicks-ass.net>
[-- Attachment #1: Type: text/plain, Size: 5535 bytes --]
On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > +/**
> > > > + * copy_struct_to_user: copy a struct to user space
> > > > + * @dst: Destination address, in user space.
> > > > + * @usize: Size of @dst struct.
> > > > + * @src: Source address, in kernel space.
> > > > + * @ksize: Size of @src struct.
> > > > + *
> > > > + * Copies a struct from kernel space to user space, in a way that guarantees
> > > > + * backwards-compatibility for struct syscall arguments (as long as future
> > > > + * struct extensions are made such that all new fields are *appended* to the
> > > > + * old struct, and zeroed-out new fields have the same meaning as the old
> > > > + * struct).
> > > > + *
> > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> > > > + * The recommended usage is something like the following:
> > > > + *
> > > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > > > + * {
> > > > + * int err;
> > > > + * struct foo karg = {};
> > > > + *
> > > > + * // do something with karg
> > > > + *
> > > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> > > > + * if (err)
> > > > + * return err;
> > > > + *
> > > > + * // ...
> > > > + * }
> > > > + *
> > > > + * There are three cases to consider:
> > > > + * * If @usize == @ksize, then it's copied verbatim.
> > > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> > > > + * older user space. In order to avoid user space getting incomplete
> > > > + * information (new fields might be important), all trailing bytes in @src
> > > > + * (@ksize - @usize) must be zerored
> > >
> > > s/zerored/zero/, right?
> >
> > It should've been "zeroed".
>
> That reads wrong to me; that way it reads like this function must take
> that action and zero out the 'rest'; which is just wrong.
>
> This function must verify those bytes are zero, not make them zero.
Right, in my head I was thinking "must have been zeroed" which isn't
what it says. I'll switch to "zero".
> > > > , otherwise -EFBIG is returned.
> > >
> > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> >
> > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> > a "too big" struct passed to the kernel, and EFBIG for a "too big"
> > struct passed to user-space. I would personally have preferred EMSGSIZE
> > instead of EFBIG, but felt using the existing error codes would be less
> > confusing.
>
> Sadly a recent commit:
>
> 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code")
>
> Made the situation even 'worse'.
I hadn't seen this patch before, and I have a few questions taking a
look at it:
* An error code for a particular behaviour was changed (EFBIG ->
E2BIG). Is this not a userspace breakage (I know Linus went ballistic
about something similar a while ago[1]), or did I misunderstand what
the issue was in [1]?
* At the risk of bike-shedding -- of we are changing it, wouldn't
-EMSGSIZE be more appropriate? To be fair, picking errno values has
always been more of an art than a science, but to my ears "Argument
list too long" doesn't make too much sense in the context of
"returning" a struct back to userspace (and the cause of the error
is that the argument passed by user space *isn't big enough*). If
there was an E2SMALL that would also work. ;)
* Do you want me to write a patch based on that, to switch it to
copy_struct_to_user()?
* That patch removes the "are there non-zero bytes in the tail that
userspace won't know about" check (which I have included in mine). I
understand that this caused issues specifically with sched_getattr(2)
due to the default value not being zero -- how should we rectify that
(given that we'd hopefully want to port everyone who uses that
interface to copy_struct_{to,from}_user())?
* Given that the [uk]attr->size construct is pretty important to the
usability of the sched and perf interfaces, should we require (or
encourage) it for all struct-extension syscall setups?
> > > > + if (unlikely(!access_ok(src, usize)))
> > > > + return -EFAULT;
> > > > +
> > > > + /* Deal with trailing bytes. */
> > > > + if (usize < ksize)
> > > > + memset(dst + size, 0, rest);
> > > > + else if (usize > ksize) {
> > > > + const void __user *addr = src + size;
> > > > + char buffer[BUFFER_SIZE] = {};
> > >
> > > Isn't that too big for on-stack?
> >
> > Is a 64-byte buffer too big? I picked the number "at random" to be the
> > size of a cache line, but I could shrink it down to 32 bytes if the size
> > is an issue (I wanted to avoid needless allocations -- hence it being
> > on-stack).
>
> Ah, my ctags gave me a definition of BUFFER_SIZE that was 512. I suppose
> 64 should be OK.
Good to know, though I'll rename it to avoid confusion.
[1]: https://lore.kernel.org/lkml/CA+55aFy98A+LJK4+GWMcbzaa1zsPBRo76q+ioEjbx-uaMKH6Uw@mail.gmail.com/
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Aleksa Sarai @ 2019-09-05 13:40 UTC (permalink / raw)
To: Christian Brauner
Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
Peter Zijlstra, Christian Brauner, Rasmus Villemoes,
Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
David Drysdale, Chanho Min, Oleg
In-Reply-To: <20190905110544.d6c5t7rx25kvywmi@wittgenstein>
[-- Attachment #1: Type: text/plain, Size: 1978 bytes --]
On 2019-09-05, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > A common pattern for syscall extensions is increasing the size of a
> > struct passed from userspace, such that the zero-value of the new fields
> > result in the old kernel behaviour (allowing for a mix of userspace and
> > kernel vintages to operate on one another in most cases). This is done
> > in both directions -- hence two helpers -- though it's more common to
> > have to copy user space structs into kernel space.
> >
> > Previously there was no common lib/ function that implemented
> > the necessary extension-checking semantics (and different syscalls
> > implemented them slightly differently or incompletely[1]). A future
> > patch replaces all of the common uses of this pattern to use the new
> > copy_struct_{to,from}_user() helpers.
> >
> > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> > similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> > always rejects differently-sized struct arguments.
> >
> > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
[...]
> > + if (unlikely(!access_ok(src, usize)))
> > + return -EFAULT;
> > +
> > + /* Deal with trailing bytes. */
> > + if (usize < ksize)
> > + memset(dst + size, 0, rest);
[...]
> That's a change in behavior for clone3() and sched at least, no? Unless
> - which I guess you might have done - you have moved the "error out when
> the struct is too small" part before the call to copy_struct_from_user()
> for them.
Yes, I've put the minimum size check to the callers in all of the
cases (in the case of clone3() I've #define'd a CLONE_ARGS_SIZE_VER0 to
match the others -- see patch 2 of the series).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* [PATCH 2/2] ipc: fix sparc64 ipc() wrapper
From: Arnd Bergmann @ 2019-09-05 15:21 UTC (permalink / raw)
To: linux-kernel, Andrew Morton, David S. Miller, Arnd Bergmann,
Eric W. Biederman, Deepa Dinamani, Christian Brauner,
Manfred Spraul, Davidlohr Bueso
Cc: linux-arch, y2038, Dominik Brodowski, Matt Turner, stable,
sparclinux, linux-api
In-Reply-To: <20190905152155.1392871-1-arnd@arndb.de>
Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
to a commit from my y2038 series in linux-5.1, as I missed the custom
sys_ipc() wrapper that sparc64 uses in place of the generic version that
I patched.
The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
now do not allow being called with the IPC_64 flag any more, resulting
in a -EINVAL error when they don't recognize the command.
Instead, the correct way to do this now is to call the internal
ksys_old_{sem,shm,msg}ctl() functions to select the API version.
As we generally move towards these functions anyway, change all of
sparc_ipc() to consistently use those in place of the sys_*() versions,
and move the required ksys_*() declarations into linux/syscalls.h
Reported-by: Matt Turner <mattst88@gmail.com>
Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Hi Matt,
Can you check that this solves your problem?
---
arch/sparc/kernel/sys_sparc_64.c | 30 +++++++++++++++---------------
include/linux/syscalls.h | 19 +++++++++++++++++++
ipc/util.h | 25 ++-----------------------
3 files changed, 36 insertions(+), 38 deletions(-)
diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c
index ccc88926bc00..5ad0494df367 100644
--- a/arch/sparc/kernel/sys_sparc_64.c
+++ b/arch/sparc/kernel/sys_sparc_64.c
@@ -340,21 +340,21 @@ SYSCALL_DEFINE6(sparc_ipc, unsigned int, call, int, first, unsigned long, second
if (call <= SEMTIMEDOP) {
switch (call) {
case SEMOP:
- err = sys_semtimedop(first, ptr,
- (unsigned int)second, NULL);
+ err = ksys_semtimedop(first, ptr,
+ (unsigned int)second, NULL);
goto out;
case SEMTIMEDOP:
- err = sys_semtimedop(first, ptr, (unsigned int)second,
+ err = ksys_semtimedop(first, ptr, (unsigned int)second,
(const struct __kernel_timespec __user *)
- (unsigned long) fifth);
+ (unsigned long) fifth);
goto out;
case SEMGET:
- err = sys_semget(first, (int)second, (int)third);
+ err = ksys_semget(first, (int)second, (int)third);
goto out;
case SEMCTL: {
- err = sys_semctl(first, second,
- (int)third | IPC_64,
- (unsigned long) ptr);
+ err = ksys_old_semctl(first, second,
+ (int)third | IPC_64,
+ (unsigned long) ptr);
goto out;
}
default:
@@ -365,18 +365,18 @@ SYSCALL_DEFINE6(sparc_ipc, unsigned int, call, int, first, unsigned long, second
if (call <= MSGCTL) {
switch (call) {
case MSGSND:
- err = sys_msgsnd(first, ptr, (size_t)second,
+ err = ksys_msgsnd(first, ptr, (size_t)second,
(int)third);
goto out;
case MSGRCV:
- err = sys_msgrcv(first, ptr, (size_t)second, fifth,
+ err = ksys_msgrcv(first, ptr, (size_t)second, fifth,
(int)third);
goto out;
case MSGGET:
- err = sys_msgget((key_t)first, (int)second);
+ err = ksys_msgget((key_t)first, (int)second);
goto out;
case MSGCTL:
- err = sys_msgctl(first, (int)second | IPC_64, ptr);
+ err = ksys_old_msgctl(first, (int)second | IPC_64, ptr);
goto out;
default:
err = -ENOSYS;
@@ -396,13 +396,13 @@ SYSCALL_DEFINE6(sparc_ipc, unsigned int, call, int, first, unsigned long, second
goto out;
}
case SHMDT:
- err = sys_shmdt(ptr);
+ err = ksys_shmdt(ptr);
goto out;
case SHMGET:
- err = sys_shmget(first, (size_t)second, (int)third);
+ err = ksys_shmget(first, (size_t)second, (int)third);
goto out;
case SHMCTL:
- err = sys_shmctl(first, (int)second | IPC_64, ptr);
+ err = ksys_old_shmctl(first, (int)second | IPC_64, ptr);
goto out;
default:
err = -ENOSYS;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 88145da7d140..f7c561c4dcdd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1402,4 +1402,23 @@ static inline unsigned int ksys_personality(unsigned int personality)
return old;
}
+/* for __ARCH_WANT_SYS_IPC */
+long ksys_semtimedop(int semid, struct sembuf __user *tsops,
+ unsigned int nsops,
+ const struct __kernel_timespec __user *timeout);
+long ksys_semget(key_t key, int nsems, int semflg);
+long ksys_old_semctl(int semid, int semnum, int cmd, unsigned long arg);
+long ksys_msgget(key_t key, int msgflg);
+long ksys_old_msgctl(int msqid, int cmd, struct msqid_ds __user *buf);
+long ksys_msgrcv(int msqid, struct msgbuf __user *msgp, size_t msgsz,
+ long msgtyp, int msgflg);
+long ksys_msgsnd(int msqid, struct msgbuf __user *msgp, size_t msgsz,
+ int msgflg);
+long ksys_shmget(key_t key, size_t size, int shmflg);
+long ksys_shmdt(char __user *shmaddr);
+long ksys_old_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);
+long compat_ksys_semtimedop(int semid, struct sembuf __user *tsems,
+ unsigned int nsops,
+ const struct old_timespec32 __user *timeout);
+
#endif
diff --git a/ipc/util.h b/ipc/util.h
index 0fcf8e719b76..5766c61aed0e 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -276,29 +276,7 @@ static inline int compat_ipc_parse_version(int *cmd)
*cmd &= ~IPC_64;
return version;
}
-#endif
-/* for __ARCH_WANT_SYS_IPC */
-long ksys_semtimedop(int semid, struct sembuf __user *tsops,
- unsigned int nsops,
- const struct __kernel_timespec __user *timeout);
-long ksys_semget(key_t key, int nsems, int semflg);
-long ksys_old_semctl(int semid, int semnum, int cmd, unsigned long arg);
-long ksys_msgget(key_t key, int msgflg);
-long ksys_old_msgctl(int msqid, int cmd, struct msqid_ds __user *buf);
-long ksys_msgrcv(int msqid, struct msgbuf __user *msgp, size_t msgsz,
- long msgtyp, int msgflg);
-long ksys_msgsnd(int msqid, struct msgbuf __user *msgp, size_t msgsz,
- int msgflg);
-long ksys_shmget(key_t key, size_t size, int shmflg);
-long ksys_shmdt(char __user *shmaddr);
-long ksys_old_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);
-
-/* for CONFIG_ARCH_WANT_OLD_COMPAT_IPC */
-long compat_ksys_semtimedop(int semid, struct sembuf __user *tsems,
- unsigned int nsops,
- const struct old_timespec32 __user *timeout);
-#ifdef CONFIG_COMPAT
long compat_ksys_old_semctl(int semid, int semnum, int cmd, int arg);
long compat_ksys_old_msgctl(int msqid, int cmd, void __user *uptr);
long compat_ksys_msgrcv(int msqid, compat_uptr_t msgp, compat_ssize_t msgsz,
@@ -306,6 +284,7 @@ long compat_ksys_msgrcv(int msqid, compat_uptr_t msgp, compat_ssize_t msgsz,
long compat_ksys_msgsnd(int msqid, compat_uptr_t msgp,
compat_ssize_t msgsz, int msgflg);
long compat_ksys_old_shmctl(int shmid, int cmd, void __user *uptr);
-#endif /* CONFIG_COMPAT */
+
+#endif
#endif
--
2.20.0
^ permalink raw reply related
* Re: [PATCH v3 bpf-next 2/3] bpf: implement CAP_BPF
From: Song Liu @ 2019-09-05 16:08 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, David S . Miller, Daniel Borkmann,
Peter Zijlstra, Andy Lutomirski, Networking, bpf, Kernel Team,
linux-api@vger.kernel.org
In-Reply-To: <20190905031518.behyq7olkh6fjsoe@ast-mbp.dhcp.thefacebook.com>
> On Sep 4, 2019, at 8:15 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 05, 2019 at 02:51:51AM +0000, Song Liu wrote:
>>
>>
>>> On Sep 4, 2019, at 6:32 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>
>>> On Thu, Sep 05, 2019 at 12:34:36AM +0000, Song Liu wrote:
>>>>
>>>>
>>>>> On Sep 4, 2019, at 11:43 AM, Alexei Starovoitov <ast@kernel.org> wrote:
>>>>>
>>>>> Implement permissions as stated in uapi/linux/capability.h
>>>>>
>>>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>>>>
>>>>
>>>> [...]
>>>>
>>>>> @@ -1648,11 +1648,11 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
>>>>> is_gpl = license_is_gpl_compatible(license);
>>>>>
>>>>> if (attr->insn_cnt == 0 ||
>>>>> - attr->insn_cnt > (capable(CAP_SYS_ADMIN) ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
>>>>> + attr->insn_cnt > (capable_bpf() ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
>>>>> return -E2BIG;
>>>>> if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
>>>>> type != BPF_PROG_TYPE_CGROUP_SKB &&
>>>>> - !capable(CAP_SYS_ADMIN))
>>>>> + !capable_bpf())
>>>>> return -EPERM;
>>>>
>>>> Do we allow load BPF_PROG_TYPE_SOCKET_FILTER and BPF_PROG_TYPE_CGROUP_SKB
>>>> without CAP_BPF? If so, maybe highlight in the header?
>>>
>>> of course. there is no change in behavior.
>>> 'highlight in the header'?
>>> you mean in commit log?
>>> I think it's a bit weird to describe things in commit that patch
>>> is _not_ changing vs things that patch does actually change.
>>> This type of comment would be great in a doc though.
>>> The doc will be coming separately in the follow up assuming
>>> the whole thing lands. I'll remember to note that bit.
>>
>> I meant capability.h:
>>
>> + * CAP_BPF allows the following BPF operations:
>> + * - Loading all types of BPF programs
>>
>> But CAP_BPF is not required to load all types of programs.
>
> yes, but above statement is still correct, right?
>
> And right below it says:
> * CAP_BPF allows the following BPF operations:
> * - Loading all types of BPF programs
> * - Creating all types of BPF maps except:
> * - stackmap that needs CAP_TRACING
> * - devmap that needs CAP_NET_ADMIN
> * - cpumap that needs CAP_SYS_ADMIN
> which is also correct, but CAP_BPF is not required
> for array, hash, prog_array, percpu, map-in-map ...
> except their lru variants...
> and except if they contain bpf_spin_lock...
> and if they need BTF it currently can be loaded with cap_sys_admin only...
>
> If we say something about socket_filter, cg_skb progs in capability.h
> we should clarify maps as well, but then it will become too big for .h
> The comments in capability.h already look too long to me.
> All that info and a lot more belongs in the doc.
Agreed. We cannot put all these details in capability.h. Doc/wikipages
would be better fit for these information.
Thanks,
Song
^ permalink raw reply
* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Aleksa Sarai @ 2019-09-05 17:01 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
Christian Brauner, Rasmus Villemoes, Eric Biederman,
Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
Oleg Nesterov, Alexander
In-Reply-To: <20190905094305.GJ2349@hirez.programming.kicks-ass.net>
[-- Attachment #1: Type: text/plain, Size: 1727 bytes --]
On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > +
> > > > + while (rest > 0) {
> > > > + size_t bufsize = min(rest, sizeof(buffer));
> > > > +
> > > > + if (__copy_from_user(buffer, addr, bufsize))
> > > > + return -EFAULT;
> > > > + if (memchr_inv(buffer, 0, bufsize))
> > > > + return -E2BIG;
> > > > +
> > > > + addr += bufsize;
> > > > + rest -= bufsize;
> > > > + }
> > >
> > > The perf implementation uses get_user(); but if that is too slow, surely
> > > we can do something with uaccess_try() here?
> >
> > Is there a non-x86-specific way to do that (unless I'm mistaken only x86
> > has uaccess_try() or the other *_try() wrappers)? The main "performance
> > improvement" (if you can even call it that) is that we use memchr_inv()
> > which finds non-matching characters more efficiently than just doing a
> > loop.
>
> Oh, you're right, that's x86 only :/
Though, I just had an idea -- am I wrong to think that the following
would work just as well (without the need for an intermediate buffer)?
if (memchr_inv((const char __force *) src + size, 0, rest))
return -E2BIG;
Or is this type of thing very much frowned upon? What if it was a
separate memchr_inv_user() instead -- I feel as though there's not a
strong argument for needing to use a buffer when we're single-passing
the __user buffer and doing a basic boolean check.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Why add the general notification queue and its sources
From: David Howells @ 2019-09-05 17:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, Greg Kroah-Hartman, rstrode, swhiteho, nicolas.dichtel,
raven, keyrings, linux-usb, linux-block, Christian Brauner,
LSM List, linux-fsdevel, Linux API, Linux List Kernel Mailing
In-Reply-To: <CAHk-=wh5ZNE9pBwrnr5MX3iqkUP4nspz17rtozrSxs5-OGygNw@mail.gmail.com>
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Here's a set of patches to add a general notification queue concept and to
> > add event sources such as:
>
> Why?
>
> I'm just going to be very blunt about this, and say that there is no
> way I can merge any of this *ever*, unless other people stand up and
> say that
>
> (a) they'll use it
>
> and
>
> (b) they'll actively develop it and participate in testing and coding
Besides the core notification buffer which ties this together, there are a
number of sources that I've implemented, not all of which are in this patch
series:
(1) Key/keyring notifications.
If you have your kerberos tickets in a file/directory, your gnome desktop
will monitor that using something like fanotify and tell you if your
credentials cache changes.
We also have the ability to cache your kerberos tickets in the session,
user or persistent keyring so that it isn't left around on disk across a
reboot or logout. Keyrings, however, cannot currently be monitored
asynchronously, so the desktop has to poll for it - not so good on a
laptop.
This source will allow the desktop to avoid the need to poll.
(2) USB notifications.
GregKH was looking for a way to do USB notifications as I was looking to
find additional sources to implement. I'm not sure how he wants to use
them, but I'll let him speak to that himself.
(3) Block notifications.
This one I was thinking that I could make something like ddrescue better
by letting it get notifications this way. This was a target of
convenience since I had a dodgy disk I was trying to rescue.
It could also potentially be used help systemd, say, detect broken
devices and avoid trying to unmount them when trying to reboot the machine.
I can drop this for now if you prefer.
(4) Mount notifications.
This one is wanted to avoid repeated trawling of /proc/mounts or similar
to work out changes to the mount object attributes and mount topology.
I'm told that the proc file holding the namespace_sem is a point of
contention, especially as the process of generating the text descriptions
of the mounts/superblocks can be quite involved.
The notifications directly indicate the mounts involved in any particular
event and what the change was. You can poll /proc/mounts, but all you
know is that something changed; you don't know what and you don't know
how and reading that file may race with multiple changed being effected.
I pair this with a new fsinfo() system call that allows, amongst other
things, the ability to retrieve in one go an { id, change counter } tuple
from all the children of a specified mount, allowing buffer overruns to
be cleaned up quickly.
It's not just Red Hat that's potentially interested in this:
https://lore.kernel.org/linux-fsdevel/293c9bd3-f530-d75e-c353-ddeabac27cf6@6wind.com/
(5) Superblock notifications.
This one is provided to allow systemd or the desktop to more easily
detect events such as I/O errors and EDQUOT/ENOSPC.
I've tried to make the core multipurpose so that the price of the code
footprint is mitigated.
David
^ permalink raw reply
* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-05 17:19 UTC (permalink / raw)
To: David Howells
Cc: Greg Kroah-Hartman, rstrode, swhiteho, Nicolas Dichtel, raven,
keyrings, linux-usb, linux-block, Christian Brauner, LSM List,
linux-fsdevel, Linux API, Linux List Kernel Mailing
In-Reply-To: <17703.1567702907@warthog.procyon.org.uk>
On Thu, Sep 5, 2019 at 10:01 AM David Howells <dhowells@redhat.com> wrote:
> >
> > I'm just going to be very blunt about this, and say that there is no
> > way I can merge any of this *ever*, unless other people stand up and
> > say that
> >
> > (a) they'll use it
> >
> > and
> >
> > (b) they'll actively develop it and participate in testing and coding
>
> Besides the core notification buffer which ties this together, there are a
> number of sources that I've implemented, not all of which are in this patch
> series:
You've at least now answered part of the "Why", but you didn't
actually answer the whole "another developer" part.
I really don't like how nobody else than you seems to even look at any
of the key handling patches. Because nobody else seems to care.
This seems to be another new subsystem / driver that has the same
pattern. If it's all just you, I don't want to merge it, because I
really want more than just other developers doing "Reviewed-by" after
looking at somebody elses code that they don't actually use or really
care about.
See what I'm saying?
New features that go into the kernel should have multiple users. Not a
single developer who pushes both the kernel feature and the single use
of that feature.
This very much comes from me reverting the key ACL pull. Not only did
I revert it, ABSOLUTELY NOBODY even reacted to the revert. Nobody
stepped up and said they they want that new ACL code, and pushed for a
fix. There was some very little murmuring about it when Mimi at least
figured out _why_ it broke, but other than that all the noise I saw
about the revert was Eric Biggers pointing out it broke other things
too, and that it had actually broken some test suites. But since it
hadn't even been in linux-next, that too had been noticed much too
late.
See what I'm saying? This whole "David Howells does his own features
that nobody else uses" needs to stop. You need to have a champion. I
just don't feel safe pulling these kinds of changes from you, because
I get the feeling that ABSOLUTELY NOBODY ELSE ever really looked at it
or really cared.
Most of the patches has nobody else even Cc'd, and even the patches
that do have some "Reviewed-by" feel more like somebody else went "ok,
the change looks fine to me", without any other real attachment to the
code.
New kernel features and interfaces really need to have a higher
barrier of entry than one developer working on his or her own thing.
Is that a change from 25 years ago? Or yes it is. We can point to lots
of "single developer did a thing" from years past. But things have
changed. And once bitten, twice shy: I really am a _lot_ more nervous
about all these key changes now.
Linus
^ permalink raw reply
* Re: [PATCH 2/2] ipc: fix sparc64 ipc() wrapper
From: Matt Turner @ 2019-09-05 17:22 UTC (permalink / raw)
To: Arnd Bergmann
Cc: LKML, Andrew Morton, David S. Miller, Eric W. Biederman,
Deepa Dinamani, Christian Brauner, Manfred Spraul,
Davidlohr Bueso, Linux-Arch, y2038 Mailman List,
Dominik Brodowski, stable, sparclinux, Linux API
In-Reply-To: <20190905152155.1392871-2-arnd@arndb.de>
On Thu, Sep 5, 2019 at 8:23 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
> to a commit from my y2038 series in linux-5.1, as I missed the custom
> sys_ipc() wrapper that sparc64 uses in place of the generic version that
> I patched.
>
> The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
> now do not allow being called with the IPC_64 flag any more, resulting
> in a -EINVAL error when they don't recognize the command.
>
> Instead, the correct way to do this now is to call the internal
> ksys_old_{sem,shm,msg}ctl() functions to select the API version.
>
> As we generally move towards these functions anyway, change all of
> sparc_ipc() to consistently use those in place of the sys_*() versions,
> and move the required ksys_*() declarations into linux/syscalls.h
>
> Reported-by: Matt Turner <mattst88@gmail.com>
> Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
> Cc: stable@vger.kernel.org
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> Hi Matt,
>
> Can you check that this solves your problem?
Works great. Thank you Arnd!
Tested-by: Matt Turner <mattst88@gmail.com>
^ permalink raw reply
* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Al Viro @ 2019-09-05 18:07 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
Christian Brauner, Rasmus Villemoes, Eric Biederman,
Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
Oleg Nesterov, Alexander
In-Reply-To: <20190904201933.10736-2-cyphar@cyphar.com>
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> +/*
> + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> + * checked access_ok(p, size).
> + */
> +static int __memzero_user(void __user *p, size_t s)
> +{
> + const char zeros[BUFFER_SIZE] = {};
> + while (s > 0) {
> + size_t n = min(s, sizeof(zeros));
> +
> + if (__copy_to_user(p, zeros, n))
> + return -EFAULT;
> +
> + p += n;
> + s -= n;
> + }
> + return 0;
> +}
That's called clear_user().
> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
Why?
> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;
Why not simply clear_user() and copy_to_user()?
> +int copy_struct_from_user(void *dst, size_t ksize,
> + const void __user *src, size_t usize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
Cute, but... you would be just as well without that 'rest' thing.
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
Again, why?
> + if (unlikely(!access_ok(src, usize)))
> + return -EFAULT;
Why not simply copy_from_user() here?
> + /* Deal with trailing bytes. */
> + if (usize < ksize)
> + memset(dst + size, 0, rest);
> + else if (usize > ksize) {
> + const void __user *addr = src + size;
> + char buffer[BUFFER_SIZE] = {};
> +
> + while (rest > 0) {
> + size_t bufsize = min(rest, sizeof(buffer));
> +
> + if (__copy_from_user(buffer, addr, bufsize))
> + return -EFAULT;
> + if (memchr_inv(buffer, 0, bufsize))
> + return -E2BIG;
Frankly, that looks like a candidate for is_all_zeroes_user().
With the loop like above serving as a dumb default. And on
badly alighed address it _will_ be dumb. Probably too much
so - something like
if ((unsigned long)addr & 1) {
u8 v;
if (get_user(v, (__u8 __user *)addr))
return -EFAULT;
if (v)
return -E2BIG;
addr++;
}
if ((unsigned long)addr & 2) {
u16 v;
if (get_user(v, (__u16 __user *)addr))
return -EFAULT;
if (v)
return -E2BIG;
addr +=2;
}
if ((unsigned long)addr & 4) {
u32 v;
if (get_user(v, (__u32 __user *)addr))
return -EFAULT;
if (v)
return -E2BIG;
}
<read the rest like you currently do>
would be saner, and things like x86 could trivially add an
asm variant - it's not hard. Incidentally, memchr_inv() is
an overkill in this case...
^ permalink raw reply
* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Christian Brauner @ 2019-09-05 18:23 UTC (permalink / raw)
To: Al Viro
Cc: Aleksa Sarai, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
Peter Zijlstra, Christian Brauner, Rasmus Villemoes,
Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov
In-Reply-To: <20190905180750.GQ1131@ZenIV.linux.org.uk>
On Thu, Sep 05, 2019 at 07:07:50PM +0100, Al Viro wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> > + * checked access_ok(p, size).
> > + */
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > + const char zeros[BUFFER_SIZE] = {};
> > + while (s > 0) {
> > + size_t n = min(s, sizeof(zeros));
> > +
> > + if (__copy_to_user(p, zeros, n))
> > + return -EFAULT;
> > +
> > + p += n;
> > + s -= n;
> > + }
> > + return 0;
> > +}
>
> That's called clear_user().
>
> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > + const void *src, size_t ksize)
> > +{
> > + size_t size = min(ksize, usize);
> > + size_t rest = abs(ksize - usize);
> > +
> > + if (unlikely(usize > PAGE_SIZE))
> > + return -EFAULT;
>
> Why?
Because every caller of that function right now has that limit set
anyway iirc. So we can either remove it from here and place it back for
the individual callers or leave it in the helper.
Also, I'm really asking, why not? Is it unreasonable to have an upper
bound on the size (for a long time probably) or are you disagreeing with
PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
bpf, and clone3 and in a few other places.
^ permalink raw reply
* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Al Viro @ 2019-09-05 18:28 UTC (permalink / raw)
To: Christian Brauner
Cc: Aleksa Sarai, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
Peter Zijlstra, Christian Brauner, Rasmus Villemoes,
Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov
In-Reply-To: <20190905182303.7f6bxpa2enbgcegv@wittgenstein>
On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> Because every caller of that function right now has that limit set
> anyway iirc. So we can either remove it from here and place it back for
> the individual callers or leave it in the helper.
> Also, I'm really asking, why not? Is it unreasonable to have an upper
> bound on the size (for a long time probably) or are you disagreeing with
> PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> bpf, and clone3 and in a few other places.
For a primitive that can be safely used with any size (OK, any within
the usual 2Gb limit)? Why push the random policy into the place where
it doesn't belong?
Seriously, what's the point? If they want to have a large chunk of
userland memory zeroed or checked for non-zeroes - why would that
be a problem?
^ permalink raw reply
* Re: Why add the general notification queue and its sources
From: Ray Strode @ 2019-09-05 18:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Howells, Greg Kroah-Hartman, Steven Whitehouse,
Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
Christian Brauner, LSM List, linux-fsdevel, Linux API,
Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wjQ5Fpv0D7rxX0W=obx9xoOAxJ_Cr+pGCYOAi2S9FiCNg@mail.gmail.com>
Hi,
On Thu, Sep 5, 2019 at 1:20 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> You've at least now answered part of the "Why", but you didn't
> actually answer the whole "another developer" part.
It's certainly something we've wanted in the GNOME world for a long time:
See for instance
https://bugzilla.redhat.com/show_bug.cgi?id=991110
and
https://bugzilla.gnome.org/show_bug.cgi?id=707402
from all the way back 2013. These are the alternatives I can think of:
- poll? status quo, but not great for obvious wakeup reasons
- use a different credential cache collection type that does support
change notification?
some of the other types do support change notification, but have their
own set of
problems. But maybe we should just go back to DIR type credential
cache collections
and try to figure out the life cycle problems they pose, i don't
know... or get more
man power behind KCM...
- manage change notification entirely from userspace. assume credentials will
always be put in place from krb5-libs entry points, and just skip
notification if
it happens out from under the libraries. maybe upstream kerberos guys would
be onboard with this, I don't know. This seems less robust than having
the kernel
in the loop, though.
> I really don't like how nobody else than you seems to even look at any
> of the key handling patches. Because nobody else seems to care.
I've got no insight here, so i'll just throw a dart...
viro, is this something you have any interest in watching closer?
> See what I'm saying? This whole "David Howells does his own features
> that nobody else uses" needs to stop. You need to have a champion. I
> just don't feel safe pulling these kinds of changes from you, because
> I get the feeling that ABSOLUTELY NOBODY ELSE ever really looked at it
> or really cared.
I get the "one man is not enough for proper maintenance" argument, and
maybe it's true. I don't know.
But I just want to point out, I have been asking dhowells for this change
notification API for years, so it's not something he did on his own and for
no particularly good reason. It solves a real problem and has a real-world
use case.
He kindly did it because I (and Robbie Harwood and others) asked him about
it, off and on, and he was able to fit it onto his priority list for us.
>From this thread, it sounds like he solved a problem for Greg too, killing a
couple birds with one stone?
--Ray
^ permalink raw reply
* Re: Why add the general notification queue and its sources
From: Greg Kroah-Hartman @ 2019-09-05 18:33 UTC (permalink / raw)
To: David Howells
Cc: Linus Torvalds, rstrode, swhiteho, nicolas.dichtel, raven,
keyrings, linux-usb, linux-block, Christian Brauner, LSM List,
linux-fsdevel, Linux API, Linux List Kernel Mailing
In-Reply-To: <17703.1567702907@warthog.procyon.org.uk>
On Thu, Sep 05, 2019 at 06:01:47PM +0100, David Howells wrote:
> (2) USB notifications.
>
> GregKH was looking for a way to do USB notifications as I was looking to
> find additional sources to implement. I'm not sure how he wants to use
> them, but I'll let him speak to that himself.
We are getting people asking for all sorts of "error reporting" events
that can happen in the USB subsystem that we have started to abuse the
KOBJ_CHANGE uevent notification for. At the same time your patches were
submitted, someone else submitted yet-another-USB-error patchset. This
type of user/kernel interface is much easier to use than abusing uevents
for USB errors and general notifications about what happened with USB
devices (more than just add/remove that uevents have).
So yes, I would like this, and I am sure the ChromeOS people would like
it too given that I rejected their patcheset with the assumption that
this could be done with the notification queue api "soon" :)
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Christian Brauner @ 2019-09-05 18:35 UTC (permalink / raw)
To: Al Viro
Cc: Aleksa Sarai, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
Peter Zijlstra, Christian Brauner, Rasmus Villemoes,
Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov
In-Reply-To: <20190905182801.GR1131@ZenIV.linux.org.uk>
On Thu, Sep 05, 2019 at 07:28:01PM +0100, Al Viro wrote:
> On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
>
> > Because every caller of that function right now has that limit set
> > anyway iirc. So we can either remove it from here and place it back for
> > the individual callers or leave it in the helper.
> > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > bound on the size (for a long time probably) or are you disagreeing with
> > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > bpf, and clone3 and in a few other places.
>
> For a primitive that can be safely used with any size (OK, any within
> the usual 2Gb limit)? Why push the random policy into the place where
> it doesn't belong?
Ah, the "not in the helper part" makes sense.
As long as leave the check for the callers themselves.
>
> Seriously, what's the point? If they want to have a large chunk of
> userland memory zeroed or checked for non-zeroes - why would that
> be a problem?
^ permalink raw reply
* Re: Why add the general notification queue and its sources
From: Steven Whitehouse @ 2019-09-05 18:37 UTC (permalink / raw)
To: Linus Torvalds, David Howells
Cc: Greg Kroah-Hartman, rstrode, Nicolas Dichtel, raven, keyrings,
linux-usb, linux-block, Christian Brauner, LSM List,
linux-fsdevel, Linux API, Linux List Kernel Mailing, David Lehman,
Ian Kent
In-Reply-To: <CAHk-=wjQ5Fpv0D7rxX0W=obx9xoOAxJ_Cr+pGCYOAi2S9FiCNg@mail.gmail.com>
Hi,
On 05/09/2019 18:19, Linus Torvalds wrote:
> On Thu, Sep 5, 2019 at 10:01 AM David Howells <dhowells@redhat.com> wrote:
>>> I'm just going to be very blunt about this, and say that there is no
>>> way I can merge any of this *ever*, unless other people stand up and
>>> say that
>>>
>>> (a) they'll use it
>>>
>>> and
>>>
>>> (b) they'll actively develop it and participate in testing and coding
>> Besides the core notification buffer which ties this together, there are a
>> number of sources that I've implemented, not all of which are in this patch
>> series:
> You've at least now answered part of the "Why", but you didn't
> actually answer the whole "another developer" part.
>
> I really don't like how nobody else than you seems to even look at any
> of the key handling patches. Because nobody else seems to care.
>
> This seems to be another new subsystem / driver that has the same
> pattern. If it's all just you, I don't want to merge it, because I
> really want more than just other developers doing "Reviewed-by" after
> looking at somebody elses code that they don't actually use or really
> care about.
>
> See what I'm saying?
>
> New features that go into the kernel should have multiple users. Not a
> single developer who pushes both the kernel feature and the single use
> of that feature.
>
> This very much comes from me reverting the key ACL pull. Not only did
> I revert it, ABSOLUTELY NOBODY even reacted to the revert. Nobody
> stepped up and said they they want that new ACL code, and pushed for a
> fix. There was some very little murmuring about it when Mimi at least
> figured out _why_ it broke, but other than that all the noise I saw
> about the revert was Eric Biggers pointing out it broke other things
> too, and that it had actually broken some test suites. But since it
> hadn't even been in linux-next, that too had been noticed much too
> late.
>
> See what I'm saying? This whole "David Howells does his own features
> that nobody else uses" needs to stop. You need to have a champion. I
> just don't feel safe pulling these kinds of changes from you, because
> I get the feeling that ABSOLUTELY NOBODY ELSE ever really looked at it
> or really cared.
>
> Most of the patches has nobody else even Cc'd, and even the patches
> that do have some "Reviewed-by" feel more like somebody else went "ok,
> the change looks fine to me", without any other real attachment to the
> code.
>
> New kernel features and interfaces really need to have a higher
> barrier of entry than one developer working on his or her own thing.
>
> Is that a change from 25 years ago? Or yes it is. We can point to lots
> of "single developer did a thing" from years past. But things have
> changed. And once bitten, twice shy: I really am a _lot_ more nervous
> about all these key changes now.
>
> Linus
There are a number of potential users, some waiting just to have a
mechanism to avoid the racy alternatives to (for example) parsing
/proc/mounts repeatedly, others perhaps a bit further away, but who have
nonetheless expressed interest in having an interface which allows
notifications for mounts.
The subject of mount notifications has been discussed at LSF/MM in the
past too, I proposed it as a topic a little while back:
https://www.spinics.net/lists/linux-block/msg07653.html and David's
patch set is a potential solution to some of the issues that I raised
there. The original series for the new mount API came from an idea of
Al/Miklos which was also presented at LSF/MM 2017, and this is a follow
on project. So it has not come out of nowhere, but has been something
that has been discussed in various forums over a period of time.
Originally, there was a proposal to use netlink for the notifications,
however that didn't seem to meet with general approval, even though Ian
Kent did some work towards figuring out whether that would be a useful
direction to go in.
David has since come up with the proposal presented here, which is
intended to improve on the original proposal in various ways - mostly
making the notifications more efficient (i.e. smaller) and also generic
enough that it might have uses beyond the original intent of just being
a mount notification mechanism.
The original reason for the mount notification mechanism was so that we
are able to provide information to GUIs and similar filesystem and
storage management tools, matching the state of the filesystem with the
state of the underlying devices. This is part of a larger project
entitled "Project Springfield" to try and provide better management
tools for storage and filesystems. I've copied David Lehman in, since he
can provide a wider view on this topic.
It is something that I do expect will receive wide use, and which will
be tested carefully. I know that Ian Kent has started work on some
support for libmount for example, even outside of autofs.
We do regularly hear from customers that better storage and filesystem
management tools are something that they consider very important, so
that is why we are spending such a lot of effort in trying to improve
the support in this area.
I'm not sure if that really answers your question, except to say that it
is something that is much more than a personal project of David's and
that other people do care about it too,
Steve.
^ permalink raw reply
* Re: Why add the general notification queue and its sources
From: Ray Strode @ 2019-09-05 18:51 UTC (permalink / raw)
To: Steven Whitehouse
Cc: Linus Torvalds, David Howells, Greg Kroah-Hartman,
Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
Christian Brauner, LSM List, linux-fsdevel, Linux API,
Linux List Kernel Mailing, David Lehman, Ian Kent
In-Reply-To: <11667f69-fbb5-28d2-3c31-7f865f2b93e5@redhat.com>
Hi,
On Thu, Sep 5, 2019 at 2:37 PM Steven Whitehouse <swhiteho@redhat.com> wrote:
> The original reason for the mount notification mechanism was so that we
> are able to provide information to GUIs and similar filesystem and
> storage management tools, matching the state of the filesystem with the
> state of the underlying devices. This is part of a larger project
> entitled "Project Springfield" to try and provide better management
> tools for storage and filesystems. I've copied David Lehman in, since he
> can provide a wider view on this topic.
So one problem that I've heard discussed before is what happens in a thinp
setup when the disk space is overallocated and gets used up. IIRC, the
volumes just sort of eat themselves?
Getting proper notification of looming catastrophic failure to the
workstation user
before it's too late would be useful, indeed.
I don't know if this new mechanism dhowells has development can help with that,
and/or if solving that problem is part of the Project Springfield
initiative or not. Do you
know off hand?
--Ray
^ permalink raw reply
* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Aleksa Sarai @ 2019-09-05 19:56 UTC (permalink / raw)
To: Al Viro
Cc: Christian Brauner, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
Peter Zijlstra, Christian Brauner, Rasmus Villemoes,
Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
David Drysdale, Chanho Min
In-Reply-To: <20190905182801.GR1131@ZenIV.linux.org.uk>
[-- Attachment #1: Type: text/plain, Size: 1283 bytes --]
On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
>
> > Because every caller of that function right now has that limit set
> > anyway iirc. So we can either remove it from here and place it back for
> > the individual callers or leave it in the helper.
> > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > bound on the size (for a long time probably) or are you disagreeing with
> > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > bpf, and clone3 and in a few other places.
>
> For a primitive that can be safely used with any size (OK, any within
> the usual 2Gb limit)? Why push the random policy into the place where
> it doesn't belong?
>
> Seriously, what's the point? If they want to have a large chunk of
> userland memory zeroed or checked for non-zeroes - why would that
> be a problem?
Thinking about it some more, there isn't really any r/w amplification --
so there isn't much to gain by passing giant structs. Though, if we are
going to permit 2GB buffers, isn't that also an argument to use
memchr_inv()? :P
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: Why add the general notification queue and its sources
From: David Lehman @ 2019-09-05 20:09 UTC (permalink / raw)
To: Ray Strode, Steven Whitehouse
Cc: Linus Torvalds, David Howells, Greg Kroah-Hartman,
Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
Christian Brauner, LSM List, linux-fsdevel, Linux API,
Linux List Kernel Mailing, Ian Kent
In-Reply-To: <CAKCoTu7ms4ckwDA_-onuJg+famnMzGZE9gGUcqqMz0kCAAECRg@mail.gmail.com>
On Thu, 2019-09-05 at 14:51 -0400, Ray Strode wrote:
> Hi,
>
> On Thu, Sep 5, 2019 at 2:37 PM Steven Whitehouse <swhiteho@redhat.com
> > wrote:
> > The original reason for the mount notification mechanism was so
> > that we
> > are able to provide information to GUIs and similar filesystem and
> > storage management tools, matching the state of the filesystem with
> > the
> > state of the underlying devices. This is part of a larger project
> > entitled "Project Springfield" to try and provide better management
> > tools for storage and filesystems. I've copied David Lehman in,
> > since he
> > can provide a wider view on this topic.
> So one problem that I've heard discussed before is what happens in a
> thinp
> setup when the disk space is overallocated and gets used up. IIRC,
> the
> volumes just sort of eat themselves?
>
> Getting proper notification of looming catastrophic failure to the
> workstation user
> before it's too late would be useful, indeed.
>
> I don't know if this new mechanism dhowells has development can help
> with that,
My understanding is that there is already a dm devent that gets sent
when the low water mark is crossed for a thin pool, but there is
nothing in userspace that knows how to effectively get the user's
attention at that time.
> and/or if solving that problem is part of the Project Springfield
> initiative or not. Do you
> know off hand?
We have been looking into building a userspace event notification
service (for storage, initially) to aggregate and add context to low-
level events such as these, providing a single source for all kinds of
storage events with an excellent signal:noise ratio. Thin pool
exhaustion is high on the list of problems we would want to address.
David
^ permalink raw reply
* Re: [PATCH 2/2] ipc: fix sparc64 ipc() wrapper
From: Anatoly Pugachev @ 2019-09-05 20:38 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Linux Kernel list, Andrew Morton, David S. Miller,
Eric W. Biederman, Deepa Dinamani, Christian Brauner,
Manfred Spraul, Davidlohr Bueso, linux-arch, y2038,
Dominik Brodowski, Matt Turner, stable, Sparc kernel list,
linux-api
In-Reply-To: <20190905152155.1392871-2-arnd@arndb.de>
On Thu, Sep 5, 2019 at 9:39 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
> to a commit from my y2038 series in linux-5.1, as I missed the custom
> sys_ipc() wrapper that sparc64 uses in place of the generic version that
> I patched.
>
> The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
> now do not allow being called with the IPC_64 flag any more, resulting
> in a -EINVAL error when they don't recognize the command.
>
> Instead, the correct way to do this now is to call the internal
> ksys_old_{sem,shm,msg}ctl() functions to select the API version.
>
> As we generally move towards these functions anyway, change all of
> sparc_ipc() to consistently use those in place of the sys_*() versions,
> and move the required ksys_*() declarations into linux/syscalls.h
>
> Reported-by: Matt Turner <mattst88@gmail.com>
> Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
> Cc: stable@vger.kernel.org
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Not Matt,
but this patch fixes util-linux.git ipcs test-suite (make check)
regression for me on current sparc64 git kernel (5.3.0-rc7), which was
broken somewhere in between 4.19 (debian unstable kernel) and 5.3-rcX.
Thanks!
PS: wanted to bisect kernel, but Matt did it first.
^ permalink raw reply
* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-05 20:39 UTC (permalink / raw)
To: Ray Strode
Cc: David Howells, Greg Kroah-Hartman, Steven Whitehouse,
Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
Christian Brauner, LSM List, linux-fsdevel, Linux API,
Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAKCoTu7ms_Mr-q08d9XB3uascpzwBa5LF9JTT2aq8uUsoFE8aQ@mail.gmail.com>
On Thu, Sep 5, 2019 at 11:33 AM Ray Strode <rstrode@redhat.com> wrote:
>
> Hi,
>
> On Thu, Sep 5, 2019 at 1:20 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > You've at least now answered part of the "Why", but you didn't
> > actually answer the whole "another developer" part.
> It's certainly something we've wanted in the GNOME world for a long time:
>
> See for instance
>
> https://bugzilla.redhat.com/show_bug.cgi?id=991110
That is *way* too specific to make for any kind of generic
notification mechanism.
Also, what is the security model here? Open a special character
device, and you get access to random notifications from random
sources?
That makes no sense. Do they have the same security permissions?
USB error reporting is one thing - and has completely different
security rules than some per-user key thing (or system? or namespace?
Or what?)
And why would you do a broken big-key thing in the kernel in the first
place? Why don't you just have a kernel key to indirectly encrypt
using a key and "additional user space data". The kernel should simply
not take care of insane 1MB keys.
Big keys just don't make sense for a kernel. Just use the backing
store THAT YOU HAVE TO HAVE ANYWAY. Introduce some "indirect key"
instead that is used to encrypt and authenticate the backing store.
And mix in /proc/mounts tracking, which has a namespace component and
completely different events and security model (likely "none" - since
you can always read your own /proc/mounts).
So honestly, this all just makes me go "user interfaces are hard, all
the users seem to have *completely* different requirements, and nobody
has apparently really tested this in practice".
Maybe a generic notification mechanism is sensible. But I don't see
how security issues could *possibly* be unified, and some of the
examples given (particularly "track changes to /proc/mounts") seem to
have obviously better alternatives (as in "just support poll() on
it").
All this discussion has convinced me of is that this whole thing is
half-baked and not ready even on a conceptual level.
So as far as I'm concerned, I think I want things like actual
"Tested-by:" lines from actual users, because it's not clear that this
makes sense. Gnome certainly should work as a regular user, if you
need a system daemon for it with root privileges you might as well
just do any notification entirely inside that daemon in user space.
Same goes for /proc/mounts - which as mentioned has a much more
obvious interface for waiting anyway.
User interfaces need a lot of thought and testing. They shouldn't be
ad-hoc "maybe this could work for X, Y and Z" theories.
Linus
^ permalink raw reply
* Re: Why add the general notification queue and its sources
From: David Howells @ 2019-09-05 21:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
Christian Brauner, LSM List, linux-fsdevel, Linux API,
Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wjcsxQ8QB_v=cwBQw4pkJg7pp-bBsdWyPivFO_OeF-y+g@mail.gmail.com>
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Also, what is the security model here? Open a special character
> device, and you get access to random notifications from random
> sources?
>
> That makes no sense. Do they have the same security permissions?
Sigh. It doesn't work like that. I tried to describe this in the manpages I
referred to in the cover note. Obviously I didn't do a good enough job. Let
me try and explain the general workings and the security model here.
(1) /dev/watch_queue just implements locked-in-memory buffers. It gets you
no events by simply opening it.
Each time you open it you get your own private buffer. Buffers are not
shares. Creation of buffers is limited by ENFILE, EMFILE and
RLIMIT_MEMLOCK.
(2) A buffer is implemented as a pollable ring buffer, with the head pointer
belonging to the kernel and the tail pointer belonging to userspace.
Userspace mmaps the buffer.
The kernel *only ever* reads the head and tail pointer from a buffer; it
never reads anything else.
When it wants to post a message to a buffer, the kernel reads the
pointers and then does one of three things:
(a) If the pointers were incoherent it drops the message.
(b) If the buffer was full the kernel writes a flag to indicate this
and drops the message.
(c) Otherwise, the kernel writes a message and maybe padding at the
place(s) it expects and writes the head pointer. If userspace was
busy trashing the place, that should not cause a problem for the
kernel.
The buffer pointers are expected to run to the end and wrap naturally;
they're only masked off at the point of actually accessing the buffer.
(3) You connect event sources to your buffer, e.g.:
fd = open("/dev/watch_queue", ...);
keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, ...);
or:
watch_mount(AT_FDCWD, "/net", 0, fd, ...);
Security is checked at the point of connection to make sure you have
permission to access that source. You have to have View permission on a
key/keyring for key events, for example, and you have to have execute
permission on a directory for mount events.
The LSM gets a look-in too: Smack checks you have read permission on a
key for example.
(4) You can connect multiple sources of different types to your buffer and a
source can be connected to multiple buffers at a time.
(5) Security is checked when an event is delivered to make sure the triggerer
of the event has permission to give you that event. Smack requires that
the triggerer has write permission on the opener of the buffer for
example.
(6) poll() signals POLLIN|POLLRDNORM if there is stuff in the buffer and
POLLERR if the pointers are incoherent.
David
^ permalink raw reply
* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
From: Alexei Starovoitov @ 2019-09-05 22:00 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Alexei Starovoitov, nicolas.dichtel@6wind.com, Alexei Starovoitov,
luto@amacapital.net, davem@davemloft.net, peterz@infradead.org,
rostedt@goodmis.org, netdev@vger.kernel.org, bpf@vger.kernel.org,
Kernel Team, linux-api@vger.kernel.org
In-Reply-To: <acc09eaf-5289-9457-3ce1-f27efb6013b8@iogearbox.net>
On Thu, Sep 05, 2019 at 10:37:03AM +0200, Daniel Borkmann wrote:
> On 9/4/19 5:21 PM, Alexei Starovoitov wrote:
> > On 9/4/19 8:16 AM, Daniel Borkmann wrote:
> > > opening/creating BPF maps" error="Unable to create map
> > > /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted"
> > > subsys=daemon
> > > 2019-09-04T14:11:47.28178666Z level=fatal msg="Error while creating
> > > daemon" error="Unable to create map
> > > /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted"
> > > subsys=daemon
> >
> > Ok. We have to include caps in both cap_sys_admin and cap_bpf then.
> >
> > > And /same/ deployment with reverted patches, hence no CAP_BPF gets it up
> > > and running again:
> > >
> > > # kubectl get pods --all-namespaces -o wide
> >
> > Can you share what this magic commands do underneath?
>
> What do you mean by magic commands? Latter is showing all pods in the cluster:
>
> https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-finding-resources
"magic" in a sense that I have no idea how k8s "services" and "pods" map
to kernel namespaces.
> Checking moby go code, it seems to exec with GetAllCapabilities which returns
> all of the capabilities it is aware of, and afaics, they seem to be using
> the below go library to get the hard-coded list from where obviously CAP_BPF
> is unknown which might also explain the breakage I've been seeing:
>
> https://github.com/syndtr/gocapability/blob/33e07d32887e1e06b7c025f27ce52f62c7990bc0/capability/enum_gen.go
thanks for the link.
That library is much better written than libcap.
capability_linux.go is reading cap_last_cap dynamically and it can understand
proposed CAP_BPF, CAP_TRACING without need o be recompiled (unlike libcap).
So passing new caps to k8s should be trivial. The library won't know
their names, but passing by number looks to be already supported.
I'm still not sure which part of k8s setup clears the caps and
why this v2 set doesn't work, but that doesn't matter any more.
I believe I addressed this compat issue in v3 set.
Could you please give a try just repeating your previous commands?
^ permalink raw reply
* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-05 22:08 UTC (permalink / raw)
To: David Howells
Cc: Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
Christian Brauner, LSM List, linux-fsdevel, Linux API,
Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <5396.1567719164@warthog.procyon.org.uk>
On Thu, Sep 5, 2019 at 2:32 PM David Howells <dhowells@redhat.com> wrote:
>
> (1) /dev/watch_queue just implements locked-in-memory buffers. It gets you
> no events by simply opening it.
Cool. In-memory buffers.
But I know - we *have* one of those. There's already a system call for
it, and has been forever. One that we then extended to allow people to
change the buffer size, and do a lot of other things with.
It's called "pipe()". And you can give the writing side to other user
space processes too, in case you are running an older kernel that
didn't have some "event pipe support". It comes with resource
management, because people already use those things.
If you want to make a message protocol on top of it, it has cool
atomicity guarantees for any message size less than PIPE_BUF, but to
make things simple, maybe just say "fixed record sizes of 64 bytes" or
something like that for events.
Then you can use them from things like perl scripts, not just magical
C programs.
Why do we need a new kind of super-fancy high-speed thing for event reporting?
If you have *so* many events that pipe handling is a performance
issue, you have something seriously wrong going on.
So no. I'm not interested in a magical memory-mapped pipe that is
actually more limited than the real thing.
Linus
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox