Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v1 1/2] Add polling support to pidfd
From: Christian Brauner @ 2019-04-26 15:21 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, luto, rostedt, dancol, sspatil, jannh, surenb,
	timmurray, Jonathan Kowalski, torvalds, kernel-team,
	Andrew Morton, Arnd Bergmann, Eric W. Biederman,
	Greg Kroah-Hartman, Ingo Molnar, Jann Horn, linux-kselftest,
	Michal Hocko, Peter Zijlstra (Intel), Serge Hallyn, Shuah Khan,
	Stephen Rothwell, Thomas Gleixner, Tycho Andersen
In-Reply-To: <20190426142337.GC261279@google.com>

On Fri, Apr 26, 2019 at 10:23:37AM -0400, Joel Fernandes wrote:
> On Fri, Apr 26, 2019 at 12:24:04AM +0200, Christian Brauner wrote:
> > On Thu, Apr 25, 2019 at 03:00:09PM -0400, Joel Fernandes (Google) wrote:
> > > pidfd are file descriptors referring to a process created with the
> > > CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd
> > > polling support to replace code that currently checks for existence of
> > > /proc/pid for knowing that a process that is signalled to be killed has
> > > died, which is both racy and slow. The pidfd poll approach is race-free,
> > > and also allows the LMK to do other things (such as by polling on other
> > > fds) while awaiting the process being killed to die.
> > 
> > Thanks for the patch!
> > 
> > Ok, let me be a little bit anal.
> > Please start the commit message with what this patch does and then add
> 
> The subject title is "Add polling support to pidfd", but ok I should write a
> better commit message.

Yeah, it's really just that we should really just have a simple
paragraph that expresses this makes the codebase do X.

> 
> > the justification why. You just say the "pidfd-poll" approach. You can
> > probably assume that CLONE_PIDFD is available for this patch. So
> > something like:
> > 
> > "This patch makes pidfds pollable. Specifically, it allows listeners to
> > be informed when the process the pidfd referes to exits. This patch only
> > introduces the ability to poll thread-group leaders since pidfds
> > currently can only reference those..."
> > 
> > Then justify the use-case and then go into implementation details.
> > That's usually how I would think about this:
> > - Change the codebase to do X
> > - Why do we need X
> > - Are there any technical details worth mentioning in the commit message
> > (- Are there any controversial points that people stumbled upon but that
> >   have been settled sufficiently.)
> 
> Generally the "how" in the patch should be in the code, but ok.

That's why I said: technical details that are worth mentioning.
Sometimes you have controversial bits that are obviously to be
understood in the code but it still might be worth explaining why one
had to do it this way. Like say what we did for the pidfd_send_signal()
thing where we explained why O_PATH is disallowed.

> 
> I changed the first 3 paragraphs of the changelog to the following, is that
> better? :
> 
> Android low memory killer (LMK) needs to know when a process dies once
> it is sent the kill signal. It does so by checking for the existence of
> /proc/pid which is both racy and slow. For example, if a PID is reused
> between when LMK sends a kill signal and checks for existence of the
> PID, since the wrong PID is now possibly checked for existence.
> 
> This patch adds polling support to pidfd. Using the polling support, LMK
> will be able to get notified when a process exists in race-free and fast
> way, and allows the LMK to do other things (such as by polling on other
> fds) while awaiting the process being killed to die.
> 
> For notification to polling processes, we follow the same existing
> mechanism in the kernel used when the parent of the task group is to be
> notified of a child's death (do_notify_parent).  This is precisely when
> the tasks waiting on a poll of pidfd are also awakened in this patch.
> 
> > > pidfd are file descriptors referring to a process created with the
> > > CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd
> > > polling support to replace code that currently checks for existence of
> > > /proc/pid for knowing that a process that is signalled to be killed has
> > > died, which is both racy and slow. The pidfd poll approach is race-free,
> > > and also allows the LMK to do other things (such as by polling on other
> > > fds) while awaiting the process being killed to die.
> > 
> > > 
> > > It prevents a situation where a PID is reused between when LMK sends a
> > > kill signal and checks for existence of the PID, since the wrong PID is
> > > now possibly checked for existence.
> > > 
> > > In this patch, we follow the same existing mechanism in the kernel used
> > > when the parent of the task group is to be notified (do_notify_parent).
> > > This is when the tasks waiting on a poll of pidfd are also awakened.
> > > 
> > > We have decided to include the waitqueue in struct pid for the following
> > > reasons:
> > > 1. The wait queue has to survive for the lifetime of the poll. Including
> > > it in task_struct would not be option in this case because the task can
> > > be reaped and destroyed before the poll returns.
> > > 
> > > 2. By including the struct pid for the waitqueue means that during
> > > de_thread(), the new thread group leader automatically gets the new
> > > waitqueue/pid even though its task_struct is different.
> > > 
> > > Appropriate test cases are added in the second patch to provide coverage
> > > of all the cases the patch is handling.
> > > 
> > > Andy had a similar patch [1] in the past which was a good reference
> > > however this patch tries to handle different situations properly related
> > > to thread group existence, and how/where it notifies. And also solves
> > > other bugs (waitqueue lifetime).  Daniel had a similar patch [2]
> > > recently which this patch supercedes.
> > > 
> > > [1] https://lore.kernel.org/patchwork/patch/345098/
> > > [2] https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/
> > > 
> > > Cc: luto@amacapital.net
> > > Cc: rostedt@goodmis.org
> > > Cc: dancol@google.com
> > > Cc: sspatil@google.com
> > > Cc: christian@brauner.io
> > > Cc: jannh@google.com
> > > Cc: surenb@google.com
> > > Cc: timmurray@google.com
> > > Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
> > > Cc: torvalds@linux-foundation.org
> > > Cc: kernel-team@android.com
> > 
> > These should all be in the form:
> > 
> > Cc: Firstname Lastname <email@address.com>
> 
> If this bothers you too much, I can also just remove the CC list from the
> changelog here, and include it in my invocation of git-send-email instead..
> but I have seen commits in the tree that don't follow this rule.

Yeah, but they should. There are people with multiple emails over the
years and they might not necessarily contain their first and last
name. And I don't want to have to mailmap them or sm. Having their names
in there just makes it easier. Also, every single other DCO-*related*
line follows:

Random J Developer <random@developer.example.org>

This should too. If others are sloppy and allow this, fine. No reason we
should.

> 
> > 
> > There are people missing from the Cc that really should be there...
> 
> If you look at the CC list of the email, people in the get_maintainer.pl
> script were also added. I did run get_maintainer.pl and checkpatch. But ok, I
> will add the folks you are suggesting as well. Thanks.

get_maintainer.pl is not the last word. 

> 
> > Even though he usually doesn't respond that often, please Cc Al on this.
> > If he responds it's usually rather important.
> 
> No issues on that, but I am wondering if he should also be in MAINTAINERS
> file somewhere such that get_maintainer.pl does pick him up. I added him.

It's often not about someone being a maintainer but whether or not they
have valuable input.

"[...] This tag documents that potentially interested parties have been
included in the discussion."

> 
> > Oleg has reviewed your RFC patch quite substantially and given valuable
> > feedback and has an opinion on this thing and is best acquainted with
> > the exit code. So please add him to the Cc of the commit message in the
> > appropriate form and also add him to the Cc of the thread.
> 
> Done.

Thanks!

> 
> > Probably also want linux-api for good measure since a lot of people are
> > subscribed that would care about pollable pidfds. I'd also add Kees
> > since he had some interest in this work and David (Howells).
> 
> Done, I added all of them and CC will go out to them next time. Thanks.

Cool. That really wasn't a "you've done this wrong". It's rather really
just to make sure that everyone who might catch a big f*ck up on our
part has had a chance to tell us so. :)

> 
> > 
> > > Co-developed-by: Daniel Colascione <dancol@google.com>
> > 
> > Every CDB needs to give a SOB as well.
> 
> Ok, done. thanks.

Fwiw, I only learned this recently too.

> 
> > 
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > 
> > > ---
> > > 
> > > RFC -> v1:
> > > * Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/
> > > * Updated selftests.
> > > * Renamed poll wake function to do_notify_pidfd.
> > > * Removed depending on EXIT flags
> > > * Removed POLLERR flag since semantics are controversial and
> > >   we don't have usecases for it right now (later we can add if there's
> > >   a need for it).
> > > 
> > >  include/linux/pid.h |  3 +++
> > >  kernel/fork.c       | 33 +++++++++++++++++++++++++++++++++
> > >  kernel/pid.c        |  2 ++
> > >  kernel/signal.c     | 14 ++++++++++++++
> > >  4 files changed, 52 insertions(+)
> > > 
> > > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > > index 3c8ef5a199ca..1484db6ca8d1 100644
> > > --- a/include/linux/pid.h
> > > +++ b/include/linux/pid.h
> > > @@ -3,6 +3,7 @@
> > >  #define _LINUX_PID_H
> > >  
> > >  #include <linux/rculist.h>
> > > +#include <linux/wait.h>
> > >  
> > >  enum pid_type
> > >  {
> > > @@ -60,6 +61,8 @@ struct pid
> > >  	unsigned int level;
> > >  	/* lists of tasks that use this pid */
> > >  	struct hlist_head tasks[PIDTYPE_MAX];
> > > +	/* wait queue for pidfd notifications */
> > > +	wait_queue_head_t wait_pidfd;
> > >  	struct rcu_head rcu;
> > >  	struct upid numbers[1];
> > >  };
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 5525837ed80e..fb3b614f6456 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1685,8 +1685,41 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
> > >  }
> > >  #endif
> > >  
> > > +static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
> > > +{
> > > +	struct task_struct *task;
> > > +	struct pid *pid;
> > > +	int poll_flags = 0;
> > > +
> > > +	/*
> > > +	 * tasklist_lock must be held because to avoid racing with
> > > +	 * changes in exit_state and wake up. Basically to avoid:
> > > +	 *
> > > +	 * P0: read exit_state = 0
> > > +	 * P1: write exit_state = EXIT_DEAD
> > > +	 * P1: Do a wake up - wq is empty, so do nothing
> > > +	 * P0: Queue for polling - wait forever.
> > > +	 */
> > > +	read_lock(&tasklist_lock);
> > > +	pid = file->private_data;
> > > +	task = pid_task(pid, PIDTYPE_PID);
> > > +	WARN_ON_ONCE(task && !thread_group_leader(task));
> > > +
> > > +	if (!task || (task->exit_state && thread_group_empty(task)))
> > > +		poll_flags = POLLIN | POLLRDNORM;
> > 
> > So we block until the thread-group is empty? Hm, the thread-group leader
> > remains in zombie state until all threads are gone. Should probably just
> > be a short comment somewhere that callers are only informed about a
> > whole thread-group exit and not about when the thread-group leader has
> > actually exited.
> 
> Ok, I'll add a comment.
> 
> > I would like the ability to extend this interface in the future to allow
> > for actually reading data from the pidfd on EPOLLIN.
> > POSIX specifies that POLLIN and POLLRDNORM are set even if the
> > message to be read is zero. So one cheap way of doing this would
> > probably be to do a 0 read/ioctl. That wouldn't hurt your very limited
> > usecase and people could test whether the read returned non-0 data and
> > if so they know this interface got extended. If we never extend it here
> > it won't matter.
> 
> I am a bit confused. What specific changes to this patch are you proposing?
> This patch makes poll block until the process exits. In the future, we can
> make it unblock for a other reasons as well, that's fine with me. I don't see
> how this patch prevents such extensions.

I guess I should've asked the following:
What happens right now, when you get EPOLLIN on the pidfd and you and
out of ignorance you do:

read(pidfd, ...)

> 
> > > +	if (!poll_flags)
> > > +		poll_wait(file, &pid->wait_pidfd, pts);
> > > +
> > > +	read_unlock(&tasklist_lock);
> > > +
> > > +	return poll_flags;
> > > +}
> > 
> > 
> > > +
> > > +
> > >  const struct file_operations pidfd_fops = {
> > >  	.release = pidfd_release,
> > > +	.poll = pidfd_poll,
> > >  #ifdef CONFIG_PROC_FS
> > >  	.show_fdinfo = pidfd_show_fdinfo,
> > >  #endif
> > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > index 20881598bdfa..5c90c239242f 100644
> > > --- a/kernel/pid.c
> > > +++ b/kernel/pid.c
> > > @@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > >  	for (type = 0; type < PIDTYPE_MAX; ++type)
> > >  		INIT_HLIST_HEAD(&pid->tasks[type]);
> > >  
> > > +	init_waitqueue_head(&pid->wait_pidfd);
> > > +
> > >  	upid = pid->numbers + ns->level;
> > >  	spin_lock_irq(&pidmap_lock);
> > >  	if (!(ns->pid_allocated & PIDNS_ADDING))
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index 1581140f2d99..16e7718316e5 100644
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -1800,6 +1800,17 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
> > >  	return ret;
> > >  }
> > >  
> > > +static void do_notify_pidfd(struct task_struct *task)
> > 
> > Maybe a short command that this helper can only be called when we know
> > that task is a thread-group leader wouldn't hurt so there's no confusion
> > later.
> 
> Ok, will do.
> 
> thanks,
> 
>  - Joel
> 

^ permalink raw reply

* [PATCH 2/2] y2038: remove CONFIG_64BIT_TIME
From: Arnd Bergmann @ 2019-04-26 14:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Joseph Myers, libc-alpha, linux-api, linux-kernel, Deepa Dinamani,
	Arnd Bergmann, Benjamin LaHaise, Alexander Viro, John Stultz,
	Stephen Boyd, David S. Miller, linux-aio, linux-fsdevel, netdev
In-Reply-To: <20190426142531.1378357-1-arnd@arndb.de>

The CONFIG_64BIT_TIME option is defined on all architectures, and can
be removed for simplicity now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
This would make sense for 5.2, or could be part of a later
cleanup series when I have more patches for the remaining
y2038 bits
---
 arch/Kconfig          | 8 --------
 fs/aio.c              | 2 +-
 ipc/syscall.c         | 2 +-
 kernel/time/hrtimer.c | 2 +-
 kernel/time/time.c    | 4 ++--
 net/socket.c          | 2 +-
 6 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 9092e0ffe4d3..23ee740182aa 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -763,14 +763,6 @@ config OLD_SIGACTION
 config COMPAT_OLD_SIGACTION
 	bool
 
-config 64BIT_TIME
-	def_bool y
-	help
-	  This should be selected by all architectures that need to support
-	  new system calls with a 64-bit time_t. This is relevant on all 32-bit
-	  architectures, and 64-bit architectures as part of compat syscall
-	  handling.
-
 config COMPAT_32BIT_TIME
 	def_bool !64BIT || COMPAT
 	help
diff --git a/fs/aio.c b/fs/aio.c
index 3490d1fa0e16..b1b949ae1a93 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -2057,7 +2057,7 @@ static long do_io_getevents(aio_context_t ctx_id,
  *	specifies an infinite timeout. Note that the timeout pointed to by
  *	timeout is relative.  Will fail with -ENOSYS if not implemented.
  */
-#if !defined(CONFIG_64BIT_TIME) || defined(CONFIG_64BIT)
+#ifdef CONFIG_64BIT
 
 SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
 		long, min_nr,
diff --git a/ipc/syscall.c b/ipc/syscall.c
index 581bdff4e7c5..dfb0e988d542 100644
--- a/ipc/syscall.c
+++ b/ipc/syscall.c
@@ -30,7 +30,7 @@ int ksys_ipc(unsigned int call, int first, unsigned long second,
 		return ksys_semtimedop(first, (struct sembuf __user *)ptr,
 				       second, NULL);
 	case SEMTIMEDOP:
-		if (IS_ENABLED(CONFIG_64BIT) || !IS_ENABLED(CONFIG_64BIT_TIME))
+		if (IS_ENABLED(CONFIG_64BIT))
 			return ksys_semtimedop(first, ptr, second,
 			        (const struct __kernel_timespec __user *)fifth);
 		else if (IS_ENABLED(CONFIG_COMPAT_32BIT_TIME))
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 41dfff23c1f9..61f03faf783a 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1749,7 +1749,7 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp,
 	return ret;
 }
 
-#if !defined(CONFIG_64BIT_TIME) || defined(CONFIG_64BIT)
+#ifdef CONFIG_64BIT
 
 SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user *, rqtp,
 		struct __kernel_timespec __user *, rmtp)
diff --git a/kernel/time/time.c b/kernel/time/time.c
index 74105fa3ce80..a4c72577aa92 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -264,7 +264,7 @@ COMPAT_SYSCALL_DEFINE2(settimeofday, struct old_timeval32 __user *, tv,
 }
 #endif
 
-#if !defined(CONFIG_64BIT_TIME) || defined(CONFIG_64BIT)
+#ifdef CONFIG_64BIT
 SYSCALL_DEFINE1(adjtimex, struct __kernel_timex __user *, txc_p)
 {
 	struct __kernel_timex txc;		/* Local copy of parameter */
@@ -871,7 +871,7 @@ int get_timespec64(struct timespec64 *ts,
 	ts->tv_sec = kts.tv_sec;
 
 	/* Zero out the padding for 32 bit systems or in compat mode */
-	if (IS_ENABLED(CONFIG_64BIT_TIME) && in_compat_syscall())
+	if (in_compat_syscall())
 		kts.tv_nsec &= 0xFFFFFFFFUL;
 
 	ts->tv_nsec = kts.tv_nsec;
diff --git a/net/socket.c b/net/socket.c
index a180e1a9ff23..2ff80e03d97b 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2787,7 +2787,7 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
 				    a[2], true);
 		break;
 	case SYS_RECVMMSG:
-		if (IS_ENABLED(CONFIG_64BIT) || !IS_ENABLED(CONFIG_64BIT_TIME))
+		if (IS_ENABLED(CONFIG_64BIT))
 			err = __sys_recvmmsg(a0, (struct mmsghdr __user *)a1,
 					     a[2], a[3],
 					     (struct __kernel_timespec __user *)a[4],
-- 
2.20.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related

* [PATCH 1/2] y2038: make CONFIG_64BIT_TIME unconditional
From: Arnd Bergmann @ 2019-04-26 14:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Joseph Myers, libc-alpha, linux-api, linux-kernel, Deepa Dinamani,
	Arnd Bergmann, Lukasz Majewski, Stepan Golosunov

As Stepan Golosunov points out, we made a small mistake in the
get_timespec64() function in the kernel. It was originally added under
the assumption that CONFIG_64BIT_TIME would get enabled on all 32-bit
and 64-bit architectures, but when I did the conversion, I only turned
it on for 32-bit ones.

The effect is that the get_timespec64() function never clears the upper
half of the tv_nsec field for 32-bit tasks in compat mode. Clearing this
is required for POSIX compliant behavior of functions that pass a
'timespec' structure with a 64-bit tv_sec and a 32-bit tv_nsec, plus
uninitialized padding.

The easiest fix for linux-5.1 is to just make the Kconfig symbol
unconditional, as it was originally intended. As a follow-up,
we should remove any #ifdef CONFIG_64BIT_TIME completely.

Link: https://lore.kernel.org/lkml/20190422090710.bmxdhhankurhafxq@sghpc.golosunov.pp.ru/
Cc: Lukasz Majewski <lukma@denx.de>
Cc: Stepan Golosunov <stepan@golosunov.pp.ru>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Please apply this one as a bugfix for 5.1
---
 arch/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 33687dddd86a..9092e0ffe4d3 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -764,7 +764,7 @@ config COMPAT_OLD_SIGACTION
 	bool

 config 64BIT_TIME
-	def_bool ARCH_HAS_64BIT_TIME
+	def_bool y
 	help
 	  This should be selected by all architectures that need to support
 	  new system calls with a 64-bit time_t. This is relevant on all 32-bit
-- 
2.20.0

^ permalink raw reply related

* Re: [PATCH v1 1/2] Add polling support to pidfd
From: Joel Fernandes @ 2019-04-26 14:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, luto, rostedt, dancol, sspatil, jannh, surenb,
	timmurray, Jonathan Kowalski, torvalds, kernel-team,
	Andrew Morton, Arnd Bergmann, Eric W. Biederman,
	Greg Kroah-Hartman, Ingo Molnar, Jann Horn, linux-kselftest,
	Michal Hocko, Peter Zijlstra (Intel), Serge Hallyn, Shuah Khan,
	Stephen Rothwell, Thomas Gleixner, Tycho Andersen
In-Reply-To: <20190425222359.sqhboc4x4daznr6r@brauner.io>

On Fri, Apr 26, 2019 at 12:24:04AM +0200, Christian Brauner wrote:
> On Thu, Apr 25, 2019 at 03:00:09PM -0400, Joel Fernandes (Google) wrote:
> > pidfd are file descriptors referring to a process created with the
> > CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd
> > polling support to replace code that currently checks for existence of
> > /proc/pid for knowing that a process that is signalled to be killed has
> > died, which is both racy and slow. The pidfd poll approach is race-free,
> > and also allows the LMK to do other things (such as by polling on other
> > fds) while awaiting the process being killed to die.
> 
> Thanks for the patch!
> 
> Ok, let me be a little bit anal.
> Please start the commit message with what this patch does and then add

The subject title is "Add polling support to pidfd", but ok I should write a
better commit message.

> the justification why. You just say the "pidfd-poll" approach. You can
> probably assume that CLONE_PIDFD is available for this patch. So
> something like:
> 
> "This patch makes pidfds pollable. Specifically, it allows listeners to
> be informed when the process the pidfd referes to exits. This patch only
> introduces the ability to poll thread-group leaders since pidfds
> currently can only reference those..."
> 
> Then justify the use-case and then go into implementation details.
> That's usually how I would think about this:
> - Change the codebase to do X
> - Why do we need X
> - Are there any technical details worth mentioning in the commit message
> (- Are there any controversial points that people stumbled upon but that
>   have been settled sufficiently.)

Generally the "how" in the patch should be in the code, but ok.

I changed the first 3 paragraphs of the changelog to the following, is that
better? :

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.

This patch adds polling support to pidfd. Using the polling support, LMK
will be able to get notified when a process exists in race-free and fast
way, and allows the LMK to do other things (such as by polling on other
fds) while awaiting the process being killed to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent).  This is precisely when
the tasks waiting on a poll of pidfd are also awakened in this patch.

> > pidfd are file descriptors referring to a process created with the
> > CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd
> > polling support to replace code that currently checks for existence of
> > /proc/pid for knowing that a process that is signalled to be killed has
> > died, which is both racy and slow. The pidfd poll approach is race-free,
> > and also allows the LMK to do other things (such as by polling on other
> > fds) while awaiting the process being killed to die.
> 
> > 
> > It prevents a situation where a PID is reused between when LMK sends a
> > kill signal and checks for existence of the PID, since the wrong PID is
> > now possibly checked for existence.
> > 
> > In this patch, we follow the same existing mechanism in the kernel used
> > when the parent of the task group is to be notified (do_notify_parent).
> > This is when the tasks waiting on a poll of pidfd are also awakened.
> > 
> > We have decided to include the waitqueue in struct pid for the following
> > reasons:
> > 1. The wait queue has to survive for the lifetime of the poll. Including
> > it in task_struct would not be option in this case because the task can
> > be reaped and destroyed before the poll returns.
> > 
> > 2. By including the struct pid for the waitqueue means that during
> > de_thread(), the new thread group leader automatically gets the new
> > waitqueue/pid even though its task_struct is different.
> > 
> > Appropriate test cases are added in the second patch to provide coverage
> > of all the cases the patch is handling.
> > 
> > Andy had a similar patch [1] in the past which was a good reference
> > however this patch tries to handle different situations properly related
> > to thread group existence, and how/where it notifies. And also solves
> > other bugs (waitqueue lifetime).  Daniel had a similar patch [2]
> > recently which this patch supercedes.
> > 
> > [1] https://lore.kernel.org/patchwork/patch/345098/
> > [2] https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/
> > 
> > Cc: luto@amacapital.net
> > Cc: rostedt@goodmis.org
> > Cc: dancol@google.com
> > Cc: sspatil@google.com
> > Cc: christian@brauner.io
> > Cc: jannh@google.com
> > Cc: surenb@google.com
> > Cc: timmurray@google.com
> > Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
> > Cc: torvalds@linux-foundation.org
> > Cc: kernel-team@android.com
> 
> These should all be in the form:
> 
> Cc: Firstname Lastname <email@address.com>

If this bothers you too much, I can also just remove the CC list from the
changelog here, and include it in my invocation of git-send-email instead..
but I have seen commits in the tree that don't follow this rule.

> 
> There are people missing from the Cc that really should be there...

If you look at the CC list of the email, people in the get_maintainer.pl
script were also added. I did run get_maintainer.pl and checkpatch. But ok, I
will add the folks you are suggesting as well. Thanks.

> Even though he usually doesn't respond that often, please Cc Al on this.
> If he responds it's usually rather important.

No issues on that, but I am wondering if he should also be in MAINTAINERS
file somewhere such that get_maintainer.pl does pick him up. I added him.

> Oleg has reviewed your RFC patch quite substantially and given valuable
> feedback and has an opinion on this thing and is best acquainted with
> the exit code. So please add him to the Cc of the commit message in the
> appropriate form and also add him to the Cc of the thread.

Done.

> Probably also want linux-api for good measure since a lot of people are
> subscribed that would care about pollable pidfds. I'd also add Kees
> since he had some interest in this work and David (Howells).

Done, I added all of them and CC will go out to them next time. Thanks.

> 
> > Co-developed-by: Daniel Colascione <dancol@google.com>
> 
> Every CDB needs to give a SOB as well.

Ok, done. thanks.

> 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > ---
> > 
> > RFC -> v1:
> > * Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/
> > * Updated selftests.
> > * Renamed poll wake function to do_notify_pidfd.
> > * Removed depending on EXIT flags
> > * Removed POLLERR flag since semantics are controversial and
> >   we don't have usecases for it right now (later we can add if there's
> >   a need for it).
> > 
> >  include/linux/pid.h |  3 +++
> >  kernel/fork.c       | 33 +++++++++++++++++++++++++++++++++
> >  kernel/pid.c        |  2 ++
> >  kernel/signal.c     | 14 ++++++++++++++
> >  4 files changed, 52 insertions(+)
> > 
> > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > index 3c8ef5a199ca..1484db6ca8d1 100644
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -3,6 +3,7 @@
> >  #define _LINUX_PID_H
> >  
> >  #include <linux/rculist.h>
> > +#include <linux/wait.h>
> >  
> >  enum pid_type
> >  {
> > @@ -60,6 +61,8 @@ struct pid
> >  	unsigned int level;
> >  	/* lists of tasks that use this pid */
> >  	struct hlist_head tasks[PIDTYPE_MAX];
> > +	/* wait queue for pidfd notifications */
> > +	wait_queue_head_t wait_pidfd;
> >  	struct rcu_head rcu;
> >  	struct upid numbers[1];
> >  };
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 5525837ed80e..fb3b614f6456 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1685,8 +1685,41 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
> >  }
> >  #endif
> >  
> > +static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
> > +{
> > +	struct task_struct *task;
> > +	struct pid *pid;
> > +	int poll_flags = 0;
> > +
> > +	/*
> > +	 * tasklist_lock must be held because to avoid racing with
> > +	 * changes in exit_state and wake up. Basically to avoid:
> > +	 *
> > +	 * P0: read exit_state = 0
> > +	 * P1: write exit_state = EXIT_DEAD
> > +	 * P1: Do a wake up - wq is empty, so do nothing
> > +	 * P0: Queue for polling - wait forever.
> > +	 */
> > +	read_lock(&tasklist_lock);
> > +	pid = file->private_data;
> > +	task = pid_task(pid, PIDTYPE_PID);
> > +	WARN_ON_ONCE(task && !thread_group_leader(task));
> > +
> > +	if (!task || (task->exit_state && thread_group_empty(task)))
> > +		poll_flags = POLLIN | POLLRDNORM;
> 
> So we block until the thread-group is empty? Hm, the thread-group leader
> remains in zombie state until all threads are gone. Should probably just
> be a short comment somewhere that callers are only informed about a
> whole thread-group exit and not about when the thread-group leader has
> actually exited.

Ok, I'll add a comment.

> I would like the ability to extend this interface in the future to allow
> for actually reading data from the pidfd on EPOLLIN.
> POSIX specifies that POLLIN and POLLRDNORM are set even if the
> message to be read is zero. So one cheap way of doing this would
> probably be to do a 0 read/ioctl. That wouldn't hurt your very limited
> usecase and people could test whether the read returned non-0 data and
> if so they know this interface got extended. If we never extend it here
> it won't matter.

I am a bit confused. What specific changes to this patch are you proposing?
This patch makes poll block until the process exits. In the future, we can
make it unblock for a other reasons as well, that's fine with me. I don't see
how this patch prevents such extensions.

> > +	if (!poll_flags)
> > +		poll_wait(file, &pid->wait_pidfd, pts);
> > +
> > +	read_unlock(&tasklist_lock);
> > +
> > +	return poll_flags;
> > +}
> 
> 
> > +
> > +
> >  const struct file_operations pidfd_fops = {
> >  	.release = pidfd_release,
> > +	.poll = pidfd_poll,
> >  #ifdef CONFIG_PROC_FS
> >  	.show_fdinfo = pidfd_show_fdinfo,
> >  #endif
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..5c90c239242f 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >  	for (type = 0; type < PIDTYPE_MAX; ++type)
> >  		INIT_HLIST_HEAD(&pid->tasks[type]);
> >  
> > +	init_waitqueue_head(&pid->wait_pidfd);
> > +
> >  	upid = pid->numbers + ns->level;
> >  	spin_lock_irq(&pidmap_lock);
> >  	if (!(ns->pid_allocated & PIDNS_ADDING))
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 1581140f2d99..16e7718316e5 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -1800,6 +1800,17 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
> >  	return ret;
> >  }
> >  
> > +static void do_notify_pidfd(struct task_struct *task)
> 
> Maybe a short command that this helper can only be called when we know
> that task is a thread-group leader wouldn't hurt so there's no confusion
> later.

Ok, will do.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Michal Hocko @ 2019-04-26 14:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Matthew Garrett, Linux-MM, kernel list, Matthew Garrett,
	Linux API
In-Reply-To: <CAG48ez1OS7DeeEtv5jhO6wtMA2M2A_Bp3-ndS+sP=UoFuMiREw@mail.gmail.com>

On Fri 26-04-19 16:03:26, Jann Horn wrote:
> On Fri, Apr 26, 2019 at 3:47 PM Michal Hocko <mhocko@kernel.org> wrote:
> > On Fri 26-04-19 15:33:25, Jann Horn wrote:
> > > On Fri, Apr 26, 2019 at 7:31 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Thu 25-04-19 14:42:52, Jann Horn wrote:
> > > > > On Thu, Apr 25, 2019 at 2:14 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > [...]
> > > > > > On Wed 24-04-19 14:10:39, Matthew Garrett wrote:
> > > > > > > From: Matthew Garrett <mjg59@google.com>
> > > > > > >
> > > > > > > Applications that hold secrets and wish to avoid them leaking can use
> > > > > > > mlock() to prevent the page from being pushed out to swap and
> > > > > > > MADV_DONTDUMP to prevent it from being included in core dumps. Applications
> > > > > > > can also use atexit() handlers to overwrite secrets on application exit.
> > > > > > > However, if an attacker can reboot the system into another OS, they can
> > > > > > > dump the contents of RAM and extract secrets. We can avoid this by setting
> > > > > > > CONFIG_RESET_ATTACK_MITIGATION on UEFI systems in order to request that the
> > > > > > > firmware wipe the contents of RAM before booting another OS, but this means
> > > > > > > rebooting takes a *long* time - the expected behaviour is for a clean
> > > > > > > shutdown to remove the request after scrubbing secrets from RAM in order to
> > > > > > > avoid this.
> > > > > > >
> > > > > > > Unfortunately, if an application exits uncleanly, its secrets may still be
> > > > > > > present in RAM. This can't be easily fixed in userland (eg, if the OOM
> > > > > > > killer decides to kill a process holding secrets, we're not going to be able
> > > > > > > to avoid that), so this patch adds a new flag to madvise() to allow userland
> > > > > > > to request that the kernel clear the covered pages whenever the page
> > > > > > > reference count hits zero. Since vm_flags is already full on 32-bit, it
> > > > > > > will only work on 64-bit systems.
> > > > > [...]
> > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > > > index 21a7881a2db4..989c2fde15cf 100644
> > > > > > > --- a/mm/madvise.c
> > > > > > > +++ b/mm/madvise.c
> > > > > > > @@ -92,6 +92,22 @@ static long madvise_behavior(struct vm_area_struct *vma,
> > > > > > >       case MADV_KEEPONFORK:
> > > > > > >               new_flags &= ~VM_WIPEONFORK;
> > > > > > >               break;
> > > > > > > +     case MADV_WIPEONRELEASE:
> > > > > > > +             /* MADV_WIPEONRELEASE is only supported on anonymous memory. */
> > > > > > > +             if (VM_WIPEONRELEASE == 0 || vma->vm_file ||
> > > > > > > +                 vma->vm_flags & VM_SHARED) {
> > > > > > > +                     error = -EINVAL;
> > > > > > > +                     goto out;
> > > > > > > +             }
> > > > > > > +             new_flags |= VM_WIPEONRELEASE;
> > > > > > > +             break;
> > > > >
> > > > > An interesting effect of this is that it will be possible to set this
> > > > > on a CoW anon VMA in a fork() child, and then the semantics in the
> > > > > parent will be subtly different - e.g. if the parent vmsplice()d a
> > > > > CoWed page into a pipe, then forked an unprivileged child, the child
> > > >
> > > > Maybe a stupid question. How do you fork an unprivileged child (without
> > > > exec)? Child would have to drop priviledges on its own, no?
> > >
> > > Sorry, yes, that's what I meant.
> >
> > But then the VMA is gone along with the flag so why does it matter?
> 
> But in theory, the page might still be used somewhere, e.g. as data in
> a pipe (into which the parent wrote it) or whatever. Parent
> vmsplice()s a page into a pipe, parent exits, child marks the VMA as
> WIPEONRELEASE and exits, page gets wiped, someone else reads the page
> from the pipe.
> 
> Yes, this is very theoretical, and you'd have to write some pretty
> weird software for this to matter. But it doesn't seem clean to me to
> allow a child to affect the data in e.g. a pipe that it isn't supposed
> to have access to like this.
> 
> Then again, this could probably already happen, since do_wp_page()
> reuses pages depending on only the mapcount, without looking at the
> refcount.

OK, now I see your point. I was confused about the unprivileged child.
You are right that this looks weird but we have traditionally trusted
child processes to not do a harm. I guess this falls down to the same
bucket. An early CoW on these mapping should solve the problem AFAICS.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Jann Horn @ 2019-04-26 14:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Garrett, Linux-MM, kernel list, Matthew Garrett,
	Linux API
In-Reply-To: <20190426134722.GH22245@dhcp22.suse.cz>

On Fri, Apr 26, 2019 at 3:47 PM Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 26-04-19 15:33:25, Jann Horn wrote:
> > On Fri, Apr 26, 2019 at 7:31 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > On Thu 25-04-19 14:42:52, Jann Horn wrote:
> > > > On Thu, Apr 25, 2019 at 2:14 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > [...]
> > > > > On Wed 24-04-19 14:10:39, Matthew Garrett wrote:
> > > > > > From: Matthew Garrett <mjg59@google.com>
> > > > > >
> > > > > > Applications that hold secrets and wish to avoid them leaking can use
> > > > > > mlock() to prevent the page from being pushed out to swap and
> > > > > > MADV_DONTDUMP to prevent it from being included in core dumps. Applications
> > > > > > can also use atexit() handlers to overwrite secrets on application exit.
> > > > > > However, if an attacker can reboot the system into another OS, they can
> > > > > > dump the contents of RAM and extract secrets. We can avoid this by setting
> > > > > > CONFIG_RESET_ATTACK_MITIGATION on UEFI systems in order to request that the
> > > > > > firmware wipe the contents of RAM before booting another OS, but this means
> > > > > > rebooting takes a *long* time - the expected behaviour is for a clean
> > > > > > shutdown to remove the request after scrubbing secrets from RAM in order to
> > > > > > avoid this.
> > > > > >
> > > > > > Unfortunately, if an application exits uncleanly, its secrets may still be
> > > > > > present in RAM. This can't be easily fixed in userland (eg, if the OOM
> > > > > > killer decides to kill a process holding secrets, we're not going to be able
> > > > > > to avoid that), so this patch adds a new flag to madvise() to allow userland
> > > > > > to request that the kernel clear the covered pages whenever the page
> > > > > > reference count hits zero. Since vm_flags is already full on 32-bit, it
> > > > > > will only work on 64-bit systems.
> > > > [...]
> > > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > > index 21a7881a2db4..989c2fde15cf 100644
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -92,6 +92,22 @@ static long madvise_behavior(struct vm_area_struct *vma,
> > > > > >       case MADV_KEEPONFORK:
> > > > > >               new_flags &= ~VM_WIPEONFORK;
> > > > > >               break;
> > > > > > +     case MADV_WIPEONRELEASE:
> > > > > > +             /* MADV_WIPEONRELEASE is only supported on anonymous memory. */
> > > > > > +             if (VM_WIPEONRELEASE == 0 || vma->vm_file ||
> > > > > > +                 vma->vm_flags & VM_SHARED) {
> > > > > > +                     error = -EINVAL;
> > > > > > +                     goto out;
> > > > > > +             }
> > > > > > +             new_flags |= VM_WIPEONRELEASE;
> > > > > > +             break;
> > > >
> > > > An interesting effect of this is that it will be possible to set this
> > > > on a CoW anon VMA in a fork() child, and then the semantics in the
> > > > parent will be subtly different - e.g. if the parent vmsplice()d a
> > > > CoWed page into a pipe, then forked an unprivileged child, the child
> > >
> > > Maybe a stupid question. How do you fork an unprivileged child (without
> > > exec)? Child would have to drop priviledges on its own, no?
> >
> > Sorry, yes, that's what I meant.
>
> But then the VMA is gone along with the flag so why does it matter?

But in theory, the page might still be used somewhere, e.g. as data in
a pipe (into which the parent wrote it) or whatever. Parent
vmsplice()s a page into a pipe, parent exits, child marks the VMA as
WIPEONRELEASE and exits, page gets wiped, someone else reads the page
from the pipe.

Yes, this is very theoretical, and you'd have to write some pretty
weird software for this to matter. But it doesn't seem clean to me to
allow a child to affect the data in e.g. a pipe that it isn't supposed
to have access to like this.

Then again, this could probably already happen, since do_wp_page()
reuses pages depending on only the mapcount, without looking at the
refcount.

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Michal Hocko @ 2019-04-26 13:47 UTC (permalink / raw)
  To: Jann Horn
  Cc: Matthew Garrett, Linux-MM, kernel list, Matthew Garrett,
	Linux API
In-Reply-To: <CAG48ez1MGyAd5tE=JLmjkFqou-VvsQHcJ5TU5f8_L43km9eoYA@mail.gmail.com>

On Fri 26-04-19 15:33:25, Jann Horn wrote:
> On Fri, Apr 26, 2019 at 7:31 AM Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 25-04-19 14:42:52, Jann Horn wrote:
> > > On Thu, Apr 25, 2019 at 2:14 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > [...]
> > > > On Wed 24-04-19 14:10:39, Matthew Garrett wrote:
> > > > > From: Matthew Garrett <mjg59@google.com>
> > > > >
> > > > > Applications that hold secrets and wish to avoid them leaking can use
> > > > > mlock() to prevent the page from being pushed out to swap and
> > > > > MADV_DONTDUMP to prevent it from being included in core dumps. Applications
> > > > > can also use atexit() handlers to overwrite secrets on application exit.
> > > > > However, if an attacker can reboot the system into another OS, they can
> > > > > dump the contents of RAM and extract secrets. We can avoid this by setting
> > > > > CONFIG_RESET_ATTACK_MITIGATION on UEFI systems in order to request that the
> > > > > firmware wipe the contents of RAM before booting another OS, but this means
> > > > > rebooting takes a *long* time - the expected behaviour is for a clean
> > > > > shutdown to remove the request after scrubbing secrets from RAM in order to
> > > > > avoid this.
> > > > >
> > > > > Unfortunately, if an application exits uncleanly, its secrets may still be
> > > > > present in RAM. This can't be easily fixed in userland (eg, if the OOM
> > > > > killer decides to kill a process holding secrets, we're not going to be able
> > > > > to avoid that), so this patch adds a new flag to madvise() to allow userland
> > > > > to request that the kernel clear the covered pages whenever the page
> > > > > reference count hits zero. Since vm_flags is already full on 32-bit, it
> > > > > will only work on 64-bit systems.
> > > [...]
> > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > index 21a7881a2db4..989c2fde15cf 100644
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -92,6 +92,22 @@ static long madvise_behavior(struct vm_area_struct *vma,
> > > > >       case MADV_KEEPONFORK:
> > > > >               new_flags &= ~VM_WIPEONFORK;
> > > > >               break;
> > > > > +     case MADV_WIPEONRELEASE:
> > > > > +             /* MADV_WIPEONRELEASE is only supported on anonymous memory. */
> > > > > +             if (VM_WIPEONRELEASE == 0 || vma->vm_file ||
> > > > > +                 vma->vm_flags & VM_SHARED) {
> > > > > +                     error = -EINVAL;
> > > > > +                     goto out;
> > > > > +             }
> > > > > +             new_flags |= VM_WIPEONRELEASE;
> > > > > +             break;
> > >
> > > An interesting effect of this is that it will be possible to set this
> > > on a CoW anon VMA in a fork() child, and then the semantics in the
> > > parent will be subtly different - e.g. if the parent vmsplice()d a
> > > CoWed page into a pipe, then forked an unprivileged child, the child
> >
> > Maybe a stupid question. How do you fork an unprivileged child (without
> > exec)? Child would have to drop priviledges on its own, no?
> 
> Sorry, yes, that's what I meant.

But then the VMA is gone along with the flag so why does it matter?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v17 1/3] proc: add /proc/<pid>/arch_status
From: Thomas Gleixner @ 2019-04-26 13:38 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Li, Aubrey, mingo, peterz, hpa, ak, tim.c.chen, dave.hansen,
	arjan, adobriyan, akpm, aubrey.li, linux-api, linux-kernel,
	Andy Lutomirski
In-Reply-To: <c6977015-094f-ad28-bdfa-7d350247fa46@metux.net>

On Fri, 26 Apr 2019, Enrico Weigelt, metux IT consult wrote:
> On 25.04.19 12:50, Thomas Gleixner wrote:
> > On Thu, 25 Apr 2019, Enrico Weigelt, metux IT consult wrote:
> > 
> >> On 25.04.19 12:42, Li, Aubrey wrote:
> >>>
> >>> Yep, I'll make it disabled by default and not switchable and let arch select it.
> >>>
> >>
> >> That's not quite what I've suggested. Instead:
> >>
> >> #1: make the switch depend on the arch's that support it
> > 
> > No. That's what select is for.
> 
> Just for clarification: I've proposed the depend, because not only some
> archs will support it - and avoid masses of #ifdef's in the code.
> Therefore, it can only be enabled, when the archi supports it.

What has the way how you enable support to do with masses of #ifdefs?
Absolutely nothing.

> But if you insist in not having it configurable, letting the arch just
> select this feature, your approach makes sense.

Even if you make it configurable, then having:

     depends on ARCH1 ... ARCHN

is just wrong. That's what dependency config symbols are for which can be
selected by the arch.
 
> >> #2: still leave it selectable to the user, so somebody who doesn't need
> >>     it, can just disable it.
> > 
> > Well, the number of knobs is exploding over time and the number of people
> > actually tweaking them is close to 0. So no, we don't want to have the
> > extra tunable for everything and the world.
> 
> The great configurability often is one of the major arguments for using
> Linux in the first place.
> 
> Would you propose killing all the CONFIG_EMBEDDED/CONFIG_EXPERT
> related knobs ?

No, but adding knobs for every tiny piece of code does not make the whole
thing any better in terms of usability and maintainability.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Jann Horn @ 2019-04-26 13:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Garrett, Linux-MM, kernel list, Matthew Garrett,
	Linux API
In-Reply-To: <20190426053135.GC12337@dhcp22.suse.cz>

On Fri, Apr 26, 2019 at 7:31 AM Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 25-04-19 14:42:52, Jann Horn wrote:
> > On Thu, Apr 25, 2019 at 2:14 PM Michal Hocko <mhocko@kernel.org> wrote:
> > [...]
> > > On Wed 24-04-19 14:10:39, Matthew Garrett wrote:
> > > > From: Matthew Garrett <mjg59@google.com>
> > > >
> > > > Applications that hold secrets and wish to avoid them leaking can use
> > > > mlock() to prevent the page from being pushed out to swap and
> > > > MADV_DONTDUMP to prevent it from being included in core dumps. Applications
> > > > can also use atexit() handlers to overwrite secrets on application exit.
> > > > However, if an attacker can reboot the system into another OS, they can
> > > > dump the contents of RAM and extract secrets. We can avoid this by setting
> > > > CONFIG_RESET_ATTACK_MITIGATION on UEFI systems in order to request that the
> > > > firmware wipe the contents of RAM before booting another OS, but this means
> > > > rebooting takes a *long* time - the expected behaviour is for a clean
> > > > shutdown to remove the request after scrubbing secrets from RAM in order to
> > > > avoid this.
> > > >
> > > > Unfortunately, if an application exits uncleanly, its secrets may still be
> > > > present in RAM. This can't be easily fixed in userland (eg, if the OOM
> > > > killer decides to kill a process holding secrets, we're not going to be able
> > > > to avoid that), so this patch adds a new flag to madvise() to allow userland
> > > > to request that the kernel clear the covered pages whenever the page
> > > > reference count hits zero. Since vm_flags is already full on 32-bit, it
> > > > will only work on 64-bit systems.
> > [...]
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 21a7881a2db4..989c2fde15cf 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -92,6 +92,22 @@ static long madvise_behavior(struct vm_area_struct *vma,
> > > >       case MADV_KEEPONFORK:
> > > >               new_flags &= ~VM_WIPEONFORK;
> > > >               break;
> > > > +     case MADV_WIPEONRELEASE:
> > > > +             /* MADV_WIPEONRELEASE is only supported on anonymous memory. */
> > > > +             if (VM_WIPEONRELEASE == 0 || vma->vm_file ||
> > > > +                 vma->vm_flags & VM_SHARED) {
> > > > +                     error = -EINVAL;
> > > > +                     goto out;
> > > > +             }
> > > > +             new_flags |= VM_WIPEONRELEASE;
> > > > +             break;
> >
> > An interesting effect of this is that it will be possible to set this
> > on a CoW anon VMA in a fork() child, and then the semantics in the
> > parent will be subtly different - e.g. if the parent vmsplice()d a
> > CoWed page into a pipe, then forked an unprivileged child, the child
>
> Maybe a stupid question. How do you fork an unprivileged child (without
> exec)? Child would have to drop priviledges on its own, no?

Sorry, yes, that's what I meant.

^ permalink raw reply

* Re: [PATCH v17 1/3] proc: add /proc/<pid>/arch_status
From: Enrico Weigelt, metux IT consult @ 2019-04-26 11:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Li, Aubrey, mingo, peterz, hpa, ak, tim.c.chen, dave.hansen,
	arjan, adobriyan, akpm, aubrey.li, linux-api, linux-kernel,
	Andy Lutomirski
In-Reply-To: <alpine.DEB.2.21.1904251249200.1960@nanos.tec.linutronix.de>

On 25.04.19 12:50, Thomas Gleixner wrote:
> On Thu, 25 Apr 2019, Enrico Weigelt, metux IT consult wrote:
> 
>> On 25.04.19 12:42, Li, Aubrey wrote:
>>>
>>> Yep, I'll make it disabled by default and not switchable and let arch select it.
>>>
>>
>> That's not quite what I've suggested. Instead:
>>
>> #1: make the switch depend on the arch's that support it
> 
> No. That's what select is for.

Just for clarification: I've proposed the depend, because not only some
archs will support it - and avoid masses of #ifdef's in the code.
Therefore, it can only be enabled, when the archi supports it.

But if you insist in not having it configurable, letting the arch just
select this feature, your approach makes sense.

>> #2: still leave it selectable to the user, so somebody who doesn't need
>>     it, can just disable it.
> 
> Well, the number of knobs is exploding over time and the number of people
> actually tweaking them is close to 0. So no, we don't want to have the
> extra tunable for everything and the world.

The great configurability often is one of the major arguments for using
Linux in the first place.

Would you propose killing all the CONFIG_EMBEDDED/CONFIG_EXPERT
related knobs ?


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply

* Re: [PATCH V3] mm: Allow userland to request that the kernel clear memory on release
From: Vlastimil Babka @ 2019-04-26  7:45 UTC (permalink / raw)
  To: Matthew Garrett, linux-mm; +Cc: linux-kernel, linux-api, Matthew Garrett
In-Reply-To: <20190425225828.212472-1-matthewgarrett@google.com>

On 4/26/19 12:58 AM, Matthew Garrett wrote:
> From: Matthew Garrett <mjg59@google.com>
> 
> Applications that hold secrets and wish to avoid them leaking can use
> mlock() to prevent the page from being pushed out to swap and
> MADV_DONTDUMP to prevent it from being included in core dumps. Applications
> can also use atexit() handlers to overwrite secrets on application exit.
> However, if an attacker can reboot the system into another OS, they can
> dump the contents of RAM and extract secrets. We can avoid this by setting
> CONFIG_RESET_ATTACK_MITIGATION on UEFI systems in order to request that the
> firmware wipe the contents of RAM before booting another OS, but this means
> rebooting takes a *long* time - the expected behaviour is for a clean
> shutdown to remove the request after scrubbing secrets from RAM in order to
> avoid this.
> 
> Unfortunately, if an application exits uncleanly, its secrets may still be
> present in RAM. This can't be easily fixed in userland (eg, if the OOM
> killer decides to kill a process holding secrets, we're not going to be able
> to avoid that), so this patch adds a new flag to madvise() to allow userland
> to request that the kernel clear the covered pages whenever the page
> map count hits zero. Since vm_flags is already full on 32-bit, it
> will only work on 64-bit systems. This is currently only permitted on
> private mappings that have not yet been populated in order to simplify
> implementation, which should suffice for the envisaged use cases. We can
> extend the behaviour later if we come up with a robust set of semantics.
> 
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> ---
> 
> Updated based on feedback from Jann - for now let's just prevent setting
> the flag on anything that has already mapped some pages, which avoids
> child processes being able to interfere with the parent. In addition,

That makes the API quite tricky and different from existing madvise()
modes that don't care. One would for example have to call
madvise(MADV_WIPEONRELEASE) before mlock(), otherwise mlock() would
fault the pages in (unless MLOCK_ONFAULT). As such it really looks like
a mmap() flag, but that's less flexible.

How bout just doing the CoW on any such pre-existing pages as part of
the madvise(MADV_WIPEONRELEASE) call?

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Michal Hocko @ 2019-04-26  5:31 UTC (permalink / raw)
  To: Jann Horn
  Cc: Matthew Garrett, Linux-MM, kernel list, Matthew Garrett,
	Linux API
In-Reply-To: <CAG48ez0x6QiFpqXbimB9ZV-jS5UJJWhzg9XiAWncQL+phfKkPA@mail.gmail.com>

On Thu 25-04-19 14:42:52, Jann Horn wrote:
> On Thu, Apr 25, 2019 at 2:14 PM Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > On Wed 24-04-19 14:10:39, Matthew Garrett wrote:
> > > From: Matthew Garrett <mjg59@google.com>
> > >
> > > Applications that hold secrets and wish to avoid them leaking can use
> > > mlock() to prevent the page from being pushed out to swap and
> > > MADV_DONTDUMP to prevent it from being included in core dumps. Applications
> > > can also use atexit() handlers to overwrite secrets on application exit.
> > > However, if an attacker can reboot the system into another OS, they can
> > > dump the contents of RAM and extract secrets. We can avoid this by setting
> > > CONFIG_RESET_ATTACK_MITIGATION on UEFI systems in order to request that the
> > > firmware wipe the contents of RAM before booting another OS, but this means
> > > rebooting takes a *long* time - the expected behaviour is for a clean
> > > shutdown to remove the request after scrubbing secrets from RAM in order to
> > > avoid this.
> > >
> > > Unfortunately, if an application exits uncleanly, its secrets may still be
> > > present in RAM. This can't be easily fixed in userland (eg, if the OOM
> > > killer decides to kill a process holding secrets, we're not going to be able
> > > to avoid that), so this patch adds a new flag to madvise() to allow userland
> > > to request that the kernel clear the covered pages whenever the page
> > > reference count hits zero. Since vm_flags is already full on 32-bit, it
> > > will only work on 64-bit systems.
> [...]
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 21a7881a2db4..989c2fde15cf 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -92,6 +92,22 @@ static long madvise_behavior(struct vm_area_struct *vma,
> > >       case MADV_KEEPONFORK:
> > >               new_flags &= ~VM_WIPEONFORK;
> > >               break;
> > > +     case MADV_WIPEONRELEASE:
> > > +             /* MADV_WIPEONRELEASE is only supported on anonymous memory. */
> > > +             if (VM_WIPEONRELEASE == 0 || vma->vm_file ||
> > > +                 vma->vm_flags & VM_SHARED) {
> > > +                     error = -EINVAL;
> > > +                     goto out;
> > > +             }
> > > +             new_flags |= VM_WIPEONRELEASE;
> > > +             break;
> 
> An interesting effect of this is that it will be possible to set this
> on a CoW anon VMA in a fork() child, and then the semantics in the
> parent will be subtly different - e.g. if the parent vmsplice()d a
> CoWed page into a pipe, then forked an unprivileged child, the child

Maybe a stupid question. How do you fork an unprivileged child (without
exec)? Child would have to drop priviledges on its own, no?

> set MADV_WIPEONRELEASE on its VMA, the parent died somehow, and then
> the child died, the page in the pipe would be zeroed out. A child
> should not be able to affect its parent like this, I think. If this
> was an mmap() flag instead of a madvise() command, that issue could be
> avoided.

With a VMA flag underneath, I think you can do an early CoW during fork
to prevent from that.

> Alternatively, if adding more mmap() flags doesn't work,
> perhaps you could scan the VMA and ensure that it contains no pages
> yet, or something like that?
> 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index ab650c21bccd..ff78b527660e 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1091,6 +1091,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > >                       page_remove_rmap(page, false);
> > >                       if (unlikely(page_mapcount(page) < 0))
> > >                               print_bad_pte(vma, addr, ptent, page);
> > > +                     if (unlikely(vma->vm_flags & VM_WIPEONRELEASE) &&
> > > +                         page_mapcount(page) == 0)
> > > +                             clear_highpage(page);
> > >                       if (unlikely(__tlb_remove_page(tlb, page))) {
> > >                               force_flush = 1;
> > >                               addr += PAGE_SIZE;
> 
> Should something like this perhaps be added in page_remove_rmap()
> instead? That's where the mapcount is decremented; and looking at
> other callers of page_remove_rmap(), in particular the following ones
> look interesting:

Well spotted!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Michal Hocko @ 2019-04-26  5:25 UTC (permalink / raw)
  To: Matthew Garrett; +Cc: linux-mm, Linux Kernel Mailing List, Linux API
In-Reply-To: <CACdnJuutwmBn_ASY1N1+ZK8g4MbpjTnUYbarR+CPhC5BAy0oZA@mail.gmail.com>

On Thu 25-04-19 13:39:01, Matthew Garrett wrote:
> On Thu, Apr 25, 2019 at 5:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> > Besides that you inherently assume that the user would do mlock because
> > you do not try to wipe the swap content. Is this intentional?
> 
> Yes, given MADV_DONTDUMP doesn't imply mlock I thought it'd be more
> consistent to keep those independent.

Do we want to fail madvise call on VMAs that are not mlocked then? What
if the munlock happens later after the madvise is called?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* [PATCH V3] mm: Allow userland to request that the kernel clear memory on release
From: Matthew Garrett @ 2019-04-25 22:58 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-api, Matthew Garrett
In-Reply-To: <CACdnJuup-y1xAO93wr+nr6ARacxJ9YXgaceQK9TLktE7shab1w@mail.gmail.com>

From: Matthew Garrett <mjg59@google.com>

Applications that hold secrets and wish to avoid them leaking can use
mlock() to prevent the page from being pushed out to swap and
MADV_DONTDUMP to prevent it from being included in core dumps. Applications
can also use atexit() handlers to overwrite secrets on application exit.
However, if an attacker can reboot the system into another OS, they can
dump the contents of RAM and extract secrets. We can avoid this by setting
CONFIG_RESET_ATTACK_MITIGATION on UEFI systems in order to request that the
firmware wipe the contents of RAM before booting another OS, but this means
rebooting takes a *long* time - the expected behaviour is for a clean
shutdown to remove the request after scrubbing secrets from RAM in order to
avoid this.

Unfortunately, if an application exits uncleanly, its secrets may still be
present in RAM. This can't be easily fixed in userland (eg, if the OOM
killer decides to kill a process holding secrets, we're not going to be able
to avoid that), so this patch adds a new flag to madvise() to allow userland
to request that the kernel clear the covered pages whenever the page
map count hits zero. Since vm_flags is already full on 32-bit, it
will only work on 64-bit systems. This is currently only permitted on
private mappings that have not yet been populated in order to simplify
implementation, which should suffice for the envisaged use cases. We can
extend the behaviour later if we come up with a robust set of semantics.

Signed-off-by: Matthew Garrett <mjg59@google.com>
---

Updated based on feedback from Jann - for now let's just prevent setting
the flag on anything that has already mapped some pages, which avoids
child processes being able to interfere with the parent. In addition,
move the page clearing logic into page_remove_rmap() to ensure that we
cover the full set of cases (eg, handling page migration properly).

 include/linux/mm.h                     |  6 ++++++
 include/linux/rmap.h                   |  2 +-
 include/uapi/asm-generic/mman-common.h |  2 ++
 kernel/events/uprobes.c                |  2 +-
 mm/huge_memory.c                       | 12 ++++++------
 mm/hugetlb.c                           |  4 ++--
 mm/khugepaged.c                        |  2 +-
 mm/ksm.c                               |  2 +-
 mm/madvise.c                           | 25 +++++++++++++++++++++++++
 mm/memory.c                            |  6 +++---
 mm/migrate.c                           |  4 ++--
 mm/rmap.c                              |  9 +++++++--
 12 files changed, 57 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b10c21630f5..64bdab679275 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -257,6 +257,8 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+
+#define VM_WIPEONRELEASE BIT(37)       /* Clear pages when releasing them */
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -298,6 +300,10 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_GROWSUP	VM_NONE
 #endif
 
+#ifndef VM_WIPEONRELEASE
+# define VM_WIPEONRELEASE VM_NONE
+#endif
+
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..abb47d623edd 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -177,7 +177,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long, bool);
 void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
+void page_remove_rmap(struct page *, struct vm_area_struct *, bool);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index abd238d0f7a4..82dfff4a8e3d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -64,6 +64,8 @@
 #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
 
+#define MADV_WIPEONRELEASE 20
+#define MADV_DONTWIPEONRELEASE 21
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index c5cde87329c7..2230a1717fe3 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	set_pte_at_notify(mm, addr, pvmw.pte,
 			mk_pte(new_page, vma->vm_page_prot));
 
-	page_remove_rmap(old_page, false);
+	page_remove_rmap(old_page, vma, false);
 	if (!page_mapped(old_page))
 		try_to_free_swap(old_page);
 	page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 165ea46bf149..1ad6ee5857b7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1260,7 +1260,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(vma->vm_mm, vmf->pmd, pgtable);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, vma, true);
 	spin_unlock(vmf->ptl);
 
 	/*
@@ -1410,7 +1410,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		} else {
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			put_page(page);
 		}
 		ret |= VM_FAULT_WRITE;
@@ -1783,7 +1783,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pmd_present(orig_pmd)) {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else if (thp_migration_supported()) {
@@ -2146,7 +2146,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			set_page_dirty(page);
 		if (!PageReferenced(page) && pmd_young(_pmd))
 			SetPageReferenced(page);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, vma, true);
 		put_page(page);
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		return;
@@ -2266,7 +2266,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (freeze) {
 		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, false);
+			page_remove_rmap(page + i, vma, false);
 			put_page(page + i);
 		}
 	}
@@ -2954,7 +2954,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, vma, true);
 	put_page(page);
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6cdc7b2d9100..1df046525861 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3419,7 +3419,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			set_page_dirty(page);
 
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, vma, true);
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -3643,7 +3643,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, haddr, ptep,
 				make_huge_pte(vma, new_page, 1));
-		page_remove_rmap(old_page, true);
+		page_remove_rmap(old_page, vma, true);
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		set_page_huge_active(new_page);
 		/* Make the old page be freed below */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 449044378782..20df74dfd954 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -673,7 +673,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 * superfluous.
 			 */
 			pte_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page, false);
+			page_remove_rmap(src_page, vma, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
diff --git a/mm/ksm.c b/mm/ksm.c
index fc64874dc6f4..280705f65af7 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1193,7 +1193,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
-	page_remove_rmap(page, false);
+	page_remove_rmap(page, vma, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 21a7881a2db4..2a6e616e5c0d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -92,6 +92,26 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	case MADV_KEEPONFORK:
 		new_flags &= ~VM_WIPEONFORK;
 		break;
+	case MADV_WIPEONRELEASE:
+		/*
+		 * MADV_WIPEONRELEASE is only supported on as-yet unallocated
+		 * anonymous memory.
+		 */
+		if (VM_WIPEONRELEASE == 0 || vma->vm_file || vma->anon_vma ||
+		    vma->vm_flags & VM_SHARED) {
+			error = -EINVAL;
+			goto out;
+		}
+
+		new_flags |= VM_WIPEONRELEASE;
+		break;
+	case MADV_DONTWIPEONRELEASE:
+		if (VM_WIPEONRELEASE == 0) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags &= ~VM_WIPEONRELEASE;
+		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -727,6 +747,8 @@ madvise_behavior_valid(int behavior)
 	case MADV_DODUMP:
 	case MADV_WIPEONFORK:
 	case MADV_KEEPONFORK:
+	case MADV_WIPEONRELEASE:
+	case MADV_DONTWIPEONRELEASE:
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
@@ -785,6 +807,9 @@ madvise_behavior_valid(int behavior)
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *  MADV_WIPEONRELEASE - clear the contents of the memory after the last
+ *		reference to it has been released
+ *  MADV_DONTWIPEONRELEASE - cancel MADV_WIPEONRELEASE
  *
  * return values:
  *  zero    - success
diff --git a/mm/memory.c b/mm/memory.c
index ab650c21bccd..dd9555bb9aec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1088,7 +1088,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1116,7 +1116,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			put_page(page);
 			continue;
 		}
@@ -2340,7 +2340,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page, false);
+			page_remove_rmap(old_page, vma, false);
 		}
 
 		/* Free the old page.. */
diff --git a/mm/migrate.c b/mm/migrate.c
index 663a5449367a..5d3437a6541d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2083,7 +2083,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	page_ref_unfreeze(page, 2);
 	mlock_migrate_page(new_page, page);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, vma, true);
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
 
 	spin_unlock(ptl);
@@ -2313,7 +2313,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * drop page refcount. Page won't be freed, as we took
 			 * a reference just above.
 			 */
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			put_page(page);
 
 			if (pte_present(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index b30c7c71d1d9..46dc9946a516 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1292,11 +1292,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
+ * @vma:	VMA the page belongs to
  * @compound:	uncharge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
+		      bool compound)
 {
 	if (!PageAnon(page))
 		return page_remove_file_rmap(page, compound);
@@ -1321,6 +1323,9 @@ void page_remove_rmap(struct page *page, bool compound)
 	if (PageTransCompound(page))
 		deferred_split_huge_page(compound_head(page));
 
+	if (unlikely(vma->vm_flags & VM_WIPEONRELEASE))
+		clear_highpage(page);
+
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
 	 * but that might overwrite a racing page_add_anon_rmap
@@ -1652,7 +1657,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, vma, PageHuge(page));
 		put_page(page);
 	}
 
-- 
2.21.0.593.g511ec345e18-goog

^ permalink raw reply related

* Re: [PATCH v1 1/2] Add polling support to pidfd
From: Christian Brauner @ 2019-04-25 22:24 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, luto, rostedt, dancol, sspatil, jannh, surenb,
	timmurray, Jonathan Kowalski, torvalds, kernel-team,
	Andrew Morton, Arnd Bergmann, Eric W. Biederman,
	Greg Kroah-Hartman, Ingo Molnar, Jann Horn, linux-kselftest,
	Michal Hocko, Peter Zijlstra (Intel), Serge Hallyn, Shuah Khan,
	Stephen Rothwell, Thomas Gleixner, Tycho Andersen
In-Reply-To: <20190425190010.46489-1-joel@joelfernandes.org>

On Thu, Apr 25, 2019 at 03:00:09PM -0400, Joel Fernandes (Google) wrote:
> pidfd are file descriptors referring to a process created with the
> CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd
> polling support to replace code that currently checks for existence of
> /proc/pid for knowing that a process that is signalled to be killed has
> died, which is both racy and slow. The pidfd poll approach is race-free,
> and also allows the LMK to do other things (such as by polling on other
> fds) while awaiting the process being killed to die.

Thanks for the patch!

Ok, let me be a little bit anal.
Please start the commit message with what this patch does and then add
the justification why. You just say the "pidfd-poll" approach. You can
probably assume that CLONE_PIDFD is available for this patch. So
something like:

"This patch makes pidfds pollable. Specifically, it allows listeners to
be informed when the process the pidfd referes to exits. This patch only
introduces the ability to poll thread-group leaders since pidfds
currently can only reference those..."

Then justify the use-case and then go into implementation details.
That's usually how I would think about this:
- Change the codebase to do X
- Why do we need X
- Are there any technical details worth mentioning in the commit message
(- Are there any controversial points that people stumbled upon but that
  have been settled sufficiently.)

> pidfd are file descriptors referring to a process created with the
> CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd
> polling support to replace code that currently checks for existence of
> /proc/pid for knowing that a process that is signalled to be killed has
> died, which is both racy and slow. The pidfd poll approach is race-free,
> and also allows the LMK to do other things (such as by polling on other
> fds) while awaiting the process being killed to die.

> 
> It prevents a situation where a PID is reused between when LMK sends a
> kill signal and checks for existence of the PID, since the wrong PID is
> now possibly checked for existence.
> 
> In this patch, we follow the same existing mechanism in the kernel used
> when the parent of the task group is to be notified (do_notify_parent).
> This is when the tasks waiting on a poll of pidfd are also awakened.
> 
> We have decided to include the waitqueue in struct pid for the following
> reasons:
> 1. The wait queue has to survive for the lifetime of the poll. Including
> it in task_struct would not be option in this case because the task can
> be reaped and destroyed before the poll returns.
> 
> 2. By including the struct pid for the waitqueue means that during
> de_thread(), the new thread group leader automatically gets the new
> waitqueue/pid even though its task_struct is different.
> 
> Appropriate test cases are added in the second patch to provide coverage
> of all the cases the patch is handling.
> 
> Andy had a similar patch [1] in the past which was a good reference
> however this patch tries to handle different situations properly related
> to thread group existence, and how/where it notifies. And also solves
> other bugs (waitqueue lifetime).  Daniel had a similar patch [2]
> recently which this patch supercedes.
> 
> [1] https://lore.kernel.org/patchwork/patch/345098/
> [2] https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/
> 
> Cc: luto@amacapital.net
> Cc: rostedt@goodmis.org
> Cc: dancol@google.com
> Cc: sspatil@google.com
> Cc: christian@brauner.io
> Cc: jannh@google.com
> Cc: surenb@google.com
> Cc: timmurray@google.com
> Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
> Cc: torvalds@linux-foundation.org
> Cc: kernel-team@android.com

These should all be in the form:

Cc: Firstname Lastname <email@address.com>

There are people missing from the Cc that really should be there...

Even though he usually doesn't respond that often, please Cc Al on this.
If he responds it's usually rather important.

Oleg has reviewed your RFC patch quite substantially and given valuable
feedback and has an opinion on this thing and is best acquainted with
the exit code. So please add him to the Cc of the commit message in the
appropriate form and also add him to the Cc of the thread.

Probably also want linux-api for good measure since a lot of people are
subscribed that would care about pollable pidfds. I'd also add Kees
since he had some interest in this work and David (Howells).

> Co-developed-by: Daniel Colascione <dancol@google.com>

Every CDB needs to give a SOB as well.

> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> ---
> 
> RFC -> v1:
> * Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/
> * Updated selftests.
> * Renamed poll wake function to do_notify_pidfd.
> * Removed depending on EXIT flags
> * Removed POLLERR flag since semantics are controversial and
>   we don't have usecases for it right now (later we can add if there's
>   a need for it).
> 
>  include/linux/pid.h |  3 +++
>  kernel/fork.c       | 33 +++++++++++++++++++++++++++++++++
>  kernel/pid.c        |  2 ++
>  kernel/signal.c     | 14 ++++++++++++++
>  4 files changed, 52 insertions(+)
> 
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 3c8ef5a199ca..1484db6ca8d1 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -3,6 +3,7 @@
>  #define _LINUX_PID_H
>  
>  #include <linux/rculist.h>
> +#include <linux/wait.h>
>  
>  enum pid_type
>  {
> @@ -60,6 +61,8 @@ struct pid
>  	unsigned int level;
>  	/* lists of tasks that use this pid */
>  	struct hlist_head tasks[PIDTYPE_MAX];
> +	/* wait queue for pidfd notifications */
> +	wait_queue_head_t wait_pidfd;
>  	struct rcu_head rcu;
>  	struct upid numbers[1];
>  };
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5525837ed80e..fb3b614f6456 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1685,8 +1685,41 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
>  }
>  #endif
>  
> +static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
> +{
> +	struct task_struct *task;
> +	struct pid *pid;
> +	int poll_flags = 0;
> +
> +	/*
> +	 * tasklist_lock must be held because to avoid racing with
> +	 * changes in exit_state and wake up. Basically to avoid:
> +	 *
> +	 * P0: read exit_state = 0
> +	 * P1: write exit_state = EXIT_DEAD
> +	 * P1: Do a wake up - wq is empty, so do nothing
> +	 * P0: Queue for polling - wait forever.
> +	 */
> +	read_lock(&tasklist_lock);
> +	pid = file->private_data;
> +	task = pid_task(pid, PIDTYPE_PID);
> +	WARN_ON_ONCE(task && !thread_group_leader(task));
> +
> +	if (!task || (task->exit_state && thread_group_empty(task)))
> +		poll_flags = POLLIN | POLLRDNORM;

So we block until the thread-group is empty? Hm, the thread-group leader
remains in zombie state until all threads are gone. Should probably just
be a short comment somewhere that callers are only informed about a
whole thread-group exit and not about when the thread-group leader has
actually exited.

I would like the ability to extend this interface in the future to allow
for actually reading data from the pidfd on EPOLLIN.
POSIX specifies that POLLIN and POLLRDNORM are set even if the
message to be read is zero. So one cheap way of doing this would
probably be to do a 0 read/ioctl. That wouldn't hurt your very limited
usecase and people could test whether the read returned non-0 data and
if so they know this interface got extended. If we never extend it here
it won't matter.

> +
> +	if (!poll_flags)
> +		poll_wait(file, &pid->wait_pidfd, pts);
> +
> +	read_unlock(&tasklist_lock);
> +
> +	return poll_flags;
> +}


> +
> +
>  const struct file_operations pidfd_fops = {
>  	.release = pidfd_release,
> +	.poll = pidfd_poll,
>  #ifdef CONFIG_PROC_FS
>  	.show_fdinfo = pidfd_show_fdinfo,
>  #endif
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..5c90c239242f 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  	for (type = 0; type < PIDTYPE_MAX; ++type)
>  		INIT_HLIST_HEAD(&pid->tasks[type]);
>  
> +	init_waitqueue_head(&pid->wait_pidfd);
> +
>  	upid = pid->numbers + ns->level;
>  	spin_lock_irq(&pidmap_lock);
>  	if (!(ns->pid_allocated & PIDNS_ADDING))
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 1581140f2d99..16e7718316e5 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1800,6 +1800,17 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
>  	return ret;
>  }
>  
> +static void do_notify_pidfd(struct task_struct *task)

Maybe a short command that this helper can only be called when we know
that task is a thread-group leader wouldn't hurt so there's no confusion
later.

> +{
> +	struct pid *pid;
> +
> +	lockdep_assert_held(&tasklist_lock);
> +
> +	pid = get_task_pid(task, PIDTYPE_PID);
> +	wake_up_all(&pid->wait_pidfd);
> +	put_pid(pid);
> +}
> +
>  /*
>   * Let a parent know about the death of a child.
>   * For a stopped/continued status change, use do_notify_parent_cldstop instead.
> @@ -1823,6 +1834,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  	BUG_ON(!tsk->ptrace &&
>  	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
>  
> +	/* Wake up all pidfd waiters */
> +	do_notify_pidfd(tsk);
> +
>  	if (sig != SIGCHLD) {
>  		/*
>  		 * This is only possible if parent == real_parent.
> -- 
> 2.21.0.593.g511ec345e18-goog

^ permalink raw reply

* Re: [PATCHv3 06/27] posix-timers/timens: Take into account clock offsets
From: Thomas Gleixner @ 2019-04-25 21:45 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Dmitry Safonov, Eric W. Biederman,
	H. Peter Anvin, Ingo Molnar, Jeff Dike, Oleg Nesterov,
	Pavel Emelyanov, Shuah Khan, Vincenzo Frascino, containers, criu,
	linux-api, x86
In-Reply-To: <20190425161416.26600-7-dima@arista.com>

On Thu, 25 Apr 2019, Dmitry Safonov wrote:

> From: Andrei Vagin <avagin@gmail.com>
> 
> Wire timer_settime() syscall into time namespace virtualization.
> 
> Signed-off-by: Andrei Vagin <avagin@openvz.org>
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
>  kernel/time/alarmtimer.c   | 3 +++
>  kernel/time/posix-timers.c | 2 ++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
> index 31f99361342e..23f5c39c8d23 100644
> --- a/kernel/time/alarmtimer.c
> +++ b/kernel/time/alarmtimer.c
> @@ -26,6 +26,7 @@
>  #include <linux/freezer.h>
>  #include <linux/compat.h>
>  #include <linux/module.h>
> +#include <linux/time_namespace.h>
>  
>  #include "posix-timers.h"
>  
> @@ -621,6 +622,8 @@ static void alarm_timer_arm(struct k_itimer *timr, ktime_t expires,
>  
>  	if (!absolute)
>  		expires = ktime_add_safe(expires, base->gettime());
> +	else
> +		expires = timens_ktime_to_host(base->base_clockid, expires);
>
>  	if (sigev_none)
>  		alarm->node.expires = expires;
>  	else
> diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
> index a723e63d55fd..7922731f98cd 100644
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -803,6 +803,8 @@ static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
>  
>  	if (!absolute)
>  		expires = ktime_add_safe(expires, timer->base->get_time());
> +	else
> +		expires = timens_ktime_to_host(timer->base->clockid, expires);
>  	hrtimer_set_expires(timer, expires);

So you have now two places where you do that adjustment. Why? And why in
the timer_arm() callback? The obvious place to do that is in
common_timer_set().

-	expires = timespec64_to_ktime(new_setting->it_value);
+	expires = timespec64_to_host(timr->it_clock, &new_setting->it_value);

static ktime_t timespec64_to_host(clockid_t which, struct timespec64 val)
{
	ktime_t expires = timespec64_to_ktime(val);

#ifdef NAMESPACE
	switch (which) {
	case CLOCK_MONOTONIC:
		return timens_sub_monotonic(expires);
	case CLOCK_BOOTTINE:
	case CLOCK_BOOTTIME_ALARM:
		return timens_sub_boottime(expires);
	}
#endif
	return expires;
}

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCHv3 05/27] timerfd/timens: Take into account ns clock offsets
From: Thomas Gleixner @ 2019-04-25 21:28 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Dmitry Safonov, Eric W. Biederman,
	H. Peter Anvin, Ingo Molnar, Jeff Dike, Oleg Nesterov,
	Pavel Emelyanov, Shuah Khan, Vincenzo Frascino, containers, criu,
	linux-api, x86
In-Reply-To: <20190425161416.26600-6-dima@arista.com>

On Thu, 25 Apr 2019, Dmitry Safonov wrote:
> From: Andrei Vagin <avagin@gmail.com>
> 
> Make timerfd respect timens offsets.
> Provide a helper timens_ktime_to_host() that is useful to wire up
> timens to different kernel subsystems.

Yet another changelog which lacks meat.

> @@ -179,6 +180,8 @@ static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
>  	htmode = (flags & TFD_TIMER_ABSTIME) ?
>  		HRTIMER_MODE_ABS: HRTIMER_MODE_REL;
>  
> +	htmode |= HRTIMER_MODE_NS;

Without looking further this time. My gut reaction is that this is
wrong. Name space adjustment is only valid for absolute timers not for
relative timers.

Aside of that the name sucks. MODE_NS is really not intuitive. It could be
NanoSeconds or whatever and quite some time(r) functions have a _ns element
already. Please look for something more inuitive and clearly related to
namespaces. We are not short of letters.

>  	texp = timespec64_to_ktime(ktmr->it_value);
>  	ctx->expired = 0;
>  	ctx->ticks = 0;
> @@ -197,9 +200,10 @@ static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
>  
>  	if (texp != 0) {
>  		if (isalarm(ctx)) {
> -			if (flags & TFD_TIMER_ABSTIME)
> +			if (flags & TFD_TIMER_ABSTIME) {
> +				texp = timens_ktime_to_host(clockid, texp);

You are not serious about that inline function here? It's huge and
pointless bloat because the only time affected here is boot time, but the
compiler does not know that.

>  				alarm_start(&ctx->t.alarm, texp);

Make that:

   alarm_start_namespace(.....)

and that does:

void alarm_start_namespace(struct alarm *alarm, ktime_t expires)
{
	if (alarm->type == ALARM_BOOTTIME)
		expires = timens_sub_boottime(expires);
	alarm_start(alarm, expires);
}

Hmm?

> -			else
> +			} else
>  				alarm_start_relative(&ctx->t.alarm, texp);
>  		} else {
>  			hrtimer_start(&ctx->t.tmr, texp, htmode);
> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index 2e8957eac4d4..4b9c89c797ee 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -38,6 +38,7 @@ enum hrtimer_mode {
>  	HRTIMER_MODE_REL	= 0x01,
>  	HRTIMER_MODE_PINNED	= 0x02,
>  	HRTIMER_MODE_SOFT	= 0x04,
> +	HRTIMER_MODE_NS		= 0x08,
>  
>  	HRTIMER_MODE_ABS_PINNED = HRTIMER_MODE_ABS | HRTIMER_MODE_PINNED,
>  	HRTIMER_MODE_REL_PINNED = HRTIMER_MODE_REL | HRTIMER_MODE_PINNED,
> diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
> index 5f0da6858b10..988414f7f791 100644
> --- a/include/linux/time_namespace.h
> +++ b/include/linux/time_namespace.h
> @@ -56,6 +56,41 @@ static inline void timens_add_boottime(struct timespec64 *ts)
>                  *ts = timespec64_add(*ts, ns_offsets->monotonic_boottime_offset);
>  }
>  
> +static inline ktime_t timens_ktime_to_host(clockid_t clockid, ktime_t tim)
> +{
> +	struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
> +	struct timespec64 *offset;
> +	ktime_t koff;
> +
> +	if (!ns_offsets)
> +		return tim;
> +
> +	switch (clockid) {
> +		case CLOCK_MONOTONIC:
> +		case CLOCK_MONOTONIC_RAW:
> +		case CLOCK_MONOTONIC_COARSE:

What's the point of COARSE and RAW? Neither of them can be used to arm
timers.

> +			offset = &ns_offsets->monotonic_time_offset;
> +			break;
> +		case CLOCK_BOOTTIME:
> +		case CLOCK_BOOTTIME_ALARM:
> +			offset = &ns_offsets->monotonic_boottime_offset;
> +			break;
> +		default:
> +			return tim;
> +	}
> +
> +	koff = timespec64_to_ktime(*offset);

What about storing both the timespec and the ktime_t representation?

> +	if (tim < koff)
> +		tim = 0;
> +	else if (KTIME_MAX - tim < -koff)
> +		tim = KTIME_MAX;

Blink!?! This is completely nonobvious and you're going to stare at it in
disbelief half a year from now. Comments exist for a reason.

> +	else
> +		tim = ktime_sub(tim, koff);
> +
> +	return tim;

This whole thing is way too large for inlining.

Please create a function which does the magic substraction, something like
ktime_sub_namespace_offset() and invoke it from the proper places, i.e. the
alarmtimer one.

> @@ -1069,6 +1070,8 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
>  
>  	if (mode & HRTIMER_MODE_REL)
>  		tim = ktime_add_safe(tim, base->get_time());
> +	else if (mode & HRTIMER_MODE_NS)
> +		tim = timens_ktime_to_host(base->clockid, tim);

You can do the same as for alarmtime above:

hrtimer_start_namespace(struct hrtimer *timer, ktime_t tim,
			const enum hrtimer_mode mode)
{
	if (mode & HRTIMER_MODE_ABS) {
		switch(timer->base->clockid) {
		case CLOCK_MONOTONIC:
			tim = timens_sub_monotonic(tim);
			break;
		case CLOCK_BOOTTIME:
			tim = timens_sub_boottime(tim);
			break;
		}
	}
	hrtimer_start(timer, tim, mode);
}

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Matthew Garrett @ 2019-04-25 20:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Linux-MM, Linux Kernel Mailing List, Linux API
In-Reply-To: <1df0ef0c-4219-c259-18a2-9abfb2782c08@suse.cz>

On Thu, Apr 25, 2019 at 5:44 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 4/25/19 2:14 PM, Michal Hocko wrote:
> > Please cc linux-api for user visible API proposals (now done). Keep the
> > rest of the email intact for reference.
> >
> > On Wed 24-04-19 14:10:39, Matthew Garrett wrote:
> >> From: Matthew Garrett <mjg59@google.com>
> >>
> >> Applications that hold secrets and wish to avoid them leaking can use
> >> mlock() to prevent the page from being pushed out to swap and
> >> MADV_DONTDUMP to prevent it from being included in core dumps. Applications
>
> So, do we really need a new madvise() flag and VMA flag, or can we just
> infer this page clearing from mlock+MADV_DONTDUMP being both applied?

I think the combination would probably imply that this is the
behaviour you want, but I'm a little concerned about changing the
semantics given the corner cases described earlier in the thread. If
we can figure those out in a way that won't break any existing code, I
could buy this.

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Matthew Garrett @ 2019-04-25 20:43 UTC (permalink / raw)
  To: Jann Horn; +Cc: Michal Hocko, Linux-MM, kernel list, Linux API
In-Reply-To: <CAG48ez0x6QiFpqXbimB9ZV-jS5UJJWhzg9XiAWncQL+phfKkPA@mail.gmail.com>

On Thu, Apr 25, 2019 at 5:43 AM Jann Horn <jannh@google.com> wrote:
> An interesting effect of this is that it will be possible to set this
> on a CoW anon VMA in a fork() child, and then the semantics in the
> parent will be subtly different - e.g. if the parent vmsplice()d a
> CoWed page into a pipe, then forked an unprivileged child, the child
> set MADV_WIPEONRELEASE on its VMA, the parent died somehow, and then
> the child died, the page in the pipe would be zeroed out. A child
> should not be able to affect its parent like this, I think. If this
> was an mmap() flag instead of a madvise() command, that issue could be
> avoided. Alternatively, if adding more mmap() flags doesn't work,
> perhaps you could scan the VMA and ensure that it contains no pages
> yet, or something like that?

I /think/ my argument here would be not to do that? I agree that it's
unexpected, but I guess the other alternative would be to force a copy
on any existing COW pages in the VMA at madvise() time, and maybe also
at fork() time (sort of like the behaviour of MADV_WIPEONFORK, but
copying the page rather than providing a new zero page)

> I think all the callers have a reference to the VMA, so perhaps you
> could add a VMA parameter to page_remove_rmap() and then look at the
> VMA in there?

I'll dig into that, thanks!

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Matthew Garrett @ 2019-04-25 20:39 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Linux Kernel Mailing List, Linux API
In-Reply-To: <20190425123755.GX12751@dhcp22.suse.cz>

On Thu, Apr 25, 2019 at 5:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> Besides that you inherently assume that the user would do mlock because
> you do not try to wipe the swap content. Is this intentional?

Yes, given MADV_DONTDUMP doesn't imply mlock I thought it'd be more
consistent to keep those independent.

> Another question would be regarding the targeted user API. There are
> some attempts to make all the freed memory to be zeroed/poisoned. Are
> users who would like to use this feature also be interested in using
> system wide setting as well?

I think that depends on the performance overhead of a global setting,
but it's also influenced by the semantics around when the freeing
occurs, which is something I haven't nailed down yet. If the
expectation is that the page is freed whenever the process exits, even
if the page is in use somewhere else, then we'd still want this to be
separate to poisoning on final page free.

^ permalink raw reply

* Re: [PATCHv3 04/27] timens: Introduce CLOCK_BOOTTIME offset
From: Cyrill Gorcunov @ 2019-04-25 20:08 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Adrian Reber, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Dmitry Safonov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jeff Dike,
	Oleg Nesterov, Pavel Emelyanov, Shuah Khan, Thomas Gleixner,
	Vincenzo Frascino, containers, criu, linux-api, x86, Andrei Vagin
In-Reply-To: <20190425161416.26600-5-dima@arista.com>

On Thu, Apr 25, 2019 at 05:13:53PM +0100, Dmitry Safonov wrote:
> 
> diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
> index 8f75d34cf34a..5f0da6858b10 100644
> --- a/include/linux/time_namespace.h
> +++ b/include/linux/time_namespace.h
> @@ -48,6 +48,14 @@ static inline void timens_add_monotonic(struct timespec64 *ts)
>                  *ts = timespec64_add(*ts, ns_offsets->monotonic_time_offset);
>  }
>  
> +static inline void timens_add_boottime(struct timespec64 *ts)
> +{
> +        struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
> +
> +        if (ns_offsets)
> +                *ts = timespec64_add(*ts, ns_offsets->monotonic_boottime_offset);
> +}
>
...

>  struct timens_offsets {
>  	struct timespec64  monotonic_time_offset;
> +	struct timespec64  monotonic_boottime_offset;
>  };

Guys, is not the _offset postfix here redundant? timens_offsets
already has it and ns_offsets->monotonic_boottime_offset looks
too much.

static always_inline current_time_ns_offsets(void)
{
	return current->nsproxy->time_ns->offsets;
}

static inline void timens_add_boottime(struct timespec64 *ts)
{
	struct timens_offsets *tns_off = current_time_ns_offsets();

	if (tns_off)
		*ts = timespec64_add(*ts, tns_off->monotonic_boottime);
}

Hmm? Up to you though.

^ permalink raw reply

* Re: [PATCHv3 03/27] timens: Introduce CLOCK_MONOTONIC offsets
From: Thomas Gleixner @ 2019-04-25 19:52 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Adrian Reber, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov, Dmitry Safonov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jeff Dike,
	Oleg Nesterov, Pavel Emelyanov, Shuah Khan, Vincenzo Frascino,
	containers, criu, linux-api, x86
In-Reply-To: <20190425161416.26600-4-dima@arista.com>

On Thu, 25 Apr 2019, Dmitry Safonov wrote:
> From: Andrei Vagin <avagin@openvz.org>
> 
> Add monotonic time virtualisation for time namespace.
> Introduce timespec for monotionic clock into timens offsets and wire
> clock_gettime() syscall.

That's a bit meager. It should at least explain the concept how this is
supposed to work.
 
>  #define CLOCKS_MASK			(CLOCK_REALTIME | CLOCK_MONOTONIC)
>  #define CLOCKS_MONO			CLOCK_MONOTONIC

Unrelated to this, but those two defines can go away.
  
> +#define CLOCK_TIMENS			1024

Random number pulled out of thin air without any explanation. Aside of that
why is this exposed to user space?

> @@ -77,6 +78,7 @@ int do_clock_gettime(clockid_t which_clock, struct timespec64 *tp)
>  		break;
>  	case CLOCK_MONOTONIC:
>  		ktime_get_ts64(tp);
> +		timens_add_monotonic(tp);
>  		break;
>  	case CLOCK_BOOTTIME:
>  		ktime_get_boottime_ts64(tp);
> diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
> index 29176635991f..a047d6b7c768 100644
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -30,6 +30,7 @@
>  #include <linux/hashtable.h>
>  #include <linux/compat.h>
>  #include <linux/nospec.h>
> +#include <linux/time_namespace.h>
>  
>  #include "timekeeping.h"
>  #include "posix-timers.h"
> @@ -190,6 +191,8 @@ static int posix_clock_realtime_adj(const clockid_t which_clock,
>  static int posix_ktime_get_ts(clockid_t which_clock, struct timespec64 *tp)
>  {
>  	ktime_get_ts64(tp);
> +	if (which_clock & CLOCK_TIMENS)
> +		timens_add_monotonic(tp);

> @@ -1039,7 +1046,7 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock,
>  	if (!kc)
>  		return -EINVAL;
>  
> -	error = kc->clock_get(which_clock, &kernel_tp);
> +	error = kc->clock_get(which_clock | CLOCK_TIMENS, &kernel_tp);

Bah. After some banging the head against the wall I figured out what you
are doing here. It's because kc->clock_get() is used in common_timer_get()
as well and there you don't want the adjustment there. Sorry, but this is
uncomprehensible and unmaintainable and adds yet another conditional into
the code flow.

The callsite in common_timer_get() has already a comment:
        /*
         * The timespec64 based conversion is suboptimal, but it's not
         * worth to implement yet another callback.
         */
        kc->clock_get(timr->it_clock, &ts64);
        now = timespec64_to_ktime(ts64);

So now we have a proper justification to add that extra callback:
separation of functionality. It's trivial and not a massive code size
overhead.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution
From: Aleksa Sarai @ 2019-04-25 19:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Al Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, David Howells, Eric Biederman, Jann Horn,
	Christian Brauner, David Drysdale, Tycho Andersen,
	Linux Containers, Linux FS Devel, Linux API, Andrew Morton,
	Alexei Starovoitov, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds
In-Reply-To: <20190424153806.64qkkmkudzodxnz2@yavin>

[-- Attachment #1: Type: text/plain, Size: 1043 bytes --]

On 2019-04-25, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-04-23, Kees Cook <keescook@chromium.org> wrote:
> > This series provides solutions to so many different race and confusion
> > issues, I'd really like to see it land. What's the next step here? Is
> > this planned to go directly to Linus for v5.2, or is it going to live
> > in -mm for a while? I'd really like to see this moving forward.
> 
> Given some of the security requirements of this interface, I think
> getting it to live in -mm wouldn't be a bad idea so folks can shake the
> bugs out before it's depended on by container runtimes.

Scratch my mention of -mm, it should be in Al's tree since it touches
quite a few of the namei seqlocks. My point was that it should live in
someone's tree for a little bit before it goes into a release.

I will put together a PoC of a resolveat(2) variation of this series and
re-send it out with both versions.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v10 00/18] Introduce the Counter subsystem
From: Greg KH @ 2019-04-25 19:36 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: William Breathitt Gray, linux-arm-kernel, devicetree,
	linux-kernel, linux-iio, fabrice.gasnier, benjamin.gaignard,
	knaack.h, lars, pmeerw, akpm, david, robh+dt, mark.rutland,
	shawnguo, leoyang.li, daniel.lezcano, tglx, thierry.reding, esben,
	linux-pwm, linuxppc-dev, patrick.havelange, linux-api
In-Reply-To: <20190407152550.451a7f63@archlinux>

On Sun, Apr 07, 2019 at 03:25:50PM +0100, Jonathan Cameron wrote:
> On Tue,  2 Apr 2019 15:30:35 +0900
> William Breathitt Gray <vilhelm.gray@gmail.com> wrote:
> 
> > Changes in v10:
> >   - Fix minor typographical errors in documentation
> >   - Merge the FlexTimer Module Quadrature decoder counter driver patches
> > 
> > This revision is functionally identical to the last; changes in this
> > version were made to fix minor typos in the documentation files and also
> > to pull in the new FTM quadrature decoder counter driver.
> > 
> > The Generic Counter API has been and is still in a feature freeze until
> > it is merged into the mainline. The following features will be
> > investigated after the merge: interrupt support for counter devices, and
> > a character device interface for low-latency applications.
> 
> Hi William / al,
> 
> So the question is how to move this forwards?  I'm happy with how it turned
> out and the existing drivers we had in IIO are a lot cleaner under
> the counter subsystem (other than the backwards compatibility for those that
> ever existed in IIO).  For those  not following closely the situation is:

I've now sucked this into my staging-testing branch and if 0-day is fine
with it, I'll merge it to staging-next in a day or so.  This way you can
build on it for any iio drivers that might be coming.

I do have reservations about that one sysfs file that is multi-line, and
I think it will come to bite you in the end over time, so I reserve the
right to say "I told you so" when that happens...

But, I don't have a better answer for it now, so don't really worry
about it :)

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCHv3 01/27] ns: Introduce Time Namespace
From: Jann Horn @ 2019-04-25 19:10 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: kernel list, Andrei Vagin, Adrian Reber, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov, Dmitry Safonov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jeff Dike,
	Oleg Nesterov, Pavel Emelyanov, Shuah Khan, Thomas Gleixner,
	Vincenzo Frascino, containers, criu, Linux API,
	the arch/x86 maintainers <x86>
In-Reply-To: <20190425161416.26600-2-dima@arista.com>

On Thu, Apr 25, 2019 at 6:14 PM Dmitry Safonov <dima@arista.com> wrote:
> Time Namespace isolates clock values.
>
> The kernel provides access to several clocks CLOCK_REALTIME,
> CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
>
> CLOCK_REALTIME
>       System-wide clock that measures real (i.e., wall-clock) time.
>
> CLOCK_MONOTONIC
>       Clock that cannot be set and represents monotonic time since
>       some unspecified starting point.
>
> CLOCK_BOOTTIME
>       Identical to CLOCK_MONOTONIC, except it also includes any time
>       that the system is suspended.
>
> For many users, the time namespace means the ability to changes date and
> time in a container (CLOCK_REALTIME).
>
> But in a context of the checkpoint/restore functionality, monotonic and
> bootime clocks become interesting. Both clocks are monotonic with
> unspecified staring points. These clocks are widely used to measure time
> slices and set timers. After restoring or migrating processes, we have to
> guarantee that they never go backward. In an ideal case, the behavior of
> these clocks should be the same as for a case when a whole system is
> suspended. All this means that we need to be able to set CLOCK_MONOTONIC
> and CLOCK_BOOTTIME clocks, what can be done by adding per-namespace
> offsets for clocks.
>
> A time namespace is similar to a pid namespace in a way how it is
> created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
> but doesn't set it to the current process. Then all children of
> the process will be born in the new time namespace, or a process can
> use the setns() system call to join a namespace.
>
> This scheme allows setting clock offsets for a namespace, before any
> processes appear in it.

Is there a check anywhere to make sure that you can't use
clone(CLONE_VM) after calling unshare(CLONE_NEWTIME), and that you
can't use clone(CLONE_VM|CLONE_NEWTIME)? Those things probably
shouldn't be allowed, right?

CLONE_NEWPID has similar rules; from the clone.2 manpage:

       EINVAL CLONE_THREAD was specified, but the current process previously
              called unshare(2) with the CLONE_NEWPID flag or used setns(2)
              to reassociate itself with a PID namespace.
[...]
       EINVAL One (or both) of CLONE_NEWPID or CLONE_NEWUSER and one (or
              both) of CLONE_THREAD or CLONE_PARENT were specified in flags.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox