Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH RFC v2 0/4] mm/ksm: add option to automerge VMAs
From: Michal Hocko @ 2019-05-16  7:47 UTC (permalink / raw)
  To: Greg KH
  Cc: Oleksandr Natalenko, linux-kernel, Kirill Tkhai, Vlastimil Babka,
	Matthew Wilcox, Pavel Tatashin, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api, Hugh Dickins,
	Suren Baghdasaryan, Minchan Kim
In-Reply-To: <20190515151557.GA23969@kroah.com>

On Wed 15-05-19 17:15:57, Greg KH wrote:
> On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote:
> > [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com]
> > 
> > On Wed 15-05-19 08:53:11, Michal Hocko wrote:
> > [...]
> > > I will try to comment on the interface itself later. But I have to say
> > > that I am not impressed. Abusing sysfs for per process features is quite
> > > gross to be honest.
> > 
> > I have already commented on this in other email. I consider sysfs an
> > unsuitable interface for per-process API.
> 
> Wait, what?  A new sysfs file/directory per process?  That's crazy, no
> one must have benchmarked it :)

Just to clarify, that was not a per process file but rather per process API.
Essentially echo $PID > $SYSFS_SPECIAL_FILE

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH RFC v2 0/4] mm/ksm: add option to automerge VMAs
From: Greg KH @ 2019-05-16  7:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleksandr Natalenko, linux-kernel, Kirill Tkhai, Vlastimil Babka,
	Matthew Wilcox, Pavel Tatashin, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api, Hugh Dickins,
	Suren Baghdasaryan, Minchan Kim
In-Reply-To: <20190516074713.GK16651@dhcp22.suse.cz>

On Thu, May 16, 2019 at 09:47:13AM +0200, Michal Hocko wrote:
> On Wed 15-05-19 17:15:57, Greg KH wrote:
> > On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote:
> > > [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com]
> > > 
> > > On Wed 15-05-19 08:53:11, Michal Hocko wrote:
> > > [...]
> > > > I will try to comment on the interface itself later. But I have to say
> > > > that I am not impressed. Abusing sysfs for per process features is quite
> > > > gross to be honest.
> > > 
> > > I have already commented on this in other email. I consider sysfs an
> > > unsuitable interface for per-process API.
> > 
> > Wait, what?  A new sysfs file/directory per process?  That's crazy, no
> > one must have benchmarked it :)
> 
> Just to clarify, that was not a per process file but rather per process API.
> Essentially echo $PID > $SYSFS_SPECIAL_FILE

Ick, no, that's not ok either.  sysfs files are not a replacement for
syscalls :)

thanks,

greg k-h

^ permalink raw reply

* [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

It all began with the fact that KSM works only on memory that is marked
by madvise(). And the only way to get around that is to either:

  * use LD_PRELOAD; or
  * patch the kernel with something like UKSM or PKSM.

(i skip ptrace can of worms here intentionally)

To overcome this restriction, lets implement a per-process /proc knob,
which allows calling madvise remotely. This can be used manually on a
task in question or by some small userspace helper daemon that will do
auto-KSM job for us.

Also, following the discussions from the previous submissions [2] and
[3], make the interface more generic, so that it can be used for other
madvise hints in the future. At this point, I'd like Android people to
speak up, for instance, and clarify in which form they need page
granularity or other things I've missed or have never heard about.

So, I think of three major consumers of this interface:

  * hosts, that run containers, especially similar ones and especially in
    a trusted environment, sharing the same runtime like Node.js;

  * heavy applications, that can be run in multiple instances, not
    limited to opensource ones like Firefox, but also those that cannot be
	modified since they are binary-only and, maybe, statically linked;

  * Android environment that wants to do tricks with
    MADV_WILLNEED/DONTNEED or something similar.

On to the actual implementation. The per-process knob is named "madvise",
and it is write-only. It accepts a madvise hint name to be executed.
Currently, only KSM hints are implemented:

* to mark all the eligible VMAs as mergeable, use:

   # echo merge > /proc/<pid>/madvise

* to unmerge all the VMAs, use:

   # echo unmerge > /proc/<pid>/madvise

I've implemented address space level granularity instead of VMA/page
granularity intentionally for simplicity. If the discussion goes in
other directions, this can be re-implemented to act on a specific VMA
(via map_files?) or page-wise.

Speaking of statistics, more numbers can be found in the very first
submission, that is related to this one [1]. For my current setup with
two Firefox instances I get 100 to 200 MiB saved for the second instance
depending on the amount of tabs.

1 FF instance with 15 tabs:

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   410

2 FF instances, second one has 12 tabs (all the tabs are different):

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   592

At the very moment I do not have specific numbers for containerised
workload, but those should be comparable in case the containers share
similar/same runtime.

The history of this patchset:

  * [2] was based on Timofey's submission [1], but it didn't use a
    dedicated kthread to walk through the list of tasks/VMAs. Instead,
	do_anonymous_page() was amended to implement fully automatic mode,
	but this approach was incorrect due to improper locking and not
	desired due to excessive complexity and being KSM-specific;
  * [3] implemented KSM-specific madvise hints via sysfs, leaving
    traversing /proc to userspace if needed. The approach was not
	desired due to the fact that sysfs shouldn't implement any
	per-process API. Also, the interface was not generic enough to
	extend it for other users.

I drop all the "Reviewed-by" tags from previous submissions because of
code changes and because the objective of this series is now somewhat
different.

Please comment!

Thanks.

[1] https://lore.kernel.org/patchwork/patch/1012142/
[2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/05076.html

Oleksandr Natalenko (5):
  proc: introduce madvise placeholder
  mm/ksm: introduce ksm_madvise_merge() helper
  mm/ksm: introduce ksm_madvise_unmerge() helper
  mm/ksm, proc: introduce remote merge
  mm/ksm, proc: add remote madvise documentation

 Documentation/filesystems/proc.txt | 13 +++++
 fs/proc/base.c                     | 70 +++++++++++++++++++++++
 include/linux/ksm.h                |  4 ++
 mm/ksm.c                           | 92 +++++++++++++++++++-----------
 4 files changed, 145 insertions(+), 34 deletions(-)

-- 
2.21.0

^ permalink raw reply

* [PATCH RFC 1/5] proc: introduce madvise placeholder
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api
In-Reply-To: <20190516094234.9116-1-oleksandr@redhat.com>

Add a write-only /proc/<pid>/madvise file to handle remote madvise
operations.

For now, this is just a soother that does nothing.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 fs/proc/base.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6a803a0b75df..f69532d6b74f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2957,6 +2957,25 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
 }
 #endif /* CONFIG_STACKLEAK_METRICS */
 
+static ssize_t madvise_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		return -ESRCH;
+
+	put_task_struct(task);
+
+	return count;
+}
+
+static const struct file_operations proc_madvise_operations = {
+	.write		= madvise_write,
+	.llseek		= noop_llseek,
+};
+
 /*
  * Thread groups
  */
@@ -3061,6 +3080,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_STACKLEAK_METRICS
 	ONE("stack_depth", S_IRUGO, proc_stack_depth),
 #endif
+	REG("madvise", S_IRUGO|S_IWUSR, proc_madvise_operations),
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
2.21.0

^ permalink raw reply related

* [PATCH RFC 2/5] mm/ksm: introduce ksm_madvise_merge() helper
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api
In-Reply-To: <20190516094234.9116-1-oleksandr@redhat.com>

Move MADV_MERGEABLE part of ksm_madvise() into a dedicated helper since
it will be further used for marking VMAs to be merged forcibly.

This does not bring any functional changes.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 include/linux/ksm.h |  2 ++
 mm/ksm.c            | 60 +++++++++++++++++++++++++++------------------
 2 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index e48b1e453ff5..e824b3141677 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -19,6 +19,8 @@ struct stable_node;
 struct mem_cgroup;
 
 #ifdef CONFIG_KSM
+int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long *vm_flags);
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
diff --git a/mm/ksm.c b/mm/ksm.c
index fc64874dc6f4..1fdcf2fbd58d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2442,41 +2442,53 @@ static int ksm_scan_thread(void *nothing)
 	return 0;
 }
 
-int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long *vm_flags)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
-	switch (advice) {
-	case MADV_MERGEABLE:
-		/*
-		 * Be somewhat over-protective for now!
-		 */
-		if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
-				 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
-				 VM_HUGETLB | VM_MIXEDMAP))
-			return 0;		/* just ignore the advice */
+	/*
+	 * Be somewhat over-protective for now!
+	 */
+	if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
+			 VM_HUGETLB | VM_MIXEDMAP))
+		return 0;		/* just ignore the advice */
 
-		if (vma_is_dax(vma))
-			return 0;
+	if (vma_is_dax(vma))
+		return 0;
 
 #ifdef VM_SAO
-		if (*vm_flags & VM_SAO)
-			return 0;
+	if (*vm_flags & VM_SAO)
+		return 0;
 #endif
 #ifdef VM_SPARC_ADI
-		if (*vm_flags & VM_SPARC_ADI)
-			return 0;
+	if (*vm_flags & VM_SPARC_ADI)
+		return 0;
 #endif
 
-		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
-			err = __ksm_enter(mm);
-			if (err)
-				return err;
-		}
+	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+		err = __ksm_enter(mm);
+		if (err)
+			return err;
+	}
+
+	*vm_flags |= VM_MERGEABLE;
+
+	return 0;
+}
 
-		*vm_flags |= VM_MERGEABLE;
+int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, int advice, unsigned long *vm_flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int err;
+
+	switch (advice) {
+	case MADV_MERGEABLE:
+		err = ksm_madvise_merge(mm, vma, vm_flags);
+		if (err)
+			return err;
 		break;
 
 	case MADV_UNMERGEABLE:
-- 
2.21.0

^ permalink raw reply related

* [PATCH RFC 3/5] mm/ksm: introduce ksm_madvise_unmerge() helper
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api
In-Reply-To: <20190516094234.9116-1-oleksandr@redhat.com>

Move MADV_UNMERGEABLE part of ksm_madvise() into a dedicated helper
since it will be further used for unmerging VMAs forcibly.

This does not bring any functional changes.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 include/linux/ksm.h |  2 ++
 mm/ksm.c            | 32 ++++++++++++++++++++++----------
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index e824b3141677..a91a7cfc87a1 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -21,6 +21,8 @@ struct mem_cgroup;
 #ifdef CONFIG_KSM
 int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long *vm_flags);
+int ksm_madvise_unmerge(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, unsigned long *vm_flags);
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
diff --git a/mm/ksm.c b/mm/ksm.c
index 1fdcf2fbd58d..e0357e25e09f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2478,6 +2478,25 @@ int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int ksm_madvise_unmerge(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, unsigned long *vm_flags)
+{
+	int err;
+
+	if (!(*vm_flags & VM_MERGEABLE))
+		return 0;		/* just ignore the advice */
+
+	if (vma->anon_vma) {
+		err = unmerge_ksm_pages(vma, start, end);
+		if (err)
+			return err;
+	}
+
+	*vm_flags &= ~VM_MERGEABLE;
+
+	return 0;
+}
+
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags)
 {
@@ -2492,16 +2511,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		break;
 
 	case MADV_UNMERGEABLE:
-		if (!(*vm_flags & VM_MERGEABLE))
-			return 0;		/* just ignore the advice */
-
-		if (vma->anon_vma) {
-			err = unmerge_ksm_pages(vma, start, end);
-			if (err)
-				return err;
-		}
-
-		*vm_flags &= ~VM_MERGEABLE;
+		err = ksm_madvise_unmerge(vma, start, end, vm_flags);
+		if (err)
+			return err;
 		break;
 	}
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api
In-Reply-To: <20190516094234.9116-1-oleksandr@redhat.com>

Use previously introduced remote madvise knob to mark task's
anonymous memory as mergeable.

To force merging task's VMAs, "merge" hint is used:

   # echo merge > /proc/<pid>/madvise

Force unmerging is done similarly:

   # echo unmerge > /proc/<pid>/madvise

To achieve this, previously introduced ksm_madvise_*() helpers
are used.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 fs/proc/base.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f69532d6b74f..6677580080ed 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -94,6 +94,8 @@
 #include <linux/sched/debug.h>
 #include <linux/sched/stat.h>
 #include <linux/posix-timers.h>
+#include <linux/mman.h>
+#include <linux/ksm.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
 static ssize_t madvise_write(struct file *file, const char __user *buf,
 		size_t count, loff_t *ppos)
 {
+	/* For now, only KSM hints are implemented */
+#ifdef CONFIG_KSM
+	char buffer[PROC_NUMBUF];
+	int behaviour;
 	struct task_struct *task;
+	struct mm_struct *mm;
+	int err = 0;
+	struct vm_area_struct *vma;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))
+		behaviour = MADV_MERGEABLE;
+	else if (!memcmp("unmerge", buffer, min(sizeof("unmerge")-1, count)))
+		behaviour = MADV_UNMERGEABLE;
+	else
+		return -EINVAL;
 
 	task = get_proc_task(file_inode(file));
 	if (!task)
 		return -ESRCH;
 
+	mm = get_task_mm(task);
+	if (!mm) {
+		err = -EINVAL;
+		goto out_put_task_struct;
+	}
+
+	down_write(&mm->mmap_sem);
+	switch (behaviour) {
+	case MADV_MERGEABLE:
+	case MADV_UNMERGEABLE:
+		vma = mm->mmap;
+		while (vma) {
+			if (behaviour == MADV_MERGEABLE)
+				ksm_madvise_merge(vma->vm_mm, vma, &vma->vm_flags);
+			else
+				ksm_madvise_unmerge(vma, vma->vm_start, vma->vm_end, &vma->vm_flags);
+			vma = vma->vm_next;
+		}
+		break;
+	}
+	up_write(&mm->mmap_sem);
+
+	mmput(mm);
+
+out_put_task_struct:
 	put_task_struct(task);
 
-	return count;
+	return err ? err : count;
+#else
+	return -EINVAL;
+#endif /* CONFIG_KSM */
 }
 
 static const struct file_operations proc_madvise_operations = {
-- 
2.21.0

^ permalink raw reply related

* [PATCH RFC 5/5] mm/ksm, proc: add remote madvise documentation
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api
In-Reply-To: <20190516094234.9116-1-oleksandr@redhat.com>

Document respective /proc/<pid>/madvise knob.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 Documentation/filesystems/proc.txt | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66cad5c86171..17106e435bba 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc/<pid>/map_files - Information about memory mapped files
   3.10  /proc/<pid>/timerslack_ns - Task timerslack value
   3.11	/proc/<pid>/patch_state - Livepatch patch operation state
+  3.12  /proc/<pid>/madvise - Remote madvise
 
   4	Configuring procfs
   4.1	Mount options
@@ -1948,6 +1949,18 @@ patched.  If the patch is being enabled, then the task has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12    /proc/<pid>/madvise - Remote madvise
+--------------------------------------------
+This write-only file allows executing madvise operation for another task.
+
+If CONFIG_KSM is enabled, the following actions are available:
+
+  * marking task's memory as mergeable:
+    # echo merge > /proc/<pid>/madvise
+
+  * unmerging all the task's memory:
+    # echo unmerge > /proc/<pid>/madvise
+
 
 ------------------------------------------------------------------------------
 Configuring procfs
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
From: Jann Horn @ 2019-05-16 10:00 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API
In-Reply-To: <20190516094234.9116-5-oleksandr@redhat.com>

On Thu, May 16, 2019 at 11:43 AM Oleksandr Natalenko
<oleksandr@redhat.com> wrote:
> Use previously introduced remote madvise knob to mark task's
> anonymous memory as mergeable.
>
> To force merging task's VMAs, "merge" hint is used:
>
>    # echo merge > /proc/<pid>/madvise
>
> Force unmerging is done similarly:
>
>    # echo unmerge > /proc/<pid>/madvise
>
> To achieve this, previously introduced ksm_madvise_*() helpers
> are used.

Why does this not require PTRACE_MODE_ATTACH_FSCREDS to the target
process? Enabling KSM on another process is hazardous because it
significantly increases the attack surface for side channels.

(Note that if you change this to require PTRACE_MODE_ATTACH_FSCREDS,
you'll want to use mm_access() in the ->open handler and drop the mm
in ->release. mm_access() from a ->write handler is not permitted.)

[...]
> @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
>  static ssize_t madvise_write(struct file *file, const char __user *buf,
>                 size_t count, loff_t *ppos)
>  {
> +       /* For now, only KSM hints are implemented */
> +#ifdef CONFIG_KSM
> +       char buffer[PROC_NUMBUF];
> +       int behaviour;
>         struct task_struct *task;
> +       struct mm_struct *mm;
> +       int err = 0;
> +       struct vm_area_struct *vma;
> +
> +       memset(buffer, 0, sizeof(buffer));
> +       if (count > sizeof(buffer) - 1)
> +               count = sizeof(buffer) - 1;
> +       if (copy_from_user(buffer, buf, count))
> +               return -EFAULT;
> +
> +       if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))

This means that you also match on something like "mergeblah". Just use strcmp().

> +               behaviour = MADV_MERGEABLE;
> +       else if (!memcmp("unmerge", buffer, min(sizeof("unmerge")-1, count)))
> +               behaviour = MADV_UNMERGEABLE;
> +       else
> +               return -EINVAL;
>
>         task = get_proc_task(file_inode(file));
>         if (!task)
>                 return -ESRCH;
>
> +       mm = get_task_mm(task);
> +       if (!mm) {
> +               err = -EINVAL;
> +               goto out_put_task_struct;
> +       }
> +
> +       down_write(&mm->mmap_sem);

Should a check for mmget_still_valid(mm) be inserted here? See commit
04f5866e41fb70690e28397487d8bd8eea7d712a.

> +       switch (behaviour) {
> +       case MADV_MERGEABLE:
> +       case MADV_UNMERGEABLE:

This switch isn't actually necessary at this point, right?

> +               vma = mm->mmap;
> +               while (vma) {
> +                       if (behaviour == MADV_MERGEABLE)
> +                               ksm_madvise_merge(vma->vm_mm, vma, &vma->vm_flags);
> +                       else
> +                               ksm_madvise_unmerge(vma, vma->vm_start, vma->vm_end, &vma->vm_flags);
> +                       vma = vma->vm_next;
> +               }
> +               break;
> +       }
> +       up_write(&mm->mmap_sem);
> +
> +       mmput(mm);
> +
> +out_put_task_struct:
>         put_task_struct(task);
>
> -       return count;
> +       return err ? err : count;
> +#else
> +       return -EINVAL;
> +#endif /* CONFIG_KSM */
>  }

^ permalink raw reply

* Re: [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise
From: Michal Hocko @ 2019-05-16 10:44 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: linux-kernel, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api
In-Reply-To: <20190516094234.9116-1-oleksandr@redhat.com>

On Thu 16-05-19 11:42:29, Oleksandr Natalenko wrote:
[...]
> * to mark all the eligible VMAs as mergeable, use:
> 
>    # echo merge > /proc/<pid>/madvise
> 
> * to unmerge all the VMAs, use:
> 
>    # echo unmerge > /proc/<pid>/madvise

Please do not open a new thread until a previous one reaches some
conclusion. I have outlined some ways to go forward in
http://lkml.kernel.org/r/20190515145151.GG16651@dhcp22.suse.cz.
I haven't heard any feedback on that, yet you open a 3rd way in a
different thread. This will not help to move on with the discussion.

Please follow up on that thread.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Roberto Sassu @ 2019-05-16 11:42 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: James Bottomley, Rob Landley, Andy Lutomirski, Arvind Sankar,
	LKML, Linux API, Linux FS Devel, linux-integrity, initramfs,
	Silviu Vlasceanu
In-Reply-To: <20190516052934.GA68777@rani.riverdale.lan>

On 5/16/2019 7:29 AM, Arvind Sankar wrote:
> On Wed, May 15, 2019 at 07:06:52PM +0200, Roberto Sassu wrote:
>> On 5/15/2019 6:08 PM, Arvind Sankar wrote:
>>> On Wed, May 15, 2019 at 01:19:04PM +0200, Roberto Sassu wrote:
>>>> On 5/15/2019 2:52 AM, Arvind Sankar wrote:
>>> I don't understand what you mean? The IMA hashes are signed by some key,
>>> but I don't see how what that key is needs to be different between the
>>> two proposals. If the only files used are from the distro, in my scheme
>>> as well you can use the signatures and key provided by the distro. If
>>> they're not, then in your scheme as well you would have to allow for a
>>> local signing key to be used. Both schemes are using the same
>>> .xattr-list file, no?
>>
>> I was referring to James's proposal to load an external initramfs from
>> the embedded initramfs. If the embedded initramfs opens the external
>> initramfs when IMA is enabled, the external initramfs needs to be
>> signed with a local signing key. But I read your answer that this
>> wouldn't be feasible. You have to specify all initramfs in the boot
>> loader configuration.
>>
>> I think deferring IMA initialization is not the safest approach, as it
>> cannot be guaranteed for all possible scenarios that there won't be any
>> file read before /init is executed.
>>
>> But if IMA is enabled, there is the problem of who signs .xattr-list.
>> There should be a local signing key that it is not necessary if the user
>> only accesses distro files.
>>
> I think that's a separate issue. If you want to allow people to be able
> to put files onto the system that will be IMA verified, they need to
> have some way to locally sign them whether it's inside an initramfs or
> on a real root filesystem.

Yes. But this shouldn't be a requirement. If I have only files signed by
the distro, I should be able to do appraisal without a local signing
key.

I made an IMA extension called IMA Digest Lists, that extracts reference
digests from RPM headers and performs appraisal based on the loaded
white lists.  The only keys that must be in the kernel for signature
verification are the PGP keys of the distro (plus the public key for the
RPM parser, which at the moment is different).

.xattr-list is generated by my custom dracut module and contains the
signature of the digest lists and the parser.


>>> Right, I guess this would be sort of the minimal "modification" to the
>>> CPIO format to allow it to support xattrs.
>>
>> I would try to do it without modification of the CPIO format. However,
>> at the time .xattr-list is parsed (in do_copy() before .xattr-list is
>> closed), it is not guaranteed that all files are extracted. These must
>> be created before xattrs are added, but the file type must be correct,
>> otherwise clean_path() removes the existing file with xattrs.
>>
>> Roberto
>>
>> -- 
>> HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
>> Managing Director: Bo PENG, Jian LI, Yanli SHI
> 
> Right by "modification" in quotes I meant the format is actually the
> same, but the kernel now interprets it a bit differently.
> 
> Regarding the order you don't have to handle that in the kernel. The
> kernel CPIO format is already restricted in that directories have to be
> specified before the files that contain them for example. It can very
> well be restricted so that an .xattr-list can only specify xattrs for
> files that were already extracted, else you bail out with an error. The
> archive creation tooling can easily handle that. If someone wants to
> shoot themselves in the foot by trying to add more files/replace
> existing files after the .xattr-list its ok, the IMA policy will prevent
> such files from being accessed and they can fix the archive for the next
> boot.

Unfortunately, dracut sorts the files before adding them to the CPIO
image (.xattr-list is at the beginning). I could move xattrs from the
existing file to the file with different mode, but this makes the code
more complex. I think it is better to call do_readxattrs() after files
are extracted, or when .xattr-list is going to be replaced by another
one in the next initramfs.

Roberto

-- 
HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
Managing Director: Bo PENG, Jian LI, Yanli SHI

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-16 13:08 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jann Horn, Oleg Nesterov, Al Viro, Linus Torvalds, linux-kernel,
	Arnd Bergmann, David Howells, Andrew Morton, Aleksa Sarai,
	Eric W. Biederman, elena.reshetova, Kees Cook, Andy Lutomirski,
	Andy Lutomirski, Thomas Gleixner, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-mips, linux-parisc, linuxppc-dev
In-Reply-To: <CAKOZuesPF+ftwqsNDMBy1LpwJgWTNuQm9-E=C90sSTBYEEsDww@mail.gmail.com>

On Wed, May 15, 2019 at 10:45:06AM -0700, Daniel Colascione wrote:
> On Wed, May 15, 2019 at 3:04 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> > pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> > process that is created via traditional fork()/clone() calls that is only
> > referenced by a PID:
> 
> Thanks for doing this work. I'm really looking forward to this new
> approach to process management.

Thanks! Glad to hear!

> 
> > int pidfd = pidfd_open(1234, 0);
> > ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
> >
> > With the introduction of pidfds through CLONE_PIDFD it is possible to
> > created pidfds at process creation time.
> > However, a lot of processes get created with traditional PID-based calls
> > such as fork() or clone() (without CLONE_PIDFD). For these processes a
> > caller can currently not create a pollable pidfd. This is a huge problem
> > for Android's low memory killer (LMK) and service managers such as systemd.
> > Both are examples of tools that want to make use of pidfds to get reliable
> > notification of process exit for non-parents (pidfd polling) and race-free
> > signal sending (pidfd_send_signal()). They intend to switch to this API for
> > process supervision/management as soon as possible. Having no way to get
> > pollable pidfds from PID-only processes is one of the biggest blockers for
> > them in adopting this api. With pidfd_open() making it possible to retrieve
> > pidfd for PID-based processes we enable them to adopt this api.
> >
> > In line with Arnd's recent changes to consolidate syscall numbers across
> > architectures, I have added the pidfd_open() syscall to all architectures
> > at the same time.
> 
> I'm glad it's easier now.
> 
> >  arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
> >  arch/arm64/include/asm/unistd32.h           |  2 +
> >  arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
> >  arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
> >  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
> >  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
> >  arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
> >  arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
> >  arch/s390/kernel/syscalls/syscall.tbl       |  1 +
> >  arch/sh/kernel/syscalls/syscall.tbl         |  1 +
> >  arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
> >  arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
> >  arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
> 
> It'd be nice to arrange the system call tables so that we need to
> change only one file when adding a new system call.
> 
> [Snip system call wiring]
> 
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -67,6 +67,7 @@ struct pid
> >  extern struct pid init_struct_pid;
> >
> >  extern const struct file_operations pidfd_fops;
> > +extern int pidfd_create(struct pid *pid);
> >
> >  static inline struct pid *get_pid(struct pid *pid)
> >  {
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e2870fe1be5b..989055e0b501 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
> >                                 struct old_timex32 __user *tx);
> >  asmlinkage long sys_syncfs(int fd);
> >  asmlinkage long sys_setns(int fd, int nstype);
> > +asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
> >  asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
> >                              unsigned int vlen, unsigned flags);
> >  asmlinkage long sys_process_vm_readv(pid_t pid,
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..94a257a93d20 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_pidfd_open 428
> > +__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
> >
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 428
> > +#define __NR_syscalls 429
> >
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 737db1828437..980cc1d2b8d4 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
> >   * Return: On success, a cloexec pidfd is returned.
> >   *         On error, a negative errno number will be returned.
> >   */
> > -static int pidfd_create(struct pid *pid)
> > +int pidfd_create(struct pid *pid)
> >  {
> >         int fd;
> >
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..237d18d6ecb8 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -38,6 +38,7 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/proc_ns.h>
> >  #include <linux/proc_fs.h>
> > +#include <linux/sched/signal.h>
> >  #include <linux/sched/task.h>
> >  #include <linux/idr.h>
> >
> > @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> >         return idr_get_next(&ns->idr, &nr);
> >  }
> >
> > +/**
> > + * pidfd_open() - Open new pid file descriptor.
> > + *
> > + * @pid:   pid for which to retrieve a pidfd
> > + * @flags: flags to pass
> > + *
> > + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> > + * the process identified by @pid. Currently, the process identified by
> > + * @pid must be a thread-group leader. This restriction currently exists
> > + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> > + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> > + * leaders).
> > + *
> > + * Return: On success, a cloexec pidfd is returned.
> > + *         On error, a negative errno number will be returned.
> > + */
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > +       int fd, ret;
> > +       struct pid *p;
> > +       struct task_struct *tsk;
> > +
> > +       if (flags)
> > +               return -EINVAL;
> 
> If we support blocking operations on pidfds, we'll want to be able to
> put them in non-blocking mode. Does it make sense to accept and ignore
> O_NONBLOCK here now?

Hm, is there a race condition you see that would prevent you from using
fcntl()? If you're ok with using fcntl() I would argue we should hold on
tight to every bit we have for flags.

> 
> > +       if (pid <= 0)
> > +               return -EINVAL;
> 
> WDYT of defining pid == 0 to mean "open myself"?

I'm torn. It be a nice shortcut of course but pid being 0 is usually an
indicator for child processes. So unless the getpid() before
pidfd_open() is an issue I'd say let's leave it as is. If you really
want the shortcut might -1 be better?

> 
> > +       p = find_get_pid(pid);
> > +       if (!p)
> > +               return -ESRCH;
> > +
> > +       rcu_read_lock();
> > +       tsk = pid_task(p, PIDTYPE_PID);
> > +       if (!tsk)
> > +               ret = -ESRCH;
> > +       else if (unlikely(!thread_group_leader(tsk)))
> > +               ret = -EINVAL;
> > +       else
> > +               ret = 0;
> > +       rcu_read_unlock();
> > +
> > +       fd = ret ?: pidfd_create(p);
> > +       put_pid(p);
> > +       return fd;
> > +}

^ permalink raw reply

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Michal Hocko @ 2019-05-16 13:30 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm, linux-api
In-Reply-To: <155793276388.13922.18064660723547377633.stgit@localhost.localdomain>

[You are defining a new user visible API, please always add linux-api
 mailing list - now done]

On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
> 
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:
> 
> 	struct iovec local_iov, remote_iov;
> 	void *buf;
> 
> 	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> 		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
> 	recv(sock, buf, n * PAGE_SIZE, 0);
> 
> 	local_iov->iov_base = buf;
> 	local_iov->iov_len = n * PAGE_SIZE;
> 	remove_iov = ...;
> 
> 	process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
> 	munmap(buf, n * PAGE_SIZE);
> 
> 	(Note, that above completely ignores error handling)
> 
> There are several problems with process_vm_writev() in this example:
> 
> 1)it causes pagefault on remote process memory, and it forces
>   allocation of a new page (if was not preallocated);
> 
> 2)amount of memory for this example is doubled in a moment --
>   n pages in current and n pages in remote tasks are occupied
>   at the same time;
> 
> 3)received data has no a chance to be properly swapped for
>   a long time.
> 
> The third is the most critical in case of remote process touches
> the data pages some time after process_vm_writev() was made.
> Imagine, node is under memory pressure:
> 
> a)kernel moves @buf pages into swap right after recv();
> b)process_vm_writev() reads the data back from swap to pages;
> c)process_vm_writev() allocates duplicate pages in remote
>   process and populates them;
> d)munmap() unmaps @buf;
> e)5 minutes later remote task touches data.
> 
> In stages "a" and "b" kernel submits unneeded IO and makes
> system IO throughput worse. To make "b" and "c", kernel
> reclaims memory, and moves pages of some other processes
> to swap, so they have to read pages from swap back. Also,
> unneeded copying of pages is occured, while zero-copy is
> more preferred.
> 
> We observe similar problem during online migration of big enough
> containers, when after doubling of container's size, the time
> increases 100 times. The system resides under high IO and
> throwing out of useful cashes.
> 
> The proposed syscall aims to introduce an interface, which
> supplements currently existing process_vm_writev() and
> process_vm_readv(), and allows to solve the problem with
> anonymous memory transfer. The above example may be rewritten as:
> 
> 	void *buf;
> 
> 	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> 		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
> 	recv(sock, buf, n * PAGE_SIZE, 0);
> 
> 	/* Sign of @pid is direction: "from @pid task to current" or vice versa. */
> 	process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
> 	munmap(buf, n * PAGE_SIZE);
> 
> It is swap-friendly: in case of memory is swapped right after recv(),
> the syscall just copies pagetable entries like we do on fork(),
> so real access to pages does not occurs, and no IO is needed.
> No excess pages are reclaimed, and number of pages is not doubled.
> Also, zero-copy takes a place, and this also reduces overhead.
> 
> The patchset does not introduce much new code, since we simply
> reuse existing copy_page_range() and copy_vma() functions.
> We extend copy_vma() to be able merge VMAs in remote task [2/5],
> and teach copy_page_range() to work with different local and
> remote addresses [3/5]. Patch [5/5] introduces the syscall logic,
> which mostly consists of sanity checks. The rest of patches
> are preparations.
> 
> This syscall may be used for page servers like in example
> above, for migration (I assume, even virtual machines may
> want something like this), for zero-copy desiring users
> of process_vm_writev() and process_vm_readv(), for debug
> purposes, etc. It requires the same permittions like
> existing proc_vm_xxx() syscalls have.
> 
> The tests I used may be obtained here:
> 
> [1]https://gist.github.com/tkhai/198d32fdc001ec7812a5e1ccf091f275
> [2]https://gist.github.com/tkhai/f52dbaeedad5a699f3fb386fda676562
> 
> ---
> 
> Kirill Tkhai (5):
>       mm: Add process_vm_mmap() syscall declaration
>       mm: Extend copy_vma()
>       mm: Extend copy_page_range()
>       mm: Export round_hint_to_min()
>       mm: Add process_vm_mmap()
> 
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl |    2 
>  include/linux/huge_mm.h                |    6 +
>  include/linux/mm.h                     |   11 ++
>  include/linux/mm_types.h               |    2 
>  include/linux/mman.h                   |   14 +++
>  include/linux/syscalls.h               |    5 +
>  include/uapi/asm-generic/mman-common.h |    5 +
>  include/uapi/asm-generic/unistd.h      |    5 +
>  init/Kconfig                           |    9 +-
>  kernel/fork.c                          |    5 +
>  kernel/sys_ni.c                        |    2 
>  mm/huge_memory.c                       |   30 ++++--
>  mm/memory.c                            |  165 +++++++++++++++++++++-----------
>  mm/mmap.c                              |  154 ++++++++++++++++++++++++++----
>  mm/mremap.c                            |    4 -
>  mm/process_vm_access.c                 |   71 ++++++++++++++
>  17 files changed, 392 insertions(+), 99 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Mimi Zohar @ 2019-05-16 13:31 UTC (permalink / raw)
  To: Arvind Sankar, Roberto Sassu, Mehmet Kayaalp
  Cc: James Bottomley, Rob Landley, Andy Lutomirski, Arvind Sankar,
	LKML, Linux API, Linux FS Devel, linux-integrity, initramfs,
	Silviu Vlasceanu
In-Reply-To: <20190516052934.GA68777@rani.riverdale.lan>

On Thu, 2019-05-16 at 01:29 -0400, Arvind Sankar wrote:

> I think that's a separate issue. If you want to allow people to be able
> to put files onto the system that will be IMA verified, they need to
> have some way to locally sign them whether it's inside an initramfs or
> on a real root filesystem.

Anyone building their own kernel can build their own key into the
kernel image.  Another option is to build the kernel with  
CONFIG_SYSTEM_EXTRA_CERTIFICATE enabled, allowing an additional
certificate to be inserted into the kernel image post build.  The
additional certificate will be loaded onto the builtin kernel keyring.
 Certificates signed with the private key can then be added to the IMA
keyring.  By modifying the kernel image, the kernel image obviously
needs to be resigned.  Additional patches "Certificate insertion
support for x86 bzImages" were posted, but have not been upstreamed.

This patch set adds the security xattrs needed by IMA.

Mimi

^ permalink raw reply

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Jann Horn @ 2019-05-16 13:32 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andrew Morton, Dan Williams, Michal Hocko, keith.busch,
	Kirill A . Shutemov, pasha.tatashin, Alexander Duyck, ira.weiny,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, hannes, npiggin, Mathieu Desnoyers,
	Shakeel Butt, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
	Jerome Glisse
In-Reply-To: <155793276388.13922.18064660723547377633.stgit@localhost.localdomain>

On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
[...]
> The proposed syscall aims to introduce an interface, which
> supplements currently existing process_vm_writev() and
> process_vm_readv(), and allows to solve the problem with
> anonymous memory transfer. The above example may be rewritten as:
>
>         void *buf;
>
>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>         recv(sock, buf, n * PAGE_SIZE, 0);
>
>         /* Sign of @pid is direction: "from @pid task to current" or vice versa. */
>         process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>         munmap(buf, n * PAGE_SIZE);

In this specific example, an alternative would be to splice() from the
socket into /proc/$pid/mem, or something like that, right?
proc_mem_operations has no ->splice_read() at the moment, and it'd
need that to be more efficient, but that could be built without
creating new UAPI, right?

But I guess maybe your workload is not that simple? What do you
actually do with the received data between receiving it and shoving it
over into the other process?

^ permalink raw reply

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Michal Hocko @ 2019-05-16 13:52 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm, linux-api
In-Reply-To: <20190516133034.GT16651@dhcp22.suse.cz>

On Thu 16-05-19 15:30:34, Michal Hocko wrote:
> [You are defining a new user visible API, please always add linux-api
>  mailing list - now done]
> 
> On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
[...]
> > The proposed syscall aims to introduce an interface, which
> > supplements currently existing process_vm_writev() and
> > process_vm_readv(), and allows to solve the problem with
> > anonymous memory transfer. The above example may be rewritten as:
> > 
> > 	void *buf;
> > 
> > 	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> > 		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
> > 	recv(sock, buf, n * PAGE_SIZE, 0);
> > 
> > 	/* Sign of @pid is direction: "from @pid task to current" or vice versa. */
> > 	process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
> > 	munmap(buf, n * PAGE_SIZE);

AFAIU this means that you actually want to do an mmap of an anonymous
memory with a COW semantic to the remote process right? How does the
remote process find out where and what has been mmaped? What if the
range collides? This sounds quite scary to me TBH. Why cannot you simply
use shared memory for that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-16 13:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Dan Williams, Michal Hocko, keith.busch,
	Kirill A . Shutemov, pasha.tatashin, Alexander Duyck, ira.weiny,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, hannes, npiggin, Mathieu Desnoyers,
	Shakeel Butt, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
	Jerome Glisse
In-Reply-To: <CAG48ez3EOwLd8A6Ku53vKLdofmZAh1ZYfkK4rVgSgM8ZfcR4zg@mail.gmail.com>

On 16.05.2019 16:32, Jann Horn wrote:
> On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a mapping from a process to another process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> [...]
>> The proposed syscall aims to introduce an interface, which
>> supplements currently existing process_vm_writev() and
>> process_vm_readv(), and allows to solve the problem with
>> anonymous memory transfer. The above example may be rewritten as:
>>
>>         void *buf;
>>
>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>
>>         /* Sign of @pid is direction: "from @pid task to current" or vice versa. */
>>         process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>>         munmap(buf, n * PAGE_SIZE);
> 
> In this specific example, an alternative would be to splice() from the
> socket into /proc/$pid/mem, or something like that, right?
> proc_mem_operations has no ->splice_read() at the moment, and it'd
> need that to be more efficient, but that could be built without
> creating new UAPI, right?

I have just never seen, a socket memory may be preempted into swap.
If so, there is a fundamental problem.
But, anyway, like you guessed below:
 
> But I guess maybe your workload is not that simple? What do you
> actually do with the received data between receiving it and shoving it
> over into the other process?

Data are usually sent encrypted and compressed by socket, so there is no
possibility to go this way. You may want to do everything with data,
before passing to another process.

Kirill

^ permalink raw reply

* [PATCH v1 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-16 13:59 UTC (permalink / raw)
  To: jannh, oleg, viro, torvalds, linux-kernel, arnd
  Cc: akpm, cyphar, dhowells, ebiederm, elena.reshetova, keescook, luto,
	luto, tglx, linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-api, linux-arch, linux-kselftest,
	joel, dancol, serge, Christian Brauner, Geert Uytterhoeven

This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:

int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);

With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a huge problem
for Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfd for PID-based processes we enable them to adopt this api.

In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.

Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
---
v1:
- kbuild test robot <lkp@intel.com>:
  - add missing entry for pidfd_open to arch/arm/tools/syscall.tbl
- Oleg Nesterov <oleg@redhat.com>:
  - use simpler thread-group leader check
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 include/linux/pid.h                         |  1 +
 include/linux/syscalls.h                    |  1 +
 include/uapi/asm-generic/unistd.h           |  4 +-
 kernel/fork.c                               |  2 +-
 kernel/pid.c                                | 50 +++++++++++++++++++++
 20 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 165f268beafc..ddc3c93ad7a7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -467,3 +467,4 @@
 535	common	io_uring_setup			sys_io_uring_setup
 536	common	io_uring_enter			sys_io_uring_enter
 537	common	io_uring_register		sys_io_uring_register
+538	common	pidfd_open			sys_pidfd_open
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 0393917eaa57..fc41fb34a636 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -441,3 +441,4 @@
 425	common	io_uring_setup			sys_io_uring_setup
 426	common	io_uring_enter			sys_io_uring_enter
 427	common	io_uring_register		sys_io_uring_register
+428	common	pidfd_open			sys_pidfd_open
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 23f1a44acada..350e2049b4a9 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -874,6 +874,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 #define __NR_io_uring_register 427
 __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 56e3d0b685e1..7115f6dd347a 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -348,3 +348,4 @@
 425	common	io_uring_setup			sys_io_uring_setup
 426	common	io_uring_enter			sys_io_uring_enter
 427	common	io_uring_register		sys_io_uring_register
+428	common	pidfd_open			sys_pidfd_open
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index df4ec3ec71d1..44bf12b16ffe 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -427,3 +427,4 @@
 425	common	io_uring_setup			sys_io_uring_setup
 426	common	io_uring_enter			sys_io_uring_enter
 427	common	io_uring_register		sys_io_uring_register
+428	common	pidfd_open			sys_pidfd_open
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 4964947732af..0d32e5152dc0 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 425	common	io_uring_setup			sys_io_uring_setup
 426	common	io_uring_enter			sys_io_uring_enter
 427	common	io_uring_register		sys_io_uring_register
+428	common	pidfd_open			sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 9392dfe33f97..726e107b3c9f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -366,3 +366,4 @@
 425	n32	io_uring_setup			sys_io_uring_setup
 426	n32	io_uring_enter			sys_io_uring_enter
 427	n32	io_uring_register		sys_io_uring_register
+428	n32	pidfd_open			sys_pidfd_open
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index fe8ca623add8..83b46b568d51 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -424,3 +424,4 @@
 425	common	io_uring_setup			sys_io_uring_setup
 426	common	io_uring_enter			sys_io_uring_enter
 427	common	io_uring_register		sys_io_uring_register
+428	common	pidfd_open			sys_pidfd_open
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 00f5a63c8d9a..5294d04d7fa5 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
 425	common	io_uring_setup			sys_io_uring_setup
 426	common	io_uring_enter			sys_io_uring_enter
 427	common	io_uring_register		sys_io_uring_register
+428	common	pidfd_open			sys_pidfd_open
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 061418f787c3..dcdb838adf49 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
 425  common	io_uring_setup		sys_io_uring_setup              sys_io_uring_setup
 426  common	io_uring_enter		sys_io_uring_enter              sys_io_uring_enter
 427  common	io_uring_register	sys_io_uring_register           sys_io_uring_register
+428  common	pidfd_open		sys_pidfd_open			sys_pidfd_open
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 480b057556ee..8e66edfbc521 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
 425	common	io_uring_setup			sys_io_uring_setup
 426	common	io_uring_enter			sys_io_uring_enter
 427	common	io_uring_register		sys_io_uring_register
+428	common	pidfd_open			sys_pidfd_open
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index a1dd24307b00..d6f3bc686939 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 425	common	io_uring_setup			sys_io_uring_setup
 426	common	io_uring_enter			sys_io_uring_enter
 427	common	io_uring_register		sys_io_uring_register
+428	common	pidfd_open			sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..1af6b469160a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
 427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
+428	i386	pidfd_open		sys_pidfd_open			__ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..c18e6ebe3387 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
 427	common	io_uring_register	__x64_sys_io_uring_register
+428	common	pidfd_open		__x64_sys_pidfd_open
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 30084eaf8422..21ee795f3003 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -398,3 +398,4 @@
 425	common	io_uring_setup			sys_io_uring_setup
 426	common	io_uring_enter			sys_io_uring_enter
 427	common	io_uring_register		sys_io_uring_register
+428	common	pidfd_open			sys_pidfd_open
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..c938a92eab99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,6 +67,7 @@ struct pid
 extern struct pid init_struct_pid;
 
 extern const struct file_operations pidfd_fops;
+extern int pidfd_create(struct pid *pid);
 
 static inline struct pid *get_pid(struct pid *pid)
 {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..989055e0b501 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
 				struct old_timex32 __user *tx);
 asmlinkage long sys_syncfs(int fd);
 asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
 asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
 			     unsigned int vlen, unsigned flags);
 asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..94a257a93d20 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 #define __NR_io_uring_register 427
 __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 
 #undef __NR_syscalls
-#define __NR_syscalls 428
+#define __NR_syscalls 429
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 737db1828437..980cc1d2b8d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
  * Return: On success, a cloexec pidfd is returned.
  *         On error, a negative errno number will be returned.
  */
-static int pidfd_create(struct pid *pid)
+int pidfd_create(struct pid *pid)
 {
 	int fd;
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..4afca3d6dcb8 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -38,6 +38,7 @@
 #include <linux/syscalls.h>
 #include <linux/proc_ns.h>
 #include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
 
@@ -451,6 +452,55 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 	return idr_get_next(&ns->idr, &nr);
 }
 
+/**
+ * pidfd_open() - Open new pid file descriptor.
+ *
+ * @pid:   pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set for
+ * the process identified by @pid. Currently, the process identified by
+ * @pid must be a thread-group leader. This restriction currently exists
+ * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
+ * be used with CLONE_THREAD) and pidfd polling (only supports thread group
+ * leaders).
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ *         On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+	int fd, ret;
+	struct pid *p;
+	struct task_struct *tsk;
+
+	if (flags)
+		return -EINVAL;
+
+	if (pid <= 0)
+		return -EINVAL;
+
+	p = find_get_pid(pid);
+	if (!p)
+		return -ESRCH;
+
+	ret = 0;
+	rcu_read_lock();
+	/*
+	 * If this returns non-NULL the pid was used as a thread-group
+	 * leader. Note, we race with exec here: If it changes the
+	 * thread-group leader we might return the old leader.
+	 */
+	tsk = pid_task(p, PIDTYPE_TGID);
+	if (!tsk)
+		ret = -ESRCH;
+	rcu_read_unlock();
+
+	fd = ret ?: pidfd_create(p);
+	put_pid(p);
+	return fd;
+}
+
 void __init pid_idr_init(void)
 {
 	/* Verify no one has done anything silly: */
-- 
2.21.0

^ permalink raw reply related

* [PATCH v1 2/2] tests: add pidfd_open() tests
From: Christian Brauner @ 2019-05-16 13:59 UTC (permalink / raw)
  To: jannh, oleg, viro, torvalds, linux-kernel, arnd
  Cc: akpm, cyphar, dhowells, ebiederm, elena.reshetova, keescook, luto,
	luto, tglx, linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-api, linux-arch, linux-kselftest,
	joel, dancol, serge, Christian Brauner,
	Michael Kerrisk (man-pages)
In-Reply-To: <20190516135944.7205-1-christian@brauner.io>

This adds testing for the new pidfd_open() syscalls. Specifically, we test:
- that no invalid flags can be passed to pidfd_open()
- that no invalid pid can be passed to pidfd_open()
- that a pidfd can be retrieved with pidfd_open()
- that the retrieved pidfd references the correct pid

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
---
v1: unchanged
---
 tools/testing/selftests/pidfd/Makefile        |   2 +-
 tools/testing/selftests/pidfd/pidfd.h         |  57 ++++++
 .../testing/selftests/pidfd/pidfd_open_test.c | 170 ++++++++++++++++++
 tools/testing/selftests/pidfd/pidfd_test.c    |  41 +----
 4 files changed, 229 insertions(+), 41 deletions(-)
 create mode 100644 tools/testing/selftests/pidfd/pidfd.h
 create mode 100644 tools/testing/selftests/pidfd/pidfd_open_test.c

diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index deaf8073bc06..b36c0be70848 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -1,6 +1,6 @@
 CFLAGS += -g -I../../../../usr/include/
 
-TEST_GEN_PROGS := pidfd_test
+TEST_GEN_PROGS := pidfd_test pidfd_open_test
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
new file mode 100644
index 000000000000..8452e910463f
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __PIDFD_H
+#define __PIDFD_H
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/mount.h>
+
+#include "../kselftest.h"
+
+/*
+ * The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
+ * That means, when it wraps around any pid < 300 will be skipped.
+ * So we need to use a pid > 300 in order to test recycling.
+ */
+#define PID_RECYCLE 1000
+
+/*
+ * Define a few custom error codes for the child process to clearly indicate
+ * what is happening. This way we can tell the difference between a system
+ * error, a test error, etc.
+ */
+#define PIDFD_PASS 0
+#define PIDFD_FAIL 1
+#define PIDFD_ERROR 2
+#define PIDFD_SKIP 3
+#define PIDFD_XFAIL 4
+
+int wait_for_pid(pid_t pid)
+{
+	int status, ret;
+
+again:
+	ret = waitpid(pid, &status, 0);
+	if (ret == -1) {
+		if (errno == EINTR)
+			goto again;
+
+		return -1;
+	}
+
+	if (!WIFEXITED(status))
+		return -1;
+
+	return WEXITSTATUS(status);
+}
+
+
+#endif /* __PIDFD_H */
diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c
new file mode 100644
index 000000000000..9b073c1ac618
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_open_test.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <linux/wait.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "pidfd.h"
+#include "../kselftest.h"
+
+static inline int sys_pidfd_open(pid_t pid, unsigned int flags)
+{
+	return syscall(__NR_pidfd_open, pid, flags);
+}
+
+static int safe_int(const char *numstr, int *converted)
+{
+	char *err = NULL;
+	long sli;
+
+	errno = 0;
+	sli = strtol(numstr, &err, 0);
+	if (errno == ERANGE && (sli == LONG_MAX || sli == LONG_MIN))
+		return -ERANGE;
+
+	if (errno != 0 && sli == 0)
+		return -EINVAL;
+
+	if (err == numstr || *err != '\0')
+		return -EINVAL;
+
+	if (sli > INT_MAX || sli < INT_MIN)
+		return -ERANGE;
+
+	*converted = (int)sli;
+	return 0;
+}
+
+static int char_left_gc(const char *buffer, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++) {
+		if (buffer[i] == ' ' ||
+		    buffer[i] == '\t')
+			continue;
+
+		return i;
+	}
+
+	return 0;
+}
+
+static int char_right_gc(const char *buffer, size_t len)
+{
+	int i;
+
+	for (i = len - 1; i >= 0; i--) {
+		if (buffer[i] == ' '  ||
+		    buffer[i] == '\t' ||
+		    buffer[i] == '\n' ||
+		    buffer[i] == '\0')
+			continue;
+
+		return i + 1;
+	}
+
+	return 0;
+}
+
+static char *trim_whitespace_in_place(char *buffer)
+{
+	buffer += char_left_gc(buffer, strlen(buffer));
+	buffer[char_right_gc(buffer, strlen(buffer))] = '\0';
+	return buffer;
+}
+
+static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen)
+{
+	int ret;
+	char path[512];
+	FILE *f;
+	size_t n = 0;
+	pid_t result = -1;
+	char *line = NULL;
+
+	snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", pidfd);
+
+	f = fopen(path, "re");
+	if (!f)
+		return -1;
+
+	while (getline(&line, &n, f) != -1) {
+		char *numstr;
+
+		if (strncmp(line, key, keylen))
+			continue;
+
+		numstr = trim_whitespace_in_place(line + 4);
+		ret = safe_int(numstr, &result);
+		if (ret < 0)
+			goto out;
+
+		break;
+	}
+
+out:
+	free(line);
+	fclose(f);
+	return result;
+}
+
+int main(int argc, char **argv)
+{
+	int pidfd = -1, ret = 1;
+	pid_t pid;
+
+	pidfd = sys_pidfd_open(-1, 0);
+	if (pidfd >= 0) {
+		ksft_print_msg(
+			"%s - succeeded to open pidfd for invalid pid -1\n",
+			strerror(errno));
+		goto on_error;
+	}
+	ksft_test_result_pass("do not allow invalid pid test: passed\n");
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidfd_open(getpid(), 1);
+	if (pidfd >= 0) {
+		ksft_print_msg(
+			"%s - succeeded to open pidfd with invalid flag value specified\n",
+			strerror(errno));
+		goto on_error;
+	}
+	ksft_test_result_pass("do not allow invalid flag test: passed\n");
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidfd_open(getpid(), 0);
+	if (pidfd < 0) {
+		ksft_print_msg("%s - failed to open pidfd\n", strerror(errno));
+		goto on_error;
+	}
+	ksft_test_result_pass("open a new pidfd test: passed\n");
+	ksft_inc_pass_cnt();
+
+	pid = get_pid_from_fdinfo_file(pidfd, "Pid:", sizeof("Pid:") - 1);
+	ksft_print_msg("pidfd %d refers to process with pid %d\n", pidfd, pid);
+
+	ret = 0;
+
+on_error:
+	if (pidfd >= 0)
+		close(pidfd);
+
+	return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
diff --git a/tools/testing/selftests/pidfd/pidfd_test.c b/tools/testing/selftests/pidfd/pidfd_test.c
index d59378a93782..f01de87249c9 100644
--- a/tools/testing/selftests/pidfd/pidfd_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_test.c
@@ -14,6 +14,7 @@
 #include <sys/wait.h>
 #include <unistd.h>
 
+#include "pidfd.h"
 #include "../kselftest.h"
 
 static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
@@ -62,28 +63,6 @@ static int test_pidfd_send_signal_simple_success(void)
 	return 0;
 }
 
-static int wait_for_pid(pid_t pid)
-{
-	int status, ret;
-
-again:
-	ret = waitpid(pid, &status, 0);
-	if (ret == -1) {
-		if (errno == EINTR)
-			goto again;
-
-		return -1;
-	}
-
-	if (ret != pid)
-		goto again;
-
-	if (!WIFEXITED(status))
-		return -1;
-
-	return WEXITSTATUS(status);
-}
-
 static int test_pidfd_send_signal_exited_fail(void)
 {
 	int pidfd, ret, saved_errno;
@@ -128,13 +107,6 @@ static int test_pidfd_send_signal_exited_fail(void)
 	return 0;
 }
 
-/*
- * The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
- * That means, when it wraps around any pid < 300 will be skipped.
- * So we need to use a pid > 300 in order to test recycling.
- */
-#define PID_RECYCLE 1000
-
 /*
  * Maximum number of cycles we allow. This is equivalent to PID_MAX_DEFAULT.
  * If users set a higher limit or we have cycled PIDFD_MAX_DEFAULT number of
@@ -143,17 +115,6 @@ static int test_pidfd_send_signal_exited_fail(void)
  */
 #define PIDFD_MAX_DEFAULT 0x8000
 
-/*
- * Define a few custom error codes for the child process to clearly indicate
- * what is happening. This way we can tell the difference between a system
- * error, a test error, etc.
- */
-#define PIDFD_PASS 0
-#define PIDFD_FAIL 1
-#define PIDFD_ERROR 2
-#define PIDFD_SKIP 3
-#define PIDFD_XFAIL 4
-
 static int test_pidfd_send_signal_recycled_pid_fail(void)
 {
 	int i, ret;
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Jann Horn @ 2019-05-16 14:03 UTC (permalink / raw)
  To: Christian Brauner, Daniel Colascione
  Cc: Oleg Nesterov, Al Viro, Linus Torvalds, linux-kernel,
	Arnd Bergmann, David Howells, Andrew Morton, Aleksa Sarai,
	Eric W. Biederman, Elena Reshetova, Kees Cook, Andy Lutomirski,
	Andy Lutomirski, Thomas Gleixner, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-mips, linux-parisc, linuxppc-dev
In-Reply-To: <20190516130813.i66ujfzftbgpqhnh@brauner.io>

On Thu, May 16, 2019 at 3:08 PM Christian Brauner <christian@brauner.io> wrote:
> On Wed, May 15, 2019 at 10:45:06AM -0700, Daniel Colascione wrote:
> > On Wed, May 15, 2019 at 3:04 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> > > pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> > > process that is created via traditional fork()/clone() calls that is only
> > > referenced by a PID:
[...]
> > > +/**
> > > + * pidfd_open() - Open new pid file descriptor.
> > > + *
> > > + * @pid:   pid for which to retrieve a pidfd
> > > + * @flags: flags to pass
> > > + *
> > > + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> > > + * the process identified by @pid. Currently, the process identified by
> > > + * @pid must be a thread-group leader. This restriction currently exists
> > > + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> > > + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> > > + * leaders).
> > > + *
> > > + * Return: On success, a cloexec pidfd is returned.
> > > + *         On error, a negative errno number will be returned.
> > > + */
> > > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > > +{
[...]
> > > +       if (pid <= 0)
> > > +               return -EINVAL;
> >
> > WDYT of defining pid == 0 to mean "open myself"?
>
> I'm torn. It be a nice shortcut of course but pid being 0 is usually an
> indicator for child processes. So unless the getpid() before
> pidfd_open() is an issue I'd say let's leave it as is. If you really
> want the shortcut might -1 be better?

Joining the bikeshed painting club: Please don't allow either 0 or -1
as shortcut for "self". James Forshaw found an Android security bug a
while back (https://bugs.chromium.org/p/project-zero/issues/detail?id=727)
that passed a PID to getpidcon(), except that the PID was 0
(placeholder for oneway binder transactions), and then the service
thought it was talking to itself. You could pick some other number and
provide a #define for that, but I think pidfd_open(getpid(), ...)
makes more sense.

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-16 14:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: Daniel Colascione, Oleg Nesterov, Al Viro, Linus Torvalds,
	linux-kernel, Arnd Bergmann, David Howells, Andrew Morton,
	Aleksa Sarai, Eric W. Biederman, Elena Reshetova, Kees Cook,
	Andy Lutomirski, Andy Lutomirski, Thomas Gleixner, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc
In-Reply-To: <CAG48ez05OtBi_yX+071TrrfK3zKOn9h1kFyPr5rttiqQAZ0sEA@mail.gmail.com>

On Thu, May 16, 2019 at 04:03:27PM +0200, Jann Horn wrote:
> On Thu, May 16, 2019 at 3:08 PM Christian Brauner <christian@brauner.io> wrote:
> > On Wed, May 15, 2019 at 10:45:06AM -0700, Daniel Colascione wrote:
> > > On Wed, May 15, 2019 at 3:04 AM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> > > > pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> > > > process that is created via traditional fork()/clone() calls that is only
> > > > referenced by a PID:
> [...]
> > > > +/**
> > > > + * pidfd_open() - Open new pid file descriptor.
> > > > + *
> > > > + * @pid:   pid for which to retrieve a pidfd
> > > > + * @flags: flags to pass
> > > > + *
> > > > + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> > > > + * the process identified by @pid. Currently, the process identified by
> > > > + * @pid must be a thread-group leader. This restriction currently exists
> > > > + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> > > > + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> > > > + * leaders).
> > > > + *
> > > > + * Return: On success, a cloexec pidfd is returned.
> > > > + *         On error, a negative errno number will be returned.
> > > > + */
> > > > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > > > +{
> [...]
> > > > +       if (pid <= 0)
> > > > +               return -EINVAL;
> > >
> > > WDYT of defining pid == 0 to mean "open myself"?
> >
> > I'm torn. It be a nice shortcut of course but pid being 0 is usually an
> > indicator for child processes. So unless the getpid() before
> > pidfd_open() is an issue I'd say let's leave it as is. If you really
> > want the shortcut might -1 be better?
> 
> Joining the bikeshed painting club: Please don't allow either 0 or -1
> as shortcut for "self". James Forshaw found an Android security bug a
> while back (https://bugs.chromium.org/p/project-zero/issues/detail?id=727)
> that passed a PID to getpidcon(), except that the PID was 0
> (placeholder for oneway binder transactions), and then the service
> thought it was talking to itself. You could pick some other number and
> provide a #define for that, but I think pidfd_open(getpid(), ...)
> makes more sense.

Yes, I agree. I left it as is for v1, i.e. no shortcut; getpid() should
do.

Christian

^ permalink raw reply

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
From: Oleksandr Natalenko @ 2019-05-16 14:20 UTC (permalink / raw)
  To: Jann Horn
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API
In-Reply-To: <CAG48ez2yXw_PJXO-mS=Qw5rkLpG6zDPd0saMhhGk09-du2bpaA@mail.gmail.com>

Hi.

On Thu, May 16, 2019 at 12:00:24PM +0200, Jann Horn wrote:
> On Thu, May 16, 2019 at 11:43 AM Oleksandr Natalenko
> <oleksandr@redhat.com> wrote:
> > Use previously introduced remote madvise knob to mark task's
> > anonymous memory as mergeable.
> >
> > To force merging task's VMAs, "merge" hint is used:
> >
> >    # echo merge > /proc/<pid>/madvise
> >
> > Force unmerging is done similarly:
> >
> >    # echo unmerge > /proc/<pid>/madvise
> >
> > To achieve this, previously introduced ksm_madvise_*() helpers
> > are used.
> 
> Why does this not require PTRACE_MODE_ATTACH_FSCREDS to the target
> process? Enabling KSM on another process is hazardous because it
> significantly increases the attack surface for side channels.
> 
> (Note that if you change this to require PTRACE_MODE_ATTACH_FSCREDS,
> you'll want to use mm_access() in the ->open handler and drop the mm
> in ->release. mm_access() from a ->write handler is not permitted.)

Sounds reasonable. So, something similar to what mem_open() & friends do
now:

static int madvise_open(...)
...
	struct task_struct *task = get_proc_task(inode);
...
	if (task) {
		mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
		put_task_struct(task);
		if (!IS_ERR_OR_NULL(mm)) {
			mmgrab(mm);
			mmput(mm);
...

Then:

static ssize_t madvise_write(...)
...
	if (!mmget_not_zero(mm))
		goto out;

	down_write(&mm->mmap_sem);
	if (!mmget_still_valid(mm))
		goto skip_mm;
...
skip_mm:
	up_write(&mm->mmap_sem);

	mmput(mm);
out:
	return ...;

And, finally:

static int madvise_release(...)
...
		mmdrop(mm);
...

Right?

> [...]
> > @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> >  static ssize_t madvise_write(struct file *file, const char __user *buf,
> >                 size_t count, loff_t *ppos)
> >  {
> > +       /* For now, only KSM hints are implemented */
> > +#ifdef CONFIG_KSM
> > +       char buffer[PROC_NUMBUF];
> > +       int behaviour;
> >         struct task_struct *task;
> > +       struct mm_struct *mm;
> > +       int err = 0;
> > +       struct vm_area_struct *vma;
> > +
> > +       memset(buffer, 0, sizeof(buffer));
> > +       if (count > sizeof(buffer) - 1)
> > +               count = sizeof(buffer) - 1;
> > +       if (copy_from_user(buffer, buf, count))
> > +               return -EFAULT;
> > +
> > +       if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))
> 
> This means that you also match on something like "mergeblah". Just use strcmp().

I agree. Just to make it more interesting I must say that

   /sys/kernel/mm/transparent_hugepage/enabled

uses memcmp in the very same way, and thus echoing "alwaysssss" or
"madviseeee" works perfectly there, and it was like that from the very
beginning, it seems. Should we fix it, or it became (zomg) a public API?

> > +               behaviour = MADV_MERGEABLE;
> > +       else if (!memcmp("unmerge", buffer, min(sizeof("unmerge")-1, count)))
> > +               behaviour = MADV_UNMERGEABLE;
> > +       else
> > +               return -EINVAL;
> >
> >         task = get_proc_task(file_inode(file));
> >         if (!task)
> >                 return -ESRCH;
> >
> > +       mm = get_task_mm(task);
> > +       if (!mm) {
> > +               err = -EINVAL;
> > +               goto out_put_task_struct;
> > +       }
> > +
> > +       down_write(&mm->mmap_sem);
> 
> Should a check for mmget_still_valid(mm) be inserted here? See commit
> 04f5866e41fb70690e28397487d8bd8eea7d712a.

Yeah, it seems so :/. Thanks for the pointer. I've put it into the
madvise_write snippet above.

> > +       switch (behaviour) {
> > +       case MADV_MERGEABLE:
> > +       case MADV_UNMERGEABLE:
> 
> This switch isn't actually necessary at this point, right?

Yup, but it is there to highlight a possibility of adding other, non-KSM
options. So, let it be, and I'll just re-arrange CONFIG_KSM ifdef
instead.

Thank you.

> > +               vma = mm->mmap;
> > +               while (vma) {
> > +                       if (behaviour == MADV_MERGEABLE)
> > +                               ksm_madvise_merge(vma->vm_mm, vma, &vma->vm_flags);
> > +                       else
> > +                               ksm_madvise_unmerge(vma, vma->vm_start, vma->vm_end, &vma->vm_flags);
> > +                       vma = vma->vm_next;
> > +               }
> > +               break;
> > +       }
> > +       up_write(&mm->mmap_sem);
> > +
> > +       mmput(mm);
> > +
> > +out_put_task_struct:
> >         put_task_struct(task);
> >
> > -       return count;
> > +       return err ? err : count;
> > +#else
> > +       return -EINVAL;
> > +#endif /* CONFIG_KSM */
> >  }

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer

^ permalink raw reply

* Re: [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise
From: Oleksandr Natalenko @ 2019-05-16 14:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api
In-Reply-To: <20190516104412.GN16651@dhcp22.suse.cz>

Hi.

On Thu, May 16, 2019 at 12:44:12PM +0200, Michal Hocko wrote:
> On Thu 16-05-19 11:42:29, Oleksandr Natalenko wrote:
> [...]
> > * to mark all the eligible VMAs as mergeable, use:
> > 
> >    # echo merge > /proc/<pid>/madvise
> > 
> > * to unmerge all the VMAs, use:
> > 
> >    # echo unmerge > /proc/<pid>/madvise
> 
> Please do not open a new thread until a previous one reaches some
> conclusion. I have outlined some ways to go forward in
> http://lkml.kernel.org/r/20190515145151.GG16651@dhcp22.suse.cz.
> I haven't heard any feedback on that, yet you open a 3rd way in a
> different thread. This will not help to move on with the discussion.
> 
> Please follow up on that thread.

Sure, I will follow the thread once and if there are responses. Consider
this one to be an intermediate summary of current suggestions and also
an indication that it is better to have the code early for public eyes.

Thank you.

> -- 
> Michal Hocko
> SUSE Labs

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer

^ permalink raw reply

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-16 14:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, dan.j.williams, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm, linux-api
In-Reply-To: <20190516135259.GU16651@dhcp22.suse.cz>

On 16.05.2019 16:52, Michal Hocko wrote:
> On Thu 16-05-19 15:30:34, Michal Hocko wrote:
>> [You are defining a new user visible API, please always add linux-api
>>  mailing list - now done]
>>
>> On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
> [...]
>>> The proposed syscall aims to introduce an interface, which
>>> supplements currently existing process_vm_writev() and
>>> process_vm_readv(), and allows to solve the problem with
>>> anonymous memory transfer. The above example may be rewritten as:
>>>
>>> 	void *buf;
>>>
>>> 	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>> 		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>> 	recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>> 	/* Sign of @pid is direction: "from @pid task to current" or vice versa. */
>>> 	process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>>> 	munmap(buf, n * PAGE_SIZE);
> 
> AFAIU this means that you actually want to do an mmap of an anonymous
> memory with a COW semantic to the remote process right?

Yes.

> How does the remote process find out where and what has been mmaped?

Any way. Isn't this a trivial task? :) You may use socket or any
of appropriate linux features to communicate between them.

>What if the range collides? This sounds quite scary to me TBH.

In case of range collides, the part of old VMA becomes unmapped.
The same way we behave on ordinary mmap. You may intersect a range,
which another thread mapped, so you need a synchronization between
them. There is no a principle difference.

Also I'm going to add a flag to prevent unmapping like Kees suggested.
Please, see his message.

> Why cannot you simply use shared memory for that?

Because of remote task may want specific type of VMA. It may want not to
share a VMA with its children.

Speaking about online migration, a task wants its anonymous private VMAs
remain the same after the migration. Otherwise, imagine the situation,
when task's stack becomes a shared VMA after the migration.
Also, task wants anonymous mapping remains anonymous.

In general, in case of shared memory is enough for everything, we would
have never had process_vm_writev() and process_vm_readv() syscalls.

Kirill

^ permalink raw reply

* Re: [PATCH v1 1/2] pid: add pidfd_open()
From: Oleg Nesterov @ 2019-05-16 14:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: jannh, viro, torvalds, linux-kernel, arnd, akpm, cyphar, dhowells,
	ebiederm, elena.reshetova, keescook, luto, luto, tglx,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-api, linux-arch, linux-kselftest, joel,
	dancol, serge
In-Reply-To: <20190516135944.7205-1-christian@brauner.io>

On 05/16, Christian Brauner wrote:
>
> With the introduction of pidfds through CLONE_PIDFD it is possible to
> created pidfds at process creation time.

Now I am wondering why do we need CLONE_PIDFD, you can just do

	pid = fork();
	pidfd_open(pid);

> +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> +{
> +	int fd, ret;
> +	struct pid *p;
> +	struct task_struct *tsk;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	if (pid <= 0)
> +		return -EINVAL;
> +
> +	p = find_get_pid(pid);
> +	if (!p)
> +		return -ESRCH;
> +
> +	ret = 0;
> +	rcu_read_lock();
> +	/*
> +	 * If this returns non-NULL the pid was used as a thread-group
> +	 * leader. Note, we race with exec here: If it changes the
> +	 * thread-group leader we might return the old leader.
> +	 */
> +	tsk = pid_task(p, PIDTYPE_TGID);
> +	if (!tsk)
> +		ret = -ESRCH;
> +	rcu_read_unlock();
> +
> +	fd = ret ?: pidfd_create(p);
> +	put_pid(p);
> +	return fd;
> +}

Looks correct, feel free to add Reviewed-by: Oleg Nesterov <oleg@redhat.com>

But why do we need task_struct *tsk?

	rcu_read_lock();
	if (!pid_task(PIDTYPE_TGID))
		ret = -ESRCH;
	rcu_read_unlock();

and in fact we do not even need rcu_read_lock(), we could do

	// shut up rcu_dereference_check()
	rcu_lock_acquire(&rcu_lock_map);
	if (!pid_task(PIDTYPE_TGID))
		ret = -ESRCH;
	rcu_lock_release(&rcu_lock_map);

Well... I won't insist, but the comment about the race with exec looks a bit
confusing to me. It is true, but we do not care at all, we are not going to
use the task_struct returned by pid_task().

Oleg.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox