Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Minchan Kim @ 2019-05-28 10:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190528090821.GU1658@dhcp22.suse.cz>

On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > if we went with the per vma fd approach then you would get this
> > > > > feature automatically because map_files would refer to file backed
> > > > > mappings while map_anon could refer only to anonymous mappings.
> > > >
> > > > The reason to add such filter option is to avoid the parsing overhead
> > > > so map_anon wouldn't be helpful.
> > > 
> > > Without chiming on whether the filter option is a good idea, I'd like
> > > to suggest that providing an efficient binary interfaces for pulling
> > > memory map information out of processes.  Some single-system-call
> > > method for retrieving a binary snapshot of a process's address space
> > > complete with attributes (selectable, like statx?) for each VMA would
> > > reduce complexity and increase performance in a variety of areas,
> > > e.g., Android memory map debugging commands.
> > 
> > I agree it's the best we can get *generally*.
> > Michal, any opinion?
> 
> I am not really sure this is directly related. I think the primary
> question that we have to sort out first is whether we want to have
> the remote madvise call process or vma fd based. This is an important
> distinction wrt. usability. I have only seen pid vs. pidfd discussions
> so far unfortunately.

With current usecase, it's per-process API with distinguishable anon/file
but thought it could be easily extended later for each address range
operation as userspace getting smarter with more information.

^ permalink raw reply

* Re: [PATCH] [RFC] Remove bdflush syscall stub
From: Andreas Schwab @ 2019-05-28 10:25 UTC (permalink / raw)
  To: Cyril Hrubis
  Cc: lkml, linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k,
	Michal Simek, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, sparclinux, linux-xtensa, linux-fsdevel, linux-api
In-Reply-To: <20190528101012.11402-1-chrubis@suse.cz>

On Mai 28 2019, Cyril Hrubis <chrubis@suse.cz> wrote:

> I've tested the patch on i386. Before the patch calling bdflush() with
> attempt to tune a variable returned 0 and after the patch the syscall
> fails with EINVAL.

Should be ENOSYS, doesn't it?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply

* [PATCH] [RFC] Remove bdflush syscall stub
From: Cyril Hrubis @ 2019-05-28 10:10 UTC (permalink / raw)
  To: lkml
  Cc: Cyril Hrubis, linux-alpha, linux-arm-kernel, linux-ia64,
	linux-m68k, Michal Simek, linux-mips, linux-parisc, linuxppc-dev,
	linux-s390, linux-sh, sparclinux, linux-xtensa, linux-fsdevel,
	linux-api

While reviewing LTP testcases I've found that we still carry workaround
for 17 years old distributions that attempted to start bdflush from
userspace. I guess it's about the time to remove it.

I've tested the patch on i386. Before the patch calling bdflush() with
attempt to tune a variable returned 0 and after the patch the syscall
fails with EINVAL. You can use the now deleted LTP bdflush testcase for
testing:

https://github.com/linux-test-project/ltp/commit/53ede74305ff7f6498d7456b6e3ea3053ed4b7dd

Also I'm not 100% sure that I patched all the syscall tables correctly,
so this needs a proper review from arch maintainers.

CC: linux-alpha@vger.kernel.org
CC: linux-arm-kernel@lists.infradead.org
CC: linux-ia64@vger.kernel.org
CC: linux-m68k@lists.linux-m68k.org
CC: Michal Simek <monstr@monstr.eu>
CC: linux-mips@vger.kernel.org
CC: linux-parisc@vger.kernel.org
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-s390@vger.kernel.org
CC: linux-sh@vger.kernel.org
CC: sparclinux@vger.kernel.org
CC: linux-xtensa@linux-xtensa.org
CC: linux-fsdevel@vger.kernel.org
CC: linux-api@vger.kernel.org
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
---
 arch/alpha/kernel/syscalls/syscall.tbl        |  2 +-
 arch/arm/tools/syscall.tbl                    |  2 +-
 arch/arm64/include/asm/unistd32.h             |  3 +--
 arch/ia64/kernel/syscalls/syscall.tbl         |  2 +-
 arch/m68k/kernel/syscalls/syscall.tbl         |  2 +-
 arch/microblaze/kernel/syscalls/syscall.tbl   |  2 +-
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  2 +-
 arch/parisc/kernel/syscalls/syscall.tbl       |  2 +-
 arch/powerpc/kernel/syscalls/syscall.tbl      |  2 +-
 arch/s390/kernel/syscalls/syscall.tbl         |  2 +-
 arch/sh/include/uapi/asm/unistd_64.h          |  2 +-
 arch/sh/kernel/syscalls/syscall.tbl           |  2 +-
 arch/sh/kernel/syscalls_64.S                  |  2 +-
 arch/sparc/kernel/syscalls/syscall.tbl        |  2 +-
 arch/x86/entry/syscalls/syscall_32.tbl        |  2 +-
 arch/xtensa/kernel/syscalls/syscall.tbl       |  2 +-
 fs/buffer.c                                   | 27 -------------------
 include/linux/syscalls.h                      |  1 -
 kernel/sys_ni.c                               |  1 -
 .../arch/powerpc/entry/syscalls/syscall.tbl   |  2 +-
 .../perf/arch/s390/entry/syscalls/syscall.tbl |  2 +-
 21 files changed, 18 insertions(+), 48 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..3a08a2708a8a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -230,7 +230,7 @@
 259	common	osf_swapctl			sys_ni_syscall
 260	common	osf_memcntl			sys_ni_syscall
 261	common	osf_fdatasync			sys_ni_syscall
-300	common	bdflush				sys_bdflush
+# 300 was sys_bdflush
 301	common	sethae				sys_sethae
 302	common	mount				sys_mount
 303	common	old_adjtimex			sys_old_adjtimex
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..3f247c627b5f 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -147,7 +147,7 @@
 131	common	quotactl		sys_quotactl
 132	common	getpgid			sys_getpgid
 133	common	fchdir			sys_fchdir
-134	common	bdflush			sys_bdflush
+# 134 was sys_bdflush
 135	common	sysfs			sys_sysfs
 136	common	personality		sys_personality
 # 137 was sys_afs_syscall
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..fd23e4db9d76 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -289,8 +289,7 @@ __SYSCALL(__NR_quotactl, sys_quotactl)
 __SYSCALL(__NR_getpgid, sys_getpgid)
 #define __NR_fchdir 133
 __SYSCALL(__NR_fchdir, sys_fchdir)
-#define __NR_bdflush 134
-__SYSCALL(__NR_bdflush, sys_bdflush)
+			/* 134 was sys_bdflush */
 #define __NR_sysfs 135
 __SYSCALL(__NR_sysfs, sys_sysfs)
 #define __NR_personality 136
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..9cd63d82de53 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -123,7 +123,7 @@
 # 1135 was get_kernel_syms
 # 1136 was query_module
 113	common	quotactl			sys_quotactl
-114	common	bdflush				sys_bdflush
+# 114 was bdflush
 115	common	sysfs				sys_sysfs
 116	common	personality			sys_personality
 117	common	afs_syscall			sys_ni_syscall
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..0c44ee777964 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -141,7 +141,7 @@
 131	common	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+# 134 was bdflush
 135	common	sysfs				sys_sysfs
 136	common	personality			sys_personality
 # 137 was afs_syscall
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..1f1288d6212d 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -141,7 +141,7 @@
 131	common	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+# 134 was bdflush
 135	common	sysfs				sys_sysfs
 136	common	personality			sys_personality
 137	common	afs_syscall			sys_ni_syscall
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..537a3828b9a0 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -145,7 +145,7 @@
 131	o32	quotactl			sys_quotactl
 132	o32	getpgid				sys_getpgid
 133	o32	fchdir				sys_fchdir
-134	o32	bdflush				sys_bdflush
+# 134 was sys_bdflush
 135	o32	sysfs				sys_sysfs
 136	o32	personality			sys_personality			sys_32_personality
 137	o32	afs_syscall			sys_ni_syscall
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..3ab53f3ed358 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -147,7 +147,7 @@
 131	common	quotactl		sys_quotactl
 132	common	getpgid			sys_getpgid
 133	common	fchdir			sys_fchdir
-134	common	bdflush			sys_bdflush
+# 134 was sys_bdflush
 135	common	sysfs			sys_sysfs
 136	32	personality		parisc_personality
 136	64	personality		sys_personality
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..5ce016e6122c 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -176,7 +176,7 @@
 131	nospu	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+# 134 was bdflush
 135	common	sysfs				sys_sysfs
 136	32	personality			sys_personality			ppc64_personality
 136	64	personality			ppc64_personality
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..3364da213530 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -122,7 +122,7 @@
 131  common	quotactl		sys_quotactl			sys_quotactl
 132  common	getpgid			sys_getpgid			sys_getpgid
 133  common	fchdir			sys_fchdir			sys_fchdir
-134  common	bdflush			sys_bdflush			sys_bdflush
+# 134 was bdflush
 135  common	sysfs			sys_sysfs			sys_sysfs
 136  common	personality		sys_s390_personality		sys_s390_personality
 137  common	afs_syscall		-				-
diff --git a/arch/sh/include/uapi/asm/unistd_64.h b/arch/sh/include/uapi/asm/unistd_64.h
index 75da54851f02..5bd0f0c29a95 100644
--- a/arch/sh/include/uapi/asm/unistd_64.h
+++ b/arch/sh/include/uapi/asm/unistd_64.h
@@ -149,7 +149,7 @@
 #define __NR_quotactl		131
 #define __NR_getpgid		132
 #define __NR_fchdir		133
-#define __NR_bdflush		134
+				/* 134 was sys_bdflush */
 #define __NR_sysfs		135
 #define __NR_personality	136
 				/* 137 was sys_afs_syscall */
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..328edef02905 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -141,7 +141,7 @@
 131	common	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+# 134 was bdflush
 135	common	sysfs				sys_sysfs
 136	common	personality			sys_personality
 # 137 was afs_syscall
diff --git a/arch/sh/kernel/syscalls_64.S b/arch/sh/kernel/syscalls_64.S
index 1bcb86f0b728..1b7ad08f5d2f 100644
--- a/arch/sh/kernel/syscalls_64.S
+++ b/arch/sh/kernel/syscalls_64.S
@@ -151,7 +151,7 @@ sys_call_table:
 	.long sys_quotactl
 	.long sys_getpgid
 	.long sys_fchdir
-	.long sys_bdflush
+	.long sys_ni_syscall		/* 134 old bdflush */
 	.long sys_sysfs			/* 135 */
 	.long sys_personality
 	.long sys_ni_syscall	/* for afs_syscall */
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..e199062c38c7 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -270,7 +270,7 @@
 222	common	delete_module		sys_delete_module
 223	common	get_kernel_syms		sys_ni_syscall
 224	common	getpgid			sys_getpgid
-225	common	bdflush			sys_bdflush
+225	common	bdflush			sys_ni_syscall
 226	common	sysfs			sys_sysfs
 227	common	afs_syscall		sys_nis_syscall
 228	common	setfsuid		sys_setfsuid16
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..b699144d52fc 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -145,7 +145,7 @@
 131	i386	quotactl		sys_quotactl			__ia32_compat_sys_quotactl32
 132	i386	getpgid			sys_getpgid			__ia32_sys_getpgid
 133	i386	fchdir			sys_fchdir			__ia32_sys_fchdir
-134	i386	bdflush			sys_bdflush			__ia32_sys_bdflush
+134	i386	bdflush			sys_ni_syscall			sys_ni_syscall
 135	i386	sysfs			sys_sysfs			__ia32_sys_sysfs
 136	i386	personality		sys_personality			__ia32_sys_personality
 137	i386	afs_syscall
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..6b9fdddd7994 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -223,7 +223,7 @@
 # 205 was old nfsservctl
 205	common	nfsservctl			sys_ni_syscall
 206	common	_sysctl				sys_sysctl
-207	common	bdflush				sys_bdflush
+207	common	bdflush				sys_ni_syscall
 208	common	uname				sys_newuname
 209	common	sysinfo				sys_sysinfo
 210	common	init_module			sys_init_module
diff --git a/fs/buffer.c b/fs/buffer.c
index e450c55f6434..6f052ada051c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3307,33 +3307,6 @@ int try_to_free_buffers(struct page *page)
 }
 EXPORT_SYMBOL(try_to_free_buffers);
 
-/*
- * There are no bdflush tunables left.  But distributions are
- * still running obsolete flush daemons, so we terminate them here.
- *
- * Use of bdflush() is deprecated and will be removed in a future kernel.
- * The `flush-X' kernel threads fully replace bdflush daemons and this call.
- */
-SYSCALL_DEFINE2(bdflush, int, func, long, data)
-{
-	static int msg_count;
-
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
-	if (msg_count < 5) {
-		msg_count++;
-		printk(KERN_INFO
-			"warning: process `%s' used the obsolete bdflush"
-			" system call\n", current->comm);
-		printk(KERN_INFO "Fix your initscripts?\n");
-	}
-
-	if (func == 1)
-		do_exit(0);
-	return 0;
-}
-
 /*
  * Buffer-head allocation
  */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..3ede2f17a044 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1104,7 +1104,6 @@ asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf);
 asmlinkage long sys_vfork(void);
 asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
 asmlinkage long sys_send(int, void __user *, size_t, unsigned);
-asmlinkage long sys_bdflush(int func, long data);
 asmlinkage long sys_oldumount(char __user *name);
 asmlinkage long sys_uselib(const char __user *library);
 asmlinkage long sys_sysctl(struct __sysctl_args __user *args);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d9ae5ea6caf..5bab44795a17 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -405,7 +405,6 @@ COND_SYSCALL(epoll_wait);
 COND_SYSCALL(recv);
 COND_SYSCALL_COMPAT(recv);
 COND_SYSCALL(send);
-COND_SYSCALL(bdflush);
 COND_SYSCALL(uselib);
 
 
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index db3bbb8744af..56b5567de7ff 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -169,7 +169,7 @@
 131	nospu	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+134	common	bdflush				sys_ni_syscall
 135	common	sysfs				sys_sysfs
 136	32	personality			sys_personality			ppc64_personality
 136	64	personality			ppc64_personality
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index b38d48464368..a83eba11c877 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -122,7 +122,7 @@
 131  common	quotactl		sys_quotactl			compat_sys_quotactl
 132  common	getpgid			sys_getpgid			sys_getpgid
 133  common	fchdir			sys_fchdir			sys_fchdir
-134  common	bdflush			sys_bdflush			compat_sys_bdflush
+134  common	bdflush			-				-
 135  common	sysfs			sys_sysfs			compat_sys_sysfs
 136  common	personality		sys_s390_personality		sys_s390_personality
 137  common	afs_syscall		-				-
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH 1/2] fork: add clone6
From: Christian Brauner @ 2019-05-28 10:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux List Kernel Mailing, Jann Horn, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <CAHk-=wjnbK5ob9JE0H1Ge_R4BL6D0ztsAvrM6DN+S+zyDWE=7A@mail.gmail.com>

On Mon, May 27, 2019 at 12:27:08PM -0700, Linus Torvalds wrote:
> On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > Hm, still pondering whether having one unsigned int argument passed
> > through registers that captures all the flags from the old clone() would
> > be a good idea.
> 
> That sounds like a reasonable thing to do.
> 
> Maybe we could continue to call the old flags CLONE_XYZ and continue
> to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
> flags for a new 64-bit flag field that comes in through memory in the
> new clone_args thing?

Hm. I think I'll try a first version without an additional register
flags argument. And here's why: I'm not sure it buys us a lot especially
if we're giving up on making this convenient for seccomp anyway.
And with that out of the way (at least for the moment) I would really
like to make this interface consistent. But we can revisit this when I
have the code.

Christian

^ permalink raw reply

* Re: [PATCH v2 2/2] tests: add close_range() tests
From: Christian Brauner @ 2019-05-28  9:57 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer,
	jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <8736kzqpdm.fsf@concordia.ellerman.id.au>

On Tue, May 28, 2019 at 12:33:41PM +1000, Michael Ellerman wrote:
> Christian Brauner <christian@brauner.io> writes:
> > This adds basic tests for the new close_range() syscall.
> > - test that no invalid flags can be passed
> > - test that a range of file descriptors is correctly closed
> > - test that a range of file descriptors is correctly closed if there there
> >   are already closed file descriptors in the range
> > - test that max_fd is correctly capped to the current fdtable maximum
> >
> > Signed-off-by: Christian Brauner <christian@brauner.io>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: David Howells <dhowells@redhat.com>
> > Cc: Dmitry V. Levin <ldv@altlinux.org>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Florian Weimer <fweimer@redhat.com>
> > Cc: linux-api@vger.kernel.org
> > ---
> > v1: unchanged
> > v2:
> > - Christian Brauner <christian@brauner.io>:
> >   - verify that close_range() correctly closes a single file descriptor
> > ---
> >  tools/testing/selftests/Makefile              |   1 +
> >  tools/testing/selftests/core/.gitignore       |   1 +
> >  tools/testing/selftests/core/Makefile         |   6 +
> >  .../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
> >  4 files changed, 150 insertions(+)
> >  create mode 100644 tools/testing/selftests/core/.gitignore
> >  create mode 100644 tools/testing/selftests/core/Makefile
> >  create mode 100644 tools/testing/selftests/core/close_range_test.c
> >
> > diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
> > new file mode 100644
> > index 000000000000..6e6712ce5817
> > --- /dev/null
> > +++ b/tools/testing/selftests/core/.gitignore
> > @@ -0,0 +1 @@
> > +close_range_test
> > diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
> > new file mode 100644
> > index 000000000000..de3ae68aa345
> > --- /dev/null
> > +++ b/tools/testing/selftests/core/Makefile
> > @@ -0,0 +1,6 @@
> > +CFLAGS += -g -I../../../../usr/include/ -I../../../../include
> 
> Your second -I pulls the unexported kernel headers in, userspace
> programs shouldn't include unexported kernel headers.
> 
> It breaks the build on powerpc with eg:
> 
>   powerpc64le-linux-gnu-gcc -g -I../../../../usr/include/ -I../../../../include    close_range_test.c  -o /output/kselftest/core/close_range_test
>   In file included from /usr/powerpc64le-linux-gnu/include/bits/fcntl-linux.h:346,
>                    from /usr/powerpc64le-linux-gnu/include/bits/fcntl.h:62,
>                    from /usr/powerpc64le-linux-gnu/include/fcntl.h:35,
>                    from close_range_test.c:5:
>   ../../../../include/linux/falloc.h:13:2: error: unknown type name '__s16'
>     __s16  l_type;
>     ^~~~~
> 
> 
> Did you do that on purpose or just copy it from one of the other
> Makefiles? :)

I originally did that on purpose because checkpatch was yammering on
about me not having used ARRAY_SIZE(). But that include can go, you are
right.

Christian

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Daniel Colascione @ 2019-05-28  9:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190528090821.GU1658@dhcp22.suse.cz>

On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > if we went with the per vma fd approach then you would get this
> > > > > feature automatically because map_files would refer to file backed
> > > > > mappings while map_anon could refer only to anonymous mappings.
> > > >
> > > > The reason to add such filter option is to avoid the parsing overhead
> > > > so map_anon wouldn't be helpful.
> > >
> > > Without chiming on whether the filter option is a good idea, I'd like
> > > to suggest that providing an efficient binary interfaces for pulling
> > > memory map information out of processes.  Some single-system-call
> > > method for retrieving a binary snapshot of a process's address space
> > > complete with attributes (selectable, like statx?) for each VMA would
> > > reduce complexity and increase performance in a variety of areas,
> > > e.g., Android memory map debugging commands.
> >
> > I agree it's the best we can get *generally*.
> > Michal, any opinion?
>
> I am not really sure this is directly related. I think the primary
> question that we have to sort out first is whether we want to have
> the remote madvise call process or vma fd based. This is an important
> distinction wrt. usability. I have only seen pid vs. pidfd discussions
> so far unfortunately.

I don't think the vma fd approach is viable. We have some processes
with a *lot* of VMAs --- system_server had 4204 when I checked just
now (and that's typical) --- and an FD operation per VMA would be
excessive. VMAs also come and go pretty easily depending on changes in
protections and various faults. It's also not entirely clear what the
semantics of vma FDs should be over address space mutations, while the
semantics of address ranges are well-understood. I would much prefer
an interface operating on address ranges to one operating on VMA FDs,
both for efficiency and for consistency with other memory management
APIs.

> An interface to query address range information is a separate but
> although a related topic. We have /proc/<pid>/[s]maps for that right
> now and I understand it is not a general win for all usecases because
> it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> provide per vma information in a binary form via a fd based interface.
> But I would rather not conflate those two discussions much - well except
> if it could give one of the approaches more justification but let's
> focus on the madvise part first.

I don't think it's a good idea to focus on one feature in a
multi-feature change when the interactions between features can be
very important for overall design of the multi-feature system and the
design of each feature.

Here's my thinking on the high-level design:

I'm imagining an address-range system that would work like this: we'd
create some kind of process_vm_getinfo(2) system call [1] that would
accept a statx-like attribute map and a pid/fd parameter as input and
return, on output, two things: 1) an array [2] of VMA descriptors
containing the requested information, and 2) a VMA configuration
sequence number. We'd then have process_madvise() and other
cross-process VM interfaces accept both address ranges and this
sequence number; they'd succeed only if the VMA configuration sequence
number is still current, i.e., the target process hasn't changed its
VMA configuration (implicitly or explicitly) since the call to
process_vm_getinfo().

This way, a process A that wants to perform some VM operation on
process B can slurp B's VMA configuration using process_vm_getinfo(),
figure out what it wants to do, and attempt to do it. If B modifies
its memory map in the meantime, If A finds that its local knowledge of
B's memory map has become invalid between the process_vm_getinfo() and
A taking some action based on the result, A can retry [3]. While A
could instead ptrace or otherwise suspend B, *then* read B's memory
map (knowing B is quiescent), *then* operate on B, the optimistic
approach I'm describing would be much lighter-weight in the typical
case. It's also pretty simple, IMHO. If the "operate on B" step is
some kind of vectorized operation over multiple address ranges, this
approach also gets us all-or-nothing semantics.

Or maybe the whole sequence number thing is overkill and we don't need
atomicity? But if there's a concern  that A shouldn't operate on B's
memory without knowing what it's operating on, then the scheme I've
proposed above solves this knowledge problem in a pretty lightweight
way.

[1] or some other interface
[2] or something more complicated if we want the descriptors to
contain variable-length elements, e.g., strings
[3] or override the sequence number check if it's feeling bold?

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-28  9:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <20190527233030.hpnnbi4aqnu34ova@box>

On 28.05.2019 02:30, Kirill A. Shutemov wrote:
> On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
>> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
>>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
>>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>>>> This patchset adds a new syscall, which makes possible
>>>>>> to clone a VMA from a process to current process.
>>>>>> The syscall supplements the functionality provided
>>>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>>>> and it may be useful in many situation.
>>>>>
>>>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>>>
>>>>> My concern is that the patchset allows to map the same page multiple times
>>>>> within one process or even map page allocated by child to the parrent.
>>>>>
>>>>> It was not allowed before.
>>>>>
>>>>> In the best case it makes reasoning about rmap substantially more difficult.
>>>>>
>>>>> But I'm worry it will introduce hard-to-debug bugs, like described in
>>>>> https://lwn.net/Articles/383162/.
>>>>
>>>> Andy suggested to unmap PTEs from source page table, and this make the single
>>>> page never be mapped in the same process twice. This is OK for my use case,
>>>> and here we will just do a small step "allow to inherit VMA by a child process",
>>>> which we didn't have before this. If someone still needs to continue the work
>>>> to allow the same page be mapped twice in a single process in the future, this
>>>> person will have a supported basis we do in this small step. I believe, someone
>>>> like debugger may want to have this to make a fast snapshot of a process private
>>>> memory (when the task is stopped for a small time to get its memory). But for
>>>> me remapping is enough at the moment.
>>>>
>>>> What do you think about this?
>>>
>>> I don't think that unmapping alone will do. Consider the following
>>> scenario:
>>>
>>> 1. Task A creates and populates the mapping.
>>> 2. Task A forks. We have now Task B mapping the same pages, but
>>> write-protected.
>>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
>>>
>>> After this Task A will have the same anon pages mapped twice.
>>
>> Ah, sure.
>>
>>> One possible way out would be to force CoW on all pages in the mapping,
>>> before passing the mapping to the new process.
>>
>> This will pop all swapped pages up, which is the thing the patchset aims
>> to prevent.
>>
>> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
>> only chain and which vma->anon_vma_chain contains single entry? This is
>> a vma, which were faulted, but its mm never were duplicated (or which
>> forks already died).
> 
> The requirement for the VMA to be faulted (have any pages mapped) looks
> excessive to me, but the general idea may work.
> 
> One issue I see is that userspace may not have full control to create such
> VMA. vma_merge() can merge the VMA to the next one without any consent
> from userspace and you'll get anon_vma inherited from the VMA you've
> justed merged with.
> 
> I don't have any valid idea on how to get around this.

Technically it is possible by creating boundary 1-page VMAs with another protection:
one above and one below the desired region, then map the desired mapping. But this
is not comfortable.

I don't think it's difficult to find a natural limitation, which prevents mapping
a single page twice if we want to avoid this at least on start. Another suggestion:

prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root
is the same as root of one of local process's VMA.

What about this?

Thanks,
Kirill

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Michal Hocko @ 2019-05-28  9:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190528084927.GB159710@google.com>

On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > if we went with the per vma fd approach then you would get this
> > > > feature automatically because map_files would refer to file backed
> > > > mappings while map_anon could refer only to anonymous mappings.
> > >
> > > The reason to add such filter option is to avoid the parsing overhead
> > > so map_anon wouldn't be helpful.
> > 
> > Without chiming on whether the filter option is a good idea, I'd like
> > to suggest that providing an efficient binary interfaces for pulling
> > memory map information out of processes.  Some single-system-call
> > method for retrieving a binary snapshot of a process's address space
> > complete with attributes (selectable, like statx?) for each VMA would
> > reduce complexity and increase performance in a variety of areas,
> > e.g., Android memory map debugging commands.
> 
> I agree it's the best we can get *generally*.
> Michal, any opinion?

I am not really sure this is directly related. I think the primary
question that we have to sort out first is whether we want to have
the remote madvise call process or vma fd based. This is an important
distinction wrt. usability. I have only seen pid vs. pidfd discussions
so far unfortunately.

An interface to query address range information is a separate but
although a related topic. We have /proc/<pid>/[s]maps for that right
now and I understand it is not a general win for all usecases because
it tends to be slow for some. I can see how /proc/<pid>/map_anons could
provide per vma information in a binary form via a fd based interface.
But I would rather not conflate those two discussions much - well except
if it could give one of the approaches more justification but let's
focus on the madvise part first.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Konstantin Khlebnikov @ 2019-05-28  8:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <20190528084243.GT1658@dhcp22.suse.cz>

On 28.05.2019 11:42, Michal Hocko wrote:
> On Tue 28-05-19 11:04:46, Konstantin Khlebnikov wrote:
>> On 28.05.2019 10:38, Michal Hocko wrote:
> [...]
>>> Could you define the exact semantic? Ideally something for the manual
>>> page please?
>>>
>>
>> Like kswapd which works with thresholds of free memory this one reclaims
>> until 'free' (i.e. memory which could be allocated without invoking
>> direct recliam of any kind) is lower than passed 'size' argument.
> 
> s@lower@higher@ I guess

Yep. My wording still bad.
'size' argument should be called 'watermark' or 'threshold'.

I.e. reclaim while 'free' memory is lower passed 'threshold'.

> 
>> Thus right after madvise(NULL, size, MADV_STOCKPILE) 'size' bytes
>> could be allocated in this memory cgroup without extra latency from
>> reclaimer if there is no other memory consumers.
>>
>> Reclaimed memory is simply put into free lists in common buddy allocator,
>> there is no reserves for particular task or cgroup.
>>
>> If overall memory allocation rate is smooth without rough spikes then
>> calling MADV_STOCKPILE in loop periodically provides enough room for
>> allocations and eliminates direct reclaim from all other tasks.
>> As a result this eliminates unpredictable delays caused by
>> direct reclaim in random places.
> 
> OK, this makes it more clear to me. Thanks for the clarification!
> I have clearly misunderstood and misinterpreted target as the reclaim
> target rather than free memory target.  Sorry about the confusion.
> I sill think that this looks like an abuse of the madvise but if there
> is a wider consensus this is acceptable I will not stand in the way.
> 

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Minchan Kim @ 2019-05-28  8:49 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <CAKOZuesnS6kBFX-PKJ3gvpkv8i-ysDOT2HE2Z12=vnnHQv0FDA@mail.gmail.com>

On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > if we went with the per vma fd approach then you would get this
> > > feature automatically because map_files would refer to file backed
> > > mappings while map_anon could refer only to anonymous mappings.
> >
> > The reason to add such filter option is to avoid the parsing overhead
> > so map_anon wouldn't be helpful.
> 
> Without chiming on whether the filter option is a good idea, I'd like
> to suggest that providing an efficient binary interfaces for pulling
> memory map information out of processes.  Some single-system-call
> method for retrieving a binary snapshot of a process's address space
> complete with attributes (selectable, like statx?) for each VMA would
> reduce complexity and increase performance in a variety of areas,
> e.g., Android memory map debugging commands.

I agree it's the best we can get *generally*.
Michal, any opinion?

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Michal Hocko @ 2019-05-28  8:42 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <5af1ba69-61d1-1472-4aa3-20beb4ae44ae@yandex-team.ru>

On Tue 28-05-19 11:04:46, Konstantin Khlebnikov wrote:
> On 28.05.2019 10:38, Michal Hocko wrote:
[...]
> > Could you define the exact semantic? Ideally something for the manual
> > page please?
> > 
> 
> Like kswapd which works with thresholds of free memory this one reclaims
> until 'free' (i.e. memory which could be allocated without invoking
> direct recliam of any kind) is lower than passed 'size' argument.

s@lower@higher@ I guess

> Thus right after madvise(NULL, size, MADV_STOCKPILE) 'size' bytes
> could be allocated in this memory cgroup without extra latency from
> reclaimer if there is no other memory consumers.
> 
> Reclaimed memory is simply put into free lists in common buddy allocator,
> there is no reserves for particular task or cgroup.
> 
> If overall memory allocation rate is smooth without rough spikes then
> calling MADV_STOCKPILE in loop periodically provides enough room for
> allocations and eliminates direct reclaim from all other tasks.
> As a result this eliminates unpredictable delays caused by
> direct reclaim in random places.

OK, this makes it more clear to me. Thanks for the clarification!
I have clearly misunderstood and misinterpreted target as the reclaim
target rather than free memory target.  Sorry about the confusion.
I sill think that this looks like an abuse of the madvise but if there
is a wider consensus this is acceptable I will not stand in the way.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Daniel Colascione @ 2019-05-28  8:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190528081351.GA159710@google.com>

On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> if we went with the per vma fd approach then you would get this
> > feature automatically because map_files would refer to file backed
> > mappings while map_anon could refer only to anonymous mappings.
>
> The reason to add such filter option is to avoid the parsing overhead
> so map_anon wouldn't be helpful.

Without chiming on whether the filter option is a good idea, I'd like
to suggest that providing an efficient binary interfaces for pulling
memory map information out of processes.  Some single-system-call
method for retrieving a binary snapshot of a process's address space
complete with attributes (selectable, like statx?) for each VMA would
reduce complexity and increase performance in a variety of areas,
e.g., Android memory map debugging commands.

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Minchan Kim @ 2019-05-28  8:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
In-Reply-To: <20190528062947.GL1658@dhcp22.suse.cz>

On Tue, May 28, 2019 at 08:29:47AM +0200, Michal Hocko wrote:
> On Tue 28-05-19 12:26:32, Minchan Kim wrote:
> > On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote:
> > > On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> > > > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > > > > [cc linux-api]
> > > > > > > 
> > > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > > > > In this configuration, userspace could handle different strategy for each
> > > > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > > > > until memory pressure happens.
> > > > > > > > 
> > > > > > > > To support such strategy easier, this patch introduces
> > > > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > > > > of (int behavior).
> > > > > > > 
> > > > > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > > > > filtering from the userspace so why do we need any additional filtering?
> > > > > > > 
> > > > > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > > > > either anonymous or file-backed.
> > > > > > > > 
> > > > > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > > > > 
> > > > > > > OK, so here is the reason you want that. The immediate question is why
> > > > > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > > > > work, all right, but less of an API to expose and that itself is a
> > > > > > > strong argument against.
> > > > > > 
> > > > > > What I should do if we don't have such filter option is to enumerate all of
> > > > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > > > > which would be painful for 2000+ vmas.
> > > > > 
> > > > > Painful is not an argument to add a new user API. If the existing API
> > > > > suits the purpose then it should be used. If it is not usable, we can
> > > > > think of a different way.
> > > > 
> > > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> > > > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> > > > It's never trivial.
> > > 
> > > This is not the only option. Have you tried to simply use
> > > /proc/<pid>/map_files interface? This will provide you with all the file
> > > backed mappings.
> > 
> > I compared maps vs. map_files with 3036 file-backed vma.
> > Test scenario is to dump all of vmas of the process and parse address
> > ranges.
> > For map_files, it's easy to parse each address range because directory name
> > itself is range. However, in case of maps, I need to parse each range
> > line by line so need to scan all of lines.
> > 
> > (maps cover additional non-file-backed vmas so nr_vma is a little bigger)
> > 
> > performance mode:
> > map_files: nr_vma 3036 usec 13387
> > maps     : nr_vma 3078 usec 12923
> > 
> > powersave mode:
> > 
> > map_files: nr_vma 3036 usec 52614
> > maps     : nr_vma 3078 usec 41089
> > 
> > map_files is slower than maps if we dump all of vmas. I guess directory
> > operation needs much more jobs(e.g., dentry lookup, instantiation)
> > compared to maps.
> 
> OK, that is somehow surprising. I am still not convinced the filter is a
> good idea though. The primary reason is that it encourages using madvise
> on a wide range without having a clue what the range contains. E.g. the
> full address range and rely the right thing will happen. Do we really
> want madvise to operate in that mode?

If user space daemon(e.g., activity manager service) could know a certain
process is bakground and idle for a while, yeb, that would be good option.

> 
> Btw. if we went with the per vma fd approach then you would get this
> feature automatically because map_files would refer to file backed
> mappings while map_anon could refer only to anonymous mappings.

The reason to add such filter option is to avoid the parsing overhead
so map_anon wouldn't be helpful.


> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Konstantin Khlebnikov @ 2019-05-28  8:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <20190528073835.GP1658@dhcp22.suse.cz>



On 28.05.2019 10:38, Michal Hocko wrote:
> On Tue 28-05-19 10:30:12, Konstantin Khlebnikov wrote:
>> On 28.05.2019 9:51, Michal Hocko wrote:
>>> On Tue 28-05-19 09:25:13, Konstantin Khlebnikov wrote:
>>>> On 27.05.2019 17:39, Michal Hocko wrote:
>>>>> On Mon 27-05-19 16:21:56, Michal Hocko wrote:
>>>>>> On Mon 27-05-19 16:12:23, Michal Hocko wrote:
>>>>>>> [Cc linux-api. Please always cc this list when proposing a new user
>>>>>>>     visible api. Keeping the rest of the email intact for reference]
>>>>>>>
>>>>>>> On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote:
>>>>>> [...]
>>>>>>>> This implements manual kswapd-style memory reclaim initiated by userspace.
>>>>>>>> It reclaims both physical memory and cgroup pages. It works in context of
>>>>>>>> task who calls syscall madvise thus cpu time is accounted correctly.
>>>>>>
>>>>>> I do not follow. Does this mean that the madvise always reclaims from
>>>>>> the memcg the process is member of?
>>>>>
>>>>> OK, I've had a quick look at the implementation (the semantic should be
>>>>> clear from the patch descrition btw.) and it goes all the way up the
>>>>> hierarchy and finally try to impose the same limit to the global state.
>>>>> This doesn't really make much sense to me. For few reasons.
>>>>>
>>>>> First of all it breaks isolation where one subgroup can influence a
>>>>> different hierarchy via parent reclaim.
>>>>
>>>> madvise(NULL, size, MADV_STOCKPILE) is the same as memory allocation and
>>>> freeing immediately, but without pinning memory and provoking oom.
>>>>
>>>> So, there is shouldn't be any isolation or security issues.
>>>>
>>>> At least probably it should be limited with portion of limit (like half)
>>>> instead of whole limit as it does now.
>>>
>>> I do not think so. If a process is running inside a memcg then it is
>>> a subject of a limit and that implies an isolation. What you are
>>> proposing here is to allow escaping that restriction unless I am missing
>>> something. Just consider the following setup
>>>
>>> 		root (total memory = 2G)
>>> 		 / \
>>>              (1G) A   B (1G)
>>>                      / \
>>>              (500M) C   D (500M)
>>>
>>> all of them used up close to the limit and a process inside D requests
>>> shrinking to 250M. Unless I am misunderstanding this implementation
>>> will shrink D, B root to 250M (which means reclaiming C and A as well)
>>> and then globally if that was not sufficient. So you have allowed D to
>>> "allocate" 1,75G of memory effectively, right?
>>
>> It shrinks not 'size' memory - only while usage + size > limit.
>> So, after reclaiming 250M in D all other levels will have 250M free.
> 
> Could you define the exact semantic? Ideally something for the manual
> page please?
> 

Like kswapd which works with thresholds of free memory this one reclaims
until 'free' (i.e. memory which could be allocated without invoking
direct recliam of any kind) is lower than passed 'size' argument.

Thus right after madvise(NULL, size, MADV_STOCKPILE) 'size' bytes
could be allocated in this memory cgroup without extra latency from
reclaimer if there is no other memory consumers.

Reclaimed memory is simply put into free lists in common buddy allocator,
there is no reserves for particular task or cgroup.

If overall memory allocation rate is smooth without rough spikes then
calling MADV_STOCKPILE in loop periodically provides enough room for
allocations and eliminates direct reclaim from all other tasks.
As a result this eliminates unpredictable delays caused by
direct reclaim in random places.

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Michal Hocko @ 2019-05-28  7:38 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <a4e5eeb8-3560-d4b4-08a0-8a22c677c0f7@yandex-team.ru>

On Tue 28-05-19 10:30:12, Konstantin Khlebnikov wrote:
> On 28.05.2019 9:51, Michal Hocko wrote:
> > On Tue 28-05-19 09:25:13, Konstantin Khlebnikov wrote:
> > > On 27.05.2019 17:39, Michal Hocko wrote:
> > > > On Mon 27-05-19 16:21:56, Michal Hocko wrote:
> > > > > On Mon 27-05-19 16:12:23, Michal Hocko wrote:
> > > > > > [Cc linux-api. Please always cc this list when proposing a new user
> > > > > >    visible api. Keeping the rest of the email intact for reference]
> > > > > > 
> > > > > > On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote:
> > > > > [...]
> > > > > > > This implements manual kswapd-style memory reclaim initiated by userspace.
> > > > > > > It reclaims both physical memory and cgroup pages. It works in context of
> > > > > > > task who calls syscall madvise thus cpu time is accounted correctly.
> > > > > 
> > > > > I do not follow. Does this mean that the madvise always reclaims from
> > > > > the memcg the process is member of?
> > > > 
> > > > OK, I've had a quick look at the implementation (the semantic should be
> > > > clear from the patch descrition btw.) and it goes all the way up the
> > > > hierarchy and finally try to impose the same limit to the global state.
> > > > This doesn't really make much sense to me. For few reasons.
> > > > 
> > > > First of all it breaks isolation where one subgroup can influence a
> > > > different hierarchy via parent reclaim.
> > > 
> > > madvise(NULL, size, MADV_STOCKPILE) is the same as memory allocation and
> > > freeing immediately, but without pinning memory and provoking oom.
> > > 
> > > So, there is shouldn't be any isolation or security issues.
> > > 
> > > At least probably it should be limited with portion of limit (like half)
> > > instead of whole limit as it does now.
> > 
> > I do not think so. If a process is running inside a memcg then it is
> > a subject of a limit and that implies an isolation. What you are
> > proposing here is to allow escaping that restriction unless I am missing
> > something. Just consider the following setup
> > 
> > 		root (total memory = 2G)
> > 		 / \
> >             (1G) A   B (1G)
> >                     / \
> >             (500M) C   D (500M)
> > 
> > all of them used up close to the limit and a process inside D requests
> > shrinking to 250M. Unless I am misunderstanding this implementation
> > will shrink D, B root to 250M (which means reclaiming C and A as well)
> > and then globally if that was not sufficient. So you have allowed D to
> > "allocate" 1,75G of memory effectively, right?
> 
> It shrinks not 'size' memory - only while usage + size > limit.
> So, after reclaiming 250M in D all other levels will have 250M free.

Could you define the exact semantic? Ideally something for the manual
page please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Konstantin Khlebnikov @ 2019-05-28  7:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <20190528065153.GB1803@dhcp22.suse.cz>

On 28.05.2019 9:51, Michal Hocko wrote:
> On Tue 28-05-19 09:25:13, Konstantin Khlebnikov wrote:
>> On 27.05.2019 17:39, Michal Hocko wrote:
>>> On Mon 27-05-19 16:21:56, Michal Hocko wrote:
>>>> On Mon 27-05-19 16:12:23, Michal Hocko wrote:
>>>>> [Cc linux-api. Please always cc this list when proposing a new user
>>>>>    visible api. Keeping the rest of the email intact for reference]
>>>>>
>>>>> On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote:
>>>> [...]
>>>>>> This implements manual kswapd-style memory reclaim initiated by userspace.
>>>>>> It reclaims both physical memory and cgroup pages. It works in context of
>>>>>> task who calls syscall madvise thus cpu time is accounted correctly.
>>>>
>>>> I do not follow. Does this mean that the madvise always reclaims from
>>>> the memcg the process is member of?
>>>
>>> OK, I've had a quick look at the implementation (the semantic should be
>>> clear from the patch descrition btw.) and it goes all the way up the
>>> hierarchy and finally try to impose the same limit to the global state.
>>> This doesn't really make much sense to me. For few reasons.
>>>
>>> First of all it breaks isolation where one subgroup can influence a
>>> different hierarchy via parent reclaim.
>>
>> madvise(NULL, size, MADV_STOCKPILE) is the same as memory allocation and
>> freeing immediately, but without pinning memory and provoking oom.
>>
>> So, there is shouldn't be any isolation or security issues.
>>
>> At least probably it should be limited with portion of limit (like half)
>> instead of whole limit as it does now.
> 
> I do not think so. If a process is running inside a memcg then it is
> a subject of a limit and that implies an isolation. What you are
> proposing here is to allow escaping that restriction unless I am missing
> something. Just consider the following setup
> 
> 		root (total memory = 2G)
> 		 / \
>             (1G) A   B (1G)
>                     / \
>             (500M) C   D (500M)
> 
> all of them used up close to the limit and a process inside D requests
> shrinking to 250M. Unless I am misunderstanding this implementation
> will shrink D, B root to 250M (which means reclaiming C and A as well)
> and then globally if that was not sufficient. So you have allowed D to
> "allocate" 1,75G of memory effectively, right?

It shrinks not 'size' memory - only while usage + size > limit.
So, after reclaiming 250M in D all other levels will have 250M free.

Of course there might be race because reclaimer works with one level
at the time. Probably it should start from inner level at each iteration.

>   
>>>
>>> I also have a problem with conflating the global and memcg states. Does
>>> it really make any sense to have the same target to the global state
>>> as per-memcg? How are you supposed to use this interface to shrink a
>>> particular memcg or for the global situation with a proportional
>>> distribution to all memcgs?
>>
>> For now this is out of my use cease. This could be done in userspace
>> with multiple daemons in different contexts and connection between them.
>> In this case each daemon should apply pressure only its own level.
> 
> Do you expect all daemons to agree on their shrinking target? Could you
> elaborate? I simply do not see how this can work with memcgs lower in
> the hierarchy having a smaller limit than their parents.
> 

Daemons could distribute pressure among leaves and propagate it into parents.
Together with low-limit this gives enough control over pressure distribution.

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Michal Hocko @ 2019-05-28  6:51 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <9c55a343-2a91-46c6-166d-41b94bf5e9c8@yandex-team.ru>

On Tue 28-05-19 09:25:13, Konstantin Khlebnikov wrote:
> On 27.05.2019 17:39, Michal Hocko wrote:
> > On Mon 27-05-19 16:21:56, Michal Hocko wrote:
> > > On Mon 27-05-19 16:12:23, Michal Hocko wrote:
> > > > [Cc linux-api. Please always cc this list when proposing a new user
> > > >   visible api. Keeping the rest of the email intact for reference]
> > > > 
> > > > On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote:
> > > [...]
> > > > > This implements manual kswapd-style memory reclaim initiated by userspace.
> > > > > It reclaims both physical memory and cgroup pages. It works in context of
> > > > > task who calls syscall madvise thus cpu time is accounted correctly.
> > > 
> > > I do not follow. Does this mean that the madvise always reclaims from
> > > the memcg the process is member of?
> > 
> > OK, I've had a quick look at the implementation (the semantic should be
> > clear from the patch descrition btw.) and it goes all the way up the
> > hierarchy and finally try to impose the same limit to the global state.
> > This doesn't really make much sense to me. For few reasons.
> > 
> > First of all it breaks isolation where one subgroup can influence a
> > different hierarchy via parent reclaim.
> 
> madvise(NULL, size, MADV_STOCKPILE) is the same as memory allocation and
> freeing immediately, but without pinning memory and provoking oom.
>
> So, there is shouldn't be any isolation or security issues.
> 
> At least probably it should be limited with portion of limit (like half)
> instead of whole limit as it does now.

I do not think so. If a process is running inside a memcg then it is
a subject of a limit and that implies an isolation. What you are
proposing here is to allow escaping that restriction unless I am missing
something. Just consider the following setup

		root (total memory = 2G)
		 / \
           (1G) A   B (1G)
                   / \
           (500M) C   D (500M)

all of them used up close to the limit and a process inside D requests
shrinking to 250M. Unless I am misunderstanding this implementation
will shrink D, B root to 250M (which means reclaiming C and A as well)
and then globally if that was not sufficient. So you have allowed D to
"allocate" 1,75G of memory effectively, right?
 
> > 
> > I also have a problem with conflating the global and memcg states. Does
> > it really make any sense to have the same target to the global state
> > as per-memcg? How are you supposed to use this interface to shrink a
> > particular memcg or for the global situation with a proportional
> > distribution to all memcgs?
> 
> For now this is out of my use cease. This could be done in userspace
> with multiple daemons in different contexts and connection between them.
> In this case each daemon should apply pressure only its own level.

Do you expect all daemons to agree on their shrinking target? Could you
elaborate? I simply do not see how this can work with memcgs lower in
the hierarchy having a smaller limit than their parents.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Michal Hocko @ 2019-05-28  6:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
In-Reply-To: <20190528032632.GF6879@google.com>

On Tue 28-05-19 12:26:32, Minchan Kim wrote:
> On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote:
> > On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > > > [cc linux-api]
> > > > > > 
> > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > > > In this configuration, userspace could handle different strategy for each
> > > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > > > until memory pressure happens.
> > > > > > > 
> > > > > > > To support such strategy easier, this patch introduces
> > > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > > > of (int behavior).
> > > > > > 
> > > > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > > > filtering from the userspace so why do we need any additional filtering?
> > > > > > 
> > > > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > > > either anonymous or file-backed.
> > > > > > > 
> > > > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > > > 
> > > > > > OK, so here is the reason you want that. The immediate question is why
> > > > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > > > work, all right, but less of an API to expose and that itself is a
> > > > > > strong argument against.
> > > > > 
> > > > > What I should do if we don't have such filter option is to enumerate all of
> > > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > > > which would be painful for 2000+ vmas.
> > > > 
> > > > Painful is not an argument to add a new user API. If the existing API
> > > > suits the purpose then it should be used. If it is not usable, we can
> > > > think of a different way.
> > > 
> > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> > > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> > > It's never trivial.
> > 
> > This is not the only option. Have you tried to simply use
> > /proc/<pid>/map_files interface? This will provide you with all the file
> > backed mappings.
> 
> I compared maps vs. map_files with 3036 file-backed vma.
> Test scenario is to dump all of vmas of the process and parse address
> ranges.
> For map_files, it's easy to parse each address range because directory name
> itself is range. However, in case of maps, I need to parse each range
> line by line so need to scan all of lines.
> 
> (maps cover additional non-file-backed vmas so nr_vma is a little bigger)
> 
> performance mode:
> map_files: nr_vma 3036 usec 13387
> maps     : nr_vma 3078 usec 12923
> 
> powersave mode:
> 
> map_files: nr_vma 3036 usec 52614
> maps     : nr_vma 3078 usec 41089
> 
> map_files is slower than maps if we dump all of vmas. I guess directory
> operation needs much more jobs(e.g., dentry lookup, instantiation)
> compared to maps.

OK, that is somehow surprising. I am still not convinced the filter is a
good idea though. The primary reason is that it encourages using madvise
on a wide range without having a clue what the range contains. E.g. the
full address range and rely the right thing will happen. Do we really
want madvise to operate in that mode?

Btw. if we went with the per vma fd approach then you would get this
feature automatically because map_files would refer to file backed
mappings while map_anon could refer only to anonymous mappings.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Konstantin Khlebnikov @ 2019-05-28  6:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <20190527143926.GF1658@dhcp22.suse.cz>

On 27.05.2019 17:39, Michal Hocko wrote:
> On Mon 27-05-19 16:21:56, Michal Hocko wrote:
>> On Mon 27-05-19 16:12:23, Michal Hocko wrote:
>>> [Cc linux-api. Please always cc this list when proposing a new user
>>>   visible api. Keeping the rest of the email intact for reference]
>>>
>>> On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote:
>> [...]
>>>> This implements manual kswapd-style memory reclaim initiated by userspace.
>>>> It reclaims both physical memory and cgroup pages. It works in context of
>>>> task who calls syscall madvise thus cpu time is accounted correctly.
>>
>> I do not follow. Does this mean that the madvise always reclaims from
>> the memcg the process is member of?
> 
> OK, I've had a quick look at the implementation (the semantic should be
> clear from the patch descrition btw.) and it goes all the way up the
> hierarchy and finally try to impose the same limit to the global state.
> This doesn't really make much sense to me. For few reasons.
> 
> First of all it breaks isolation where one subgroup can influence a
> different hierarchy via parent reclaim.

madvise(NULL, size, MADV_STOCKPILE) is the same as memory allocation and
freeing immediately, but without pinning memory and provoking oom.

So, there is shouldn't be any isolation or security issues.

At least probably it should be limited with portion of limit (like half)
instead of whole limit as it does now.

> 
> I also have a problem with conflating the global and memcg states. Does
> it really make any sense to have the same target to the global state
> as per-memcg? How are you supposed to use this interface to shrink a
> particular memcg or for the global situation with a proportional
> distribution to all memcgs?

For now this is out of my use cease. This could be done in userspace
with multiple daemons in different contexts and connection between them.
In this case each daemon should apply pressure only its own level.

Also kernel could remember static pressure applied from each cgroup which
fades away when memory is allocated. And each call adds this pressure to
own requests to cooperate with neighbours. But rhight I don't know how to
implement this without over-engineering. Pure userspace solution looks
much better.

> 
> There also doens't seem to be anything about security model for this
> operation. There is no capability check from a quick look. Is it really
> safe to expose such a functionality for a common user?

Yep, it seems save. This is same as memory allocation and freeing.

> 
> Last but not least, I am not really convinced that madvise is a proper
> interface. It stretches the API which is address range based and it has
> per-process implications.
> 

Well, this is silly but semantic could be explained as preparation for
memory allocation via faulting into region. But since it doesn't need
to know exact range starting address could be arbitrary.

Also we employ MADV_POPULATE which implements batched faults into range
for robust memory allocation and undo for MADV_FREE. Will publish later.

^ permalink raw reply

* Re: [PATCH v3 2/3] arch: wire-up close_range()
From: Michael Ellerman @ 2019-05-28  3:43 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	Christian Brauner, linux-api, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-mips, linux-parisc, linuxppc-dev,
	linux-s390, linux-sh, sparclinux, linux-xtensa, linux-arch, x86
In-Reply-To: <20190524111047.6892-3-christian@brauner.io>

Christian Brauner <christian@brauner.io> writes:
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 103655d84b4b..ba2c1f078cbd 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -515,3 +515,4 @@
>  431	common	fsconfig			sys_fsconfig
>  432	common	fsmount				sys_fsmount
>  433	common	fspick				sys_fspick
> +435	common	close_range			sys_close_range

With a minor build fix the selftest passes for me on ppc64le:

  # ./close_range_test 
  1..9
  ok 1 do not allow invalid flag values for close_range()
  ok 2 close_range() from 3 to 53
  ok 3 fcntl() verify closed range from 3 to 53
  ok 4 close_range() from 54 to 95
  ok 5 fcntl() verify closed range from 54 to 95
  ok 6 close_range() from 96 to 102
  ok 7 fcntl() verify closed range from 96 to 102
  ok 8 close_range() closed single file descriptor
  ok 9 fcntl() verify closed single file descriptor
  # Pass 9 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0


Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)

cheers

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Minchan Kim @ 2019-05-28  3:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
In-Reply-To: <20190527124411.GC1658@dhcp22.suse.cz>

On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote:
> On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > > [cc linux-api]
> > > > > 
> > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > > In this configuration, userspace could handle different strategy for each
> > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > > until memory pressure happens.
> > > > > > 
> > > > > > To support such strategy easier, this patch introduces
> > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > > of (int behavior).
> > > > > 
> > > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > > filtering from the userspace so why do we need any additional filtering?
> > > > > 
> > > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > > either anonymous or file-backed.
> > > > > > 
> > > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > > 
> > > > > OK, so here is the reason you want that. The immediate question is why
> > > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > > work, all right, but less of an API to expose and that itself is a
> > > > > strong argument against.
> > > > 
> > > > What I should do if we don't have such filter option is to enumerate all of
> > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > > which would be painful for 2000+ vmas.
> > > 
> > > Painful is not an argument to add a new user API. If the existing API
> > > suits the purpose then it should be used. If it is not usable, we can
> > > think of a different way.
> > 
> > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> > It's never trivial.
> 
> This is not the only option. Have you tried to simply use
> /proc/<pid>/map_files interface? This will provide you with all the file
> backed mappings.

I compared maps vs. map_files with 3036 file-backed vma.
Test scenario is to dump all of vmas of the process and parse address
ranges.
For map_files, it's easy to parse each address range because directory name
itself is range. However, in case of maps, I need to parse each range
line by line so need to scan all of lines.

(maps cover additional non-file-backed vmas so nr_vma is a little bigger)

performance mode:
map_files: nr_vma 3036 usec 13387
maps     : nr_vma 3078 usec 12923

powersave mode:

map_files: nr_vma 3036 usec 52614
maps     : nr_vma 3078 usec 41089

map_files is slower than maps if we dump all of vmas. I guess directory
operation needs much more jobs(e.g., dentry lookup, instantiation)
compared to maps.

^ permalink raw reply

* Re: [PATCH v3 3/3] tests: add close_range() tests
From: Michael Ellerman @ 2019-05-28  3:22 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	Christian Brauner, linux-api, linux-kselftest
In-Reply-To: <20190524111047.6892-4-christian@brauner.io>

Christian Brauner <christian@brauner.io> writes:
> This adds basic tests for the new close_range() syscall.
> - test that no invalid flags can be passed
> - test that a range of file descriptors is correctly closed
> - test that a range of file descriptors is correctly closed if there there
>   are already closed file descriptors in the range
> - test that max_fd is correctly capped to the current fdtable maximum
>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Dmitry V. Levin <ldv@altlinux.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: linux-api@vger.kernel.org
> Cc: linux-kselftest@vger.kernel.org
> ---
> v1: unchanged
> v2:
> - Christian Brauner <christian@brauner.io>:
>   - verify that close_range() correctly closes a single file descriptor
> v3:
> - Christian Brauner <christian@brauner.io>:
>   - add missing Cc for Shuah
>   - add missing Cc for linux-kselftest

Sorry I replied to v2, but I think my comments still apply:

https://lore.kernel.org/lkml/8736kzqpdm.fsf@concordia.ellerman.id.au/

cheers

^ permalink raw reply

* Re: [PATCH v2 2/2] tests: add close_range() tests
From: Michael Ellerman @ 2019-05-28  2:33 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86, Christian Brauner
In-Reply-To: <20190523154747.15162-3-christian@brauner.io>

Christian Brauner <christian@brauner.io> writes:
> This adds basic tests for the new close_range() syscall.
> - test that no invalid flags can be passed
> - test that a range of file descriptors is correctly closed
> - test that a range of file descriptors is correctly closed if there there
>   are already closed file descriptors in the range
> - test that max_fd is correctly capped to the current fdtable maximum
>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Dmitry V. Levin <ldv@altlinux.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: linux-api@vger.kernel.org
> ---
> v1: unchanged
> v2:
> - Christian Brauner <christian@brauner.io>:
>   - verify that close_range() correctly closes a single file descriptor
> ---
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/core/.gitignore       |   1 +
>  tools/testing/selftests/core/Makefile         |   6 +
>  .../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
>  4 files changed, 150 insertions(+)
>  create mode 100644 tools/testing/selftests/core/.gitignore
>  create mode 100644 tools/testing/selftests/core/Makefile
>  create mode 100644 tools/testing/selftests/core/close_range_test.c
>
> diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
> new file mode 100644
> index 000000000000..6e6712ce5817
> --- /dev/null
> +++ b/tools/testing/selftests/core/.gitignore
> @@ -0,0 +1 @@
> +close_range_test
> diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
> new file mode 100644
> index 000000000000..de3ae68aa345
> --- /dev/null
> +++ b/tools/testing/selftests/core/Makefile
> @@ -0,0 +1,6 @@
> +CFLAGS += -g -I../../../../usr/include/ -I../../../../include

Your second -I pulls the unexported kernel headers in, userspace
programs shouldn't include unexported kernel headers.

It breaks the build on powerpc with eg:

  powerpc64le-linux-gnu-gcc -g -I../../../../usr/include/ -I../../../../include    close_range_test.c  -o /output/kselftest/core/close_range_test
  In file included from /usr/powerpc64le-linux-gnu/include/bits/fcntl-linux.h:346,
                   from /usr/powerpc64le-linux-gnu/include/bits/fcntl.h:62,
                   from /usr/powerpc64le-linux-gnu/include/fcntl.h:35,
                   from close_range_test.c:5:
  ../../../../include/linux/falloc.h:13:2: error: unknown type name '__s16'
    __s16  l_type;
    ^~~~~


Did you do that on purpose or just copy it from one of the other
Makefiles? :)

If you're just wanting to get the syscall number when the headers
haven't been exported, I think the best solution is to do eg:

diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
index d6e6079d3d53..34c6f02f25de 100644
--- a/tools/testing/selftests/core/close_range_test.c
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -14,6 +14,10 @@

 #include "../kselftest.h"

+#ifndef __NR_close_range
+#define __NR_close_range       435
+#endif
+
 static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
                                  unsigned int flags)
 {


cheers

^ permalink raw reply related

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill A. Shutemov @ 2019-05-27 23:30 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <c3cd3719-0a5e-befe-89f2-328526bb714d@virtuozzo.com>

On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
> > On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
> >> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> >>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >>>> This patchset adds a new syscall, which makes possible
> >>>> to clone a VMA from a process to current process.
> >>>> The syscall supplements the functionality provided
> >>>> by process_vm_writev() and process_vm_readv() syscalls,
> >>>> and it may be useful in many situation.
> >>>
> >>> Kirill, could you explain how the change affects rmap and how it is safe.
> >>>
> >>> My concern is that the patchset allows to map the same page multiple times
> >>> within one process or even map page allocated by child to the parrent.
> >>>
> >>> It was not allowed before.
> >>>
> >>> In the best case it makes reasoning about rmap substantially more difficult.
> >>>
> >>> But I'm worry it will introduce hard-to-debug bugs, like described in
> >>> https://lwn.net/Articles/383162/.
> >>
> >> Andy suggested to unmap PTEs from source page table, and this make the single
> >> page never be mapped in the same process twice. This is OK for my use case,
> >> and here we will just do a small step "allow to inherit VMA by a child process",
> >> which we didn't have before this. If someone still needs to continue the work
> >> to allow the same page be mapped twice in a single process in the future, this
> >> person will have a supported basis we do in this small step. I believe, someone
> >> like debugger may want to have this to make a fast snapshot of a process private
> >> memory (when the task is stopped for a small time to get its memory). But for
> >> me remapping is enough at the moment.
> >>
> >> What do you think about this?
> > 
> > I don't think that unmapping alone will do. Consider the following
> > scenario:
> > 
> > 1. Task A creates and populates the mapping.
> > 2. Task A forks. We have now Task B mapping the same pages, but
> > write-protected.
> > 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
> > 
> > After this Task A will have the same anon pages mapped twice.
> 
> Ah, sure.
> 
> > One possible way out would be to force CoW on all pages in the mapping,
> > before passing the mapping to the new process.
> 
> This will pop all swapped pages up, which is the thing the patchset aims
> to prevent.
> 
> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
> only chain and which vma->anon_vma_chain contains single entry? This is
> a vma, which were faulted, but its mm never were duplicated (or which
> forks already died).

The requirement for the VMA to be faulted (have any pages mapped) looks
excessive to me, but the general idea may work.

One issue I see is that userspace may not have full control to create such
VMA. vma_merge() can merge the VMA to the next one without any consent
from userspace and you'll get anon_vma inherited from the VMA you've
justed merged with.

I don't have any valid idea on how to get around this.

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH 1/2] fork: add clone6
From: Jann Horn @ 2019-05-27 19:36 UTC (permalink / raw)
  To: Linus Torvalds, Kees Cook, Christian Brauner
  Cc: Al Viro, Linux List Kernel Mailing, Florian Weimer, Oleg Nesterov,
	Arnd Bergmann, David Howells, Pavel Emelyanov, Andrew Morton,
	Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <CAHk-=wjnbK5ob9JE0H1Ge_R4BL6D0ztsAvrM6DN+S+zyDWE=7A@mail.gmail.com>

+Kees

On Mon, May 27, 2019 at 9:27 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
> > Hm, still pondering whether having one unsigned int argument passed
> > through registers that captures all the flags from the old clone() would
> > be a good idea.
>
> That sounds like a reasonable thing to do.
>
> Maybe we could continue to call the old flags CLONE_XYZ and continue
> to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
> flags for a new 64-bit flag field that comes in through memory in the
> new clone_args thing?

With the current seccomp model, that would have the unfortunate effect
of making it impossible to filter out new clone flags - which would
likely mean that people who want to sandbox their code would not use
the new clone() because they don't want their sandboxed code to be
able to create time namespaces and whatever other new fancy things
clone() might support in the future. This is why I convinced Christian
to pass flags in registers for the first patch version.

The alternative I see would be to somehow extend seccomp to support
argument structures that are passed in memory - that would probably
require quite a bit of new plumbing though, both in the kernel and in
userspace code that configures seccomp filters.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox