Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Linus Torvalds @ 2019-06-17 18:48 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <87a7egdqgr.fsf@oldenburg2.str.redhat.com>

On Mon, Jun 17, 2019 at 11:19 AM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > Unlike the "val[]" thing, I don't think anybody is supposed to access
> > those fields directly.
>
> Well, glibc already calls it __val …

Hmm. If user space already doesn't see the "val[]" array anyway, I
guess we could just do that in the kernel too.

Looking at the glibc headers I have for fds_bits, glibc seems to do
*both* fds_bits[] and __fds_bits[] depending on __USE_XOPEN or not.

Anyway, that all implies to me that we might as well just go the truly
mindless way, and just do the double underscores and not bother with
renaming any files.

I thought people actually might care about the "val[]" name because I
find that in documentation, but since apparently it's already not
visible to user space anyway, that can't be true.

I guess that makes the original patch acceptable, and we should just
do the same thing to fds_bits..

                     Linus

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Florian Weimer @ 2019-06-17 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <CAHk-=wgiZNERDN7p-bsCzzYGRjeqTQw7kJxJnXAHVjqqO8PGrg@mail.gmail.com>

* Linus Torvalds:

> On Mon, Jun 17, 2019 at 11:03 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> There's also __kernel_fd_set in <linux/posix_types.h>.  I may have
>> lumped this up with <asm/posix_types.h>, but it has the same problem.
>
> Hmm.
>
> That one we might be able to just fix by renaming "fds_bits" to "__fds_bits".
>
> Unlike the "val[]" thing, I don't think anybody is supposed to access
> those fields directly.

Well, glibc already calls it __val …

> I think fd_set and friends are now supposed to be in <sys/select.h>
> anyway, and the "it was in <sys/types.h>" is all legacy.

Do you suggest to create a <linux/select.h> header to mirror this?

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Linus Torvalds @ 2019-06-17 18:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <CAHk-=wgiZNERDN7p-bsCzzYGRjeqTQw7kJxJnXAHVjqqO8PGrg@mail.gmail.com>

On Mon, Jun 17, 2019 at 11:13 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That one we might be able to just fix by renaming "fds_bits" to "__fds_bits".

That said, moving it to another file might be the better option anyway.

I think fd_set and friends are now supposed to be in <sys/select.h>
anyway, and the "it was in <sys/types.h>" is all legacy.

                 Linus

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Linus Torvalds @ 2019-06-17 18:13 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <87k1dkdr9c.fsf@oldenburg2.str.redhat.com>

On Mon, Jun 17, 2019 at 11:03 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> There's also __kernel_fd_set in <linux/posix_types.h>.  I may have
> lumped this up with <asm/posix_types.h>, but it has the same problem.

Hmm.

That one we might be able to just fix by renaming "fds_bits" to "__fds_bits".

Unlike the "val[]" thing, I don't think anybody is supposed to access
those fields directly.

Of course, "supposed to" and "does" are two very very different things ;)

                  Linus

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Florian Weimer @ 2019-06-17 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <CAHk-=wjCwnk0nfgCcMYqqX6o9bBrutDtut_fzZ-2VwiZR1y4kw@mail.gmail.com>

* Linus Torvalds:

>> A different approach would rename <asm/posix_types.h> to something more
>> basic, exclude the two structs, and move all internal #includes which do
>> need the structs to the new header.
>
> In fact, I wouldn't even rename <posix_types.h> at all, I'd just make
> sure it's namespace-clean.
>
> I _think_ the only thing causing problems is  '__kernel_fsid_t' due to
> that "val[]" thing, so just remove ity entirely, and add it to
> <statfs.h> instead.

There's also __kernel_fd_set in <linux/posix_types.h>.  I may have
lumped this up with <asm/posix_types.h>, but it has the same problem.

If it's okay to move them both to more natural places (maybe
<asm/statfs.h> and <linux/socket.h>), I think that should work well for
glibc.

However, application code may have to include additional header files.
I think the GCC/LLVM sanitizers currently get __kernel_fd_set from
<linux/posix_types.h> (but I think we discussed it before, they really
shouldn't use this type because it's misleading).

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Linus Torvalds @ 2019-06-17 17:49 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <87sgs8igfj.fsf@oldenburg2.str.redhat.com>

On Mon, Jun 17, 2019 at 4:45 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> I wanted to introduce a new header, <asm/kernel_long_t.h>, and include
> it where the definition of __kernel_long_t is needed, something like
> this (incomplete, untested):

So this doesn't look interesting to me: __kernel_long_t is neither
interesting as a type anyway (it's just a way for user space to
override "long"), nor is it a namespace violation.

So honestly, user space could do whatever it wants for __kernel_long_t anyway.

The thing that I think we should try to fix is just the "val[]" thing, ie

> A different approach would rename <asm/posix_types.h> to something more
> basic, exclude the two structs, and move all internal #includes which do
> need the structs to the new header.

In fact, I wouldn't even rename <posix_types.h> at all, I'd just make
sure it's namespace-clean.

I _think_ the only thing causing problems is  '__kernel_fsid_t' due to
that "val[]" thing, so just remove ity entirely, and add it to
<statfs.h> instead.

And yeah, then we'd need to maybe make sure that the (couple) of
__kernel_fsid_t users properly include that statfs.h file.

Hmm?

               Linus

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Peter Zijlstra @ 2019-06-17 16:24 UTC (permalink / raw)
  To: David Howells
  Cc: Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening
In-Reply-To: <15401.1559322762@warthog.procyon.org.uk>

On Fri, May 31, 2019 at 06:12:42PM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > > (and it has already been established that refcount_t doesn't work for
> > > > usage count scenarios)
> > > 
> > > ?
> > > 
> > > Does that mean struct kref doesn't either?
> > 
> > Indeed, since kref is just a pointless wrapper around refcount_t it does
> > not either.
> > 
> > The main distinction between a reference count and a usage count is that
> > 0 means different things. For a refcount 0 means dead. For a usage count
> > 0 is merely unused but valid.
> 
> Ah - I consider the terms interchangeable.
> 
> Take Documentation/filesystems/vfs.txt for instance:
> 
>   dget: open a new handle for an existing dentry (this just increments
> 	the usage count)
> 
>   dput: close a handle for a dentry (decrements the usage count). ...
> 
>   ...
> 
>   d_lookup: look up a dentry given its parent and path name component
> 	It looks up the child of that given name from the dcache
> 	hash table. If it is found, the reference count is incremented
> 	and the dentry is returned. The caller must use dput()
> 	to free the dentry when it finishes using it.
> 
> Here we interchange the terms.
> 
> Or https://www.kernel.org/doc/gorman/html/understand/understand013.html
> which seems to interchange the terms in reference to struct page.

Right, but we have two distinct set of semantics, I figured it makes
sense to have two different names for them. Do you have an alternative
naming scheme we could use?

Or should we better document our distinction between reference and usage
count?

^ permalink raw reply

* [RFC PATCH 1/1] Revert "rseq/selftests: arm: use udf instruction for RSEQ_SIG"
From: Mathieu Desnoyers @ 2019-06-17 15:23 UTC (permalink / raw)
  To: Shuah Khan, Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner,
	Joel Fernandes, Catalin Marinas, Dave Watson, Andi Kleen,
	linux-kselftest, H . Peter Anvin, Chris Lameter, Russell King,
	Michael Kerrisk, Paul E . McKenney, Paul Turner, Boqun Feng,
	Josh Triplett, Steven Rostedt, Ben Maurer, linux-api,
	Andy Lutomirski <lu>

This reverts commit 2b845d4b4acd9422bbb668989db8dc36dfc8f438.

That commit introduces build issues for programs compiled in Thumb mode.
Rather than try to be clever and emit a valid trap instruction on arm32,
which requires special care about big/little endian handling on that
architecture, just emit plain data. Data in the instruction stream is
technically expected on arm32: this is how literal pools are
implemented. Reverting to the prior behavior does exactly that.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Joel Fernandes <joelaf@google.com>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Dave Watson <davejwatson@fb.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Andi Kleen <andi@firstfloor.org>
CC: linux-kselftest@vger.kernel.org
CC: "H . Peter Anvin" <hpa@zytor.com>
CC: Chris Lameter <cl@linux.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
CC: Paul Turner <pjt@google.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Maurer <bmaurer@fb.com>
CC: linux-api@vger.kernel.org
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
---
 tools/testing/selftests/rseq/rseq-arm.h | 52 ++-------------------------------
 1 file changed, 2 insertions(+), 50 deletions(-)

diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
index 84f28f147fb6..5f262c54364f 100644
--- a/tools/testing/selftests/rseq/rseq-arm.h
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -5,54 +5,7 @@
  * (C) Copyright 2016-2018 - Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
  */
 
-/*
- * RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand
- * value 0x5de3. This traps if user-space reaches this instruction by mistake,
- * and the uncommon operand ensures the kernel does not move the instruction
- * pointer to attacker-controlled code on rseq abort.
- *
- * The instruction pattern in the A32 instruction set is:
- *
- * e7f5def3    udf    #24035    ; 0x5de3
- *
- * This translates to the following instruction pattern in the T16 instruction
- * set:
- *
- * little endian:
- * def3        udf    #243      ; 0xf3
- * e7f5        b.n    <7f5>
- *
- * pre-ARMv6 big endian code:
- * e7f5        b.n    <7f5>
- * def3        udf    #243      ; 0xf3
- *
- * ARMv6+ -mbig-endian generates mixed endianness code vs data: little-endian
- * code and big-endian data. Ensure the RSEQ_SIG data signature matches code
- * endianness. Prior to ARMv6, -mbig-endian generates big-endian code and data
- * (which match), so there is no need to reverse the endianness of the data
- * representation of the signature. However, the choice between BE32 and BE8
- * is done by the linker, so we cannot know whether code and data endianness
- * will be mixed before the linker is invoked.
- */
-
-#define RSEQ_SIG_CODE	0xe7f5def3
-
-#ifndef __ASSEMBLER__
-
-#define RSEQ_SIG_DATA							\
-	({								\
-		int sig;						\
-		asm volatile ("b 2f\n\t"				\
-			      "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
-			      "2:\n\t"					\
-			      "ldr %[sig], 1b\n\t"			\
-			      : [sig] "=r" (sig));			\
-		sig;							\
-	})
-
-#define RSEQ_SIG	RSEQ_SIG_DATA
-
-#endif
+#define RSEQ_SIG	0x53053053
 
 #define rseq_smp_mb()	__asm__ __volatile__ ("dmb" ::: "memory", "cc")
 #define rseq_smp_rmb()	__asm__ __volatile__ ("dmb" ::: "memory", "cc")
@@ -125,8 +78,7 @@ do {									\
 		__rseq_str(table_label) ":\n\t"				\
 		".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
 		".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
-		".arm\n\t"						\
-		".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t"		\
+		".word " __rseq_str(RSEQ_SIG) "\n\t"			\
 		__rseq_str(label) ":\n\t"				\
 		teardown						\
 		"b %l[" __rseq_str(abort_label) "]\n\t"
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH] mm, memcg: Report number of memcg caches in slabinfo
From: Waiman Long @ 2019-06-17 14:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-api
In-Reply-To: <20190617143842.GC1492@dhcp22.suse.cz>

On 6/17/19 10:38 AM, Michal Hocko wrote:
> [Cc linux-api]
>
> On Mon 17-06-19 10:21:49, Waiman Long wrote:
>> There are concerns about memory leaks from extensive use of memory
>> cgroups as each memory cgroup creates its own set of kmem caches. There
>> is a possiblity that the memcg kmem caches may remain even after the
>> memory cgroup removal.
>>
>> Therefore, it will be useful to show how many memcg caches are present
>> for each of the kmem caches.
> How is a user going to use that information?  Btw. Don't we have an
> interface to display the number of (dead) cgroups?

The interface to report dead cgroups is for cgroup v2 (cgroup.stat)
only. I don't think there is a way to find that for cgroup v1. Also the
number of memcg kmem caches may not be the same as the number of
memcg's. It can range from 0 to above the number of memcg's.  So it is
an interesting number by itself.

>From the user perspective, if the numbers is way above the number of
memcg's, there is probably something wrong there.

Cheers,
Longman

^ permalink raw reply

* Re: [PATCH] mm, memcg: Report number of memcg caches in slabinfo
From: Michal Hocko @ 2019-06-17 14:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-api
In-Reply-To: <20190617142149.5245-1-longman@redhat.com>

[Cc linux-api]

On Mon 17-06-19 10:21:49, Waiman Long wrote:
> There are concerns about memory leaks from extensive use of memory
> cgroups as each memory cgroup creates its own set of kmem caches. There
> is a possiblity that the memcg kmem caches may remain even after the
> memory cgroup removal.
> 
> Therefore, it will be useful to show how many memcg caches are present
> for each of the kmem caches.

How is a user going to use that information?  Btw. Don't we have an
interface to display the number of (dead) cgroups?

Keeping the rest of the email for the reference.

> As slabinfo reporting code has to iterate
> through all the memcg caches to get the final numbers anyway, there is
> no additional cost in reporting the number of memcg caches available.
> 
> The slabinfo version is bumped up to 2.2 as a new "<num_caches>" column
> is added at the end.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/slab_common.c | 24 ++++++++++++++++--------
>  1 file changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 58251ba63e4a..c7aa47a99b2b 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1308,13 +1308,13 @@ static void print_slabinfo_header(struct seq_file *m)
>  	 * without _too_ many complaints.
>  	 */
>  #ifdef CONFIG_DEBUG_SLAB
> -	seq_puts(m, "slabinfo - version: 2.1 (statistics)\n");
> +	seq_puts(m, "slabinfo - version: 2.2 (statistics)\n");
>  #else
> -	seq_puts(m, "slabinfo - version: 2.1\n");
> +	seq_puts(m, "slabinfo - version: 2.2\n");
>  #endif
>  	seq_puts(m, "# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>");
>  	seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
> -	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
> +	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail> <num_caches>");
>  #ifdef CONFIG_DEBUG_SLAB
>  	seq_puts(m, " : globalstat <listallocs> <maxobjs> <grown> <reaped> <error> <maxfreeable> <nodeallocs> <remotefrees> <alienoverflow>");
>  	seq_puts(m, " : cpustat <allochit> <allocmiss> <freehit> <freemiss>");
> @@ -1338,14 +1338,18 @@ void slab_stop(struct seq_file *m, void *p)
>  	mutex_unlock(&slab_mutex);
>  }
>  
> -static void
> +/*
> + * Return number of memcg caches.
> + */
> +static unsigned int
>  memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
>  {
>  	struct kmem_cache *c;
>  	struct slabinfo sinfo;
> +	unsigned int cnt = 0;
>  
>  	if (!is_root_cache(s))
> -		return;
> +		return 0;
>  
>  	for_each_memcg_cache(c, s) {
>  		memset(&sinfo, 0, sizeof(sinfo));
> @@ -1356,17 +1360,20 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
>  		info->shared_avail += sinfo.shared_avail;
>  		info->active_objs += sinfo.active_objs;
>  		info->num_objs += sinfo.num_objs;
> +		cnt++;
>  	}
> +	return cnt;
>  }
>  
>  static void cache_show(struct kmem_cache *s, struct seq_file *m)
>  {
>  	struct slabinfo sinfo;
> +	unsigned int nr_memcg_caches;
>  
>  	memset(&sinfo, 0, sizeof(sinfo));
>  	get_slabinfo(s, &sinfo);
>  
> -	memcg_accumulate_slabinfo(s, &sinfo);
> +	nr_memcg_caches = memcg_accumulate_slabinfo(s, &sinfo);
>  
>  	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
>  		   cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size,
> @@ -1374,8 +1381,9 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
>  
>  	seq_printf(m, " : tunables %4u %4u %4u",
>  		   sinfo.limit, sinfo.batchcount, sinfo.shared);
> -	seq_printf(m, " : slabdata %6lu %6lu %6lu",
> -		   sinfo.active_slabs, sinfo.num_slabs, sinfo.shared_avail);
> +	seq_printf(m, " : slabdata %6lu %6lu %6lu %3u",
> +		   sinfo.active_slabs, sinfo.num_slabs, sinfo.shared_avail,
> +		   nr_memcg_caches);
>  	slabinfo_show_stats(m, s);
>  	seq_putc(m, '\n');
>  }
> -- 
> 2.18.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Thomas Gleixner @ 2019-06-17 12:20 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Dave Martin, Yu-cheng Yu, x86, H. Peter Anvin, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg
In-Reply-To: <87imt4jwpt.fsf@oldenburg2.str.redhat.com>

On Mon, 17 Jun 2019, Florian Weimer wrote:
> * Dave Martin:
> > On Tue, Jun 11, 2019 at 12:31:34PM -0700, Yu-cheng Yu wrote:
> >> We can probably check PT_GNU_PROPERTY first, and fallback (based on ld-linux
> >> version?) to PT_NOTE scanning?
> >
> > For arm64, we can check for PT_GNU_PROPERTY and then give up
> > unconditionally.
> >
> > For x86, we would fall back to PT_NOTE scanning, but this will add a bit
> > of cost to binaries that don't have NT_GNU_PROPERTY_TYPE_0.  The ld.so
> > version doesn't tell you what ELF ABI a given executable conforms to.
> >
> > Since this sounds like it's largely a distro-specific issue, maybe there
> > could be a Kconfig option to turn the fallback PT_NOTE scanning on?
> 
> I'm worried that this causes interop issues similarly to what we see
> with VSYSCALL today.  If we need both and a way to disable it, it should
> be something like a personality flag which can be configured for each
> process tree separately.  Ideally, we'd settle on one correct approach
> (i.e., either always process both, or only process PT_GNU_PROPERTY) and
> enforce that.

Chose one and only the one which makes technically sense and is not some
horrible vehicle.

Everytime we did those 'oh we need to make x fly workarounds' we regretted
it sooner than later.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Florian Weimer @ 2019-06-17 11:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joseph Myers, Arnd Bergmann, Linux API, linux-arch, Netdev,
	Laura Abbott, Paul Burton, Deepa Dinamani,
	Linux List Kernel Mailing
In-Reply-To: <CAHk-=wg4ijSoPq-w7ct_VuZvgHx+tUv_QX-We-62dEwK+AOf2w@mail.gmail.com>

* Linus Torvalds:

> On Fri, Jun 7, 2019 at 11:43 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> On the glibc side, we nowadays deal with this by splitting headers
>> further.  (We used to suppress definitions with macros, but that tended
>> to become convoluted.)  In this case, moving the definition of
>> __kernel_long_t to its own header, so that
>> include/uapi/asm-generic/socket.h can include that should fix it.
>
> I think we should strive to do that on the kernel side too, since
> clearly we shouldn't expose that "val[]" thing in the core posix types
> due to namespace rules, but at the same time I think the patch to
> rename val[] is fundamentally broken too.
>
> Can you describe how you split things (perhaps even with a patch ;)?
> Is this literally the only issue you currently have? Because I'd
> expect similar issues to show up elsewhere too, but who knows.. You
> presumably do.

I wanted to introduce a new header, <asm/kernel_long_t.h>, and include
it where the definition of __kernel_long_t is needed, something like
this (incomplete, untested):

diff --git a/arch/sparc/include/uapi/asm/posix_types.h b/arch/sparc/include/uapi/asm/posix_types.h
index f139e0048628..6510d7538605 100644
--- a/arch/sparc/include/uapi/asm/posix_types.h
+++ b/arch/sparc/include/uapi/asm/posix_types.h
@@ -8,6 +8,8 @@
 #ifndef __SPARC_POSIX_TYPES_H
 #define __SPARC_POSIX_TYPES_H
 
+#include <asm/kernel_long_t.h>
+
 #if defined(__sparc__) && defined(__arch64__)
 /* sparc 64 bit */
 
@@ -19,10 +21,6 @@ typedef unsigned short         __kernel_old_gid_t;
 typedef int		       __kernel_suseconds_t;
 #define __kernel_suseconds_t __kernel_suseconds_t
 
-typedef long		__kernel_long_t;
-typedef unsigned long	__kernel_ulong_t;
-#define __kernel_long_t __kernel_long_t
-
 struct __kernel_old_timeval {
 	__kernel_long_t tv_sec;
 	__kernel_suseconds_t tv_usec;
diff --git a/arch/x86/include/uapi/asm/kernel_long_t.h b/arch/x86/include/uapi/asm/kernel_long_t.h
new file mode 100644
index 000000000000..ed3bff40e1e8
--- /dev/null
+++ b/arch/x86/include/uapi/asm/kernel_long_t.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __KERNEL__
+# ifdef defined(__x86_64__) && defined(__ILP32__)
+#  include <asm/kernel_long_t_x32.h>
+# else
+#  include <asm-generic/kernel_long_t.h>
+# endif
+#endif
diff --git a/arch/x86/include/uapi/asm/kernel_long_t_x32.h b/arch/x86/include/uapi/asm/kernel_long_t_x32.h
new file mode 100644
index 000000000000..a71cbce7e966
--- /dev/null
+++ b/arch/x86/include/uapi/asm/kernel_long_t_x32.h
@@ -0,0 +1,6 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _ASM_X86_KERNEL_LONG_T_X32_H
+#define _ASM_X86_KERNEL_LONG_T_X32_H
+typedef long long __kernel_long_t;
+typedef unsigned long long __kernel_ulong_t;
+#endif /* _ASM_X86_KERNEL_LONG_T_X32_H */
diff --git a/arch/x86/include/uapi/asm/posix_types_x32.h b/arch/x86/include/uapi/asm/posix_types_x32.h
index f60479b07fc8..92c7af21da9e 100644
--- a/arch/x86/include/uapi/asm/posix_types_x32.h
+++ b/arch/x86/include/uapi/asm/posix_types_x32.h
@@ -11,10 +11,6 @@
  *
  */
 
-typedef long long __kernel_long_t;
-typedef unsigned long long __kernel_ulong_t;
-#define __kernel_long_t __kernel_long_t
-
 #include <asm/posix_types_64.h>
 
 #endif /* _ASM_X86_POSIX_TYPES_X32_H */
diff --git a/include/uapi/asm-generic/kernel_long_t.h b/include/uapi/asm-generic/kernel_long_t.h
new file mode 100644
index 000000000000..649a97a8c304
--- /dev/null
+++ b/include/uapi/asm-generic/kernel_long_t.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __ASM_GENERIC_KERNEL_LONG_T_H
+#define __ASM_GENERIC_KERNEL_LONG_T_H
+
+typedef long		__kernel_long_t;
+typedef unsigned long	__kernel_ulong_t;
+
+#endif /* __ASM_GENERIC_POSIX_TYPES_H */
diff --git a/include/uapi/asm-generic/posix_types.h b/include/uapi/asm-generic/posix_types.h
index f0733a26ebfc..2715ba4599bd 100644
--- a/include/uapi/asm-generic/posix_types.h
+++ b/include/uapi/asm-generic/posix_types.h
@@ -11,10 +11,7 @@
  * architectures, so that you can override them.
  */
 
-#ifndef __kernel_long_t
-typedef long		__kernel_long_t;
-typedef unsigned long	__kernel_ulong_t;
-#endif
+#include <asm/kernel_long_t.h>
 
 #ifndef __kernel_ino_t
 typedef __kernel_ulong_t __kernel_ino_t;

Additional architectures need conversion as well, but I think this
suggests where this is going.  Would that be acceptable?

A different approach would rename <asm/posix_types.h> to something more
basic, exclude the two structs, and move all internal #includes which do
need the structs to the new header.  A new <asm/posix_types.h> would
include the renamed header and add back the two structs, for
compatibility.

For a less strict definition of compatibility, it would also be possible
to introduce <asm/fsid_t.h> (for __kernel_fsid_t) and <linux/fd_set.h>
(for __kernel_fd_set), and remove the definition of those from
<asm/posix_types.h>.

The other question is whether this __kernel_long_t dependency in
<asm/socket.h> is even valid because it makes the constants SO_RCVTIMEO
etc. unusable in a preprocessor expression (although POSIX does not make
such a requirement as far as I can see).

Thanks,
Florian

^ permalink raw reply related

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Florian Weimer @ 2019-06-17 11:08 UTC (permalink / raw)
  To: Dave Martin
  Cc: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190612093238.GQ28398@e103592.cambridge.arm.com>

* Dave Martin:

> On Tue, Jun 11, 2019 at 12:31:34PM -0700, Yu-cheng Yu wrote:
>> On Tue, 2019-06-11 at 12:41 +0100, Dave Martin wrote:
>> > On Mon, Jun 10, 2019 at 07:24:43PM +0200, Florian Weimer wrote:
>> > > * Yu-cheng Yu:
>> > > 
>> > > > To me, looking at PT_GNU_PROPERTY and not trying to support anything is a
>> > > > logical choice.  And it breaks only a limited set of toolchains.
>> > > > 
>> > > > I will simplify the parser and leave this patch as-is for anyone who wants
>> > > > to
>> > > > back-port.  Are there any objections or concerns?
>> > > 
>> > > Red Hat Enterprise Linux 8 does not use PT_GNU_PROPERTY and is probably
>> > > the largest collection of CET-enabled binaries that exists today.
>> > 
>> > For clarity, RHEL is actively parsing these properties today?
>> > 
>> > > My hope was that we would backport the upstream kernel patches for CET,
>> > > port the glibc dynamic loader to the new kernel interface, and be ready
>> > > to run with CET enabled in principle (except that porting userspace
>> > > libraries such as OpenSSL has not really started upstream, so many
>> > > processes where CET is particularly desirable will still run without
>> > > it).
>> > > 
>> > > I'm not sure if it is a good idea to port the legacy support if it's not
>> > > part of the mainline kernel because it comes awfully close to creating
>> > > our own private ABI.
>> > 
>> > I guess we can aim to factor things so that PT_NOTE scanning is
>> > available as a fallback on arches for which the absence of
>> > PT_GNU_PROPERTY is not authoritative.
>> 
>> We can probably check PT_GNU_PROPERTY first, and fallback (based on ld-linux
>> version?) to PT_NOTE scanning?
>
> For arm64, we can check for PT_GNU_PROPERTY and then give up
> unconditionally.
>
> For x86, we would fall back to PT_NOTE scanning, but this will add a bit
> of cost to binaries that don't have NT_GNU_PROPERTY_TYPE_0.  The ld.so
> version doesn't tell you what ELF ABI a given executable conforms to.
>
> Since this sounds like it's largely a distro-specific issue, maybe there
> could be a Kconfig option to turn the fallback PT_NOTE scanning on?

I'm worried that this causes interop issues similarly to what we see
with VSYSCALL today.  If we need both and a way to disable it, it should
be something like a personality flag which can be configured for each
process tree separately.  Ideally, we'd settle on one correct approach
(i.e., either always process both, or only process PT_GNU_PROPERTY) and
enforce that.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCHv4 17/28] x86/vdso: Switch image on setns()/unshare()/clone()
From: Dmitry Safonov @ 2019-06-16 17:51 UTC (permalink / raw)
  To: Thomas Gleixner, Dmitry Safonov
  Cc: linux-kernel, Adrian Reber, Andrei Vagin, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86, Andrei Vagin
In-Reply-To: <alpine.DEB.2.21.1906141603500.1722@nanos.tec.linutronix.de>

On 6/14/19 3:05 PM, Thomas Gleixner wrote:
> On Wed, 12 Jun 2019, Dmitry Safonov wrote:
>>  
>> +#ifdef CONFIG_TIME_NS
>> +int vdso_join_timens(struct task_struct *task)
>> +{
>> +	struct mm_struct *mm = task->mm;
>> +	struct vm_area_struct *vma;
>> +
>> +	if (down_write_killable(&mm->mmap_sem))
>> +		return -EINTR;
>> +
>> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
>> +		unsigned long size = vma->vm_end - vma->vm_start;
>> +
>> +		if (vma_is_special_mapping(vma, &vvar_mapping) ||
>> +		    vma_is_special_mapping(vma, &vdso_mapping))
>> +			zap_page_range(vma, vma->vm_start, size);
>> +	}
>> +
>> +	up_write(&mm->mmap_sem);
>> +	return 0;
>> +}
>> +#else /* CONFIG_TIME_NS */
>> +int vdso_join_timens(struct task_struct *task)
>> +{
>> +	return -ENXIO;
>> +}
> 
> Is that else path really required? The callsite is only compiled when
> CONFIG_TIME_NS is enabled, right?

Oh, yes - will drop this.

Thanks,
          Dmitry

^ permalink raw reply

* Re: [PATCHv4 15/28] x86/vdso: Add offsets page in vvar
From: Dmitry Safonov @ 2019-06-16 17:49 UTC (permalink / raw)
  To: Thomas Gleixner, Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Adrian Reber, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86, Andrei Vagin
In-Reply-To: <alpine.DEB.2.21.1906141553070.1722@nanos.tec.linutronix.de>

On 6/14/19 2:58 PM, Thomas Gleixner wrote:
> On Wed, 12 Jun 2019, Dmitry Safonov wrote:
>>  
>> +#ifdef CONFIG_TIME_NS
>> +notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec *ts)
>> +{
>> +	struct timens_offsets *timens = (struct timens_offsets *) &timens_page;
>> +	struct timespec64 *offset64;
>> +
>> +	switch (clk) {
>> +	case CLOCK_MONOTONIC:
>> +	case CLOCK_MONOTONIC_COARSE:
>> +	case CLOCK_MONOTONIC_RAW:
>> +		offset64 = &timens->monotonic;
>> +		break;
>> +	case CLOCK_BOOTTIME:
>> +		offset64 = &timens->boottime;
>> +	default:
>> +		return;
>> +	}
>> +
>> +	ts->tv_nsec += offset64->tv_nsec;
>> +	ts->tv_sec += offset64->tv_sec;
>> +	if (ts->tv_nsec >= NSEC_PER_SEC) {
>> +		ts->tv_nsec -= NSEC_PER_SEC;
>> +		ts->tv_sec++;
>> +	}
>> +	if (ts->tv_nsec < 0) {
>> +		ts->tv_nsec += NSEC_PER_SEC;
>> +		ts->tv_sec--;
>> +	}
> 
> I had to think twice why adding the offset (which can be negative) can
> never result in negative time being returned. A comment explaining this
> would be appreciated.
> 
> As I'm planning to merge Vincezos VDSO consolidation into 5.3, can you
> please start to work on top of his series, which should be available as
> final v7 next week hopefully.

Yes, will rebase on the top of his series.

Thanks much,
          Dmitry

^ permalink raw reply

* Re: [PATCHv4 09/28] timens: Shift /proc/uptime
From: Dmitry Safonov @ 2019-06-16 17:48 UTC (permalink / raw)
  To: Thomas Gleixner, Dmitry Safonov
  Cc: linux-kernel, Adrian Reber, Andrei Vagin, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1906141549560.1722@nanos.tec.linutronix.de>

On 6/14/19 2:50 PM, Thomas Gleixner wrote:
> On Wed, 12 Jun 2019, Dmitry Safonov wrote:
> 
> Again, please use the usual prefix and bolt not everything to
> timens. timens: is the proper prefix for the actual time namespace core
> code.

Yep, will do.

Thanks,
          Dmitry

^ permalink raw reply

* Re: [PATCHv4 07/28] posix-timers/timens: Take into account clock offsets
From: Dmitry Safonov @ 2019-06-16 17:45 UTC (permalink / raw)
  To: Thomas Gleixner, Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1906141538240.1722@nanos.tec.linutronix.de>

On 6/14/19 2:42 PM, Thomas Gleixner wrote:
> On Wed, 12 Jun 2019, Dmitry Safonov wrote:
> 
>> Subject: posix-timers/timens: Take into account clock offsets
> 
> Please avoid that '/timens' appendix. It's not really a new subsystem or
> subfunction of posix-timers.
> 
> posix-timers: Add time namespace support to common_timer_set()

Ok

> 
>> From: Andrei Vagin <avagin@gmail.com>
>>
>> Wire timer_settime() syscall into time namespace virtualization.
> 
> Please explain why this only affects common_timer_set() and not any other
> incarnation along with an explanation why only ABSTIME timers need to be
> converted.

Will do.

Thanks,
          Dmitry

^ permalink raw reply

* Re: [PATCHv4 06/28] timerfd/timens: Take into account ns clock offsets
From: Dmitry Safonov @ 2019-06-16 17:43 UTC (permalink / raw)
  To: Thomas Gleixner, Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1906141534090.1722@nanos.tec.linutronix.de>

On 6/14/19 2:37 PM, Thomas Gleixner wrote:
> On Wed, 12 Jun 2019, Dmitry Safonov wrote:
>> ---
>>  fs/timerfd.c                   |  3 +++
>>  include/linux/time_namespace.h | 18 ++++++++++++++++++
>>  kernel/time_namespace.c        | 27 +++++++++++++++++++++++++++
>>  3 files changed, 48 insertions(+)
> 
> Again, please split that into:
> 
>    1) Introduce the new function
> 
>    2) Make use of it

Will do

> 
>> diff --git a/fs/timerfd.c b/fs/timerfd.c
>> index 6a6fc8aa1de7..9b0c2f65e7e8 100644
>> --- a/fs/timerfd.c
>> +++ b/fs/timerfd.c
>> @@ -26,6 +26,7 @@
>>  #include <linux/syscalls.h>
>>  #include <linux/compat.h>
>>  #include <linux/rcupdate.h>
>> +#include <linux/time_namespace.h>
>>  
>>  struct timerfd_ctx {
>>  	union {
>> @@ -196,6 +197,8 @@ static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
>>  	}
>>  
>>  	if (texp != 0) {
>> +		if (flags & TFD_TIMER_ABSTIME)
>> +			texp = timens_ktime_to_host(clockid, texp);
>>  		if (isalarm(ctx)) {
>>  			if (flags & TFD_TIMER_ABSTIME)
>>  				alarm_start(&ctx->t.alarm, texp);
>> diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
>> index 1dda8af6b9fe..d32b55fad953 100644
>> --- a/include/linux/time_namespace.h
>> +++ b/include/linux/time_namespace.h
>> @@ -56,6 +56,19 @@ static inline void timens_add_boottime(struct timespec64 *ts)
>>                  *ts = timespec64_add(*ts, ns_offsets->boottime);
>>  }
>>  
>> +ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim,
>> +				struct timens_offsets *offsets);
>> +static inline ktime_t timens_ktime_to_host(clockid_t clockid, ktime_t tim)
>> +{
>> +	struct timens_offsets *offsets = current->nsproxy->time_ns->offsets;
>> +
>> +	if (!offsets) /* fast-path for the root time namespace */
> 
> Can you please avoid tail comments. They break the reading flow. Aside of
> that I don't see the value of documenting the obvious.
> 
>> +ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim, struct timens_offsets *ns_offsets)
> 
> Please line break the arguments
> 
> ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim,
> 				struct timens_offsets *ns_offsets)

Sure

> 
>> +{
>> +	ktime_t koff;
>> +
>> +	switch (clockid) {
>> +	case CLOCK_MONOTONIC:
>> +		koff = timespec64_to_ktime(ns_offsets->monotonic);
>> +		break;
>> +	case CLOCK_BOOTTIME:
>> +	case CLOCK_BOOTTIME_ALARM:
>> +		koff = timespec64_to_ktime(ns_offsets->boottime);
>> +		break;
>> +	default:
>> +		return tim;
>> +	}
>> +
>> +	/* tim - off has to be in [0, KTIME_MAX) */
> 
> Please be more elaborate why the below conditions can happen at all.
> 
>> +	if (tim < koff)
>> +		tim = 0;
>> +	else if (KTIME_MAX - tim < -koff)
>> +		tim = KTIME_MAX;
>> +	else
>> +		tim = ktime_sub(tim, koff);

Thanks,
          Dmitry

^ permalink raw reply

* [PATCH NOTFORMERGE 5/5] mm/madvise: allow KSM hints for remote API
From: Oleksandr Natalenko @ 2019-06-16  8:58 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, linux-mm, linux-api
In-Reply-To: <20190616085835.953-1-oleksandr@redhat.com>

It all began with the fact that KSM works only on memory that is marked
by madvise(). And the only way to get around that is to either:

  * use LD_PRELOAD; or
  * patch the kernel with something like UKSM or PKSM.

(i skip ptrace can of worms here intentionally)

To overcome this restriction, lets employ a new remote madvise API. This
can be used by some small userspace helper daemon that will do auto-KSM
job for us.

I think of two major consumers of remote KSM hints:

  * hosts, that run containers, especially similar ones and especially in
    a trusted environment, sharing the same runtime like Node.js;

  * heavy applications, that can be run in multiple instances, not
    limited to opensource ones like Firefox, but also those that cannot be
    modified since they are binary-only and, maybe, statically linked.

Speaking of statistics, more numbers can be found in the very first
submission, that is related to this one [1]. For my current setup with
two Firefox instances I get 100 to 200 MiB saved for the second instance
depending on the amount of tabs.

1 FF instance with 15 tabs:

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   410

2 FF instances, second one has 12 tabs (all the tabs are different):

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   592

At the very moment I do not have specific numbers for containerised
workload, but those should be comparable in case the containers share
similar/same runtime.

[1] https://lore.kernel.org/patchwork/patch/1012142/

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 mm/madvise.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 84f899b1b6da..e8f9c49794a3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -991,6 +991,8 @@ process_madvise_behavior_valid(int behavior)
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
+	case MADV_MERGEABLE:
+	case MADV_UNMERGEABLE:
 		return true;

 	default:
-- 
2.22.0

^ permalink raw reply related

* [PATCH NOTFORMERGE 4/5] mm/madvise: employ mmget_still_valid for write lock
From: Oleksandr Natalenko @ 2019-06-16  8:58 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, linux-mm, linux-api
In-Reply-To: <20190616085835.953-1-oleksandr@redhat.com>

Do the very same trick as we already do since 04f5866e41fb. KSM hints
will require locking mmap_sem for write since they modify vm_flags, so
for remote KSM hinting this additional check is needed.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 mm/madvise.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 9755340da157..84f899b1b6da 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1049,6 +1049,8 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm,
 	if (write) {
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
+		if (current->mm != mm && !mmget_still_valid(mm))
+			goto skip_mm;
 	} else {
 		down_read(&mm->mmap_sem);
 	}
@@ -1099,6 +1101,7 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm,
 	}
 out:
 	blk_finish_plug(&plug);
+skip_mm:
 	if (write)
 		up_write(&mm->mmap_sem);
 	else
-- 
2.22.0

^ permalink raw reply related

* [PATCH NOTFORMERGE 3/5] mm: include uio.h to madvise.c
From: Oleksandr Natalenko @ 2019-06-16  8:58 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, linux-mm, linux-api
In-Reply-To: <20190616085835.953-1-oleksandr@redhat.com>

I couldn't compile it w/o this header.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 mm/madvise.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 70aeb54f3e1c..9755340da157 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -25,6 +25,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/uio.h>
 
 #include <asm/tlb.h>
 
-- 
2.22.0

^ permalink raw reply related

* [PATCH NOTFORMERGE 2/5] mm: revert madvise_inject_error line split
From: Oleksandr Natalenko @ 2019-06-16  8:58 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, linux-mm, linux-api
In-Reply-To: <20190616085835.953-1-oleksandr@redhat.com>

Just to highlight it after our conversation.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 mm/madvise.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index edb7184f665c..70aeb54f3e1c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1041,8 +1041,7 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm,
 
 #ifdef CONFIG_MEMORY_FAILURE
 	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
-		return madvise_inject_error(behavior,
-					start, start + len_in);
+		return madvise_inject_error(behavior, start, start + len_in);
 #endif
 
 	write = madvise_need_mmap_write(behavior);
-- 
2.22.0

^ permalink raw reply related

* [PATCH NOTFORMERGE 1/5] mm: rename madvise_core to madvise_common
From: Oleksandr Natalenko @ 2019-06-16  8:58 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, linux-mm, linux-api
In-Reply-To: <20190616085835.953-1-oleksandr@redhat.com>

"core" usually means something very different within the kernel land,
thus lets just follow the way it is handled in mutexes, rw_semaphores
etc and name common things as "_common".

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 mm/madvise.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 94d782097afd..edb7184f665c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -998,7 +998,7 @@ process_madvise_behavior_valid(int behavior)
 }
 
 /*
- * madvise_core - request behavior hint to address range of the target process
+ * madvise_common - request behavior hint to address range of the target process
  *
  * @task: task_struct got behavior hint, not giving the hint
  * @mm: mm_struct got behavior hint, not giving the hint
@@ -1009,7 +1009,7 @@ process_madvise_behavior_valid(int behavior)
  * @task could be a zombie leader if it calls sys_exit so accessing mm_struct
  * via task->mm is prohibited. Please use @mm insetad of task->mm.
  */
-static int madvise_core(struct task_struct *task, struct mm_struct *mm,
+static int madvise_common(struct task_struct *task, struct mm_struct *mm,
 			unsigned long start, size_t len_in, int behavior)
 {
 	unsigned long end, tmp;
@@ -1132,7 +1132,7 @@ static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param,
 	return ret;
 }
 
-static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm,
+static int process_madvise_common(struct task_struct *tsk, struct mm_struct *mm,
 				int *behaviors,
 				struct iov_iter *iter,
 				const struct iovec *range_vec,
@@ -1144,7 +1144,7 @@ static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm,
 	for (i = 0; i < riovcnt && iov_iter_count(iter); i++) {
 		err = -EINVAL;
 		if (process_madvise_behavior_valid(behaviors[i]))
-			err = madvise_core(tsk, mm,
+			err = madvise_common(tsk, mm,
 				(unsigned long)range_vec[i].iov_base,
 				range_vec[i].iov_len, behaviors[i]);
 
@@ -1220,7 +1220,7 @@ static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm,
  */
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
-	return madvise_core(current, current->mm, start, len_in, behavior);
+	return madvise_common(current, current->mm, start, len_in, behavior);
 }
 
 
@@ -1252,7 +1252,7 @@ SYSCALL_DEFINE3(process_madvise, int, pidfd,
 
 	/*
 	 * We don't support cookie to gaurantee address space atomicity yet.
-	 * Once we implment cookie, process_madvise_core need to hold mmap_sme
+	 * Once we implment cookie, process_madvise_common need to hold mmap_sme
 	 * during entire operation to guarantee atomicity.
 	 */
 	if (params.cookie != 0)
@@ -1316,7 +1316,7 @@ SYSCALL_DEFINE3(process_madvise, int, pidfd,
 		goto release_task;
 	}
 
-	ret = process_madvise_core(task, mm, behaviors, &iter, iov_r, nr_elem);
+	ret = process_madvise_common(task, mm, behaviors, &iter, iov_r, nr_elem);
 	mmput(mm);
 release_task:
 	put_task_struct(task);
-- 
2.22.0

^ permalink raw reply related

* [PATCH NOTFORMERGE 0/5] Extend remote madvise API to KSM hints
From: Oleksandr Natalenko @ 2019-06-16  8:58 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, linux-mm, linux-api

Hi, Minchan.

This is a set of commits based on our discussion on your submission [1].

First 2 implement minor suggestions just for you to not forget to take
them into account.

uio.h inclusion was needed for me to be able to compile your series
successfully. Also please note I had to enable "Transparent Hugepage
Support" as well as "Enable idle page tracking" options, otherwise the
build failed. I guess this can be addressed by you better since the
errors are introduced with MADV_COLD introduction.

Last 2 commits are the actual KSM hints enablement. The first one
implements additional check for the case where the mmap_sem is taken for
write, and the second one just allows KSM hints to be used by the remote
interface.

I'm not Cc'ing else anyone except two mailing lists to not distract
people unnecessarily. If you are fine with this addition, please use it
for your next iteration of process_madvise(), and then you'll Cc all the
people needed.

Thanks.

[1] https://lore.kernel.org/lkml/20190531064313.193437-1-minchan@kernel.org/

Oleksandr Natalenko (5):
  mm: rename madvise_core to madvise_common
  mm: revert madvise_inject_error line split
  mm: include uio.h to madvise.c
  mm/madvise: employ mmget_still_valid for write lock
  mm/madvise: allow KSM hints for remote API

 mm/madvise.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

-- 
2.22.0

^ permalink raw reply

* Re: [PATCH v4 14/16] ext4: add basic fs-verity support
From: Theodore Ts'o @ 2019-06-15 15:31 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Darrick J . Wong, linux-api, Dave Chinner, linux-f2fs-devel,
	linux-fscrypt, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190606155205.2872-15-ebiggers@kernel.org>

On Thu, Jun 06, 2019 at 08:52:03AM -0700, Eric Biggers wrote:
> +/*
> + * Format of ext4 verity xattr.  This points to the location of the verity
> + * descriptor within the file data rather than containing it directly because
> + * the verity descriptor *must* be encrypted when ext4 encryption is used.  But,
> + * ext4 encryption does not encrypt xattrs.
> + */
> +struct fsverity_descriptor_location {
> +	__le32 version;
> +	__le32 size;
> +	__le64 pos;
> +};

What's the benefit of storing the location in an xattr as opposed to
just keying it off the end of i_size, rounded up to next page size (or
64k) as I had suggested earlier?

Using an xattr burns xattr space, which is a limited resource, and it
adds some additional code complexity.  Does the benefits outweigh the
added complexity?

						- Ted

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox