Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/2] fork: add clone3
From: Christian Brauner @ 2019-06-08  8:15 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds
  Cc: jannh, Serge E. Hallyn, keescook, fweimer, oleg, arnd, dhowells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	linux-api
In-Reply-To: <20190606214645.GA31599@mail.hallyn.com>

On Thu, Jun 06, 2019 at 04:46:45PM -0500, Serge Hallyn wrote:
> On Tue, Jun 04, 2019 at 06:09:43PM +0200, Christian Brauner wrote:
> > This adds the clone3 system call.
> > 
> > As mentioned several times already (cf. [7], [8]) here's the promised
> > patchset for clone3().
> > 
> > We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> > free flag from clone().
> > 
> > Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> > at Linux Plumber Conference last year and has been sent out and reviewed
> > (cf. [5]). It is expected that it will go upstream in the not too distant
> > future. However, it relies on the addition of the CLONE_NEWTIME flag to
> > clone(). The only other good candidate - CLONE_DETACHED - is currently not
> > recyclable as we have identified at least two large or widely used
> > codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
> > CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
> > blocked. clone3() has the advantage that it will unblock this patchset
> > again. In general, clone3() is extensible and allows for the implementation
> > of new features.
> > 
> > The idea is to keep clone3() very simple and close to the original clone(),
> > specifically, to keep on supporting old clone()-based workloads.
> > We know there have been various creative proposals how a new process
> > creation syscall or even api is supposed to look like. Some people even
> > going so far as to argue that the traditional fork()+exec() split should be
> > abandoned in favor of an in-kernel version of spawn(). Independent of
> > whether or not we personally think spawn() is a good idea this patchset has
> > and does not want to have anything to do with this.
> > One stance we take is that there's no real good alternative to
> > clone()+exec() and we need and want to support this model going forward;
> > independent of spawn().
> > The following requirements guided clone3():
> > - bump the number of available flags
> > - move arguments that are currently passed as separate arguments
> >   in clone() into a dedicated struct clone_args
> >   - choose a struct layout that is easy to handle on 32 and on 64 bit
> >   - choose a struct layout that is extensible
> >   - give new flags that currently need to abuse another flag's dedicated
> >     return argument in clone() their own dedicated return argument
> >     (e.g. CLONE_PIDFD)
> >   - use a separate kernel internal struct kernel_clone_args that is
> >     properly typed according to current kernel conventions in fork.c and is
> >     different from  the uapi struct clone_args
> > - port _do_fork() to use kernel_clone_args so that all process creation
> >   syscalls such as fork(), vfork(), clone(), and clone3() behave identical
> >   (Arnd suggested, that we can probably also port do_fork() itself in a
> >    separate patchset.)
> > - ease of transition for userspace from clone() to clone3()
> >   This very much means that we do *not* remove functionality that userspace
> >   currently relies on as the latter is a good way of creating a syscall
> >   that won't be adopted.
> > - do not try to be clever or complex: keep clone3() as dumb as possible
> > 
> > In accordance with Linus suggestions (cf. [11]), clone3() has the following
> > signature:
> > 
> > /* uapi */
> > struct clone_args {
> >         __aligned_u64 flags;
> >         __aligned_u64 pidfd;
> >         __aligned_u64 child_tid;
> >         __aligned_u64 parent_tid;
> >         __aligned_u64 exit_signal;
> >         __aligned_u64 stack;
> >         __aligned_u64 stack_size;
> >         __aligned_u64 tls;
> > };
> > 
> > /* kernel internal */
> > struct kernel_clone_args {
> >         u64 flags;
> >         int __user *pidfd;
> >         int __user *child_tid;
> >         int __user *parent_tid;
> >         int exit_signal;
> >         unsigned long stack;
> >         unsigned long stack_size;
> >         unsigned long tls;
> > };
> > 
> > long sys_clone3(struct clone_args __user *uargs, size_t size)
> > 
> > clone3() cleanly supports all of the supported flags from clone() and thus
> > all legacy workloads.
> > The advantage of sticking close to the old clone() is the low cost for
> > userspace to switch to this new api. Quite a lot of userspace apis (e.g.
> > pthreads) are based on the clone() syscall. With the new clone3() syscall
> > supporting all of the old workloads and opening up the ability to add new
> > features should make switching to it for userspace more appealing. In
> > essence, glibc can just write a simple wrapper to switch from clone() to
> > clone3().
> > 
> > There has been some interest in this patchset already. We have received a
> > patch from the CRIU corner for clone3() that would set the PID/TID of a
> > restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
> > 
> > /* User visible differences to legacy clone() */
> > - CLONE_DETACHED will cause EINVAL with clone3()
> > - CSIGNAL is deprecated
> >   It is superseeded by a dedicated "exit_signal" argument in struct
> >   clone_args freeing up space for additional flags.
> >   This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
> > 
> > /* References */
> > [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
> > [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
> > [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
> > [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
> > [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
> > [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
> > [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
> > [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
> > [9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
> > [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
> > [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Christian Brauner <christian@brauner.io>
> 
> Acked-by: Serge Hallyn <serge@hallyn.com>

This also carries an Ack by Arnd and there don't seem to be technical
issues anymore.
So I'm going to move this over into my for-next branch targeting 5.3 to
see some testing.

Thanks!
Christian

> 
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: David Howells <dhowells@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Adrian Reber <adrian@lisas.de>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Andrei Vagin <avagin@gmail.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Florian Weimer <fweimer@redhat.com>
> > Cc: linux-api@vger.kernel.org
> > ---
> > v1:
> > - Linus Torvalds <torvalds@linux-foundation.org>:
> >   - redesign based on Linus proposal
> >   - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
> > - Arnd Bergmann <arnd@arndb.de>:
> >   - use a single copy_from_user() instead of multiple get_user() calls
> >     since the latter have a constant overhead on some architectures
> >   - a range of other tweaks and suggestions
> > v2:
> > - Linus Torvalds <torvalds@linux-foundation.org>,
> >   Andrei Vagin <avagin@gmail.com>:
> >   - replace CSIGNAL flag with dedicated exit_signal argument in struct
> >     clone_args
> > - Christian Brauner <christian@brauner.io>:
> >   - improve naming for some struct clone_args members
> > v3:
> > - Arnd Bergmann <arnd@arndb.de>:
> >   - replace memset with constructor for clarity and better object code
> >   - call flag verification function clone3_flags_valid() on
> >     kernel_clone_args instead of clone_args
> >   - remove __ARCH_WANT_SYS_CLONE ifdefine around sys_clone3()
> > - Christian Brauner <christian@brauner.io>:
> >   - replace clone3_flags_valid() with clone3_args_valid() and call in
> >     clone3() directly rather than in copy_clone_args_from_user()
> >     This cleanly separates copying the args from userspace from the
> >     verification whether those args are sane.
> > - David Howells <dhowells@redhat.com>:
> >   - align new struct member assignments with tabs
> >   - replace CLONE_MAX by with a non-uapi exported CLONE_LEGACY_FLAGS and
> >     define it as  0xffffffffULL for clarity
> >   - make copy_clone_args_from_user() noinline
> >   - avoid assigning to local variables from struct kernel_clone_args
> >     members in cases where it makes sense
> > ---
> >  arch/x86/ia32/sys_ia32.c   |  12 ++-
> >  include/linux/sched/task.h |  17 +++-
> >  include/linux/syscalls.h   |   4 +
> >  include/uapi/linux/sched.h |  16 +++
> >  kernel/fork.c              | 201 ++++++++++++++++++++++++++++---------
> >  5 files changed, 199 insertions(+), 51 deletions(-)

^ permalink raw reply

* Re: [PATCH v2 for 5.2 08/12] rseq/selftests: arm: use udf instruction for RSEQ_SIG
From: Mathieu Desnoyers @ 2019-06-08 15:52 UTC (permalink / raw)
  To: Will Deacon, Russell King
  Cc: linux-kernel, linux-api, Thomas Gleixner, Peter Zijlstra,
	Paul E . McKenney, Boqun Feng, shuah, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas,
	Michael Kerrisk <mtk.manpages@
In-Reply-To: <1975020343.35751.1559844149532.JavaMail.zimbra@efficios.com>

----- On Jun 6, 2019, at 8:02 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On May 3, 2019, at 3:38 PM, Mathieu Desnoyers
> mathieu.desnoyers@efficios.com wrote:
> 
>> Use udf as the guard instruction for the restartable sequence abort
>> handler.
>> 
>> Previously, the chosen signature was not a valid instruction, based
>> on the assumption that it could always sit in a literal pool. However,
>> there are compilation environments in which literal pools are not
>> available, for instance execute-only code. Therefore, we need to
>> choose a signature value that is also a valid instruction.
>> 
>> Handle compiling with -mbig-endian on ARMv6+, which generates binaries
>> with mixed code vs data endianness (little endian code, big endian
>> data).
>> 
>> Else mismatch between code endianness for the generated signatures and
>> data endianness for the RSEQ_SIG parameter passed to the rseq
>> registration will trigger application segmentation faults when the
>> kernel try to abort rseq critical sections.
>> 
>> Prior to ARMv6, -mbig-endian generates big-endian code and data, so
>> endianness should not be reversed in that case.
> 
> And of course it cannot be that easy. This breaks when building in
> thumb mode (-mthumb). Output from librseq arm32 build [1] (code similar
> to what is found in the rseq selftests):
> 
>  CC       rseq.lo
> /tmp/ccu6Jw1b.s: Assembler messages:
> /tmp/ccu6Jw1b.s:297: Error: cannot determine Thumb instruction size. Use
> .inst.n/.inst.w instead
> /tmp/ccu6Jw1b.s:490: Error: cannot determine Thumb instruction size. Use
> .inst.n/.inst.w instead
> Makefile:460: recipe for target 'rseq.lo' failed
> 
> This appears to be caused by a missing .arm directive in RSEQ_SIG_DATA.
> Fixing with:
> 
> -               asm volatile ("b 2f\n\t"                                \
> +               asm volatile (".arm\n\t"                                \
> +                             "b 2f\n\t"                                \
> 
> gets the build to go further, but breaks at:
> 
>  CC       basic_percpu_ops_test.o
> /tmp/ccpHOMHZ.s: Assembler messages:
> /tmp/ccpHOMHZ.s:148: Error: misaligned branch destination
> /tmp/ccpHOMHZ.s:956: Error: misaligned branch destination
> Makefile:378: recipe for target 'basic_percpu_ops_test.o' failed
> 
> I suspect it's caused by the change from:
> 
> -               ".word " __rseq_str(RSEQ_SIG) "\n\t"                    \
> 
> to
> 
> +               ".arm\n\t"                                              \
> +               ".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t"               \
> 
> which changes the mode from thumb to arm for the rest of the
> inline asm within __RSEQ_ASM_DEFINE_ABORT. Better yet, there appears
> to be no way to save the arm/thumb state and restore it afterwards.
> 
> I'm really starting to wonder if we should go our of our way to try
> to get this signature to be a valid instruction on arm32. Perhaps
> we should consider going back to use ".word" on arm32 so it ensures
> it uses data endianness (which matches the parameter received by the
> sys_rseq system call), let objdump and friends print it as a literal
> pool (which it is), and just choose an instruction which has little
> chances to appear for the cases we care about between ARM32 BE, LE
> and THUMB. Perhaps a 32-bit palindrome ? Bonus points if this is a
> trap instruction in common configurations for odd-cases-debugging
> purposes.

So I'm not particularly proud of the result, but I found a rather
ugly way to figure out if we are currently in thumb mode within an
inline asm, and restore that mode: test the length of a nop
instruction with a ".if" asm statement.

Do we want to go for this kind of approach, or should we revert
back to a ".word" and accept that the rseq signature before the
abort handler will be seen as data rather than an instruction
on arm32 ?

Is there a better way to do this ?

Thanks,

Mathieu

diff --git a/include/rseq/rseq-arm.h b/include/rseq/rseq-arm.h
index 1ce9231..b6c36dd 100644
--- a/include/rseq/rseq-arm.h
+++ b/include/rseq/rseq-arm.h
@@ -43,7 +43,14 @@
        ({                                                              \
                int sig;                                                \
                asm volatile ("b 2f\n\t"                                \
+                             "3:\n\t"                                  \
+                             "nop\n\t"                                 \
+                             "4:\n\t"                                  \
+                             ".arm\n\t"                                \
                              "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
+                             ".if ((4b - 3b) == 2)\n\t"                \
+                             ".thumb\n\t"                              \
+                             ".endif\n\t"                              \
                              "2:\n\t"                                  \
                              "ldr %[sig], 1b\n\t"                      \
                              : [sig] "=r" (sig));                      \
@@ -125,8 +132,14 @@ do {                                                                       \
                __rseq_str(table_label) ":\n\t"                         \
                ".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
                ".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
+               "333:\n\t"                                              \
+               "nop\n\t"                                               \
+               "444:\n\t"                                              \
                ".arm\n\t"                                              \
                ".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t"               \
+               ".if ((444b - 333b) == 2)\n\t"                          \
+               ".thumb\n\t"                                            \
+               ".endif\n\t"                                            \
                __rseq_str(label) ":\n\t"                               \
                teardown                                                \
                "b %l[" __rseq_str(abort_label) "]\n\t"






-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply related

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Pavel Machek @ 2019-06-08 20:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook
In-Reply-To: <b3de4110-5366-fdc7-a960-71dea543a42f@intel.com>

Hi!

> > I've no idea what the kernel should do; since you failed to answer the
> > question what happens when you point this to garbage.
> > 
> > Does it then fault or what?
> 
> Yeah, I think you'll fault with a rather mysterious CR2 value since
> you'll go look at the instruction that faulted and not see any
> references to the CR2 value.
> 
> I think this new MSR probably needs to get included in oops output when
> CET is enabled.
> 
> Why don't we require that a VMA be in place for the entire bitmap?
> Don't we need a "get" prctl function too in case something like a JIT is
> running and needs to find the location of this bitmap to set bits itself?
> 
> Or, do we just go whole-hog and have the kernel manage the bitmap
> itself. Our interface here could be:
> 
> 	prctl(PR_MARK_CODE_AS_LEGACY, start, size);
> 
> and then have the kernel allocate and set the bitmap for those code
> locations.

For the record, that sounds like a better interface than userspace knowing
about the bitmap formats...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* Re: [PATCH 02/13] uapi: General notification ring definitions [ver #4]
From: Randy Dunlap @ 2019-06-09  4:35 UTC (permalink / raw)
  To: David Howells, Darrick J. Wong
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <29222.1559922719@warthog.procyon.org.uk>

On 6/7/19 8:51 AM, David Howells wrote:
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 

>>> +	__u32			info;
>>> +#define WATCH_INFO_OVERRUN	0x00000001	/* Event(s) lost due to overrun */
>>> +#define WATCH_INFO_ENOMEM	0x00000002	/* Event(s) lost due to ENOMEM */
>>> +#define WATCH_INFO_RECURSIVE	0x00000004	/* Change was recursive */
>>> +#define WATCH_INFO_LENGTH	0x000001f8	/* Length of record / sizeof(watch_notification) */
>>
>> This is a mask, isn't it?  Could we perhaps have some helpers here?
>> Something along the lines of...
>>
>> #define WATCH_INFO_LENGTH_MASK	0x000001f8
>> #define WATCH_INFO_LENGTH_SHIFT	3
>>
>> static inline size_t watch_notification_length(struct watch_notification *wn)
>> {
>> 	return (wn->info & WATCH_INFO_LENGTH_MASK) >> WATCH_INFO_LENGTH_SHIFT *
>> 			sizeof(struct watch_notification);
>> }
>>
>> static inline struct watch_notification *watch_notification_next(
>> 		struct watch_notification *wn)
>> {
>> 	return wn + ((wn->info & WATCH_INFO_LENGTH_MASK) >>
>> 			WATCH_INFO_LENGTH_SHIFT);
>> }
> 
> No inline functions in UAPI headers, please.  I'd love to kill off the ones
> that we have, but that would break things.

Hi David,

What is the problem with inline functions in UAPI headers?

>> ...so that we don't have to opencode all of the ring buffer walking
>> magic and stuff?
> 
> There'll end up being a small userspace library, I think.

>>> +};
>>> +
>>> +#define WATCH_LENGTH_SHIFT	3
>>> +
>>> +struct watch_queue_buffer {
>>> +	union {
>>> +		/* The first few entries are special, containing the
>>> +		 * ring management variables.
>>
>> The first /two/ entries, correct?
> 
> Currently two.
> 
>> Also, weird multiline comment style.
> 
> Not really.

Yes really.

>>> +		 */

It does not match the preferred coding style for multi-line comments
according to coding-style.rst.


thanks.
-- 
~Randy

^ permalink raw reply

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-06-10 10:09 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton
In-Reply-To: <20190604203841.GC228607@google.com>

Hi Joel,

On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range, it could
> > give a hint to kernel that the pages can be reclaimed when memory pressure
> > happens but data should be preserved for future use.  This could reduce
> > workingset eviction so it ends up increasing performance.
> > 
> > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > MADV_COLD can be used by a process to mark a memory range as not expected
> > to be used in the near future. The hint can help kernel in deciding which
> > pages to evict early during memory pressure.
> > 
> > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > 
> > 	active file page -> inactive file LRU
> > 	active anon page -> inacdtive anon LRU
> > 
> > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > files's head because MADV_COLD is a little bit different symantic.
> > MADV_FREE means it's okay to discard when the memory pressure because
> > the content of the page is *garbage* so freeing such pages is almost zero
> > overhead since we don't need to swap out and access afterward causes just
> > minor fault. Thus, it would make sense to put those freeable pages in
> > inactive file LRU to compete other used-once pages. Even, it could
> > give a bonus to make them be reclaimed on swapless system. However,
> > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > if system doesn't have a swap device.
> > 
> > All of error rule is same with MADV_DONTNEED.
> > 
> > Note:
> > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > because shared page could have more chance to be accessed from other
> > processes sharing the page although the caller reset the reference bits.
> > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > 
> > * RFCv2
> >  * add more description - mhocko
> > 
> > * RFCv1
> >  * renaming from MADV_COOL to MADV_COLD - hannes
> > 
> > * internal review
> >  * use clear_page_youn in deactivate_page - joelaf
> >  * Revise the description - surenb
> >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/page-flags.h             |   1 +
> >  include/linux/page_idle.h              |  15 ++++
> >  include/linux/swap.h                   |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/internal.h                          |   2 +-
> >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> >  mm/oom_kill.c                          |   2 +-
> >  mm/swap.c                              |  43 +++++++++
> >  8 files changed, 176 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 9f8712a4b1a5..58b06654c8dd 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> >  TESTPAGEFLAG(Young, young, PF_ANY)
> >  SETPAGEFLAG(Young, young, PF_ANY)
> >  TESTCLEARFLAG(Young, young, PF_ANY)
> > +CLEARPAGEFLAG(Young, young, PF_ANY)
> >  PAGEFLAG(Idle, idle, PF_ANY)
> >  #endif
> >  
> > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > index 1e894d34bdce..f3f43b317150 100644
> > --- a/include/linux/page_idle.h
> > +++ b/include/linux/page_idle.h
> > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> >  	SetPageYoung(page);
> >  }
> >  
> > +static inline void clear_page_young(struct page *page)
> > +{
> > +	ClearPageYoung(page);
> > +}
> > +
> >  static inline bool test_and_clear_page_young(struct page *page)
> >  {
> >  	return TestClearPageYoung(page);
> > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> >  }
> >  
> > +static void clear_page_young(struct page *page)
> > +{
> > +	struct page_ext *page_ext = lookup_page_ext(page);
> > +
> > +	if (unlikely(!page_ext))
> > +		return;
> > +
> > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > +}
> > +
> >  static inline bool test_and_clear_page_young(struct page *page)
> >  {
> >  	struct page_ext *page_ext = lookup_page_ext(page);
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index de2c67a33b7e..0ce997edb8bb 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> >  extern void lru_add_drain_all(void);
> >  extern void rotate_reclaimable_page(struct page *page);
> >  extern void deactivate_file_page(struct page *page);
> > +extern void deactivate_page(struct page *page);
> >  extern void mark_page_lazyfree(struct page *page);
> >  extern void swap_setup(void);
> >  
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index bea0278f65ab..1190f4e7f7b9 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -43,6 +43,7 @@
> >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> > +#define MADV_COLD	5		/* deactivatie these pages */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 9eeaf2b95166..75a4f96ec0fb 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> >  		unsigned long floor, unsigned long ceiling);
> >  
> > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> >  {
> >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> >  }
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 628022e674a7..ab158766858a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_REMOVE:
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> > +	case MADV_COLD:
> >  	case MADV_FREE:
> >  		return 0;
> >  	default:
> > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> > +	pte_t *orig_pte, *pte, ptent;
> > +	spinlock_t *ptl;
> > +	struct page *page;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	unsigned long next;
> > +
> > +	next = pmd_addr_end(addr, end);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > +		if (!ptl)
> > +			return 0;
> > +
> > +		if (is_huge_zero_pmd(*pmd))
> > +			goto huge_unlock;
> > +
> > +		page = pmd_page(*pmd);
> > +		if (page_mapcount(page) > 1)
> > +			goto huge_unlock;
> > +
> > +		if (next - addr != HPAGE_PMD_SIZE) {
> > +			int err;
> > +
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			lock_page(page);
> > +			err = split_huge_page(page);
> > +			unlock_page(page);
> > +			put_page(page);
> > +			if (!err)
> > +				goto regular_page;
> > +			return 0;
> > +		}
> > +
> > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > +		deactivate_page(page);
> > +huge_unlock:
> > +		spin_unlock(ptl);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +regular_page:
> > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +
> > +		if (pte_none(ptent))
> > +			continue;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (page_mapcount(page) > 1)
> > +			continue;
> > +
> > +		ptep_test_and_clear_young(vma, addr, pte);
> 
> Wondering here how it interacts with idle page tracking. Here since young
> flag is cleared by the cold hint, page_referenced_one() or
> page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> if it was previously set since it does not know any more that a page was
> actively referenced.

ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
doesn't affect.

> 
> Should you also be clearing the page-idle flag here if the ptep young/accessed
> bit was previously set, just so that page-idle tracking works smoothly when
> this hint is concurrently applied?

deactivate_page will remove PG_young bit so that the page will be reclaimed.
Do I miss your point?

> 
> > +		deactivate_page(page);
> > +	}
> > +
> > +	pte_unmap_unlock(orig_pte, ptl);
> > +	cond_resched();
> > +
> > +	return 0;
> > +}
> > +
> > +static void madvise_cold_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	struct mm_walk cool_walk = {
> > +		.pmd_entry = madvise_cold_pte_range,
> > +		.mm = vma->vm_mm,
> > +	};
> > +
> > +	tlb_start_vma(tlb, vma);
> > +	walk_page_range(addr, end, &cool_walk);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> > +static long madvise_cold(struct vm_area_struct *vma,
> > +			struct vm_area_struct **prev,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	*prev = vma;
> > +	if (!can_madv_lru_vma(vma))
> > +		return -EINVAL;
> > +
> > +	lru_add_drain();
> > +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> > +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> > +
> > +	return 0;
> > +}
> > +
> >  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  				unsigned long end, struct mm_walk *walk)
> >  
> > @@ -519,7 +627,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  				  int behavior)
> >  {
> >  	*prev = vma;
> > -	if (!can_madv_dontneed_vma(vma))
> > +	if (!can_madv_lru_vma(vma))
> >  		return -EINVAL;
> >  
> >  	if (!userfaultfd_remove(vma, start, end)) {
> > @@ -541,7 +649,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  			 */
> >  			return -ENOMEM;
> >  		}
> > -		if (!can_madv_dontneed_vma(vma))
> > +		if (!can_madv_lru_vma(vma))
> >  			return -EINVAL;
> >  		if (end > vma->vm_end) {
> >  			/*
> > @@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		return madvise_remove(vma, prev, start, end);
> >  	case MADV_WILLNEED:
> >  		return madvise_willneed(vma, prev, start, end);
> > +	case MADV_COLD:
> > +		return madvise_cold(vma, prev, start, end);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> > @@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> >  	case MADV_FREE:
> > +	case MADV_COLD:
> >  #ifdef CONFIG_KSM
> >  	case MADV_MERGEABLE:
> >  	case MADV_UNMERGEABLE:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5a58778c91d4..f73d5f5145f0 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -515,7 +515,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >  
> >  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > -		if (!can_madv_dontneed_vma(vma))
> > +		if (!can_madv_lru_vma(vma))
> >  			continue;
> >  
> >  		/*
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 7b079976cbec..cebedab15aa2 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -47,6 +47,7 @@ int page_cluster;
> >  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> > +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
> >  #ifdef CONFIG_SMP
> >  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> > @@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
> >  	update_page_reclaim_stat(lruvec, file, 0);
> >  }
> >  
> > +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> > +			    void *arg)
> > +{
> > +	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> > +		int file = page_is_file_cache(page);
> > +		int lru = page_lru_base_type(page);
> > +
> > +		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> > +		ClearPageActive(page);
> > +		ClearPageReferenced(page);
> > +		clear_page_young(page);
> 
> I was curious, how does the above interact with clear_refs?

clear_refs couldn't find the access any longer. However, it's the
intention because we couldn't reclaim the page without it.

> 
> It appears that a range that is marked as cold will appear to be unreferenced
> (since referenced flag is cleared) even though it may have been referenced in
> the past. Very least, I believe this behavior should be documented somewhere.

Okay, I will add some comment.

Thanks.

^ permalink raw reply

* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
From: Minchan Kim @ 2019-06-10 10:12 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, hdanton
In-Reply-To: <20190605132728.mihzzw7galqjf5uz@butterfly.localdomain>

Hi Oleksandr,

On Wed, Jun 05, 2019 at 03:27:28PM +0200, Oleksandr Natalenko wrote:
< snip >
> > > > > >  	write = madvise_need_mmap_write(behavior);
> > > > > >  	if (write) {
> > > > > > -		if (down_write_killable(&current->mm->mmap_sem))
> > > > > > +		if (down_write_killable(&mm->mmap_sem))
> > > > > >  			return -EINTR;
> > > > > 
> > > > > Do you still need that trick with mmget_still_valid() here?
> > > > > Something like:
> > > > 
> > > > Since MADV_COLD|PAGEOUT doesn't change address space layout or
> > > > vma->vm_flags, technically, we don't need it if I understand
> > > > correctly. Right?
> > > 
> > > I'd expect so, yes. But.
> > > 
> > > Since we want this interface to be universal and to be able to cover
> > > various needs, and since my initial intention with working in this
> > > direction involved KSM, I'd ask you to enable KSM hints too, and once
> > > (and if) that happens, the work there is done under write lock, and
> > > you'll need this trick to be applied.
> > > 
> > > Of course, I can do that myself later in a subsequent patch series once
> > > (and, again, if) your series is merged, but, maybe, we can cover this
> > > already especially given the fact that KSM hinting is a relatively easy
> > > task in this pile. I did some preliminary tests with it, and so far no
> > > dragons have started to roar.
> > 
> > Then, do you mind sending a patch based upon this series to expose
> > MADV_MERGEABLE to process_madvise? It will have the right description
> > why you want to have such feature which I couldn't provide since I don't
> > have enough material to write the motivation. And the patch also could
> > include the logic to prevent coredump race, which is more proper since
> > finally we need to hold mmap_sem write-side lock, finally.
> > I will pick it up and will rebase since then.
> 
> Sure, I can. Would you really like to have it being based on this exact
> revision, or I should wait till you deal with MADV_COLD & Co and re-iterate
> this part again?

I'm okay you to send your patch against this revision. I'm happy to
include it when I start a new thread for process_madvise discussion
after resolving MADV_COLD|PAGEOUT.

Thanks.

^ permalink raw reply

* [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Minchan Kim @ 2019-06-10 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb, Minchan Kim

This patch is part of previous series:
https://lore.kernel.org/lkml/20190531064313.193437-1-minchan@kernel.org/T/#u
Originally, it was created for external madvise hinting feature.

https://lkml.org/lkml/2019/5/31/463
Michal wanted to separte the discussion from external hinting interface
so this patchset includes only first part of my entire patchset

  - introduce MADV_COLD and MADV_PAGEOUT hint to madvise.

However, I keep entire description for others for easier understanding
why this kinds of hint was born.

Thanks.

This patchset is against on next-20190530.

Below is description of previous entire patchset.
================= &< =====================

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.

This approach is similar in spirit to madvise(MADV_WONTNEED), but the
information required to make the reclaim decision is not known to the app.
Instead, it is known to a centralized userspace daemon, and that daemon
must be able to initiate reclaim on its own without any app involvement.
To solve the concern, this patch introduces new syscall -

    struct pr_madvise_param {
            int size;               /* the size of this structure */
            int cookie;             /* reserved to support atomicity */
            int nr_elem;            /* count of below arrary fields */
            int __user *hints;      /* hints for each range */
            /* to store result of each operation */
            const struct iovec __user *results;
            /* input address ranges */
            const struct iovec __user *ranges;
    };
    
    int process_madvise(int pidfd, struct pr_madvise_param *u_param,
                            unsigned long flags);

The syscall get pidfd to give hints to external process and provides
pair of result/ranges vector arguments so that it could give several
hints to each address range all at once. It also has cookie variable
to support atomicity of the API for address ranges operations. IOW, if
target process changes address space since monitor process has parsed
address ranges via map_files or maps, the API can detect the race so
could cancel entire address space operation. It's not implemented yet.
Daniel Colascione suggested a idea(Please read description in patch[6/6])
and this patchset adds cookie a variable for the future.

- Experiment

We did bunch of testing with several hundreds of real users, not artificial
benchmark on android. We saw about 17% cold start decreasement without any
significant battery/app startup latency issues. And with artificial benchmark
which launches and switching apps, we saw average 7% app launching improvement,
18% less lmkd kill and good stat from vmstat.

A is vanilla and B is process_madvise.

                                       A          B      delta   ratio(%)
               allocstall_dma          0          0          0       0.00
           allocstall_movable       1464        457      -1007     -69.00
            allocstall_normal     263210     190763     -72447     -28.00
             allocstall_total     264674     191220     -73454     -28.00
          compact_daemon_wake      26912      25294      -1618      -7.00
                 compact_fail      17885      14151      -3734     -21.00
         compact_free_scanned 4204766409 3835994922 -368771487      -9.00
             compact_isolated    3446484    2967618    -478866     -14.00
      compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
                compact_stall      19387      15343      -4044     -21.00
              compact_success       1502       1192       -310     -21.00
kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
            kswapd_inodesteal     221635     233093      11458       5.00
 kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
                   nr_dirtied     259934     296476      36542      14.00
  nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
              nr_vmscan_write    1274232    2661733    1387501     108.00
                   nr_written    1514060    2937560    1423500      94.00
                   pageoutrun      67561      55133     -12428     -19.00
                   pgactivate    2335060    1984882    -350178     -15.00
                  pgalloc_dma   13743011   14096463     353452       2.00
              pgalloc_movable          0          0          0       0.00
               pgalloc_normal   18742440   16802065   -1940375     -11.00
                pgalloc_total   32485451   30898528   -1586923      -5.00
                 pgdeactivate    4262210    2930670   -1331540     -32.00
                      pgfault   30812334   31085065     272731       0.00
                       pgfree   33553970   31765164   -1788806      -6.00
                 pginodesteal      33411      15084     -18327     -55.00
                  pglazyfreed          0          0          0       0.00
                   pgmajfault     551312    1508299     956987     173.00
               pgmigrate_fail      43927      29330     -14597     -34.00
            pgmigrate_success    1399851    1203922    -195929     -14.00
                       pgpgin   24141776   19032156   -5109620     -22.00
                      pgpgout     959344    1103316     143972      15.00
                 pgpgoutclean    4639732    3765868    -873864     -19.00
                     pgrefill    4884560    3006938   -1877622     -39.00
                    pgrotated      37828      25897     -11931     -32.00
                pgscan_direct    1456037     957567    -498470     -35.00
       pgscan_direct_throttle          0          0          0       0.00
                pgscan_kswapd    6667767    5047360   -1620407     -25.00
                 pgscan_total    8123804    6004927   -2118877     -27.00
                   pgskip_dma          0          0          0       0.00
               pgskip_movable          0          0          0       0.00
                pgskip_normal      14907      25382      10475      70.00
                 pgskip_total      14907      25382      10475      70.00
               pgsteal_direct    1118986     690215    -428771     -39.00
               pgsteal_kswapd    4750223    3657107   -1093116     -24.00
                pgsteal_total    5869209    4347322   -1521887     -26.00
                       pswpin     417613    1392647     975034     233.00
                      pswpout    1274224    2661731    1387507     108.00
                slabs_scanned   13686905   10807200   -2879705     -22.00
          workingset_activate     668966     569444     -99522     -15.00
       workingset_nodereclaim      38957      32621      -6336     -17.00
           workingset_refault    2816795    2179782    -637013     -23.00
           workingset_restore     294320     168601    -125719     -43.00

pgmajfault is increased by 173% because swapin is increased by 200% by
process_madvise hint. However, swap read based on zram is much cheaper
than file IO in performance point of view and app hot start by swapin is
also cheaper than cold start from the beginning of app which needs many IO
from storage and initialization steps.

Brian Geffon in ChromeOS team had an experiment with process_madvise(2)
Quote form him:
"What I found is that by using process_madvise after a tab has been back
grounded for more than 45 seconds reduced the average tab switch times by
25%! This is a huge result and very obvious validation that process_madvise
hints works well for the ChromeOS use case."

This patchset is against on next-20190607.

Minchan Kim (5):
  mm: introduce MADV_COLD
  mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
  mm: account nr_isolated_xxx in [isolate|putback]_lru_page
  mm: introduce MADV_PAGEOUT
  mm: factor out pmd young/dirty bit handling and THP split

 include/linux/huge_mm.h                |   3 -
 include/linux/swap.h                   |   2 +
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/compaction.c                        |   2 -
 mm/gup.c                               |   7 +-
 mm/huge_memory.c                       |  74 -----
 mm/internal.h                          |   2 +-
 mm/khugepaged.c                        |   3 -
 mm/madvise.c                           | 358 ++++++++++++++++++++++++-
 mm/memory-failure.c                    |   3 -
 mm/memory_hotplug.c                    |   4 -
 mm/mempolicy.c                         |   6 +-
 mm/migrate.c                           |  37 +--
 mm/oom_kill.c                          |   2 +-
 mm/swap.c                              |  42 +++
 mm/vmscan.c                            |  86 +++++-
 16 files changed, 486 insertions(+), 147 deletions(-)

-- 
2.22.0.rc2.383.gf4fbbf30c2-goog

^ permalink raw reply

* [PATCH v2 1/5] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-06-10 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb, Minchan Kim
In-Reply-To: <20190610111252.239156-1-minchan@kernel.org>

When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

	active file page -> inactive file LRU
	active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
file LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment.

All of error rule is same with MADV_DONTNEED.

* v1
 * remove page_mapcount filter - hannes, mhocko
 * fix idle page handling - joelaf

* RFCv2
 * add more description - mhocko

* RFCv1
 * renaming from MADV_COOL to MADV_COLD - hannes

* internal review
 * use clear_page_youn in deactivate_page - joelaf
 * Revise the description - surenb
 * Renaming from MADV_WARM to MADV_COOL - surenb

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h                   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/internal.h                          |   2 +-
 mm/madvise.c                           | 151 ++++++++++++++++++++++++-
 mm/oom_kill.c                          |   2 +-
 mm/swap.c                              |  42 +++++++
 6 files changed, 195 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ef4623f03156..d7b4231eea63 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -47,6 +47,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_COLD	5		/* deactivatie these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE	8		/* free pages only if memory pressure */
diff --git a/mm/internal.h b/mm/internal.h
index e32390802fd3..0d5f720c75ab 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -39,7 +39,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
-static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
 {
 	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..67c0379f64a7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_COLD:
 	case MADV_FREE:
 		return 0;
 	default:
@@ -307,6 +308,149 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = walk->vma;
+	pte_t *orig_pte, *pte, ptent;
+	spinlock_t *ptl;
+	struct page *page;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		pmd_t orig_pmd;
+
+		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+		ptl = pmd_trans_huge_lock(pmd, vma);
+		if (!ptl)
+			return 0;
+
+		orig_pmd = *pmd;
+		if (is_huge_zero_pmd(orig_pmd))
+			goto huge_unlock;
+
+		if (unlikely(!pmd_present(orig_pmd))) {
+			VM_BUG_ON(thp_migration_supported() &&
+					!is_pmd_migration_entry(orig_pmd));
+			goto huge_unlock;
+		}
+
+		page = pmd_page(orig_pmd);
+		if (next - addr != HPAGE_PMD_SIZE) {
+			int err;
+
+			if (page_mapcount(page) != 1)
+				goto huge_unlock;
+
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			err = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			if (!err)
+				goto regular_page;
+			return 0;
+		}
+
+		if (pmd_young(orig_pmd)) {
+			pmdp_invalidate(vma, addr, pmd);
+			orig_pmd = pmd_mkold(orig_pmd);
+
+			set_pmd_at(mm, addr, pmd, orig_pmd);
+			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		}
+
+		test_and_clear_page_young(page);
+		deactivate_page(page);
+huge_unlock:
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+regular_page:
+	tlb_change_page_size(tlb, PAGE_SIZE);
+	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	flush_tlb_batched_pending(mm);
+	arch_enter_lazy_mmu_mode();
+	for (; addr < end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (pte_none(ptent))
+			continue;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (pte_young(ptent)) {
+			ptent = ptep_get_and_clear_full(mm, addr, pte,
+							tlb->fullmm);
+			ptent = pte_mkold(ptent);
+			set_pte_at(mm, addr, pte, ptent);
+			tlb_remove_tlb_entry(tlb, pte, addr);
+		}
+
+		/*
+		 * We are deactivating a page for accelerating reclaiming.
+		 * VM couldn't reclaim the page unless we clear PG_young.
+		 * As a side effect, it makes confuse idle-page tracking
+		 * because they will miss recent referenced history.
+		 */
+		test_and_clear_page_young(page);
+		deactivate_page(page);
+	}
+
+	arch_enter_lazy_mmu_mode();
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk cold_walk = {
+		.pmd_entry = madvise_cold_pte_range,
+		.mm = vma->vm_mm,
+		.private = tlb,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &cold_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cold(struct vm_area_struct *vma,
+			struct vm_area_struct **prev,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	*prev = vma;
+	if (!can_madv_lru_vma(vma))
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+	tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+	return 0;
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -519,7 +663,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  int behavior)
 {
 	*prev = vma;
-	if (!can_madv_dontneed_vma(vma))
+	if (!can_madv_lru_vma(vma))
 		return -EINVAL;
 
 	if (!userfaultfd_remove(vma, start, end)) {
@@ -541,7 +685,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 			 */
 			return -ENOMEM;
 		}
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_lru_vma(vma))
 			return -EINVAL;
 		if (end > vma->vm_end) {
 			/*
@@ -695,6 +839,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_COLD:
+		return madvise_cold(vma, prev, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +862,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
 	case MADV_FREE:
+	case MADV_COLD:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5a58778c91d4..f73d5f5145f0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -515,7 +515,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_lru_vma(vma))
 			continue;
 
 		/*
diff --git a/mm/swap.c b/mm/swap.c
index 6d153ce4cb8c..7e44f5b50774 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,22 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		ClearPageReferenced(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
@@ -590,6 +607,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +644,26 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		get_page(page);
+		if (!pagevec_add(pvec, page) || PageCompound(page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 /**
  * mark_page_lazyfree - make an anon page lazyfree
  * @page: page to deactivate
@@ -687,6 +728,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
-- 
2.22.0.rc2.383.gf4fbbf30c2-goog

^ permalink raw reply related

* [PATCH v2 2/5] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
From: Minchan Kim @ 2019-06-10 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb, Minchan Kim
In-Reply-To: <20190610111252.239156-1-minchan@kernel.org>

The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.

Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.
So this patch makes references's default value to PAGEREF_RECLAIM and
rename force_reclaim with ignore_references to make it more clear.

This is a preparatory work for next patch.

* RFCv1
 * use ignore_referecnes as parameter name - hannes

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 84dcb651d05c..0973a46a0472 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct scan_control *sc,
 				      enum ttu_flags ttu_flags,
 				      struct reclaim_stat *stat,
-				      bool force_reclaim)
+				      bool ignore_references)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		struct address_space *mapping;
 		struct page *page;
 		int may_enter_fs;
-		enum page_references references = PAGEREF_RECLAIM_CLEAN;
+		enum page_references references = PAGEREF_RECLAIM;
 		bool dirty, writeback;
 		unsigned int nr_pages;
 
@@ -1247,7 +1247,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!force_reclaim)
+		if (!ignore_references)
 			references = page_check_references(page, sc);
 
 		switch (references) {
-- 
2.22.0.rc2.383.gf4fbbf30c2-goog

^ permalink raw reply related

* [PATCH v2 3/5] mm: account nr_isolated_xxx in [isolate|putback]_lru_page
From: Minchan Kim @ 2019-06-10 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb, Minchan Kim
In-Reply-To: <20190610111252.239156-1-minchan@kernel.org>

The isolate counting is pecpu counter so it would be not huge gain
to work them by batch. Rather than complicating to make them batch,
let's make it more stright-foward via adding the counting logic
into [isolate|putback]_lru_page API.

* v1
 * fix accounting bug - Hillf

Link: http://lkml.kernel.org/r/20190531165927.GA20067@cmpxchg.org
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/compaction.c     |  2 --
 mm/gup.c            |  7 +------
 mm/khugepaged.c     |  3 ---
 mm/memory-failure.c |  3 ---
 mm/memory_hotplug.c |  4 ----
 mm/mempolicy.c      |  6 +-----
 mm/migrate.c        | 37 ++++++++-----------------------------
 mm/vmscan.c         | 22 ++++++++++++++++------
 8 files changed, 26 insertions(+), 58 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9e1b9acb116b..c6591682deda 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -982,8 +982,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/* Successfully isolated */
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		inc_node_page_state(page,
-				NR_ISOLATED_ANON + page_is_file_cache(page));
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
diff --git a/mm/gup.c b/mm/gup.c
index 63ac50e48072..2d9a9bc358c7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1360,13 +1360,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
 					drain_allow = false;
 				}
 
-				if (!isolate_lru_page(head)) {
+				if (!isolate_lru_page(head))
 					list_add_tail(&head->lru, &cma_page_list);
-					mod_node_page_state(page_pgdat(head),
-							    NR_ISOLATED_ANON +
-							    page_is_file_cache(head),
-							    hpage_nr_pages(head));
-				}
 			}
 		}
 	}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..3359df994fb4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -503,7 +503,6 @@ void __khugepaged_exit(struct mm_struct *mm)
 
 static void release_pte_page(struct page *page)
 {
-	dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -602,8 +601,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_DEL_PAGE_LRU;
 			goto out;
 		}
-		inc_node_page_state(page,
-				NR_ISOLATED_ANON + page_is_file_cache(page));
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b9cc36a284f9..430946cf9c8a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1796,9 +1796,6 @@ static int __soft_offline_page(struct page *page, int flags)
 		 * so use !__PageMovable instead for LRU page's mapping
 		 * cannot have PAGE_MAPPING_MOVABLE.
 		 */
-		if (!__PageMovable(page))
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-						page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
 					MIGRATE_SYNC, MR_MEMORY_FAILURE);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a88c5f334e5a..a41bea24d0c9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1390,10 +1390,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
 		if (!ret) { /* Success */
 			list_add_tail(&page->lru, &source);
-			if (!__PageMovable(page))
-				inc_node_page_state(page, NR_ISOLATED_ANON +
-						    page_is_file_cache(page));
-
 		} else {
 			pr_warn("failed to isolate pfn %lx\n", pfn);
 			dump_page(page, "isolation failed");
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fdcb73536319..89bb25fe7553 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -948,12 +948,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 	 * Avoid migrating a page that is shared with others.
 	 */
 	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
-		if (!isolate_lru_page(head)) {
+		if (!isolate_lru_page(head))
 			list_add_tail(&head->lru, pagelist);
-			mod_node_page_state(page_pgdat(head),
-				NR_ISOLATED_ANON + page_is_file_cache(head),
-				hpage_nr_pages(head));
-		}
 	}
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 572b4bc85d76..5583324c01e7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -190,8 +190,6 @@ void putback_movable_pages(struct list_head *l)
 			unlock_page(page);
 			put_page(page);
 		} else {
-			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-					page_is_file_cache(page), -hpage_nr_pages(page));
 			putback_lru_page(page);
 		}
 	}
@@ -1181,10 +1179,17 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		return -ENOMEM;
 
 	if (page_count(page) == 1) {
+		bool is_lru = !__PageMovable(page);
+
 		/* page was freed from under us. So we are done. */
 		ClearPageActive(page);
 		ClearPageUnevictable(page);
-		if (unlikely(__PageMovable(page))) {
+		if (likely(is_lru))
+			mod_node_page_state(page_pgdat(page),
+						NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						-hpage_nr_pages(page));
+		else {
 			lock_page(page);
 			if (!PageMovable(page))
 				__ClearPageIsolated(page);
@@ -1210,15 +1215,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		 * restored.
 		 */
 		list_del(&page->lru);
-
-		/*
-		 * Compaction can migrate also non-LRU pages which are
-		 * not accounted to NR_ISOLATED_*. They can be recognized
-		 * as __PageMovable
-		 */
-		if (likely(!__PageMovable(page)))
-			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-					page_is_file_cache(page), -hpage_nr_pages(page));
 	}
 
 	/*
@@ -1572,9 +1568,6 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 
 		err = 0;
 		list_add_tail(&head->lru, pagelist);
-		mod_node_page_state(page_pgdat(head),
-			NR_ISOLATED_ANON + page_is_file_cache(head),
-			hpage_nr_pages(head));
 	}
 out_putpage:
 	/*
@@ -1890,8 +1883,6 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
-	int page_lru;
-
 	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
 
 	/* Avoid migrating to a node that is nearly full */
@@ -1913,10 +1904,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 		return 0;
 	}
 
-	page_lru = page_is_file_cache(page);
-	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
-				hpage_nr_pages(page));
-
 	/*
 	 * Isolating the page has taken another reference, so the
 	 * caller's reference can be safely dropped without the page
@@ -1971,8 +1958,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
-			dec_node_page_state(page, NR_ISOLATED_ANON +
-					page_is_file_cache(page));
 			putback_lru_page(page);
 		}
 		isolated = 0;
@@ -2002,7 +1987,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
-	int page_lru = page_is_file_cache(page);
 	unsigned long start = address & HPAGE_PMD_MASK;
 
 	new_page = alloc_pages_node(node,
@@ -2048,8 +2032,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		/* Retake the callers reference and putback on LRU */
 		get_page(page);
 		putback_lru_page(page);
-		mod_node_page_state(page_pgdat(page),
-			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
 
 		goto out_unlock;
 	}
@@ -2099,9 +2081,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-	mod_node_page_state(page_pgdat(page),
-			NR_ISOLATED_ANON + page_lru,
-			-HPAGE_PMD_NR);
 	return isolated;
 
 out_fail:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0973a46a0472..56df55e8afcd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -999,6 +999,9 @@ int remove_mapping(struct address_space *mapping, struct page *page)
 void putback_lru_page(struct page *page)
 {
 	lru_cache_add(page);
+	mod_node_page_state(page_pgdat(page),
+				NR_ISOLATED_ANON + page_is_file_cache(page),
+				-hpage_nr_pages(page));
 	put_page(page);		/* drop ref from isolate */
 }
 
@@ -1464,6 +1467,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		nr_reclaimed += nr_pages;
 
+		mod_node_page_state(pgdat, NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						-nr_pages);
 		/*
 		 * Is there need to periodically free_page_list? It would
 		 * appear not as the counts should be low
@@ -1539,7 +1545,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
 			TTU_IGNORE_ACCESS, &dummy_stat, true);
 	list_splice(&clean_pages, page_list);
-	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
 	return ret;
 }
 
@@ -1615,6 +1620,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 		 */
 		ClearPageLRU(page);
 		ret = 0;
+		__mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						hpage_nr_pages(page));
 	}
 
 	return ret;
@@ -1746,6 +1754,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				    total_scan, skipped, nr_taken, mode, lru);
 	update_lru_sizes(lruvec, lru, nr_zone_taken);
+
 	return nr_taken;
 }
 
@@ -1794,6 +1803,9 @@ int isolate_lru_page(struct page *page)
 			ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
+			mod_node_page_state(pgdat, NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						hpage_nr_pages(page));
 		}
 		spin_unlock_irq(&pgdat->lru_lock);
 	}
@@ -1885,6 +1897,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
 		list_move(&page->lru, &lruvec->lists[lru]);
 
+		__mod_node_page_state(pgdat, NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						-hpage_nr_pages(page));
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
@@ -1962,7 +1977,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
@@ -1988,8 +2002,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	move_pages_to_lru(lruvec, &page_list);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
@@ -2048,7 +2060,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	__count_vm_events(PGREFILL, nr_scanned);
@@ -2117,7 +2128,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGDEACTIVATE, nr_deactivate);
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
-- 
2.22.0.rc2.383.gf4fbbf30c2-goog

^ permalink raw reply related

* [PATCH v2 4/5] mm: introduce MADV_PAGEOUT
From: Minchan Kim @ 2019-06-10 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb, Minchan Kim
In-Reply-To: <20190610111252.239156-1-minchan@kernel.org>

When a process expects no accesses to a certain memory range
for a long time, it could hint kernel that the pages can be
reclaimed instantly but data should be preserved for future use.
This could reduce workingset eviction so it ends up increasing
performance.

This patch introduces the new MADV_PAGEOUT hint to madvise(2)
syscall. MADV_PAGEOUT can be used by a process to mark a memory
range as not expected to be used for a long time so that kernel
reclaims *any LRU* pages instantly. The hint can help kernel in
deciding which pages to evict proactively.

All of error rule is same with MADV_DONTNEED.

* v1
 * change pte to old and rely on the other's reference - hannes
 * remove page_mapcount to check shared page - mhocko

* RFC v2
 * make reclaim_pages simple via factoring out isolate logic - hannes

* RFCv1
 * rename from MADV_COLD to MADV_PAGEOUT - hannes
 * bail out if process is being killed - Hillf
 * fix reclaim_pages bugs - Hillf

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h                   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 161 +++++++++++++++++++++++++
 mm/vmscan.c                            |  58 +++++++++
 4 files changed, 221 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ce997edb8bb..063c0c1e112b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -365,6 +365,7 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
+extern unsigned long reclaim_pages(struct list_head *page_list);
 #ifdef CONFIG_NUMA
 extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index d7b4231eea63..f545e159b472 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -48,6 +48,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
 #define MADV_COLD	5		/* deactivatie these pages */
+#define MADV_PAGEOUT	6		/* reclaim these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE	8		/* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index 67c0379f64a7..3b9d2ba421b1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -11,6 +11,7 @@
 #include <linux/syscalls.h>
 #include <linux/mempolicy.h>
 #include <linux/page-isolation.h>
+#include <linux/page_idle.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/hugetlb.h>
 #include <linux/falloc.h>
@@ -41,6 +42,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
 	case MADV_COLD:
+	case MADV_PAGEOUT:
 	case MADV_FREE:
 		return 0;
 	default:
@@ -451,6 +453,162 @@ static long madvise_cold(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = walk->vma;
+	pte_t *orig_pte, *pte, ptent;
+	spinlock_t *ptl;
+	LIST_HEAD(page_list);
+	struct page *page;
+	int isolated = 0;
+	unsigned long next;
+
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		pmd_t orig_pmd;
+
+		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+		ptl = pmd_trans_huge_lock(pmd, vma);
+		if (!ptl)
+			return 0;
+
+		orig_pmd = *pmd;
+		if (is_huge_zero_pmd(orig_pmd))
+			goto huge_unlock;
+
+		if (unlikely(!pmd_present(orig_pmd))) {
+			VM_BUG_ON(thp_migration_supported() &&
+					!is_pmd_migration_entry(orig_pmd));
+			goto huge_unlock;
+		}
+
+		page = pmd_page(orig_pmd);
+		if (next - addr != HPAGE_PMD_SIZE) {
+			int err;
+
+			if (page_mapcount(page) != 1)
+				goto huge_unlock;
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			err = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			if (!err)
+				goto regular_page;
+			return 0;
+		}
+
+		if (isolate_lru_page(page))
+			goto huge_unlock;
+
+		if (pmd_young(orig_pmd)) {
+			pmdp_invalidate(vma, addr, pmd);
+			orig_pmd = pmd_mkold(orig_pmd);
+
+			set_pmd_at(mm, addr, pmd, orig_pmd);
+			tlb_remove_tlb_entry(tlb, pmd, addr);
+		}
+
+		ClearPageReferenced(page);
+		test_and_clear_page_young(page);
+		list_add(&page->lru, &page_list);
+huge_unlock:
+		spin_unlock(ptl);
+		reclaim_pages(&page_list);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+regular_page:
+	tlb_change_page_size(tlb, PAGE_SIZE);
+	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	flush_tlb_batched_pending(mm);
+	arch_enter_lazy_mmu_mode();
+	for (; addr < end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (isolate_lru_page(page))
+			continue;
+
+		isolated++;
+		if (pte_young(ptent)) {
+			ptent = ptep_get_and_clear_full(mm, addr, pte,
+							tlb->fullmm);
+			ptent = pte_mkold(ptent);
+			set_pte_at(mm, addr, pte, ptent);
+			tlb_remove_tlb_entry(tlb, pte, addr);
+		}
+		ClearPageReferenced(page);
+		test_and_clear_page_young(page);
+		list_add(&page->lru, &page_list);
+		if (isolated >= SWAP_CLUSTER_MAX) {
+			arch_leave_lazy_mmu_mode();
+			pte_unmap_unlock(orig_pte, ptl);
+			reclaim_pages(&page_list);
+			isolated = 0;
+			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+			arch_enter_lazy_mmu_mode();
+			orig_pte = pte;
+		}
+	}
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(orig_pte, ptl);
+	reclaim_pages(&page_list);
+	cond_resched();
+
+	return 0;
+}
+
+static void madvise_pageout_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk pageout_walk = {
+		.pmd_entry = madvise_pageout_pte_range,
+		.mm = vma->vm_mm,
+		.private = tlb,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &pageout_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+
+static long madvise_pageout(struct vm_area_struct *vma,
+			struct vm_area_struct **prev,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	*prev = vma;
+	if (!can_madv_lru_vma(vma))
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+	tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+	return 0;
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -841,6 +999,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_COLD:
 		return madvise_cold(vma, prev, start, end);
+	case MADV_PAGEOUT:
+		return madvise_pageout(vma, prev, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -863,6 +1023,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_DONTNEED:
 	case MADV_FREE:
 	case MADV_COLD:
+	case MADV_PAGEOUT:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56df55e8afcd..04061185677f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2136,6 +2136,64 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			nr_deactivate, nr_rotated, sc->priority, file);
 }
 
+unsigned long reclaim_pages(struct list_head *page_list)
+{
+	int nid = -1;
+	unsigned long nr_reclaimed = 0;
+	LIST_HEAD(node_page_list);
+	struct reclaim_stat dummy_stat;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.priority = DEF_PRIORITY,
+		.may_writepage = 1,
+		.may_unmap = 1,
+		.may_swap = 1,
+	};
+
+	while (!list_empty(page_list)) {
+		struct page *page;
+
+		page = lru_to_page(page_list);
+		if (nid == -1) {
+			nid = page_to_nid(page);
+			INIT_LIST_HEAD(&node_page_list);
+		}
+
+		if (nid == page_to_nid(page)) {
+			list_move(&page->lru, &node_page_list);
+			continue;
+		}
+
+		nr_reclaimed += shrink_page_list(&node_page_list,
+						NODE_DATA(nid),
+						&sc, 0,
+						&dummy_stat, false);
+		while (!list_empty(&node_page_list)) {
+			struct page *page = lru_to_page(&node_page_list);
+
+			list_del(&page->lru);
+			putback_lru_page(page);
+		}
+
+		nid = -1;
+	}
+
+	if (!list_empty(&node_page_list)) {
+		nr_reclaimed += shrink_page_list(&node_page_list,
+						NODE_DATA(nid),
+						&sc, 0,
+						&dummy_stat, false);
+		while (!list_empty(&node_page_list)) {
+			struct page *page = lru_to_page(&node_page_list);
+
+			list_del(&page->lru);
+			putback_lru_page(page);
+		}
+	}
+
+	return nr_reclaimed;
+}
+
 /*
  * The inactive anon list should be small enough that the VM never has
  * to do too much work.
-- 
2.22.0.rc2.383.gf4fbbf30c2-goog

^ permalink raw reply related

* [PATCH v2 5/5] mm: factor out pmd young/dirty bit handling and THP split
From: Minchan Kim @ 2019-06-10 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb, Minchan Kim, Kirill A. Shutemov,
	Christopher Lameter
In-Reply-To: <20190610111252.239156-1-minchan@kernel.org>

Now, there are common part among MADV_COLD|PAGEOUT|FREE to reset
access/dirty bit resetting or split the THP page to handle part
of subpages in the THP page. This patch factor out the common part.

Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Christopher Lameter <cl@linux.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/huge_mm.h |   3 -
 mm/huge_memory.c        |  74 -------------
 mm/madvise.c            | 234 +++++++++++++++++++++++-----------------
 3 files changed, 135 insertions(+), 176 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7cd5c150c21d..2667e1aa3ce5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -29,9 +29,6 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 					  unsigned long addr,
 					  pmd_t *pmd,
 					  unsigned int flags);
-extern bool madvise_free_huge_pmd(struct mmu_gather *tlb,
-			struct vm_area_struct *vma,
-			pmd_t *pmd, unsigned long addr, unsigned long next);
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f8bce9a6b32..22e20f929463 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1668,80 +1668,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	return 0;
 }
 
-/*
- * Return true if we do MADV_FREE successfully on entire pmd page.
- * Otherwise, return false.
- */
-bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
-		pmd_t *pmd, unsigned long addr, unsigned long next)
-{
-	spinlock_t *ptl;
-	pmd_t orig_pmd;
-	struct page *page;
-	struct mm_struct *mm = tlb->mm;
-	bool ret = false;
-
-	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
-
-	ptl = pmd_trans_huge_lock(pmd, vma);
-	if (!ptl)
-		goto out_unlocked;
-
-	orig_pmd = *pmd;
-	if (is_huge_zero_pmd(orig_pmd))
-		goto out;
-
-	if (unlikely(!pmd_present(orig_pmd))) {
-		VM_BUG_ON(thp_migration_supported() &&
-				  !is_pmd_migration_entry(orig_pmd));
-		goto out;
-	}
-
-	page = pmd_page(orig_pmd);
-	/*
-	 * If other processes are mapping this page, we couldn't discard
-	 * the page unless they all do MADV_FREE so let's skip the page.
-	 */
-	if (page_mapcount(page) != 1)
-		goto out;
-
-	if (!trylock_page(page))
-		goto out;
-
-	/*
-	 * If user want to discard part-pages of THP, split it so MADV_FREE
-	 * will deactivate only them.
-	 */
-	if (next - addr != HPAGE_PMD_SIZE) {
-		get_page(page);
-		spin_unlock(ptl);
-		split_huge_page(page);
-		unlock_page(page);
-		put_page(page);
-		goto out_unlocked;
-	}
-
-	if (PageDirty(page))
-		ClearPageDirty(page);
-	unlock_page(page);
-
-	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
-		pmdp_invalidate(vma, addr, pmd);
-		orig_pmd = pmd_mkold(orig_pmd);
-		orig_pmd = pmd_mkclean(orig_pmd);
-
-		set_pmd_at(mm, addr, pmd, orig_pmd);
-		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-	}
-
-	mark_page_lazyfree(page);
-	ret = true;
-out:
-	spin_unlock(ptl);
-out_unlocked:
-	return ret;
-}
-
 static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 {
 	pgtable_t pgtable;
diff --git a/mm/madvise.c b/mm/madvise.c
index 3b9d2ba421b1..bb1906bb75fd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -310,6 +310,91 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+enum madv_pmdp_reset_t {
+	MADV_PMDP_RESET,	/* pmd was reset successfully */
+	MADV_PMDP_SPLIT,	/* pmd was split */
+	MADV_PMDP_ERROR,
+};
+
+static enum madv_pmdp_reset_t madvise_pmdp_reset_or_split(struct mm_walk *walk,
+				pmd_t *pmd, spinlock_t *ptl,
+				unsigned long addr, unsigned long end,
+				bool young, bool dirty)
+{
+	pmd_t orig_pmd;
+	unsigned long next;
+	struct page *page;
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = walk->mm;
+	struct vm_area_struct *vma = walk->vma;
+	bool reset_young = false;
+	bool reset_dirty = false;
+	enum madv_pmdp_reset_t ret = MADV_PMDP_ERROR;
+
+	orig_pmd = *pmd;
+	if (is_huge_zero_pmd(orig_pmd))
+		return ret;
+
+	if (unlikely(!pmd_present(orig_pmd))) {
+		VM_BUG_ON(thp_migration_supported() &&
+				!is_pmd_migration_entry(orig_pmd));
+		return ret;
+	}
+
+	next = pmd_addr_end(addr, end);
+	page = pmd_page(orig_pmd);
+	if (next - addr != HPAGE_PMD_SIZE) {
+		/*
+		 * THP collapsing is not cheap so only split the page is
+		 * private to the this process.
+		 */
+		if (page_mapcount(page) != 1)
+			return ret;
+		get_page(page);
+		spin_unlock(ptl);
+		lock_page(page);
+		if (!split_huge_page(page))
+			ret = MADV_PMDP_SPLIT;
+		unlock_page(page);
+		put_page(page);
+		return ret;
+	}
+
+	if (young && pmd_young(orig_pmd))
+		reset_young = true;
+	if (dirty && pmd_dirty(orig_pmd))
+		reset_dirty = true;
+
+	/*
+	 * Other process could rely on the PG_dirty for data consistency,
+	 * not pte_dirty so we could reset PG_dirty only when we are owner
+	 * of the page.
+	 */
+	if (reset_dirty) {
+		if (page_mapcount(page) != 1)
+			goto out;
+		if (!trylock_page(page))
+			goto out;
+		if (PageDirty(page))
+			ClearPageDirty(page);
+		unlock_page(page);
+	}
+
+	ret = MADV_PMDP_RESET;
+	if (reset_young || reset_dirty) {
+		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+		pmdp_invalidate(vma, addr, pmd);
+		if (reset_young)
+			orig_pmd = pmd_mkold(orig_pmd);
+		if (reset_dirty)
+			orig_pmd = pmd_mkclean(orig_pmd);
+		set_pmd_at(mm, addr, pmd, orig_pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+	}
+out:
+	return ret;
+}
+
 static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 {
@@ -319,64 +404,31 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte, *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
-	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
-		pmd_t orig_pmd;
-
-		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
 		if (!ptl)
 			return 0;
 
-		orig_pmd = *pmd;
-		if (is_huge_zero_pmd(orig_pmd))
-			goto huge_unlock;
-
-		if (unlikely(!pmd_present(orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					!is_pmd_migration_entry(orig_pmd));
-			goto huge_unlock;
-		}
-
-		page = pmd_page(orig_pmd);
-		if (next - addr != HPAGE_PMD_SIZE) {
-			int err;
-
-			if (page_mapcount(page) != 1)
-				goto huge_unlock;
-
-			get_page(page);
+		switch (madvise_pmdp_reset_or_split(walk, pmd, ptl, addr, end,
+							true, false)) {
+		case MADV_PMDP_RESET:
 			spin_unlock(ptl);
-			lock_page(page);
-			err = split_huge_page(page);
-			unlock_page(page);
-			put_page(page);
-			if (!err)
-				goto regular_page;
-			return 0;
-		}
-
-		if (pmd_young(orig_pmd)) {
-			pmdp_invalidate(vma, addr, pmd);
-			orig_pmd = pmd_mkold(orig_pmd);
-
-			set_pmd_at(mm, addr, pmd, orig_pmd);
-			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+			page = pmd_page(*pmd);
+			test_and_clear_page_young(page);
+			deactivate_page(page);
+			goto next;
+		case MADV_PMDP_ERROR:
+			spin_unlock(ptl);
+			goto next;
+		case MADV_PMDP_SPLIT:
+			; /* go through */
 		}
-
-		test_and_clear_page_young(page);
-		deactivate_page(page);
-huge_unlock:
-		spin_unlock(ptl);
-		return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
-regular_page:
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	flush_tlb_batched_pending(mm);
@@ -414,6 +466,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 
 	arch_enter_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
+next:
 	cond_resched();
 
 	return 0;
@@ -464,70 +517,38 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
 	LIST_HEAD(page_list);
 	struct page *page;
 	int isolated = 0;
-	unsigned long next;
 
 	if (fatal_signal_pending(current))
 		return -EINTR;
 
-	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
-		pmd_t orig_pmd;
-
-		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
 		if (!ptl)
 			return 0;
 
-		orig_pmd = *pmd;
-		if (is_huge_zero_pmd(orig_pmd))
-			goto huge_unlock;
-
-		if (unlikely(!pmd_present(orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					!is_pmd_migration_entry(orig_pmd));
-			goto huge_unlock;
-		}
-
-		page = pmd_page(orig_pmd);
-		if (next - addr != HPAGE_PMD_SIZE) {
-			int err;
-
-			if (page_mapcount(page) != 1)
-				goto huge_unlock;
-			get_page(page);
+		switch (madvise_pmdp_reset_or_split(walk, pmd, ptl, addr, end,
+							true, false)) {
+		case MADV_PMDP_RESET:
+			page = pmd_page(*pmd);
 			spin_unlock(ptl);
-			lock_page(page);
-			err = split_huge_page(page);
-			unlock_page(page);
-			put_page(page);
-			if (!err)
-				goto regular_page;
-			return 0;
-		}
-
-		if (isolate_lru_page(page))
-			goto huge_unlock;
-
-		if (pmd_young(orig_pmd)) {
-			pmdp_invalidate(vma, addr, pmd);
-			orig_pmd = pmd_mkold(orig_pmd);
-
-			set_pmd_at(mm, addr, pmd, orig_pmd);
-			tlb_remove_tlb_entry(tlb, pmd, addr);
+			if (isolate_lru_page(page))
+				return 0;
+			ClearPageReferenced(page);
+			test_and_clear_page_young(page);
+			list_add(&page->lru, &page_list);
+			reclaim_pages(&page_list);
+			goto next;
+		case MADV_PMDP_ERROR:
+			spin_unlock(ptl);
+			goto next;
+		case MADV_PMDP_SPLIT:
+			; /* go through */
 		}
-
-		ClearPageReferenced(page);
-		test_and_clear_page_young(page);
-		list_add(&page->lru, &page_list);
-huge_unlock:
-		spin_unlock(ptl);
-		reclaim_pages(&page_list);
-		return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
 		return 0;
-regular_page:
+
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	flush_tlb_batched_pending(mm);
@@ -569,6 +590,7 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
 	reclaim_pages(&page_list);
+next:
 	cond_resched();
 
 	return 0;
@@ -620,12 +642,26 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte, *pte, ptent;
 	struct page *page;
 	int nr_swap = 0;
-	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
-	if (pmd_trans_huge(*pmd))
-		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
+	if (pmd_trans_huge(*pmd)) {
+		ptl = pmd_trans_huge_lock(pmd, vma);
+		if (!ptl)
+			return 0;
+
+		switch (madvise_pmdp_reset_or_split(walk, pmd, ptl, addr, end,
+							true, true)) {
+		case MADV_PMDP_RESET:
+			page = pmd_page(*pmd);
+			spin_unlock(ptl);
+			mark_page_lazyfree(page);
 			goto next;
+		case MADV_PMDP_ERROR:
+			spin_unlock(ptl);
+			goto next;
+		case MADV_PMDP_SPLIT:
+			; /* go through */
+		}
+	}
 
 	if (pmd_trans_unstable(pmd))
 		return 0;
@@ -737,8 +773,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
-	cond_resched();
 next:
+	cond_resched();
 	return 0;
 }
 
-- 
2.22.0.rc2.383.gf4fbbf30c2-goog

^ permalink raw reply related

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Carlos O'Donell @ 2019-06-10 14:43 UTC (permalink / raw)
  To: Florian Weimer, Mathieu Desnoyers
  Cc: Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <87wohzorj0.fsf@oldenburg2.str.redhat.com>

On 6/6/19 7:57 AM, Florian Weimer wrote:
> Let me ask the key question again: Does it matter if code observes the
> rseq area first without kernel support, and then with kernel support?
> If we don't expect any problems immediately, we do not need to worry
> much about the constructor ordering right now.  I expect that over time,
> fixing this properly will become easier.

I just wanted to chime in and say that splitting this into:

* Ownership (__rseq_handled)

* Initialization (__rseq_abi)

Makes sense to me.

I agree we need an answer to this question of ownership but not yet
initialized, to owned and initialized.

I like the idea of having __rseq_handled in ld.so.

-- 
Cheers,
Carlos.

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Stephen Smalley @ 2019-06-10 15:21 UTC (permalink / raw)
  To: David Howells, viro
  Cc: linux-usb, linux-security-module, Casey Schaufler,
	Greg Kroah-Hartman, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-kernel, Paul Moore
In-Reply-To: <155991702981.15579.6007568669839441045.stgit@warthog.procyon.org.uk>

On 6/7/19 10:17 AM, David Howells wrote:
> 
> Hi Al,
> 
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:
> 
>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>       mount reconfiguration.
> 
>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>       errors (not complete yet).
> 
>   (3) Key/keyring events, such as creating, linking and removal of keys.
> 
>   (4) General device events (single common queue) including:
> 
>       - Block layer events, such as device errors
> 
>       - USB subsystem events, such as device/bus attach/remove, device
>         reset, device errors.
> 
> One of the reasons for this is so that we can remove the issue of processes
> having to repeatedly and regularly scan /proc/mounts, which has proven to
> be a system performance problem.  To further aid this, the fsinfo() syscall
> on which this patch series depends, provides a way to access superblock and
> mount information in binary form without the need to parse /proc/mounts.
> 
> 
> LSM support is included, but controversial:
> 
>   (1) The creds of the process that did the fput() that reduced the refcount
>       to zero are cached in the file struct.
> 
>   (2) __fput() overrides the current creds with the creds from (1) whilst
>       doing the cleanup, thereby making sure that the creds seen by the
>       destruction notification generated by mntput() appears to come from
>       the last fputter.
> 
>   (3) security_post_notification() is called for each queue that we might
>       want to post a notification into, thereby allowing the LSM to prevent
>       covert communications.
> 
>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>       may be set in the first place?  I might need to add a variant per
>       watch-type.
> 
>   (?) Do I really need to keep track of the process creds in which an
>       implicit object destruction happened?  For example, imagine you create
>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>       refers to on close unless move_mount() clears that flag.  Now, imagine
>       someone looking at that fd through procfs at the same time as you exit
>       due to an error.  The LSM sees the destruction notification come from
>       the looker if they happen to do their fput() after yours.

I remain unconvinced that (1), (2), (3), and the final (?) above are a 
good idea.

For SELinux, I would expect that one would implement a collection of per 
watch-type WATCH permission checks on the target object (or to some 
well-defined object label like the kernel SID if there is no object) 
that allow receipt of all notifications of that watch-type for objects 
related to the target object, where "related to" is defined per watch-type.

I wouldn't expect SELinux to implement security_post_notification() at 
all.  I can't see how one can construct a meaningful, stable policy for 
it.  I'd argue that the triggering process is not posting the 
notification; the kernel is posting the notification and the watcher has 
been authorized to receive it.

> 
> 
> Design decisions:
> 
>   (1) A misc chardev is used to create and open a ring buffer:
> 
> 	fd = open("/dev/watch_queue", O_RDWR);
> 
>       which is then configured and mmap'd into userspace:
> 
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> 	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
> 		   MAP_SHARED, fd, 0);
> 
>       The fd cannot be read or written (though there is a facility to use
>       write to inject records for debugging) and userspace just pulls data
>       directly out of the buffer.
> 
>   (2) The ring index pointers are stored inside the ring and are thus
>       accessible to userspace.  Userspace should only update the tail
>       pointer and never the head pointer or risk breaking the buffer.  The
>       kernel checks that the pointers appear valid before trying to use
>       them.  A 'skip' record is maintained around the pointers.
> 
>   (3) poll() can be used to wait for data to appear in the buffer.
> 
>   (4) Records in the buffer are binary, typed and have a length so that they
>       can be of varying size.
> 
>       This means that multiple heterogeneous sources can share a common
>       buffer.  Tags may be specified when a watchpoint is created to help
>       distinguish the sources.
> 
>   (5) The queue is reusable as there are 16 million types available, of
>       which I've used 4, so there is scope for others to be used.
> 
>   (6) Records are filterable as types have up to 256 subtypes that can be
>       individually filtered.  Other filtration is also available.
> 
>   (7) Each time the buffer is opened, a new buffer is created - this means
>       that there's no interference between watchers.
> 
>   (8) When recording a notification, the kernel will not sleep, but will
>       rather mark a queue as overrun if there's insufficient space, thereby
>       avoiding userspace causing the kernel to hang.
> 
>   (9) The 'watchpoint' should be specific where possible, meaning that you
>       specify the object that you want to watch.
> 
> (10) The buffer is created and then watchpoints are attached to it, using
>       one of:
> 
> 	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> 	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
> 	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
> 
>       where in all three cases, fd indicates the queue and the number after
>       is a tag between 0 and 255.
> 
> (11) The watch must be removed if either the watch buffer is destroyed or
>       the watched object is destroyed.
> 
> 
> Things I want to avoid:
> 
>   (1) Introducing features that make the core VFS dependent on the network
>       stack or networking namespaces (ie. usage of netlink).
> 
>   (2) Dumping all this stuff into dmesg and having a daemon that sits there
>       parsing the output and distributing it as this then puts the
>       responsibility for security into userspace and makes handling
>       namespaces tricky.  Further, dmesg might not exist or might be
>       inaccessible inside a container.
> 
>   (3) Letting users see events they shouldn't be able to see.
> 
> 
> Further things that could be considered:
> 
>   (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>       "children" of that keyring, such that the watch is removed from the
>       child if it is unlinked from the keyring.
> 
>   (2) Adding global superblock event queue.
> 
>   (3) Propagating watches to child superblock over automounts.
> 
> 
> The patches can be found here also:
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
> 
> Changes:
> 
>   v4: Split the basic UAPI bits out into their own patch and then split the
>       LSM hooks out into an intermediate patch.  Add LSM hooks for setting
>       watches.
> 
>       Rename the *_notify() system calls to watch_*() for consistency.
> 
>   v3: I've added a USB notification source and reformulated the block
>       notification source so that there's now a common watch list, for which
>       the system call is now device_notify().
> 
>       I've assigned a pair of unused ioctl numbers in the 'W' series to the
>       ioctls added by this series.
> 
>       I've also added a description of the kernel API to the documentation.
> 
>   v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
>       krefs for refcounting.  I've added some security features to try and
>       give Casey Schaufler the LSM control he wants.
> 
> David
> ---
> David Howells (13):
>        security: Override creds in __fput() with last fputter's creds
>        uapi: General notification ring definitions
>        security: Add hooks to rule on setting a watch
>        security: Add a hook for the point of notification insertion
>        General notification queue with user mmap()'able ring buffer
>        keys: Add a notification facility
>        vfs: Add a mount-notification facility
>        vfs: Add superblock notifications
>        fsinfo: Export superblock notification counter
>        Add a general, global device notification watch list
>        block: Add block layer notifications
>        usb: Add USB subsystem notifications
>        Add sample notification program
> 
> 
>   Documentation/ioctl/ioctl-number.txt   |    1
>   Documentation/security/keys/core.rst   |   58 ++
>   Documentation/watch_queue.rst          |  492 ++++++++++++++++++
>   arch/x86/entry/syscalls/syscall_32.tbl |    3
>   arch/x86/entry/syscalls/syscall_64.tbl |    3
>   block/Kconfig                          |    9
>   block/blk-core.c                       |   29 +
>   drivers/base/Kconfig                   |    9
>   drivers/base/Makefile                  |    1
>   drivers/base/watch.c                   |   89 +++
>   drivers/misc/Kconfig                   |   13
>   drivers/misc/Makefile                  |    1
>   drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
>   drivers/usb/core/Kconfig               |   10
>   drivers/usb/core/devio.c               |   55 ++
>   drivers/usb/core/hub.c                 |    3
>   fs/Kconfig                             |   21 +
>   fs/Makefile                            |    1
>   fs/file_table.c                        |   12
>   fs/fsinfo.c                            |   12
>   fs/mount.h                             |   33 +
>   fs/mount_notify.c                      |  187 +++++++
>   fs/namespace.c                         |    9
>   fs/super.c                             |  122 ++++
>   include/linux/blkdev.h                 |   15 +
>   include/linux/dcache.h                 |    1
>   include/linux/device.h                 |    7
>   include/linux/fs.h                     |   79 +++
>   include/linux/key.h                    |    4
>   include/linux/lsm_hooks.h              |   48 ++
>   include/linux/security.h               |   35 +
>   include/linux/syscalls.h               |    5
>   include/linux/usb.h                    |   19 +
>   include/linux/watch_queue.h            |   87 +++
>   include/uapi/linux/fsinfo.h            |   10
>   include/uapi/linux/keyctl.h            |    1
>   include/uapi/linux/watch_queue.h       |  213 ++++++++
>   kernel/sys_ni.c                        |    7
>   samples/Kconfig                        |    6
>   samples/Makefile                       |    1
>   samples/vfs/test-fsinfo.c              |   13
>   samples/watch_queue/Makefile           |    9
>   samples/watch_queue/watch_test.c       |  308 +++++++++++
>   security/keys/Kconfig                  |   10
>   security/keys/compat.c                 |    2
>   security/keys/gc.c                     |    5
>   security/keys/internal.h               |   30 +
>   security/keys/key.c                    |   37 +
>   security/keys/keyctl.c                 |   95 +++
>   security/keys/keyring.c                |   17 -
>   security/keys/request_key.c            |    4
>   security/security.c                    |   29 +
>   52 files changed, 3121 insertions(+), 38 deletions(-)
>   create mode 100644 Documentation/watch_queue.rst
>   create mode 100644 drivers/base/watch.c
>   create mode 100644 drivers/misc/watch_queue.c
>   create mode 100644 fs/mount_notify.c
>   create mode 100644 include/linux/watch_queue.h
>   create mode 100644 include/uapi/linux/watch_queue.h
>   create mode 100644 samples/watch_queue/Makefile
>   create mode 100644 samples/watch_queue/watch_test.c
> 

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-10 15:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <25281DB3-FCE4-40C2-BADB-B3B05C5F8DD3@amacapital.net>

On Fri, 2019-06-07 at 13:43 -0700, Andy Lutomirski wrote:
> > On Jun 7, 2019, at 12:49 PM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > 
> > On Fri, 2019-06-07 at 11:29 -0700, Andy Lutomirski wrote:
> > > > On Jun 7, 2019, at 10:59 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> > > > 
> > > > > On 6/7/19 10:43 AM, Peter Zijlstra wrote:
> > > > > I've no idea what the kernel should do; since you failed to answer the
> > > > > question what happens when you point this to garbage.
> > > > > 
> > > > > Does it then fault or what?
> > > > 
> > > > Yeah, I think you'll fault with a rather mysterious CR2 value since
> > > > you'll go look at the instruction that faulted and not see any
> > > > references to the CR2 value.
> > > > 
> > > > I think this new MSR probably needs to get included in oops output when
> > > > CET is enabled.
> > > 
> > > This shouldn’t be able to OOPS because it only happens at CPL 3,
> > > right?  We
> > > should put it into core dumps, though.
> > > 
> > > > 
> > > > Why don't we require that a VMA be in place for the entire bitmap?
> > > > Don't we need a "get" prctl function too in case something like a JIT is
> > > > running and needs to find the location of this bitmap to set bits
> > > > itself?
> > > > 
> > > > Or, do we just go whole-hog and have the kernel manage the bitmap
> > > > itself. Our interface here could be:
> > > > 
> > > >   prctl(PR_MARK_CODE_AS_LEGACY, start, size);
> > > > 
> > > > and then have the kernel allocate and set the bitmap for those code
> > > > locations.
> > > 
> > > Given that the format depends on the VA size, this might be a good
> > > idea.  I
> > > bet we can reuse the special mapping infrastructure for this — the VMA
> > > could
> > > be a MAP_PRIVATE special mapping named [cet_legacy_bitmap] or similar, and
> > > we
> > > can even make special rules to core dump it intelligently if needed.  And
> > > we
> > > can make mremap() on it work correctly if anyone (CRIU?) cares.
> > > 
> > > Hmm.  Can we be creative and skip populating it with zeros?  The CPU
> > > should
> > > only ever touch a page if we miss an ENDBR on it, so, in normal operation,
> > > we
> > > don’t need anything to be there.  We could try to prevent anyone from
> > > *reading* it outside of ENDBR tracking if we want to avoid people
> > > accidentally
> > > wasting lots of memory by forcing it to be fully populated when the read
> > > it.
> > > 
> > > The one downside is this forces it to be per-mm, but that seems like a
> > > generally reasonable model anyway.
> > > 
> > > This also gives us an excellent opportunity to make it read-only as seen
> > > from
> > > userspace to prevent exploits from just poking it full of ones before
> > > redirecting execution.
> > 
> > GLIBC sets bits only for legacy code, and then makes the bitmap read-
> > only.  That
> > avoids most issues:
> 
> How does glibc know the linear address space size?  We don’t want LA64 to
> break old binaries because the address calculation changed.

When an application starts, its highest stack address is determined.
It uses that as the maximum the bitmap needs to cover.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-10 15:47 UTC (permalink / raw)
  To: Pavel Machek, Dave Hansen
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <20190608205218.GA2359@xo-6d-61-c0.localdomain>

On Sat, 2019-06-08 at 22:52 +0200, Pavel Machek wrote:
> Hi!
> 
> > > I've no idea what the kernel should do; since you failed to answer the
> > > question what happens when you point this to garbage.
> > > 
> > > Does it then fault or what?
> > 
> > Yeah, I think you'll fault with a rather mysterious CR2 value since
> > you'll go look at the instruction that faulted and not see any
> > references to the CR2 value.
> > 
> > I think this new MSR probably needs to get included in oops output when
> > CET is enabled.
> > 
> > Why don't we require that a VMA be in place for the entire bitmap?
> > Don't we need a "get" prctl function too in case something like a JIT is
> > running and needs to find the location of this bitmap to set bits itself?
> > 
> > Or, do we just go whole-hog and have the kernel manage the bitmap
> > itself. Our interface here could be:
> > 
> > 	prctl(PR_MARK_CODE_AS_LEGACY, start, size);
> > 
> > and then have the kernel allocate and set the bitmap for those code
> > locations.
> 
> For the record, that sounds like a better interface than userspace knowing
> about the bitmap formats...
> 									Pavel

Initially we implemented the bitmap that way.  To manage the bitmap, every time
the application issues a syscall for a .so it loads, and the kernel does
copy_from_user() & copy_to_user() (or similar things).  If a system has a few
legacy .so files and every application does the same, it can take a long time to
boot up.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-10 16:03 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <4F7D0C3C-F239-4B67-BB05-31350F809293@amacapital.net>

On Fri, 2019-06-07 at 15:27 -0700, Andy Lutomirski wrote:
> > On Jun 7, 2019, at 2:09 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> > On 6/7/19 1:06 PM, Yu-cheng Yu wrote:
> > > > Huh, how does glibc know about all possible past and future legacy code
> > > > in the application?
> > > 
> > > When dlopen() gets a legacy binary and the policy allows that, it will
> > > manage
> > > the bitmap:
> > > 
> > >  If a bitmap has not been created, create one.
> > >  Set bits for the legacy code being loaded.
> > 
> > I was thinking about code that doesn't go through GLIBC like JITs.
> 
> CRIU is another consideration: it would be rather annoying if CET programs
> can’t migrate between LA57 and normal machines.

When a machine migrates, does its applications' addresses change?
If no, then the bitmap should still work, right?

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-10 16:05 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <f6de9073-9939-a20d-2196-25fa223cf3fc@intel.com>

On Fri, 2019-06-07 at 14:09 -0700, Dave Hansen wrote:
> On 6/7/19 1:06 PM, Yu-cheng Yu wrote:
> > > Huh, how does glibc know about all possible past and future legacy code
> > > in the application?
> > 
> > When dlopen() gets a legacy binary and the policy allows that, it will
> > manage
> > the bitmap:
> > 
> >   If a bitmap has not been created, create one.
> >   Set bits for the legacy code being loaded.
> 
> I was thinking about code that doesn't go through GLIBC like JITs.

If JIT manages the bitmap, it knows where it is.
It can always read the bitmap again, right?

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Yu-cheng Yu @ 2019-06-10 16:29 UTC (permalink / raw)
  To: Dave Martin
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190607180115.GJ28398@e103592.cambridge.arm.com>

On Fri, 2019-06-07 at 19:01 +0100, Dave Martin wrote:
> On Thu, Jun 06, 2019 at 01:06:41PM -0700, Yu-cheng Yu wrote:
> > An ELF file's .note.gnu.property indicates features the executable file
> > can support.  For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
> > indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
> > GNU_PROPERTY_X86_FEATURE_1_SHSTK.
> > 
> > With this patch, if an arch needs to setup features from ELF properties,
> > it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and a specific
> > arch_setup_property().
> > 
> > For example, for X86_64:
> > 
> > int arch_setup_property(void *ehdr, void *phdr, struct file *f, bool inter)
> > {
> > 	int r;
> > 	uint32_t property;
> > 
> > 	r = get_gnu_property(ehdr, phdr, f, GNU_PROPERTY_X86_FEATURE_1_AND,
> > 			     &property);
> > 	...
> > }
> 
> Although this code works for the simple case, I have some concerns about
> some aspects of the implementation here.  There appear to be some bounds
> checking / buffer overrun issues, and the code seems quite complex.
> 
> Maybe this patch tries too hard to be compatible with toolchains that do
> silly things such as embedding huge notes in an executable, or mixing
> NT_GNU_PROPERTY_TYPE_0 in a single PT_NOTE with a load of junk not
> relevant to the loader.  I wonder whether Linux can dictate what
> interpretation(s) of the ELF specs it is prepared to support, rather than
> trying to support absolutely anything.

To me, looking at PT_GNU_PROPERTY and not trying to support anything is a
logical choice.  And it breaks only a limited set of toolchains.

I will simplify the parser and leave this patch as-is for anyone who wants to
back-port.  Are there any objections or concerns?

Yu-cheng

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Casey Schaufler @ 2019-06-10 16:33 UTC (permalink / raw)
  To: Stephen Smalley, David Howells, viro
  Cc: linux-usb, linux-security-module, Greg Kroah-Hartman, raven,
	linux-fsdevel, linux-api, linux-block, keyrings, linux-kernel,
	Paul Moore, casey
In-Reply-To: <be966d9c-e38d-7a30-8d80-fad5f25ab230@tycho.nsa.gov>

On 6/10/2019 8:21 AM, Stephen Smalley wrote:
> On 6/7/19 10:17 AM, David Howells wrote:
>>
>> Hi Al,
>>
>> Here's a set of patches to add a general variable-length notification queue
>> concept and to add sources of events for:
>>
>>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>       mount reconfiguration.
>>
>>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>       errors (not complete yet).
>>
>>   (3) Key/keyring events, such as creating, linking and removal of keys.
>>
>>   (4) General device events (single common queue) including:
>>
>>       - Block layer events, such as device errors
>>
>>       - USB subsystem events, such as device/bus attach/remove, device
>>         reset, device errors.
>>
>> One of the reasons for this is so that we can remove the issue of processes
>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>> be a system performance problem.  To further aid this, the fsinfo() syscall
>> on which this patch series depends, provides a way to access superblock and
>> mount information in binary form without the need to parse /proc/mounts.
>>
>>
>> LSM support is included, but controversial:
>>
>>   (1) The creds of the process that did the fput() that reduced the refcount
>>       to zero are cached in the file struct.
>>
>>   (2) __fput() overrides the current creds with the creds from (1) whilst
>>       doing the cleanup, thereby making sure that the creds seen by the
>>       destruction notification generated by mntput() appears to come from
>>       the last fputter.
>>
>>   (3) security_post_notification() is called for each queue that we might
>>       want to post a notification into, thereby allowing the LSM to prevent
>>       covert communications.
>>
>>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>       may be set in the first place?  I might need to add a variant per
>>       watch-type.
>>
>>   (?) Do I really need to keep track of the process creds in which an
>>       implicit object destruction happened?  For example, imagine you create
>>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>       refers to on close unless move_mount() clears that flag.  Now, imagine
>>       someone looking at that fd through procfs at the same time as you exit
>>       due to an error.  The LSM sees the destruction notification come from
>>       the looker if they happen to do their fput() after yours.
>
> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>
> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>
> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.

I cannot agree. There is an explicit action by a subject that results
in information being delivered to an object. Just like a signal or a
UDP packet delivery. Smack handles this kind of thing just fine. The
internal mechanism that results in the access is irrelevant from
this viewpoint. I can understand how a mechanism like SELinux that
works on finer granularity might view it differently.



>
>>
>>
>> Design decisions:
>>
>>   (1) A misc chardev is used to create and open a ring buffer:
>>
>>     fd = open("/dev/watch_queue", O_RDWR);
>>
>>       which is then configured and mmap'd into userspace:
>>
>>     ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
>>     ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
>>     buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
>>            MAP_SHARED, fd, 0);
>>
>>       The fd cannot be read or written (though there is a facility to use
>>       write to inject records for debugging) and userspace just pulls data
>>       directly out of the buffer.
>>
>>   (2) The ring index pointers are stored inside the ring and are thus
>>       accessible to userspace.  Userspace should only update the tail
>>       pointer and never the head pointer or risk breaking the buffer.  The
>>       kernel checks that the pointers appear valid before trying to use
>>       them.  A 'skip' record is maintained around the pointers.
>>
>>   (3) poll() can be used to wait for data to appear in the buffer.
>>
>>   (4) Records in the buffer are binary, typed and have a length so that they
>>       can be of varying size.
>>
>>       This means that multiple heterogeneous sources can share a common
>>       buffer.  Tags may be specified when a watchpoint is created to help
>>       distinguish the sources.
>>
>>   (5) The queue is reusable as there are 16 million types available, of
>>       which I've used 4, so there is scope for others to be used.
>>
>>   (6) Records are filterable as types have up to 256 subtypes that can be
>>       individually filtered.  Other filtration is also available.
>>
>>   (7) Each time the buffer is opened, a new buffer is created - this means
>>       that there's no interference between watchers.
>>
>>   (8) When recording a notification, the kernel will not sleep, but will
>>       rather mark a queue as overrun if there's insufficient space, thereby
>>       avoiding userspace causing the kernel to hang.
>>
>>   (9) The 'watchpoint' should be specific where possible, meaning that you
>>       specify the object that you want to watch.
>>
>> (10) The buffer is created and then watchpoints are attached to it, using
>>       one of:
>>
>>     keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
>>     mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
>>     sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
>>
>>       where in all three cases, fd indicates the queue and the number after
>>       is a tag between 0 and 255.
>>
>> (11) The watch must be removed if either the watch buffer is destroyed or
>>       the watched object is destroyed.
>>
>>
>> Things I want to avoid:
>>
>>   (1) Introducing features that make the core VFS dependent on the network
>>       stack or networking namespaces (ie. usage of netlink).
>>
>>   (2) Dumping all this stuff into dmesg and having a daemon that sits there
>>       parsing the output and distributing it as this then puts the
>>       responsibility for security into userspace and makes handling
>>       namespaces tricky.  Further, dmesg might not exist or might be
>>       inaccessible inside a container.
>>
>>   (3) Letting users see events they shouldn't be able to see.
>>
>>
>> Further things that could be considered:
>>
>>   (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>>       "children" of that keyring, such that the watch is removed from the
>>       child if it is unlinked from the keyring.
>>
>>   (2) Adding global superblock event queue.
>>
>>   (3) Propagating watches to child superblock over automounts.
>>
>>
>> The patches can be found here also:
>>
>>     http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
>>
>> Changes:
>>
>>   v4: Split the basic UAPI bits out into their own patch and then split the
>>       LSM hooks out into an intermediate patch.  Add LSM hooks for setting
>>       watches.
>>
>>       Rename the *_notify() system calls to watch_*() for consistency.
>>
>>   v3: I've added a USB notification source and reformulated the block
>>       notification source so that there's now a common watch list, for which
>>       the system call is now device_notify().
>>
>>       I've assigned a pair of unused ioctl numbers in the 'W' series to the
>>       ioctls added by this series.
>>
>>       I've also added a description of the kernel API to the documentation.
>>
>>   v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
>>       krefs for refcounting.  I've added some security features to try and
>>       give Casey Schaufler the LSM control he wants.
>>
>> David
>> ---
>> David Howells (13):
>>        security: Override creds in __fput() with last fputter's creds
>>        uapi: General notification ring definitions
>>        security: Add hooks to rule on setting a watch
>>        security: Add a hook for the point of notification insertion
>>        General notification queue with user mmap()'able ring buffer
>>        keys: Add a notification facility
>>        vfs: Add a mount-notification facility
>>        vfs: Add superblock notifications
>>        fsinfo: Export superblock notification counter
>>        Add a general, global device notification watch list
>>        block: Add block layer notifications
>>        usb: Add USB subsystem notifications
>>        Add sample notification program
>>
>>
>>   Documentation/ioctl/ioctl-number.txt   |    1
>>   Documentation/security/keys/core.rst   |   58 ++
>>   Documentation/watch_queue.rst          |  492 ++++++++++++++++++
>>   arch/x86/entry/syscalls/syscall_32.tbl |    3
>>   arch/x86/entry/syscalls/syscall_64.tbl |    3
>>   block/Kconfig                          |    9
>>   block/blk-core.c                       |   29 +
>>   drivers/base/Kconfig                   |    9
>>   drivers/base/Makefile                  |    1
>>   drivers/base/watch.c                   |   89 +++
>>   drivers/misc/Kconfig                   |   13
>>   drivers/misc/Makefile                  |    1
>>   drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
>>   drivers/usb/core/Kconfig               |   10
>>   drivers/usb/core/devio.c               |   55 ++
>>   drivers/usb/core/hub.c                 |    3
>>   fs/Kconfig                             |   21 +
>>   fs/Makefile                            |    1
>>   fs/file_table.c                        |   12
>>   fs/fsinfo.c                            |   12
>>   fs/mount.h                             |   33 +
>>   fs/mount_notify.c                      |  187 +++++++
>>   fs/namespace.c                         |    9
>>   fs/super.c                             |  122 ++++
>>   include/linux/blkdev.h                 |   15 +
>>   include/linux/dcache.h                 |    1
>>   include/linux/device.h                 |    7
>>   include/linux/fs.h                     |   79 +++
>>   include/linux/key.h                    |    4
>>   include/linux/lsm_hooks.h              |   48 ++
>>   include/linux/security.h               |   35 +
>>   include/linux/syscalls.h               |    5
>>   include/linux/usb.h                    |   19 +
>>   include/linux/watch_queue.h            |   87 +++
>>   include/uapi/linux/fsinfo.h            |   10
>>   include/uapi/linux/keyctl.h            |    1
>>   include/uapi/linux/watch_queue.h       |  213 ++++++++
>>   kernel/sys_ni.c                        |    7
>>   samples/Kconfig                        |    6
>>   samples/Makefile                       |    1
>>   samples/vfs/test-fsinfo.c              |   13
>>   samples/watch_queue/Makefile           |    9
>>   samples/watch_queue/watch_test.c       |  308 +++++++++++
>>   security/keys/Kconfig                  |   10
>>   security/keys/compat.c                 |    2
>>   security/keys/gc.c                     |    5
>>   security/keys/internal.h               |   30 +
>>   security/keys/key.c                    |   37 +
>>   security/keys/keyctl.c                 |   95 +++
>>   security/keys/keyring.c                |   17 -
>>   security/keys/request_key.c            |    4
>>   security/security.c                    |   29 +
>>   52 files changed, 3121 insertions(+), 38 deletions(-)
>>   create mode 100644 Documentation/watch_queue.rst
>>   create mode 100644 drivers/base/watch.c
>>   create mode 100644 drivers/misc/watch_queue.c
>>   create mode 100644 fs/mount_notify.c
>>   create mode 100644 include/linux/watch_queue.h
>>   create mode 100644 include/uapi/linux/watch_queue.h
>>   create mode 100644 samples/watch_queue/Makefile
>>   create mode 100644 samples/watch_queue/watch_test.c
>>
>

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Andy Lutomirski @ 2019-06-10 16:42 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Stephen Smalley, David Howells, Al Viro, USB list, LSM List,
	Greg Kroah-Hartman, raven, Linux FS Devel, Linux API, linux-block,
	keyrings, LKML, Paul Moore
In-Reply-To: <0cf7a49d-85f6-fba9-62ec-a378e0b76adf@schaufler-ca.com>

On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
> > On 6/7/19 10:17 AM, David Howells wrote:
> >>
> >> Hi Al,
> >>
> >> Here's a set of patches to add a general variable-length notification queue
> >> concept and to add sources of events for:
> >>
> >>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
> >>       mount reconfiguration.
> >>
> >>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
> >>       errors (not complete yet).
> >>
> >>   (3) Key/keyring events, such as creating, linking and removal of keys.
> >>
> >>   (4) General device events (single common queue) including:
> >>
> >>       - Block layer events, such as device errors
> >>
> >>       - USB subsystem events, such as device/bus attach/remove, device
> >>         reset, device errors.
> >>
> >> One of the reasons for this is so that we can remove the issue of processes
> >> having to repeatedly and regularly scan /proc/mounts, which has proven to
> >> be a system performance problem.  To further aid this, the fsinfo() syscall
> >> on which this patch series depends, provides a way to access superblock and
> >> mount information in binary form without the need to parse /proc/mounts.
> >>
> >>
> >> LSM support is included, but controversial:
> >>
> >>   (1) The creds of the process that did the fput() that reduced the refcount
> >>       to zero are cached in the file struct.
> >>
> >>   (2) __fput() overrides the current creds with the creds from (1) whilst
> >>       doing the cleanup, thereby making sure that the creds seen by the
> >>       destruction notification generated by mntput() appears to come from
> >>       the last fputter.
> >>
> >>   (3) security_post_notification() is called for each queue that we might
> >>       want to post a notification into, thereby allowing the LSM to prevent
> >>       covert communications.
> >>
> >>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
> >>       may be set in the first place?  I might need to add a variant per
> >>       watch-type.
> >>
> >>   (?) Do I really need to keep track of the process creds in which an
> >>       implicit object destruction happened?  For example, imagine you create
> >>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
> >>       refers to on close unless move_mount() clears that flag.  Now, imagine
> >>       someone looking at that fd through procfs at the same time as you exit
> >>       due to an error.  The LSM sees the destruction notification come from
> >>       the looker if they happen to do their fput() after yours.
> >
> > I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
> >
> > For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
> >
> > I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>
> I cannot agree. There is an explicit action by a subject that results
> in information being delivered to an object. Just like a signal or a
> UDP packet delivery. Smack handles this kind of thing just fine. The
> internal mechanism that results in the access is irrelevant from
> this viewpoint. I can understand how a mechanism like SELinux that
> works on finer granularity might view it differently.

I think you really need to give an example of a coherent policy that
needs this.  As it stands, your analogy seems confusing.  If someone
changes the system clock, we don't restrict who is allowed to be
notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
was changed based on who changed the clock.  Similarly, if someone
tries to receive a packet on a socket, we check whether they have the
right to receive on that socket (from the endpoint in question) and,
if the sender is local, whether the sender can send to that socket.
We do not check whether the sender can send to the receiver.

The signal example is inapplicable.  Sending a signal to a process is
an explicit action done to that process, and it can easily adversely
affect the target.  Of course it requires permission.

--Andy

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Dave Martin @ 2019-06-10 16:57 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <94b9c55b3b874825fda485af40ab2a6bc3dad171.camel@intel.com>

On Mon, Jun 10, 2019 at 09:29:04AM -0700, Yu-cheng Yu wrote:
> On Fri, 2019-06-07 at 19:01 +0100, Dave Martin wrote:
> > On Thu, Jun 06, 2019 at 01:06:41PM -0700, Yu-cheng Yu wrote:
> > > An ELF file's .note.gnu.property indicates features the executable file
> > > can support.  For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
> > > indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
> > > GNU_PROPERTY_X86_FEATURE_1_SHSTK.
> > > 
> > > With this patch, if an arch needs to setup features from ELF properties,
> > > it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and a specific
> > > arch_setup_property().
> > > 
> > > For example, for X86_64:
> > > 
> > > int arch_setup_property(void *ehdr, void *phdr, struct file *f, bool inter)
> > > {
> > > 	int r;
> > > 	uint32_t property;
> > > 
> > > 	r = get_gnu_property(ehdr, phdr, f, GNU_PROPERTY_X86_FEATURE_1_AND,
> > > 			     &property);
> > > 	...
> > > }
> > 
> > Although this code works for the simple case, I have some concerns about
> > some aspects of the implementation here.  There appear to be some bounds
> > checking / buffer overrun issues, and the code seems quite complex.
> > 
> > Maybe this patch tries too hard to be compatible with toolchains that do
> > silly things such as embedding huge notes in an executable, or mixing
> > NT_GNU_PROPERTY_TYPE_0 in a single PT_NOTE with a load of junk not
> > relevant to the loader.  I wonder whether Linux can dictate what
> > interpretation(s) of the ELF specs it is prepared to support, rather than
> > trying to support absolutely anything.
> 
> To me, looking at PT_GNU_PROPERTY and not trying to support anything is a
> logical choice.  And it breaks only a limited set of toolchains.
> 
> I will simplify the parser and leave this patch as-is for anyone who wants to
> back-port.  Are there any objections or concerns?

No objection from me ;)  But I'm biased.

Hopefully this change should allow substantial simplification.  For one
thing, PT_GNU_PROPERTY tells its file offset and size directly in its
phdrs entry.  That should save us a lot of effort on the kernel side.

Cheers
---Dave

^ permalink raw reply

* Re: [PATCH 06/13] keys: Add a notification facility [ver #4]
From: Jonathan Corbet @ 2019-06-10 17:11 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <155991709983.15579.13232123365803197237.stgit@warthog.procyon.org.uk>

On Fri, 07 Jun 2019 15:18:19 +0100
David Howells <dhowells@redhat.com> wrote:

> Add a key/keyring change notification facility whereby notifications about
> changes in key and keyring content and attributes can be received.
> 
> Firstly, an event queue needs to be created:
> 
> 	fd = open("/dev/event_queue", O_RDWR);
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, page_size << n);
> 
> then a notification can be set up to report notifications via that queue:
> 
> 	struct watch_notification_filter filter = {
> 		.nr_filters = 1,
> 		.filters = {
> 			[0] = {
> 				.type = WATCH_TYPE_KEY_NOTIFY,
> 				.subtype_filter[0] = UINT_MAX,
> 			},
> 		},
> 	};
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> 	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);

One little nit: it seems that keyctl_watch_key is actually spelled
keyctl(KEYCTL_WATCH_KEY, ...).

jon

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Florian Weimer @ 2019-06-10 17:24 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Dave Martin, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <94b9c55b3b874825fda485af40ab2a6bc3dad171.camel@intel.com>

* Yu-cheng Yu:

> To me, looking at PT_GNU_PROPERTY and not trying to support anything is a
> logical choice.  And it breaks only a limited set of toolchains.
>
> I will simplify the parser and leave this patch as-is for anyone who wants to
> back-port.  Are there any objections or concerns?

Red Hat Enterprise Linux 8 does not use PT_GNU_PROPERTY and is probably
the largest collection of CET-enabled binaries that exists today.

My hope was that we would backport the upstream kernel patches for CET,
port the glibc dynamic loader to the new kernel interface, and be ready
to run with CET enabled in principle (except that porting userspace
libraries such as OpenSSL has not really started upstream, so many
processes where CET is particularly desirable will still run without
it).

I'm not sure if it is a good idea to port the legacy support if it's not
part of the mainline kernel because it comes awfully close to creating
our own private ABI.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Florian Weimer @ 2019-06-10 17:28 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, linux-kernel, linux-doc, linux-mm,
	linux-arch, linux-api, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H.J. Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz
In-Reply-To: <5dc357f5858f8036cad5847cfe214401bb9138bf.camel@intel.com>

* Yu-cheng Yu:

> On Fri, 2019-06-07 at 14:09 -0700, Dave Hansen wrote:
>> On 6/7/19 1:06 PM, Yu-cheng Yu wrote:
>> > > Huh, how does glibc know about all possible past and future legacy code
>> > > in the application?
>> > 
>> > When dlopen() gets a legacy binary and the policy allows that, it will
>> > manage
>> > the bitmap:
>> > 
>> >   If a bitmap has not been created, create one.
>> >   Set bits for the legacy code being loaded.
>> 
>> I was thinking about code that doesn't go through GLIBC like JITs.
>
> If JIT manages the bitmap, it knows where it is.
> It can always read the bitmap again, right?

The problem are JIT libraries without assembler code which can be marked
non-CET, such as liborc.  Our builds (e.g., orc-0.4.29-2.fc30.x86_64)
currently carries the IBT and SHSTK flag, although the entry points into
the generated code do not start with ENDBR, so that a jump to them will
fault with the CET enabled.

Thanks,
Florian

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox