Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/2] fork: add clone3
From: Arnd Bergmann @ 2019-06-04 10:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Kees Cook, Florian Weimer, Oleg Nesterov, David Howells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	Linux API
In-Reply-To: <20190603144331.16760-1-christian@brauner.io>

On Mon, Jun 3, 2019 at 4:44 PM Christian Brauner <christian@brauner.io> wrote:

> +
> +#ifdef __ARCH_WANT_SYS_CLONE
> +asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
> +#endif

I would leave it outside of __ARCH_WANT_SYS_CLONE, as far
as I can tell the only reason for that #ifdef is so architectures that
have their own sys_clone implementation can opt out of the generic
one, but we don't want that for new syscalls.

In fact, I'd prefer to drop the symbol entirely and have a different
symbol with the opposite meaning such as
__ARCH_NONSTANDARD_SYS_CLONE that only gets
selected by sparc, ia64 and m68k. That should be a separate
patch though, and I'm not asking you to do it, unless you
want to clean up a little more.

       Arnd

^ permalink raw reply

* Re: [PATCH v2 1/2] fork: add clone3
From: Christian Brauner @ 2019-06-04  9:56 UTC (permalink / raw)
  To: David Howells
  Cc: viro, linux-kernel, torvalds, jannh, keescook, fweimer, oleg,
	arnd, Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	linux-api
In-Reply-To: <20190604094317.4wfelmbw4lgxzide@brauner.io>

On Tue, Jun 04, 2019 at 11:43:17AM +0200, Christian Brauner wrote:
> On Tue, Jun 04, 2019 at 10:28:12AM +0100, David Howells wrote:
> > Christian Brauner <christian@brauner.io> wrote:
> > 
> > > +#include <linux/compiler_types.h>
> > 
> > I suspect you don't want to include that directly.
> > 
> > Also, to avoid bloating linux/sched/task.h yet further, maybe put this in
> > linux/sched/clone.h?
> 
> Yeah, not the worst idea.
> Though I'd leave the flags where they are and just add struct
> kernel_clone_args in there. But I assume that's what you meant anyway.

Actually, I would like to defer this to the cleanup patch too.
This way the patch stays small and clean and task.h is currently the
right place to put it.

> 
> > 
> > > -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> > > +extern long _do_fork(struct kernel_clone_args *kargs);
> > >  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
> > 
> > Maybe these could move into linux/sched/clone.h too.
> 
> Meh, that could be a separate cleanup patch after clone3() has been
> merged.
> 
> > 
> > > +#define CLONE_MAX ~0U
> > 
> > Can you add a comment summarising the meaning?
> 
> Yes, can do.
> 
> > 
> > > +	u64 clone_flags = args->flags;
> > > +	int __user *child_tidptr = args->child_tid;
> > > +	unsigned long tls = args->tls;
> > > +	unsigned long stack_start = args->stack;
> > > +	unsigned long stack_size = args->stack_size;
> > 
> > Some of these are only used once, so it's probably not worth sticking them in
> > local variables.
> 
> [1]:
> Ok, will double check.
> This was just to minimize copy-paste erros for variables which were used
> multiple times.
> 
> > 
> > > -		if (clone_flags &
> > > -		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
> > > -			return ERR_PTR(-EINVAL);
> > 
> > Did this error check get lost?  I can see part of it further on, but the check
> > on CLONE_PARENT_SETTID is absent.
> 
> No, it's only relevant for legacy clone() since it uses the
> parent_tidptr argument to return the pidfd. clone3() has a dedicated
> return argument for that in clone_args.
> The check for legacy clone() is now done in legacy clone() directly.
> copy_process() should only do generic checks for all version of
> clone(),fork(),vfork(), etc.
> 
> > 
> > > +	int __user *parent_tidptr = args->parent_tid;
> > 
> > There's only one usage remaining after this patch, so a local var doesn't gain
> > a lot.
> 
> Yes, that leads back to [1].
> 
> > 
> > >  pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
> > >  {
> > > -	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> > > -		(unsigned long)arg, NULL, NULL, 0);
> > > +	struct kernel_clone_args args = {
> > > +		.flags = ((flags | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL),
> > > +		.exit_signal = (flags & CSIGNAL),
> > 
> > Kernel threads can have exit signals?
> 
> Yes,
> 
> kernel/kthread.c:       pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
> kernel/umh.c:   pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD);
> 
> And even if they couldn't have. This is just to make sure that if they
> ever would we'd be prepared.
> 
> > 
> > > +static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> > > +				     struct clone_args __user *uargs,
> > > +				     size_t size)
> > 
> > I would make this "noinline".  If it gets inlined, local variable "args" may
> > still be on the stack when _do_fork() gets called.
> 
> Hm, can do.
> 
> Thanks!
> Christian

^ permalink raw reply

* Re: [PATCH v2 1/2] fork: add clone3
From: Christian Brauner @ 2019-06-04  9:43 UTC (permalink / raw)
  To: David Howells
  Cc: viro, linux-kernel, torvalds, jannh, keescook, fweimer, oleg,
	arnd, Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	linux-api
In-Reply-To: <4020.1559640492@warthog.procyon.org.uk>

On Tue, Jun 04, 2019 at 10:28:12AM +0100, David Howells wrote:
> Christian Brauner <christian@brauner.io> wrote:
> 
> > +#include <linux/compiler_types.h>
> 
> I suspect you don't want to include that directly.
> 
> Also, to avoid bloating linux/sched/task.h yet further, maybe put this in
> linux/sched/clone.h?

Yeah, not the worst idea.
Though I'd leave the flags where they are and just add struct
kernel_clone_args in there. But I assume that's what you meant anyway.

> 
> > -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> > +extern long _do_fork(struct kernel_clone_args *kargs);
> >  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
> 
> Maybe these could move into linux/sched/clone.h too.

Meh, that could be a separate cleanup patch after clone3() has been
merged.

> 
> > +#define CLONE_MAX ~0U
> 
> Can you add a comment summarising the meaning?

Yes, can do.

> 
> > +	u64 clone_flags = args->flags;
> > +	int __user *child_tidptr = args->child_tid;
> > +	unsigned long tls = args->tls;
> > +	unsigned long stack_start = args->stack;
> > +	unsigned long stack_size = args->stack_size;
> 
> Some of these are only used once, so it's probably not worth sticking them in
> local variables.

[1]:
Ok, will double check.
This was just to minimize copy-paste erros for variables which were used
multiple times.

> 
> > -		if (clone_flags &
> > -		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
> > -			return ERR_PTR(-EINVAL);
> 
> Did this error check get lost?  I can see part of it further on, but the check
> on CLONE_PARENT_SETTID is absent.

No, it's only relevant for legacy clone() since it uses the
parent_tidptr argument to return the pidfd. clone3() has a dedicated
return argument for that in clone_args.
The check for legacy clone() is now done in legacy clone() directly.
copy_process() should only do generic checks for all version of
clone(),fork(),vfork(), etc.

> 
> > +	int __user *parent_tidptr = args->parent_tid;
> 
> There's only one usage remaining after this patch, so a local var doesn't gain
> a lot.

Yes, that leads back to [1].

> 
> >  pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
> >  {
> > -	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> > -		(unsigned long)arg, NULL, NULL, 0);
> > +	struct kernel_clone_args args = {
> > +		.flags = ((flags | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL),
> > +		.exit_signal = (flags & CSIGNAL),
> 
> Kernel threads can have exit signals?

Yes,

kernel/kthread.c:       pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
kernel/umh.c:   pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD);

And even if they couldn't have. This is just to make sure that if they
ever would we'd be prepared.

> 
> > +static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> > +				     struct clone_args __user *uargs,
> > +				     size_t size)
> 
> I would make this "noinline".  If it gets inlined, local variable "args" may
> still be on the stack when _do_fork() gets called.

Hm, can do.

Thanks!
Christian

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-06-04  9:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <20190603174706.t4cby7f5ni4gvvom@box>

On 03.06.2019 20:47, Kirill A. Shutemov wrote:> On Mon, Jun 03, 2019 at 05:56:32PM +0300, Kirill Tkhai wrote:
>> On 03.06.2019 17:38, Kirill Tkhai wrote:
>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>>> This patchset adds a new syscall, which makes possible
>>>>> to clone a VMA from a process to current process.
>>>>> The syscall supplements the functionality provided
>>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>>> and it may be useful in many situation.
>>>>
>>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>>
>>>> My concern is that the patchset allows to map the same page multiple times
>>>> within one process or even map page allocated by child to the parrent.
>>>
>>> Speaking honestly, we already support this model, since ZERO_PAGE() may
>>> be mapped multiply times in any number of mappings.
>>
>> Picking of huge_zero_page and mremapping its VMA to unaligned address also gives
>> the case, when the same huge page is mapped as huge page and as set of ordinary
>> pages in the same process.
>>
>> Summing up two above cases, is there really a fundamental problem with
>> the functionality the patch set introduces? It looks like we already have
>> these cases in stable kernel supported.
> 
> It *might* work. But it requires a lot of audit to prove that it actually
> *does* work.

Please, give the represent of the way the audit results should look like
for you. In case of I hadn't done some audit before patchset preparing,
I wouldn't have sent it. So, give an idea that you expect from this.

> For instance, are you sure it will not break KSM?

Yes, it does not break KSM. The main point is that in case of KSM we already
may have not just only a page mapped twice in a single process, but even
a page mapped twice in a single VMA. And this is just a particular case of
generic supported set. (Ordinary page still can't be mapped twice in a single
VMA, since pgoff differences won't allow to merge such two hunks together).

The generic rule of ksm is "everything may happen with a page in a real time,
and all of this will be reflected in stable and unstable trees and rmap_items
some time later". Pages of a duplicated VMA will be interpreted as KSM fork,
and the corresponding checks in unstable_tree_search_insert() and
stable_tree_search() provide this.

When both of source and destination VMAs are mergeable,
1)if page was added to stable tree before the duplication of related VMA,
  then during scanning destination VMA in cmp_and_merge_page() it will be
  detected as a duplicate, and we will just add related rmap_item
  to stable node chain;
2)if page was added to unstable tree before the duplication of related VMA,
  and it is remaining there, then the page will be detected as a duplicate
  in destination VMA, and the scan of page will be skipped till next turn;
3)if page was not added to any tree before the duplication, it may be added
  to one of the trees and it will be handled by one of two rules above.

When one of source or destination VMAs is not mergeable, while a page become
PageKsm() during scanning other of them, the unmergeable VMA becomes to refer
to PageKsm(), which does not have rmap_item. But it still possible to unmap
that page from unmergeable VMA, since rmap_walk_ksm() goes over all anon_vma
under rb_root. Just the same as what happens, when process forks, and its
child makes VMA unmergeable.

> What does it mean for memory accounting? memcg?

Once assigned memcg remains the same after VMA duplication. Mapped page range
advances counters in vm_stat_account(). Since we keep fork() semantics,
the same thing occurs as after fork()+mremap().

> My point is that you breaking long standing invariant in Linux MM and it
> has to be properly justified.

I'm not against that. Please, say, which form of the justification you expect.
I assume you do not mean retelling of every string of existing code, because
this way the words will take 10 times more, than the code, and just not human
possible.

Please, give the specific request what you expect, and how this should look like.

> I would expect to see some strange deadlocks or permanent trylock failure
> as result of such change.

Do you hint some specific area? Do you expect I run some specific test cases?
Do you want we add some debugging engine on top of page locking to detect such
the trylock failures?

Thanks,
Kirill

^ permalink raw reply

* Re: [PATCH v2 1/2] fork: add clone3
From: David Howells @ 2019-06-04  9:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, viro, linux-kernel, torvalds, jannh, keescook, fweimer,
	oleg, arnd, Pavel Emelyanov, Andrew Morton, Adrian Reber,
	Andrei Vagin, linux-api
In-Reply-To: <20190603144331.16760-1-christian@brauner.io>

Christian Brauner <christian@brauner.io> wrote:

> +#include <linux/compiler_types.h>

I suspect you don't want to include that directly.

Also, to avoid bloating linux/sched/task.h yet further, maybe put this in
linux/sched/clone.h?

> -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> +extern long _do_fork(struct kernel_clone_args *kargs);
>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);

Maybe these could move into linux/sched/clone.h too.

> +#define CLONE_MAX ~0U

Can you add a comment summarising the meaning?

> +	u64 clone_flags = args->flags;
> +	int __user *child_tidptr = args->child_tid;
> +	unsigned long tls = args->tls;
> +	unsigned long stack_start = args->stack;
> +	unsigned long stack_size = args->stack_size;

Some of these are only used once, so it's probably not worth sticking them in
local variables.

> -		if (clone_flags &
> -		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
> -			return ERR_PTR(-EINVAL);

Did this error check get lost?  I can see part of it further on, but the check
on CLONE_PARENT_SETTID is absent.

> +	int __user *parent_tidptr = args->parent_tid;

There's only one usage remaining after this patch, so a local var doesn't gain
a lot.

>  pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
>  {
> -	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> -		(unsigned long)arg, NULL, NULL, 0);
> +	struct kernel_clone_args args = {
> +		.flags = ((flags | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL),
> +		.exit_signal = (flags & CSIGNAL),

Kernel threads can have exit signals?

> +static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> +				     struct clone_args __user *uargs,
> +				     size_t size)

I would make this "noinline".  If it gets inlined, local variable "args" may
still be on the stack when _do_fork() gets called.

David

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Michal Hocko @ 2019-06-04  7:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190604042651.GC43390@google.com>

On Tue 04-06-19 13:26:51, Minchan Kim wrote:
> On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
[...]
> > Right. But there is still the page cache reclaim. Is it expected that
> > an explicitly cold memory doesn't get reclaimed because we have a
> > sufficient amount of page cache (a very common case) and we never age
> > anonymous memory because of that?
> 
> If there are lots of used-once pages in file-LRU, I think there is no
> need to reclaim anonymous pages because it needs bigger overhead due to
> IO. It has been true for a long time in current VM policy.

You are making an assumption which is not universally true. If I _know_
that there is a considerable amount of idle anonymous memory then I
would really prefer if it goes to the swap rather than make a pressure
on caching. Inactive list is not guaranteed to contain only used-once
pages, right?

Anyway, as already mentioned, we can start with a simpler implementation
for now and explicitly note that pagecache biased reclaim is known to be
a problem potentially. I am pretty sure somebody will come sooner or
later and we can address the problem then with some good numbers to back
the additional complexity.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Michal Hocko @ 2019-06-04  6:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, linux-mm, LKML, linux-api,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190603230205.GA43390@google.com>

On Tue 04-06-19 08:02:05, Minchan Kim wrote:
> Hi Johannes,
> 
> On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > > happens but data should be preserved for future use.  This could reduce
> > > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > > 
> > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > > 
> > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > > 
> > > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > > > 
> > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > > 
> > > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > > > 
> > > > > > > >     if (page_mapcount(page) <= 1)
> > > > > > > >         deactivate_page(page);
> > > > > > > 
> > > > > > > Why do we restrict to pages that are single mapped?
> > > > > > 
> > > > > > Because page table in one of process shared the page would have access bit
> > > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > > the more fail to reclaim.
> > > > > 
> > > > > So what? In other words why should it be restricted solely based on the
> > > > > map count. I can see a reason to restrict based on the access
> > > > > permissions because we do not want to simplify all sorts of side channel
> > > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > > far I haven't heard any sound argument why madvise should skip those.
> > > > > Again if there are any reasons, then document them in the changelog.
> > > > 
> > > > I think it makes sense. It could be explained, but it also follows
> > > > established madvise semantics, and I'm not sure it's necessarily
> > > > Minchan's job to re-iterate those.
> > > > 
> > > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > > ksm etc. When you madvise, you can really only speak for your own
> > > > reference to that memory - "*I* am not using this."
> > > > 
> > > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > > local page table entries and drops the corresponding references, so
> > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > > shared pages don't get freed.
> > > 
> > > Right, being consistent with other madvise syscalls is certainly a way
> > > to go. And I am not pushing one way or another, I just want this to be
> > > documented with a reasoning behind. Consistency is certainly an argument
> > > to use.
> > > 
> > > On the other hand these non-destructive madvise operations are quite
> > > different and the shared policy might differ as a result as well. We are
> > > aging objects rather than destroying them after all. Being able to age
> > > a pagecache with a sufficient privileges sounds like a useful usecase to
> > > me. In other words you are able to cause the same effect indirectly
> > > without the madvise operation so it kinda makes sense to allow it in a
> > > more sophisticated way.
> > 
> > Right, I don't think it's about permission - as you say, you can do
> > this indirectly. Page reclaim is all about relative page order, so if
> > we thwarted you from demoting some pages, you could instead promote
> > other pages to cause a similar end result.
> > 
> > I think it's about intent. You're advising the kernel that *you're*
> > not using this memory and would like to have it cleared out based on
> > that knowledge. You could do the same by simply allocating the new
> > pages and have the kernel sort it out. However, if the kernel sorts it
> > out, it *will* look at other users of the page, and it might decide
> > that other pages are actually colder when considering all users.
> > 
> > When you ignore shared state, on the other hand, the pages you advise
> > out could refault right after. And then, not only did you not free up
> > the memory, but you also caused IO that may interfere with bringing in
> > the new data for which you tried to create room in the first place.
> > 
> > So I don't think it ever makes sense to override it.
> > 
> > But it might be better to drop the explicit mapcount check and instead
> > make the local pte young and call shrink_page_list() without the
>                      ^
>                      old?
> 
> > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> > to handle references and shared pages exactly the same way it would if
> > those pages came fresh off the LRU tail, excluding only the reference
> > from the mapping that we're madvising.
> 
> You are confused from the name change. Here, MADV_COLD is deactivating
> , not pageing out. Therefore, shrink_page_list doesn't matter.
> And madvise_cold_pte_range already makes the local pte *old*(I guess
> your saying was typo).
> I guess that's exactly what Michal wanted: just removing page_mapcount
> check and defers to decision on normal page reclaim policy:
> If I didn't miss your intention, it seems you and Michal are on same page.
> (Please correct me if you want to say something other)

Indeed.

> I could drop the page_mapcount check at next revision.

Yes please.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Michal Hocko @ 2019-06-04  6:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190603215059.GA16824@cmpxchg.org>

On Mon 03-06-19 17:50:59, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > happens but data should be preserved for future use.  This could reduce
> > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > 
> > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > 
> > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > 
> > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > > 
> > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > 
> > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > > 
> > > > > > >     if (page_mapcount(page) <= 1)
> > > > > > >         deactivate_page(page);
> > > > > > 
> > > > > > Why do we restrict to pages that are single mapped?
> > > > > 
> > > > > Because page table in one of process shared the page would have access bit
> > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > the more fail to reclaim.
> > > > 
> > > > So what? In other words why should it be restricted solely based on the
> > > > map count. I can see a reason to restrict based on the access
> > > > permissions because we do not want to simplify all sorts of side channel
> > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > far I haven't heard any sound argument why madvise should skip those.
> > > > Again if there are any reasons, then document them in the changelog.
> > > 
> > > I think it makes sense. It could be explained, but it also follows
> > > established madvise semantics, and I'm not sure it's necessarily
> > > Minchan's job to re-iterate those.
> > > 
> > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > ksm etc. When you madvise, you can really only speak for your own
> > > reference to that memory - "*I* am not using this."
> > > 
> > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > local page table entries and drops the corresponding references, so
> > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > shared pages don't get freed.
> > 
> > Right, being consistent with other madvise syscalls is certainly a way
> > to go. And I am not pushing one way or another, I just want this to be
> > documented with a reasoning behind. Consistency is certainly an argument
> > to use.
> > 
> > On the other hand these non-destructive madvise operations are quite
> > different and the shared policy might differ as a result as well. We are
> > aging objects rather than destroying them after all. Being able to age
> > a pagecache with a sufficient privileges sounds like a useful usecase to
> > me. In other words you are able to cause the same effect indirectly
> > without the madvise operation so it kinda makes sense to allow it in a
> > more sophisticated way.
> 
> Right, I don't think it's about permission - as you say, you can do
> this indirectly. Page reclaim is all about relative page order, so if
> we thwarted you from demoting some pages, you could instead promote
> other pages to cause a similar end result.

There is one notable difference. If we allow an easy way to demote a
shared resource _easily_ then we have to think about potential side
channel attacks. Sure you can generate a memory pressure to cause the
same but that is much harder and impractical in many cases.

> I think it's about intent. You're advising the kernel that *you're*
> not using this memory and would like to have it cleared out based on
> that knowledge. You could do the same by simply allocating the new
> pages and have the kernel sort it out. However, if the kernel sorts it
> out, it *will* look at other users of the page, and it might decide
> that other pages are actually colder when considering all users.
> 
> When you ignore shared state, on the other hand, the pages you advise
> out could refault right after. And then, not only did you not free up
> the memory, but you also caused IO that may interfere with bringing in
> the new data for which you tried to create room in the first place.

That is a fair argument and I would tend to agree. On the other hand we
are talking about potential usecases which tend to _know_ what they are
doing and removing the possibility completely sounds like they will not
exploit the existing interface to the maximum. But as already mentioned
starting simpler and more restricted is usually a better choice when
the semantic is not carved in stone from the very beginning and
documented that way.

> So I don't think it ever makes sense to override it.
> 
> But it might be better to drop the explicit mapcount check and instead
> make the local pte young and call shrink_page_list() without the
> TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> to handle references and shared pages exactly the same way it would if
> those pages came fresh off the LRU tail, excluding only the reference
> from the mapping that we're madvising.

Yeah that makes sense to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v1 3/4] mm: account nr_isolated_xxx in [isolate|putback]_lru_page
From: Minchan Kim @ 2019-06-04  4:51 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, linux-mm, LKML, linux-api@vger.kernel.org,
	Michal Hocko, Johannes Weiner, Tim Murray, Joel Fernandes,
	Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao,
	Brian Geffon, jannh@google.com, oleg@redhat.com,
	christian@brauner.io, oleksandr@redhat.com
In-Reply-To: <20190604042047.13492-1-hdanton@sina.com>

Hi Hillf,

On Tue, Jun 04, 2019 at 12:20:47PM +0800, Hillf Danton wrote:
> 
> Hi Minchan
> 
> On Mon, 3 Jun 2019 13:37:27 +0800 Minchan Kim wrote:
> > @@ -1181,10 +1179,17 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> >  		return -ENOMEM;
> > 
> >  	if (page_count(page) == 1) {
> > +		bool is_lru = !__PageMovable(page);
> > +
> >  		/* page was freed from under us. So we are done. */
> >  		ClearPageActive(page);
> >  		ClearPageUnevictable(page);
> > -		if (unlikely(__PageMovable(page))) {
> > +		if (likely(is_lru))
> > +			mod_node_page_state(page_pgdat(page),
> > +						NR_ISOLATED_ANON +
> > +						page_is_file_cache(page),
> > +						hpage_nr_pages(page));

That should be -hpage_nr_pages(page). It's a bug.

> > +		else {
> >  			lock_page(page);
> >  			if (!PageMovable(page))
> >  				__ClearPageIsolated(page);
> 
> As this page will go down the path only through the MIGRATEPAGE_SUCCESS branches,
> with no putback ahead, the current code is, I think, doing right things for this
> work to keep isolated stat balanced.

I guess that's the one you pointed out. Right?
Thanks for the review!

> 
> > @@ -1210,15 +1215,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> >  		 * restored.
> >  		 */
> >  		list_del(&page->lru);
> > -
> > -		/*
> > -		 * Compaction can migrate also non-LRU pages which are
> > -		 * not accounted to NR_ISOLATED_*. They can be recognized
> > -		 * as __PageMovable
> > -		 */
> > -		if (likely(!__PageMovable(page)))
> > -			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> > -					page_is_file_cache(page), -hpage_nr_pages(page));
> >  	}
> > 
> 
> BR
> Hillf
> 

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-06-04  4:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190603071607.GB4531@dhcp22.suse.cz>

On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > happens but data should be preserved for future use.  This could reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > > 
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > pages to evict early during memory pressure.
> > > > > > 
> > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > active pages unless there is no access until the time.
> > > > > 
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > 
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > > 
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > > 
> > > >     if (page_mapcount(page) <= 1)
> > > >         deactivate_page(page);
> > > 
> > > Why do we restrict to pages that are single mapped?
> > 
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
> 
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

I will go with removing the part so that defer to decision to the VM reclaim
based on the review.

>  
> [...]
> 
> > > Please document this, if this is really a desirable semantic because
> > > then you have the same set of problems as we've had with the early
> > > MADV_FREE implementation mentioned above.
> > 
> > IIRC, the problem of MADV_FREE was that we couldn't discard freeable
> > pages because VM never scan anonymous LRU with swapless system.
> > However, it's not the our case because we should reclaim them, not
> > discarding.
> 
> Right. But there is still the page cache reclaim. Is it expected that
> an explicitly cold memory doesn't get reclaimed because we have a
> sufficient amount of page cache (a very common case) and we never age
> anonymous memory because of that?

If there are lots of used-once pages in file-LRU, I think there is no
need to reclaim anonymous pages because it needs bigger overhead due to
IO. It has been true for a long time in current VM policy.

Reclaim preference model based on hints is as following based on cost:

MADV_DONTNEED >> MADV_PAGEOUT > used-once pages > MADV_FREE >= MADV_COLD

It is desirable for the new hints to be placed in the reclaiming preference
order such that a) they don't overlap functionally with existing hints and
b) we have a balanced ordering of disruptive and non-disruptive hints.

^ permalink raw reply

* Re: [PATCH v1 4/4] mm: introduce MADV_PAGEOUT
From: Minchan Kim @ 2019-06-04  3:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190603203911.GA14953@cmpxchg.org>

On Mon, Jun 03, 2019 at 04:39:11PM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 02:36:55PM +0900, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > for a long time, it could hint kernel that the pages can be
> > reclaimed instantly but data should be preserved for future use.
> > This could reduce workingset eviction so it ends up increasing
> > performance.
> > 
> > This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> > syscall. MADV_PAGEOUT can be used by a process to mark a memory
> > range as not expected to be used for a long time so that kernel
> > reclaims *any LRU* pages instantly. The hint can help kernel in deciding
> > which pages to evict proactively.
> > 
> > All of error rule is same with MADV_DONTNEED.
> > 
> > Note:
> >     This hint works with only private pages(IOW, page_mapcount(page) < 2)
> >     because shared page could have more chance to be accessed from other
> >     processes sharing the page so that it could cause major fault soon,
> >     which is inefficient.
> > 
> > * RFC v2
> >  * make reclaim_pages simple via factoring out isolate logic - hannes
> > 
> > * RFCv1
> >  * rename from MADV_COLD to MADV_PAGEOUT - hannes
> >  * bail out if process is being killed - Hillf
> >  * fix reclaim_pages bugs - Hillf
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/swap.h                   |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/madvise.c                           | 126 +++++++++++++++++++++++++
> >  mm/vmscan.c                            |  34 +++++++
> >  4 files changed, 162 insertions(+)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 0ce997edb8bb..063c0c1e112b 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -365,6 +365,7 @@ extern int vm_swappiness;
> >  extern int remove_mapping(struct address_space *mapping, struct page *page);
> >  extern unsigned long vm_total_pages;
> >  
> > +extern unsigned long reclaim_pages(struct list_head *page_list);
> >  #ifdef CONFIG_NUMA
> >  extern int node_reclaim_mode;
> >  extern int sysctl_min_unmapped_ratio;
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 1190f4e7f7b9..92e347a89ddc 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -44,6 +44,7 @@
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> >  #define MADV_COLD	5		/* deactivatie these pages */
> > +#define MADV_PAGEOUT	6		/* reclaim these pages */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index ab158766858a..b010249cb8b6 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> >  	case MADV_COLD:
> > +	case MADV_PAGEOUT:
> >  	case MADV_FREE:
> >  		return 0;
> >  	default:
> > @@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> > +	pte_t *orig_pte, *pte, ptent;
> > +	spinlock_t *ptl;
> > +	LIST_HEAD(page_list);
> > +	struct page *page;
> > +	int isolated = 0;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	unsigned long next;
> > +
> > +	if (fatal_signal_pending(current))
> > +		return -EINTR;
> > +
> > +	next = pmd_addr_end(addr, end);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > +		if (!ptl)
> > +			return 0;
> > +
> > +		if (is_huge_zero_pmd(*pmd))
> > +			goto huge_unlock;
> > +
> > +		page = pmd_page(*pmd);
> > +		if (page_mapcount(page) > 1)
> > +			goto huge_unlock;
> > +
> > +		if (next - addr != HPAGE_PMD_SIZE) {
> > +			int err;
> > +
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			lock_page(page);
> > +			err = split_huge_page(page);
> > +			unlock_page(page);
> > +			put_page(page);
> > +			if (!err)
> > +				goto regular_page;
> > +			return 0;
> > +		}
> > +
> > +		if (isolate_lru_page(page))
> > +			goto huge_unlock;
> > +
> > +		list_add(&page->lru, &page_list);
> > +huge_unlock:
> > +		spin_unlock(ptl);
> > +		reclaim_pages(&page_list);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +regular_page:
> > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (page_mapcount(page) > 1)
> > +			continue;
> > +
> > +		if (isolate_lru_page(page))
> > +			continue;
> > +
> > +		isolated++;
> > +		list_add(&page->lru, &page_list);
> > +		if (isolated >= SWAP_CLUSTER_MAX) {
> > +			pte_unmap_unlock(orig_pte, ptl);
> > +			reclaim_pages(&page_list);
> > +			isolated = 0;
> > +			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +			orig_pte = pte;
> > +		}
> > +	}
> > +
> > +	pte_unmap_unlock(orig_pte, ptl);
> > +	reclaim_pages(&page_list);
> > +	cond_resched();
> > +
> > +	return 0;
> > +}
> > +
> > +static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	struct mm_walk warm_walk = {
> > +		.pmd_entry = madvise_pageout_pte_range,
> > +		.mm = vma->vm_mm,
> > +	};
> > +
> > +	tlb_start_vma(tlb, vma);
> > +	walk_page_range(addr, end, &warm_walk);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> > +
> > +static long madvise_pageout(struct vm_area_struct *vma,
> > +			struct vm_area_struct **prev,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	*prev = vma;
> > +	if (!can_madv_lru_vma(vma))
> > +		return -EINVAL;
> > +
> > +	lru_add_drain();
> > +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > +	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> > +
> > +	return 0;
> > +}
> > +
> >  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  				unsigned long end, struct mm_walk *walk)
> >  
> > @@ -805,6 +928,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		return madvise_willneed(vma, prev, start, end);
> >  	case MADV_COLD:
> >  		return madvise_cold(vma, prev, start, end);
> > +	case MADV_PAGEOUT:
> > +		return madvise_pageout(vma, prev, start, end);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> > @@ -827,6 +952,7 @@ madvise_behavior_valid(int behavior)
> >  	case MADV_DONTNEED:
> >  	case MADV_FREE:
> >  	case MADV_COLD:
> > +	case MADV_PAGEOUT:
> >  #ifdef CONFIG_KSM
> >  	case MADV_MERGEABLE:
> >  	case MADV_UNMERGEABLE:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 56df55e8afcd..2c2cf442db58 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2136,6 +2136,40 @@ static void shrink_active_list(unsigned long nr_to_scan,
> >  			nr_deactivate, nr_rotated, sc->priority, file);
> >  }
> >  
> > +unsigned long reclaim_pages(struct list_head *page_list)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +	LIST_HEAD(node_page_list);
> > +	struct reclaim_stat dummy_stat;
> > +	struct scan_control sc = {
> > +		.gfp_mask = GFP_KERNEL,
> > +		.priority = DEF_PRIORITY,
> > +		.may_writepage = 1,
> > +		.may_unmap = 1,
> > +		.may_swap = 1,
> > +	};
> > +
> > +	while (!list_empty(page_list)) {
> > +		struct page *page;
> > +
> > +		page = lru_to_page(page_list);
> > +		list_move(&page->lru, &node_page_list);
> > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > +						page_pgdat(page),
> > +						&sc, TTU_IGNORE_ACCESS,
> > +						&dummy_stat, true);
> > +		if (!list_empty(&node_page_list)) {
> > +			struct page *page = lru_to_page(&node_page_list);
> > +
> > +			list_del(&page->lru);
> > +			putback_lru_page(page);
> > +
> > +		}
> > +	}
> 
> Awesome, this is way more readable now. Thanks for the cleanup!
> 
> Regarding the loop, for the vast majority of instances, pages on
> page_list will actually be from the same node. It would make sense to
> do batching here and collect pages until last_pgdat != pgdat. That
> should reduce the number of tlb flushes and memcg uncharge flushes in
> shrink_page_list().

Sure, Thanks!

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-06-03 23:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, linux-mm, LKML, linux-api,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190603215059.GA16824@cmpxchg.org>

Hi Johannes,

On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > happens but data should be preserved for future use.  This could reduce
> > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > 
> > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > 
> > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > 
> > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > > 
> > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > 
> > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > > 
> > > > > > >     if (page_mapcount(page) <= 1)
> > > > > > >         deactivate_page(page);
> > > > > > 
> > > > > > Why do we restrict to pages that are single mapped?
> > > > > 
> > > > > Because page table in one of process shared the page would have access bit
> > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > the more fail to reclaim.
> > > > 
> > > > So what? In other words why should it be restricted solely based on the
> > > > map count. I can see a reason to restrict based on the access
> > > > permissions because we do not want to simplify all sorts of side channel
> > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > far I haven't heard any sound argument why madvise should skip those.
> > > > Again if there are any reasons, then document them in the changelog.
> > > 
> > > I think it makes sense. It could be explained, but it also follows
> > > established madvise semantics, and I'm not sure it's necessarily
> > > Minchan's job to re-iterate those.
> > > 
> > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > ksm etc. When you madvise, you can really only speak for your own
> > > reference to that memory - "*I* am not using this."
> > > 
> > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > local page table entries and drops the corresponding references, so
> > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > shared pages don't get freed.
> > 
> > Right, being consistent with other madvise syscalls is certainly a way
> > to go. And I am not pushing one way or another, I just want this to be
> > documented with a reasoning behind. Consistency is certainly an argument
> > to use.
> > 
> > On the other hand these non-destructive madvise operations are quite
> > different and the shared policy might differ as a result as well. We are
> > aging objects rather than destroying them after all. Being able to age
> > a pagecache with a sufficient privileges sounds like a useful usecase to
> > me. In other words you are able to cause the same effect indirectly
> > without the madvise operation so it kinda makes sense to allow it in a
> > more sophisticated way.
> 
> Right, I don't think it's about permission - as you say, you can do
> this indirectly. Page reclaim is all about relative page order, so if
> we thwarted you from demoting some pages, you could instead promote
> other pages to cause a similar end result.
> 
> I think it's about intent. You're advising the kernel that *you're*
> not using this memory and would like to have it cleared out based on
> that knowledge. You could do the same by simply allocating the new
> pages and have the kernel sort it out. However, if the kernel sorts it
> out, it *will* look at other users of the page, and it might decide
> that other pages are actually colder when considering all users.
> 
> When you ignore shared state, on the other hand, the pages you advise
> out could refault right after. And then, not only did you not free up
> the memory, but you also caused IO that may interfere with bringing in
> the new data for which you tried to create room in the first place.
> 
> So I don't think it ever makes sense to override it.
> 
> But it might be better to drop the explicit mapcount check and instead
> make the local pte young and call shrink_page_list() without the
                     ^
                     old?

> TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> to handle references and shared pages exactly the same way it would if
> those pages came fresh off the LRU tail, excluding only the reference
> from the mapping that we're madvising.

You are confused from the name change. Here, MADV_COLD is deactivating
, not pageing out. Therefore, shrink_page_list doesn't matter.
And madvise_cold_pte_range already makes the local pte *old*(I guess
your saying was typo).
I guess that's exactly what Michal wanted: just removing page_mapcount
check and defers to decision on normal page reclaim policy:
If I didn't miss your intention, it seems you and Michal are on same page.
(Please correct me if you want to say something other)
I could drop the page_mapcount check at next revision.

Thanks for the review!

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Johannes Weiner @ 2019-06-03 21:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190603203230.GB22799@dhcp22.suse.cz>

On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > happens but data should be preserved for future use.  This could reduce
> > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > 
> > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > pages to evict early during memory pressure.
> > > > > > > > 
> > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > active pages unless there is no access until the time.
> > > > > > > 
> > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > 
> > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > 
> > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > 
> > > > > >     if (page_mapcount(page) <= 1)
> > > > > >         deactivate_page(page);
> > > > > 
> > > > > Why do we restrict to pages that are single mapped?
> > > > 
> > > > Because page table in one of process shared the page would have access bit
> > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > the more fail to reclaim.
> > > 
> > > So what? In other words why should it be restricted solely based on the
> > > map count. I can see a reason to restrict based on the access
> > > permissions because we do not want to simplify all sorts of side channel
> > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > far I haven't heard any sound argument why madvise should skip those.
> > > Again if there are any reasons, then document them in the changelog.
> > 
> > I think it makes sense. It could be explained, but it also follows
> > established madvise semantics, and I'm not sure it's necessarily
> > Minchan's job to re-iterate those.
> > 
> > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > ksm etc. When you madvise, you can really only speak for your own
> > reference to that memory - "*I* am not using this."
> > 
> > This is in line with other madvise calls: MADV_DONTNEED clears the
> > local page table entries and drops the corresponding references, so
> > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > also has explicit mapcount checks before clearing PG_dirty, so again
> > shared pages don't get freed.
> 
> Right, being consistent with other madvise syscalls is certainly a way
> to go. And I am not pushing one way or another, I just want this to be
> documented with a reasoning behind. Consistency is certainly an argument
> to use.
> 
> On the other hand these non-destructive madvise operations are quite
> different and the shared policy might differ as a result as well. We are
> aging objects rather than destroying them after all. Being able to age
> a pagecache with a sufficient privileges sounds like a useful usecase to
> me. In other words you are able to cause the same effect indirectly
> without the madvise operation so it kinda makes sense to allow it in a
> more sophisticated way.

Right, I don't think it's about permission - as you say, you can do
this indirectly. Page reclaim is all about relative page order, so if
we thwarted you from demoting some pages, you could instead promote
other pages to cause a similar end result.

I think it's about intent. You're advising the kernel that *you're*
not using this memory and would like to have it cleared out based on
that knowledge. You could do the same by simply allocating the new
pages and have the kernel sort it out. However, if the kernel sorts it
out, it *will* look at other users of the page, and it might decide
that other pages are actually colder when considering all users.

When you ignore shared state, on the other hand, the pages you advise
out could refault right after. And then, not only did you not free up
the memory, but you also caused IO that may interfere with bringing in
the new data for which you tried to create room in the first place.

So I don't think it ever makes sense to override it.

But it might be better to drop the explicit mapcount check and instead
make the local pte young and call shrink_page_list() without the
TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
to handle references and shared pages exactly the same way it would if
those pages came fresh off the LRU tail, excluding only the reference
from the mapping that we're madvising.

^ permalink raw reply

* Re: [PATCH v1 4/4] mm: introduce MADV_PAGEOUT
From: Johannes Weiner @ 2019-06-03 20:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190603053655.127730-5-minchan@kernel.org>

On Mon, Jun 03, 2019 at 02:36:55PM +0900, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> for a long time, it could hint kernel that the pages can be
> reclaimed instantly but data should be preserved for future use.
> This could reduce workingset eviction so it ends up increasing
> performance.
> 
> This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> syscall. MADV_PAGEOUT can be used by a process to mark a memory
> range as not expected to be used for a long time so that kernel
> reclaims *any LRU* pages instantly. The hint can help kernel in deciding
> which pages to evict proactively.
> 
> All of error rule is same with MADV_DONTNEED.
> 
> Note:
>     This hint works with only private pages(IOW, page_mapcount(page) < 2)
>     because shared page could have more chance to be accessed from other
>     processes sharing the page so that it could cause major fault soon,
>     which is inefficient.
> 
> * RFC v2
>  * make reclaim_pages simple via factoring out isolate logic - hannes
> 
> * RFCv1
>  * rename from MADV_COLD to MADV_PAGEOUT - hannes
>  * bail out if process is being killed - Hillf
>  * fix reclaim_pages bugs - Hillf
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/swap.h                   |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c                           | 126 +++++++++++++++++++++++++
>  mm/vmscan.c                            |  34 +++++++
>  4 files changed, 162 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0ce997edb8bb..063c0c1e112b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -365,6 +365,7 @@ extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern unsigned long vm_total_pages;
>  
> +extern unsigned long reclaim_pages(struct list_head *page_list);
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim_mode;
>  extern int sysctl_min_unmapped_ratio;
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 1190f4e7f7b9..92e347a89ddc 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -44,6 +44,7 @@
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
>  #define MADV_COLD	5		/* deactivatie these pages */
> +#define MADV_PAGEOUT	6		/* reclaim these pages */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_FREE	8		/* free pages only if memory pressure */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index ab158766858a..b010249cb8b6 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
>  	case MADV_COLD:
> +	case MADV_PAGEOUT:
>  	case MADV_FREE:
>  		return 0;
>  	default:
> @@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	LIST_HEAD(page_list);
> +	struct page *page;
> +	int isolated = 0;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long next;
> +
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		if (is_huge_zero_pmd(*pmd))
> +			goto huge_unlock;
> +
> +		page = pmd_page(*pmd);
> +		if (page_mapcount(page) > 1)
> +			goto huge_unlock;
> +
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		if (isolate_lru_page(page))
> +			goto huge_unlock;
> +
> +		list_add(&page->lru, &page_list);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		reclaim_pages(&page_list);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +regular_page:
> +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (page_mapcount(page) > 1)
> +			continue;
> +
> +		if (isolate_lru_page(page))
> +			continue;
> +
> +		isolated++;
> +		list_add(&page->lru, &page_list);
> +		if (isolated >= SWAP_CLUSTER_MAX) {
> +			pte_unmap_unlock(orig_pte, ptl);
> +			reclaim_pages(&page_list);
> +			isolated = 0;
> +			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +			orig_pte = pte;
> +		}
> +	}
> +
> +	pte_unmap_unlock(orig_pte, ptl);
> +	reclaim_pages(&page_list);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static void madvise_pageout_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end)
> +{
> +	struct mm_walk warm_walk = {
> +		.pmd_entry = madvise_pageout_pte_range,
> +		.mm = vma->vm_mm,
> +	};
> +
> +	tlb_start_vma(tlb, vma);
> +	walk_page_range(addr, end, &warm_walk);
> +	tlb_end_vma(tlb, vma);
> +}
> +
> +
> +static long madvise_pageout(struct vm_area_struct *vma,
> +			struct vm_area_struct **prev,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	*prev = vma;
> +	if (!can_madv_lru_vma(vma))
> +		return -EINVAL;
> +
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> +	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> +
> +	return 0;
> +}
> +
>  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				unsigned long end, struct mm_walk *walk)
>  
> @@ -805,6 +928,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		return madvise_willneed(vma, prev, start, end);
>  	case MADV_COLD:
>  		return madvise_cold(vma, prev, start, end);
> +	case MADV_PAGEOUT:
> +		return madvise_pageout(vma, prev, start, end);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> @@ -827,6 +952,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_DONTNEED:
>  	case MADV_FREE:
>  	case MADV_COLD:
> +	case MADV_PAGEOUT:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 56df55e8afcd..2c2cf442db58 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2136,6 +2136,40 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  			nr_deactivate, nr_rotated, sc->priority, file);
>  }
>  
> +unsigned long reclaim_pages(struct list_head *page_list)
> +{
> +	unsigned long nr_reclaimed = 0;
> +	LIST_HEAD(node_page_list);
> +	struct reclaim_stat dummy_stat;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.priority = DEF_PRIORITY,
> +		.may_writepage = 1,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +	};
> +
> +	while (!list_empty(page_list)) {
> +		struct page *page;
> +
> +		page = lru_to_page(page_list);
> +		list_move(&page->lru, &node_page_list);
> +		nr_reclaimed += shrink_page_list(&node_page_list,
> +						page_pgdat(page),
> +						&sc, TTU_IGNORE_ACCESS,
> +						&dummy_stat, true);
> +		if (!list_empty(&node_page_list)) {
> +			struct page *page = lru_to_page(&node_page_list);
> +
> +			list_del(&page->lru);
> +			putback_lru_page(page);
> +
> +		}
> +	}

Awesome, this is way more readable now. Thanks for the cleanup!

Regarding the loop, for the vast majority of instances, pages on
page_list will actually be from the same node. It would make sense to
do batching here and collect pages until last_pgdat != pgdat. That
should reduce the number of tlb flushes and memcg uncharge flushes in
shrink_page_list().

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Michal Hocko @ 2019-06-03 20:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190603172717.GA30363@cmpxchg.org>

On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > happens but data should be preserved for future use.  This could reduce
> > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > 
> > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > pages to evict early during memory pressure.
> > > > > > > 
> > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > head if the page is private because inactive list could be full of
> > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > active pages unless there is no access until the time.
> > > > > > 
> > > > > > [I am intentionally not looking at the implementation because below
> > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > 
> > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > Private/shared? If shared, are there any restrictions?
> > > > > 
> > > > > Both file and private pages could be deactived from each active LRU
> > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > 
> > > > >     if (page_mapcount(page) <= 1)
> > > > >         deactivate_page(page);
> > > > 
> > > > Why do we restrict to pages that are single mapped?
> > > 
> > > Because page table in one of process shared the page would have access bit
> > > so finally we couldn't reclaim the page. The more process it is shared,
> > > the more fail to reclaim.
> > 
> > So what? In other words why should it be restricted solely based on the
> > map count. I can see a reason to restrict based on the access
> > permissions because we do not want to simplify all sorts of side channel
> > attacks but memory reclaim is capable of reclaiming shared pages and so
> > far I haven't heard any sound argument why madvise should skip those.
> > Again if there are any reasons, then document them in the changelog.
> 
> I think it makes sense. It could be explained, but it also follows
> established madvise semantics, and I'm not sure it's necessarily
> Minchan's job to re-iterate those.
> 
> Sharing isn't exactly transparent to userspace. The kernel does COW,
> ksm etc. When you madvise, you can really only speak for your own
> reference to that memory - "*I* am not using this."
> 
> This is in line with other madvise calls: MADV_DONTNEED clears the
> local page table entries and drops the corresponding references, so
> shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> also has explicit mapcount checks before clearing PG_dirty, so again
> shared pages don't get freed.

Right, being consistent with other madvise syscalls is certainly a way
to go. And I am not pushing one way or another, I just want this to be
documented with a reasoning behind. Consistency is certainly an argument
to use.

On the other hand these non-destructive madvise operations are quite
different and the shared policy might differ as a result as well. We are
aging objects rather than destroying them after all. Being able to age
a pagecache with a sufficient privileges sounds like a useful usecase to
me. In other words you are able to cause the same effect indirectly
without the madvise operation so it kinda makes sense to allow it in a
more sophisticated way.

That being said, madvise is just a _hint_ and the kernel will be always
free to ignore it so the future implementation might change so we can
start simple and consistent with existing MADV_$FOO operations now and
extend later on. But let's document the intention in the changelog and
make the decision clear. I am sorry to be so anal about this but I have
seen so many ad-hoc policies that were undocumented and it was so hard
to guess when revisiting later on and make some sense of it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Steve Grubb @ 2019-06-03 20:24 UTC (permalink / raw)
  To: Paul Moore
  Cc: Richard Guy Briggs, Tycho Andersen, Serge E. Hallyn, containers,
	linux-api, Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, omosnace, dhowells, simo, Eric Paris, ebiederm,
	nhorman
In-Reply-To: <CAHC9VhTrM1op_EH=YAn9pU8dMOr=jB-Ph4SxFeqGFskwLmFnCA@mail.gmail.com>

Hello Paul,

I am curious about this. We seemed to be close to getting this patch pulled 
in. A lot of people are waiting for it. Can you summarize what you think the 
patches need and who we think needs to do it? I'm lost. Does LXC people need 
to propose something? Does Richard? Someone else? Who's got the ball?

Thank,
-Steve

On Friday, May 31, 2019 8:44:45 AM EDT Paul Moore wrote:
> On Thu, May 30, 2019 at 8:21 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-05-30 19:26, Paul Moore wrote:
> > > On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > > > > [REMINDER: It is an "*audit* container ID" and not a general
> > > > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> > > > 
> > > > This sort of seems like a distinction without a difference;
> > > > presumably
> > > > audit is going to want to differentiate between everything that
> > > > people
> > > > in userspace call a container. So you'll have to support all this
> > > > insanity anyway, even if it's "not a container ID".
> > > 
> > > That's not quite right.  Audit doesn't care about what a container is,
> > > or is not, it also doesn't care if the "audit container ID" actually
> > > matches the ID used by the container engine in userspace and I think
> > > that is a very important line to draw.  Audit is simply given a value
> > > which it calls the "audit container ID", it ensures that the value is
> > > inherited appropriately (e.g. children inherit their parent's audit
> > > container ID), and it uses the value in audit records to provide some
> > > additional context for log analysis.  The distinction isn't limited to
> > > the value itself, but also to how it is used; it is an "audit
> > > container ID" and not a "container ID" because this value is
> > > exclusively for use by the audit subsystem.  We are very intentionally
> > > not adding a generic container ID to the kernel.  If the kernel does
> > > ever grow a general purpose container ID we will be one of the first
> > > ones in line to make use of it, but we are not going to be the ones to
> > > generically add containers to the kernel.  Enough people already hate
> > > audit ;)
> > > 
> > > > > I'm not interested in supporting/merging something that isn't
> > > > > useful;
> > > > > if this doesn't work for your use case then we need to figure out
> > > > > what
> > > > > would work.  It sounds like nested containers are much more common
> > > > > in
> > > > > the lxc world, can you elaborate a bit more on this?
> > > > > 
> > > > > As far as the possible solutions you mention above, I'm not sure I
> > > > > like the per-userns audit container IDs, I'd much rather just emit
> > > > > the
> > > > > necessary tracking information via the audit record stream and let
> > > > > the
> > > > > log analysis tools figure it out.  However, the bigger question is
> > > > > how
> > > > > to limit (re)setting the audit container ID when you are in a
> > > > > non-init
> > > > > userns.  For reasons already mentioned, using capable() is a non
> > > > > starter for everything but the initial userns, and using
> > > > > ns_capable()
> > > > > is equally poor as it essentially allows any userns the ability to
> > > > > munge it's audit container ID (obviously not good).  It appears we
> > > > > need a different method for controlling access to the audit
> > > > > container
> > > > > ID.
> > > > 
> > > > One option would be to make it a string, and have it be append only.
> > > > That should be safe with no checks.
> > > > 
> > > > I know there was a long thread about what type to make this thing. I
> > > > think you could accomplish the append-only-ness with a u64 if you had
> > > > some rule about only allowing setting lower order bits than those
> > > > that
> > > > are already set. With 4 bits for simplicity:
> > > > 
> > > > 1100         # initial container id
> > > > 1100 -> 1011 # not allowed
> > > > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> > > > 
> > > >              # no lower order bits left
> > > > 
> > > > There are probably fancier ways to do it if you actually understand
> > > > math :)
> > >  
> > >  ;)
> > >  
> > > > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > > > you have 64 bits, this might be reasonable. You could just teach
> > > > container engines to use the first say N bits for themselves, with a
> > > > 1
> > > > bit for the barrier at the end.
> > > 
> > > I like the creativity, but I worry that at some point these
> > > limitations are going to be raised (limits have a funny way of doing
> > > that over time) and we will be in trouble.  I say "trouble" because I
> > > want to be able to quickly do an audit container ID comparison and
> > > we're going to pay a penalty for these larger values (we'll need this
> > > when we add multiple auditd support and the requisite record routing).
> > > 
> > > Thinking about this makes me also realize we probably need to think a
> > > bit longer about audit container ID conflicts between orchestrators.
> > > Right now we just take the value that is given to us by the
> > > orchestrator, but if we want to allow multiple container orchestrators
> > > to work without some form of cooperation in userspace (I think we have
> > > to assume the orchestrators will not talk to each other) we likely
> > > need to have some way to block reuse of an audit container ID.  We
> > > would either need to prevent the orchestrator from explicitly setting
> > > an audit container ID to a currently in use value, or instead generate
> > > the audit container ID in the kernel upon an event triggered by the
> > > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > > start looking at the idr code, I think we will need to make use of it.
> > 
> > My first reaction to using the IDR code is that once an idr is given up,
> > it can be reused.  I suppose we request IDRs and then never give them up
> > to avoid reuse...
> 
> I'm not sure we ever what to guarantee that an audit container ID
> won't be reused during the lifetime of the system, it is a discrete
> integer after all.  What I think we do want to guarantee is that we
> won't allow an unintentional audit container ID collision between
> different orchestrators; if a single orchestrator wants to reuse an
> audit container ID then that is its choice.
> 
> > I already had some ideas of preventing an existing ID from being reused,
> 
> Cool.  I only made the idr suggestion since it is used for PIDs and
> solves a very similar problem.
> 
> > but that makes the practice of some container engines injecting
> > processes into existing containers difficult if not impossible.
> 
> Yes, we'll need some provision to indicate which orchestrator
> "controls" that particular audit container ID, and allow that
> orchestrator to reuse that particular audit container ID (until all
> those containers disappear and the audit container ID is given back to
> the pool).

^ permalink raw reply

* Re: [PATCH v4 0/3] initramfs: add support for xattrs in the initial ram disk
From: Rob Landley @ 2019-06-03 18:32 UTC (permalink / raw)
  To: Roberto Sassu, viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, james.w.mcmechan,
	niveditas98
In-Reply-To: <cf9d08ca-74c7-c945-5bf9-7c3495907d1e@huawei.com>

On 6/3/19 4:31 AM, Roberto Sassu wrote:
>> This patch set aims at solving the following use case: appraise files from
>> the initial ram disk. To do that, IMA checks the signature/hash from the
>> security.ima xattr. Unfortunately, this use case cannot be implemented
>> currently, as the CPIO format does not support xattrs.
>>
>> This proposal consists in including file metadata as additional files named
>> METADATA!!!, for each file added to the ram disk. The CPIO parser in the
>> kernel recognizes these special files from the file name, and calls the
>> appropriate parser to add metadata to the previously extracted file. It has
>> been proposed to use bit 17:16 of the file mode as a way to recognize files
>> with metadata, but both the kernel and the cpio tool declare the file mode
>> as unsigned short.
>
> Any opinion on this patch set?
> 
> Thanks
> 
> Roberto

Sorry, I've had the window open since you posted it but haven't gotten around to
it. I'll try to build it later today.

It does look interesting, and I have no objections to the basic approach. I
should be able to add support to toybox cpio over a weekend once I've got the
kernel doing it to test against.

Rob

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill A. Shutemov @ 2019-06-03 17:47 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <5ae7e3c1-3875-ea1e-54b3-ac3c493a11f0@virtuozzo.com>

On Mon, Jun 03, 2019 at 05:56:32PM +0300, Kirill Tkhai wrote:
> On 03.06.2019 17:38, Kirill Tkhai wrote:
> > On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> >> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >>> This patchset adds a new syscall, which makes possible
> >>> to clone a VMA from a process to current process.
> >>> The syscall supplements the functionality provided
> >>> by process_vm_writev() and process_vm_readv() syscalls,
> >>> and it may be useful in many situation.
> >>
> >> Kirill, could you explain how the change affects rmap and how it is safe.
> >>
> >> My concern is that the patchset allows to map the same page multiple times
> >> within one process or even map page allocated by child to the parrent.
> > 
> > Speaking honestly, we already support this model, since ZERO_PAGE() may
> > be mapped multiply times in any number of mappings.
> 
> Picking of huge_zero_page and mremapping its VMA to unaligned address also gives
> the case, when the same huge page is mapped as huge page and as set of ordinary
> pages in the same process.
> 
> Summing up two above cases, is there really a fundamental problem with
> the functionality the patch set introduces? It looks like we already have
> these cases in stable kernel supported.

It *might* work. But it requires a lot of audit to prove that it actually
*does* work.

For instance, are you sure it will not break KSM? What does it mean for
memory accounting? memcg?

My point is that you breaking long standing invariant in Linux MM and it
has to be properly justified.

I would expect to see some strange deadlocks or permanent trylock failure
as result of such change.

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Johannes Weiner @ 2019-06-03 17:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton
In-Reply-To: <20190603071607.GB4531@dhcp22.suse.cz>

On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > happens but data should be preserved for future use.  This could reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > > 
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > pages to evict early during memory pressure.
> > > > > > 
> > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > active pages unless there is no access until the time.
> > > > > 
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > 
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > > 
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > > 
> > > >     if (page_mapcount(page) <= 1)
> > > >         deactivate_page(page);
> > > 
> > > Why do we restrict to pages that are single mapped?
> > 
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
> 
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

I think it makes sense. It could be explained, but it also follows
established madvise semantics, and I'm not sure it's necessarily
Minchan's job to re-iterate those.

Sharing isn't exactly transparent to userspace. The kernel does COW,
ksm etc. When you madvise, you can really only speak for your own
reference to that memory - "*I* am not using this."

This is in line with other madvise calls: MADV_DONTNEED clears the
local page table entries and drops the corresponding references, so
shared pages won't get freed. MADV_FREE clears the pte dirty bit and
also has explicit mapcount checks before clearing PG_dirty, so again
shared pages don't get freed.

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: David Howells @ 2019-06-03 16:30 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: dhowells, Jann Horn, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Andy Lutomirski
In-Reply-To: <c95dd6cd-5530-6b70-68f6-4038edd72352@schaufler-ca.com>

Casey Schaufler <casey@schaufler-ca.com> wrote:

> >> should be used. Someone or something caused the event. It can
> >> be important who it was.
> > The kernel's normal security model means that you should be able to
> > e.g. accept FDs that random processes send you and perform
> > read()/write() calls on them without acting as a subject in any
> > security checks; let alone close().
> 
> Passed file descriptors are an anomaly in the security model
> that (in this developer's opinion) should have never been
> included. More than one of the "B" level UNIX systems disabled
> them outright. 

Considering further on this, I think the only way to implement what you're
suggesting is to add a field to struct file to record the last fputter's creds
as the procedure of fputting is offloaded to a workqueue.

Note that's last fputter, not the last closer, as we don't track the number of
open fds linked to a file struct.

In the case of AF_UNIX sockets that contain in-the-process-of-being-passed fds
at the time of closure, this is further complicated by the socket fput being
achieved in the work item - thereby adding layers of indirection.

It might be possible to replace f_cred rather than adding a new field, but
that might get used somewhere after that point.

Note also that fsnotify_close() doesn't appear to use the last fputter's path
since it's not available if called from deferred fput.

David

^ permalink raw reply

* Re: [PATCH ghak90 V6] fixup! audit: add containerid filtering
From: Paul Moore @ 2019-06-03 16:01 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <fadb320e38a899441fcc693bbbc822a3b57f1a46.1559239558.git.rgb@redhat.com>

On Fri, May 31, 2019 at 1:54 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> Remove the BUG() call since we will never have an invalid op value as
> audit_data_to_entry()/audit_to_op() ensure that the op value is a a
> known good value.
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
>  kernel/auditfilter.c | 1 -
>  1 file changed, 1 deletion(-)

Thanks for sending this out.  However, in light of the discussion in
the patchset's cover letter it looks like we need to better support
nested container orchestrators which is likely going to require some
non-trivial changes to the kernel/userspace API.  Because of this I'm
going to hold off pulling these patches into a "working" branch,
hopefully the next revision of these patches will solve the nested
orchestrator issue enough that we can continue to move forward with
testing.

> diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> index 407b5bb3b4c6..385a114a1254 100644
> --- a/kernel/auditfilter.c
> +++ b/kernel/auditfilter.c
> @@ -1244,7 +1244,6 @@ int audit_comparator64(u64 left, u32 op, u64 right)
>         case Audit_bittest:
>                 return ((left & right) == right);
>         default:
> -               BUG();
>                 return 0;
>         }
>  }
> --
> 1.8.3.1
>


-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Daniel Colascione @ 2019-06-03 15:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, Linux API,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn, Oleg Nesterov,
	Christian Brauner, oleksandr, hdanton
In-Reply-To: <20190603071607.GB4531@dhcp22.suse.cz>

On Mon, Jun 3, 2019 at 12:16 AM Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > happens but data should be preserved for future use.  This could reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > >
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > pages to evict early during memory pressure.
> > > > > >
> > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > active pages unless there is no access until the time.
> > > > >
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > >
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > >
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > >
> > > >     if (page_mapcount(page) <= 1)
> > > >         deactivate_page(page);
> > >
> > > Why do we restrict to pages that are single mapped?
> >
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
>
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

Whether to reclaim shared pages is a policy decision best left to
userland, IMHO.

^ permalink raw reply

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: Renzo Davoli @ 2019-06-03 15:00 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner
In-Reply-To: <cd20672aaf13f939b4f798d0839d2438@suse.de>

Hi Roman,

	 I sorry for the delay in my answer, but I needed to set up a minimal
tutorial to show what I am working on and why I need a feature like the
one I am proposing.

Please, have a look of the README.md page here:
https://github.com/virtualsquare/vuos
(everything can be downloaded and tested)

On Fri, May 31, 2019 at 01:48:39PM +0200, Roman Penyaev wrote:
> Since each such a stack has a set of read/write/etc functions you always
> can extend you stack with another call which returns you event mask,
> specifying what exactly you have to do, e.g.:
> 
>     nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
>     for (n = 0; n < nfds; ++n) {
>          struct sock *sock;
> 
>          sock = events[n].data.ptr;
>          events = sock->get_events(sock, &events[n]);
> 
>          if (events & EPOLLIN)
>              sock->read(sock);
>          if (events & EPOLLOUT)
>              sock->write(sock);
>     }
> 
> 
> With such a virtual table you can mix all userspace stacks and even
> with normal sockets, for which 'get_events' function can be declared as
> 
> static poll_t kernel_sock_get_events(struct sock *sock, struct epoll_event
> *ev)
> {
>     return ev->events;
> }
> 
> Do I miss something?

I am not trying to port some tools to use user-space implemented stacks or device
drivers/emulators, I am seeking to a general purpose approach.

I think that the example in the section of the README "mount a user-level 
networking stack" explains the situation.

The submodule vunetvdestack uses a namespace to define a networking stack connected
to a VDE network (see https://github.com/rd235/vdeplug4).

The API is clean (as it can be seen at the end of the file vunet_modules/vunetvdestack.c).
All the methods but "socket" are directly mapped to their system call counterparts:

struct vunet_operations vunet_ops = {
  .socket = vdestack_socket,
  .bind = bind,
  .connect = connect,
  .listen = listen,
  .accept4 = accept4,
....
	.epoll_ctl = epoll_ctl,
...
}

(the elegance of the API can be seen also in vunet_modules/vunetreal.c: a 38 lines module
 implementing a gateway to the real networking of the hosting machine)

Unfortunately I cannot use the same clean interface to support user-library implemented
stacks like lwip/lwipv6/picotcp because I cannot generate EPOLL events...

Bizantine workarounds based on data structures exchanged in the data.ptr field of epoll_event
that must be decoded by the hypervisor to retrieve the missing information about the event
can be implemented... but it would be a pity ;-)

The same problem arises in umdev modules: virtual devices should generate the same
EPOLL events of their real couterparts.

I feel that the ability to generate/synthesize EPOLL events could be useful for many projects.
(In my first message I included some URLs of people seeking for this feature, retrieved by
 some queries on a web search engine)

Implementations may vary as well as the kernel API to support such a feature.
As I told, my proposal has a minimal impact on the code, it does not require the definition
of new syscalls, it simply enhances the features of eventfd.

> 
> Eventually you come up with such a lock to protect your tcp or whatever
> state machine.  Or you have a real example where read and write paths
> can work completely independently?

Actually umvu hypervisor uses concurrent tracing of concurrent processes.
We have named this technique "guardian angels": each process/thread running in the
partial virtual machine has a correspondent thread in the hypervisor.
So if a process uses two threads to manage a network connection (say a TCP stream),
the two guardian angels replicate their requests towards the networking module.

So I am looking for a general solution, not to a pattern to port some projects.
(and I cannot use two different approaches for event driven and multi-threaded
 implementations as I have to support both).

If you reached this point...  Thank you for your patience.
I am more than pleased to receive further comments or proposals.

	renzo

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-06-03 14:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <4228b541-d31c-b76a-2570-1924df0d4724@virtuozzo.com>

On 03.06.2019 17:38, Kirill Tkhai wrote:
> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>> This patchset adds a new syscall, which makes possible
>>> to clone a VMA from a process to current process.
>>> The syscall supplements the functionality provided
>>> by process_vm_writev() and process_vm_readv() syscalls,
>>> and it may be useful in many situation.
>>
>> Kirill, could you explain how the change affects rmap and how it is safe.
>>
>> My concern is that the patchset allows to map the same page multiple times
>> within one process or even map page allocated by child to the parrent.
> 
> Speaking honestly, we already support this model, since ZERO_PAGE() may
> be mapped multiply times in any number of mappings.

Picking of huge_zero_page and mremapping its VMA to unaligned address also gives
the case, when the same huge page is mapped as huge page and as set of ordinary
pages in the same process.

Summing up two above cases, is there really a fundamental problem with
the functionality the patch set introduces? It looks like we already have
these cases in stable kernel supported.

Thanks,
Kirill

^ permalink raw reply

* [PATCH v2 2/2] arch: wire-up clone3() syscall on x86
From: Christian Brauner @ 2019-06-03 14:43 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, jannh
  Cc: keescook, fweimer, oleg, arnd, dhowells, Christian Brauner,
	Andrew Morton, Adrian Reber, linux-api, linux-arch, x86
In-Reply-To: <20190603144331.16760-1-christian@brauner.io>

Wire up the clone3() call on x86.

This patch only wires up clone3() on x86. Some of the arches look like they
need special assembly massaging and it is probably smarter if the
appropriate arch maintainers would do the actual wiring.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
---
v1: unchanged
v2: unchanged
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/uapi/asm-generic/unistd.h      | 4 +++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..80e26211feff 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+436	i386	clone3			sys_clone3			__ia32_sys_clone3
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..7968f0b5b5e8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+436	common	clone3			__x64_sys_clone3/ptregs
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..45bc87687c47 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_clone3 436
+__SYSCALL(__NR_clone3, sys_clone3)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 437
 
 /*
  * 32 bit systems traditionally used different
-- 
2.21.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox