Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>,
	David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Minchan Kim <minchan@kernel.org>, Colin Cross <ccross@google.com>,
	Suren Baghdasarya <surenb@google.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>
Subject: Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED)
Date: Wed, 13 Oct 2021 08:47:11 -0700	[thread overview]
Message-ID: <595A6581-86CF-4372-98AF-532DF65186C6@gmail.com> (raw)
In-Reply-To: <YWYWyUMcgoAJqi3V@t490s>

[-- Attachment #1: Type: text/plain, Size: 5900 bytes --]



> On Oct 12, 2021, at 4:14 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
>> 
>> 
>>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote:
>>> 
>>> On Mon 27-09-21 12:12:46, Nadav Amit wrote:
>>>> 
>>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote:
>>>>> 
>>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
>>>>> [...]
>>>>>> The manager is notified on memory regions that it should monitor
>>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants
>>>>>> to reclaim (anonymous) memory, it:
>>>>>> 
>>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet).
>>>>>> 2. Calls process_vm_readv() to read that memory of that process.
>>>>>> 3. Write it back to “swap”.
>>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>>>>> 
>>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
>>>> 
>>>> Providing hints to the kernel takes you so far to a certain extent.
>>>> The kernel does not want to (for a good reason) to be completely
>>>> configurable when it comes to reclaim and prefetch policies. Doing
>>>> so from userspace allows you to be fully configurable.
>>> 
>>> I am sorry but I do not follow. Your scenario is describing a user
>>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
>>> designed for. What are you missing in the existing functionality?
>> 
>> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
>> many aspects of paging out memory:
>> 
>> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
>> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
>>  on non-contiguous memory).
>> 3. No guarantee the page is actually reclaimed (e.g., writeback)
>>  and the time it takes place.
>> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
>>  as non-performant as it is cannot be used for swap AFAIK).
>> 5. Other operations (e.g., locking, working set tracking) that
>>  might not be necessary or interfere.
>> 
>> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
>> of userfaultfd to trap page-faults and react accordingly, so you
>> are also prevented from:
>> 
>> 6. Having your own custom prefetching policy in response to #PF.
>> 
>> There are additional use-cases I can try to formalize in which
>> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
>> is pretty clear, I think: one is a hint that only applied to
>> page reclamation. The other enables the direct control of
>> userspace over (almost) all aspects of paging.
>> 
>> As I suggested before, if it is preferred, this can be a UFFD
>> IOCTL instead of process_madvise() behavior, thereby lowering
>> the risk of a misuse.
> 
> (Sorry to join so late..)
> 
> Yeah I'm wondering whether that could add one extra layer of security.  But as
> you mentioned, we've already have process_vm_writev(), then it's indeed not
> strong reason to reject process_madvise(DONTNEED) too, it seems.
> 
> Not sure whether you're aware of the umap project from LLNL:
> 
> https://github.com/LLNL/umap
> 
> From what I can tell, that's really doing very similar thing as what you
> proposed here, but it's just a local version of things.  IOW in umap the
> DONTNEED can be done locally with madvise() already in the umap maintained
> threads.  That close the need to introduce the new process_madvise() interface
> and it's definitely safer as it's per-mm and per-task.
> 
> I think you mentioned above that the tracee program will need to cooperate in
> this case, I'm wondering whether some solution like umap would be fine too as
> that also requires cooperation of the tracee program, it's just that the
> cooperation may be slightly more than your solution but frankly I think that's
> still trivial and before I understand the details of your solution I can't
> really tell..
> 
> E.g. for a program to use umap, I think it needs to replace mmap() to umap()
> where we want the buffers to be managed by umap library rather than the kernel,
> then link against the umap library should work.  If the remote solution you're
> proposing requires similar (or even more complicated) cooperation, then it'll
> be controversial whether that can be done per-mm just like how umap designed
> and used.  So IMHO it'll be great to share more details on those parts if umap
> cannot satisfy the current need - IMHO it satisfies all the features you
> described on fully customized pageout and page faulting in, it's just done in a
> single mm.

Thanks for you feedback, Peter.

I am familiar with umap, perhaps not enough, but I am aware.

From my experience, the existing interfaces are not sufficient if you look
for high performance (low overhead) solution for multiple processes. The
level of cooperation that I mentioned is something that I mentioned
preemptively to avoid unnecessary discussion, but I believe they can be
resolved (I have just deferred handling them).

Specifically for performance, several new kernel features are needed, for
instance, support for iouring with async operations, a vectored
UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a
vectored madvise(). Even if we talk on the context of a single mm, I
cannot see umap being performant for low latency devices without those
facilities.

Anyhow, I take your feedback and I will resend the patch for enabling
MADV_DONTNEED with other patches once I am done. As for the TLB batching
itself, I think it has an independent value - but I am not going to
argue about it now if there is a pushback against it.


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2021-10-13 15:47 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-26 16:12 [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED) Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 1/8] mm/madvise: propagate vma->vm_end changes Nadav Amit
2021-09-27  9:08   ` Kirill A. Shutemov
2021-09-27 10:11     ` Nadav Amit
2021-09-27 11:55       ` Kirill A. Shutemov
2021-09-27 12:33         ` Nadav Amit
2021-09-27 12:45           ` Kirill A. Shutemov
2021-09-27 12:59             ` Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 2/8] mm/madvise: remove unnecessary check on madvise_dontneed_free() Nadav Amit
2021-09-27  9:11   ` Kirill A. Shutemov
2021-09-27 11:05     ` Nadav Amit
2021-09-27 12:19       ` Kirill A. Shutemov
2021-09-27 12:52         ` Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 3/8] mm/madvise: remove unnecessary checks on madvise_free_single_vma() Nadav Amit
2021-09-27  9:17   ` Kirill A. Shutemov
2021-09-27  9:24     ` Kirill A. Shutemov
2021-09-26 16:12 ` [RFC PATCH 4/8] mm/madvise: define madvise behavior in a struct Nadav Amit
2021-09-27  9:31   ` Kirill A. Shutemov
2021-09-27 10:31     ` Nadav Amit
2021-09-27 12:14       ` Kirill A. Shutemov
2021-09-27 20:36         ` Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 5/8] mm/madvise: perform certain operations once on process_madvise() Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 6/8] mm/madvise: more aggressive TLB batching Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 7/8] mm/madvise: deduplicate code in madvise_dontneed_free() Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 8/8] mm/madvise: process_madvise(MADV_DONTNEED) Nadav Amit
2021-09-27  9:24 ` [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED) David Hildenbrand
2021-09-27 10:41   ` Nadav Amit
2021-09-27 10:58     ` David Hildenbrand
2021-09-27 12:00       ` Nadav Amit
2021-09-27 12:16         ` Michal Hocko
2021-09-27 19:12           ` Nadav Amit
2021-09-29  7:52             ` Michal Hocko
2021-09-29 18:31               ` Nadav Amit
2021-10-12 23:14                 ` Peter Xu
2021-10-13 15:47                   ` Nadav Amit [this message]
2021-10-13 23:09                     ` Peter Xu
2021-09-27 17:05         ` David Hildenbrand
2021-09-27 19:59           ` Nadav Amit
2021-09-28  8:53             ` David Hildenbrand
2021-09-28 22:56               ` Nadav Amit
2021-10-04 17:58                 ` David Hildenbrand
2021-10-07 16:19                   ` Nadav Amit
2021-10-07 16:46                     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=595A6581-86CF-4372-98AF-532DF65186C6@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=ccross@google.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox