From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>,
David Hildenbrand <david@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linux-MM <linux-mm@kvack.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Minchan Kim <minchan@kernel.org>, Colin Cross <ccross@google.com>,
Suren Baghdasarya <surenb@google.com>,
Mike Rapoport <rppt@linux.vnet.ibm.com>
Subject: Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED)
Date: Wed, 13 Oct 2021 08:47:11 -0700 [thread overview]
Message-ID: <595A6581-86CF-4372-98AF-532DF65186C6@gmail.com> (raw)
In-Reply-To: <YWYWyUMcgoAJqi3V@t490s>
[-- Attachment #1: Type: text/plain, Size: 5900 bytes --]
> On Oct 12, 2021, at 4:14 PM, Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
>>
>>
>>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote:
>>>
>>> On Mon 27-09-21 12:12:46, Nadav Amit wrote:
>>>>
>>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote:
>>>>>
>>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
>>>>> [...]
>>>>>> The manager is notified on memory regions that it should monitor
>>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants
>>>>>> to reclaim (anonymous) memory, it:
>>>>>>
>>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet).
>>>>>> 2. Calls process_vm_readv() to read that memory of that process.
>>>>>> 3. Write it back to “swap”.
>>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>>>>>
>>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
>>>>
>>>> Providing hints to the kernel takes you so far to a certain extent.
>>>> The kernel does not want to (for a good reason) to be completely
>>>> configurable when it comes to reclaim and prefetch policies. Doing
>>>> so from userspace allows you to be fully configurable.
>>>
>>> I am sorry but I do not follow. Your scenario is describing a user
>>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
>>> designed for. What are you missing in the existing functionality?
>>
>> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
>> many aspects of paging out memory:
>>
>> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
>> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
>> on non-contiguous memory).
>> 3. No guarantee the page is actually reclaimed (e.g., writeback)
>> and the time it takes place.
>> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
>> as non-performant as it is cannot be used for swap AFAIK).
>> 5. Other operations (e.g., locking, working set tracking) that
>> might not be necessary or interfere.
>>
>> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
>> of userfaultfd to trap page-faults and react accordingly, so you
>> are also prevented from:
>>
>> 6. Having your own custom prefetching policy in response to #PF.
>>
>> There are additional use-cases I can try to formalize in which
>> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
>> is pretty clear, I think: one is a hint that only applied to
>> page reclamation. The other enables the direct control of
>> userspace over (almost) all aspects of paging.
>>
>> As I suggested before, if it is preferred, this can be a UFFD
>> IOCTL instead of process_madvise() behavior, thereby lowering
>> the risk of a misuse.
>
> (Sorry to join so late..)
>
> Yeah I'm wondering whether that could add one extra layer of security. But as
> you mentioned, we've already have process_vm_writev(), then it's indeed not
> strong reason to reject process_madvise(DONTNEED) too, it seems.
>
> Not sure whether you're aware of the umap project from LLNL:
>
> https://github.com/LLNL/umap
>
> From what I can tell, that's really doing very similar thing as what you
> proposed here, but it's just a local version of things. IOW in umap the
> DONTNEED can be done locally with madvise() already in the umap maintained
> threads. That close the need to introduce the new process_madvise() interface
> and it's definitely safer as it's per-mm and per-task.
>
> I think you mentioned above that the tracee program will need to cooperate in
> this case, I'm wondering whether some solution like umap would be fine too as
> that also requires cooperation of the tracee program, it's just that the
> cooperation may be slightly more than your solution but frankly I think that's
> still trivial and before I understand the details of your solution I can't
> really tell..
>
> E.g. for a program to use umap, I think it needs to replace mmap() to umap()
> where we want the buffers to be managed by umap library rather than the kernel,
> then link against the umap library should work. If the remote solution you're
> proposing requires similar (or even more complicated) cooperation, then it'll
> be controversial whether that can be done per-mm just like how umap designed
> and used. So IMHO it'll be great to share more details on those parts if umap
> cannot satisfy the current need - IMHO it satisfies all the features you
> described on fully customized pageout and page faulting in, it's just done in a
> single mm.
Thanks for you feedback, Peter.
I am familiar with umap, perhaps not enough, but I am aware.
From my experience, the existing interfaces are not sufficient if you look
for high performance (low overhead) solution for multiple processes. The
level of cooperation that I mentioned is something that I mentioned
preemptively to avoid unnecessary discussion, but I believe they can be
resolved (I have just deferred handling them).
Specifically for performance, several new kernel features are needed, for
instance, support for iouring with async operations, a vectored
UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a
vectored madvise(). Even if we talk on the context of a single mm, I
cannot see umap being performant for low latency devices without those
facilities.
Anyhow, I take your feedback and I will resend the patch for enabling
MADV_DONTNEED with other patches once I am done. As for the TLB batching
itself, I think it has an independent value - but I am not going to
argue about it now if there is a pushback against it.
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2021-10-13 15:47 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-26 16:12 [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED) Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 1/8] mm/madvise: propagate vma->vm_end changes Nadav Amit
2021-09-27 9:08 ` Kirill A. Shutemov
2021-09-27 10:11 ` Nadav Amit
2021-09-27 11:55 ` Kirill A. Shutemov
2021-09-27 12:33 ` Nadav Amit
2021-09-27 12:45 ` Kirill A. Shutemov
2021-09-27 12:59 ` Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 2/8] mm/madvise: remove unnecessary check on madvise_dontneed_free() Nadav Amit
2021-09-27 9:11 ` Kirill A. Shutemov
2021-09-27 11:05 ` Nadav Amit
2021-09-27 12:19 ` Kirill A. Shutemov
2021-09-27 12:52 ` Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 3/8] mm/madvise: remove unnecessary checks on madvise_free_single_vma() Nadav Amit
2021-09-27 9:17 ` Kirill A. Shutemov
2021-09-27 9:24 ` Kirill A. Shutemov
2021-09-26 16:12 ` [RFC PATCH 4/8] mm/madvise: define madvise behavior in a struct Nadav Amit
2021-09-27 9:31 ` Kirill A. Shutemov
2021-09-27 10:31 ` Nadav Amit
2021-09-27 12:14 ` Kirill A. Shutemov
2021-09-27 20:36 ` Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 5/8] mm/madvise: perform certain operations once on process_madvise() Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 6/8] mm/madvise: more aggressive TLB batching Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 7/8] mm/madvise: deduplicate code in madvise_dontneed_free() Nadav Amit
2021-09-26 16:12 ` [RFC PATCH 8/8] mm/madvise: process_madvise(MADV_DONTNEED) Nadav Amit
2021-09-27 9:24 ` [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED) David Hildenbrand
2021-09-27 10:41 ` Nadav Amit
2021-09-27 10:58 ` David Hildenbrand
2021-09-27 12:00 ` Nadav Amit
2021-09-27 12:16 ` Michal Hocko
2021-09-27 19:12 ` Nadav Amit
2021-09-29 7:52 ` Michal Hocko
2021-09-29 18:31 ` Nadav Amit
2021-10-12 23:14 ` Peter Xu
2021-10-13 15:47 ` Nadav Amit [this message]
2021-10-13 23:09 ` Peter Xu
2021-09-27 17:05 ` David Hildenbrand
2021-09-27 19:59 ` Nadav Amit
2021-09-28 8:53 ` David Hildenbrand
2021-09-28 22:56 ` Nadav Amit
2021-10-04 17:58 ` David Hildenbrand
2021-10-07 16:19 ` Nadav Amit
2021-10-07 16:46 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=595A6581-86CF-4372-98AF-532DF65186C6@gmail.com \
--to=nadav.amit@gmail.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=ccross@google.com \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=peterx@redhat.com \
--cc=rppt@linux.vnet.ibm.com \
--cc=surenb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox