From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
"Darrick J . Wong" <djwong@kernel.org>,
John Hubbard <jhubbard@nvidia.com>,
Jason Gunthorpe <jgg@nvidia.com>, Hugh Dickins <hughd@google.com>
Subject: [PATCH v1 2/2] mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE)
Date: Thu, 14 Mar 2024 17:13:00 +0100 [thread overview]
Message-ID: <20240314161300.382526-3-david@redhat.com> (raw)
In-Reply-To: <20240314161300.382526-1-david@redhat.com>
We changed faultin_page_range() to no longer consume a VMA, because
faultin_page_range() might internally release the mm lock to lookup
the VMA again -- required to cleanly handle VM_FAULT_RETRY. But
independent of that, __get_user_pages() will always lookup the VMA
itself.
Now that we let __get_user_pages() just handle VMA checks in a way that
is suitable for MADV_POPULATE_(READ|WRITE), the VMA walk in madvise()
is just overhead. So let's just call madvise_populate()
on the full range instead.
There is one change in behavior: madvise_walk_vmas() would skip any VMA
holes, and if everything succeeded, it would return -ENOMEM after
processing all VMAs.
However, for MADV_POPULATE_(READ|WRITE) it's unlikely for the caller to
notice any difference: -ENOMEM might either indicate that there were VMA
holes or that populating page tables failed because there was not enough
memory. So it's unlikely that user space will notice the difference, and
that special handling likely only makes sense for some other madvise()
actions.
Further, we'd already fail with -ENOMEM early in the past if looking up the
VMA after dropping the MM lock failed because of concurrent VMA
modifications. So let's just keep it simple and avoid the madvise VMA
walk, and consistently fail early if we find a VMA hole.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/madvise.c | 26 ++++++++++++--------------
1 file changed, 12 insertions(+), 14 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 1a073fcc4c0c0..a2dd70c4a2e6b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -901,26 +901,19 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
return -EINVAL;
}
-static long madvise_populate(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end,
- int behavior)
+static long madvise_populate(struct mm_struct *mm, unsigned long start,
+ unsigned long end, int behavior)
{
const bool write = behavior == MADV_POPULATE_WRITE;
- struct mm_struct *mm = vma->vm_mm;
int locked = 1;
long pages;
- *prev = vma;
-
while (start < end) {
/* Populate (prefault) page tables readable/writable. */
pages = faultin_page_range(mm, start, end, write, &locked);
if (!locked) {
mmap_read_lock(mm);
locked = 1;
- *prev = NULL;
- vma = NULL;
}
if (pages < 0) {
switch (pages) {
@@ -1021,9 +1014,6 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
case MADV_DONTNEED:
case MADV_DONTNEED_LOCKED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
- case MADV_POPULATE_READ:
- case MADV_POPULATE_WRITE:
- return madvise_populate(vma, prev, start, end, behavior);
case MADV_NORMAL:
new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
break;
@@ -1425,8 +1415,16 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
end = start + len;
blk_start_plug(&plug);
- error = madvise_walk_vmas(mm, start, end, behavior,
- madvise_vma_behavior);
+ switch (behavior) {
+ case MADV_POPULATE_READ:
+ case MADV_POPULATE_WRITE:
+ error = madvise_populate(mm, start, end, behavior);
+ break;
+ default:
+ error = madvise_walk_vmas(mm, start, end, behavior,
+ madvise_vma_behavior);
+ break;
+ }
blk_finish_plug(&plug);
if (write)
mmap_write_unlock(mm);
--
2.43.2
next prev parent reply other threads:[~2024-03-14 16:13 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-14 16:12 [PATCH v1 0/2] mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly David Hildenbrand
2024-03-14 16:12 ` [PATCH v1 1/2] " David Hildenbrand
2024-03-14 16:13 ` David Hildenbrand [this message]
2024-03-15 2:25 ` [PATCH v1 0/2] " Darrick J. Wong
2024-03-17 16:50 ` Darrick J. Wong
2024-03-17 16:51 ` [RFC PATCH] xfs_io: add linux madvise advice codes Darrick J. Wong
2024-03-17 16:53 ` [RFC PATCH] fstests: test MADV_POPULATE_READ with IO errors Darrick J. Wong
2024-03-17 21:14 ` Christoph Hellwig
2024-03-19 8:59 ` David Hildenbrand
2024-03-17 21:14 ` [RFC PATCH] xfs_io: add linux madvise advice codes Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240314161300.382526-3-david@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=djwong@kernel.org \
--cc=hughd@google.com \
--cc=jgg@nvidia.com \
--cc=jhubbard@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).