From: Jason Gunthorpe <jgg@ziepe.ca>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>, John Hubbard <jhubbard@nvidia.com>,
Leon Romanovsky <leonro@nvidia.com>,
Linux-MM <linux-mm@kvack.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
"Maya B . Gokhale" <gokhale2@llnl.gov>,
Yang Shi <yang.shi@linux.alibaba.com>,
Marty Mcfadden <mcfadden8@llnl.gov>,
Kirill Shutemov <kirill@shutemov.name>,
Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
Jan Kara <jack@suse.cz>, Kirill Tkhai <ktkhai@virtuozzo.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Christoph Hellwig <hch@lst.de>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification
Date: Thu, 17 Sep 2020 16:38:24 -0300 [thread overview]
Message-ID: <20200917193824.GL8409@ziepe.ca> (raw)
In-Reply-To: <CAHk-=wjtfjB3TqTFRzVmOrB9Mii6Yzc-=wKq0fu4ruDE6AsJgg@mail.gmail.com>
On Thu, Sep 17, 2020 at 11:11:06AM -0700, Linus Torvalds wrote:
> (a) if the pinner is going to change the page, it will have to get
> the pin with FOLL_WRITE in addition to FOLL_PIN
>
> (b) that means it will do the COW and mark the page writable and dirty
Yep
> (c) the whole _point_ of the FOLL_PIN is that subsequent operations
> shouldn't make it non-writable any more (ie it can't be unmapped, and
> we should synchronize on fork etc)
It is the ideal, but FOLL_PIN has been troubled for a long time:
https://lwn.net/Articles/753027/
ie writeprotect is known to happen due to writeback. I had understood
that fork() will also cause write protect. Eg
copy_process()
copy_mm()
dup_mm()
dup_mmap()
copy_page_range()
[..]
copy_one_pte()
Gets to:
if (is_cow_mapping(vm_flags) && pte_write(pte)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
Which blindly write protects a FOLL_PIN page. Here src_pte will be
under a pin.
I *think* the failing test is basically:
1) pin_user_pages(mem, FOLL_FORCE | FOLL_WRITE)
2) pid = fork()
3) child: does a bit, then exec
4) parent: waitpid(pid)
5) *mem = 0
Here #5 is the WP triggered COW. We know a WP triggered COW is
happening from the bisect and success result with MADV_DONTFORK.
#2 triggers the dup_mmap() and the ptep_set_wrprotect() (From
inspection, at least)
This "Trial do_wp_page() simplification" patch means that when #5 goes
to do COW it gets a copy instead of a re-use because the
reuse_swap_page() was aborting the copy before.
So, to your point, yes ideally FOLL_PIN would never write-protect
pages!
Looking for awhile, this now looks reasonable and
doable. page_maybe_dma_pinned() was created for exactly this kind of
case.
I've attached a dumb sketch for the pte level (surely wrong! I have
never looked at this part of the mm before!) at the end of this
message.
Peter, do you know this better? Does this inspire you to make a patch? :)
Would really love to see this. We have such a huge expensive mess with
MADV_DONTFORK, this would eliminate all that overhead.
Thanks,
Jason
diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76e1..6bc19a43da1391 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -689,6 +689,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
}
#endif
+static int duplicate_page(pte_t *newpte, struct vm_area_struct *vma,
+ unsigned long address, struct page *page)
+{
+ struct page *new_page;
+
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!new_page)
+ return -ENOMEM;
+ copy_user_highpage(new_page, page, address, vma);
+
+ /* FIXME: surely more than this */
+ *newpte = mk_pte(new_page, vma->vm_page_prot);
+ return 0;
+}
+
/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
@@ -703,6 +718,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ bool do_src_wp;
/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -775,15 +791,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_set_pte;
}
- /*
- * If it's a COW mapping, write protect it both
- * in the parent and the child
- */
- if (is_cow_mapping(vm_flags) && pte_write(pte)) {
- ptep_set_wrprotect(src_mm, addr, src_pte);
- pte = pte_wrprotect(pte);
- }
-
/*
* If it's a shared mapping, mark it clean in
* the child
@@ -800,11 +807,34 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (!(vm_flags & VM_UFFD_WP))
pte = pte_clear_uffd_wp(pte);
+ do_src_wp = is_cow_mapping(vm_flags) && pte_write(pte);
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
page_dup_rmap(page, false);
rss[mm_counter(page)]++;
+
+ /*
+ * If a page is DMA pinned we never want to write protect it,
+ * copy it now.
+ */
+ if (do_src_wp && page_maybe_dma_pinned(page)) {
+ do_src_wp = false;
+ ret = duplicate_page(&pte, vma, addr, page);
+ if (ret)
+ /* FIXME: need to restructure a bit to handle this */
+ return ret;
+ }
+ }
+
+ /*
+ * If it's a COW mapping, write protect it both
+ * in the parent and the child
+ * FIXME check carefully this is new order is OK
+ */
+ if (do_src_wp) {
+ ptep_set_wrprotect(src_mm, addr, src_pte);
+ pte = pte_wrprotect(pte);
}
out_set_pte:
next prev parent reply other threads:[~2020-09-17 19:38 UTC|newest]
Thread overview: 98+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-21 23:49 [PATCH 0/4] mm: Simplfy cow handling Peter Xu
2020-08-21 23:49 ` [PATCH 1/4] mm: Trial do_wp_page() simplification Peter Xu
2020-08-24 8:36 ` Kirill Tkhai
2020-08-24 14:30 ` Jan Kara
2020-08-24 15:37 ` Kirill Tkhai
2020-08-24 18:22 ` Linus Torvalds
2020-09-01 7:01 ` Hugh Dickins
2020-09-14 14:38 ` Jason Gunthorpe
2020-09-14 17:32 ` Linus Torvalds
2020-09-14 18:34 ` Peter Xu
2020-09-14 21:15 ` Peter Xu
2020-09-14 22:55 ` Jason Gunthorpe
2020-09-14 22:59 ` Linus Torvalds
2020-09-14 23:28 ` Jason Gunthorpe
2020-09-15 0:19 ` Linus Torvalds
2020-09-15 14:50 ` Peter Xu
2020-09-15 15:17 ` Peter Xu
2020-09-15 16:05 ` Jason Gunthorpe
2020-09-15 18:29 ` Jason Gunthorpe
2020-09-15 19:13 ` Peter Xu
2020-09-15 19:38 ` Jason Gunthorpe
2020-09-15 21:33 ` Peter Xu
2020-09-15 23:22 ` Jason Gunthorpe
2020-09-16 1:50 ` John Hubbard
2020-09-16 17:48 ` Jason Gunthorpe
2020-09-16 18:46 ` Peter Xu
2020-09-17 11:25 ` Jason Gunthorpe
2020-09-17 18:11 ` Linus Torvalds
2020-09-17 19:38 ` Jason Gunthorpe [this message]
2020-09-17 19:51 ` Linus Torvalds
2020-09-18 16:40 ` Peter Xu
2020-09-18 17:16 ` Linus Torvalds
2020-09-18 19:57 ` Peter Xu
2020-09-18 17:32 ` Jason Gunthorpe
2020-09-18 20:40 ` Peter Xu
2020-09-18 20:59 ` Linus Torvalds
2020-09-19 0:28 ` Jason Gunthorpe
2020-09-18 21:06 ` John Hubbard
2020-09-19 0:01 ` Jason Gunthorpe
2020-09-21 8:35 ` Jan Kara
2020-09-21 12:03 ` Jason Gunthorpe
2022-02-16 16:59 ` Oded Gabbay
2022-02-16 17:24 ` Oded Gabbay
2022-02-16 19:04 ` Linus Torvalds
2022-02-16 19:20 ` Oded Gabbay
2022-02-16 19:24 ` David Hildenbrand
2020-09-21 13:42 ` Michal Hocko
2020-09-21 14:18 ` Peter Xu
2020-09-21 14:28 ` Michal Hocko
2020-09-21 14:38 ` Tejun Heo
2020-09-21 14:43 ` Christian Brauner
2020-09-21 14:55 ` Michal Hocko
2020-09-21 15:04 ` Christian Brauner
2020-09-21 16:06 ` Michal Hocko
2020-09-23 7:53 ` Michal Hocko
2020-09-21 14:41 ` Christian Brauner
2020-09-21 14:57 ` Michal Hocko
2020-09-21 16:31 ` Peter Xu
2020-09-17 18:14 ` Peter Xu
2020-09-17 18:26 ` Linus Torvalds
2020-09-17 19:03 ` Peter Xu
2020-09-17 19:42 ` Linus Torvalds
2020-09-17 19:55 ` John Hubbard
2020-09-17 20:06 ` Jason Gunthorpe
2020-09-17 20:19 ` John Hubbard
2020-09-17 20:25 ` Jason Gunthorpe
2020-09-17 20:35 ` Linus Torvalds
2020-09-17 21:40 ` Peter Xu
2020-09-17 22:09 ` Jason Gunthorpe
2020-09-17 22:25 ` Linus Torvalds
2020-09-17 22:48 ` Ira Weiny
2020-09-18 9:36 ` Jan Kara
2020-09-18 9:44 ` Jan Kara
2020-09-18 16:19 ` Jason Gunthorpe
2020-09-15 10:23 ` Leon Romanovsky
2020-09-15 15:56 ` Jason Gunthorpe
2020-09-15 15:03 ` Oleg Nesterov
2020-09-15 16:18 ` Peter Xu
2020-08-21 23:49 ` [PATCH 2/4] mm/ksm: Remove reuse_ksm_page() Peter Xu
2020-08-21 23:49 ` [PATCH 3/4] mm/gup: Remove enfornced COW mechanism Peter Xu
2020-09-14 14:27 ` Oleg Nesterov
2020-09-14 17:59 ` Peter Xu
2020-09-14 19:03 ` Linus Torvalds
2020-08-21 23:49 ` [PATCH 4/4] mm: Add PGREUSE counter Peter Xu
2020-08-22 16:14 ` Linus Torvalds
2020-08-24 0:24 ` Peter Xu
2020-08-22 16:05 ` [PATCH 0/4] mm: Simplfy cow handling Linus Torvalds
2020-08-23 23:58 ` Peter Xu
2020-08-24 8:38 ` Kirill Tkhai
2020-08-27 14:15 ` Peter Xu
-- strict thread matches above, loose matches on Subject: below --
2021-02-02 14:40 [PATCH 1/4] mm: Trial do_wp_page() simplification Gal Pressman
2021-02-02 16:31 ` Peter Xu
2021-02-02 16:44 ` Jason Gunthorpe
2021-02-02 17:05 ` Peter Xu
2021-02-02 17:13 ` Jason Gunthorpe
2021-02-03 12:43 ` Gal Pressman
2021-02-03 14:00 ` Jason Gunthorpe
2021-02-03 14:47 ` Gal Pressman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200917193824.GL8409@ziepe.ca \
--to=jgg@ziepe.ca \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=gokhale2@llnl.gov \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=jhubbard@nvidia.com \
--cc=kirill@shutemov.name \
--cc=ktkhai@virtuozzo.com \
--cc=leonro@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mcfadden8@llnl.gov \
--cc=oleg@redhat.com \
--cc=peterx@redhat.com \
--cc=torvalds@linux-foundation.org \
--cc=yang.shi@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.