From: Chih-En Lin <shiyn.lin@gmail.com>
To: Nadav Amit <namit@vmware.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Qi Zheng <zhengqi.arch@bytedance.com>,
David Hildenbrand <david@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Christophe Leroy <christophe.leroy@csgroup.eu>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Luis Chamberlain <mcgrof@kernel.org>,
Kees Cook <keescook@chromium.org>,
Iurii Zaikin <yzaikin@google.com>,
Vlastimil Babka <vbabka@suse.cz>,
William Kucharski <william.kucharski@oracle.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Peter Xu <peterx@redhat.com>,
Suren Baghdasaryan <surenb@google.com>,
Arnd Bergmann <arnd@arndb.de>,
Tong Tiangen <tongtiangen@huawei.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Li kunyu <kunyu@nfschina.com>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Minchan Kim <minchan@kernel.org>, Yang Shi <shy828301@gmail.com>,
Song Liu <song@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>,
Thomas Gleixner <tglx@linutronix.de>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Andy Lutomirski <luto@kernel.org>,
Fenghua Yu <fenghua.yu@intel.com>,
Dinglan Peng <peng301@purdue.edu>,
Pedro Fonseca <pfonseca@purdue.edu>,
Jim Huang <jserv@ccns.ncku.edu.tw>,
Huichun Feng <foxhoundsk.tw@gmail.com>
Subject: Re: [RFC PATCH v2 7/9] mm: Add the break COW PTE handler
Date: Wed, 28 Sep 2022 03:23:13 +0800 [thread overview]
Message-ID: <YzNNoWfTDHNWDLa0@strix-laptop> (raw)
In-Reply-To: <8F98262B-206B-434C-88B9-9F3A6919782D@vmware.com>
On Tue, Sep 27, 2022 at 06:15:34PM +0000, Nadav Amit wrote:
> On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:
>
> > To handle the COW PTE with write fault, introduce the helper function
> > handle_cow_pte(). The function provides two behaviors. One is breaking
> > COW by decreasing the refcount, pgables_bytes, and RSS. Another is
> > copying all the information in the shared PTE table by using
> > copy_pte_page() with a wrapper.
> >
> > Also, add the wrapper functions to help us find out the COWed or
> > COW-available PTE table.
> >
>
> [ snip ]
>
> > +static inline int copy_cow_pte_range(struct vm_area_struct *vma,
> > + pmd_t *dst_pmd, pmd_t *src_pmd,
> > + unsigned long start, unsigned long end)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + struct mmu_notifier_range range;
> > + int ret;
> > + bool is_cow;
> > +
> > + is_cow = is_cow_mapping(vma->vm_flags);
> > + if (is_cow) {
> > + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> > + 0, vma, mm, start, end);
> > + mmu_notifier_invalidate_range_start(&range);
> > + mmap_assert_write_locked(mm);
> > + raw_write_seqcount_begin(&mm->write_protect_seq);
> > + }
> > +
> > + ret = copy_pte_range(vma, vma, dst_pmd, src_pmd, start, end);
> > +
> > + if (is_cow) {
> > + raw_write_seqcount_end(&mm->write_protect_seq);
> > + mmu_notifier_invalidate_range_end(&range);
>
> Usually, I would expect mmu-notifiers and TLB flushes to be initiated at the
> same point in the code. Presumably you changed protection, so you do need a
> TLB flush, right? Is it done elsewhere?
You're right.
I will add TLB flushes here.
Thanks.
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +/*
> > + * Break COW PTE, two state here:
> > + * - After fork : [parent, rss=1, ref=2, write=NO , owner=parent]
> > + * to [parent, rss=1, ref=1, write=YES, owner=NULL ]
> > + * COW PTE become [ref=1, write=NO , owner=NULL ]
> > + * [child , rss=0, ref=2, write=NO , owner=parent]
> > + * to [child , rss=1, ref=1, write=YES, owner=NULL ]
> > + * COW PTE become [ref=1, write=NO , owner=parent]
> > + * NOTE
> > + * - Copy the COW PTE to new PTE.
> > + * - Clear the owner of COW PTE and set PMD entry writable when it is owner.
> > + * - Increase RSS if it is not owner.
> > + */
> > +static int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > + unsigned long addr)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + unsigned long pte_start, pte_end;
> > + unsigned long start, end;
> > + struct vm_area_struct *prev = vma->vm_prev;
> > + struct vm_area_struct *next = vma->vm_next;
> > + pmd_t cowed_entry = *pmd;
> > +
> > + if (cow_pte_count(&cowed_entry) == 1) {
> > + cow_pte_fallback(vma, pmd, addr);
> > + return 1;
> > + }
> > +
> > + pte_start = start = addr & PMD_MASK;
> > + pte_end = end = (addr + PMD_SIZE) & PMD_MASK;
> > +
> > + pmd_clear(pmd);
> > + /*
> > + * If the vma does not cover the entire address range of the PTE table,
> > + * it should check the previous and next.
> > + */
> > + if (start < vma->vm_start && prev) {
> > + /* The part of address range is covered by previous. */
> > + if (start < prev->vm_end)
> > + copy_cow_pte_range(prev, pmd, &cowed_entry,
> > + start, prev->vm_end);
> > + start = vma->vm_start;
> > + }
> > + if (end > vma->vm_end && next) {
> > + /* The part of address range is covered by next. */
> > + if (end > next->vm_start)
> > + copy_cow_pte_range(next, pmd, &cowed_entry,
> > + next->vm_start, end);
> > + end = vma->vm_end;
> > + }
> > + if (copy_cow_pte_range(vma, pmd, &cowed_entry, start, end))
> > + return -ENOMEM;
> > +
> > + /*
> > + * Here, it is the owner, so clear the ownership. To keep RSS state and
> > + * page table bytes correct, it needs to decrease them.
> > + * Also, handle the address range issue here.
> > + */
> > + if (cow_pte_owner_is_same(&cowed_entry, pmd)) {
> > + set_cow_pte_owner(&cowed_entry, NULL);
>
> Presumably there is some assumption on atomicity here. Otherwise, two
> threads can run the following code, which is wrong, no? Yet, I do not see
> anything that provides such atomicity.
I may have multiple process access here. But for the thread, I assume
that they need to hold the mmap_lock. Maybe I need to add the assert
here too.
>
> > + if (pte_start < vma->vm_start && prev &&
> > + pte_start < prev->vm_end)
> > + cow_pte_rss(mm, vma->vm_prev, pmd,
> > + pte_start, prev->vm_end, false /* dec */);
> > + if (pte_end > vma->vm_end && next &&
> > + pte_end > next->vm_start)
> > + cow_pte_rss(mm, vma->vm_next, pmd,
> > + next->vm_start, pte_end, false /* dec */);
> > + cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
> > + mm_dec_nr_ptes(mm);
> > + }
> > +
> > + /* Already handled it, don't reuse cowed table. */
> > + pmd_put_pte(vma, &cowed_entry, addr, false);
> > +
> > + VM_BUG_ON(cow_pte_count(pmd) != 1);
>
> Don’t use VM_BUG_ON().
Sure. I will change it to VM_WARN_ON().
Thanks,
Chih-En Lin
next prev parent reply other threads:[~2022-09-27 19:23 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table Chih-En Lin
2022-09-27 17:23 ` Nadav Amit
2022-09-27 17:36 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE Chih-En Lin
2022-09-27 17:27 ` Nadav Amit
2022-09-27 18:05 ` Chih-En Lin
2022-09-27 21:22 ` John Hubbard
2022-09-28 8:36 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table Chih-En Lin
2022-09-27 17:30 ` Nadav Amit
2022-09-27 18:23 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions Chih-En Lin
2022-09-27 17:51 ` Nadav Amit
2022-09-27 19:00 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table Chih-En Lin
2022-09-27 17:59 ` Nadav Amit
2022-09-27 19:07 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 6/9] mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 7/9] mm: Add the break COW PTE handler Chih-En Lin
2022-09-27 18:15 ` Nadav Amit
2022-09-27 19:23 ` Chih-En Lin [this message]
2022-09-27 16:29 ` [RFC PATCH v2 8/9] mm: Handle COW PTE with reclaim algorithm Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Chih-En Lin
2022-09-27 18:38 ` Nadav Amit
2022-09-27 19:53 ` Chih-En Lin
2022-09-27 21:26 ` John Hubbard
2022-09-28 8:52 ` Chih-En Lin
2022-09-28 14:03 ` David Hildenbrand
2022-09-29 13:38 ` Chih-En Lin
2022-09-29 13:49 ` Chih-En Lin
2022-09-29 17:24 ` David Hildenbrand
2022-09-29 18:29 ` Chih-En Lin
2022-09-29 18:38 ` David Hildenbrand
2022-09-29 18:57 ` Chih-En Lin
2022-09-29 19:00 ` David Hildenbrand
2022-09-29 18:40 ` Nadav Amit
2022-09-29 19:02 ` Chih-En Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YzNNoWfTDHNWDLa0@strix-laptop \
--to=shiyn.lin@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=arnd@arndb.de \
--cc=bigeasy@linutronix.de \
--cc=christophe.leroy@csgroup.eu \
--cc=david@redhat.com \
--cc=fenghua.yu@intel.com \
--cc=foxhoundsk.tw@gmail.com \
--cc=jserv@ccns.ncku.edu.tw \
--cc=keescook@chromium.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=kunyu@nfschina.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mcgrof@kernel.org \
--cc=minchan@kernel.org \
--cc=namit@vmware.com \
--cc=pasha.tatashin@soleen.com \
--cc=peng301@purdue.edu \
--cc=peterx@redhat.com \
--cc=pfonseca@purdue.edu \
--cc=shy828301@gmail.com \
--cc=song@kernel.org \
--cc=surenb@google.com \
--cc=tglx@linutronix.de \
--cc=tongtiangen@huawei.com \
--cc=vbabka@suse.cz \
--cc=william.kucharski@oracle.com \
--cc=willy@infradead.org \
--cc=yzaikin@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.