From: Jiaqi Yan <jiaqiyan@google.com>
To: Yang Shi <shy828301@gmail.com>
Cc: "Tong Tiangen" <tongtiangen@huawei.com>,
"Tony Luck" <tony.luck@intel.com>,
"HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
"Miaohe Lin" <linmiaohe@huawei.com>, "Jue Wang" <juew@google.com>,
"Linux MM" <linux-mm@kvack.org>
Subject: Re: [PATCH v3 1/2] mm: khugepaged: recover from poisoned anonymous memory
Date: Fri, 27 May 2022 12:03:23 -0700 [thread overview]
Message-ID: <CACw3F534HQuNP-NReyB9Dx72UnunfCDDSOEXnGdibTmA1ZXoBA@mail.gmail.com> (raw)
In-Reply-To: <CAHbLzkoTPRJKV9odCCUhiqoer3ckh0rhCiVp20aR05WVX4GOKQ@mail.gmail.com>
On Fri, May 27, 2022 at 8:42 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, May 26, 2022 at 5:06 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
> >
> > On Tue, May 24, 2022 at 11:41 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Mon, May 23, 2022 at 7:54 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
> > > >
> > > > Make __collapse_huge_page_copy return whether
> > > > collapsing/copying anonymous pages succeeded,
> > > > and make collapse_huge_page handle the return status.
> > > >
> > > > Break existing PTE scan loop into two for-loops.
> > > > The first loop copies source pages into target huge page,
> > > > and can fail gracefully when running into memory errors in
> > > > source pages. Roll back the page table and page states
> > > > when copying failed:
> > > > 1) re-establish the PTEs-to-PMD connection.
> > > > 2) release pages back to their LRU list.
> > > >
> > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > ---
> > > > include/linux/highmem.h | 19 ++++
> > > > include/trace/events/huge_memory.h | 27 +++++-
> > > > mm/khugepaged.c | 146 ++++++++++++++++++++++-------
> > > > 3 files changed, 158 insertions(+), 34 deletions(-)
> > > >
> > > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> > > > index 39bb9b47fa9cd..0ccb1e92c4b06 100644
> > > > --- a/include/linux/highmem.h
> > > > +++ b/include/linux/highmem.h
> > > > @@ -298,6 +298,25 @@ static inline void copy_highpage(struct page *to, struct page *from)
> > > >
> > > > #endif
> > > >
> > > > +/*
> > > > + * Machine check exception handled version of copy_highpage.
> > > > + * Return true if copying page content failed; otherwise false.
> > > > + * Note handling #MC requires arch opt-in.
> > > > + */
> > > > +static inline bool copy_highpage_mc(struct page *to, struct page *from)
> > > > +{
> > > > + char *vfrom, *vto;
> > > > + unsigned long ret;
> > > > +
> > > > + vfrom = kmap_local_page(from);
> > > > + vto = kmap_local_page(to);
> > > > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> > > > + kunmap_local(vto);
> > > > + kunmap_local(vfrom);
> > > > +
> > > > + return ret > 0;
> > > > +}
> > > > +
> > > > static inline void memcpy_page(struct page *dst_page, size_t dst_off,
> > > > struct page *src_page, size_t src_off,
> > > > size_t len)
> > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > index 4fdb14a81108b..7e02c0aa2f728 100644
> > > > --- a/include/trace/events/huge_memory.h
> > > > +++ b/include/trace/events/huge_memory.h
> > > > @@ -34,7 +34,8 @@
> > > > EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \
> > > > EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \
> > > > EM( SCAN_TRUNCATED, "truncated") \
> > > > - EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \
> > > > + EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \
> > > > + EMe(SCAN_COPY_MC, "copy_poisoned_page") \
> > > >
> > > > #undef EM
> > > > #undef EMe
> > > > @@ -167,5 +168,29 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
> > > > __entry->ret)
> > > > );
> > > >
> > > > +TRACE_EVENT(mm_collapse_huge_page_copy,
> > >
> > > You may misunderstand my point. I don't mean we need a new tracepoint,
> > > IMHO trace_mm_collapse_huge_page() is good enough. When copy is failed
> > > you do have "result = SCAN_COPY_MC" and the result will be consumed by
> > > trace_mm_collapse_huge_page(). You just need have trace interpret the
> > > new result as the above section does.
> >
> > I see. To make sure we are on the same page, I think we only need to add:
> >
> > EMe(SCAN_COPY_MC, "copy_poisoned_page")
>
> Yes
>
> >
> > I don't think we need to add a 'copied' arg in mm_collapse_huge_page() as
> > "copy_poisoned_page" is the only reason copied=false.
> >
> > My intention to add a new tracepoint is for showing the poisoned PFN but I feel
> > you don't think this is very important information w.r.t khugepaged.
> > The important
> > thing is we skipped the poisoned page(s).
>
> I think memory failure will report the detailed information later.
> Khugepaged just needs to know the reason IMHO.
>
Eventually yes. For example if userspace accesses the memory error,
then memory_failure() will be invoked in the MCE handling code path.
A future workitem could be to make the memory_failure() happen right away in
the khugepaged context. But for now I don't have a very good concrete scenario
to demonstrate that doing this is better than wait memory_failure() happening
later.
Anyway, this is not intended to be included in this patch series. I
will send out
v4 asap.
> >
> > >
> > > > +
> > > > + TP_PROTO(struct page *start, struct page *poisoned, int status),
> > > > +
> > > > + TP_ARGS(start, poisoned, status),
> > > > +
> > > > + TP_STRUCT__entry(
> > > > + __field(unsigned long, start_pfn)
> > > > + __field(unsigned long, poisoned_pfn)
> > > > + __field(int, status)
> > > > + ),
> > > > +
> > > > + TP_fast_assign(
> > > > + __entry->start_pfn = start ? page_to_pfn(start) : -1;
> > > > + __entry->poisoned_pfn = poisoned ? page_to_pfn(poisoned) : -1;
> > > > + __entry->status = status;
> > > > + ),
> > > > +
> > > > + TP_printk("start_pfn=0x%lx, poisoned_pfn=0x%lx, status=%s",
> > > > + __entry->start_pfn,
> > > > + __entry->poisoned_pfn,
> > > > + __print_symbolic(__entry->status, SCAN_STATUS))
> > > > +);
> > > > +
> > > > #endif /* __HUGE_MEMORY_H */
> > > > #include <trace/define_trace.h>
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 131492fd1148b..1b08e31ba081a 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -52,6 +52,7 @@ enum scan_result {
> > > > SCAN_CGROUP_CHARGE_FAIL,
> > > > SCAN_TRUNCATED,
> > > > SCAN_PAGE_HAS_PRIVATE,
> > > > + SCAN_COPY_MC,
> > > > };
> > > >
> > > > #define CREATE_TRACE_POINTS
> > > > @@ -739,44 +740,104 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > > return 0;
> > > > }
> > > >
> > > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > > > - struct vm_area_struct *vma,
> > > > - unsigned long address,
> > > > - spinlock_t *ptl,
> > > > - struct list_head *compound_pagelist)
> > > > +/*
> > > > + * __collapse_huge_page_copy - attempts to copy memory contents from normal
> > > > + * pages to a hugepage. Cleanup the normal pages if copying succeeds;
> > > > + * otherwise restore the original page table and release isolated normal pages.
> > > > + * Returns true if copying succeeds, otherwise returns false.
> > > > + *
> > > > + * @pte: starting of the PTEs to copy from
> > > > + * @page: the new hugepage to copy contents to
> > > > + * @pmd: pointer to the new hugepage's PMD
> > > > + * @rollback: the orginal normal pages' PMD
> > >
> > > s/orginal/original
> > >
> > > > + * @address: starting address to copy
> > > > + * @pte_ptl: lock on normal pages' PTEs
> > > > + * @compound_pagelist: list that stores compound pages
> > > > + */
> > > > +static bool __collapse_huge_page_copy(pte_t *pte,
> > > > + struct page *page,
> > > > + pmd_t *pmd,
> > > > + pmd_t rollback,
> > > > + struct vm_area_struct *vma,
> > > > + unsigned long address,
> > > > + spinlock_t *pte_ptl,
> > > > + struct list_head *compound_pagelist)
> > > > {
> > > > struct page *src_page, *tmp;
> > > > pte_t *_pte;
> > > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > > > - _pte++, page++, address += PAGE_SIZE) {
> > > > - pte_t pteval = *_pte;
> > > > + pte_t pteval;
> > > > + unsigned long _address;
> > > > + spinlock_t *pmd_ptl;
> > > > + bool copy_succeeded = true;
> > > >
> > > > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > > + /*
> > > > + * Copying pages' contents is subject to memory poison at any iteration.
> > > > + */
> > > > + for (_pte = pte, _address = address;
> > > > + _pte < pte + HPAGE_PMD_NR;
> > > > + _pte++, page++, _address += PAGE_SIZE) {
> > > > + pteval = *_pte;
> > > > +
> > > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval)))
> > > > clear_user_highpage(page, address);
> > > > - add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > > > - if (is_zero_pfn(pte_pfn(pteval))) {
> > > > - /*
> > > > - * ptl mostly unnecessary.
> > > > - */
> > > > - spin_lock(ptl);
> > > > - ptep_clear(vma->vm_mm, address, _pte);
> > > > - spin_unlock(ptl);
> > > > + else {
> > > > + src_page = pte_page(pteval);
> > > > + if (copy_highpage_mc(page, src_page)) {
> > > > + copy_succeeded = false;
> > > > + trace_mm_collapse_huge_page_copy(pte_page(*pte),
> > > > + src_page, SCAN_COPY_MC);
> > >
> > > As aforementioned I don't think we need a new trace point.
> > >
> > > > + break;
> > > > + }
> > > > + }
> > > > + }
> > > > +
> > > > + if (copy_succeeded)
> > > > + trace_mm_collapse_huge_page_copy(pte_page(*pte),
> > > > + NULL, SCAN_SUCCEED);
> > > > + else {
> > > > + /*
> > > > + * Copying failed, re-establish the regular PMD that points to
> > > > + * the regular page table. Restoring PMD needs to be done prior
> > > > + * to releasing pages. Since pages are still isolated and locked
> > > > + * here, acquiring anon_vma_lock_write is unnecessary.
> > > > + */
> > > > + pmd_ptl = pmd_lock(vma->vm_mm, pmd);
> > > > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback));
> > > > + spin_unlock(pmd_ptl);
> > > > + }
> > > > +
> > > > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR;
> > > > + _pte++, _address += PAGE_SIZE) {
> > > > + pteval = *_pte;
> > > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > > + if (copy_succeeded) {
> > > > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > > > + if (is_zero_pfn(pte_pfn(pteval))) {
> > > > + /*
> > > > + * ptl mostly unnecessary.
> > > > + */
> > > > + spin_lock(pte_ptl);
> > > > + pte_clear(vma->vm_mm, _address, _pte);
> > > > + spin_unlock(pte_ptl);
> > > > + }
> > > > }
> > > > } else {
> > > > src_page = pte_page(pteval);
> > > > - copy_user_highpage(page, src_page, address, vma);
> > > > if (!PageCompound(src_page))
> > > > release_pte_page(src_page);
> > > > - /*
> > > > - * ptl mostly unnecessary, but preempt has to
> > > > - * be disabled to update the per-cpu stats
> > > > - * inside page_remove_rmap().
> > > > - */
> > > > - spin_lock(ptl);
> > > > - ptep_clear(vma->vm_mm, address, _pte);
> > > > - page_remove_rmap(src_page, false);
> > > > - spin_unlock(ptl);
> > > > - free_page_and_swap_cache(src_page);
> > > > +
> > > > + if (copy_succeeded) {
> > > > + /*
> > > > + * ptl mostly unnecessary, but preempt has to
> > > > + * be disabled to update the per-cpu stats
> > > > + * inside page_remove_rmap().
> > > > + */
> > > > + spin_lock(pte_ptl);
> > > > + pte_clear(vma->vm_mm, _address, _pte);
> > > > + page_remove_rmap(src_page, false);
> > > > + spin_unlock(pte_ptl);
> > > > + free_page_and_swap_cache(src_page);
> > > > + }
> > > > }
> > > > }
> > > >
> > > > @@ -784,6 +845,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > > > list_del(&src_page->lru);
> > > > release_pte_page(src_page);
> > > > }
> > > > +
> > > > + return copy_succeeded;
> > > > }
> > > >
> > > > static void khugepaged_alloc_sleep(void)
> > > > @@ -1066,6 +1129,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > struct vm_area_struct *vma;
> > > > struct mmu_notifier_range range;
> > > > gfp_t gfp;
> > > > + bool copied = false;
> > > >
> > > > VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > >
> > > > @@ -1177,9 +1241,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > */
> > > > anon_vma_unlock_write(vma->anon_vma);
> > > >
> > > > - __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
> > > > - &compound_pagelist);
> > > > + copied = __collapse_huge_page_copy(pte, new_page, pmd, _pmd,
> > > > + vma, address, pte_ptl, &compound_pagelist);
> > > > pte_unmap(pte);
> > > > + if (!copied) {
> > > > + result = SCAN_COPY_MC;
> > > > + goto out_up_write;
> > > > + }
> > > > /*
> > > > * spin_lock() below is not the equivalent of smp_wmb(), but
> > > > * the smp_wmb() inside __SetPageUptodate() can be reused to
> > > > @@ -1364,9 +1432,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > > > pte_unmap_unlock(pte, ptl);
> > > > if (ret) {
> > > > node = khugepaged_find_target_node();
> > > > - /* collapse_huge_page will return with the mmap_lock released */
> > > > - collapse_huge_page(mm, address, hpage, node,
> > > > - referenced, unmapped);
> > > > + /*
> > > > + * collapse_huge_page will return with the mmap_r+w_lock released.
> > > > + * It is uncertain if *hpage is NULL or not when collapse_huge_page
> > >
> > > Why is it uncertain?
> >
> > By "uncertain" I mean *hpage may not be set to NULL near the end of
> > collapse_huge_page
> > if swapin/isolation/copying failed.
> >
> > >
> > > > + * returns, so keep ret=1 to jump to breakouterloop_mmap_lock
> > >
> > > "ret=1" means scan succeeded so mmap_lock is released. But this
> > > happens in the caller.
> > >
> > > > + * in khugepaged_scan_mm_slot, then *hpage will be freed
> > > > + * if collapse failed.
> > >
> > > It depends. For CONFIG_NUMA, it will be freed since next scan may need
> > > a page from a different node, but !CONFIG_NUMA may not free it, it may
> > > be reused IIUC. We do intend to remove this optimization, but it has
> > > not happened yet.
> >
> > Yes, you are right if !CONFIG_NUMA, khugepaged_prealloc_page only
> > free *hpage when page_count(*hpage) > 1.
> >
> > >
> > > TBH I don't think the comment is helpful here and actually confusing,
> > > I'd prefer to have it right before calling khugepaged_scan_pmd() if
> > > you would like to keep it.
> >
> > I will remove the comment and revert this part.
> >
> > >
> > > > + */
> > > > + collapse_huge_page(mm, address, hpage, node, referenced, unmapped);
> > > > }
> > > > out:
> > > > trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> > > > @@ -2168,6 +2241,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> > > > khugepaged_scan_file(mm, file, pgoff, hpage);
> > > > fput(file);
> > > > } else {
> > > > + /*
> > > > + * mmap_read_lock is
> > > > + * 1) still held if scan failed;
> > > > + * 2) released if scan succeeded.
> > > > + * It is not affected by collapsing or copying
> > > > + * operations.
> > > > + */
> > > > ret = khugepaged_scan_pmd(mm, vma,
> > > > khugepaged_scan.address,
> > > > hpage);
> > > > --
> > > > 2.36.1.124.g0e6072fb45-goog
> > > >
next prev parent reply other threads:[~2022-05-27 19:03 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-24 2:53 [PATCH v3 0/2] Memory poison recovery in khugepaged collapsing Jiaqi Yan
2022-05-24 2:53 ` [PATCH v3 1/2] mm: khugepaged: recover from poisoned anonymous memory Jiaqi Yan
2022-05-24 10:48 ` Oscar Salvador
2022-05-24 14:05 ` Jue Wang
2022-05-24 16:32 ` Jiaqi Yan
2022-05-24 18:41 ` Yang Shi
2022-05-27 0:05 ` Jiaqi Yan
2022-05-27 15:42 ` Yang Shi
2022-05-27 19:03 ` Jiaqi Yan [this message]
2022-05-24 2:53 ` [PATCH v3 2/2] mm: khugepaged: recover from poisoned file-backed memory Jiaqi Yan
2022-05-24 19:35 ` Yang Shi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CACw3F534HQuNP-NReyB9Dx72UnunfCDDSOEXnGdibTmA1ZXoBA@mail.gmail.com \
--to=jiaqiyan@google.com \
--cc=juew@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linmiaohe@huawei.com \
--cc=linux-mm@kvack.org \
--cc=naoya.horiguchi@nec.com \
--cc=shy828301@gmail.com \
--cc=tongtiangen@huawei.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).