All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: "Jiaqi Yan" <jiaqiyan@google.com>,
	"HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>,
	"songmuchun@bytedance.com" <songmuchun@bytedance.com>,
	"shy828301@gmail.com" <shy828301@gmail.com>,
	"linmiaohe@huawei.com" <linmiaohe@huawei.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"duenwen@google.com" <duenwen@google.com>,
	"axelrasmussen@google.com" <axelrasmussen@google.com>,
	"jthoughton@google.com" <jthoughton@google.com>
Subject: Re: [PATCH v1 1/3] mm/hwpoison: find subpage in hugetlb HWPOISON list
Date: Tue, 20 Jun 2023 11:05:33 -0700	[thread overview]
Message-ID: <20230620180533.GA3567@monkey> (raw)
In-Reply-To: <20230619082330.GA1612447@ik1-406-35019.vs.sakura.ne.jp>

On 06/19/23 17:23, Naoya Horiguchi wrote:
> On Sat, Jun 17, 2023 at 03:59:27PM -0700, Mike Kravetz wrote:
> > On 06/16/23 19:18, Jiaqi Yan wrote:
> > > On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > On 06/16/23 14:19, Jiaqi Yan wrote:
> > > > >
> > > > > Now looking again this, I think concurrent adding and deleting are
> > > > > fine with each other and with themselves, because raw_hwp_list is
> > > > > lock-less llist.
> > > >
> > > > Correct.
> > > >
> > > > > As for synchronizing traversal with adding and deleting, I wonder is
> > > > > it a good idea to make __update_and_free_hugetlb_folio hold
> > > > > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse +
> > > > > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already
> > > > > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is
> > > > > missing the lock.
> > > >
> > > > I do not think the lock is needed.  However, while looking more closely
> > > > at this I think I discovered another issue.
> > > > This is VERY subtle.
> > > > Perhaps Naoya can help verify if my reasoning below is correct.
> > > >
> > > > In __update_and_free_hugetlb_folio we are not operating on a hugetlb page.
> > > > Why is this?
> > > > Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio.
> > > > The purpose of remove_hugetlb_folio is to remove the huge page from the
> > > > list AND compound page destructor indicating this is a hugetlb page is changed.
> > > > This is all done while holding the hugetlb lock.  So, the test for
> > > > folio_test_hugetlb(folio) is false.
> > > >
> > > > We have technically a compound non-hugetlb page with a non-null raw_hwp_list.
> > > >
> > > > Important note: at this time we have not reallocated vmemmap pages if
> > > > hugetlb page was vmemmap optimized.  That is done later in
> > > > __update_and_free_hugetlb_folio.
> > > 
> > > 
> > > >
> > > > The 'good news' is that after this point get_huge_page_for_hwpoison will
> > > > not recognize this as a hugetlb page, so nothing will be added to the
> > > > list.  There is no need to worry about entries being added to the list
> > > > during traversal.
> > > >
> > > > The 'bad news' is that if we get a memory error at this time we will
> > > > treat it as a memory error on a regular compound page.  So,
> > > > TestSetPageHWPoison(p) in memory_failure() may try to write a read only
> > > > struct page. :(
> > > 
> > > At least I think this is an issue.
> > > 
> > > Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock
> > > until update_and_free_hugetlb_folio is done, or basically until
> > > dissolve_free_huge_page is done?
> > 
> > Unfortunately, update_and_free_hugetlb_folio is designed to be called
> > without locks held.  This is because we can not hold any locks while
> > allocating vmemmap pages.
> > 
> > I'll try to think of some way to restructure the code.  IIUC, this is a
> > potential general issue, not just isolated to memory error handling.
> 
> Considering this issue as one specific to memory error handling, checking
> HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to
> detect the race.  Then, an idea like the below diff (not tested) can make
> try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait
> for complete the allocation of vmemmap pages.
> 
> @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
>         int ret = 2;    /* fallback to normal page handling */
>         bool count_increased = false;
> 
> -       if (!folio_test_hugetlb(folio))
> +       if (!folio_test_hugetlb(folio)) {
> +               if (folio_test_hugetlb_vmemmap_optimized(folio))
> +                       ret = -EBUSY;

The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in
the folio->private field.

In the case where the folio is a non-hugetlb folio, the folio->private field
could be any arbitrary value.  As such, the test for vmemmap_optimized may
return a false positive.  We could end up retrying for an arbitrarily
long time.

I am looking at how to restructure the code which removes and frees
hugetlb pages so that folio_test_hugetlb() would remain true until
vmemmap pages are allocated.  The easiest way to do this would introduce
another hugetlb lock/unlock cycle in the page freeing path.  This would
undo some of the speedups in the series:
https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab

-- 
Mike Kravetz

>                 goto out;
> +       }
> 
>         if (flags & MF_COUNT_INCREASED) {
>                 ret = 1;
> 
> 
> Thanks,
> Naoya Horiguchi
> 
> > -- 
> > Mike Kravetz
> > 
> > > 
> > > TestSetPageHWPoison in memory_failure is called after
> > > try_memory_failure_hugetlb, and folio_test_hugetlb is tested within
> > > __get_huge_page_for_hwpoison, which is wrapped by the hugetlb_lock. So
> > > by the time dissolve_free_huge_page returns, subpages already go
> > > through hugetlb_vmemmap_restore and __destroy_compound_gigantic_folio
> > > and become non-compound raw pages (folios). Now
> > > folio_test_hugetlb(p)=false will be correct for memory_failure, and it
> > > can recover p as a dissolved non-hugetlb page.


  reply	other threads:[~2023-06-20 18:06 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-17 16:09 [PATCH v1 0/3] Improve hugetlbfs read on HWPOISON hugepages Jiaqi Yan
2023-05-17 16:09 ` [PATCH v1 1/3] mm/hwpoison: find subpage in hugetlb HWPOISON list Jiaqi Yan
2023-05-17 23:53   ` Mike Kravetz
2023-05-19 20:54     ` Jiaqi Yan
2023-05-19 22:42       ` Mike Kravetz
2023-05-22  4:50         ` HORIGUCHI NAOYA(堀口 直也)
2023-05-22 18:22           ` Jiaqi Yan
2023-05-23  2:43             ` HORIGUCHI NAOYA(堀口 直也)
2023-05-26  0:28               ` Jiaqi Yan
2023-06-10  5:48                 ` Jiaqi Yan
2023-06-12  4:19                   ` Naoya Horiguchi
2023-06-16 21:19                     ` Jiaqi Yan
2023-06-16 23:34                       ` Mike Kravetz
2023-06-17  2:18                         ` Jiaqi Yan
2023-06-17 22:59                           ` Mike Kravetz
2023-06-19  8:23                             ` Naoya Horiguchi
2023-06-20 18:05                               ` Mike Kravetz [this message]
2023-06-20 22:39                                 ` Mike Kravetz
2023-06-23  0:45                                   ` Jiaqi Yan
2023-06-23  4:19                                     ` Mike Kravetz
2023-06-23 16:40                                       ` Jiaqi Yan
2023-05-17 16:09 ` [PATCH v1 2/3] hugetlbfs: improve read HWPOISON hugepage Jiaqi Yan
2023-05-18 22:18   ` Mike Kravetz
2023-05-19 20:54     ` Jiaqi Yan
2023-05-17 16:09 ` [PATCH v1 3/3] selftests/mm: add tests for HWPOISON hugetlbfs read Jiaqi Yan
2023-05-23  7:35   ` kernel test robot
2023-05-17 23:30 ` [PATCH v1 0/3] Improve hugetlbfs read on HWPOISON hugepages Mike Kravetz
2023-05-18 16:02   ` Jiaqi Yan
2023-05-18 16:10   ` Jiaqi Yan
2023-05-18 22:24     ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230620180533.GA3567@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=duenwen@google.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=naoya.horiguchi@linux.dev \
    --cc=naoya.horiguchi@nec.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.