From: Mike Kravetz <mike.kravetz@oracle.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>,
akpm@linux-foundation.org, songmuchun@bytedance.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page
Date: Thu, 25 Aug 2022 11:30:19 -0700 [thread overview]
Message-ID: <Ywe/u6c0yPMuEcR7@monkey> (raw)
In-Reply-To: <f3ee3581-5d0b-f564-7016-783a0d91fea2@linux.alibaba.com>
On 08/25/22 09:43, Baolin Wang wrote:
>
>
> On 8/25/2022 7:34 AM, Mike Kravetz wrote:
> > On 08/24/22 17:41, Baolin Wang wrote:
> > >
> > >
> > > On 8/24/2022 3:31 PM, David Hildenbrand wrote:
> > > > > > > >
> > > > > > > > IMHO, these follow_huge_xxx() functions are arch-specified at first and
> > > > > > > > were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm:
> > > > > > > > hugetlb: Copy general hugetlb code from x86 to mm"), and now there are
> > > > > > > > still some arch-specified follow_huge_xxx() definition, for example:
> > > > > > > > ia64: follow_huge_addr
> > > > > > > > powerpc: follow_huge_pd
> > > > > > > > s390: follow_huge_pud
> > > > > > > >
> > > > > > > > What I mean is that follow_hugetlb_page() is a common and
> > > > > > > > not-arch-specified function, is it suitable to change it to be
> > > > > > > > arch-specified?
> > > > > > > > And thinking more, can we rename follow_hugetlb_page() as
> > > > > > > > hugetlb_page_faultin() and simplify it to only handle the page faults of
> > > > > > > > hugetlb like the faultin_page() for normal page? That means we can make
> > > > > > > > sure only follow_page_mask() can handle hugetlb.
> > > > > > > >
> > > > > >
> > > > > > Something like that might work, but you still have two page table walkers
> > > > > > for hugetlb. I like David's idea (if I understand it correctly) of
> > > > >
> > > > > What I mean is we may change the hugetlb handling like normal page:
> > > > > 1) use follow_page_mask() to look up a hugetlb firstly.
> > > > > 2) if can not get the hugetlb, then try to page fault by
> > > > > hugetlb_page_faultin().
> > > > > 3) if page fault successed, then retry to find hugetlb by
> > > > > follow_page_mask().
> > > >
> > > > That implies putting more hugetlbfs special code into generic GUP,
> > > > turning it even more complicated. But of course, it depends on how the
> > > > end result looks like. My gut feeling was that hugetlb is better handled
> > > > in follow_hugetlb_page() separately (just like we do with a lot of other
> > > > page table walkers).
> > >
> > > OK, fair enough.
> > >
> > > > >
> > > > > Just a rough thought, and I need more investigation for my idea and
> > > > > David's idea.
> > > > >
> > > > > > using follow_hugetlb_page for both cases. As noted, it will need to be
> > > > > > taught how to not trigger faults in the follow_page_mask case.
> > > > >
> > > > > Anyway, I also agree we need some cleanup, and firstly I think we should
> > > > > cleanup these arch-specified follow_huge_xxx() on some architectures
> > > > > which are similar with the common ones. I will look into these.
> > > >
> > > > There was a recent discussion on that, e.g.:
> > > >
> > > > https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad
> > >
> > > Thanks.
> > >
> > > >
> > > > >
> > > > > However, considering cleanup may need more investigation and
> > > > > refactoring, now I prefer to make these bug-fix patches of this patchset
> > > > > into mainline firstly, which are suitable to backport to old version to
> > > > > fix potential race issues. Mike and David, how do you think? Could you
> > > > > help to review these patches? Thanks.
> > > >
> > > > Patch #1 certainly add more special code just to handle another hugetlb
> > > > corner case (CONT pages), and maybe just making it all use
> > > > follow_hugetlb_page() would be even cleaner and less error prone.
> > > >
> > > > I agree that locking is shaky, but I'm not sure if we really want to
> > > > backport this to stable trees:
> > > >
> > > > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > >
> > > > "It must fix a real bug that bothers people (not a, “This could be a
> > > > problem...” type thing)."
> > > >
> > > >
> > > > Do we actually have any instance of this being a real (and not a
> > > > theoretical) problem? If not, I'd rather clean it all up right away.
> > >
> > > I think this is a real problem (not theoretical), and easy to write some
> > > code to show the issue. For example, suppose thread A is trying to look up a
> > > CONT-PTE size hugetlb page under the lock, however antoher thread B can
> > > migrate the CONT-PTE hugetlb page at the same time, which will cause thread
> > > A to get an incorrect page, if thread A want to do something for this
> > > incorrect page, error occurs.
> >
> > Is the primary concern the locking? If so, I am not sure we have an issue.
>
> Yes.
>
> > As mentioned in your commit message, current code will use
> > pte_offset_map_lock(). pte_offset_map_lock uses pte_lockptr, and pte_lockptr
> > will either be the mm wide lock or pmd_page lock. To me, it seems that
>
> The ALLOC_SPLIT_PTLOCKS can be always true on my machine, that means the
> pte_lockptr() will always use the PTE page lock, however huge_pte_lock()
> will use the mm wide lock.
Yes, the different calling context/path to the locking code will cause a
different lock to be used. I thought of the AFTER sending the above.
>
> > either would provide correct synchronization for CONT-PTE entries. Am I
> > missing something or misreading the code?
> >
> > I started looking at code cleanup suggested by David. Here is a quick
> > patch (not tested and likely containing errors) to see if this is a step
> > in the right direction.
> >
> > I like it because we get rid of/combine all those follow_huge_p*d
> > routines.
>
> Great, this looks straight forward to me (some nits as below).
> David, how do you think?
>
I will continue to refine this based on suggestions from you and David.
> > +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > + unsigned long address, unsigned int flags)
> > +{
> > + struct hstate *h = hstate_vma(vma);
> > + struct mm_struct *mm = vma->vm_mm;
> > + unsigned long haddr = address & huge_page_mask(h);
> > + struct page *page = NULL;
> > + spinlock_t *ptl;
> > + pte_t *pte, entry;
> > +
> > + /*
> > + * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> > + * follow_hugetlb_page().
> > + */
> > + if (WARN_ON_ONCE(flags & FOLL_PIN))
> > + return NULL;
> > +
> > + pte = huge_pte_offset(mm, haddr, huge_page_size(h));
> > + if (!pte)
> > + return NULL;
> > +
> > +retry:
> > + ptl = huge_pte_lock(h, mm, pte);
> > + entry = huge_ptep_get(pte);
> > + if (pte_present(entry)) {
> > + page = pte_page(entry);
>
> Should follow previous logic?
> page = pte_page(entry) + ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
>
Yes, this needs to be PAGE aligned, not HUGETLB_PAGE aligned.
--
Mike Kravetz
> > + /*
> > + * try_grab_page() should always succeed here, because we hold
> > + * the ptl lock and have verified pte_present().
> > + */
> > + if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
> > + page = NULL;
> > + goto out;
> > + }
> > + } else {
next prev parent reply other threads:[~2022-08-25 18:30 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-23 7:50 [PATCH v2 0/5] Fix some issues when looking up hugetlb page Baolin Wang
2022-08-23 7:50 ` [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size " Baolin Wang
2022-08-23 8:29 ` David Hildenbrand
2022-08-23 10:02 ` Baolin Wang
2022-08-23 10:23 ` David Hildenbrand
2022-08-23 23:55 ` Mike Kravetz
2022-08-24 2:06 ` Baolin Wang
2022-08-24 7:31 ` David Hildenbrand
2022-08-24 9:41 ` Baolin Wang
2022-08-24 11:55 ` David Hildenbrand
2022-08-24 14:30 ` Baolin Wang
2022-08-24 14:33 ` David Hildenbrand
2022-08-24 15:06 ` Baolin Wang
2022-08-24 15:13 ` David Hildenbrand
2022-08-24 15:23 ` Baolin Wang
2022-08-24 23:34 ` Mike Kravetz
2022-08-25 1:43 ` Baolin Wang
2022-08-25 7:10 ` David Hildenbrand
2022-08-25 7:58 ` Baolin Wang
2022-08-25 18:30 ` Mike Kravetz [this message]
2022-08-25 7:25 ` David Hildenbrand
2022-08-25 10:54 ` Baolin Wang
2022-08-25 21:13 ` Mike Kravetz
2022-08-26 22:40 ` Mike Kravetz
2022-08-27 13:59 ` Aneesh Kumar K.V
2022-08-29 18:30 ` Mike Kravetz
2022-08-23 7:50 ` [PATCH v2 2/5] mm/hugetlb: use PTE page lock to protect CONT-PTE entries Baolin Wang
2022-08-23 7:50 ` [PATCH v2 3/5] mm/hugetlb: fix races when looking up a CONT-PMD size hugetlb page Baolin Wang
2022-08-23 7:50 ` [PATCH v2 4/5] mm/hugetlb: use PMD page lock to protect CONT-PTE entries Baolin Wang
2022-08-23 8:14 ` David Hildenbrand
2022-08-23 10:12 ` Baolin Wang
2022-08-23 7:50 ` [PATCH v2 5/5] mm/hugetlb: add FOLL_MIGRATION validation before waiting for a migration entry Baolin Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Ywe/u6c0yPMuEcR7@monkey \
--to=mike.kravetz@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=songmuchun@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).