Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>,
	akpm@linux-foundation.org, songmuchun@bytedance.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page
Date: Thu, 25 Aug 2022 11:30:19 -0700	[thread overview]
Message-ID: <Ywe/u6c0yPMuEcR7@monkey> (raw)
In-Reply-To: <f3ee3581-5d0b-f564-7016-783a0d91fea2@linux.alibaba.com>

On 08/25/22 09:43, Baolin Wang wrote:
> 
> 
> On 8/25/2022 7:34 AM, Mike Kravetz wrote:
> > On 08/24/22 17:41, Baolin Wang wrote:
> > > 
> > > 
> > > On 8/24/2022 3:31 PM, David Hildenbrand wrote:
> > > > > > > > 
> > > > > > > > IMHO, these follow_huge_xxx() functions are arch-specified at first and
> > > > > > > > were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm:
> > > > > > > > hugetlb: Copy general hugetlb code from x86 to mm"), and now there are
> > > > > > > > still some arch-specified follow_huge_xxx() definition, for example:
> > > > > > > > ia64: follow_huge_addr
> > > > > > > > powerpc: follow_huge_pd
> > > > > > > > s390: follow_huge_pud
> > > > > > > > 
> > > > > > > > What I mean is that follow_hugetlb_page() is a common and
> > > > > > > > not-arch-specified function, is it suitable to change it to be
> > > > > > > > arch-specified?
> > > > > > > > And thinking more, can we rename follow_hugetlb_page() as
> > > > > > > > hugetlb_page_faultin() and simplify it to only handle the page faults of
> > > > > > > > hugetlb like the faultin_page() for normal page? That means we can make
> > > > > > > > sure only follow_page_mask() can handle hugetlb.
> > > > > > > > 
> > > > > > 
> > > > > > Something like that might work, but you still have two page table walkers
> > > > > > for hugetlb.  I like David's idea (if I understand it correctly) of
> > > > > 
> > > > > What I mean is we may change the hugetlb handling like normal page:
> > > > > 1) use follow_page_mask() to look up a hugetlb firstly.
> > > > > 2) if can not get the hugetlb, then try to page fault by
> > > > > hugetlb_page_faultin().
> > > > > 3) if page fault successed, then retry to find hugetlb by
> > > > > follow_page_mask().
> > > > 
> > > > That implies putting more hugetlbfs special code into generic GUP,
> > > > turning it even more complicated. But of course, it depends on how the
> > > > end result looks like. My gut feeling was that hugetlb is better handled
> > > > in follow_hugetlb_page() separately (just like we do with a lot of other
> > > > page table walkers).
> > > 
> > > OK, fair enough.
> > > 
> > > > > 
> > > > > Just a rough thought, and I need more investigation for my idea and
> > > > > David's idea.
> > > > > 
> > > > > > using follow_hugetlb_page for both cases.  As noted, it will need to be
> > > > > > taught how to not trigger faults in the follow_page_mask case.
> > > > > 
> > > > > Anyway, I also agree we need some cleanup, and firstly I think we should
> > > > > cleanup these arch-specified follow_huge_xxx() on some architectures
> > > > > which are similar with the common ones. I will look into these.
> > > > 
> > > > There was a recent discussion on that, e.g.:
> > > > 
> > > > https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad
> > > 
> > > Thanks.
> > > 
> > > > 
> > > > > 
> > > > > However, considering cleanup may need more investigation and
> > > > > refactoring, now I prefer to make these bug-fix patches of this patchset
> > > > > into mainline firstly, which are suitable to backport to old version to
> > > > > fix potential race issues. Mike and David, how do you think? Could you
> > > > > help to review these patches? Thanks.
> > > > 
> > > > Patch #1 certainly add more special code just to handle another hugetlb
> > > > corner case (CONT pages), and maybe just making it all use
> > > > follow_hugetlb_page() would be even cleaner and less error prone.
> > > > 
> > > > I agree that locking is shaky, but I'm not sure if we really want to
> > > > backport this to stable trees:
> > > > 
> > > > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > > 
> > > > "It must fix a real bug that bothers people (not a, “This could be a
> > > > problem...” type thing)."
> > > > 
> > > > 
> > > > Do we actually have any instance of this being a real (and not a
> > > > theoretical) problem? If not, I'd rather clean it all up right away.
> > > 
> > > I think this is a real problem (not theoretical), and easy to write some
> > > code to show the issue. For example, suppose thread A is trying to look up a
> > > CONT-PTE size hugetlb page under the lock, however antoher thread B can
> > > migrate the CONT-PTE hugetlb page at the same time, which will cause thread
> > > A to get an incorrect page, if thread A want to do something for this
> > > incorrect page, error occurs.
> > 
> > Is the primary concern the locking?  If so, I am not sure we have an issue.
> 
> Yes.
> 
> > As mentioned in your commit message, current code will use
> > pte_offset_map_lock().  pte_offset_map_lock uses pte_lockptr, and pte_lockptr
> > will either be the mm wide lock or pmd_page lock.  To me, it seems that
> 
> The ALLOC_SPLIT_PTLOCKS can be always true on my machine, that means the
> pte_lockptr() will always use the PTE page lock, however huge_pte_lock()
> will use the mm wide lock.

Yes, the different calling context/path to the locking code will cause a
different lock to be used.  I thought of the AFTER sending the above.

> 
> > either would provide correct synchronization for CONT-PTE entries.  Am I
> > missing something or misreading the code?
> > 
> > I started looking at code cleanup suggested by David.  Here is a quick
> > patch (not tested and likely containing errors) to see if this is a step
> > in the right direction.
> > 
> > I like it because we get rid of/combine all those follow_huge_p*d
> > routines.
> 
> Great, this looks straight forward to me (some nits as below).
> David, how do you think?
> 

I will continue to refine this based on suggestions from you and David.

> > +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > +				unsigned long address, unsigned int flags)
> > +{
> > +	struct hstate *h = hstate_vma(vma);
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	unsigned long haddr = address & huge_page_mask(h);
> > +	struct page *page = NULL;
> > +	spinlock_t *ptl;
> > +	pte_t *pte, entry;
> > +
> > +	/*
> > +	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> > +	 * follow_hugetlb_page().
> > +	 */
> > +	if (WARN_ON_ONCE(flags & FOLL_PIN))
> > +		return NULL;
> > +
> > +	pte = huge_pte_offset(mm, haddr, huge_page_size(h));
> > +	if (!pte)
> > +		return NULL;
> > +
> > +retry:
> > +	ptl = huge_pte_lock(h, mm, pte);
> > +	entry = huge_ptep_get(pte);
> > +	if (pte_present(entry)) {
> > +		page = pte_page(entry);
> 
> Should follow previous logic?
> page = pte_page(entry) + ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> 

Yes, this needs to be PAGE aligned, not HUGETLB_PAGE aligned.
-- 
Mike Kravetz

> > +		/*
> > +		 * try_grab_page() should always succeed here, because we hold
> > +		 * the ptl lock and have verified pte_present().
> > +		 */
> > +		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
> > +			page = NULL;
> > +			goto out;
> > +		}
> > +	} else {

next prev parent reply	other threads:[~2022-08-25 18:30 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-23  7:50 [PATCH v2 0/5] Fix some issues when looking up hugetlb page Baolin Wang
2022-08-23  7:50 ` [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size " Baolin Wang
2022-08-23  8:29   ` David Hildenbrand
2022-08-23 10:02     ` Baolin Wang
2022-08-23 10:23       ` David Hildenbrand
2022-08-23 23:55         ` Mike Kravetz
2022-08-24  2:06           ` Baolin Wang
2022-08-24  7:31             ` David Hildenbrand
2022-08-24  9:41               ` Baolin Wang
2022-08-24 11:55                 ` David Hildenbrand
2022-08-24 14:30                   ` Baolin Wang
2022-08-24 14:33                     ` David Hildenbrand
2022-08-24 15:06                       ` Baolin Wang
2022-08-24 15:13                         ` David Hildenbrand
2022-08-24 15:23                           ` Baolin Wang
2022-08-24 23:34                 ` Mike Kravetz
2022-08-25  1:43                   ` Baolin Wang
2022-08-25  7:10                     ` David Hildenbrand
2022-08-25  7:58                       ` Baolin Wang
2022-08-25 18:30                     ` Mike Kravetz [this message]
2022-08-25  7:25                   ` David Hildenbrand
2022-08-25 10:54                     ` Baolin Wang
2022-08-25 21:13                     ` Mike Kravetz
2022-08-26 22:40                       ` Mike Kravetz
2022-08-27 13:59                       ` Aneesh Kumar K.V
2022-08-29 18:30                         ` Mike Kravetz
2022-08-23  7:50 ` [PATCH v2 2/5] mm/hugetlb: use PTE page lock to protect CONT-PTE entries Baolin Wang
2022-08-23  7:50 ` [PATCH v2 3/5] mm/hugetlb: fix races when looking up a CONT-PMD size hugetlb page Baolin Wang
2022-08-23  7:50 ` [PATCH v2 4/5] mm/hugetlb: use PMD page lock to protect CONT-PTE entries Baolin Wang
2022-08-23  8:14   ` David Hildenbrand
2022-08-23 10:12     ` Baolin Wang
2022-08-23  7:50 ` [PATCH v2 5/5] mm/hugetlb: add FOLL_MIGRATION validation before waiting for a migration entry Baolin Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ywe/u6c0yPMuEcR7@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.