Re: [RFC PATCH v4 6/8] hugetlb: add vma based lock for pmd sharing synchronization

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Miaohe Lin <linmiaohe@huawei.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Muchun Song <songmuchun@bytedance.com>,
	Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	David Hildenbrand <david@redhat.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Ray Fucillo <Ray.Fucillo@intersystems.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH v4 6/8] hugetlb: add vma based lock for pmd sharing synchronization
Date: Fri, 29 Jul 2022 11:00:03 -0700	[thread overview]
Message-ID: <YuQgIwT+bjqX7Kcx@monkey> (raw)
In-Reply-To: <5b8c6b49-e17a-2c0b-4440-ccf3c5493cb2@huawei.com>

On 07/29/22 10:55, Miaohe Lin wrote:
> On 2022/7/7 4:23, Mike Kravetz wrote:
> > Allocate a rw semaphore and hang off vm_private_data for
> > synchronization use by vmas that could be involved in pmd sharing.  Only
> > add infrastructure for the new lock here.  Actual use will be added in
> > subsequent patch.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  include/linux/hugetlb.h |  36 +++++++++-
> >  kernel/fork.c           |   6 +-
> >  mm/hugetlb.c            | 150 ++++++++++++++++++++++++++++++++++++----
> >  mm/rmap.c               |   8 ++-
> >  4 files changed, 178 insertions(+), 22 deletions(-)
> > 
> 
> <snip>
> 
> >  
> >  /* Forward declaration */
> >  static int hugetlb_acct_memory(struct hstate *h, long delta);
> > +static bool vma_pmd_shareable(struct vm_area_struct *vma);
> >  
> >  static inline bool subpool_is_free(struct hugepage_subpool *spool)
> >  {
> > @@ -904,6 +905,89 @@ resv_map_set_hugetlb_cgroup_uncharge_info(struct resv_map *resv_map,
> >  #endif
> >  }
> >  
> > +static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma)
> > +{
> > +	return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
> 
> Should me make __vma_aligned_range_pmd_shareable check (VM_MAYSHARE | VM_SHARED) like above
> instead of VM_MAYSHARE to make code more consistent?
> 

I 'think' we want them to be different.  Note this subtle code and
explanation in __unmap_hugepage_range_final().

	/*
	 * Clear this flag so that x86's huge_pmd_share page_table_shareable
	 * test will fail on a vma being torn down, and not grab a page table
	 * on its way out.  We're lucky that the flag has such an appropriate
	 * name, and can in fact be safely cleared here. We could clear it
	 * before the __unmap_hugepage_range above, but all that's necessary
	 * is to clear it before releasing the i_mmap_rwsem. This works
	 * because in the context this is called, the VMA is about to be
	 * destroyed and the i_mmap_rwsem is held.
	 */
	vma->vm_flags &= ~VM_MAYSHARE;

So, when making a decision to share or not we need to only check VM_MAYSHARE.
When making decisions about about the vma_lock, we need to check both.  In most
cases, just VM_MAYSHARE would be sufficient but we need to handle this case
where VM_SHARED and !VM_MAYSHARE.  Mostly in the unmap/cleanup cases.

> > +		vma->vm_private_data;
> > +}
> > +
> > +void hugetlb_vma_lock_read(struct vm_area_struct *vma)
> > +{
> > +	if (__vma_shareable_flags_pmd(vma))
> > +		down_read((struct rw_semaphore *)vma->vm_private_data);
> > +}
> > +
> > +void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
> > +{
> > +	if (__vma_shareable_flags_pmd(vma))
> > +		up_read((struct rw_semaphore *)vma->vm_private_data);
> > +}
> > +
> > +void hugetlb_vma_lock_write(struct vm_area_struct *vma)
> > +{
> > +	if (__vma_shareable_flags_pmd(vma))
> > +		down_write((struct rw_semaphore *)vma->vm_private_data);
> > +}
> > +
> > +void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
> > +{
> > +	if (__vma_shareable_flags_pmd(vma))
> > +		up_write((struct rw_semaphore *)vma->vm_private_data);
> > +}
> > +
> > +int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
> > +{
> > +	if (!__vma_shareable_flags_pmd(vma))
> > +		return 1;
> > +
> > +	return down_write_trylock((struct rw_semaphore *)vma->vm_private_data);
> > +}
> > +
> > +void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
> > +{
> > +	if (__vma_shareable_flags_pmd(vma))
> > +		lockdep_assert_held((struct rw_semaphore *)
> > +				vma->vm_private_data);
> > +}
> > +
> > +static void hugetlb_free_vma_lock(struct vm_area_struct *vma)
> > +{
> > +	/* Only present in sharable vmas */
> > +	if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
> > +		return;
> > +
> > +	if (vma->vm_private_data) {
> > +		kfree(vma->vm_private_data);
> > +		vma->vm_private_data = NULL;
> > +	}
> > +}
> > +
> > +static void hugetlb_alloc_vma_lock(struct vm_area_struct *vma)
> > +{
> > +	struct rw_semaphore *vma_sema;
> > +
> > +	/* Only establish in (flags) sharable vmas */
> > +	if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))

Based on my explanation above, I think this should only check VM_MAYSHARE.

> > +		return;
> > +> +	if (!vma_pmd_shareable(vma)) {
> > +		vma->vm_private_data = NULL;
> > +		return;
> > +	}
> > +
> > +	vma_sema = kmalloc(sizeof(*vma_sema), GFP_KERNEL);
> > +	if (!vma_sema) {
> > +		/*
> > +		 * If we can not allocate semaphore, then vma can not
> > +		 * participate in pmd sharing.
> > +		 */
> > +		vma->vm_private_data = NULL;
> > +	} else {
> > +		init_rwsem(vma_sema);
> > +		vma->vm_private_data = vma_sema;
> > +	}
> 
> This code is really subtle. If it's called from hugetlb_vm_op_open during fork after
> hugetlb_dup_vma_private is done, there should already be a kmalloc-ed vma_sema for this
> vma (because hugetlb_alloc_vma_lock is also called by hugetlb_dup_vma_private). So we
> can't simply change the value of vm_private_data here or vma_sema will be leaked ?

Yes, I believe your analysis is correct.

>                                                                                    But
> when hugetlb_alloc_vma_lock is called from hugetlb_reserve_pages, it should work fine.
> Or am I miss something?

You are right.  This is an issue in the current code.  I will address in
the next version.

Thanks for all your comments on this series!
-- 
Mike Kravetz

next prev parent reply	other threads:[~2022-07-29 20:15 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-06 20:23 [RFC PATCH v4 0/8] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
2022-07-06 20:23 ` [RFC PATCH v4 1/8] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-07-06 20:23 ` [RFC PATCH v4 2/8] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-07-06 20:23 ` [RFC PATCH v4 3/8] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
2022-08-05 16:11   ` David Hildenbrand
2022-07-06 20:23 ` [RFC PATCH v4 4/8] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
2022-07-27  9:20   ` Miaohe Lin
2022-07-27 19:00     ` Mike Kravetz
2022-07-28  2:02       ` Miaohe Lin
2022-07-28 16:45         ` Mike Kravetz
2022-08-05 16:28           ` David Hildenbrand
2022-08-05 22:41             ` Mike Kravetz
2022-07-06 20:23 ` [RFC PATCH v4 5/8] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
2022-07-06 20:23 ` [RFC PATCH v4 6/8] hugetlb: add vma based lock for pmd sharing synchronization Mike Kravetz
2022-07-29  2:55   ` Miaohe Lin
2022-07-29 18:00     ` Mike Kravetz [this message]
2022-07-30  2:12       ` Miaohe Lin
2022-07-06 20:23 ` [RFC PATCH v4 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio Mike Kravetz
2022-07-29  2:02   ` Miaohe Lin
2022-07-29 18:11     ` Mike Kravetz
2022-07-30  2:15       ` Miaohe Lin
2022-07-06 20:23 ` [RFC PATCH v4 8/8] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
2022-07-28  6:51   ` Miaohe Lin
2022-07-28 17:47     ` Mike Kravetz
2022-07-29  1:41       ` Miaohe Lin
2022-07-29 17:41         ` Mike Kravetz
2022-07-30  1:57           ` Miaohe Lin
2022-07-20 14:16 ` [RFC PATCH v4 0/8] hugetlb: Change huge pmd sharing synchronization again Ray Fucillo
  -- strict thread matches above, loose matches on Subject: below --
2022-07-10  6:17 [RFC PATCH v4 6/8] hugetlb: add vma based lock for pmd sharing synchronization kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YuQgIwT+bjqX7Kcx@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=Ray.Fucillo@intersystems.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=axelrasmussen@google.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=prakash.sangappa@oracle.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.