linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>,
	David Hildenbrand <david@redhat.com>,
	Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Ray Fucillo <Ray.Fucillo@intersystems.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization
Date: Fri, 2 Sep 2022 16:07:55 -0700	[thread overview]
Message-ID: <YxKMy3sDsWPEOMMJ@monkey> (raw)
In-Reply-To: <08edc08e-08ab-0706-3c8d-804080f37bd7@huawei.com>

On 08/30/22 10:02, Miaohe Lin wrote:
> On 2022/8/25 1:57, Mike Kravetz wrote:
> > The new hugetlb vma lock (rw semaphore) is used to address this race:
> > 
> > Faulting thread                                 Unsharing thread
> > ...                                                  ...
> > ptep = huge_pte_offset()
> >       or
> > ptep = huge_pte_alloc()
> > ...
> >                                                 i_mmap_lock_write
> >                                                 lock page table
> > ptep invalid   <------------------------        huge_pmd_unshare()
> > Could be in a previously                        unlock_page_table
> > sharing process or worse                        i_mmap_unlock_write
> > ...
> > 
> > The vma_lock is used as follows:
> > - During fault processing. the lock is acquired in read mode before
> >   doing a page table lock and allocation (huge_pte_alloc).  The lock is
> >   held until code is finished with the page table entry (ptep).
> > - The lock must be held in write mode whenever huge_pmd_unshare is
> >   called.
> > 
> > Lock ordering issues come into play when unmapping a page from all
> > vmas mapping the page.  The i_mmap_rwsem must be held to search for the
> > vmas, and the vma lock must be held before calling unmap which will
> > call huge_pmd_unshare.  This is done today in:
> > - try_to_migrate_one and try_to_unmap_ for page migration and memory
> >   error handling.  In these routines we 'try' to obtain the vma lock and
> >   fail to unmap if unsuccessful.  Calling routines already deal with the
> >   failure of unmapping.
> > - hugetlb_vmdelete_list for truncation and hole punch.  This routine
> >   also tries to acquire the vma lock.  If it fails, it skips the
> >   unmapping.  However, we can not have file truncation or hole punch
> >   fail because of contention.  After hugetlb_vmdelete_list, truncation
> >   and hole punch call remove_inode_hugepages.  remove_inode_hugepages
> >   check for mapped pages and call hugetlb_unmap_file_page to unmap them.
> >   hugetlb_unmap_file_page is designed to drop locks and reacquire in the
> >   correct order to guarantee unmap success.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  fs/hugetlbfs/inode.c |  46 +++++++++++++++++++
> >  mm/hugetlb.c         | 102 +++++++++++++++++++++++++++++++++++++++----
> >  mm/memory.c          |   2 +
> >  mm/rmap.c            | 100 +++++++++++++++++++++++++++---------------
> >  mm/userfaultfd.c     |   9 +++-
> >  5 files changed, 214 insertions(+), 45 deletions(-)
> > 
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index b93d131b0cb5..52d9b390389b 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >  					struct folio *folio, pgoff_t index)
> >  {
> >  	struct rb_root_cached *root = &mapping->i_mmap;
> > +	unsigned long skipped_vm_start;
> > +	struct mm_struct *skipped_mm;
> >  	struct page *page = &folio->page;
> >  	struct vm_area_struct *vma;
> >  	unsigned long v_start;
> > @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >  	end = ((index + 1) * pages_per_huge_page(h));
> >  
> >  	i_mmap_lock_write(mapping);
> > +retry:
> > +	skipped_mm = NULL;
> >  
> >  	vma_interval_tree_foreach(vma, root, start, end - 1) {
> >  		v_start = vma_offset_start(vma, start);
> > @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >  		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> >  			continue;
> >  
> > +		if (!hugetlb_vma_trylock_write(vma)) {
> > +			/*
> > +			 * If we can not get vma lock, we need to drop
> > +			 * immap_sema and take locks in order.
> > +			 */
> > +			skipped_vm_start = vma->vm_start;
> > +			skipped_mm = vma->vm_mm;
> > +			/* grab mm-struct as we will be dropping i_mmap_sema */
> > +			mmgrab(skipped_mm);
> > +			break;
> > +		}
> > +
> >  		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> >  				NULL, ZAP_FLAG_DROP_MARKER);
> > +		hugetlb_vma_unlock_write(vma);
> >  	}
> >  
> >  	i_mmap_unlock_write(mapping);
> > +
> > +	if (skipped_mm) {
> > +		mmap_read_lock(skipped_mm);
> > +		vma = find_vma(skipped_mm, skipped_vm_start);
> > +		if (!vma || !is_vm_hugetlb_page(vma) ||
> > +					vma->vm_file->f_mapping != mapping ||
> > +					vma->vm_start != skipped_vm_start) {
> 
> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
> 

Yes, that is missing.  I will add here.

> > +			mmap_read_unlock(skipped_mm);
> > +			mmdrop(skipped_mm);
> > +			goto retry;
> > +		}
> > +
> 
> IMHO, above check is not enough. Think about the below scene:
> 
> CPU 1					CPU 2
> hugetlb_unmap_file_folio		exit_mmap
>   mmap_read_lock(skipped_mm);		  mmap_read_lock(mm);
>   check vma is wanted.
>   					  unmap_vmas
>   mmap_read_unlock(skipped_mm);		  mmap_read_unlock
>   					  mmap_write_lock(mm);
>   					  free_pgtables
>   					  remove_vma
> 					    hugetlb_vma_lock_free
>   vma, hugetlb_vma_lock is still *used after free*
>   					  mmap_write_unlock(mm);
> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?

In the retry case, we are OK because go back and look up the vma again.  Right?

After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
Before that, we do the following:

> > +		hugetlb_vma_lock_write(vma);
> > +		i_mmap_lock_write(mapping);

IIUC, vma can not go away while we hold i_mmap_lock_write.  So, after this we
can.

> > +		mmap_read_unlock(skipped_mm);
> > +		mmdrop(skipped_mm);

We continue to hold i_mmap_lock_write as we goto retry.

I could be missing something as well.  This was how I intended to keep
vma valid while dropping and acquiring locks.

> > +
> > +		v_start = vma_offset_start(vma, start);
> > +		v_end = vma_offset_end(vma, end);
> > +		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> > +				NULL, ZAP_FLAG_DROP_MARKER);
> > +		hugetlb_vma_unlock_write(vma);
> > +
> > +		goto retry;
> 
> Should here be one cond_resched() here in case this function will take a really long time?
> 

I think we will at most retry once.

> > +	}
> >  }
> >  
> >  static void
> > @@ -474,11 +516,15 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> >  		unsigned long v_start;
> >  		unsigned long v_end;
> >  
> > +		if (!hugetlb_vma_trylock_write(vma))
> > +			continue;
> > +
> >  		v_start = vma_offset_start(vma, start);
> >  		v_end = vma_offset_end(vma, end);
> >  
> >  		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> >  				     NULL, zap_flags);
> > +		hugetlb_vma_unlock_write(vma);
> >  	}
> 
> unmap_hugepage_range is not called under hugetlb_vma_lock in unmap_ref_private since it's private vma?
> Add a comment to avoid future confusion?
> 
> >  }

Sure, will add a comment before hugetlb_vma_lock.

> >  
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6fb0bff2c7ee..5912c2b97ddf 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4801,6 +4801,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  		mmu_notifier_invalidate_range_start(&range);
> >  		mmap_assert_write_locked(src);
> >  		raw_write_seqcount_begin(&src->write_protect_seq);
> > +	} else {
> > +		/*
> > +		 * For shared mappings the vma lock must be held before
> > +		 * calling huge_pte_offset in the src vma. Otherwise, the
> 
> s/huge_pte_offset/huge_pte_alloc/, i.e. huge_pte_alloc could return shared pmd, not huge_pte_offset which
> might lead to confusion. But this is really trivial...

Actually, it is huge_pte_offset.  While looking up ptes in the source vma, we
do not want to race with other threads in the source process which could
be doing a huge_pmd_unshare.  Otherwise, the returned pte could be invalid.

FYI - Most of this code is now 'dead' because of bcd51a3c679d "Lazy page table
copies in fork()".  We will not copy shared mappigns at fork time.

> 
> Except from above comments, this patch looks good to me.
> 

Thank you! Thank you! Thank you!  For looking at this series and all
your comments.  I hope to send out v2 next week.
-- 
Mike Kravetz


  reply	other threads:[~2022-09-02 23:08 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-24 17:57 [PATCH 0/8] hugetlb: Use new vma mutex for huge pmd sharing synchronization Mike Kravetz
2022-08-24 17:57 ` [PATCH 1/8] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-08-27  2:50   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 2/8] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-08-27  2:59   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 3/8] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache Mike Kravetz
2022-08-27  3:08   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 4/8] hugetlb: handle truncate racing with page faults Mike Kravetz
2022-08-25 17:00   ` Mike Kravetz
2022-08-27  8:02   ` Miaohe Lin
2022-08-29 21:53     ` Mike Kravetz
2022-09-06 13:57   ` Sven Schnelle
2022-09-06 16:48     ` Mike Kravetz
2022-09-06 18:05       ` Mike Kravetz
2022-09-06 23:08         ` Mike Kravetz
2022-09-07  2:11           ` Miaohe Lin
2022-09-07  2:37             ` Mike Kravetz
2022-09-07  3:07               ` Miaohe Lin
2022-09-07  3:30                 ` Mike Kravetz
2022-09-07  8:22                   ` Sven Schnelle
2022-09-07 14:50                     ` Mike Kravetz
2022-08-24 17:57 ` [PATCH 5/8] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
2022-08-27  8:07   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 6/8] hugetlb: add vma based lock for pmd sharing Mike Kravetz
2022-08-27  9:30   ` Miaohe Lin
2022-08-29 22:24     ` Mike Kravetz
2022-08-30  2:34       ` Miaohe Lin
2022-09-07 20:50       ` Mike Kravetz
2022-09-08  2:04         ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio Mike Kravetz
2022-08-29  2:44   ` Miaohe Lin
2022-08-29 22:37     ` Mike Kravetz
2022-08-30  2:46       ` Miaohe Lin
2022-09-02 21:35         ` Mike Kravetz
2022-09-05  2:32           ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
2022-08-30  2:02   ` Miaohe Lin
2022-09-02 23:07     ` Mike Kravetz [this message]
2022-09-05  3:08       ` Miaohe Lin
2022-09-12 23:02         ` Mike Kravetz
2022-09-13  2:14           ` Miaohe Lin
2022-09-14  0:50             ` Mike Kravetz
2022-09-14  2:08               ` Miaohe Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YxKMy3sDsWPEOMMJ@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=Ray.Fucillo@intersystems.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=axelrasmussen@google.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=prakash.sangappa@oracle.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).