All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: Muchun Song <songmuchun@bytedance.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	David Hildenbrand <david@redhat.com>,
	Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Ray Fucillo <Ray.Fucillo@intersystems.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults
Date: Thu, 25 Aug 2022 10:00:19 -0700	[thread overview]
Message-ID: <Yweqo6R7qPpmy1S6@monkey> (raw)
In-Reply-To: <20220824175757.20590-5-mike.kravetz@oracle.com>

On 08/24/22 10:57, Mike Kravetz wrote:
> When page fault code needs to allocate and instantiate a new hugetlb
> page (huegtlb_no_page), it checks early to determine if the fault is
> beyond i_size.  When discovered early, it is easy to abort the fault and
> return an error.  However, it becomes much more difficult to handle when
> discovered later after allocating the page and consuming reservations
> and adding to the page cache.  Backing out changes in such instances
> becomes difficult and error prone.
> 
> Instead of trying to catch and backout all such races, use the hugetlb
> fault mutex to handle truncate racing with page faults.  The most
> significant change is modification of the routine remove_inode_hugepages
> such that it will take the fault mutex for EVERY index in the truncated
> range (or hole in the case of hole punch).  Since remove_inode_hugepages
> is called in the truncate path after updating i_size, we can experience
> races as follows.
> - truncate code updates i_size and takes fault mutex before a racing
>   fault.  After fault code takes mutex, it will notice fault beyond
>   i_size and abort early.
> - fault code obtains mutex, and truncate updates i_size after early
>   checks in fault code.  fault code will add page beyond i_size.
>   When truncate code takes mutex for page/index, it will remove the
>   page.
> - truncate updates i_size, but fault code obtains mutex first.  If
>   fault code sees updated i_size it will abort early.  If fault code
>   does not see updated i_size, it will add page beyond i_size and
>   truncate code will remove page when it obtains fault mutex.
> 
> Note, for performance reasons remove_inode_hugepages will still use
> filemap_get_folios for bulk folio lookups.  For indicies not returned in
> the bulk lookup, it will need to lookup individual folios to check for
> races with page fault.
> 
<snip>
>  /*
>   * remove_inode_hugepages handles two distinct cases: truncation and hole
>   * punch.  There are subtle differences in operation for each case.
> @@ -418,11 +507,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>   * truncation is indicated by end of range being LLONG_MAX
>   *	In this case, we first scan the range and release found pages.
>   *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
> - *	maps and global counts.  Page faults can not race with truncation
> - *	in this routine.  hugetlb_no_page() prevents page faults in the
> - *	truncated range.  It checks i_size before allocation, and again after
> - *	with the page table lock for the page held.  The same lock must be
> - *	acquired to unmap a page.
> + *	maps and global counts.  Page faults can race with truncation.
> + *	During faults, hugetlb_no_page() checks i_size before page allocation,
> + *	and again after	obtaining page table lock.  It will 'back out'
> + *	allocations in the truncated range.
>   * hole punch is indicated if end is not LLONG_MAX
>   *	In the hole punch case we scan the range and release found pages.
>   *	Only when releasing a page is the associated region/reserve map
> @@ -431,75 +519,69 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>   *	This is indicated if we find a mapped page.
>   * Note: If the passed end of range value is beyond the end of file, but
>   * not LLONG_MAX this routine still performs a hole punch operation.
> + *
> + * Since page faults can race with this routine, care must be taken as both
> + * modify huge page reservation data.  To somewhat synchronize these operations
> + * the hugetlb fault mutex is taken for EVERY index in the range to be hole
> + * punched or truncated.  In this way, we KNOW either:
> + * - fault code has added a page beyond i_size, and we will remove here
> + * - fault code will see updated i_size and not add a page beyond
> + * The parameter 'lm__end' indicates the offset of the end of hole or file
> + * before truncation.  For hole punch lm_end == lend.
>   */
>  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> -				   loff_t lend)
> +				   loff_t lend, loff_t lm_end)
>  {
>  	struct hstate *h = hstate_inode(inode);
>  	struct address_space *mapping = &inode->i_data;
>  	const pgoff_t start = lstart >> huge_page_shift(h);
>  	const pgoff_t end = lend >> huge_page_shift(h);
> +	pgoff_t m_end = lm_end >> huge_page_shift(h);
> +	pgoff_t m_start, m_index;
>  	struct folio_batch fbatch;
> +	struct folio *folio;
>  	pgoff_t next, index;
> -	int i, freed = 0;
> +	unsigned int i;
> +	long freed = 0;
> +	u32 hash;
>  	bool truncate_op = (lend == LLONG_MAX);
>  
>  	folio_batch_init(&fbatch);
> -	next = start;
> +	next = m_start = start;
>  	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
>  		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> -			struct folio *folio = fbatch.folios[i];
> -			u32 hash = 0;
> +			folio = fbatch.folios[i];
>  
>  			index = folio->index;
> -			hash = hugetlb_fault_mutex_hash(mapping, index);
> -			mutex_lock(&hugetlb_fault_mutex_table[hash]);
> -
>  			/*
> -			 * If folio is mapped, it was faulted in after being
> -			 * unmapped in caller.  Unmap (again) now after taking
> -			 * the fault mutex.  The mutex will prevent faults
> -			 * until we finish removing the folio.
> -			 *
> -			 * This race can only happen in the hole punch case.
> -			 * Getting here in a truncate operation is a bug.
> +			 * Take fault mutex for missing folios before index,
> +			 * while checking folios that might have been added
> +			 * due to a race with fault code.
>  			 */
> -			if (unlikely(folio_mapped(folio))) {
> -				BUG_ON(truncate_op);
> -
> -				i_mmap_lock_write(mapping);
> -				hugetlb_vmdelete_list(&mapping->i_mmap,
> -					index * pages_per_huge_page(h),
> -					(index + 1) * pages_per_huge_page(h),
> -					ZAP_FLAG_DROP_MARKER);
> -				i_mmap_unlock_write(mapping);
> -			}
> +			freed += fault_lock_inode_indicies(h, inode, mapping,
> +						m_start, m_index, truncate_op);

This should be 'index' instead of 'm_index' as discovered here:
https://lore.kernel.org/linux-mm/CA+G9fYsHVdu0toduQqk6vsR8Z8mOVzZ9-_p3O5fjQ5mOpSxsDA@mail.gmail.com/

-- 
Mike Kravetz


  reply	other threads:[~2022-08-25 17:00 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-24 17:57 [PATCH 0/8] hugetlb: Use new vma mutex for huge pmd sharing synchronization Mike Kravetz
2022-08-24 17:57 ` [PATCH 1/8] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-08-27  2:50   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 2/8] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-08-27  2:59   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 3/8] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache Mike Kravetz
2022-08-27  3:08   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 4/8] hugetlb: handle truncate racing with page faults Mike Kravetz
2022-08-25 17:00   ` Mike Kravetz [this message]
2022-08-27  8:02   ` Miaohe Lin
2022-08-29 21:53     ` Mike Kravetz
2022-09-06 13:57   ` Sven Schnelle
2022-09-06 16:48     ` Mike Kravetz
2022-09-06 18:05       ` Mike Kravetz
2022-09-06 23:08         ` Mike Kravetz
2022-09-07  2:11           ` Miaohe Lin
2022-09-07  2:37             ` Mike Kravetz
2022-09-07  3:07               ` Miaohe Lin
2022-09-07  3:30                 ` Mike Kravetz
2022-09-07  8:22                   ` Sven Schnelle
2022-09-07 14:50                     ` Mike Kravetz
2022-08-24 17:57 ` [PATCH 5/8] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
2022-08-27  8:07   ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 6/8] hugetlb: add vma based lock for pmd sharing Mike Kravetz
2022-08-27  9:30   ` Miaohe Lin
2022-08-29 22:24     ` Mike Kravetz
2022-08-30  2:34       ` Miaohe Lin
2022-09-07 20:50       ` Mike Kravetz
2022-09-08  2:04         ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio Mike Kravetz
2022-08-29  2:44   ` Miaohe Lin
2022-08-29 22:37     ` Mike Kravetz
2022-08-30  2:46       ` Miaohe Lin
2022-09-02 21:35         ` Mike Kravetz
2022-09-05  2:32           ` Miaohe Lin
2022-08-24 17:57 ` [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
2022-08-30  2:02   ` Miaohe Lin
2022-09-02 23:07     ` Mike Kravetz
2022-09-05  3:08       ` Miaohe Lin
2022-09-12 23:02         ` Mike Kravetz
2022-09-13  2:14           ` Miaohe Lin
2022-09-14  0:50             ` Mike Kravetz
2022-09-14  2:08               ` Miaohe Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yweqo6R7qPpmy1S6@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=Ray.Fucillo@intersystems.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=axelrasmussen@google.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=prakash.sangappa@oracle.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.