From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A5497C3DA78
	for <linux-mm@archiver.kernel.org>; Tue, 17 Jan 2023 21:06:51 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 010E66B0073; Tue, 17 Jan 2023 16:06:51 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EDC906B0074; Tue, 17 Jan 2023 16:06:50 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D571D6B0075; Tue, 17 Jan 2023 16:06:50 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id C35CE6B0073
	for <linux-mm@kvack.org>; Tue, 17 Jan 2023 16:06:50 -0500 (EST)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 97EC612078F
	for <linux-mm@kvack.org>; Tue, 17 Jan 2023 21:06:50 +0000 (UTC)
X-FDA: 80365525380.08.CFDBDB9
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf25.hostedemail.com (Postfix) with ESMTP id D2D37A0004
	for <linux-mm@kvack.org>; Tue, 17 Jan 2023 21:06:47 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WqgIxh7q;
	spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1673989608;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=bKk9f7e5Yleo+prYdw0iH3hbLZ17vENtX6XlppAd1xw=;
	b=raPEKrB7rQnTox1T11z+v0b3AHR5mdcm7VFPtGkvT9BFzIxPuN1HasiIAX+jDz4N7cRkHp
	1ZrcE9FqcjEJ9/nJZFXyUvqMkI9e6tgCoVTe4LDoiCKQwSQSzJsnUYVw0axAQJk9mFXahm
	CexlaV2A6TfmhF5G5XibcxEQrVbF+bk=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WqgIxh7q;
	spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673989608; a=rsa-sha256;
	cv=none;
	b=ztkEybt2b4xoMNNt15KmQ+GWWcxwHFXAMMpwhMusyWn5XfnBU0QcbxLCLRe08G+FJlnCpi
	sPL54ZOZjvsliaG63zeM3bi+eOz9I5tkZb/ZPJqQsNd6wuQX12QcZ8Rx9CQvuueTGPo9e4
	ACjoNmBEqQE7P13mLEMvksX3iFQEGus=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1673989607;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=bKk9f7e5Yleo+prYdw0iH3hbLZ17vENtX6XlppAd1xw=;
	b=WqgIxh7qm4a3b5VE3hxs2fVCOiF3HyjedxmfZFFv/+a78G9lCfrzmkE0UrShZ/d0+OiOhu
	nPsDLhoDMJLohrfmlfbm36w5XFbIiF/6KvDhyWL9+PDW+EqWR6KA7hMrkZJmvrf8tore5k
	l3vLusEzExvypIscXHcL8Ah01RPbAlY=
Received: from mail-yb1-f199.google.com (mail-yb1-f199.google.com
 [209.85.219.199]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id
 us-mta-464-dd2gkXCMMh6g1YGySgYe1Q-1; Tue, 17 Jan 2023 16:06:45 -0500
X-MC-Unique: dd2gkXCMMh6g1YGySgYe1Q-1
Received: by mail-yb1-f199.google.com with SMTP id p18-20020a25bd52000000b007c8919c86efso17764868ybm.13
        for <linux-mm@kvack.org>; Tue, 17 Jan 2023 13:06:45 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=bKk9f7e5Yleo+prYdw0iH3hbLZ17vENtX6XlppAd1xw=;
        b=O2y0cquQdFmnI2RBU9dHVCisa24d0amBwC6RvwKYziW/zryBGHZgD2x/9P99Bf6S8m
         57SS+dBrEeT0Mo8+YCK/VIZcXbI2cuwMVXoNwf6mbanDpgkuGBbyCfzcX+I3GSylemZU
         GKJ1sP/CZnj34H3YqwnjVm7lmEa74k8yjO+TX6eq0oRAMli3WKWISpbwttsVRRKPpz6/
         KWG9j2ucL2QAPFsWQrlCdPRxNMn1N6tvQtaQ4ZApBljbJoEHnDCY1CZmTaIXwSEXiJqq
         wxIGm6En9RmQu7pDyhePBKOOQucguojAdkjkPWsOd7KIhKN+2NIfx6CvFIt9bYmdduO8
         8pKA==
X-Gm-Message-State: AFqh2kq7qui1mB7190XQmpRTU+IrV623YphAlvzN2Vu/sbG6niOFMtdp
	bAIPHVowg/1nfMFd1l4iG0UtfkG0UQhP9kcHb/HoFfOkbnDwLfUqjValJCU7S/pz5O8PeFbmxOu
	y99xbXJTZOyU=
X-Received: by 2002:a05:7500:58a:b0:f0:2adf:ae3d with SMTP id n10-20020a057500058a00b000f02adfae3dmr376925gac.36.1673989605093;
        Tue, 17 Jan 2023 13:06:45 -0800 (PST)
X-Google-Smtp-Source: AMrXdXtJDQrCN2P1BEejhV6KQl5Pj7xf4V6bwRkqIyA4ip/U/9RoNFp238q6BsVVuZ0FSLiMvE43Kw==
X-Received: by 2002:a05:7500:58a:b0:f0:2adf:ae3d with SMTP id n10-20020a057500058a00b000f02adfae3dmr376903gac.36.1673989604619;
        Tue, 17 Jan 2023 13:06:44 -0800 (PST)
Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63])
        by smtp.gmail.com with ESMTPSA id t5-20020a05620a450500b006fa8299b4d5sm21132728qkp.100.2023.01.17.13.06.42
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 17 Jan 2023 13:06:43 -0800 (PST)
Date: Tue, 17 Jan 2023 16:06:40 -0500
From: Peter Xu <peterx@redhat.com>
To: James Houghton <jthoughton@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Hildenbrand <david@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Zach O'Keefe <zokeefe@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb
Message-ID: <Y8cN4G0ICoSSggS+@x1n>
References: <20230105101844.1893104-1-jthoughton@google.com>
 <20230105101844.1893104-36-jthoughton@google.com>
MIME-Version: 1.0
In-Reply-To: <20230105101844.1893104-36-jthoughton@google.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: D2D37A0004
X-Stat-Signature: wp19efod5rw17pd6hm7jtqb5j6gjdqnr
X-Rspam-User: 
X-HE-Tag: 1673989607-4037
X-HE-Meta: U2FsdGVkX19EV2OXM/cyjWjff4PBU2yRiIjKrtB+yYq3lwXgyYyB6r0Jvx7Ha6Rl24jw53Uu2icizd6E+C/+Xhjy3NhMPSyJ6NUCnFQsZZE2Sl9bXgvCki/gZXO1RVcNWDZ/caQ62gYYOyfiRnM4x9Mfw8K2YvDELlYn5n04wuyW9mWIPgdXIEiI/UYOqD71tncy07/CdjF4bsUKC6DQrYcdgmRuD6DM7nSCTOgZCI1qxw4qoL5riXg7tiz3TkfpChAic9T2nxroWbJlhn4Fr5Kq/9fLFeyy0l+uvEehLpSc+oWrI8TNzjwD8ldLpDXE7lOn4W7ao8KB/wJ2EcqauGMQOU9s0PXllPCkixu0ZMNhDsIiM4iq1lIc6RwCle1oQxnz4My4fP8Qse0yZJ6MfeXIH+MVhfYy+dTH8ZfPUknej9qeV73GLX+bZO8uWRnWhGYBxqBjfnvmS/irvb0CnsDVf4XOVf+V/VacdLc9z25L2i3UxGLKpiLvRX8KqLHIHNFIi9nzxe9T2W0HKuZZ0yiyqvjXwoVqYR9kmxSmsDs8U7JYXyUCVEyrD63O0XpYzpYBHnHAop2dvaK8MOmsaLX9BbE8amXCX9tal8Ksydeptq80eaQzlGvg62QPLUrAJBWExA8As5QC+BFa/0ujeM1OpOUfD0HwsjDuvvIPHVOoxg5urDXWYXHc/ZNqCjJw628tLIpJ/+uzhrO9b7nzeCKHMh+5tWjLIP9kd4VBwZetfVe9tqdXCWH9PgFhAekWA6D7Tg572vlCOLEuNvViPLNv/zCmMywOJkpp7Q47HPOej7NqbcR0T7Kjtpako2hclXXmrL7PQpYPX9jYeEYwEHZkP1268XWKd+0h9BnMV9bi1nrxjLw89pbg0wBwLih5BmWZYTvo4mz1dp557RBtij2TCFUwwZJ2RkE7y5+VqvPptV+pewGkNAesoIxhNzJqWbTjuc7tGtvv1WodcoN
 m9lBBndH
 /RaFAGItWVgyfDki2+TdBnA624D0mSP8+pcW4FMvXL/+4z5jUPD486cWxVwcK02JiHVkHCzVO8Wezzz98P6vINWfO1aAtY/nkjgvPRqHrNro+2rgZb6TEpBiDvqwhyn/fwwgjHhyNVKpio6bzuombACZp6JFf+IC1IfsO
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Jan 05, 2023 at 10:18:33AM +0000, James Houghton wrote:
> This is a necessary extension to the UFFDIO_CONTINUE changes. When
> userspace finishes mapping an entire hugepage with UFFDIO_CONTINUE, the
> kernel has no mechanism to automatically collapse the page table to map
> the whole hugepage normally. We require userspace to inform us that they
> would like the mapping to be collapsed; they do this with MADV_COLLAPSE.
> 
> If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but
> only some, hugetlb_collapse will cause the requested range to be mapped
> as if it were UFFDIO_CONTINUE'd already. The effects of any
> UFFDIO_WRITEPROTECT calls may be undone by a call to MADV_COLLAPSE for
> intersecting address ranges.
> 
> This commit is co-opting the same madvise mode that has been introduced
> to synchronously collapse THPs. The function that does THP collapsing
> has been renamed to madvise_collapse_thp.
> 
> As with the rest of the high-granularity mapping support, MADV_COLLAPSE
> is only supported for shared VMAs right now.
> 
> MADV_COLLAPSE has the same synchronization as huge_pmd_unshare.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/huge_mm.h |  12 +--
>  include/linux/hugetlb.h |   8 ++
>  mm/hugetlb.c            | 164 ++++++++++++++++++++++++++++++++++++++++
>  mm/khugepaged.c         |   4 +-
>  mm/madvise.c            |  18 ++++-
>  5 files changed, 197 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a1341fdcf666..5d1e3c980f74 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -218,9 +218,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  
>  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>  		     int advice);
> -int madvise_collapse(struct vm_area_struct *vma,
> -		     struct vm_area_struct **prev,
> -		     unsigned long start, unsigned long end);
> +int madvise_collapse_thp(struct vm_area_struct *vma,
> +			 struct vm_area_struct **prev,
> +			 unsigned long start, unsigned long end);
>  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
>  			   unsigned long end, long adjust_next);
>  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -367,9 +367,9 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
>  	return -EINVAL;
>  }
>  
> -static inline int madvise_collapse(struct vm_area_struct *vma,
> -				   struct vm_area_struct **prev,
> -				   unsigned long start, unsigned long end)
> +static inline int madvise_collapse_thp(struct vm_area_struct *vma,
> +				       struct vm_area_struct **prev,
> +				       unsigned long start, unsigned long end)
>  {
>  	return -EINVAL;
>  }
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c8524ac49b24..e1baf939afb6 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1298,6 +1298,8 @@ bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
>  int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
>  			      struct vm_area_struct *vma, unsigned long start,
>  			      unsigned long end);
> +int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> +		     unsigned long start, unsigned long end);
>  #else
>  static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
>  {
> @@ -1318,6 +1320,12 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
>  {
>  	return -EINVAL;
>  }
> +static inline
> +int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> +		     unsigned long start, unsigned long end)
> +{
> +	return -EINVAL;
> +}
>  #endif
>  
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5b6215e03fe1..388c46c7e77a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7852,6 +7852,170 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
>  	return 0;
>  }
>  
> +static bool hugetlb_hgm_collapsable(struct vm_area_struct *vma)
> +{
> +	if (!hugetlb_hgm_eligible(vma))
> +		return false;
> +	if (!vma->vm_private_data)	/* vma lock required for collapsing */
> +		return false;
> +	return true;
> +}
> +
> +/*
> + * Collapse the address range from @start to @end to be mapped optimally.
> + *
> + * This is only valid for shared mappings. The main use case for this function
> + * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire hugepage
> + * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't know
> + * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we leave
> + * it up to userspace to tell us to do so, via MADV_COLLAPSE.
> + *
> + * Any holes in the mapping will be filled. If there is no page in the
> + * pagecache for a region we're collapsing, the PTEs will be cleared.
> + *
> + * If high-granularity PTEs are uffd-wp markers, those markers will be dropped.
> + */
> +int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> +			    unsigned long start, unsigned long end)
> +{
> +	struct hstate *h = hstate_vma(vma);
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	struct mmu_notifier_range range;
> +	struct mmu_gather tlb;
> +	unsigned long curr = start;
> +	int ret = 0;
> +	struct page *hpage, *subpage;
> +	pgoff_t idx;
> +	bool writable = vma->vm_flags & VM_WRITE;
> +	bool shared = vma->vm_flags & VM_SHARED;
> +	struct hugetlb_pte hpte;
> +	pte_t entry;
> +
> +	/*
> +	 * This is only supported for shared VMAs, because we need to look up
> +	 * the page to use for any PTEs we end up creating.
> +	 */
> +	if (!shared)
> +		return -EINVAL;
> +
> +	/* If HGM is not enabled, there is nothing to collapse. */
> +	if (!hugetlb_hgm_enabled(vma))
> +		return 0;
> +
> +	/*
> +	 * We lost the VMA lock after splitting, so we can't safely collapse.
> +	 * We could improve this in the future (like take the mmap_lock for
> +	 * writing and try again), but for now just fail with ENOMEM.
> +	 */
> +	if (unlikely(!hugetlb_hgm_collapsable(vma)))
> +		return -ENOMEM;
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
> +				start, end);
> +	mmu_notifier_invalidate_range_start(&range);
> +	tlb_gather_mmu(&tlb, mm);
> +
> +	/*
> +	 * Grab the VMA lock and mapping sem for writing. This will prevent
> +	 * concurrent high-granularity page table walks, so that we can safely
> +	 * collapse and free page tables.
> +	 *
> +	 * This is the same locking that huge_pmd_unshare requires.
> +	 */
> +	hugetlb_vma_lock_write(vma);
> +	i_mmap_lock_write(vma->vm_file->f_mapping);
> +
> +	while (curr < end) {
> +		ret = hugetlb_alloc_largest_pte(&hpte, mm, vma, curr, end);
> +		if (ret)
> +			goto out;
> +
> +		entry = huge_ptep_get(hpte.ptep);
> +
> +		/*
> +		 * There is no work to do if the PTE doesn't point to page
> +		 * tables.
> +		 */
> +		if (!pte_present(entry))
> +			goto next_hpte;
> +		if (hugetlb_pte_present_leaf(&hpte, entry))
> +			goto next_hpte;
> +
> +		idx = vma_hugecache_offset(h, vma, curr);
> +		hpage = find_get_page(mapping, idx);
> +
> +		if (hpage && !HPageMigratable(hpage)) {
> +			/*
> +			 * Don't collapse a mapping to a page that is pending
> +			 * a migration. Migration swap entries may have placed
> +			 * in the page table.
> +			 */
> +			ret = -EBUSY;
> +			put_page(hpage);
> +			goto out;
> +		}
> +
> +		if (hpage && PageHWPoison(hpage)) {
> +			/*
> +			 * Don't collapse a mapping to a page that is
> +			 * hwpoisoned.
> +			 */
> +			ret = -EHWPOISON;
> +			put_page(hpage);
> +			/*
> +			 * By setting ret to -EHWPOISON, if nothing else
> +			 * happens, we will tell userspace that we couldn't
> +			 * fully collapse everything due to poison.
> +			 *
> +			 * Skip this page, and continue to collapse the rest
> +			 * of the mapping.
> +			 */
> +			curr = (curr & huge_page_mask(h)) + huge_page_size(h);
> +			continue;
> +		}
> +
> +		/*
> +		 * Clear all the PTEs, and drop ref/mapcounts
> +		 * (on tlb_finish_mmu).
> +		 */
> +		__unmap_hugepage_range(&tlb, vma, curr,
> +			curr + hugetlb_pte_size(&hpte),
> +			NULL,
> +			ZAP_FLAG_DROP_MARKER);
> +		/* Free the PTEs. */
> +		hugetlb_free_pgd_range(&tlb,
> +				curr, curr + hugetlb_pte_size(&hpte),
> +				curr, curr + hugetlb_pte_size(&hpte));
> +		if (!hpage) {
> +			huge_pte_clear(mm, curr, hpte.ptep,
> +					hugetlb_pte_size(&hpte));
> +			goto next_hpte;
> +		}
> +
> +		page_dup_file_rmap(hpage, true);
> +
> +		subpage = hugetlb_find_subpage(h, hpage, curr);
> +		entry = make_huge_pte_with_shift(vma, subpage,
> +						 writable, hpte.shift);
> +		set_huge_pte_at(mm, curr, hpte.ptep, entry);
> +next_hpte:
> +		curr += hugetlb_pte_size(&hpte);
> +
> +		if (curr < end) {
> +			/* Don't hold the VMA lock for too long. */
> +			hugetlb_vma_unlock_write(vma);
> +			cond_resched();
> +			hugetlb_vma_lock_write(vma);

The intention is good here but IIUC this will cause vma lock to be taken
after the i_mmap_rwsem, which can cause circular deadlocks.  If to do this
properly we'll need to also release the i_mmap_rwsem.

However it may make the resched() logic over complicated, meanwhile for 2M
huge pages I think this will be called for each 2M range which can be too
fine grained, so it looks like the "cur < end" check is a bit too aggresive.

The other thing is I noticed that the long period of mmu notifier
invalidate between start -> end will (in reallife VM context) causing vcpu
threads spinning.

I _think_ it's because is_page_fault_stale() (when during a vmexit
following a kvm page fault) always reports true during the long procedure
of MADV_COLLAPSE if to be called upon a large range, so even if we release
both locks here it may not tremedously on the VM migration use case because
of the long-standing mmu notifier invalidation procedure.

To summarize.. I think a simpler start version of hugetlb MADV_COLLAPSE can
drop this "if" block, and let the userapp decide the step size of COLLAPSE?

> +		}
> +	}
> +out:
> +	i_mmap_unlock_write(vma->vm_file->f_mapping);
> +	hugetlb_vma_unlock_write(vma);
> +	tlb_finish_mmu(&tlb);
> +	mmu_notifier_invalidate_range_end(&range);
> +	return ret;
> +}
> +
>  #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

-- 
Peter Xu