* [PATCH V2] mm/gup: Fix longterm pin on slow gup regression
@ 2024-06-28 6:01 yangge1116
2024-06-28 20:42 ` Andrew Morton
0 siblings, 1 reply; 3+ messages in thread
From: yangge1116 @ 2024-06-28 6:01 UTC (permalink / raw)
To: akpm
Cc: linux-mm, linux-kernel, stable, 21cnbao, peterx, yang,
baolin.wang, liuzixing, yangge
From: yangge <yangge1116@126.com>
If a large number of CMA memory are configured in system (for
example, the CMA memory accounts for 50% of the system memory),
starting a SEV virtual machine will fail. During starting the SEV
virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM,
...) to pin memory. Normally if a page is present and in CMA area,
pin_user_pages_fast() will first call __get_user_pages_locked() to
pin the page in CMA area, and then call
check_and_migrate_movable_pages() to migrate the page from CMA area
to non-CMA area. But the current code calling __get_user_pages_locked()
will fail, because it call try_grab_folio() to pin page in gup slow
path.
The commit 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages
!= NULL"") uses try_grab_folio() in gup slow path, which seems to be
problematic because try_grap_folio() will check if the page can be
longterm pinned. This check may fail and cause __get_user_pages_lock()
to fail. However, these checks are not required in gup slow path,
seems we can use try_grab_page() instead of try_grab_folio(). In
addition, in the current code, try_grab_page() can only add 1 to the
page's refcount. We extend this function so that the page's refcount
can be increased according to the parameters passed in.
The following log reveals it:
[ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
[ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
[ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520
[ 464.325515] Call Trace:
[ 464.325520] <TASK>
[ 464.325523] ? __get_user_pages+0x423/0x520
[ 464.325528] ? __warn+0x81/0x130
[ 464.325536] ? __get_user_pages+0x423/0x520
[ 464.325541] ? report_bug+0x171/0x1a0
[ 464.325549] ? handle_bug+0x3c/0x70
[ 464.325554] ? exc_invalid_op+0x17/0x70
[ 464.325558] ? asm_exc_invalid_op+0x1a/0x20
[ 464.325567] ? __get_user_pages+0x423/0x520
[ 464.325575] __gup_longterm_locked+0x212/0x7a0
[ 464.325583] internal_get_user_pages_fast+0xfb/0x190
[ 464.325590] pin_user_pages_fast+0x47/0x60
[ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd]
[ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd]
In another thread [1], hugepd also has a similar problem, so include
relevant handling codes.
[1] https://lore.kernel.org/all/20240604234858.948986-2-yang@os.amperecomputing.com/
Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"")
Cc: <stable@vger.kernel.org>
Signed-off-by: yangge <yangge1116@126.com>
---
mm/gup.c | 55 +++++++++++++++++++++++++++++--------------------------
mm/huge_memory.c | 2 +-
mm/internal.h | 2 +-
3 files changed, 31 insertions(+), 28 deletions(-)
V2:
1, Using unlikely instead of WARN_ON_ONCE
2, Reworked the code and commit log to include hugepd path handling from Yang
diff --git a/mm/gup.c b/mm/gup.c
index 6ff9f95..070cf58 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -222,7 +222,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
* -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not
* be grabbed.
*/
-int __must_check try_grab_page(struct page *page, unsigned int flags)
+int __must_check try_grab_page(struct page *page, int refs, unsigned int flags)
{
struct folio *folio = page_folio(page);
@@ -233,7 +233,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
return -EREMOTEIO;
if (flags & FOLL_GET)
- folio_ref_inc(folio);
+ folio_ref_add(folio, refs);
else if (flags & FOLL_PIN) {
/*
* Don't take a pin on the zero page - it's not going anywhere
@@ -248,13 +248,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
* so that the page really is pinned.
*/
if (folio_test_large(folio)) {
- folio_ref_add(folio, 1);
- atomic_add(1, &folio->_pincount);
+ folio_ref_add(folio, refs);
+ atomic_add(refs, &folio->_pincount);
} else {
- folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+ folio_ref_add(folio, refs * GUP_PIN_COUNTING_BIAS);
}
- node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+ node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
}
return 0;
@@ -535,7 +535,7 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
*/
static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz,
unsigned long addr, unsigned long end, unsigned int flags,
- struct page **pages, int *nr)
+ struct page **pages, int *nr, bool fast)
{
unsigned long pte_end;
struct page *page;
@@ -558,9 +558,14 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz
page = pte_page(pte);
refs = record_subpages(page, sz, addr, end, pages + *nr);
- folio = try_grab_folio(page, refs, flags);
- if (!folio)
- return 0;
+ if (fast) {
+ if (try_grab_page(page, refs, flags))
+ return 0;
+ else {
+ folio = try_grab_folio(page, refs, flags);
+ if (!folio)
+ return 0;
+ }
if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
gup_put_folio(folio, refs, flags);
@@ -588,7 +593,7 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz
static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
unsigned long addr, unsigned int pdshift,
unsigned long end, unsigned int flags,
- struct page **pages, int *nr)
+ struct page **pages, int *nr, bool fast)
{
pte_t *ptep;
unsigned long sz = 1UL << hugepd_shift(hugepd);
@@ -598,7 +603,7 @@ static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
ptep = hugepte_offset(hugepd, addr, pdshift);
do {
next = hugepte_addr_end(addr, end, sz);
- ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr);
+ ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, fast);
if (ret != 1)
return ret;
} while (ptep++, addr = next, addr != end);
@@ -625,7 +630,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
ptep = hugepte_offset(hugepd, addr, pdshift);
ptl = huge_pte_lock(h, vma->vm_mm, ptep);
ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE,
- flags, &page, &nr);
+ flags, &page, &nr, false);
spin_unlock(ptl);
if (ret == 1) {
@@ -642,7 +647,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
static inline int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
unsigned long addr, unsigned int pdshift,
unsigned long end, unsigned int flags,
- struct page **pages, int *nr)
+ struct page **pages, int *nr, bool fast)
{
return 0;
}
@@ -729,7 +734,7 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
gup_must_unshare(vma, flags, page))
return ERR_PTR(-EMLINK);
- ret = try_grab_page(page, flags);
+ ret = try_grab_page(page, 1, flags);
if (ret)
page = ERR_PTR(ret);
else
@@ -806,7 +811,7 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
!PageAnonExclusive(page), page);
- ret = try_grab_page(page, flags);
+ ret = try_grab_page(page, 1, flags);
if (ret)
return ERR_PTR(ret);
@@ -969,7 +974,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
!PageAnonExclusive(page), page);
/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
- ret = try_grab_page(page, flags);
+ ret = try_grab_page(page, 1, flags);
if (unlikely(ret)) {
page = ERR_PTR(ret);
goto out;
@@ -1233,7 +1238,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
goto unmap;
*page = pte_page(entry);
}
- ret = try_grab_page(*page, gup_flags);
+ ret = try_grab_page(*page, 1, gup_flags);
if (unlikely(ret))
goto unmap;
out:
@@ -1636,22 +1641,20 @@ static long __get_user_pages(struct mm_struct *mm,
* pages.
*/
if (page_increm > 1) {
- struct folio *folio;
/*
* Since we already hold refcount on the
* large folio, this should never fail.
*/
- folio = try_grab_folio(page, page_increm - 1,
+ ret = try_grab_page(page, page_increm - 1,
foll_flags);
- if (WARN_ON_ONCE(!folio)) {
+ if (unlikely(ret)) {
/*
* Release the 1st page ref if the
* folio is problematic, fail hard.
*/
gup_put_folio(page_folio(page), 1,
foll_flags);
- ret = -EFAULT;
goto out;
}
}
@@ -3276,7 +3279,7 @@ static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
* pmd format and THP pmd format
*/
if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr,
- PMD_SHIFT, next, flags, pages, nr) != 1)
+ PMD_SHIFT, next, flags, pages, nr, true) != 1)
return 0;
} else if (!gup_fast_pte_range(pmd, pmdp, addr, next, flags,
pages, nr))
@@ -3306,7 +3309,7 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr,
return 0;
} else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) {
if (gup_hugepd(NULL, __hugepd(pud_val(pud)), addr,
- PUD_SHIFT, next, flags, pages, nr) != 1)
+ PUD_SHIFT, next, flags, pages, nr, true) != 1)
return 0;
} else if (!gup_fast_pmd_range(pudp, pud, addr, next, flags,
pages, nr))
@@ -3333,7 +3336,7 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr,
BUILD_BUG_ON(p4d_leaf(p4d));
if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) {
if (gup_hugepd(NULL, __hugepd(p4d_val(p4d)), addr,
- P4D_SHIFT, next, flags, pages, nr) != 1)
+ P4D_SHIFT, next, flags, pages, nr, true) != 1)
return 0;
} else if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags,
pages, nr))
@@ -3362,7 +3365,7 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end,
return;
} else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) {
if (gup_hugepd(NULL, __hugepd(pgd_val(pgd)), addr,
- PGDIR_SHIFT, next, flags, pages, nr) != 1)
+ PGDIR_SHIFT, next, flags, pages, nr, true) != 1)
return;
} else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags,
pages, nr))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 425374a..18604e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1332,7 +1332,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
if (!*pgmap)
return ERR_PTR(-EFAULT);
page = pfn_to_page(pfn);
- ret = try_grab_page(page, flags);
+ ret = try_grab_page(page, 1, flags);
if (ret)
page = ERR_PTR(ret);
diff --git a/mm/internal.h b/mm/internal.h
index 2ea9a88..5305bbf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1227,7 +1227,7 @@ int migrate_device_coherent_page(struct page *page);
* mm/gup.c
*/
struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
-int __must_check try_grab_page(struct page *page, unsigned int flags);
+int __must_check try_grab_page(struct page *page, int refs, unsigned int flags);
/*
* mm/huge_memory.c
--
2.7.4
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH V2] mm/gup: Fix longterm pin on slow gup regression
2024-06-28 6:01 [PATCH V2] mm/gup: Fix longterm pin on slow gup regression yangge1116
@ 2024-06-28 20:42 ` Andrew Morton
2024-06-28 21:19 ` Yang Shi
0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2024-06-28 20:42 UTC (permalink / raw)
To: yangge1116
Cc: linux-mm, linux-kernel, stable, 21cnbao, peterx, yang,
baolin.wang, liuzixing
On Fri, 28 Jun 2024 14:01:58 +0800 yangge1116@126.com wrote:
> From: yangge <yangge1116@126.com>
>
> If a large number of CMA memory are configured in system (for
> example, the CMA memory accounts for 50% of the system memory),
> starting a SEV virtual machine will fail. During starting the SEV
> virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM,
> ...) to pin memory. Normally if a page is present and in CMA area,
> pin_user_pages_fast() will first call __get_user_pages_locked() to
> pin the page in CMA area, and then call
> check_and_migrate_movable_pages() to migrate the page from CMA area
> to non-CMA area. But the current code calling __get_user_pages_locked()
> will fail, because it call try_grab_folio() to pin page in gup slow
> path.
>
> The commit 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages
> != NULL"") uses try_grab_folio() in gup slow path, which seems to be
> problematic because try_grap_folio() will check if the page can be
> longterm pinned. This check may fail and cause __get_user_pages_lock()
> to fail. However, these checks are not required in gup slow path,
> seems we can use try_grab_page() instead of try_grab_folio(). In
> addition, in the current code, try_grab_page() can only add 1 to the
> page's refcount. We extend this function so that the page's refcount
> can be increased according to the parameters passed in.
>
> The following log reveals it:
>
> [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
> [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
> [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520
> [ 464.325515] Call Trace:
> [ 464.325520] <TASK>
> [ 464.325523] ? __get_user_pages+0x423/0x520
> [ 464.325528] ? __warn+0x81/0x130
> [ 464.325536] ? __get_user_pages+0x423/0x520
> [ 464.325541] ? report_bug+0x171/0x1a0
> [ 464.325549] ? handle_bug+0x3c/0x70
> [ 464.325554] ? exc_invalid_op+0x17/0x70
> [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20
> [ 464.325567] ? __get_user_pages+0x423/0x520
> [ 464.325575] __gup_longterm_locked+0x212/0x7a0
> [ 464.325583] internal_get_user_pages_fast+0xfb/0x190
> [ 464.325590] pin_user_pages_fast+0x47/0x60
> [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd]
> [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd]
>
Well, we also have Yang Shi's patch
(https://lkml.kernel.org/r/20240627231601.1713119-1-yang@os.amperecomputing.com)
which takes a significantly different approach. Which way should we
go?
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH V2] mm/gup: Fix longterm pin on slow gup regression
2024-06-28 20:42 ` Andrew Morton
@ 2024-06-28 21:19 ` Yang Shi
0 siblings, 0 replies; 3+ messages in thread
From: Yang Shi @ 2024-06-28 21:19 UTC (permalink / raw)
To: Andrew Morton
Cc: yangge1116, linux-mm, linux-kernel, stable, 21cnbao, peterx, yang,
baolin.wang, liuzixing
On Fri, Jun 28, 2024 at 1:42 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 28 Jun 2024 14:01:58 +0800 yangge1116@126.com wrote:
>
> > From: yangge <yangge1116@126.com>
> >
> > If a large number of CMA memory are configured in system (for
> > example, the CMA memory accounts for 50% of the system memory),
> > starting a SEV virtual machine will fail. During starting the SEV
> > virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM,
> > ...) to pin memory. Normally if a page is present and in CMA area,
> > pin_user_pages_fast() will first call __get_user_pages_locked() to
> > pin the page in CMA area, and then call
> > check_and_migrate_movable_pages() to migrate the page from CMA area
> > to non-CMA area. But the current code calling __get_user_pages_locked()
> > will fail, because it call try_grab_folio() to pin page in gup slow
> > path.
> >
> > The commit 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages
> > != NULL"") uses try_grab_folio() in gup slow path, which seems to be
> > problematic because try_grap_folio() will check if the page can be
> > longterm pinned. This check may fail and cause __get_user_pages_lock()
> > to fail. However, these checks are not required in gup slow path,
> > seems we can use try_grab_page() instead of try_grab_folio(). In
> > addition, in the current code, try_grab_page() can only add 1 to the
> > page's refcount. We extend this function so that the page's refcount
> > can be increased according to the parameters passed in.
> >
> > The following log reveals it:
> >
> > [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
> > [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
> > [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520
> > [ 464.325515] Call Trace:
> > [ 464.325520] <TASK>
> > [ 464.325523] ? __get_user_pages+0x423/0x520
> > [ 464.325528] ? __warn+0x81/0x130
> > [ 464.325536] ? __get_user_pages+0x423/0x520
> > [ 464.325541] ? report_bug+0x171/0x1a0
> > [ 464.325549] ? handle_bug+0x3c/0x70
> > [ 464.325554] ? exc_invalid_op+0x17/0x70
> > [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20
> > [ 464.325567] ? __get_user_pages+0x423/0x520
> > [ 464.325575] __gup_longterm_locked+0x212/0x7a0
> > [ 464.325583] internal_get_user_pages_fast+0xfb/0x190
> > [ 464.325590] pin_user_pages_fast+0x47/0x60
> > [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd]
> > [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd]
> >
>
> Well, we also have Yang Shi's patch
> (https://lkml.kernel.org/r/20240627231601.1713119-1-yang@os.amperecomputing.com)
> which takes a significantly different approach. Which way should we
> go?
IMO, my patch is more complete, it should be sent to the mainline.
This patch can be considered if it is hard to backport my patch to the
stable tree.
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-06-28 21:19 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-28 6:01 [PATCH V2] mm/gup: Fix longterm pin on slow gup regression yangge1116
2024-06-28 20:42 ` Andrew Morton
2024-06-28 21:19 ` Yang Shi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).