[patch 02/15] ksm: reinstate memcg charge on copied pages

Linux kernel -stable discussions
 help / color / mirror / Atom feed

* [patch 02/15] ksm: reinstate memcg charge on copied pages
       [not found] <20200918211925.7e97f0ef63d92f5cfe5ccbc5@linux-foundation.org>
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP Andrew Morton
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, alex.shi, cai, hannes, hughd, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, stable, torvalds

From: Hugh Dickins <hughd@google.com>
Subject: ksm: reinstate memcg charge on copied pages

Patch series "mm: fixes to past from future testing".

Here's a set of independent fixes against 5.9-rc2: prompted by
testing Alex Shi's "warning on !memcg" and lru_lock series, but
I think fit for 5.9 - though maybe only the first for stable.

This patch (of 5):

In 5.8 some instances of memcg charging in do_swap_page() and unuse_pte()
were removed, on the understanding that swap cache is now already charged
at those points; but a case was missed, when ksm_might_need_to_copy() has
decided it must allocate a substitute page: such pages were never charged.
Fix it inside ksm_might_need_to_copy().

This was discovered by Alex Shi's prospective commit "mm/memcg: warning on
!memcg after readahead page charged".

But there is a another surprise: this also fixes some rarer uncharged
PageAnon cases, when KSM is configured in, but has never been activated. 
ksm_might_need_to_copy()'s anon_vma->root and linear_page_index() check
sometimes catches a case which would need to have been copied if KSM were
turned on.  Or that's my optimistic interpretation (of my own old code),
but it leaves some doubt as to whether everything is working as intended
there - might it hint at rare anon ptes which rmap cannot find?  A
question not easily answered: put in the fix for missed memcg charges.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301343270.5954@eggly.anvils
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301358020.5954@eggly.anvils
Fixes: 4c6355b25e8b ("mm: memcontrol: charge swapin pages on instantiation")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc; Matthew Wilcox <willy@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>	[5.8]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/ksm.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/mm/ksm.c~ksm-reinstate-memcg-charge-on-copied-pages
+++ a/mm/ksm.c
@@ -2586,6 +2586,10 @@ struct page *ksm_might_need_to_copy(stru
 		return page;		/* let do_swap_page report the error */

 	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (new_page && mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL)) {
+		put_page(new_page);
+		new_page = NULL;
+	}
 	if (new_page) {
 		copy_user_highpage(new_page, page, address, vma);

_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
       [not found] <20200918211925.7e97f0ef63d92f5cfe5ccbc5@linux-foundation.org>
  2020-09-19  4:20 ` [patch 02/15] ksm: reinstate memcg charge on copied pages Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:44   ` Matthew Wilcox
  2020-09-19  4:20 ` [patch 08/15] kprobes: fix kill kprobe which has been marked as gone Andrew Morton
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, alex.shi, cai, hannes, hughd, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, stable, torvalds

From: Hugh Dickins <hughd@google.com>
Subject: shmem: shmem_writepage() split unlikely i915 THP

drivers/gpu/drm/i915/gem/i915_gem_shmem.c contains a shmem_writeback()
which calls shmem_writepage() from a shrinker: that usually works well
enough; but if /sys/kernel/mm/transparent_hugepage/shmem_enabled has been
set to "force" (documented as "Force the huge option on for all - very
useful for testing"), shmem_writepage() is surprised to be called with a
huge page, and crashes on the VM_BUG_ON_PAGE(PageCompound) (I did not find
out where the crash happens when CONFIG_DEBUG_VM is off).

LRU page reclaim always splits the shmem huge page first: I'd prefer not
to demand that of i915, so check and split compound in shmem_writepage().

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301401390.5954@eggly.anvils
Fixes: 2d6692e642e7 ("drm/i915: Start writeback from the shrinker")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>	[5.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/shmem.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- a/mm/shmem.c~shmem-shmem_writepage-split-unlikely-i915-thp
+++ a/mm/shmem.c
@@ -1362,7 +1362,15 @@ static int shmem_writepage(struct page *
 	swp_entry_t swap;
 	pgoff_t index;
 
-	VM_BUG_ON_PAGE(PageCompound(page), page);
+	/*
+	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
+	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
+	 * and its shmem_writeback() needs them to be split when swapping.
+	 */
+	if (PageTransCompound(page))
+		if (split_huge_page(page) < 0)
+			goto redirty;
+
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
 	index = page->index;
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19  4:20 ` [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP Andrew Morton
@ 2020-09-19  4:44   ` Matthew Wilcox
  2020-09-19  5:44     ` Hugh Dickins
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2020-09-19  4:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: alex.shi, cai, hannes, hughd, linux-mm, mhocko, mike.kravetz,
	mm-commits, shakeelb, shy828301, stable, torvalds

On Fri, Sep 18, 2020 at 09:20:09PM -0700, Andrew Morton wrote:
> LRU page reclaim always splits the shmem huge page first: I'd prefer not
> to demand that of i915, so check and split compound in shmem_writepage().

Sorry for not checking this earlier, but I don't think this is right.

        for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
...
                if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
...
                        ret = mapping->a_ops->writepage(page, &wbc);

so we cleared the dirty bit on the entire hugepage, but then split
it after clearing the dirty bit, so the subpages are now not dirty.
I think we'll lose writes as a result?  At least we won't swap pages
out that deserve to be paged out.

>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page))
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +
>  	BUG_ON(!PageLocked(page));
>  	mapping = page->mapping;
>  	index = page->index;
> _
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19  4:44   ` Matthew Wilcox
@ 2020-09-19  5:44     ` Hugh Dickins
  2020-09-19 16:18       ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2020-09-19  5:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, alex.shi, cai, hannes, hughd, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, stable, torvalds

On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 09:20:09PM -0700, Andrew Morton wrote:
> > LRU page reclaim always splits the shmem huge page first: I'd prefer not
> > to demand that of i915, so check and split compound in shmem_writepage().
> 
> Sorry for not checking this earlier, but I don't think this is right.
> 
>         for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
> ...
>                 if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
> ...
>                         ret = mapping->a_ops->writepage(page, &wbc);
> 
> so we cleared the dirty bit on the entire hugepage, but then split
> it after clearing the dirty bit, so the subpages are now not dirty.
> I think we'll lose writes as a result?  At least we won't swap pages
> out that deserve to be paged out.

Very good observation, thank you.

It behaves a lot better with this patch in than without it; but you're
right, only the head will get written to swap, and the tails left in
memory; with dirty cleared, so they may be left indefinitely (I've
not yet looked to see when if ever PageDirty might get set later).

Hmm. It may just be a matter of restyling the i915 code with

		if (!page_mapped(page)) {
			clear_page_dirty_for_io(page);

but I don't want to rush to that conclusion - there might turn
out to be a good way of doing it at the shmem_writepage() end, but
probably only hacks available.  I'll mull it over: it deserves some
thought about what would suit, if a THP arrived here some other way.

We did have some i915 guys Cc'ed on the original posting, but no
need to Cc them again until I'm closer to deciding what's right.

Linus, Andrew, probably best to drop this patch for now, since
no-one else has reported the problem here than me, testing with 
/sys/kernel/mm/transparent_hugepage/shmem_enabled set to "force";
and what it fixes is not a new regression in 5.9.

Though no harm done if the patch goes in: it is an improvement,
but seriously incomplete, as Matthew has observed.

Hugh

> 
> >  
> > -	VM_BUG_ON_PAGE(PageCompound(page), page);
> > +	/*
> > +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> > +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> > +	 * and its shmem_writeback() needs them to be split when swapping.
> > +	 */
> > +	if (PageTransCompound(page))
> > +		if (split_huge_page(page) < 0)
> > +			goto redirty;
> > +
> >  	BUG_ON(!PageLocked(page));
> >  	mapping = page->mapping;
> >  	index = page->index;
> > _
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19  5:44     ` Hugh Dickins
@ 2020-09-19 16:18       ` Matthew Wilcox
  2020-09-20  0:16         ` Hugh Dickins
  2020-10-02 18:37         ` Matthew Wilcox
  0 siblings, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2020-09-19 16:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, alex.shi, cai, hannes, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, stable, torvalds

On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> It behaves a lot better with this patch in than without it; but you're
> right, only the head will get written to swap, and the tails left in
> memory; with dirty cleared, so they may be left indefinitely (I've
> not yet looked to see when if ever PageDirty might get set later).
> 
> Hmm. It may just be a matter of restyling the i915 code with
> 
> 		if (!page_mapped(page)) {
> 			clear_page_dirty_for_io(page);
> 
> but I don't want to rush to that conclusion - there might turn
> out to be a good way of doing it at the shmem_writepage() end, but
> probably only hacks available.  I'll mull it over: it deserves some
> thought about what would suit, if a THP arrived here some other way.

I think the ultimate solution is to do as I have done for iomap and make
->writepage handle arbitrary sized pages.  However, I don't know the
swap I/O path particularly well, and I would rather not learn it just yet.

How about this for a band-aid until we sort that out properly?  Just mark
the page as dirty before splitting it so subsequent iterations see the
subpages as dirty.  Arguably, we should use set_page_dirty() instead of
SetPageDirty, but I don't think i915 cares.  In particular, it uses
an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.

diff --git a/mm/shmem.c b/mm/shmem.c
index 271548ca20f3..6231207ab1eb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	swp_entry_t swap;
 	pgoff_t index;
 
-	VM_BUG_ON_PAGE(PageCompound(page), page);
 	BUG_ON(!PageLocked(page));
+
+	/*
+	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
+	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
+	 * and its shmem_writeback() needs them to be split when swapping.
+	 */
+	if (PageTransCompound(page)) {
+		/* Ensure the subpages are still dirty */
+		SetPageDirty(page);
+		if (split_huge_page(page) < 0)
+			goto redirty;
+		ClearPageDirty(page);
+	}
+
 	mapping = page->mapping;
 	index = page->index;
 	inode = mapping->host;


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19 16:18       ` Matthew Wilcox
@ 2020-09-20  0:16         ` Hugh Dickins
  2020-09-20  3:32           ` Matthew Wilcox
  2020-10-02 18:37         ` Matthew Wilcox
  1 sibling, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2020-09-20  0:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> > It behaves a lot better with this patch in than without it; but you're
> > right, only the head will get written to swap, and the tails left in
> > memory; with dirty cleared, so they may be left indefinitely (I've
> > not yet looked to see when if ever PageDirty might get set later).
> > 
> > Hmm. It may just be a matter of restyling the i915 code with
> > 
> > 		if (!page_mapped(page)) {
> > 			clear_page_dirty_for_io(page);
> > 
> > but I don't want to rush to that conclusion - there might turn
> > out to be a good way of doing it at the shmem_writepage() end, but
> > probably only hacks available.  I'll mull it over: it deserves some
> > thought about what would suit, if a THP arrived here some other way.
> 
> I think the ultimate solution is to do as I have done for iomap and make
> ->writepage handle arbitrary sized pages.  However, I don't know the
> swap I/O path particularly well, and I would rather not learn it just yet.

Understood ;)

> 
> How about this for a band-aid until we sort that out properly?  Just mark
> the page as dirty before splitting it so subsequent iterations see the
> subpages as dirty.

This is certainly much better than my original, and I've had no problem
in running with it (briefly); and it's what I'll now keep in my testing
tree - thanks.  But it is more obviously a hack, or as you put it, a
band-aid: fit for Linus's tree?

This issue has only come up with i915 and shmem_enabled "force", nobody
has been bleeding except me: but if it's something that impedes or may
impede your own testing, or that you feel should go in anyway, say so and
I'll push your new improved version to akpm (adding your signoff to mine).

> Arguably, we should use set_page_dirty() instead of SetPageDirty,

My position on that has changed down the years: I went through a
phase of replacing shmem.c's SetPageDirtys by set_page_dirtys, with
the idea that its "if (!PageDirty)" is good to avoid setting cacheline
dirty.  Then Spectre changed the balance, so now I'd rather avoid the
indirect function call, and go with your SetPageDirty.  But the bdi
flags are such that they do the same thing, and if they ever diverge,
then we make the necessary changes at that time.

> but I don't think i915 cares.  In particular, it uses
> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.

PAGECACHE_TAG_DIRTY comes with bdi flags that shmem does not use:
with one exception, every shmem page is always dirty (since its swap
is freed as soon as the page is moved from swap cache to file cache);
unless I'm forgetting something (like the tails in my patch!), the
only exception is a page allocated to a hole on fault (after which
a write fault will soon set pte dirty).

So I didn't quite get what i915 was doing with its set_page_dirty()s:
but very probably being a good citizen, marking when it exposed the
page to changes itself, making no assumption of what shmem.c does.

It would be good to have a conversation with i915 guys about huge pages
sometime (they forked off their own mount point in i915_gemfs_init(),
precisely in order to make use of huge pages, but couldn't get them
working, so removed the option to ask for them, but kept the separate
mount point.  But not a conversation I'll be responsive on at present.)

Hugh

> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 271548ca20f3..6231207ab1eb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	swp_entry_t swap;
>  	pgoff_t index;
>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>  	BUG_ON(!PageLocked(page));
> +
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page)) {
> +		/* Ensure the subpages are still dirty */
> +		SetPageDirty(page);
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +		ClearPageDirty(page);
> +	}
> +
>  	mapping = page->mapping;
>  	index = page->index;
>  	inode = mapping->host;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-20  0:16         ` Hugh Dickins
@ 2020-09-20  3:32           ` Matthew Wilcox
  0 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2020-09-20  3:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, alex.shi, cai, hannes, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, stable, torvalds

On Sat, Sep 19, 2020 at 05:16:57PM -0700, Hugh Dickins wrote:
> On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> > How about this for a band-aid until we sort that out properly?  Just mark
> > the page as dirty before splitting it so subsequent iterations see the
> > subpages as dirty.
> 
> This is certainly much better than my original, and I've had no problem
> in running with it (briefly); and it's what I'll now keep in my testing
> tree - thanks.  But it is more obviously a hack, or as you put it, a
> band-aid: fit for Linus's tree?
> 
> This issue has only come up with i915 and shmem_enabled "force", nobody
> has been bleeding except me: but if it's something that impedes or may
> impede your own testing, or that you feel should go in anyway, say so and
> I'll push your new improved version to akpm (adding your signoff to mine).

No, it's not impeding my testing; I don't have i915 even compiled into
my kernel.

> > Arguably, we should use set_page_dirty() instead of SetPageDirty,
> 
> My position on that has changed down the years: I went through a
> phase of replacing shmem.c's SetPageDirtys by set_page_dirtys, with
> the idea that its "if (!PageDirty)" is good to avoid setting cacheline
> dirty.  Then Spectre changed the balance, so now I'd rather avoid the
> indirect function call, and go with your SetPageDirty.  But the bdi
> flags are such that they do the same thing, and if they ever diverge,
> then we make the necessary changes at that time.

That makes sense.

> > but I don't think i915 cares.  In particular, it uses
> > an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
> 
> PAGECACHE_TAG_DIRTY comes with bdi flags that shmem does not use:
> with one exception, every shmem page is always dirty (since its swap
> is freed as soon as the page is moved from swap cache to file cache);
> unless I'm forgetting something (like the tails in my patch!), the
> only exception is a page allocated to a hole on fault (after which
> a write fault will soon set pte dirty).

Ah, of course.  I keep forgetting that shmem doesn't use the
dirty/writeback bits.

> So I didn't quite get what i915 was doing with its set_page_dirty()s:
> but very probably being a good citizen, marking when it exposed the
> page to changes itself, making no assumption of what shmem.c does.
> 
> It would be good to have a conversation with i915 guys about huge pages
> sometime (they forked off their own mount point in i915_gemfs_init(),
> precisely in order to make use of huge pages, but couldn't get them
> working, so removed the option to ask for them, but kept the separate
> mount point.  But not a conversation I'll be responsive on at present.)

Yes, I have a strong feeling the i915 shmem code is cargo-culted.  There
are some very questionable constructs in there.  It's a little unfair to
expect graphics developers to suddenly become experts on the page cache,
so I've taken this as impetus to clean up our APIs a little (eg moving
find_lock_entry() to mm/internal.h)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19 16:18       ` Matthew Wilcox
  2020-09-20  0:16         ` Hugh Dickins
@ 2020-10-02 18:37         ` Matthew Wilcox
  2020-10-09  8:14           ` Huang, Ying
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2020-10-02 18:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Huang Ying, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> > It behaves a lot better with this patch in than without it; but you're
> > right, only the head will get written to swap, and the tails left in
> > memory; with dirty cleared, so they may be left indefinitely (I've
> > not yet looked to see when if ever PageDirty might get set later).
> > 
> > Hmm. It may just be a matter of restyling the i915 code with
> > 
> > 		if (!page_mapped(page)) {
> > 			clear_page_dirty_for_io(page);
> > 
> > but I don't want to rush to that conclusion - there might turn
> > out to be a good way of doing it at the shmem_writepage() end, but
> > probably only hacks available.  I'll mull it over: it deserves some
> > thought about what would suit, if a THP arrived here some other way.
> 
> I think the ultimate solution is to do as I have done for iomap and make
> ->writepage handle arbitrary sized pages.  However, I don't know the
> swap I/O path particularly well, and I would rather not learn it just yet.
> 
> How about this for a band-aid until we sort that out properly?  Just mark
> the page as dirty before splitting it so subsequent iterations see the
> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
> SetPageDirty, but I don't think i915 cares.  In particular, it uses
> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 271548ca20f3..6231207ab1eb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	swp_entry_t swap;
>  	pgoff_t index;
>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>  	BUG_ON(!PageLocked(page));
> +
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page)) {
> +		/* Ensure the subpages are still dirty */
> +		SetPageDirty(page);
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +		ClearPageDirty(page);
> +	}
> +
>  	mapping = page->mapping;
>  	index = page->index;
>  	inode = mapping->host;

It turns out that I have an entirely different reason for wanting
->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
we currently try to split file-backed THPs.  This always fails for XFS
file-backed THPs because they have page_private set which increments
the refcount by 1.  And so we OOM when the page cache is full of XFS
THPs.  I've been running successfully for a few days with this patch:

@@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
                                /* Adding to swap updated mapping */
                                mapping = page_mapping(page);
                        }
-               } else if (unlikely(PageTransHuge(page))) {
-                       /* Split file THP */
-                       if (split_huge_page_to_list(page, page_list))
-                               goto keep_locked;
                }
 
                /*


Kirill points out that this will probably make shmem unhappy (it's
possible that said pages will get split anyway if they're mapped
because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
->writepage().

The patch above is probably not exactly the right solution for this
case, since pageout() calls writepage only once, not once for each
sub-page.  This is hard to write a cute patch for because the
pages get unlocked by split_huge_page().  I think I'm going to have
to learn about the swap path, unless someone can save me from that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-10-02 18:37         ` Matthew Wilcox
@ 2020-10-09  8:14           ` Huang, Ying
  2020-10-10 15:32             ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2020-10-09  8:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

Hi, Matthew,

Sorry for late reply.  I just come back from a long holiday.

Matthew Wilcox <willy@infradead.org> writes:

> On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
>> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
>> > It behaves a lot better with this patch in than without it; but you're
>> > right, only the head will get written to swap, and the tails left in
>> > memory; with dirty cleared, so they may be left indefinitely (I've
>> > not yet looked to see when if ever PageDirty might get set later).
>> > 
>> > Hmm. It may just be a matter of restyling the i915 code with
>> > 
>> > 		if (!page_mapped(page)) {
>> > 			clear_page_dirty_for_io(page);
>> > 
>> > but I don't want to rush to that conclusion - there might turn
>> > out to be a good way of doing it at the shmem_writepage() end, but
>> > probably only hacks available.  I'll mull it over: it deserves some
>> > thought about what would suit, if a THP arrived here some other way.
>> 
>> I think the ultimate solution is to do as I have done for iomap and make
>> ->writepage handle arbitrary sized pages.  However, I don't know the
>> swap I/O path particularly well, and I would rather not learn it just yet.
>> 
>> How about this for a band-aid until we sort that out properly?  Just mark
>> the page as dirty before splitting it so subsequent iterations see the
>> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
>> SetPageDirty, but I don't think i915 cares.  In particular, it uses
>> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 271548ca20f3..6231207ab1eb 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>>  	swp_entry_t swap;
>>  	pgoff_t index;
>>  
>> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>>  	BUG_ON(!PageLocked(page));
>> +
>> +	/*
>> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
>> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
>> +	 * and its shmem_writeback() needs them to be split when swapping.
>> +	 */
>> +	if (PageTransCompound(page)) {
>> +		/* Ensure the subpages are still dirty */
>> +		SetPageDirty(page);
>> +		if (split_huge_page(page) < 0)
>> +			goto redirty;
>> +		ClearPageDirty(page);
>> +	}
>> +
>>  	mapping = page->mapping;
>>  	index = page->index;
>>  	inode = mapping->host;
>
> It turns out that I have an entirely different reason for wanting
> ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
> we currently try to split file-backed THPs.  This always fails for XFS
> file-backed THPs because they have page_private set which increments
> the refcount by 1.  And so we OOM when the page cache is full of XFS
> THPs.  I've been running successfully for a few days with this patch:
>
> @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>                                 /* Adding to swap updated mapping */
>                                 mapping = page_mapping(page);
>                         }
> -               } else if (unlikely(PageTransHuge(page))) {
> -                       /* Split file THP */
> -                       if (split_huge_page_to_list(page, page_list))
> -                               goto keep_locked;
>                 }
>  
>                 /*
>
>
> Kirill points out that this will probably make shmem unhappy (it's
> possible that said pages will get split anyway if they're mapped
> because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
> they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
> ->writepage().

We may distinguish the shmem THPs from the XFS file cache THPs via
PageSwapBacked()?

Best Regards,
Huang, Ying

> The patch above is probably not exactly the right solution for this
> case, since pageout() calls writepage only once, not once for each
> sub-page.  This is hard to write a cute patch for because the
> pages get unlocked by split_huge_page().  I think I'm going to have
> to learn about the swap path, unless someone can save me from that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-10-09  8:14           ` Huang, Ying
@ 2020-10-10 15:32             ` Matthew Wilcox
  2020-10-12  2:01               ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2020-10-10 15:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

On Fri, Oct 09, 2020 at 04:14:25PM +0800, Huang, Ying wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
> >> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> >> > It behaves a lot better with this patch in than without it; but you're
> >> > right, only the head will get written to swap, and the tails left in
> >> > memory; with dirty cleared, so they may be left indefinitely (I've
> >> > not yet looked to see when if ever PageDirty might get set later).
> >> > 
> >> > Hmm. It may just be a matter of restyling the i915 code with
> >> > 
> >> > 		if (!page_mapped(page)) {
> >> > 			clear_page_dirty_for_io(page);
> >> > 
> >> > but I don't want to rush to that conclusion - there might turn
> >> > out to be a good way of doing it at the shmem_writepage() end, but
> >> > probably only hacks available.  I'll mull it over: it deserves some
> >> > thought about what would suit, if a THP arrived here some other way.
> >> 
> >> I think the ultimate solution is to do as I have done for iomap and make
> >> ->writepage handle arbitrary sized pages.  However, I don't know the
> >> swap I/O path particularly well, and I would rather not learn it just yet.
> >> 
> >> How about this for a band-aid until we sort that out properly?  Just mark
> >> the page as dirty before splitting it so subsequent iterations see the
> >> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
> >> SetPageDirty, but I don't think i915 cares.  In particular, it uses
> >> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
> >> 
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 271548ca20f3..6231207ab1eb 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> >>  	swp_entry_t swap;
> >>  	pgoff_t index;
> >>  
> >> -	VM_BUG_ON_PAGE(PageCompound(page), page);
> >>  	BUG_ON(!PageLocked(page));
> >> +
> >> +	/*
> >> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> >> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> >> +	 * and its shmem_writeback() needs them to be split when swapping.
> >> +	 */
> >> +	if (PageTransCompound(page)) {
> >> +		/* Ensure the subpages are still dirty */
> >> +		SetPageDirty(page);
> >> +		if (split_huge_page(page) < 0)
> >> +			goto redirty;
> >> +		ClearPageDirty(page);
> >> +	}
> >> +
> >>  	mapping = page->mapping;
> >>  	index = page->index;
> >>  	inode = mapping->host;
> >
> > It turns out that I have an entirely different reason for wanting
> > ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
> > we currently try to split file-backed THPs.  This always fails for XFS
> > file-backed THPs because they have page_private set which increments
> > the refcount by 1.  And so we OOM when the page cache is full of XFS
> > THPs.  I've been running successfully for a few days with this patch:
> >
> > @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> >                                 /* Adding to swap updated mapping */
> >                                 mapping = page_mapping(page);
> >                         }
> > -               } else if (unlikely(PageTransHuge(page))) {
> > -                       /* Split file THP */
> > -                       if (split_huge_page_to_list(page, page_list))
> > -                               goto keep_locked;
> >                 }
> >  
> >                 /*
> >
> >
> > Kirill points out that this will probably make shmem unhappy (it's
> > possible that said pages will get split anyway if they're mapped
> > because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
> > they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
> > ->writepage().
> 
> We may distinguish the shmem THPs from the XFS file cache THPs via
> PageSwapBacked()?

Yes, we _can_, but we now have two reasons for wanting to be able to write
THPs to swap without splitting them.  Another thing this solves is that,
in my tree, we don't allocate the bottom layer of the XArray for THPs.
So when we split, we have to allocate memory to store the split pages,
and it seems like a bad idea to allocate memory in order to free memory,
particularly when we don't have to.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-10-10 15:32             ` Matthew Wilcox
@ 2020-10-12  2:01               ` Huang, Ying
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Ying @ 2020-10-12  2:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Oct 09, 2020 at 04:14:25PM +0800, Huang, Ying wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
>> >> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
>> >> > It behaves a lot better with this patch in than without it; but you're
>> >> > right, only the head will get written to swap, and the tails left in
>> >> > memory; with dirty cleared, so they may be left indefinitely (I've
>> >> > not yet looked to see when if ever PageDirty might get set later).
>> >> > 
>> >> > Hmm. It may just be a matter of restyling the i915 code with
>> >> > 
>> >> > 		if (!page_mapped(page)) {
>> >> > 			clear_page_dirty_for_io(page);
>> >> > 
>> >> > but I don't want to rush to that conclusion - there might turn
>> >> > out to be a good way of doing it at the shmem_writepage() end, but
>> >> > probably only hacks available.  I'll mull it over: it deserves some
>> >> > thought about what would suit, if a THP arrived here some other way.
>> >> 
>> >> I think the ultimate solution is to do as I have done for iomap and make
>> >> ->writepage handle arbitrary sized pages.  However, I don't know the
>> >> swap I/O path particularly well, and I would rather not learn it just yet.
>> >> 
>> >> How about this for a band-aid until we sort that out properly?  Just mark
>> >> the page as dirty before splitting it so subsequent iterations see the
>> >> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
>> >> SetPageDirty, but I don't think i915 cares.  In particular, it uses
>> >> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
>> >> 
>> >> diff --git a/mm/shmem.c b/mm/shmem.c
>> >> index 271548ca20f3..6231207ab1eb 100644
>> >> --- a/mm/shmem.c
>> >> +++ b/mm/shmem.c
>> >> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>> >>  	swp_entry_t swap;
>> >>  	pgoff_t index;
>> >>  
>> >> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>> >>  	BUG_ON(!PageLocked(page));
>> >> +
>> >> +	/*
>> >> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
>> >> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
>> >> +	 * and its shmem_writeback() needs them to be split when swapping.
>> >> +	 */
>> >> +	if (PageTransCompound(page)) {
>> >> +		/* Ensure the subpages are still dirty */
>> >> +		SetPageDirty(page);
>> >> +		if (split_huge_page(page) < 0)
>> >> +			goto redirty;
>> >> +		ClearPageDirty(page);
>> >> +	}
>> >> +
>> >>  	mapping = page->mapping;
>> >>  	index = page->index;
>> >>  	inode = mapping->host;
>> >
>> > It turns out that I have an entirely different reason for wanting
>> > ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
>> > we currently try to split file-backed THPs.  This always fails for XFS
>> > file-backed THPs because they have page_private set which increments
>> > the refcount by 1.  And so we OOM when the page cache is full of XFS
>> > THPs.  I've been running successfully for a few days with this patch:
>> >
>> > @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>> >                                 /* Adding to swap updated mapping */
>> >                                 mapping = page_mapping(page);
>> >                         }
>> > -               } else if (unlikely(PageTransHuge(page))) {
>> > -                       /* Split file THP */
>> > -                       if (split_huge_page_to_list(page, page_list))
>> > -                               goto keep_locked;
>> >                 }
>> >  
>> >                 /*
>> >
>> >
>> > Kirill points out that this will probably make shmem unhappy (it's
>> > possible that said pages will get split anyway if they're mapped
>> > because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
>> > they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
>> > ->writepage().
>> 
>> We may distinguish the shmem THPs from the XFS file cache THPs via
>> PageSwapBacked()?
>
> Yes, we _can_, but we now have two reasons for wanting to be able to write
> THPs to swap without splitting them.  Another thing this solves is that,
> in my tree, we don't allocate the bottom layer of the XArray for THPs.
> So when we split, we have to allocate memory to store the split pages,
> and it seems like a bad idea to allocate memory in order to free memory,
> particularly when we don't have to.

I am afraid we cannot avoid to allocate memory during swapping.  Because
the anonymous page (strictly PageSwapBacked()) will be put in swap cache
(a special page cache) during page reclaiming.  This means we need to
allocate XArray node for them.

And, we cannot guarantee there are always large continuous free space
available in the swap devices to accommodate the THP as a whole, so we
need to be prepared to split the THP anyway.

Things may be different for the page cache.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 08/15] kprobes: fix kill kprobe which has been marked as gone
       [not found] <20200918211925.7e97f0ef63d92f5cfe5ccbc5@linux-foundation.org>
  2020-09-19  4:20 ` [patch 02/15] ksm: reinstate memcg charge on copied pages Andrew Morton
  2020-09-19  4:20 ` [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 09/15] mm/thp: fix __split_huge_pmd_locked() for migration PMD Andrew Morton
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, anil.s.keshavamurthy, davem, linux-mm, mhiramat, mm-commits,
	naveen.n.rao, rostedt, songliubraving, songmuchun, stable,
	torvalds, zhouchengming

From: Muchun Song <songmuchun@bytedance.com>
Subject: kprobes: fix kill kprobe which has been marked as gone

If a kprobe is marked as gone, we should not kill it again.  Otherwise, we
can disarm the kprobe more than once.  In that case, the statistics of
kprobe_ftrace_enabled can unbalance which can lead to that kprobe do not
work.

Link: https://lkml.kernel.org/r/20200822030055.32383-1-songmuchun@bytedance.com
Fixes: e8386a0cb22f ("kprobes: support probing module __exit function")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Song Liu <songliubraving@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/kprobes.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/kernel/kprobes.c~kprobes-fix-kill-kprobe-which-has-been-marked-as-gone
+++ a/kernel/kprobes.c
@@ -2140,6 +2140,9 @@ static void kill_kprobe(struct kprobe *p
 
 	lockdep_assert_held(&kprobe_mutex);
 
+	if (WARN_ON_ONCE(kprobe_gone(p)))
+		return;
+
 	p->flags |= KPROBE_FLAG_GONE;
 	if (kprobe_aggrprobe(p)) {
 		/*
@@ -2419,7 +2422,10 @@ static int kprobes_module_callback(struc
 	mutex_lock(&kprobe_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
-		hlist_for_each_entry(p, head, hlist)
+		hlist_for_each_entry(p, head, hlist) {
+			if (kprobe_gone(p))
+				continue;
+
 			if (within_module_init((unsigned long)p->addr, mod) ||
 			    (checkcore &&
 			     within_module_core((unsigned long)p->addr, mod))) {
@@ -2436,6 +2442,7 @@ static int kprobes_module_callback(struc
 				 */
 				kill_kprobe(p);
 			}
+		}
 	}
 	if (val == MODULE_STATE_GOING)
 		remove_module_kprobe_blacklist(mod);
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 09/15] mm/thp: fix __split_huge_pmd_locked() for migration PMD
       [not found] <20200918211925.7e97f0ef63d92f5cfe5ccbc5@linux-foundation.org>
                   ` (2 preceding siblings ...)
  2020-09-19  4:20 ` [patch 08/15] kprobes: fix kill kprobe which has been marked as gone Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 10/15] selftests/vm: fix display of page size in map_hugetlb Andrew Morton
  2020-09-19  4:20 ` [patch 11/15] mm/memory_hotplug: drain per-cpu pages again during memory offline Andrew Morton
  5 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, apopple, bharata, bskeggs, hch, jgg, jglisse, jhubbard,
	linux-mm, mm-commits, rcampbell, shuah, shy828301, stable,
	torvalds, ziy

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: mm/thp: fix __split_huge_pmd_locked() for migration PMD

A migrating transparent huge page has to already be unmapped.  Otherwise,
the page could be modified while it is being copied to a new page and data
could be lost.  The function __split_huge_pmd() checks for a PMD migration
entry before calling __split_huge_pmd_locked() leading one to think that
__split_huge_pmd_locked() can handle splitting a migrating PMD.

However, the code always increments the page->_mapcount and adjusts the
memory control group accounting assuming the page is mapped.

Also, if the PMD entry is a migration PMD entry, the call to
is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
of migration_entry_to_pfn(pmd_to_swp_entry(pmd)).  Fix these problems by
checking for a PMD migration entry.

Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.com
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |   42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

--- a/mm/huge_memory.c~mm-thp-fix-__split_huge_pmd_locked-for-migration-pmd
+++ a/mm/huge_memory.c
@@ -2022,7 +2022,7 @@ static void __split_huge_pmd_locked(stru
 		put_page(page);
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		return;
-	} else if (is_huge_zero_pmd(*pmd)) {
+	} else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
 		 * mmu_notifier_invalidate_range() see comments below inside
@@ -2116,30 +2116,34 @@ static void __split_huge_pmd_locked(stru
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
-		atomic_inc(&page[i]._mapcount);
-		pte_unmap(pte);
-	}
-
-	/*
-	 * Set PG_double_map before dropping compound_mapcount to avoid
-	 * false-negative page_mapped().
-	 */
-	if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) {
-		for (i = 0; i < HPAGE_PMD_NR; i++)
+		if (!pmd_migration)
 			atomic_inc(&page[i]._mapcount);
+		pte_unmap(pte);
 	}
 
-	lock_page_memcg(page);
-	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
-		/* Last compound_mapcount is gone. */
-		__dec_lruvec_page_state(page, NR_ANON_THPS);
-		if (TestClearPageDoubleMap(page)) {
-			/* No need in mapcount reference anymore */
+	if (!pmd_migration) {
+		/*
+		 * Set PG_double_map before dropping compound_mapcount to avoid
+		 * false-negative page_mapped().
+		 */
+		if (compound_mapcount(page) > 1 &&
+		    !TestSetPageDoubleMap(page)) {
 			for (i = 0; i < HPAGE_PMD_NR; i++)
-				atomic_dec(&page[i]._mapcount);
+				atomic_inc(&page[i]._mapcount);
+		}
+
+		lock_page_memcg(page);
+		if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+			/* Last compound_mapcount is gone. */
+			__dec_lruvec_page_state(page, NR_ANON_THPS);
+			if (TestClearPageDoubleMap(page)) {
+				/* No need in mapcount reference anymore */
+				for (i = 0; i < HPAGE_PMD_NR; i++)
+					atomic_dec(&page[i]._mapcount);
+			}
 		}
+		unlock_page_memcg(page);
 	}
-	unlock_page_memcg(page);
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 10/15] selftests/vm: fix display of page size in map_hugetlb
       [not found] <20200918211925.7e97f0ef63d92f5cfe5ccbc5@linux-foundation.org>
                   ` (3 preceding siblings ...)
  2020-09-19  4:20 ` [patch 09/15] mm/thp: fix __split_huge_pmd_locked() for migration PMD Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 11/15] mm/memory_hotplug: drain per-cpu pages again during memory offline Andrew Morton
  5 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, christophe.leroy, linux-mm, mm-commits, stable, torvalds

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: selftests/vm: fix display of page size in map_hugetlb

The displayed size is in bytes while the text says it is in kB.

Shift it by 10 to really display kBytes.

Link: https://lkml.kernel.org/r/e27481224564a93d14106e750de31189deaa8bc8.1598861977.git.christophe.leroy@csgroup.eu
Fixes: fa7b9a805c79 ("tools/selftest/vm: allow choosing mem size and page size in map_hugetlb")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/map_hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/tools/testing/selftests/vm/map_hugetlb.c~selftests-vm-fix-display-of-page-size-in-map_hugetlb
+++ a/tools/testing/selftests/vm/map_hugetlb.c
@@ -83,7 +83,7 @@ int main(int argc, char **argv)
 	}
 
 	if (shift)
-		printf("%u kB hugepages\n", 1 << shift);
+		printf("%u kB hugepages\n", 1 << (shift - 10));
 	else
 		printf("Default size hugepages\n");
 	printf("Mapping %lu Mbytes\n", (unsigned long)length >> 20);
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 11/15] mm/memory_hotplug: drain per-cpu pages again during memory offline
       [not found] <20200918211925.7e97f0ef63d92f5cfe5ccbc5@linux-foundation.org>
                   ` (4 preceding siblings ...)
  2020-09-19  4:20 ` [patch 10/15] selftests/vm: fix display of page size in map_hugetlb Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  5 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, david, linux-mm, mhocko, mm-commits, osalvador,
	pasha.tatashin, richard.weiyang, rientjes, stable, torvalds,
	vbabka

From: Pavel Tatashin <pasha.tatashin@soleen.com>
Subject: mm/memory_hotplug: drain per-cpu pages again during memory offline

There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.

Thread#1 - a new process:

load_elf_binary
 begin_new_exec
  exec_mmap
   mmput
    exit_mmap
     tlb_finish_mmu
      tlb_flush_mmu
       release_pages
        free_unref_page_list
         free_unref_page_prepare
          set_pcppage_migratetype(page, migratetype);
             // Set page->index migration type below  MIGRATE_PCPTYPES

Thread#2 - hot-removes memory
__offline_pages
  start_isolate_page_range
    set_migratetype_isolate
      set_pageblock_migratetype(page, MIGRATE_ISOLATE);
        Set migration type to MIGRATE_ISOLATE-> set
        drain_all_pages(zone);
             // drain per-cpu page lists to buddy allocator.

Thread#1 - continue
         free_unref_page_commit
           migratetype = get_pcppage_migratetype(page);
              // get old migration type
           list_add(&page->lru, &pcp->lists[migratetype]);
              // add new page to already drained pcp list

Thread#2
Never drains pcp again, and therefore gets stuck in the loop.

The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.

Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   14 ++++++++++++++
 mm/page_isolation.c |    8 ++++++++
 2 files changed, 22 insertions(+)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-drain-per-cpu-pages-again-during-memory-offline
+++ a/mm/memory_hotplug.c
@@ -1575,6 +1575,20 @@ static int __ref __offline_pages(unsigne
 		/* check again */
 		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
 					    NULL, check_pages_isolated_cb);
+		/*
+		 * per-cpu pages are drained in start_isolate_page_range, but if
+		 * there are still pages that are not free, make sure that we
+		 * drain again, because when we isolated range we might
+		 * have raced with another thread that was adding pages to pcp
+		 * list.
+		 *
+		 * Forward progress should be still guaranteed because
+		 * pages on the pcp list can only belong to MOVABLE_ZONE
+		 * because has_unmovable_pages explicitly checks for
+		 * PageBuddy on freed pages on other zones.
+		 */
+		if (ret)
+			drain_all_pages(zone);
 	} while (ret);
 
 	/* Ok, all of our target is isolated.
--- a/mm/page_isolation.c~mm-memory_hotplug-drain-per-cpu-pages-again-during-memory-offline
+++ a/mm/page_isolation.c
@@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, un
  * pageblocks we may have modified and return -EBUSY to caller. This
  * prevents two threads from simultaneously working on overlapping ranges.
  *
+ * Please note that there is no strong synchronization with the page allocator
+ * either. Pages might be freed while their page blocks are marked ISOLATED.
+ * In some cases pages might still end up on pcp lists and that would allow
+ * for their allocation even when they are in fact isolated already. Depending
+ * on how strong of a guarantee the caller needs drain_all_pages might be needed
+ * (e.g. __offline_pages will need to call it after check for isolated range for
+ * a next retry).
+ *
  * Return: the number of isolated pageblocks on success and -EBUSY if any part
  * of range cannot be isolated.
  */
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-10-12  2:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20200918211925.7e97f0ef63d92f5cfe5ccbc5@linux-foundation.org>
2020-09-19  4:20 ` [patch 02/15] ksm: reinstate memcg charge on copied pages Andrew Morton
2020-09-19  4:20 ` [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP Andrew Morton
2020-09-19  4:44   ` Matthew Wilcox
2020-09-19  5:44     ` Hugh Dickins
2020-09-19 16:18       ` Matthew Wilcox
2020-09-20  0:16         ` Hugh Dickins
2020-09-20  3:32           ` Matthew Wilcox
2020-10-02 18:37         ` Matthew Wilcox
2020-10-09  8:14           ` Huang, Ying
2020-10-10 15:32             ` Matthew Wilcox
2020-10-12  2:01               ` Huang, Ying
2020-09-19  4:20 ` [patch 08/15] kprobes: fix kill kprobe which has been marked as gone Andrew Morton
2020-09-19  4:20 ` [patch 09/15] mm/thp: fix __split_huge_pmd_locked() for migration PMD Andrew Morton
2020-09-19  4:20 ` [patch 10/15] selftests/vm: fix display of page size in map_hugetlb Andrew Morton
2020-09-19  4:20 ` [patch 11/15] mm/memory_hotplug: drain per-cpu pages again during memory offline Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox