mm: delay rmap removal until after TLB flush

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

* mm: delay rmap removal until after TLB flush
       [not found]                     ` <CAHk-=wjnvPA7mi-E3jVEfCWXCNJNZEUjm6XODbbzGOh9c8mhgw@mail.gmail.com>
@ 2022-10-31 18:43                       ` Linus Torvalds
  2022-11-02  9:14                         ` Christian Borntraeger
                                           ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Linus Torvalds @ 2022-10-31 18:43 UTC (permalink / raw)
  To: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

Updated subject line, and here's the link to the original discussion
for new people:

    https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/

On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok. At that point we no longer have the pte or the virtual address, so
> it's not going to be exactly the same debug output.
>
> But I think it ends up being fairly natural to do
>
>         VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>
> instead, and I've fixed that last patch up to do that.

Ok, so I've got a fixed set of patches based on the feedback from
PeterZ, and also tried to do the s390 updates for this blindly, and
pushed them out into a git branch:

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix

If people really want to see the patches in email again, I can do
that, but most of you already have, and the changes are either trivial
fixes or the s390 updates.

For the s390 people that I've now added to the participant list maybe
the git tree is fine - and the fundamental explanation of the problem
is in that top-most commit (with the three preceding commits being
prep-work). Or that link to the thread about this all.

That top-most commit is also where I tried to fix things up for s390
that uses its own non-gathering TLB flush due to
CONFIG_MMU_GATHER_NO_GATHER.

NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
rebased etc for further comments and fixes. So don't consider that
stable, it's still more of an RFC branch.

At a minimum I'll update it with Ack's etc, assuming I get those, and
my s390 changes are entirely untested and probably won't work.

As far as I can tell, s390 doesn't actually *have* the problem that
causes this change, because of its synchronous TLB flush, but it
obviously needs to deal with the change of rmap zapping logic.

Also added a few people who are explicitly listed as being mmu_gather
maintainers. Maybe people saw the discussion on the linux-mm list, but
let's make it explicit.

Do people have any objections to this approach, or other suggestions?

I do *not* consider this critical, so it's a "queue for 6.2" issue for me.

It probably makes most sense to queue in the -MM tree (after the thing
is acked and people agree), but I can keep that branch alive too and
just deal with it all myself as well.

Anybody?

                     Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                       ` mm: delay rmap removal until after TLB flush Linus Torvalds
@ 2022-11-02  9:14                         ` Christian Borntraeger
  2022-11-02  9:23                           ` Christian Borntraeger
  2022-11-02 17:55                           ` Linus Torvalds
  2022-11-02 12:45                         ` Peter Zijlstra
                                           ` (3 subsequent siblings)
  4 siblings, 2 replies; 30+ messages in thread
From: Christian Borntraeger @ 2022-11-02  9:14 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Gerald Schaefer
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

Am 31.10.22 um 19:43 schrieb Linus Torvalds:
> Updated subject line, and here's the link to the original discussion
> for new people:
> 
>      https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
> 
> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ok. At that point we no longer have the pte or the virtual address, so
>> it's not going to be exactly the same debug output.
>>
>> But I think it ends up being fairly natural to do
>>
>>          VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>>
>> instead, and I've fixed that last patch up to do that.
> 
> Ok, so I've got a fixed set of patches based on the feedback from
> PeterZ, and also tried to do the s390 updates for this blindly, and
> pushed them out into a git branch:
> 
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
> 
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.

Adding Gerald.

> 
> That top-most commit is also where I tried to fix things up for s390
> that uses its own non-gathering TLB flush due to
> CONFIG_MMU_GATHER_NO_GATHER.
> 
> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
> rebased etc for further comments and fixes. So don't consider that
> stable, it's still more of an RFC branch.
> 
> At a minimum I'll update it with Ack's etc, assuming I get those, and
> my s390 changes are entirely untested and probably won't work.
> 
> As far as I can tell, s390 doesn't actually *have* the problem that
> causes this change, because of its synchronous TLB flush, but it
> obviously needs to deal with the change of rmap zapping logic.
> 
> Also added a few people who are explicitly listed as being mmu_gather
> maintainers. Maybe people saw the discussion on the linux-mm list, but
> let's make it explicit.
> 
> Do people have any objections to this approach, or other suggestions?
> 
> I do *not* consider this critical, so it's a "queue for 6.2" issue for me.
> 
> It probably makes most sense to queue in the -MM tree (after the thing
> is acked and people agree), but I can keep that branch alive too and
> just deal with it all myself as well.
> 
> Anybody?
> 
>                       Linus

It certainly needs a build fix for s390:


In file included from kernel/sched/core.c:78:
./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
    50 |                 page_zap_pte_rmap(page);
       |                 ^~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02  9:14                         ` Christian Borntraeger
@ 2022-11-02  9:23                           ` Christian Borntraeger
  2022-11-02 17:55                           ` Linus Torvalds
  1 sibling, 0 replies; 30+ messages in thread
From: Christian Borntraeger @ 2022-11-02  9:23 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Gerald Schaefer
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch


Am 02.11.22 um 10:14 schrieb Christian Borntraeger:
> Am 31.10.22 um 19:43 schrieb Linus Torvalds:
>> Updated subject line, and here's the link to the original discussion
>> for new people:
>>
>>      https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
>>
>> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> Ok. At that point we no longer have the pte or the virtual address, so
>>> it's not going to be exactly the same debug output.
>>>
>>> But I think it ends up being fairly natural to do
>>>
>>>          VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>>>
>>> instead, and I've fixed that last patch up to do that.
>>
>> Ok, so I've got a fixed set of patches based on the feedback from
>> PeterZ, and also tried to do the s390 updates for this blindly, and
>> pushed them out into a git branch:
>>
>>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
>>
>> If people really want to see the patches in email again, I can do
>> that, but most of you already have, and the changes are either trivial
>> fixes or the s390 updates.
>>
>> For the s390 people that I've now added to the participant list maybe
>> the git tree is fine - and the fundamental explanation of the problem
>> is in that top-most commit (with the three preceding commits being
>> prep-work). Or that link to the thread about this all.
> 
> Adding Gerald.

now the correct Gerald....
> 
>>
>> That top-most commit is also where I tried to fix things up for s390
>> that uses its own non-gathering TLB flush due to
>> CONFIG_MMU_GATHER_NO_GATHER.
>>
>> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
>> rebased etc for further comments and fixes. So don't consider that
>> stable, it's still more of an RFC branch.
>>
>> At a minimum I'll update it with Ack's etc, assuming I get those, and
>> my s390 changes are entirely untested and probably won't work.
>>
>> As far as I can tell, s390 doesn't actually *have* the problem that
>> causes this change, because of its synchronous TLB flush, but it
>> obviously needs to deal with the change of rmap zapping logic.
>>
>> Also added a few people who are explicitly listed as being mmu_gather
>> maintainers. Maybe people saw the discussion on the linux-mm list, but
>> let's make it explicit.
>>
>> Do people have any objections to this approach, or other suggestions?
>>
>> I do *not* consider this critical, so it's a "queue for 6.2" issue for me.
>>
>> It probably makes most sense to queue in the -MM tree (after the thing
>> is acked and people agree), but I can keep that branch alive too and
>> just deal with it all myself as well.
>>
>> Anybody?
>>
>>                       Linus
> 
> It certainly needs a build fix for s390:
> 
> 
> In file included from kernel/sched/core.c:78:
> ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
> ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
>     50 |                 page_zap_pte_rmap(page);
>        |                 ^~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02  9:14                         ` Christian Borntraeger
  2022-11-02  9:23                           ` Christian Borntraeger
@ 2022-11-02 17:55                           ` Linus Torvalds
  2022-11-02 18:28                             ` Linus Torvalds
  2022-11-02 22:29                             ` Gerald Schaefer
  1 sibling, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2022-11-02 17:55 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger
<borntraeger@linux.ibm.com> wrote:
>
> It certainly needs a build fix for s390:
>
> In file included from kernel/sched/core.c:78:
> ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
> ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
>     50 |                 page_zap_pte_rmap(page);
>        |                 ^~~~~~~~~~~~~~~~~

Hmm. I'm not sure if I can add a

   #include <linux/rmap.h>

to that s390 asm header file without causing more issues.

The minimal damage would probably be to duplicate the declaration of
page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used.

Not pretty to have two different declarations of that thing, but
anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which
is much of mm) would then verify the consistency of  them.

So I'll do that minimal fix and update that branch, but if s390 people
end up having a better fix, please holler.

                Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02 17:55                           ` Linus Torvalds
@ 2022-11-02 18:28                             ` Linus Torvalds
  2022-11-02 22:29                             ` Gerald Schaefer
  1 sibling, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2022-11-02 18:28 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

On Wed, Nov 2, 2022 at 10:55 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I'll do that minimal fix and update that branch, but if s390 people
> end up having a better fix, please holler.

I've updated the branch with that, so hopefully s390 builds now.

I also fixed a typo in the commit message and added Peter's ack. Other
than that it's all the same it was before.

                 Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02 17:55                           ` Linus Torvalds
  2022-11-02 18:28                             ` Linus Torvalds
@ 2022-11-02 22:29                             ` Gerald Schaefer
  1 sibling, 0 replies; 30+ messages in thread
From: Gerald Schaefer @ 2022-11-02 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Borntraeger, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Wed, 2 Nov 2022 10:55:10 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger
> <borntraeger@linux.ibm.com> wrote:
> >
> > It certainly needs a build fix for s390:
> >
> > In file included from kernel/sched/core.c:78:
> > ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
> > ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
> >     50 |                 page_zap_pte_rmap(page);
> >        |                 ^~~~~~~~~~~~~~~~~
> 
> Hmm. I'm not sure if I can add a
> 
>    #include <linux/rmap.h>
> 
> to that s390 asm header file without causing more issues.
> 
> The minimal damage would probably be to duplicate the declaration of
> page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used.
> 
> Not pretty to have two different declarations of that thing, but
> anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which
> is much of mm) would then verify the consistency of  them.
> 
> So I'll do that minimal fix and update that branch, but if s390 people
> end up having a better fix, please holler.

It compiles now with your duplicate declaration, but adding the #include
also did not cause any damage, so that should also be OK.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                       ` mm: delay rmap removal until after TLB flush Linus Torvalds
  2022-11-02  9:14                         ` Christian Borntraeger
@ 2022-11-02 12:45                         ` Peter Zijlstra
  2022-11-02 22:31                         ` Gerald Schaefer
                                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2022-11-02 12:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote:

>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix


Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                       ` mm: delay rmap removal until after TLB flush Linus Torvalds
  2022-11-02  9:14                         ` Christian Borntraeger
  2022-11-02 12:45                         ` Peter Zijlstra
@ 2022-11-02 22:31                         ` Gerald Schaefer
  2022-11-02 23:13                           ` Linus Torvalds
  2022-11-03  9:52                         ` David Hildenbrand
  2022-11-04  6:33                         ` Alexander Gordeev
  4 siblings, 1 reply; 30+ messages in thread
From: Gerald Schaefer @ 2022-11-02 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Mon, 31 Oct 2022 11:43:30 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Updated subject line, and here's the link to the original discussion
> for new people:
> 
>     https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
> 
> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ok. At that point we no longer have the pte or the virtual address, so
> > it's not going to be exactly the same debug output.
> >
> > But I think it ends up being fairly natural to do
> >
> >         VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
> >
> > instead, and I've fixed that last patch up to do that.
> 
> Ok, so I've got a fixed set of patches based on the feedback from
> PeterZ, and also tried to do the s390 updates for this blindly, and
> pushed them out into a git branch:
> 
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
> 
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.
> 
> That top-most commit is also where I tried to fix things up for s390
> that uses its own non-gathering TLB flush due to
> CONFIG_MMU_GATHER_NO_GATHER.
> 
> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
> rebased etc for further comments and fixes. So don't consider that
> stable, it's still more of an RFC branch.
> 
> At a minimum I'll update it with Ack's etc, assuming I get those, and
> my s390 changes are entirely untested and probably won't work.
> 
> As far as I can tell, s390 doesn't actually *have* the problem that
> causes this change, because of its synchronous TLB flush, but it
> obviously needs to deal with the change of rmap zapping logic.

Correct, we need to flush already when we change a PTE, which is
done in ptep_get_and_clear() etc. Only exception would be lazy
flushing when only one active thread is attached, then we would
flush later in flush_tlb_mm/range(), or as soon as another thread
is attached (IIRC).

So it seems straight forward to just call page_zap_pte_rmap()
from our private __tlb_remove_page_size() implementation.

Just wondering a bit why you did not also add the
VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like
in the generic change.

Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02 22:31                         ` Gerald Schaefer
@ 2022-11-02 23:13                           ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2022-11-02 23:13 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Wed, Nov 2, 2022 at 3:31 PM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> Just wondering a bit why you did not also add the
> VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like
> in the generic change.

Heh, I had considered dropping it entirely even from the generic code,
since I don't remember seeing that ever trigger, but PeterZ convinced
me otherwise.

For the s390 side I really wanted to keep things minimal since I
(obviously) didn't even built-test it, so..

I'm perfectly happy with s390 people adding it later, of course.

                Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                       ` mm: delay rmap removal until after TLB flush Linus Torvalds
                                           ` (2 preceding siblings ...)
  2022-11-02 22:31                         ` Gerald Schaefer
@ 2022-11-03  9:52                         ` David Hildenbrand
  2022-11-03 16:54                           ` Linus Torvalds
  2022-11-04  6:33                         ` Alexander Gordeev
  4 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2022-11-03  9:52 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

On 31.10.22 19:43, Linus Torvalds wrote:
> Updated subject line, and here's the link to the original discussion
> for new people:
> 
>      https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
> 
> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ok. At that point we no longer have the pte or the virtual address, so
>> it's not going to be exactly the same debug output.
>>
>> But I think it ends up being fairly natural to do
>>
>>          VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>>
>> instead, and I've fixed that last patch up to do that.
> 
> Ok, so I've got a fixed set of patches based on the feedback from
> PeterZ, and also tried to do the s390 updates for this blindly, and
> pushed them out into a git branch:
> 
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
> 
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.
> 
> That top-most commit is also where I tried to fix things up for s390
> that uses its own non-gathering TLB flush due to
> CONFIG_MMU_GATHER_NO_GATHER.
> 
> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
> rebased etc for further comments and fixes. So don't consider that
> stable, it's still more of an RFC branch.
> 
> At a minimum I'll update it with Ack's etc, assuming I get those, and
> my s390 changes are entirely untested and probably won't work.
> 
> As far as I can tell, s390 doesn't actually *have* the problem that
> causes this change, because of its synchronous TLB flush, but it
> obviously needs to deal with the change of rmap zapping logic.
> 
> Also added a few people who are explicitly listed as being mmu_gather
> maintainers. Maybe people saw the discussion on the linux-mm list, but
> let's make it explicit.
> 
> Do people have any objections to this approach, or other suggestions?
> 
> I do *not* consider this critical, so it's a "queue for 6.2" issue for me.
> 
> It probably makes most sense to queue in the -MM tree (after the thing
> is acked and people agree), but I can keep that branch alive too and
> just deal with it all myself as well.
> 
> Anybody?

Happy to see that we're still decrementing the mapcount before 
decrementingthe refcount, I was briefly concerned.

I was not able to come up quickly with something that would be 
fundamentally wrong here, but devil is in the detail.

Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is 
unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON 
instead of VM_BUG_ON).

I agree that 6.2 is good enough and that upstreaming this via the -MM 
tree would be a good way to move forward.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-03  9:52                         ` David Hildenbrand
@ 2022-11-03 16:54                           ` Linus Torvalds
  2022-11-03 17:09                             ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2022-11-03 16:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Thu, Nov 3, 2022 at 2:52 AM David Hildenbrand <david@redhat.com> wrote:
>
> Happy to see that we're still decrementing the mapcount before
> decrementingthe refcount, I was briefly concerned.

Oh, that would have been horribly wrong.

> I was not able to come up quickly with something that would be
> fundamentally wrong here, but devil is in the detail.

So I tried to be very careful.

The biggest change in the whole series (visible in last patch, but
there in the prep-patches too) is how it narrows down some lock
coverage.

Now, that locking didn't *do* anything valid, but I did try to point
it out when it happens - how first the mapcount is decremented outside
the memcg lock (in preparatory patches), and then later on the
__dec_lruvec_page_state() turns into a dec_lruvec_page_state() because
it's then done outside the page table lock.

The locking in the second case did do something - not locking-wise,
but simply the "running under the spinlock means we are not
preemptable without RT".

And in the memcg case it was just plain overly wide lock regions.

None of the other changes should have any real semantic meaning
*apart* from just keeping ->_mapcount elevated slightly longer.

> Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is
> unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON
> instead of VM_BUG_ON).

That VM_BUG_ON() is a case of "what to do if this ever triggers?" So a
WARN_ON() would be fatal too, it's some seriously bogus stuff to try
to put bits in that won't fit.

It probably should be removed, since the value should always be pretty
much a simple constant. It was more of a "let's be careful with new
code, not for production".

Probably like pretty much all VM_BUG_ON's are - test code that just
got left around.

I considered just getting rid of ENCODE_PAGE_BITS entirely, since
there is only one bit. But it was always "let's keep the option open
for dirty bits etc", so I kept it, but I agree that the name isn't
wonderful.

And in fact I wanted the encoding to really be done by the caller (so
that TLB_ZAP_RMAP wouldn't be a new argument, but the 'page' argument
to __tlb_remove_page_*() would simply be an 'encoded page' pointer,
but that would have caused the patch to be much bigger (and expanded
the s390 side too). Which I didn't want to do.

Long-term that's probably still the right thing to do, including
passing the encoded pointers all the way to
free_pages_and_swap_cache().

Because it's pretty disgusting how it cleans up that array in place
and then does that cast to a new array type, but it is also disgusting
how it traverses that array multiple times (ie
free_pages_and_swap_cache() will just have *another* loop).

But again, those changes would have made the patch bigger, which I
didn't want at this point (and 'release_pages()' would need that
clean-in-place anyway, unless we changed *that* too and made the whole
page encoding be something widely available).

That's also why I then went with that fairly generic
"ENCODE_PAGE_BITS" name. The *use* of it right now is very very
specific to just the TLB gather, and the TLB_ZAP_RMAP bit shows that
in the name. But then I went for a fairly generic "encode extra bits
in the page pointer" name because it felt like it might expand beyond
the initial intentionally small patch in the long run.

So it's a combination of "we might want to expand on this in the
future" and yet also "I really want to keep the changes localized in
this patch".

And the two are kind of inverses of each other, which hopefully
explains the seemingly disparate naming logic.

                 Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-03 16:54                           ` Linus Torvalds
@ 2022-11-03 17:09                             ` Linus Torvalds
  2022-11-03 17:36                               ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2022-11-03 17:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But again, those changes would have made the patch bigger, which I
> didn't want at this point (and 'release_pages()' would need that
> clean-in-place anyway, unless we changed *that* too and made the whole
> page encoding be something widely available).

And just to clarify: this is not just me trying to expand the reach of my patch.

I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE
and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits
that TLB_ZAP_RMAP is.

Except the mlock code does *not* show that in the type system, and
instead just passes a "struct page **" array around in pvec->pages,
and then you'd just better know that "oh, it's not *really* just a
page pointer".

So I really think that the "array of encoded page pointers" thing is a
generic notion that we *already* have.

It's just that we've done it disgustingly in the past, and I didn't
want to do that disgusting thing again.

So I would hope that the nasty things that the mlock code would some
day use the same page pointer encoding logic to actually make the
whole "this is not a page pointer that you can use directly, it has
low bits set for flags" very explicit.

I am *not* sure if then the actual encoded bits would be unified.
Probably not - you might have very different and distinct uses of the
encode_page() thing where the bits mean different things in different
contexts.

Anyway, this is me just explaining the thinking behind it all. The
page bit encoding is a very generic thing (well, "very generic" in
this case means "has at least one other independent user"), explaining
the very generic naming.

But at the same time, the particular _patch_ was meant to be very targeted.

So slightly schizophrenic name choices as a result.

             Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-03 17:09                             ` Linus Torvalds
@ 2022-11-03 17:36                               ` David Hildenbrand
  0 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2022-11-03 17:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On 03.11.22 18:09, Linus Torvalds wrote:
> On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> But again, those changes would have made the patch bigger, which I
>> didn't want at this point (and 'release_pages()' would need that
>> clean-in-place anyway, unless we changed *that* too and made the whole
>> page encoding be something widely available).
> 
> And just to clarify: this is not just me trying to expand the reach of my patch.
> 
> I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE
> and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits
> that TLB_ZAP_RMAP is.
> 
> Except the mlock code does *not* show that in the type system, and
> instead just passes a "struct page **" array around in pvec->pages,
> and then you'd just better know that "oh, it's not *really* just a
> page pointer".
> 
> So I really think that the "array of encoded page pointers" thing is a
> generic notion that we *already* have.
> 
> It's just that we've done it disgustingly in the past, and I didn't
> want to do that disgusting thing again.
> 
> So I would hope that the nasty things that the mlock code would some
> day use the same page pointer encoding logic to actually make the
> whole "this is not a page pointer that you can use directly, it has
> low bits set for flags" very explicit.
> 
> I am *not* sure if then the actual encoded bits would be unified.
> Probably not - you might have very different and distinct uses of the
> encode_page() thing where the bits mean different things in different
> contexts.
> 
> Anyway, this is me just explaining the thinking behind it all. The
> page bit encoding is a very generic thing (well, "very generic" in
> this case means "has at least one other independent user"), explaining
> the very generic naming.
> 
> But at the same time, the particular _patch_ was meant to be very targeted.
> 
> So slightly schizophrenic name choices as a result.

Thanks for the explanation. I brought it up because the generic name 
somehow felt weird in include/asm-generic/tlb.h. Skimming over the code 
I'd have expected something like TLB_ENCODE_PAGE_BITS, so making the 
"very generic" things "very specific" as long as it lives in tlb.h :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                       ` mm: delay rmap removal until after TLB flush Linus Torvalds
                                           ` (3 preceding siblings ...)
  2022-11-03  9:52                         ` David Hildenbrand
@ 2022-11-04  6:33                         ` Alexander Gordeev
  2022-11-04 17:35                           ` Linus Torvalds
  4 siblings, 1 reply; 30+ messages in thread
From: Alexander Gordeev @ 2022-11-04  6:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote:
[...]
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.

I rather have a question to the generic part (had to master the code quotting).

> static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
> {
> 	for (unsigned int i = 0; i < nr; i++) {
> 		struct encoded_page *encoded = pages[i];
> 		unsigned int flags = encoded_page_flags(encoded);
> 		if (flags) {
> 			/* Clean the flagged pointer in-place */
> 			struct page *page = encoded_page_ptr(encoded);
> 			pages[i] = encode_page(page, 0);
> 
> 			/* The flag bit being set means that we should zap the rmap */

Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version?
(I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling
page_zap_pte_rmap() without such a check would be a bug).

> 			page_zap_pte_rmap(page);
> 			VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
> 		}
> 	}
> 
> 	/*
> 	 * Now all entries have been un-encoded, and changed to plain
> 	 * page pointers, so we can cast the 'encoded_page' array to
> 	 * a plain page array and free them
> 	 */
> 	free_pages_and_swap_cache((struct page **)pages, nr);
> }

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-04  6:33                         ` Alexander Gordeev
@ 2022-11-04 17:35                           ` Linus Torvalds
  2022-11-06 21:06                             ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2022-11-04 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> I rather have a question to the generic part (had to master the code quotting).

Sure.

Although now I think the series in in Andrew's -mm tree, or just about
to get moved in there, so I'm not going to touch my actual branch any
more.

> > static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
> > {
> >       for (unsigned int i = 0; i < nr; i++) {
> >               struct encoded_page *encoded = pages[i];
> >               unsigned int flags = encoded_page_flags(encoded);
> >               if (flags) {
> >                       /* Clean the flagged pointer in-place */
> >                       struct page *page = encoded_page_ptr(encoded);
> >                       pages[i] = encode_page(page, 0);
> >
> >                       /* The flag bit being set means that we should zap the rmap */
>
> Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version?
> (I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling
> page_zap_pte_rmap() without such a check would be a bug).

No major reason. This is basically the same issue as the naming, which
I touched on in

  https://lore.kernel.org/all/CAHk-=wiDg_1up8K4PhK4+kzPN7xJG297=nw+tvgrGn7aVgZdqw@mail.gmail.com/

and the follow-up note about how I hope the "encoded page pointer with
flags" thing gets used by the mlock code some day too.

IOW, there's kind of a generic "I have extra flags associated with the
pointer", and then the specific "this case uses this flag", and
depending on which mindset you have at the time, you might do one or
the other.

So in that clean_and_free_pages_and_swap_cache(), the code basically
knows "I have a pointer with extra flags", and it's written that way.
And that's partly historical, because it actually started with the
original code tracking the dirty bit as the extra piece of
information, and then transformed into this "no, the information is
TLB_ZAP_RMAP".

So "unsigned int flags" at one point was "bool dirty" instead, but
then became more of a "I think this whole page pointer with flags is
general", and the naming changed, and I had both cases in mind, and
then the code is perhaps not so specifically named. I'm not sure the
zap_page_range() case will ever use more than one flag, but the mlock
case already has two different flags. So the "encode_page" thing is
very much written to be about more than just the zap_page_range()
case.

But yes, that code could (and maybe should) use "if (flags &
TLB_ZAP_RMAP)" to make it clear that in this case, the single flags
bit is that one bit.

But the "if ()" there actually does two conceptually *separate*
things: it needs to clean the pointer in-place (which is regardless of
"which" flag bit is set, and then it does that page_zap_pte_rmap(),
which is just for the TLB_ZAP_RMAP bit.

So to be really explicit about it, you'd have two different tests: one
for "do I have flags that need to be cleaned up" and then an inner
test for each flag. And since there is only one flag in this
particular use case, it's essentially that inner test that I dropped
as pointless.

In contrast, in the s390 version, that bit was never encoded as a a
general "flags associated with a page pointer" in the first place, so
there was never any such duality. There is only TLB_ZAP_RMAP.

Hope that explains the thinking.

                 Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-04 17:35                           ` Linus Torvalds
@ 2022-11-06 21:06                             ` Hugh Dickins
  2022-11-06 22:34                               ` Linus Torvalds
                                                 ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Hugh Dickins @ 2022-11-06 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

Adding Stephen to Cc, for next-20221107 alert.
Adding Johannes to Cc, particularly for lock_page_memcg discussion.

On Fri, 4 Nov 2022, Linus Torvalds wrote:
> On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > I rather have a question to the generic part (had to master the code quotting).
> 
> Sure.
> 
> Although now I think the series in in Andrew's -mm tree, or just about
> to get moved in there, so I'm not going to touch my actual branch any
> more.

Linus, we've been exchanging about my mm/rmap.c mods in private mail,
I need to raise some points about your mods here in public mail.

Four topics - munlock (good), anon_vma (bad), mm-unstable (bad),
lock_page_memcg (uncertain).  I'm asking Andrew here to back your
patchset out of mm-unstable for now, until we have its fixes in:
otherwise I'm worried that next-20221107 will waste everyone's time.

munlock (good)
--------------
You've separated out the munlock_vma_page() from the rest of PTE
remove rmap work.  Worried me at first, but I think that's fine:
bundling it into page_remove_rmap() was mainly so that we wouldn't
forget to do it anywhere (and other rmap funcs already took vma arg).

Certainly during development, I had been using page mapcount somewhere
inside munlock_vma_page(), either for optimization or for sanity check,
I forget; but gave up on that as unnecessary complication, and I think
it became impossible with the pagevec; so not an issue now.

If it were my change, I'd certainly do it by changing the name of the
vma arg in page_remove_rmap() from vma to munlock_vma, not doing the
munlock when that is NULL, and avoid the need for duplicating code:
whereas you're very pleased with cutting out the unnecessary stuff
in slimmed-down page_zap_pte_rmap().  Differing tastes (or perhaps
you'll say taste versus no taste).

anon_vma (bad)
--------------
My first reaction to your patchset was, that it wasn't obvious to me
that delaying the decrementation of mapcount is safe: I needed time to
think about it.  But happily, testing saved me from needing much thought.

The first problem, came immediately in the first iteration of my
huge tmpfs kbuild swapping loads, was BUG at mm/migrate.c:1105!
VM_BUG_ON_FOLIO(folio_test_anon(src) && !folio_test_ksm(src) && !anon_vma, src);
Okay, that's interesting but not necessarily fatal, we can very easily
make it a recoverable condition.  I didn't bother to think on that one,
just patched around it.

The second problem was more significant: came after nearly five hours
of load, BUG NULL dereference (address 8) in down_read_trylock() in
rmap_walk_anon(), while kswapd was doing a folio_referenced().

That one goes right to the heart of the matter, and instinct had been
correct to worry about delaying the decrementation of mapcount, that is,
extending the life of page_mapped() or folio_mapped().

See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
establishing the continued validity of an anon_vma.  See comments
above folio_get_anon_vma(), some by me but most by PeterZ IIRC.

I believe what has happened is that your patchset has, very intentionally,
kept the page as "folio_mapped" until after free_pgtables() does its
unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read()
that the anon_vma is safe to use when actually it has been freed.
(It looked like a page table when I peeped at it.)

I'm not certain, but I think that you made page_zap_pte_rmap() handle
anon as well as file, just for the righteous additional simplification;
but I'm afraid that (without opening a huge anon_vma refcounting can of
worms) that unification has to be reverted, and anon left to go the
same old way it did before.

I didn't look into whether reverting one of your patches would achieve
that, I just adjusted the code in zap_pte_range() to go the old way for
PageAnon; and that has been running successfully, hitting neither BUG,
for 15 hours now.

mm-unstable (bad)
-----------------
Aside from that PageAnon issue, mm-unstable is in an understandably bad
state because you could not have foreseen my subpages_mapcount addition
to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
PageCompound (but not the "compound") case too.  I rushed you and akpm
an emergency patch for that on Friday night, but you, let's say, had
reservations about it.  So I haven't posted it, and while the PageAnon
issue remains, I think your patchset has to be removed from mm-unstable
and linux-next anyway.

What happens to mm-unstable with page_zap_pte_rmap() not handling
subpages_mapcount?  In my CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y case,
I get a "Bad page state" even before reaching first login after boot,
"page dumped because: nonzero subpages_mapcount".  Yes, I make it
worse for myself by having a BUG() in bad_page(); others will limp
along with more and more of those "Bad page" messages.

lock_page_memcg (uncertain)
---------------------------
Johannes, please help! Linus has got quite upset by lock_page_memcg(),
its calls from mm/rmap.c anyway, and most particularly by the way
in which it is called at the start of page_remove_rmap(), before
anyone's critical atomic_add_negative(), yet its use is to guarantee
the stability of page memcg while doing the stats updates, done only
when atomic_add_negative() says so.

I do have one relevant insight on this.  It (or its antecedents under
other names) date from the days when we did "reparenting" of memcg
charges from an LRU: and in those days the lock_page_memcg() before
mapcount adjustment was vital, to pair with the uses of folio_mapped()
or page_mapped() in mem_cgroup_move_account() - those "mapped" checks
are precisely around the stats which the rmap functions affect.

But nowadays mem_cgroup_move_account() is only called, with page table
lock held, on matching pages found in a task's page table: so its
"mapped" checks are redundant - I've sometimes thought in the past of
removing them, but held back, because I always have the notion (not
hope!) that "reparenting" may one day be brought back from the grave.
I'm too out of touch with memcg to know where that notion stands today.

I've gone through a multiverse of opinions on those lock_page_memcg()s
in the last day: I currently believe that Linus is right, that the
lock_page_memcg()s could and should be moved just before the stats
updates.  But I am not 100% certain of that - is there still some
reason why it's important that the page memcg at the instant of the
critical mapcount transition be kept unchanged until the stats are
updated?  I've tried running scenarios through my mind but given up.

(And note that the answer might be different with Linus's changes
than without them: since he delays the mapcount decrementation
until long after pte was removed and page table lock dropped).

And I do wish that we could settle this lock_page_memcg() question
in an entirely separate patch: as it stands, page_zap_pte_rmap()
gets to benefit from Linus's insight (or not), and all the other rmap
functions carry on with the mis?placed lock_page_memcg() as before.

Let's press Send,
Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 21:06                             ` Hugh Dickins
@ 2022-11-06 22:34                               ` Linus Torvalds
  2022-11-06 23:14                                 ` Andrew Morton
  2022-11-07  9:12                               ` Peter Zijlstra
  2022-11-07 20:07                               ` Johannes Weiner
  2 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2022-11-06 22:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

[ Editing down to just the bare-bones problem cases ]

On Sun, Nov 6, 2022 at 1:06 PM Hugh Dickins <hughd@google.com> wrote:
>
> anon_vma (bad)
> --------------
>
> See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
> establishing the continued validity of an anon_vma.  See comments
> above folio_get_anon_vma(), some by me but most by PeterZ IIRC.
>
> I believe what has happened is that your patchset has, very intentionally,
> kept the page as "folio_mapped" until after free_pgtables() does its
> unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read()
> that the anon_vma is safe to use when actually it has been freed.
> (It looked like a page table when I peeped at it.)
>
> I'm not certain, but I think that you made page_zap_pte_rmap() handle
> anon as well as file, just for the righteous additional simplification;
> but I'm afraid that (without opening a huge anon_vma refcounting can of
> worms) that unification has to be reverted, and anon left to go the
> same old way it did before.

Indeed. I made them separate initially, just because the only case
that mattered for the dirty bit was the file-mapped case.

But then the two functions ended up being basically the identical
function, so I unified them again.

But the anonvma lifetime issue looks very real, and so doing the
"delay rmap only for file mappings" seems sane.

In fact, I wonder if we should delay it only for *dirty* file
mappings, since it doesn't matter for the clean case.

Hmm.

I already threw away my branch (since Andrew picked the patches up),
so a question for Andrew: do you want me to re-do the branch entirely,
or do you want me to just send you an incremental patch?

To make for minimal changes, I'd drop the 're-unification' patch, and
then small updates to the zap_pte_range() code to keep the anon (and
possibly non-dirty) case synchronous.

And btw, this one is interesting: for anonymous (and non-dirty
file-mapped) patches, we actually can end up delaying the final page
free (and the rmap zapping) all the way to "tlb_finish_mmu()".

Normally we still have the vma's all available, but yes,
free_pgtables() can and does happen before the final TLB flush.

The file-mapped dirty case doesn't have that issue - not just because
it doesn't have an anonvma at all, but because it also does that
"force_flush" thing that just measn that the page freeign never gets
delayed that far in the first place.

> mm-unstable (bad)
> -----------------
> Aside from that PageAnon issue, mm-unstable is in an understandably bad
> state because you could not have foreseen my subpages_mapcount addition
> to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
> PageCompound (but not the "compound") case too.  I rushed you and akpm
> an emergency patch for that on Friday night, but you, let's say, had
> reservations about it.  So I haven't posted it, and while the PageAnon
> issue remains, I think your patchset has to be removed from mm-unstable
> and linux-next anyway.

So I think I'm fine with your patch, I just want to move the memcg
accounting to outside of it.

I can re-do my series on top of mm-unstable, I guess. That's probably
the easiest way to handle this all.

Andrew - can you remove those patches again, and I'll create a new
series for you?

                 Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 22:34                               ` Linus Torvalds
@ 2022-11-06 23:14                                 ` Andrew Morton
  2022-11-07  0:06                                   ` Stephen Rothwell
  2022-11-07 16:19                                   ` Linus Torvalds
  0 siblings, 2 replies; 30+ messages in thread
From: Andrew Morton @ 2022-11-06 23:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

On Sun, 6 Nov 2022 14:34:51 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > mm-unstable (bad)
> > -----------------
> > Aside from that PageAnon issue, mm-unstable is in an understandably bad
> > state because you could not have foreseen my subpages_mapcount addition
> > to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
> > PageCompound (but not the "compound") case too.  I rushed you and akpm
> > an emergency patch for that on Friday night, but you, let's say, had
> > reservations about it.  So I haven't posted it, and while the PageAnon
> > issue remains, I think your patchset has to be removed from mm-unstable
> > and linux-next anyway.
> 
> So I think I'm fine with your patch, I just want to move the memcg
> accounting to outside of it.
> 
> I can re-do my series on top of mm-unstable, I guess. That's probably
> the easiest way to handle this all.
> 
> Andrew - can you remove those patches again, and I'll create a new
> series for you?

Yes, I've removed both serieses and shall push the tree out within half
an hour from now.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 23:14                                 ` Andrew Morton
@ 2022-11-07  0:06                                   ` Stephen Rothwell
  2022-11-07 16:19                                   ` Linus Torvalds
  1 sibling, 0 replies; 30+ messages in thread
From: Stephen Rothwell @ 2022-11-07  0:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Johannes Weiner, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

[-- Attachment #1: Type: text/plain, Size: 305 bytes --]

Hi all,

On Sun, 6 Nov 2022 15:14:16 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Yes, I've removed both serieses and shall push the tree out within half
> an hour from now.

And I have picked up the new version for today's linux-next.  Thanks all.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 23:14                                 ` Andrew Morton
  2022-11-07  0:06                                   ` Stephen Rothwell
@ 2022-11-07 16:19                                   ` Linus Torvalds
  2022-11-07 23:02                                     ` Andrew Morton
  1 sibling, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2022-11-07 16:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Johannes Weiner, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Yes, I've removed both serieses and shall push the tree out within half
> an hour from now.

Oh, can you please just put Hugh's series back in?

I don't love the mapcount changes, but I'm going to re-do my parts so
that there is no clash with them.

And while I'd much rather see the mapcounts done another way, that's a
completely separate issue.

                Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 16:19                                   ` Linus Torvalds
@ 2022-11-07 23:02                                     ` Andrew Morton
  2022-11-07 23:44                                       ` Stephen Rothwell
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2022-11-07 23:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Yes, I've removed both serieses and shall push the tree out within half
> > an hour from now.
> 
> Oh, can you please just put Hugh's series back in?
> 

Done, all pushed out to
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 23:02                                     ` Andrew Morton
@ 2022-11-07 23:44                                       ` Stephen Rothwell
  0 siblings, 0 replies; 30+ messages in thread
From: Stephen Rothwell @ 2022-11-07 23:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Johannes Weiner, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

Hi all,

On Mon, 7 Nov 2022 15:02:42 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> Done, all pushed out to
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable.

Will be in today's linux-next.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 21:06                             ` Hugh Dickins
  2022-11-06 22:34                               ` Linus Torvalds
@ 2022-11-07  9:12                               ` Peter Zijlstra
  2022-11-07 20:07                               ` Johannes Weiner
  2 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2022-11-07  9:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Johannes Weiner, Stephen Rothwell,
	Alexander Gordeev, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch


Hi Hugh!

On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote:
> See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
> establishing the continued validity of an anon_vma.  See comments
> above folio_get_anon_vma(), some by me but most by PeterZ IIRC.

Ohhh, you're quite right. Unfortunately I seem to have completely
forgotten about that :-(



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 21:06                             ` Hugh Dickins
  2022-11-06 22:34                               ` Linus Torvalds
  2022-11-07  9:12                               ` Peter Zijlstra
@ 2022-11-07 20:07                               ` Johannes Weiner
  2022-11-07 20:29                                 ` Linus Torvalds
  2 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2022-11-07 20:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

Hello,

On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote:
> lock_page_memcg (uncertain)
> ---------------------------
> Johannes, please help! Linus has got quite upset by lock_page_memcg(),
> its calls from mm/rmap.c anyway, and most particularly by the way
> in which it is called at the start of page_remove_rmap(), before
> anyone's critical atomic_add_negative(), yet its use is to guarantee
> the stability of page memcg while doing the stats updates, done only
> when atomic_add_negative() says so.

As you mentioned, the pte lock historically wasn't always taken on the
move side. And so the reason lock_page_memcg() covers the mapcount
update is that move_account() needs to atomically see either a) the
page is mapped and counted, or b) unmapped and uncounted. If we lock
after mapcountdec, move_account() could miss pending updates that need
transferred, and it would break the scheme breaks thusly:

memcg1->nr_mapped = 1
memcg2->nr_mapped = 0

page_remove_rmap:                              mem_cgroup_move_account():
  if atomic_add_negative(page->mapcount):
                                                 lock_page_memcg()
                                                 if page->mapcount: // NOT TAKEN
                                                   memcg1->nr_mapped--
                                                   memcg2->nr_mapped++
                                                 page->memcg = memcg2
                                                 unlock_page_memcg()
    lock_page_memcg()
    page->memcg->nr_mapped-- // UNDERFLOW memcg2->nr_mapped
    unlock_page_memcg()

> I do have one relevant insight on this.  It (or its antecedents under
> other names) date from the days when we did "reparenting" of memcg
> charges from an LRU: and in those days the lock_page_memcg() before
> mapcount adjustment was vital, to pair with the uses of folio_mapped()
> or page_mapped() in mem_cgroup_move_account() - those "mapped" checks
> are precisely around the stats which the rmap functions affect.
> 
> But nowadays mem_cgroup_move_account() is only called, with page table
> lock held, on matching pages found in a task's page table: so its
> "mapped" checks are redundant - I've sometimes thought in the past of
> removing them, but held back, because I always have the notion (not
> hope!) that "reparenting" may one day be brought back from the grave.
> I'm too out of touch with memcg to know where that notion stands today.
>
> I've gone through a multiverse of opinions on those lock_page_memcg()s
> in the last day: I currently believe that Linus is right, that the
> lock_page_memcg()s could and should be moved just before the stats
> updates.  But I am not 100% certain of that - is there still some
> reason why it's important that the page memcg at the instant of the
> critical mapcount transition be kept unchanged until the stats are
> updated?  I've tried running scenarios through my mind but given up.

Okay, I think there are two options.

- If we don't want to codify the pte lock requirement on the move
  side, then moving the lock_page_memcg() like that would break the
  locking scheme, as per above.

- If we DO want to codify the pte lock requirement, we should just
  remove the lock_page_memcg() altogether, as it's fully redundant.

I'm leaning toward the second option. If somebody brings back
reparenting they can bring the lock back along with it.

[ If it's even still necessary by then.

  It's conceivable reparenting is brought back only for cgroup2, where
  the race wouldn't matter because of the hierarchical stats. The
  reparenting side wouldn't have to move page state to the parent -
  it's already there. And whether rmap would see the dying child or
  the parent doesn't matter much either: the parent will see the
  update anyway, directly or recursively, and we likely don't care to
  balance the books on a dying cgroup.

  It's then just a matter of lifetime - which should be guaranteed
  also, as long as the pte lock prevents an RCU quiescent state. ]

So how about something like below?

UNTESTED, just for illustration. This is cgroup1 code, which I haven't
looked at too closely in a while. If you can't spot an immediate hole
in it, I'd go ahead and test it and send a proper patch.

---
From 88a32b1b5737630fb981114f6333d8fd057bd8e9 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 7 Nov 2022 12:05:09 -0500
Subject: [PATCH] mm: remove lock_page_memcg() from rmap

rmap changes (mapping and unmapping) of a page currently take
lock_page_memcg() to serialize 1) update of the mapcount and the
cgroup mapped counter with 2) cgroup moving the page and updating the
old cgroup and the new cgroup counters based on page_mapped().

Before b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups"), we used to reassign all pages that could be found
on a cgroup's LRU list on deletion - something that rmap didn't
naturally serialize against. Since that commit, however, the only
pages that get moved are those mapped into page tables of a task
that's being migrated. In that case, the pte lock is always held (and
we know the page is mapped), which keeps rmap changes at bay already.

The additional lock_page_memcg() by rmap is redundant. Remove it.

NOT-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 27 ++++++++++++---------------
 mm/rmap.c       | 30 ++++++++++++------------------
 2 files changed, 24 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..f7716e9038e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5666,7 +5666,10 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
- * The caller must make sure the page is not on LRU (isolate_page() is useful.)
+ * This function acquires folio_lock() and folio_lock_memcg(). The
+ * caller must exclude all other possible ways of accessing
+ * page->memcg, such as LRU isolation (to lock out isolation) and
+ * having the page mapped and pte-locked (to lock out rmap).
  *
  * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
  * from old cgroup.
@@ -5685,6 +5688,7 @@ static int mem_cgroup_move_account(struct page *page,
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON(compound && !folio_test_large(folio));
+	VM_WARN_ON_ONCE(!folio_mapped(folio));
 
 	/*
 	 * Prevent mem_cgroup_migrate() from looking at
@@ -5705,30 +5709,23 @@ static int mem_cgroup_move_account(struct page *page,
 	folio_memcg_lock(folio);
 
 	if (folio_test_anon(folio)) {
-		if (folio_mapped(folio)) {
-			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
-			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
-			if (folio_test_transhuge(folio)) {
-				__mod_lruvec_state(from_vec, NR_ANON_THPS,
-						   -nr_pages);
-				__mod_lruvec_state(to_vec, NR_ANON_THPS,
-						   nr_pages);
-			}
+		__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
+		__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
+		if (folio_test_transhuge(folio)) {
+			__mod_lruvec_state(from_vec, NR_ANON_THPS, -nr_pages);
+			__mod_lruvec_state(to_vec, NR_ANON_THPS, nr_pages);
 		}
 	} else {
 		__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
 		__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
+		__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
+		__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
 
 		if (folio_test_swapbacked(folio)) {
 			__mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
 			__mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
 		}
 
-		if (folio_mapped(folio)) {
-			__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
-			__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
-		}
-
 		if (folio_test_dirty(folio)) {
 			struct address_space *mapping = folio_mapping(folio);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..60c31375f274 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1197,11 +1197,6 @@ void page_add_anon_rmap(struct page *page,
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
 
-	if (unlikely(PageKsm(page)))
-		lock_page_memcg(page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(page), page);
-
 	if (compound) {
 		atomic_t *mapcount;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1217,19 +1212,19 @@ void page_add_anon_rmap(struct page *page,
 	if (first) {
 		int nr = compound ? thp_nr_pages(page) : 1;
 		/*
-		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
-		 * these counters are not modified in interrupt context, and
-		 * pte lock(a spinlock) is held, which implies preemption
-		 * disabled.
+		 * We use the irq-unsafe __{inc|mod}_zone_page_stat
+		 * because these counters are not modified in
+		 * interrupt context, and pte lock(a spinlock) is
+		 * held, which implies preemption disabled.
+		 *
+		 * The pte lock also stabilizes page->memcg wrt
+		 * mem_cgroup_move_account().
 		 */
 		if (compound)
 			__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
-	if (unlikely(PageKsm(page)))
-		unlock_page_memcg(page);
-
 	/* address might be in next vma when migration races vma_adjust */
 	else if (first)
 		__page_set_anon_rmap(page, vma, address,
@@ -1290,7 +1285,6 @@ void page_add_file_rmap(struct page *page,
 	int i, nr = 0;
 
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
-	lock_page_memcg(page);
 	if (compound && PageTransHuge(page)) {
 		int nr_pages = thp_nr_pages(page);
 
@@ -1311,6 +1305,7 @@ void page_add_file_rmap(struct page *page,
 		if (nr == nr_pages && PageDoubleMap(page))
 			ClearPageDoubleMap(page);
 
+		/* The pte lock stabilizes page->memcg */
 		if (PageSwapBacked(page))
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						nr_pages);
@@ -1328,7 +1323,6 @@ void page_add_file_rmap(struct page *page,
 out:
 	if (nr)
 		__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
-	unlock_page_memcg(page);
 
 	mlock_vma_page(page, vma, compound);
 }
@@ -1356,6 +1350,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			goto out;
+		/* The pte lock stabilizes page->memcg */
 		if (PageSwapBacked(page))
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						-nr_pages);
@@ -1423,8 +1418,6 @@ static void page_remove_anon_compound_rmap(struct page *page)
 void page_remove_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
-	lock_page_memcg(page);
-
 	if (!PageAnon(page)) {
 		page_remove_file_rmap(page, compound);
 		goto out;
@@ -1443,6 +1436,9 @@ void page_remove_rmap(struct page *page,
 	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
+	 *
+	 * The pte lock also stabilizes page->memcg wrt
+	 * mem_cgroup_move_account().
 	 */
 	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
 
@@ -1459,8 +1455,6 @@ void page_remove_rmap(struct page *page,
 	 * faster for those pages still in swapcache.
 	 */
 out:
-	unlock_page_memcg(page);
-
 	munlock_vma_page(page, vma, compound);
 }
 
-- 
2.38.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 20:07                               ` Johannes Weiner
@ 2022-11-07 20:29                                 ` Linus Torvalds
  2022-11-07 23:47                                   ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2022-11-07 20:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra,
	Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit,
	Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton,
	kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov,
	Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch

On Mon, Nov 7, 2022 at 12:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> - If we DO want to codify the pte lock requirement, we should just
>   remove the lock_page_memcg() altogether, as it's fully redundant.
>
> I'm leaning toward that second option.

The thing is, that's very much the case we do *not* want.

We need to delay the rmap removal until at least after the TLB flush.
At least for dirty filemapped pages - because the page cleaning needs
to see that they exists as mapped entities until all CPU's have
*actually* dropped them.

Now, we do the TLB flush still under the page table lock, so we could
still then do the rmap removal before dropping the lock.

But it would be much cleaner from the TLB flushing standpoint to delay
it until the page freeing, which ends up being delayed until after the
lock is dropped.

That said, if always doing the rmap removal under the page table lock
means that that memcg lock can just be deleted in that whole path, I
will certainly bow to _that_ simplification instead, and just handle
the dirty pages after the TLB flush but before the page table drop.

              Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 20:29                                 ` Linus Torvalds
@ 2022-11-07 23:47                                   ` Linus Torvalds
  2022-11-08  4:28                                     ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2022-11-07 23:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra,
	Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit,
	Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton,
	kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov,
	Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch

On Mon, Nov 7, 2022 at 12:29 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That said, if always doing the rmap removal under the page table lock
> means that that memcg lock can just be deleted in that whole path, I
> will certainly bow to _that_ simplification instead, and just handle
> the dirty pages after the TLB flush but before the page table drop.

Ok, so I think I have a fairly clean way to do this.

Let me try to make that series look reasonable, although it might be
until tomorrow. I'll need to massage my mess into not just prettier
code, but a sane history.

               Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 23:47                                   ` Linus Torvalds
@ 2022-11-08  4:28                                     ` Linus Torvalds
  2022-11-08 19:56                                       ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2022-11-08  4:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra,
	Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit,
	Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton,
	kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov,
	Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch

[-- Attachment #1: Type: text/plain, Size: 3263 bytes --]

On Mon, Nov 7, 2022 at 3:47 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, so I think I have a fairly clean way to do this.
>
> Let me try to make that series look reasonable, although it might be
> until tomorrow. I'll need to massage my mess into not just prettier
> code, but a sane history.

Ugh. Ok, so massaging it into a saner form and splitting it out into a
pretty history took a lot longer than writing the initial ugly
"something like this".

Anyway, the end result looks very different from the previous series,
since I very consciously tried to keep away from rmap changes to not
clash with Hugh's work, and also made sure we still do the
page_remove_rmap() under the page table lock, just *after* the TLB
flush.

It does share some basic stuff - I'm still using that "struct
encoded_page" thing, I just moved it into a separate commit, and moved
it to where it conceptually belongs (I'd love to say that I also made
the mlock.c code use it, but I didn't - that 'pagevec' thing is a
pretty pointless abstraction and I didn't want to fight it).

So you'll see some familiar support structures and scaffolding, but
the actual approach to zap_pte_range() is very different.

The approach taken to the "s390 is very different" issue is also
completely different. It should actually be better for s390: we used
to cause that "force_flush" whenever we hit a dirty shared page, and
s390 doesn't care. The new model makes that "s390 doesn't care" be
part of the whole design, so now s390 treats dirty shared pages
basically as if they weren't anything special at all. Which they
aren't on s390.

But, as with the previous case, I don't even try to cross-compile for
s390, so my attempts at handling s390 well are just that: attempts.
The code may be broken.

Of course, the code may be broken on x86-64 too, but at least there
I've compiled it and am running it right now.

Oh, and because I copied large parts of the commit message from the
previous approach (because the problem description is the same), I
noticed that I also kept the "Acked-by:". Those are bogus, because the
code is sufficiently different that any previous acks are just not
valid any more, but I just hadn't fixed that yet.

The meat of it all is in that last patch, the rest is just type system
cleanups to prep for it. But it was also that last patch that I spent
hours on just tweaking it to look sensible. The *code* was pretty
easy. Making it have sensible naming and a sensible abstraction
interface that worked well for s390 too, that was 90% of the effort.

But also I hope the end result is reasonably easy to review as a result.

I'm sending this out because I'm stepping away from the keyboard,
because that whole "let's massage it into something lagible" was
really somewhat exhausting. You don't see all the small side turns it
took only to go "that's ugly, let's try again" ;)

Anybody interested in giving this a peek?

(Patch 2/4 might make some people pause. It's fairly small and simple.
It's effective and makes it easy to do some of the later changes. And
it's also quite different from our usual model. It was "inspired" by
the signed-vs-unsigned char thread from a few weeks ago. But patch 4/4
is the one that matters).

                  Linus

[-- Attachment #2: 0001-mm-introduce-encoded-page-pointers-with-embedded-ext.patch --]
[-- Type: text/x-patch, Size: 2749 bytes --]

From 675b73aaa7718e93e9f2492a3b9cc417f9e820b4 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 7 Nov 2022 15:48:27 -0800
Subject: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra
 bits

We already have this notion in parts of the MM code (see the mlock code
with the LRU_PAGE and NEW_PAGE) bits, but I'm going to introduce a new
case, and I refuse to do the same thing we've done before where we just
put bits in the raw pointer and say it's still a normal pointer.

So this introduces a 'struct encoded_page' pointer that cannot be used
for anything else than to encode a real page pointer and a couple of
extra bits in the low bits.  That way the compiler can trivially track
the state of the pointer and you just explicitly encode and decode the
extra bits.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm_types.h | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..b5cffd250784 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -67,7 +67,7 @@ struct mem_cgroup;
 #ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
 #define _struct_page_alignment	__aligned(2 * sizeof(unsigned long))
 #else
-#define _struct_page_alignment
+#define _struct_page_alignment	__aligned(sizeof(unsigned long))
 #endif
 
 struct page {
@@ -241,6 +241,37 @@ struct page {
 #endif
 } _struct_page_alignment;
 
+/**
+ * struct encoded_page - a nonexistent type marking this pointer
+ *
+ * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
+ * with the low bits of the pointer indicating extra context-dependent
+ * information. Not super-common, but happens in mmu_gather and mlock
+ * handling, and this acts as a type system check on that use.
+ *
+ * We only really have two guaranteed bits in general, although you could
+ * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+ * for more.
+ *
+ * Use the supplied helper functions to endcode/decode the pointer and bits.
+ */
+struct encoded_page;
+#define ENCODE_PAGE_BITS 3ul
+static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
+{
+	return (struct encoded_page *)(flags | (unsigned long)page);
+}
+
+static inline bool encoded_page_flags(struct encoded_page *page)
+{
+	return ENCODE_PAGE_BITS & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+	return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+}
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #3: 0002-mm-teach-release_pages-to-take-an-array-of-encoded-p.patch --]
[-- Type: text/x-patch, Size: 4058 bytes --]

From 31e4135eeedbb6ae12bfb9b17a7f6d9d815ff289 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 7 Nov 2022 16:46:16 -0800
Subject: [PATCH 2/4] mm: teach release_pages() to take an array of encoded
 page pointers too

release_pages() already could take either an array of page pointers, or
an array of folio pointers.  Expand it to also accept an array of
encoded page pointers, which is what both the existing mlock() use and
the upcoming mmu_gather use of encoded page pointers wants.

Note that release_pages() won't actually use, or react to, any extra
encoded bits.  Instead, this is very much a case of "I have walked the
array of encoded pages and done everything the extra bits tell me to do,
now release it all".

Also, while the "either page or folio pointers" dual use was handled
with a cast of the pointer in "release_folios()", this takes a slightly
different approach and uses the "transparent union" attribute to
describe the set of arguments to the function:

  https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html

which has been supported by gcc forever, but the kernel hasn't used
before.

That allows us to avoid using various wrappers with casts, and just use
the same function regardless of use.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm.h | 21 +++++++++++++++++++--
 mm/swap.c          | 16 ++++++++++++----
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..d9fb5c3e3045 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1179,7 +1179,24 @@ static inline void folio_put_refs(struct folio *folio, int refs)
 		__folio_put(folio);
 }
 
-void release_pages(struct page **pages, int nr);
+/**
+ * release_pages - release an array of pages or folios
+ *
+ * This just releases a simple array of multiple pages, and
+ * accepts various different forms of said page array: either
+ * a regular old boring array of pages, an array of folios, or
+ * an array of encoded page pointers.
+ *
+ * The transparent union syntax for this kind of "any of these
+ * argument types" is all kinds of ugly, so look away.
+ */
+typedef union {
+	struct page **pages;
+	struct folio **folios;
+	struct encoded_page **encoded_pages;
+} release_pages_arg __attribute__ ((__transparent_union__));
+
+void release_pages(release_pages_arg, int nr);
 
 /**
  * folios_put - Decrement the reference count on an array of folios.
@@ -1195,7 +1212,7 @@ void release_pages(struct page **pages, int nr);
  */
 static inline void folios_put(struct folio **folios, unsigned int nr)
 {
-	release_pages((struct page **)folios, nr);
+	release_pages(folios, nr);
 }
 
 static inline void put_page(struct page *page)
diff --git a/mm/swap.c b/mm/swap.c
index 955930f41d20..596ed226ddb8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -968,22 +968,30 @@ void lru_cache_disable(void)
 
 /**
  * release_pages - batched put_page()
- * @pages: array of pages to release
+ * @arg: array of pages to release
  * @nr: number of pages
  *
- * Decrement the reference count on all the pages in @pages.  If it
+ * Decrement the reference count on all the pages in @arg.  If it
  * fell to zero, remove the page from the LRU and free it.
+ *
+ * Note that the argument can be an array of pages, encoded pages,
+ * or folio pointers. We ignore any encoded bits, and turn any of
+ * them into just a folio that gets free'd.
  */
-void release_pages(struct page **pages, int nr)
+void release_pages(release_pages_arg arg, int nr)
 {
 	int i;
+	struct encoded_page **encoded = arg.encoded_pages;
 	LIST_HEAD(pages_to_free);
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 	unsigned int lock_batch;
 
 	for (i = 0; i < nr; i++) {
-		struct folio *folio = page_folio(pages[i]);
+		struct folio *folio;
+
+		/* Turn any of the argument types into a folio */
+		folio = page_folio(encoded_page_ptr(encoded[i]));
 
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #4: 0003-mm-mmu_gather-prepare-to-gather-encoded-page-pointer.patch --]
[-- Type: text/x-patch, Size: 3376 bytes --]

From 0e6863a5389a10e984122c6dca143f9be71da310 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 7 Nov 2022 17:36:43 -0800
Subject: [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers
 with flags

This is purely a preparatory patch that makes all the data structures
ready for encoding flags with the mmu_gather page pointers.

The code currently always sets the flag to zero and doesn't use it yet,
but now it's tracking the type state along.  The next step will be to
actually start using it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/asm-generic/tlb.h |  2 +-
 include/linux/swap.h      |  2 +-
 mm/mmu_gather.c           |  4 ++--
 mm/swap_state.c           | 11 ++++-------
 4 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..faca23e87278 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -242,7 +242,7 @@ struct mmu_gather_batch {
 	struct mmu_gather_batch	*next;
 	unsigned int		nr;
 	unsigned int		max;
-	struct page		*pages[];
+	struct encoded_page	*encoded_pages[];
 };
 
 #define MAX_GATHER_BATCH	\
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a18cf4b7c724..40e418e3461b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -470,7 +470,7 @@ static inline unsigned long total_swapcache_pages(void)
 
 extern void free_swap_cache(struct page *page);
 extern void free_page_and_swap_cache(struct page *);
-extern void free_pages_and_swap_cache(struct page **, int);
+extern void free_pages_and_swap_cache(struct encoded_page **, int);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index add4244e5790..57b7850c1b5e 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -48,7 +48,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		struct page **pages = batch->pages;
+		struct encoded_page **pages = batch->encoded_pages;
 
 		do {
 			/*
@@ -92,7 +92,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->pages[batch->nr++] = page;
+	batch->encoded_pages[batch->nr++] = encode_page(page, 0);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 438d0676c5be..8bf08c313872 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -303,15 +303,12 @@ void free_page_and_swap_cache(struct page *page)
  * Passed an array of pages, drop them all from swapcache and then release
  * them.  They are removed from the LRU and freed if this is their last use.
  */
-void free_pages_and_swap_cache(struct page **pages, int nr)
+void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 {
-	struct page **pagep = pages;
-	int i;
-
 	lru_add_drain();
-	for (i = 0; i < nr; i++)
-		free_swap_cache(pagep[i]);
-	release_pages(pagep, nr);
+	for (int i = 0; i < nr; i++)
+		free_swap_cache(encoded_page_ptr(pages[i]));
+	release_pages(pages, nr);
 }
 
 static inline bool swap_use_vma_readahead(void)
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #5: 0004-mm-delay-page_remove_rmap-until-after-the-TLB-has-be.patch --]
[-- Type: text/x-patch, Size: 11173 bytes --]

From 7ef5220cd5825d6e4a770286d8949b9b838bbc30 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 7 Nov 2022 19:38:38 -0800
Subject: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been
 flushed

When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.

However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with
the page table lock because the page had already been made invisible to
rmap.

And that is a problem, because while the TLB entry exists, we could end
up with the following situation:

 (a) one CPU could come in and clean it, never seeing our mapping of the
     page

 (b) another CPU could continue to use the stale and dirty TLB entry and
     continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.

End result: possibly lost dirty data.

This extends our current TLB gather infrastructure to optionally track a
"should I do a delayed page_remove_rmap() for this page after flushing
the TLB".  It uses the newly introduced 'encoded page pointer' to do
that without having to keep separate data around.

Note, this is complicated by a couple of issues:

 - s390 has its own mmu_gather model that doesn't delay TLB flushing,
   and as a result also does not want the delayed rmap

 - we want to delay the rmap removal, but not past the page table lock

 - we can track an enormous number of pages in our mmu_gather structure,
   with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
   all set up to be approximately 10k pending pages.

   We do not want to have a huge number of batched pages that we then
   need to check for delayed rmap handling inside the page table lock.

Particularly that last point results in a noteworthy detail, where the
normal page batch gathering is limited once we have delayed rmaps
pending, in such a way that only the last batch (the so-called "active
batch") in the mmu_gather structure can have any delayed entries.

NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.

So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly
even when user space goes off the rails.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
Cc: Will Deacon <will@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/s390/include/asm/tlb.h | 21 +++++++++++++++++++--
 include/asm-generic/tlb.h   | 21 +++++++++++++++++----
 mm/memory.c                 | 23 +++++++++++++++++------
 mm/mmu_gather.c             | 35 +++++++++++++++++++++++++++++++++--
 4 files changed, 86 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 3a5c8fb590e5..e5903ee2f1ca 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,7 +25,8 @@
 void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct page *page, int page_size);
+					  struct page *page, int page_size,
+					  unsigned int flags);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -36,13 +37,24 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 #include <asm/tlbflush.h>
 #include <asm-generic/tlb.h>
 
+/*
+ * s390 never needs to delay page_remove_rmap, because
+ * the ptep_get_and_clear_full() will have flushed the
+ * TLB across CPUs
+ */
+static inline bool tlb_delay_rmap(struct mmu_gather *tlb)
+{
+	return false;
+}
+
 /*
  * Release the page cache reference for a pte removed by
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  */
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct page *page, int page_size)
+					  struct page *page, int page_size,
+					  unsigned int flags)
 {
 	free_page_and_swap_cache(page);
 	return false;
@@ -53,6 +65,11 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 	__tlb_flush_mm_lazy(tlb->mm);
 }
 
+static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	/* Nothing to do, s390 does not delay rmaps */
+}
+
 /*
  * pte_free_tlb frees a pte table and clears the CRSTE for the
  * page table from the tlb.
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index faca23e87278..9df513e5ad28 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -257,7 +257,15 @@ struct mmu_gather_batch {
 #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
 
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
-				   int page_size);
+				   int page_size, unsigned int flags);
+extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
+
+/*
+ * This both sets 'delayed_rmap', and returns true. It would be an inline
+ * function, except we define it before the 'struct mmu_gather'.
+ */
+#define tlb_delay_rmap(tlb) (((tlb)->delayed_rmap = 1), true)
+
 #endif
 
 /*
@@ -290,6 +298,11 @@ struct mmu_gather {
 	 */
 	unsigned int		freed_tables : 1;
 
+	/*
+	 * Do we have pending delayed rmap removals?
+	 */
+	unsigned int		delayed_rmap : 1;
+
 	/*
 	 * at which levels have we cleared entries?
 	 */
@@ -431,13 +444,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
-	if (__tlb_remove_page_size(tlb, page, page_size))
+	if (__tlb_remove_page_size(tlb, page, page_size, 0))
 		tlb_flush_mmu(tlb);
 }
 
-static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
 {
-	return __tlb_remove_page_size(tlb, page, PAGE_SIZE);
+	return __tlb_remove_page_size(tlb, page, PAGE_SIZE, flags);
 }
 
 /* tlb_remove_page
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..60a0f44f6e72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1432,6 +1432,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
+			unsigned int delay_rmap;
+
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
@@ -1443,20 +1445,26 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(!page))
 				continue;
 
+			delay_rmap = 0;
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
-					force_flush = 1;
 					set_page_dirty(page);
+					if (tlb_delay_rmap(tlb)) {
+						delay_rmap = 1;
+						force_flush = 1;
+					}
 				}
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, vma, false);
-			if (unlikely(page_mapcount(page) < 0))
-				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
+			if (!delay_rmap) {
+				page_remove_rmap(page, vma, false);
+				if (unlikely(page_mapcount(page) < 0))
+					print_bad_pte(vma, addr, ptent, page);
+			}
+			if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
 				break;
@@ -1513,8 +1521,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush)
+	if (force_flush) {
 		tlb_flush_mmu_tlbonly(tlb);
+		if (tlb->delayed_rmap)
+			tlb_flush_rmaps(tlb, vma);
+	}
 	pte_unmap_unlock(start_pte, ptl);
 
 	/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 57b7850c1b5e..136f5fad43e3 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,6 +9,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/swap.h>
+#include <linux/rmap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
@@ -19,6 +20,10 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
+	/* No more batching if we have delayed rmaps pending */
+	if (tlb->delayed_rmap)
+		return false;
+
 	batch = tlb->active;
 	if (batch->next) {
 		tlb->active = batch->next;
@@ -43,6 +48,31 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	return true;
 }
 
+/**
+ * tlb_flush_rmaps - do pending rmap removals after we have flushed the TLB
+ * @tlb: the current mmu_gather
+ *
+ * Note that because of how tlb_next_batch() above works, we will
+ * never start new batches with pending delayed rmaps, so we only
+ * need to walk through the current active batch.
+ */
+void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	struct mmu_gather_batch *batch;
+
+	batch = tlb->active;
+	for (int i = 0; i < batch->nr; i++) {
+		struct encoded_page *enc = batch->encoded_pages[i];
+
+		if (encoded_page_flags(enc)) {
+			struct page *page = encoded_page_ptr(enc);
+			page_remove_rmap(page, vma, false);
+		}
+	}
+
+	tlb->delayed_rmap = 0;
+}
+
 static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
@@ -77,7 +107,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 	tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size, unsigned int flags)
 {
 	struct mmu_gather_batch *batch;
 
@@ -92,7 +122,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->encoded_pages[batch->nr++] = encode_page(page, 0);
+	batch->encoded_pages[batch->nr++] = encode_page(page, flags);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
@@ -286,6 +316,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	tlb->active     = &tlb->local;
 	tlb->batch_count = 0;
 #endif
+	tlb->delayed_rmap = 0;
 
 	tlb_table_init(tlb);
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
-- 
2.37.1.289.g45aa1e5c72.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-08  4:28                                     ` Linus Torvalds
@ 2022-11-08 19:56                                       ` Linus Torvalds
  2022-11-08 20:03                                         ` Konstantin Ryabitsev
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2022-11-08 19:56 UTC (permalink / raw)
  To: Johannes Weiner, Konstantin Ryabitsev
  Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra,
	Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit,
	Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton,
	kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov,
	Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch

On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm sending this out because I'm stepping away from the keyboard,
> because that whole "let's massage it into something legible" was
> really somewhat exhausting. You don't see all the small side turns it
> took only to go "that's ugly, let's try again" ;)

Ok, I actually sent the individual patches with 'git-send-email',
although I only sent them to the mailing list and to people that were
mentioned in the commit descriptions.

I hope that makes review easier.

See

   https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org

for the series if you weren't mentioned and are interested.

Oh, and because I decided to just use the email in this thread as the
reference and cover letter, it turns out that this all confuses 'b4',
because it actually walks up the whole thread all the way to the
original 13-patch series by PeterZ that started this whole discussion.

I've seen that before with other peoples patch series, but now that it
happened to my own, I'm cc'ing Konstantine here too to see if there's
some magic for b4 to say "look, I pointed you to a msg-id that is
clearly a new series, don't walk all the way up and then take patches
from a completely different one.

Oh well. I guess I should just have not been lazy and done a
cover-letter and a whole new thread.

My bad.

Konstantin, please help me look like less of a tool?

                    Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-08 19:56                                       ` Linus Torvalds
@ 2022-11-08 20:03                                         ` Konstantin Ryabitsev
  2022-11-08 20:18                                           ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Konstantin Ryabitsev @ 2022-11-08 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Hugh Dickins, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Tue, Nov 08, 2022 at 11:56:13AM -0800, Linus Torvalds wrote:
> On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I'm sending this out because I'm stepping away from the keyboard,
> > because that whole "let's massage it into something legible" was
> > really somewhat exhausting. You don't see all the small side turns it
> > took only to go "that's ugly, let's try again" ;)
> 
> Ok, I actually sent the individual patches with 'git-send-email',
> although I only sent them to the mailing list and to people that were
> mentioned in the commit descriptions.
> 
> I hope that makes review easier.
> 
> See
> 
>    https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org
> 
> for the series if you weren't mentioned and are interested.
> 
> Oh, and because I decided to just use the email in this thread as the
> reference and cover letter, it turns out that this all confuses 'b4',
> because it actually walks up the whole thread all the way to the
> original 13-patch series by PeterZ that started this whole discussion.
> 
> I've seen that before with other peoples patch series, but now that it
> happened to my own, I'm cc'ing Konstantine here too to see if there's
> some magic for b4 to say "look, I pointed you to a msg-id that is
> clearly a new series, don't walk all the way up and then take patches
> from a completely different one.

Yes, --no-parent.

It's slightly more complicated in your case because the patches aren't
threaded to the first patch/cover letter, but you can choose an arbitrary
msgid upthread and tell b4 to ignore anything that came before it. E.g.:

b4 am -o/tmp --no-parent 20221108194139.57604-1-torvalds@linux-foundation.org

-K

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-08 20:03                                         ` Konstantin Ryabitsev
@ 2022-11-08 20:18                                           ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2022-11-08 20:18 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Johannes Weiner, Hugh Dickins, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Tue, Nov 8, 2022 at 12:03 PM Konstantin Ryabitsev <mricon@kernel.org> wrote:
>
> Yes, --no-parent.

Ahh, that's new. I guess I need to update my ancient b4-0.8.0 install..

But yes, with that, and the manual parent lookup (because otherwise
"--no-parent" will fetch *just* the patch itself, not even walking up
the single parent chain). it works.

Maybe a "--single-parent" or "--deadbeat-parent" option would be a good idea?

Anyway, with a more recent b4 version, the command

   b4 am --no-parent
CAHk-=wh6MxaCA4pXpt1F5Bn2__6MxCq0Dr-rES4i=MOL9ibjpg@mail.gmail.com

gets that series and only that series.

              Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-11-08 20:19 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com>
     [not found] ` <CAHk-=wgzT1QsSCF-zN+eS06WGVTBg4sf=6oTMg95+AEq7QrSCQ@mail.gmail.com>
     [not found]   ` <47678198-C502-47E1-B7C8-8A12352CDA95@gmail.com>
     [not found]     ` <CAHk-=wjzngbbwHw4nAsqo_RpyOtUDk5G+Wus=O0w0A6goHvBWA@mail.gmail.com>
     [not found]       ` <CAHk-=wijU_YHSZq5N7vYK+qHPX0aPkaePaGOyWk4aqMvvSXxJA@mail.gmail.com>
     [not found]         ` <140B437E-B994-45B7-8DAC-E9B66885BEEF@gmail.com>
     [not found]           ` <CAHk-=wjX_P78xoNcGDTjhkgffs-Bhzcwp-mdsE1maeF57Sh0MA@mail.gmail.com>
     [not found]             ` <CAHk-=wio=UKK9fX4z+0CnyuZG7L+U9OB7t7Dcrg4FuFHpdSsfw@mail.gmail.com>
     [not found]               ` <CAHk-=wgz0QQd6KaRYQ8viwkZBt4xDGuZTFiTB8ifg7E3F2FxHg@mail.gmail.com>
     [not found]                 ` <CAHk-=wiwt4LC-VmqvYrphraF0=yQV=CQimDCb0XhtXwk8oKCCA@mail.gmail.com>
     [not found]                   ` <Y1+XCALog8bW7Hgl@hirez.programming.kicks-ass.net>
     [not found]                     ` <CAHk-=wjnvPA7mi-E3jVEfCWXCNJNZEUjm6XODbbzGOh9c8mhgw@mail.gmail.com>
2022-10-31 18:43                       ` mm: delay rmap removal until after TLB flush Linus Torvalds
2022-11-02  9:14                         ` Christian Borntraeger
2022-11-02  9:23                           ` Christian Borntraeger
2022-11-02 17:55                           ` Linus Torvalds
2022-11-02 18:28                             ` Linus Torvalds
2022-11-02 22:29                             ` Gerald Schaefer
2022-11-02 12:45                         ` Peter Zijlstra
2022-11-02 22:31                         ` Gerald Schaefer
2022-11-02 23:13                           ` Linus Torvalds
2022-11-03  9:52                         ` David Hildenbrand
2022-11-03 16:54                           ` Linus Torvalds
2022-11-03 17:09                             ` Linus Torvalds
2022-11-03 17:36                               ` David Hildenbrand
2022-11-04  6:33                         ` Alexander Gordeev
2022-11-04 17:35                           ` Linus Torvalds
2022-11-06 21:06                             ` Hugh Dickins
2022-11-06 22:34                               ` Linus Torvalds
2022-11-06 23:14                                 ` Andrew Morton
2022-11-07  0:06                                   ` Stephen Rothwell
2022-11-07 16:19                                   ` Linus Torvalds
2022-11-07 23:02                                     ` Andrew Morton
2022-11-07 23:44                                       ` Stephen Rothwell
2022-11-07  9:12                               ` Peter Zijlstra
2022-11-07 20:07                               ` Johannes Weiner
2022-11-07 20:29                                 ` Linus Torvalds
2022-11-07 23:47                                   ` Linus Torvalds
2022-11-08  4:28                                     ` Linus Torvalds
2022-11-08 19:56                                       ` Linus Torvalds
2022-11-08 20:03                                         ` Konstantin Ryabitsev
2022-11-08 20:18                                           ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox