* mm: delay rmap removal until after TLB flush [not found] ` <CAHk-=wjnvPA7mi-E3jVEfCWXCNJNZEUjm6XODbbzGOh9c8mhgw@mail.gmail.com> @ 2022-10-31 18:43 ` Linus Torvalds 2022-11-02 9:14 ` Christian Borntraeger ` (4 more replies) 0 siblings, 5 replies; 30+ messages in thread From: Linus Torvalds @ 2022-10-31 18:43 UTC (permalink / raw) To: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch Updated subject line, and here's the link to the original discussion for new people: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ok. At that point we no longer have the pte or the virtual address, so > it's not going to be exactly the same debug output. > > But I think it ends up being fairly natural to do > > VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); > > instead, and I've fixed that last patch up to do that. Ok, so I've got a fixed set of patches based on the feedback from PeterZ, and also tried to do the s390 updates for this blindly, and pushed them out into a git branch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix If people really want to see the patches in email again, I can do that, but most of you already have, and the changes are either trivial fixes or the s390 updates. For the s390 people that I've now added to the participant list maybe the git tree is fine - and the fundamental explanation of the problem is in that top-most commit (with the three preceding commits being prep-work). Or that link to the thread about this all. That top-most commit is also where I tried to fix things up for s390 that uses its own non-gathering TLB flush due to CONFIG_MMU_GATHER_NO_GATHER. NOTE NOTE NOTE! Unlike my regular git branch, this one may end up rebased etc for further comments and fixes. So don't consider that stable, it's still more of an RFC branch. At a minimum I'll update it with Ack's etc, assuming I get those, and my s390 changes are entirely untested and probably won't work. As far as I can tell, s390 doesn't actually *have* the problem that causes this change, because of its synchronous TLB flush, but it obviously needs to deal with the change of rmap zapping logic. Also added a few people who are explicitly listed as being mmu_gather maintainers. Maybe people saw the discussion on the linux-mm list, but let's make it explicit. Do people have any objections to this approach, or other suggestions? I do *not* consider this critical, so it's a "queue for 6.2" issue for me. It probably makes most sense to queue in the -MM tree (after the thing is acked and people agree), but I can keep that branch alive too and just deal with it all myself as well. Anybody? Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-10-31 18:43 ` mm: delay rmap removal until after TLB flush Linus Torvalds @ 2022-11-02 9:14 ` Christian Borntraeger 2022-11-02 9:23 ` Christian Borntraeger 2022-11-02 17:55 ` Linus Torvalds 2022-11-02 12:45 ` Peter Zijlstra ` (3 subsequent siblings) 4 siblings, 2 replies; 30+ messages in thread From: Christian Borntraeger @ 2022-11-02 9:14 UTC (permalink / raw) To: Linus Torvalds, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Gerald Schaefer Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch Am 31.10.22 um 19:43 schrieb Linus Torvalds: > Updated subject line, and here's the link to the original discussion > for new people: > > https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ > > On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Ok. At that point we no longer have the pte or the virtual address, so >> it's not going to be exactly the same debug output. >> >> But I think it ends up being fairly natural to do >> >> VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); >> >> instead, and I've fixed that last patch up to do that. > > Ok, so I've got a fixed set of patches based on the feedback from > PeterZ, and also tried to do the s390 updates for this blindly, and > pushed them out into a git branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix > > If people really want to see the patches in email again, I can do > that, but most of you already have, and the changes are either trivial > fixes or the s390 updates. > > For the s390 people that I've now added to the participant list maybe > the git tree is fine - and the fundamental explanation of the problem > is in that top-most commit (with the three preceding commits being > prep-work). Or that link to the thread about this all. Adding Gerald. > > That top-most commit is also where I tried to fix things up for s390 > that uses its own non-gathering TLB flush due to > CONFIG_MMU_GATHER_NO_GATHER. > > NOTE NOTE NOTE! Unlike my regular git branch, this one may end up > rebased etc for further comments and fixes. So don't consider that > stable, it's still more of an RFC branch. > > At a minimum I'll update it with Ack's etc, assuming I get those, and > my s390 changes are entirely untested and probably won't work. > > As far as I can tell, s390 doesn't actually *have* the problem that > causes this change, because of its synchronous TLB flush, but it > obviously needs to deal with the change of rmap zapping logic. > > Also added a few people who are explicitly listed as being mmu_gather > maintainers. Maybe people saw the discussion on the linux-mm list, but > let's make it explicit. > > Do people have any objections to this approach, or other suggestions? > > I do *not* consider this critical, so it's a "queue for 6.2" issue for me. > > It probably makes most sense to queue in the -MM tree (after the thing > is acked and people agree), but I can keep that branch alive too and > just deal with it all myself as well. > > Anybody? > > Linus It certainly needs a build fix for s390: In file included from kernel/sched/core.c:78: ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size': ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration] 50 | page_zap_pte_rmap(page); | ^~~~~~~~~~~~~~~~~ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-02 9:14 ` Christian Borntraeger @ 2022-11-02 9:23 ` Christian Borntraeger 2022-11-02 17:55 ` Linus Torvalds 1 sibling, 0 replies; 30+ messages in thread From: Christian Borntraeger @ 2022-11-02 9:23 UTC (permalink / raw) To: Linus Torvalds, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Gerald Schaefer Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch Am 02.11.22 um 10:14 schrieb Christian Borntraeger: > Am 31.10.22 um 19:43 schrieb Linus Torvalds: >> Updated subject line, and here's the link to the original discussion >> for new people: >> >> https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ >> >> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> Ok. At that point we no longer have the pte or the virtual address, so >>> it's not going to be exactly the same debug output. >>> >>> But I think it ends up being fairly natural to do >>> >>> VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); >>> >>> instead, and I've fixed that last patch up to do that. >> >> Ok, so I've got a fixed set of patches based on the feedback from >> PeterZ, and also tried to do the s390 updates for this blindly, and >> pushed them out into a git branch: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix >> >> If people really want to see the patches in email again, I can do >> that, but most of you already have, and the changes are either trivial >> fixes or the s390 updates. >> >> For the s390 people that I've now added to the participant list maybe >> the git tree is fine - and the fundamental explanation of the problem >> is in that top-most commit (with the three preceding commits being >> prep-work). Or that link to the thread about this all. > > Adding Gerald. now the correct Gerald.... > >> >> That top-most commit is also where I tried to fix things up for s390 >> that uses its own non-gathering TLB flush due to >> CONFIG_MMU_GATHER_NO_GATHER. >> >> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up >> rebased etc for further comments and fixes. So don't consider that >> stable, it's still more of an RFC branch. >> >> At a minimum I'll update it with Ack's etc, assuming I get those, and >> my s390 changes are entirely untested and probably won't work. >> >> As far as I can tell, s390 doesn't actually *have* the problem that >> causes this change, because of its synchronous TLB flush, but it >> obviously needs to deal with the change of rmap zapping logic. >> >> Also added a few people who are explicitly listed as being mmu_gather >> maintainers. Maybe people saw the discussion on the linux-mm list, but >> let's make it explicit. >> >> Do people have any objections to this approach, or other suggestions? >> >> I do *not* consider this critical, so it's a "queue for 6.2" issue for me. >> >> It probably makes most sense to queue in the -MM tree (after the thing >> is acked and people agree), but I can keep that branch alive too and >> just deal with it all myself as well. >> >> Anybody? >> >> Linus > > It certainly needs a build fix for s390: > > > In file included from kernel/sched/core.c:78: > ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size': > ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration] > 50 | page_zap_pte_rmap(page); > | ^~~~~~~~~~~~~~~~~ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-02 9:14 ` Christian Borntraeger 2022-11-02 9:23 ` Christian Borntraeger @ 2022-11-02 17:55 ` Linus Torvalds 2022-11-02 18:28 ` Linus Torvalds 2022-11-02 22:29 ` Gerald Schaefer 1 sibling, 2 replies; 30+ messages in thread From: Linus Torvalds @ 2022-11-02 17:55 UTC (permalink / raw) To: Christian Borntraeger, Gerald Schaefer Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > > It certainly needs a build fix for s390: > > In file included from kernel/sched/core.c:78: > ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size': > ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration] > 50 | page_zap_pte_rmap(page); > | ^~~~~~~~~~~~~~~~~ Hmm. I'm not sure if I can add a #include <linux/rmap.h> to that s390 asm header file without causing more issues. The minimal damage would probably be to duplicate the declaration of page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used. Not pretty to have two different declarations of that thing, but anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which is much of mm) would then verify the consistency of them. So I'll do that minimal fix and update that branch, but if s390 people end up having a better fix, please holler. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-02 17:55 ` Linus Torvalds @ 2022-11-02 18:28 ` Linus Torvalds 2022-11-02 22:29 ` Gerald Schaefer 1 sibling, 0 replies; 30+ messages in thread From: Linus Torvalds @ 2022-11-02 18:28 UTC (permalink / raw) To: Christian Borntraeger, Gerald Schaefer Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Wed, Nov 2, 2022 at 10:55 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I'll do that minimal fix and update that branch, but if s390 people > end up having a better fix, please holler. I've updated the branch with that, so hopefully s390 builds now. I also fixed a typo in the commit message and added Peter's ack. Other than that it's all the same it was before. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-02 17:55 ` Linus Torvalds 2022-11-02 18:28 ` Linus Torvalds @ 2022-11-02 22:29 ` Gerald Schaefer 1 sibling, 0 replies; 30+ messages in thread From: Gerald Schaefer @ 2022-11-02 22:29 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Borntraeger, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Wed, 2 Nov 2022 10:55:10 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger > <borntraeger@linux.ibm.com> wrote: > > > > It certainly needs a build fix for s390: > > > > In file included from kernel/sched/core.c:78: > > ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size': > > ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration] > > 50 | page_zap_pte_rmap(page); > > | ^~~~~~~~~~~~~~~~~ > > Hmm. I'm not sure if I can add a > > #include <linux/rmap.h> > > to that s390 asm header file without causing more issues. > > The minimal damage would probably be to duplicate the declaration of > page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used. > > Not pretty to have two different declarations of that thing, but > anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which > is much of mm) would then verify the consistency of them. > > So I'll do that minimal fix and update that branch, but if s390 people > end up having a better fix, please holler. It compiles now with your duplicate declaration, but adding the #include also did not cause any damage, so that should also be OK. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-10-31 18:43 ` mm: delay rmap removal until after TLB flush Linus Torvalds 2022-11-02 9:14 ` Christian Borntraeger @ 2022-11-02 12:45 ` Peter Zijlstra 2022-11-02 22:31 ` Gerald Schaefer ` (2 subsequent siblings) 4 siblings, 0 replies; 30+ messages in thread From: Peter Zijlstra @ 2022-11-02 12:45 UTC (permalink / raw) To: Linus Torvalds Cc: Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-10-31 18:43 ` mm: delay rmap removal until after TLB flush Linus Torvalds 2022-11-02 9:14 ` Christian Borntraeger 2022-11-02 12:45 ` Peter Zijlstra @ 2022-11-02 22:31 ` Gerald Schaefer 2022-11-02 23:13 ` Linus Torvalds 2022-11-03 9:52 ` David Hildenbrand 2022-11-04 6:33 ` Alexander Gordeev 4 siblings, 1 reply; 30+ messages in thread From: Gerald Schaefer @ 2022-11-02 22:31 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Mon, 31 Oct 2022 11:43:30 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Updated subject line, and here's the link to the original discussion > for new people: > > https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ > > On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Ok. At that point we no longer have the pte or the virtual address, so > > it's not going to be exactly the same debug output. > > > > But I think it ends up being fairly natural to do > > > > VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); > > > > instead, and I've fixed that last patch up to do that. > > Ok, so I've got a fixed set of patches based on the feedback from > PeterZ, and also tried to do the s390 updates for this blindly, and > pushed them out into a git branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix > > If people really want to see the patches in email again, I can do > that, but most of you already have, and the changes are either trivial > fixes or the s390 updates. > > For the s390 people that I've now added to the participant list maybe > the git tree is fine - and the fundamental explanation of the problem > is in that top-most commit (with the three preceding commits being > prep-work). Or that link to the thread about this all. > > That top-most commit is also where I tried to fix things up for s390 > that uses its own non-gathering TLB flush due to > CONFIG_MMU_GATHER_NO_GATHER. > > NOTE NOTE NOTE! Unlike my regular git branch, this one may end up > rebased etc for further comments and fixes. So don't consider that > stable, it's still more of an RFC branch. > > At a minimum I'll update it with Ack's etc, assuming I get those, and > my s390 changes are entirely untested and probably won't work. > > As far as I can tell, s390 doesn't actually *have* the problem that > causes this change, because of its synchronous TLB flush, but it > obviously needs to deal with the change of rmap zapping logic. Correct, we need to flush already when we change a PTE, which is done in ptep_get_and_clear() etc. Only exception would be lazy flushing when only one active thread is attached, then we would flush later in flush_tlb_mm/range(), or as soon as another thread is attached (IIRC). So it seems straight forward to just call page_zap_pte_rmap() from our private __tlb_remove_page_size() implementation. Just wondering a bit why you did not also add the VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like in the generic change. Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-02 22:31 ` Gerald Schaefer @ 2022-11-02 23:13 ` Linus Torvalds 0 siblings, 0 replies; 30+ messages in thread From: Linus Torvalds @ 2022-11-02 23:13 UTC (permalink / raw) To: Gerald Schaefer Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Wed, Nov 2, 2022 at 3:31 PM Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > Just wondering a bit why you did not also add the > VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like > in the generic change. Heh, I had considered dropping it entirely even from the generic code, since I don't remember seeing that ever trigger, but PeterZ convinced me otherwise. For the s390 side I really wanted to keep things minimal since I (obviously) didn't even built-test it, so.. I'm perfectly happy with s390 people adding it later, of course. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-10-31 18:43 ` mm: delay rmap removal until after TLB flush Linus Torvalds ` (2 preceding siblings ...) 2022-11-02 22:31 ` Gerald Schaefer @ 2022-11-03 9:52 ` David Hildenbrand 2022-11-03 16:54 ` Linus Torvalds 2022-11-04 6:33 ` Alexander Gordeev 4 siblings, 1 reply; 30+ messages in thread From: David Hildenbrand @ 2022-11-03 9:52 UTC (permalink / raw) To: Linus Torvalds, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On 31.10.22 19:43, Linus Torvalds wrote: > Updated subject line, and here's the link to the original discussion > for new people: > > https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ > > On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Ok. At that point we no longer have the pte or the virtual address, so >> it's not going to be exactly the same debug output. >> >> But I think it ends up being fairly natural to do >> >> VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); >> >> instead, and I've fixed that last patch up to do that. > > Ok, so I've got a fixed set of patches based on the feedback from > PeterZ, and also tried to do the s390 updates for this blindly, and > pushed them out into a git branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix > > If people really want to see the patches in email again, I can do > that, but most of you already have, and the changes are either trivial > fixes or the s390 updates. > > For the s390 people that I've now added to the participant list maybe > the git tree is fine - and the fundamental explanation of the problem > is in that top-most commit (with the three preceding commits being > prep-work). Or that link to the thread about this all. > > That top-most commit is also where I tried to fix things up for s390 > that uses its own non-gathering TLB flush due to > CONFIG_MMU_GATHER_NO_GATHER. > > NOTE NOTE NOTE! Unlike my regular git branch, this one may end up > rebased etc for further comments and fixes. So don't consider that > stable, it's still more of an RFC branch. > > At a minimum I'll update it with Ack's etc, assuming I get those, and > my s390 changes are entirely untested and probably won't work. > > As far as I can tell, s390 doesn't actually *have* the problem that > causes this change, because of its synchronous TLB flush, but it > obviously needs to deal with the change of rmap zapping logic. > > Also added a few people who are explicitly listed as being mmu_gather > maintainers. Maybe people saw the discussion on the linux-mm list, but > let's make it explicit. > > Do people have any objections to this approach, or other suggestions? > > I do *not* consider this critical, so it's a "queue for 6.2" issue for me. > > It probably makes most sense to queue in the -MM tree (after the thing > is acked and people agree), but I can keep that branch alive too and > just deal with it all myself as well. > > Anybody? Happy to see that we're still decrementing the mapcount before decrementingthe refcount, I was briefly concerned. I was not able to come up quickly with something that would be fundamentally wrong here, but devil is in the detail. Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON instead of VM_BUG_ON). I agree that 6.2 is good enough and that upstreaming this via the -MM tree would be a good way to move forward. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-03 9:52 ` David Hildenbrand @ 2022-11-03 16:54 ` Linus Torvalds 2022-11-03 17:09 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2022-11-03 16:54 UTC (permalink / raw) To: David Hildenbrand Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Thu, Nov 3, 2022 at 2:52 AM David Hildenbrand <david@redhat.com> wrote: > > Happy to see that we're still decrementing the mapcount before > decrementingthe refcount, I was briefly concerned. Oh, that would have been horribly wrong. > I was not able to come up quickly with something that would be > fundamentally wrong here, but devil is in the detail. So I tried to be very careful. The biggest change in the whole series (visible in last patch, but there in the prep-patches too) is how it narrows down some lock coverage. Now, that locking didn't *do* anything valid, but I did try to point it out when it happens - how first the mapcount is decremented outside the memcg lock (in preparatory patches), and then later on the __dec_lruvec_page_state() turns into a dec_lruvec_page_state() because it's then done outside the page table lock. The locking in the second case did do something - not locking-wise, but simply the "running under the spinlock means we are not preemptable without RT". And in the memcg case it was just plain overly wide lock regions. None of the other changes should have any real semantic meaning *apart* from just keeping ->_mapcount elevated slightly longer. > Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is > unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON > instead of VM_BUG_ON). That VM_BUG_ON() is a case of "what to do if this ever triggers?" So a WARN_ON() would be fatal too, it's some seriously bogus stuff to try to put bits in that won't fit. It probably should be removed, since the value should always be pretty much a simple constant. It was more of a "let's be careful with new code, not for production". Probably like pretty much all VM_BUG_ON's are - test code that just got left around. I considered just getting rid of ENCODE_PAGE_BITS entirely, since there is only one bit. But it was always "let's keep the option open for dirty bits etc", so I kept it, but I agree that the name isn't wonderful. And in fact I wanted the encoding to really be done by the caller (so that TLB_ZAP_RMAP wouldn't be a new argument, but the 'page' argument to __tlb_remove_page_*() would simply be an 'encoded page' pointer, but that would have caused the patch to be much bigger (and expanded the s390 side too). Which I didn't want to do. Long-term that's probably still the right thing to do, including passing the encoded pointers all the way to free_pages_and_swap_cache(). Because it's pretty disgusting how it cleans up that array in place and then does that cast to a new array type, but it is also disgusting how it traverses that array multiple times (ie free_pages_and_swap_cache() will just have *another* loop). But again, those changes would have made the patch bigger, which I didn't want at this point (and 'release_pages()' would need that clean-in-place anyway, unless we changed *that* too and made the whole page encoding be something widely available). That's also why I then went with that fairly generic "ENCODE_PAGE_BITS" name. The *use* of it right now is very very specific to just the TLB gather, and the TLB_ZAP_RMAP bit shows that in the name. But then I went for a fairly generic "encode extra bits in the page pointer" name because it felt like it might expand beyond the initial intentionally small patch in the long run. So it's a combination of "we might want to expand on this in the future" and yet also "I really want to keep the changes localized in this patch". And the two are kind of inverses of each other, which hopefully explains the seemingly disparate naming logic. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-03 16:54 ` Linus Torvalds @ 2022-11-03 17:09 ` Linus Torvalds 2022-11-03 17:36 ` David Hildenbrand 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2022-11-03 17:09 UTC (permalink / raw) To: David Hildenbrand Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But again, those changes would have made the patch bigger, which I > didn't want at this point (and 'release_pages()' would need that > clean-in-place anyway, unless we changed *that* too and made the whole > page encoding be something widely available). And just to clarify: this is not just me trying to expand the reach of my patch. I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits that TLB_ZAP_RMAP is. Except the mlock code does *not* show that in the type system, and instead just passes a "struct page **" array around in pvec->pages, and then you'd just better know that "oh, it's not *really* just a page pointer". So I really think that the "array of encoded page pointers" thing is a generic notion that we *already* have. It's just that we've done it disgustingly in the past, and I didn't want to do that disgusting thing again. So I would hope that the nasty things that the mlock code would some day use the same page pointer encoding logic to actually make the whole "this is not a page pointer that you can use directly, it has low bits set for flags" very explicit. I am *not* sure if then the actual encoded bits would be unified. Probably not - you might have very different and distinct uses of the encode_page() thing where the bits mean different things in different contexts. Anyway, this is me just explaining the thinking behind it all. The page bit encoding is a very generic thing (well, "very generic" in this case means "has at least one other independent user"), explaining the very generic naming. But at the same time, the particular _patch_ was meant to be very targeted. So slightly schizophrenic name choices as a result. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-03 17:09 ` Linus Torvalds @ 2022-11-03 17:36 ` David Hildenbrand 0 siblings, 0 replies; 30+ messages in thread From: David Hildenbrand @ 2022-11-03 17:36 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On 03.11.22 18:09, Linus Torvalds wrote: > On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> But again, those changes would have made the patch bigger, which I >> didn't want at this point (and 'release_pages()' would need that >> clean-in-place anyway, unless we changed *that* too and made the whole >> page encoding be something widely available). > > And just to clarify: this is not just me trying to expand the reach of my patch. > > I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE > and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits > that TLB_ZAP_RMAP is. > > Except the mlock code does *not* show that in the type system, and > instead just passes a "struct page **" array around in pvec->pages, > and then you'd just better know that "oh, it's not *really* just a > page pointer". > > So I really think that the "array of encoded page pointers" thing is a > generic notion that we *already* have. > > It's just that we've done it disgustingly in the past, and I didn't > want to do that disgusting thing again. > > So I would hope that the nasty things that the mlock code would some > day use the same page pointer encoding logic to actually make the > whole "this is not a page pointer that you can use directly, it has > low bits set for flags" very explicit. > > I am *not* sure if then the actual encoded bits would be unified. > Probably not - you might have very different and distinct uses of the > encode_page() thing where the bits mean different things in different > contexts. > > Anyway, this is me just explaining the thinking behind it all. The > page bit encoding is a very generic thing (well, "very generic" in > this case means "has at least one other independent user"), explaining > the very generic naming. > > But at the same time, the particular _patch_ was meant to be very targeted. > > So slightly schizophrenic name choices as a result. Thanks for the explanation. I brought it up because the generic name somehow felt weird in include/asm-generic/tlb.h. Skimming over the code I'd have expected something like TLB_ENCODE_PAGE_BITS, so making the "very generic" things "very specific" as long as it lives in tlb.h :) -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-10-31 18:43 ` mm: delay rmap removal until after TLB flush Linus Torvalds ` (3 preceding siblings ...) 2022-11-03 9:52 ` David Hildenbrand @ 2022-11-04 6:33 ` Alexander Gordeev 2022-11-04 17:35 ` Linus Torvalds 4 siblings, 1 reply; 30+ messages in thread From: Alexander Gordeev @ 2022-11-04 6:33 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote: [...] > If people really want to see the patches in email again, I can do > that, but most of you already have, and the changes are either trivial > fixes or the s390 updates. > > For the s390 people that I've now added to the participant list maybe > the git tree is fine - and the fundamental explanation of the problem > is in that top-most commit (with the three preceding commits being > prep-work). Or that link to the thread about this all. I rather have a question to the generic part (had to master the code quotting). > static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr) > { > for (unsigned int i = 0; i < nr; i++) { > struct encoded_page *encoded = pages[i]; > unsigned int flags = encoded_page_flags(encoded); > if (flags) { > /* Clean the flagged pointer in-place */ > struct page *page = encoded_page_ptr(encoded); > pages[i] = encode_page(page, 0); > > /* The flag bit being set means that we should zap the rmap */ Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version? (I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling page_zap_pte_rmap() without such a check would be a bug). > page_zap_pte_rmap(page); > VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); > } > } > > /* > * Now all entries have been un-encoded, and changed to plain > * page pointers, so we can cast the 'encoded_page' array to > * a plain page array and free them > */ > free_pages_and_swap_cache((struct page **)pages, nr); > } Thanks! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-04 6:33 ` Alexander Gordeev @ 2022-11-04 17:35 ` Linus Torvalds 2022-11-06 21:06 ` Hugh Dickins 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2022-11-04 17:35 UTC (permalink / raw) To: Alexander Gordeev Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev <agordeev@linux.ibm.com> wrote: > > I rather have a question to the generic part (had to master the code quotting). Sure. Although now I think the series in in Andrew's -mm tree, or just about to get moved in there, so I'm not going to touch my actual branch any more. > > static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr) > > { > > for (unsigned int i = 0; i < nr; i++) { > > struct encoded_page *encoded = pages[i]; > > unsigned int flags = encoded_page_flags(encoded); > > if (flags) { > > /* Clean the flagged pointer in-place */ > > struct page *page = encoded_page_ptr(encoded); > > pages[i] = encode_page(page, 0); > > > > /* The flag bit being set means that we should zap the rmap */ > > Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version? > (I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling > page_zap_pte_rmap() without such a check would be a bug). No major reason. This is basically the same issue as the naming, which I touched on in https://lore.kernel.org/all/CAHk-=wiDg_1up8K4PhK4+kzPN7xJG297=nw+tvgrGn7aVgZdqw@mail.gmail.com/ and the follow-up note about how I hope the "encoded page pointer with flags" thing gets used by the mlock code some day too. IOW, there's kind of a generic "I have extra flags associated with the pointer", and then the specific "this case uses this flag", and depending on which mindset you have at the time, you might do one or the other. So in that clean_and_free_pages_and_swap_cache(), the code basically knows "I have a pointer with extra flags", and it's written that way. And that's partly historical, because it actually started with the original code tracking the dirty bit as the extra piece of information, and then transformed into this "no, the information is TLB_ZAP_RMAP". So "unsigned int flags" at one point was "bool dirty" instead, but then became more of a "I think this whole page pointer with flags is general", and the naming changed, and I had both cases in mind, and then the code is perhaps not so specifically named. I'm not sure the zap_page_range() case will ever use more than one flag, but the mlock case already has two different flags. So the "encode_page" thing is very much written to be about more than just the zap_page_range() case. But yes, that code could (and maybe should) use "if (flags & TLB_ZAP_RMAP)" to make it clear that in this case, the single flags bit is that one bit. But the "if ()" there actually does two conceptually *separate* things: it needs to clean the pointer in-place (which is regardless of "which" flag bit is set, and then it does that page_zap_pte_rmap(), which is just for the TLB_ZAP_RMAP bit. So to be really explicit about it, you'd have two different tests: one for "do I have flags that need to be cleaned up" and then an inner test for each flag. And since there is only one flag in this particular use case, it's essentially that inner test that I dropped as pointless. In contrast, in the s390 version, that bit was never encoded as a a general "flags associated with a page pointer" in the first place, so there was never any such duality. There is only TLB_ZAP_RMAP. Hope that explains the thinking. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-04 17:35 ` Linus Torvalds @ 2022-11-06 21:06 ` Hugh Dickins 2022-11-06 22:34 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Hugh Dickins @ 2022-11-06 21:06 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch Adding Stephen to Cc, for next-20221107 alert. Adding Johannes to Cc, particularly for lock_page_memcg discussion. On Fri, 4 Nov 2022, Linus Torvalds wrote: > On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev > <agordeev@linux.ibm.com> wrote: > > > > I rather have a question to the generic part (had to master the code quotting). > > Sure. > > Although now I think the series in in Andrew's -mm tree, or just about > to get moved in there, so I'm not going to touch my actual branch any > more. Linus, we've been exchanging about my mm/rmap.c mods in private mail, I need to raise some points about your mods here in public mail. Four topics - munlock (good), anon_vma (bad), mm-unstable (bad), lock_page_memcg (uncertain). I'm asking Andrew here to back your patchset out of mm-unstable for now, until we have its fixes in: otherwise I'm worried that next-20221107 will waste everyone's time. munlock (good) -------------- You've separated out the munlock_vma_page() from the rest of PTE remove rmap work. Worried me at first, but I think that's fine: bundling it into page_remove_rmap() was mainly so that we wouldn't forget to do it anywhere (and other rmap funcs already took vma arg). Certainly during development, I had been using page mapcount somewhere inside munlock_vma_page(), either for optimization or for sanity check, I forget; but gave up on that as unnecessary complication, and I think it became impossible with the pagevec; so not an issue now. If it were my change, I'd certainly do it by changing the name of the vma arg in page_remove_rmap() from vma to munlock_vma, not doing the munlock when that is NULL, and avoid the need for duplicating code: whereas you're very pleased with cutting out the unnecessary stuff in slimmed-down page_zap_pte_rmap(). Differing tastes (or perhaps you'll say taste versus no taste). anon_vma (bad) -------------- My first reaction to your patchset was, that it wasn't obvious to me that delaying the decrementation of mapcount is safe: I needed time to think about it. But happily, testing saved me from needing much thought. The first problem, came immediately in the first iteration of my huge tmpfs kbuild swapping loads, was BUG at mm/migrate.c:1105! VM_BUG_ON_FOLIO(folio_test_anon(src) && !folio_test_ksm(src) && !anon_vma, src); Okay, that's interesting but not necessarily fatal, we can very easily make it a recoverable condition. I didn't bother to think on that one, just patched around it. The second problem was more significant: came after nearly five hours of load, BUG NULL dereference (address 8) in down_read_trylock() in rmap_walk_anon(), while kswapd was doing a folio_referenced(). That one goes right to the heart of the matter, and instinct had been correct to worry about delaying the decrementation of mapcount, that is, extending the life of page_mapped() or folio_mapped(). See folio_lock_anon_vma_read(): folio_mapped() plays a key role in establishing the continued validity of an anon_vma. See comments above folio_get_anon_vma(), some by me but most by PeterZ IIRC. I believe what has happened is that your patchset has, very intentionally, kept the page as "folio_mapped" until after free_pgtables() does its unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read() that the anon_vma is safe to use when actually it has been freed. (It looked like a page table when I peeped at it.) I'm not certain, but I think that you made page_zap_pte_rmap() handle anon as well as file, just for the righteous additional simplification; but I'm afraid that (without opening a huge anon_vma refcounting can of worms) that unification has to be reverted, and anon left to go the same old way it did before. I didn't look into whether reverting one of your patches would achieve that, I just adjusted the code in zap_pte_range() to go the old way for PageAnon; and that has been running successfully, hitting neither BUG, for 15 hours now. mm-unstable (bad) ----------------- Aside from that PageAnon issue, mm-unstable is in an understandably bad state because you could not have foreseen my subpages_mapcount addition to page_remove_rmap(). page_zap_pte_rmap() now needs to handle the PageCompound (but not the "compound") case too. I rushed you and akpm an emergency patch for that on Friday night, but you, let's say, had reservations about it. So I haven't posted it, and while the PageAnon issue remains, I think your patchset has to be removed from mm-unstable and linux-next anyway. What happens to mm-unstable with page_zap_pte_rmap() not handling subpages_mapcount? In my CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y case, I get a "Bad page state" even before reaching first login after boot, "page dumped because: nonzero subpages_mapcount". Yes, I make it worse for myself by having a BUG() in bad_page(); others will limp along with more and more of those "Bad page" messages. lock_page_memcg (uncertain) --------------------------- Johannes, please help! Linus has got quite upset by lock_page_memcg(), its calls from mm/rmap.c anyway, and most particularly by the way in which it is called at the start of page_remove_rmap(), before anyone's critical atomic_add_negative(), yet its use is to guarantee the stability of page memcg while doing the stats updates, done only when atomic_add_negative() says so. I do have one relevant insight on this. It (or its antecedents under other names) date from the days when we did "reparenting" of memcg charges from an LRU: and in those days the lock_page_memcg() before mapcount adjustment was vital, to pair with the uses of folio_mapped() or page_mapped() in mem_cgroup_move_account() - those "mapped" checks are precisely around the stats which the rmap functions affect. But nowadays mem_cgroup_move_account() is only called, with page table lock held, on matching pages found in a task's page table: so its "mapped" checks are redundant - I've sometimes thought in the past of removing them, but held back, because I always have the notion (not hope!) that "reparenting" may one day be brought back from the grave. I'm too out of touch with memcg to know where that notion stands today. I've gone through a multiverse of opinions on those lock_page_memcg()s in the last day: I currently believe that Linus is right, that the lock_page_memcg()s could and should be moved just before the stats updates. But I am not 100% certain of that - is there still some reason why it's important that the page memcg at the instant of the critical mapcount transition be kept unchanged until the stats are updated? I've tried running scenarios through my mind but given up. (And note that the answer might be different with Linus's changes than without them: since he delays the mapcount decrementation until long after pte was removed and page table lock dropped). And I do wish that we could settle this lock_page_memcg() question in an entirely separate patch: as it stands, page_zap_pte_rmap() gets to benefit from Linus's insight (or not), and all the other rmap functions carry on with the mis?placed lock_page_memcg() as before. Let's press Send, Hugh ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-06 21:06 ` Hugh Dickins @ 2022-11-06 22:34 ` Linus Torvalds 2022-11-06 23:14 ` Andrew Morton 2022-11-07 9:12 ` Peter Zijlstra 2022-11-07 20:07 ` Johannes Weiner 2 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2022-11-06 22:34 UTC (permalink / raw) To: Hugh Dickins Cc: Johannes Weiner, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch [ Editing down to just the bare-bones problem cases ] On Sun, Nov 6, 2022 at 1:06 PM Hugh Dickins <hughd@google.com> wrote: > > anon_vma (bad) > -------------- > > See folio_lock_anon_vma_read(): folio_mapped() plays a key role in > establishing the continued validity of an anon_vma. See comments > above folio_get_anon_vma(), some by me but most by PeterZ IIRC. > > I believe what has happened is that your patchset has, very intentionally, > kept the page as "folio_mapped" until after free_pgtables() does its > unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read() > that the anon_vma is safe to use when actually it has been freed. > (It looked like a page table when I peeped at it.) > > I'm not certain, but I think that you made page_zap_pte_rmap() handle > anon as well as file, just for the righteous additional simplification; > but I'm afraid that (without opening a huge anon_vma refcounting can of > worms) that unification has to be reverted, and anon left to go the > same old way it did before. Indeed. I made them separate initially, just because the only case that mattered for the dirty bit was the file-mapped case. But then the two functions ended up being basically the identical function, so I unified them again. But the anonvma lifetime issue looks very real, and so doing the "delay rmap only for file mappings" seems sane. In fact, I wonder if we should delay it only for *dirty* file mappings, since it doesn't matter for the clean case. Hmm. I already threw away my branch (since Andrew picked the patches up), so a question for Andrew: do you want me to re-do the branch entirely, or do you want me to just send you an incremental patch? To make for minimal changes, I'd drop the 're-unification' patch, and then small updates to the zap_pte_range() code to keep the anon (and possibly non-dirty) case synchronous. And btw, this one is interesting: for anonymous (and non-dirty file-mapped) patches, we actually can end up delaying the final page free (and the rmap zapping) all the way to "tlb_finish_mmu()". Normally we still have the vma's all available, but yes, free_pgtables() can and does happen before the final TLB flush. The file-mapped dirty case doesn't have that issue - not just because it doesn't have an anonvma at all, but because it also does that "force_flush" thing that just measn that the page freeign never gets delayed that far in the first place. > mm-unstable (bad) > ----------------- > Aside from that PageAnon issue, mm-unstable is in an understandably bad > state because you could not have foreseen my subpages_mapcount addition > to page_remove_rmap(). page_zap_pte_rmap() now needs to handle the > PageCompound (but not the "compound") case too. I rushed you and akpm > an emergency patch for that on Friday night, but you, let's say, had > reservations about it. So I haven't posted it, and while the PageAnon > issue remains, I think your patchset has to be removed from mm-unstable > and linux-next anyway. So I think I'm fine with your patch, I just want to move the memcg accounting to outside of it. I can re-do my series on top of mm-unstable, I guess. That's probably the easiest way to handle this all. Andrew - can you remove those patches again, and I'll create a new series for you? Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-06 22:34 ` Linus Torvalds @ 2022-11-06 23:14 ` Andrew Morton 2022-11-07 0:06 ` Stephen Rothwell 2022-11-07 16:19 ` Linus Torvalds 0 siblings, 2 replies; 30+ messages in thread From: Andrew Morton @ 2022-11-06 23:14 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Johannes Weiner, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Sun, 6 Nov 2022 14:34:51 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > mm-unstable (bad) > > ----------------- > > Aside from that PageAnon issue, mm-unstable is in an understandably bad > > state because you could not have foreseen my subpages_mapcount addition > > to page_remove_rmap(). page_zap_pte_rmap() now needs to handle the > > PageCompound (but not the "compound") case too. I rushed you and akpm > > an emergency patch for that on Friday night, but you, let's say, had > > reservations about it. So I haven't posted it, and while the PageAnon > > issue remains, I think your patchset has to be removed from mm-unstable > > and linux-next anyway. > > So I think I'm fine with your patch, I just want to move the memcg > accounting to outside of it. > > I can re-do my series on top of mm-unstable, I guess. That's probably > the easiest way to handle this all. > > Andrew - can you remove those patches again, and I'll create a new > series for you? Yes, I've removed both serieses and shall push the tree out within half an hour from now. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-06 23:14 ` Andrew Morton @ 2022-11-07 0:06 ` Stephen Rothwell 2022-11-07 16:19 ` Linus Torvalds 1 sibling, 0 replies; 30+ messages in thread From: Stephen Rothwell @ 2022-11-07 0:06 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Hugh Dickins, Johannes Weiner, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch [-- Attachment #1: Type: text/plain, Size: 305 bytes --] Hi all, On Sun, 6 Nov 2022 15:14:16 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > Yes, I've removed both serieses and shall push the tree out within half > an hour from now. And I have picked up the new version for today's linux-next. Thanks all. -- Cheers, Stephen Rothwell [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-06 23:14 ` Andrew Morton 2022-11-07 0:06 ` Stephen Rothwell @ 2022-11-07 16:19 ` Linus Torvalds 2022-11-07 23:02 ` Andrew Morton 1 sibling, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2022-11-07 16:19 UTC (permalink / raw) To: Andrew Morton Cc: Hugh Dickins, Johannes Weiner, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > Yes, I've removed both serieses and shall push the tree out within half > an hour from now. Oh, can you please just put Hugh's series back in? I don't love the mapcount changes, but I'm going to re-do my parts so that there is no clash with them. And while I'd much rather see the mapcounts done another way, that's a completely separate issue. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-07 16:19 ` Linus Torvalds @ 2022-11-07 23:02 ` Andrew Morton 2022-11-07 23:44 ` Stephen Rothwell 0 siblings, 1 reply; 30+ messages in thread From: Andrew Morton @ 2022-11-07 23:02 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Johannes Weiner, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Yes, I've removed both serieses and shall push the tree out within half > > an hour from now. > > Oh, can you please just put Hugh's series back in? > Done, all pushed out to git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-07 23:02 ` Andrew Morton @ 2022-11-07 23:44 ` Stephen Rothwell 0 siblings, 0 replies; 30+ messages in thread From: Stephen Rothwell @ 2022-11-07 23:44 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Hugh Dickins, Johannes Weiner, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch [-- Attachment #1: Type: text/plain, Size: 358 bytes --] Hi all, On Mon, 7 Nov 2022 15:02:42 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Done, all pushed out to > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable. Will be in today's linux-next. -- Cheers, Stephen Rothwell [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-06 21:06 ` Hugh Dickins 2022-11-06 22:34 ` Linus Torvalds @ 2022-11-07 9:12 ` Peter Zijlstra 2022-11-07 20:07 ` Johannes Weiner 2 siblings, 0 replies; 30+ messages in thread From: Peter Zijlstra @ 2022-11-07 9:12 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Johannes Weiner, Stephen Rothwell, Alexander Gordeev, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch Hi Hugh! On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote: > See folio_lock_anon_vma_read(): folio_mapped() plays a key role in > establishing the continued validity of an anon_vma. See comments > above folio_get_anon_vma(), some by me but most by PeterZ IIRC. Ohhh, you're quite right. Unfortunately I seem to have completely forgotten about that :-( ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-06 21:06 ` Hugh Dickins 2022-11-06 22:34 ` Linus Torvalds 2022-11-07 9:12 ` Peter Zijlstra @ 2022-11-07 20:07 ` Johannes Weiner 2022-11-07 20:29 ` Linus Torvalds 2 siblings, 1 reply; 30+ messages in thread From: Johannes Weiner @ 2022-11-07 20:07 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch Hello, On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote: > lock_page_memcg (uncertain) > --------------------------- > Johannes, please help! Linus has got quite upset by lock_page_memcg(), > its calls from mm/rmap.c anyway, and most particularly by the way > in which it is called at the start of page_remove_rmap(), before > anyone's critical atomic_add_negative(), yet its use is to guarantee > the stability of page memcg while doing the stats updates, done only > when atomic_add_negative() says so. As you mentioned, the pte lock historically wasn't always taken on the move side. And so the reason lock_page_memcg() covers the mapcount update is that move_account() needs to atomically see either a) the page is mapped and counted, or b) unmapped and uncounted. If we lock after mapcountdec, move_account() could miss pending updates that need transferred, and it would break the scheme breaks thusly: memcg1->nr_mapped = 1 memcg2->nr_mapped = 0 page_remove_rmap: mem_cgroup_move_account(): if atomic_add_negative(page->mapcount): lock_page_memcg() if page->mapcount: // NOT TAKEN memcg1->nr_mapped-- memcg2->nr_mapped++ page->memcg = memcg2 unlock_page_memcg() lock_page_memcg() page->memcg->nr_mapped-- // UNDERFLOW memcg2->nr_mapped unlock_page_memcg() > I do have one relevant insight on this. It (or its antecedents under > other names) date from the days when we did "reparenting" of memcg > charges from an LRU: and in those days the lock_page_memcg() before > mapcount adjustment was vital, to pair with the uses of folio_mapped() > or page_mapped() in mem_cgroup_move_account() - those "mapped" checks > are precisely around the stats which the rmap functions affect. > > But nowadays mem_cgroup_move_account() is only called, with page table > lock held, on matching pages found in a task's page table: so its > "mapped" checks are redundant - I've sometimes thought in the past of > removing them, but held back, because I always have the notion (not > hope!) that "reparenting" may one day be brought back from the grave. > I'm too out of touch with memcg to know where that notion stands today. > > I've gone through a multiverse of opinions on those lock_page_memcg()s > in the last day: I currently believe that Linus is right, that the > lock_page_memcg()s could and should be moved just before the stats > updates. But I am not 100% certain of that - is there still some > reason why it's important that the page memcg at the instant of the > critical mapcount transition be kept unchanged until the stats are > updated? I've tried running scenarios through my mind but given up. Okay, I think there are two options. - If we don't want to codify the pte lock requirement on the move side, then moving the lock_page_memcg() like that would break the locking scheme, as per above. - If we DO want to codify the pte lock requirement, we should just remove the lock_page_memcg() altogether, as it's fully redundant. I'm leaning toward the second option. If somebody brings back reparenting they can bring the lock back along with it. [ If it's even still necessary by then. It's conceivable reparenting is brought back only for cgroup2, where the race wouldn't matter because of the hierarchical stats. The reparenting side wouldn't have to move page state to the parent - it's already there. And whether rmap would see the dying child or the parent doesn't matter much either: the parent will see the update anyway, directly or recursively, and we likely don't care to balance the books on a dying cgroup. It's then just a matter of lifetime - which should be guaranteed also, as long as the pte lock prevents an RCU quiescent state. ] So how about something like below? UNTESTED, just for illustration. This is cgroup1 code, which I haven't looked at too closely in a while. If you can't spot an immediate hole in it, I'd go ahead and test it and send a proper patch. --- From 88a32b1b5737630fb981114f6333d8fd057bd8e9 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Mon, 7 Nov 2022 12:05:09 -0500 Subject: [PATCH] mm: remove lock_page_memcg() from rmap rmap changes (mapping and unmapping) of a page currently take lock_page_memcg() to serialize 1) update of the mapcount and the cgroup mapped counter with 2) cgroup moving the page and updating the old cgroup and the new cgroup counters based on page_mapped(). Before b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"), we used to reassign all pages that could be found on a cgroup's LRU list on deletion - something that rmap didn't naturally serialize against. Since that commit, however, the only pages that get moved are those mapped into page tables of a task that's being migrated. In that case, the pte lock is always held (and we know the page is mapped), which keeps rmap changes at bay already. The additional lock_page_memcg() by rmap is redundant. Remove it. NOT-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/memcontrol.c | 27 ++++++++++++--------------- mm/rmap.c | 30 ++++++++++++------------------ 2 files changed, 24 insertions(+), 33 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d8549ae1b30..f7716e9038e9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5666,7 +5666,10 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma, * @from: mem_cgroup which the page is moved from. * @to: mem_cgroup which the page is moved to. @from != @to. * - * The caller must make sure the page is not on LRU (isolate_page() is useful.) + * This function acquires folio_lock() and folio_lock_memcg(). The + * caller must exclude all other possible ways of accessing + * page->memcg, such as LRU isolation (to lock out isolation) and + * having the page mapped and pte-locked (to lock out rmap). * * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" * from old cgroup. @@ -5685,6 +5688,7 @@ static int mem_cgroup_move_account(struct page *page, VM_BUG_ON(from == to); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); VM_BUG_ON(compound && !folio_test_large(folio)); + VM_WARN_ON_ONCE(!folio_mapped(folio)); /* * Prevent mem_cgroup_migrate() from looking at @@ -5705,30 +5709,23 @@ static int mem_cgroup_move_account(struct page *page, folio_memcg_lock(folio); if (folio_test_anon(folio)) { - if (folio_mapped(folio)) { - __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); - __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); - if (folio_test_transhuge(folio)) { - __mod_lruvec_state(from_vec, NR_ANON_THPS, - -nr_pages); - __mod_lruvec_state(to_vec, NR_ANON_THPS, - nr_pages); - } + __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); + __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); + if (folio_test_transhuge(folio)) { + __mod_lruvec_state(from_vec, NR_ANON_THPS, -nr_pages); + __mod_lruvec_state(to_vec, NR_ANON_THPS, nr_pages); } } else { __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages); __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages); + __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); + __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); if (folio_test_swapbacked(folio)) { __mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages); __mod_lruvec_state(to_vec, NR_SHMEM, nr_pages); } - if (folio_mapped(folio)) { - __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); - __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); - } - if (folio_test_dirty(folio)) { struct address_space *mapping = folio_mapping(folio); diff --git a/mm/rmap.c b/mm/rmap.c index 2ec925e5fa6a..60c31375f274 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1197,11 +1197,6 @@ void page_add_anon_rmap(struct page *page, bool compound = flags & RMAP_COMPOUND; bool first; - if (unlikely(PageKsm(page))) - lock_page_memcg(page); - else - VM_BUG_ON_PAGE(!PageLocked(page), page); - if (compound) { atomic_t *mapcount; VM_BUG_ON_PAGE(!PageLocked(page), page); @@ -1217,19 +1212,19 @@ void page_add_anon_rmap(struct page *page, if (first) { int nr = compound ? thp_nr_pages(page) : 1; /* - * We use the irq-unsafe __{inc|mod}_zone_page_stat because - * these counters are not modified in interrupt context, and - * pte lock(a spinlock) is held, which implies preemption - * disabled. + * We use the irq-unsafe __{inc|mod}_zone_page_stat + * because these counters are not modified in + * interrupt context, and pte lock(a spinlock) is + * held, which implies preemption disabled. + * + * The pte lock also stabilizes page->memcg wrt + * mem_cgroup_move_account(). */ if (compound) __mod_lruvec_page_state(page, NR_ANON_THPS, nr); __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr); } - if (unlikely(PageKsm(page))) - unlock_page_memcg(page); - /* address might be in next vma when migration races vma_adjust */ else if (first) __page_set_anon_rmap(page, vma, address, @@ -1290,7 +1285,6 @@ void page_add_file_rmap(struct page *page, int i, nr = 0; VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page); - lock_page_memcg(page); if (compound && PageTransHuge(page)) { int nr_pages = thp_nr_pages(page); @@ -1311,6 +1305,7 @@ void page_add_file_rmap(struct page *page, if (nr == nr_pages && PageDoubleMap(page)) ClearPageDoubleMap(page); + /* The pte lock stabilizes page->memcg */ if (PageSwapBacked(page)) __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED, nr_pages); @@ -1328,7 +1323,6 @@ void page_add_file_rmap(struct page *page, out: if (nr) __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); - unlock_page_memcg(page); mlock_vma_page(page, vma, compound); } @@ -1356,6 +1350,7 @@ static void page_remove_file_rmap(struct page *page, bool compound) } if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) goto out; + /* The pte lock stabilizes page->memcg */ if (PageSwapBacked(page)) __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED, -nr_pages); @@ -1423,8 +1418,6 @@ static void page_remove_anon_compound_rmap(struct page *page) void page_remove_rmap(struct page *page, struct vm_area_struct *vma, bool compound) { - lock_page_memcg(page); - if (!PageAnon(page)) { page_remove_file_rmap(page, compound); goto out; @@ -1443,6 +1436,9 @@ void page_remove_rmap(struct page *page, * We use the irq-unsafe __{inc|mod}_zone_page_stat because * these counters are not modified in interrupt context, and * pte lock(a spinlock) is held, which implies preemption disabled. + * + * The pte lock also stabilizes page->memcg wrt + * mem_cgroup_move_account(). */ __dec_lruvec_page_state(page, NR_ANON_MAPPED); @@ -1459,8 +1455,6 @@ void page_remove_rmap(struct page *page, * faster for those pages still in swapcache. */ out: - unlock_page_memcg(page); - munlock_vma_page(page, vma, compound); } -- 2.38.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-07 20:07 ` Johannes Weiner @ 2022-11-07 20:29 ` Linus Torvalds 2022-11-07 23:47 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2022-11-07 20:29 UTC (permalink / raw) To: Johannes Weiner Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Mon, Nov 7, 2022 at 12:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > - If we DO want to codify the pte lock requirement, we should just > remove the lock_page_memcg() altogether, as it's fully redundant. > > I'm leaning toward that second option. The thing is, that's very much the case we do *not* want. We need to delay the rmap removal until at least after the TLB flush. At least for dirty filemapped pages - because the page cleaning needs to see that they exists as mapped entities until all CPU's have *actually* dropped them. Now, we do the TLB flush still under the page table lock, so we could still then do the rmap removal before dropping the lock. But it would be much cleaner from the TLB flushing standpoint to delay it until the page freeing, which ends up being delayed until after the lock is dropped. That said, if always doing the rmap removal under the page table lock means that that memcg lock can just be deleted in that whole path, I will certainly bow to _that_ simplification instead, and just handle the dirty pages after the TLB flush but before the page table drop. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-07 20:29 ` Linus Torvalds @ 2022-11-07 23:47 ` Linus Torvalds 2022-11-08 4:28 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2022-11-07 23:47 UTC (permalink / raw) To: Johannes Weiner Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Mon, Nov 7, 2022 at 12:29 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That said, if always doing the rmap removal under the page table lock > means that that memcg lock can just be deleted in that whole path, I > will certainly bow to _that_ simplification instead, and just handle > the dirty pages after the TLB flush but before the page table drop. Ok, so I think I have a fairly clean way to do this. Let me try to make that series look reasonable, although it might be until tomorrow. I'll need to massage my mess into not just prettier code, but a sane history. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-07 23:47 ` Linus Torvalds @ 2022-11-08 4:28 ` Linus Torvalds 2022-11-08 19:56 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2022-11-08 4:28 UTC (permalink / raw) To: Johannes Weiner Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch [-- Attachment #1: Type: text/plain, Size: 3263 bytes --] On Mon, Nov 7, 2022 at 3:47 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ok, so I think I have a fairly clean way to do this. > > Let me try to make that series look reasonable, although it might be > until tomorrow. I'll need to massage my mess into not just prettier > code, but a sane history. Ugh. Ok, so massaging it into a saner form and splitting it out into a pretty history took a lot longer than writing the initial ugly "something like this". Anyway, the end result looks very different from the previous series, since I very consciously tried to keep away from rmap changes to not clash with Hugh's work, and also made sure we still do the page_remove_rmap() under the page table lock, just *after* the TLB flush. It does share some basic stuff - I'm still using that "struct encoded_page" thing, I just moved it into a separate commit, and moved it to where it conceptually belongs (I'd love to say that I also made the mlock.c code use it, but I didn't - that 'pagevec' thing is a pretty pointless abstraction and I didn't want to fight it). So you'll see some familiar support structures and scaffolding, but the actual approach to zap_pte_range() is very different. The approach taken to the "s390 is very different" issue is also completely different. It should actually be better for s390: we used to cause that "force_flush" whenever we hit a dirty shared page, and s390 doesn't care. The new model makes that "s390 doesn't care" be part of the whole design, so now s390 treats dirty shared pages basically as if they weren't anything special at all. Which they aren't on s390. But, as with the previous case, I don't even try to cross-compile for s390, so my attempts at handling s390 well are just that: attempts. The code may be broken. Of course, the code may be broken on x86-64 too, but at least there I've compiled it and am running it right now. Oh, and because I copied large parts of the commit message from the previous approach (because the problem description is the same), I noticed that I also kept the "Acked-by:". Those are bogus, because the code is sufficiently different that any previous acks are just not valid any more, but I just hadn't fixed that yet. The meat of it all is in that last patch, the rest is just type system cleanups to prep for it. But it was also that last patch that I spent hours on just tweaking it to look sensible. The *code* was pretty easy. Making it have sensible naming and a sensible abstraction interface that worked well for s390 too, that was 90% of the effort. But also I hope the end result is reasonably easy to review as a result. I'm sending this out because I'm stepping away from the keyboard, because that whole "let's massage it into something lagible" was really somewhat exhausting. You don't see all the small side turns it took only to go "that's ugly, let's try again" ;) Anybody interested in giving this a peek? (Patch 2/4 might make some people pause. It's fairly small and simple. It's effective and makes it easy to do some of the later changes. And it's also quite different from our usual model. It was "inspired" by the signed-vs-unsigned char thread from a few weeks ago. But patch 4/4 is the one that matters). Linus [-- Attachment #2: 0001-mm-introduce-encoded-page-pointers-with-embedded-ext.patch --] [-- Type: text/x-patch, Size: 2749 bytes --] From 675b73aaa7718e93e9f2492a3b9cc417f9e820b4 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 7 Nov 2022 15:48:27 -0800 Subject: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits We already have this notion in parts of the MM code (see the mlock code with the LRU_PAGE and NEW_PAGE) bits, but I'm going to introduce a new case, and I refuse to do the same thing we've done before where we just put bits in the raw pointer and say it's still a normal pointer. So this introduces a 'struct encoded_page' pointer that cannot be used for anything else than to encode a real page pointer and a couple of extra bits in the low bits. That way the compiler can trivially track the state of the pointer and you just explicitly encode and decode the extra bits. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- include/linux/mm_types.h | 33 ++++++++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 500e536796ca..b5cffd250784 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -67,7 +67,7 @@ struct mem_cgroup; #ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE #define _struct_page_alignment __aligned(2 * sizeof(unsigned long)) #else -#define _struct_page_alignment +#define _struct_page_alignment __aligned(sizeof(unsigned long)) #endif struct page { @@ -241,6 +241,37 @@ struct page { #endif } _struct_page_alignment; +/** + * struct encoded_page - a nonexistent type marking this pointer + * + * An 'encoded_page' pointer is a pointer to a regular 'struct page', but + * with the low bits of the pointer indicating extra context-dependent + * information. Not super-common, but happens in mmu_gather and mlock + * handling, and this acts as a type system check on that use. + * + * We only really have two guaranteed bits in general, although you could + * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE) + * for more. + * + * Use the supplied helper functions to endcode/decode the pointer and bits. + */ +struct encoded_page; +#define ENCODE_PAGE_BITS 3ul +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags) +{ + return (struct encoded_page *)(flags | (unsigned long)page); +} + +static inline bool encoded_page_flags(struct encoded_page *page) +{ + return ENCODE_PAGE_BITS & (unsigned long)page; +} + +static inline struct page *encoded_page_ptr(struct encoded_page *page) +{ + return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page); +} + /** * struct folio - Represents a contiguous set of bytes. * @flags: Identical to the page flags. -- 2.37.1.289.g45aa1e5c72.dirty [-- Attachment #3: 0002-mm-teach-release_pages-to-take-an-array-of-encoded-p.patch --] [-- Type: text/x-patch, Size: 4058 bytes --] From 31e4135eeedbb6ae12bfb9b17a7f6d9d815ff289 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 7 Nov 2022 16:46:16 -0800 Subject: [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too release_pages() already could take either an array of page pointers, or an array of folio pointers. Expand it to also accept an array of encoded page pointers, which is what both the existing mlock() use and the upcoming mmu_gather use of encoded page pointers wants. Note that release_pages() won't actually use, or react to, any extra encoded bits. Instead, this is very much a case of "I have walked the array of encoded pages and done everything the extra bits tell me to do, now release it all". Also, while the "either page or folio pointers" dual use was handled with a cast of the pointer in "release_folios()", this takes a slightly different approach and uses the "transparent union" attribute to describe the set of arguments to the function: https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html which has been supported by gcc forever, but the kernel hasn't used before. That allows us to avoid using various wrappers with casts, and just use the same function regardless of use. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- include/linux/mm.h | 21 +++++++++++++++++++-- mm/swap.c | 16 ++++++++++++---- 2 files changed, 31 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bbcccbc5565..d9fb5c3e3045 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1179,7 +1179,24 @@ static inline void folio_put_refs(struct folio *folio, int refs) __folio_put(folio); } -void release_pages(struct page **pages, int nr); +/** + * release_pages - release an array of pages or folios + * + * This just releases a simple array of multiple pages, and + * accepts various different forms of said page array: either + * a regular old boring array of pages, an array of folios, or + * an array of encoded page pointers. + * + * The transparent union syntax for this kind of "any of these + * argument types" is all kinds of ugly, so look away. + */ +typedef union { + struct page **pages; + struct folio **folios; + struct encoded_page **encoded_pages; +} release_pages_arg __attribute__ ((__transparent_union__)); + +void release_pages(release_pages_arg, int nr); /** * folios_put - Decrement the reference count on an array of folios. @@ -1195,7 +1212,7 @@ void release_pages(struct page **pages, int nr); */ static inline void folios_put(struct folio **folios, unsigned int nr) { - release_pages((struct page **)folios, nr); + release_pages(folios, nr); } static inline void put_page(struct page *page) diff --git a/mm/swap.c b/mm/swap.c index 955930f41d20..596ed226ddb8 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -968,22 +968,30 @@ void lru_cache_disable(void) /** * release_pages - batched put_page() - * @pages: array of pages to release + * @arg: array of pages to release * @nr: number of pages * - * Decrement the reference count on all the pages in @pages. If it + * Decrement the reference count on all the pages in @arg. If it * fell to zero, remove the page from the LRU and free it. + * + * Note that the argument can be an array of pages, encoded pages, + * or folio pointers. We ignore any encoded bits, and turn any of + * them into just a folio that gets free'd. */ -void release_pages(struct page **pages, int nr) +void release_pages(release_pages_arg arg, int nr) { int i; + struct encoded_page **encoded = arg.encoded_pages; LIST_HEAD(pages_to_free); struct lruvec *lruvec = NULL; unsigned long flags = 0; unsigned int lock_batch; for (i = 0; i < nr; i++) { - struct folio *folio = page_folio(pages[i]); + struct folio *folio; + + /* Turn any of the argument types into a folio */ + folio = page_folio(encoded_page_ptr(encoded[i])); /* * Make sure the IRQ-safe lock-holding time does not get -- 2.37.1.289.g45aa1e5c72.dirty [-- Attachment #4: 0003-mm-mmu_gather-prepare-to-gather-encoded-page-pointer.patch --] [-- Type: text/x-patch, Size: 3376 bytes --] From 0e6863a5389a10e984122c6dca143f9be71da310 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 7 Nov 2022 17:36:43 -0800 Subject: [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags This is purely a preparatory patch that makes all the data structures ready for encoding flags with the mmu_gather page pointers. The code currently always sets the flag to zero and doesn't use it yet, but now it's tracking the type state along. The next step will be to actually start using it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- include/asm-generic/tlb.h | 2 +- include/linux/swap.h | 2 +- mm/mmu_gather.c | 4 ++-- mm/swap_state.c | 11 ++++------- 4 files changed, 8 insertions(+), 11 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 492dce43236e..faca23e87278 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -242,7 +242,7 @@ struct mmu_gather_batch { struct mmu_gather_batch *next; unsigned int nr; unsigned int max; - struct page *pages[]; + struct encoded_page *encoded_pages[]; }; #define MAX_GATHER_BATCH \ diff --git a/include/linux/swap.h b/include/linux/swap.h index a18cf4b7c724..40e418e3461b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -470,7 +470,7 @@ static inline unsigned long total_swapcache_pages(void) extern void free_swap_cache(struct page *page); extern void free_page_and_swap_cache(struct page *); -extern void free_pages_and_swap_cache(struct page **, int); +extern void free_pages_and_swap_cache(struct encoded_page **, int); /* linux/mm/swapfile.c */ extern atomic_long_t nr_swap_pages; extern long total_swap_pages; diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index add4244e5790..57b7850c1b5e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -48,7 +48,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb) struct mmu_gather_batch *batch; for (batch = &tlb->local; batch && batch->nr; batch = batch->next) { - struct page **pages = batch->pages; + struct encoded_page **pages = batch->encoded_pages; do { /* @@ -92,7 +92,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_ * Add the page and check if we are full. If so * force a flush. */ - batch->pages[batch->nr++] = page; + batch->encoded_pages[batch->nr++] = encode_page(page, 0); if (batch->nr == batch->max) { if (!tlb_next_batch(tlb)) return true; diff --git a/mm/swap_state.c b/mm/swap_state.c index 438d0676c5be..8bf08c313872 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -303,15 +303,12 @@ void free_page_and_swap_cache(struct page *page) * Passed an array of pages, drop them all from swapcache and then release * them. They are removed from the LRU and freed if this is their last use. */ -void free_pages_and_swap_cache(struct page **pages, int nr) +void free_pages_and_swap_cache(struct encoded_page **pages, int nr) { - struct page **pagep = pages; - int i; - lru_add_drain(); - for (i = 0; i < nr; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, nr); + for (int i = 0; i < nr; i++) + free_swap_cache(encoded_page_ptr(pages[i])); + release_pages(pages, nr); } static inline bool swap_use_vma_readahead(void) -- 2.37.1.289.g45aa1e5c72.dirty [-- Attachment #5: 0004-mm-delay-page_remove_rmap-until-after-the-TLB-has-be.patch --] [-- Type: text/x-patch, Size: 11173 bytes --] From 7ef5220cd5825d6e4a770286d8949b9b838bbc30 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 7 Nov 2022 19:38:38 -0800 Subject: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed When we remove a page table entry, we are very careful to only free the page after we have flushed the TLB, because other CPUs could still be using the page through stale TLB entries until after the flush. However, we have removed the rmap entry for that page early, which means that functions like folio_mkclean() would end up not serializing with the page table lock because the page had already been made invisible to rmap. And that is a problem, because while the TLB entry exists, we could end up with the following situation: (a) one CPU could come in and clean it, never seeing our mapping of the page (b) another CPU could continue to use the stale and dirty TLB entry and continue to write to said page resulting in a page that has been dirtied, but then marked clean again, all while another CPU might have dirtied it some more. End result: possibly lost dirty data. This extends our current TLB gather infrastructure to optionally track a "should I do a delayed page_remove_rmap() for this page after flushing the TLB". It uses the newly introduced 'encoded page pointer' to do that without having to keep separate data around. Note, this is complicated by a couple of issues: - s390 has its own mmu_gather model that doesn't delay TLB flushing, and as a result also does not want the delayed rmap - we want to delay the rmap removal, but not past the page table lock - we can track an enormous number of pages in our mmu_gather structure, with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each, all set up to be approximately 10k pending pages. We do not want to have a huge number of batched pages that we then need to check for delayed rmap handling inside the page table lock. Particularly that last point results in a noteworthy detail, where the normal page batch gathering is limited once we have delayed rmaps pending, in such a way that only the last batch (the so-called "active batch") in the mmu_gather structure can have any delayed entries. NOTE! While the "possibly lost dirty data" sounds catastrophic, for this all to happen you need to have a user thread doing either madvise() with MADV_DONTNEED or a full re-mmap() of the area concurrently with another thread continuing to use said mapping. So arguably this is about user space doing crazy things, but from a VM consistency standpoint it's better if we track the dirty bit properly even when user space goes off the rails. Reported-by: Nadav Amit <nadav.amit@gmail.com> Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ Cc: Will Deacon <will@kernel.org> Cc: Aneesh Kumar <aneesh.kumar@linux.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- arch/s390/include/asm/tlb.h | 21 +++++++++++++++++++-- include/asm-generic/tlb.h | 21 +++++++++++++++++---- mm/memory.c | 23 +++++++++++++++++------ mm/mmu_gather.c | 35 +++++++++++++++++++++++++++++++++-- 4 files changed, 86 insertions(+), 14 deletions(-) diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index 3a5c8fb590e5..e5903ee2f1ca 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -25,7 +25,8 @@ void __tlb_remove_table(void *_table); static inline void tlb_flush(struct mmu_gather *tlb); static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, - struct page *page, int page_size); + struct page *page, int page_size, + unsigned int flags); #define tlb_flush tlb_flush #define pte_free_tlb pte_free_tlb @@ -36,13 +37,24 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, #include <asm/tlbflush.h> #include <asm-generic/tlb.h> +/* + * s390 never needs to delay page_remove_rmap, because + * the ptep_get_and_clear_full() will have flushed the + * TLB across CPUs + */ +static inline bool tlb_delay_rmap(struct mmu_gather *tlb) +{ + return false; +} + /* * Release the page cache reference for a pte removed by * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page * has already been freed, so just do free_page_and_swap_cache. */ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, - struct page *page, int page_size) + struct page *page, int page_size, + unsigned int flags) { free_page_and_swap_cache(page); return false; @@ -53,6 +65,11 @@ static inline void tlb_flush(struct mmu_gather *tlb) __tlb_flush_mm_lazy(tlb->mm); } +static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) +{ + /* Nothing to do, s390 does not delay rmaps */ +} + /* * pte_free_tlb frees a pte table and clears the CRSTE for the * page table from the tlb. diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index faca23e87278..9df513e5ad28 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -257,7 +257,15 @@ struct mmu_gather_batch { #define MAX_GATHER_BATCH_COUNT (10000UL/MAX_GATHER_BATCH) extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, - int page_size); + int page_size, unsigned int flags); +extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma); + +/* + * This both sets 'delayed_rmap', and returns true. It would be an inline + * function, except we define it before the 'struct mmu_gather'. + */ +#define tlb_delay_rmap(tlb) (((tlb)->delayed_rmap = 1), true) + #endif /* @@ -290,6 +298,11 @@ struct mmu_gather { */ unsigned int freed_tables : 1; + /* + * Do we have pending delayed rmap removals? + */ + unsigned int delayed_rmap : 1; + /* * at which levels have we cleared entries? */ @@ -431,13 +444,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) static inline void tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size) { - if (__tlb_remove_page_size(tlb, page, page_size)) + if (__tlb_remove_page_size(tlb, page, page_size, 0)) tlb_flush_mmu(tlb); } -static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page) +static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags) { - return __tlb_remove_page_size(tlb, page, PAGE_SIZE); + return __tlb_remove_page_size(tlb, page, PAGE_SIZE, flags); } /* tlb_remove_page diff --git a/mm/memory.c b/mm/memory.c index f88c351aecd4..60a0f44f6e72 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1432,6 +1432,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, break; if (pte_present(ptent)) { + unsigned int delay_rmap; + page = vm_normal_page(vma, addr, ptent); if (unlikely(!should_zap_page(details, page))) continue; @@ -1443,20 +1445,26 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (unlikely(!page)) continue; + delay_rmap = 0; if (!PageAnon(page)) { if (pte_dirty(ptent)) { - force_flush = 1; set_page_dirty(page); + if (tlb_delay_rmap(tlb)) { + delay_rmap = 1; + force_flush = 1; + } } if (pte_young(ptent) && likely(!(vma->vm_flags & VM_SEQ_READ))) mark_page_accessed(page); } rss[mm_counter(page)]--; - page_remove_rmap(page, vma, false); - if (unlikely(page_mapcount(page) < 0)) - print_bad_pte(vma, addr, ptent, page); - if (unlikely(__tlb_remove_page(tlb, page))) { + if (!delay_rmap) { + page_remove_rmap(page, vma, false); + if (unlikely(page_mapcount(page) < 0)) + print_bad_pte(vma, addr, ptent, page); + } + if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { force_flush = 1; addr += PAGE_SIZE; break; @@ -1513,8 +1521,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ - if (force_flush) + if (force_flush) { tlb_flush_mmu_tlbonly(tlb); + if (tlb->delayed_rmap) + tlb_flush_rmaps(tlb, vma); + } pte_unmap_unlock(start_pte, ptl); /* diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 57b7850c1b5e..136f5fad43e3 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -9,6 +9,7 @@ #include <linux/rcupdate.h> #include <linux/smp.h> #include <linux/swap.h> +#include <linux/rmap.h> #include <asm/pgalloc.h> #include <asm/tlb.h> @@ -19,6 +20,10 @@ static bool tlb_next_batch(struct mmu_gather *tlb) { struct mmu_gather_batch *batch; + /* No more batching if we have delayed rmaps pending */ + if (tlb->delayed_rmap) + return false; + batch = tlb->active; if (batch->next) { tlb->active = batch->next; @@ -43,6 +48,31 @@ static bool tlb_next_batch(struct mmu_gather *tlb) return true; } +/** + * tlb_flush_rmaps - do pending rmap removals after we have flushed the TLB + * @tlb: the current mmu_gather + * + * Note that because of how tlb_next_batch() above works, we will + * never start new batches with pending delayed rmaps, so we only + * need to walk through the current active batch. + */ +void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) +{ + struct mmu_gather_batch *batch; + + batch = tlb->active; + for (int i = 0; i < batch->nr; i++) { + struct encoded_page *enc = batch->encoded_pages[i]; + + if (encoded_page_flags(enc)) { + struct page *page = encoded_page_ptr(enc); + page_remove_rmap(page, vma, false); + } + } + + tlb->delayed_rmap = 0; +} + static void tlb_batch_pages_flush(struct mmu_gather *tlb) { struct mmu_gather_batch *batch; @@ -77,7 +107,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb) tlb->local.next = NULL; } -bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size) +bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size, unsigned int flags) { struct mmu_gather_batch *batch; @@ -92,7 +122,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_ * Add the page and check if we are full. If so * force a flush. */ - batch->encoded_pages[batch->nr++] = encode_page(page, 0); + batch->encoded_pages[batch->nr++] = encode_page(page, flags); if (batch->nr == batch->max) { if (!tlb_next_batch(tlb)) return true; @@ -286,6 +316,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, tlb->active = &tlb->local; tlb->batch_count = 0; #endif + tlb->delayed_rmap = 0; tlb_table_init(tlb); #ifdef CONFIG_MMU_GATHER_PAGE_SIZE -- 2.37.1.289.g45aa1e5c72.dirty ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-08 4:28 ` Linus Torvalds @ 2022-11-08 19:56 ` Linus Torvalds 2022-11-08 20:03 ` Konstantin Ryabitsev 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2022-11-08 19:56 UTC (permalink / raw) To: Johannes Weiner, Konstantin Ryabitsev Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I'm sending this out because I'm stepping away from the keyboard, > because that whole "let's massage it into something legible" was > really somewhat exhausting. You don't see all the small side turns it > took only to go "that's ugly, let's try again" ;) Ok, I actually sent the individual patches with 'git-send-email', although I only sent them to the mailing list and to people that were mentioned in the commit descriptions. I hope that makes review easier. See https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org for the series if you weren't mentioned and are interested. Oh, and because I decided to just use the email in this thread as the reference and cover letter, it turns out that this all confuses 'b4', because it actually walks up the whole thread all the way to the original 13-patch series by PeterZ that started this whole discussion. I've seen that before with other peoples patch series, but now that it happened to my own, I'm cc'ing Konstantine here too to see if there's some magic for b4 to say "look, I pointed you to a msg-id that is clearly a new series, don't walk all the way up and then take patches from a completely different one. Oh well. I guess I should just have not been lazy and done a cover-letter and a whole new thread. My bad. Konstantin, please help me look like less of a tool? Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-08 19:56 ` Linus Torvalds @ 2022-11-08 20:03 ` Konstantin Ryabitsev 2022-11-08 20:18 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Konstantin Ryabitsev @ 2022-11-08 20:03 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Tue, Nov 08, 2022 at 11:56:13AM -0800, Linus Torvalds wrote: > On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I'm sending this out because I'm stepping away from the keyboard, > > because that whole "let's massage it into something legible" was > > really somewhat exhausting. You don't see all the small side turns it > > took only to go "that's ugly, let's try again" ;) > > Ok, I actually sent the individual patches with 'git-send-email', > although I only sent them to the mailing list and to people that were > mentioned in the commit descriptions. > > I hope that makes review easier. > > See > > https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org > > for the series if you weren't mentioned and are interested. > > Oh, and because I decided to just use the email in this thread as the > reference and cover letter, it turns out that this all confuses 'b4', > because it actually walks up the whole thread all the way to the > original 13-patch series by PeterZ that started this whole discussion. > > I've seen that before with other peoples patch series, but now that it > happened to my own, I'm cc'ing Konstantine here too to see if there's > some magic for b4 to say "look, I pointed you to a msg-id that is > clearly a new series, don't walk all the way up and then take patches > from a completely different one. Yes, --no-parent. It's slightly more complicated in your case because the patches aren't threaded to the first patch/cover letter, but you can choose an arbitrary msgid upthread and tell b4 to ignore anything that came before it. E.g.: b4 am -o/tmp --no-parent 20221108194139.57604-1-torvalds@linux-foundation.org -K ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: mm: delay rmap removal until after TLB flush 2022-11-08 20:03 ` Konstantin Ryabitsev @ 2022-11-08 20:18 ` Linus Torvalds 0 siblings, 0 replies; 30+ messages in thread From: Linus Torvalds @ 2022-11-08 20:18 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Johannes Weiner, Hugh Dickins, Stephen Rothwell, Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens, Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple, linux-arch On Tue, Nov 8, 2022 at 12:03 PM Konstantin Ryabitsev <mricon@kernel.org> wrote: > > Yes, --no-parent. Ahh, that's new. I guess I need to update my ancient b4-0.8.0 install.. But yes, with that, and the manual parent lookup (because otherwise "--no-parent" will fetch *just* the patch itself, not even walking up the single parent chain). it works. Maybe a "--single-parent" or "--deadbeat-parent" option would be a good idea? Anyway, with a more recent b4 version, the command b4 am --no-parent CAHk-=wh6MxaCA4pXpt1F5Bn2__6MxCq0Dr-rES4i=MOL9ibjpg@mail.gmail.com gets that series and only that series. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2022-11-08 20:19 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com>
[not found] ` <CAHk-=wgzT1QsSCF-zN+eS06WGVTBg4sf=6oTMg95+AEq7QrSCQ@mail.gmail.com>
[not found] ` <47678198-C502-47E1-B7C8-8A12352CDA95@gmail.com>
[not found] ` <CAHk-=wjzngbbwHw4nAsqo_RpyOtUDk5G+Wus=O0w0A6goHvBWA@mail.gmail.com>
[not found] ` <CAHk-=wijU_YHSZq5N7vYK+qHPX0aPkaePaGOyWk4aqMvvSXxJA@mail.gmail.com>
[not found] ` <140B437E-B994-45B7-8DAC-E9B66885BEEF@gmail.com>
[not found] ` <CAHk-=wjX_P78xoNcGDTjhkgffs-Bhzcwp-mdsE1maeF57Sh0MA@mail.gmail.com>
[not found] ` <CAHk-=wio=UKK9fX4z+0CnyuZG7L+U9OB7t7Dcrg4FuFHpdSsfw@mail.gmail.com>
[not found] ` <CAHk-=wgz0QQd6KaRYQ8viwkZBt4xDGuZTFiTB8ifg7E3F2FxHg@mail.gmail.com>
[not found] ` <CAHk-=wiwt4LC-VmqvYrphraF0=yQV=CQimDCb0XhtXwk8oKCCA@mail.gmail.com>
[not found] ` <Y1+XCALog8bW7Hgl@hirez.programming.kicks-ass.net>
[not found] ` <CAHk-=wjnvPA7mi-E3jVEfCWXCNJNZEUjm6XODbbzGOh9c8mhgw@mail.gmail.com>
2022-10-31 18:43 ` mm: delay rmap removal until after TLB flush Linus Torvalds
2022-11-02 9:14 ` Christian Borntraeger
2022-11-02 9:23 ` Christian Borntraeger
2022-11-02 17:55 ` Linus Torvalds
2022-11-02 18:28 ` Linus Torvalds
2022-11-02 22:29 ` Gerald Schaefer
2022-11-02 12:45 ` Peter Zijlstra
2022-11-02 22:31 ` Gerald Schaefer
2022-11-02 23:13 ` Linus Torvalds
2022-11-03 9:52 ` David Hildenbrand
2022-11-03 16:54 ` Linus Torvalds
2022-11-03 17:09 ` Linus Torvalds
2022-11-03 17:36 ` David Hildenbrand
2022-11-04 6:33 ` Alexander Gordeev
2022-11-04 17:35 ` Linus Torvalds
2022-11-06 21:06 ` Hugh Dickins
2022-11-06 22:34 ` Linus Torvalds
2022-11-06 23:14 ` Andrew Morton
2022-11-07 0:06 ` Stephen Rothwell
2022-11-07 16:19 ` Linus Torvalds
2022-11-07 23:02 ` Andrew Morton
2022-11-07 23:44 ` Stephen Rothwell
2022-11-07 9:12 ` Peter Zijlstra
2022-11-07 20:07 ` Johannes Weiner
2022-11-07 20:29 ` Linus Torvalds
2022-11-07 23:47 ` Linus Torvalds
2022-11-08 4:28 ` Linus Torvalds
2022-11-08 19:56 ` Linus Torvalds
2022-11-08 20:03 ` Konstantin Ryabitsev
2022-11-08 20:18 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox