* Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5
2018-06-29 21:01 ` Linus Torvalds
@ 2018-06-29 21:46 ` Kirill A. Shutemov
2018-06-30 2:22 ` Linus Torvalds
2018-06-30 6:23 ` Aneesh Kumar K.V
2018-06-30 0:55 ` Segher Boessenkool
` (2 subsequent siblings)
3 siblings, 2 replies; 11+ messages in thread
From: Kirill A. Shutemov @ 2018-06-29 21:46 UTC (permalink / raw)
To: Linus Torvalds, Aneesh Kumar K.V
Cc: Larry Finger, Matthew Wilcox, Kirill A. Shutemov, Vlastimil Babka,
Christoph Lameter, Dave Hansen, Jerome Glisse, Lai Jiangshan,
Martin Schwidefsky, Pekka Enberg, Randy Dunlap, Andrey Ryabinin,
Andrew Morton, Benjamin Herrenschmidt, Paul Mackerras,
Michael Ellerman, ppc-dev, Linux Kernel Mailing List
On Fri, Jun 29, 2018 at 02:01:46PM -0700, Linus Torvalds wrote:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger <Larry.Finger@lwfinger.net> wrote:
> >
> > I have more information regarding this BUG. Line 700 of page-flags.h is the
> > macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
> > the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
> > in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
> > include/linux/mm.h. I also added a printk call to PageTable() that logs
> > page->page_type. The routine was called twice. The first had page_type of
> > 0xfffffbff, which would have been expected for a . The second call had
> > 0xffffffff, which led to the BUG.
>
> So it looks to me like the tear-down of the page tables first found a
> page that is indeed a page table, and cleared the page table bit
> (well, it set it - the bits are reversed).
>
> Then it took an exception (that "interrupt: 700") and that causes
> do_exit() again, and it tries to free the same page table - and now
> it's no longer marked as a page table, because it already went through
> the __ClearPageTable() dance once.
>
> So on the second path through, it catches that "the bit already said
> it wasn't a page table" and does the BUG.
>
> But the real question is what the problem was the *first* time around.
+Aneesh.
Looks like pgtable_page_dtor() gets called in __pte_free_tlb() path twice.
Once in __pte_free_tlb() itself and the second time in pgtable_free().
Would this help?
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 6a6673907e45..e7a2f0e6b695 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -137,7 +137,6 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
unsigned long address)
{
- pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
}
#endif /* _ASM_POWERPC_BOOK3S_32_PGALLOC_H */
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 1707781d2f20..30a13b80fd58 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -139,7 +139,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
unsigned long address)
{
tlb_flush_pgtable(tlb, address);
- pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
}
#endif /* _ASM_POWERPC_PGALLOC_32_H */
--
Kirill A. Shutemov
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5
2018-06-29 21:46 ` Kirill A. Shutemov
@ 2018-06-30 2:22 ` Linus Torvalds
2018-06-30 6:23 ` Aneesh Kumar K.V
1 sibling, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2018-06-30 2:22 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: aneesh.kumar, Larry Finger, Matthew Wilcox, Kirill A. Shutemov,
Vlastimil Babka, Christoph Lameter, Dave Hansen, Jerome Glisse,
Lai Jiangshan, Martin Schwidefsky, Pekka Enberg, Randy Dunlap,
Andrey Ryabinin, Andrew Morton, Benjamin Herrenschmidt,
Paul Mackerras, Michael Ellerman, ppc-dev,
Linux Kernel Mailing List
On Fri, Jun 29, 2018 at 2:46 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Looks like pgtable_page_dtor() gets called in __pte_free_tlb() path twice.
> Once in __pte_free_tlb() itself and the second time in pgtable_free().
Ahh, that would certainly do it,. and explains why this hits ppc32 but
not x86, for example.
Linus
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5
2018-06-29 21:46 ` Kirill A. Shutemov
2018-06-30 2:22 ` Linus Torvalds
@ 2018-06-30 6:23 ` Aneesh Kumar K.V
1 sibling, 0 replies; 11+ messages in thread
From: Aneesh Kumar K.V @ 2018-06-30 6:23 UTC (permalink / raw)
To: Kirill A. Shutemov, Linus Torvalds
Cc: Larry Finger, Matthew Wilcox, Kirill A. Shutemov, Vlastimil Babka,
Christoph Lameter, Dave Hansen, Jerome Glisse, Lai Jiangshan,
Martin Schwidefsky, Pekka Enberg, Randy Dunlap, Andrey Ryabinin,
Andrew Morton, Benjamin Herrenschmidt, Paul Mackerras,
Michael Ellerman, ppc-dev, Linux Kernel Mailing List
On 06/30/2018 03:16 AM, Kirill A. Shutemov wrote:
> On Fri, Jun 29, 2018 at 02:01:46PM -0700, Linus Torvalds wrote:
>> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger <Larry.Finger@lwfinger.net> wrote:
>>>
>>> I have more information regarding this BUG. Line 700 of page-flags.h is the
>>> macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
>>> the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
>>> in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
>>> include/linux/mm.h. I also added a printk call to PageTable() that logs
>>> page->page_type. The routine was called twice. The first had page_type of
>>> 0xfffffbff, which would have been expected for a . The second call had
>>> 0xffffffff, which led to the BUG.
>>
>> So it looks to me like the tear-down of the page tables first found a
>> page that is indeed a page table, and cleared the page table bit
>> (well, it set it - the bits are reversed).
>>
>> Then it took an exception (that "interrupt: 700") and that causes
>> do_exit() again, and it tries to free the same page table - and now
>> it's no longer marked as a page table, because it already went through
>> the __ClearPageTable() dance once.
>>
>> So on the second path through, it catches that "the bit already said
>> it wasn't a page table" and does the BUG.
>>
>> But the real question is what the problem was the *first* time around.
>
> +Aneesh.
>
> Looks like pgtable_page_dtor() gets called in __pte_free_tlb() path twice.
> Once in __pte_free_tlb() itself and the second time in pgtable_free().
>
> Would this help?
>
> diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
> index 6a6673907e45..e7a2f0e6b695 100644
> --- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
> +++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
> @@ -137,7 +137,6 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
> static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
> unsigned long address)
> {
> - pgtable_page_dtor(table);
> pgtable_free_tlb(tlb, page_address(table), 0);
> }
> #endif /* _ASM_POWERPC_BOOK3S_32_PGALLOC_H */
> diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
> index 1707781d2f20..30a13b80fd58 100644
> --- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
> +++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
> @@ -139,7 +139,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
> unsigned long address)
> {
> tlb_flush_pgtable(tlb, address);
> - pgtable_page_dtor(table);
> pgtable_free_tlb(tlb, page_address(table), 0);
> }
> #endif /* _ASM_POWERPC_PGALLOC_32_H */
>
https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-June/175015.html
Also part of pull request from Michael Ellerman
-aneesh
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5
2018-06-29 21:01 ` Linus Torvalds
2018-06-29 21:46 ` Kirill A. Shutemov
@ 2018-06-30 0:55 ` Segher Boessenkool
2018-06-30 2:38 ` Denise Finger
2018-07-02 4:16 ` Michael Ellerman
3 siblings, 0 replies; 11+ messages in thread
From: Segher Boessenkool @ 2018-06-30 0:55 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, Randy Dunlap, Dave Hansen, Lai Jiangshan,
Linux Kernel Mailing List, Matthew Wilcox, Pekka Enberg,
Jerome Glisse, Paul Mackerras, Kirill A. Shutemov,
Martin Schwidefsky, Andrey Ryabinin, Christoph Lameter, ppc-dev,
Andrew Morton, Vlastimil Babka
On Fri, Jun 29, 2018 at 02:01:46PM -0700, Linus Torvalds wrote:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger <Larry.Finger@lwfinger.net> wrote:
> But the real question is what the problem was the *first* time around.
> I assume that has scrolled off the screen? This part:
>
> _exception_pkey+0x58/0x128
> ret_from_except_full+0x0/0x4
> --- interrupt: 700 at free_pgd_range+0x19c/0x30c
> LR = free_pgd_range+0x19c/0x30c
> free_pgtables+0xa/0xb
> exit_mnap+0xf4/0x16c
> mmput+0x64/0xf0
>
> Does reverting that commit 1d40a5ea01d5 make everything work for you?
> Because if so, judging by the deafening silence on this so far, I
> think that's what we should do.
>
> That said, can some ppc person who knows the 32-bit ppc code and maybe
> knows what that "interrupt: 700" means talk about that oddity in the
> trace, please?
700 is "program interrupt"; here it probably means a BUG() happened (which
does a trap instruction, which causes a 700). The stuff that scrolled away
should tell more.
Segher
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5
2018-06-29 21:01 ` Linus Torvalds
2018-06-29 21:46 ` Kirill A. Shutemov
2018-06-30 0:55 ` Segher Boessenkool
@ 2018-06-30 2:38 ` Denise Finger
2018-07-02 4:16 ` Michael Ellerman
3 siblings, 0 replies; 11+ messages in thread
From: Denise Finger @ 2018-06-30 2:38 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Kirill A. Shutemov, Vlastimil Babka,
Christoph Lameter, Dave Hansen, Jerome Glisse, Lai Jiangshan,
Martin Schwidefsky, Pekka Enberg, Randy Dunlap, Andrey Ryabinin,
Andrew Morton, Benjamin Herrenschmidt, Paul Mackerras,
Michael Ellerman, ppc-dev, Linux Kernel Mailing List
On 06/29/2018 04:01 PM, Linus Torvalds wrote:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger <Larry.Finger@lwfinger.net> wrote:
>>
>> I have more information regarding this BUG. Line 700 of page-flags.h is the
>> macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
>> the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
>> in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
>> include/linux/mm.h. I also added a printk call to PageTable() that logs
>> page->page_type. The routine was called twice. The first had page_type of
>> 0xfffffbff, which would have been expected for a . The second call had
>> 0xffffffff, which led to the BUG.
>
> So it looks to me like the tear-down of the page tables first found a
> page that is indeed a page table, and cleared the page table bit
> (well, it set it - the bits are reversed).
>
> Then it took an exception (that "interrupt: 700") and that causes
> do_exit() again, and it tries to free the same page table - and now
> it's no longer marked as a page table, because it already went through
> the __ClearPageTable() dance once.
>
> So on the second path through, it catches that "the bit already said
> it wasn't a page table" and does the BUG.
>
> But the real question is what the problem was the *first* time around.
> I assume that has scrolled off the screen? This part:
>
> _exception_pkey+0x58/0x128
> ret_from_except_full+0x0/0x4
> --- interrupt: 700 at free_pgd_range+0x19c/0x30c
> LR = free_pgd_range+0x19c/0x30c
> free_pgtables+0xa/0xb
> exit_mnap+0xf4/0x16c
> mmput+0x64/0xf0
>
> Does reverting that commit 1d40a5ea01d5 make everything work for you?
> Because if so, judging by the deafening silence on this so far, I
> think that's what we should do.
>
> That said, can some ppc person who knows the 32-bit ppc code and maybe
> knows what that "interrupt: 700" means talk about that oddity in the
> trace, please?
The deafening silence may be due to my having an old Microsoft address for
Matthew Wilcox in my first posting. He should now have received the BUG report,
and he may have some suggestions. Yes, reverting commit 1d40a5ea01d5 does permit
the box to boot.
Kirill's patch also works, which seems like a better solution. If any other
architecture bugs on boot, at least we will know where to look. :)
@Kirill: You may add a Reported-by: and Tested-by: Larry Finger
<Larry.Finger@lwfinger.net> to the patch.
Thanks for the help,
Larry
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5
2018-06-29 21:01 ` Linus Torvalds
` (2 preceding siblings ...)
2018-06-30 2:38 ` Denise Finger
@ 2018-07-02 4:16 ` Michael Ellerman
2018-07-02 20:51 ` Larry Finger
3 siblings, 1 reply; 11+ messages in thread
From: Michael Ellerman @ 2018-07-02 4:16 UTC (permalink / raw)
To: Linus Torvalds, Larry Finger
Cc: Matthew Wilcox, Kirill A. Shutemov, Vlastimil Babka,
Christoph Lameter, Dave Hansen, Jerome Glisse, Lai Jiangshan,
Martin Schwidefsky, Pekka Enberg, Randy Dunlap, Andrey Ryabinin,
Andrew Morton, Benjamin Herrenschmidt, Paul Mackerras, ppc-dev,
Linux Kernel Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger <Larry.Finger@lwfinger.net> wrote:
>>
>> I have more information regarding this BUG. Line 700 of page-flags.h is the
>> macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
>> the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
>> in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
>> include/linux/mm.h. I also added a printk call to PageTable() that logs
>> page->page_type. The routine was called twice. The first had page_type of
>> 0xfffffbff, which would have been expected for a . The second call had
>> 0xffffffff, which led to the BUG.
>
> So it looks to me like the tear-down of the page tables first found a
> page that is indeed a page table, and cleared the page table bit
> (well, it set it - the bits are reversed).
...
>
> That said, can some ppc person who knows the 32-bit ppc code and maybe
> knows what that "interrupt: 700" means talk about that oddity in the
> trace, please?
I think everyone else answered your questions here, and it should be
fixed now in your tree.
Larry let me know if you're still seeing a crash with 4.18-rc3.
cheers
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5
2018-07-02 4:16 ` Michael Ellerman
@ 2018-07-02 20:51 ` Larry Finger
0 siblings, 0 replies; 11+ messages in thread
From: Larry Finger @ 2018-07-02 20:51 UTC (permalink / raw)
To: Michael Ellerman, Linus Torvalds
Cc: Matthew Wilcox, Kirill A. Shutemov, Vlastimil Babka,
Christoph Lameter, Dave Hansen, Jerome Glisse, Lai Jiangshan,
Martin Schwidefsky, Pekka Enberg, Randy Dunlap, Andrey Ryabinin,
Andrew Morton, Benjamin Herrenschmidt, Paul Mackerras, ppc-dev,
Linux Kernel Mailing List
On 07/01/2018 11:16 PM, Michael Ellerman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger <Larry.Finger@lwfinger.net> wrote:
>>>
>>> I have more information regarding this BUG. Line 700 of page-flags.h is the
>>> macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
>>> the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
>>> in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
>>> include/linux/mm.h. I also added a printk call to PageTable() that logs
>>> page->page_type. The routine was called twice. The first had page_type of
>>> 0xfffffbff, which would have been expected for a . The second call had
>>> 0xffffffff, which led to the BUG.
>>
>> So it looks to me like the tear-down of the page tables first found a
>> page that is indeed a page table, and cleared the page table bit
>> (well, it set it - the bits are reversed).
> ...
>>
>> That said, can some ppc person who knows the 32-bit ppc code and maybe
>> knows what that "interrupt: 700" means talk about that oddity in the
>> trace, please?
>
> I think everyone else answered your questions here, and it should be
> fixed now in your tree.
>
> Larry let me know if you're still seeing a crash with 4.18-rc3.
The problem is fixed in 4.18-rc3. Thanks to all that helped.
Larry
^ permalink raw reply [flat|nested] 11+ messages in thread