All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@linux.ibm.com>
To: Qian Cai <quic_qiancai@quicinc.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Ard Biesheuvel <ardb@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: Arm64 crash while reading memory sysfs
Date: Thu, 27 May 2021 11:56:48 +0300	[thread overview]
Message-ID: <YK9e0LgDOfCFo6TM@linux.ibm.com> (raw)
In-Reply-To: <d55f915c-ad01-e729-1e29-b57d78257cbb@quicinc.com>

On Wed, May 26, 2021 at 08:16:14PM -0400, Qian Cai wrote:
> 
> On 5/26/2021 1:24 PM, Mike Rapoport wrote:
> > On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> >>>
> >>> On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> >>>> Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
> >>> reading files under /sys/devices/system/memory.
> > 
> > Does the issue persist of you only revert the latest patch in the series?
> > In next-20210525 it would be commit 
> > 89fb47db72f2 ("arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix")
> > and commit
> > dfe215e9bac2 ("arm64: drop pfn_valid_within() and simplify pfn_valid()").
> 
> Reverting those two commits alone is enough to fix the issue.
> 
> > 
> >>> Can you please send the beginning of the boot log, up to the
> >>> 	 "Memory: xK/yK available ..."
> >>> line?
> >>
> >> [    0.000000] NUMA: Failed to initialise from firmware
> >> [    0.000000] NUMA: Faking a node at [mem 0x0000000090000000-0x0000009fffffffff]
> >> [    0.000000] NUMA: NODE_DATA [mem 0x9ffefbabc0-0x9ffefbffff]
> >> [    0.000000] Zone ranges:
> >> [    0.000000]   Normal   [mem 0x0000000090000000-0x0000009fffffffff]
> >> [    0.000000] Movable zone start for each node
> >> [    0.000000] Early memory node ranges
> >> [    0.000000]   node   0: [mem 0x0000000090000000-0x0000000091ffffff]
> >> [    0.000000]   node   0: [mem 0x0000000092000000-0x00000000928fffff]
> >> [    0.000000]   node   0: [mem 0x0000000092900000-0x00000000fffbffff]
> >> [    0.000000]   node   0: [mem 0x00000000fffc0000-0x00000000ffffffff]
> >> [    0.000000]   node   0: [mem 0x0000000880000000-0x0000000fffffffff]
> >> [    0.000000]   node   0: [mem 0x0000008800000000-0x0000009ff5aeffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff5af0000-0x0000009ff5b2ffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff5b30000-0x0000009ff5baffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff5bb0000-0x0000009ff7deffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff7df0000-0x0000009ff7e5ffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff7e60000-0x0000009ff7ffffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff8000000-0x0000009fffffffff]
> >> [    0.000000] Initmem setup node 0 [mem 0x0000000090000000-0x0000009fffffffff]
> >> [    0.000000] mem auto-init: stack:off, heap alloc:on, heap free:off
> >> [    0.000000] Memory: 777216K/133955584K available (17920K kernel code, 118786K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)
> > 
> > The available and reserved sizes look weird. Can you post the log with
> > memblock=debug and mminit_loglevel=4 added to the kernel command line?
> 
> http://www.lsbug.org/tmp/dmesg.txt

It seems cut in the middle and even then it's too long to be useful.

Let's drop memblock=debug for now and add this instead:

diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..3f888bef1994 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2055,6 +2055,8 @@ void __init memblock_free_all(void)
 {
 	unsigned long pages;
 
+	__memblock_dump_all();
+
 	free_unused_memmap();
 	reset_all_zones_managed_pages();
 
> >>>> [1] https://lore.kernel.org/kvmarm/20210511100550.28178-1-rppt@kernel.org/
> >>>>
> >>>> [  247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
> >>>> [  247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
> >>>> [  247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit
> >>> nvme mlx5_core i2c_core nvme_core firmware_class
> >>>> [  247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
> >>>> [  247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> >>>> [  247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> >>>> [  247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
> >>>> [  247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300
> > 
> > Do we know what PFN triggers it? Can you please run with this patch:
> 
> Nothing useful showed up with this patch. Yes, I double-checked that the patch was applied.

Sorry, I've missed that the BUG is apparently triggered for pfn + i. Can
you please try this instead:


diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..d0e42e09ad84 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,6 +1447,13 @@ struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 			if (zone && !zone_spans_pfn(zone, pfn + i))
 				return NULL;
 			page = pfn_to_page(pfn + i);
+
+			if (!pfn_valid(pfn + i))
+				pr_info("%s: pfn %lx is not valid\n", __func__, pfn + i);
+			else if (PagePoisoned(page))
+				dump_page(page, "");
+
+
 			if (zone && page_zone(page) != zone)
 				return NULL;
 			zone = page_zone(page);
 
-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Mike Rapoport <rppt@linux.ibm.com>
To: Qian Cai <quic_qiancai@quicinc.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Ard Biesheuvel <ardb@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: Arm64 crash while reading memory sysfs
Date: Thu, 27 May 2021 11:56:48 +0300	[thread overview]
Message-ID: <YK9e0LgDOfCFo6TM@linux.ibm.com> (raw)
In-Reply-To: <d55f915c-ad01-e729-1e29-b57d78257cbb@quicinc.com>

On Wed, May 26, 2021 at 08:16:14PM -0400, Qian Cai wrote:
> 
> On 5/26/2021 1:24 PM, Mike Rapoport wrote:
> > On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> >>>
> >>> On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> >>>> Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
> >>> reading files under /sys/devices/system/memory.
> > 
> > Does the issue persist of you only revert the latest patch in the series?
> > In next-20210525 it would be commit 
> > 89fb47db72f2 ("arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix")
> > and commit
> > dfe215e9bac2 ("arm64: drop pfn_valid_within() and simplify pfn_valid()").
> 
> Reverting those two commits alone is enough to fix the issue.
> 
> > 
> >>> Can you please send the beginning of the boot log, up to the
> >>> 	 "Memory: xK/yK available ..."
> >>> line?
> >>
> >> [    0.000000] NUMA: Failed to initialise from firmware
> >> [    0.000000] NUMA: Faking a node at [mem 0x0000000090000000-0x0000009fffffffff]
> >> [    0.000000] NUMA: NODE_DATA [mem 0x9ffefbabc0-0x9ffefbffff]
> >> [    0.000000] Zone ranges:
> >> [    0.000000]   Normal   [mem 0x0000000090000000-0x0000009fffffffff]
> >> [    0.000000] Movable zone start for each node
> >> [    0.000000] Early memory node ranges
> >> [    0.000000]   node   0: [mem 0x0000000090000000-0x0000000091ffffff]
> >> [    0.000000]   node   0: [mem 0x0000000092000000-0x00000000928fffff]
> >> [    0.000000]   node   0: [mem 0x0000000092900000-0x00000000fffbffff]
> >> [    0.000000]   node   0: [mem 0x00000000fffc0000-0x00000000ffffffff]
> >> [    0.000000]   node   0: [mem 0x0000000880000000-0x0000000fffffffff]
> >> [    0.000000]   node   0: [mem 0x0000008800000000-0x0000009ff5aeffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff5af0000-0x0000009ff5b2ffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff5b30000-0x0000009ff5baffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff5bb0000-0x0000009ff7deffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff7df0000-0x0000009ff7e5ffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff7e60000-0x0000009ff7ffffff]
> >> [    0.000000]   node   0: [mem 0x0000009ff8000000-0x0000009fffffffff]
> >> [    0.000000] Initmem setup node 0 [mem 0x0000000090000000-0x0000009fffffffff]
> >> [    0.000000] mem auto-init: stack:off, heap alloc:on, heap free:off
> >> [    0.000000] Memory: 777216K/133955584K available (17920K kernel code, 118786K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)
> > 
> > The available and reserved sizes look weird. Can you post the log with
> > memblock=debug and mminit_loglevel=4 added to the kernel command line?
> 
> http://www.lsbug.org/tmp/dmesg.txt

It seems cut in the middle and even then it's too long to be useful.

Let's drop memblock=debug for now and add this instead:

diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..3f888bef1994 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2055,6 +2055,8 @@ void __init memblock_free_all(void)
 {
 	unsigned long pages;
 
+	__memblock_dump_all();
+
 	free_unused_memmap();
 	reset_all_zones_managed_pages();
 
> >>>> [1] https://lore.kernel.org/kvmarm/20210511100550.28178-1-rppt@kernel.org/
> >>>>
> >>>> [  247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
> >>>> [  247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
> >>>> [  247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit
> >>> nvme mlx5_core i2c_core nvme_core firmware_class
> >>>> [  247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
> >>>> [  247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> >>>> [  247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> >>>> [  247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
> >>>> [  247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300
> > 
> > Do we know what PFN triggers it? Can you please run with this patch:
> 
> Nothing useful showed up with this patch. Yes, I double-checked that the patch was applied.

Sorry, I've missed that the BUG is apparently triggered for pfn + i. Can
you please try this instead:


diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..d0e42e09ad84 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,6 +1447,13 @@ struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 			if (zone && !zone_spans_pfn(zone, pfn + i))
 				return NULL;
 			page = pfn_to_page(pfn + i);
+
+			if (!pfn_valid(pfn + i))
+				pr_info("%s: pfn %lx is not valid\n", __func__, pfn + i);
+			else if (PagePoisoned(page))
+				dump_page(page, "");
+
+
 			if (zone && page_zone(page) != zone)
 				return NULL;
 			zone = page_zone(page);
 
-- 
Sincerely yours,
Mike.


  parent reply	other threads:[~2021-05-27  9:01 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-25 15:25 Arm64 crash while reading memory sysfs Qian Cai (QUIC)
2021-05-25 15:25 ` Qian Cai (QUIC)
2021-05-25 15:37 ` David Hildenbrand
2021-05-25 15:37   ` David Hildenbrand
2021-05-26  6:40 ` Mike Rapoport
2021-05-26  6:40   ` Mike Rapoport
2021-05-26 12:09   ` Qian Cai (QUIC)
2021-05-26 12:09     ` Qian Cai (QUIC)
2021-05-26 13:04     ` Catalin Marinas
2021-05-26 13:04       ` Catalin Marinas
2021-05-26 17:25       ` Mike Rapoport
2021-05-26 17:25         ` Mike Rapoport
2021-05-26 17:24     ` Mike Rapoport
2021-05-26 17:24       ` Mike Rapoport
2021-05-27  0:16       ` Qian Cai
2021-05-27  0:16         ` Qian Cai
2021-05-27  0:31         ` Andrew Morton
2021-05-27  0:31           ` Andrew Morton
2021-05-27  7:25           ` Stephen Rothwell
2021-05-27  7:25             ` Stephen Rothwell
2021-05-27  8:56         ` Mike Rapoport [this message]
2021-05-27  8:56           ` Mike Rapoport
2021-05-27 14:33           ` Qian Cai
2021-05-27 14:33             ` Qian Cai
2021-05-27 16:22             ` Mike Rapoport
2021-05-27 16:22               ` Mike Rapoport
2021-05-27 17:00               ` Qian Cai
2021-05-27 17:00                 ` Qian Cai
2021-05-27 17:12               ` David Hildenbrand
2021-05-27 17:12                 ` David Hildenbrand
2021-05-27 17:50               ` Catalin Marinas
2021-05-27 17:50                 ` Catalin Marinas
2021-05-27 22:56                 ` Andrew Morton
2021-05-27 22:56                   ` Andrew Morton
2021-05-28  5:13                   ` Mike Rapoport
2021-05-28  5:13                     ` Mike Rapoport
2021-06-08  7:06                     ` Anshuman Khandual
2021-06-08  7:06                       ` Anshuman Khandual
2021-06-14  8:25                       ` Mike Rapoport
2021-06-14  8:25                         ` Mike Rapoport
2021-06-15  0:13                         ` Andrew Morton
2021-06-15  0:13                           ` Andrew Morton
2021-06-15  6:05                           ` Mike Rapoport
2021-06-15  6:05                             ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YK9e0LgDOfCFo6TM@linux.ibm.com \
    --to=rppt@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=ardb@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maz@kernel.org \
    --cc=quic_qiancai@quicinc.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.