All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@kernel.org>
To: Nam Cao <namcao@linutronix.de>
Cc: "Björn Töpel" <bjorn@kernel.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Andreas Dilger" <adilger@dilger.ca>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	"Jan Kara" <jack@suse.cz>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	linux-riscv@lists.infradead.org, "Theodore Ts'o" <tytso@mit.edu>,
	"Ext4 Developers List" <linux-ext4@vger.kernel.org>,
	"Conor Dooley" <conor@kernel.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Anders Roxell" <anders.roxell@linaro.org>,
	"Alexandre Ghiti" <alex@ghiti.fr>
Subject: Re: riscv32 EXT4 splat, 6.8 regression?
Date: Tue, 16 Apr 2024 19:19:27 +0300	[thread overview]
Message-ID: <Zh6lD8d7cUZdkZJb@kernel.org> (raw)
In-Reply-To: <20240416171713.7d76fe7d@namcao>

On Tue, Apr 16, 2024 at 05:17:13PM +0200, Nam Cao wrote:
> On 2024-04-16 Mike Rapoport wrote:
> > Hi,
> > 
> > On Tue, Apr 16, 2024 at 01:02:20PM +0200, Björn Töpel wrote:
> > > Christian Brauner <brauner@kernel.org> writes:
> > > 
> > > > [Adding Mike who's knowledgeable in this area]
> > > 
> > > >> > Further, it seems like riscv32 indeed inserts a page like that to the
> > > >> > buddy allocator, when the memblock is free'd:
> > > >> > 
> > > >> >   | [<c024961c>] __free_one_page+0x2a4/0x3ea
> > > >> >   | [<c024a448>] __free_pages_ok+0x158/0x3cc
> > > >> >   | [<c024b1a4>] __free_pages_core+0xe8/0x12c
> > > >> >   | [<c0c1435a>] memblock_free_pages+0x1a/0x22
> > > >> >   | [<c0c17676>] memblock_free_all+0x1ee/0x278
> > > >> >   | [<c0c050b0>] mem_init+0x10/0xa4
> > > >> >   | [<c0c1447c>] mm_core_init+0x11a/0x2da
> > > >> >   | [<c0c00bb6>] start_kernel+0x3c4/0x6de
> > > >> > 
> > > >> > Here, a page with VA 0xfffff000 is a added to the freelist. We were just
> > > >> > lucky (unlucky?) that page was used for the page cache.
> > > >> 
> > > >> I just educated myself about memory mapping last night, so the below
> > > >> may be complete nonsense. Take it with a grain of salt.
> > > >> 
> > > >> In riscv's setup_bootmem(), we have this line:
> > > >> 	max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> > > >> 
> > > >> I think this is the root cause: max_low_pfn indicates the last page
> > > >> to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from
> > > >> getting mapped to the last page (0xfffff000). If max_low_pfn is mapped
> > > >> to the last page, we get the reported problem.
> > > >> 
> > > >> There seems to be some code to make sure the last page is not used
> > > >> (the call to memblock_set_current_limit() right above this line). It is
> > > >> unclear to me why this still lets the problem slip through.
> > > >> 
> > > >> The fix is simple: never let max_low_pfn gets mapped to the last page.
> > > >> The below patch fixes the problem for me. But I am not entirely sure if
> > > >> this is the correct fix, further investigation needed.
> > > >> 
> > > >> Best regards,
> > > >> Nam
> > > >> 
> > > >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > >> index fa34cf55037b..17cab0a52726 100644
> > > >> --- a/arch/riscv/mm/init.c
> > > >> +++ b/arch/riscv/mm/init.c
> > > >> @@ -251,7 +251,8 @@ static void __init setup_bootmem(void)
> > > >>  	}
> > > >>  
> > > >>  	min_low_pfn = PFN_UP(phys_ram_base);
> > > >> -	max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> > > >> +	max_low_pfn = PFN_DOWN(memblock_get_current_limit());
> > > >> +	max_pfn = PFN_DOWN(phys_ram_end);
> > > >>  	high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
> > > >>  
> > > >>  	dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));
> > > 
> > > Yeah, AFAIU memblock_set_current_limit() only limits the allocation from
> > > memblock. The "forbidden" page (PA 0xc03ff000 VA 0xfffff000) will still
> > > be allowed in the zone.
> > > 
> > > I think your patch requires memblock_set_current_limit() is
> > > unconditionally called, which currently is not done.
> > > 
> > > The hack I tried was (which seems to work):
> > > 
> > > --
> > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > index fe8e159394d8..3a1f25d41794 100644
> > > --- a/arch/riscv/mm/init.c
> > > +++ b/arch/riscv/mm/init.c
> > > @@ -245,8 +245,10 @@ static void __init setup_bootmem(void)
> > >          */
> > >         if (!IS_ENABLED(CONFIG_64BIT)) {
> > >                 max_mapped_addr = __pa(~(ulong)0);
> > > -               if (max_mapped_addr == (phys_ram_end - 1))
> > > +               if (max_mapped_addr == (phys_ram_end - 1)) {
> > >                         memblock_set_current_limit(max_mapped_addr - 4096);
> > > +                       phys_ram_end -= 4096;
> > > +               }
> > >         }
> > 
> > You can just memblock_reserve() the last page of the first gigabyte, e.g.
> 
> "last page of the first gigabyte" - why first gigabyte? Do you mean
> last page of *last* gigabyte?
 
With 3G-1G split linear map can map only 1G from 0xc0000000 to 0xffffffff
(or 0x00000000 with 32-bit overflow):

[    0.000000]       lowmem : 0xc0000000 - 0x00000000   (1024 MB)

> > 	if (!IS_ENABLED(CONFIG_64BIT)
> > 		memblock_reserve(SZ_1G - PAGE_SIZE, PAGE_SIZE);
> > 
> > The page will still be mapped, but it will never make it to the page
> > allocator.
> > 
> > The nice thing about it is, that memblock lets you to reserve regions that are
> > not necessarily populated, so there's no need to check where the actual RAM
> > ends.
> 
> I tried the suggested code, it didn't work. I think there are 2
> mistakes:
>  - last gigabyte, not first
>  - memblock_reserve() takes physical addresses as arguments, not
>    virtual addresses
> 
> The below patch fixes the problem. Is this what you really meant?
> 
> Best regards,
> Nam
> 
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..ac7efdd77be8 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -245,6 +245,7 @@ static void __init setup_bootmem(void)
>  	 * be done as soon as the kernel mapping base address is determined.
>  	 */
>  	if (!IS_ENABLED(CONFIG_64BIT)) {
> +		memblock_reserve(__pa(-PAGE_SIZE), __pa(PAGE_SIZE));

__pa(-PAGE_SIZE) is what I meant, yes. 

>  		max_mapped_addr = __pa(~(ulong)0);
>  		if (max_mapped_addr == (phys_ram_end - 1))
>  			memblock_set_current_limit(max_mapped_addr - 4096);
> 

-- 
Sincerely yours,
Mike.

WARNING: multiple messages have this Message-ID (diff)
From: Mike Rapoport <rppt@kernel.org>
To: Nam Cao <namcao@linutronix.de>
Cc: "Björn Töpel" <bjorn@kernel.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Andreas Dilger" <adilger@dilger.ca>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	"Jan Kara" <jack@suse.cz>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	linux-riscv@lists.infradead.org, "Theodore Ts'o" <tytso@mit.edu>,
	"Ext4 Developers List" <linux-ext4@vger.kernel.org>,
	"Conor Dooley" <conor@kernel.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Anders Roxell" <anders.roxell@linaro.org>,
	"Alexandre Ghiti" <alex@ghiti.fr>
Subject: Re: riscv32 EXT4 splat, 6.8 regression?
Date: Tue, 16 Apr 2024 19:19:27 +0300	[thread overview]
Message-ID: <Zh6lD8d7cUZdkZJb@kernel.org> (raw)
In-Reply-To: <20240416171713.7d76fe7d@namcao>

On Tue, Apr 16, 2024 at 05:17:13PM +0200, Nam Cao wrote:
> On 2024-04-16 Mike Rapoport wrote:
> > Hi,
> > 
> > On Tue, Apr 16, 2024 at 01:02:20PM +0200, Björn Töpel wrote:
> > > Christian Brauner <brauner@kernel.org> writes:
> > > 
> > > > [Adding Mike who's knowledgeable in this area]
> > > 
> > > >> > Further, it seems like riscv32 indeed inserts a page like that to the
> > > >> > buddy allocator, when the memblock is free'd:
> > > >> > 
> > > >> >   | [<c024961c>] __free_one_page+0x2a4/0x3ea
> > > >> >   | [<c024a448>] __free_pages_ok+0x158/0x3cc
> > > >> >   | [<c024b1a4>] __free_pages_core+0xe8/0x12c
> > > >> >   | [<c0c1435a>] memblock_free_pages+0x1a/0x22
> > > >> >   | [<c0c17676>] memblock_free_all+0x1ee/0x278
> > > >> >   | [<c0c050b0>] mem_init+0x10/0xa4
> > > >> >   | [<c0c1447c>] mm_core_init+0x11a/0x2da
> > > >> >   | [<c0c00bb6>] start_kernel+0x3c4/0x6de
> > > >> > 
> > > >> > Here, a page with VA 0xfffff000 is a added to the freelist. We were just
> > > >> > lucky (unlucky?) that page was used for the page cache.
> > > >> 
> > > >> I just educated myself about memory mapping last night, so the below
> > > >> may be complete nonsense. Take it with a grain of salt.
> > > >> 
> > > >> In riscv's setup_bootmem(), we have this line:
> > > >> 	max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> > > >> 
> > > >> I think this is the root cause: max_low_pfn indicates the last page
> > > >> to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from
> > > >> getting mapped to the last page (0xfffff000). If max_low_pfn is mapped
> > > >> to the last page, we get the reported problem.
> > > >> 
> > > >> There seems to be some code to make sure the last page is not used
> > > >> (the call to memblock_set_current_limit() right above this line). It is
> > > >> unclear to me why this still lets the problem slip through.
> > > >> 
> > > >> The fix is simple: never let max_low_pfn gets mapped to the last page.
> > > >> The below patch fixes the problem for me. But I am not entirely sure if
> > > >> this is the correct fix, further investigation needed.
> > > >> 
> > > >> Best regards,
> > > >> Nam
> > > >> 
> > > >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > >> index fa34cf55037b..17cab0a52726 100644
> > > >> --- a/arch/riscv/mm/init.c
> > > >> +++ b/arch/riscv/mm/init.c
> > > >> @@ -251,7 +251,8 @@ static void __init setup_bootmem(void)
> > > >>  	}
> > > >>  
> > > >>  	min_low_pfn = PFN_UP(phys_ram_base);
> > > >> -	max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> > > >> +	max_low_pfn = PFN_DOWN(memblock_get_current_limit());
> > > >> +	max_pfn = PFN_DOWN(phys_ram_end);
> > > >>  	high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
> > > >>  
> > > >>  	dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));
> > > 
> > > Yeah, AFAIU memblock_set_current_limit() only limits the allocation from
> > > memblock. The "forbidden" page (PA 0xc03ff000 VA 0xfffff000) will still
> > > be allowed in the zone.
> > > 
> > > I think your patch requires memblock_set_current_limit() is
> > > unconditionally called, which currently is not done.
> > > 
> > > The hack I tried was (which seems to work):
> > > 
> > > --
> > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > index fe8e159394d8..3a1f25d41794 100644
> > > --- a/arch/riscv/mm/init.c
> > > +++ b/arch/riscv/mm/init.c
> > > @@ -245,8 +245,10 @@ static void __init setup_bootmem(void)
> > >          */
> > >         if (!IS_ENABLED(CONFIG_64BIT)) {
> > >                 max_mapped_addr = __pa(~(ulong)0);
> > > -               if (max_mapped_addr == (phys_ram_end - 1))
> > > +               if (max_mapped_addr == (phys_ram_end - 1)) {
> > >                         memblock_set_current_limit(max_mapped_addr - 4096);
> > > +                       phys_ram_end -= 4096;
> > > +               }
> > >         }
> > 
> > You can just memblock_reserve() the last page of the first gigabyte, e.g.
> 
> "last page of the first gigabyte" - why first gigabyte? Do you mean
> last page of *last* gigabyte?
 
With 3G-1G split linear map can map only 1G from 0xc0000000 to 0xffffffff
(or 0x00000000 with 32-bit overflow):

[    0.000000]       lowmem : 0xc0000000 - 0x00000000   (1024 MB)

> > 	if (!IS_ENABLED(CONFIG_64BIT)
> > 		memblock_reserve(SZ_1G - PAGE_SIZE, PAGE_SIZE);
> > 
> > The page will still be mapped, but it will never make it to the page
> > allocator.
> > 
> > The nice thing about it is, that memblock lets you to reserve regions that are
> > not necessarily populated, so there's no need to check where the actual RAM
> > ends.
> 
> I tried the suggested code, it didn't work. I think there are 2
> mistakes:
>  - last gigabyte, not first
>  - memblock_reserve() takes physical addresses as arguments, not
>    virtual addresses
> 
> The below patch fixes the problem. Is this what you really meant?
> 
> Best regards,
> Nam
> 
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..ac7efdd77be8 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -245,6 +245,7 @@ static void __init setup_bootmem(void)
>  	 * be done as soon as the kernel mapping base address is determined.
>  	 */
>  	if (!IS_ENABLED(CONFIG_64BIT)) {
> +		memblock_reserve(__pa(-PAGE_SIZE), __pa(PAGE_SIZE));

__pa(-PAGE_SIZE) is what I meant, yes. 

>  		max_mapped_addr = __pa(~(ulong)0);
>  		if (max_mapped_addr == (phys_ram_end - 1))
>  			memblock_set_current_limit(max_mapped_addr - 4096);
> 

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

  parent reply	other threads:[~2024-04-16 16:20 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-12 14:57 riscv32 EXT4 splat, 6.8 regression? Björn Töpel
2024-04-12 14:57 ` Björn Töpel
2024-04-12 15:43 ` Theodore Ts'o
2024-04-12 15:43   ` Theodore Ts'o
2024-04-12 16:59   ` Björn Töpel
2024-04-12 16:59     ` Björn Töpel
2024-04-13  4:35     ` Theodore Ts'o
2024-04-13  4:35       ` Theodore Ts'o
2024-04-13 10:01       ` Conor Dooley
2024-04-13 10:01         ` Conor Dooley
2024-04-13 14:43 ` Nam Cao
2024-04-13 14:43   ` Nam Cao
2024-04-14  0:24   ` Theodore Ts'o
2024-04-14  0:24     ` Theodore Ts'o
2024-04-14  1:46   ` Andreas Dilger
2024-04-14  1:46     ` Andreas Dilger
2024-04-14  2:04     ` Theodore Ts'o
2024-04-14  2:04       ` Theodore Ts'o
2024-04-14  2:18       ` Al Viro
2024-04-14  2:18         ` Al Viro
2024-04-14  2:15     ` Al Viro
2024-04-14  2:15       ` Al Viro
2024-04-14  4:16       ` Andreas Dilger
2024-04-14  4:16         ` Andreas Dilger
2024-04-14 14:08         ` Björn Töpel
2024-04-14 14:08           ` Björn Töpel
2024-04-15 13:04           ` Christian Brauner
2024-04-15 13:04             ` Christian Brauner
2024-04-15 16:04             ` Björn Töpel
2024-04-15 16:04               ` Björn Töpel
2024-04-16  6:44               ` Nam Cao
2024-04-16  6:44                 ` Nam Cao
2024-04-16  8:25                 ` Christian Brauner
2024-04-16  8:25                   ` Christian Brauner
2024-04-16 11:02                   ` Björn Töpel
2024-04-16 11:02                     ` Björn Töpel
2024-04-16 14:24                     ` Mike Rapoport
2024-04-16 14:24                       ` Mike Rapoport
2024-04-16 15:17                       ` Nam Cao
2024-04-16 15:17                         ` Nam Cao
2024-04-16 15:30                         ` Nam Cao
2024-04-16 15:30                           ` Nam Cao
2024-04-16 15:56                           ` Björn Töpel
2024-04-16 15:56                             ` Björn Töpel
2024-04-16 16:19                             ` Nam Cao
2024-04-16 16:19                               ` Nam Cao
2024-04-16 16:31                               ` Mike Rapoport
2024-04-16 16:31                                 ` Mike Rapoport
2024-04-16 17:00                                 ` Matthew Wilcox
2024-04-16 17:00                                   ` Matthew Wilcox
2024-04-16 18:34                                   ` Mike Rapoport
2024-04-16 18:34                                     ` Mike Rapoport
2024-04-16 22:36                                     ` Nam Cao
2024-04-16 22:36                                       ` Nam Cao
2024-04-17 15:31                                       ` Theodore Ts'o
2024-04-17 15:31                                         ` Theodore Ts'o
2024-04-17 18:06                                         ` Nam Cao
2024-04-17 18:06                                           ` Nam Cao
2024-04-17 19:34                                       ` Mike Rapoport
2024-04-17 19:34                                         ` Mike Rapoport
2024-04-17 22:09                                       ` Andreas Dilger
2024-04-17 22:09                                         ` Andreas Dilger
2024-04-18  9:17                                         ` Nam Cao
2024-04-18  9:17                                           ` Nam Cao
2024-04-16 18:05                               ` Björn Töpel
2024-04-16 18:05                                 ` Björn Töpel
2024-04-16 18:09                                 ` Nam Cao
2024-04-16 18:09                                   ` Nam Cao
2024-04-16 16:19                         ` Mike Rapoport [this message]
2024-04-16 16:19                           ` Mike Rapoport
2024-04-16 16:31                           ` Matthew Wilcox
2024-04-16 16:31                             ` Matthew Wilcox
2024-04-16 18:18                             ` Mike Rapoport
2024-04-16 18:18                               ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zh6lD8d7cUZdkZJb@kernel.org \
    --to=rppt@kernel.org \
    --cc=adilger@dilger.ca \
    --cc=alex@ghiti.fr \
    --cc=anders.roxell@linaro.org \
    --cc=bjorn@kernel.org \
    --cc=brauner@kernel.org \
    --cc=conor@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=namcao@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.