linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Björn Töpel" <bjorn@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Cc: Andreas Dilger <adilger@dilger.ca>,
	Al Viro <viro@zeniv.linux.org.uk>, Nam Cao <namcao@linutronix.de>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-riscv@lists.infradead.org, Theodore Ts'o <tytso@mit.edu>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>,
	Conor Dooley <conor@kernel.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Anders Roxell <anders.roxell@linaro.org>
Subject: Re: riscv32 EXT4 splat, 6.8 regression?
Date: Mon, 15 Apr 2024 18:04:50 +0200	[thread overview]
Message-ID: <87le5e393x.fsf@all.your.base.are.belong.to.us> (raw)
In-Reply-To: <20240415-festland-unattraktiv-2b5953a6dbc9@brauner>

Christian Brauner <brauner@kernel.org> writes:

> On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote:
>> Andreas Dilger <adilger@dilger.ca> writes:
>> 
>> > On Apr 13, 2024, at 8:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> >> 
>> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:
>> >> 
>> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is
>> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely
>> >>> to have 4GB+ of RAM and want to use it all.  The riscv32 might consider
>> >>> reserving this page address from allocation to avoid similar issues in
>> >>> other parts of the code, as is done with the NULL/0 page address.
>> >> 
>> >> Not a chance.  *Any* page mapped there is a serious bug on any 32bit
>> >> box.  Recall what ERR_PTR() is...
>> >> 
>> >> On any architecture the virtual addresses in range (unsigned long)-512..
>> >> (unsigned long)-1 must never resolve to valid kernel objects.
>> >> In other words, any kind of wraparound here is asking for an oops on
>> >> attempts to access the elements of buffer - kernel dereference of
>> >> (char *)0xfffff000 on a 32bit box is already a bug.
>> >> 
>> >> It might be getting an invalid pointer, but arithmetical overflows
>> >> are irrelevant.
>> >
>> > The original bug report stated that search_buf = 0xfffff000 on entry,
>> > and I'd quoted that at the start of my email:
>> >
>> > On Apr 12, 2024, at 8:57 AM, Björn Töpel <bjorn@kernel.org> wrote:
>> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
>> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000
>> >> is a sane address.
>> >
>> > Now that you mention ERR_PTR() it definitely makes sense that this last
>> > page HAS to be excluded.
>> >
>> > So some other bug is passing the bad pointer to this code before this
>> > error, or the arch is not correctly excluding this page from allocation.
>> 
>> Yeah, something is off for sure.
>> 
>> (FWIW, I manage to hit this for Linus' master as well.)
>> 
>> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for
>> this BT:
>> 
>>   [<c01e8b34>] __filemap_add_folio+0x322/0x508
>>   [<c01e8d6e>] filemap_add_folio+0x54/0xce
>>   [<c01ea076>] __filemap_get_folio+0x156/0x2aa
>>   [<c02df346>] __getblk_slow+0xcc/0x302
>>   [<c02df5f2>] bdev_getblk+0x76/0x7a
>>   [<c03519da>] ext4_getblk+0xbc/0x2c4
>>   [<c0351cc2>] ext4_bread_batch+0x56/0x186
>>   [<c036bcaa>] __ext4_find_entry+0x156/0x578
>>   [<c036c152>] ext4_lookup+0x86/0x1f4
>>   [<c02a3252>] __lookup_slow+0x8e/0x142
>>   [<c02a6d70>] walk_component+0x104/0x174
>>   [<c02a793c>] path_lookupat+0x78/0x182
>>   [<c02a8c7c>] filename_lookup+0x96/0x158
>>   [<c02a8d76>] kern_path+0x38/0x56
>>   [<c0c1cb7a>] init_mount+0x5c/0xac
>>   [<c0c2ba4c>] devtmpfs_mount+0x44/0x7a
>>   [<c0c01cce>] prepare_namespace+0x226/0x27c
>>   [<c0c011c6>] kernel_init_freeable+0x286/0x2a8
>>   [<c0b97ab8>] kernel_init+0x2a/0x156
>>   [<c0ba22ca>] ret_from_fork+0xe/0x20
>> 
>> I get a folio where folio_address(folio) == 0xfffff000 (which is
>> broken).
>> 
>> Need to go into the weeds here...
>
> I don't see anything obvious that could explain this right away. Did you
> manage to reproduce this on any other architecture and/or filesystem?
>
> Fwiw, iirc there were a bunch of fs/buffer.c changes that came in
> through the mm/ layer between v6.7 and v6.8 that might also be
> interesting. But really I'm poking in the dark currently.

Thanks for getting back! Spent some more time one it today.

It seems that the buddy allocator *can* return a page with a VA that can
wrap (0xfffff000 -- pointed out by Nam and myself).

Further, it seems like riscv32 indeed inserts a page like that to the
buddy allocator, when the memblock is free'd:

  | [<c024961c>] __free_one_page+0x2a4/0x3ea
  | [<c024a448>] __free_pages_ok+0x158/0x3cc
  | [<c024b1a4>] __free_pages_core+0xe8/0x12c
  | [<c0c1435a>] memblock_free_pages+0x1a/0x22
  | [<c0c17676>] memblock_free_all+0x1ee/0x278
  | [<c0c050b0>] mem_init+0x10/0xa4
  | [<c0c1447c>] mm_core_init+0x11a/0x2da
  | [<c0c00bb6>] start_kernel+0x3c4/0x6de

Here, a page with VA 0xfffff000 is a added to the freelist. We were just
lucky (unlucky?) that page was used for the page cache.

A nasty patch like:
--8<--
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 549e76af8f82..a6a6abbe71b0 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2566,6 +2566,9 @@ void __init set_dma_reserve(unsigned long new_dma_reserve)
 void __init memblock_free_pages(struct page *page, unsigned long pfn,
 							unsigned int order)
 {
+	if ((long)page_address(page) == 0xfffff000L) {
+		return; // leak it
+	}
 
 	if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) {
 		int nid = early_pfn_to_nid(pfn);
--8<--

...and it's gone.

I need to think more about what a proper fix is. Regardless; Christian,
Al, and Ted can all relax. ;-)


Björn

  reply	other threads:[~2024-04-15 16:04 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-12 14:57 riscv32 EXT4 splat, 6.8 regression? Björn Töpel
2024-04-12 15:43 ` Theodore Ts'o
2024-04-12 16:59   ` Björn Töpel
2024-04-13  4:35     ` Theodore Ts'o
2024-04-13 10:01       ` Conor Dooley
2024-04-13 14:43 ` Nam Cao
2024-04-14  0:24   ` Theodore Ts'o
2024-04-14  1:46   ` Andreas Dilger
2024-04-14  2:04     ` Theodore Ts'o
2024-04-14  2:18       ` Al Viro
2024-04-14  2:15     ` Al Viro
2024-04-14  4:16       ` Andreas Dilger
2024-04-14 14:08         ` Björn Töpel
2024-04-15 13:04           ` Christian Brauner
2024-04-15 16:04             ` Björn Töpel [this message]
2024-04-16  6:44               ` Nam Cao
2024-04-16  8:25                 ` Christian Brauner
2024-04-16 11:02                   ` Björn Töpel
2024-04-16 14:24                     ` Mike Rapoport
2024-04-16 15:17                       ` Nam Cao
2024-04-16 15:30                         ` Nam Cao
2024-04-16 15:56                           ` Björn Töpel
2024-04-16 16:19                             ` Nam Cao
2024-04-16 16:31                               ` Mike Rapoport
2024-04-16 17:00                                 ` Matthew Wilcox
2024-04-16 18:34                                   ` Mike Rapoport
2024-04-16 22:36                                     ` Nam Cao
2024-04-17 15:31                                       ` Theodore Ts'o
2024-04-17 18:06                                         ` Nam Cao
2024-04-17 19:34                                       ` Mike Rapoport
2024-04-17 22:09                                       ` Andreas Dilger
2024-04-18  9:17                                         ` Nam Cao
2024-04-16 18:05                               ` Björn Töpel
2024-04-16 18:09                                 ` Nam Cao
2024-04-16 16:19                         ` Mike Rapoport
2024-04-16 16:31                           ` Matthew Wilcox
2024-04-16 18:18                             ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87le5e393x.fsf@all.your.base.are.belong.to.us \
    --to=bjorn@kernel.org \
    --cc=adilger@dilger.ca \
    --cc=anders.roxell@linaro.org \
    --cc=brauner@kernel.org \
    --cc=conor@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=namcao@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).