From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 14C9083A17; Fri, 13 Sep 2024 18:15:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726251340; cv=none; b=fcvTqrAJdG0g3DyDgTluO2+XSdOOZJgNvBnLyeOcjSvYODM1lS9W6vYS30llYjQ0TJZwEFfyO0qtidE9VnVve+db7b45JJLPKRomesUxiQeQClL9p4FHUYITR+hZr4AmK2SvvUA9E1j/7WxJIQx5BWozRtvKdCm167itVWy/PHg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726251340; c=relaxed/simple; bh=NlK4dbEW2AvqUqV7RdrHsjrhW7yTPUe9VucDl4PrIzs=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=pTjLL/wnShIGKkcGI66n+kacaBulmK25x9tHhO+r/bsXxTBTKpYrpM+iJJ/16k9Yv5jGc0xmfyPgwm836KhuMUbKCgxshp/7zRD887iAF8h7sDi9384kCY8IsY3b+HIckPxPfpmVrb2rb1MmVut/tKnJod9hIgT+utnDA3ocbEw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=R5ag1CbF; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="R5ag1CbF" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=5Bq1Gea88A+B6NeDYaKCfbevhpShpf8ZKdFAOPfCVxM=; b=R5ag1CbFUgKYRrgPv9hvd9N0st gv2RQcEDGnDMhGm6FfWCwcueEoDgTilPTHewhvzTanyDdOCZznNjtw/jsemjq+dD8HqBCkyVwksVd EjGmaTHUIXSggDU0i0iNOzSsDvLkpUXLcSm2awbf/UCZCtUdki2H78F+9KMHmtXcK1xnXKHutcIkO WqCpHlXilpccexdKKRW6vWl5SO2H/it8zE0uaDWiSLPbNchT2O6Sb6C6cOiFjfMJNLC9p2hEuGmWN 30EVL8uwd2J9lqJYlLARlV17iMI4KbfRLKrCiKc0j0HtO2BmSp5jQJj93n6Pa/bAOqfqQsOQWLcHI zLWTbeCg==; Received: from willy by casper.infradead.org with local (Exim 4.98 #2 (Red Hat Linux)) id 1spApL-0000000GrP9-0pZr; Fri, 13 Sep 2024 18:15:27 +0000 Date: Fri, 13 Sep 2024 19:15:26 +0100 From: Matthew Wilcox To: Chris Mason Cc: Linus Torvalds , Jens Axboe , Christian Theune , linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Daniel Dao , Dave Chinner , regressions@lists.linux.dev, regressions@leemhuis.info Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) Message-ID: References: <0fc8c3e7-e5d2-40db-8661-8c7199f84e43@kernel.dk> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, Sep 13, 2024 at 12:33:49PM -0400, Chris Mason wrote: > > If you could get the precise index numbers, that would be an important > > clue. It would be interesting to know the index number in the xarray > > where the folio was found rather than folio->index (as I suspect that > > folio->index is completely bogus because folio->mapping is wrong). > > But gathering that info is going to be hard. > > This particular debug session was late at night while we were urgently > trying to roll out some NFS features. I didn't really save many of the > details because my plan was to reproduce it and make a full bug report. > > Also, I was explaining the details to people in workplace chat, which is > wildly bad at rendering long lines of structured text, especially when > half the people in the chat are on a mobile device. > > You're probably wondering why all of that is important...what I'm really > trying to say is that I've attached a screenshot of the debugging output. > > It came from a older drgn script, where I'm still clinging to "radix", > and you probably can't trust the string representation of the page flags > because I wasn't yet using Omar's helpers and may have hard coded them > from an older kernel. That's all _fine_. This is enormously helpful. First, we see the same folio appear three times. I think that's particularly significant. Modulo 64 (number of entries/node), the indices the bad folio are found at is 16, 32 and 48. So I think the _current_ order of folio is 4, but at the time the folio was put in the xarray, it was order 6. Except ... at order-6 we elide a level of the xarray. So we shouldn't be able to see this. Hm. Oh! I think split is the key. Let's say we have an order-6 (or larger) folio. And we call split_huge_page() (whatever it's called in your kernel version). That calls xas_split_alloc() followed by xas_split(). xas_split_alloc() puts entry in node->slots[0] and initialises node->slots[1..XA_CHUNK_SIZE] to a sibling entry. Now, if we do allocate those node in xas_split_alloc(), we're supposed to free them with radix_tree_node_rcu_free() which zeroes all the slots. But what if we don't, somehow? (this is my best current theory). Then we allocate the node to a different tree, but any time we try to look something up, unless it's the index for which we allocated the node, we find a sibling entry and it points to a stale pointer. I'm going to think on this a bit more, but so far this is all good evidence for my leading theory.