From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9BD3C0032E for ; Wed, 25 Oct 2023 15:50:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232469AbjJYPuu (ORCPT ); Wed, 25 Oct 2023 11:50:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57956 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234524AbjJYPuu (ORCPT ); Wed, 25 Oct 2023 11:50:50 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 96F3C13A for ; Wed, 25 Oct 2023 08:50:47 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 31DD6C433C7; Wed, 25 Oct 2023 15:50:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1698249047; bh=1xD61dKZzRe/rzmJ3m58Uq9VTAKiJn6b6NclNNX5YMA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=nMB5jeOCCQafw5aeOhAUTEr0SyusQGJTpK6uIgHFgcwX0wK8MDG625I6Doc0nP/mm /ZnhVESGeLW3OkAE2duS+2BNxuI2pkKcWMEY4I3yi6B7eI0tx3RHelaYp1hh620YEj /cxz2pla3admhUsB2eRDzMIvFFWB7v7HNP44JL+SpLDo7tXfpCzkg9jV9fwN48YsAb Tp3SHNsAAOYAzbdDdr/YJdz4CgrSIKfqNGdetxJy5g2d0YQYQiyojmxCeCW84WVDiN WcUVPQx0KYd62st84nLikPC+KAXRHB/7gRcbY96gZUVqhTkBm/kH8G4lqbQMh+VJmV 1MRu4ZdN5AjOQ== Date: Wed, 25 Oct 2023 08:50:46 -0700 From: "Darrick J. Wong" To: Omar Sandoval Cc: Dave Chinner , linux-xfs@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH] xfs: fix internal error from AGFL exhaustion Message-ID: <20231025155046.GF3195650@frogsfrogsfrogs> References: <013168d2de9d25c56fe45ad75e9257cf9664f2d6.1698190191.git.osandov@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote: > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote: > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote: > > > From: Omar Sandoval > > > > > > We've been seeing XFS errors like the following: > > > > > > XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c. Caller xfs_btree_insert+0x1ec/0x280 > > > ... > > > Call Trace: > > > xfs_corruption_error+0x94/0xa0 > > > xfs_btree_insert+0x221/0x280 > > > xfs_alloc_fixup_trees+0x104/0x3e0 > > > xfs_alloc_ag_vextent_size+0x667/0x820 > > > xfs_alloc_fix_freelist+0x5d9/0x750 > > > xfs_free_extent_fix_freelist+0x65/0xa0 > > > __xfs_free_extent+0x57/0x180 > > > ... > > > > Good find, Omar! > > > > For completeness, what's the rest of the trace? We've recently > > changed how extent freeing path works w.r.t. busy extents so I'm > > curious as to what the path into this code is. > > This one is from my fpunch reproducer, but there were two call traces we > actually saw in production. One from writeback: > > xfs_corruption_error+0x90/0xa0 > xfs_btree_insert+0x1b1/0x1d0 > xfs_alloc_fixup_trees+0x28e/0x450 > xfs_alloc_ag_vextent_size+0x4e0/0x780 > xfs_alloc_ag_vextent+0x11c/0x140 > xfs_alloc_fix_freelist+0x3f2/0x510 > xfs_alloc_vextent+0x254/0x490 > xfs_bmap_btalloc+0x301/0x6e0 > xfs_bmapi_allocate+0x206/0x2d0 > xfs_bmapi_convert_delalloc+0x2b9/0x450 > xfs_map_blocks+0x1cc/0x470 > iomap_do_writepage+0x22b/0x860 > write_cache_pages+0x1ae/0x4c0 > iomap_writepages+0x1c/0x40 > xfs_vm_writepages+0x6b/0xa0 > do_writepages+0x6f/0x1d0 > __writeback_single_inode+0x39/0x2b0 > writeback_sb_inodes+0x277/0x510 > __writeback_inodes_wb+0x6e/0xe0 > wb_writeback+0xda/0x250 > wb_workfn+0x1f4/0x430 > process_one_work+0x12c/0x490 > worker_thread+0x143/0x3e0 > kthread+0xa3/0xd0 > ret_from_fork+0x1f/0x30 > > And one from inodegc: > > xfs_corruption_error+0x90/0xa0 > xfs_btree_insert+0x1b1/0x1d0 > xfs_alloc_fixup_trees+0x28e/0x450 > xfs_alloc_ag_vextent_size+0x4e0/0x780 > xfs_alloc_ag_vextent+0x11c/0x140 > xfs_alloc_fix_freelist+0x3f2/0x510 > __xfs_free_extent+0xde/0x1f0 > xfs_trans_free_extent+0x54/0xd0 > xfs_extent_free_finish_item+0x69/0xa0 > xfs_defer_finish_noroll+0x163/0x510 > xfs_defer_finish+0x10/0x70 > xfs_itruncate_extents_flags+0x13a/0x250 > xfs_inactive_truncate+0xac/0xe0 > xfs_inactive+0x14f/0x160 > xfs_inodegc_worker+0x81/0x100 > process_one_work+0x12c/0x490 > worker_thread+0x143/0x3e0 > kthread+0xa3/0xd0 > ret_from_fork+0x1f/0x30 > > These are both from 5.19. Hey, I've seen that backtrace before! Only a handful of times, fortuntely. > > > (It's guaranteed that > > > none of the internal nodes need to be split because they were just > > > split.) > > > > Hmmm. That doesn't sound right - btrees don't work that way. > > > > We don't know what path the tree split along in the previous insert > > operation - a split only guarantees the next insert along that same > > path won't split if the insert hits one of the two leaf blocks from > > the split. Insert into a different leaf can still split interior > > nodes all the way back up the tree the point where it intersects > > with the previous full height split path. > > > > e.g. if the previous full height split was down the left side of the > > tree and the next insert is in the right half of the tree, the right > > half can split on insert all the way up to the level below the old > > root. > > You're right, I was only thinking about the 2->3 level case. My > assumption is only true in that case. > > > > Fix it by adding an extra block of slack in the freelist for the > > > potential leaf split in each of the bnobt and cntbt. I see how the cntbt can split because changing the length of a freespace extent can require the record to move to a totally different part of the cntbt. The reinsertion causes the split. How does the bnobt split during a refresh of the AGFL? Will we ever allocate a block from the /middle/ of a free space extent? > > > > Hmmm. yeah - example given is 2->3 level split, hence only need to > > handle a single level leaf split before path intersection occurs. > > Yup, adding an extra block will make the exact problem being seen go > > away. > > > > Ok, what's the general solution? 4-level, 5-level or larger trees? > > > > Worst split after a full split is up to the level below the old root > > block. the root block was split, so it won't need a split again, so > > only it's children could split again. OK, so that's (height - 1) > > needed for a max split to refill the AGFL after a full height split > > occurred. > > > > Hence it looks like the minimum AGFL space for any given > > allocation/free operation needs to be: > > > > (full height split reservation) + (AGFL refill split height) > > > > which is: > > > > = (new height) + (new height - 2) > > = 2 * new height - 2 > > = 2 * (current height + 1) - 2 > > = 2 * current height > > > > Ok, so we essentially have to double the AGFL minimum size to handle > > the generic case of AGFL refill splitting free space trees after a > > transaction that has exhausted it's AGFL reservation. > > > > Now, did I get that right? > > That sounds right, but I'll have to think about it more after some sleep > :) I think that's correct, assuming that (2 * current_height) is the per-btree calcluation. > Assuming that is correct, your LE search suggestion sounds kind of nice > rather than such a drastic change to the AGFL size. The absolute maximum height of a free space btree is 7 blocks, according to xfs_db: # xfs_db /dev/sda -c 'btheight -w absmax all' bnobt: 7 cntbt: 7 inobt: 6 finobt: 6 bmapbt: 14 refcountbt: 6 rmapbt: 10 The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't think it's /that/ drastic. For a filesystem with rmapbt (V5, 1k blocks) that minimum jumps to 121 blocks. > > The rmapbt case will need this change, too, because rmapbt blocks > > are allocated from the free list and so an rmapbt update can exhaust > > the free list completely, too. > > Ah, I missed where the rmapbt is updated since it was slightly removed > from the xfs_alloc_fixup_trees() code path I was looking at. The rmapbt has its own accounting tricks (XFS_AG_RESV_RMAPBT) to ensure that there's always enough free space to refill the AGFL. Is that why the testcase turns off rmapbt? --D > Thanks for the quick review! > > Omar