From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94B43C433B4 for ; Tue, 11 May 2021 11:49:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5D2D761207 for ; Tue, 11 May 2021 11:49:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231341AbhEKLue (ORCPT ); Tue, 11 May 2021 07:50:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40632 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231518AbhEKLuc (ORCPT ); Tue, 11 May 2021 07:50:32 -0400 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F677C061574 for ; Tue, 11 May 2021 04:49:25 -0700 (PDT) Received: by mail-pj1-x1030.google.com with SMTP id gq14-20020a17090b104eb029015be008ab0fso228512pjb.1 for ; Tue, 11 May 2021 04:49:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=6GzUXw/5sVyUjFwJsWC5Mre/0Yn3N9FQiCKmLqdwxb4=; b=p5YvlIlIzbszhRHVd+Pl5UILBmOeVa0DhM6hBK4XL8HUyeCuJjnBqoGuMaW9Ih1vIj 1vBZyHZ7IIfKCyu7ZJwmLns91gemHf2fFEB2Hazylmk3byEllEFK50eCxiTSl4KawUp3 9gd52QFKLsIpE11SlYqulmKDz0Wtnu8KlMkE7lCtqoi/Mi9JPk6mFl9aLohuxsAJqSIW 65oKHTJ+2fXmCwbGtw9PVSCR2wpRhzJg55jWSXg6QOmDWBi9UevaJFtE8VI6uGsjOeAC 48TNHwPIp4qSEFUbPcoJOoSVxPAjyf6wnk2wnzF6l9sjN0hvG0ZvB2WKKkb2e/WbXAAf kfyA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=6GzUXw/5sVyUjFwJsWC5Mre/0Yn3N9FQiCKmLqdwxb4=; b=NVHsNdTu6cg1T9CbQH/H8N/r/slLP0ins6JZ6JU+C9f8F07leqkuQiwXV1Rs95PVNn dlW7Vtj1JMkAfpb9YUGSiy+F8zPbt8xnFXC5DUlAGTZULYq3DiwMs4pIopN5QDRmGqZf Ve8MuZXyCCCmWnyaS0fvP4yEL6jy2GQjoWQxbfABRjQ32jnzaa2CCf6H9jITEPdkTxII wjYfZctGXQD9IPjlvScmh6IhEeOK8c2AxHPjHkXZTYhC+xXmyoHciGQIF/aFKFU7MZeN 4JDPwedZ7PD5eyr3yiHdnsxr1NqIDYGaCa3C1kZnr9xzUOt6Dl7qWgS3xLp2JIaHZTc5 oL7w== X-Gm-Message-State: AOAM532FtTiZFmIUqSMuosBovVWw62IcirbZOW0U340JGWmNpM/ooukp Y4nyA0cpT3zYk07zGxIfZ7RuvjW7qrc= X-Google-Smtp-Source: ABdhPJwMGlJTh5yWtNc42++S8vikP8NuiAx0PVf4WSeOvCRvw55QViUugWFZggWiyR6NawMiI+fVWA== X-Received: by 2002:a17:902:6901:b029:ee:e531:ca5f with SMTP id j1-20020a1709026901b02900eee531ca5fmr29577638plk.37.1620733764665; Tue, 11 May 2021 04:49:24 -0700 (PDT) Received: from garuda ([122.171.167.72]) by smtp.gmail.com with ESMTPSA id c12sm3861017pfr.154.2021.05.11.04.49.23 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Tue, 11 May 2021 04:49:24 -0700 (PDT) References: <20210428065152.77280-1-chandanrlinux@gmail.com> <20210428065152.77280-2-chandanrlinux@gmail.com> <20210429011231.GF63242@dread.disaster.area> <875z0399gw.fsf@garuda> <20210430224415.GG63242@dread.disaster.area> <87y2cwnnzp.fsf@garuda> <20210504000306.GJ63242@dread.disaster.area> <874kfh5p32.fsf@garuda> <20210506032751.GN63242@dread.disaster.area> User-agent: mu4e 1.0; emacs 26.1 From: Chandan Babu R To: Dave Chinner Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 2/2] xfs: Prevent deadlock when allocating blocks for AGFL In-reply-to: <20210506032751.GN63242@dread.disaster.area> Date: Tue, 11 May 2021 17:19:21 +0530 Message-ID: <87cztxwkvy.fsf@garuda> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On 06 May 2021 at 08:57, Dave Chinner wrote: > On Wed, May 05, 2021 at 06:12:41PM +0530, Chandan Babu R wrote: >> > Hence when doing allocation for the free list, we need to fail the >> > allocation rather than block on the only remaining free extent in >> > the AG. If we are freeing extents, the AGFL not being full is not an >> > issue at all. And if we are allocating extents, the transaction >> > reservations should have ensured that the AG had sufficient space in >> > it to complete the entire operation without deadlocking waiting for >> > space. >> > >> > Either way, I don't see a problem with making sure the AGFL >> > allocations just skip busy extents and fail if the only free extents >> > are ones this transaction has freed itself. >> > >> >> Hmm. In the scenario where *all* free extents in the AG were originally freed >> by the current transaction (and hence busy in the transaction), > > How does that happen? I tried in vain to arrive at the above mentioned scenario by consuming away as many blocks as possible from the filesystem. At best, I could arrive at an AG with just one free extent record in the cntbt (NOTE: I had to disable global reservation by invoking "xfs_io -x -c 'resblks 0' $mntpnt"): recs[1] = [startblock,blockcount] 1:[32767,1] For each AG available in an FS instance, we take away 8 (i.e. XFS_ALLOC_AGFL_RESERVE + 4) blocks from the global free data blocks counter. This reservation is applied to the FS as a whole rather than each AG individually. Hence we could get to a scenario where an AG could have less than 8 free blocks. I could not find any other restriction in the code that explicitly prevents an AG from having zero free extents. However, I could not create such an AG because any fs operation that needs extent allocation to be done would try to reserve more than 1 extent causing the above cited AG to not be chosen. > >> we would need >> to be able to recognize this situation and skip invoking >> xfs_extent_busy_flush() altogether. > > If we are freeing extents (i.e XFS_ALLOC_FLAG_FREEING is set) and > we are doing allocation for AGFL and we only found busy extents, > then it's OK to fail the allocation. When freeing an extent, the following patch skips allocation of blocks to AGFL if all the free extents found are busy, diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index aaa19101bb2a..5310e311d5c6 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -1694,6 +1694,7 @@ xfs_alloc_ag_vextent_size( * are no smaller extents available. */ if (!i) { +alloc_small_extent: error = xfs_alloc_ag_vextent_small(args, cnt_cur, &fbno, &flen, &i); if (error) @@ -1710,6 +1711,9 @@ xfs_alloc_ag_vextent_size( /* * Search for a non-busy extent that is large enough. */ + xfs_agblock_t orig_fbno = NULLAGBLOCK; + xfs_extlen_t orig_flen; + for (;;) { error = xfs_alloc_get_rec(cnt_cur, &fbno, &flen, &i); if (error) @@ -1719,6 +1723,11 @@ xfs_alloc_ag_vextent_size( goto error0; } + if (orig_fbno == NULLAGBLOCK) { + orig_fbno = fbno; + orig_flen = flen; + } + busy = xfs_alloc_compute_aligned(args, fbno, flen, &rbno, &rlen, &busy_gen); @@ -1729,6 +1738,13 @@ xfs_alloc_ag_vextent_size( if (error) goto error0; if (i == 0) { + if (args->freeing_extent) { + error = xfs_alloc_lookup_eq(cnt_cur, + orig_fbno, orig_flen, &i); + ASSERT(!error && i); + goto alloc_small_extent; + } + /* * Our only valid extents must have been busy. * Make it unbusy by forcing the log out and @@ -1819,7 +1835,7 @@ xfs_alloc_ag_vextent_size( */ args->len = rlen; if (rlen < args->minlen) { - if (busy) { + if (busy && !args->freeing_extent) { xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); trace_xfs_alloc_size_busy(args); xfs_extent_busy_flush(args->mp, args->pag, busy_gen); @@ -2641,6 +2657,7 @@ xfs_alloc_fix_freelist( targs.alignment = targs.minlen = targs.prod = 1; targs.type = XFS_ALLOCTYPE_THIS_AG; targs.pag = pag; + targs.freeing_extent = flags & XFS_ALLOC_FLAG_FREEING; error = xfs_alloc_read_agfl(mp, tp, targs.agno, &agflbp); if (error) goto out_agbp_relse; diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index a4427c5775c2..1e0fc28ef87a 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -78,6 +78,7 @@ typedef struct xfs_alloc_arg { #ifdef DEBUG bool alloc_minlen_only; /* allocate exact minlen extent */ #endif + bool freeing_extent; } xfs_alloc_arg_t; /* With the above patch, xfs/538 cause the following call trace to be printed, XFS (vdc2): Internal error i != 1 at line 3426 of file fs/xfs/libxfs/xfs_btree.c. Caller xfs_btree_insert+0x15c/0x1f0 CPU: 2 PID: 1284 Comm: punch-alternati Not tainted 5.12.0-rc8-next-20210419-chandan #19 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0x64/0x7c xfs_corruption_error+0x85/0x90 ? xfs_btree_insert+0x15c/0x1f0 xfs_btree_insert+0x18d/0x1f0 ? xfs_btree_insert+0x15c/0x1f0 ? xfs_allocbt_init_common+0x30/0xf0 xfs_free_ag_extent+0x463/0x9d0 __xfs_free_extent+0xe5/0x200 xfs_trans_free_extent+0x3e/0x100 xfs_extent_free_finish_item+0x24/0x40 xfs_defer_finish_noroll+0x1f7/0x5c0 __xfs_trans_commit+0x12f/0x300 xfs_free_file_space+0x1af/0x2c0 xfs_file_fallocate+0x1ca/0x430 ? __cond_resched+0x16/0x40 ? inode_security+0x22/0x60 ? selinux_file_permission+0xe2/0x120 vfs_fallocate+0x146/0x2e0 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae The above call trace occurs during execution of the step #2 listed below, 1. Filling up 90% of the free space of the filesystem. 2. Punch alternate blocks of files. Just before the failure, the filesystem had ~9000 busy extents. So I think we have to flush busy extents even when refilling AGFL for the purpose of freeing an extent. > > We have options here - once we get to the end of the btree and > haven't found a candidate that isn't busy, we could fail > immediately. Or maybe we try an optimisitic flush which forces the > log and waits for as short while (instead of forever) for the > generation to change and then fail if we get a timeout response. Or > maybe there's a more elegant way of doing this that hasn't yet > rattled out of my poor, overloaded brain right now. > > Just because we currently do a blocking flush doesn't mean we always > must do a blocking flush.... I will try to work out a solution. -- chandan