From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1376C47080 for ; Mon, 31 May 2021 10:02:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7708C60E08 for ; Mon, 31 May 2021 10:02:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231177AbhEaKDu (ORCPT ); Mon, 31 May 2021 06:03:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230518AbhEaKDu (ORCPT ); Mon, 31 May 2021 06:03:50 -0400 Received: from mail-pj1-x1036.google.com (mail-pj1-x1036.google.com [IPv6:2607:f8b0:4864:20::1036]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BC94CC061574 for ; Mon, 31 May 2021 03:02:09 -0700 (PDT) Received: by mail-pj1-x1036.google.com with SMTP id m8-20020a17090a4148b029015fc5d36343so6452366pjg.1 for ; Mon, 31 May 2021 03:02:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=QKqjyRs3mWkJjJvo73kAm/6h0AV6R4c66zLdPzlFTns=; b=csRmHh8Pr5/MxOKd1H62jc5+lMgQ9T2d6O4pWwH/HBZLa83BpClnitxfdlIPjY1580 1jQ0qq7nbCNox0AISF6mOU2Cq88EoNvBVY7/oc98Kx7drik+mQ3f5QmL25Z55WQwvVbN qe086ZWHNP9eOfsZ3e4x+2E0XLwTTcjJIb8blFSzFTBqsqay4lBu5MhCZoFKp7H8dq+Y WzlxBUjE38xGJ/7QswJQ2xcufpknuOs7RWzs7eoBIKQ0L2Z206UvmOxHfv8xCFPI7PdI lf2a392tIcYHenHDLKo89PBXLz0zVXezt18BfI2RxS8jW0DqRxmpL3nQAkDWQokd3Nvq X9Rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=QKqjyRs3mWkJjJvo73kAm/6h0AV6R4c66zLdPzlFTns=; b=NynzmBZwcKud0YKUDfsxb9c/XxoRkPf/JkamGsnYqcgPyxeZ0TCNE9Ori0ppDSIw8y /dOlCRjowQDy3Zvp1gZbXhoCBPYzTw2RTmfBSxGEi92usYAm/hTgaQVb3+8yRjJHXy13 omUsJxoKuE70YOZETE78Rtq1d6CPtjxsVPlkcbB+dRsEOzx+O+HzMoU+sd2JJZ+r1ml2 i7fgooacXodBFNdQTNbY/vcPtqBm4gHzWxJqvpLAn+oybdNnvVGXSXhp+YqqhQ7adFmH MxzIciAkNTi/4ZdCmwq7lZITdgqeTcT2vPjmnOwBiqam2LB5uTw3iEGYVRU16i1pCX0U tzwA== X-Gm-Message-State: AOAM530YODUIQ901+DXXs2V5q2EUyg66EPHT71LEeDlK8oe9FvN7AqMG otfAPIW40Zgi5LKP6HG/2W/VSMbNXq8oGg== X-Google-Smtp-Source: ABdhPJxeE9reWXH4gjLYeDZueLzmR/CqjZrztAX33YAABgrnvv9w5exRRFBUW01tRH9vf9RFUvCtwg== X-Received: by 2002:a17:90a:4fc2:: with SMTP id q60mr18045283pjh.64.1622455329058; Mon, 31 May 2021 03:02:09 -0700 (PDT) Received: from garuda ([122.171.220.253]) by smtp.gmail.com with ESMTPSA id d18sm11292078pgm.93.2021.05.31.03.02.07 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Mon, 31 May 2021 03:02:08 -0700 (PDT) References: <20210527045202.1155628-1-david@fromorbit.com> User-agent: mu4e 1.0; emacs 26.1 From: Chandan Babu R To: Dave Chinner Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 0/6] xfs: bunmapi needs updating for deferred freeing In-reply-to: <20210527045202.1155628-1-david@fromorbit.com> Date: Mon, 31 May 2021 15:32:05 +0530 Message-ID: <87fsy3uspu.fsf@garuda> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On 27 May 2021 at 10:21, Dave Chinner wrote: > Hi folks, > > I pulled on a loose thread when I started looking into the 64kB > directory block size assert failure I was seeing while trying to > test the bulk page allocation changes. > > I posted the first patch in the series separately - it fixed the > immediate assert failure (5.13-rc1 regression) I was seeing, but in > fixing that it only then dropped back to the previous assert failure > that g/538 was triggering with 64kb directory block sizes. This can > only be reproduced on 5.12, because that's when the error injection > that g/538 uses was added. So I went looking deeper. > > It turns out that xfs_bunmapi() has some code in it to avoid locking > AGFs in the wrong order and this is what was triggering. Many of the > xfs_bunmapi() callers can not/do not handle partial unmaps that > return success, and that's what the directory code is tripping over > trying to free badly fragmented directory blocks. > > This AGF locking order constraint was added to xfs_bunmapu in 2017 > to avoid a deadlock in g/299. Sad thing is that shortly after this, > we converted xfs-bunmapi to use deferred freeing, so it never > actually locks AGFs anymore. But the deadlock avoiding landmine > remained. And xfs_bmap_finish() went away, too, and we now only ever > put one extent in any EFI we log for deferred freeing. I did come across a scenario (when executing xfs/538 with 1k fs block size and 64k directory block size) where an EFI item contained three extents: - Two of those extents belonged to the file whose extents were being freed. - One more extent was added by xfs_bmap_btree_to_extents(). The corresponding call trace was, CPU: 3 PID: 1367 Comm: fsstress Not tainted 5.12.0-rc8-next-20210419-chandan #125 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0x64/0x7c xfs_defer_add.cold+0x1d/0x22 xfs_bmap_btree_to_extents+0x1f6/0x470 __xfs_bunmapi+0x50a/0xe60 ? xfs_trans_alloc_inode+0xbb/0x180 xfs_bunmapi+0x15/0x30 xfs_free_file_space+0x241/0x2c0 xfs_file_fallocate+0x1ca/0x430 ? __cond_resched+0x16/0x40 ? inode_security+0x22/0x60 ? selinux_file_permission+0xe2/0x120 vfs_fallocate+0x146/0x2e0 ioctl_preallocate+0x8f/0xc0 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae > > That means we now only free one extent per transaction via deferred > freeing, With three instances of xfs_extent_free_items associated with one instance of xfs_defer_pending, xfs_defer_finish_noroll() would, 1. Create an EFI item containing information about the three extents to be freed. - The extents in xfs_defer_pending->dfp_work list are sorted based on AG number. 2. Roll the transaction. 3. The new transaction would, - Create an EFD item to hold information about the three extents to be freed. - Free the three extents in a single transaction. > and there are no limitations on what order xfs_bunmapi() > can unmap extents. I think the sorting of extent items mentioned above is the reason that AG locks are obtained in increasing AGNO order while freeing extents. > 64kB directories on a 1kB block size filesystem > already unmap 64 extents in a single loop, so there's no real > limitation here. I think, in the worst case, we can free atmost XFS_EFI_MAX_FAST_EXTENTS (i.e. 16) extents in a single transaction assuming that they were all added in a sequence without any non-XFS_DEFER_OPS_TYPE_FREE deferred objects added in between. > > This means that the limitations of how many extents we can unmap per > loop in xfs_itruncate_extents_flags() goes away for data device > extents (and will eventually go away for RT devices, too, when > Darrick's RT EFI stuff gets merged). > > This "one data deveice extent free per transaction" change now means > that all of the transaction reservations that include > "xfs_bmap_finish" based freeing reservations are wrong. These extent > frees are now done by deferred freeing, and so they only need a > single extent free reservation instead of up to 4 (as truncate was > reserving). > > This series fixes the btree fork regression, the bunmapi partial > unmap regression from 2017, extends xfs_itruncate_extents to unmap > 64 extents at a time for data device (AG) resident extents, and > reworks the transaction reservations to use a consistent and correct > reservation for allocation and freeing extents. The size of some > transaction reservations drops dramatically as a result. > > The first two patches are -rcX candidates, the rest are for the next > merge cycle.... > -- chandan