From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A33F3C54E94 for ; Wed, 25 Jan 2023 15:45:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235709AbjAYPpE (ORCPT ); Wed, 25 Jan 2023 10:45:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60250 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235020AbjAYPpD (ORCPT ); Wed, 25 Jan 2023 10:45:03 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 747663867B for ; Wed, 25 Jan 2023 07:44:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674661455; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type; bh=tEt7Q1/r8y0k1WkcEB/B05kL8WxhpGE85spzlaRCJ+4=; b=IDejAEYp+uNNuuasN6xfqJXvc/eHrD1HA7f+ICBvA9Anj5Yi4oLzBWRJs3kxDtbcYN3zog 5jmY41XhR59vqaeDsSrYKy2ne7Sq2rlRlDeBHUBznS4VCqStVGIeOGvNYmaRWmYzVZMybj bL0XwB5mlzERsJ80sglP33xBje4FaxQ= Received: from mail-ua1-f70.google.com (mail-ua1-f70.google.com [209.85.222.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-186-51PQfgjCNiShvKXw6lty0g-1; Wed, 25 Jan 2023 10:44:13 -0500 X-MC-Unique: 51PQfgjCNiShvKXw6lty0g-1 Received: by mail-ua1-f70.google.com with SMTP id y10-20020ab0560a000000b003af33bfa8c4so5703168uaa.21 for ; Wed, 25 Jan 2023 07:44:13 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-disposition:mime-version:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=tEt7Q1/r8y0k1WkcEB/B05kL8WxhpGE85spzlaRCJ+4=; b=vMhMaLtLC5HBMdfR1FerslpeF6m6mEfHVJl8VgoLpIhBV4OQjxopaimkYw/fswSbII UGZz9sewTlwcNa/PaFPFv5ePnBuVNPf55oZ3cK5P/zhf89odxb33ZSjD4GByu1MfM8Zj 9MnvMhrDSryUEyjKSqy/qcmxB5uwTpQhSGc6m29sPCtEBL8SuslgEjJmGfQJPwPj+fOj DyoccWct/L/UUSRMBI5zqUmq6Z5oYpvWbpYFAkPeinyNzJtYbC413RxBKjh682foORSa sIsMclIxxhvHYy+e7CGngqQz7e5CBybkW/trhwCzL8OhocNrFmk6uvHg5muBHl0LY+64 id5Q== X-Gm-Message-State: AFqh2koxaVPlSK0Q6hKKUTN5eK0rsKwD9sv9iO0xXjTQuqISgGqVX7JP 4oudJC9OBCqvyjRlM6Op7HvLZaEQy6vMjoPUca6dGX4rCezLunGVL0a09QaReVz5dgEepCaUEIe BgXXM+4bdXxGLN4h7YkOT8PXRoIBXkO27bBfM/w/S/42Sg9rJ0Y+Pkc9nJ0EjdcIykFBFwDjWnH y4iA== X-Received: by 2002:a05:6102:290b:b0:3d0:ec6c:239 with SMTP id cz11-20020a056102290b00b003d0ec6c0239mr22844816vsb.14.1674661451798; Wed, 25 Jan 2023 07:44:11 -0800 (PST) X-Google-Smtp-Source: AMrXdXtL2LHkRFSoUiIUI2QfFeRIgeuTisTHazq+xuLmR2IhK2FnY/obIa/8MfFvj2ckcK5X1cLLow== X-Received: by 2002:a05:6102:290b:b0:3d0:ec6c:239 with SMTP id cz11-20020a056102290b00b003d0ec6c0239mr22844795vsb.14.1674661451325; Wed, 25 Jan 2023 07:44:11 -0800 (PST) Received: from bfoster (c-24-61-119-116.hsd1.ma.comcast.net. [24.61.119.116]) by smtp.gmail.com with ESMTPSA id x3-20020a05620a12a300b006fa4ac86bfbsm3658662qki.55.2023.01.25.07.44.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Jan 2023 07:44:10 -0800 (PST) Date: Wed, 25 Jan 2023 10:45:24 -0500 From: Brian Foster To: linux-bcachefs@vger.kernel.org Cc: Kent Overstreet Subject: fstests generic/441 -- occasional bcachefs failure Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: bulk List-ID: X-Mailing-List: linux-bcachefs@vger.kernel.org Hi Kent, list, I've noticed occasional failures of generic/441 on bcachefs that look something like this: generic/441 11s ... - output mismatch (see /root/xfstests-dev/results//generic/441.out.bad) --- tests/generic/441.out 2022-10-11 14:20:36.986202162 -0400 +++ /root/xfstests-dev/results//generic/441.out.bad 2023-01-24 13:33:54.977729904 -0500 @@ -1,3 +1,3 @@ QA output created by 441 Format and mount -Test passed! +First fsync after reopen of fd[0] failed: Input/output error ... (Run 'diff -u /root/xfstests-dev/tests/generic/441.out /root/xfstests-dev/results//generic/441.out.bad' to see the entire diff) This test covers fsync error reporting across multiple fds. What it does in a nutshell is open a bunch of fds, write to them and fsync them, and then repeat that a couple times over after switching to an error table (i.e. dm-error) and back to a working table, verifying error reporting expectations at each step along the way. The error above is associated with an fsync failure after switching from an error table back to a working table, and essentially occurs because the filesystem has shutdown and the bcachefs inode has a journal sequence number ahead of what the journal shows has been flushed to disk. That makes sense in general because even though the vfs error reporting might not expect an error, the filesystem has been shutdown and fsync has work to do. I've been digging into it a bit regardless to try and understand why the behavior seems transient.. What I'm seeing looks like a matter of timing at the point the test switches to an error table and issues the first fsync (where failure is expected). IIUC, what I think is happening here in the scenario where the test passes is something like this: 1. With the working table in place, the bch2_vfs_write_inode() -> bch2_write_inode() path commits a transaction with pending changes to the inode. These changes end up in some sort of cache (i.e. key cache?). 2. The bch2_flush_inode() path flushes the journal based on the sequence number in the inode. 3. Sometime shortly after that, journal reclaim kicks in presumably to write out pending btree node data and release the item from the journal. Since the test switches to an error table after fsync() return, this async (workqueue) journal reclaim context is what typically first encounters I/O failures and essentially shuts down the fs (i.e. we end up in bch2_fs_emergency_read_only()). 4. The test switches back to working context. The next fsync loop appears to succeed even though the fs is shutdown because bch2_vfs_write_inode() transaction commit fails with -EROFS in the commit path, which means the inode sequence number is not updated by the bch2_mark_inode() trigger, and thus bch2_flush_inode() appears to have nothing to do and returns 0. In the case where the test fails, the journal reclaim context in step 3 appears to race with the transaction commit in step 4 such that bch2_mark_inode() is called, updates the inode seq number, and transaction commit actually succeeds before the fs is shutdown. By the time bch2_flush_inode() is called, however, the journal->err_seq value has been set and the flush returns -EIO back to userspace. So all in all this doesn't really seem like a bug. The fs is shutdown either way and the error depends on whether fsync needs to flush the log. I was trying to think about whether this could be improved slightly to perhaps be more consistent and avoid spurious test failure. For example, an artificial delay after the dm-table switchover seems to always prevent the problem because it ensures journal reclaim has processed the error before a subsequent fsync has a chance to bump the inode sequence number. There's still a lot I don't understand here on the bcachefs side, so after poking around a bit I didn't see anything terribly obvious that might prevent this in the fs itself. AFAICS there is no serialization between journal reclaim and the transaction commit, so something like rechecking journal error state after the transaction is locked (but before the mem triggers are run) seems like it could help reduce the race window, but not necessarily prevent the race. Is that accurate? If so, any other thoughts on how we might prevent this particular failure? Sorry for the longish writeup on what likely amounts to a minor behavioral quirk... I dug into it as sort of a proxy to help get more familiar with the code, so if the right approach is basically to leave things as is (and/or potentially tweak the test to keep it quiet), then that seems reasonable to me as well. Brian