From: Chandan Babu R <chandan.babu@oracle.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Filipe Manana <fdmanana@kernel.org>,
fstests@vger.kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] fstests: UDEV_SETTLE_PROG before dmsetup create
Date: Mon, 06 Dec 2021 19:37:31 +0530 [thread overview]
Message-ID: <87fsr5rfmk.fsf@debian-BULLSEYE-live-builder-AMD64> (raw)
In-Reply-To: <87ilw2qo6z.fsf@debian-BULLSEYE-live-builder-AMD64>
On 06 Dec 2021 at 11:17, Chandan Babu R wrote:
> On 02 Dec 2021 at 21:15, Josef Bacik wrote:
>> On Thu, Dec 02, 2021 at 02:45:24PM +0000, Filipe Manana wrote:
>>> On Wed, Dec 01, 2021 at 01:18:52PM -0500, Josef Bacik wrote:
>>> > We've been seeing transient errors with any test that uses a dm device
>>> > for the entirety of the time that we've been running nightly xfstests
>>>
>>> I have been having it on my tests vms since ever as well.
>>> It's really annoying, but fortunatelly it doesn't happen too often.
>>>
>>
>> Yeah before this change we'd fail ~6 tests on every configruation on every
>> overnight run. With this change we fail 0. It's rare, but with our continual
>> testing it happens sooooooo much.
>>
>>> > runs. This turns out to be because sometimes we get EBUSY while trying
>>> > to create our new dm device. Generally this is because the test comes
>>> > right after another test that messes with the dm device, and thus we
>>> > still have udev messing around with the device when DM tries to O_EXCL
>>> > the block device.
>>> >
>>> > Add a UDEV_SETTLE_PROG before creating the device to make sure we can
>>> > create our new dm device without getting this transient error.
>>>
>>> I suspect this might only make it seem the problem goes away but does not
>>> really fix it.
>>>
>>> I say that for 2 reasons:
>>>
>>> 1) All tests that use dm end up calling _dmsetup_remove(), like through
>>> _log_writes_remove() or _cleanup_flakey() for example. Normally those
>>> are called in the _cleanup() function, which ensures it's done even if
>>> the test fails for some reason.
>>>
>>> So I don't understand why we need that UDEV_SETTLE_PROG at _dmsetup_create().
>>>
>>> And I've seen the ebusy failure happen even when the previous tests did
>>> not use any dm device;
>>>
>>> 2) Some tests fail after creating the dm device and using it. For example
>>> btrfs/206 often fails when it tries to fsck the filesystem:
>>>
>>> btrfs/206 3s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad)
>>> --- tests/btrfs/206.out 2020-10-16 23:13:46.554152652 +0100
>>> +++ /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad 2021-12-01 21:09:46.317632589 +0000
>>> @@ -3,3 +3,5 @@
>>> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>> wrote 8192/8192 bytes at offset 0
>>> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>> +_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>> +(see /home/fdmanana/git/hub/xfstests/results//btrfs/206.full for details)
>>> ...
>>>
>>> (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/206.out /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad' to see the entire diff)
>>>
>>> In the .full file I got:
>>>
>>> (...)
>>> replaying 1239@11201: sector 2173408, size 16384, flags 0x10(METADATA)
>>> replaying 1240@11234: sector 0, size 0, flags 0x1(FLUSH)
>>> replaying 1241@11235: sector 128, size 4096, flags 0x12(FUA|METADATA)
>>> _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>> *** fsck.btrfs output ***
>>> ERROR: cannot open device '/dev/sdc': Device or resource busy
>>> ERROR: cannot open file system
>>> Opening filesystem to check...
>>> *** end fsck.btrfs output
>>> *** mount output ***
>>>
>>> The ebusy failure is not when the test starts, but when somewhere in the middle
>>> of the replay loop when it calls fsck, or when it ends and the fstests framework
>>> calls fsck.
>>>
>>> I've seen that with btrfs/172 too, which also uses dm logwrites in a similar way.
>>>
>>> So to me this suggests 2 things:
>>>
>>> 1) Calling UDEV_SETTLE_PROG at _dmsetup_create() doesn't solve that problem with
>>> btrfs/206 (and other tests) - the problem is fsck failing to open the scratch
>>> device after it called _log_writes_remove() -> _dmsetup_remove(), and not a
>>> failure to create the dm device;
>>>
>>> 2) The problem is likely something missing at _dmsetup_remove(). Perhaps add
>>> another UDEV_SETTLE_PROG there:
>>>
>>> diff --git a/common/rc b/common/rc
>>> index 8e351f17..22b34677 100644
>>> --- a/common/rc
>>> +++ b/common/rc
>>> @@ -4563,6 +4563,7 @@ _dmsetup_remove()
>>> $UDEV_SETTLE_PROG >/dev/null 2>&1
>>> $DMSETUP_PROG remove "$@" >>$seqres.full 2>&1
>>> $DMSETUP_PROG mknodes >/dev/null 2>&1
>>> + $UDEV_SETTLE_PROG >/dev/null 2>&1
>>> }
>>>
>>> _dmsetup_create()
>>>
>>> I can't say if that change to _dmsetup_remove() is correct, or what it's
>>> needed, as I really haven't spent time trying to figure out why the issue
>>> happens.
>>>
>>
>> I actually tried a few iterations before I settled on this one, but I was only
>> trying to reproduce the EBUSY when creating the flakey device, I hadn't seen it
>> with fsck. So I originally started with your change, but it didn't fix the
>> problem. Then I did both, UDEV_SETTLE at the end of remove and at the beginning
>> of create and the problem went away, and then I removed the one from remove and
>> the problem still was gone.
>>
>> But since I've made this change I also have been investigating another problem
>> where we'll get EBUSY at mkfs time when we use SCRATCH_DEV_POOL. I have a test
>> patch in our staging branch to make sure it actuall fixes it, but I have to add
>> this to the start of _scratch_pool_mkfs as well.
>>
>> It turns out that udev is doing this thing where it writes to
>> /sys/block/whatever/uevent to make sure that a KOBJ_CHANGE event gets sent out
>> for the device.
>>
>> This is racing with the test doing a mount. So the mount gets O_EXCL, which
>> means the uevent doesn't get emitted until umount. This would explain what
>> you're seeing, we umount, we get the KOBJ_CHANGE event once the O_EXCL is
>> dropped, udev runs, and then fsck gets an EBUSY.
>
> I started debugging this issue late last week. Among several tests which
> failed, xfs/033 was failing once in a while because the umount syscall
> returned -EBUSY. Debugging this further led me to the following,
>
> 1. Mounting an fs causes udev to invoke xfs_spaceman (via udisksd => xfs_info).
> This causes the per-cpu counter at mount->mnt_pcp->mnt_count to increase by
> 1.
> 2. Unmount at this stage causes the umount syscall to fail since the refcount
> on struct mount is greater than 2.
>
> I changed my test disks from the regular /dev/sd* devices to loop devices. I
> then added loop devices to be ignored by udev with the following change in
> /lib/udev/rules.d/60-persistent-storage.rules,
>
> KERNEL!="sr*|loop*", IMPORT{builtin}="blkid"
>
> This led to xfs/033 execute fine for 100 times without failure.
> However, other tests which use device mapper devices are still failing
> arbitrarily.
This failure occurs because 60-persistent-storage-dm.rules has the following,
# Add inotify watch to track changes on this device.
# Using the watch rule is not optimal - it generates a lot of spurious
# and useless events whenever the device opened for read-write is closed.
# The best would be to generete the event directly in the tool changing
# relevant information so only relevant events will be processed
# (like creating a filesystem, changing filesystem label etc.).
#
# But let's use this until we have something better...
LABEL="dm_watch"
OPTIONS+="watch"
Hence any change to the device will generate a uevent causing udev's internal
blkid to be invoked. This ends up forking and executing xfs_spaceman.
>
> Looks like a simple mount operation causes udev to indirectly invoke
> xfs_spaceman as described by the output of perf script,
>
> bash 50034 [002] 2541.601278: sched:sched_process_fork: comm=bash pid=50034 child_comm=bash child_pid=50044
> mount 50044 [003] 2541.601851: sched:sched_process_exec: filename=/usr/bin/mount pid=50044 old_pid=50044
> systemd-udevd 173 [000] 2541.620525: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50048
> systemd-udevd 173 [000] 2541.621071: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50051
> systemd-udevd 173 [000] 2541.621562: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50052
> systemd-udevd 173 [000] 2541.621971: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50053
> systemd-udevd 173 [000] 2541.622657: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50054
> udisksd 588 [000] 2541.675852: sched:sched_process_fork: comm=udisksd pid=588 child_comm=udisksd child_pid=50058
> xfs_admin 50058 [000] 2541.676645: sched:sched_process_exec: filename=/usr/sbin/xfs_admin pid=50058 old_pid=50058
> xfs_admin 50058 [000] 2541.677317: sched:sched_process_fork: comm=xfs_admin pid=50058 child_comm=xfs_admin child_pid=50059
> xfs_db 50059 [000] 2541.677569: sched:sched_process_exec: filename=/usr/sbin/xfs_db pid=50059 old_pid=50059
> udisksd 588 [003] 2541.687075: sched:sched_process_fork: comm=udisksd pid=588 child_comm=udisksd child_pid=50060
> xfs_info 50060 [001] 2541.687843: sched:sched_process_exec: filename=/usr/sbin/xfs_info pid=50060 old_pid=50060
> xfs_info 50060 [001] 2541.688362: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50061
> xfs_info 50061 [003] 2541.688615: sched:sched_process_fork: comm=xfs_info pid=50061 child_comm=xfs_info child_pid=50062
> xfs_info 50062 [001] 2541.688810: sched:sched_process_fork: comm=xfs_info pid=50062 child_comm=xfs_info child_pid=50063
> xfs_info 50062 [001] 2541.688944: sched:sched_process_fork: comm=xfs_info pid=50062 child_comm=xfs_info child_pid=50064
> losetup 50063 [003] 2541.689058: sched:sched_process_exec: filename=/usr/sbin/losetup pid=50063 old_pid=50063
> tail 50064 [000] 2541.689169: sched:sched_process_exec: filename=/usr/bin/tail pid=50064 old_pid=50064
> xfs_info 50060 [001] 2541.690200: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50065
> findmnt 50065 [003] 2541.690501: sched:sched_process_exec: filename=/usr/bin/findmnt pid=50065 old_pid=50065
> xfs_info 50060 [001] 2541.692154: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50066
> xfs_spaceman 50066 [003] 2541.692345: sched:sched_process_exec: filename=/usr/sbin/xfs_spaceman pid=50066 old_pid=50066
>
> xfs_spaceman in turn causes ref count on struct mount to increase. Hence an
> unmount operation can fail with -EBUSY error.
>
>>
>> This is a very long email to say that udev is causing spurious failures because
>> of behavior I don't entirely understand. We're going to need to sprinkle in
>> UDEV_SETTLE_PROG in different places to kill all of these different scenarios.
>>
>> What do we think is best here? Put UDEV_SETTLE_PROG at the start of any
>> function that needs to do O_EXCL? So this would mean
>>
>> _dmsetup_create
>> _dmsetup_remove
>> *_mkfs
>> *_mount
>> *_check
>>
>
> To be honest, I don't understand udev well enough. But as pointed above, a
> simple invocation of mount is causing xfs_spaceman to be executed (indirectly)
> by udevd. Hence, may be executing UDEV_SETTLE_PROG is probably not
> sufficient.
>
>> That would be safest, and I can certainly do that. My initial approach was just
>> to do it where it was problematic, but the debugging I did yesterday around
>> btrfs/223 failures and the fact that udev is queue'ing up events that get
>> delivered at some point in the future makes it kind of hard to handle on a
>> case-by-case basis. Thanks,
>>
>> Josef
--
chandan
next prev parent reply other threads:[~2021-12-06 14:08 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-01 18:18 [PATCH] fstests: UDEV_SETTLE_PROG before dmsetup create Josef Bacik
2021-12-02 14:45 ` Filipe Manana
2021-12-02 15:45 ` Josef Bacik
2021-12-06 5:47 ` Chandan Babu R
2021-12-06 14:07 ` Chandan Babu R [this message]
2021-12-06 19:28 ` Josef Bacik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87fsr5rfmk.fsf@debian-BULLSEYE-live-builder-AMD64 \
--to=chandan.babu@oracle.com \
--cc=fdmanana@kernel.org \
--cc=fstests@vger.kernel.org \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox