* [BUG?] bcachefs: keep writing to device when there is no high-level I/O activity.
@ 2024-08-27 9:49 David Wang
2024-08-27 16:17 ` Kent Overstreet
2024-09-05 4:26 ` Kent Overstreet
0 siblings, 2 replies; 4+ messages in thread
From: David Wang @ 2024-08-27 9:49 UTC (permalink / raw)
To: kent.overstreet; +Cc: linux-bcachefs, linux-kernel
Hi,
I was using two partitions on same nvme device to compare filesystem performance,
and I consistantly observed a strange behavior:
After 10 minutes fio test with bcachefs on one partition, performance degrade
significantly for other filesystems on other partition (same device).
ext4 150M/s --> 143M/s
xfs 150M/s --> 134M/s
btrfs 127M/s --> 108M/s
Several round tests show the same pattern that bcachefs seems occupy some device resource
even when there is no high-level I/O.
I monitor /proc/diskstats, and it confirmed that bcachefs do keep writing the device.
Following is the time serial samples for "writes_completed" on my bcachefs partition:
writes_completed @timestamp
0 @1724748233.712
4 @1724748248.712 <--- mkfs
4 @1724748263.712
65 @1724748278.712
25350 @1724748293.712
63839 @1724748308.712 <--- fio started
352228 @1724748323.712
621350 @1724748338.712
903487 @1724748353.712
...
12790311 @1724748863.712
13100041 @1724748878.712
13419642 @1724748893.712
13701685 @1724748908.712 <--- fio done (10minutes)
13701769 @1724748923.712 <--- from here, average 5~7writes/second for 2000 seconds
13701852 @1724748938.712
13701953 @1724748953.712
13702032 @1724748968.712
13702133 @1724748983.712
13702213 @1724748998.712
13702265 @1724749013.712
13702357 @1724749028.712
...
13712984 @1724750858.712
13713076 @1724750873.712
13713196 @1724750888.712
13713299 @1724750903.712
13713386 @1724750918.712
13713463 @1724750933.712
13713501 @1724750948.712 <--- writes stopped here
13713501 @1724750963.712
13713501 @1724750978.712
...
Is this behavior expected?
My test script:
set -e
for fsa in "btrfs" "ext4" "bcachefs" "xfs"
do
if [ $fsa == 'ext4' ]; then
mkfs -t ext4 -F /dev/nvme0n1p1
else
mkfs -t $fsa -f /dev/nvme0n1p1
fi
mount -t $fsa /dev/nvme0n1p1 /disk02/dir1
for fsb in "ext4" "bcachefs" "xfs" "btrfs"
do
if [ $fsb == 'ext4' ]; then
mkfs -t ext4 -F /dev/nvme0n1p2
else
mkfs -t $fsb -f /dev/nvme0n1p2
fi
mount -t $fsb /dev/nvme0n1p2 /disk02/dir2
cd /disk02/dir1 && fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --runtime=600 --numjobs=8 --time_based=1 --output=/disk02/fio.${fsa}.${fsb}.0
sleep 30
cd /disk02/dir2 && fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --runtime=600 --numjobs=8 --time_based=1 --output=/disk02/fio.${fsa}.${fsb}.1
sleep 30
cd /disk02
umount /disk02/dir2
done
umount /disk02/dir1
done
And here is a report for one round of test matrix:
+----------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
| R|W | ext4 | bcachefs | xfs | btrfs |
+----------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
| ext4 | [ext4]147MB/s|147MB/s | [ext4]146MB/s|146MB/s | [ext4]150MB/s|150MB/s | [ext4]149MB/s|149MB/s |
| | [ext4]146MB/s|146MB/s | [bcachefs]72.2MB/s|72.2MB/s | [xfs]149MB/s|149MB/s | [btrfs]132MB/s|132MB/s |
| bcachefs | [bcachefs]71.9MB/s|71.9MB/s | [bcachefs]65.1MB/s|65.1MB/s | [bcachefs]69.6MB/s|69.6MB/s | [bcachefs]65.8MB/s|65.8MB/s |
| | [ext4]143MB/s|143MB/s | [bcachefs]71.5MB/s|71.5MB/s | [xfs]134MB/s|133MB/s | [btrfs]108MB/s|108MB/s |
| xfs | [xfs]148MB/s|148MB/s | [xfs]147MB/s|147MB/s | [xfs]152MB/s|152MB/s | [xfs]151MB/s|151MB/s |
| | [ext4]147MB/s|147MB/s | [bcachefs]71.3MB/s|71.3MB/s | [xfs]148MB/s|148MB/s | [btrfs]127MB/s|127MB/s |
| btrfs | [btrfs]132MB/s|132MB/s | [btrfs]112MB/s|111MB/s | [btrfs]110MB/s|110MB/s | [btrfs]110MB/s|110MB/s |
| | [ext4]147MB/s|146MB/s | [bcachefs]69.7MB/s|69.7MB/s | [xfs]146MB/s|146MB/s | [btrfs]125MB/s|125MB/s |
+----------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
(The rows are for the FS on the first partition, and the cols are on the second partition)
The version of bcachefs-tools on my system is 1.9.1.
(The impact is worse, ext4 dropped to 80M/s, when I was using bcachefs-tools from debian repos which is too *old*,
and known to cause bcachefs problems. And that is the reason that I do this kind of test.)
Thanks
David
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [BUG?] bcachefs: keep writing to device when there is no high-level I/O activity.
2024-08-27 9:49 [BUG?] bcachefs: keep writing to device when there is no high-level I/O activity David Wang
@ 2024-08-27 16:17 ` Kent Overstreet
2024-08-30 3:08 ` David Wang
2024-09-05 4:26 ` Kent Overstreet
1 sibling, 1 reply; 4+ messages in thread
From: Kent Overstreet @ 2024-08-27 16:17 UTC (permalink / raw)
To: David Wang; +Cc: linux-bcachefs, linux-kernel, linux-fsdevel
On Tue, Aug 27, 2024 at 05:49:33PM GMT, David Wang wrote:
> Hi,
>
> I was using two partitions on same nvme device to compare filesystem performance,
> and I consistantly observed a strange behavior:
>
> After 10 minutes fio test with bcachefs on one partition, performance degrade
> significantly for other filesystems on other partition (same device).
>
> ext4 150M/s --> 143M/s
> xfs 150M/s --> 134M/s
> btrfs 127M/s --> 108M/s
>
> Several round tests show the same pattern that bcachefs seems occupy some device resource
> even when there is no high-level I/O.
This is is a known issue, it should be either journal reclaim or
rebalance.
(We could use some better stats to see exactly which it is)
The algorithm for how we do background work needs to change; I've
written up a new one but I'm a ways off from having time to implement it
https://evilpiepirate.org/git/bcachefs.git/commit/?h=bcachefs-garbage&id=47a4b574fb420aa824aad222436f4c294daf66ae
Could be a fun one for someone new to take on.
>
> I monitor /proc/diskstats, and it confirmed that bcachefs do keep writing the device.
> Following is the time serial samples for "writes_completed" on my bcachefs partition:
>
> writes_completed @timestamp
> 0 @1724748233.712
> 4 @1724748248.712 <--- mkfs
> 4 @1724748263.712
> 65 @1724748278.712
> 25350 @1724748293.712
> 63839 @1724748308.712 <--- fio started
> 352228 @1724748323.712
> 621350 @1724748338.712
> 903487 @1724748353.712
> ...
> 12790311 @1724748863.712
> 13100041 @1724748878.712
> 13419642 @1724748893.712
> 13701685 @1724748908.712 <--- fio done (10minutes)
> 13701769 @1724748923.712 <--- from here, average 5~7writes/second for 2000 seconds
> 13701852 @1724748938.712
> 13701953 @1724748953.712
> 13702032 @1724748968.712
> 13702133 @1724748983.712
> 13702213 @1724748998.712
> 13702265 @1724749013.712
> 13702357 @1724749028.712
> ...
> 13712984 @1724750858.712
> 13713076 @1724750873.712
> 13713196 @1724750888.712
> 13713299 @1724750903.712
> 13713386 @1724750918.712
> 13713463 @1724750933.712
> 13713501 @1724750948.712 <--- writes stopped here
> 13713501 @1724750963.712
> 13713501 @1724750978.712
> ...
>
> Is this behavior expected?
>
> My test script:
> set -e
> for fsa in "btrfs" "ext4" "bcachefs" "xfs"
> do
> if [ $fsa == 'ext4' ]; then
> mkfs -t ext4 -F /dev/nvme0n1p1
> else
> mkfs -t $fsa -f /dev/nvme0n1p1
> fi
> mount -t $fsa /dev/nvme0n1p1 /disk02/dir1
> for fsb in "ext4" "bcachefs" "xfs" "btrfs"
> do
> if [ $fsb == 'ext4' ]; then
> mkfs -t ext4 -F /dev/nvme0n1p2
> else
> mkfs -t $fsb -f /dev/nvme0n1p2
> fi
> mount -t $fsb /dev/nvme0n1p2 /disk02/dir2
>
> cd /disk02/dir1 && fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --runtime=600 --numjobs=8 --time_based=1 --output=/disk02/fio.${fsa}.${fsb}.0
> sleep 30
> cd /disk02/dir2 && fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --runtime=600 --numjobs=8 --time_based=1 --output=/disk02/fio.${fsa}.${fsb}.1
> sleep 30
> cd /disk02
> umount /disk02/dir2
> done
> umount /disk02/dir1
> done
>
> And here is a report for one round of test matrix:
> +----------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
> | R|W | ext4 | bcachefs | xfs | btrfs |
> +----------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
> | ext4 | [ext4]147MB/s|147MB/s | [ext4]146MB/s|146MB/s | [ext4]150MB/s|150MB/s | [ext4]149MB/s|149MB/s |
> | | [ext4]146MB/s|146MB/s | [bcachefs]72.2MB/s|72.2MB/s | [xfs]149MB/s|149MB/s | [btrfs]132MB/s|132MB/s |
> | bcachefs | [bcachefs]71.9MB/s|71.9MB/s | [bcachefs]65.1MB/s|65.1MB/s | [bcachefs]69.6MB/s|69.6MB/s | [bcachefs]65.8MB/s|65.8MB/s |
> | | [ext4]143MB/s|143MB/s | [bcachefs]71.5MB/s|71.5MB/s | [xfs]134MB/s|133MB/s | [btrfs]108MB/s|108MB/s |
> | xfs | [xfs]148MB/s|148MB/s | [xfs]147MB/s|147MB/s | [xfs]152MB/s|152MB/s | [xfs]151MB/s|151MB/s |
> | | [ext4]147MB/s|147MB/s | [bcachefs]71.3MB/s|71.3MB/s | [xfs]148MB/s|148MB/s | [btrfs]127MB/s|127MB/s |
> | btrfs | [btrfs]132MB/s|132MB/s | [btrfs]112MB/s|111MB/s | [btrfs]110MB/s|110MB/s | [btrfs]110MB/s|110MB/s |
> | | [ext4]147MB/s|146MB/s | [bcachefs]69.7MB/s|69.7MB/s | [xfs]146MB/s|146MB/s | [btrfs]125MB/s|125MB/s |
> +----------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
> (The rows are for the FS on the first partition, and the cols are on the second partition)
>
> The version of bcachefs-tools on my system is 1.9.1.
> (The impact is worse, ext4 dropped to 80M/s, when I was using bcachefs-tools from debian repos which is too *old*,
> and known to cause bcachefs problems. And that is the reason that I do this kind of test.)
>
>
> Thanks
> David
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re:Re: [BUG?] bcachefs: keep writing to device when there is no high-level I/O activity.
2024-08-27 16:17 ` Kent Overstreet
@ 2024-08-30 3:08 ` David Wang
0 siblings, 0 replies; 4+ messages in thread
From: David Wang @ 2024-08-30 3:08 UTC (permalink / raw)
To: Kent Overstreet; +Cc: linux-bcachefs, linux-kernel, linux-fsdevel
Hi,
At 2024-08-28 00:17:12, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Tue, Aug 27, 2024 at 05:49:33PM GMT, David Wang wrote:
>> Hi,
>>
>> I was using two partitions on same nvme device to compare filesystem performance,
>> and I consistantly observed a strange behavior:
>>
>> After 10 minutes fio test with bcachefs on one partition, performance degrade
>> significantly for other filesystems on other partition (same device).
>>
>> ext4 150M/s --> 143M/s
>> xfs 150M/s --> 134M/s
>> btrfs 127M/s --> 108M/s
>>
>> Several round tests show the same pattern that bcachefs seems occupy some device resource
>> even when there is no high-level I/O.
>
>This is is a known issue, it should be either journal reclaim or
>rebalance.
>
>(We could use some better stats to see exactly which it is)
>
I kprobe bch2_submit_wbio_replicas and then bch2_btree_node_write, confirmed that
the background writes were from bch2_journal_reclaim_thread.
(And then, by skimming the code in __bch2_journal_reclaim, I noticed those trace_and_count stats)
>The algorithm for how we do background work needs to change; I've
>written up a new one but I'm a ways off from having time to implement it
>
>https://evilpiepirate.org/git/bcachefs.git/commit/?h=bcachefs-garbage&id=47a4b574fb420aa824aad222436f4c294daf66ae
>
>Could be a fun one for someone new to take on.
>
>>
A Fun and scary one....
For the issue in this thread,
I think *idle* should be defined to be device wide:
when bcachefs is idle while other FS on the same block device is busy, those background threads should be throttled to some degree.
Thanks
David
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [BUG?] bcachefs: keep writing to device when there is no high-level I/O activity.
2024-08-27 9:49 [BUG?] bcachefs: keep writing to device when there is no high-level I/O activity David Wang
2024-08-27 16:17 ` Kent Overstreet
@ 2024-09-05 4:26 ` Kent Overstreet
1 sibling, 0 replies; 4+ messages in thread
From: Kent Overstreet @ 2024-09-05 4:26 UTC (permalink / raw)
To: David Wang; +Cc: linux-bcachefs, linux-kernel
On Tue, Aug 27, 2024 at 05:49:33PM GMT, David Wang wrote:
> Hi,
>
> I was using two partitions on same nvme device to compare filesystem performance,
> and I consistantly observed a strange behavior:
>
> After 10 minutes fio test with bcachefs on one partition, performance degrade
> significantly for other filesystems on other partition (same device).
>
> ext4 150M/s --> 143M/s
> xfs 150M/s --> 134M/s
> btrfs 127M/s --> 108M/s
>
> Several round tests show the same pattern that bcachefs seems occupy some device resource
> even when there is no high-level I/O.
>
> I monitor /proc/diskstats, and it confirmed that bcachefs do keep writing the device.
> Following is the time serial samples for "writes_completed" on my bcachefs partition:
>
> writes_completed @timestamp
> 0 @1724748233.712
> 4 @1724748248.712 <--- mkfs
> 4 @1724748263.712
> 65 @1724748278.712
> 25350 @1724748293.712
> 63839 @1724748308.712 <--- fio started
> 352228 @1724748323.712
> 621350 @1724748338.712
> 903487 @1724748353.712
> ...
> 12790311 @1724748863.712
> 13100041 @1724748878.712
> 13419642 @1724748893.712
> 13701685 @1724748908.712 <--- fio done (10minutes)
> 13701769 @1724748923.712 <--- from here, average 5~7writes/second for 2000 seconds
> 13701852 @1724748938.712
> 13701953 @1724748953.712
> 13702032 @1724748968.712
> 13702133 @1724748983.712
> 13702213 @1724748998.712
> 13702265 @1724749013.712
> 13702357 @1724749028.712
> ...
> 13712984 @1724750858.712
> 13713076 @1724750873.712
> 13713196 @1724750888.712
> 13713299 @1724750903.712
> 13713386 @1724750918.712
> 13713463 @1724750933.712
> 13713501 @1724750948.712 <--- writes stopped here
> 13713501 @1724750963.712
> 13713501 @1724750978.712
> ...
>
> Is this behavior expected?
We've got tracepoints and counters (in sysfs) for checking what's doing
the writing.
If it's a moderate amount of IO, it's probably journal reclaim; it
trickles out writes if the journal has room (it's trying to spread out
bursty work). If it's a lot of IO, it might be a bug.
Changing journal reclaim behaviour (and other background tasks) is on
the todo list, the current behaviour is not great for machines where we
want to "race to idle" to save power.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-09-05 4:26 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-27 9:49 [BUG?] bcachefs: keep writing to device when there is no high-level I/O activity David Wang
2024-08-27 16:17 ` Kent Overstreet
2024-08-30 3:08 ` David Wang
2024-09-05 4:26 ` Kent Overstreet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox