* [PATCH] btrfs: don't force DIO writes to be serialized
@ 2026-04-22 14:03 Mark Harmstone
2026-04-22 20:57 ` David Sterba
2026-04-28 15:13 ` David Sterba
0 siblings, 2 replies; 7+ messages in thread
From: Mark Harmstone @ 2026-04-22 14:03 UTC (permalink / raw)
To: linux-btrfs; +Cc: josef, boris, Mark Harmstone
Before btrfs switched to the new mount API in 2023, we were setting
SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
filesystem may have files which don't have security xattrs, enabling it
to do some optimizations.
Unfortunately this was missed in the transition, meaning that IS_NOSEC
will always return false for a btrfs inode. This means that
btrfs_direct_write() calls will always get the inode lock exclusively,
meaning that DIO writes to the same file will be serialized.
On my machine, this one-line change results in a ~59% improvement in DIO
throughput:
Before patch:
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
...
fio-3.39
Starting 32 processes
test: Laying out IO file (1 file / 1024MiB)
Jobs: 32 (f=32): [w(32)][100.0%][w=764MiB/s][w=195k IOPS][eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=586: Wed Apr 22 13:03:04 2026
write: IOPS=202k, BW=787MiB/s (826MB/s)(46.1GiB/60012msec); 0 zone resets
bw ( KiB/s): min=498714, max=1199892, per=100.00%, avg=806659.03, stdev=4229.94, samples=3808
iops : min=124677, max=299971, avg=201661.82, stdev=1057.49, samples=3808
cpu : usr=0.32%, sys=1.27%, ctx=8329204, majf=0, minf=1163
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,12094328,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=787MiB/s (826MB/s), 787MiB/s-787MiB/s (826MB/s-826MB/s), io=46.1GiB (49.5GB), run=60012-60012msec
After patch:
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
...
fio-3.39
Starting 32 processes
test: Laying out IO file (1 file / 1024MiB)
Jobs: 32 (f=32): [w(32)][100.0%][w=1255MiB/s][w=321k IOPS][eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=572: Wed Apr 22 13:13:46 2026
write: IOPS=320k, BW=1250MiB/s (1311MB/s)(73.3GiB/60003msec); 0 zone resets
bw ( MiB/s): min= 619, max= 2289, per=100.00%, avg=1251.28, stdev= 9.64, samples=3808
iops : min=158538, max=586025, avg=320320.80, stdev=2468.97, samples=3808
cpu : usr=0.35%, sys=11.50%, ctx=1584847, majf=0, minf=1160
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,19203309,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=1250MiB/s (1311MB/s), 1250MiB/s-1250MiB/s (1311MB/s-1311MB/s), io=73.3GiB (78.7GB), run=60003-60003msec
Fixes: ad21f15b0f79 ("btrfs: switch to the new mount API")
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/super.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a60bce413d33b5..fb15decb086189 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1872,6 +1872,7 @@ static int btrfs_get_tree_super(struct fs_context *fc)
fs_info->fs_devices = fs_devices;
mutex_unlock(&uuid_mutex);
+ fc->sb_flags |= SB_NOSEC;
sb = sget_fc(fc, btrfs_fc_test_super, set_anon_super_fc);
if (IS_ERR(sb)) {
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH] btrfs: don't force DIO writes to be serialized
2026-04-22 14:03 [PATCH] btrfs: don't force DIO writes to be serialized Mark Harmstone
@ 2026-04-22 20:57 ` David Sterba
2026-04-23 10:04 ` Mark Harmstone
2026-04-28 15:13 ` David Sterba
1 sibling, 1 reply; 7+ messages in thread
From: David Sterba @ 2026-04-22 20:57 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs, josef, boris
On Wed, Apr 22, 2026 at 03:03:35PM +0100, Mark Harmstone wrote:
> Before btrfs switched to the new mount API in 2023, we were setting
> SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
> filesystem may have files which don't have security xattrs, enabling it
> to do some optimizations.
>
> Unfortunately this was missed in the transition, meaning that IS_NOSEC
> will always return false for a btrfs inode. This means that
> btrfs_direct_write() calls will always get the inode lock exclusively,
> meaning that DIO writes to the same file will be serialized.
>
> On my machine, this one-line change results in a ~59% improvement in DIO
> throughput:
That's quite an improvement. What's the actual fio script you've used?
Also the DIO depends on the block group profile wrt the buffered
fallback so that would be good to know too.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] btrfs: don't force DIO writes to be serialized
2026-04-22 20:57 ` David Sterba
@ 2026-04-23 10:04 ` Mark Harmstone
2026-04-23 10:20 ` Qu Wenruo
2026-04-24 10:28 ` David Sterba
0 siblings, 2 replies; 7+ messages in thread
From: Mark Harmstone @ 2026-04-23 10:04 UTC (permalink / raw)
To: dsterba; +Cc: linux-btrfs, josef, boris
On 22/04/2026 9.57 pm, David Sterba wrote:
> On Wed, Apr 22, 2026 at 03:03:35PM +0100, Mark Harmstone wrote:
>> Before btrfs switched to the new mount API in 2023, we were setting
>> SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
>> filesystem may have files which don't have security xattrs, enabling it
>> to do some optimizations.
>>
>> Unfortunately this was missed in the transition, meaning that IS_NOSEC
>> will always return false for a btrfs inode. This means that
>> btrfs_direct_write() calls will always get the inode lock exclusively,
>> meaning that DIO writes to the same file will be serialized.
>>
>> On my machine, this one-line change results in a ~59% improvement in DIO
>> throughput:
>
> That's quite an improvement. What's the actual fio script you've used?
> Also the DIO depends on the block group profile wrt the buffered
> fallback so that would be good to know too.
It is. There's a big dropoff in DIO write performance in 6.8 that we
never recovered from. I'm going to look into some sort of automated
performance so this kind of thing can't happen casually.
This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed
through PCI passthrough. The figures for XFS and ext4 in comparison are
both about ~3GB/s.
# cat go
#!/bin/bash
mkfs.btrfs -f /dev/nvme0n1
mount /dev/nvme0n1 /mnt/test
mkdir /mnt/test/nocow
chattr +C /mnt/test/nocow
fio /root/test.fio
# cat /root/test.fio
[global]
rw=randwrite
ioengine=io_uring
iodepth=64
size=1g
direct=1
startdelay=20
force_async=4
ramp_time=5
runtime=60
group_reporting=1
numjobs=32
time_based
disk_util=0
clat_percentiles=0
disable_lat=1
disable_clat=1
disable_slat=1
filename=/mnt/test/nocow/fiofile
[test]
name=test
bs=4k
stonewall
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] btrfs: don't force DIO writes to be serialized
2026-04-23 10:04 ` Mark Harmstone
@ 2026-04-23 10:20 ` Qu Wenruo
2026-04-23 10:26 ` Mark Harmstone
2026-04-24 10:28 ` David Sterba
1 sibling, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2026-04-23 10:20 UTC (permalink / raw)
To: Mark Harmstone, dsterba; +Cc: linux-btrfs, josef, boris
在 2026/4/23 19:34, Mark Harmstone 写道:
> On 22/04/2026 9.57 pm, David Sterba wrote:
>> On Wed, Apr 22, 2026 at 03:03:35PM +0100, Mark Harmstone wrote:
>>> Before btrfs switched to the new mount API in 2023, we were setting
>>> SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
>>> filesystem may have files which don't have security xattrs, enabling it
>>> to do some optimizations.
>>>
>>> Unfortunately this was missed in the transition, meaning that IS_NOSEC
>>> will always return false for a btrfs inode. This means that
>>> btrfs_direct_write() calls will always get the inode lock exclusively,
>>> meaning that DIO writes to the same file will be serialized.
>>>
>>> On my machine, this one-line change results in a ~59% improvement in DIO
>>> throughput:
>>
>> That's quite an improvement. What's the actual fio script you've used?
>> Also the DIO depends on the block group profile wrt the buffered
>> fallback so that would be good to know too.
>
> It is. There's a big dropoff in DIO write performance in 6.8 that we
> never recovered from.
There is the bounded page solution from iomap already, which will no
longer fallback to buffered IO but to use extra page copy to make sure
the final bio won't change its content halfway.
IIRC it's one extra flag and remove the btrfs' specific fallback checks,
but I haven't yet verified the behavior/code.
Thanks,
Qu
> I'm going to look into some sort of automated
> performance so this kind of thing can't happen casually.
>
> This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed
> through PCI passthrough. The figures for XFS and ext4 in comparison are
> both about ~3GB/s.
>
> # cat go
> #!/bin/bash
> mkfs.btrfs -f /dev/nvme0n1
> mount /dev/nvme0n1 /mnt/test
> mkdir /mnt/test/nocow
> chattr +C /mnt/test/nocow
> fio /root/test.fio
>
> # cat /root/test.fio
> [global]
> rw=randwrite
> ioengine=io_uring
> iodepth=64
> size=1g
> direct=1
> startdelay=20
> force_async=4
> ramp_time=5
> runtime=60
> group_reporting=1
> numjobs=32
> time_based
> disk_util=0
> clat_percentiles=0
> disable_lat=1
> disable_clat=1
> disable_slat=1
> filename=/mnt/test/nocow/fiofile
> [test]
> name=test
> bs=4k
> stonewall
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] btrfs: don't force DIO writes to be serialized
2026-04-23 10:20 ` Qu Wenruo
@ 2026-04-23 10:26 ` Mark Harmstone
0 siblings, 0 replies; 7+ messages in thread
From: Mark Harmstone @ 2026-04-23 10:26 UTC (permalink / raw)
To: Qu Wenruo, dsterba; +Cc: linux-btrfs, josef, boris
On 23/04/2026 11.20 am, Qu Wenruo wrote:
>
>
> 在 2026/4/23 19:34, Mark Harmstone 写道:
>> On 22/04/2026 9.57 pm, David Sterba wrote:
>>> On Wed, Apr 22, 2026 at 03:03:35PM +0100, Mark Harmstone wrote:
>>>> Before btrfs switched to the new mount API in 2023, we were setting
>>>> SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
>>>> filesystem may have files which don't have security xattrs, enabling it
>>>> to do some optimizations.
>>>>
>>>> Unfortunately this was missed in the transition, meaning that IS_NOSEC
>>>> will always return false for a btrfs inode. This means that
>>>> btrfs_direct_write() calls will always get the inode lock exclusively,
>>>> meaning that DIO writes to the same file will be serialized.
>>>>
>>>> On my machine, this one-line change results in a ~59% improvement in
>>>> DIO
>>>> throughput:
>>>
>>> That's quite an improvement. What's the actual fio script you've used?
>>> Also the DIO depends on the block group profile wrt the buffered
>>> fallback so that would be good to know too.
>>
>> It is. There's a big dropoff in DIO write performance in 6.8 that we
>> never recovered from.
>
> There is the bounded page solution from iomap already, which will no
> longer fallback to buffered IO but to use extra page copy to make sure
> the final bio won't change its content halfway.
>
> IIRC it's one extra flag and remove the btrfs' specific fallback checks,
> but I haven't yet verified the behavior/code.
>
> Thanks,
> Qu
That sounds like a different thing - this is just making it so that
we're not forced to take the inode rwsem exclusively for each write.
>> I'm going to look into some sort of automated performance so this kind
>> of thing can't happen casually.
>>
>> This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed
>> through PCI passthrough. The figures for XFS and ext4 in comparison
>> are both about ~3GB/s.
>>
>> # cat go
>> #!/bin/bash
>> mkfs.btrfs -f /dev/nvme0n1
>> mount /dev/nvme0n1 /mnt/test
>> mkdir /mnt/test/nocow
>> chattr +C /mnt/test/nocow
>> fio /root/test.fio
>>
>> # cat /root/test.fio
>> [global]
>> rw=randwrite
>> ioengine=io_uring
>> iodepth=64
>> size=1g
>> direct=1
>> startdelay=20
>> force_async=4
>> ramp_time=5
>> runtime=60
>> group_reporting=1
>> numjobs=32
>> time_based
>> disk_util=0
>> clat_percentiles=0
>> disable_lat=1
>> disable_clat=1
>> disable_slat=1
>> filename=/mnt/test/nocow/fiofile
>> [test]
>> name=test
>> bs=4k
>> stonewall
>>
>>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] btrfs: don't force DIO writes to be serialized
2026-04-23 10:04 ` Mark Harmstone
2026-04-23 10:20 ` Qu Wenruo
@ 2026-04-24 10:28 ` David Sterba
1 sibling, 0 replies; 7+ messages in thread
From: David Sterba @ 2026-04-24 10:28 UTC (permalink / raw)
To: Mark Harmstone; +Cc: dsterba, linux-btrfs, josef, boris
On Thu, Apr 23, 2026 at 11:04:37AM +0100, Mark Harmstone wrote:
> On 22/04/2026 9.57 pm, David Sterba wrote:
> > On Wed, Apr 22, 2026 at 03:03:35PM +0100, Mark Harmstone wrote:
> >> Before btrfs switched to the new mount API in 2023, we were setting
> >> SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
> >> filesystem may have files which don't have security xattrs, enabling it
> >> to do some optimizations.
> >>
> >> Unfortunately this was missed in the transition, meaning that IS_NOSEC
> >> will always return false for a btrfs inode. This means that
> >> btrfs_direct_write() calls will always get the inode lock exclusively,
> >> meaning that DIO writes to the same file will be serialized.
> >>
> >> On my machine, this one-line change results in a ~59% improvement in DIO
> >> throughput:
> >
> > That's quite an improvement. What's the actual fio script you've used?
> > Also the DIO depends on the block group profile wrt the buffered
> > fallback so that would be good to know too.
>
> It is. There's a big dropoff in DIO write performance in 6.8 that we
> never recovered from. I'm going to look into some sort of automated
> performance so this kind of thing can't happen casually.
As you've used fio for that, please add the test case to the
fstests/tests/perf. Though the 'pref' tests haven't caught up I'd like
to at least collect them in our btrfs/fstests repository.
> This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed
> through PCI passthrough. The figures for XFS and ext4 in comparison are
> both about ~3GB/s.
>
> # cat go
> #!/bin/bash
> mkfs.btrfs -f /dev/nvme0n1
> mount /dev/nvme0n1 /mnt/test
> mkdir /mnt/test/nocow
> chattr +C /mnt/test/nocow
> fio /root/test.fio
>
> # cat /root/test.fio
> [global]
> rw=randwrite
> ioengine=io_uring
> iodepth=64
> size=1g
> direct=1
> startdelay=20
> force_async=4
> ramp_time=5
> runtime=60
> group_reporting=1
> numjobs=32
> time_based
> disk_util=0
> clat_percentiles=0
> disable_lat=1
> disable_clat=1
> disable_slat=1
> filename=/mnt/test/nocow/fiofile
> [test]
> name=test
> bs=4k
> stonewall
Thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] btrfs: don't force DIO writes to be serialized
2026-04-22 14:03 [PATCH] btrfs: don't force DIO writes to be serialized Mark Harmstone
2026-04-22 20:57 ` David Sterba
@ 2026-04-28 15:13 ` David Sterba
1 sibling, 0 replies; 7+ messages in thread
From: David Sterba @ 2026-04-28 15:13 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs, josef, boris
On Wed, Apr 22, 2026 at 03:03:35PM +0100, Mark Harmstone wrote:
> Before btrfs switched to the new mount API in 2023, we were setting
> SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
> filesystem may have files which don't have security xattrs, enabling it
> to do some optimizations.
>
> Unfortunately this was missed in the transition, meaning that IS_NOSEC
> will always return false for a btrfs inode. This means that
> btrfs_direct_write() calls will always get the inode lock exclusively,
> meaning that DIO writes to the same file will be serialized.
>
> On my machine, this one-line change results in a ~59% improvement in DIO
> throughput:
>
> Before patch:
>
> test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> ...
> fio-3.39
> Starting 32 processes
> test: Laying out IO file (1 file / 1024MiB)
> Jobs: 32 (f=32): [w(32)][100.0%][w=764MiB/s][w=195k IOPS][eta 00m:00s]
> test: (groupid=0, jobs=32): err= 0: pid=586: Wed Apr 22 13:03:04 2026
> write: IOPS=202k, BW=787MiB/s (826MB/s)(46.1GiB/60012msec); 0 zone resets
> bw ( KiB/s): min=498714, max=1199892, per=100.00%, avg=806659.03, stdev=4229.94, samples=3808
> iops : min=124677, max=299971, avg=201661.82, stdev=1057.49, samples=3808
> cpu : usr=0.32%, sys=1.27%, ctx=8329204, majf=0, minf=1163
> IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
> issued rwts: total=0,12094328,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> WRITE: bw=787MiB/s (826MB/s), 787MiB/s-787MiB/s (826MB/s-826MB/s), io=46.1GiB (49.5GB), run=60012-60012msec
>
> After patch:
>
> test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> ...
> fio-3.39
> Starting 32 processes
> test: Laying out IO file (1 file / 1024MiB)
> Jobs: 32 (f=32): [w(32)][100.0%][w=1255MiB/s][w=321k IOPS][eta 00m:00s]
> test: (groupid=0, jobs=32): err= 0: pid=572: Wed Apr 22 13:13:46 2026
> write: IOPS=320k, BW=1250MiB/s (1311MB/s)(73.3GiB/60003msec); 0 zone resets
> bw ( MiB/s): min= 619, max= 2289, per=100.00%, avg=1251.28, stdev= 9.64, samples=3808
> iops : min=158538, max=586025, avg=320320.80, stdev=2468.97, samples=3808
> cpu : usr=0.35%, sys=11.50%, ctx=1584847, majf=0, minf=1160
> IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
> issued rwts: total=0,19203309,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> WRITE: bw=1250MiB/s (1311MB/s), 1250MiB/s-1250MiB/s (1311MB/s-1311MB/s), io=73.3GiB (78.7GB), run=60003-60003msec
>
> Fixes: ad21f15b0f79 ("btrfs: switch to the new mount API")
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
I've updated changelog with the fio script and added the patch to
for-next. We want to get this backported to stable trees, ETA 2 weeks so
we get some coverage.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-04-28 15:13 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 14:03 [PATCH] btrfs: don't force DIO writes to be serialized Mark Harmstone
2026-04-22 20:57 ` David Sterba
2026-04-23 10:04 ` Mark Harmstone
2026-04-23 10:20 ` Qu Wenruo
2026-04-23 10:26 ` Mark Harmstone
2026-04-24 10:28 ` David Sterba
2026-04-28 15:13 ` David Sterba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox