[bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
@ 2024-11-20 21:35 Saeed Mirzamohammadi
  2024-11-21  0:00 ` Chaitanya Kulkarni
  0 siblings, 1 reply; 19+ messages in thread
From: Saeed Mirzamohammadi @ 2024-11-20 21:35 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, Keith Busch, axboe@kernel.dk,
	Christoph Hellwig, Sagi Grimberg, linux-nvme@lists.infradead.org
  Cc: Ramanan Govindarajan, Paul Webb

Hi,

I’m reporting a performance regression of up to 9-10% with FIO randomwrite benchmark on ext4 comparing 6.12.0-rc2 kernel and v5.15.161. Also, standard deviation after this change grows up to 5-6%.

Bisect root cause commit
===================
- commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard”)


Test details
=========
- readwrite=randwrite bs=4k size=1G ioengine=libaio iodepth=16 direct=1 time_based=1 ramp_time=180 runtime=1800 randrepeat=1 gtod_reduce=1
- Test is on ext4 filesystem
- System has 4 NVMe disks


5.15.y base
========
fio.test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.35
Starting 128 processes
fio.test: Laying out IO file (1 file / 1024MiB)
…[cut here]
fio.test: Laying out IO file (1 file / 1024MiB)

fio.test: (groupid=0, jobs=128): err= 0: pid=4226: Fri Sep 13 00:34:07 2024
write: IOPS=2550k, BW=9962MiB/s (10.4GB/s)(17.1TiB/1800006msec); 0 zone resets
bw ( MiB/s): min= 5326, max=15283, per=100.00%, avg=9972.35, stdev=12.37, samples=460672
iops : min=1363492, max=3912552, avg=2552897.74, stdev=3166.48, samples=460672
cpu : usr=4.00%, sys=13.81%, ctx=4730536027, majf=0, minf=71229
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,4590594600,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
WRITE: bw=9962MiB/s (10.4GB/s), 9962MiB/s-9962MiB/s (10.4GB/s-10.4GB/s), io=17.1TiB (18.8TB), run=1800006-1800006msec

Disk stats (read/write):
dm-0: ios=0/5006840732, merge=0/0, ticks=0/3893644388, in_queue=3893644388, util=100.00%, aggrios=0/1251134829, aggrmerge=0/737033, aggrticks=0/973009387, aggrin_queue=973009387, aggrutil=100.00%
nvme3n1: ios=0/1251035509, merge=0/829083, ticks=0/1443792479, in_queue=1443792479, util=100.00%
nvme0n1: ios=0/1251231344, merge=0/638993, ticks=0/1011756001, in_queue=1011756002, util=100.00%
nvme1n1: ios=0/1251224162, merge=0/639192, ticks=0/672688952, in_queue=672688953, util=100.00%
nvme2n1: ios=0/1251048302, merge=0/840864, ticks=0/763800117, in_queue=763800117, util=100.00%

Throughput Results:
WRITE:10649.6:2550000:0


6.12.y test
========
fio.test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.35
Starting 128 processes
fio.test: Laying out IO file (1 file / 1024MiB)
…[cut here]
fio.test: Laying out IO file (1 file / 1024MiB)

fio.test: (groupid=0, jobs=128): err= 0: pid=4308: Fri Sep 13 08:03:37 2024
write: IOPS=2270k, BW=8868MiB/s (9299MB/s)(15.2TiB/1800006msec); 0 zone resets
bw ( MiB/s): min= 6, max=13343, per=100.00%, avg=9066.78, stdev=14.33, samples=451008
iops : min= 1743, max=3415839, avg=2321069.38, stdev=3669.20, samples=451008
cpu : usr=3.65%, sys=11.28%, ctx=3420577602, majf=0, minf=12682
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,4086517562,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
WRITE: bw=8868MiB/s (9299MB/s), 8868MiB/s-8868MiB/s (9299MB/s-9299MB/s), io=15.2TiB (16.7TB), run=1800006-1800006msec

Disk stats (read/write):
dm-0: ios=0/4404382767, merge=0/0, ticks=0/3614104412, in_queue=3614104412, util=99.91%, aggrios=0/1099847046, aggrmerge=0/1309757, aggrticks=0/908073333, aggrin_queue=908073333, aggrutil=82.28%
nvme3n1: ios=0/1099844453, merge=0/1309770, ticks=0/606919121, in_queue=606919121, util=82.28%
nvme0n1: ios=0/1099847110, merge=0/1310120, ticks=0/1007261464, in_queue=1007261464, util=80.97%
nvme1n1: ios=0/1099847041, merge=0/1309430, ticks=0/1327975386, in_queue=1327975386, util=81.71%
nvme2n1: ios=0/1099849583, merge=0/1309709, ticks=0/690137362, in_queue=690137362, util=80.29%

Throughput Results:
WRITE:9299:2270000:0


Thanks,
Saeed


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-20 21:35 [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel Saeed Mirzamohammadi
@ 2024-11-21  0:00 ` Chaitanya Kulkarni
  2024-11-21  1:20   ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Chaitanya Kulkarni @ 2024-11-21  0:00 UTC (permalink / raw)
  To: Saeed Mirzamohammadi
  Cc: linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Paul Webb, Christoph Hellwig,
	Keith Busch, axboe@kernel.dk

On 11/20/24 13:35, Saeed Mirzamohammadi wrote:
> Hi,
>
> I’m reporting a performance regression of up to 9-10% with FIO randomwrite benchmark on ext4 comparing 6.12.0-rc2 kernel and v5.15.161. Also, standard deviation after this change grows up to 5-6%.
>
> Bisect root cause commit
> ===================
> - commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard”)
>
>
> Test details
> =========
> - readwrite=randwrite bs=4k size=1G ioengine=libaio iodepth=16 direct=1 time_based=1 ramp_time=180 runtime=1800 randrepeat=1 gtod_reduce=1
> - Test is on ext4 filesystem
> - System has 4 NVMe disks
>

Thanks a lot for the report, to narrow down this problem can you
please :-

1. Run the same test on the raw nvme device /dev/nvme0n1 that you
    have used for this benchmark ?
2. Run the same test on the  XFS formatted nvme device instead of ext4 ?

This way we will know if there is an issue only with the ext4 or
with other file systems are suffering from this problem too or
it is below the file system layer such as block layer and nvme pci driver ?

It will also help if you can repeat these numbers for io_uring fio io_engine
to narrow down this problem to know if the issue is ioengine specific.

Looking at the commit [1], it only sets the max value to write zeroes 
sectors
if NVME_QUIRK_DEALLOCATE_ZEROES is set, else uses the controller max
write zeroes value.

So not sure how this commit can slow things down unless there is change in
behavior of the write-zeores instead of offloading (REQ_OP_WRITE_ZEROES)
it's now falling back to REQ_OP_WRITE with ZERO PAGE when called from
ext4 sb_issue_zeroout :-

fs/ext4/ialloc.c ext4_init_inode_table        sb_issue_zeroout()
fs/ext4/inode.c  ext4_issue_zeroout           sb_issue_zeroout()
fs/ext4/resize.c setup_new_flex_group_blocks  sb_issue_zeroout()
fs/ext4/resize.c setup_new_flex_group_blocks  sb_issue_zeroout()

-ck

 From 63dfa1004322d596417f23da43cdc43cf6298c71 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Mon, 4 Mar 2024 07:04:46 -0700
Subject: [PATCH] nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of
  nvme_config_discard

Move the handling of the NVME_QUIRK_DEALLOCATE_ZEROES quirk out of
nvme_config_discard so that it is combined with the normal write_zeroes
limit handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
  drivers/nvme/host/core.c | 11 ++++++-----
  1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6ae9aedf7bc2..a6c0b2f4cf79 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1816,9 +1816,6 @@ static void nvme_config_discard(struct nvme_ctrl 
*ctrl, struct gendisk *disk,
         else
                 blk_queue_max_discard_segments(queue, NVME_DSM_MAX_RANGES);
         queue->limits.discard_granularity = 
queue_logical_block_size(queue);
-
-       if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
-               blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
  }

  static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct 
nvme_ns_ids *b)
@@ -2029,8 +2026,12 @@ static void nvme_update_disk_info(struct 
nvme_ctrl *ctrl, struct gendisk *disk,
         set_capacity_and_notify(disk, capacity);

         nvme_config_discard(ctrl, disk, head);
-       blk_queue_max_write_zeroes_sectors(disk->queue,
- ctrl->max_zeroes_sectors);
+
+       if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
+               blk_queue_max_write_zeroes_sectors(disk->queue, UINT_MAX);
+ else
+               blk_queue_max_write_zeroes_sectors(disk->queue,
+                               ctrl->max_zeroes_sectors);
  }

  static bool nvme_ns_is_readonly(struct nvme_ns *ns, struct 
nvme_ns_info *info)



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-21  0:00 ` Chaitanya Kulkarni
@ 2024-11-21  1:20   ` Jens Axboe
  2024-11-21  4:57     ` Christoph Hellwig
  2024-11-21 11:30     ` Phil Auld
  0 siblings, 2 replies; 19+ messages in thread
From: Jens Axboe @ 2024-11-21  1:20 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Saeed Mirzamohammadi
  Cc: linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Paul Webb, Christoph Hellwig,
	Keith Busch

On 11/20/24 5:00 PM, Chaitanya Kulkarni wrote:
> On 11/20/24 13:35, Saeed Mirzamohammadi wrote:
>> Hi,
>>
>> I?m reporting a performance regression of up to 9-10% with FIO randomwrite benchmark on ext4 comparing 6.12.0-rc2 kernel and v5.15.161. Also, standard deviation after this change grows up to 5-6%.
>>
>> Bisect root cause commit
>> ===================
>> - commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard?)
>>
>>
>> Test details
>> =========
>> - readwrite=randwrite bs=4k size=1G ioengine=libaio iodepth=16 direct=1 time_based=1 ramp_time=180 runtime=1800 randrepeat=1 gtod_reduce=1
>> - Test is on ext4 filesystem
>> - System has 4 NVMe disks
>>
> 
> Thanks a lot for the report, to narrow down this problem can you
> please :-
> 
> 1. Run the same test on the raw nvme device /dev/nvme0n1 that you
>     have used for this benchmark ?
> 2. Run the same test on the  XFS formatted nvme device instead of ext4 ?
> 
> This way we will know if there is an issue only with the ext4 or
> with other file systems are suffering from this problem too or
> it is below the file system layer such as block layer and nvme pci driver ?
> 
> It will also help if you can repeat these numbers for io_uring fio io_engine
> to narrow down this problem to know if the issue is ioengine specific.
> 
> Looking at the commit [1], it only sets the max value to write zeroes 
> sectors
> if NVME_QUIRK_DEALLOCATE_ZEROES is set, else uses the controller max
> write zeroes value.

There's no way that commit is involved, the test as quoted doesn't even
touch write zeroes. Hence if there really is a regression here, then
it's either not easily bisectable, some error was injected while
bisecting, or the test itself is bimodal.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-21  1:20   ` Jens Axboe
@ 2024-11-21  4:57     ` Christoph Hellwig
  2024-11-21 14:48       ` Jens Axboe
  2024-11-21 11:30     ` Phil Auld
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2024-11-21  4:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chaitanya Kulkarni, Saeed Mirzamohammadi,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Paul Webb, Christoph Hellwig,
	Keith Busch

On Wed, Nov 20, 2024 at 06:20:12PM -0700, Jens Axboe wrote:
> There's no way that commit is involved, the test as quoted doesn't even
> touch write zeroes. Hence if there really is a regression here, then
> it's either not easily bisectable, some error was injected while
> bisecting, or the test itself is bimodal.

ext4 actually has some weird lazy init code using write zeroes.  So
if the test actually wasn't a steady state one but only run for a short
time after init, and the mentioned commit dropped the intel hack for
deallocate as write zeroes it might actually make a difference.

To check for that do a :

/sys/block/nvmeXn1/queue/write_zeroes_max_bytes

with and without that commit.
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-21  1:20   ` Jens Axboe
  2024-11-21  4:57     ` Christoph Hellwig
@ 2024-11-21 11:30     ` Phil Auld
  2024-11-21 14:49       ` Jens Axboe
  1 sibling, 1 reply; 19+ messages in thread
From: Phil Auld @ 2024-11-21 11:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chaitanya Kulkarni, Saeed Mirzamohammadi,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Paul Webb, Christoph Hellwig,
	Keith Busch


Hi,

On Wed, Nov 20, 2024 at 06:20:12PM -0700 Jens Axboe wrote:
> On 11/20/24 5:00 PM, Chaitanya Kulkarni wrote:
> > On 11/20/24 13:35, Saeed Mirzamohammadi wrote:
> >> Hi,
> >>
> >> I?m reporting a performance regression of up to 9-10% with FIO randomwrite benchmark on ext4 comparing 6.12.0-rc2 kernel and v5.15.161. Also, standard deviation after this change grows up to 5-6%.
> >>
> >> Bisect root cause commit
> >> ===================
> >> - commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard?)
> >>
> >>
> >> Test details
> >> =========
> >> - readwrite=randwrite bs=4k size=1G ioengine=libaio iodepth=16 direct=1 time_based=1 ramp_time=180 runtime=1800 randrepeat=1 gtod_reduce=1
> >> - Test is on ext4 filesystem
> >> - System has 4 NVMe disks
> >>
> > 
> > Thanks a lot for the report, to narrow down this problem can you
> > please :-
> > 
> > 1. Run the same test on the raw nvme device /dev/nvme0n1 that you
> >     have used for this benchmark ?
> > 2. Run the same test on the  XFS formatted nvme device instead of ext4 ?
> > 
> > This way we will know if there is an issue only with the ext4 or
> > with other file systems are suffering from this problem too or
> > it is below the file system layer such as block layer and nvme pci driver ?
> > 
> > It will also help if you can repeat these numbers for io_uring fio io_engine
> > to narrow down this problem to know if the issue is ioengine specific.
> > 
> > Looking at the commit [1], it only sets the max value to write zeroes 
> > sectors
> > if NVME_QUIRK_DEALLOCATE_ZEROES is set, else uses the controller max
> > write zeroes value.
> 
> There's no way that commit is involved, the test as quoted doesn't even
> touch write zeroes. Hence if there really is a regression here, then
> it's either not easily bisectable, some error was injected while
> bisecting, or the test itself is bimodal.

I was just going to ask how confident we are in that bisect result.

I suspect this is the same issue I've been fighting here:

https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/

Saeed, can you try your randwrite test after

  "echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features"

please?

We don't as yet have a general fix for it as it seems to be a bit of
a trade off.


Cheers,
Phil

> 
> -- 
> Jens Axboe
> 

-- 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-21  4:57     ` Christoph Hellwig
@ 2024-11-21 14:48       ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2024-11-21 14:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chaitanya Kulkarni, Saeed Mirzamohammadi,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Paul Webb, Keith Busch

On 11/20/24 9:57 PM, Christoph Hellwig wrote:
> On Wed, Nov 20, 2024 at 06:20:12PM -0700, Jens Axboe wrote:
>> There's no way that commit is involved, the test as quoted doesn't even
>> touch write zeroes. Hence if there really is a regression here, then
>> it's either not easily bisectable, some error was injected while
>> bisecting, or the test itself is bimodal.
> 
> ext4 actually has some weird lazy init code using write zeroes.  So
> if the test actually wasn't a steady state one but only run for a short
> time after init, and the mentioned commit dropped the intel hack for
> deallocate as write zeroes it might actually make a difference.

Ah good point, I forgot about the ext4 lazy init. But any test should
surely quiesce that first, not great to have background activity with
that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-21 11:30     ` Phil Auld
@ 2024-11-21 14:49       ` Jens Axboe
       [not found]         ` <181bcb70-e0bf-4024-80b7-e79276d6eaf7@oracle.com>
  2024-11-22 17:13         ` Paul Webb
  0 siblings, 2 replies; 19+ messages in thread
From: Jens Axboe @ 2024-11-21 14:49 UTC (permalink / raw)
  To: Phil Auld
  Cc: Chaitanya Kulkarni, Saeed Mirzamohammadi,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Paul Webb, Christoph Hellwig,
	Keith Busch

On 11/21/24 4:30 AM, Phil Auld wrote:
> 
> Hi,
> 
> On Wed, Nov 20, 2024 at 06:20:12PM -0700 Jens Axboe wrote:
>> On 11/20/24 5:00 PM, Chaitanya Kulkarni wrote:
>>> On 11/20/24 13:35, Saeed Mirzamohammadi wrote:
>>>> Hi,
>>>>
>>>> I?m reporting a performance regression of up to 9-10% with FIO randomwrite benchmark on ext4 comparing 6.12.0-rc2 kernel and v5.15.161. Also, standard deviation after this change grows up to 5-6%.
>>>>
>>>> Bisect root cause commit
>>>> ===================
>>>> - commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard?)
>>>>
>>>>
>>>> Test details
>>>> =========
>>>> - readwrite=randwrite bs=4k size=1G ioengine=libaio iodepth=16 direct=1 time_based=1 ramp_time=180 runtime=1800 randrepeat=1 gtod_reduce=1
>>>> - Test is on ext4 filesystem
>>>> - System has 4 NVMe disks
>>>>
>>>
>>> Thanks a lot for the report, to narrow down this problem can you
>>> please :-
>>>
>>> 1. Run the same test on the raw nvme device /dev/nvme0n1 that you
>>>     have used for this benchmark ?
>>> 2. Run the same test on the  XFS formatted nvme device instead of ext4 ?
>>>
>>> This way we will know if there is an issue only with the ext4 or
>>> with other file systems are suffering from this problem too or
>>> it is below the file system layer such as block layer and nvme pci driver ?
>>>
>>> It will also help if you can repeat these numbers for io_uring fio io_engine
>>> to narrow down this problem to know if the issue is ioengine specific.
>>>
>>> Looking at the commit [1], it only sets the max value to write zeroes 
>>> sectors
>>> if NVME_QUIRK_DEALLOCATE_ZEROES is set, else uses the controller max
>>> write zeroes value.
>>
>> There's no way that commit is involved, the test as quoted doesn't even
>> touch write zeroes. Hence if there really is a regression here, then
>> it's either not easily bisectable, some error was injected while
>> bisecting, or the test itself is bimodal.
> 
> I was just going to ask how confident we are in that bisect result.
> 
> I suspect this is the same issue I've been fighting here:
> 
> https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> 
> Saeed, can you try your randwrite test after
> 
>   "echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features"
> 
> please?
> 
> We don't as yet have a general fix for it as it seems to be a bit of
> a trade off.

Interesting. Might explain some regressions I've seen too related to
performance.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
       [not found]         ` <181bcb70-e0bf-4024-80b7-e79276d6eaf7@oracle.com>
@ 2024-11-21 21:19           ` Phil Auld
  2024-11-22 12:13           ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Phil Auld @ 2024-11-21 21:19 UTC (permalink / raw)
  To: Paul Webb
  Cc: Jens Axboe, Chaitanya Kulkarni, Saeed Mirzamohammadi,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Christoph Hellwig,
	Keith Busch, Nicky Veitch

On Thu, Nov 21, 2024 at 09:07:32PM +0000 Paul Webb wrote:
> Hi,
> 
> To answer the various questions/suggestions, I'll just group them here:
> 
> Phil:
> can you try your randwrite test after
> "echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features"
> 
> Performance regression still persists with this setting being used.
>

Okay, thanks. Different FIO randwrite issue I guess.  Nevermind, I'll
go back over to scheduler land...



Cheers,
Phil

> 
> Christoph:
> To check for weird lazy init code using write zeroes
> 
> Values in the 5.15 kernel baseline prior to the commit:
> $ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
> 0
> 0
> 0
> 0
> 
> Values in the 6.11 kernel that contains the commit:
> $ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
> 2199023255040
> 2199023255040
> 2199023255040
> 2199023255040
> 
> 
> 
> Chaitanya:
> 
> Run the same test on the  XFS formatted nvme device instead of ext4 ?
> - XFS runs did not show the performance regression.
> 
> Run the same test on the raw nvme device /dev/nvme0n1 that you    have used for
> this benchmark
> - Will have to check if this was done, and if not, get that test run
> 
> repeat these numbers for io_uring fio io_engine
> - Will look into getting those too
> 
> 
> Another interesting datapoint is that while performing some runs I am seeing
> the following output on the console in the 6.11/6.12 kernels that contain the
> commit:
> 
> [  473.398188] operation not supported error, dev nvme2n1, sector 13952 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
> [  473.534550] nvme0n1: Dataset Management(0x9) @ LBA 14000, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
> [  473.660502] operation not supported error, dev nvme0n1, sector 14000 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
> [  473.796859] nvme3n1: Dataset Management(0x9) @ LBA 13952, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
> [  473.922810] operation not supported error, dev nvme3n1, sector 13952 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
> [  474.059169] nvme1n1: Dataset Management(0x9) @ LBA 13952, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
> 
> 
> Regards,
> Paul.
> 
> 
> 
> On 21/11/2024 14:49, Jens Axboe wrote:
> 
>     On 11/21/24 4:30 AM, Phil Auld wrote:
> 
>         Hi,
> 
>         On Wed, Nov 20, 2024 at 06:20:12PM -0700 Jens Axboe wrote:
> 
>             On 11/20/24 5:00 PM, Chaitanya Kulkarni wrote:
> 
>                 On 11/20/24 13:35, Saeed Mirzamohammadi wrote:
> 
>                     Hi,
> 
>                     I?m reporting a performance regression of up to 9-10% with FIO randomwrite benchmark on ext4 comparing 6.12.0-rc2 kernel and v5.15.161. Also, standard deviation after this change grows up to 5-6%.
> 
>                     Bisect root cause commit
>                     ===================
>                     - commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard?)
> 
> 
>                     Test details
>                     =========
>                     - readwrite=randwrite bs=4k size=1G ioengine=libaio iodepth=16 direct=1 time_based=1 ramp_time=180 runtime=1800 randrepeat=1 gtod_reduce=1
>                     - Test is on ext4 filesystem
>                     - System has 4 NVMe disks
> 
> 
>                 Thanks a lot for the report, to narrow down this problem can you
>                 please :-
> 
>                 1. Run the same test on the raw nvme device /dev/nvme0n1 that you
>                     have used for this benchmark ?
>                 2. Run the same test on the  XFS formatted nvme device instead of ext4 ?
> 
>                 This way we will know if there is an issue only with the ext4 or
>                 with other file systems are suffering from this problem too or
>                 it is below the file system layer such as block layer and nvme pci driver ?
> 
>                 It will also help if you can repeat these numbers for io_uring fio io_engine
>                 to narrow down this problem to know if the issue is ioengine specific.
> 
>                 Looking at the commit [1], it only sets the max value to write zeroes
>                 sectors
>                 if NVME_QUIRK_DEALLOCATE_ZEROES is set, else uses the controller max
>                 write zeroes value.
> 
>             There's no way that commit is involved, the test as quoted doesn't even
>             touch write zeroes. Hence if there really is a regression here, then
>             it's either not easily bisectable, some error was injected while
>             bisecting, or the test itself is bimodal.
> 
>         I was just going to ask how confident we are in that bisect result.
> 
>         I suspect this is the same issue I've been fighting here:
> 
>         [1]https://urldefense.com/v3/__https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/__;!!ACWV5N9M2RV99hQ!PXJXp0zosonkV7jeW9yE0YL-uPElcYI-G-bvm69COWR1Tbl9w9puGc1tLR_ccsDoYPBb9Bs3waNVuuf9Lg$
> 
>         Saeed, can you try your randwrite test after
> 
>           "echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features"
> 
>         please?
> 
>         We don't as yet have a general fix for it as it seems to be a bit of
>         a trade off.
> 
>     Interesting. Might explain some regressions I've seen too related to
>     performance.
> 
> 
> 
> References:
> 
> [1] https://urldefense.com/v3/__https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/__;!!ACWV5N9M2RV99hQ!PXJXp0zosonkV7jeW9yE0YL-uPElcYI-G-bvm69COWR1Tbl9w9puGc1tLR_ccsDoYPBb9Bs3waNVuuf9Lg$

-- 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
       [not found]         ` <181bcb70-e0bf-4024-80b7-e79276d6eaf7@oracle.com>
  2024-11-21 21:19           ` [External] : " Phil Auld
@ 2024-11-22 12:13           ` Christoph Hellwig
  2024-11-22 17:18             ` Paul Webb
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2024-11-22 12:13 UTC (permalink / raw)
  To: Paul Webb
  Cc: Jens Axboe, Phil Auld, Chaitanya Kulkarni, Saeed Mirzamohammadi,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Christoph Hellwig,
	Keith Busch, Nicky Veitch

On Thu, Nov 21, 2024 at 09:07:32PM +0000, Paul Webb wrote:
> Christoph:
> To check for weird lazy init code using write zeroes
>
> Values in the 5.15 kernel baseline prior to the commit:
> $ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
> 0
> 0
> 0
> 0
>
> Values in the 6.11 kernel that contains the commit:
> $ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
> 2199023255040
> 2199023255040
> 2199023255040
> 2199023255040

Thanks!  So 6.11 actually enables write zeroes for your controller.

> Another interesting datapoint is that while performing some runs I am 
> seeing the following output on the console in the 6.11/6.12 kernels that 
> contain the commit:
>
> [  473.398188] operation not supported error, dev nvme2n1, sector 13952 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0

.. which it doesn't handle well.

> [  473.534550] nvme0n1: Dataset Management(0x9) @ LBA 14000, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR

.. and interesting this is for a Deallocate, which should only happen
with the quirk for certain Intel controllers from the very first days of
nvme.

What controller do you have?  Can you post the output of lspci and
"nvme list"?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-21 14:49       ` Jens Axboe
       [not found]         ` <181bcb70-e0bf-4024-80b7-e79276d6eaf7@oracle.com>
@ 2024-11-22 17:13         ` Paul Webb
  1 sibling, 0 replies; 19+ messages in thread
From: Paul Webb @ 2024-11-22 17:13 UTC (permalink / raw)
  To: Jens Axboe, Phil Auld
  Cc: Chaitanya Kulkarni, Saeed Mirzamohammadi,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Christoph Hellwig,
	Keith Busch


On 21/11/2024 14:49, Jens Axboe wrote:
> On 11/21/24 4:30 AM, Phil Auld wrote:
>> Hi,
>>
>> On Wed, Nov 20, 2024 at 06:20:12PM -0700 Jens Axboe wrote:
>>> On 11/20/24 5:00 PM, Chaitanya Kulkarni wrote:
>>>> On 11/20/24 13:35, Saeed Mirzamohammadi wrote:
>>>>> Hi,
>>>>>
>>>>> I?m reporting a performance regression of up to 9-10% with FIO randomwrite benchmark on ext4 comparing 6.12.0-rc2 kernel and v5.15.161. Also, standard deviation after this change grows up to 5-6%.
>>>>>
>>>>> Bisect root cause commit
>>>>> ===================
>>>>> - commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard?)
>>>>>
>>>>>
>>>>> Test details
>>>>> =========
>>>>> - readwrite=randwrite bs=4k size=1G ioengine=libaio iodepth=16 direct=1 time_based=1 ramp_time=180 runtime=1800 randrepeat=1 gtod_reduce=1
>>>>> - Test is on ext4 filesystem
>>>>> - System has 4 NVMe disks
>>>>>
>>>> Thanks a lot for the report, to narrow down this problem can you
>>>> please :-
>>>>
>>>> 1. Run the same test on the raw nvme device /dev/nvme0n1 that you
>>>>      have used for this benchmark ?
>>>> 2. Run the same test on the  XFS formatted nvme device instead of ext4 ?
>>>>
>>>> This way we will know if there is an issue only with the ext4 or
>>>> with other file systems are suffering from this problem too or
>>>> it is below the file system layer such as block layer and nvme pci driver ?
>>>>
>>>> It will also help if you can repeat these numbers for io_uring fio io_engine
>>>> to narrow down this problem to know if the issue is ioengine specific.
>>>>
>>>> Looking at the commit [1], it only sets the max value to write zeroes
>>>> sectors
>>>> if NVME_QUIRK_DEALLOCATE_ZEROES is set, else uses the controller max
>>>> write zeroes value.
>>> There's no way that commit is involved, the test as quoted doesn't even
>>> touch write zeroes. Hence if there really is a regression here, then
>>> it's either not easily bisectable, some error was injected while
>>> bisecting, or the test itself is bimodal.
>> I was just going to ask how confident we are in that bisect result.
>>
>> I suspect this is the same issue I've been fighting here:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/__;!!ACWV5N9M2RV99hQ!PXJXp0zosonkV7jeW9yE0YL-uPElcYI-G-bvm69COWR1Tbl9w9puGc1tLR_ccsDoYPBb9Bs3waNVuuf9Lg$
>>
>> Saeed, can you try your randwrite test after
>>
>>    "echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features"
>>
>> please?
>>
>> We don't as yet have a general fix for it as it seems to be a bit of
>> a trade off.
> Interesting. Might explain some regressions I've seen too related to
> performance.
Apologies for those receiving this twice, but resending due to mail 
client not sending it as text content causing it to be rejected by the 
lists.
Also, a little more info to update.

Hi,

To answer the various questions/suggestions, I'll just group them here:

Phil:
can you try your randwrite test after
"echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features"

Performance regression still persists with this setting being used.


Christoph:
To check for weird lazy init code using write zeroes

Values in the 5.15 kernel baseline prior to the commit:
$ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
0
0
0
0

Values in the 6.11 kernel that contains the commit:
$ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
2199023255040
2199023255040
2199023255040
2199023255040


Chaitanya:

Run the same test on the  XFS formatted nvme device instead of ext4 ?
- XFS runs did not show the performance regression.

Run the same test on the raw nvme device /dev/nvme0n1 that you have used 
for this benchmark
- Will have to check if this was done, and if not, get that test run

repeat these numbers for io_uring fio io_engine
- Will look into getting those too


Another interesting datapoint is that while performing some runs I am 
seeing the following output on the console in the 6.11/6.12 kernels that 
contain the commit:

[  473.398188] operation not supported error, dev nvme2n1, sector 13952 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
[  473.534550] nvme0n1: Dataset Management(0x9) @ LBA 14000, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
[  473.660502] operation not supported error, dev nvme0n1, sector 14000 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
[  473.796859] nvme3n1: Dataset Management(0x9) @ LBA 13952, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
[  473.922810] operation not supported error, dev nvme3n1, sector 13952 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
[  474.059169] nvme1n1: Dataset Management(0x9) @ LBA 13952, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR


The errors start as soon as the mkfs command is initiated for ext4 and 
continue throughout the fio test run.
These are also seen when using mkfs command to create an xfs filesystem, 
however in this case, the error is only seen once at fs creation time.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-22 12:13           ` Christoph Hellwig
@ 2024-11-22 17:18             ` Paul Webb
  2024-11-22 18:26               ` Saeed Mirzamohammadi
  0 siblings, 1 reply; 19+ messages in thread
From: Paul Webb @ 2024-11-22 17:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Phil Auld, Chaitanya Kulkarni, Saeed Mirzamohammadi,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Keith Busch, Nicky Veitch


On 22/11/2024 12:13, Christoph Hellwig wrote:
> On Thu, Nov 21, 2024 at 09:07:32PM +0000, Paul Webb wrote:
>> Christoph:
>> To check for weird lazy init code using write zeroes
>>
>> Values in the 5.15 kernel baseline prior to the commit:
>> $ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
>> 0
>> 0
>> 0
>> 0
>>
>> Values in the 6.11 kernel that contains the commit:
>> $ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
>> 2199023255040
>> 2199023255040
>> 2199023255040
>> 2199023255040
> Thanks!  So 6.11 actually enables write zeroes for your controller.
>
>> Another interesting datapoint is that while performing some runs I am
>> seeing the following output on the console in the 6.11/6.12 kernels that
>> contain the commit:
>>
>> [  473.398188] operation not supported error, dev nvme2n1, sector 13952 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
> .. which it doesn't handle well.
>
>> [  473.534550] nvme0n1: Dataset Management(0x9) @ LBA 14000, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
> .. and interesting this is for a Deallocate, which should only happen
> with the quirk for certain Intel controllers from the very first days of
> nvme.
>
> What controller do you have?  Can you post the output of lspci and
> "nvme list"?

Hi Christoph,

The nvme related output from lspci is as follows:
$ lspci | grep -i nvme
19:00.0 Non-Volatile memory controller: Intel Corporation NVMe 
Datacenter SSD [3DNAND, Beta Rock Controller]
20:00.0 Non-Volatile memory controller: Intel Corporation NVMe 
Datacenter SSD [3DNAND, Beta Rock Controller]
94:00.0 Non-Volatile memory controller: Intel Corporation NVMe 
Datacenter SSD [3DNAND, Beta Rock Controller]
9b:00.0 Non-Volatile memory controller: Intel Corporation NVMe 
Datacenter SSD [3DNAND, Beta Rock Controller]


$ sudo nvme list
Node                  Generic               SN 
Model                                    Namespace 
Usage                      Format           FW Rev
--------------------- --------------------- -------------------- 
---------------------------------------- ---------- 
-------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            PHLN942100EQ6P4CGN 
7361456_ICRPC2DD2ORA6.4T                 0x1          6.40  TB / 6.40  
TB    512   B +  0 B   VDV1RL06
/dev/nvme1n1          /dev/ng1n1            PHLN942100PE6P4CGN 
7361456_ICRPC2DD2ORA6.4T                 0x1          6.40  TB / 6.40  
TB    512   B +  0 B   VDV1RL06
/dev/nvme2n1          /dev/ng2n1            PHLN9415002B6P4CGN 
7361456_ICRPC2DD2ORA6.4T                 0x1          6.40  TB / 6.40  
TB    512   B +  0 B   VDV1RL06
/dev/nvme3n1          /dev/ng3n1            PHLN942100DQ6P4CGN 
7361456_ICRPC2DD2ORA6.4T                 0x1          6.40  TB / 6.40  
TB    512   B +  0 B   VDV1RL06


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-22 17:18             ` Paul Webb
@ 2024-11-22 18:26               ` Saeed Mirzamohammadi
  2024-11-22 21:09                 ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Saeed Mirzamohammadi @ 2024-11-22 18:26 UTC (permalink / raw)
  To: Paul Webb
  Cc: Christoph Hellwig, Jens Axboe, Phil Auld, Chaitanya Kulkarni,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Keith Busch, Nicky Veitch

FYI, Tried disabling write zeros but still getting the same errors:
[ 326.097275] operation not supported error, dev nvme2n1, sector 10624 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
[ 338.496217] nvme0n1: Dataset Management(0x9) @ LBA 10928, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
… 

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d3bde17c818d5..ad2ce6008062e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3425,7 +3425,8 @@ static const struct pci_device_id nvme_id_table[] = {
                .driver_data = NVME_QUIRK_STRIPE_SIZE |
                                NVME_QUIRK_DEALLOCATE_ZEROES |
                                NVME_QUIRK_IGNORE_DEV_SUBNQN |
-                               NVME_QUIRK_BOGUS_NID, },
+                               NVME_QUIRK_BOGUS_NID |
+                               NVME_QUIRK_DISABLE_WRITE_ZEROES, },
        { PCI_VDEVICE(INTEL, 0x0a55),   /* Dell Express Flash P4600 */
                .driver_data = NVME_QUIRK_STRIPE_SIZE |
                                NVME_QUIRK_DEALLOCATE_ZEROES, },


This is for:
        { PCI_VDEVICE(INTEL, 0x0a54),   /* Intel P4500/P4600 */

And as we can see from lspci:
$ lspci -nn | grep -i nvme 19:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] [8086:0a54]
20:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] [8086:0a54]
94:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] [8086:0a54]
9b:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] [8086:0a54]


> On Nov 22, 2024, at 9:18 AM, Paul Webb <paul.x.webb@oracle.com> wrote:
> 
> 
> On 22/11/2024 12:13, Christoph Hellwig wrote:
>> On Thu, Nov 21, 2024 at 09:07:32PM +0000, Paul Webb wrote:
>>> Christoph:
>>> To check for weird lazy init code using write zeroes
>>> 
>>> Values in the 5.15 kernel baseline prior to the commit:
>>> $ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
>>> 0
>>> 0
>>> 0
>>> 0
>>> 
>>> Values in the 6.11 kernel that contains the commit:
>>> $ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
>>> 2199023255040
>>> 2199023255040
>>> 2199023255040
>>> 2199023255040
>> Thanks!  So 6.11 actually enables write zeroes for your controller.
>> 
>>> Another interesting datapoint is that while performing some runs I am
>>> seeing the following output on the console in the 6.11/6.12 kernels that
>>> contain the commit:
>>> 
>>> [  473.398188] operation not supported error, dev nvme2n1, sector 13952 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
>> .. which it doesn't handle well.
>> 
>>> [  473.534550] nvme0n1: Dataset Management(0x9) @ LBA 14000, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
>> .. and interesting this is for a Deallocate, which should only happen
>> with the quirk for certain Intel controllers from the very first days of
>> nvme.
>> 
>> What controller do you have?  Can you post the output of lspci and
>> "nvme list"?
> 
> Hi Christoph,
> 
> The nvme related output from lspci is as follows:
> $ lspci | grep -i nvme
> 19:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
> 20:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
> 94:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
> 9b:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
> 
> 
> $ sudo nvme list
> Node                  Generic               SN Model                                    Namespace Usage                      Format           FW Rev
> --------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
> /dev/nvme0n1          /dev/ng0n1            PHLN942100EQ6P4CGN 7361456_ICRPC2DD2ORA6.4T                 0x1          6.40  TB / 6.40  TB    512   B +  0 B   VDV1RL06
> /dev/nvme1n1          /dev/ng1n1            PHLN942100PE6P4CGN 7361456_ICRPC2DD2ORA6.4T                 0x1          6.40  TB / 6.40  TB    512   B +  0 B   VDV1RL06
> /dev/nvme2n1          /dev/ng2n1            PHLN9415002B6P4CGN 7361456_ICRPC2DD2ORA6.4T                 0x1          6.40  TB / 6.40  TB    512   B +  0 B   VDV1RL06
> /dev/nvme3n1          /dev/ng3n1            PHLN942100DQ6P4CGN 7361456_ICRPC2DD2ORA6.4T                 0x1          6.40  TB / 6.40  TB    512   B +  0 B   VDV1RL06
> 


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-22 18:26               ` Saeed Mirzamohammadi
@ 2024-11-22 21:09                 ` Keith Busch
  2024-11-25  6:46                   ` Christoph Hellwig
  2024-11-25 18:28                   ` Saeed Mirzamohammadi
  0 siblings, 2 replies; 19+ messages in thread
From: Keith Busch @ 2024-11-22 21:09 UTC (permalink / raw)
  To: Saeed Mirzamohammadi
  Cc: Paul Webb, Christoph Hellwig, Jens Axboe, Phil Auld,
	Chaitanya Kulkarni, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org, Ramanan Govindarajan,
	Sagi Grimberg, Nicky Veitch

On Fri, Nov 22, 2024 at 06:26:46PM +0000, Saeed Mirzamohammadi wrote:
> FYI, Tried disabling write zeros but still getting the same errors:
> [ 326.097275] operation not supported error, dev nvme2n1, sector 10624 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
> [ 338.496217] nvme0n1: Dataset Management(0x9) @ LBA 10928, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
> ... 
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index d3bde17c818d5..ad2ce6008062e 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3425,7 +3425,8 @@ static const struct pci_device_id nvme_id_table[] = {
>                 .driver_data = NVME_QUIRK_STRIPE_SIZE |
>                                 NVME_QUIRK_DEALLOCATE_ZEROES |
>                                 NVME_QUIRK_IGNORE_DEV_SUBNQN |
> -                               NVME_QUIRK_BOGUS_NID, },
> +                               NVME_QUIRK_BOGUS_NID |
> +                               NVME_QUIRK_DISABLE_WRITE_ZEROES, },
>         { PCI_VDEVICE(INTEL, 0x0a55),   /* Dell Express Flash P4600 */
>                 .driver_data = NVME_QUIRK_STRIPE_SIZE |
>                                 NVME_QUIRK_DEALLOCATE_ZEROES, },

Could you instead try deleting the NVME_QUIRK_DEALLOCATE_ZEROES quirk
for this device? The driver apparently uses this to assume you meant to
do a Discard, but it sounds like the device wants an actual Write Zeroes
command here.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-22 21:09                 ` Keith Busch
@ 2024-11-25  6:46                   ` Christoph Hellwig
  2024-11-25 18:28                   ` Saeed Mirzamohammadi
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2024-11-25  6:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Saeed Mirzamohammadi, Paul Webb, Christoph Hellwig, Jens Axboe,
	Phil Auld, Chaitanya Kulkarni, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org, Ramanan Govindarajan,
	Sagi Grimberg, Nicky Veitch

On Fri, Nov 22, 2024 at 02:09:04PM -0700, Keith Busch wrote:
> Could you instead try deleting the NVME_QUIRK_DEALLOCATE_ZEROES quirk
> for this device? The driver apparently uses this to assume you meant to
> do a Discard, but it sounds like the device wants an actual Write Zeroes
> command here.

From the logs it sounds like the device does not support the DSM command
at all.  Which is a bit weird, but might be an odd OEM firmware of some
kind.  If that's the case, the patch below should fix it:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1a8d32a4a5c3..ca57086ba038 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2068,7 +2068,8 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
 	lim->physical_block_size = min(phys_bs, atomic_bs);
 	lim->io_min = phys_bs;
 	lim->io_opt = io_opt;
-	if (ns->ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
+	if ((ns->ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES) &&
+	    (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM))
 		lim->max_write_zeroes_sectors = UINT_MAX;
 	else
 		lim->max_write_zeroes_sectors = ns->ctrl->max_zeroes_sectors;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-22 21:09                 ` Keith Busch
  2024-11-25  6:46                   ` Christoph Hellwig
@ 2024-11-25 18:28                   ` Saeed Mirzamohammadi
  2024-11-26  4:55                     ` Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: Saeed Mirzamohammadi @ 2024-11-25 18:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Paul Webb, Christoph Hellwig, Jens Axboe, Phil Auld,
	Chaitanya Kulkarni, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org, Ramanan Govindarajan,
	Sagi Grimberg, Nicky Veitch



> On Nov 22, 2024, at 1:09 PM, Keith Busch <kbusch@kernel.org> wrote:
> 
> On Fri, Nov 22, 2024 at 06:26:46PM +0000, Saeed Mirzamohammadi wrote:
>> FYI, Tried disabling write zeros but still getting the same errors:
>> [ 326.097275] operation not supported error, dev nvme2n1, sector 10624 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
>> [ 338.496217] nvme0n1: Dataset Management(0x9) @ LBA 10928, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
>> ... 
>> 
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index d3bde17c818d5..ad2ce6008062e 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -3425,7 +3425,8 @@ static const struct pci_device_id nvme_id_table[] = {
>>                .driver_data = NVME_QUIRK_STRIPE_SIZE |
>>                                NVME_QUIRK_DEALLOCATE_ZEROES |
>>                                NVME_QUIRK_IGNORE_DEV_SUBNQN |
>> -                               NVME_QUIRK_BOGUS_NID, },
>> +                               NVME_QUIRK_BOGUS_NID |
>> +                               NVME_QUIRK_DISABLE_WRITE_ZEROES, },
>>        { PCI_VDEVICE(INTEL, 0x0a55),   /* Dell Express Flash P4600 */
>>                .driver_data = NVME_QUIRK_STRIPE_SIZE |
>>                                NVME_QUIRK_DEALLOCATE_ZEROES, },
> 
> Could you instead try deleting the NVME_QUIRK_DEALLOCATE_ZEROES quirk
> for this device? The driver apparently uses this to assume you meant to
> do a Discard, but it sounds like the device wants an actual Write Zeroes
> command here.

Deleting the NVME_QUIRK_DEALLOCATE_ZEROES recovers the performance.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d3bde17c818d5..09f6fce2fec54 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3423,7 +3423,6 @@ static const struct pci_device_id nvme_id_table[] = {
                                NVME_QUIRK_DEALLOCATE_ZEROES, },
        { PCI_VDEVICE(INTEL, 0x0a54),   /* Intel P4500/P4600 */
                .driver_data = NVME_QUIRK_STRIPE_SIZE |
-                               NVME_QUIRK_DEALLOCATE_ZEROES |
                                NVME_QUIRK_IGNORE_DEV_SUBNQN |
                                NVME_QUIRK_BOGUS_NID, },
        { PCI_VDEVICE(INTEL, 0x0a55),   /* Dell Express Flash P4600 */






^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-25 18:28                   ` Saeed Mirzamohammadi
@ 2024-11-26  4:55                     ` Christoph Hellwig
  2024-11-26 18:06                       ` Saeed Mirzamohammadi
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2024-11-26  4:55 UTC (permalink / raw)
  To: Saeed Mirzamohammadi
  Cc: Keith Busch, Paul Webb, Christoph Hellwig, Jens Axboe, Phil Auld,
	Chaitanya Kulkarni, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org, Ramanan Govindarajan,
	Sagi Grimberg, Nicky Veitch

Hi Saeed,

can you please also test I sent yesterday?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-26  4:55                     ` Christoph Hellwig
@ 2024-11-26 18:06                       ` Saeed Mirzamohammadi
  2024-11-26 18:09                         ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Saeed Mirzamohammadi @ 2024-11-26 18:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Paul Webb, Jens Axboe, Phil Auld, Chaitanya Kulkarni,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Nicky Veitch

I was waiting for the results but yes that recovered the regression as well (snippet below), thanks! I think that’s the best way to go here. Should I make it into a patch and send it for review?

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 983909a600adb..d252c9651fc99 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2044,7 +2044,7 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
        lim->physical_block_size = min(phys_bs, atomic_bs);
        lim->io_min = phys_bs;
        lim->io_opt = io_opt;
-       if (ns->ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
+       if ((ns->ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES) && (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM))
                lim->max_write_zeroes_sectors = UINT_MAX;
        else
                lim->max_write_zeroes_sectors = ns->ctrl->max_zeroes_sectors;


Saeed

> On Nov 25, 2024, at 8:55 PM, Christoph Hellwig <hch@lst.de> wrote:
> 
> Hi Saeed,
> 
> can you please also test I sent yesterday?
> 


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-26 18:06                       ` Saeed Mirzamohammadi
@ 2024-11-26 18:09                         ` Christoph Hellwig
  2024-11-26 18:13                           ` Saeed Mirzamohammadi
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2024-11-26 18:09 UTC (permalink / raw)
  To: Saeed Mirzamohammadi
  Cc: Christoph Hellwig, Keith Busch, Paul Webb, Jens Axboe, Phil Auld,
	Chaitanya Kulkarni, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org, Ramanan Govindarajan,
	Sagi Grimberg, Nicky Veitch

On Tue, Nov 26, 2024 at 06:06:19PM +0000, Saeed Mirzamohammadi wrote:
> I was waiting for the results but yes that recovered the regression as well (snippet below), thanks! I think that’s the best way to go here. Should I make it into a patch and send it for review?

I'll send out the formal patch tomorrow morning, need to do a little
more digging for a good detailed commit log describing what went
wrong here.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel
  2024-11-26 18:09                         ` Christoph Hellwig
@ 2024-11-26 18:13                           ` Saeed Mirzamohammadi
  0 siblings, 0 replies; 19+ messages in thread
From: Saeed Mirzamohammadi @ 2024-11-26 18:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Paul Webb, Jens Axboe, Phil Auld, Chaitanya Kulkarni,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ramanan Govindarajan, Sagi Grimberg, Nicky Veitch



> On Nov 26, 2024, at 10:09 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Tue, Nov 26, 2024 at 06:06:19PM +0000, Saeed Mirzamohammadi wrote:
>> I was waiting for the results but yes that recovered the regression as well (snippet below), thanks! I think that’s the best way to go here. Should I make it into a patch and send it for review?
> 
> I'll send out the formal patch tomorrow morning, need to do a little
> more digging for a good detailed commit log describing what went
> wrong here.
Sure, thanks Christoph.

> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-11-26 18:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-20 21:35 [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel Saeed Mirzamohammadi
2024-11-21  0:00 ` Chaitanya Kulkarni
2024-11-21  1:20   ` Jens Axboe
2024-11-21  4:57     ` Christoph Hellwig
2024-11-21 14:48       ` Jens Axboe
2024-11-21 11:30     ` Phil Auld
2024-11-21 14:49       ` Jens Axboe
     [not found]         ` <181bcb70-e0bf-4024-80b7-e79276d6eaf7@oracle.com>
2024-11-21 21:19           ` [External] : " Phil Auld
2024-11-22 12:13           ` Christoph Hellwig
2024-11-22 17:18             ` Paul Webb
2024-11-22 18:26               ` Saeed Mirzamohammadi
2024-11-22 21:09                 ` Keith Busch
2024-11-25  6:46                   ` Christoph Hellwig
2024-11-25 18:28                   ` Saeed Mirzamohammadi
2024-11-26  4:55                     ` Christoph Hellwig
2024-11-26 18:06                       ` Saeed Mirzamohammadi
2024-11-26 18:09                         ` Christoph Hellwig
2024-11-26 18:13                           ` Saeed Mirzamohammadi
2024-11-22 17:13         ` Paul Webb

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox