Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel

public inbox for fio@vger.kernel.org
 help / color / mirror / Atom feed

* Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel
@ 2025-07-15  5:17 Thomas Glanzmann
  2025-07-15 20:44 ` Sitsofe Wheeler
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Glanzmann @ 2025-07-15  5:17 UTC (permalink / raw)
  To: fio

Hello,
I have a 1 TiB NVMe namespace from a NetApp connected via two distinct
direct links to a Linux system over NVMe/TCP. I would like to generate
read and write I/O using multiple jobs/iodepth so that every byte of the
device is being written to in parallel with the maximum number of
available parallel inflight I/Os. The NetApp does deduplication
and compression by default so I want to generate random data. Because I
think if I don't do refill_buffers, the NetApp gets that the same data
is used over and over again and dedups it. I tried:

fio --ioengine=libaio --refill_buffers --filesize=25G --ramp_time=2s \
--runtime=1m --numjobs=40 --direct=1 --verify=0 --randrepeat=0 \
--group_reporting --filename=/dev/nvme0n1 --name=1mhqd --blocksize=1m \
--iodepth=1638 --readwrite=write

So I'm on a Linux system with 40 hyperthreads and a mellanox 2x 25
Gbit/s card hooked up to the NetApp:

(live) [~] ip -br a s
...
eth6             UP             192.168.0.100/24
eth7             UP             192.168.1.100/24

(live) [~] nvme list-subsys /dev/nvme0n1
nvme-subsys0 - NQN=nqn.1992-08.com.netapp:sn.e0a0273a60b711f09deed039ead647e8:subsystem.svm1_subsystem_553
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:20f011e6-9ab8-584f-abb0-a260d2d685c4
\
 +- nvme0 tcp traddr=192.168.0.2,trsvcid=4420,src_addr=192.168.0.100 live optimized
 +- nvme1 tcp traddr=192.168.1.2,trsvcid=4420,src_addr=192.168.1.100 live optimized

na2501::*> network interface show
            Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
...
svm1
            lif_svm1_2660 up/up   192.168.1.2/24     na2501-02     e4c     true
            lif_svm1_9354 up/up   192.168.0.2/24     na2501-01     e4c     true

So when I run the above command the NetApp only reports a few hundered GiB of
physically allocated space:

na2501::*> aggr show -fields physical-used
aggregate      physical-used
-------------- -------------
dataFA_4_p0_i1 169.5GB

So, I ran:

(live) [~] pv < /dev/urandom > /dev/nvme0n1
1.00TiB 0:59:14 [ 294MiB/s] [======================>] 100%

And afterwards more physical space was used:

na2501::*> aggr show -fields physical-used
aggregate      physical-used
-------------- -------------
dataFA_4_p0_i1 1.15TB

So, what is the best way to use fio to write random data to every byte of this
1 TiB device in parallel?

	- Is there a command line parameter?
	- Or should I create 40 25.6 GiB (1024/40) partitions and give them as
	  colon separated list to fio?

I also would like to determine the number of queues and queue depth? Is there a
command available. When I run:

fio --ioengine=libaio --refill_buffers --filesize=8G --ramp_time=2s \
--runtime=1m --numjobs=40 --direct=1 --verify=0 --randrepeat=0 \
--group_reporting --filename=/dev/nvme0n1 --name=4khqd --blocksize=4k \
--iodepth=1638 --readwrite=randwrite

And also watch 'iostat -xm 2' I can see aqu-sz is 194.87 per path and
391.96 for the multipathed device nvme0n1. So I kind of know it but
would like to have a command on Linux that shows me the available queues
and queue depths.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.67    0.00    6.48   85.13    0.00    6.72

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0c0n1        0.00      0.00     0.00   0.00    0.00     0.00 60855.00    237.71     0.00   0.00    3.20     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00  194.87 100.00
nvme0c1n1        0.00      0.00     0.00   0.00    0.00     0.00 58719.00    229.37     0.00   0.00    3.32     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00  194.94 100.00
nvme0n1          0.00      0.00     0.00   0.00    0.00     0.00 119570.50    467.07     0.00   0.00    3.28     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00  391.96 100.00

Cheers,
        Thomas


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel
  2025-07-15  5:17 Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel Thomas Glanzmann
@ 2025-07-15 20:44 ` Sitsofe Wheeler
  2025-07-17  7:52   ` Thomas Glanzmann
  0 siblings, 1 reply; 3+ messages in thread
From: Sitsofe Wheeler @ 2025-07-15 20:44 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: fio

Hello Thomas,

On Tue, 15 Jul 2025 at 06:18, Thomas Glanzmann <thomas@glanzmann.de> wrote:
>
> I have a 1 TiB NVMe namespace from a NetApp connected via two distinct
> direct links to a Linux system over NVMe/TCP. I would like to generate
> read and write I/O using multiple jobs/iodepth so that every byte of the
> device is being written to in parallel with the maximum number of
> available parallel inflight I/Os. The NetApp does deduplication
> and compression by default so I want to generate random data. Because I
> think if I don't do refill_buffers, the NetApp gets that the same data
> is used over and over again and dedups it. I tried:

It depends on just how clever it is. By default fio uses
scramble_buffers
(https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-scramble_buffers
) which may be enough but you would have to check. The trick would be
to see what happens with a single stream as that's easier to reason
about.

> fio --ioengine=libaio --refill_buffers --filesize=25G --ramp_time=2s \
> --runtime=1m --numjobs=40 --direct=1 --verify=0 --randrepeat=0 \
> --group_reporting --filename=/dev/nvme0n1 --name=1mhqd --blocksize=1m \
> --iodepth=1638 --readwrite=write

If I'm reading this correctly you have 40 jobs all sequentially
writing the same 25G of the device at the same time. This is
problematic because if you send two or more write I/Os for the same
area to storage at the same time then something in your data path
could legitimately just throw all but one of them away and say "I'm
done" (because one I/O overwrites the others) - you're essentially
saying "I don't care about the data" and in certain setups such
behaviour is undefined. One change you could make is to have each of
the 40 jobs write to a different region to the others e.g. by using
offset_increment
https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-offset_increment
and size https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-size
.

<snip>

> So, I ran:
>
> (live) [~] pv < /dev/urandom > /dev/nvme0n1
> 1.00TiB 0:59:14 [ 294MiB/s] [======================>] 100%
>
> And afterwards more physical space was used:
>
> na2501::*> aggr show -fields physical-used
> aggregate      physical-used
> -------------- -------------
> dataFA_4_p0_i1 1.15TB

Do you get similar results in terms of space used with a single fio
stream? Start small and then work your way up!

> So, what is the best way to use fio to write random data to every byte of this
> 1 TiB device in parallel?
>
>         - Is there a command line parameter?
>         - Or should I create 40 25.6 GiB (1024/40) partitions and give them as
>           colon separated list to fio?

See above.

> I also would like to determine the number of queues and queue depth? Is there a
> command available. When I run:

fio does report statistics about what queue depth each job internally
reached but these may be different to what your device sees for a
variety of reasons (e.g. splitting a coalescing done by the block
layer). See IO depths/IO submit/IO complete over on
https://fio.readthedocs.io/en/latest/fio_doc.html#interpreting-the-output
. But perhaps you're thinking of device queue depths?

> fio --ioengine=libaio --refill_buffers --filesize=8G --ramp_time=2s \
> --runtime=1m --numjobs=40 --direct=1 --verify=0 --randrepeat=0 \
> --group_reporting --filename=/dev/nvme0n1 --name=4khqd --blocksize=4k \
> --iodepth=1638 --readwrite=randwrite

Given you're running 40 jobs I'd be surprised if you can hit a depth
of over 1000 per job (that would be over 65000 I/Os in total) without
some serious tuning. You may want to look at
/sys/block/[disk]/queue/nr_requests (see
https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst ) and
/sys/block/[disk]/device/queue_depth but you may also find you run
into libaio limits...

> And also watch 'iostat -xm 2' I can see aqu-sz is 194.87 per path and
> 391.96 for the multipathed device nvme0n1. So I kind of know it but
> would like to have a command on Linux that shows me the available queues
> and queue depths.

I'm fairly sure iostat is the right way to go (unless you wanted to
write some BPF tracing).

-- 
Sitsofe

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel
  2025-07-15 20:44 ` Sitsofe Wheeler
@ 2025-07-17  7:52   ` Thomas Glanzmann
  0 siblings, 0 replies; 3+ messages in thread
From: Thomas Glanzmann @ 2025-07-17  7:52 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio

Hello Sitsofe,

* Sitsofe Wheeler <sitsofe@gmail.com> [2025-07-15 22:44]:
> (https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-scramble_buffers)
> which may be enough but you would have to check. The trick would be to
> see what happens with a single stream as that's easier to reason
> about.

I see, I stuck with refill_buffers because it was good enough.

> https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-offset_increment
> and size https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-size

Thank you that resolved my problem. I used the following command line:

fio --ioengine=libaio --refill_buffers --offset=0 --offset_increment=26G \
        --size=25G --ramp_time=2s --numjobs=80 --direct=1 --verify=0 \
        --randrepeat=0 --group_reporting  --filename /dev/nvme0n1 --name=1mhqd \
        --blocksize=1m --iodepth=3 --readwrite=write

> Do you get similar results in terms of space used with a single fio
> stream? Start small and then work your way up!

Yes, that works, but I also wanted to benchmark parallel performacne.
Also the single fio stream takes an hour while the parallel one only
takes 8,5 minutes.

> But perhaps you're thinking of device queue depths?

Yeah, that I was looking for Keith Bush answered me. The commands I was
looking for, are:

# How many IO queues are there:

ls -1 /sys/block/nvme0n1/mq/ | wc -l

# How large is each IO queue:

cat /sys/block/nvme0n1/queue/nr_requests

> Given you're running 40 jobs I'd be surprised if you can hit a depth
> of over 1000 per job (that would be over 65000 I/Os in total) without
> some serious tuning. You may want to look at
> /sys/block/[disk]/queue/nr_requests (see
> https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst ) and
> /sys/block/[disk]/device/queue_depth but you may also find you run
> into libaio limits...

I could not. The netapp only has 8 * 128 queue depth, so now that I knew
the exact values I adopted the same.

Cheers,
        Thomas

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-07-17  7:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15  5:17 Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel Thomas Glanzmann
2025-07-15 20:44 ` Sitsofe Wheeler
2025-07-17  7:52   ` Thomas Glanzmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox