* Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel
@ 2025-07-15 5:17 Thomas Glanzmann
2025-07-15 20:44 ` Sitsofe Wheeler
0 siblings, 1 reply; 3+ messages in thread
From: Thomas Glanzmann @ 2025-07-15 5:17 UTC (permalink / raw)
To: fio
Hello,
I have a 1 TiB NVMe namespace from a NetApp connected via two distinct
direct links to a Linux system over NVMe/TCP. I would like to generate
read and write I/O using multiple jobs/iodepth so that every byte of the
device is being written to in parallel with the maximum number of
available parallel inflight I/Os. The NetApp does deduplication
and compression by default so I want to generate random data. Because I
think if I don't do refill_buffers, the NetApp gets that the same data
is used over and over again and dedups it. I tried:
fio --ioengine=libaio --refill_buffers --filesize=25G --ramp_time=2s \
--runtime=1m --numjobs=40 --direct=1 --verify=0 --randrepeat=0 \
--group_reporting --filename=/dev/nvme0n1 --name=1mhqd --blocksize=1m \
--iodepth=1638 --readwrite=write
So I'm on a Linux system with 40 hyperthreads and a mellanox 2x 25
Gbit/s card hooked up to the NetApp:
(live) [~] ip -br a s
...
eth6 UP 192.168.0.100/24
eth7 UP 192.168.1.100/24
(live) [~] nvme list-subsys /dev/nvme0n1
nvme-subsys0 - NQN=nqn.1992-08.com.netapp:sn.e0a0273a60b711f09deed039ead647e8:subsystem.svm1_subsystem_553
hostnqn=nqn.2014-08.org.nvmexpress:uuid:20f011e6-9ab8-584f-abb0-a260d2d685c4
\
+- nvme0 tcp traddr=192.168.0.2,trsvcid=4420,src_addr=192.168.0.100 live optimized
+- nvme1 tcp traddr=192.168.1.2,trsvcid=4420,src_addr=192.168.1.100 live optimized
na2501::*> network interface show
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
...
svm1
lif_svm1_2660 up/up 192.168.1.2/24 na2501-02 e4c true
lif_svm1_9354 up/up 192.168.0.2/24 na2501-01 e4c true
So when I run the above command the NetApp only reports a few hundered GiB of
physically allocated space:
na2501::*> aggr show -fields physical-used
aggregate physical-used
-------------- -------------
dataFA_4_p0_i1 169.5GB
So, I ran:
(live) [~] pv < /dev/urandom > /dev/nvme0n1
1.00TiB 0:59:14 [ 294MiB/s] [======================>] 100%
And afterwards more physical space was used:
na2501::*> aggr show -fields physical-used
aggregate physical-used
-------------- -------------
dataFA_4_p0_i1 1.15TB
So, what is the best way to use fio to write random data to every byte of this
1 TiB device in parallel?
- Is there a command line parameter?
- Or should I create 40 25.6 GiB (1024/40) partitions and give them as
colon separated list to fio?
I also would like to determine the number of queues and queue depth? Is there a
command available. When I run:
fio --ioengine=libaio --refill_buffers --filesize=8G --ramp_time=2s \
--runtime=1m --numjobs=40 --direct=1 --verify=0 --randrepeat=0 \
--group_reporting --filename=/dev/nvme0n1 --name=4khqd --blocksize=4k \
--iodepth=1638 --readwrite=randwrite
And also watch 'iostat -xm 2' I can see aqu-sz is 194.87 per path and
391.96 for the multipathed device nvme0n1. So I kind of know it but
would like to have a command on Linux that shows me the available queues
and queue depths.
avg-cpu: %user %nice %system %iowait %steal %idle
1.67 0.00 6.48 85.13 0.00 6.72
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0c0n1 0.00 0.00 0.00 0.00 0.00 0.00 60855.00 237.71 0.00 0.00 3.20 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 194.87 100.00
nvme0c1n1 0.00 0.00 0.00 0.00 0.00 0.00 58719.00 229.37 0.00 0.00 3.32 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 194.94 100.00
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 119570.50 467.07 0.00 0.00 3.28 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 391.96 100.00
Cheers,
Thomas
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel
2025-07-15 5:17 Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel Thomas Glanzmann
@ 2025-07-15 20:44 ` Sitsofe Wheeler
2025-07-17 7:52 ` Thomas Glanzmann
0 siblings, 1 reply; 3+ messages in thread
From: Sitsofe Wheeler @ 2025-07-15 20:44 UTC (permalink / raw)
To: Thomas Glanzmann; +Cc: fio
Hello Thomas,
On Tue, 15 Jul 2025 at 06:18, Thomas Glanzmann <thomas@glanzmann.de> wrote:
>
> I have a 1 TiB NVMe namespace from a NetApp connected via two distinct
> direct links to a Linux system over NVMe/TCP. I would like to generate
> read and write I/O using multiple jobs/iodepth so that every byte of the
> device is being written to in parallel with the maximum number of
> available parallel inflight I/Os. The NetApp does deduplication
> and compression by default so I want to generate random data. Because I
> think if I don't do refill_buffers, the NetApp gets that the same data
> is used over and over again and dedups it. I tried:
It depends on just how clever it is. By default fio uses
scramble_buffers
(https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-scramble_buffers
) which may be enough but you would have to check. The trick would be
to see what happens with a single stream as that's easier to reason
about.
> fio --ioengine=libaio --refill_buffers --filesize=25G --ramp_time=2s \
> --runtime=1m --numjobs=40 --direct=1 --verify=0 --randrepeat=0 \
> --group_reporting --filename=/dev/nvme0n1 --name=1mhqd --blocksize=1m \
> --iodepth=1638 --readwrite=write
If I'm reading this correctly you have 40 jobs all sequentially
writing the same 25G of the device at the same time. This is
problematic because if you send two or more write I/Os for the same
area to storage at the same time then something in your data path
could legitimately just throw all but one of them away and say "I'm
done" (because one I/O overwrites the others) - you're essentially
saying "I don't care about the data" and in certain setups such
behaviour is undefined. One change you could make is to have each of
the 40 jobs write to a different region to the others e.g. by using
offset_increment
https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-offset_increment
and size https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-size
.
<snip>
> So, I ran:
>
> (live) [~] pv < /dev/urandom > /dev/nvme0n1
> 1.00TiB 0:59:14 [ 294MiB/s] [======================>] 100%
>
> And afterwards more physical space was used:
>
> na2501::*> aggr show -fields physical-used
> aggregate physical-used
> -------------- -------------
> dataFA_4_p0_i1 1.15TB
Do you get similar results in terms of space used with a single fio
stream? Start small and then work your way up!
> So, what is the best way to use fio to write random data to every byte of this
> 1 TiB device in parallel?
>
> - Is there a command line parameter?
> - Or should I create 40 25.6 GiB (1024/40) partitions and give them as
> colon separated list to fio?
See above.
> I also would like to determine the number of queues and queue depth? Is there a
> command available. When I run:
fio does report statistics about what queue depth each job internally
reached but these may be different to what your device sees for a
variety of reasons (e.g. splitting a coalescing done by the block
layer). See IO depths/IO submit/IO complete over on
https://fio.readthedocs.io/en/latest/fio_doc.html#interpreting-the-output
. But perhaps you're thinking of device queue depths?
> fio --ioengine=libaio --refill_buffers --filesize=8G --ramp_time=2s \
> --runtime=1m --numjobs=40 --direct=1 --verify=0 --randrepeat=0 \
> --group_reporting --filename=/dev/nvme0n1 --name=4khqd --blocksize=4k \
> --iodepth=1638 --readwrite=randwrite
Given you're running 40 jobs I'd be surprised if you can hit a depth
of over 1000 per job (that would be over 65000 I/Os in total) without
some serious tuning. You may want to look at
/sys/block/[disk]/queue/nr_requests (see
https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst ) and
/sys/block/[disk]/device/queue_depth but you may also find you run
into libaio limits...
> And also watch 'iostat -xm 2' I can see aqu-sz is 194.87 per path and
> 391.96 for the multipathed device nvme0n1. So I kind of know it but
> would like to have a command on Linux that shows me the available queues
> and queue depths.
I'm fairly sure iostat is the right way to go (unless you wanted to
write some BPF tracing).
--
Sitsofe
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel
2025-07-15 20:44 ` Sitsofe Wheeler
@ 2025-07-17 7:52 ` Thomas Glanzmann
0 siblings, 0 replies; 3+ messages in thread
From: Thomas Glanzmann @ 2025-07-17 7:52 UTC (permalink / raw)
To: Sitsofe Wheeler; +Cc: fio
Hello Sitsofe,
* Sitsofe Wheeler <sitsofe@gmail.com> [2025-07-15 22:44]:
> (https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-scramble_buffers)
> which may be enough but you would have to check. The trick would be to
> see what happens with a single stream as that's easier to reason
> about.
I see, I stuck with refill_buffers because it was good enough.
> https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-offset_increment
> and size https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-size
Thank you that resolved my problem. I used the following command line:
fio --ioengine=libaio --refill_buffers --offset=0 --offset_increment=26G \
--size=25G --ramp_time=2s --numjobs=80 --direct=1 --verify=0 \
--randrepeat=0 --group_reporting --filename /dev/nvme0n1 --name=1mhqd \
--blocksize=1m --iodepth=3 --readwrite=write
> Do you get similar results in terms of space used with a single fio
> stream? Start small and then work your way up!
Yes, that works, but I also wanted to benchmark parallel performacne.
Also the single fio stream takes an hour while the parallel one only
takes 8,5 minutes.
> But perhaps you're thinking of device queue depths?
Yeah, that I was looking for Keith Bush answered me. The commands I was
looking for, are:
# How many IO queues are there:
ls -1 /sys/block/nvme0n1/mq/ | wc -l
# How large is each IO queue:
cat /sys/block/nvme0n1/queue/nr_requests
> Given you're running 40 jobs I'd be surprised if you can hit a depth
> of over 1000 per job (that would be over 65000 I/Os in total) without
> some serious tuning. You may want to look at
> /sys/block/[disk]/queue/nr_requests (see
> https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst ) and
> /sys/block/[disk]/device/queue_depth but you may also find you run
> into libaio limits...
I could not. The netapp only has 8 * 128 queue depth, so now that I knew
the exact values I adopted the same.
Cheers,
Thomas
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-07-17 7:52 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 5:17 Evenly distribute jobs and iodepth over a 1 TiB device so that every byte is written to in parallel Thomas Glanzmann
2025-07-15 20:44 ` Sitsofe Wheeler
2025-07-17 7:52 ` Thomas Glanzmann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox