public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* read-modify-write occurring for direct I/O on RAID-5
@ 2023-08-04  5:44 Corey Hickey
  2023-08-04  8:07 ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Corey Hickey @ 2023-08-04  5:44 UTC (permalink / raw)
  To: linux-xfs

Hello,

I am having a problem with write performance via direct I/O. My setup is:
* Debian Sid
* Linux 6.3.0-2 (Debian Kernel)
* 3-disk MD RAID-5 of hard disks
* XFS

When I do large sequential writes via direct I/O, sometimes the writes 
are fast, but sometimes the RAID ends up doing RMW and performance gets 
slow.

If I use regular buffered I/O, then performance is better, presumably 
due to the MD stripe cache. I could just use buffered writes, of course, 
but I am really trying to make sure I get the alignment correct to start 
with.


I can reproduce the problem on a fresh RAID.
-----------------------------------------------------------------------
$ sudo mdadm --create /dev/md10 -n 3 -l 5 -z 30G /dev/sd[ghi]
mdadm: largest drive (/dev/sdg) exceeds size (31457280K) by more than 1%
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md10 started.
-----------------------------------------------------------------------
For testing, I'm using "-z 30G" to limit the duration of the initial 
RAID resync.


For XFS I can use default options:
-----------------------------------------------------------------------
$ sudo mkfs.xfs /dev/md10
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md10              isize=512    agcount=16, agsize=983040 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1    bigtime=1 inobtcount=1 
nrext64=0
data     =                       bsize=4096   blocks=15728640, imaxpct=25
          =                       sunit=128    swidth=68352 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
          =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount /dev/md10 /mnt/tmp
-----------------------------------------------------------------------


I am testing via dd:
-----------------------------------------------------------------------
$ sudo dd if=/dev/zero of=/mnt/tmp/test.bin iflag=fullblock oflag=direct 
bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 100.664 s, 107 MB/s
-----------------------------------------------------------------------

I can monitor performance with dstat (the I/O reported at the start 
seems to be an artifact of dstat's monitoring).
-----------------------------------------------------------------------
$ dstat -dD sdg,sdh,sdi 2
--dsk/sdg-----dsk/sdh-----dsk/sdi--
  read  writ: read  writ: read  writ
   16G 5673M:  16G 5673M: 537M   21G  # <--not a real reading
    0     0 :   0     0 :   0     0
    0     0 :   0     0 :   0     0
    0    29M:   0    29M:   0    29M  # <--test starts here
    0   126M:   0   126M:   0   126M
    0   134M:   0   134M:   0   134M
    0   145M:   0   145M:   0   144M
   16k  137M:   0   137M:   0   138M
    0   152M:   0   152M:   0   152M
    0   140M:   0   140M:   0   140M
5632k  110M:5376k  110M:5376k  111M  # <--RMW begins here
   12M   49M:  12M   49M:  12M   49M
   14M   53M:  13M   54M:  13M   53M
   12M   50M:  12M   50M:  12M   50M
   12M   49M:  12M   50M:  12M   49M
   12M   50M:  12M   49M:  12M   49M
   13M   50M:  13M   51M:  12M   51M
   12M   50M:  12M   50M:  12M   50M
   12M   48M:  12M   48M:  12M   48M
   13M   53M:  13M   52M:  13M   53M
   13M   50M:  12M   50M:  13M   50M
   13M   52M:  13M   52M:  13M   52M
   12M   47M:  12M   46M:  12M   46M
   13M   52M:  13M   52M:  13M   52M
-----------------------------------------------------------------------
(I truncated the output--the rest looks the same)

Note how the I/O starts out fully as writes, but then continues with 
many reads. I am fairly sure this is RAID-5 read-modify-write due to 
misaligned writes.


The default chunk size is 512K
-----------------------------------------------------------------------
$ sudo mdadm --detail /dev/md10 | grep Chunk
         Chunk Size : 512K
$ sudo blkid -i /dev/md10
/dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792" 
PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512"
-----------------------------------------------------------------------
I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I 
would expect this to be 1024K (due to two data disks in a three-disk 
RAID-5).

Translating into 512-byte sectors, I think the topology should be:
chunk size (sunit): 1024 sectors
stripe size (swidth): 2048 sectors


I can see the write alignment with blktrace.
-----------------------------------------------------------------------
$ sudo blktrace -d /dev/md10 -o - | blkparse -i - | grep ' Q '
   9,10  15        1     0.000000000 186548  Q  WS 3829760 + 2048 [dd]
   9,10  15        3     0.021087119 186548  Q  WS 3831808 + 2048 [dd]
   9,10  15        5     0.023605705 186548  Q  WS 3833856 + 2048 [dd]
   9,10  15        7     0.026093572 186548  Q  WS 3835904 + 2048 [dd]
   9,10  15        9     0.028595887 186548  Q  WS 3837952 + 2048 [dd]
   9,10  15       11     0.031171221 186548  Q  WS 3840000 + 2048 [dd]
[...]
   9,10   5      441    14.601942400 186608  Q  WS 8082432 + 2048 [dd]
   9,10   5      443    14.620316654 186608  Q  WS 8084480 + 2048 [dd]
   9,10   5      445    14.646707430 186608  Q  WS 8086528 + 2048 [dd]
   9,10   5      447    14.654519976 186608  Q  WS 8088576 + 2048 [dd]
   9,10   5      449    14.680901605 186608  Q  WS 8090624 + 2048 [dd]
   9,10   5      451    14.689156421 186608  Q  WS 8092672 + 2048 [dd]
   9,10   5      453    14.706529362 186608  Q  WS 8094720 + 2048 [dd]
   9,10   5      455    14.732451407 186608  Q  WS 8096768 + 2048 [dd]
-----------------------------------------------------------------------
In the beginning, writes queued are stripe-aligned. For example:
3829760 / 2048 == 1870

Later on, writes end up getting misaligned by half a stripe. For example:
8082432 / 2048 == 3946.5

I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, 
but that had pretty much the same behavior when writing (the RMW starts 
later, but it still starts).


Am I doing something wrong, or is there a bug, or are my expectations 
incorrect? I had expected that large sequential writes would be aligned 
with swidth.

Thank you,
Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-08-06 22:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-04  5:44 read-modify-write occurring for direct I/O on RAID-5 Corey Hickey
2023-08-04  8:07 ` Dave Chinner
2023-08-04 19:26   ` Corey Hickey
2023-08-04 21:52     ` Dave Chinner
2023-08-05  1:44       ` Corey Hickey
2023-08-05 22:37         ` Dave Chinner
2023-08-06 18:21           ` Corey Hickey
2023-08-06 22:38             ` Dave Chinner
2023-08-06 18:54         ` Corey Hickey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox