* read-modify-write occurring for direct I/O on RAID-5
@ 2023-08-04 5:44 Corey Hickey
2023-08-04 8:07 ` Dave Chinner
0 siblings, 1 reply; 9+ messages in thread
From: Corey Hickey @ 2023-08-04 5:44 UTC (permalink / raw)
To: linux-xfs
Hello,
I am having a problem with write performance via direct I/O. My setup is:
* Debian Sid
* Linux 6.3.0-2 (Debian Kernel)
* 3-disk MD RAID-5 of hard disks
* XFS
When I do large sequential writes via direct I/O, sometimes the writes
are fast, but sometimes the RAID ends up doing RMW and performance gets
slow.
If I use regular buffered I/O, then performance is better, presumably
due to the MD stripe cache. I could just use buffered writes, of course,
but I am really trying to make sure I get the alignment correct to start
with.
I can reproduce the problem on a fresh RAID.
-----------------------------------------------------------------------
$ sudo mdadm --create /dev/md10 -n 3 -l 5 -z 30G /dev/sd[ghi]
mdadm: largest drive (/dev/sdg) exceeds size (31457280K) by more than 1%
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md10 started.
-----------------------------------------------------------------------
For testing, I'm using "-z 30G" to limit the duration of the initial
RAID resync.
For XFS I can use default options:
-----------------------------------------------------------------------
$ sudo mkfs.xfs /dev/md10
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md10 isize=512 agcount=16, agsize=983040 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1 bigtime=1 inobtcount=1
nrext64=0
data = bsize=4096 blocks=15728640, imaxpct=25
= sunit=128 swidth=68352 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
$ sudo mount /dev/md10 /mnt/tmp
-----------------------------------------------------------------------
I am testing via dd:
-----------------------------------------------------------------------
$ sudo dd if=/dev/zero of=/mnt/tmp/test.bin iflag=fullblock oflag=direct
bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 100.664 s, 107 MB/s
-----------------------------------------------------------------------
I can monitor performance with dstat (the I/O reported at the start
seems to be an artifact of dstat's monitoring).
-----------------------------------------------------------------------
$ dstat -dD sdg,sdh,sdi 2
--dsk/sdg-----dsk/sdh-----dsk/sdi--
read writ: read writ: read writ
16G 5673M: 16G 5673M: 537M 21G # <--not a real reading
0 0 : 0 0 : 0 0
0 0 : 0 0 : 0 0
0 29M: 0 29M: 0 29M # <--test starts here
0 126M: 0 126M: 0 126M
0 134M: 0 134M: 0 134M
0 145M: 0 145M: 0 144M
16k 137M: 0 137M: 0 138M
0 152M: 0 152M: 0 152M
0 140M: 0 140M: 0 140M
5632k 110M:5376k 110M:5376k 111M # <--RMW begins here
12M 49M: 12M 49M: 12M 49M
14M 53M: 13M 54M: 13M 53M
12M 50M: 12M 50M: 12M 50M
12M 49M: 12M 50M: 12M 49M
12M 50M: 12M 49M: 12M 49M
13M 50M: 13M 51M: 12M 51M
12M 50M: 12M 50M: 12M 50M
12M 48M: 12M 48M: 12M 48M
13M 53M: 13M 52M: 13M 53M
13M 50M: 12M 50M: 13M 50M
13M 52M: 13M 52M: 13M 52M
12M 47M: 12M 46M: 12M 46M
13M 52M: 13M 52M: 13M 52M
-----------------------------------------------------------------------
(I truncated the output--the rest looks the same)
Note how the I/O starts out fully as writes, but then continues with
many reads. I am fairly sure this is RAID-5 read-modify-write due to
misaligned writes.
The default chunk size is 512K
-----------------------------------------------------------------------
$ sudo mdadm --detail /dev/md10 | grep Chunk
Chunk Size : 512K
$ sudo blkid -i /dev/md10
/dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792"
PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512"
-----------------------------------------------------------------------
I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I
would expect this to be 1024K (due to two data disks in a three-disk
RAID-5).
Translating into 512-byte sectors, I think the topology should be:
chunk size (sunit): 1024 sectors
stripe size (swidth): 2048 sectors
I can see the write alignment with blktrace.
-----------------------------------------------------------------------
$ sudo blktrace -d /dev/md10 -o - | blkparse -i - | grep ' Q '
9,10 15 1 0.000000000 186548 Q WS 3829760 + 2048 [dd]
9,10 15 3 0.021087119 186548 Q WS 3831808 + 2048 [dd]
9,10 15 5 0.023605705 186548 Q WS 3833856 + 2048 [dd]
9,10 15 7 0.026093572 186548 Q WS 3835904 + 2048 [dd]
9,10 15 9 0.028595887 186548 Q WS 3837952 + 2048 [dd]
9,10 15 11 0.031171221 186548 Q WS 3840000 + 2048 [dd]
[...]
9,10 5 441 14.601942400 186608 Q WS 8082432 + 2048 [dd]
9,10 5 443 14.620316654 186608 Q WS 8084480 + 2048 [dd]
9,10 5 445 14.646707430 186608 Q WS 8086528 + 2048 [dd]
9,10 5 447 14.654519976 186608 Q WS 8088576 + 2048 [dd]
9,10 5 449 14.680901605 186608 Q WS 8090624 + 2048 [dd]
9,10 5 451 14.689156421 186608 Q WS 8092672 + 2048 [dd]
9,10 5 453 14.706529362 186608 Q WS 8094720 + 2048 [dd]
9,10 5 455 14.732451407 186608 Q WS 8096768 + 2048 [dd]
-----------------------------------------------------------------------
In the beginning, writes queued are stripe-aligned. For example:
3829760 / 2048 == 1870
Later on, writes end up getting misaligned by half a stripe. For example:
8082432 / 2048 == 3946.5
I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs,
but that had pretty much the same behavior when writing (the RMW starts
later, but it still starts).
Am I doing something wrong, or is there a bug, or are my expectations
incorrect? I had expected that large sequential writes would be aligned
with swidth.
Thank you,
Corey
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: read-modify-write occurring for direct I/O on RAID-5 2023-08-04 5:44 read-modify-write occurring for direct I/O on RAID-5 Corey Hickey @ 2023-08-04 8:07 ` Dave Chinner 2023-08-04 19:26 ` Corey Hickey 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2023-08-04 8:07 UTC (permalink / raw) To: Corey Hickey; +Cc: linux-xfs On Thu, Aug 03, 2023 at 10:44:31PM -0700, Corey Hickey wrote: > Hello, > > I am having a problem with write performance via direct I/O. My setup is: > * Debian Sid > * Linux 6.3.0-2 (Debian Kernel) > * 3-disk MD RAID-5 of hard disks > * XFS > > When I do large sequential writes via direct I/O, sometimes the writes are > fast, but sometimes the RAID ends up doing RMW and performance gets slow. > > If I use regular buffered I/O, then performance is better, presumably due to > the MD stripe cache. I could just use buffered writes, of course, but I am > really trying to make sure I get the alignment correct to start with. > > > I can reproduce the problem on a fresh RAID. > ----------------------------------------------------------------------- > $ sudo mdadm --create /dev/md10 -n 3 -l 5 -z 30G /dev/sd[ghi] > mdadm: largest drive (/dev/sdg) exceeds size (31457280K) by more than 1% > Continue creating array? y > mdadm: Defaulting to version 1.2 metadata > mdadm: array /dev/md10 started. > ----------------------------------------------------------------------- > For testing, I'm using "-z 30G" to limit the duration of the initial RAID > resync. > > > For XFS I can use default options: > ----------------------------------------------------------------------- > $ sudo mkfs.xfs /dev/md10 > log stripe unit (524288 bytes) is too large (maximum is 256KiB) > log stripe unit adjusted to 32KiB > meta-data=/dev/md10 isize=512 agcount=16, agsize=983040 blks So an AG size of just under 2GB. > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 bigtime=1 inobtcount=1 > nrext64=0 > data = bsize=4096 blocks=15728640, imaxpct=25 > = sunit=128 swidth=68352 blks ^^^^^^^^^^^^^^^^^ Something is badly broken in MD land. ..... > The default chunk size is 512K > ----------------------------------------------------------------------- > $ sudo mdadm --detail /dev/md10 | grep Chunk > Chunk Size : 512K > $ sudo blkid -i /dev/md10 > /dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792" ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Yup, that's definitely broken. > PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512" > ----------------------------------------------------------------------- > I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I would > expect this to be 1024K (due to two data disks in a three-disk RAID-5). Yup, it's broken. :/ > Translating into 512-byte sectors, I think the topology should be: > chunk size (sunit): 1024 sectors > stripe size (swidth): 2048 sectors Yup, or as it reports from mkfs, sunit=128 fsbs, swidth=256 fsbs. > ----------------------------------------------------------------------- > $ sudo blktrace -d /dev/md10 -o - | blkparse -i - | grep ' Q ' > 9,10 15 1 0.000000000 186548 Q WS 3829760 + 2048 [dd] > 9,10 15 3 0.021087119 186548 Q WS 3831808 + 2048 [dd] > 9,10 15 5 0.023605705 186548 Q WS 3833856 + 2048 [dd] > 9,10 15 7 0.026093572 186548 Q WS 3835904 + 2048 [dd] > 9,10 15 9 0.028595887 186548 Q WS 3837952 + 2048 [dd] > 9,10 15 11 0.031171221 186548 Q WS 3840000 + 2048 [dd] > [...] > 9,10 5 441 14.601942400 186608 Q WS 8082432 + 2048 [dd] > 9,10 5 443 14.620316654 186608 Q WS 8084480 + 2048 [dd] > 9,10 5 445 14.646707430 186608 Q WS 8086528 + 2048 [dd] > 9,10 5 447 14.654519976 186608 Q WS 8088576 + 2048 [dd] > 9,10 5 449 14.680901605 186608 Q WS 8090624 + 2048 [dd] > 9,10 5 451 14.689156421 186608 Q WS 8092672 + 2048 [dd] > 9,10 5 453 14.706529362 186608 Q WS 8094720 + 2048 [dd] > 9,10 5 455 14.732451407 186608 Q WS 8096768 + 2048 [dd] > ----------------------------------------------------------------------- > In the beginning, writes queued are stripe-aligned. For example: > 3829760 / 2048 == 1870 > > Later on, writes end up getting misaligned by half a stripe. For example: > 8082432 / 2048 == 3946.5 So it's aligned to sunit, not swidth. That will match up with a discontiguity in the file layout. i.e. an extent boundary. And given this is at just under 4GB written, and the AG size is just under 2GB, this discontiguity is going to occur as writing fills AG 1 and allocation switches to AG 2. > I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, but > that had pretty much the same behavior when writing (the RMW starts later, > but it still starts). It won't change anything, actually. The first allocation in an AG will determine which stripe unit the new extent starts on, and then for the entire AG the write will be aligned to that choice. If you do IOs much larger than the stripe width (e.g. 16MB at a time) the impact of the head/tail RMW will largely go away. The problem is that you are doing exactly stripe width sized IOs and so is the worse case for any allocation misalignment that might occur. > Am I doing something wrong, or is there a bug, or are my expectations > incorrect? I had expected that large sequential writes would be aligned with > swidth. Expectations are wrong. Large allocations are aligned to stripe unit in XFS by default. THis is because XFS was tuned for *large* multi-layer RAID setups like RAID-50 that had hardware RAID 5 luns stripe together via RAID-0 in the volume manager. In these setups, the stripe unit is the hardware RAID-5 lun stripe width (the minimum size that avoids RMW) and the stripe width is the RAID-0 width. Hence for performance, it didn't matter which sunit allocation aligned to as long as writes spanned the entire stripe width. That way they would hit every lun. In general, we don't want stripe width aligned allocation, because that hot-spots the first stripe unit in the stripe as all file data first writes to that unit. A raid stripe is only as fast as it's slowest disk, and so having a hot stripe unit slows everything down. Hence by default we move the initial allocation around the stripe units, and that largely removes the hotspots in the RAID luns... So, yeah, there are good reasons for stripe unit aligned allocation rather than stripe width aligned. The problem is that MD has never behaved this way - it has always exposed it's individual disk chunk size as the minimum IO size (i.e. the stripe unit) and the stripe width as the optimal IO size to avoid RMW cycles. If you want to force XFS to do stripe width aligned allocation for large files to match with how MD exposes it's topology to filesytsems, use the 'swalloc' mount option. The down side is that you'll hotspot the first disk in the MD array.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read-modify-write occurring for direct I/O on RAID-5 2023-08-04 8:07 ` Dave Chinner @ 2023-08-04 19:26 ` Corey Hickey 2023-08-04 21:52 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Corey Hickey @ 2023-08-04 19:26 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 2023-08-04 01:07, Dave Chinner wrote: >> = sunit=128 swidth=68352 blks > ^^^^^^^^^^^^^^^^^ > > Something is badly broken in MD land. > > ..... > >> The default chunk size is 512K >> ----------------------------------------------------------------------- >> $ sudo mdadm --detail /dev/md10 | grep Chunk >> Chunk Size : 512K >> $ sudo blkid -i /dev/md10 >> /dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792" > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Yup, that's definitely broken. > >> PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512" >> ----------------------------------------------------------------------- >> I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I would >> expect this to be 1024K (due to two data disks in a three-disk RAID-5). > > Yup, it's broken. :/ For what it's worth, this test was on older disks: * 2 TB Seagate constellation ES.2 * running in an external USB enclosure If I use newer disks: * 12 TB Toshiba N300 * hooked up via internal SATA ...then I see the expected OPTIMAL_IO_SIZE. Maybe the issue is due to the USB enclosure or due to the older disks having 512-byte physical sectors. I don't know what other differences could be relevant. >> Later on, writes end up getting misaligned by half a stripe. For example: >> 8082432 / 2048 == 3946.5 > > So it's aligned to sunit, not swidth. That will match up with a > discontiguity in the file layout. i.e. an extent boundary. > > And given this is at just under 4GB written, and the AG size is > just under 2GB, this discontiguity is going to occur as writing > fills AG 1 and allocation switches to AG 2. Thanks. I figured I was seeing something like that, but I didn't know the details. >> I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, but >> that had pretty much the same behavior when writing (the RMW starts later, >> but it still starts). > > It won't change anything, actually. The first allocation in an AG > will determine which stripe unit the new extent starts on, and then > for the entire AG the write will be aligned to that choice. > > If you do IOs much larger than the stripe width (e.g. 16MB at a > time) the impact of the head/tail RMW will largely go away. The > problem is that you are doing exactly stripe width sized IOs and so > is the worse case for any allocation misalignment that might occur. Thank you, yes, I have seen that behavior in testing. >> Am I doing something wrong, or is there a bug, or are my expectations >> incorrect? I had expected that large sequential writes would be aligned with >> swidth. > > Expectations are wrong. Large allocations are aligned to stripe unit > in XFS by default. > > THis is because XFS was tuned for *large* multi-layer RAID setups > like RAID-50 that had hardware RAID 5 luns stripe together via > RAID-0 in the volume manager. > In these setups, the stripe unit is the hardware RAID-5 lun stripe > width (the minimum size that avoids RMW) and the stripe width is the > RAID-0 width. > > Hence for performance, it didn't matter which sunit allocation > aligned to as long as writes spanned the entire stripe width. That > way they would hit every lun. That is very interesting and definitely makes sense. > In general, we don't want stripe width aligned allocation, because > that hot-spots the first stripe unit in the stripe as all file data > first writes to that unit. A raid stripe is only as fast as it's > slowest disk, and so having a hot stripe unit slows everything down. > Hence by default we move the initial allocation around the stripe > units, and that largely removes the hotspots in the RAID luns... That makes sense. So the data allocation alignment controls the alignment of the writes. I wasn't quite making that connection before. > So, yeah, there are good reasons for stripe unit aligned allocation > rather than stripe width aligned. > > The problem is that MD has never behaved this way - it has always > exposed it's individual disk chunk size as the minimum IO size (i.e. > the stripe unit) and the stripe width as the optimal IO size to > avoid RMW cycles. > > If you want to force XFS to do stripe width aligned allocation for > large files to match with how MD exposes it's topology to > filesytsems, use the 'swalloc' mount option. The down side is that > you'll hotspot the first disk in the MD array.... If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any unaligned writes. If I manually specify the (I think) correct values, I do still get writes aligned to sunit but not swidth, as before. ----------------------------------------------------------------------- $ sudo mkfs.xfs -f -d sunit=1024,swidth=2048 /dev/md10 mkfs.xfs: Specified data stripe width 2048 is not the same as the volume stripe width 546816 log stripe unit (524288 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/md10 isize=512 agcount=16, agsize=982912 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 nrext64=0 data = bsize=4096 blocks=15726592, imaxpct=25 = sunit=128 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o swalloc /dev/md10 /mnt/tmp ----------------------------------------------------------------------- There's probably something else I'm doing wrong there. Still, I'll heed your advice about not making a hotspot disk and allow XFS to allocate as default. Now that I understand that XFS is behaving as intended and I can't/shouldn't necessarily aim for further alignment, I'll try recreating my real RAID, trust in buffered writes and the MD stripe cache, and see how that goes. Thank you very much for your detailed answers; I learned a lot. -Corey ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read-modify-write occurring for direct I/O on RAID-5 2023-08-04 19:26 ` Corey Hickey @ 2023-08-04 21:52 ` Dave Chinner 2023-08-05 1:44 ` Corey Hickey 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2023-08-04 21:52 UTC (permalink / raw) To: Corey Hickey; +Cc: linux-xfs On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote: > On 2023-08-04 01:07, Dave Chinner wrote: > > If you want to force XFS to do stripe width aligned allocation for > > large files to match with how MD exposes it's topology to > > filesytsems, use the 'swalloc' mount option. The down side is that > > you'll hotspot the first disk in the MD array.... > > If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any > unaligned writes. > > If I manually specify the (I think) correct values, I do still get writes > aligned to sunit but not swidth, as before. Hmmm, it should not be doing that - where is the misalignment happening in the file? swalloc isn't widely used/tested, so there's every chance there's something unexpected going on in the code... > ----------------------------------------------------------------------- > $ sudo mkfs.xfs -f -d sunit=1024,swidth=2048 /dev/md10 > mkfs.xfs: Specified data stripe width 2048 is not the same as the volume > stripe width 546816 > log stripe unit (524288 bytes) is too large (maximum is 256KiB) > log stripe unit adjusted to 32KiB > meta-data=/dev/md10 isize=512 agcount=16, agsize=982912 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 bigtime=1 inobtcount=1 > nrext64=0 > data = bsize=4096 blocks=15726592, imaxpct=25 > = sunit=128 swidth=256 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=16384, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > $ sudo mount -o swalloc /dev/md10 /mnt/tmp > ----------------------------------------------------------------------- > > There's probably something else I'm doing wrong there. Looks sensible, but it's likely still tripping over some non-obvious corner case in the allocation code. The allocation code is not simple (allocation alone has roughly 20 parameters that determine behaviour), especially with all the alignment setup stuff done before we even get to the allocation code... One thing to try is to set extent size hints for the directories these large files are going to be written to. That takes a lot of the allocation decisions away from the size/shape of the individual IO and instead does large file offset aligned/sized allocations which are much more likely to be stripe width aligned. e.g. set a extent size hint of 16MB, and the first write into a hole will allocate a 16MB chunk around the write instead of just the size that covers the write IO. > Still, I'll heed your advice about not making a hotspot disk and allow XFS > to allocate as default. > > Now that I understand that XFS is behaving as intended and I can't/shouldn't > necessarily aim for further alignment, I'll try recreating my real RAID, > trust in buffered writes and the MD stripe cache, and see how that goes. Buffered writes won't guarantee you alignment, either, In fact, it's much more likely to do weird stuff than direct IO. If your filesystem is empty, then buffered writes can look *really good*, but once the filesystem starts being used and has lots of discontiguous free space or the system is busy enough that writeback can't lock contiguous ranges of pages, writeback IO will look a whole lot less pretty and you have little control over what it does.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read-modify-write occurring for direct I/O on RAID-5 2023-08-04 21:52 ` Dave Chinner @ 2023-08-05 1:44 ` Corey Hickey 2023-08-05 22:37 ` Dave Chinner 2023-08-06 18:54 ` Corey Hickey 0 siblings, 2 replies; 9+ messages in thread From: Corey Hickey @ 2023-08-05 1:44 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 2023-08-04 14:52, Dave Chinner wrote: > On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote: >> On 2023-08-04 01:07, Dave Chinner wrote: >>> If you want to force XFS to do stripe width aligned allocation for >>> large files to match with how MD exposes it's topology to >>> filesytsems, use the 'swalloc' mount option. The down side is that >>> you'll hotspot the first disk in the MD array.... >> >> If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any >> unaligned writes. >> >> If I manually specify the (I think) correct values, I do still get writes >> aligned to sunit but not swidth, as before. > > Hmmm, it should not be doing that - where is the misalignment > happening in the file? swalloc isn't widely used/tested, so there's > every chance there's something unexpected going on in the code... I don't know how to tell the file position, but I wrote a one-liner for blktrace that may help. This should tell the position within the block device of writes enqueued. For every time the alignment _changes_, the awk program prints: * the previous line (if it exists and was not already printed) * the current line Lines from blktrace are prefixed by: * a 'c' or 'p' for debugging the awk program * the offset from a 2048-sector alignment * a '--' as a separator I have manually inserted blank lines into the output in order to visually separate into three sections: 1. writes predominantly stripe-aligned 2. writes predominantly offset by one chunk 3. writes predominantly stripe-aligned again ----------------------------------------------------------------------- $ sudo blktrace -d /dev/md10 -o - | blkparse -i - | awk 'BEGIN { prev=""; prev_offset=-1; } / Q / { offset=$8 % 2048; if (offset != prev_offset) { if (prev) { printf("p %4d -- %s\n", prev_offset, prev); prev="" }; printf("c %4d -- %s\n", offset, $0); prev_offset=offset; fflush(); } else { prev=$0 }} ' c 32 -- 9,10 11 1 0.000000000 213852 Q RM 32 + 8 [dd] c 24 -- 9,10 11 2 0.000253462 213852 Q RM 24 + 8 [dd] c 1024 -- 9,10 11 3 0.000434115 213852 Q RM 1024 + 32 [dd] c 3 -- 9,10 11 4 0.001008057 213852 Q RM 3 + 1 [dd] c 16 -- 9,10 11 5 0.001165978 213852 Q RM 16 + 8 [dd] c 8 -- 9,10 11 6 0.001328206 213852 Q RM 8 + 8 [dd] c 0 -- 9,10 11 7 0.001496647 213852 Q WS 2048 + 2048 [dd] p 0 -- 9,10 1 469 10.544416303 213852 Q WS 6301696 + 2048 [dd] c 128 -- 9,10 1 471 10.545831615 213789 Q FWFSM 62906496 + 64 [kworker/1:3] c 0 -- 9,10 1 472 10.548127201 213852 Q WS 6303744 + 2048 [dd] p 0 -- 9,10 0 5791 13.109985396 213852 Q WS 7804928 + 2048 [dd] c 1027 -- 9,10 0 5793 13.113192558 213852 Q RM 7863299 + 1 [dd] c 1040 -- 9,10 0 5794 13.136165405 213852 Q RM 7863312 + 8 [dd] c 1032 -- 9,10 0 5795 13.136458182 213852 Q RM 7863304 + 8 [dd] c 1024 -- 9,10 0 5796 13.136568992 213852 Q WS 7865344 + 2048 [dd] p 1024 -- 9,10 1 2818 41.250430374 213852 Q WS 12133376 + 2048 [dd] c 192 -- 9,10 1 2820 41.266187726 213789 Q FWFSM 62906560 + 64 [kworker/1:3] c 1024 -- 9,10 1 2821 41.275578120 213852 Q WS 12135424 + 2048 [dd] c 2 -- 9,10 5 1 41.266226029 213819 Q WM 2 + 1 [xfsaild/md10] c 24 -- 9,10 5 2 41.266236639 213819 Q WM 24 + 8 [xfsaild/md10] c 32 -- 9,10 5 3 41.266242160 213819 Q WM 32 + 8 [xfsaild/md10] c 1024 -- 9,10 5 4 41.266246318 213819 Q WM 1024 + 32 [xfsaild/md10] p 1024 -- 9,10 1 2823 41.308444405 213852 Q WS 12137472 + 2048 [dd] c 256 -- 9,10 10 706 41.322338854 207685 Q FWFSM 62906624 + 64 [kworker/u64:11] c 1024 -- 9,10 1 2825 41.334778677 213852 Q WS 12139520 + 2048 [dd] p 1024 -- 9,10 3 3739 64.424114908 213852 Q WS 15668224 + 2048 [dd] c 3 -- 9,10 3 3741 64.445830212 213852 Q RM 15726595 + 1 [dd] c 16 -- 9,10 3 3742 64.455104423 213852 Q RM 15726608 + 8 [dd] c 8 -- 9,10 3 3743 64.463494822 213852 Q RM 15726600 + 8 [dd] c 0 -- 9,10 3 3744 64.470414156 213852 Q WS 15728640 + 2048 [dd] p 0 -- 9,10 1 6911 71.983449607 213852 Q WS 20101120 + 2048 [dd] c 320 -- 9,10 1 6913 71.985823522 213789 Q FWFSM 62906688 + 64 [kworker/1:3] c 0 -- 9,10 1 6914 71.987115410 213852 Q WS 20103168 + 2048 [dd] c 1 -- 9,10 5 6 71.985857777 213819 Q WM 1 + 1 [xfsaild/md10] c 8 -- 9,10 5 7 71.985869209 213819 Q WM 8 + 8 [xfsaild/md10] c 16 -- 9,10 5 8 71.985874249 213819 Q WM 16 + 8 [xfsaild/md10] c 0 -- 9,10 1 6916 72.002414341 213852 Q WS 20105216 + 2048 [dd] p 0 -- 9,10 1 6924 72.041196270 213852 Q WS 20113408 + 2048 [dd] c 384 -- 9,10 4 1 72.041820949 211757 Q FWFSM 62906752 + 64 [kworker/u64:1] c 0 -- 9,10 1 6926 72.043596586 213852 Q WS 20115456 + 2048 [dd] ----------------------------------------------------------------------- I don't know if that's quite what you wanted, but hopefully it helps for something. > One thing to try is to set extent size hints for the directories > these large files are going to be written to. That takes a lot of > the allocation decisions away from the size/shape of the individual > IO and instead does large file offset aligned/sized allocations > which are much more likely to be stripe width aligned. e.g. set a > extent size hint of 16MB, and the first write into a hole will > allocate a 16MB chunk around the write instead of just the size that > covers the write IO. Can you please give me a documentation pointer for that? I wasn't able to find the right thing via searching. I see some references to size hints in mkfs.xfs, but it seems like you refer to something to be set for specific directories at run-time. >> Still, I'll heed your advice about not making a hotspot disk and allow XFS >> to allocate as default. >> >> Now that I understand that XFS is behaving as intended and I can't/shouldn't >> necessarily aim for further alignment, I'll try recreating my real RAID, >> trust in buffered writes and the MD stripe cache, and see how that goes. > > Buffered writes won't guarantee you alignment, either, In fact, it's > much more likely to do weird stuff than direct IO. If your > filesystem is empty, then buffered writes can look *really good*, > but once the filesystem starts being used and has lots of > discontiguous free space or the system is busy enough that writeback > can't lock contiguous ranges of pages, writeback IO will look a > whole lot less pretty and you have little control over what > it does.... I'll keep that in mind. This filesystem doesn't get extensive writes except when restoring from backup. That is why I started looking at alignment, though--restoring from backup onto a new array with new disks was incurring lots of RMW, reads were very delayed, and the kernel was warning about hung tasks. It probably didn't help that my RAID-5 was degraded due to a failed disk I had to return. I audited my alignment choices anyway and found some things I could do better, but I got stuck on XFS, hence this thread. My intended full stack is: * RAID-5 * bcache (default settings--writethrough) * dm-crypt * XFS ..and I've operated that before without noticing anything so bad. The alignment gets tricky, especially because bcache has a fixed default data offset and doesn't quite propagate the topology of the underlying backing device. $ sudo blkid -i /dev/md5 /dev/md5: MINIMUM_IO_SIZE="131072" OPTIMAL_IO_SIZE="262144" PHYSICAL_SECTOR_SIZE="4096" LOGICAL_SECTOR_SIZE="512" $ sudo blkid -i /dev/bcache0 /dev/bcache0: MINIMUM_IO_SIZE="512" OPTIMAL_IO_SIZE="262144" PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512" Some of that makes sense for a writeback scenario, but I think for writethrough I want to align to the topology of the underlying backing device. Thanks again for all your time. -Corey ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read-modify-write occurring for direct I/O on RAID-5 2023-08-05 1:44 ` Corey Hickey @ 2023-08-05 22:37 ` Dave Chinner 2023-08-06 18:21 ` Corey Hickey 2023-08-06 18:54 ` Corey Hickey 1 sibling, 1 reply; 9+ messages in thread From: Dave Chinner @ 2023-08-05 22:37 UTC (permalink / raw) To: Corey Hickey; +Cc: linux-xfs On Fri, Aug 04, 2023 at 06:44:47PM -0700, Corey Hickey wrote: > On 2023-08-04 14:52, Dave Chinner wrote: > > On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote: > > > On 2023-08-04 01:07, Dave Chinner wrote: > > > > If you want to force XFS to do stripe width aligned allocation for > > > > large files to match with how MD exposes it's topology to > > > > filesytsems, use the 'swalloc' mount option. The down side is that > > > > you'll hotspot the first disk in the MD array.... > > > > > > If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any > > > unaligned writes. > > > > > > If I manually specify the (I think) correct values, I do still get writes > > > aligned to sunit but not swidth, as before. > > > > Hmmm, it should not be doing that - where is the misalignment > > happening in the file? swalloc isn't widely used/tested, so there's > > every chance there's something unexpected going on in the code... > > I don't know how to tell the file position, but I wrote a one-liner for > blktrace that may help. This should tell the position within the block > device of writes enqueued. xfs_bmap will tell you the file extent layout (offset to lba relationship). (`xfs_bmap -vvp <file>` output is prefered if you are going to paste it into an email.) > For every time the alignment _changes_, the awk program prints: > * the previous line (if it exists and was not already printed) > * the current line > > Lines from blktrace are prefixed by: > * a 'c' or 'p' for debugging the awk program > * the offset from a 2048-sector alignment > * a '--' as a separator > > I have manually inserted blank lines into the output in order to > visually separate into three sections: > 1. writes predominantly stripe-aligned > 2. writes predominantly offset by one chunk > 3. writes predominantly stripe-aligned again > > ----------------------------------------------------------------------- > $ sudo blktrace -d /dev/md10 -o - | blkparse -i - | awk 'BEGIN { prev=""; prev_offset=-1; } / Q / { offset=$8 % 2048; if (offset != prev_offset) { if (prev) { printf("p %4d -- %s\n", prev_offset, prev); prev="" }; printf("c %4d -- %s\n", offset, $0); prev_offset=offset; fflush(); } else { prev=$0 }} ' > c 32 -- 9,10 11 1 0.000000000 213852 Q RM 32 + 8 [dd] > c 24 -- 9,10 11 2 0.000253462 213852 Q RM 24 + 8 [dd] inobt + finobt metadata reads. > c 1024 -- 9,10 11 3 0.000434115 213852 Q RM 1024 + 32 [dd] Inode cluster read. > c 3 -- 9,10 11 4 0.001008057 213852 Q RM 3 + 1 [dd] AGFL read. > c 16 -- 9,10 11 5 0.001165978 213852 Q RM 16 + 8 [dd] > c 8 -- 9,10 11 6 0.001328206 213852 Q RM 8 + 8 [dd] AG freespace btree block reads. <inode now allocated> > c 0 -- 9,10 11 7 0.001496647 213852 Q WS 2048 + 2048 [dd] Data writes. > p 0 -- 9,10 1 469 10.544416303 213852 Q WS 6301696 + 2048 [dd] > c 128 -- 9,10 1 471 10.545831615 213789 Q FWFSM 62906496 + 64 [kworker/1:3] > c 0 -- 9,10 1 472 10.548127201 213852 Q WS 6303744 + 2048 [dd] Seek for journal IO between two sequential, contiguous data writes. > p 0 -- 9,10 0 5791 13.109985396 213852 Q WS 7804928 + 2048 [dd] > c 1027 -- 9,10 0 5793 13.113192558 213852 Q RM 7863299 + 1 [dd] > c 1040 -- 9,10 0 5794 13.136165405 213852 Q RM 7863312 + 8 [dd] > c 1032 -- 9,10 0 5795 13.136458182 213852 Q RM 7863304 + 8 [dd] Data write at tail end of AG, followed by read of the AGF and AG freespace btree blocks in next AG... > c 1024 -- 9,10 0 5796 13.136568992 213852 Q WS 7865344 + 2048 [dd] ... And the data write continues but I don;t think that is aligned. $ echo $(((7865344 / 2048) * 2048)) 7864320 $ Yeah, so if that was aligned, it would start at LBA 7864320, not 7865344. > p 1024 -- 9,10 1 2818 41.250430374 213852 Q WS 12133376 + 2048 [dd] > c 192 -- 9,10 1 2820 41.266187726 213789 Q FWFSM 62906560 + 64 [kworker/1:3] > c 1024 -- 9,10 1 2821 41.275578120 213852 Q WS 12135424 + 2048 [dd] Journal IO breaking up two unaligned contiguous data writes. > c 2 -- 9,10 5 1 41.266226029 213819 Q WM 2 + 1 [xfsaild/md10] > c 24 -- 9,10 5 2 41.266236639 213819 Q WM 24 + 8 [xfsaild/md10] > c 32 -- 9,10 5 3 41.266242160 213819 Q WM 32 + 8 [xfsaild/md10] > c 1024 -- 9,10 5 4 41.266246318 213819 Q WM 1024 + 32 [xfsaild/md10] Metadata writeback of AGI 0, inobt, finobt and inode cluster blocks. > p 1024 -- 9,10 1 2823 41.308444405 213852 Q WS 12137472 + 2048 [dd] > c 256 -- 9,10 10 706 41.322338854 207685 Q FWFSM 62906624 + 64 [kworker/u64:11] > c 1024 -- 9,10 1 2825 41.334778677 213852 Q WS 12139520 + 2048 [dd] Journal IO. > p 1024 -- 9,10 3 3739 64.424114908 213852 Q WS 15668224 + 2048 [dd] > c 3 -- 9,10 3 3741 64.445830212 213852 Q RM 15726595 + 1 [dd] > c 16 -- 9,10 3 3742 64.455104423 213852 Q RM 15726608 + 8 [dd] > c 8 -- 9,10 3 3743 64.463494822 213852 Q RM 15726600 + 8 [dd] Next AG. So the entire AG was written unaligned - that is expected because this is appending and that aims for contiguous allocation, not aligned allocation. > c 0 -- 9,10 3 3744 64.470414156 213852 Q WS 15728640 + 2048 [dd] And the first allocation in the next AG is properly aligned. Ok. SO it appears that something is not working 100% w.r.t. aligned allocation on the transition from one AG to the next. I wonder if we've failed the "at EOF" allocation because there isn't space in the AG and then done an "any AG" unaligned allocation as the fallback? I'll have to see if I can replicate this now I know that it appears to be the full AG -> first allocation in next AG fallback that appears to be going astray.... > > One thing to try is to set extent size hints for the directories > > these large files are going to be written to. That takes a lot of > > the allocation decisions away from the size/shape of the individual > > IO and instead does large file offset aligned/sized allocations > > which are much more likely to be stripe width aligned. e.g. set a > > extent size hint of 16MB, and the first write into a hole will > > allocate a 16MB chunk around the write instead of just the size that > > covers the write IO. > > Can you please give me a documentation pointer for that? I wasn't able > to find the right thing via searching. $ man 2 ioctl_xfs_fsgetxattr .... fsx_extsize is the preferred extent allocation size for data blocks mapped to this file, in units of filesystem blocks. If this value is zero, the filesystem will choose a default option, which is currently zero. If XFS_IOC_FSSETXATTR is called with XFS_XFLAG_EXTSIZE set in fsx_xflags and this field set to zero, the XFLAG will also be cleared. .... XFS_XFLAG_EXTSIZE Extent size bit - if a basic extent size value is set on the file then the allocator will allocate in multiples of the set size for this file (see fsx_extsize below). The extent size can only be changed on a file when it has no allocated extents. .... $ man xfs_io .... extsize [ -R | -D ] [ value ] Display and/or modify the preferred extent size used when allocating space for the currently open file. If the -R option is specified, a recursive descent is performed for all directory entries below the currently open file (-D can be used to restrict the output to directories only). If the target file is a directory, then the inherited extent size is set for that directory (new files created in that directory inherit that extent size). The value should be specified in bytes, or using one of the usual units suffixes (k, m, g, b, etc). The extent size is always reported in units of bytes. .... $ man mkfs.xfs .... extszinherit=value All inodes created by mkfs.xfs will have this value extent size hint applied. The value must be provided in units of filesystem blocks. Directories will pass on this hint to newly created regular files and directories. .... > I see some references to size hints in mkfs.xfs, but it seems like you > refer to something to be set for specific directories at run-time. It's the same thing, just set up different ways. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read-modify-write occurring for direct I/O on RAID-5 2023-08-05 22:37 ` Dave Chinner @ 2023-08-06 18:21 ` Corey Hickey 2023-08-06 22:38 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Corey Hickey @ 2023-08-06 18:21 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 2023-08-05 15:37, Dave Chinner wrote: > On Fri, Aug 04, 2023 at 06:44:47PM -0700, Corey Hickey wrote: >> On 2023-08-04 14:52, Dave Chinner wrote: >>> On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote: >>>> On 2023-08-04 01:07, Dave Chinner wrote: >>>>> If you want to force XFS to do stripe width aligned allocation for >>>>> large files to match with how MD exposes it's topology to >>>>> filesytsems, use the 'swalloc' mount option. The down side is that >>>>> you'll hotspot the first disk in the MD array.... >>>> >>>> If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any >>>> unaligned writes. >>>> >>>> If I manually specify the (I think) correct values, I do still get writes >>>> aligned to sunit but not swidth, as before. >>> >>> Hmmm, it should not be doing that - where is the misalignment >>> happening in the file? swalloc isn't widely used/tested, so there's >>> every chance there's something unexpected going on in the code... >> >> I don't know how to tell the file position, but I wrote a one-liner for >> blktrace that may help. This should tell the position within the block >> device of writes enqueued. > > xfs_bmap will tell you the file extent layout (offset to lba relationship). > (`xfs_bmap -vvp <file>` output is prefered if you are going to paste > it into an email.) Ah, nice; the flags even show the alignment. Here are the results for a filesystem on a 2-data-disk RAID-5 with 128 KB chunk size. $ sudo mkfs.xfs -s size=4096 -d sunit=256,swidth=512 /dev/md5 -f meta-data=/dev/md5 isize=512 agcount=16, agsize=983008 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 nrext64=0 data = bsize=4096 blocks=15728128, imaxpct=25 = sunit=32 swidth=64 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o noatime,swalloc /dev/md5 /mnt/tmp $ sudo dd if=/dev/zero of=/mnt/tmp/test.bin iflag=fullblock oflag=direct bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 62.6102 s, 171 MB/s $ sudo xfs_bmap -vvp /mnt/tmp/test.bin /mnt/tmp/test.bin: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7806975]: 512..7807487 0 (512..7807487) 7806976 000000 1: [7806976..15613951]: 7864576..15671551 1 (512..7807487) 7806976 000011 2: [15613952..20971519]: 15728640..21086207 2 (512..5358079) 5357568 000000 FLAG Values: 0100000 Shared extent 0010000 Unwritten preallocated extent 0001000 Doesn't begin on stripe unit 0000100 Doesn't end on stripe unit 0000010 Doesn't begin on stripe width 0000001 Doesn't end on stripe width >>> One thing to try is to set extent size hints for the directories >>> these large files are going to be written to. That takes a lot of >>> the allocation decisions away from the size/shape of the individual >>> IO and instead does large file offset aligned/sized allocations >>> which are much more likely to be stripe width aligned. e.g. set a >>> extent size hint of 16MB, and the first write into a hole will >>> allocate a 16MB chunk around the write instead of just the size that >>> covers the write IO. >> >> Can you please give me a documentation pointer for that? I wasn't able >> to find the right thing via searching. > [...] > $ man xfs_io > .... > extsize [ -R | -D ] [ value ] [...] Aha, thanks. That's what I was looking for. -Corey ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read-modify-write occurring for direct I/O on RAID-5 2023-08-06 18:21 ` Corey Hickey @ 2023-08-06 22:38 ` Dave Chinner 0 siblings, 0 replies; 9+ messages in thread From: Dave Chinner @ 2023-08-06 22:38 UTC (permalink / raw) To: Corey Hickey; +Cc: linux-xfs On Sun, Aug 06, 2023 at 11:21:38AM -0700, Corey Hickey wrote: > On 2023-08-05 15:37, Dave Chinner wrote: > > On Fri, Aug 04, 2023 at 06:44:47PM -0700, Corey Hickey wrote: > > > On 2023-08-04 14:52, Dave Chinner wrote: > > > > On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote: > > > > > On 2023-08-04 01:07, Dave Chinner wrote: > > > > > > If you want to force XFS to do stripe width aligned allocation for > > > > > > large files to match with how MD exposes it's topology to > > > > > > filesytsems, use the 'swalloc' mount option. The down side is that > > > > > > you'll hotspot the first disk in the MD array.... > > > > > > > > > > If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any > > > > > unaligned writes. > > > > > > > > > > If I manually specify the (I think) correct values, I do still get writes > > > > > aligned to sunit but not swidth, as before. > > > > > > > > Hmmm, it should not be doing that - where is the misalignment > > > > happening in the file? swalloc isn't widely used/tested, so there's > > > > every chance there's something unexpected going on in the code... > > > > > > I don't know how to tell the file position, but I wrote a one-liner for > > > blktrace that may help. This should tell the position within the block > > > device of writes enqueued. > > > > xfs_bmap will tell you the file extent layout (offset to lba relationship). > > (`xfs_bmap -vvp <file>` output is prefered if you are going to paste > > it into an email.) > Ah, nice; the flags even show the alignment. > > Here are the results for a filesystem on a 2-data-disk RAID-5 with 128 KB > chunk size. .... > $ sudo xfs_bmap -vvp /mnt/tmp/test.bin > /mnt/tmp/test.bin: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..7806975]: 512..7807487 0 (512..7807487) 7806976 000000 > 1: [7806976..15613951]: 7864576..15671551 1 (512..7807487) 7806976 000011 > 2: [15613952..20971519]: 15728640..21086207 2 (512..5358079) 5357568 000000 Thanks for that, I think it points out the problem quite clearly. The stripe width allocation alignment looks to be working as intended - the "AG-OFFSET" column has the same values in each extent so within the AG address space everything is correctly "stripe width" aligned. What we see here is a mkfs.xfs "anti hotspot" behaviour with striped layouts. That is, it automagically sizes the AGs such that each AG header sits on a different stripe unit within the stripe so that the AG headers don't end up all on the same physical stripe unit. That results in the entire AG being aligned to the stripe unit rather than the stripe width. And so when we do stripe width aligned allocation within the AG, it assumes that the AG itself is stripe width aligned, which it isn't.... So, if you were to do something like this: # mkfs.xfs -d agsize=1048576b .... To force the AG size to be a multiple of stripe width, mkfs will issue a warning that it is going to place all the AG headers on the same stripe unit, but then go and do what you asked it to do. That should work around the problem you are seeing, meanwhile I suspect the swalloc mechanism might need a tweak to do physical LBA alignment, not AG offset alignment.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read-modify-write occurring for direct I/O on RAID-5 2023-08-05 1:44 ` Corey Hickey 2023-08-05 22:37 ` Dave Chinner @ 2023-08-06 18:54 ` Corey Hickey 1 sibling, 0 replies; 9+ messages in thread From: Corey Hickey @ 2023-08-06 18:54 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 2023-08-04 18:44, Corey Hickey wrote: >> Buffered writes won't guarantee you alignment, either, In fact, it's >> much more likely to do weird stuff than direct IO. If your >> filesystem is empty, then buffered writes can look *really good*, >> but once the filesystem starts being used and has lots of >> discontiguous free space or the system is busy enough that writeback >> can't lock contiguous ranges of pages, writeback IO will look a >> whole lot less pretty and you have little control over what >> it does.... > > I'll keep that in mind. This filesystem doesn't get extensive writes > except when restoring from backup. That is why I started looking at > alignment, though--restoring from backup onto a new array with new > disks was incurring lots of RMW, reads were very delayed, and the > kernel was warning about hung tasks. I tested and learned further. The root cause does not seem to be excessive RMW--the root cause seems to be that the drives in my new array do not handle the RMW nearly as well as the drives I had used before. Under different usage, I had previously noticed reduced performance on "parallel" reads of the new drives as compared to my older drives, though I didn't investigate further at the time. I don't know a great way to test this--there's probably a better way with fio or something. I wrote a small program to _roughly_ simulate the non-sequential activity of a RAID-5 RMW. Mostly I just wanted to induce lots of seeks over small intervals. I see consistent results across different drives attached via different cables to different SATA controllers. It's not just that I have one malfunctioning component. Differences in performance between runs are negligible, so I'm only reporting one run of each test. For 512 KB chunks, the Toshiba performs 11.5% worse. ---------------------------------------------------------------------- $ sudo ./rmw /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L "$((512 * 1024))" "$((2 * 1024))" testing path: /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L buffer_size: 524288 count: 2048 1073741824 bytes in 34.633402 seconds: 29.6 MiB/sec $ sudo ./rmw /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG "$((512 * 1024))" "$((2 * 1024))" testing path: /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG buffer_size: 524288 count: 2048 1073741824 bytes in 39.147649 seconds: 26.2 MiB/sec ---------------------------------------------------------------------- For 128 KB chunks, the Toshiba performs 29.4% worse. ---------------------------------------------------------------------- $ sudo ./rmw /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L "$((128 * 1024))" "$((8 * 1024))" testing path: /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L buffer_size: 131072 count: 8192 1073741824 bytes in 100.036280 seconds: 10.2 MiB/sec $ sudo ./rmw /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG "$((128 * 1024))" "$((8 * 1024))" testing path: /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG buffer_size: 131072 count: 8192 1073741824 bytes in 142.250680 seconds: 7.2 MiB/sec ---------------------------------------------------------------------- I don't know if the MD behavior tends toward better or worse as compared to my synthetic testing, but there's definitely a difference in performance between drives--apparently higher latency on the Toshiba. The RAID-5 write-back journal feature seems interesting, but I hit a reproducible bug early on: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1043078 Making RAID-5 work well under these circumstances doesn't seem worth it. I'm probably going to use RAID-10 instead. The test program follows. -Corey ---------------------------------------------------------------------- #define _GNU_SOURCE #include <assert.h> #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <unistd.h> long parse_positive_long(char *); int rmw(char *, long, long); size_t rmw_once(int, char *, long); int main(int argc, char **argv) { char *path; long buffer_size, count; if (argc != 4) { printf("usage: %s path buffer_size count\n", argv[0]); printf("WARNING: this overwrites the target file/device\n"); } path = argv[1]; if (! (buffer_size = parse_positive_long(argv[2]))) { exit(1); } if (buffer_size % 512) { printf("buffer size must be a multiple of 512\n"); exit(1); } if (! (count = parse_positive_long(argv[3]))) { exit(1); } printf("testing path: %s buffer_size: %ld count: %ld\n", path, buffer_size, count); if (! rmw(path, buffer_size, count)) { exit(1); } exit(0); } /* returns 0 on failure */ long parse_positive_long(char *str) { long ret; char *endptr; ret = strtol(str, &endptr, 0); if (str[0] != 0 && endptr[0] == 0) { if (ret <= 0) { printf("expected positive number instead of: %ld\n", ret); return 0; } return ret; } else { printf("error parsing number: %s\n", str); return 0; } } int rmw(char *path, long buffer_size, long count) { int fd, i; size_t bytes_handled, bytes_total; char *buffer; struct timespec start_time, end_time; double elapsed; buffer = aligned_alloc(512, buffer_size); if (! buffer) { printf("error allocating buffer: %s\n", strerror(errno)); } fd = open(path, O_RDWR|O_DIRECT|O_SYNC); if (fd == -1) { printf("error opening %s: %s\n", path, strerror(errno)); return 0; } bytes_total = 0; clock_gettime(CLOCK_MONOTONIC, &start_time); for (i = 0; i < count; ++i) { bytes_handled = rmw_once(fd, buffer, buffer_size); if (! bytes_handled) { return 0; } bytes_total += bytes_handled; if (bytes_handled != buffer_size) { printf("warning: encountered EOF\n"); break; } } clock_gettime(CLOCK_MONOTONIC, &end_time); if (close(fd)) { printf("error closing %s: %s\n", path, strerror(errno)); } free(buffer); if (! bytes_total) { return 0; } elapsed = (double)(end_time.tv_sec - start_time.tv_sec) + (double)(end_time.tv_nsec - start_time.tv_nsec) / 1.0e9; if (elapsed == 0.0) { printf("no time elapsed???\n"); return 0; } printf("%ld bytes in %lf seconds: %.1lf MiB/sec\n", bytes_total, elapsed, (double)bytes_total/elapsed/1024/1024); return 1; } size_t rmw_once(int fd, char *buffer, long buffer_size) { size_t bytes_read, bytes_written; ssize_t last_size; long i; int attempts; /* ----- READ ----- */ bytes_read = 0; attempts = 0; do { ++attempts; last_size = read(fd, buffer + bytes_read, buffer_size - bytes_read); bytes_read += last_size; } while (bytes_read < buffer_size && last_size > 0); if (attempts > 1) { printf("warning: took %d attempts to read into buffer\n", attempts); } if (last_size < 0) { printf("error reading: %s\n", strerror(errno)); return 0; } /* ----- MODIFY ----- */ for (i = 0; i < bytes_read; ++i) { /* do something... doesn't matter what */ buffer[i] = ~buffer[i]; } /* ----- WRITE ----- */ if (lseek(fd, -bytes_read, SEEK_CUR) == -1) { printf("error seeking: %s\n", strerror(errno)); return 0; } bytes_written = 0; attempts = 0; do { attempts += 1; last_size = write(fd, buffer, bytes_read - bytes_written); bytes_written += last_size; } while (bytes_written < bytes_read && last_size > 0); if (attempts > 1) { printf("warning: took %d attempts to write from buffer\n"); } if (last_size < 0) { printf("error writing: %s\n", strerror(errno)); return 0; } assert(bytes_read == bytes_written); return bytes_read; } ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-08-06 22:38 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-08-04 5:44 read-modify-write occurring for direct I/O on RAID-5 Corey Hickey 2023-08-04 8:07 ` Dave Chinner 2023-08-04 19:26 ` Corey Hickey 2023-08-04 21:52 ` Dave Chinner 2023-08-05 1:44 ` Corey Hickey 2023-08-05 22:37 ` Dave Chinner 2023-08-06 18:21 ` Corey Hickey 2023-08-06 22:38 ` Dave Chinner 2023-08-06 18:54 ` Corey Hickey
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox