read-modify-write occurring for direct I/O on RAID-5

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* read-modify-write occurring for direct I/O on RAID-5
@ 2023-08-04  5:44 Corey Hickey
  2023-08-04  8:07 ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Corey Hickey @ 2023-08-04  5:44 UTC (permalink / raw)
  To: linux-xfs

Hello,

I am having a problem with write performance via direct I/O. My setup is:
* Debian Sid
* Linux 6.3.0-2 (Debian Kernel)
* 3-disk MD RAID-5 of hard disks
* XFS

When I do large sequential writes via direct I/O, sometimes the writes 
are fast, but sometimes the RAID ends up doing RMW and performance gets 
slow.

If I use regular buffered I/O, then performance is better, presumably 
due to the MD stripe cache. I could just use buffered writes, of course, 
but I am really trying to make sure I get the alignment correct to start 
with.


I can reproduce the problem on a fresh RAID.
-----------------------------------------------------------------------
$ sudo mdadm --create /dev/md10 -n 3 -l 5 -z 30G /dev/sd[ghi]
mdadm: largest drive (/dev/sdg) exceeds size (31457280K) by more than 1%
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md10 started.
-----------------------------------------------------------------------
For testing, I'm using "-z 30G" to limit the duration of the initial 
RAID resync.


For XFS I can use default options:
-----------------------------------------------------------------------
$ sudo mkfs.xfs /dev/md10
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md10              isize=512    agcount=16, agsize=983040 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1    bigtime=1 inobtcount=1 
nrext64=0
data     =                       bsize=4096   blocks=15728640, imaxpct=25
          =                       sunit=128    swidth=68352 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
          =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount /dev/md10 /mnt/tmp
-----------------------------------------------------------------------


I am testing via dd:
-----------------------------------------------------------------------
$ sudo dd if=/dev/zero of=/mnt/tmp/test.bin iflag=fullblock oflag=direct 
bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 100.664 s, 107 MB/s
-----------------------------------------------------------------------

I can monitor performance with dstat (the I/O reported at the start 
seems to be an artifact of dstat's monitoring).
-----------------------------------------------------------------------
$ dstat -dD sdg,sdh,sdi 2
--dsk/sdg-----dsk/sdh-----dsk/sdi--
  read  writ: read  writ: read  writ
   16G 5673M:  16G 5673M: 537M   21G  # <--not a real reading
    0     0 :   0     0 :   0     0
    0     0 :   0     0 :   0     0
    0    29M:   0    29M:   0    29M  # <--test starts here
    0   126M:   0   126M:   0   126M
    0   134M:   0   134M:   0   134M
    0   145M:   0   145M:   0   144M
   16k  137M:   0   137M:   0   138M
    0   152M:   0   152M:   0   152M
    0   140M:   0   140M:   0   140M
5632k  110M:5376k  110M:5376k  111M  # <--RMW begins here
   12M   49M:  12M   49M:  12M   49M
   14M   53M:  13M   54M:  13M   53M
   12M   50M:  12M   50M:  12M   50M
   12M   49M:  12M   50M:  12M   49M
   12M   50M:  12M   49M:  12M   49M
   13M   50M:  13M   51M:  12M   51M
   12M   50M:  12M   50M:  12M   50M
   12M   48M:  12M   48M:  12M   48M
   13M   53M:  13M   52M:  13M   53M
   13M   50M:  12M   50M:  13M   50M
   13M   52M:  13M   52M:  13M   52M
   12M   47M:  12M   46M:  12M   46M
   13M   52M:  13M   52M:  13M   52M
-----------------------------------------------------------------------
(I truncated the output--the rest looks the same)

Note how the I/O starts out fully as writes, but then continues with 
many reads. I am fairly sure this is RAID-5 read-modify-write due to 
misaligned writes.


The default chunk size is 512K
-----------------------------------------------------------------------
$ sudo mdadm --detail /dev/md10 | grep Chunk
         Chunk Size : 512K
$ sudo blkid -i /dev/md10
/dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792" 
PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512"
-----------------------------------------------------------------------
I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I 
would expect this to be 1024K (due to two data disks in a three-disk 
RAID-5).

Translating into 512-byte sectors, I think the topology should be:
chunk size (sunit): 1024 sectors
stripe size (swidth): 2048 sectors


I can see the write alignment with blktrace.
-----------------------------------------------------------------------
$ sudo blktrace -d /dev/md10 -o - | blkparse -i - | grep ' Q '
   9,10  15        1     0.000000000 186548  Q  WS 3829760 + 2048 [dd]
   9,10  15        3     0.021087119 186548  Q  WS 3831808 + 2048 [dd]
   9,10  15        5     0.023605705 186548  Q  WS 3833856 + 2048 [dd]
   9,10  15        7     0.026093572 186548  Q  WS 3835904 + 2048 [dd]
   9,10  15        9     0.028595887 186548  Q  WS 3837952 + 2048 [dd]
   9,10  15       11     0.031171221 186548  Q  WS 3840000 + 2048 [dd]
[...]
   9,10   5      441    14.601942400 186608  Q  WS 8082432 + 2048 [dd]
   9,10   5      443    14.620316654 186608  Q  WS 8084480 + 2048 [dd]
   9,10   5      445    14.646707430 186608  Q  WS 8086528 + 2048 [dd]
   9,10   5      447    14.654519976 186608  Q  WS 8088576 + 2048 [dd]
   9,10   5      449    14.680901605 186608  Q  WS 8090624 + 2048 [dd]
   9,10   5      451    14.689156421 186608  Q  WS 8092672 + 2048 [dd]
   9,10   5      453    14.706529362 186608  Q  WS 8094720 + 2048 [dd]
   9,10   5      455    14.732451407 186608  Q  WS 8096768 + 2048 [dd]
-----------------------------------------------------------------------
In the beginning, writes queued are stripe-aligned. For example:
3829760 / 2048 == 1870

Later on, writes end up getting misaligned by half a stripe. For example:
8082432 / 2048 == 3946.5

I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, 
but that had pretty much the same behavior when writing (the RMW starts 
later, but it still starts).


Am I doing something wrong, or is there a bug, or are my expectations 
incorrect? I had expected that large sequential writes would be aligned 
with swidth.

Thank you,
Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read-modify-write occurring for direct I/O on RAID-5
  2023-08-04  5:44 read-modify-write occurring for direct I/O on RAID-5 Corey Hickey
@ 2023-08-04  8:07 ` Dave Chinner
  2023-08-04 19:26   ` Corey Hickey
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2023-08-04  8:07 UTC (permalink / raw)
  To: Corey Hickey; +Cc: linux-xfs

On Thu, Aug 03, 2023 at 10:44:31PM -0700, Corey Hickey wrote:
> Hello,
> 
> I am having a problem with write performance via direct I/O. My setup is:
> * Debian Sid
> * Linux 6.3.0-2 (Debian Kernel)
> * 3-disk MD RAID-5 of hard disks
> * XFS
> 
> When I do large sequential writes via direct I/O, sometimes the writes are
> fast, but sometimes the RAID ends up doing RMW and performance gets slow.
> 
> If I use regular buffered I/O, then performance is better, presumably due to
> the MD stripe cache. I could just use buffered writes, of course, but I am
> really trying to make sure I get the alignment correct to start with.
> 
> 
> I can reproduce the problem on a fresh RAID.
> -----------------------------------------------------------------------
> $ sudo mdadm --create /dev/md10 -n 3 -l 5 -z 30G /dev/sd[ghi]
> mdadm: largest drive (/dev/sdg) exceeds size (31457280K) by more than 1%
> Continue creating array? y
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md10 started.
> -----------------------------------------------------------------------
> For testing, I'm using "-z 30G" to limit the duration of the initial RAID
> resync.
> 
> 
> For XFS I can use default options:
> -----------------------------------------------------------------------
> $ sudo mkfs.xfs /dev/md10
> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/md10              isize=512    agcount=16, agsize=983040 blks

So an AG size of just under 2GB.

>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1    bigtime=1 inobtcount=1
> nrext64=0
> data     =                       bsize=4096   blocks=15728640, imaxpct=25
>          =                       sunit=128    swidth=68352 blks
                                                ^^^^^^^^^^^^^^^^^

Something is badly broken in MD land.

.....

> The default chunk size is 512K
> -----------------------------------------------------------------------
> $ sudo mdadm --detail /dev/md10 | grep Chunk
>         Chunk Size : 512K
> $ sudo blkid -i /dev/md10
> /dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792"
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Yup, that's definitely broken.

> PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512"
> -----------------------------------------------------------------------
> I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I would
> expect this to be 1024K (due to two data disks in a three-disk RAID-5).

Yup, it's broken. :/

> Translating into 512-byte sectors, I think the topology should be:
> chunk size (sunit): 1024 sectors
> stripe size (swidth): 2048 sectors

Yup, or as it reports from mkfs, sunit=128 fsbs, swidth=256 fsbs.

> -----------------------------------------------------------------------
> $ sudo blktrace -d /dev/md10 -o - | blkparse -i - | grep ' Q '
>   9,10  15        1     0.000000000 186548  Q  WS 3829760 + 2048 [dd]
>   9,10  15        3     0.021087119 186548  Q  WS 3831808 + 2048 [dd]
>   9,10  15        5     0.023605705 186548  Q  WS 3833856 + 2048 [dd]
>   9,10  15        7     0.026093572 186548  Q  WS 3835904 + 2048 [dd]
>   9,10  15        9     0.028595887 186548  Q  WS 3837952 + 2048 [dd]
>   9,10  15       11     0.031171221 186548  Q  WS 3840000 + 2048 [dd]
> [...]
>   9,10   5      441    14.601942400 186608  Q  WS 8082432 + 2048 [dd]
>   9,10   5      443    14.620316654 186608  Q  WS 8084480 + 2048 [dd]
>   9,10   5      445    14.646707430 186608  Q  WS 8086528 + 2048 [dd]
>   9,10   5      447    14.654519976 186608  Q  WS 8088576 + 2048 [dd]
>   9,10   5      449    14.680901605 186608  Q  WS 8090624 + 2048 [dd]
>   9,10   5      451    14.689156421 186608  Q  WS 8092672 + 2048 [dd]
>   9,10   5      453    14.706529362 186608  Q  WS 8094720 + 2048 [dd]
>   9,10   5      455    14.732451407 186608  Q  WS 8096768 + 2048 [dd]
> -----------------------------------------------------------------------
> In the beginning, writes queued are stripe-aligned. For example:
> 3829760 / 2048 == 1870
> 
> Later on, writes end up getting misaligned by half a stripe. For example:
> 8082432 / 2048 == 3946.5

So it's aligned to sunit, not swidth. That will match up with a
discontiguity in the file layout. i.e. an extent boundary.

And given this is at just under 4GB written, and the AG size is 
just under 2GB, this discontiguity is going to occur as writing
fills AG 1 and allocation switches to AG 2.

> I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, but
> that had pretty much the same behavior when writing (the RMW starts later,
> but it still starts).

It won't change anything, actually. The first allocation in an AG
will determine which stripe unit the new extent starts on, and then
for the entire AG the write will be aligned to that choice.

If you do IOs much larger than the stripe width (e.g. 16MB at a
time) the impact of the head/tail RMW will largely go away. The
problem is that you are doing exactly stripe width sized IOs and so
is the worse case for any allocation misalignment that might occur.

> Am I doing something wrong, or is there a bug, or are my expectations
> incorrect? I had expected that large sequential writes would be aligned with
> swidth.

Expectations are wrong. Large allocations are aligned to stripe unit
in XFS by default.

THis is because XFS was tuned for *large* multi-layer RAID setups
like RAID-50 that had hardware RAID 5 luns stripe together via
RAID-0 in the volume manager.

In these setups, the stripe unit is the hardware RAID-5 lun stripe
width (the minimum size that avoids RMW) and the stripe width is the
RAID-0 width.

Hence for performance, it didn't matter which sunit allocation
aligned to as long as writes spanned the entire stripe width. That
way they would hit every lun.

In general, we don't want stripe width aligned allocation, because
that hot-spots the first stripe unit in the stripe as all file data
first writes to that unit. A raid stripe is only as fast as it's
slowest disk, and so having a hot stripe unit slows everything down.
Hence by default we move the initial allocation around the stripe
units, and that largely removes the hotspots in the RAID luns...

So, yeah, there are good reasons for stripe unit aligned allocation
rather than stripe width aligned.

The problem is that MD has never behaved this way - it has always
exposed it's individual disk chunk size as the minimum IO size (i.e.
the stripe unit) and the stripe width as the optimal IO size to
avoid RMW cycles.

If you want to force XFS to do stripe width aligned allocation for
large files to match with how MD exposes it's topology to
filesytsems, use the 'swalloc' mount option. The down side is that
you'll hotspot the first disk in the MD array....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read-modify-write occurring for direct I/O on RAID-5
  2023-08-04  8:07 ` Dave Chinner
@ 2023-08-04 19:26   ` Corey Hickey
  2023-08-04 21:52     ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Corey Hickey @ 2023-08-04 19:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 2023-08-04 01:07, Dave Chinner wrote:
>>           =                       sunit=128    swidth=68352 blks
>                                                  ^^^^^^^^^^^^^^^^^
> 
> Something is badly broken in MD land.
> 
> .....
> 
>> The default chunk size is 512K
>> -----------------------------------------------------------------------
>> $ sudo mdadm --detail /dev/md10 | grep Chunk
>>          Chunk Size : 512K
>> $ sudo blkid -i /dev/md10
>> /dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792"
>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Yup, that's definitely broken.
> 
>> PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512"
>> -----------------------------------------------------------------------
>> I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I would
>> expect this to be 1024K (due to two data disks in a three-disk RAID-5).
> 
> Yup, it's broken. :/

For what it's worth, this test was on older disks:
* 2 TB Seagate constellation ES.2
* running in an external USB enclosure

If I use newer disks:
* 12 TB Toshiba N300
* hooked up via internal SATA

...then I see the expected OPTIMAL_IO_SIZE. Maybe the issue is due to 
the USB enclosure or due to the older disks having 512-byte physical 
sectors. I don't know what other differences could be relevant.

>> Later on, writes end up getting misaligned by half a stripe. For example:
>> 8082432 / 2048 == 3946.5
> 
> So it's aligned to sunit, not swidth. That will match up with a
> discontiguity in the file layout. i.e. an extent boundary.
> 
> And given this is at just under 4GB written, and the AG size is
> just under 2GB, this discontiguity is going to occur as writing
> fills AG 1 and allocation switches to AG 2.

Thanks. I figured I was seeing something like that, but I didn't know 
the details.

>> I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, but
>> that had pretty much the same behavior when writing (the RMW starts later,
>> but it still starts).
> 
> It won't change anything, actually. The first allocation in an AG
> will determine which stripe unit the new extent starts on, and then
> for the entire AG the write will be aligned to that choice.
> 
> If you do IOs much larger than the stripe width (e.g. 16MB at a
> time) the impact of the head/tail RMW will largely go away. The
> problem is that you are doing exactly stripe width sized IOs and so
> is the worse case for any allocation misalignment that might occur.

Thank you, yes, I have seen that behavior in testing.

>> Am I doing something wrong, or is there a bug, or are my expectations
>> incorrect? I had expected that large sequential writes would be aligned with
>> swidth.
> 
> Expectations are wrong. Large allocations are aligned to stripe unit
> in XFS by default.
> 
> THis is because XFS was tuned for *large* multi-layer RAID setups
> like RAID-50 that had hardware RAID 5 luns stripe together via
> RAID-0 in the volume manager.

> In these setups, the stripe unit is the hardware RAID-5 lun stripe
> width (the minimum size that avoids RMW) and the stripe width is the
> RAID-0 width.
> 
> Hence for performance, it didn't matter which sunit allocation
> aligned to as long as writes spanned the entire stripe width. That
> way they would hit every lun.

That is very interesting and definitely makes sense.

> In general, we don't want stripe width aligned allocation, because
> that hot-spots the first stripe unit in the stripe as all file data
> first writes to that unit. A raid stripe is only as fast as it's
> slowest disk, and so having a hot stripe unit slows everything down.
> Hence by default we move the initial allocation around the stripe
> units, and that largely removes the hotspots in the RAID luns...

That makes sense. So the data allocation alignment controls the 
alignment of the writes. I wasn't quite making that connection before.

> So, yeah, there are good reasons for stripe unit aligned allocation
> rather than stripe width aligned.
> 
> The problem is that MD has never behaved this way - it has always
> exposed it's individual disk chunk size as the minimum IO size (i.e.
> the stripe unit) and the stripe width as the optimal IO size to
> avoid RMW cycles.
> 
> If you want to force XFS to do stripe width aligned allocation for
> large files to match with how MD exposes it's topology to
> filesytsems, use the 'swalloc' mount option. The down side is that
> you'll hotspot the first disk in the MD array....

If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any 
unaligned writes.

If I manually specify the (I think) correct values, I do still get 
writes aligned to sunit but not swidth, as before.

-----------------------------------------------------------------------
$ sudo mkfs.xfs -f -d sunit=1024,swidth=2048 /dev/md10
mkfs.xfs: Specified data stripe width 2048 is not the same as the volume 
stripe width 546816
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md10              isize=512    agcount=16, agsize=982912 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1    bigtime=1 inobtcount=1 
nrext64=0
data     =                       bsize=4096   blocks=15726592, imaxpct=25
          =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
          =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

$ sudo mount -o swalloc /dev/md10 /mnt/tmp
-----------------------------------------------------------------------


There's probably something else I'm doing wrong there.

Still, I'll heed your advice about not making a hotspot disk and allow 
XFS to allocate as default.

Now that I understand that XFS is behaving as intended and I 
can't/shouldn't necessarily aim for further alignment, I'll try 
recreating my real RAID, trust in buffered writes and the MD stripe 
cache, and see how that goes.

Thank you very much for your detailed answers; I learned a lot.

-Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read-modify-write occurring for direct I/O on RAID-5
  2023-08-04 19:26   ` Corey Hickey
@ 2023-08-04 21:52     ` Dave Chinner
  2023-08-05  1:44       ` Corey Hickey
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2023-08-04 21:52 UTC (permalink / raw)
  To: Corey Hickey; +Cc: linux-xfs

On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote:
> On 2023-08-04 01:07, Dave Chinner wrote:
> > If you want to force XFS to do stripe width aligned allocation for
> > large files to match with how MD exposes it's topology to
> > filesytsems, use the 'swalloc' mount option. The down side is that
> > you'll hotspot the first disk in the MD array....
> 
> If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any
> unaligned writes.
> 
> If I manually specify the (I think) correct values, I do still get writes
> aligned to sunit but not swidth, as before.

Hmmm, it should not be doing that - where is the misalignment
happening in the file? swalloc isn't widely used/tested, so there's
every chance there's something unexpected going on in the code...

> -----------------------------------------------------------------------
> $ sudo mkfs.xfs -f -d sunit=1024,swidth=2048 /dev/md10
> mkfs.xfs: Specified data stripe width 2048 is not the same as the volume
> stripe width 546816
> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/md10              isize=512    agcount=16, agsize=982912 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1    bigtime=1 inobtcount=1
> nrext64=0
> data     =                       bsize=4096   blocks=15726592, imaxpct=25
>          =                       sunit=128    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=16384, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> $ sudo mount -o swalloc /dev/md10 /mnt/tmp
> -----------------------------------------------------------------------
> 
> There's probably something else I'm doing wrong there.

Looks sensible, but it's likely still tripping over some non-obvious
corner case in the allocation code. The allocation code is not
simple (allocation alone has roughly 20 parameters that determine
behaviour), especially with all the alignment setup stuff done
before we even get to the allocation code...

One thing to try is to set extent size hints for the directories
these large files are going to be written to. That takes a lot of
the allocation decisions away from the size/shape of the individual
IO and instead does large file offset aligned/sized allocations
which are much more likely to be stripe width aligned. e.g. set a
extent size hint of 16MB, and the first write into a hole will
allocate a 16MB chunk around the write instead of just the size that
covers the write IO.

> Still, I'll heed your advice about not making a hotspot disk and allow XFS
> to allocate as default.
> 
> Now that I understand that XFS is behaving as intended and I can't/shouldn't
> necessarily aim for further alignment, I'll try recreating my real RAID,
> trust in buffered writes and the MD stripe cache, and see how that goes.

Buffered writes won't guarantee you alignment, either, In fact, it's
much more likely to do weird stuff than direct IO. If your
filesystem is empty, then buffered writes can look *really good*,
but once the filesystem starts being used and has lots of
discontiguous free space or the system is busy enough that writeback
can't lock contiguous ranges of pages, writeback IO will look a
whole lot less pretty and you have little control over what
it does....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read-modify-write occurring for direct I/O on RAID-5
  2023-08-04 21:52     ` Dave Chinner
@ 2023-08-05  1:44       ` Corey Hickey
  2023-08-05 22:37         ` Dave Chinner
  2023-08-06 18:54         ` Corey Hickey
  0 siblings, 2 replies; 9+ messages in thread
From: Corey Hickey @ 2023-08-05  1:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 2023-08-04 14:52, Dave Chinner wrote:
> On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote:
>> On 2023-08-04 01:07, Dave Chinner wrote:
>>> If you want to force XFS to do stripe width aligned allocation for
>>> large files to match with how MD exposes it's topology to
>>> filesytsems, use the 'swalloc' mount option. The down side is that
>>> you'll hotspot the first disk in the MD array....
>>
>> If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any
>> unaligned writes.
>>
>> If I manually specify the (I think) correct values, I do still get writes
>> aligned to sunit but not swidth, as before.
> 
> Hmmm, it should not be doing that - where is the misalignment
> happening in the file? swalloc isn't widely used/tested, so there's
> every chance there's something unexpected going on in the code...

I don't know how to tell the file position, but I wrote a one-liner for
blktrace that may help. This should tell the position within the block
device of writes enqueued.

For every time the alignment _changes_, the awk program prints:
* the previous line (if it exists and was not already printed)
* the current line

Lines from blktrace are prefixed by:
* a 'c' or 'p' for debugging the awk program
* the offset from a 2048-sector alignment
* a '--' as a separator

I have manually inserted blank lines into the output in order to
visually separate into three sections:
1. writes predominantly stripe-aligned
2. writes predominantly offset by one chunk
3. writes predominantly stripe-aligned again

-----------------------------------------------------------------------
$ sudo blktrace -d /dev/md10 -o - | blkparse -i - | awk 'BEGIN { prev=""; prev_offset=-1; } / Q / { offset=$8 % 2048; if (offset != prev_offset) { if (prev) { printf("p %4d -- %s\n", prev_offset, prev); prev="" }; printf("c %4d -- %s\n", offset, $0); prev_offset=offset; fflush(); } else { prev=$0 }} '
c   32 --   9,10  11        1     0.000000000 213852  Q  RM 32 + 8 [dd]
c   24 --   9,10  11        2     0.000253462 213852  Q  RM 24 + 8 [dd]
c 1024 --   9,10  11        3     0.000434115 213852  Q  RM 1024 + 32 [dd]
c    3 --   9,10  11        4     0.001008057 213852  Q  RM 3 + 1 [dd]
c   16 --   9,10  11        5     0.001165978 213852  Q  RM 16 + 8 [dd]
c    8 --   9,10  11        6     0.001328206 213852  Q  RM 8 + 8 [dd]
c    0 --   9,10  11        7     0.001496647 213852  Q  WS 2048 + 2048 [dd]
p    0 --   9,10   1      469    10.544416303 213852  Q  WS 6301696 + 2048 [dd]
c  128 --   9,10   1      471    10.545831615 213789  Q FWFSM 62906496 + 64 [kworker/1:3]
c    0 --   9,10   1      472    10.548127201 213852  Q  WS 6303744 + 2048 [dd]
p    0 --   9,10   0     5791    13.109985396 213852  Q  WS 7804928 + 2048 [dd]

c 1027 --   9,10   0     5793    13.113192558 213852  Q  RM 7863299 + 1 [dd]
c 1040 --   9,10   0     5794    13.136165405 213852  Q  RM 7863312 + 8 [dd]
c 1032 --   9,10   0     5795    13.136458182 213852  Q  RM 7863304 + 8 [dd]
c 1024 --   9,10   0     5796    13.136568992 213852  Q  WS 7865344 + 2048 [dd]
p 1024 --   9,10   1     2818    41.250430374 213852  Q  WS 12133376 + 2048 [dd]
c  192 --   9,10   1     2820    41.266187726 213789  Q FWFSM 62906560 + 64 [kworker/1:3]
c 1024 --   9,10   1     2821    41.275578120 213852  Q  WS 12135424 + 2048 [dd]
c    2 --   9,10   5        1    41.266226029 213819  Q  WM 2 + 1 [xfsaild/md10]
c   24 --   9,10   5        2    41.266236639 213819  Q  WM 24 + 8 [xfsaild/md10]
c   32 --   9,10   5        3    41.266242160 213819  Q  WM 32 + 8 [xfsaild/md10]
c 1024 --   9,10   5        4    41.266246318 213819  Q  WM 1024 + 32 [xfsaild/md10]
p 1024 --   9,10   1     2823    41.308444405 213852  Q  WS 12137472 + 2048 [dd]
c  256 --   9,10  10      706    41.322338854 207685  Q FWFSM 62906624 + 64 [kworker/u64:11]
c 1024 --   9,10   1     2825    41.334778677 213852  Q  WS 12139520 + 2048 [dd]
p 1024 --   9,10   3     3739    64.424114908 213852  Q  WS 15668224 + 2048 [dd]
c    3 --   9,10   3     3741    64.445830212 213852  Q  RM 15726595 + 1 [dd]
c   16 --   9,10   3     3742    64.455104423 213852  Q  RM 15726608 + 8 [dd]
c    8 --   9,10   3     3743    64.463494822 213852  Q  RM 15726600 + 8 [dd]
c    0 --   9,10   3     3744    64.470414156 213852  Q  WS 15728640 + 2048 [dd]

p    0 --   9,10   1     6911    71.983449607 213852  Q  WS 20101120 + 2048 [dd]
c  320 --   9,10   1     6913    71.985823522 213789  Q FWFSM 62906688 + 64 [kworker/1:3]
c    0 --   9,10   1     6914    71.987115410 213852  Q  WS 20103168 + 2048 [dd]
c    1 --   9,10   5        6    71.985857777 213819  Q  WM 1 + 1 [xfsaild/md10]
c    8 --   9,10   5        7    71.985869209 213819  Q  WM 8 + 8 [xfsaild/md10]
c   16 --   9,10   5        8    71.985874249 213819  Q  WM 16 + 8 [xfsaild/md10]
c    0 --   9,10   1     6916    72.002414341 213852  Q  WS 20105216 + 2048 [dd]
p    0 --   9,10   1     6924    72.041196270 213852  Q  WS 20113408 + 2048 [dd]
c  384 --   9,10   4        1    72.041820949 211757  Q FWFSM 62906752 + 64 [kworker/u64:1]
c    0 --   9,10   1     6926    72.043596586 213852  Q  WS 20115456 + 2048 [dd]
-----------------------------------------------------------------------

I don't know if that's quite what you wanted, but hopefully it helps for
something.

> One thing to try is to set extent size hints for the directories
> these large files are going to be written to. That takes a lot of
> the allocation decisions away from the size/shape of the individual
> IO and instead does large file offset aligned/sized allocations
> which are much more likely to be stripe width aligned. e.g. set a
> extent size hint of 16MB, and the first write into a hole will
> allocate a 16MB chunk around the write instead of just the size that
> covers the write IO.

Can you please give me a documentation pointer for that? I wasn't able
to find the right thing via searching.

I see some references to size hints in mkfs.xfs, but it seems like you
refer to something to be set for specific directories at run-time.

>> Still, I'll heed your advice about not making a hotspot disk and allow XFS
>> to allocate as default.
>>
>> Now that I understand that XFS is behaving as intended and I can't/shouldn't
>> necessarily aim for further alignment, I'll try recreating my real RAID,
>> trust in buffered writes and the MD stripe cache, and see how that goes.
> 
> Buffered writes won't guarantee you alignment, either, In fact, it's
> much more likely to do weird stuff than direct IO. If your
> filesystem is empty, then buffered writes can look *really good*,
> but once the filesystem starts being used and has lots of
> discontiguous free space or the system is busy enough that writeback
> can't lock contiguous ranges of pages, writeback IO will look a
> whole lot less pretty and you have little control over what
> it does....

I'll keep that in mind. This filesystem doesn't get extensive writes
except when restoring from backup. That is why I started looking at
alignment, though--restoring from backup onto a new array with new
disks was incurring lots of RMW, reads were very delayed, and the
kernel was warning about hung tasks.

It probably didn't help that my RAID-5 was degraded due to a failed
disk I had to return. I audited my alignment choices anyway and found
some things I could do better, but I got stuck on XFS, hence this
thread.

My intended full stack is:
* RAID-5
* bcache (default settings--writethrough)
* dm-crypt
* XFS

..and I've operated that before without noticing anything so bad.

The alignment gets tricky, especially because bcache has a fixed
default data offset and doesn't quite propagate the topology of the
underlying backing device.

$ sudo blkid -i /dev/md5
/dev/md5: MINIMUM_IO_SIZE="131072" OPTIMAL_IO_SIZE="262144" PHYSICAL_SECTOR_SIZE="4096" LOGICAL_SECTOR_SIZE="512"
$ sudo blkid -i /dev/bcache0
/dev/bcache0: MINIMUM_IO_SIZE="512" OPTIMAL_IO_SIZE="262144" PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512"

Some of that makes sense for a writeback scenario, but I think for
writethrough I want to align to the topology of the underlying
backing device.

Thanks again for all your time.

-Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read-modify-write occurring for direct I/O on RAID-5
  2023-08-05  1:44       ` Corey Hickey
@ 2023-08-05 22:37         ` Dave Chinner
  2023-08-06 18:21           ` Corey Hickey
  2023-08-06 18:54         ` Corey Hickey
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2023-08-05 22:37 UTC (permalink / raw)
  To: Corey Hickey; +Cc: linux-xfs

On Fri, Aug 04, 2023 at 06:44:47PM -0700, Corey Hickey wrote:
> On 2023-08-04 14:52, Dave Chinner wrote:
> > On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote:
> > > On 2023-08-04 01:07, Dave Chinner wrote:
> > > > If you want to force XFS to do stripe width aligned allocation for
> > > > large files to match with how MD exposes it's topology to
> > > > filesytsems, use the 'swalloc' mount option. The down side is that
> > > > you'll hotspot the first disk in the MD array....
> > > 
> > > If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any
> > > unaligned writes.
> > > 
> > > If I manually specify the (I think) correct values, I do still get writes
> > > aligned to sunit but not swidth, as before.
> > 
> > Hmmm, it should not be doing that - where is the misalignment
> > happening in the file? swalloc isn't widely used/tested, so there's
> > every chance there's something unexpected going on in the code...
> 
> I don't know how to tell the file position, but I wrote a one-liner for
> blktrace that may help. This should tell the position within the block
> device of writes enqueued.

xfs_bmap will tell you the file extent layout (offset to lba relationship).
(`xfs_bmap -vvp <file>` output is prefered if you are going to paste
it into an email.)

> For every time the alignment _changes_, the awk program prints:
> * the previous line (if it exists and was not already printed)
> * the current line
> 
> Lines from blktrace are prefixed by:
> * a 'c' or 'p' for debugging the awk program
> * the offset from a 2048-sector alignment
> * a '--' as a separator
> 
> I have manually inserted blank lines into the output in order to
> visually separate into three sections:
> 1. writes predominantly stripe-aligned
> 2. writes predominantly offset by one chunk
> 3. writes predominantly stripe-aligned again
> 
> -----------------------------------------------------------------------
> $ sudo blktrace -d /dev/md10 -o - | blkparse -i - | awk 'BEGIN { prev=""; prev_offset=-1; } / Q / { offset=$8 % 2048; if (offset != prev_offset) { if (prev) { printf("p %4d -- %s\n", prev_offset, prev); prev="" }; printf("c %4d -- %s\n", offset, $0); prev_offset=offset; fflush(); } else { prev=$0 }} '
> c   32 --   9,10  11        1     0.000000000 213852  Q  RM 32 + 8 [dd]
> c   24 --   9,10  11        2     0.000253462 213852  Q  RM 24 + 8 [dd]

inobt + finobt metadata reads.

> c 1024 --   9,10  11        3     0.000434115 213852  Q  RM 1024 + 32 [dd]

Inode cluster read.

> c    3 --   9,10  11        4     0.001008057 213852  Q  RM 3 + 1 [dd]

AGFL read.

> c   16 --   9,10  11        5     0.001165978 213852  Q  RM 16 + 8 [dd]
> c    8 --   9,10  11        6     0.001328206 213852  Q  RM 8 + 8 [dd]

AG freespace btree block reads.

<inode now allocated>

> c    0 --   9,10  11        7     0.001496647 213852  Q  WS 2048 + 2048 [dd]

Data writes.

> p    0 --   9,10   1      469    10.544416303 213852  Q  WS 6301696 + 2048 [dd]
> c  128 --   9,10   1      471    10.545831615 213789  Q FWFSM 62906496 + 64 [kworker/1:3]
> c    0 --   9,10   1      472    10.548127201 213852  Q  WS 6303744 + 2048 [dd]

Seek for journal IO between two sequential, contiguous data writes.

> p    0 --   9,10   0     5791    13.109985396 213852  Q  WS 7804928 + 2048 [dd]
> c 1027 --   9,10   0     5793    13.113192558 213852  Q  RM 7863299 + 1 [dd]
> c 1040 --   9,10   0     5794    13.136165405 213852  Q  RM 7863312 + 8 [dd]
> c 1032 --   9,10   0     5795    13.136458182 213852  Q  RM 7863304 + 8 [dd]

Data write at tail end of AG, followed by read of the AGF and AG
freespace btree blocks in next AG...

> c 1024 --   9,10   0     5796    13.136568992 213852  Q  WS 7865344 + 2048 [dd]

... And the data write continues but I don;t think that is aligned.

$ echo $(((7865344 / 2048) * 2048))
7864320
$

Yeah, so if that was aligned, it would start at LBA 7864320, not
7865344.

> p 1024 --   9,10   1     2818    41.250430374 213852  Q  WS 12133376 + 2048 [dd]
> c  192 --   9,10   1     2820    41.266187726 213789  Q FWFSM 62906560 + 64 [kworker/1:3]
> c 1024 --   9,10   1     2821    41.275578120 213852  Q  WS 12135424 + 2048 [dd]

Journal IO breaking up two unaligned contiguous data writes.

> c    2 --   9,10   5        1    41.266226029 213819  Q  WM 2 + 1 [xfsaild/md10]
> c   24 --   9,10   5        2    41.266236639 213819  Q  WM 24 + 8 [xfsaild/md10]
> c   32 --   9,10   5        3    41.266242160 213819  Q  WM 32 + 8 [xfsaild/md10]
> c 1024 --   9,10   5        4    41.266246318 213819  Q  WM 1024 + 32 [xfsaild/md10]

Metadata writeback of AGI 0, inobt, finobt and inode cluster blocks.

> p 1024 --   9,10   1     2823    41.308444405 213852  Q  WS 12137472 + 2048 [dd]
> c  256 --   9,10  10      706    41.322338854 207685  Q FWFSM 62906624 + 64 [kworker/u64:11]
> c 1024 --   9,10   1     2825    41.334778677 213852  Q  WS 12139520 + 2048 [dd]

Journal IO.

> p 1024 --   9,10   3     3739    64.424114908 213852  Q  WS 15668224 + 2048 [dd]
> c    3 --   9,10   3     3741    64.445830212 213852  Q  RM 15726595 + 1 [dd]
> c   16 --   9,10   3     3742    64.455104423 213852  Q  RM 15726608 + 8 [dd]
> c    8 --   9,10   3     3743    64.463494822 213852  Q  RM 15726600 + 8 [dd]

Next AG. So the entire AG was written unaligned - that is expected
because this is appending and that aims for contiguous allocation,
not aligned allocation.

> c    0 --   9,10   3     3744    64.470414156 213852  Q  WS 15728640 + 2048 [dd]

And the first allocation in the next AG is properly aligned.

Ok. SO it appears that something is not working 100% w.r.t. aligned
allocation on the transition from one AG to the next. I wonder if
we've failed the "at EOF" allocation because there isn't space in
the AG and then done an "any AG" unaligned allocation as the
fallback?

I'll have to see if I can replicate this now I know that it appears
to be the full AG -> first allocation in next AG fallback that
appears to be going astray....

> > One thing to try is to set extent size hints for the directories
> > these large files are going to be written to. That takes a lot of
> > the allocation decisions away from the size/shape of the individual
> > IO and instead does large file offset aligned/sized allocations
> > which are much more likely to be stripe width aligned. e.g. set a
> > extent size hint of 16MB, and the first write into a hole will
> > allocate a 16MB chunk around the write instead of just the size that
> > covers the write IO.
> 
> Can you please give me a documentation pointer for that? I wasn't able
> to find the right thing via searching.

$ man 2 ioctl_xfs_fsgetxattr
....
       fsx_extsize is the preferred extent allocation size for data
       blocks mapped to this file, in units of filesystem blocks.
       If this value is zero, the filesystem will choose a default
       option, which is currently zero.  If XFS_IOC_FSSETXATTR is
       called with XFS_XFLAG_EXTSIZE set in fsx_xflags and this
       field set to zero, the XFLAG will also be cleared.
....
       XFS_XFLAG_EXTSIZE
	      Extent size bit - if a basic extent size value is set
	      on the file then the allocator will allocate in
	      multiples of the set size for this file (see
	      fsx_extsize below).  The extent size can only be
	      changed on a file when it has no allocated extents.
....
$ man xfs_io
....
       extsize [ -R | -D ] [ value ]
	      Display  and/or  modify  the  preferred extent size
	      used when allocating space for the currently open
	      file. If the -R option is specified, a recursive
	      descent is performed for all directory entries below
	      the currently open file (-D can be used to restrict
	      the  output  to directories only).  If the target file
	      is a directory, then the inherited extent size is set
	      for that directory (new files created in that
	      directory inherit that extent size).  The value should
	      be specified in bytes, or using  one  of  the usual
	      units suffixes (k, m, g, b, etc). The extent size is
	      always reported in units of bytes.
....
$ man mkfs.xfs
....
                   extszinherit=value
			  All  inodes created by mkfs.xfs will have
			  this value extent size hint applied.  The
			  value must be provided in units of
			  filesystem blocks.  Directories will pass
			  on this hint to newly created regular
			  files and directories.
....

> I see some references to size hints in mkfs.xfs, but it seems like you
> refer to something to be set for specific directories at run-time.

It's the same thing, just set up different ways.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read-modify-write occurring for direct I/O on RAID-5
  2023-08-05 22:37         ` Dave Chinner
@ 2023-08-06 18:21           ` Corey Hickey
  2023-08-06 22:38             ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Corey Hickey @ 2023-08-06 18:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 2023-08-05 15:37, Dave Chinner wrote:
> On Fri, Aug 04, 2023 at 06:44:47PM -0700, Corey Hickey wrote:
>> On 2023-08-04 14:52, Dave Chinner wrote:
>>> On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote:
>>>> On 2023-08-04 01:07, Dave Chinner wrote:
>>>>> If you want to force XFS to do stripe width aligned allocation for
>>>>> large files to match with how MD exposes it's topology to
>>>>> filesytsems, use the 'swalloc' mount option. The down side is that
>>>>> you'll hotspot the first disk in the MD array....
>>>>
>>>> If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any
>>>> unaligned writes.
>>>>
>>>> If I manually specify the (I think) correct values, I do still get writes
>>>> aligned to sunit but not swidth, as before.
>>>
>>> Hmmm, it should not be doing that - where is the misalignment
>>> happening in the file? swalloc isn't widely used/tested, so there's
>>> every chance there's something unexpected going on in the code...
>>
>> I don't know how to tell the file position, but I wrote a one-liner for
>> blktrace that may help. This should tell the position within the block
>> device of writes enqueued.
> 
> xfs_bmap will tell you the file extent layout (offset to lba relationship).
> (`xfs_bmap -vvp <file>` output is prefered if you are going to paste
> it into an email.)
Ah, nice; the flags even show the alignment.

Here are the results for a filesystem on a 2-data-disk RAID-5 with 128 KB
chunk size.

$ sudo mkfs.xfs -s size=4096 -d sunit=256,swidth=512 /dev/md5 -f
meta-data=/dev/md5               isize=512    agcount=16, agsize=983008 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=4096   blocks=15728128, imaxpct=25
          =                       sunit=32     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

$ sudo mount -o noatime,swalloc /dev/md5 /mnt/tmp

$ sudo dd if=/dev/zero of=/mnt/tmp/test.bin iflag=fullblock oflag=direct bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 62.6102 s, 171 MB/s

$ sudo xfs_bmap -vvp /mnt/tmp/test.bin
/mnt/tmp/test.bin:
  EXT: FILE-OFFSET           BLOCK-RANGE        AG AG-OFFSET          TOTAL FLAGS
    0: [0..7806975]:         512..7807487        0 (512..7807487)   7806976 000000
    1: [7806976..15613951]:  7864576..15671551   1 (512..7807487)   7806976 000011
    2: [15613952..20971519]: 15728640..21086207  2 (512..5358079)   5357568 000000
  FLAG Values:
     0100000 Shared extent
     0010000 Unwritten preallocated extent
     0001000 Doesn't begin on stripe unit
     0000100 Doesn't end   on stripe unit
     0000010 Doesn't begin on stripe width
     0000001 Doesn't end   on stripe width

>>> One thing to try is to set extent size hints for the directories
>>> these large files are going to be written to. That takes a lot of
>>> the allocation decisions away from the size/shape of the individual
>>> IO and instead does large file offset aligned/sized allocations
>>> which are much more likely to be stripe width aligned. e.g. set a
>>> extent size hint of 16MB, and the first write into a hole will
>>> allocate a 16MB chunk around the write instead of just the size that
>>> covers the write IO.
>>
>> Can you please give me a documentation pointer for that? I wasn't able
>> to find the right thing via searching.
> 
[...]
> $ man xfs_io
> ....
>         extsize [ -R | -D ] [ value ]
[...]

Aha, thanks. That's what I was looking for.

-Corey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read-modify-write occurring for direct I/O on RAID-5
  2023-08-06 18:21           ` Corey Hickey
@ 2023-08-06 22:38             ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2023-08-06 22:38 UTC (permalink / raw)
  To: Corey Hickey; +Cc: linux-xfs

On Sun, Aug 06, 2023 at 11:21:38AM -0700, Corey Hickey wrote:
> On 2023-08-05 15:37, Dave Chinner wrote:
> > On Fri, Aug 04, 2023 at 06:44:47PM -0700, Corey Hickey wrote:
> > > On 2023-08-04 14:52, Dave Chinner wrote:
> > > > On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote:
> > > > > On 2023-08-04 01:07, Dave Chinner wrote:
> > > > > > If you want to force XFS to do stripe width aligned allocation for
> > > > > > large files to match with how MD exposes it's topology to
> > > > > > filesytsems, use the 'swalloc' mount option. The down side is that
> > > > > > you'll hotspot the first disk in the MD array....
> > > > > 
> > > > > If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any
> > > > > unaligned writes.
> > > > > 
> > > > > If I manually specify the (I think) correct values, I do still get writes
> > > > > aligned to sunit but not swidth, as before.
> > > > 
> > > > Hmmm, it should not be doing that - where is the misalignment
> > > > happening in the file? swalloc isn't widely used/tested, so there's
> > > > every chance there's something unexpected going on in the code...
> > > 
> > > I don't know how to tell the file position, but I wrote a one-liner for
> > > blktrace that may help. This should tell the position within the block
> > > device of writes enqueued.
> > 
> > xfs_bmap will tell you the file extent layout (offset to lba relationship).
> > (`xfs_bmap -vvp <file>` output is prefered if you are going to paste
> > it into an email.)
> Ah, nice; the flags even show the alignment.
> 
> Here are the results for a filesystem on a 2-data-disk RAID-5 with 128 KB
> chunk size.

....

> $ sudo xfs_bmap -vvp /mnt/tmp/test.bin
> /mnt/tmp/test.bin:
>  EXT: FILE-OFFSET           BLOCK-RANGE        AG AG-OFFSET          TOTAL FLAGS
>    0: [0..7806975]:         512..7807487        0 (512..7807487)   7806976 000000
>    1: [7806976..15613951]:  7864576..15671551   1 (512..7807487)   7806976 000011
>    2: [15613952..20971519]: 15728640..21086207  2 (512..5358079)   5357568 000000

Thanks for that, I think it points out the problem quite clearly.
The stripe width allocation alignment looks to be working as
intended - the "AG-OFFSET" column has the same values in each extent
so within the AG address space everything is correctly "stripe
width" aligned.

What we see here is a mkfs.xfs "anti hotspot" behaviour with striped
layouts. That is, it automagically sizes the AGs such that each AG
header sits on a different stripe unit within the stripe so that the
AG headers don't end up all on the same physical stripe unit.

That results in the entire AG being aligned to the stripe unit
rather than the stripe width. And so when we do stripe width aligned
allocation within the AG, it assumes that the AG itself is stripe
width aligned, which it isn't....

So, if you were to do something like this:

# mkfs.xfs -d agsize=1048576b ....

To force the AG size to be a multiple of stripe width, mkfs will
issue a warning that it is going to place all the AG headers on the
same stripe unit, but then go and do what you asked it to do.

That should work around the problem you are seeing, meanwhile I
suspect the swalloc mechanism might need a tweak to do physical LBA
alignment, not AG offset alignment....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read-modify-write occurring for direct I/O on RAID-5
  2023-08-05  1:44       ` Corey Hickey
  2023-08-05 22:37         ` Dave Chinner
@ 2023-08-06 18:54         ` Corey Hickey
  1 sibling, 0 replies; 9+ messages in thread
From: Corey Hickey @ 2023-08-06 18:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 2023-08-04 18:44, Corey Hickey wrote:
>> Buffered writes won't guarantee you alignment, either, In fact, it's
>> much more likely to do weird stuff than direct IO. If your
>> filesystem is empty, then buffered writes can look *really good*,
>> but once the filesystem starts being used and has lots of
>> discontiguous free space or the system is busy enough that writeback
>> can't lock contiguous ranges of pages, writeback IO will look a
>> whole lot less pretty and you have little control over what
>> it does....
> 
> I'll keep that in mind. This filesystem doesn't get extensive writes
> except when restoring from backup. That is why I started looking at
> alignment, though--restoring from backup onto a new array with new
> disks was incurring lots of RMW, reads were very delayed, and the
> kernel was warning about hung tasks.

I tested and learned further. The root cause does not seem to be
excessive RMW--the root cause seems to be that the drives in my new
array do not handle the RMW nearly as well as the drives I had used
before.

Under different usage, I had previously noticed reduced performance on
"parallel" reads of the new drives as compared to my older drives,
though I didn't investigate further at the time.

I don't know a great way to test this--there's probably a better way
with fio or something. I wrote a small program to _roughly_ simulate the
non-sequential activity of a RAID-5 RMW. Mostly I just wanted to induce
lots of seeks over small intervals.

I see consistent results across different drives attached via different
cables to different SATA controllers. It's not just that I have one
malfunctioning component.

Differences in performance between runs are negligible, so I'm only
reporting one run of each test.

For 512 KB chunks, the Toshiba performs 11.5% worse.
----------------------------------------------------------------------
$ sudo ./rmw /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L "$((512 * 1024))" "$((2 * 1024))"
testing path: /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L  buffer_size: 524288  count: 2048
1073741824 bytes in 34.633402 seconds: 29.6 MiB/sec

$ sudo ./rmw /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG "$((512 * 1024))" "$((2 * 1024))"
testing path: /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG  buffer_size: 524288  count: 2048
1073741824 bytes in 39.147649 seconds: 26.2 MiB/sec
----------------------------------------------------------------------

For 128 KB chunks, the Toshiba performs 29.4% worse.
----------------------------------------------------------------------
$ sudo ./rmw /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L "$((128 * 1024))" "$((8 * 1024))"
testing path: /dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA71YR1L  buffer_size: 131072  count: 8192
1073741824 bytes in 100.036280 seconds: 10.2 MiB/sec

$ sudo ./rmw /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG "$((128 * 1024))" "$((8 * 1024))"
testing path: /dev/disk/by-id/ata-TOSHIBA_HDWG21C_2290A04EFPBG  buffer_size: 131072  count: 8192
1073741824 bytes in 142.250680 seconds: 7.2 MiB/sec
----------------------------------------------------------------------

I don't know if the MD behavior tends toward better or worse as
compared to my synthetic testing, but there's definitely a difference
in performance between drives--apparently higher latency on the
Toshiba.

The RAID-5 write-back journal feature seems interesting, but I
hit a reproducible bug early on:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1043078

Making RAID-5 work well under these circumstances doesn't seem
worth it. I'm probably going to use RAID-10 instead.

The test program follows.

-Corey

----------------------------------------------------------------------
#define _GNU_SOURCE

#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>

long parse_positive_long(char *);
int rmw(char *, long, long);
size_t rmw_once(int, char *, long);

int main(int argc, char **argv) {
     char *path;
     long buffer_size, count;

     if (argc != 4) {
         printf("usage: %s path buffer_size count\n", argv[0]);
         printf("WARNING: this overwrites the target file/device\n");
     }

     path = argv[1];
     if (! (buffer_size = parse_positive_long(argv[2]))) {
         exit(1);
     }
     if (buffer_size % 512) {
         printf("buffer size must be a multiple of 512\n");
         exit(1);
     }

     if (! (count = parse_positive_long(argv[3]))) {
         exit(1);
     }

     printf("testing path: %s  buffer_size: %ld  count: %ld\n", path, buffer_size, count);
     if (! rmw(path, buffer_size, count)) {
         exit(1);
     }
     exit(0);
}

/* returns 0 on failure */
long parse_positive_long(char *str) {
     long ret;
     char *endptr;
     ret = strtol(str, &endptr, 0);
     if (str[0] != 0 && endptr[0] == 0) {
         if (ret <= 0) {
             printf("expected positive number instead of: %ld\n", ret);
             return 0;
         }
         return ret;
     } else {
         printf("error parsing number: %s\n", str);
         return 0;
     }
}

int rmw(char *path, long buffer_size, long count) {
     int fd, i;
     size_t bytes_handled, bytes_total;
     char *buffer;
     struct timespec start_time, end_time;
     double elapsed;

     buffer = aligned_alloc(512, buffer_size);
     if (! buffer) {
         printf("error allocating buffer: %s\n", strerror(errno));
     }

     fd = open(path, O_RDWR|O_DIRECT|O_SYNC);
     if (fd == -1) {
         printf("error opening %s: %s\n", path, strerror(errno));
         return 0;
     }

     bytes_total = 0;
     clock_gettime(CLOCK_MONOTONIC, &start_time);
     for (i = 0; i < count; ++i) {
         bytes_handled = rmw_once(fd, buffer, buffer_size);
         if (! bytes_handled) {
             return 0;
         }
         bytes_total += bytes_handled;
         if (bytes_handled != buffer_size) {
             printf("warning: encountered EOF\n");
             break;
         }
     }
     clock_gettime(CLOCK_MONOTONIC, &end_time);

     if (close(fd)) {
         printf("error closing %s: %s\n", path, strerror(errno));
     }

     free(buffer);

     if (! bytes_total) {
         return 0;
     }

     elapsed = (double)(end_time.tv_sec - start_time.tv_sec) +
               (double)(end_time.tv_nsec - start_time.tv_nsec) / 1.0e9;
     if (elapsed == 0.0) {
         printf("no time elapsed???\n");
         return 0;
     }

     printf("%ld bytes in %lf seconds: %.1lf MiB/sec\n", bytes_total, elapsed, (double)bytes_total/elapsed/1024/1024);
     return 1;
}


size_t rmw_once(int fd, char *buffer, long buffer_size) {
     size_t bytes_read, bytes_written;
     ssize_t last_size;
     long i;
     int attempts;

     /* ----- READ ----- */
     bytes_read = 0;
     attempts = 0;
     do {
         ++attempts;
         last_size = read(fd, buffer + bytes_read, buffer_size - bytes_read);
         bytes_read += last_size;
     } while (bytes_read < buffer_size && last_size > 0);

     if (attempts > 1) {
         printf("warning: took %d attempts to read into buffer\n", attempts);
     }

     if (last_size < 0) {
         printf("error reading: %s\n", strerror(errno));
         return 0;
     }

     /* ----- MODIFY ----- */
     for (i = 0; i < bytes_read; ++i) {
         /* do something... doesn't matter what */
         buffer[i] = ~buffer[i];
     }

     /* ----- WRITE ----- */
     if (lseek(fd, -bytes_read, SEEK_CUR) == -1) {
         printf("error seeking: %s\n", strerror(errno));
         return 0;
     }

     bytes_written = 0;
     attempts = 0;
     do {
         attempts += 1;
         last_size = write(fd, buffer, bytes_read - bytes_written);
         bytes_written += last_size;
     } while (bytes_written < bytes_read && last_size > 0);

     if (attempts > 1) {
         printf("warning: took %d attempts to write from buffer\n");
     }

     if (last_size < 0) {
         printf("error writing: %s\n", strerror(errno));
         return 0;
     }

     assert(bytes_read == bytes_written);

     return bytes_read;
}
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-08-06 22:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-04  5:44 read-modify-write occurring for direct I/O on RAID-5 Corey Hickey
2023-08-04  8:07 ` Dave Chinner
2023-08-04 19:26   ` Corey Hickey
2023-08-04 21:52     ` Dave Chinner
2023-08-05  1:44       ` Corey Hickey
2023-08-05 22:37         ` Dave Chinner
2023-08-06 18:21           ` Corey Hickey
2023-08-06 22:38             ` Dave Chinner
2023-08-06 18:54         ` Corey Hickey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox