From: Dave Chinner <david@fromorbit.com>
To: Corey Hickey <bugfood-ml@fatooh.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: read-modify-write occurring for direct I/O on RAID-5
Date: Sat, 5 Aug 2023 07:52:56 +1000 [thread overview]
Message-ID: <ZM1zOFWVm9lD8pNc@dread.disaster.area> (raw)
In-Reply-To: <db157228-3687-57bf-d090-10517847404d@fatooh.org>
On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote:
> On 2023-08-04 01:07, Dave Chinner wrote:
> > If you want to force XFS to do stripe width aligned allocation for
> > large files to match with how MD exposes it's topology to
> > filesytsems, use the 'swalloc' mount option. The down side is that
> > you'll hotspot the first disk in the MD array....
>
> If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any
> unaligned writes.
>
> If I manually specify the (I think) correct values, I do still get writes
> aligned to sunit but not swidth, as before.
Hmmm, it should not be doing that - where is the misalignment
happening in the file? swalloc isn't widely used/tested, so there's
every chance there's something unexpected going on in the code...
> -----------------------------------------------------------------------
> $ sudo mkfs.xfs -f -d sunit=1024,swidth=2048 /dev/md10
> mkfs.xfs: Specified data stripe width 2048 is not the same as the volume
> stripe width 546816
> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/md10 isize=512 agcount=16, agsize=982912 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=1, sparse=1, rmapbt=0
> = reflink=1 bigtime=1 inobtcount=1
> nrext64=0
> data = bsize=4096 blocks=15726592, imaxpct=25
> = sunit=128 swidth=256 blks
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
> log =internal log bsize=4096 blocks=16384, version=2
> = sectsz=512 sunit=8 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> $ sudo mount -o swalloc /dev/md10 /mnt/tmp
> -----------------------------------------------------------------------
>
> There's probably something else I'm doing wrong there.
Looks sensible, but it's likely still tripping over some non-obvious
corner case in the allocation code. The allocation code is not
simple (allocation alone has roughly 20 parameters that determine
behaviour), especially with all the alignment setup stuff done
before we even get to the allocation code...
One thing to try is to set extent size hints for the directories
these large files are going to be written to. That takes a lot of
the allocation decisions away from the size/shape of the individual
IO and instead does large file offset aligned/sized allocations
which are much more likely to be stripe width aligned. e.g. set a
extent size hint of 16MB, and the first write into a hole will
allocate a 16MB chunk around the write instead of just the size that
covers the write IO.
> Still, I'll heed your advice about not making a hotspot disk and allow XFS
> to allocate as default.
>
> Now that I understand that XFS is behaving as intended and I can't/shouldn't
> necessarily aim for further alignment, I'll try recreating my real RAID,
> trust in buffered writes and the MD stripe cache, and see how that goes.
Buffered writes won't guarantee you alignment, either, In fact, it's
much more likely to do weird stuff than direct IO. If your
filesystem is empty, then buffered writes can look *really good*,
but once the filesystem starts being used and has lots of
discontiguous free space or the system is busy enough that writeback
can't lock contiguous ranges of pages, writeback IO will look a
whole lot less pretty and you have little control over what
it does....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2023-08-04 21:53 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-04 5:44 read-modify-write occurring for direct I/O on RAID-5 Corey Hickey
2023-08-04 8:07 ` Dave Chinner
2023-08-04 19:26 ` Corey Hickey
2023-08-04 21:52 ` Dave Chinner [this message]
2023-08-05 1:44 ` Corey Hickey
2023-08-05 22:37 ` Dave Chinner
2023-08-06 18:21 ` Corey Hickey
2023-08-06 22:38 ` Dave Chinner
2023-08-06 18:54 ` Corey Hickey
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZM1zOFWVm9lD8pNc@dread.disaster.area \
--to=david@fromorbit.com \
--cc=bugfood-ml@fatooh.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox