* Disk aligned (but not block aligned) DIO write woes
@ 2020-12-28 15:57 Avi Kivity
2021-01-04 15:06 ` Brian Foster
0 siblings, 1 reply; 4+ messages in thread
From: Avi Kivity @ 2020-12-28 15:57 UTC (permalink / raw)
To: linux-xfs
I observe that XFS takes an exclusive lock for DIO writes that are not
block aligned:
xfs_file_dio_aio_write(
{
...
/*
* Don't take the exclusive iolock here unless the I/O is
unaligned to
* the file system block size. We don't need to consider the EOF
* extension case here because xfs_file_aio_write_checks() will
relock
* the inode as necessary for EOF zeroing cases and fill out
the new
* inode size as appropriate.
*/
if ((iocb->ki_pos & mp->m_blockmask) ||
((iocb->ki_pos + count) & mp->m_blockmask)) {
unaligned_io = 1;
/*
* We can't properly handle unaligned direct I/O to reflink
* files yet, as we can't unshare a partial block.
*/
if (xfs_is_cow_inode(ip)) {
trace_xfs_reflink_bounce_dio_write(ip,
iocb->ki_pos, count);
return -ENOTBLK;
}
iolock = XFS_IOLOCK_EXCL;
} else {
iolock = XFS_IOLOCK_SHARED;
}
I also see that such writes cause io_submit to block, even if they hit a
written extent (and are also not size-changing, by implication) and
therefore do not require a metadata write. Probably due to "||
unaligned_io" in
ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
&xfs_dio_write_ops,
is_sync_kiocb(iocb) || unaligned_io);
Can this be relaxed to allow writes to written extents to proceed in
parallel? I explain the motivation below.
My thinking (from a position of blissful ignorance) is that if the
extent is already written, then no metadata changes and block zeroing
are needed. If we can detect that favorable conditions exists (perhaps
with the extra constraint that the mapping be already cached), then we
can handle this particular case asynchronously.
My motivation is a database commit log. NVMe drives can serve small
writes with ridiculously low latency - around 20 microseconds. Let's say
a commitlog entry is around 100 bytes; we fill a 4k block with 41
entries. To achieve that in 20 microseconds requires 2 million
records/sec. Even if we add artificial delay and commit every 1ms,
filling this 4k block require 41,000 commits/sec. If the entry write
rate is lower, then we will be forced to pad the rest of the block. This
increases the write amplification, impacting other activities using the
disk (such as reads).
41,000 commits/sec may not sound like much, but in a thread-per-core
design (where each core commits independently) this translates to
millions of commits per second for the entire machine. If the real
throughput is below that, then we are forced to either increase the
latency to collect more writes into a full block, or we have to tolerate
the increased write amplification.
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Disk aligned (but not block aligned) DIO write woes
2020-12-28 15:57 Disk aligned (but not block aligned) DIO write woes Avi Kivity
@ 2021-01-04 15:06 ` Brian Foster
2021-01-04 15:16 ` Avi Kivity
2021-01-08 20:45 ` Dave Chinner
0 siblings, 2 replies; 4+ messages in thread
From: Brian Foster @ 2021-01-04 15:06 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Mon, Dec 28, 2020 at 05:57:29PM +0200, Avi Kivity wrote:
> I observe that XFS takes an exclusive lock for DIO writes that are not block
> aligned:
>
>
> xfs_file_dio_aio_write(
>
> {
>
> ...
>
> /*
> * Don't take the exclusive iolock here unless the I/O is unaligned
> to
> * the file system block size. We don't need to consider the EOF
> * extension case here because xfs_file_aio_write_checks() will
> relock
> * the inode as necessary for EOF zeroing cases and fill out the new
> * inode size as appropriate.
> */
> if ((iocb->ki_pos & mp->m_blockmask) ||
> ((iocb->ki_pos + count) & mp->m_blockmask)) {
> unaligned_io = 1;
>
> /*
> * We can't properly handle unaligned direct I/O to reflink
> * files yet, as we can't unshare a partial block.
> */
> if (xfs_is_cow_inode(ip)) {
> trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos,
> count);
> return -ENOTBLK;
> }
> iolock = XFS_IOLOCK_EXCL;
> } else {
> iolock = XFS_IOLOCK_SHARED;
> }
>
>
> I also see that such writes cause io_submit to block, even if they hit a
> written extent (and are also not size-changing, by implication) and
> therefore do not require a metadata write. Probably due to "|| unaligned_io"
> in
>
>
> ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> &xfs_dio_write_ops,
> is_sync_kiocb(iocb) || unaligned_io);
>
>
> Can this be relaxed to allow writes to written extents to proceed in
> parallel? I explain the motivation below.
>
From the above code, it looks as though we require the exclusive lock to
accommodate sub-block zeroing that might occur down in the dio code.
Without it, concurrent sector granular I/Os to the same block could
presumably cause corruption if the data bios and associated zeroing bios
were reordered between two requests.
Looking down in the iomap/dio code, iomap_dio_bio_actor() triggers
sub-block zeroing if the associated range is unaligned and the mapping
is either unwritten or new (i.e. newly allocated for this write). So as
you note above, there is no such zeroing for a pre-existing written
block within EOF. From that, it seems reasonable to me that in principle
the filesystem could check for these conditions in the higher layer and
further restrict usage of the exclusive lock. That said, we'd probably
have to rework the above code to acquire the shared lock first,
read/check the extent state for the start/end blocks of the I/O and then
cycle out to the exclusive lock in the cases where it is required. It's
not immediately clear to me if we'd still want to set the wait flag in
those unaligned-but-no-zeroing cases. Perhaps somebody who knows the dio
code a bit better can chime in further on all of this..
Brian
>
> My thinking (from a position of blissful ignorance) is that if the extent is
> already written, then no metadata changes and block zeroing are needed. If
> we can detect that favorable conditions exists (perhaps with the extra
> constraint that the mapping be already cached), then we can handle this
> particular case asynchronously.
>
>
> My motivation is a database commit log. NVMe drives can serve small writes
> with ridiculously low latency - around 20 microseconds. Let's say a
> commitlog entry is around 100 bytes; we fill a 4k block with 41 entries. To
> achieve that in 20 microseconds requires 2 million records/sec. Even if we
> add artificial delay and commit every 1ms, filling this 4k block require
> 41,000 commits/sec. If the entry write rate is lower, then we will be forced
> to pad the rest of the block. This increases the write amplification,
> impacting other activities using the disk (such as reads).
>
>
> 41,000 commits/sec may not sound like much, but in a thread-per-core design
> (where each core commits independently) this translates to millions of
> commits per second for the entire machine. If the real throughput is below
> that, then we are forced to either increase the latency to collect more
> writes into a full block, or we have to tolerate the increased write
> amplification.
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Disk aligned (but not block aligned) DIO write woes
2021-01-04 15:06 ` Brian Foster
@ 2021-01-04 15:16 ` Avi Kivity
2021-01-08 20:45 ` Dave Chinner
1 sibling, 0 replies; 4+ messages in thread
From: Avi Kivity @ 2021-01-04 15:16 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-xfs
On 04/01/2021 17.06, Brian Foster wrote:
> On Mon, Dec 28, 2020 at 05:57:29PM +0200, Avi Kivity wrote:
>> I observe that XFS takes an exclusive lock for DIO writes that are not block
>> aligned:
>>
>>
>> xfs_file_dio_aio_write(
>>
>> {
>>
>> ...
>>
>> /*
>> * Don't take the exclusive iolock here unless the I/O is unaligned
>> to
>> * the file system block size. We don't need to consider the EOF
>> * extension case here because xfs_file_aio_write_checks() will
>> relock
>> * the inode as necessary for EOF zeroing cases and fill out the new
>> * inode size as appropriate.
>> */
>> if ((iocb->ki_pos & mp->m_blockmask) ||
>> ((iocb->ki_pos + count) & mp->m_blockmask)) {
>> unaligned_io = 1;
>>
>> /*
>> * We can't properly handle unaligned direct I/O to reflink
>> * files yet, as we can't unshare a partial block.
>> */
>> if (xfs_is_cow_inode(ip)) {
>> trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos,
>> count);
>> return -ENOTBLK;
>> }
>> iolock = XFS_IOLOCK_EXCL;
>> } else {
>> iolock = XFS_IOLOCK_SHARED;
>> }
>>
>>
>> I also see that such writes cause io_submit to block, even if they hit a
>> written extent (and are also not size-changing, by implication) and
>> therefore do not require a metadata write. Probably due to "|| unaligned_io"
>> in
>>
>>
>> ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
>> &xfs_dio_write_ops,
>> is_sync_kiocb(iocb) || unaligned_io);
>>
>>
>> Can this be relaxed to allow writes to written extents to proceed in
>> parallel? I explain the motivation below.
>>
> From the above code, it looks as though we require the exclusive lock to
> accommodate sub-block zeroing that might occur down in the dio code.
> Without it, concurrent sector granular I/Os to the same block could
> presumably cause corruption if the data bios and associated zeroing bios
> were reordered between two requests.
>
> Looking down in the iomap/dio code, iomap_dio_bio_actor() triggers
> sub-block zeroing if the associated range is unaligned and the mapping
> is either unwritten or new (i.e. newly allocated for this write). So as
> you note above, there is no such zeroing for a pre-existing written
> block within EOF. From that, it seems reasonable to me that in principle
> the filesystem could check for these conditions in the higher layer and
> further restrict usage of the exclusive lock. That said, we'd probably
> have to rework the above code to acquire the shared lock first,
> read/check the extent state for the start/end blocks of the I/O and then
> cycle out to the exclusive lock in the cases where it is required. It's
> not immediately clear to me if we'd still want to set the wait flag in
> those unaligned-but-no-zeroing cases. Perhaps somebody who knows the dio
> code a bit better can chime in further on all of this..
>
>
Thanks. I'll try to convince someone to attempt to do this.
Meanwhile we'll try using -b size=1024. I have hopes that the
extent-based nature of the filesystem will not cause this to increase
overhead too much.
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Disk aligned (but not block aligned) DIO write woes
2021-01-04 15:06 ` Brian Foster
2021-01-04 15:16 ` Avi Kivity
@ 2021-01-08 20:45 ` Dave Chinner
1 sibling, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2021-01-08 20:45 UTC (permalink / raw)
To: Brian Foster; +Cc: Avi Kivity, linux-xfs
On Mon, Jan 04, 2021 at 10:06:01AM -0500, Brian Foster wrote:
> On Mon, Dec 28, 2020 at 05:57:29PM +0200, Avi Kivity wrote:
> > I observe that XFS takes an exclusive lock for DIO writes that are not block
> > aligned:
> >
> >
> > xfs_file_dio_aio_write(
> >
> > {
> >
> > ...
> >
> > /*
> > * Don't take the exclusive iolock here unless the I/O is unaligned
> > to
> > * the file system block size. We don't need to consider the EOF
> > * extension case here because xfs_file_aio_write_checks() will
> > relock
> > * the inode as necessary for EOF zeroing cases and fill out the new
> > * inode size as appropriate.
> > */
> > if ((iocb->ki_pos & mp->m_blockmask) ||
> > ((iocb->ki_pos + count) & mp->m_blockmask)) {
> > unaligned_io = 1;
> >
> > /*
> > * We can't properly handle unaligned direct I/O to reflink
> > * files yet, as we can't unshare a partial block.
> > */
> > if (xfs_is_cow_inode(ip)) {
> > trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos,
> > count);
> > return -ENOTBLK;
> > }
> > iolock = XFS_IOLOCK_EXCL;
> > } else {
> > iolock = XFS_IOLOCK_SHARED;
> > }
> >
> >
> > I also see that such writes cause io_submit to block, even if they hit a
> > written extent (and are also not size-changing, by implication) and
> > therefore do not require a metadata write. Probably due to "|| unaligned_io"
> > in
> >
> >
> > ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> > &xfs_dio_write_ops,
> > is_sync_kiocb(iocb) || unaligned_io);
> >
> >
> > Can this be relaxed to allow writes to written extents to proceed in
> > parallel? I explain the motivation below.
> >
>
> From the above code, it looks as though we require the exclusive lock to
> accommodate sub-block zeroing that might occur down in the dio code.
> Without it, concurrent sector granular I/Os to the same block could
> presumably cause corruption if the data bios and associated zeroing bios
> were reordered between two requests.
>
> Looking down in the iomap/dio code, iomap_dio_bio_actor() triggers
> sub-block zeroing if the associated range is unaligned and the mapping
> is either unwritten or new (i.e. newly allocated for this write). So as
> you note above, there is no such zeroing for a pre-existing written
> block within EOF. From that, it seems reasonable to me that in principle
> the filesystem could check for these conditions in the higher layer and
> further restrict usage of the exclusive lock.
Yuck, no.
> That said, we'd probably
> have to rework the above code to acquire the shared lock first,
> read/check the extent state for the start/end blocks of the I/O and then
> cycle out to the exclusive lock in the cases where it is required.
Yes, that requires mapping lookups above the iomap layer, then
repeating them again in the iomap layer and having to juggle the
locking to ensure the iomap code gets the same result. This also
requires taking the ILOCK to decide how to lock the IOLOCK, which
inverts all the IO path locking in nasty, bug prone ways. We tried
very hard to get rid of all those layering violations and lock
dances in the IO path with iomap - the old DIO code was a never
ending source of locking bugs because we had to play locking games
like this....
>
> It's
> not immediately clear to me if we'd still want to set the wait flag in
> those unaligned-but-no-zeroing cases. Perhaps somebody who knows the dio
> code a bit better can chime in further on all of this..
As I explained quickly on #xfs over xmas/ny time - because it seems
like 4 or 5 people all asked for this at the same time - I suspect
we can do this with IOMAP_NOWAIT directives for unaligned IO under
shared locking.
That is, if the unaligned dio is going to require allocation or
sub-block zeroing or would otherwise block, the filesystem mapping
code returns EAGAIN to iomap_apply() which propagates back to the
filesystem submitting the DIO which can then take exclusive lock and
resubmit the unaligned DIO, this time without the nowait semantics.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-01-08 20:46 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-28 15:57 Disk aligned (but not block aligned) DIO write woes Avi Kivity
2021-01-04 15:06 ` Brian Foster
2021-01-04 15:16 ` Avi Kivity
2021-01-08 20:45 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox