Disk aligned (but not block aligned) DIO write woes

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Disk aligned (but not block aligned) DIO write woes
@ 2020-12-28 15:57 Avi Kivity
  2021-01-04 15:06 ` Brian Foster
  0 siblings, 1 reply; 4+ messages in thread
From: Avi Kivity @ 2020-12-28 15:57 UTC (permalink / raw)
  To: linux-xfs

I observe that XFS takes an exclusive lock for DIO writes that are not 
block aligned:

xfs_file_dio_aio_write(

{

...

        /*
          * Don't take the exclusive iolock here unless the I/O is 
unaligned to
          * the file system block size.  We don't need to consider the EOF
          * extension case here because xfs_file_aio_write_checks() will 
relock
          * the inode as necessary for EOF zeroing cases and fill out 
the new
          * inode size as appropriate.
          */
         if ((iocb->ki_pos & mp->m_blockmask) ||
             ((iocb->ki_pos + count) & mp->m_blockmask)) {
                 unaligned_io = 1;

                 /*
                  * We can't properly handle unaligned direct I/O to reflink
                  * files yet, as we can't unshare a partial block.
                  */
                 if (xfs_is_cow_inode(ip)) {
                         trace_xfs_reflink_bounce_dio_write(ip, 
iocb->ki_pos, count);
                         return -ENOTBLK;
                 }
                 iolock = XFS_IOLOCK_EXCL;
         } else {
                 iolock = XFS_IOLOCK_SHARED;
         }

I also see that such writes cause io_submit to block, even if they hit a 
written extent (and are also not size-changing, by implication) and 
therefore do not require a metadata write. Probably due to "|| 
unaligned_io" in

         ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
                            &xfs_dio_write_ops,
                            is_sync_kiocb(iocb) || unaligned_io);

Can this be relaxed to allow writes to written extents to proceed in 
parallel? I explain the motivation below.

My thinking (from a position of blissful ignorance) is that if the 
extent is already written, then no metadata changes and block zeroing 
are needed. If we can detect that favorable conditions exists (perhaps 
with the extra constraint that the mapping be already cached), then we 
can handle this particular case asynchronously.

My motivation is a database commit log. NVMe drives can serve small 
writes with ridiculously low latency - around 20 microseconds. Let's say 
a commitlog entry is around 100 bytes; we fill a 4k block with 41 
entries. To achieve that in 20 microseconds requires 2 million 
records/sec. Even if we add artificial delay and commit every 1ms, 
filling this 4k block require 41,000 commits/sec. If the entry write 
rate is lower, then we will be forced to pad the rest of the block. This 
increases the write amplification, impacting other activities using the 
disk (such as reads).

41,000 commits/sec may not sound like much, but in a thread-per-core 
design (where each core commits independently) this translates to 
millions of commits per second for the entire machine. If the real 
throughput is below that, then we are forced to either increase the 
latency to collect more writes into a full block, or we have to tolerate 
the increased write amplification.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk aligned (but not block aligned) DIO write woes
  2020-12-28 15:57 Disk aligned (but not block aligned) DIO write woes Avi Kivity
@ 2021-01-04 15:06 ` Brian Foster
  2021-01-04 15:16   ` Avi Kivity
  2021-01-08 20:45   ` Dave Chinner
  0 siblings, 2 replies; 4+ messages in thread
From: Brian Foster @ 2021-01-04 15:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Mon, Dec 28, 2020 at 05:57:29PM +0200, Avi Kivity wrote:
> I observe that XFS takes an exclusive lock for DIO writes that are not block
> aligned:
> 
> 
> xfs_file_dio_aio_write(
> 
> {
> 
> ...
> 
>        /*
>          * Don't take the exclusive iolock here unless the I/O is unaligned
> to
>          * the file system block size.  We don't need to consider the EOF
>          * extension case here because xfs_file_aio_write_checks() will
> relock
>          * the inode as necessary for EOF zeroing cases and fill out the new
>          * inode size as appropriate.
>          */
>         if ((iocb->ki_pos & mp->m_blockmask) ||
>             ((iocb->ki_pos + count) & mp->m_blockmask)) {
>                 unaligned_io = 1;
> 
>                 /*
>                  * We can't properly handle unaligned direct I/O to reflink
>                  * files yet, as we can't unshare a partial block.
>                  */
>                 if (xfs_is_cow_inode(ip)) {
>                         trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos,
> count);
>                         return -ENOTBLK;
>                 }
>                 iolock = XFS_IOLOCK_EXCL;
>         } else {
>                 iolock = XFS_IOLOCK_SHARED;
>         }
> 
> 
> I also see that such writes cause io_submit to block, even if they hit a
> written extent (and are also not size-changing, by implication) and
> therefore do not require a metadata write. Probably due to "|| unaligned_io"
> in
> 
> 
>         ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
>                            &xfs_dio_write_ops,
>                            is_sync_kiocb(iocb) || unaligned_io);
> 
> 
> Can this be relaxed to allow writes to written extents to proceed in
> parallel? I explain the motivation below.
> 

From the above code, it looks as though we require the exclusive lock to
accommodate sub-block zeroing that might occur down in the dio code.
Without it, concurrent sector granular I/Os to the same block could
presumably cause corruption if the data bios and associated zeroing bios
were reordered between two requests.

Looking down in the iomap/dio code, iomap_dio_bio_actor() triggers
sub-block zeroing if the associated range is unaligned and the mapping
is either unwritten or new (i.e. newly allocated for this write). So as
you note above, there is no such zeroing for a pre-existing written
block within EOF. From that, it seems reasonable to me that in principle
the filesystem could check for these conditions in the higher layer and
further restrict usage of the exclusive lock. That said, we'd probably
have to rework the above code to acquire the shared lock first,
read/check the extent state for the start/end blocks of the I/O and then
cycle out to the exclusive lock in the cases where it is required. It's
not immediately clear to me if we'd still want to set the wait flag in
those unaligned-but-no-zeroing cases. Perhaps somebody who knows the dio
code a bit better can chime in further on all of this..

Brian

> 
> My thinking (from a position of blissful ignorance) is that if the extent is
> already written, then no metadata changes and block zeroing are needed. If
> we can detect that favorable conditions exists (perhaps with the extra
> constraint that the mapping be already cached), then we can handle this
> particular case asynchronously.
> 
> 
> My motivation is a database commit log. NVMe drives can serve small writes
> with ridiculously low latency - around 20 microseconds. Let's say a
> commitlog entry is around 100 bytes; we fill a 4k block with 41 entries. To
> achieve that in 20 microseconds requires 2 million records/sec. Even if we
> add artificial delay and commit every 1ms, filling this 4k block require
> 41,000 commits/sec. If the entry write rate is lower, then we will be forced
> to pad the rest of the block. This increases the write amplification,
> impacting other activities using the disk (such as reads).
> 
> 
> 41,000 commits/sec may not sound like much, but in a thread-per-core design
> (where each core commits independently) this translates to millions of
> commits per second for the entire machine. If the real throughput is below
> that, then we are forced to either increase the latency to collect more
> writes into a full block, or we have to tolerate the increased write
> amplification.
> 
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk aligned (but not block aligned) DIO write woes
  2021-01-04 15:06 ` Brian Foster
@ 2021-01-04 15:16   ` Avi Kivity
  2021-01-08 20:45   ` Dave Chinner
  1 sibling, 0 replies; 4+ messages in thread
From: Avi Kivity @ 2021-01-04 15:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On 04/01/2021 17.06, Brian Foster wrote:
> On Mon, Dec 28, 2020 at 05:57:29PM +0200, Avi Kivity wrote:
>> I observe that XFS takes an exclusive lock for DIO writes that are not block
>> aligned:
>>
>>
>> xfs_file_dio_aio_write(
>>
>> {
>>
>> ...
>>
>>         /*
>>           * Don't take the exclusive iolock here unless the I/O is unaligned
>> to
>>           * the file system block size.  We don't need to consider the EOF
>>           * extension case here because xfs_file_aio_write_checks() will
>> relock
>>           * the inode as necessary for EOF zeroing cases and fill out the new
>>           * inode size as appropriate.
>>           */
>>          if ((iocb->ki_pos & mp->m_blockmask) ||
>>              ((iocb->ki_pos + count) & mp->m_blockmask)) {
>>                  unaligned_io = 1;
>>
>>                  /*
>>                   * We can't properly handle unaligned direct I/O to reflink
>>                   * files yet, as we can't unshare a partial block.
>>                   */
>>                  if (xfs_is_cow_inode(ip)) {
>>                          trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos,
>> count);
>>                          return -ENOTBLK;
>>                  }
>>                  iolock = XFS_IOLOCK_EXCL;
>>          } else {
>>                  iolock = XFS_IOLOCK_SHARED;
>>          }
>>
>>
>> I also see that such writes cause io_submit to block, even if they hit a
>> written extent (and are also not size-changing, by implication) and
>> therefore do not require a metadata write. Probably due to "|| unaligned_io"
>> in
>>
>>
>>          ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
>>                             &xfs_dio_write_ops,
>>                             is_sync_kiocb(iocb) || unaligned_io);
>>
>>
>> Can this be relaxed to allow writes to written extents to proceed in
>> parallel? I explain the motivation below.
>>
>  From the above code, it looks as though we require the exclusive lock to
> accommodate sub-block zeroing that might occur down in the dio code.
> Without it, concurrent sector granular I/Os to the same block could
> presumably cause corruption if the data bios and associated zeroing bios
> were reordered between two requests.
>
> Looking down in the iomap/dio code, iomap_dio_bio_actor() triggers
> sub-block zeroing if the associated range is unaligned and the mapping
> is either unwritten or new (i.e. newly allocated for this write). So as
> you note above, there is no such zeroing for a pre-existing written
> block within EOF. From that, it seems reasonable to me that in principle
> the filesystem could check for these conditions in the higher layer and
> further restrict usage of the exclusive lock. That said, we'd probably
> have to rework the above code to acquire the shared lock first,
> read/check the extent state for the start/end blocks of the I/O and then
> cycle out to the exclusive lock in the cases where it is required. It's
> not immediately clear to me if we'd still want to set the wait flag in
> those unaligned-but-no-zeroing cases. Perhaps somebody who knows the dio
> code a bit better can chime in further on all of this..
>
>

Thanks. I'll try to convince someone to attempt to do this.


Meanwhile we'll try using -b size=1024. I have hopes that the 
extent-based nature of the filesystem will not cause this to increase 
overhead too much.



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk aligned (but not block aligned) DIO write woes
  2021-01-04 15:06 ` Brian Foster
  2021-01-04 15:16   ` Avi Kivity
@ 2021-01-08 20:45   ` Dave Chinner
  1 sibling, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2021-01-08 20:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: Avi Kivity, linux-xfs

On Mon, Jan 04, 2021 at 10:06:01AM -0500, Brian Foster wrote:
> On Mon, Dec 28, 2020 at 05:57:29PM +0200, Avi Kivity wrote:
> > I observe that XFS takes an exclusive lock for DIO writes that are not block
> > aligned:
> > 
> > 
> > xfs_file_dio_aio_write(
> > 
> > {
> > 
> > ...
> > 
> >        /*
> >          * Don't take the exclusive iolock here unless the I/O is unaligned
> > to
> >          * the file system block size.  We don't need to consider the EOF
> >          * extension case here because xfs_file_aio_write_checks() will
> > relock
> >          * the inode as necessary for EOF zeroing cases and fill out the new
> >          * inode size as appropriate.
> >          */
> >         if ((iocb->ki_pos & mp->m_blockmask) ||
> >             ((iocb->ki_pos + count) & mp->m_blockmask)) {
> >                 unaligned_io = 1;
> > 
> >                 /*
> >                  * We can't properly handle unaligned direct I/O to reflink
> >                  * files yet, as we can't unshare a partial block.
> >                  */
> >                 if (xfs_is_cow_inode(ip)) {
> >                         trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos,
> > count);
> >                         return -ENOTBLK;
> >                 }
> >                 iolock = XFS_IOLOCK_EXCL;
> >         } else {
> >                 iolock = XFS_IOLOCK_SHARED;
> >         }
> > 
> > 
> > I also see that such writes cause io_submit to block, even if they hit a
> > written extent (and are also not size-changing, by implication) and
> > therefore do not require a metadata write. Probably due to "|| unaligned_io"
> > in
> > 
> > 
> >         ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> >                            &xfs_dio_write_ops,
> >                            is_sync_kiocb(iocb) || unaligned_io);
> > 
> > 
> > Can this be relaxed to allow writes to written extents to proceed in
> > parallel? I explain the motivation below.
> > 
> 
> From the above code, it looks as though we require the exclusive lock to
> accommodate sub-block zeroing that might occur down in the dio code.
> Without it, concurrent sector granular I/Os to the same block could
> presumably cause corruption if the data bios and associated zeroing bios
> were reordered between two requests.
> 
> Looking down in the iomap/dio code, iomap_dio_bio_actor() triggers
> sub-block zeroing if the associated range is unaligned and the mapping
> is either unwritten or new (i.e. newly allocated for this write). So as
> you note above, there is no such zeroing for a pre-existing written
> block within EOF. From that, it seems reasonable to me that in principle
> the filesystem could check for these conditions in the higher layer and
> further restrict usage of the exclusive lock.

Yuck, no.

> That said, we'd probably
> have to rework the above code to acquire the shared lock first,
> read/check the extent state for the start/end blocks of the I/O and then
> cycle out to the exclusive lock in the cases where it is required.

Yes, that requires mapping lookups above the iomap layer, then
repeating them again in the iomap layer and having to juggle the
locking to ensure the iomap code gets the same result.  This also
requires taking the ILOCK to decide how to lock the IOLOCK, which
inverts all the IO path locking in nasty, bug prone ways. We tried
very hard to get rid of all those layering violations and lock
dances in the IO path with iomap - the old DIO code was a never
ending source of locking bugs because we had to play locking games
like this....

>
> It's
> not immediately clear to me if we'd still want to set the wait flag in
> those unaligned-but-no-zeroing cases. Perhaps somebody who knows the dio
> code a bit better can chime in further on all of this..

As I explained quickly on #xfs over xmas/ny time - because it seems
like 4 or 5 people all asked for this at the same time - I suspect
we can do this with IOMAP_NOWAIT directives for unaligned IO under
shared locking.

That is, if the unaligned dio is going to require allocation or
sub-block zeroing or would otherwise block, the filesystem mapping
code returns EAGAIN to iomap_apply() which propagates back to the
filesystem submitting the DIO which can then take exclusive lock and
resubmit the unaligned DIO, this time without the nowait semantics.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-01-08 20:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-28 15:57 Disk aligned (but not block aligned) DIO write woes Avi Kivity
2021-01-04 15:06 ` Brian Foster
2021-01-04 15:16   ` Avi Kivity
2021-01-08 20:45   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox