public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Disk aligned (but not block aligned) DIO write woes
@ 2020-12-28 15:57 Avi Kivity
  2021-01-04 15:06 ` Brian Foster
  0 siblings, 1 reply; 4+ messages in thread
From: Avi Kivity @ 2020-12-28 15:57 UTC (permalink / raw)
  To: linux-xfs

I observe that XFS takes an exclusive lock for DIO writes that are not 
block aligned:


xfs_file_dio_aio_write(

{

...

        /*
          * Don't take the exclusive iolock here unless the I/O is 
unaligned to
          * the file system block size.  We don't need to consider the EOF
          * extension case here because xfs_file_aio_write_checks() will 
relock
          * the inode as necessary for EOF zeroing cases and fill out 
the new
          * inode size as appropriate.
          */
         if ((iocb->ki_pos & mp->m_blockmask) ||
             ((iocb->ki_pos + count) & mp->m_blockmask)) {
                 unaligned_io = 1;

                 /*
                  * We can't properly handle unaligned direct I/O to reflink
                  * files yet, as we can't unshare a partial block.
                  */
                 if (xfs_is_cow_inode(ip)) {
                         trace_xfs_reflink_bounce_dio_write(ip, 
iocb->ki_pos, count);
                         return -ENOTBLK;
                 }
                 iolock = XFS_IOLOCK_EXCL;
         } else {
                 iolock = XFS_IOLOCK_SHARED;
         }


I also see that such writes cause io_submit to block, even if they hit a 
written extent (and are also not size-changing, by implication) and 
therefore do not require a metadata write. Probably due to "|| 
unaligned_io" in


         ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
                            &xfs_dio_write_ops,
                            is_sync_kiocb(iocb) || unaligned_io);


Can this be relaxed to allow writes to written extents to proceed in 
parallel? I explain the motivation below.


My thinking (from a position of blissful ignorance) is that if the 
extent is already written, then no metadata changes and block zeroing 
are needed. If we can detect that favorable conditions exists (perhaps 
with the extra constraint that the mapping be already cached), then we 
can handle this particular case asynchronously.


My motivation is a database commit log. NVMe drives can serve small 
writes with ridiculously low latency - around 20 microseconds. Let's say 
a commitlog entry is around 100 bytes; we fill a 4k block with 41 
entries. To achieve that in 20 microseconds requires 2 million 
records/sec. Even if we add artificial delay and commit every 1ms, 
filling this 4k block require 41,000 commits/sec. If the entry write 
rate is lower, then we will be forced to pad the rest of the block. This 
increases the write amplification, impacting other activities using the 
disk (such as reads).


41,000 commits/sec may not sound like much, but in a thread-per-core 
design (where each core commits independently) this translates to 
millions of commits per second for the entire machine. If the real 
throughput is below that, then we are forced to either increase the 
latency to collect more writes into a full block, or we have to tolerate 
the increased write amplification.



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-01-08 20:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-28 15:57 Disk aligned (but not block aligned) DIO write woes Avi Kivity
2021-01-04 15:06 ` Brian Foster
2021-01-04 15:16   ` Avi Kivity
2021-01-08 20:45   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox