Re: Question about non asynchronous aio calls.

From: Avi Kivity <avi@scylladb.com>
To: Eric Sandeen <sandeen@sandeen.net>, Brian Foster <bfoster@redhat.com>
Cc: xfs@oss.sgi.com
Subject: Re: Question about non asynchronous aio calls.
Date: Wed, 7 Oct 2015 21:13:06 +0300	[thread overview]
Message-ID: <561560B2.1080902@scylladb.com> (raw)
In-Reply-To: <56153685.3040401@sandeen.net>

On 07/10/15 18:13, Eric Sandeen wrote:
>
> On 10/7/15 10:08 AM, Brian Foster wrote:
>> On Wed, Oct 07, 2015 at 09:24:15AM -0500, Eric Sandeen wrote:
>>>
>>> On 10/7/15 9:18 AM, Gleb Natapov wrote:
>>>> Hello XFS developers,
>>>>
>>>> We are working on scylladb[1] database which is written using seastar[2]
>>>> - highly asynchronous C++ framework. The code uses aio heavily: no
>>>> synchronous operation is allowed at all by the framework otherwise
>>>> performance drops drastically. We noticed that the only mainstream FS
>>>> in Linux that takes aio seriously is XFS. So let me start by thanking
>>>> you guys for the great work! But unfortunately we also noticed that
>>>> sometimes io_submit() is executed synchronously even on XFS.
>>>>
>>>> Looking at the code I see two cases when this is happening: unaligned
>>>> IO and write past EOF. It looks like we hit both. For the first one we
>>>> make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
>>>> to figure out what alignment should be, but it does not help. Looking at the
>>>> code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
>>>> is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
>>>> filesystem block size not values that DIOINFO returns. Is it intentional? How
>>>> should our code know what it should align buffers to?
>>>          /* "unaligned" here means not aligned to a filesystem block */
>>>          if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
>>>                  unaligned_io = 1;
>>>
>>> It should be aligned to the filesystem block size.
>>>
>> I'm not sure exactly what kinds of races are opened if the above locking
>> were absent, but I'd guess it's related to the buffer/block state
>> management, block zeroing and whatnot that is buried in the depths of
>> the generic dio code.
> Yep:
>
> commit eda77982729b7170bdc9e8855f0682edf322d277
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Tue Jan 11 10:22:40 2011 +1100
>
>      xfs: serialise unaligned direct IOs
>      
>      When two concurrent unaligned, non-overlapping direct IOs are issued
>      to the same block, the direct Io layer will race to zero the block.
>      The result is that one of the concurrent IOs will overwrite data
>      written by the other IO with zeros. This is demonstrated by the
>      xfsqa test 240.
>      
>      To avoid this problem, serialise all unaligned direct IOs to an
>      inode with a big hammer. We need a big hammer approach as we need to
>      serialise AIO as well, so we can't just block writes on locks.
>      Hence, the big hammer is calling xfs_ioend_wait() while holding out
>      other unaligned direct IOs from starting.
>      
>      We don't bother trying to serialised aligned vs unaligned IOs as
>      they are overlapping IO and the result of concurrent overlapping IOs
>      is undefined - the result of either IO is a valid result so we let
>      them race. Hence we only penalise unaligned IO, which already has a
>      major overhead compared to aligned IO so this isn't a major problem.
>      
>      Signed-off-by: Dave Chinner <dchinner@redhat.com>
>      Reviewed-by: Alex Elder <aelder@sgi.com>
>      Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> I fixed something similar in ext4 at the time, FWIW.

Makes sense.

Is there a way to relax this for reads?  It's pretty easy to saturate 
the disk read bandwidth with 4K reads, and there shouldn't be a race 
there, at least for reads targeting already-written blocks.  For us at 
least small reads would be sufficient.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs