public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@scylladb.com>
To: Eric Sandeen <sandeen@sandeen.net>, Brian Foster <bfoster@redhat.com>
Cc: xfs@oss.sgi.com
Subject: Re: Question about non asynchronous aio calls.
Date: Wed, 7 Oct 2015 21:13:06 +0300	[thread overview]
Message-ID: <561560B2.1080902@scylladb.com> (raw)
In-Reply-To: <56153685.3040401@sandeen.net>

On 07/10/15 18:13, Eric Sandeen wrote:
>
> On 10/7/15 10:08 AM, Brian Foster wrote:
>> On Wed, Oct 07, 2015 at 09:24:15AM -0500, Eric Sandeen wrote:
>>>
>>> On 10/7/15 9:18 AM, Gleb Natapov wrote:
>>>> Hello XFS developers,
>>>>
>>>> We are working on scylladb[1] database which is written using seastar[2]
>>>> - highly asynchronous C++ framework. The code uses aio heavily: no
>>>> synchronous operation is allowed at all by the framework otherwise
>>>> performance drops drastically. We noticed that the only mainstream FS
>>>> in Linux that takes aio seriously is XFS. So let me start by thanking
>>>> you guys for the great work! But unfortunately we also noticed that
>>>> sometimes io_submit() is executed synchronously even on XFS.
>>>>
>>>> Looking at the code I see two cases when this is happening: unaligned
>>>> IO and write past EOF. It looks like we hit both. For the first one we
>>>> make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
>>>> to figure out what alignment should be, but it does not help. Looking at the
>>>> code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
>>>> is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
>>>> filesystem block size not values that DIOINFO returns. Is it intentional? How
>>>> should our code know what it should align buffers to?
>>>          /* "unaligned" here means not aligned to a filesystem block */
>>>          if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
>>>                  unaligned_io = 1;
>>>
>>> It should be aligned to the filesystem block size.
>>>
>> I'm not sure exactly what kinds of races are opened if the above locking
>> were absent, but I'd guess it's related to the buffer/block state
>> management, block zeroing and whatnot that is buried in the depths of
>> the generic dio code.
> Yep:
>
> commit eda77982729b7170bdc9e8855f0682edf322d277
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Tue Jan 11 10:22:40 2011 +1100
>
>      xfs: serialise unaligned direct IOs
>      
>      When two concurrent unaligned, non-overlapping direct IOs are issued
>      to the same block, the direct Io layer will race to zero the block.
>      The result is that one of the concurrent IOs will overwrite data
>      written by the other IO with zeros. This is demonstrated by the
>      xfsqa test 240.
>      
>      To avoid this problem, serialise all unaligned direct IOs to an
>      inode with a big hammer. We need a big hammer approach as we need to
>      serialise AIO as well, so we can't just block writes on locks.
>      Hence, the big hammer is calling xfs_ioend_wait() while holding out
>      other unaligned direct IOs from starting.
>      
>      We don't bother trying to serialised aligned vs unaligned IOs as
>      they are overlapping IO and the result of concurrent overlapping IOs
>      is undefined - the result of either IO is a valid result so we let
>      them race. Hence we only penalise unaligned IO, which already has a
>      major overhead compared to aligned IO so this isn't a major problem.
>      
>      Signed-off-by: Dave Chinner <dchinner@redhat.com>
>      Reviewed-by: Alex Elder <aelder@sgi.com>
>      Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> I fixed something similar in ext4 at the time, FWIW.

Makes sense.

Is there a way to relax this for reads?  It's pretty easy to saturate 
the disk read bandwidth with 4K reads, and there shouldn't be a race 
there, at least for reads targeting already-written blocks.  For us at 
least small reads would be sufficient.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2015-10-07 18:13 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-07 14:18 Question about non asynchronous aio calls Gleb Natapov
2015-10-07 14:24 ` Eric Sandeen
2015-10-07 15:08   ` Brian Foster
2015-10-07 15:13     ` Eric Sandeen
2015-10-07 18:13       ` Avi Kivity [this message]
2015-10-08  4:28         ` Dave Chinner
2015-10-08  5:21           ` Avi Kivity
2015-10-08  8:23             ` Gleb Natapov
2015-10-08 11:46               ` Dave Chinner
2015-10-12 12:37                 ` Avi Kivity
2015-10-12 22:23                   ` Dave Chinner
2015-10-13  9:11                     ` Avi Kivity
2015-10-08  8:34     ` Gleb Natapov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=561560B2.1080902@scylladb.com \
    --to=avi@scylladb.com \
    --cc=bfoster@redhat.com \
    --cc=sandeen@sandeen.net \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox