From: Avi Kivity <avi@scylladb.com>
To: Eric Sandeen <sandeen@sandeen.net>, Brian Foster <bfoster@redhat.com>
Cc: xfs@oss.sgi.com
Subject: Re: Question about non asynchronous aio calls.
Date: Wed, 7 Oct 2015 21:13:06 +0300 [thread overview]
Message-ID: <561560B2.1080902@scylladb.com> (raw)
In-Reply-To: <56153685.3040401@sandeen.net>
On 07/10/15 18:13, Eric Sandeen wrote:
>
> On 10/7/15 10:08 AM, Brian Foster wrote:
>> On Wed, Oct 07, 2015 at 09:24:15AM -0500, Eric Sandeen wrote:
>>>
>>> On 10/7/15 9:18 AM, Gleb Natapov wrote:
>>>> Hello XFS developers,
>>>>
>>>> We are working on scylladb[1] database which is written using seastar[2]
>>>> - highly asynchronous C++ framework. The code uses aio heavily: no
>>>> synchronous operation is allowed at all by the framework otherwise
>>>> performance drops drastically. We noticed that the only mainstream FS
>>>> in Linux that takes aio seriously is XFS. So let me start by thanking
>>>> you guys for the great work! But unfortunately we also noticed that
>>>> sometimes io_submit() is executed synchronously even on XFS.
>>>>
>>>> Looking at the code I see two cases when this is happening: unaligned
>>>> IO and write past EOF. It looks like we hit both. For the first one we
>>>> make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
>>>> to figure out what alignment should be, but it does not help. Looking at the
>>>> code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
>>>> is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
>>>> filesystem block size not values that DIOINFO returns. Is it intentional? How
>>>> should our code know what it should align buffers to?
>>> /* "unaligned" here means not aligned to a filesystem block */
>>> if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
>>> unaligned_io = 1;
>>>
>>> It should be aligned to the filesystem block size.
>>>
>> I'm not sure exactly what kinds of races are opened if the above locking
>> were absent, but I'd guess it's related to the buffer/block state
>> management, block zeroing and whatnot that is buried in the depths of
>> the generic dio code.
> Yep:
>
> commit eda77982729b7170bdc9e8855f0682edf322d277
> Author: Dave Chinner <dchinner@redhat.com>
> Date: Tue Jan 11 10:22:40 2011 +1100
>
> xfs: serialise unaligned direct IOs
>
> When two concurrent unaligned, non-overlapping direct IOs are issued
> to the same block, the direct Io layer will race to zero the block.
> The result is that one of the concurrent IOs will overwrite data
> written by the other IO with zeros. This is demonstrated by the
> xfsqa test 240.
>
> To avoid this problem, serialise all unaligned direct IOs to an
> inode with a big hammer. We need a big hammer approach as we need to
> serialise AIO as well, so we can't just block writes on locks.
> Hence, the big hammer is calling xfs_ioend_wait() while holding out
> other unaligned direct IOs from starting.
>
> We don't bother trying to serialised aligned vs unaligned IOs as
> they are overlapping IO and the result of concurrent overlapping IOs
> is undefined - the result of either IO is a valid result so we let
> them race. Hence we only penalise unaligned IO, which already has a
> major overhead compared to aligned IO so this isn't a major problem.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Alex Elder <aelder@sgi.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> I fixed something similar in ext4 at the time, FWIW.
Makes sense.
Is there a way to relax this for reads? It's pretty easy to saturate
the disk read bandwidth with 4K reads, and there shouldn't be a race
there, at least for reads targeting already-written blocks. For us at
least small reads would be sufficient.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2015-10-07 18:13 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-07 14:18 Question about non asynchronous aio calls Gleb Natapov
2015-10-07 14:24 ` Eric Sandeen
2015-10-07 15:08 ` Brian Foster
2015-10-07 15:13 ` Eric Sandeen
2015-10-07 18:13 ` Avi Kivity [this message]
2015-10-08 4:28 ` Dave Chinner
2015-10-08 5:21 ` Avi Kivity
2015-10-08 8:23 ` Gleb Natapov
2015-10-08 11:46 ` Dave Chinner
2015-10-12 12:37 ` Avi Kivity
2015-10-12 22:23 ` Dave Chinner
2015-10-13 9:11 ` Avi Kivity
2015-10-08 8:34 ` Gleb Natapov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=561560B2.1080902@scylladb.com \
--to=avi@scylladb.com \
--cc=bfoster@redhat.com \
--cc=sandeen@sandeen.net \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox