From: Avi Kivity <avi@scylladb.com>
To: Dave Chinner <david@fromorbit.com>, Gleb Natapov <gleb@scylladb.com>
Cc: Brian Foster <bfoster@redhat.com>,
Eric Sandeen <sandeen@sandeen.net>,
xfs@oss.sgi.com
Subject: Re: Question about non asynchronous aio calls.
Date: Mon, 12 Oct 2015 15:37:04 +0300 [thread overview]
Message-ID: <561BA970.8080504@scylladb.com> (raw)
In-Reply-To: <20151008114622.GV27164@dastard>
On 10/08/2015 02:46 PM, Dave Chinner wrote:
> On Thu, Oct 08, 2015 at 11:23:07AM +0300, Gleb Natapov wrote:
>> On Thu, Oct 08, 2015 at 08:21:58AM +0300, Avi Kivity wrote:
>>>>>> I fixed something similar in ext4 at the time, FWIW.
>>>>> Makes sense.
>>>>>
>>>>> Is there a way to relax this for reads?
>>>> The above mostly only applies to writes. Reads don't modify data so
>>>> racing unaligned reads against other reads won't given unexpected
>>>> results and so aren't serialised.
>>>>
>>>> i.e. serialisation will only occur when:
>>>> - unaligned write IO will serialise until sub-block zeroing
>>>> is complete.
>>>> - write IO extending EOF will serialis until post-EOF
>>>> zeroing is complete
>>>
>>> By "complete" here, do you mean that a call to truncate() returned, or that
>>> its results reached the disk an unknown time later?
>>>
> No, I'm talking purely about DIO here. If you do write that
> starts beyond the existing EOF, there is a region between the
> current EOF and the offset the write starts at. i.e.
>
> 0 EOF offset new EOF
> +dddddddddddddd+..............+nnnnnnnnnnn+
>
> It is the region between EOF and offset that we must ensure is made
> up of either holes, unwritten extents or fully zeroed blocks before
> allowing the write to proceed. If we have to zero allocated blocks,
> then we have to ensure that completes before the write can start.
> This means that when we update the EOF on completion of the write,
> we don't expose stale data in blocks that were between EOF and
> offset...
Thanks. We found, experimentally, that io_submit(write_at_eof) followed
by (without waiting) io_submit(write_at_what_would_be_the_new_eof)
occasionally blocks.
So I guess we have to employ a train algorithm here and keep at most one
aio in flight for append loads (which are very common for us).
>
>> I think Brian already answered that one with:
>>
>> There are no such pitfalls as far as I'm aware. The entire AIO
>> submission synchronization sequence triggers off an in-memory i_size
>> check in xfs_file_aio_write_checks(). The in-memory i_size is updated in
>> the truncate path (xfs_setattr_size()) via truncate_setsize(), so at
>> that point the new size should be visible to subsequent AIO writers.
> Different situation as truncate serialises all IO. Extending the file
> via truncate also runs the same "EOF zeroing" that the DIO code runs
> above, for the same reasons.
Does that mean that truncate() will wait for inflight aios, or that new
aios will wait for the truncate() to complete, or both?
>
>>>> - truncate/extent manipulation syscall is run
>>> Actually, we do call fallocate() ahead of io_submit() (in a worker thread,
>>> in non-overlapping ranges) to optimize file layout and also in the belief
>>> that it would reduce the amount of blocking io_submit() does.
> fallocate serialises all IO submission - including reads. Unlike
> truncate, however, it doesn't drain the queue of IO for
> preallocation so the impact on AIO is somewhat limited.
>
> Ideally you want to limit fallocate calls to large chunks at a time.
> If you have a 1:1 mapping of fallocate calls to write calls, then
> you're likely making things worse for the AIO submission path
> because you'll block reads as well as writes. Doing the allocation
> in the write submission path will not block reads, and only writes
> that are attempting to do concurrent allocations to the same file
> will serialise...
We have a 1:8 ratio (128K:1M), but that's just random numbers we guessed.
Again, not only for reduced xfs metadata, but also to reduce the amount
of write amplification done by the FTL. We have a concurrent append
workload on many files, and files are reclaimed out of order, so larger
extends means less fragmentation for the FTL later on.
>
> If you want to limit fragmentation without adding and overhead on
> XFS for non-sparse files (which it sounds like your case), then the
> best thing to use in XFS is the per-inode extent size hints. You set
> it on the file when first creating it (or the parent directory so
> all children inherit it at create), and then the allocator will
> round out allocations to the size hint alignment and size, including
> beyond EOF so appending writes can take advantage of it....
We'll try that out. That's fsxattr::fsx_extsize?
What about small files that are eventually closed, do I need to do
anything to reclaim the preallocated space?
>
>>> A final point is discoverability. There is no way to discover safe
>>> alignment for reads and writes, and which operations block io_submit(),
>>> except by asking here, which cannot be done at runtime. Interfaces that
>>> provide a way to query these attributes are very important to us.
>> As Brian pointed statfs() can be use to get f_bsize which is defined as
>> "optimal transfer block size".
> Well, that's what posix calls it. It's not really the optimal IO
> size, though, it's just the IO size that avoids page cache RMW
> cycles. For direct IO, larger tends to be better, and IO aligned to
> the underlying geometry of the storage is even better. See, for
> example, the "largeio" mount option, which will make XFS report the
> stripe width in f_bsize rather than the PAGE_SIZE of the machine....
>
Well, random reads will still be faster with 512 byte alignment, yes?
and for random writes, you can't just make those I/Os larger, you'll
overwrite something.
So I read "optimal" here to mean "smallest I/O size that doesn't incur a
penalty; but if you really need more data, making it larger will help".
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2015-10-12 12:37 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-07 14:18 Question about non asynchronous aio calls Gleb Natapov
2015-10-07 14:24 ` Eric Sandeen
2015-10-07 15:08 ` Brian Foster
2015-10-07 15:13 ` Eric Sandeen
2015-10-07 18:13 ` Avi Kivity
2015-10-08 4:28 ` Dave Chinner
2015-10-08 5:21 ` Avi Kivity
2015-10-08 8:23 ` Gleb Natapov
2015-10-08 11:46 ` Dave Chinner
2015-10-12 12:37 ` Avi Kivity [this message]
2015-10-12 22:23 ` Dave Chinner
2015-10-13 9:11 ` Avi Kivity
2015-10-08 8:34 ` Gleb Natapov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=561BA970.8080504@scylladb.com \
--to=avi@scylladb.com \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=gleb@scylladb.com \
--cc=sandeen@sandeen.net \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox