Re: [RFC, PATCH] Extensible AIO interface

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kent Overstreet <koverstreet@google.com>
To: Jeff Moyer <jmoyer@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	tytso@google.com, tj@kernel.org,
	Dave Kleikamp <dave.kleikamp@oracle.com>,
	Zach Brown <zab@zabbo.net>,
	Dmitry Monakhov <dmonakhov@openvz.org>,
	"Maxim V. Patlasov" <mpatlasov@parallels.com>,
	michael.mesnier@intel.com, jeffrey.d.skirvin@intel.com,
	pjt@google.com
Subject: Re: [RFC, PATCH] Extensible AIO interface
Date: Thu, 4 Oct 2012 12:37:59 -0700	[thread overview]
Message-ID: <20121004193759.GZ26488@google.com> (raw)
In-Reply-To: <x49sj9vtlip.fsf@segfault.boston.devel.redhat.com>

On Wed, Oct 03, 2012 at 03:15:26PM -0400, Jeff Moyer wrote:
> Kent Overstreet <koverstreet@google.com> writes:
> 
> > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> >> Kent Overstreet <koverstreet@google.com> writes:
> >> 
> >> > So, I and other people keep running into things where we really need to
> >> > add an interface to pass some auxiliary... stuff along with a pread() or
> >> > pwrite().
> >> >
> >> > A few examples:
> >> >
> >> > * IO scheduler hints. Some userspace program wants to, per IO, specify
> >> > either priorities or a cgroup - by specifying a cgroup you can have a
> >> > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> >> > quotas.
> >> 
> >> You can do this today by splitting I/O between processes and placing
> >> those processes in different cgroups.  For io priority, there is
> >> ioprio_set, which incurs an extra system call, but can be used.  Not
> >> elegant, but possible.
> >
> > Yes - those are things I'm trying to replace. Doing it that way is a
> > real pain, both as it's a lousy interface for this and it does impact
> > performance (ioprio_set doesn't really work too well with aio, too).
> 
> ioprio_set works fine with aio, since the I/O is issued in the caller's
> context.  Perhaps you're thinking of writeback I/O?

Until you want to issue different IOs with different priorities...

> >> > * Cache hints. For bcache and other things, userspace may want to specify
> >> > "this data should be cached", "this data should bypass the cache", etc.
> >> 
> >> Please explain how you will differentiate this from posix_fadvise.
> >
> > Oh sorry, I think about SSD caching so much I forget to say that's what
> > I'm talking about. posix_fadvise is for the page cache, we want
> > something different for an SSD cache (IMO it'd be really ugly to use it
> > for both, and posix_fadvise() can't really specifify everything we'd
> > want to for an SSD cache).
> 
> DESCRIPTION
>        Programs can use posix_fadvise() to announce an intention to
>        access file data in a specific pattern in the future, thus
>        allowing the kernel to perform appropriate optimizations.
> 
> That description seems broad enough to include disk caches as well.  You
> haven't exactly stated what's missing.

It _could_ work for SSD caches, but that doesn't mean you'd want it to -
it doesn't have any way of specifying which cache you want the hint to
apply to, and there are certainly circumstances under which you
_wouldn't_ want it to apply to both.

And making it apply to SSD caches would be silently changing the
behavior, and also like I mentioned it's not sufficient for SSD caches.

> >> > Hence, AIO attributes.
> >> 
> >> *No.*  Start with the non-AIO case first.
> >
> > Why? It is orthogonal to AIO (and I should make that clearer), but to do
> > it for sync IO we'd need new syscalls that take an extra argument so IMO
> > it's a bit easier to start with AIO.
> >
> > Might be worth implementing the sync interface sooner rather than later
> > just to discover any potential issues, I suppose.
> 
> Looking back to preadv and pwritev, it was wrong to only add them to
> libaio (and that later got corrected).  I'd just like to see things
> start out with the sync interfaces, since you'll get more eyes on the
> code (not everyone cares about aio) and that helps to work out any
> interface issues.

I agree we don't want to leave out sync versions, but honestly this
stuff is more useful with AIO and that's the easier place to start.

> > It's not possible in general - consider stacking block devices, and
> > attrs that are supported only by specific block drivers. I.e. if you've
> > got lvm on top of bcache or bcache on top of md, we can pass the attr
> > down with the IO but we can't determine ahead of time, in general, where
> > the IO is going to go.
> 
> If the io stack is static (meaning you setup a device once, then open it
> and do io to it, and it doesn't change while you're doing io), how would
> you not know where the IO is going to go?

With something like dm, md or bcache - you've got multiple underlying
devices, and of those underlying devices which one the IO goes to is not
something you can in general predict ahead of time.

> > But that probably isn't true for most attrs so it probably would be a
> > good idea to have an interface for querying what's supported, and even
> > for device specific ones you could query what a device supports.
> 
> OK.
> 
> >> > One could imagine sticking the return in the attribute itself, but I
> >> > don't want to do this. For some things (checksums), the attribute will
> >> > contain a pointer to a buffer - that's fine. But I don't want the
> >> > attributes themselves to be writeable.
> >> 
> >> One could imagine that attributes don't return anything, because, well,
> >> they're properties of something else, and properties don't return
> >> anything.
> >
> > With a strict definition of attribute, yeah. One of the real uses cases
> > we have for this is per IO timings, for aio - right now we've got an
> > interface for the kernel to tell userspace how long a syscall took
> > (don't think it's upstream yet - Paul's been behind that stuff), but it
> > only really makes sense with synchronous syscalls.
> 
> Something beyond recording the time spent in the kernel?  Paul who?  I
> agree the per io timing for aio may be coarse-grained today (you can
> time the difference between io_submit returning and the event being
> returned by io_getevents, but that says nothing of when the io was
> issued to the block layer).  I'm curious to know exactly what
> granularity you want here, and what an application would do with that
> information.  You can currently access a whole lot of detail of the io
> path through blktrace, but that is not easily done from within an
> application.

Paul Turner, scheduler guy. 

Believe it's both syscall time and IO time (i.e. what you'd get from
blktrace). It's basically used for visibility in filesystem type stuff,
for monitoring latency - rpc latency isn't enough, you really need to
know why things are slow and that could be as simple as a disk going
bad.

next prev parent reply	other threads:[~2012-10-04 19:37 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-01 22:23 [RFC, PATCH] Extensible AIO interface Kent Overstreet
2012-10-01 23:12 ` Zach Brown
2012-10-01 23:22   ` Kent Overstreet
2012-10-01 23:44     ` Zach Brown
2012-10-02  0:22       ` Kent Overstreet
2012-10-02 17:43         ` Zach Brown
2012-10-02 21:41           ` Kent Overstreet
2012-10-03  1:41             ` Tejun Heo
2012-10-03  3:00               ` Kent Overstreet
2012-10-03 21:58                 ` Tejun Heo
2012-10-04 19:50                   ` Kent Overstreet
2012-10-02  0:47       ` Kent Overstreet
2012-10-02 22:34     ` Martin K. Petersen
2012-10-02 17:41 ` Jeff Moyer
2012-10-03  0:20   ` Kent Overstreet
2012-10-03  1:28     ` Dave Chinner
2012-10-03  2:41       ` Kent Overstreet
2012-10-04  1:04         ` Dave Chinner
2012-10-03 19:15     ` Jeff Moyer
2012-10-04 19:37       ` Kent Overstreet [this message]
2012-10-02 19:34 ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121004193759.GZ26488@google.com \
    --to=koverstreet@google.com \
    --cc=dave.kleikamp@oracle.com \
    --cc=dmonakhov@openvz.org \
    --cc=jeffrey.d.skirvin@intel.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.mesnier@intel.com \
    --cc=mpatlasov@parallels.com \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=tytso@google.com \
    --cc=zab@zabbo.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).