Re: [RFC, PATCH] Extensible AIO interface

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kent Overstreet <koverstreet@google.com>
To: Jeff Moyer <jmoyer@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	tytso@google.com, tj@kernel.org,
	Dave Kleikamp <dave.kleikamp@oracle.com>,
	Zach Brown <zab@zabbo.net>,
	Dmitry Monakhov <dmonakhov@openvz.org>,
	"Maxim V. Patlasov" <mpatlasov@parallels.com>,
	michael.mesnier@intel.com, jeffrey.d.skirvin@intel.com,
	pjt@google.com
Subject: Re: [RFC, PATCH] Extensible AIO interface
Date: Thu, 4 Oct 2012 12:37:59 -0700	[thread overview]
Message-ID: <20121004193759.GZ26488@google.com> (raw)
In-Reply-To: <x49sj9vtlip.fsf@segfault.boston.devel.redhat.com>

On Wed, Oct 03, 2012 at 03:15:26PM -0400, Jeff Moyer wrote:
> Kent Overstreet <koverstreet@google.com> writes:
> 
> > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> >> Kent Overstreet <koverstreet@google.com> writes:
> >> 
> >> > So, I and other people keep running into things where we really need to
> >> > add an interface to pass some auxiliary... stuff along with a pread() or
> >> > pwrite().
> >> >
> >> > A few examples:
> >> >
> >> > * IO scheduler hints. Some userspace program wants to, per IO, specify
> >> > either priorities or a cgroup - by specifying a cgroup you can have a
> >> > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> >> > quotas.
> >> 
> >> You can do this today by splitting I/O between processes and placing
> >> those processes in different cgroups.  For io priority, there is
> >> ioprio_set, which incurs an extra system call, but can be used.  Not
> >> elegant, but possible.
> >
> > Yes - those are things I'm trying to replace. Doing it that way is a
> > real pain, both as it's a lousy interface for this and it does impact
> > performance (ioprio_set doesn't really work too well with aio, too).
> 
> ioprio_set works fine with aio, since the I/O is issued in the caller's
> context.  Perhaps you're thinking of writeback I/O?

Until you want to issue different IOs with different priorities...

> >> > * Cache hints. For bcache and other things, userspace may want to specify
> >> > "this data should be cached", "this data should bypass the cache", etc.
> >> 
> >> Please explain how you will differentiate this from posix_fadvise.
> >
> > Oh sorry, I think about SSD caching so much I forget to say that's what
> > I'm talking about. posix_fadvise is for the page cache, we want
> > something different for an SSD cache (IMO it'd be really ugly to use it
> > for both, and posix_fadvise() can't really specifify everything we'd
> > want to for an SSD cache).
> 
> DESCRIPTION
>        Programs can use posix_fadvise() to announce an intention to
>        access file data in a specific pattern in the future, thus
>        allowing the kernel to perform appropriate optimizations.
> 
> That description seems broad enough to include disk caches as well.  You
> haven't exactly stated what's missing.

It _could_ work for SSD caches, but that doesn't mean you'd want it to -
it doesn't have any way of specifying which cache you want the hint to
apply to, and there are certainly circumstances under which you
_wouldn't_ want it to apply to both.

And making it apply to SSD caches would be silently changing the
behavior, and also like I mentioned it's not sufficient for SSD caches.

> >> > Hence, AIO attributes.
> >> 
> >> *No.*  Start with the non-AIO case first.
> >
> > Why? It is orthogonal to AIO (and I should make that clearer), but to do
> > it for sync IO we'd need new syscalls that take an extra argument so IMO
> > it's a bit easier to start with AIO.
> >
> > Might be worth implementing the sync interface sooner rather than later
> > just to discover any potential issues, I suppose.
> 
> Looking back to preadv and pwritev, it was wrong to only add them to
> libaio (and that later got corrected).  I'd just like to see things
> start out with the sync interfaces, since you'll get more eyes on the
> code (not everyone cares about aio) and that helps to work out any
> interface issues.

I agree we don't want to leave out sync versions, but honestly this
stuff is more useful with AIO and that's the easier place to start.

> > It's not possible in general - consider stacking block devices, and
> > attrs that are supported only by specific block drivers. I.e. if you've
> > got lvm on top of bcache or bcache on top of md, we can pass the attr
> > down with the IO but we can't determine ahead of time, in general, where
> > the IO is going to go.
> 
> If the io stack is static (meaning you setup a device once, then open it
> and do io to it, and it doesn't change while you're doing io), how would
> you not know where the IO is going to go?

With something like dm, md or bcache - you've got multiple underlying
devices, and of those underlying devices which one the IO goes to is not
something you can in general predict ahead of time.

> > But that probably isn't true for most attrs so it probably would be a
> > good idea to have an interface for querying what's supported, and even
> > for device specific ones you could query what a device supports.
> 
> OK.
> 
> >> > One could imagine sticking the return in the attribute itself, but I
> >> > don't want to do this. For some things (checksums), the attribute will
> >> > contain a pointer to a buffer - that's fine. But I don't want the
> >> > attributes themselves to be writeable.
> >> 
> >> One could imagine that attributes don't return anything, because, well,
> >> they're properties of something else, and properties don't return
> >> anything.
> >
> > With a strict definition of attribute, yeah. One of the real uses cases
> > we have for this is per IO timings, for aio - right now we've got an
> > interface for the kernel to tell userspace how long a syscall took
> > (don't think it's upstream yet - Paul's been behind that stuff), but it
> > only really makes sense with synchronous syscalls.
> 
> Something beyond recording the time spent in the kernel?  Paul who?  I
> agree the per io timing for aio may be coarse-grained today (you can
> time the difference between io_submit returning and the event being
> returned by io_getevents, but that says nothing of when the io was
> issued to the block layer).  I'm curious to know exactly what
> granularity you want here, and what an application would do with that
> information.  You can currently access a whole lot of detail of the io
> path through blktrace, but that is not easily done from within an
> application.

Paul Turner, scheduler guy. 

Believe it's both syscall time and IO time (i.e. what you'd get from
blktrace). It's basically used for visibility in filesystem type stuff,
for monitoring latency - rpc latency isn't enough, you really need to
know why things are slow and that could be as simple as a disk going
bad.

next prev parent reply	other threads:[~2012-10-04 19:37 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-01 22:23 [RFC, PATCH] Extensible AIO interface Kent Overstreet
2012-10-01 23:12 ` Zach Brown
2012-10-01 23:22   ` Kent Overstreet
2012-10-01 23:44     ` Zach Brown
2012-10-02  0:22       ` Kent Overstreet
2012-10-02 17:43         ` Zach Brown
2012-10-02 21:41           ` Kent Overstreet
2012-10-03  1:41             ` Tejun Heo
2012-10-03  3:00               ` Kent Overstreet
2012-10-03 21:58                 ` Tejun Heo
2012-10-04 19:50                   ` Kent Overstreet
2012-10-02  0:47       ` Kent Overstreet
2012-10-02 22:34     ` Martin K. Petersen
2012-10-02 17:41 ` Jeff Moyer
2012-10-03  0:20   ` Kent Overstreet
2012-10-03  1:28     ` Dave Chinner
2012-10-03  2:41       ` Kent Overstreet
2012-10-04  1:04         ` Dave Chinner
2012-10-03 19:15     ` Jeff Moyer
2012-10-04 19:37       ` Kent Overstreet [this message]
2012-10-02 19:34 ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121004193759.GZ26488@google.com \
    --to=koverstreet@google.com \
    --cc=dave.kleikamp@oracle.com \
    --cc=dmonakhov@openvz.org \
    --cc=jeffrey.d.skirvin@intel.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.mesnier@intel.com \
    --cc=mpatlasov@parallels.com \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=tytso@google.com \
    --cc=zab@zabbo.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.