* [PATCH] direct-io: allow file systems to do their own waiting for io V2
@ 2012-12-11 14:53 Josef Bacik
2012-12-19 14:38 ` Christoph Hellwig
0 siblings, 1 reply; 3+ messages in thread
From: Josef Bacik @ 2012-12-11 14:53 UTC (permalink / raw)
To: hch, linux-fsdevel, viro, jmoyer, zab
Btrfs is terrible with O_DIRECT|O_SYNC, mostly because of the constant
waiting. The thing is we have a handy way of waiting for IO that we can
delay to the very last second so we do all of the O_SYNC work and then wait
for a bunch of IO to complete. So introduce a flag to allow the generic
direct io stuff to forgo waiting and leave that up to the file system.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
---
V1->V2: fix stupid rw == WRITE bug.
fs/direct-io.c | 36 +++++++++++++++++++++++++++++-------
include/linux/fs.h | 3 +++
2 files changed, 32 insertions(+), 7 deletions(-)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..4e1cdb4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -299,19 +299,35 @@ static void dio_bio_end_aio(struct bio *bio, int error)
* handler.
*
* During I/O bi_private points at the dio. After I/O, bi_private is used to
- * implement a singly-linked list of completed BIOs, at dio->bio_list.
+ * implement a singly-linked list of completed BIOs, at dio->bio_list, but only
+ * if the file system isn't doing its own waiting.
*/
static void dio_bio_end_io(struct bio *bio, int error)
{
struct dio *dio = bio->bi_private;
unsigned long flags;
+ unsigned long remaining;
+ bool own_waiting = ((dio->rw & WRITE) &&
+ (dio->flags & DIO_OWN_WAITING));
+
+ if (own_waiting)
+ dio_bio_complete(dio, bio);
spin_lock_irqsave(&dio->bio_lock, flags);
- bio->bi_private = dio->bio_list;
- dio->bio_list = bio;
- if (--dio->refcount == 1 && dio->waiter)
+ if (!own_waiting) {
+ bio->bi_private = dio->bio_list;
+ dio->bio_list = bio;
+ }
+ remaining = --dio->refcount;
+ if (remaining == 1 && dio->waiter)
wake_up_process(dio->waiter);
spin_unlock_irqrestore(&dio->bio_lock, flags);
+
+ if (remaining == 0) {
+ BUG_ON(!(dio->flags & DIO_OWN_WAITING));
+ dio_complete(dio, dio->iocb->ki_pos, 0, false);
+ kmem_cache_free(dio_cache, dio);
+ }
}
/**
@@ -1266,14 +1282,20 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
((rw == READ) || (dio->result == sdio.size)))
retval = -EIOCBQUEUED;
- if (retval != -EIOCBQUEUED)
+ if (retval != -EIOCBQUEUED &&
+ (rw == READ || !(flags & DIO_OWN_WAITING)))
dio_await_completion(dio);
if (drop_refcount(dio) == 0) {
retval = dio_complete(dio, offset, retval, false);
kmem_cache_free(dio_cache, dio);
- } else
- BUG_ON(retval != -EIOCBQUEUED);
+ } else {
+ BUG_ON(retval != -EIOCBQUEUED && !(flags & DIO_OWN_WAITING));
+
+ /* Need to return how much data we should be waiting for */
+ if (!retval && flags & DIO_OWN_WAITING)
+ retval = dio->result;
+ }
out:
return retval;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..c7944d1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2433,6 +2433,9 @@ enum {
/* filesystem does not support filling holes */
DIO_SKIP_HOLES = 0x02,
+
+ /* filesystem will do it's own waiting thank you! */
+ DIO_OWN_WAITING = 0x04,
};
void dio_end_io(struct bio *bio, int error);
--
1.7.7.6
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] direct-io: allow file systems to do their own waiting for io V2
2012-12-11 14:53 [PATCH] direct-io: allow file systems to do their own waiting for io V2 Josef Bacik
@ 2012-12-19 14:38 ` Christoph Hellwig
2012-12-19 15:55 ` Chris Mason
0 siblings, 1 reply; 3+ messages in thread
From: Christoph Hellwig @ 2012-12-19 14:38 UTC (permalink / raw)
To: Josef Bacik; +Cc: hch, linux-fsdevel, viro, jmoyer, zab
I have to say I still hate the flag magic in here. Spent some time to
look over things to be a bit more constructive in getting what you
guys want in a nicer way:
> static void dio_bio_end_io(struct bio *bio, int error)
> {
> struct dio *dio = bio->bi_private;
> unsigned long flags;
> + unsigned long remaining;
> + bool own_waiting = ((dio->rw & WRITE) &&
> + (dio->flags & DIO_OWN_WAITING));
> +
> + if (own_waiting)
> + dio_bio_complete(dio, bio);
>
> spin_lock_irqsave(&dio->bio_lock, flags);
> + if (!own_waiting) {
> + bio->bi_private = dio->bio_list;
> + dio->bio_list = bio;
> + }
> + remaining = --dio->refcount;
> + if (remaining == 1 && dio->waiter)
> wake_up_process(dio->waiter);
> spin_unlock_irqrestore(&dio->bio_lock, flags);
> +
> + if (remaining == 0) {
> + BUG_ON(!(dio->flags & DIO_OWN_WAITING));
> + dio_complete(dio, dio->iocb->ki_pos, 0, false);
> + kmem_cache_free(dio_cache, dio);
> + }
This own_waiting case of this is not identical to dio_bio_end_aio
except for the inverted is_async argument of dio_complete.
So even if we allow for the flag I think we should test it in dio_end_io
and use common code for the case where we don't use the linked list of
bios to complete. In that case you could also just call the current
aio version from btrfs as it already calls dio_end_io directly and
remove the flag given that dio_await_completion would become a no-op.
That being said I would much, much prefer to consolidate code here
rather than adding more special cases.
What I would really like to understand is what the point for the
bio_list batching is to start with, given that it also requires nasty
workarounds like dio_bio_reap() to work around the amount of memory it
might have to use.
The only thing I could think of is to allow ->end_io callbacks from user
context, but that is a bigger problem as we can't do that for AIO. I'd
much prefer a unified approach with my generic user context callbacks
from a few weeks ago to actually simplify this code. (and yeah, it's
probably up to me to demonstrate at least a prototype of this)
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] direct-io: allow file systems to do their own waiting for io V2
2012-12-19 14:38 ` Christoph Hellwig
@ 2012-12-19 15:55 ` Chris Mason
0 siblings, 0 replies; 3+ messages in thread
From: Chris Mason @ 2012-12-19 15:55 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Josef Bacik, linux-fsdevel@vger.kernel.org,
viro@ZenIV.linux.org.uk, jmoyer@redhat.com, zab@redhat.com
On Wed, Dec 19, 2012 at 07:38:41AM -0700, Christoph Hellwig wrote:
> I have to say I still hate the flag magic in here. Spent some time to
> look over things to be a bit more constructive in getting what you
> guys want in a nicer way:
>
> > static void dio_bio_end_io(struct bio *bio, int error)
> > {
> > struct dio *dio = bio->bi_private;
> > unsigned long flags;
> > + unsigned long remaining;
> > + bool own_waiting = ((dio->rw & WRITE) &&
> > + (dio->flags & DIO_OWN_WAITING));
> > +
> > + if (own_waiting)
> > + dio_bio_complete(dio, bio);
> >
> > spin_lock_irqsave(&dio->bio_lock, flags);
> > + if (!own_waiting) {
> > + bio->bi_private = dio->bio_list;
> > + dio->bio_list = bio;
> > + }
> > + remaining = --dio->refcount;
> > + if (remaining == 1 && dio->waiter)
> > wake_up_process(dio->waiter);
> > spin_unlock_irqrestore(&dio->bio_lock, flags);
> > +
> > + if (remaining == 0) {
> > + BUG_ON(!(dio->flags & DIO_OWN_WAITING));
> > + dio_complete(dio, dio->iocb->ki_pos, 0, false);
> > + kmem_cache_free(dio_cache, dio);
> > + }
>
> This own_waiting case of this is not identical to dio_bio_end_aio
> except for the inverted is_async argument of dio_complete.
>
> So even if we allow for the flag I think we should test it in dio_end_io
> and use common code for the case where we don't use the linked list of
> bios to complete. In that case you could also just call the current
> aio version from btrfs as it already calls dio_end_io directly and
> remove the flag given that dio_await_completion would become a no-op.
>
> That being said I would much, much prefer to consolidate code here
> rather than adding more special cases.
>
> What I would really like to understand is what the point for the
> bio_list batching is to start with, given that it also requires nasty
> workarounds like dio_bio_reap() to work around the amount of memory it
> might have to use.
Just to clarify a little, we didn't send this with my last pull request.
I mentioned before how we want to reduce the number of waits in the DIO
chain, especially for btrfs who has to do metadata updates along with
data IO for O_DIRECT | O_SYNC. If the FS has control over the waiting,
we can turn three waits (data, log-metadata, super) into
two (data + log-metadata, super)
That's nice, but the flash vendors are coming out with apis for atomic
ios. They basically want a full set of IO all at once, instead of the
model where you get a token, do some IO and commit the token.
So, this code allows us to create that batch of atomic IO. I'm hoping
for an API where we hand a list of bios over to the block layer and
it is completed as a single unit (data + log-metadata + super).
The truth is that btrfs doesn't really need atomic IO, we just need
ordered IO (do the super last please), and if that ends up useful in
general the fusionio cards may provide it. <insert barrier discussion
here, hopefully having learned from the past>
The atomic vs ordered difference is important because cards may be able
to do a larger set of IO in an ordered fashion than atomic.
Of course, I'm hoping everyone is able to make use of whatever is
included. There's nothing btrfs specific here.
>
> The only thing I could think of is to allow ->end_io callbacks from user
> context, but that is a bigger problem as we can't do that for AIO. I'd
> much prefer a unified approach with my generic user context callbacks
> from a few weeks ago to actually simplify this code. (and yeah, it's
> probably up to me to demonstrate at least a prototype of this)
end_io callbacks from user context are definitely interesting, but
that's not the kind of performance tuning we're targeting right now.
-chris
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2012-12-19 15:55 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-11 14:53 [PATCH] direct-io: allow file systems to do their own waiting for io V2 Josef Bacik
2012-12-19 14:38 ` Christoph Hellwig
2012-12-19 15:55 ` Chris Mason
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).