linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] direct-io: allow file systems to do their own waiting for io
@ 2012-12-03 13:37 Josef Bacik
  2012-12-03 15:41 ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2012-12-03 13:37 UTC (permalink / raw)
  To: linux-fsdevel, viro, jmoyer, zab

Btrfs is terrible with O_DIRECT|O_SYNC, mostly because of the constant
waiting.  The thing is we have a handy way of waiting for IO that we can
delay to the very last second so we do all of the O_SYNC work and then wait
for a bunch of IO to complete.  So introduce a flag to allow the generic
direct io stuff to forgo waiting and leave that up to the file system.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
---
 fs/direct-io.c     |   36 +++++++++++++++++++++++++++++-------
 include/linux/fs.h |    3 +++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..ae31c183 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -299,19 +299,35 @@ static void dio_bio_end_aio(struct bio *bio, int error)
  * handler.
  *
  * During I/O bi_private points at the dio.  After I/O, bi_private is used to
- * implement a singly-linked list of completed BIOs, at dio->bio_list.
+ * implement a singly-linked list of completed BIOs, at dio->bio_list, but only
+ * if the file system isn't doing its own waiting.
  */
 static void dio_bio_end_io(struct bio *bio, int error)
 {
 	struct dio *dio = bio->bi_private;
 	unsigned long flags;
+	unsigned long remaining;
+	bool own_waiting = (dio->rw == WRITE &&
+			    (dio->flags & DIO_OWN_WAITING));
+
+	if (own_waiting)
+		dio_bio_complete(dio, bio);
 
 	spin_lock_irqsave(&dio->bio_lock, flags);
-	bio->bi_private = dio->bio_list;
-	dio->bio_list = bio;
-	if (--dio->refcount == 1 && dio->waiter)
+	if (!own_waiting) {
+		bio->bi_private = dio->bio_list;
+		dio->bio_list = bio;
+	}
+	remaining = --dio->refcount;
+	if (remaining == 1 && dio->waiter)
 		wake_up_process(dio->waiter);
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
+
+	if (remaining == 0) {
+		BUG_ON(!(dio->flags & DIO_OWN_WAITING));
+		dio_complete(dio, dio->iocb->ki_pos, 0, false);
+		kmem_cache_free(dio_cache, dio);
+	}
 }
 
 /**
@@ -1266,14 +1282,20 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	    ((rw == READ) || (dio->result == sdio.size)))
 		retval = -EIOCBQUEUED;
 
-	if (retval != -EIOCBQUEUED)
+	if (retval != -EIOCBQUEUED &&
+	    (rw == READ || !(flags & DIO_OWN_WAITING)))
 		dio_await_completion(dio);
 
 	if (drop_refcount(dio) == 0) {
 		retval = dio_complete(dio, offset, retval, false);
 		kmem_cache_free(dio_cache, dio);
-	} else
-		BUG_ON(retval != -EIOCBQUEUED);
+	} else {
+		BUG_ON(retval != -EIOCBQUEUED && !(flags & DIO_OWN_WAITING));
+
+		/* Need to return how much data we should be waiting for */
+		if (!retval && flags & DIO_OWN_WAITING)
+			retval = dio->result;
+	}
 
 out:
 	return retval;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..c7944d1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2433,6 +2433,9 @@ enum {
 
 	/* filesystem does not support filling holes */
 	DIO_SKIP_HOLES	= 0x02,
+
+	/* filesystem will do it's own waiting thank you! */
+	DIO_OWN_WAITING = 0x04,
 };
 
 void dio_end_io(struct bio *bio, int error);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] direct-io: allow file systems to do their own waiting for io
  2012-12-03 13:37 [PATCH] direct-io: allow file systems to do their own waiting for io Josef Bacik
@ 2012-12-03 15:41 ` Christoph Hellwig
  2012-12-03 16:14   ` Josef Bacik
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2012-12-03 15:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, viro, jmoyer, zab

On Mon, Dec 03, 2012 at 08:37:20AM -0500, Josef Bacik wrote:
> Btrfs is terrible with O_DIRECT|O_SYNC, mostly because of the constant
> waiting.  The thing is we have a handy way of waiting for IO that we can
> delay to the very last second so we do all of the O_SYNC work and then wait
> for a bunch of IO to complete.  So introduce a flag to allow the generic
> direct io stuff to forgo waiting and leave that up to the file system.
> Thanks,

I don't really like passing another flag for this, if we we are going to
do something like this it should be in a way where:

 - the actualy waiting code is a helper that btrfs would also use
 - the main dio code is structured in a way that we have a lower level
   entry point that skips the waiting, and a higher level one that also
   calls it.

That beeing said I'm not imaginative enough to see how you're actually
going to use it.  Posting the btrfs side would help with that.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] direct-io: allow file systems to do their own waiting for io
  2012-12-03 15:41 ` Christoph Hellwig
@ 2012-12-03 16:14   ` Josef Bacik
  2012-12-08 12:17     ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2012-12-03 16:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Josef Bacik, linux-fsdevel@vger.kernel.org,
	viro@ZenIV.linux.org.uk, jmoyer@redhat.com, zab@redhat.com

On Mon, Dec 03, 2012 at 08:41:25AM -0700, Christoph Hellwig wrote:
> On Mon, Dec 03, 2012 at 08:37:20AM -0500, Josef Bacik wrote:
> > Btrfs is terrible with O_DIRECT|O_SYNC, mostly because of the constant
> > waiting.  The thing is we have a handy way of waiting for IO that we can
> > delay to the very last second so we do all of the O_SYNC work and then wait
> > for a bunch of IO to complete.  So introduce a flag to allow the generic
> > direct io stuff to forgo waiting and leave that up to the file system.
> > Thanks,
> 
> I don't really like passing another flag for this, if we we are going to
> do something like this it should be in a way where:
> 
>  - the actualy waiting code is a helper that btrfs would also use
>  - the main dio code is structured in a way that we have a lower level
>    entry point that skips the waiting, and a higher level one that also
>    calls it.
> 
> That beeing said I'm not imaginative enough to see how you're actually
> going to use it.  Posting the btrfs side would help with that.
> 

Hrm so I can do that, but it may not make much sense.  Here are the two patches
that are relevant (older versions but they get the idea across)

http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=78b40072c556d82fac5e58793a3178887ac057ec
http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=b7728f1b19eeb2041e3d4da22fd3d5a5c11abd3c

Basically what happens with btrfs now in O_SYNC/fsync() with either O_DIRECT or
not is this

write()
fsync()/O_SYNC
	start and wait on all io to complete
	log changed metadata into special tree
	write and wait on our new log
	sync super which points at our new log

What I'm trying to accomplish is this

write()
fsync()/O_SYNC
	start io
	log changed metadata into special tree
	write log and then wait on log and data
	sync super

this gives us a pretty great performance boost since we just have to wait the
one time (well two if you include the super).  But in the O_DIRECT case it
always waits for writes to be completed before it returns to the file system.
In normal O_DIRECT we want to do that, which is all the first patch does, waits
for the IO like we normally would.  But for fsync()/O_SYNC we want to forego the
waiting until the last possible second, so we start io, gather up the ordered
extents (what we use to track pending IO), and then when we're ready wait to
make sure those ordered extents have completed.  We already have our own helpers
and such to keep track of when IO finishes for a given range, so all we really
need is a flag to tell O_DIRECT not to do what it normally does since we will
take care of it.  I'm open to other ways to do this, but I'd rather not go to
all the trouble to create new helpers and such that btrfs will just never need
to use.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] direct-io: allow file systems to do their own waiting for io
  2012-12-03 16:14   ` Josef Bacik
@ 2012-12-08 12:17     ` Christoph Hellwig
  2012-12-08 12:35       ` Chris Mason
  2012-12-11 10:00       ` Liu Bo
  0 siblings, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2012-12-08 12:17 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org,
	viro@ZenIV.linux.org.uk, jmoyer@redhat.com, zab@redhat.com

On Mon, Dec 03, 2012 at 11:14:03AM -0500, Josef Bacik wrote:
> On Mon, Dec 03, 2012 at 08:41:25AM -0700, Christoph Hellwig wrote:
> > On Mon, Dec 03, 2012 at 08:37:20AM -0500, Josef Bacik wrote:
> > > Btrfs is terrible with O_DIRECT|O_SYNC, mostly because of the constant
> > > waiting.  The thing is we have a handy way of waiting for IO that we can
> > > delay to the very last second so we do all of the O_SYNC work and then wait
> > > for a bunch of IO to complete.  So introduce a flag to allow the generic
> > > direct io stuff to forgo waiting and leave that up to the file system.
> > > Thanks,
> > 
> > I don't really like passing another flag for this, if we we are going to
> > do something like this it should be in a way where:
> > 
> >  - the actualy waiting code is a helper that btrfs would also use
> >  - the main dio code is structured in a way that we have a lower level
> >    entry point that skips the waiting, and a higher level one that also
> >    calls it.
> > 
> > That beeing said I'm not imaginative enough to see how you're actually
> > going to use it.  Posting the btrfs side would help with that.
> > 
> 
> Hrm so I can do that, but it may not make much sense.  Here are the two patches
> that are relevant (older versions but they get the idea across)
> 
> http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=78b40072c556d82fac5e58793a3178887ac057ec
> http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=b7728f1b19eeb2041e3d4da22fd3d5a5c11abd3c

I've looked over the patches but I still don't know what's going on,
sorry for having to poke a bit deeper by mail.

> 
> Basically what happens with btrfs now in O_SYNC/fsync() with either O_DIRECT or
> not is this
> 
> write()
> fsync()/O_SYNC
> 	start and wait on all io to complete
> 	log changed metadata into special tree
> 	write and wait on our new log
> 	sync super which points at our new log
> 
> What I'm trying to accomplish is this
> 
> write()
> fsync()/O_SYNC
> 	start io
> 	log changed metadata into special tree
> 	write log and then wait on log and data

How is going to be safe?  You must only update the metadata once the
data has made it to disk, that is the actual disk I/O for the metadata
must only start once the disk I/O for the data has finished.  For
exactly that scenario the direct I/O code supports the end_io callback
to notify the filesystem efficiently.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] direct-io: allow file systems to do their own waiting for io
  2012-12-08 12:17     ` Christoph Hellwig
@ 2012-12-08 12:35       ` Chris Mason
  2012-12-14 13:44         ` Chris Mason
  2012-12-11 10:00       ` Liu Bo
  1 sibling, 1 reply; 7+ messages in thread
From: Chris Mason @ 2012-12-08 12:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Josef Bacik, linux-fsdevel@vger.kernel.org,
	viro@ZenIV.linux.org.uk, jmoyer@redhat.com, zab@redhat.com

On Sat, Dec 08, 2012 at 05:17:31AM -0700, Christoph Hellwig wrote:
> On Mon, Dec 03, 2012 at 11:14:03AM -0500, Josef Bacik wrote:
> > On Mon, Dec 03, 2012 at 08:41:25AM -0700, Christoph Hellwig wrote:
> > > On Mon, Dec 03, 2012 at 08:37:20AM -0500, Josef Bacik wrote:
> > > > Btrfs is terrible with O_DIRECT|O_SYNC, mostly because of the constant
> > > > waiting.  The thing is we have a handy way of waiting for IO that we can
> > > > delay to the very last second so we do all of the O_SYNC work and then wait
> > > > for a bunch of IO to complete.  So introduce a flag to allow the generic
> > > > direct io stuff to forgo waiting and leave that up to the file system.
> > > > Thanks,
> > > 
> > > I don't really like passing another flag for this, if we we are going to
> > > do something like this it should be in a way where:
> > > 
> > >  - the actualy waiting code is a helper that btrfs would also use
> > >  - the main dio code is structured in a way that we have a lower level
> > >    entry point that skips the waiting, and a higher level one that also
> > >    calls it.
> > > 
> > > That beeing said I'm not imaginative enough to see how you're actually
> > > going to use it.  Posting the btrfs side would help with that.
> > > 
> > 
> > Hrm so I can do that, but it may not make much sense.  Here are the two patches
> > that are relevant (older versions but they get the idea across)
> > 
> > http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=78b40072c556d82fac5e58793a3178887ac057ec
> > http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=b7728f1b19eeb2041e3d4da22fd3d5a5c11abd3c
> 
> I've looked over the patches but I still don't know what's going on,
> sorry for having to poke a bit deeper by mail.
> 
> > 
> > Basically what happens with btrfs now in O_SYNC/fsync() with either O_DIRECT or
> > not is this
> > 
> > write()
> > fsync()/O_SYNC
> > 	start and wait on all io to complete
> > 	log changed metadata into special tree
> > 	write and wait on our new log
> > 	sync super which points at our new log
> > 
> > What I'm trying to accomplish is this
> > 
> > write()
> > fsync()/O_SYNC
> > 	start io
> > 	log changed metadata into special tree
> > 	write log and then wait on log and data
> 
> How is going to be safe?  You must only update the metadata once the
> data has made it to disk, that is the actual disk I/O for the metadata
> must only start once the disk I/O for the data has finished.  For
> exactly that scenario the direct I/O code supports the end_io callback
> to notify the filesystem efficiently.

Thanks for reading through things.  The current model without the patch
looks like this:

[ write data, wait for data ] [ write various tree blocks, wait ]
[ write the super, wait ]

One data block, 3 waits.  But thanks to cow, the super commits the
metadata, so we could do this:

[ write the data ] [ write various tree blocks ] [ wait on all of it ]
[ write the super, wait ]

That's down to two waits.  If we start using atomic writes on flash, we can
do it all as a single IO.

-chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] direct-io: allow file systems to do their own waiting for io
  2012-12-08 12:17     ` Christoph Hellwig
  2012-12-08 12:35       ` Chris Mason
@ 2012-12-11 10:00       ` Liu Bo
  1 sibling, 0 replies; 7+ messages in thread
From: Liu Bo @ 2012-12-11 10:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Josef Bacik, linux-fsdevel@vger.kernel.org,
	viro@ZenIV.linux.org.uk, jmoyer@redhat.com, zab@redhat.com

On Sat, Dec 08, 2012 at 07:17:31AM -0500, Christoph Hellwig wrote:
> On Mon, Dec 03, 2012 at 11:14:03AM -0500, Josef Bacik wrote:
> > On Mon, Dec 03, 2012 at 08:41:25AM -0700, Christoph Hellwig wrote:
> > > On Mon, Dec 03, 2012 at 08:37:20AM -0500, Josef Bacik wrote:
> > > > Btrfs is terrible with O_DIRECT|O_SYNC, mostly because of the constant
> > > > waiting.  The thing is we have a handy way of waiting for IO that we can
> > > > delay to the very last second so we do all of the O_SYNC work and then wait
> > > > for a bunch of IO to complete.  So introduce a flag to allow the generic
> > > > direct io stuff to forgo waiting and leave that up to the file system.
> > > > Thanks,
> > > 
> > > I don't really like passing another flag for this, if we we are going to
> > > do something like this it should be in a way where:
> > > 
> > >  - the actualy waiting code is a helper that btrfs would also use
> > >  - the main dio code is structured in a way that we have a lower level
> > >    entry point that skips the waiting, and a higher level one that also
> > >    calls it.
> > > 
> > > That beeing said I'm not imaginative enough to see how you're actually
> > > going to use it.  Posting the btrfs side would help with that.
> > > 
> > 
> > Hrm so I can do that, but it may not make much sense.  Here are the two patches
> > that are relevant (older versions but they get the idea across)
> > 
> > http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=78b40072c556d82fac5e58793a3178887ac057ec
> > http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=b7728f1b19eeb2041e3d4da22fd3d5a5c11abd3c
> 
> I've looked over the patches but I still don't know what's going on,
> sorry for having to poke a bit deeper by mail.
> 
> > 
> > Basically what happens with btrfs now in O_SYNC/fsync() with either O_DIRECT or
> > not is this
> > 
> > write()
> > fsync()/O_SYNC
> > 	start and wait on all io to complete
> > 	log changed metadata into special tree
> > 	write and wait on our new log
> > 	sync super which points at our new log
> > 
> > What I'm trying to accomplish is this
> > 
> > write()
> > fsync()/O_SYNC
> > 	start io
> > 	log changed metadata into special tree
> > 	write log and then wait on log and data
> 
> How is going to be safe?  You must only update the metadata once the
> data has made it to disk, that is the actual disk I/O for the metadata
> must only start once the disk I/O for the data has finished.  For
> exactly that scenario the direct I/O code supports the end_io callback
> to notify the filesystem efficiently.

Actually Btrfs's ordered mode is like:

Updating metadata by translating pending io into metadata in btree
after the actual disk I/O finishes its work.

So in fsync we can just make use of these pending io info instead in advance
and delay the waiting process.

thanks,
liubo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] direct-io: allow file systems to do their own waiting for io
  2012-12-08 12:35       ` Chris Mason
@ 2012-12-14 13:44         ` Chris Mason
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Mason @ 2012-12-14 13:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Josef Bacik, linux-fsdevel@vger.kernel.org,
	viro@ZenIV.linux.org.uk, jmoyer@redhat.com, zab@redhat.com

On Sat, Dec 08, 2012 at 05:35:41AM -0700, Chris Mason wrote:
> On Sat, Dec 08, 2012 at 05:17:31AM -0700, Christoph Hellwig wrote:
> > On Mon, Dec 03, 2012 at 11:14:03AM -0500, Josef Bacik wrote:
> > > On Mon, Dec 03, 2012 at 08:41:25AM -0700, Christoph Hellwig wrote:
> > > > On Mon, Dec 03, 2012 at 08:37:20AM -0500, Josef Bacik wrote:
> > > > > Btrfs is terrible with O_DIRECT|O_SYNC, mostly because of the constant
> > > > > waiting.  The thing is we have a handy way of waiting for IO that we can
> > > > > delay to the very last second so we do all of the O_SYNC work and then wait
> > > > > for a bunch of IO to complete.  So introduce a flag to allow the generic
> > > > > direct io stuff to forgo waiting and leave that up to the file system.
> > > > > Thanks,
> > > > 
> > > > I don't really like passing another flag for this, if we we are going to
> > > > do something like this it should be in a way where:
> > > > 
> > > >  - the actualy waiting code is a helper that btrfs would also use
> > > >  - the main dio code is structured in a way that we have a lower level
> > > >    entry point that skips the waiting, and a higher level one that also
> > > >    calls it.
> > > > 
> > > > That beeing said I'm not imaginative enough to see how you're actually
> > > > going to use it.  Posting the btrfs side would help with that.
> > > > 
> > > 
> > > Hrm so I can do that, but it may not make much sense.  Here are the two patches
> > > that are relevant (older versions but they get the idea across)
> > > 
> > > http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=78b40072c556d82fac5e58793a3178887ac057ec
> > > http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=b7728f1b19eeb2041e3d4da22fd3d5a5c11abd3c
> > 
> > I've looked over the patches but I still don't know what's going on,
> > sorry for having to poke a bit deeper by mail.
> > 
> > > 
> > > Basically what happens with btrfs now in O_SYNC/fsync() with either O_DIRECT or
> > > not is this
> > > 
> > > write()
> > > fsync()/O_SYNC
> > > 	start and wait on all io to complete
> > > 	log changed metadata into special tree
> > > 	write and wait on our new log
> > > 	sync super which points at our new log
> > > 
> > > What I'm trying to accomplish is this
> > > 
> > > write()
> > > fsync()/O_SYNC
> > > 	start io
> > > 	log changed metadata into special tree
> > > 	write log and then wait on log and data
> > 
> > How is going to be safe?  You must only update the metadata once the
> > data has made it to disk, that is the actual disk I/O for the metadata
> > must only start once the disk I/O for the data has finished.  For
> > exactly that scenario the direct I/O code supports the end_io callback
> > to notify the filesystem efficiently.
> 
> Thanks for reading through things.  The current model without the patch
> looks like this:
> 
> [ write data, wait for data ] [ write various tree blocks, wait ]
> [ write the super, wait ]
> 
> One data block, 3 waits.  But thanks to cow, the super commits the
> metadata, so we could do this:
> 
> [ write the data ] [ write various tree blocks ] [ wait on all of it ]
> [ write the super, wait ]
> 
> That's down to two waits.  If we start using atomic writes on flash, we can
> do it all as a single IO.

So I have this (Josef's v2) in a branch here.  I'm happy to wait a kernel release if
we'd like to hash it out.  But if it makes sense I'll send in with my
pull request.

-chris


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-12-14 13:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-03 13:37 [PATCH] direct-io: allow file systems to do their own waiting for io Josef Bacik
2012-12-03 15:41 ` Christoph Hellwig
2012-12-03 16:14   ` Josef Bacik
2012-12-08 12:17     ` Christoph Hellwig
2012-12-08 12:35       ` Chris Mason
2012-12-14 13:44         ` Chris Mason
2012-12-11 10:00       ` Liu Bo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).