[RFC] loop: issue aio with pages

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] loop: issue aio with pages
@ 2009-10-22 20:25 Zach Brown
  2009-10-22 20:25 ` [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY Zach Brown
  2009-10-25  7:36 ` [RFC] loop: issue aio with pages Christoph Hellwig
  0 siblings, 2 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-22 20:25 UTC (permalink / raw)
  To: linux-fsdevel

This patch series adds a kernel interface to fs/aio.c so that kernel code can
issue concurrent asynchronous IO to file systems.  It adds an aio command and
file system methods which specify io memory with pages instead of userspace
addresses.  

This series was written to reduce the current overhead loop imposes by
performing synchronus buffered file system IO from a kernel thread.  These
patches turn loop into a light weight layer that translates bios into iocbs.
knfsd might also want to use these interfaces to perform file IO.

Initial results are encouraging.  Before loop over a block device would cut
random concurrent read performance in half.  With this series reading from loop
performs the same as reading from the underlying device.

But the series is still a prototype.  Only blockdev has the methods implemented
and only for O_DIRECT.  It'll take more work to get pages passed in to file
system methods, through generic code, and down to fs/direct-io.c.

I'm sending this now so that we can decide if we want to pursue this and the
direction it should head in before I get too lost updating file systems to know
how to work with pages. 

Thoughts?

I think my current preference is to follow the current strategy of adding a new
file op method instead of trying to get fancy by changing the types of existing
methods to take some abstract memory specifier.  We'll see if it's feasible to
rework the generic code around iovecs and pages by converting a few file
systems.

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY
  2009-10-22 20:25 [RFC] loop: issue aio with pages Zach Brown
@ 2009-10-22 20:25 ` Zach Brown
  2009-10-22 20:25   ` [PATCH 2/8] aio: disable retry Zach Brown
  2009-10-26 15:57   ` [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY Jeff Moyer
  2009-10-25  7:36 ` [RFC] loop: issue aio with pages Christoph Hellwig
  1 sibling, 2 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-22 20:25 UTC (permalink / raw)
  To: linux-fsdevel

gadgetfs was the sole mainline user of the retry mechanism offered by fs/aio.c.
It would set the iocb's ki_retry to a tiny function that is executed once an
async read completed.  The function would simply copy data that was dmaed into
a kernel buffer out to the userspace buffer for the read.

This use was not worth the complexity and volume of code needed to support
retrying iocbs in the aio core.  Many other file systems already do this kind
of post processing work in their own threads today.  This work that gadgetfs
does pales in comparison to the work that btrfs and xfs do.

This patch moves this trivial copying into a work struct that is processed by
keventd.  It uses the {un,}use_mm() calls which were recently promoted out of
aio into a proper API.

This paves the way for removing the dangerous and slow retry functionality from
the aio core.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---
 drivers/usb/gadget/inode.c |   41 +++++++++++++++++++++++++++++++++--------
 1 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index bf0f652..694efc8 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -34,6 +34,7 @@
 #include <linux/slab.h>
 #include <linux/poll.h>
 #include <linux/smp_lock.h>
+#include <linux/mmu_context.h>
 
 #include <linux/device.h>
 #include <linux/moduleparam.h>
@@ -522,6 +523,7 @@ struct kiocb_priv {
 	const struct iovec	*iv;
 	unsigned long		nr_segs;
 	unsigned		actual;
+	struct work_struct	work;
 };
 
 static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e)
@@ -545,14 +547,31 @@ static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e)
 	return value;
 }
 
-static ssize_t ep_aio_read_retry(struct kiocb *iocb)
+/*
+ * This runs in keventd.  It assumes the mm context of the aio context that
+ * submitted the read and copies the data from the allocated kernel result
+ * buffer back to the userspace buffer.
+ *
+ * This used to be done by setting ki_retry to this function once the aio read
+ * was submitted.  The caller would assume the aio context's mm and call this
+ * function to perform the copy.  As ki_retry this function could only return
+ * len which would be then be given to aio_complete().  To maintain this
+ * behaviour we call aio_complete() with a 0 third argument.  This could now
+ * return req->status to userspace as the third argument but I opted to
+ * maintain the previously existing behaviour.
+ */
+static void ep_aio_read_copy(struct work_struct *work)
 {
-	struct kiocb_priv	*priv = iocb->private;
+	struct kiocb_priv *priv = container_of(work, struct kiocb_priv, work);
+	struct kiocb *iocb = priv->req->context;
+	mm_segment_t oldfs = get_fs();
 	ssize_t			len, total;
 	void			*to_copy;
 	int			i;
 
-	/* we "retry" to get the right mm context for this: */
+	/* get the right mm context for this */
+	set_fs(USER_DS);
+	use_mm(iocb->ki_ctx->mm);
 
 	/* copy stuff into user buffers */
 	total = priv->actual;
@@ -573,9 +592,12 @@ static ssize_t ep_aio_read_retry(struct kiocb *iocb)
 		if (total == 0)
 			break;
 	}
+
+ 	unuse_mm(iocb->ki_ctx->mm);
+	set_fs(oldfs);
 	kfree(priv->buf);
 	kfree(priv);
-	return len;
+	aio_complete(iocb, len, 0);
 }
 
 static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
@@ -601,14 +623,18 @@ static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
 		aio_complete(iocb, req->actual ? req->actual : req->status,
 				req->status);
 	} else {
-		/* retry() won't report both; so we hide some faults */
+		/*
+		 * ep_aio_read_copy() doesn't report both; so we hide some
+		 * faults.
+		 */
 		if (unlikely(0 != req->status))
 			DBG(epdata->dev, "%s fault %d len %d\n",
 				ep->name, req->status, req->actual);
 
 		priv->buf = req->buf;
 		priv->actual = req->actual;
-		kick_iocb(iocb);
+		INIT_WORK(&priv->work, ep_aio_read_copy);
+		schedule_work(&priv->work);
 	}
 	spin_unlock(&epdata->dev->lock);
 
@@ -679,7 +705,7 @@ fail:
 		kfree(priv);
 		put_ep(epdata);
 	} else
-		value = (iv ? -EIOCBRETRY : -EIOCBQUEUED);
+		value = -EIOCBQUEUED;
 	return value;
 }
 
@@ -697,7 +723,6 @@ ep_aio_read(struct kiocb *iocb, const struct iovec *iov,
 	if (unlikely(!buf))
 		return -ENOMEM;
 
-	iocb->ki_retry = ep_aio_read_retry;
 	return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs);
 }
 
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/8] aio: disable retry
  2009-10-22 20:25 ` [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY Zach Brown
@ 2009-10-22 20:25   ` Zach Brown
  2009-10-22 20:25     ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Zach Brown
                       ` (2 more replies)
  2009-10-26 15:57   ` [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY Jeff Moyer
  1 sibling, 3 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-22 20:25 UTC (permalink / raw)
  To: linux-fsdevel

This patch ensures that aio ops won't be retried.  Code that uses EIOCBRETRY
will fail to build and modules that reference kick_iocb() won't load.

This simplifies an upcoming patch by ensuring that it doesn't have to support
aio ops that retry.  Further patches could remove much more of fs/aio.c if we
chose to remove retry-based aio support.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---
 fs/aio.c              |   82 +-----------------------------------------------
 fs/ocfs2/dlmglue.c    |    2 +-
 fs/read_write.c       |   34 ++------------------
 include/linux/aio.h   |   11 ------
 include/linux/errno.h |    1 -
 5 files changed, 6 insertions(+), 124 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 02a2c93..2406981 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -697,37 +697,12 @@ static ssize_t aio_run_iocb(struct kiocb *iocb)
 	 */
 	ret = retry(iocb);
 
-	if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
+	if (ret != -EIOCBQUEUED) {
 		BUG_ON(!list_empty(&iocb->ki_wait.task_list));
 		aio_complete(iocb, ret, 0);
 	}
 out:
 	spin_lock_irq(&ctx->ctx_lock);
-
-	if (-EIOCBRETRY == ret) {
-		/*
-		 * OK, now that we are done with this iteration
-		 * and know that there is more left to go,
-		 * this is where we let go so that a subsequent
-		 * "kick" can start the next iteration
-		 */
-
-		/* will make __queue_kicked_iocb succeed from here on */
-		INIT_LIST_HEAD(&iocb->ki_run_list);
-		/* we must queue the next iteration ourselves, if it
-		 * has already been kicked */
-		if (kiocbIsKicked(iocb)) {
-			__queue_kicked_iocb(iocb);
-
-			/*
-			 * __queue_kicked_iocb will always return 1 here, because
-			 * iocb->ki_run_list is empty at this point so it should
-			 * be safe to unconditionally queue the context into the
-			 * work queue.
-			 */
-			aio_queue_work(ctx);
-		}
-	}
 	return ret;
 }
 
@@ -840,56 +815,6 @@ static void aio_kick_handler(struct work_struct *work)
 		queue_delayed_work(aio_wq, &ctx->wq, 0);
 }
 
-
-/*
- * Called by kick_iocb to queue the kiocb for retry
- * and if required activate the aio work queue to process
- * it
- */
-static void try_queue_kicked_iocb(struct kiocb *iocb)
-{
- 	struct kioctx	*ctx = iocb->ki_ctx;
-	unsigned long flags;
-	int run = 0;
-
-	/* We're supposed to be the only path putting the iocb back on the run
-	 * list.  If we find that the iocb is *back* on a wait queue already
-	 * than retry has happened before we could queue the iocb.  This also
-	 * means that the retry could have completed and freed our iocb, no
-	 * good. */
-	BUG_ON((!list_empty(&iocb->ki_wait.task_list)));
-
-	spin_lock_irqsave(&ctx->ctx_lock, flags);
-	/* set this inside the lock so that we can't race with aio_run_iocb()
-	 * testing it and putting the iocb on the run list under the lock */
-	if (!kiocbTryKick(iocb))
-		run = __queue_kicked_iocb(iocb);
-	spin_unlock_irqrestore(&ctx->ctx_lock, flags);
-	if (run)
-		aio_queue_work(ctx);
-}
-
-/*
- * kick_iocb:
- *      Called typically from a wait queue callback context
- *      (aio_wake_function) to trigger a retry of the iocb.
- *      The retry is usually executed by aio workqueue
- *      threads (See aio_kick_handler).
- */
-void kick_iocb(struct kiocb *iocb)
-{
-	/* sync iocbs are easy: they can only ever be executing from a 
-	 * single context. */
-	if (is_sync_kiocb(iocb)) {
-		kiocbSetKicked(iocb);
-	        wake_up_process(iocb->ki_obj.tsk);
-		return;
-	}
-
-	try_queue_kicked_iocb(iocb);
-}
-EXPORT_SYMBOL(kick_iocb);
-
 /* aio_complete
  *	Called when the io request on the given iocb is complete.
  *	Returns true if this is the last user of the request.  The 
@@ -1352,7 +1277,7 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
 	/* If we managed to write some out we return that, rather than
 	 * the eventual error. */
 	if (opcode == IOCB_CMD_PWRITEV
-	    && ret < 0 && ret != -EIOCBQUEUED && ret != -EIOCBRETRY
+	    && ret < 0 && ret != -EIOCBQUEUED
 	    && iocb->ki_nbytes - iocb->ki_left)
 		ret = iocb->ki_nbytes - iocb->ki_left;
 
@@ -1524,10 +1449,7 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb)
 static int aio_wake_function(wait_queue_t *wait, unsigned mode,
 			     int sync, void *key)
 {
-	struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
-
 	list_del_init(&wait->task_list);
-	kick_iocb(iocb);
 	return 1;
 }
 
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 0d38d67..fb7bb85 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -2282,7 +2282,7 @@ int ocfs2_inode_lock_full_nested(struct inode *inode,
 	status = __ocfs2_cluster_lock(osb, lockres, level, dlm_flags,
 				      arg_flags, subclass, _RET_IP_);
 	if (status < 0) {
-		if (status != -EAGAIN && status != -EIOCBRETRY)
+		if (status != -EAGAIN)
 			mlog_errno(status);
 		goto bail;
 	}
diff --git a/fs/read_write.c b/fs/read_write.c
index 3ac2898..a6d1b45 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -239,16 +239,6 @@ int rw_verify_area(int read_write, struct file *file, loff_t *ppos, size_t count
 	return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;
 }
 
-static void wait_on_retry_sync_kiocb(struct kiocb *iocb)
-{
-	set_current_state(TASK_UNINTERRUPTIBLE);
-	if (!kiocbIsKicked(iocb))
-		schedule();
-	else
-		kiocbClearKicked(iocb);
-	__set_current_state(TASK_RUNNING);
-}
-
 ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
 {
 	struct iovec iov = { .iov_base = buf, .iov_len = len };
@@ -259,13 +249,7 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *pp
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_left = len;
 
-	for (;;) {
-		ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
-		if (ret != -EIOCBRETRY)
-			break;
-		wait_on_retry_sync_kiocb(&kiocb);
-	}
-
+	ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
 	if (-EIOCBQUEUED == ret)
 		ret = wait_on_sync_kiocb(&kiocb);
 	*ppos = kiocb.ki_pos;
@@ -314,13 +298,7 @@ ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, lof
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_left = len;
 
-	for (;;) {
-		ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
-		if (ret != -EIOCBRETRY)
-			break;
-		wait_on_retry_sync_kiocb(&kiocb);
-	}
-
+	ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
 	if (-EIOCBQUEUED == ret)
 		ret = wait_on_sync_kiocb(&kiocb);
 	*ppos = kiocb.ki_pos;
@@ -494,13 +472,7 @@ ssize_t do_sync_readv_writev(struct file *filp, const struct iovec *iov,
 	kiocb.ki_left = len;
 	kiocb.ki_nbytes = len;
 
-	for (;;) {
-		ret = fn(&kiocb, iov, nr_segs, kiocb.ki_pos);
-		if (ret != -EIOCBRETRY)
-			break;
-		wait_on_retry_sync_kiocb(&kiocb);
-	}
-
+	ret = fn(&kiocb, iov, nr_segs, kiocb.ki_pos);
 	if (ret == -EIOCBQUEUED)
 		ret = wait_on_sync_kiocb(&kiocb);
 	*ppos = kiocb.ki_pos;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index aea219d..4f88ec2 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -74,15 +74,6 @@ struct kioctx;
  * not ask the method again -- ki_retry must ensure forward progress.
  * aio_complete() must be called once and only once in the future, multiple
  * calls may result in undefined behaviour.
- *
- * If ki_retry returns -EIOCBRETRY it has made a promise that kick_iocb()
- * will be called on the kiocb pointer in the future.  This may happen
- * through generic helpers that associate kiocb->ki_wait with a wait
- * queue head that ki_retry uses via current->io_wait.  It can also happen
- * with custom tracking and manual calls to kick_iocb(), though that is
- * discouraged.  In either case, kick_iocb() must be called once and only
- * once.  ki_retry must ensure forward progress, the AIO core will wait
- * indefinitely for kick_iocb() to be called.
  */
 struct kiocb {
 	struct list_head	ki_run_list;
@@ -210,14 +201,12 @@ extern unsigned aio_max_size;
 #ifdef CONFIG_AIO
 extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
 extern int aio_put_req(struct kiocb *iocb);
-extern void kick_iocb(struct kiocb *iocb);
 extern int aio_complete(struct kiocb *iocb, long res, long res2);
 struct mm_struct;
 extern void exit_aio(struct mm_struct *mm);
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline int aio_put_req(struct kiocb *iocb) { return 0; }
-static inline void kick_iocb(struct kiocb *iocb) { }
 static inline int aio_complete(struct kiocb *iocb, long res, long res2) { return 0; }
 struct mm_struct;
 static inline void exit_aio(struct mm_struct *mm) { }
diff --git a/include/linux/errno.h b/include/linux/errno.h
index 4668583..63a1f15 100644
--- a/include/linux/errno.h
+++ b/include/linux/errno.h
@@ -27,7 +27,6 @@
 #define EBADTYPE	527	/* Type not supported by server */
 #define EJUKEBOX	528	/* Request initiated, but will not complete before timeout */
 #define EIOCBQUEUED	529	/* iocb queued, will get completion event */
-#define EIOCBRETRY	530	/* iocb queued, will trigger a retry */
 
 #endif
 
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/8] aio: add an interface to submit aio from the kernel
  2009-10-22 20:25   ` [PATCH 2/8] aio: disable retry Zach Brown
@ 2009-10-22 20:25     ` Zach Brown
  2009-10-22 20:25       ` [PATCH 4/8] aio: add aio_read_pages and aio_write_pages Zach Brown
  2009-10-26 16:10       ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Jeff Moyer
  2009-10-25  7:37     ` [PATCH 2/8] aio: disable retry Christoph Hellwig
  2009-10-26 16:00     ` Jeff Moyer
  2 siblings, 2 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-22 20:25 UTC (permalink / raw)
  To: linux-fsdevel

This adds a simple interface that lets other parts of the kernel submit aio
iocbs.  Callers provide a function which is called as the IO completes.

These iocbs aren't tracked to reduce overhead: they can't be canceled, callers
limit the number in flight, and previous patches in this series removed
retry-based aio.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---
 fs/aio.c            |   86 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/aio.h |   11 ++++++
 2 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 2406981..7a150c2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -843,6 +843,10 @@ int aio_complete(struct kiocb *iocb, long res, long res2)
 		iocb->ki_users = 0;
 		wake_up_process(iocb->ki_obj.tsk);
 		return 1;
+	} else if (is_kernel_kiocb(iocb)) {
+		iocb->ki_obj.complete(iocb->ki_user_data, res);
+		aio_kernel_free(iocb);
+		return 0;
 	}
 
 	info = &ctx->ring_info;
@@ -1706,3 +1710,85 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
 	asmlinkage_protect(5, ret, ctx_id, min_nr, nr, events, timeout);
 	return ret;
 }
+
+/*
+ * This allocates an iocb that will be used to submit and track completion of
+ * an IO that is issued from kernel space.  We don't have a context, we don't
+ * limit the number pending, and we can't be canceled.  The caller is expected
+ * to call the appropriate aio_kernel_init_() functions and then call
+ * aio_kernel_submit().  From that point forward progress is guaranteed by the
+ * file system aio method.  Eventually the caller's completion callback will be
+ * called.
+ */
+struct kiocb *aio_kernel_alloc(gfp_t gfp)
+{
+	struct kiocb *iocb = kmem_cache_zalloc(kiocb_cachep, gfp);
+	if (iocb)
+		iocb->ki_key = KIOCB_KERNEL_KEY;
+	return iocb;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_alloc);
+
+void aio_kernel_free(struct kiocb *iocb)
+{
+	if (iocb)
+		kmem_cache_free(kiocb_cachep, iocb);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_free);
+
+/*
+ * ptr and count can be a buff and bytes or an iov and segs.
+ */
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+			unsigned short op, void *ptr, size_t nr, loff_t off)
+{
+	iocb->ki_filp = filp;
+	iocb->ki_opcode = op;
+	iocb->ki_buf = (char __user *)(unsigned long)ptr;
+	iocb->ki_left = nr;
+	iocb->ki_nbytes = nr;
+	iocb->ki_pos = off;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
+
+void aio_kernel_init_callback(struct kiocb *iocb,
+			      void (*complete)(u64 user_data, long res),
+			      u64 user_data)
+{
+	iocb->ki_obj.complete = complete;
+	iocb->ki_user_data = user_data;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_callback);
+
+/*
+ * The iocb is our responsibility once this is called.  The caller must not
+ * reference it.  This comes from aio_setup_iocb() modifying the iocb.
+ *
+ * Callers must be prepared for their iocb completion callback to be called the
+ * moment they enter this function.  The completion callback may be called from
+ * any context.
+ *
+ * Returns: 0: the iocb completion callback will be called with the op result
+ * negative errno: the operation was not submitted and the iocb was freed
+ */
+int aio_kernel_submit(struct kiocb *iocb)
+{
+	int ret;
+
+	BUG_ON(!is_kernel_kiocb(iocb));
+	BUG_ON(!iocb->ki_obj.complete);
+	BUG_ON(!iocb->ki_filp);
+
+	ret = aio_setup_iocb(iocb);
+	if (ret) {
+		aio_kernel_free(iocb);
+		return ret;
+	}
+
+	ret = iocb->ki_retry(iocb);
+	if (ret != -EIOCBQUEUED)
+		aio_complete(iocb, ret, 0);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_submit);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 4f88ec2..95ef1ea 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -24,6 +24,7 @@ struct kioctx;
 #define KIOCB_C_COMPLETE	0x02
 
 #define KIOCB_SYNC_KEY		(~0U)
+#define KIOCB_KERNEL_KEY		(~1U)
 
 /* ki_flags bits */
 /*
@@ -90,6 +91,7 @@ struct kiocb {
 	union {
 		void __user		*user;
 		struct task_struct	*tsk;
+		void			(*complete)(u64 user_data, long res);
 	} ki_obj;
 
 	__u64			ki_user_data;	/* user's data for completion */
@@ -118,6 +120,7 @@ struct kiocb {
 };
 
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
+#define is_kernel_kiocb(iocb)	((iocb)->ki_key == KIOCB_KERNEL_KEY)
 #define init_sync_kiocb(x, filp)			\
 	do {						\
 		struct task_struct *tsk = current;	\
@@ -204,6 +207,14 @@ extern int aio_put_req(struct kiocb *iocb);
 extern int aio_complete(struct kiocb *iocb, long res, long res2);
 struct mm_struct;
 extern void exit_aio(struct mm_struct *mm);
+struct kiocb *aio_kernel_alloc(gfp_t gfp);
+void aio_kernel_free(struct kiocb *iocb);
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+			unsigned short op, void *ptr, size_t nr, loff_t off);
+void aio_kernel_init_callback(struct kiocb *iocb,
+			      void (*complete)(u64 user_data, long res),
+			      u64 user_data);
+int aio_kernel_submit(struct kiocb *iocb);
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline int aio_put_req(struct kiocb *iocb) { return 0; }
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/8] aio: add aio_read_pages and aio_write_pages
  2009-10-22 20:25     ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Zach Brown
@ 2009-10-22 20:25       ` Zach Brown
  2009-10-22 20:25         ` [PATCH 5/8] dio: refactor __blockdev_direct_IO() Zach Brown
  2009-10-26 16:17         ` [PATCH 4/8] aio: add aio_read_pages and aio_write_pages Jeff Moyer
  2009-10-26 16:10       ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Jeff Moyer
  1 sibling, 2 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-22 20:25 UTC (permalink / raw)
  To: linux-fsdevel

This adds read and write file operations which specify memory with an array of
page pointers, offsets, and lengths.  AIO commands are added which call these
methods.  Fields are added to the iocb to specify the pages which are passed to
the methods.

This is intended to be used by callers in the kernel.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---
 fs/aio.c                |   53 ++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/aio.h     |    3 ++
 include/linux/aio_abi.h |    2 +
 include/linux/fs.h      |    3 ++
 4 files changed, 60 insertions(+), 1 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 7a150c2..fa85cb3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1308,6 +1308,30 @@ static ssize_t aio_fsync(struct kiocb *iocb)
 	return ret;
 }
 
+static ssize_t aio_pread_pages(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = -EINVAL;
+
+	if (file->f_op->aio_read_pages)
+		ret = file->f_op->aio_read_pages(iocb, iocb->ki_bvec,
+						 iocb->ki_bvec_len,
+						 iocb->ki_pos);
+	return ret;
+}
+
+static ssize_t aio_pwrite_pages(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = -EINVAL;
+
+	if (file->f_op->aio_write_pages)
+		ret = file->f_op->aio_write_pages(iocb, iocb->ki_bvec,
+						  iocb->ki_bvec_len,
+						  iocb->ki_pos);
+	return ret;
+}
+
 static ssize_t aio_setup_vectored_rw(int type, struct kiocb *kiocb)
 {
 	ssize_t ret;
@@ -1414,6 +1438,28 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb)
 		if (file->f_op->aio_write)
 			kiocb->ki_retry = aio_rw_vect_retry;
 		break;
+	case IOCB_CMD_PREADP:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_READ)))
+			break;
+		ret = security_file_permission(file, MAY_READ);
+		if (unlikely(ret))
+			break;
+		ret = -EINVAL;
+		if (file->f_op->aio_read_pages)
+			kiocb->ki_retry = aio_pread_pages;
+		break;
+	case IOCB_CMD_PWRITEP:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_WRITE)))
+			break;
+		ret = security_file_permission(file, MAY_WRITE);
+		if (unlikely(ret))
+			break;
+		ret = -EINVAL;
+		if (file->f_op->aio_write_pages)
+			kiocb->ki_retry = aio_pwrite_pages;
+		break;
 	case IOCB_CMD_FDSYNC:
 		ret = -EINVAL;
 		if (file->f_op->aio_fsync)
@@ -1737,7 +1783,8 @@ void aio_kernel_free(struct kiocb *iocb)
 EXPORT_SYMBOL_GPL(aio_kernel_free);
 
 /*
- * ptr and count can be a buff and bytes or an iov and segs.
+ * @ptr: can be a pointer to bytes, an iovec array, or bio_vec array
+ * @nr: can be the bytes at the pointer, the nr of iovecs or bio_vecs
  */
 void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
 			unsigned short op, void *ptr, size_t nr, loff_t off)
@@ -1748,6 +1795,10 @@ void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
 	iocb->ki_left = nr;
 	iocb->ki_nbytes = nr;
 	iocb->ki_pos = off;
+	/* ignored unless the pages variants are used */
+	iocb->ki_bvec = ptr;
+	iocb->ki_bvec_len = nr;
+
 }
 EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
 
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 95ef1ea..4ad1010 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -117,6 +117,9 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	struct bio_vec		*ki_bvec;
+	unsigned long		ki_bvec_len;
 };
 
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
diff --git a/include/linux/aio_abi.h b/include/linux/aio_abi.h
index 2c87316..5b67ed6 100644
--- a/include/linux/aio_abi.h
+++ b/include/linux/aio_abi.h
@@ -44,6 +44,8 @@ enum {
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
+	IOCB_CMD_PREADP = 9,
+	IOCB_CMD_PWRITEP = 10,
 };
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2620a8c..2ba15f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1479,6 +1479,7 @@ struct block_device_operations;
  * read, write, poll, fsync, readv, writev, unlocked_ioctl and compat_ioctl
  * can be called without the big kernel lock held in all filesystems.
  */
+struct bio_vec; /* XXX :( */
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
@@ -1486,6 +1487,8 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+	ssize_t (*aio_read_pages) (struct kiocb *, const struct bio_vec *, unsigned long, loff_t);
+	ssize_t (*aio_write_pages) (struct kiocb *, const struct bio_vec *, unsigned long, loff_t);
 	int (*readdir) (struct file *, void *, filldir_t);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/8] dio: refactor __blockdev_direct_IO()
  2009-10-22 20:25       ` [PATCH 4/8] aio: add aio_read_pages and aio_write_pages Zach Brown
@ 2009-10-22 20:25         ` Zach Brown
  2009-10-22 20:25           ` [PATCH 6/8] dio: add an entry point which takes pages Zach Brown
  2009-10-27 15:39           ` [PATCH 5/8] dio: refactor __blockdev_direct_IO() Jeff Moyer
  2009-10-26 16:17         ` [PATCH 4/8] aio: add aio_read_pages and aio_write_pages Jeff Moyer
  1 sibling, 2 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-22 20:25 UTC (permalink / raw)
  To: linux-fsdevel

This patch moves code around so that __blockdev_direct_IO() becomes a function
which only iterates over its iovec argument while calling helper functions.
This wil let us add functions in the future which call the same helper
functions while iterating over other ways of specifying memory for the IO.

direct_io_worker() was only called once from __blockdev_direct_IO() but took
the iovec argument.  Instead of trying to pass down other ways of specifying
memory, we move its body up into __blockdev_direct_IO() before pulling the
initilization and teardown steps of this resulting huge function off into
helpers.  What's left is a small function that mostly concerns itself with
calling helpers on the iovec.

The dio_unaligned() helper is a little strange, but it reflects the flow of the
original code.  If we were to clean it up we'd run the risk of introducing
bugs.

Comments have not been updated to reflect the code change yet.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---
 fs/direct-io.c |  313 +++++++++++++++++++++++++++++--------------------------
 1 files changed, 165 insertions(+), 148 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8b10b87..0a2ba8e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -925,21 +925,19 @@ out:
 	return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
-static ssize_t
-direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
-	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
-	unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
-	struct dio *dio)
+static struct dio *dio_prepare(int rw, struct kiocb *iocb, struct inode *inode,
+			       unsigned blkbits, loff_t offset, loff_t end,
+			       get_block_t get_block, dio_iodone_t end_io,
+			       int dio_lock_type)
 {
-	unsigned long user_addr; 
-	unsigned long flags;
-	int seg;
-	ssize_t ret = 0;
-	ssize_t ret2;
-	size_t bytes;
+	struct dio *dio;
+
+	dio = kzalloc(sizeof(*dio), GFP_KERNEL);
+	if (!dio)
+		return NULL;
+
+	if (rw & WRITE)
+		rw = WRITE_ODIRECT;
 
 	dio->inode = inode;
 	dio->rw = rw;
@@ -947,6 +945,13 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	dio->blkfactor = inode->i_blkbits - blkbits;
 	dio->block_in_file = offset >> blkbits;
 
+	/*
+	 * In case of non-aligned buffers, we may need 2 more
+	 * pages since we need to zero out first and last block.
+	 */
+	if (unlikely(dio->blkfactor))
+		dio->pages_in_io = 2;
+
 	dio->get_block = get_block;
 	dio->end_io = end_io;
 	dio->final_block_in_bio = -1;
@@ -958,54 +963,88 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 
+	dio->lock_type = dio_lock_type;
+
 	/*
-	 * In case of non-aligned buffers, we may need 2 more
-	 * pages since we need to zero out first and last block.
+	 * For file extending writes updating i_size before data
+	 * writeouts complete can expose uninitialized blocks. So
+	 * even for AIO, we need to wait for i/o to complete before
+	 * returning in this case.
 	 */
-	if (unlikely(dio->blkfactor))
-		dio->pages_in_io = 2;
+	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
+		(end > i_size_read(inode)));
 
-	for (seg = 0; seg < nr_segs; seg++) {
-		user_addr = (unsigned long)iov[seg].iov_base;
-		dio->pages_in_io +=
-			((user_addr+iov[seg].iov_len +PAGE_SIZE-1)/PAGE_SIZE
-				- user_addr/PAGE_SIZE);
+	return dio;
+}
+
+/*
+ * dio will work in inode blocksize until the time that it sees IO that isn't
+ * aligned to the inode blocksize.  From that point on all IO works in
+ * the block device's block size.
+ */
+int dio_unaligned(unsigned long val, unsigned *blkbits, 
+		  struct block_device *bdev)
+{
+	unsigned mask = (1 << *blkbits) - 1;
+
+	if (val & mask) {
+		if (bdev)
+			*blkbits = blksize_bits(bdev_logical_block_size(bdev));
+		mask = (1 << *blkbits) - 1;
+		return !!(val & mask);
 	}
 
-	for (seg = 0; seg < nr_segs; seg++) {
-		user_addr = (unsigned long)iov[seg].iov_base;
-		dio->size += bytes = iov[seg].iov_len;
+	return 0;
+}
 
-		/* Index into the first page of the first block */
-		dio->first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits;
-		dio->final_block_in_request = dio->block_in_file +
-						(bytes >> blkbits);
-		/* Page fetching state */
-		dio->head = 0;
-		dio->tail = 0;
-		dio->curr_page = 0;
+/*
+ * For block device access DIO_NO_LOCKING is used,
+ *	neither readers nor writers do any locking at all
+ * For regular files using DIO_LOCKING,
+ *	readers need to grab i_mutex and i_alloc_sem
+ *	writers need to grab i_alloc_sem only (i_mutex is already held)
+ * For regular files using DIO_OWN_LOCKING,
+ *	neither readers nor writers take any locks here
+ */
+static int dio_lock_and_flush(int rw, struct inode *inode, int dio_lock_type,
+		              loff_t offset, loff_t end)
+{
+	int ret;
 
-		dio->total_pages = 0;
-		if (user_addr & (PAGE_SIZE-1)) {
-			dio->total_pages++;
-			bytes -= PAGE_SIZE - (user_addr & (PAGE_SIZE - 1));
-		}
-		dio->total_pages += (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
-		dio->curr_user_address = user_addr;
-	
-		ret = do_direct_IO(dio);
+	if (dio_lock_type == DIO_NO_LOCKING)
+		return 0;
 
-		dio->result += iov[seg].iov_len -
-			((dio->final_block_in_request - dio->block_in_file) <<
-					blkbits);
+	/* watch out for a 0 len io from a tricksy fs */
+	if (rw == READ && end > offset) {
+		if (dio_lock_type != DIO_OWN_LOCKING)
+			mutex_lock(&inode->i_mutex);
 
+		ret = filemap_write_and_wait_range(inode->i_mapping, offset,
+						   end - 1);
 		if (ret) {
-			dio_cleanup(dio);
-			break;
+			if (dio_lock_type != DIO_OWN_LOCKING)
+				mutex_unlock(&inode->i_mutex);
+			return ret;
 		}
-	} /* end iovec loop */
 
-	if (ret == -ENOTBLK && (rw & WRITE)) {
+		if (dio_lock_type == DIO_OWN_LOCKING)
+			mutex_unlock(&inode->i_mutex);
+	}
+
+	if (dio_lock_type == DIO_LOCKING)
+		/* lockdep: not the owner will release it */
+		down_read_non_owner(&inode->i_alloc_sem);
+
+	return 0;
+}
+
+static ssize_t dio_post_submission(struct dio *dio, loff_t offset, loff_t end,
+				   int ret)
+{
+	unsigned long flags;
+	int ret2;
+
+	if (ret == -ENOTBLK && (dio->rw & WRITE)) {
 		/*
 		 * The remaining part of the request will be
 		 * be handled by buffered I/O when we return
@@ -1029,7 +1068,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 		dio_bio_submit(dio);
 
 	/* All IO is now issued, send it on its way */
-	blk_run_address_space(inode->i_mapping);
+	blk_run_address_space(dio->inode->i_mapping);
 
 	/*
 	 * It is possible that, we return short IO due to end of file.
@@ -1042,7 +1081,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	 * we can let i_mutex go now that its achieved its purpose
 	 * of protecting us from looking up uninitialized blocks.
 	 */
-	if ((rw == READ) && (dio->lock_type == DIO_LOCKING))
+	if ((dio->rw == READ) && (dio->lock_type == DIO_LOCKING))
 		mutex_unlock(&dio->inode->i_mutex);
 
 	/*
@@ -1054,7 +1093,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	 */
 	BUG_ON(ret == -EIOCBQUEUED);
 	if (dio->is_async && ret == 0 && dio->result &&
-	    ((rw & READ) || (dio->result == dio->size)))
+	    ((dio->rw & READ) || (dio->result == dio->size)))
 		ret = -EIOCBQUEUED;
 
 	if (ret != -EIOCBQUEUED)
@@ -1081,6 +1120,22 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	} else
 		BUG_ON(ret != -EIOCBQUEUED);
 
+	/*
+	 * In case of error extending write may have instantiated a few
+	 * blocks outside i_size. Trim these off again for DIO_LOCKING.
+	 * NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this by
+	 * it's own meaner.
+	 */
+	if (unlikely(ret < 0 && (dio->rw & WRITE))) {
+		loff_t isize = i_size_read(dio->inode);
+
+		if (end > isize && dio->lock_type == DIO_LOCKING)
+			vmtruncate(dio->inode, isize);
+	}
+
+	if (dio->lock_type == DIO_OWN_LOCKING && dio->rw == READ)
+		mutex_lock(&dio->inode->i_mutex);
+
 	return ret;
 }
 
@@ -1112,122 +1167,84 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	int dio_lock_type)
 {
 	int seg;
-	size_t size;
-	unsigned long addr;
+	size_t bytes;
+	unsigned long user_addr;
 	unsigned blkbits = inode->i_blkbits;
-	unsigned bdev_blkbits = 0;
-	unsigned blocksize_mask = (1 << blkbits) - 1;
-	ssize_t retval = -EINVAL;
+	ssize_t ret;
 	loff_t end = offset;
 	struct dio *dio;
-	int release_i_mutex = 0;
-	int acquire_i_mutex = 0;
-
-	if (rw & WRITE)
-		rw = WRITE_ODIRECT;
-
-	if (bdev)
-		bdev_blkbits = blksize_bits(bdev_logical_block_size(bdev));
 
-	if (offset & blocksize_mask) {
-		if (bdev)
-			 blkbits = bdev_blkbits;
-		blocksize_mask = (1 << blkbits) - 1;
-		if (offset & blocksize_mask)
-			goto out;
+	if (dio_unaligned(offset, &blkbits, bdev)) {
+		ret = -EINVAL;
+		goto out;
 	}
 
 	/* Check the memory alignment.  Blocks cannot straddle pages */
 	for (seg = 0; seg < nr_segs; seg++) {
-		addr = (unsigned long)iov[seg].iov_base;
-		size = iov[seg].iov_len;
-		end += size;
-		if ((addr & blocksize_mask) || (size & blocksize_mask))  {
-			if (bdev)
-				 blkbits = bdev_blkbits;
-			blocksize_mask = (1 << blkbits) - 1;
-			if ((addr & blocksize_mask) || (size & blocksize_mask))  
-				goto out;
+		user_addr = (unsigned long)iov[seg].iov_base;
+		bytes = iov[seg].iov_len;
+		end += bytes;
+		if (dio_unaligned(user_addr|bytes, &blkbits, bdev)) {
+			ret = -EINVAL;
+			goto out;
 		}
 	}
 
-	dio = kzalloc(sizeof(*dio), GFP_KERNEL);
-	retval = -ENOMEM;
-	if (!dio)
+	dio = dio_prepare(rw, iocb, inode, blkbits, offset, end, get_block,
+			  end_io, dio_lock_type);
+	if (!dio) {
+		ret = -ENOMEM;
 		goto out;
+	}
 
-	/*
-	 * For block device access DIO_NO_LOCKING is used,
-	 *	neither readers nor writers do any locking at all
-	 * For regular files using DIO_LOCKING,
-	 *	readers need to grab i_mutex and i_alloc_sem
-	 *	writers need to grab i_alloc_sem only (i_mutex is already held)
-	 * For regular files using DIO_OWN_LOCKING,
-	 *	neither readers nor writers take any locks here
-	 */
-	dio->lock_type = dio_lock_type;
-	if (dio_lock_type != DIO_NO_LOCKING) {
-		/* watch out for a 0 len io from a tricksy fs */
-		if (rw == READ && end > offset) {
-			struct address_space *mapping;
-
-			mapping = iocb->ki_filp->f_mapping;
-			if (dio_lock_type != DIO_OWN_LOCKING) {
-				mutex_lock(&inode->i_mutex);
-				release_i_mutex = 1;
-			}
-
-			retval = filemap_write_and_wait_range(mapping, offset,
-							      end - 1);
-			if (retval) {
-				kfree(dio);
-				goto out;
-			}
-
-			if (dio_lock_type == DIO_OWN_LOCKING) {
-				mutex_unlock(&inode->i_mutex);
-				acquire_i_mutex = 1;
-			}
-		}
+	ret = dio_lock_and_flush(rw, inode, dio_lock_type, offset, end);
+	if (ret) {
+		kfree(dio);
+		goto out;
+	}
 
-		if (dio_lock_type == DIO_LOCKING)
-			/* lockdep: not the owner will release it */
-			down_read_non_owner(&inode->i_alloc_sem);
+	for (seg = 0; seg < nr_segs; seg++) {
+		user_addr = (unsigned long)iov[seg].iov_base;
+		dio->pages_in_io +=
+			((user_addr+iov[seg].iov_len +PAGE_SIZE-1)/PAGE_SIZE
+				- user_addr/PAGE_SIZE);
 	}
 
-	/*
-	 * For file extending writes updating i_size before data
-	 * writeouts complete can expose uninitialized blocks. So
-	 * even for AIO, we need to wait for i/o to complete before
-	 * returning in this case.
-	 */
-	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
-		(end > i_size_read(inode)));
+	for (seg = 0; seg < nr_segs; seg++) {
+		user_addr = (unsigned long)iov[seg].iov_base;
+		dio->size += bytes = iov[seg].iov_len;
 
-	retval = direct_io_worker(rw, iocb, inode, iov, offset,
-				nr_segs, blkbits, get_block, end_io, dio);
+		/* Index into the first page of the first block */
+		dio->first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits;
+		dio->final_block_in_request = dio->block_in_file +
+						(bytes >> blkbits);
+		/* Page fetching state */
+		dio->head = 0;
+		dio->tail = 0;
+		dio->curr_page = 0;
 
-	/*
-	 * In case of error extending write may have instantiated a few
-	 * blocks outside i_size. Trim these off again for DIO_LOCKING.
-	 * NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this by
-	 * it's own meaner.
-	 */
-	if (unlikely(retval < 0 && (rw & WRITE))) {
-		loff_t isize = i_size_read(inode);
+		dio->total_pages = 0;
+		if (user_addr & (PAGE_SIZE-1)) {
+			dio->total_pages++;
+			bytes -= PAGE_SIZE - (user_addr & (PAGE_SIZE - 1));
+		}
+		dio->total_pages += (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
+		dio->curr_user_address = user_addr;
+	
+		ret = do_direct_IO(dio);
 
-		if (end > isize && dio_lock_type == DIO_LOCKING)
-			vmtruncate(inode, isize);
-	}
+		dio->result += iov[seg].iov_len -
+			((dio->final_block_in_request - dio->block_in_file) <<
+					blkbits);
 
-	if (rw == READ && dio_lock_type == DIO_LOCKING)
-		release_i_mutex = 0;
+		if (ret) {
+			dio_cleanup(dio);
+			break;
+		}
+	}
 
+	ret = dio_post_submission(dio, offset, end, ret);
 out:
-	if (release_i_mutex)
-		mutex_unlock(&inode->i_mutex);
-	else if (acquire_i_mutex)
-		mutex_lock(&inode->i_mutex);
-	return retval;
+	return ret;
 }
 EXPORT_SYMBOL(__blockdev_direct_IO);
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/8] dio: add an entry point which takes pages
  2009-10-22 20:25         ` [PATCH 5/8] dio: refactor __blockdev_direct_IO() Zach Brown
@ 2009-10-22 20:25           ` Zach Brown
  2009-10-22 20:25             ` [PATCH 7/8] block: provide aio_read_pages and aio_write_pages Zach Brown
  2009-10-27 15:49             ` [PATCH 6/8] dio: add an entry point which takes pages Jeff Moyer
  2009-10-27 15:39           ` [PATCH 5/8] dio: refactor __blockdev_direct_IO() Jeff Moyer
  1 sibling, 2 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-22 20:25 UTC (permalink / raw)
  To: linux-fsdevel

This adds a high level entry point into the direct-io code which calls helpers
on memory specified by pages instead of iovecs.

curr_user_address is used to decide if we should be dirtying the memory pages.
In our case, we don't want to.

The trick here is to initialize the dio state so that do_direct_IO() consumes
the pages we provide and never tries to map user pages.  This is done by making
sure that final_block_in_request covers all the pages we provide.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---
 fs/direct-io.c     |   86 +++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/fs.h |    4 ++
 2 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 0a2ba8e..3551c4a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -109,7 +109,7 @@ struct dio {
 	 */
 	int curr_page;			/* changes */
 	int total_pages;		/* doesn't change */
-	unsigned long curr_user_address;/* changes */
+	unsigned long curr_user_address;/* changes, indicates user pages */
 
 	/*
 	 * Page queue.  These variables belong to dio_refill_pages() and
@@ -337,7 +337,7 @@ static void dio_bio_submit(struct dio *dio)
 	dio->refcount++;
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
-	if (dio->is_async && dio->rw == READ)
+	if (dio->is_async && dio->rw == READ && dio->curr_user_address)
 		bio_set_pages_dirty(bio);
 
 	submit_bio(dio->rw, bio);
@@ -403,13 +403,14 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	if (!uptodate)
 		dio->io_error = -EIO;
 
-	if (dio->is_async && dio->rw == READ) {
+	if (dio->is_async && dio->rw == READ && dio->curr_user_address) {
 		bio_check_pages_dirty(bio);	/* transfers ownership */
 	} else {
 		for (page_no = 0; page_no < bio->bi_vcnt; page_no++) {
 			struct page *page = bvec[page_no].bv_page;
 
-			if (dio->rw == READ && !PageCompound(page))
+			if (dio->rw == READ && !PageCompound(page) && 
+			    dio->curr_user_address)
 				set_page_dirty_lock(page);
 			page_cache_release(page);
 		}
@@ -1248,3 +1249,80 @@ out:
 	return ret;
 }
 EXPORT_SYMBOL(__blockdev_direct_IO);
+
+ssize_t
+__blockdev_direct_IO_pages(int rw, struct kiocb *iocb, struct inode *inode,
+	struct block_device *bdev, const struct bio_vec *bvec, 
+	unsigned long bvec_len, loff_t offset, get_block_t get_block,
+	dio_iodone_t end_io, int dio_lock_type)
+{
+	unsigned blkbits = inode->i_blkbits;
+	ssize_t ret;
+	loff_t end = offset;
+	struct dio *dio;
+	unsigned long i;
+
+	if (dio_unaligned(offset, &blkbits, bdev)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Check the memory alignment.  Blocks cannot straddle pages */
+	for (i = 0; i < bvec_len; i++) {
+		end += bvec[i].bv_len;
+		if (dio_unaligned(bvec[i].bv_len | bvec[i].bv_offset,
+				  &blkbits, bdev)) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	dio = dio_prepare(rw, iocb, inode, blkbits, offset, end, get_block,
+			  end_io, dio_lock_type);
+	if (!dio) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = dio_lock_and_flush(rw, inode, dio_lock_type, offset, end);
+	if (ret) {
+		kfree(dio);
+		goto out;
+	}
+
+	dio->pages_in_io = bvec_len;
+
+	for (i = 0; i < bvec_len; i++) {
+		dio->size += bvec[i].bv_len;
+
+		/* Index into the first page of the first block */
+		dio->first_block_in_page = bvec[i].bv_offset >> blkbits;
+		dio->final_block_in_request = dio->block_in_file +
+						(bvec[i].bv_len  >> blkbits);
+		/* Page fetching state */
+		dio->curr_page = 0;
+		page_cache_get(bvec[i].bv_page);
+		dio->pages[0] = bvec[i].bv_page;
+		dio->head = 0;
+		dio->tail = 1;
+
+		dio->total_pages = 1;
+		dio->curr_user_address = 0;
+	
+		ret = do_direct_IO(dio);
+
+		dio->result += bvec[i].bv_len -
+			((dio->final_block_in_request - dio->block_in_file) <<
+					blkbits);
+
+		if (ret) {
+			dio_cleanup(dio);
+			break;
+		}
+	}
+
+	ret = dio_post_submission(dio, offset, end, ret);
+out:
+	return ret;
+}
+EXPORT_SYMBOL(__blockdev_direct_IO_pages);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba15f0..01c0b71 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2266,6 +2266,10 @@ ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,
 	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
 	int lock_type);
+ssize_t __blockdev_direct_IO_pages(int rw, struct kiocb *iocb,
+	struct inode *inode, struct block_device *bdev,
+	const struct bio_vec *bvec, unsigned long bvec_len, loff_t offset,
+	get_block_t get_block, dio_iodone_t end_io, int dio_lock_type);
 
 enum {
 	DIO_LOCKING = 1, /* need locking between buffered and direct access */
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 7/8] block: provide aio_read_pages and aio_write_pages
  2009-10-22 20:25           ` [PATCH 6/8] dio: add an entry point which takes pages Zach Brown
@ 2009-10-22 20:25             ` Zach Brown
  2009-10-22 20:25               ` [PATCH 8/8] loop: use aio to perform io on the underlying file Zach Brown
  2009-10-27 15:49             ` [PATCH 6/8] dio: add an entry point which takes pages Jeff Moyer
  1 sibling, 1 reply; 24+ messages in thread
From: Zach Brown @ 2009-10-22 20:25 UTC (permalink / raw)
  To: linux-fsdevel

This adds functions which perform direct aio read and write on the specified
pages.  This is just a hack to prototype the surrounding code without having to
start with file systems.  It just copies the generic paths which are needed for
direct io to block devices.  It calls the __blockdev_direct_IO_pages() directly
instead of changing the ->direct_IO() prototype.

To do this properly we're going to have to weave support for specifying memory
with pages into the generic paths.  We could have one entry point which takes
an abstract memory argument and conditionals in the path for working on iovecs
or pages.  Or we could have entry points for each memory argument that call
helpers.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---
 fs/block_dev.c |  113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 113 insertions(+), 0 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9cf4b92..1a07574 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1436,6 +1436,117 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 }
 EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
+/* XXX did I miss an existing helper? _length() like iov_length().. */
+static ssize_t bvec_length(const struct bio_vec *bvec, unsigned long bvec_len)
+{
+	ssize_t bytes = 0;
+	BUG_ON(bvec_len == 0);
+	while (bvec_len--)
+		bytes += (bvec++)->bv_len;
+	return bytes;
+}
+
+/*
+ * A quick prototype for testing.  This has copied the direct reading path of
+ * raw_fops.aio_read which doesn't use iovecs.
+ */
+ssize_t blkdev_aio_read_pages(struct kiocb *iocb, const struct bio_vec *bvec,
+			      unsigned long bvec_len, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	ssize_t ret;
+	size_t count;
+
+	BUG_ON(iocb->ki_pos != pos);
+
+	if (!(file->f_flags & O_DIRECT))
+		return -EINVAL;
+
+	if (pos >= i_size_read(inode))
+		return 0;
+
+	count = bvec_length(bvec, bvec_len);
+	if (!count)
+		return 0;
+
+	ret = filemap_write_and_wait_range(mapping, pos, pos + count - 1);
+	if (ret)
+		return ret;
+
+	ret = __blockdev_direct_IO_pages(READ, iocb, inode, I_BDEV(inode),
+					 bvec, bvec_len, pos,
+					 blkdev_get_blocks, NULL,
+					 DIO_NO_LOCKING);
+	if (ret)
+		file_accessed(file);
+	return ret;
+}
+
+/*
+ * A quick prototype for testing.  This has copied the direct writing path of
+ * raw_fops.aio_write which doesn't use iovecs.
+ */
+ssize_t blkdev_aio_write_pages(struct kiocb *iocb, const struct bio_vec *bvec,
+			       unsigned long bvec_len, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	ssize_t ret;
+	size_t count;
+
+	BUG_ON(iocb->ki_pos != pos);
+
+	if (!(file->f_flags & O_DIRECT))
+		return -EINVAL;
+
+	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+
+	count = bvec_length(bvec, bvec_len);
+
+	ret = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
+	if (ret)
+		goto out;
+
+	if (!count)
+		return 0;
+
+	ret = file_remove_suid(file);
+	if (ret)
+		goto out;
+
+	file_update_time(file);
+
+	ret = filemap_write_and_wait_range(mapping, pos, pos + count - 1);
+	if (ret)
+		goto out;
+
+	if (mapping->nrpages) {
+		ret = invalidate_inode_pages2_range(mapping,
+					pos >> PAGE_CACHE_SHIFT, 
+					(pos + count - 1) >> PAGE_CACHE_SHIFT);
+		/* XXX this'll return -EBUSY when a page can't be validated
+		 * as we're not falling back to buffered */
+		if (ret)
+			goto out;
+	}
+
+	ret = __blockdev_direct_IO_pages(WRITE, iocb, inode, I_BDEV(inode),
+					 bvec, bvec_len, pos, blkdev_get_blocks,
+					 NULL, DIO_NO_LOCKING);
+	if (ret > 0 || ret == -EIOCBQUEUED) {
+		ssize_t err; 
+		
+		err = generic_write_sync(file, pos, ret > 0 ? ret : count);
+		if (err < 0 && ret > 0)
+			ret = err;
+	}
+out:
+	return ret;
+}
+
 /*
  * Try to release a page associated with block device when the system
  * is under memory pressure.
@@ -1477,6 +1588,8 @@ const struct file_operations def_blk_fops = {
 #endif
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+  	.aio_read_pages	= blkdev_aio_read_pages,
+	.aio_write_pages	= blkdev_aio_write_pages,
 };
 
 int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 8/8] loop: use aio to perform io on the underlying file
  2009-10-22 20:25             ` [PATCH 7/8] block: provide aio_read_pages and aio_write_pages Zach Brown
@ 2009-10-22 20:25               ` Zach Brown
  2009-10-27 16:01                 ` Jeff Moyer
  0 siblings, 1 reply; 24+ messages in thread
From: Zach Brown @ 2009-10-22 20:25 UTC (permalink / raw)
  To: linux-fsdevel

This uses the new kernel aio interface to process loopback IO by submitting
concurrent direct aio.  Previously loop's IO was serialized by synchronus
processing in a thread.  It specifies io memory by directly referencing the
pages in the incoming bios rather than kmapping them.

This patch lets us start testing, but we should fix the following before its
merged:

* We might want to fall back to non-vectored ops if the file doesn't support
  vectored ops.

* Talk to Jens about cleverly translating bio flags (like barriers?) down
  to the bios that fs/direct-io.c is building based on our pages?

* More carefully discover support for and enable o_direct.

* Call into fs/aio.c to discover if aio is supported.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---
 drivers/block/loop.c |   62 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/loop.h |    1 +
 2 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index edda9ea..1180808 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -267,6 +267,41 @@ fail:
 	goto out;
 }
 
+void lo_rw_aio_complete(u64 data, long res)
+{
+	struct bio *bio = (struct bio *)data;
+
+	if (res > 0)
+		res = 0;
+	else if (res < 0)
+		res = -EIO;
+
+	bio_endio(bio, res);
+}
+
+static int lo_rw_aio(struct loop_device *lo, struct bio *bio, loff_t pos)
+{
+	struct file *file = lo->lo_backing_file;
+	struct kiocb *iocb;
+	unsigned short op;
+
+	iocb = aio_kernel_alloc(mapping_gfp_mask(file->f_mapping));
+	if (!iocb)
+		return -ENOMEM;
+
+	if (bio_rw(bio) & WRITE)
+		op = IOCB_CMD_PWRITEP;
+	else
+		op = IOCB_CMD_PREADP;
+
+	/* XXX can we just reference the bvec array like this? */
+	aio_kernel_init_rw(iocb, file, op, bio_iovec_idx(bio, bio->bi_idx),
+			   bio_segments(bio), pos);
+	aio_kernel_init_callback(iocb, lo_rw_aio_complete, (u64)bio);
+
+	return aio_kernel_submit(iocb);
+}
+
 /**
  * __do_lo_send_write - helper for writing data to a loop device
  *
@@ -467,13 +502,22 @@ lo_receive(struct loop_device *lo, struct bio *bio, int bsize, loff_t pos)
 	return ret;
 }
 
-static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
+static void do_bio_filebacked(struct loop_device *lo, struct bio *bio)
 {
 	loff_t pos;
 	int ret;
 
 	pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;
 
+	if (lo->lo_flags & LO_FLAGS_USE_AIO && lo->transfer == transfer_none) {
+		ret = lo_rw_aio(lo, bio, pos);
+		if (ret) {
+			ret = -EIO;
+			goto out;
+		}
+		return;
+	}
+
 	if (bio_rw(bio) == WRITE) {
 		bool barrier = bio_rw_flagged(bio, BIO_RW_BARRIER);
 		struct file *file = lo->lo_backing_file;
@@ -502,7 +546,7 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
 		ret = lo_receive(lo, bio, lo->lo_blocksize, pos);
 
 out:
-	return ret;
+	bio_endio(bio, ret);
 }
 
 /*
@@ -570,10 +614,8 @@ static inline void loop_handle_bio(struct loop_device *lo, struct bio *bio)
 	if (unlikely(!bio->bi_bdev)) {
 		do_loop_switch(lo, bio->bi_private);
 		bio_put(bio);
-	} else {
-		int ret = do_bio_filebacked(lo, bio);
-		bio_endio(bio, ret);
-	}
+	} else 
+		do_bio_filebacked(lo, bio);
 }
 
 /*
@@ -784,9 +826,13 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	if (S_ISREG(inode->i_mode) || S_ISBLK(inode->i_mode)) {
 		const struct address_space_operations *aops = mapping->a_ops;
 
-		if (aops->write_begin)
+		if (file->f_op->aio_write_pages && file->f_op->aio_read_pages){
+			/* XXX dangerous hack */
+			file->f_flags |= O_DIRECT;
+			lo_flags |= LO_FLAGS_USE_AIO;
+		} else if (aops->write_begin)
 			lo_flags |= LO_FLAGS_USE_AOPS;
-		if (!(lo_flags & LO_FLAGS_USE_AOPS) && !file->f_op->write)
+		else if (!file->f_op->write)
 			lo_flags |= LO_FLAGS_READ_ONLY;
 
 		lo_blocksize = S_ISBLK(inode->i_mode) ?
diff --git a/include/linux/loop.h b/include/linux/loop.h
index 66c194e..97a955c 100644
--- a/include/linux/loop.h
+++ b/include/linux/loop.h
@@ -76,6 +76,7 @@ enum {
 	LO_FLAGS_READ_ONLY	= 1,
 	LO_FLAGS_USE_AOPS	= 2,
 	LO_FLAGS_AUTOCLEAR	= 4,
+	LO_FLAGS_USE_AIO	= 8,
 };
 
 #include <asm/posix_types.h>	/* for __kernel_old_dev_t */
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC] loop: issue aio with pages
  2009-10-22 20:25 [RFC] loop: issue aio with pages Zach Brown
  2009-10-22 20:25 ` [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY Zach Brown
@ 2009-10-25  7:36 ` Christoph Hellwig
  2009-10-26 22:13   ` Zach Brown
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2009-10-25  7:36 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

Thanks, from the highlevel POV this looks much better than any of the
previous attempts on page-based kernel I/O.  Just wondering, shouldn't
the first two patches be a separate series that can be merged even
before the rest is ready?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/8] aio: disable retry
  2009-10-22 20:25   ` [PATCH 2/8] aio: disable retry Zach Brown
  2009-10-22 20:25     ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Zach Brown
@ 2009-10-25  7:37     ` Christoph Hellwig
  2009-10-26 22:15       ` Zach Brown
  2009-10-26 16:00     ` Jeff Moyer
  2 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2009-10-25  7:37 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

On Thu, Oct 22, 2009 at 01:25:51PM -0700, Zach Brown wrote:
> This patch ensures that aio ops won't be retried.  Code that uses EIOCBRETRY
> will fail to build and modules that reference kick_iocb() won't load.
> 
> This simplifies an upcoming patch by ensuring that it doesn't have to support
> aio ops that retry.  Further patches could remove much more of fs/aio.c if we
> chose to remove retry-based aio support.

Is there any good reason to keep it?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY
  2009-10-22 20:25 ` [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY Zach Brown
  2009-10-22 20:25   ` [PATCH 2/8] aio: disable retry Zach Brown
@ 2009-10-26 15:57   ` Jeff Moyer
  1 sibling, 0 replies; 24+ messages in thread
From: Jeff Moyer @ 2009-10-26 15:57 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

Zach Brown <zach.brown@oracle.com> writes:

> gadgetfs was the sole mainline user of the retry mechanism offered by fs/aio.c.
> It would set the iocb's ki_retry to a tiny function that is executed once an
> async read completed.  The function would simply copy data that was dmaed into
> a kernel buffer out to the userspace buffer for the read.
>
> This use was not worth the complexity and volume of code needed to support
> retrying iocbs in the aio core.  Many other file systems already do this kind
> of post processing work in their own threads today.  This work that gadgetfs
> does pales in comparison to the work that btrfs and xfs do.
>
> This patch moves this trivial copying into a work struct that is processed by
> keventd.  It uses the {un,}use_mm() calls which were recently promoted out of
> aio into a proper API.
>
> This paves the way for removing the dangerous and slow retry functionality from
> the aio core.
>
> Signed-off-by: Zach Brown <zach.brown@oracle.com>

Looks ok to me.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/8] aio: disable retry
  2009-10-22 20:25   ` [PATCH 2/8] aio: disable retry Zach Brown
  2009-10-22 20:25     ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Zach Brown
  2009-10-25  7:37     ` [PATCH 2/8] aio: disable retry Christoph Hellwig
@ 2009-10-26 16:00     ` Jeff Moyer
  2 siblings, 0 replies; 24+ messages in thread
From: Jeff Moyer @ 2009-10-26 16:00 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

Zach Brown <zach.brown@oracle.com> writes:

> This patch ensures that aio ops won't be retried.  Code that uses EIOCBRETRY
> will fail to build and modules that reference kick_iocb() won't load.
>
> This simplifies an upcoming patch by ensuring that it doesn't have to support
> aio ops that retry.  Further patches could remove much more of fs/aio.c if we
> chose to remove retry-based aio support.

Frankly, the retry-based AIO never made it, and no one has been
promoting it for, what, 5 years or more?  Let's just get rid of it.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] aio: add an interface to submit aio from the kernel
  2009-10-22 20:25     ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Zach Brown
  2009-10-22 20:25       ` [PATCH 4/8] aio: add aio_read_pages and aio_write_pages Zach Brown
@ 2009-10-26 16:10       ` Jeff Moyer
  2009-10-26 22:21         ` Zach Brown
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff Moyer @ 2009-10-26 16:10 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

Zach Brown <zach.brown@oracle.com> writes:

> This adds a simple interface that lets other parts of the kernel submit aio
> iocbs.  Callers provide a function which is called as the IO completes.
>
> These iocbs aren't tracked to reduce overhead: they can't be canceled, callers
> limit the number in flight, and previous patches in this series removed
> retry-based aio.
>
> Signed-off-by: Zach Brown <zach.brown@oracle.com>

> +void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
> +			unsigned short op, void *ptr, size_t nr, loff_t off)
> +{
> +	iocb->ki_filp = filp;
> +	iocb->ki_opcode = op;
> +	iocb->ki_buf = (char __user *)(unsigned long)ptr;
> +	iocb->ki_left = nr;
> +	iocb->ki_nbytes = nr;
> +	iocb->ki_pos = off;
> +}
> +EXPORT_SYMBOL_GPL(aio_kernel_init_rw);

Why isn't this just a static inline in a header, like the
io_prep_pread/pwrite methods in libaio.h?  Not a big deal, just curious.

> +int aio_kernel_submit(struct kiocb *iocb)

Why are you limiting this to just 1 iocb at a time?  Is it because the
overhead of a function call is pretty small (as compared to the
user/kernel context switch for the system call)?  I guess we can add a
mechanism to submit multiple iocbs if it becomes necessary.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/8] aio: add aio_read_pages and aio_write_pages
  2009-10-22 20:25       ` [PATCH 4/8] aio: add aio_read_pages and aio_write_pages Zach Brown
  2009-10-22 20:25         ` [PATCH 5/8] dio: refactor __blockdev_direct_IO() Zach Brown
@ 2009-10-26 16:17         ` Jeff Moyer
  2009-10-26 17:08           ` Jeff Moyer
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff Moyer @ 2009-10-26 16:17 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

Zach Brown <zach.brown@oracle.com> writes:

> This adds read and write file operations which specify memory with an array of
> page pointers, offsets, and lengths.  AIO commands are added which call these
> methods.  Fields are added to the iocb to specify the pages which are passed to
> the methods.
>
> This is intended to be used by callers in the kernel.
>
> Signed-off-by: Zach Brown <zach.brown@oracle.com>
> diff --git a/include/linux/aio_abi.h b/include/linux/aio_abi.h
> index 2c87316..5b67ed6 100644
> --- a/include/linux/aio_abi.h
> +++ b/include/linux/aio_abi.h
> @@ -44,6 +44,8 @@ enum {
>  	IOCB_CMD_NOOP = 6,
>  	IOCB_CMD_PREADV = 7,
>  	IOCB_CMD_PWRITEV = 8,
> +	IOCB_CMD_PREADP = 9,
> +	IOCB_CMD_PWRITEP = 10,

I question the merits of adding commands to the aio_abi that cannot be
called from userspace.  However, I don't know how else you would keep
things consistent.  I guess we live with it.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/8] aio: add aio_read_pages and aio_write_pages
  2009-10-26 16:17         ` [PATCH 4/8] aio: add aio_read_pages and aio_write_pages Jeff Moyer
@ 2009-10-26 17:08           ` Jeff Moyer
  2009-10-26 22:22             ` Zach Brown
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Moyer @ 2009-10-26 17:08 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

Jeff Moyer <jmoyer@redhat.com> writes:

> Zach Brown <zach.brown@oracle.com> writes:
>
>> This adds read and write file operations which specify memory with an array of
>> page pointers, offsets, and lengths.  AIO commands are added which call these
>> methods.  Fields are added to the iocb to specify the pages which are passed to
>> the methods.
>>
>> This is intended to be used by callers in the kernel.
>>
>> Signed-off-by: Zach Brown <zach.brown@oracle.com>
>> diff --git a/include/linux/aio_abi.h b/include/linux/aio_abi.h
>> index 2c87316..5b67ed6 100644
>> --- a/include/linux/aio_abi.h
>> +++ b/include/linux/aio_abi.h
>> @@ -44,6 +44,8 @@ enum {
>>  	IOCB_CMD_NOOP = 6,
>>  	IOCB_CMD_PREADV = 7,
>>  	IOCB_CMD_PWRITEV = 8,
>> +	IOCB_CMD_PREADP = 9,
>> +	IOCB_CMD_PWRITEP = 10,
>
> I question the merits of adding commands to the aio_abi that cannot be
> called from userspace.  However, I don't know how else you would keep
> things consistent.  I guess we live with it.

Actually, I didn't see code in the patch to prevent userspace from
issuing such commands.  Shouldn't we check for that and spit an error?
It sounds like a security risk if we don't.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] loop: issue aio with pages
  2009-10-25  7:36 ` [RFC] loop: issue aio with pages Christoph Hellwig
@ 2009-10-26 22:13   ` Zach Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-26 22:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel

Christoph Hellwig wrote:
> Thanks, from the highlevel POV this looks much better than any of the
> previous attempts on page-based kernel I/O.

Great, I'll keep at it.

> Just wondering, shouldn't
> the first two patches be a separate series that can be merged even
> before the rest is ready?

Yeah, I think so.  The second patch should be expanded a bit to really
pull out all the retrying code.  I can work on that series independently.

Those first patches were just in this series to save the patch that
introduced the aio api for kernel callers from having to worry about
calling aio providers that would try to queue iocbs for retrying.

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/8] aio: disable retry
  2009-10-25  7:37     ` [PATCH 2/8] aio: disable retry Christoph Hellwig
@ 2009-10-26 22:15       ` Zach Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-26 22:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel


> Is there any good reason to keep it?

Not that I know of.

And I think that we'll be pleased with the simplicity of fs/aio.c once
all the retrying code is gone.  We'll see.

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] aio: add an interface to submit aio from the kernel
  2009-10-26 16:10       ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Jeff Moyer
@ 2009-10-26 22:21         ` Zach Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-26 22:21 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel

> Why isn't this just a static inline in a header, like the
> io_prep_pread/pwrite methods in libaio.h?  Not a big deal, just curious.

Honestly, I didn't give it much thought -- I just threw it together this
way.  I'd be happy either way, though I guess I tend to prefer to avoid
inlining.

>> +int aio_kernel_submit(struct kiocb *iocb)
> 
> Why are you limiting this to just 1 iocb at a time?

Well, the current user will only be issuing one at a time.  We can worry
about adding that complexity when someone can justify it :).

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/8] aio: add aio_read_pages and aio_write_pages
  2009-10-26 17:08           ` Jeff Moyer
@ 2009-10-26 22:22             ` Zach Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-26 22:22 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel


> I question the merits of adding commands to the aio_abi that cannot be
> called from userspace.  However, I don't know how else you would keep
> things consistent.  I guess we live with it.

I didn't like it either.  I can give this some more thought.

> Actually, I didn't see code in the patch to prevent userspace from
> issuing such commands.  Shouldn't we check for that and spit an error?
> It sounds like a security risk if we don't.

See, that's why we keep you around!  Yes, that's one of the things we
have to fix before it can be merged.

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/8] dio: refactor __blockdev_direct_IO()
  2009-10-22 20:25         ` [PATCH 5/8] dio: refactor __blockdev_direct_IO() Zach Brown
  2009-10-22 20:25           ` [PATCH 6/8] dio: add an entry point which takes pages Zach Brown
@ 2009-10-27 15:39           ` Jeff Moyer
  1 sibling, 0 replies; 24+ messages in thread
From: Jeff Moyer @ 2009-10-27 15:39 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

Zach Brown <zach.brown@oracle.com> writes:

> This patch moves code around so that __blockdev_direct_IO() becomes a function
> which only iterates over its iovec argument while calling helper functions.
> This wil let us add functions in the future which call the same helper
> functions while iterating over other ways of specifying memory for the IO.
>
> direct_io_worker() was only called once from __blockdev_direct_IO() but took
> the iovec argument.  Instead of trying to pass down other ways of specifying
> memory, we move its body up into __blockdev_direct_IO() before pulling the
> initilization and teardown steps of this resulting huge function off into
> helpers.  What's left is a small function that mostly concerns itself with
> calling helpers on the iovec.
>
> The dio_unaligned() helper is a little strange, but it reflects the flow of the
> original code.  If we were to clean it up we'd run the risk of introducing
> bugs.
>
> Comments have not been updated to reflect the code change yet.

Wow, that's a dense patch.  It looks right to me, and I like the
cleanup.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] dio: add an entry point which takes pages
  2009-10-22 20:25           ` [PATCH 6/8] dio: add an entry point which takes pages Zach Brown
  2009-10-22 20:25             ` [PATCH 7/8] block: provide aio_read_pages and aio_write_pages Zach Brown
@ 2009-10-27 15:49             ` Jeff Moyer
  2009-10-27 17:50               ` Zach Brown
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff Moyer @ 2009-10-27 15:49 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

Zach Brown <zach.brown@oracle.com> writes:

> This adds a high level entry point into the direct-io code which calls helpers
> on memory specified by pages instead of iovecs.
>
> curr_user_address is used to decide if we should be dirtying the memory pages.
> In our case, we don't want to.
>
> The trick here is to initialize the dio state so that do_direct_IO() consumes
> the pages we provide and never tries to map user pages.  This is done by making
> sure that final_block_in_request covers all the pages we provide.

Looks sane, in general.

> -	if (dio->is_async && dio->rw == READ)
> +	if (dio->is_async && dio->rw == READ && dio->curr_user_address)

Any chance we can factor this check out into a macro or some such thing?
It's repeated several times.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/8] loop: use aio to perform io on the underlying file
  2009-10-22 20:25               ` [PATCH 8/8] loop: use aio to perform io on the underlying file Zach Brown
@ 2009-10-27 16:01                 ` Jeff Moyer
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff Moyer @ 2009-10-27 16:01 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel

Zach Brown <zach.brown@oracle.com> writes:

> This uses the new kernel aio interface to process loopback IO by submitting
> concurrent direct aio.  Previously loop's IO was serialized by synchronus
> processing in a thread.  It specifies io memory by directly referencing the
> pages in the incoming bios rather than kmapping them.

Wow, that's pretty straight-forward.  Nice job.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] dio: add an entry point which takes pages
  2009-10-27 15:49             ` [PATCH 6/8] dio: add an entry point which takes pages Jeff Moyer
@ 2009-10-27 17:50               ` Zach Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Zach Brown @ 2009-10-27 17:50 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel


>> +	if (dio->is_async && dio->rw == READ && dio->curr_user_address)
> 
> Any chance we can factor this check out into a macro or some such thing?
> It's repeated several times.

Yeah.  I'm also nervous about overloading curr_user_address for this.
It was expedient, but it feels unnecessarily fragile.

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-10-27 17:52 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-22 20:25 [RFC] loop: issue aio with pages Zach Brown
2009-10-22 20:25 ` [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY Zach Brown
2009-10-22 20:25   ` [PATCH 2/8] aio: disable retry Zach Brown
2009-10-22 20:25     ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Zach Brown
2009-10-22 20:25       ` [PATCH 4/8] aio: add aio_read_pages and aio_write_pages Zach Brown
2009-10-22 20:25         ` [PATCH 5/8] dio: refactor __blockdev_direct_IO() Zach Brown
2009-10-22 20:25           ` [PATCH 6/8] dio: add an entry point which takes pages Zach Brown
2009-10-22 20:25             ` [PATCH 7/8] block: provide aio_read_pages and aio_write_pages Zach Brown
2009-10-22 20:25               ` [PATCH 8/8] loop: use aio to perform io on the underlying file Zach Brown
2009-10-27 16:01                 ` Jeff Moyer
2009-10-27 15:49             ` [PATCH 6/8] dio: add an entry point which takes pages Jeff Moyer
2009-10-27 17:50               ` Zach Brown
2009-10-27 15:39           ` [PATCH 5/8] dio: refactor __blockdev_direct_IO() Jeff Moyer
2009-10-26 16:17         ` [PATCH 4/8] aio: add aio_read_pages and aio_write_pages Jeff Moyer
2009-10-26 17:08           ` Jeff Moyer
2009-10-26 22:22             ` Zach Brown
2009-10-26 16:10       ` [PATCH 3/8] aio: add an interface to submit aio from the kernel Jeff Moyer
2009-10-26 22:21         ` Zach Brown
2009-10-25  7:37     ` [PATCH 2/8] aio: disable retry Christoph Hellwig
2009-10-26 22:15       ` Zach Brown
2009-10-26 16:00     ` Jeff Moyer
2009-10-26 15:57   ` [PATCH 1/8] gadgetfs: use schedule_work() instead of EIOCBRETRY Jeff Moyer
2009-10-25  7:36 ` [RFC] loop: issue aio with pages Christoph Hellwig
2009-10-26 22:13   ` Zach Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).