linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] fs io with struct page instead of iovecs
@ 2007-11-07  1:43 Zach Brown
  2007-11-07  1:43 ` [PATCH 1/4] struct rwmem: an abstraction of the memory argument to read/write Zach Brown
  2007-11-07 16:50 ` [RFC] fs io with struct page instead of iovecs Badari Pulavarty
  0 siblings, 2 replies; 8+ messages in thread
From: Zach Brown @ 2007-11-07  1:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Christoph Hellwig, David Chinner

At the FS meeting at LCE there was some talk of doing O_DIRECT writes from the
kernel with pages instead of with iovecs.  This patch series explores one
direction we could head in to achieve this.

We obviously can't just translate user iovecs (which might represent more
memory than the machine has) to pinned pages and pass page struct pointers all
the way down the stack.

And we don't want to duplicate a lot of the generic FS machinery down a
duplicate path which works with pages instead of iovecs.

So this patch series heads in the direction of abstracting out the memory
argument to the read and write calls.  It's still based on segments, but we
hide that it's either iovecs or arrays of page pointers from the callers in the
rw stack.

I didn't go too nuts with the syntactic sugar.  This is just intended to show
the basic mechanics.  We can obviously pretty it up if we think this is a sane
thing to be doing.

The series builds but has never been run.

What do people (that's you, Christoph) think?  I'm flexible.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] struct rwmem: an abstraction of the memory argument to read/write
  2007-11-07  1:43 [RFC] fs io with struct page instead of iovecs Zach Brown
@ 2007-11-07  1:43 ` Zach Brown
  2007-11-07  1:43   ` [PATCH 2/4] dio: use rwmem to work with r/w memory arguments Zach Brown
  2007-11-07 16:50 ` [RFC] fs io with struct page instead of iovecs Badari Pulavarty
  1 sibling, 1 reply; 8+ messages in thread
From: Zach Brown @ 2007-11-07  1:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Christoph Hellwig, David Chinner

This adds a structure and interface to represent the segments of memory
which are acting as the source or destination for a read or write operation.

Callers would fill this structure and then pass it down the rw path.

The intent is to let stages in the rw path make specific calls against this
API and structure instead of working with, say, struct iovec natively.

The main intent of this is to enable kernel calls into the rw path which
specify memory with page/offset/len tuples.

Another potential benefit of this is the reduction in iterations over iovecs at
various points in the kernel.  Each iov_length(iov) call, for example, could be
translated into rwm->total_bytes.  O_DIRECTs check of memory alignment is
changed into a single test against rwm->boundary_bits.

I imagine this might integrate well with the iov_iter interface, though I
haven't examined that in any depth.
---
 fs/Makefile           |    2 +-
 fs/rwmem.c            |   92 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/rwmem.h |   29 +++++++++++++++
 3 files changed, 122 insertions(+), 1 deletions(-)
 create mode 100644 fs/rwmem.c
 create mode 100644 include/linux/rwmem.h

diff --git a/fs/Makefile b/fs/Makefile
index 500cf15..c342365 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o
+		stack.o rwmem.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/rwmem.c b/fs/rwmem.c
new file mode 100644
index 0000000..0433ba4
--- /dev/null
+++ b/fs/rwmem.c
@@ -0,0 +1,92 @@
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/uio.h>
+#include <linux/rwmem.h>
+
+static inline unsigned long pages_spanned(unsigned long addr,
+					  unsigned long bytes)
+{
+	return ((addr + bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) -
+		(addr >> PAGE_SHIFT);
+}
+
+void rwmem_iovec_init(struct rwmem *rwm)
+{
+	struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+	struct iovec *iov;
+	unsigned long i;
+
+	rwm->total_bytes = 0;
+	rwm->nr_pages = 0;
+	rwm->boundary_bits = 0;
+
+	for (i = 0; i < rwm->nr_segs; i++) {
+		iov = &rwi->iov[i];
+
+		rwm->total_bytes += iov->iov_len;
+		rwm->nr_pages += pages_spanned((unsigned long)iov->iov_base,
+						    iov->iov_len);
+		rwm->boundary_bits |= (unsigned long)iov->iov_base |
+				      (unsigned long)iov->iov_len;
+	}
+}
+
+/*
+ * Returns the offset of the start of a segment within its first page.
+ */
+unsigned long rwmem_iovec_seg_page_offset(struct rwmem *rwm, unsigned long i)
+{
+	struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+	BUG_ON(i >= rwm->nr_segs);
+	return (unsigned long)rwi->iov[i].iov_base & ~PAGE_MASK;
+}
+
+/*
+ * Returns the total bytes in the given segment.
+ */
+unsigned long rwmem_iovec_seg_bytes(struct rwmem *rwm, unsigned long i)
+{
+	struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+	BUG_ON(i >= rwm->nr_segs);
+	return rwi->iov[i].iov_len;
+}
+
+int rwmem_iovec_get_seg_pages(struct rwmem *rwm, unsigned long i,
+			      unsigned long *cursor, struct page **pages,
+			      unsigned long max_pages, int write)
+{
+	struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+	struct iovec *iov;
+	int ret;
+
+	BUG_ON(i >= rwm->nr_segs);
+	iov = &rwi->iov[i];
+
+	if (*cursor == 0)
+		*cursor = (unsigned long)iov->iov_base;
+
+	max_pages = min(pages_spanned(*cursor, iov->iov_len -
+				      (*cursor - (unsigned long)iov->iov_base)),
+			max_pages);
+
+	down_read(&current->mm->mmap_sem);
+	ret = get_user_pages(current, current->mm, *cursor, max_pages, write,
+			     0, pages, NULL);
+	up_read(&current->mm->mmap_sem);
+
+	if (ret > 0) { 
+		*cursor += ret * PAGE_SIZE;
+		if (*cursor >= (unsigned long)iov->iov_base + iov->iov_len)
+			*cursor = ~0;
+	}
+
+	return ret;
+}
+
+struct rwmem_ops rwmem_iovec_ops = {
+	.init			= rwmem_iovec_init,
+	.seg_page_offset	= rwmem_iovec_seg_page_offset,
+	.seg_bytes		= rwmem_iovec_seg_bytes,
+	.get_seg_pages		= rwmem_iovec_get_seg_pages,
+};
diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h
new file mode 100644
index 0000000..666f9f4
--- /dev/null
+++ b/include/linux/rwmem.h
@@ -0,0 +1,29 @@
+#ifndef _LINUX_RWMEM_H
+#define _LINUX_RWMEM_H
+
+struct rwmwm_ops;
+
+struct rwmem {
+	struct rwmem_ops	*ops;
+	size_t			total_bytes;
+	unsigned long		boundary_bits;
+	unsigned long		nr_pages;
+	unsigned short		nr_segs;
+};
+
+struct rwmem_ops {
+	void (*init)(struct rwmem *rwm);
+	unsigned long (*seg_page_offset)(struct rwmem *rwm, unsigned long i);
+	unsigned long (*seg_bytes)(struct rwmem *rwm, unsigned long i);
+	int (*get_seg_pages)(struct rwmem *rwm, unsigned long i,
+			     unsigned long *cursor, struct page **pages,
+			     unsigned long max_pages, int write);
+};
+
+struct rwmem_iovec {
+	struct rwmem		rwmem;
+	const struct iovec	*iov;
+};
+struct rwmem_ops rwmem_iovec_ops;
+
+#endif
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] dio: use rwmem to work with r/w memory arguments
  2007-11-07  1:43 ` [PATCH 1/4] struct rwmem: an abstraction of the memory argument to read/write Zach Brown
@ 2007-11-07  1:43   ` Zach Brown
  2007-11-07  1:43     ` [PATCH 3/4] add rwmem type backed by pages Zach Brown
  0 siblings, 1 reply; 8+ messages in thread
From: Zach Brown @ 2007-11-07  1:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Christoph Hellwig, David Chinner

This switches dio to work with the rwmem api to get memory pages for the IO
instead of working with iovecs directly.

It can use direct rwm struct accesses for some static universal properties of a
set of memory segments that make up the buffer argument.

It uses helper functions to work with the underlying data structures directly.
---
 fs/direct-io.c |  123 +++++++++++++++++++++++++-------------------------------
 1 files changed, 55 insertions(+), 68 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index acf0da1..0d5ed41 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -34,7 +34,7 @@
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
 #include <linux/rwsem.h>
-#include <linux/uio.h>
+#include <linux/rwmem.h>
 #include <asm/atomic.h>
 
 /*
@@ -105,11 +105,12 @@ struct dio {
 	sector_t cur_page_block;	/* Where it starts */
 
 	/*
-	 * Page fetching state. These variables belong to dio_refill_pages().
+	 * Page fetching state. direct_io_worker() sets these for
+	 * dio_refill_pages() who modifies them as it fetches.
 	 */
-	int curr_page;			/* changes */
-	int total_pages;		/* doesn't change */
-	unsigned long curr_user_address;/* changes */
+	struct rwmem *rwm;
+	unsigned long cur_seg;
+	unsigned long cur_seg_cursor;
 
 	/*
 	 * Page queue.  These variables belong to dio_refill_pages() and
@@ -146,21 +147,11 @@ static inline unsigned dio_pages_present(struct dio *dio)
  */
 static int dio_refill_pages(struct dio *dio)
 {
+	struct rwmem *rwm = dio->rwm;
 	int ret;
-	int nr_pages;
-
-	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
-	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(
-		current,			/* Task for fault acounting */
-		current->mm,			/* whose pages? */
-		dio->curr_user_address,		/* Where from? */
-		nr_pages,			/* How many pages? */
-		dio->rw == READ,		/* Write to memory? */
-		0,				/* force (?) */
-		&dio->pages[0],
-		NULL);				/* vmas */
-	up_read(&current->mm->mmap_sem);
+
+	ret = rwm->ops->get_seg_pages(rwm, dio->cur_seg, &dio->cur_seg_cursor,
+				      dio->pages, DIO_PAGES, dio->rw == READ);
 
 	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
 		struct page *page = ZERO_PAGE(0);
@@ -180,8 +171,6 @@ static int dio_refill_pages(struct dio *dio)
 	}
 
 	if (ret >= 0) {
-		dio->curr_user_address += ret * PAGE_SIZE;
-		dio->curr_page += ret;
 		dio->head = 0;
 		dio->tail = ret;
 		ret = 0;
@@ -938,11 +927,9 @@ out:
  */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
-	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
-	unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
-	struct dio *dio)
+	struct rwmem *rwm, loff_t offset, unsigned blkbits,
+	get_block_t get_block, dio_iodone_t end_io, struct dio *dio)
 {
-	unsigned long user_addr; 
 	unsigned long flags;
 	int seg;
 	ssize_t ret = 0;
@@ -966,44 +953,33 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 
+	dio->rwm = rwm;
+
 	/*
 	 * In case of non-aligned buffers, we may need 2 more
 	 * pages since we need to zero out first and last block.
 	 */
+	dio->pages_in_io = rwm->nr_pages;
 	if (unlikely(dio->blkfactor))
-		dio->pages_in_io = 2;
-
-	for (seg = 0; seg < nr_segs; seg++) {
-		user_addr = (unsigned long)iov[seg].iov_base;
-		dio->pages_in_io +=
-			((user_addr+iov[seg].iov_len +PAGE_SIZE-1)/PAGE_SIZE
-				- user_addr/PAGE_SIZE);
-	}
+		dio->pages_in_io += 2;
 
-	for (seg = 0; seg < nr_segs; seg++) {
-		user_addr = (unsigned long)iov[seg].iov_base;
-		dio->size += bytes = iov[seg].iov_len;
+	for (seg = 0; seg < rwm->nr_segs; seg++) {
+		dio->size += bytes = rwm->ops->seg_bytes(rwm, seg);
 
 		/* Index into the first page of the first block */
-		dio->first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits;
+		dio->first_block_in_page = 
+			rwm->ops->seg_page_offset(rwm, seg) >> blkbits;
 		dio->final_block_in_request = dio->block_in_file +
 						(bytes >> blkbits);
 		/* Page fetching state */
+		dio->cur_seg = seg;
+		dio->cur_seg_cursor = 0;
 		dio->head = 0;
 		dio->tail = 0;
-		dio->curr_page = 0;
 
-		dio->total_pages = 0;
-		if (user_addr & (PAGE_SIZE-1)) {
-			dio->total_pages++;
-			bytes -= PAGE_SIZE - (user_addr & (PAGE_SIZE - 1));
-		}
-		dio->total_pages += (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
-		dio->curr_user_address = user_addr;
-	
 		ret = do_direct_IO(dio);
 
-		dio->result += iov[seg].iov_len -
+		dio->result += bytes -
 			((dio->final_block_in_request - dio->block_in_file) <<
 					blkbits);
 
@@ -1113,15 +1089,11 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
  *
  * Additional i_alloc_sem locking requirements described inline below.
  */
-ssize_t
-__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
-	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
-	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
-	int dio_lock_type)
+static ssize_t
+blockdev_direct_IO_rwmem(int rw, struct kiocb *iocb, struct inode *inode,
+	struct block_device *bdev, struct rwmem *rwm, loff_t offset, 
+	get_block_t get_block, dio_iodone_t end_io, int dio_lock_type)
 {
-	int seg;
-	size_t size;
-	unsigned long addr;
 	unsigned blkbits = inode->i_blkbits;
 	unsigned bdev_blkbits = 0;
 	unsigned blocksize_mask = (1 << blkbits) - 1;
@@ -1146,17 +1118,12 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	}
 
 	/* Check the memory alignment.  Blocks cannot straddle pages */
-	for (seg = 0; seg < nr_segs; seg++) {
-		addr = (unsigned long)iov[seg].iov_base;
-		size = iov[seg].iov_len;
-		end += size;
-		if ((addr & blocksize_mask) || (size & blocksize_mask))  {
-			if (bdev)
-				 blkbits = bdev_blkbits;
-			blocksize_mask = (1 << blkbits) - 1;
-			if ((addr & blocksize_mask) || (size & blocksize_mask))  
-				goto out;
-		}
+	if (rwm->boundary_bits & blocksize_mask) {
+		if (bdev)
+			 blkbits = bdev_blkbits;
+		blocksize_mask = (1 << blkbits) - 1;
+		if (rwm->boundary_bits & blocksize_mask)
+			goto out;
 	}
 
 	dio = kzalloc(sizeof(*dio), GFP_KERNEL);
@@ -1212,8 +1179,8 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));
 
-	retval = direct_io_worker(rw, iocb, inode, iov, offset,
-				nr_segs, blkbits, get_block, end_io, dio);
+	retval = direct_io_worker(rw, iocb, inode, rwm, offset, blkbits,
+				  get_block, end_io, dio);
 
 	if (rw == READ && dio_lock_type == DIO_LOCKING)
 		release_i_mutex = 0;
@@ -1225,4 +1192,24 @@ out:
 		mutex_lock(&inode->i_mutex);
 	return retval;
 }
+
+ssize_t
+__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
+	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
+	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
+	int dio_lock_type)
+{
+	struct rwmem_iovec rwi = {
+		.rwmem.ops = &rwmem_iovec_ops,
+		.rwmem.nr_segs = nr_segs,
+		.iov = iov,
+	};
+	struct rwmem *rwm = &rwi.rwmem;
+
+	rwm->ops->init(rwm);
+
+	return blockdev_direct_IO_rwmem(rw, iocb, inode, bdev, rwm, offset,
+					get_block, end_io, dio_lock_type);
+}
+
 EXPORT_SYMBOL(__blockdev_direct_IO);
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] add rwmem type backed by pages
  2007-11-07  1:43   ` [PATCH 2/4] dio: use rwmem to work with r/w memory arguments Zach Brown
@ 2007-11-07  1:43     ` Zach Brown
  2007-11-07  1:43       ` [PATCH 4/4] add dio interface for page/offset/len tuples Zach Brown
  0 siblings, 1 reply; 8+ messages in thread
From: Zach Brown @ 2007-11-07  1:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Christoph Hellwig, David Chinner

This lets callers specify a region of memory to read from or write to with
an array of page/offset/len tuples.  There have been specific requests to
do this from servers which want to do O_DIRECT from the kernel.  (knfsd?)

This could also be used by places which currently hold a kmap() and call
fop->write.  ecryptfs_write_lower_page_segment() is one such caller.
---
 fs/rwmem.c            |   66 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/rwmem.h |   16 ++++++++++++
 2 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/fs/rwmem.c b/fs/rwmem.c
index 0433ba4..c87e8a4 100644
--- a/fs/rwmem.c
+++ b/fs/rwmem.c
@@ -90,3 +90,69 @@ struct rwmem_ops rwmem_iovec_ops = {
 	.seg_bytes		= rwmem_iovec_seg_bytes,
 	.get_seg_pages		= rwmem_iovec_get_seg_pages,
 };
+
+void rwmem_pages_init(struct rwmem *rwm)
+{
+	struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+	struct pgol *pgol;
+	unsigned long i;
+
+	rwm->total_bytes = 0;
+	rwm->nr_pages = rwm->nr_segs;
+	rwm->boundary_bits = 0;
+
+	for (i = 0; i < rwm->nr_segs; i++) {
+		pgol = &rwp->pgol[i];
+
+		rwm->total_bytes += pgol->len;
+		rwm->boundary_bits |= pgol->offset | pgol->len;
+	}
+}
+
+/*
+ * Returns the offset of the start of a segment within its first page.
+ */
+unsigned long rwmem_pages_seg_page_offset(struct rwmem *rwm, unsigned long i)
+{
+	struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+	BUG_ON(i >= rwm->nr_segs);
+	return rwp->pgol[i].offset;
+}
+
+/*
+ * Returns the total bytes in the given segment.
+ */
+unsigned long rwmem_pages_seg_bytes(struct rwmem *rwm, unsigned long i)
+{
+	struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+	BUG_ON(i >= rwm->nr_segs);
+	return rwp->pgol[i].len;
+}
+
+/*
+ * For now each page is its own seg.
+ */
+int rwmem_pages_get_seg_pages(struct rwmem *rwm, unsigned long i,
+			      unsigned long *cursor, struct page **pages,
+			      unsigned long max_pages, int write)
+{
+	struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+	int ret = 0;
+
+	BUG_ON(i >= rwm->nr_segs);
+	BUG_ON(*cursor != 0);
+
+	if (max_pages) {
+		pages[0] = rwp->pgol[i].page;
+		get_page(pages[0]);
+		ret = 1;
+	}
+	return ret;
+}
+
+struct rwmem_ops rwmem_pages_ops = {
+	.init			= rwmem_pages_init,
+	.seg_page_offset	= rwmem_pages_seg_page_offset,
+	.seg_bytes		= rwmem_pages_seg_bytes,
+	.get_seg_pages		= rwmem_pages_get_seg_pages,
+};
diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h
index 666f9f4..47019f0 100644
--- a/include/linux/rwmem.h
+++ b/include/linux/rwmem.h
@@ -26,4 +26,20 @@ struct rwmem_iovec {
 };
 struct rwmem_ops rwmem_iovec_ops;
 
+/*
+ * How many times do we need this in subsystems before we make a universal
+ * struct?  (bio_vec, skb_frag_struct, pipe_buffer)
+ */
+struct pgol {
+	struct page *page;
+	unsigned int offset;
+	unsigned int len;
+};
+
+struct rwmem_pages {
+	struct rwmem	rwmem;
+	struct pgol	*pgol;
+};
+struct rwmem_ops rwmem_pages_ops;
+
 #endif
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] add dio interface for page/offset/len tuples
  2007-11-07  1:43     ` [PATCH 3/4] add rwmem type backed by pages Zach Brown
@ 2007-11-07  1:43       ` Zach Brown
  0 siblings, 0 replies; 8+ messages in thread
From: Zach Brown @ 2007-11-07  1:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Christoph Hellwig, David Chinner

This is what it might look like to feed pgol in to some part of the fs stack
instead of iovecs.  I imagine we'd want to do it at a much higher level, perhaps
something like vfs_write_pages().
---
 fs/direct-io.c |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 0d5ed41..e86bcbc 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1213,3 +1213,24 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 }
 
 EXPORT_SYMBOL(__blockdev_direct_IO);
+
+ssize_t
+__blockdev_direct_IO_pages(int rw, struct kiocb *iocb, struct inode *inode,
+	struct block_device *bdev, struct pgol *pgol, loff_t offset, 
+	unsigned long nr_pages, get_block_t get_block, dio_iodone_t end_io,
+	int dio_lock_type)
+{
+	struct rwmem_pages rwp = {
+		.rwmem.ops = &rwmem_pages_ops,
+		.rwmem.nr_segs = nr_pages,
+		.pgol = pgol,
+	};
+	struct rwmem *rwm = &rwp.rwmem;
+
+	rwm->ops->init(rwm);
+
+	return blockdev_direct_IO_rwmem(rw, iocb, inode, bdev, rwm, offset,
+					get_block, end_io, dio_lock_type);
+}
+
+EXPORT_SYMBOL(__blockdev_direct_IO_pages);
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] fs io with struct page instead of iovecs
  2007-11-07  1:43 [RFC] fs io with struct page instead of iovecs Zach Brown
  2007-11-07  1:43 ` [PATCH 1/4] struct rwmem: an abstraction of the memory argument to read/write Zach Brown
@ 2007-11-07 16:50 ` Badari Pulavarty
  2007-11-07 17:02   ` Zach Brown
  1 sibling, 1 reply; 8+ messages in thread
From: Badari Pulavarty @ 2007-11-07 16:50 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel, Christoph Hellwig, David Chinner

On Tue, 2007-11-06 at 17:43 -0800, Zach Brown wrote:
> At the FS meeting at LCE there was some talk of doing O_DIRECT writes from the
> kernel with pages instead of with iovecs.  T

Why ? Whats the use case ?

Thanks,
Badari




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] fs io with struct page instead of iovecs
  2007-11-07 16:50 ` [RFC] fs io with struct page instead of iovecs Badari Pulavarty
@ 2007-11-07 17:02   ` Zach Brown
  2007-11-07 20:44     ` David Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Zach Brown @ 2007-11-07 17:02 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, Christoph Hellwig, David Chinner

Badari Pulavarty wrote:
> On Tue, 2007-11-06 at 17:43 -0800, Zach Brown wrote:
>> At the FS meeting at LCE there was some talk of doing O_DIRECT writes from the
>> kernel with pages instead of with iovecs.  T
> 
> Why ? Whats the use case ?

Well, I think there's a few:

There are existing callers which hold a kmap() across ->write, which
isn't great.  ecryptfs() does this.  That's mentioned in the patch
series.  Arguably loopback should be using this instead of copying some
fs paths and trying to call aop methods directly.

I seem to remember Christoph and David having stories of knfsd folks in
SGI wanting to do O_DIRECT writes from knfsd?  (If not, *I* kind of want
to, after rolling some patches to align net rx descriptors :)).

Lustre shows us that there is a point at which you can't saturate your
network and storage if your cpu is copying all the data.  I'll be the
first to admit that the community might not feel a pressing need to
address this for in-kernel file system writers, but the observation remains.

- z

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] fs io with struct page instead of iovecs
  2007-11-07 17:02   ` Zach Brown
@ 2007-11-07 20:44     ` David Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: David Chinner @ 2007-11-07 20:44 UTC (permalink / raw)
  To: Zach Brown
  Cc: Badari Pulavarty, linux-fsdevel, Christoph Hellwig, David Chinner

On Wed, Nov 07, 2007 at 09:02:05AM -0800, Zach Brown wrote:
> Badari Pulavarty wrote:
> > On Tue, 2007-11-06 at 17:43 -0800, Zach Brown wrote:
> >> At the FS meeting at LCE there was some talk of doing O_DIRECT writes from the
> >> kernel with pages instead of with iovecs.  T
> > 
> > Why ? Whats the use case ?
> 
> Well, I think there's a few:
> 
> There are existing callers which hold a kmap() across ->write, which
> isn't great.  ecryptfs() does this.  That's mentioned in the patch
> series.  Arguably loopback should be using this instead of copying some
> fs paths and trying to call aop methods directly.
> 
> I seem to remember Christoph and David having stories of knfsd folks in
> SGI wanting to do O_DIRECT writes from knfsd?  (If not, *I* kind of want
> to, after rolling some patches to align net rx descriptors :)).

The main reason is to remove the serialised writer problem when multiple
clients are writing to the one file. With XFS and direct I/O, we can have
multiple concurrent writers to the one file and have it scale rather than be
limited to what a single cpu holding the i_mutex can do....

> Lustre shows us that there is a point at which you can't saturate your
> network and storage if your cpu is copying all the data.

Buy more CPUs ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-11-07 20:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-07  1:43 [RFC] fs io with struct page instead of iovecs Zach Brown
2007-11-07  1:43 ` [PATCH 1/4] struct rwmem: an abstraction of the memory argument to read/write Zach Brown
2007-11-07  1:43   ` [PATCH 2/4] dio: use rwmem to work with r/w memory arguments Zach Brown
2007-11-07  1:43     ` [PATCH 3/4] add rwmem type backed by pages Zach Brown
2007-11-07  1:43       ` [PATCH 4/4] add dio interface for page/offset/len tuples Zach Brown
2007-11-07 16:50 ` [RFC] fs io with struct page instead of iovecs Badari Pulavarty
2007-11-07 17:02   ` Zach Brown
2007-11-07 20:44     ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).