* [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support
@ 2006-03-08 0:19 Badari Pulavarty
2006-03-08 0:22 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty
` (3 more replies)
0 siblings, 4 replies; 58+ messages in thread
From: Badari Pulavarty @ 2006-03-08 0:19 UTC (permalink / raw)
To: Zach Brown, christoph; +Cc: lkml, linux-fsdevel, pbadari
Hi,
These series of changes collapses all the vectored IO support
into single file-operation method using aio_read/aio_write.
This work was originally suggested & started by Christoph Hellwig,
when Zach Brown tried to add vectored support for AIO.
Christoph & Zach, comments/suggestions ? If you are happy with the
work, can you add your Sign-off or Ack ? I addressed all the
known issues, please review.
Here is the summary:
[PATCH 1/3] Vectorize aio_read/aio_write methods
[PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write
instead.
[PATCH 3/3] Zach's core aio changes to support vectored AIO.
NOTE: This is not ready for -mm or mainline consumption yet -
since I am still doing basic testing.
Comments ?
Thanks,
Badari
^ permalink raw reply [flat|nested] 58+ messages in thread* [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-03-08 0:19 [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support Badari Pulavarty @ 2006-03-08 0:22 ` Badari Pulavarty 2006-03-08 12:44 ` christoph 2006-03-08 0:23 ` [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead Badari Pulavarty ` (2 subsequent siblings) 3 siblings, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-03-08 0:22 UTC (permalink / raw) To: Zach Brown; +Cc: christoph, lkml, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 166 bytes --] This patch vectorizes aio_read() and aio_write() methods to prepare for colapsing all the vectored operations into one interface - which is aio_read()/aio_write(). [-- Attachment #2: aiovector.patch --] [-- Type: text/x-patch, Size: 36291 bytes --] This patch vectorizes aio_read() and aio_write() methods to prepare for colapsing all the vectored operations into one interface - which is aio_read()/aio_write(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Index: linux-2.6.16-rc5/Documentation/filesystems/Locking =================================================================== --- linux-2.6.16-rc5.orig/Documentation/filesystems/Locking 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/Documentation/filesystems/Locking 2006-02-27 08:33:22.000000000 -0800 @@ -355,10 +355,9 @@ The last two are called only from check_ prototypes: loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, - loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, Index: linux-2.6.16-rc5/Documentation/filesystems/vfs.txt =================================================================== --- linux-2.6.16-rc5.orig/Documentation/filesystems/vfs.txt 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/Documentation/filesystems/vfs.txt 2006-02-27 08:33:22.000000000 -0800 @@ -526,9 +526,9 @@ This describes how the VFS can manipulat struct file_operations { loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); Index: linux-2.6.16-rc5/drivers/char/raw.c =================================================================== --- linux-2.6.16-rc5.orig/drivers/char/raw.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/drivers/char/raw.c 2006-03-07 13:52:28.000000000 -0800 @@ -249,23 +249,11 @@ static ssize_t raw_file_write(struct fil return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t raw_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { - .iov_base = (char __user *)buf, - .iov_len = count - }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - - static struct file_operations raw_fops = { .read = generic_file_read, .aio_read = generic_file_aio_read, .write = raw_file_write, - .aio_write = raw_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .open = raw_open, .release= raw_release, .ioctl = raw_ioctl, Index: linux-2.6.16-rc5/fs/aio.c =================================================================== --- linux-2.6.16-rc5.orig/fs/aio.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/fs/aio.c 2006-03-07 13:44:09.000000000 -0800 @@ -15,6 +15,7 @@ #include <linux/aio_abi.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/uio.h> #define DEBUG 0 @@ -1316,8 +1317,11 @@ static ssize_t aio_pread(struct kiocb *i ssize_t ret = 0; do { - ret = file->f_op->aio_read(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + iocb->ki_inline_vec.iov_base = iocb->ki_buf; + iocb->ki_inline_vec.iov_len = iocb->ki_left; + + ret = file->f_op->aio_read(iocb, &iocb->ki_inline_vec, + 1, iocb->ki_pos); /* * Can't just depend on iocb->ki_left to determine * whether we are done. This may have been a short read. @@ -1350,8 +1354,11 @@ static ssize_t aio_pwrite(struct kiocb * ssize_t ret = 0; do { - ret = file->f_op->aio_write(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + iocb->ki_inline_vec.iov_base = iocb->ki_buf; + iocb->ki_inline_vec.iov_len = iocb->ki_left; + + ret = file->f_op->aio_write(iocb, &iocb->ki_inline_vec, + 1, iocb->ki_pos); if (ret > 0) { iocb->ki_buf += ret; iocb->ki_left -= ret; Index: linux-2.6.16-rc5/fs/block_dev.c =================================================================== --- linux-2.6.16-rc5.orig/fs/block_dev.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/fs/block_dev.c 2006-03-07 13:52:28.000000000 -0800 @@ -769,14 +769,6 @@ static ssize_t blkdev_file_write(struct return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t blkdev_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) { return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); @@ -799,7 +791,7 @@ struct file_operations def_blk_fops = { .read = generic_file_read, .write = blkdev_file_write, .aio_read = generic_file_aio_read, - .aio_write = blkdev_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .mmap = generic_file_mmap, .fsync = block_fsync, .unlocked_ioctl = block_ioctl, Index: linux-2.6.16-rc5/fs/cifs/cifsfs.c =================================================================== --- linux-2.6.16-rc5.orig/fs/cifs/cifsfs.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/fs/cifs/cifsfs.c 2006-03-07 13:52:28.000000000 -0800 @@ -501,13 +501,13 @@ static ssize_t cifs_file_writev(struct f return written; } -static ssize_t cifs_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +static ssize_t cifs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct inode *inode = iocb->ki_filp->f_dentry->d_inode; ssize_t written; - written = generic_file_aio_write(iocb, buf, count, pos); + written = generic_file_aio_write(iocb, iov, nr_segs, pos); if (!CIFS_I(inode)->clientCanCacheAll) filemap_fdatawrite(inode->i_mapping); return written; Index: linux-2.6.16-rc5/fs/ext3/file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/ext3/file.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/fs/ext3/file.c 2006-03-07 13:52:28.000000000 -0800 @@ -48,14 +48,15 @@ static int ext3_release_file (struct ino } static ssize_t -ext3_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +ext3_file_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_dentry->d_inode; ssize_t ret; int err; - ret = generic_file_aio_write(iocb, buf, count, pos); + ret = generic_file_aio_write(iocb, iov, nr_segs, pos); /* * Skip flushing if there was an error, or if nothing was written. Index: linux-2.6.16-rc5/fs/nfs/file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/nfs/file.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/fs/nfs/file.c 2006-02-27 08:33:22.000000000 -0800 @@ -40,8 +40,10 @@ static int nfs_file_release(struct inode static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); static int nfs_file_mmap(struct file *, struct vm_area_struct *); static ssize_t nfs_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *); -static ssize_t nfs_file_read(struct kiocb *, char __user *, size_t, loff_t); -static ssize_t nfs_file_write(struct kiocb *, const char __user *, size_t, loff_t); +static ssize_t nfs_file_read(struct kiocb *, const struct iovec *, + unsigned long, loff_t); +static ssize_t nfs_file_write(struct kiocb *, const struct iovec *, + unsigned long, loff_t); static int nfs_file_flush(struct file *); static int nfs_fsync(struct file *, struct dentry *dentry, int datasync); static int nfs_check_flags(int flags); @@ -52,8 +54,8 @@ struct file_operations nfs_file_operatio .llseek = nfs_file_llseek, .read = do_sync_read, .write = do_sync_write, - .aio_read = nfs_file_read, - .aio_write = nfs_file_write, + .aio_read = nfs_file_read, + .aio_write = nfs_file_write, .mmap = nfs_file_mmap, .open = nfs_file_open, .flush = nfs_file_flush, @@ -213,7 +215,8 @@ nfs_file_flush(struct file *file) } static ssize_t -nfs_file_read(struct kiocb *iocb, char __user * buf, size_t count, loff_t pos) +nfs_file_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; @@ -221,16 +224,15 @@ nfs_file_read(struct kiocb *iocb, char _ #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_read(iocb, buf, count, pos); + return nfs_file_direct_read(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: read(%s/%s, %lu@%lu)\n", - dentry->d_parent->d_name.name, dentry->d_name.name, - (unsigned long) count, (unsigned long) pos); + dfprintk(VFS, "nfs: read(%s/%s)\n", + dentry->d_parent->d_name.name, dentry->d_name.name); result = nfs_revalidate_file(inode, iocb->ki_filp); if (!result) - result = generic_file_aio_read(iocb, buf, count, pos); + result = generic_file_aio_read(iocb, iov, nr_segs, pos); return result; } @@ -333,7 +335,8 @@ struct address_space_operations nfs_file * Write to a file (through the page cache). */ static ssize_t -nfs_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +nfs_file_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; @@ -341,12 +344,12 @@ nfs_file_write(struct kiocb *iocb, const #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_write(iocb, buf, count, pos); + return nfs_file_direct_write(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%lu)\n", + dfprintk(VFS, "nfs: write(%s/%s(%ld))\n", dentry->d_parent->d_name.name, dentry->d_name.name, - inode->i_ino, (unsigned long) count, (unsigned long) pos); + inode->i_ino); result = -EBUSY; if (IS_SWAPFILE(inode)) @@ -361,11 +364,7 @@ nfs_file_write(struct kiocb *iocb, const } nfs_revalidate_mapping(inode, iocb->ki_filp->f_mapping); - result = count; - if (!count) - goto out; - - result = generic_file_aio_write(iocb, buf, count, pos); + result = generic_file_aio_write(iocb, iov, nr_segs, pos); out: return result; Index: linux-2.6.16-rc5/fs/read_write.c =================================================================== --- linux-2.6.16-rc5.orig/fs/read_write.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/fs/read_write.c 2006-03-07 13:52:28.000000000 -0800 @@ -227,14 +227,19 @@ static void wait_on_retry_sync_kiocb(str ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos))) + + for (;;) { + ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); @@ -279,14 +284,19 @@ EXPORT_SYMBOL(vfs_read); ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_write(&kiocb, buf, len, kiocb.ki_pos))) + + for (;;) { + ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); Index: linux-2.6.16-rc5/fs/reiserfs/file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/reiserfs/file.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/fs/reiserfs/file.c 2006-02-27 08:33:22.000000000 -0800 @@ -1560,12 +1560,6 @@ static ssize_t reiserfs_file_write(struc return res; } -static ssize_t reiserfs_aio_write(struct kiocb *iocb, const char __user * buf, - size_t count, loff_t pos) -{ - return generic_file_aio_write(iocb, buf, count, pos); -} - struct file_operations reiserfs_file_operations = { .read = generic_file_read, .write = reiserfs_file_write, @@ -1575,7 +1569,7 @@ struct file_operations reiserfs_file_ope .fsync = reiserfs_sync_file, .sendfile = generic_file_sendfile, .aio_read = generic_file_aio_read, - .aio_write = reiserfs_aio_write, + .aio_write = generic_file_aio_write, }; struct inode_operations reiserfs_file_inode_operations = { Index: linux-2.6.16-rc5/fs/xfs/linux-2.6/xfs_file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/xfs/linux-2.6/xfs_file.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/fs/xfs/linux-2.6/xfs_file.c 2006-03-07 13:52:28.000000000 -0800 @@ -51,12 +51,11 @@ static struct vm_operations_struct linvf STATIC inline ssize_t __linvfs_read( struct kiocb *iocb, - char __user *buf, + const struct iovec *iov, + unsigned long nr_segs, int ioflags, - size_t count, loff_t pos) { - struct iovec iov = {buf, count}; struct file *file = iocb->ki_filp; vnode_t *vp = LINVFS_GET_VP(file->f_dentry->d_inode); ssize_t rval; @@ -65,7 +64,7 @@ __linvfs_read( if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_READ(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_READ(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } @@ -73,33 +72,32 @@ __linvfs_read( STATIC ssize_t linvfs_aio_read( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __linvfs_read(iocb, buf, IO_ISAIO, count, pos); + return __linvfs_read(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t linvfs_aio_read_invis( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __linvfs_read(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __linvfs_read(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } STATIC inline ssize_t __linvfs_write( - struct kiocb *iocb, - const char __user *buf, - int ioflags, - size_t count, - loff_t pos) + struct kiocb *iocb, + const struct iovec *iov, + unsigned long nr_segs, + int ioflags, + loff_t pos) { - struct iovec iov = {(void __user *)buf, count}; struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; vnode_t *vp = LINVFS_GET_VP(inode); @@ -109,7 +107,7 @@ __linvfs_write( if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_WRITE(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_WRITE(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } @@ -117,21 +115,21 @@ __linvfs_write( STATIC ssize_t linvfs_aio_write( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __linvfs_write(iocb, buf, IO_ISAIO, count, pos); + return __linvfs_write(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t linvfs_aio_write_invis( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __linvfs_write(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __linvfs_write(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } Index: linux-2.6.16-rc5/include/linux/fs.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/fs.h 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/fs.h 2006-03-07 13:52:28.000000000 -0800 @@ -999,9 +999,9 @@ struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); @@ -1561,11 +1561,11 @@ extern int file_send_actor(read_descript extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *); int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *); -extern ssize_t generic_file_aio_read(struct kiocb *, char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t __generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t *); -extern ssize_t generic_file_aio_write(struct kiocb *, const char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *, - unsigned long, loff_t *); + unsigned long, loff_t); extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *, unsigned long *, loff_t, loff_t *, size_t, size_t); extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *, Index: linux-2.6.16-rc5/include/net/sock.h =================================================================== --- linux-2.6.16-rc5.orig/include/net/sock.h 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/include/net/sock.h 2006-02-27 08:33:22.000000000 -0800 @@ -650,7 +650,6 @@ struct sock_iocb { struct sock *sk; struct scm_cookie *scm; struct msghdr *msg, async_msg; - struct iovec async_iov; struct kiocb *kiocb; }; Index: linux-2.6.16-rc5/mm/filemap.c =================================================================== --- linux-2.6.16-rc5.orig/mm/filemap.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/mm/filemap.c 2006-03-07 13:52:28.000000000 -0800 @@ -1065,14 +1065,12 @@ out: EXPORT_SYMBOL(__generic_file_aio_read); ssize_t -generic_file_aio_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - struct iovec local_iov = { .iov_base = buf, .iov_len = count }; - BUG_ON(iocb->ki_pos != pos); - return __generic_file_aio_read(iocb, &local_iov, 1, &iocb->ki_pos); + return __generic_file_aio_read(iocb, iov, nr_segs, &iocb->ki_pos); } - EXPORT_SYMBOL(generic_file_aio_read); ssize_t @@ -2132,22 +2130,21 @@ out: current->backing_dev_info = NULL; return written ? written : err; } -EXPORT_SYMBOL(generic_file_aio_write_nolock); -ssize_t -generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) +ssize_t generic_file_aio_write_nolock(struct kiocb *iocb, + const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - loff_t pos = *ppos; - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, ppos); + BUG_ON(iocb->ki_pos != pos); + + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { - int err; + ssize_t err; err = sync_page_range_nolock(inode, mapping, pos, ret); if (err < 0) @@ -2155,6 +2152,7 @@ generic_file_aio_write_nolock(struct kio } return ret; } +EXPORT_SYMBOL(generic_file_aio_write_nolock); static ssize_t __generic_file_write_nolock(struct file *file, const struct iovec *iov, @@ -2164,9 +2162,11 @@ __generic_file_write_nolock(struct file ssize_t ret; init_sync_kiocb(&kiocb, file); + kiocb.ki_pos = *ppos; ret = __generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); - if (ret == -EIOCBQUEUED) + if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; return ret; } @@ -2178,28 +2178,27 @@ generic_file_write_nolock(struct file *f ssize_t ret; init_sync_kiocb(&kiocb, file); - ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); + kiocb.ki_pos = *ppos; + ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, *ppos); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; return ret; } EXPORT_SYMBOL(generic_file_write_nolock); -ssize_t generic_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - struct iovec local_iov = { .iov_base = (void __user *)buf, - .iov_len = count }; BUG_ON(iocb->ki_pos != pos); mutex_lock(&inode->i_mutex); - ret = __generic_file_aio_write_nolock(iocb, &local_iov, 1, - &iocb->ki_pos); + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); mutex_unlock(&inode->i_mutex); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { Index: linux-2.6.16-rc5/net/socket.c =================================================================== --- linux-2.6.16-rc5.orig/net/socket.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/net/socket.c 2006-03-07 13:53:40.000000000 -0800 @@ -98,10 +98,10 @@ #include <linux/netfilter.h> static int sock_no_open(struct inode *irrelevant, struct file *dontcare); -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *buf, - size_t size, loff_t pos); -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *buf, - size_t size, loff_t pos); +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); static int sock_mmap(struct file *file, struct vm_area_struct * vma); static int sock_close(struct inode *inode, struct file *file); @@ -656,7 +656,7 @@ static ssize_t sock_sendpage(struct file } static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb, - char __user *ubuf, size_t size, struct sock_iocb *siocb) + struct sock_iocb *siocb) { if (!is_sync_kiocb(iocb)) { siocb = kmalloc(sizeof(*siocb), GFP_KERNEL); @@ -666,15 +666,13 @@ static struct sock_iocb *alloc_sock_iocb } siocb->kiocb = iocb; - siocb->async_iov.iov_base = ubuf; - siocb->async_iov.iov_len = size; - iocb->private = siocb; return siocb; } static ssize_t do_sock_read(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -705,31 +703,33 @@ static ssize_t sock_readv(struct file *f init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_read(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_read(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; - if (count == 0) /* Match SYS5 behaviour */ + + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ return 0; - x = alloc_sock_iocb(iocb, ubuf, count, &siocb); + + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } static ssize_t do_sock_write(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -762,28 +762,28 @@ static ssize_t sock_writev(struct file * init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_write(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_write(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; - if (count == 0) /* Match SYS5 behaviour */ + + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ return 0; - x = alloc_sock_iocb(iocb, (void __user *)ubuf, count, &siocb); + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } Index: linux-2.6.16-rc5/fs/nfs/direct.c =================================================================== --- linux-2.6.16-rc5.orig/fs/nfs/direct.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/fs/nfs/direct.c 2006-02-27 08:38:28.000000000 -0800 @@ -626,6 +626,32 @@ nfs_direct_IO(int rw, struct kiocb *iocb return result; } +static ssize_t +check_access_ok(int type, const struct iovec *iov, unsigned long nr_segs) +{ + ssize_t tot_len = 0; + ssize_t ret = -EINVAL; + int seg; + + for (seg = 0; seg < nr_segs; seg++) { + void __user *buf = iov[seg].iov_base; + ssize_t len = (ssize_t)iov[seg].iov_len; + + if (len < 0) /* size_t not fitting an ssize_t .. */ + goto out; + if (unlikely(!access_ok(type, buf, len))) { + ret = -EFAULT; + goto out; + } + tot_len += len; + if ((ssize_t)tot_len < 0) /* maths overflow on the ssize_t */ + goto out; + } + ret = tot_len; +out: + return ret; +} + /** * nfs_file_direct_read - file direct read operation for NFS files * @iocb: target I/O control block @@ -648,7 +674,8 @@ nfs_direct_IO(int rw, struct kiocb *iocb * cache. */ ssize_t -nfs_file_direct_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { ssize_t retval = -EINVAL; loff_t *ppos = &iocb->ki_pos; @@ -657,32 +684,24 @@ nfs_file_direct_read(struct kiocb *iocb, (struct nfs_open_context *) file->private_data; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; - struct iovec iov = { - .iov_base = buf, - .iov_len = count, - }; - dprintk("nfs: direct read(%s/%s, %lu@%Ld)\n", + dprintk("nfs: direct read(%s/%s, @%Ld)\n", file->f_dentry->d_parent->d_name.name, file->f_dentry->d_name.name, - (unsigned long) count, (long long) pos); + (long long) pos); if (!is_sync_kiocb(iocb)) goto out; - if (count < 0) - goto out; - retval = -EFAULT; - if (!access_ok(VERIFY_WRITE, iov.iov_base, iov.iov_len)) - goto out; - retval = 0; - if (!count) + + retval = check_access_ok(VERIFY_WRITE, iov, nr_segs); + if (retval <= 0) goto out; retval = nfs_sync_mapping(mapping); if (retval) goto out; - retval = nfs_direct_read(inode, ctx, &iov, pos, 1); + retval = nfs_direct_read(inode, ctx, iov, pos, nr_segs); if (retval > 0) *ppos = pos + retval; @@ -716,7 +735,8 @@ out: * is no atomic O_APPEND write facility in the NFS protocol. */ ssize_t -nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { ssize_t retval; struct file *file = iocb->ki_filp; @@ -724,40 +744,32 @@ nfs_file_direct_write(struct kiocb *iocb (struct nfs_open_context *) file->private_data; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; - struct iovec iov = { - .iov_base = (char __user *)buf, - }; + ssize_t count; - dfprintk(VFS, "nfs: direct write(%s/%s, %lu@%Ld)\n", + dfprintk(VFS, "nfs: direct write(%s/%s, @%Ld)\n", file->f_dentry->d_parent->d_name.name, file->f_dentry->d_name.name, - (unsigned long) count, (long long) pos); + (long long) pos); retval = -EINVAL; if (!is_sync_kiocb(iocb)) goto out; - retval = generic_write_checks(file, &pos, &count, 0); - if (retval) + retval = check_access_ok(VERIFY_READ, iov, nr_segs); + if (retval <= 0) goto out; - retval = -EINVAL; - if ((ssize_t) count < 0) - goto out; - retval = 0; - if (!count) - goto out; - iov.iov_len = count, - - retval = -EFAULT; - if (!access_ok(VERIFY_READ, iov.iov_base, iov.iov_len)) + /* FIXME: how to adjust iovec if count gets adjusted ? */ + count = retval; + retval = generic_write_checks(file, &pos, &count, 0); + if (retval) goto out; retval = nfs_sync_mapping(mapping); if (retval) goto out; - retval = nfs_direct_write(inode, ctx, &iov, pos, 1); + retval = nfs_direct_write(inode, ctx, iov, pos, nr_segs); if (mapping->nrpages) invalidate_inode_pages2(mapping); if (retval > 0) Index: linux-2.6.16-rc5/include/linux/nfs_fs.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/nfs_fs.h 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/nfs_fs.h 2006-02-27 08:33:22.000000000 -0800 @@ -369,10 +369,10 @@ extern int nfs3_removexattr (struct dent */ extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t, unsigned long); -extern ssize_t nfs_file_direct_read(struct kiocb *iocb, char __user *buf, - size_t count, loff_t pos); -extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos); +extern ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *, + unsigned long nr_segs, loff_t pos); +extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *, + unsigned long nr_segs, loff_t pos); /* * linux/fs/nfs/dir.c Index: linux-2.6.16-rc5/drivers/usb/gadget/inode.c =================================================================== --- linux-2.6.16-rc5.orig/drivers/usb/gadget/inode.c 2006-02-26 21:09:35.000000000 -0800 +++ linux-2.6.16-rc5/drivers/usb/gadget/inode.c 2006-03-07 13:26:46.000000000 -0800 @@ -529,7 +529,8 @@ struct kiocb_priv { struct usb_request *req; struct ep_data *epdata; void *buf; - char __user *ubuf; + struct iovec *iv; + unsigned long count; unsigned actual; }; @@ -557,18 +558,32 @@ static int ep_aio_cancel(struct kiocb *i static ssize_t ep_aio_read_retry(struct kiocb *iocb) { struct kiocb_priv *priv = iocb->private; - ssize_t status = priv->actual; + ssize_t len, total; /* we "retry" to get the right mm context for this: */ - status = copy_to_user(priv->ubuf, priv->buf, priv->actual); - if (unlikely(0 != status)) - status = -EFAULT; - else - status = priv->actual; + + /* copy stuff into user buffers */ + total = priv->actual; + len = 0; + for (i=0; i < priv->count; i++) { + ssize_t this = min(priv->iv[i].iov_len, (size_t)total); + + if (copy_to_user(priv->iv[i].iov_buf, priv->buf, this)) + break; + + total -= this; + len += this; + if (total <= 0) + break; + } + + if (unlikely(len != 0)) + len = -EFAULT; + kfree(priv->buf); kfree(priv); aio_put_req(iocb); - return status; + return len; } static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req) @@ -616,7 +631,8 @@ ep_aio_rwtail( char *buf, size_t len, struct ep_data *epdata, - char __user *ubuf + const struct iovec *iv, + unsigned long count ) { struct kiocb_priv *priv = (void *) &iocb->private; @@ -631,7 +647,8 @@ fail: return value; } iocb->private = priv; - priv->ubuf = ubuf; + priv->iovec = iv; + priv->count = count; value = get_ready_ep(iocb->ki_filp->f_flags, epdata); if (unlikely(value < 0)) { @@ -676,36 +693,52 @@ fail: } static ssize_t -ep_aio_read(struct kiocb *iocb, char __user *ubuf, size_t len, loff_t o) +ep_aio_read(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len; + int i = 0; + ssize_t ret; if (unlikely(epdata->desc.bEndpointAddress & USB_DIR_IN)) return -EINVAL; - buf = kmalloc(len, GFP_KERNEL); + + buf = kmalloc(iocb->ki_left, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; + iocb->ki_retry = ep_aio_read_retry; - return ep_aio_rwtail(iocb, buf, len, epdata, ubuf); + return ep_aio_rwtail(iocb, buf, len, epdata, iv, count); } static ssize_t -ep_aio_write(struct kiocb *iocb, const char __user *ubuf, size_t len, loff_t o) +ep_aio_write(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len = 0; + int i = 0; + ssize_t ret; if (unlikely(!(epdata->desc.bEndpointAddress & USB_DIR_IN))) return -EINVAL; - buf = kmalloc(len, GFP_KERNEL); + + buf = kmalloc(iocb->ki_left, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; - if (unlikely(copy_from_user(buf, ubuf, len) != 0)) { - kfree(buf); - return -EFAULT; + + for (i=0; i < count; i++) { + if (unlikely(copy_from_user(&buf[len], iv[i]->iov_base, + iv[i]->iov_len) != 0)) { + kfree(buf); + return -EFAULT; + } + len += iv[i]->iov_len; } - return ep_aio_rwtail(iocb, buf, len, epdata, NULL); + return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0); } /*----------------------------------------------------------------------*/ Index: linux-2.6.16-rc5/include/linux/aio.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/aio.h 2006-03-07 08:37:05.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/aio.h 2006-03-07 13:44:09.000000000 -0800 @@ -4,6 +4,7 @@ #include <linux/list.h> #include <linux/workqueue.h> #include <linux/aio_abi.h> +#include <linux/uio.h> #include <asm/atomic.h> @@ -112,6 +113,7 @@ struct kiocb { long ki_retried; /* just for testing */ long ki_kicked; /* just for testing */ long ki_queued; /* just for testing */ + struct iovec ki_inline_vec; /* inline vector */ struct list_head ki_list; /* the aio core uses this * for cancellation */ ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-03-08 0:22 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty @ 2006-03-08 12:44 ` christoph 0 siblings, 0 replies; 58+ messages in thread From: christoph @ 2006-03-08 12:44 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Zach Brown, christoph, lkml, linux-fsdevel On Tue, Mar 07, 2006 at 04:22:10PM -0800, Badari Pulavarty wrote: > This patch vectorizes aio_read() and aio_write() methods to prepare > for colapsing all the vectored operations into one interface - > which is aio_read()/aio_write(). ok ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead 2006-03-08 0:19 [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support Badari Pulavarty 2006-03-08 0:22 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty @ 2006-03-08 0:23 ` Badari Pulavarty 2006-03-08 12:45 ` christoph 2006-03-08 0:24 ` [PATCH 3/3] Zach's core aio changes to support vectored AIO Badari Pulavarty 2006-03-08 12:47 ` [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support christoph 3 siblings, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-03-08 0:23 UTC (permalink / raw) To: Zach Brown; +Cc: christoph, lkml, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 105 bytes --] This patch removes readv() and writev() methods and replaces them with aio_read()/aio_write() methods. [-- Attachment #2: remove-readv-writev.patch --] [-- Type: text/x-patch, Size: 34934 bytes --] This patch removes readv() and writev() methods and replaces them with aio_read()/aio_write() methods. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Index: linux-2.6.16-rc5/drivers/char/raw.c =================================================================== --- linux-2.6.16-rc5.orig/drivers/char/raw.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/drivers/char/raw.c 2006-03-07 13:36:05.000000000 -0800 @@ -257,8 +257,6 @@ static struct file_operations raw_fops = .open = raw_open, .release= raw_release, .ioctl = raw_ioctl, - .readv = generic_file_readv, - .writev = generic_file_writev, .owner = THIS_MODULE, }; Index: linux-2.6.16-rc5/drivers/net/tun.c =================================================================== --- linux-2.6.16-rc5.orig/drivers/net/tun.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/drivers/net/tun.c 2006-03-07 13:36:05.000000000 -0800 @@ -286,11 +286,10 @@ static inline size_t iov_total(const str return len; } -/* Writev */ -static ssize_t tun_chr_writev(struct file * file, const struct iovec *iv, - unsigned long count, loff_t *pos) +static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t pos) { - struct tun_struct *tun = file->private_data; + struct tun_struct *tun = iocb->ki_filp->private_data; if (!tun) return -EBADFD; @@ -300,14 +299,6 @@ static ssize_t tun_chr_writev(struct fil return tun_get_user(tun, (struct iovec *) iv, iov_total(iv, count)); } -/* Write */ -static ssize_t tun_chr_write(struct file * file, const char __user * buf, - size_t count, loff_t *pos) -{ - struct iovec iv = { (void __user *) buf, count }; - return tun_chr_writev(file, &iv, 1, pos); -} - /* Put packet to the user space buffer */ static __inline__ ssize_t tun_put_user(struct tun_struct *tun, struct sk_buff *skb, @@ -341,10 +332,10 @@ static __inline__ ssize_t tun_put_user(s return total; } -/* Readv */ -static ssize_t tun_chr_readv(struct file *file, const struct iovec *iv, - unsigned long count, loff_t *pos) +static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t pos) { + struct file *file = iocb->ki_filp; struct tun_struct *tun = file->private_data; DECLARE_WAITQUEUE(wait, current); struct sk_buff *skb; @@ -424,14 +415,6 @@ static ssize_t tun_chr_readv(struct file return ret; } -/* Read */ -static ssize_t tun_chr_read(struct file * file, char __user * buf, - size_t count, loff_t *pos) -{ - struct iovec iv = { buf, count }; - return tun_chr_readv(file, &iv, 1, pos); -} - static void tun_setup(struct net_device *dev) { struct tun_struct *tun = netdev_priv(dev); @@ -759,10 +742,8 @@ static int tun_chr_close(struct inode *i static struct file_operations tun_fops = { .owner = THIS_MODULE, .llseek = no_llseek, - .read = tun_chr_read, - .readv = tun_chr_readv, - .write = tun_chr_write, - .writev = tun_chr_writev, + .aio_read = tun_chr_aio_read, + .aio_write = tun_chr_aio_write, .poll = tun_chr_poll, .ioctl = tun_chr_ioctl, .open = tun_chr_open, Index: linux-2.6.16-rc5/fs/bad_inode.c =================================================================== --- linux-2.6.16-rc5.orig/fs/bad_inode.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/bad_inode.c 2006-03-07 13:36:05.000000000 -0800 @@ -40,8 +40,6 @@ static struct file_operations bad_file_o .aio_fsync = EIO_ERROR, .fasync = EIO_ERROR, .lock = EIO_ERROR, - .readv = EIO_ERROR, - .writev = EIO_ERROR, .sendfile = EIO_ERROR, .sendpage = EIO_ERROR, .get_unmapped_area = EIO_ERROR, Index: linux-2.6.16-rc5/fs/block_dev.c =================================================================== --- linux-2.6.16-rc5.orig/fs/block_dev.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/block_dev.c 2006-03-07 13:36:05.000000000 -0800 @@ -798,8 +798,6 @@ struct file_operations def_blk_fops = { #ifdef CONFIG_COMPAT .compat_ioctl = compat_blkdev_ioctl, #endif - .readv = generic_file_readv, - .writev = generic_file_write_nolock, .sendfile = generic_file_sendfile, }; Index: linux-2.6.16-rc5/fs/cifs/cifsfs.c =================================================================== --- linux-2.6.16-rc5.orig/fs/cifs/cifsfs.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/cifs/cifsfs.c 2006-03-07 13:36:05.000000000 -0800 @@ -489,18 +489,6 @@ cifs_get_sb(struct file_system_type *fs_ return sb; } -static ssize_t cifs_file_writev(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) -{ - struct inode *inode = file->f_dentry->d_inode; - ssize_t written; - - written = generic_file_writev(file, iov, nr_segs, ppos); - if (!CIFS_I(inode)->clientCanCacheAll) - filemap_fdatawrite(inode->i_mapping); - return written; -} - static ssize_t cifs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { @@ -586,8 +574,6 @@ struct inode_operations cifs_symlink_ino struct file_operations cifs_file_ops = { .read = do_sync_read, .write = do_sync_write, - .readv = generic_file_readv, - .writev = cifs_file_writev, .aio_read = generic_file_aio_read, .aio_write = cifs_file_aio_write, .open = cifs_open, @@ -629,8 +615,6 @@ struct file_operations cifs_file_direct_ struct file_operations cifs_file_nobrl_ops = { .read = do_sync_read, .write = do_sync_write, - .readv = generic_file_readv, - .writev = cifs_file_writev, .aio_read = generic_file_aio_read, .aio_write = cifs_file_aio_write, .open = cifs_open, Index: linux-2.6.16-rc5/fs/compat.c =================================================================== --- linux-2.6.16-rc5.orig/fs/compat.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/compat.c 2006-03-07 13:36:05.000000000 -0800 @@ -55,6 +55,8 @@ extern void sigset_from_compat(sigset_t *set, compat_sigset_t *compat); +#include "read_write.h" + /* * Not all architectures have sys_utime, so implement this in terms * of sys_utimes. @@ -1137,9 +1139,6 @@ static ssize_t compat_do_readv_writev(in const struct compat_iovec __user *uvector, unsigned long nr_segs, loff_t *pos) { - typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *); - typedef ssize_t (*iov_fn_t)(struct file *, const struct iovec *, unsigned long, loff_t *); - compat_ssize_t tot_len; struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov=iovstack, *vector; @@ -1218,39 +1217,17 @@ static ssize_t compat_do_readv_writev(in fnv = NULL; if (type == READ) { fn = file->f_op->read; - fnv = file->f_op->readv; + fnv = file->f_op->aio_read; } else { fn = (io_fn_t)file->f_op->write; - fnv = file->f_op->writev; - } - if (fnv) { - ret = fnv(file, iov, nr_segs, pos); - goto out; + fnv = file->f_op->aio_write; } - /* Do it by hand, with file-ops */ - ret = 0; - vector = iov; - while (nr_segs > 0) { - void __user * base; - size_t len; - ssize_t nr; - - base = vector->iov_base; - len = vector->iov_len; - vector++; - nr_segs--; - - nr = fn(file, base, len, pos); + if (fnv) + ret = do_sync_readv_writev(file, iov, nr_segs, pos, fnv); + else + ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn); - if (nr < 0) { - if (!ret) ret = nr; - break; - } - ret += nr; - if (nr != len) - break; - } out: if (iov != iovstack) kfree(iov); @@ -1278,7 +1255,7 @@ compat_sys_readv(unsigned long fd, const goto out; ret = -EINVAL; - if (!file->f_op || (!file->f_op->readv && !file->f_op->read)) + if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read)) goto out; ret = compat_do_readv_writev(READ, file, vec, vlen, &file->f_pos); @@ -1301,7 +1278,7 @@ compat_sys_writev(unsigned long fd, cons goto out; ret = -EINVAL; - if (!file->f_op || (!file->f_op->writev && !file->f_op->write)) + if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write)) goto out; ret = compat_do_readv_writev(WRITE, file, vec, vlen, &file->f_pos); Index: linux-2.6.16-rc5/fs/ext2/file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/ext2/file.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/ext2/file.c 2006-03-07 13:36:05.000000000 -0800 @@ -50,8 +50,6 @@ struct file_operations ext2_file_operati .open = generic_file_open, .release = ext2_release_file, .fsync = ext2_sync_file, - .readv = generic_file_readv, - .writev = generic_file_writev, .sendfile = generic_file_sendfile, }; Index: linux-2.6.16-rc5/fs/ext3/file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/ext3/file.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/ext3/file.c 2006-03-07 13:36:05.000000000 -0800 @@ -112,8 +112,6 @@ struct file_operations ext3_file_operati .write = do_sync_write, .aio_read = generic_file_aio_read, .aio_write = ext3_file_write, - .readv = generic_file_readv, - .writev = generic_file_writev, .ioctl = ext3_ioctl, .mmap = generic_file_mmap, .open = generic_file_open, Index: linux-2.6.16-rc5/fs/fat/file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/fat/file.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/fat/file.c 2006-03-07 13:36:05.000000000 -0800 @@ -116,8 +116,6 @@ struct file_operations fat_file_operatio .llseek = generic_file_llseek, .read = do_sync_read, .write = do_sync_write, - .readv = generic_file_readv, - .writev = generic_file_writev, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, Index: linux-2.6.16-rc5/fs/fuse/dev.c =================================================================== --- linux-2.6.16-rc5.orig/fs/fuse/dev.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/fuse/dev.c 2006-03-07 13:36:05.000000000 -0800 @@ -602,8 +602,8 @@ static void request_wait(struct fuse_con * request_end(). Otherwise add it to the processing list, and set * the 'sent' flag. */ -static ssize_t fuse_dev_readv(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *off) +static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { int err; struct fuse_conn *fc; @@ -614,7 +614,7 @@ static ssize_t fuse_dev_readv(struct fil restart: spin_lock(&fuse_lock); - fc = file->private_data; + fc = iocb->ki_filp->private_data; err = -EPERM; if (!fc) goto err_unlock; @@ -672,15 +672,6 @@ static ssize_t fuse_dev_readv(struct fil return err; } -static ssize_t fuse_dev_read(struct file *file, char __user *buf, - size_t nbytes, loff_t *off) -{ - struct iovec iov; - iov.iov_len = nbytes; - iov.iov_base = buf; - return fuse_dev_readv(file, &iov, 1, off); -} - /* Look up request on processing list by unique ID */ static struct fuse_req *request_find(struct fuse_conn *fc, u64 unique) { @@ -725,15 +716,15 @@ static int copy_out_args(struct fuse_cop * it from the list and copy the rest of the buffer to the request. * The request is finished by calling request_end() */ -static ssize_t fuse_dev_writev(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *off) +static ssize_t fuse_dev_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { int err; unsigned nbytes = iov_length(iov, nr_segs); struct fuse_req *req; struct fuse_out_header oh; struct fuse_copy_state cs; - struct fuse_conn *fc = fuse_get_conn(file); + struct fuse_conn *fc = fuse_get_conn(iocb->ki_filp); if (!fc) return -ENODEV; @@ -793,15 +784,6 @@ static ssize_t fuse_dev_writev(struct fi return err; } -static ssize_t fuse_dev_write(struct file *file, const char __user *buf, - size_t nbytes, loff_t *off) -{ - struct iovec iov; - iov.iov_len = nbytes; - iov.iov_base = (char __user *) buf; - return fuse_dev_writev(file, &iov, 1, off); -} - static unsigned fuse_dev_poll(struct file *file, poll_table *wait) { struct fuse_conn *fc = fuse_get_conn(file); @@ -925,10 +907,8 @@ static int fuse_dev_release(struct inode struct file_operations fuse_dev_operations = { .owner = THIS_MODULE, .llseek = no_llseek, - .read = fuse_dev_read, - .readv = fuse_dev_readv, - .write = fuse_dev_write, - .writev = fuse_dev_writev, + .aio_read = fuse_dev_read, + .aio_write = fuse_dev_write, .poll = fuse_dev_poll, .release = fuse_dev_release, }; Index: linux-2.6.16-rc5/fs/hostfs/hostfs_kern.c =================================================================== --- linux-2.6.16-rc5.orig/fs/hostfs/hostfs_kern.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/hostfs/hostfs_kern.c 2006-03-07 13:36:05.000000000 -0800 @@ -390,8 +390,6 @@ static struct file_operations hostfs_fil .sendfile = generic_file_sendfile, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, - .readv = generic_file_readv, - .writev = generic_file_writev, .write = generic_file_write, .mmap = generic_file_mmap, .open = hostfs_file_open, Index: linux-2.6.16-rc5/fs/jfs/file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/jfs/file.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/jfs/file.c 2006-03-07 13:36:05.000000000 -0800 @@ -108,8 +108,6 @@ struct file_operations jfs_file_operatio .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, - .readv = generic_file_readv, - .writev = generic_file_writev, .sendfile = generic_file_sendfile, .fsync = jfs_fsync, .release = jfs_release, Index: linux-2.6.16-rc5/fs/ntfs/file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/ntfs/file.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/ntfs/file.c 2006-03-07 13:36:05.000000000 -0800 @@ -2308,11 +2308,9 @@ struct file_operations ntfs_file_ops = { .llseek = generic_file_llseek, /* Seek inside file. */ .read = generic_file_read, /* Read from file. */ .aio_read = generic_file_aio_read, /* Async read from file. */ - .readv = generic_file_readv, /* Read from file. */ #ifdef NTFS_RW .write = ntfs_file_write, /* Write to file. */ .aio_write = ntfs_file_aio_write, /* Async write to file. */ - .writev = ntfs_file_writev, /* Write to file. */ /*.release = ,*/ /* Last file is closed. See fs/ext2/file.c:: ext2_release_file() for Index: linux-2.6.16-rc5/fs/pipe.c =================================================================== --- linux-2.6.16-rc5.orig/fs/pipe.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/pipe.c 2006-03-07 13:36:05.000000000 -0800 @@ -119,9 +119,10 @@ static struct pipe_buf_operations anon_p }; static ssize_t -pipe_readv(struct file *filp, const struct iovec *_iov, - unsigned long nr_segs, loff_t *ppos) +pipe_read(struct kiocb *iocb, const struct iovec *_iov, + unsigned long nr_segs, loff_t pos) { + struct file *filp = iocb->ki_filp; struct inode *inode = filp->f_dentry->d_inode; struct pipe_inode_info *info; int do_wakeup; @@ -212,16 +213,10 @@ pipe_readv(struct file *filp, const stru } static ssize_t -pipe_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) -{ - struct iovec iov = { .iov_base = buf, .iov_len = count }; - return pipe_readv(filp, &iov, 1, ppos); -} - -static ssize_t -pipe_writev(struct file *filp, const struct iovec *_iov, - unsigned long nr_segs, loff_t *ppos) +pipe_write(struct kiocb *iocb, const struct iovec *_iov, + unsigned long nr_segs, loff_t pos) { + struct file *filp = iocb->ki_filp; struct inode *inode = filp->f_dentry->d_inode; struct pipe_inode_info *info; ssize_t ret; @@ -352,14 +347,6 @@ out: } static ssize_t -pipe_write(struct file *filp, const char __user *buf, - size_t count, loff_t *ppos) -{ - struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; - return pipe_writev(filp, &iov, 1, ppos); -} - -static ssize_t bad_pipe_r(struct file *filp, char __user *buf, size_t count, loff_t *ppos) { return -EBADF; @@ -570,8 +557,7 @@ pipe_rdwr_open(struct inode *inode, stru */ struct file_operations read_fifo_fops = { .llseek = no_llseek, - .read = pipe_read, - .readv = pipe_readv, + .aio_read = pipe_read, .write = bad_pipe_w, .poll = pipe_poll, .ioctl = pipe_ioctl, @@ -583,8 +569,7 @@ struct file_operations read_fifo_fops = struct file_operations write_fifo_fops = { .llseek = no_llseek, .read = bad_pipe_r, - .write = pipe_write, - .writev = pipe_writev, + .aio_write = pipe_write, .poll = pipe_poll, .ioctl = pipe_ioctl, .open = pipe_write_open, @@ -594,10 +579,8 @@ struct file_operations write_fifo_fops = struct file_operations rdwr_fifo_fops = { .llseek = no_llseek, - .read = pipe_read, - .readv = pipe_readv, - .write = pipe_write, - .writev = pipe_writev, + .aio_read = pipe_read, + .aio_write = pipe_write, .poll = pipe_poll, .ioctl = pipe_ioctl, .open = pipe_rdwr_open, @@ -607,8 +590,7 @@ struct file_operations rdwr_fifo_fops = struct file_operations read_pipe_fops = { .llseek = no_llseek, - .read = pipe_read, - .readv = pipe_readv, + .aio_read = pipe_read, .write = bad_pipe_w, .poll = pipe_poll, .ioctl = pipe_ioctl, @@ -620,8 +602,7 @@ struct file_operations read_pipe_fops = struct file_operations write_pipe_fops = { .llseek = no_llseek, .read = bad_pipe_r, - .write = pipe_write, - .writev = pipe_writev, + .aio_write = pipe_write, .poll = pipe_poll, .ioctl = pipe_ioctl, .open = pipe_write_open, @@ -631,10 +612,8 @@ struct file_operations write_pipe_fops = struct file_operations rdwr_pipe_fops = { .llseek = no_llseek, - .read = pipe_read, - .readv = pipe_readv, - .write = pipe_write, - .writev = pipe_writev, + .aio_read = pipe_read, + .aio_write = pipe_write, .poll = pipe_poll, .ioctl = pipe_ioctl, .open = pipe_rdwr_open, Index: linux-2.6.16-rc5/fs/read_write.c =================================================================== --- linux-2.6.16-rc5.orig/fs/read_write.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/read_write.c 2006-03-07 13:51:38.000000000 -0800 @@ -448,6 +448,68 @@ unsigned long iov_shorten(struct iovec * EXPORT_SYMBOL(iov_shorten); +typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *); +typedef ssize_t (*iov_fn_t)(struct kiocb *, const struct iovec *, + unsigned long, loff_t); + + +ssize_t do_sync_readv_writev(struct file *filp, const struct iovec *iov, + unsigned long nr_segs, size_t len, loff_t *ppos, iov_fn_t fn) +{ + struct kiocb kiocb; + ssize_t ret; + + init_sync_kiocb(&kiocb, filp); + kiocb.ki_pos = *ppos; + kiocb.ki_left = len; + kiocb.ki_nbytes = len; + + for (;;) { + ret = fn(&kiocb, iov, nr_segs, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; + wait_on_retry_sync_kiocb(&kiocb); + } + + if (ret == -EIOCBQUEUED) + ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; + return ret; +} + +/* Do it by hand, with file-ops */ +ssize_t do_loop_readv_writev(struct file *filp, struct iovec *iov, + unsigned long nr_segs, loff_t *ppos, io_fn_t fn) +{ + struct iovec *vector = iov; + ssize_t ret = 0; + + + while (nr_segs > 0) { + void __user * base; + size_t len; + ssize_t nr; + + base = vector->iov_base; + len = vector->iov_len; + vector++; + nr_segs--; + + nr = fn(filp, base, len, ppos); + + if (nr < 0) { + if (!ret) + ret = nr; + break; + } + ret += nr; + if (nr != len) + break; + } + + return ret; +} + /* A write operation does a read from user space and vice versa */ #define vrfy_dir(type) ((type) == READ ? VERIFY_WRITE : VERIFY_READ) @@ -455,12 +517,9 @@ static ssize_t do_readv_writev(int type, const struct iovec __user * uvector, unsigned long nr_segs, loff_t *pos) { - typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *); - typedef ssize_t (*iov_fn_t)(struct file *, const struct iovec *, unsigned long, loff_t *); - size_t tot_len; struct iovec iovstack[UIO_FASTIOV]; - struct iovec *iov=iovstack, *vector; + struct iovec *iov = iovstack; ssize_t ret; int seg; io_fn_t fn; @@ -530,39 +589,17 @@ static ssize_t do_readv_writev(int type, fnv = NULL; if (type == READ) { fn = file->f_op->read; - fnv = file->f_op->readv; + fnv = file->f_op->aio_read; } else { fn = (io_fn_t)file->f_op->write; - fnv = file->f_op->writev; - } - if (fnv) { - ret = fnv(file, iov, nr_segs, pos); - goto out; + fnv = file->f_op->aio_write; } - /* Do it by hand, with file-ops */ - ret = 0; - vector = iov; - while (nr_segs > 0) { - void __user * base; - size_t len; - ssize_t nr; - - base = vector->iov_base; - len = vector->iov_len; - vector++; - nr_segs--; - - nr = fn(file, base, len, pos); + if (fnv) + ret = do_sync_readv_writev(file, iov, nr_segs, tot_len, pos, fnv); + else + ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn); - if (nr < 0) { - if (!ret) ret = nr; - break; - } - ret += nr; - if (nr != len) - break; - } out: if (iov != iovstack) kfree(iov); @@ -583,7 +620,7 @@ ssize_t vfs_readv(struct file *file, con { if (!(file->f_mode & FMODE_READ)) return -EBADF; - if (!file->f_op || (!file->f_op->readv && !file->f_op->read)) + if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read)) return -EINVAL; return do_readv_writev(READ, file, vec, vlen, pos); @@ -596,7 +633,7 @@ ssize_t vfs_writev(struct file *file, co { if (!(file->f_mode & FMODE_WRITE)) return -EBADF; - if (!file->f_op || (!file->f_op->writev && !file->f_op->write)) + if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write)) return -EINVAL; return do_readv_writev(WRITE, file, vec, vlen, pos); Index: linux-2.6.16-rc5/fs/read_write.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc5/fs/read_write.h 2006-03-07 13:36:05.000000000 -0800 @@ -0,0 +1,14 @@ +/* + * This file is only for sharing some helpers from read_write.c with compat.c. + * Don't use anywhere else. + */ + + +typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *); +typedef ssize_t (*iov_fn_t)(struct kiocb *, const struct iovec *, + unsigned long, loff_t); + +ssize_t do_sync_readv_writev(struct file *filp, const struct iovec *iov, + unsigned long nr_segs, loff_t *ppos, iov_fn_t fn); +ssize_t do_loop_readv_writev(struct file *filp, struct iovec *iov, + unsigned long nr_segs, loff_t *ppos, io_fn_t fn); Index: linux-2.6.16-rc5/fs/xfs/linux-2.6/xfs_file.c =================================================================== --- linux-2.6.16-rc5.orig/fs/xfs/linux-2.6/xfs_file.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/fs/xfs/linux-2.6/xfs_file.c 2006-03-07 13:36:05.000000000 -0800 @@ -133,96 +133,6 @@ linvfs_aio_write_invis( } -STATIC inline ssize_t -__linvfs_readv( - struct file *file, - const struct iovec *iov, - int ioflags, - unsigned long nr_segs, - loff_t *ppos) -{ - struct inode *inode = file->f_mapping->host; - vnode_t *vp = LINVFS_GET_VP(inode); - struct kiocb kiocb; - ssize_t rval; - - init_sync_kiocb(&kiocb, file); - kiocb.ki_pos = *ppos; - - if (unlikely(file->f_flags & O_DIRECT)) - ioflags |= IO_ISDIRECT; - VOP_READ(vp, &kiocb, iov, nr_segs, &kiocb.ki_pos, ioflags, NULL, rval); - - *ppos = kiocb.ki_pos; - return rval; -} - -STATIC ssize_t -linvfs_readv( - struct file *file, - const struct iovec *iov, - unsigned long nr_segs, - loff_t *ppos) -{ - return __linvfs_readv(file, iov, 0, nr_segs, ppos); -} - -STATIC ssize_t -linvfs_readv_invis( - struct file *file, - const struct iovec *iov, - unsigned long nr_segs, - loff_t *ppos) -{ - return __linvfs_readv(file, iov, IO_INVIS, nr_segs, ppos); -} - - -STATIC inline ssize_t -__linvfs_writev( - struct file *file, - const struct iovec *iov, - int ioflags, - unsigned long nr_segs, - loff_t *ppos) -{ - struct inode *inode = file->f_mapping->host; - vnode_t *vp = LINVFS_GET_VP(inode); - struct kiocb kiocb; - ssize_t rval; - - init_sync_kiocb(&kiocb, file); - kiocb.ki_pos = *ppos; - if (unlikely(file->f_flags & O_DIRECT)) - ioflags |= IO_ISDIRECT; - - VOP_WRITE(vp, &kiocb, iov, nr_segs, &kiocb.ki_pos, ioflags, NULL, rval); - - *ppos = kiocb.ki_pos; - return rval; -} - - -STATIC ssize_t -linvfs_writev( - struct file *file, - const struct iovec *iov, - unsigned long nr_segs, - loff_t *ppos) -{ - return __linvfs_writev(file, iov, 0, nr_segs, ppos); -} - -STATIC ssize_t -linvfs_writev_invis( - struct file *file, - const struct iovec *iov, - unsigned long nr_segs, - loff_t *ppos) -{ - return __linvfs_writev(file, iov, IO_INVIS, nr_segs, ppos); -} - STATIC ssize_t linvfs_sendfile( struct file *filp, @@ -529,8 +439,6 @@ struct file_operations linvfs_file_opera .llseek = generic_file_llseek, .read = do_sync_read, .write = do_sync_write, - .readv = linvfs_readv, - .writev = linvfs_writev, .aio_read = linvfs_aio_read, .aio_write = linvfs_aio_write, .sendfile = linvfs_sendfile, @@ -551,8 +459,6 @@ struct file_operations linvfs_invis_file .llseek = generic_file_llseek, .read = do_sync_read, .write = do_sync_write, - .readv = linvfs_readv_invis, - .writev = linvfs_writev_invis, .aio_read = linvfs_aio_read_invis, .aio_write = linvfs_aio_write_invis, .sendfile = linvfs_sendfile, Index: linux-2.6.16-rc5/include/linux/fs.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/fs.h 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/fs.h 2006-03-07 13:44:09.000000000 -0800 @@ -1015,8 +1015,6 @@ struct file_operations { int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); - ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); @@ -1580,10 +1578,6 @@ extern void do_generic_mapping_read(stru loff_t *, read_descriptor_t *, read_actor_t); extern void file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping); -extern ssize_t generic_file_readv(struct file *filp, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos); -ssize_t generic_file_writev(struct file *filp, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos); extern loff_t no_llseek(struct file *file, loff_t offset, int origin); extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin); extern loff_t remote_llseek(struct file *file, loff_t offset, int origin); Index: linux-2.6.16-rc5/mm/filemap.c =================================================================== --- linux-2.6.16-rc5.orig/mm/filemap.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/mm/filemap.c 2006-03-07 13:36:05.000000000 -0800 @@ -2236,42 +2236,6 @@ ssize_t generic_file_write(struct file * } EXPORT_SYMBOL(generic_file_write); -ssize_t generic_file_readv(struct file *filp, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) -{ - struct kiocb kiocb; - ssize_t ret; - - init_sync_kiocb(&kiocb, filp); - ret = __generic_file_aio_read(&kiocb, iov, nr_segs, ppos); - if (-EIOCBQUEUED == ret) - ret = wait_on_sync_kiocb(&kiocb); - return ret; -} -EXPORT_SYMBOL(generic_file_readv); - -ssize_t generic_file_writev(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) -{ - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; - ssize_t ret; - - mutex_lock(&inode->i_mutex); - ret = __generic_file_write_nolock(file, iov, nr_segs, ppos); - mutex_unlock(&inode->i_mutex); - - if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { - int err; - - err = sync_page_range(inode, mapping, *ppos - ret, ret); - if (err < 0) - ret = err; - } - return ret; -} -EXPORT_SYMBOL(generic_file_writev); - /* * Called under i_mutex for writes to S_ISREG files. Returns -EIO if something * went wrong during pagecache shootdown. Index: linux-2.6.16-rc5/net/socket.c =================================================================== --- linux-2.6.16-rc5.orig/net/socket.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/net/socket.c 2006-03-07 13:36:05.000000000 -0800 @@ -110,10 +110,6 @@ static unsigned int sock_poll(struct fil static long sock_ioctl(struct file *file, unsigned int cmd, unsigned long arg); static int sock_fasync(int fd, struct file *filp, int on); -static ssize_t sock_readv(struct file *file, const struct iovec *vector, - unsigned long count, loff_t *ppos); -static ssize_t sock_writev(struct file *file, const struct iovec *vector, - unsigned long count, loff_t *ppos); static ssize_t sock_sendpage(struct file *file, struct page *page, int offset, size_t size, loff_t *ppos, int more); @@ -134,8 +130,6 @@ static struct file_operations socket_fil .open = sock_no_open, /* special open code to disallow open via /proc */ .release = sock_close, .fasync = sock_fasync, - .readv = sock_readv, - .writev = sock_writev, .sendpage = sock_sendpage }; @@ -692,23 +686,6 @@ static ssize_t do_sock_read(struct msghd return __sock_recvmsg(iocb, sock, msg, size, msg->msg_flags); } -static ssize_t sock_readv(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) -{ - struct kiocb iocb; - struct sock_iocb siocb; - struct msghdr msg; - int ret; - - init_sync_kiocb(&iocb, NULL); - iocb.private = &siocb; - - ret = do_sock_read(&msg, &iocb, file, iov, nr_segs); - if (-EIOCBQUEUED == ret) - ret = wait_on_sync_kiocb(&iocb); - return ret; -} - static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { @@ -751,23 +728,6 @@ static ssize_t do_sock_write(struct msgh return __sock_sendmsg(iocb, sock, msg, size); } -static ssize_t sock_writev(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) -{ - struct msghdr msg; - struct kiocb iocb; - struct sock_iocb siocb; - int ret; - - init_sync_kiocb(&iocb, NULL); - iocb.private = &siocb; - - ret = do_sock_write(&msg, &iocb, file, iov, nr_segs); - if (-EIOCBQUEUED == ret) - ret = wait_on_sync_kiocb(&iocb); - return ret; -} - static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { Index: linux-2.6.16-rc5/sound/core/pcm_native.c =================================================================== --- linux-2.6.16-rc5.orig/sound/core/pcm_native.c 2006-03-07 09:13:39.000000000 -0800 +++ linux-2.6.16-rc5/sound/core/pcm_native.c 2006-03-07 13:36:05.000000000 -0800 @@ -2824,8 +2824,8 @@ static ssize_t snd_pcm_write(struct file return result; } -static ssize_t snd_pcm_readv(struct file *file, const struct iovec *_vector, - unsigned long count, loff_t * offset) +static ssize_t snd_pcm_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct snd_pcm_file *pcm_file; @@ -2836,22 +2836,22 @@ static ssize_t snd_pcm_readv(struct file void __user **bufs; snd_pcm_uframes_t frames; - pcm_file = file->private_data; + pcm_file = iocb->ki_filp->private_data; substream = pcm_file->substream; snd_assert(substream != NULL, return -ENXIO); runtime = substream->runtime; if (runtime->status->state == SNDRV_PCM_STATE_OPEN) return -EBADFD; - if (count > 1024 || count != runtime->channels) + if (nr_segs > 1024 || nr_segs != runtime->channels) return -EINVAL; - if (!frame_aligned(runtime, _vector->iov_len)) + if (!frame_aligned(runtime, iov->iov_len)) return -EINVAL; - frames = bytes_to_samples(runtime, _vector->iov_len); - bufs = kmalloc(sizeof(void *) * count, GFP_KERNEL); + frames = bytes_to_samples(runtime, iov->iov_len); + bufs = kmalloc(sizeof(void *) * nr_segs, GFP_KERNEL); if (bufs == NULL) return -ENOMEM; - for (i = 0; i < count; ++i) - bufs[i] = _vector[i].iov_base; + for (i = 0; i < nr_segs; ++i) + bufs[i] = iov[i].iov_base; result = snd_pcm_lib_readv(substream, bufs, frames); if (result > 0) result = frames_to_bytes(runtime, result); @@ -2859,8 +2859,8 @@ static ssize_t snd_pcm_readv(struct file return result; } -static ssize_t snd_pcm_writev(struct file *file, const struct iovec *_vector, - unsigned long count, loff_t * offset) +static ssize_t snd_pcm_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct snd_pcm_file *pcm_file; struct snd_pcm_substream *substream; @@ -2870,7 +2870,7 @@ static ssize_t snd_pcm_writev(struct fil void __user **bufs; snd_pcm_uframes_t frames; - pcm_file = file->private_data; + pcm_file = iocb->ki_filp->private_data; substream = pcm_file->substream; snd_assert(substream != NULL, result = -ENXIO; goto end); runtime = substream->runtime; @@ -2878,17 +2878,17 @@ static ssize_t snd_pcm_writev(struct fil result = -EBADFD; goto end; } - if (count > 128 || count != runtime->channels || - !frame_aligned(runtime, _vector->iov_len)) { + if (nr_segs > 128 || nr_segs != runtime->channels || + !frame_aligned(runtime, iov->iov_len)) { result = -EINVAL; goto end; } - frames = bytes_to_samples(runtime, _vector->iov_len); - bufs = kmalloc(sizeof(void *) * count, GFP_KERNEL); + frames = bytes_to_samples(runtime, iov->iov_len); + bufs = kmalloc(sizeof(void *) * nr_segs, GFP_KERNEL); if (bufs == NULL) return -ENOMEM; - for (i = 0; i < count; ++i) - bufs[i] = _vector[i].iov_base; + for (i = 0; i < nr_segs; ++i) + bufs[i] = iov[i].iov_base; result = snd_pcm_lib_writev(substream, bufs, frames); if (result > 0) result = frames_to_bytes(runtime, result); @@ -3394,7 +3394,7 @@ struct file_operations snd_pcm_f_ops[2] { .owner = THIS_MODULE, .write = snd_pcm_write, - .writev = snd_pcm_writev, + .aio_write = snd_pcm_aio_write, .open = snd_pcm_playback_open, .release = snd_pcm_release, .poll = snd_pcm_playback_poll, @@ -3406,7 +3406,7 @@ struct file_operations snd_pcm_f_ops[2] { .owner = THIS_MODULE, .read = snd_pcm_read, - .readv = snd_pcm_readv, + .aio_read = snd_pcm_aio_read, .open = snd_pcm_capture_open, .release = snd_pcm_release, .poll = snd_pcm_capture_poll, ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead 2006-03-08 0:23 ` [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead Badari Pulavarty @ 2006-03-08 12:45 ` christoph 2006-03-08 16:26 ` Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: christoph @ 2006-03-08 12:45 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Zach Brown, christoph, lkml, linux-fsdevel On Tue, Mar 07, 2006 at 04:23:02PM -0800, Badari Pulavarty wrote: > This patch removes readv() and writev() methods and replaces > them with aio_read()/aio_write() methods. you have the io_fn_t/io_fnv_t typedefs both in read_write.c and read_write.h - they really should be in the latter only. else ok. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead 2006-03-08 12:45 ` christoph @ 2006-03-08 16:26 ` Badari Pulavarty 0 siblings, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-03-08 16:26 UTC (permalink / raw) To: christoph; +Cc: Zach Brown, lkml, linux-fsdevel On Wed, 2006-03-08 at 13:45 +0100, christoph wrote: > On Tue, Mar 07, 2006 at 04:23:02PM -0800, Badari Pulavarty wrote: > > This patch removes readv() and writev() methods and replaces > > them with aio_read()/aio_write() methods. > > you have the io_fn_t/io_fnv_t typedefs both in read_write.c and > read_write.h - they really should be in the latter only. Taken care of. I am not sure if you noticed or not .. iocb->ki_left holds the amount of IO that needs to be done. So we can use it instead of looping through iovecs to calculate length. All we need to do is, we need to set it correctly in sync case. Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 3/3] Zach's core aio changes to support vectored AIO 2006-03-08 0:19 [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support Badari Pulavarty 2006-03-08 0:22 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty 2006-03-08 0:23 ` [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead Badari Pulavarty @ 2006-03-08 0:24 ` Badari Pulavarty 2006-03-08 3:37 ` Benjamin LaHaise 2006-03-08 12:47 ` [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support christoph 3 siblings, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-03-08 0:24 UTC (permalink / raw) To: Zach Brown; +Cc: christoph, lkml, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 156 bytes --] This work is initially done by Zach Brown to add support for vectored aio. These are the core changes for AIO to support IOCB_CMD_PREADV/IOCB_CMD_PWRITEV. [-- Attachment #2: aiocore-changes.patch --] [-- Type: text/x-patch, Size: 13206 bytes --] This work is initailly done by Zach Brown to add support for vectored aio. These are the core changes for AIO to support IOCB_CMD_PREADV/IOCB_CMD_PWRITEV. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Index: linux-2.6.16-rc5/fs/aio.c =================================================================== --- linux-2.6.16-rc5.orig/fs/aio.c 2006-03-07 13:44:09.000000000 -0800 +++ linux-2.6.16-rc5/fs/aio.c 2006-03-07 13:59:17.000000000 -0800 @@ -416,6 +416,7 @@ static struct kiocb fastcall *__aio_get_ req->ki_retry = NULL; req->ki_dtor = NULL; req->private = NULL; + req->ki_iovec = NULL; INIT_LIST_HEAD(&req->ki_run_list); /* Check if the completion queue has enough free space to @@ -461,6 +462,8 @@ static inline void really_put_req(struct if (req->ki_dtor) req->ki_dtor(req); + if (req->ki_iovec != &req->ki_inline_vec) + kfree(req->ki_iovec); kmem_cache_free(kiocb_cachep, req); ctx->reqs_active--; @@ -1302,69 +1305,63 @@ asmlinkage long sys_io_destroy(aio_conte return -EINVAL; } -/* - * aio_p{read,write} are the default ki_retry methods for - * IO_CMD_P{READ,WRITE}. They maintains kiocb retry state around potentially - * multiple calls to f_op->aio_read(). They loop around partial progress - * instead of returning -EIOCBRETRY because they don't have the means to call - * kick_iocb(). - */ -static ssize_t aio_pread(struct kiocb *iocb) +static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret) { - struct file *file = iocb->ki_filp; - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; - ssize_t ret = 0; + struct iovec *iov = &iocb->ki_iovec[iocb->ki_cur_seg]; - do { - iocb->ki_inline_vec.iov_base = iocb->ki_buf; - iocb->ki_inline_vec.iov_len = iocb->ki_left; + BUG_ON(ret <= 0); - ret = file->f_op->aio_read(iocb, &iocb->ki_inline_vec, - 1, iocb->ki_pos); - /* - * Can't just depend on iocb->ki_left to determine - * whether we are done. This may have been a short read. - */ - if (ret > 0) { - iocb->ki_buf += ret; - iocb->ki_left -= ret; + while (iocb->ki_cur_seg < iocb->ki_nr_segs && ret > 0) { + ssize_t this = min(iov->iov_len, (size_t)ret); + iov->iov_base += this; + iov->iov_len -= this; + iocb->ki_left -= this; + ret -= this; + if (iov->iov_len == 0) { + iocb->ki_cur_seg++; + iov++; } + } - /* - * For pipes and sockets we return once we have some data; for - * regular files we retry till we complete the entire read or - * find that we can't read any more data (e.g short reads). - */ - } while (ret > 0 && iocb->ki_left > 0 && - !S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode)); - - /* This means we must have transferred all that we could */ - /* No need to retry anymore */ - if ((ret == 0) || (iocb->ki_left == 0)) - ret = iocb->ki_nbytes - iocb->ki_left; - - return ret; + /* the caller should not have done more io than what fit in + * the remaining iovecs */ + BUG_ON(ret > 0 && iocb->ki_left == 0); } -/* see aio_pread() */ -static ssize_t aio_pwrite(struct kiocb *iocb) +static ssize_t aio_rw_vect_retry(struct kiocb *iocb) { struct file *file = iocb->ki_filp; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + ssize_t (*rw_op)(struct kiocb *, const struct iovec *, + unsigned long, loff_t); ssize_t ret = 0; + unsigned short opcode; + + if ((iocb->ki_opcode == IOCB_CMD_PREADV) || + (iocb->ki_opcode == IOCB_CMD_PREAD)) { + rw_op = file->f_op->aio_read; + opcode = IOCB_CMD_PREADV; + } else { + rw_op = file->f_op->aio_write; + opcode = IOCB_CMD_PWRITEV; + } do { - iocb->ki_inline_vec.iov_base = iocb->ki_buf; - iocb->ki_inline_vec.iov_len = iocb->ki_left; + ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg], + iocb->ki_nr_segs - iocb->ki_cur_seg, + iocb->ki_pos); + if (ret > 0) + aio_advance_iovec(iocb, ret); - ret = file->f_op->aio_write(iocb, &iocb->ki_inline_vec, - 1, iocb->ki_pos); - if (ret > 0) { - iocb->ki_buf += ret; - iocb->ki_left -= ret; - } - } while (ret > 0 && iocb->ki_left > 0); + /* retry all partial writes. retry partial reads as long as its a + * regular file. */ + } while (ret > 0 && iocb->ki_left > 0 && + (opcode == IOCB_CMD_PWRITEV || + (!S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode)))); + /* This means we must have transferred all that we could */ + /* No need to retry anymore */ if ((ret == 0) || (iocb->ki_left == 0)) ret = iocb->ki_nbytes - iocb->ki_left; @@ -1391,6 +1388,38 @@ static ssize_t aio_fsync(struct kiocb *i return ret; } +static ssize_t aio_setup_vectored_rw(struct kiocb *kiocb) +{ + ssize_t ret; + + ret = rw_copy_check_uvector((struct iovec __user *)kiocb->ki_buf, + kiocb->ki_nbytes, 1, + &kiocb->ki_inline_vec, &kiocb->ki_iovec); + if (ret < 0) + goto out; + + kiocb->ki_nr_segs = kiocb->ki_nbytes; + kiocb->ki_cur_seg = 0; + /* ki_nbytes/left now reflect bytes instead of segs */ + kiocb->ki_nbytes = ret; + kiocb->ki_left = ret; + + ret = 0; +out: + return ret; +} + +static ssize_t aio_setup_single_vector(struct kiocb *kiocb) +{ + kiocb->ki_iovec = &kiocb->ki_inline_vec; + kiocb->ki_iovec->iov_base = kiocb->ki_buf; + kiocb->ki_iovec->iov_len = kiocb->ki_left; + kiocb->ki_nr_segs = 1; + kiocb->ki_cur_seg = 0; + kiocb->ki_nbytes = kiocb->ki_left; + return 0; +} + /* * aio_setup_iocb: * Performs the initial checks and aio retry method @@ -1413,9 +1442,12 @@ static ssize_t aio_setup_iocb(struct kio ret = security_file_permission(file, MAY_READ); if (unlikely(ret)) break; + ret = aio_setup_single_vector(kiocb); + if (ret) + break; ret = -EINVAL; if (file->f_op->aio_read) - kiocb->ki_retry = aio_pread; + kiocb->ki_retry = aio_rw_vect_retry; break; case IOCB_CMD_PWRITE: ret = -EBADF; @@ -1428,9 +1460,34 @@ static ssize_t aio_setup_iocb(struct kio ret = security_file_permission(file, MAY_WRITE); if (unlikely(ret)) break; + ret = aio_setup_single_vector(kiocb); + if (ret) + break; + ret = -EINVAL; + if (file->f_op->aio_write) + kiocb->ki_retry = aio_rw_vect_retry; + break; + case IOCB_CMD_PREADV: + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + break; + ret = aio_setup_vectored_rw(kiocb); + if (ret) + break; + ret = -EINVAL; + if (file->f_op->aio_read) + kiocb->ki_retry = aio_rw_vect_retry; + break; + case IOCB_CMD_PWRITEV: + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + break; + ret = aio_setup_vectored_rw(kiocb); + if (ret) + break; ret = -EINVAL; if (file->f_op->aio_write) - kiocb->ki_retry = aio_pwrite; + kiocb->ki_retry = aio_rw_vect_retry; break; case IOCB_CMD_FDSYNC: ret = -EINVAL; Index: linux-2.6.16-rc5/include/linux/aio.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/aio.h 2006-03-07 13:44:09.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/aio.h 2006-03-07 13:59:17.000000000 -0800 @@ -7,6 +7,7 @@ #include <linux/uio.h> #include <asm/atomic.h> +#include <linux/uio.h> #define AIO_MAXSEGS 4 #define AIO_KIOGRP_NR_ATOMIC 8 @@ -114,6 +115,9 @@ struct kiocb { long ki_kicked; /* just for testing */ long ki_queued; /* just for testing */ struct iovec ki_inline_vec; /* inline vector */ + struct iovec *ki_iovec; + unsigned long ki_nr_segs; + unsigned long ki_cur_seg; struct list_head ki_list; /* the aio core uses this * for cancellation */ Index: linux-2.6.16-rc5/include/linux/aio_abi.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/aio_abi.h 2006-03-07 13:44:09.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/aio_abi.h 2006-03-07 13:59:17.000000000 -0800 @@ -41,6 +41,8 @@ enum { * IOCB_CMD_POLL = 5, */ IOCB_CMD_NOOP = 6, + IOCB_CMD_PREADV = 7, + IOCB_CMD_PWRITEV = 8, }; /* read() from /dev/aio returns these structures. */ Index: linux-2.6.16-rc5/fs/read_write.c =================================================================== --- linux-2.6.16-rc5.orig/fs/read_write.c 2006-03-07 13:59:14.000000000 -0800 +++ linux-2.6.16-rc5/fs/read_write.c 2006-03-07 13:59:17.000000000 -0800 @@ -513,72 +513,103 @@ ssize_t do_loop_readv_writev(struct file /* A write operation does a read from user space and vice versa */ #define vrfy_dir(type) ((type) == READ ? VERIFY_WRITE : VERIFY_READ) +ssize_t rw_copy_check_uvector(const struct iovec __user * uvector, + unsigned long nr_segs, unsigned long fast_segs, + struct iovec *fast_pointer, + struct iovec **ret_pointer) + { + unsigned long seg; + ssize_t ret; + struct iovec *iov = fast_pointer; + + /* + * SuS says "The readv() function *may* fail if the iovcnt argument + * was less than or equal to 0, or greater than {IOV_MAX}. Linux has + * traditionally returned zero for zero segments, so... + */ + if (nr_segs == 0) { + ret = 0; + goto out; + } + + /* + * First get the "struct iovec" from user memory and + * verify all the pointers + */ + if ((nr_segs > UIO_MAXIOV) || (nr_segs <= 0)) { + ret = -EINVAL; + goto out; + } + if (nr_segs > fast_segs) { + iov = kmalloc(nr_segs*sizeof(struct iovec), GFP_KERNEL); + if (iov == NULL) { + ret = -ENOMEM; + goto out; + } + } + if (copy_from_user(iov, uvector, nr_segs*sizeof(*uvector))) { + ret = -EFAULT; + goto out; + } + + /* + * According to the Single Unix Specification we should return EINVAL + * if an element length is < 0 when cast to ssize_t or if the + * total length would overflow the ssize_t return value of the + * system call. + */ + ret = 0; + for (seg = 0; seg < nr_segs; seg++) { + void __user *buf = iov[seg].iov_base; + ssize_t len = (ssize_t)iov[seg].iov_len; + + /* see if we we're about to use an invalid len or if + * it's about to overflow ssize_t */ + if (len < 0 || (ret + len < ret)) { + ret = -EINVAL; + goto out; + } + if (unlikely(!access_ok(vrfy_dir(type), buf, len))) { + ret = -EFAULT; + goto out; + } + + ret += len; + } +out: + *ret_pointer = iov; + return ret; +} + +/* A write operation does a read from user space and vice versa */ +#define vrfy_dir(type) ((type) == READ ? VERIFY_WRITE : VERIFY_READ) + static ssize_t do_readv_writev(int type, struct file *file, const struct iovec __user * uvector, unsigned long nr_segs, loff_t *pos) { - size_t tot_len; + ssize_t tot_len; struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov = iovstack; ssize_t ret; - int seg; io_fn_t fn; iov_fn_t fnv; - /* - * SuS says "The readv() function *may* fail if the iovcnt argument - * was less than or equal to 0, or greater than {IOV_MAX}. Linux has - * traditionally returned zero for zero segments, so... - */ - ret = 0; - if (nr_segs == 0) - goto out; - - /* - * First get the "struct iovec" from user memory and - * verify all the pointers - */ - ret = -EINVAL; - if ((nr_segs > UIO_MAXIOV) || (nr_segs <= 0)) - goto out; - if (!file->f_op) + if (!file->f_op) { + ret = -EINVAL; goto out; - if (nr_segs > UIO_FASTIOV) { - ret = -ENOMEM; - iov = kmalloc(nr_segs*sizeof(struct iovec), GFP_KERNEL); - if (!iov) - goto out; } - ret = -EFAULT; - if (copy_from_user(iov, uvector, nr_segs*sizeof(*uvector))) + + ret = rw_copy_check_uvector(uvector, nr_segs, ARRAY_SIZE(iovstack), + iovstack, &iov); + + if (ret < 0) goto out; - /* - * Single unix specification: - * We should -EINVAL if an element length is not >= 0 and fitting an - * ssize_t. The total length is fitting an ssize_t - * - * Be careful here because iov_len is a size_t not an ssize_t - */ - tot_len = 0; - ret = -EINVAL; - for (seg = 0; seg < nr_segs; seg++) { - void __user *buf = iov[seg].iov_base; - ssize_t len = (ssize_t)iov[seg].iov_len; - - if (len < 0) /* size_t not fitting an ssize_t .. */ - goto out; - if (unlikely(!access_ok(vrfy_dir(type), buf, len))) - goto Efault; - tot_len += len; - if ((ssize_t)tot_len < 0) /* maths overflow on the ssize_t */ - goto out; - } - if (tot_len == 0) { - ret = 0; + if (ret == 0) goto out; - } + tot_len = ret; ret = rw_verify_area(type, file, pos, tot_len); if (ret < 0) goto out; @@ -610,9 +641,6 @@ out: fsnotify_modify(file->f_dentry); } return ret; -Efault: - ret = -EFAULT; - goto out; } ssize_t vfs_readv(struct file *file, const struct iovec __user *vec, Index: linux-2.6.16-rc5/include/linux/fs.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/fs.h 2006-03-07 13:59:14.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/fs.h 2006-03-07 13:59:17.000000000 -0800 @@ -1050,6 +1050,11 @@ struct inode_operations { struct seq_file; +ssize_t rw_copy_check_uvector(const struct iovec __user * uvector, + unsigned long nr_segs, unsigned long fast_segs, + struct iovec *fast_pointer, + struct iovec **ret_pointer); + extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_readv(struct file *, const struct iovec __user *, ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/3] Zach's core aio changes to support vectored AIO 2006-03-08 0:24 ` [PATCH 3/3] Zach's core aio changes to support vectored AIO Badari Pulavarty @ 2006-03-08 3:37 ` Benjamin LaHaise 2006-03-08 16:34 ` Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: Benjamin LaHaise @ 2006-03-08 3:37 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Zach Brown, christoph, lkml, linux-fsdevel On Tue, Mar 07, 2006 at 04:24:19PM -0800, Badari Pulavarty wrote: > This work is initially done by Zach Brown to add support for > vectored aio. These are the core changes for AIO to support > IOCB_CMD_PREADV/IOCB_CMD_PWRITEV. Please, please, please send patches inline so they can be quoted. In any case, there's a bug in the PREADV/WRITEV code in that it doesn't check the selinux security bits for the file. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/3] Zach's core aio changes to support vectored AIO 2006-03-08 3:37 ` Benjamin LaHaise @ 2006-03-08 16:34 ` Badari Pulavarty 0 siblings, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-03-08 16:34 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Zach Brown, christoph, lkml, linux-fsdevel On Tue, 2006-03-07 at 22:37 -0500, Benjamin LaHaise wrote: > On Tue, Mar 07, 2006 at 04:24:19PM -0800, Badari Pulavarty wrote: > > This work is initially done by Zach Brown to add support for > > vectored aio. These are the core changes for AIO to support > > IOCB_CMD_PREADV/IOCB_CMD_PWRITEV. > > Please, please, please send patches inline so they can be quoted. In > any case, there's a bug in the PREADV/WRITEV code in that it doesn't > check the selinux security bits for the file. Thanks for the review. Here is the latest version with with selinux security check. Ben, could you review this little closely as I am depending on your AIO expertise ? Thanks, Badari This work is initially done by Zach Brown to add support for vectored aio. These are the core changes for AIO to support IOCB_CMD_PREADV/IOCB_CMD_PWRITEV. I made few extra changes beyond Zach's work. They are - took out aio_pread/aio_pwrite and made them a special case into vectored support - added single inlined vector to save on kmalloc() for a simple aio_read/aio_write - kiocb->ki_left always indicates the amount of IO need to be done. Made sure that this gets set in sync case also, so that we don't need to loop over iovecs to figure out IO size all the time. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> fs/aio.c | 165 ++++++++++++++++++++++++++++++++ +--------------- fs/read_write.c | 134 +++++++++++++++++++++++--------------- include/linux/aio.h | 4 + include/linux/aio_abi.h | 2 include/linux/fs.h | 5 + 5 files changed, 206 insertions(+), 104 deletions(-) Index: linux-2.6.16-rc5/fs/aio.c =================================================================== --- linux-2.6.16-rc5.orig/fs/aio.c 2006-03-08 08:03:02.000000000 -0800 +++ linux-2.6.16-rc5/fs/aio.c 2006-03-08 08:12:47.000000000 -0800 @@ -416,6 +416,7 @@ static struct kiocb fastcall *__aio_get_ req->ki_retry = NULL; req->ki_dtor = NULL; req->private = NULL; + req->ki_iovec = NULL; INIT_LIST_HEAD(&req->ki_run_list); /* Check if the completion queue has enough free space to @@ -461,6 +462,8 @@ static inline void really_put_req(struct if (req->ki_dtor) req->ki_dtor(req); + if (req->ki_iovec != &req->ki_inline_vec) + kfree(req->ki_iovec); kmem_cache_free(kiocb_cachep, req); ctx->reqs_active--; @@ -1302,69 +1305,63 @@ asmlinkage long sys_io_destroy(aio_conte return -EINVAL; } -/* - * aio_p{read,write} are the default ki_retry methods for - * IO_CMD_P{READ,WRITE}. They maintains kiocb retry state around potentially - * multiple calls to f_op->aio_read(). They loop around partial progress - * instead of returning -EIOCBRETRY because they don't have the means to call - * kick_iocb(). - */ -static ssize_t aio_pread(struct kiocb *iocb) +static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret) { - struct file *file = iocb->ki_filp; - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; - ssize_t ret = 0; + struct iovec *iov = &iocb->ki_iovec[iocb->ki_cur_seg]; - do { - iocb->ki_inline_vec.iov_base = iocb->ki_buf; - iocb->ki_inline_vec.iov_len = iocb->ki_left; + BUG_ON(ret <= 0); - ret = file->f_op->aio_read(iocb, &iocb->ki_inline_vec, - 1, iocb->ki_pos); - /* - * Can't just depend on iocb->ki_left to determine - * whether we are done. This may have been a short read. - */ - if (ret > 0) { - iocb->ki_buf += ret; - iocb->ki_left -= ret; + while (iocb->ki_cur_seg < iocb->ki_nr_segs && ret > 0) { + ssize_t this = min(iov->iov_len, (size_t)ret); + iov->iov_base += this; + iov->iov_len -= this; + iocb->ki_left -= this; + ret -= this; + if (iov->iov_len == 0) { + iocb->ki_cur_seg++; + iov++; } + } - /* - * For pipes and sockets we return once we have some data; for - * regular files we retry till we complete the entire read or - * find that we can't read any more data (e.g short reads). - */ - } while (ret > 0 && iocb->ki_left > 0 && - !S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode)); - - /* This means we must have transferred all that we could */ - /* No need to retry anymore */ - if ((ret == 0) || (iocb->ki_left == 0)) - ret = iocb->ki_nbytes - iocb->ki_left; - - return ret; + /* the caller should not have done more io than what fit in + * the remaining iovecs */ + BUG_ON(ret > 0 && iocb->ki_left == 0); } -/* see aio_pread() */ -static ssize_t aio_pwrite(struct kiocb *iocb) +static ssize_t aio_rw_vect_retry(struct kiocb *iocb) { struct file *file = iocb->ki_filp; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + ssize_t (*rw_op)(struct kiocb *, const struct iovec *, + unsigned long, loff_t); ssize_t ret = 0; + unsigned short opcode; + + if ((iocb->ki_opcode == IOCB_CMD_PREADV) || + (iocb->ki_opcode == IOCB_CMD_PREAD)) { + rw_op = file->f_op->aio_read; + opcode = IOCB_CMD_PREADV; + } else { + rw_op = file->f_op->aio_write; + opcode = IOCB_CMD_PWRITEV; + } do { - iocb->ki_inline_vec.iov_base = iocb->ki_buf; - iocb->ki_inline_vec.iov_len = iocb->ki_left; + ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg], + iocb->ki_nr_segs - iocb->ki_cur_seg, + iocb->ki_pos); + if (ret > 0) + aio_advance_iovec(iocb, ret); - ret = file->f_op->aio_write(iocb, &iocb->ki_inline_vec, - 1, iocb->ki_pos); - if (ret > 0) { - iocb->ki_buf += ret; - iocb->ki_left -= ret; - } - } while (ret > 0 && iocb->ki_left > 0); + /* retry all partial writes. retry partial reads as long as its a + * regular file. */ + } while (ret > 0 && iocb->ki_left > 0 && + (opcode == IOCB_CMD_PWRITEV || + (!S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode)))); + /* This means we must have transferred all that we could */ + /* No need to retry anymore */ if ((ret == 0) || (iocb->ki_left == 0)) ret = iocb->ki_nbytes - iocb->ki_left; @@ -1391,6 +1388,38 @@ static ssize_t aio_fsync(struct kiocb *i return ret; } +static ssize_t aio_setup_vectored_rw(struct kiocb *kiocb) +{ + ssize_t ret; + + ret = rw_copy_check_uvector((struct iovec __user *)kiocb->ki_buf, + kiocb->ki_nbytes, 1, + &kiocb->ki_inline_vec, &kiocb->ki_iovec); + if (ret < 0) + goto out; + + kiocb->ki_nr_segs = kiocb->ki_nbytes; + kiocb->ki_cur_seg = 0; + /* ki_nbytes/left now reflect bytes instead of segs */ + kiocb->ki_nbytes = ret; + kiocb->ki_left = ret; + + ret = 0; +out: + return ret; +} + +static ssize_t aio_setup_single_vector(struct kiocb *kiocb) +{ + kiocb->ki_iovec = &kiocb->ki_inline_vec; + kiocb->ki_iovec->iov_base = kiocb->ki_buf; + kiocb->ki_iovec->iov_len = kiocb->ki_left; + kiocb->ki_nr_segs = 1; + kiocb->ki_cur_seg = 0; + kiocb->ki_nbytes = kiocb->ki_left; + return 0; +} + /* * aio_setup_iocb: * Performs the initial checks and aio retry method @@ -1413,9 +1442,12 @@ static ssize_t aio_setup_iocb(struct kio ret = security_file_permission(file, MAY_READ); if (unlikely(ret)) break; + ret = aio_setup_single_vector(kiocb); + if (ret) + break; ret = -EINVAL; if (file->f_op->aio_read) - kiocb->ki_retry = aio_pread; + kiocb->ki_retry = aio_rw_vect_retry; break; case IOCB_CMD_PWRITE: ret = -EBADF; @@ -1428,9 +1460,40 @@ static ssize_t aio_setup_iocb(struct kio ret = security_file_permission(file, MAY_WRITE); if (unlikely(ret)) break; + ret = aio_setup_single_vector(kiocb); + if (ret) + break; + ret = -EINVAL; + if (file->f_op->aio_write) + kiocb->ki_retry = aio_rw_vect_retry; + break; + case IOCB_CMD_PREADV: + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + break; + ret = security_file_permission(file, MAY_READ); + if (unlikely(ret)) + break; + ret = aio_setup_vectored_rw(kiocb); + if (ret) + break; + ret = -EINVAL; + if (file->f_op->aio_read) + kiocb->ki_retry = aio_rw_vect_retry; + break; + case IOCB_CMD_PWRITEV: + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + break; + ret = security_file_permission(file, MAY_WRITE); + if (unlikely(ret)) + break; + ret = aio_setup_vectored_rw(kiocb); + if (ret) + break; ret = -EINVAL; if (file->f_op->aio_write) - kiocb->ki_retry = aio_pwrite; + kiocb->ki_retry = aio_rw_vect_retry; break; case IOCB_CMD_FDSYNC: ret = -EINVAL; Index: linux-2.6.16-rc5/include/linux/aio.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/aio.h 2006-03-08 08:03:02.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/aio.h 2006-03-08 08:12:47.000000000 -0800 @@ -7,6 +7,7 @@ #include <linux/uio.h> #include <asm/atomic.h> +#include <linux/uio.h> #define AIO_MAXSEGS 4 #define AIO_KIOGRP_NR_ATOMIC 8 @@ -114,6 +115,9 @@ struct kiocb { long ki_kicked; /* just for testing */ long ki_queued; /* just for testing */ struct iovec ki_inline_vec; /* inline vector */ + struct iovec *ki_iovec; + unsigned long ki_nr_segs; + unsigned long ki_cur_seg; struct list_head ki_list; /* the aio core uses this * for cancellation */ Index: linux-2.6.16-rc5/include/linux/aio_abi.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/aio_abi.h 2006-03-08 08:03:02.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/aio_abi.h 2006-03-08 08:12:47.000000000 -0800 @@ -41,6 +41,8 @@ enum { * IOCB_CMD_POLL = 5, */ IOCB_CMD_NOOP = 6, + IOCB_CMD_PREADV = 7, + IOCB_CMD_PWRITEV = 8, }; /* read() from /dev/aio returns these structures. */ Index: linux-2.6.16-rc5/fs/read_write.c =================================================================== --- linux-2.6.16-rc5.orig/fs/read_write.c 2006-03-08 08:12:37.000000000 -0800 +++ linux-2.6.16-rc5/fs/read_write.c 2006-03-08 08:12:47.000000000 -0800 @@ -509,72 +509,103 @@ ssize_t do_loop_readv_writev(struct file /* A write operation does a read from user space and vice versa */ #define vrfy_dir(type) ((type) == READ ? VERIFY_WRITE : VERIFY_READ) +ssize_t rw_copy_check_uvector(const struct iovec __user * uvector, + unsigned long nr_segs, unsigned long fast_segs, + struct iovec *fast_pointer, + struct iovec **ret_pointer) + { + unsigned long seg; + ssize_t ret; + struct iovec *iov = fast_pointer; + + /* + * SuS says "The readv() function *may* fail if the iovcnt argument + * was less than or equal to 0, or greater than {IOV_MAX}. Linux has + * traditionally returned zero for zero segments, so... + */ + if (nr_segs == 0) { + ret = 0; + goto out; + } + + /* + * First get the "struct iovec" from user memory and + * verify all the pointers + */ + if ((nr_segs > UIO_MAXIOV) || (nr_segs <= 0)) { + ret = -EINVAL; + goto out; + } + if (nr_segs > fast_segs) { + iov = kmalloc(nr_segs*sizeof(struct iovec), GFP_KERNEL); + if (iov == NULL) { + ret = -ENOMEM; + goto out; + } + } + if (copy_from_user(iov, uvector, nr_segs*sizeof(*uvector))) { + ret = -EFAULT; + goto out; + } + + /* + * According to the Single Unix Specification we should return EINVAL + * if an element length is < 0 when cast to ssize_t or if the + * total length would overflow the ssize_t return value of the + * system call. + */ + ret = 0; + for (seg = 0; seg < nr_segs; seg++) { + void __user *buf = iov[seg].iov_base; + ssize_t len = (ssize_t)iov[seg].iov_len; + + /* see if we we're about to use an invalid len or if + * it's about to overflow ssize_t */ + if (len < 0 || (ret + len < ret)) { + ret = -EINVAL; + goto out; + } + if (unlikely(!access_ok(vrfy_dir(type), buf, len))) { + ret = -EFAULT; + goto out; + } + + ret += len; + } +out: + *ret_pointer = iov; + return ret; +} + +/* A write operation does a read from user space and vice versa */ +#define vrfy_dir(type) ((type) == READ ? VERIFY_WRITE : VERIFY_READ) + static ssize_t do_readv_writev(int type, struct file *file, const struct iovec __user * uvector, unsigned long nr_segs, loff_t *pos) { - size_t tot_len; + ssize_t tot_len; struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov = iovstack; ssize_t ret; - int seg; io_fn_t fn; iov_fn_t fnv; - /* - * SuS says "The readv() function *may* fail if the iovcnt argument - * was less than or equal to 0, or greater than {IOV_MAX}. Linux has - * traditionally returned zero for zero segments, so... - */ - ret = 0; - if (nr_segs == 0) - goto out; - - /* - * First get the "struct iovec" from user memory and - * verify all the pointers - */ - ret = -EINVAL; - if ((nr_segs > UIO_MAXIOV) || (nr_segs <= 0)) - goto out; - if (!file->f_op) + if (!file->f_op) { + ret = -EINVAL; goto out; - if (nr_segs > UIO_FASTIOV) { - ret = -ENOMEM; - iov = kmalloc(nr_segs*sizeof(struct iovec), GFP_KERNEL); - if (!iov) - goto out; } - ret = -EFAULT; - if (copy_from_user(iov, uvector, nr_segs*sizeof(*uvector))) + + ret = rw_copy_check_uvector(uvector, nr_segs, ARRAY_SIZE(iovstack), + iovstack, &iov); + + if (ret < 0) goto out; - /* - * Single unix specification: - * We should -EINVAL if an element length is not >= 0 and fitting an - * ssize_t. The total length is fitting an ssize_t - * - * Be careful here because iov_len is a size_t not an ssize_t - */ - tot_len = 0; - ret = -EINVAL; - for (seg = 0; seg < nr_segs; seg++) { - void __user *buf = iov[seg].iov_base; - ssize_t len = (ssize_t)iov[seg].iov_len; - - if (len < 0) /* size_t not fitting an ssize_t .. */ - goto out; - if (unlikely(!access_ok(vrfy_dir(type), buf, len))) - goto Efault; - tot_len += len; - if ((ssize_t)tot_len < 0) /* maths overflow on the ssize_t */ - goto out; - } - if (tot_len == 0) { - ret = 0; + if (ret == 0) goto out; - } + tot_len = ret; ret = rw_verify_area(type, file, pos, tot_len); if (ret < 0) goto out; @@ -606,9 +637,6 @@ out: fsnotify_modify(file->f_dentry); } return ret; -Efault: - ret = -EFAULT; - goto out; } ssize_t vfs_readv(struct file *file, const struct iovec __user *vec, Index: linux-2.6.16-rc5/include/linux/fs.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/fs.h 2006-03-08 08:10:55.000000000 -0800 +++ linux-2.6.16-rc5/include/linux/fs.h 2006-03-08 08:12:47.000000000 -0800 @@ -1050,6 +1050,11 @@ struct inode_operations { struct seq_file; +ssize_t rw_copy_check_uvector(const struct iovec __user * uvector, + unsigned long nr_segs, unsigned long fast_segs, + struct iovec *fast_pointer, + struct iovec **ret_pointer); + extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_readv(struct file *, const struct iovec __user *, ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support 2006-03-08 0:19 [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support Badari Pulavarty ` (2 preceding siblings ...) 2006-03-08 0:24 ` [PATCH 3/3] Zach's core aio changes to support vectored AIO Badari Pulavarty @ 2006-03-08 12:47 ` christoph 2006-03-08 16:24 ` Badari Pulavarty 2006-03-09 16:17 ` ext3_ordered_writepage() questions Badari Pulavarty 3 siblings, 2 replies; 58+ messages in thread From: christoph @ 2006-03-08 12:47 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Zach Brown, christoph, lkml, linux-fsdevel On Tue, Mar 07, 2006 at 04:19:59PM -0800, Badari Pulavarty wrote: > Hi, > > These series of changes collapses all the vectored IO support > into single file-operation method using aio_read/aio_write. > > This work was originally suggested & started by Christoph Hellwig, > when Zach Brown tried to add vectored support for AIO. > > Christoph & Zach, comments/suggestions ? If you are happy with the > work, can you add your Sign-off or Ack ? I addressed all the > known issues, please review. the first two patches are fine with me, they're basically my patches with the bugs fixed and the missing conversions done, so they must be good ;-) can't really comment on the third one because I don't understand the aio internals good enough. Onced this goes to -mm we should add a third patch to kill generic_file_read/generic_file_write and convert all filesystems to the aio/vectored variant and use do_sync_read/do_sync_write for .read/.write. The major syscalls use the aio_ variant directly anyway, this is only needed for some special cases like the ELF loader. Removing generic_file_read/generic_file_write will finally cut filemap.c back to a sane size. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support 2006-03-08 12:47 ` [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support christoph @ 2006-03-08 16:24 ` Badari Pulavarty 2006-03-09 16:17 ` ext3_ordered_writepage() questions Badari Pulavarty 1 sibling, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-03-08 16:24 UTC (permalink / raw) To: christoph; +Cc: Zach Brown, lkml, linux-fsdevel, bcrl On Wed, 2006-03-08 at 13:47 +0100, christoph wrote: > On Tue, Mar 07, 2006 at 04:19:59PM -0800, Badari Pulavarty wrote: > > Hi, > > > > These series of changes collapses all the vectored IO support > > into single file-operation method using aio_read/aio_write. > > > > This work was originally suggested & started by Christoph Hellwig, > > when Zach Brown tried to add vectored support for AIO. > > > > Christoph & Zach, comments/suggestions ? If you are happy with the > > work, can you add your Sign-off or Ack ? I addressed all the > > known issues, please review. > > the first two patches are fine with me, they're basically my patches > with the bugs fixed and the missing conversions done, so they must be > good ;-) :) I rewrote usb/gadget/inode.c ep_aio_* support 3 times. I am still not sure if I got it right. Can you review them ? I have no way to test them. > > can't really comment on the third one because I don't understand the > aio internals good enough. Zach okayed the changes. I am depending on BenL's expertise to review them little more closely, before pushing to -mm. > Onced this goes to -mm we should add a third patch to kill > generic_file_read/generic_file_write and convert all filesystems to the > aio/vectored variant and use do_sync_read/do_sync_write for > .read/.write. The major syscalls use the aio_ variant directly anyway, > this is only needed for some special cases like the ELF loader. > Removing generic_file_read/generic_file_write will finally cut filemap.c > back to a sane size. Yep. Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* ext3_ordered_writepage() questions 2006-03-08 12:47 ` [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support christoph 2006-03-08 16:24 ` Badari Pulavarty @ 2006-03-09 16:17 ` Badari Pulavarty 2006-03-09 23:35 ` Andrew Morton 1 sibling, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-03-09 16:17 UTC (permalink / raw) To: akpm; +Cc: sct, lkml, linux-fsdevel, jack Hi, I am trying to cleanup ext3_ordered and ext3_writeback_writepage() routines. I am confused on what ext3_ordered_writepage() is currently doing ? I hope you can help me understand it little better. 1) Why do we do journal_start() all the time ? I was hoping to skip journal_start()/stop() if the blocks are already allocated. Since we allocated blocks in prepare_write() for most cases (non-mapped writes), I was hoping to avoid the whole journal stuff in writepage(), if blocks are already there. (we can check buffers attached to the page and find out if they are mapped or not). 2) Why do we call journal_dirty_data_fn() on the buffers ? We already issued IO on all those buffers() in block_full_write_page(). Why do we need to add them to transaction ? I understand we need to do this for block allocation case. But do we need it for non-allocation case also ? Can we skip the whole journal start, journal_dirty_data, journal_stop for non-allocation cases ? I have coded up to do so, but I am confused on what am I missing here ? Please let me know. Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-09 16:17 ` ext3_ordered_writepage() questions Badari Pulavarty @ 2006-03-09 23:35 ` Andrew Morton 2006-03-10 0:36 ` Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: Andrew Morton @ 2006-03-09 23:35 UTC (permalink / raw) To: Badari Pulavarty; +Cc: sct, linux-kernel, linux-fsdevel, jack Badari Pulavarty <pbadari@us.ibm.com> wrote: > > Hi, > > I am trying to cleanup ext3_ordered and ext3_writeback_writepage() routines. > I am confused on what ext3_ordered_writepage() is currently doing ? I hope > you can help me understand it little better. > > 1) Why do we do journal_start() all the time ? Because we're lame. > 2) Why do we call journal_dirty_data_fn() on the buffers ? We already > issued IO on all those buffers() in block_full_write_page(). Why do we > need to add them to transaction ? I understand we need to do this for > block allocation case. But do we need it for non-allocation case also ? Yup. Ordered-mode JBD commit needs to write and wait upon all dirty file-data buffers prior to journalling the metadata. If we didn't run journal_dirty_data_fn() against those buffers then they'd still be under I/O after commit had completed. Consequently a crash+recover would occasionally allow a read() to read uninitialised data blocks - those blocks for which we had a) started the I/O, b) journalled the metadata which refers to that block and c) not yet completed the I/O when the crash happened. Now, if the write was known to be an overwrite then we know that the block isn't uninitialised, so we could perhaps avoid writing that block in the next commit - just let pdflush handle it. We'd need to work out whether a particular block has initialised data on-disk under it when we dirty it, then track that all the way through to writepage. It's all possible, and adds a significant semantic change to ordered-mode. If that change offered significant performance benefits then yeah, we could scratch our heads over it. But I think if you're looking for CPU consumption reductions, you'd be much better off attacking prepare_write() and commit_write(), rather than writepage(). ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-09 23:35 ` Andrew Morton @ 2006-03-10 0:36 ` Badari Pulavarty 2006-03-16 18:09 ` Theodore Ts'o 0 siblings, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-03-10 0:36 UTC (permalink / raw) To: Andrew Morton; +Cc: sct, linux-kernel, linux-fsdevel, jack Andrew Morton wrote: >Badari Pulavarty <pbadari@us.ibm.com> wrote: > >>Hi, >> >>I am trying to cleanup ext3_ordered and ext3_writeback_writepage() routines. >>I am confused on what ext3_ordered_writepage() is currently doing ? I hope >>you can help me understand it little better. >> >>1) Why do we do journal_start() all the time ? >> > >Because we're lame. > >>2) Why do we call journal_dirty_data_fn() on the buffers ? We already >>issued IO on all those buffers() in block_full_write_page(). Why do we >>need to add them to transaction ? I understand we need to do this for >>block allocation case. But do we need it for non-allocation case also ? >> > >Yup. Ordered-mode JBD commit needs to write and wait upon all dirty >file-data buffers prior to journalling the metadata. If we didn't run >journal_dirty_data_fn() against those buffers then they'd still be under >I/O after commit had completed. > In non-block allocation case, what metadata are we journaling in writepage() ? block allocation happend in prepare_write() and commit_write() journaled the transaction. All the meta data updates should be done there. What JBD commit are you refering to here ? > >But I think if you're looking for CPU consumption reductions, you'd be much >better off attacking prepare_write() and commit_write(), rather than >writepage(). > Yes. You are right. I never realized that, we call prepare_write()/commit_write() for each write. I was under the impression that we do it only on the first instantiation of the page. I will take a closer look at it. The reasons for looking at writepage() are: - want to support writepages() for ext3. Last time when I tried, ran into page->lock and journal_start() deadlock. Thats why I want to understand the journalling better and clean it up while looking at it. - eventually, I want to add delayed allocation to make use of multiblock allocation. Right now, we can't make use of multiblock allocation for buffered mode :( Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-10 0:36 ` Badari Pulavarty @ 2006-03-16 18:09 ` Theodore Ts'o 2006-03-16 18:22 ` Badari Pulavarty 2006-03-17 15:32 ` Jamie Lokier 0 siblings, 2 replies; 58+ messages in thread From: Theodore Ts'o @ 2006-03-16 18:09 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Andrew Morton, sct, linux-kernel, linux-fsdevel, jack > >Yup. Ordered-mode JBD commit needs to write and wait upon all dirty > >file-data buffers prior to journalling the metadata. If we didn't run > >journal_dirty_data_fn() against those buffers then they'd still be under > >I/O after commit had completed. > > > In non-block allocation case, what metadata are we journaling in > writepage() ? > block allocation happend in prepare_write() and commit_write() journaled the > transaction. All the meta data updates should be done there. What JBD > commit are you refering to here ? Basically, this boils down to what is our definition of ordered-mode? If the goal is to make sure we avoid the security exposure of allocating a block and then crashing before we write the data block, potentially exposing previously written data that might be belong to another user, then what Badari is suggesting would avoid this particular problem. However, if what we are doing is overwriting our own data with more an updated, more recent version of the data block, do we guarantee that any ordering semantics apply? For example, what if we write a data block, and then follow it up with some kind of metadata update (say we touch atime, or add an extended attribute). Do we guarantee that if the metadata update is committed, that the data block will have made it to disk as well? Today that is the way things work, but is that guarantee part of the contract of ordered-mode? - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-16 18:09 ` Theodore Ts'o @ 2006-03-16 18:22 ` Badari Pulavarty 2006-03-16 21:04 ` Theodore Ts'o 2006-03-17 15:32 ` Jamie Lokier 1 sibling, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-03-16 18:22 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Andrew Morton, sct, lkml, linux-fsdevel, jack On Thu, 2006-03-16 at 13:09 -0500, Theodore Ts'o wrote: > > >Yup. Ordered-mode JBD commit needs to write and wait upon all dirty > > >file-data buffers prior to journalling the metadata. If we didn't run > > >journal_dirty_data_fn() against those buffers then they'd still be under > > >I/O after commit had completed. > > > > > In non-block allocation case, what metadata are we journaling in > > writepage() ? > > block allocation happend in prepare_write() and commit_write() journaled the > > transaction. All the meta data updates should be done there. What JBD > > commit are you refering to here ? > > Basically, this boils down to what is our definition of ordered-mode? > > If the goal is to make sure we avoid the security exposure of > allocating a block and then crashing before we write the data block, > potentially exposing previously written data that might be belong to > another user, then what Badari is suggesting would avoid this > particular problem. Yes, if the block allocation is needed, my patch is basically no-op, we go through regular code. > However, if what we are doing is overwriting our own data with more an > updated, more recent version of the data block, do we guarantee that > any ordering semantics apply? For example, what if we write a data > block, and then follow it up with some kind of metadata update (say we > touch atime, or add an extended attribute). Do we guarantee that if > the metadata update is committed, that the data block will have made > it to disk as well? I don't see how we do this today. Yes. Metadata updates are jounalled, but I don't see how current adding buffers through journal_dirty_data (bh) call can guarantee that these buffers get added to metadata-update transaction ? > Today that is the way things work, but is that > guarantee part of the contract of ordered-mode? BTW, thanks Ted for putting this in human-readable terms :) Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-16 18:22 ` Badari Pulavarty @ 2006-03-16 21:04 ` Theodore Ts'o 2006-03-16 21:57 ` Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: Theodore Ts'o @ 2006-03-16 21:04 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Andrew Morton, sct, lkml, linux-fsdevel, jack On Thu, Mar 16, 2006 at 10:22:40AM -0800, Badari Pulavarty wrote: > > However, if what we are doing is overwriting our own data with more an > > updated, more recent version of the data block, do we guarantee that > > any ordering semantics apply? For example, what if we write a data > > block, and then follow it up with some kind of metadata update (say we > > touch atime, or add an extended attribute). Do we guarantee that if > > the metadata update is committed, that the data block will have made > > it to disk as well? > > I don't see how we do this today. Yes. Metadata updates are jounalled, > but I don't see how current adding buffers through journal_dirty_data > (bh) call can guarantee that these buffers get added to metadata-update > transaction ? Even though there aren't any updates to any metadata blocks that take place between the journal_start() and journal_stop() calls, if journal_dirty_data() is called (for example in ordered_writepage), those buffers will be associated with the currently open transaction, so they will be guaranteed to be written before the transaction is allowed to commit. Remember, journal_start and journal_stop do not delineate a full ext3/jbd transaction, but rather an operation, where a large number of operations are bundled together to form a transaction. When you call journal_start, and request a certain number of credits (number of buffers that you maximally intend to dirty), that opens up an operation. If the operation turns out not to dirty any metadata blocks at the time of journal_stop(), all of the credits that were reserved by jouranl_start() are returned to the currently open transaction. However, any data blocks which are marked via journal_dirty_data() are still going to be associated with the currently open transaction, and they will still be forced out before the transaction is allowed to commit. Does that make sense? - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-16 21:04 ` Theodore Ts'o @ 2006-03-16 21:57 ` Badari Pulavarty 2006-03-16 22:05 ` Jan Kara 0 siblings, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-03-16 21:57 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Andrew Morton, sct, lkml, linux-fsdevel, jack On Thu, 2006-03-16 at 16:04 -0500, Theodore Ts'o wrote: > On Thu, Mar 16, 2006 at 10:22:40AM -0800, Badari Pulavarty wrote: > > > However, if what we are doing is overwriting our own data with more an > > > updated, more recent version of the data block, do we guarantee that > > > any ordering semantics apply? For example, what if we write a data > > > block, and then follow it up with some kind of metadata update (say we > > > touch atime, or add an extended attribute). Do we guarantee that if > > > the metadata update is committed, that the data block will have made > > > it to disk as well? > > > > I don't see how we do this today. Yes. Metadata updates are jounalled, > > but I don't see how current adding buffers through journal_dirty_data > > (bh) call can guarantee that these buffers get added to metadata-update > > transaction ? > > Even though there aren't any updates to any metadata blocks that take > place between the journal_start() and journal_stop() calls, if > journal_dirty_data() is called (for example in ordered_writepage), > those buffers will be associated with the currently open transaction, > so they will be guaranteed to be written before the transaction is > allowed to commit. > > Remember, journal_start and journal_stop do not delineate a full > ext3/jbd transaction, but rather an operation, where a large number of > operations are bundled together to form a transaction. When you call > journal_start, and request a certain number of credits (number of > buffers that you maximally intend to dirty), that opens up an > operation. If the operation turns out not to dirty any metadata > blocks at the time of journal_stop(), all of the credits that were > reserved by jouranl_start() are returned to the currently open > transaction. However, any data blocks which are marked via > journal_dirty_data() are still going to be associated with the > currently open transaction, and they will still be forced out before > the transaction is allowed to commit. > > Does that make sense? Makes perfect sense, except that it doesn't match what I see through "debugfs" - logdump :( I wrote a testcase to re-write same blocks again & again - there is absolutely nothing showed up in log. Which implied to me that, all the jorunal_dirty_data we did on all those buffers, did nothing - since there is no current transaction. What am I missing ? Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-16 21:57 ` Badari Pulavarty @ 2006-03-16 22:05 ` Jan Kara 2006-03-16 23:45 ` Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: Jan Kara @ 2006-03-16 22:05 UTC (permalink / raw) To: Badari Pulavarty Cc: Theodore Ts'o, Andrew Morton, sct, lkml, linux-fsdevel > On Thu, 2006-03-16 at 16:04 -0500, Theodore Ts'o wrote: > > On Thu, Mar 16, 2006 at 10:22:40AM -0800, Badari Pulavarty wrote: > > > > However, if what we are doing is overwriting our own data with more an > > > > updated, more recent version of the data block, do we guarantee that > > > > any ordering semantics apply? For example, what if we write a data > > > > block, and then follow it up with some kind of metadata update (say we > > > > touch atime, or add an extended attribute). Do we guarantee that if > > > > the metadata update is committed, that the data block will have made > > > > it to disk as well? > > > > > > I don't see how we do this today. Yes. Metadata updates are jounalled, > > > but I don't see how current adding buffers through journal_dirty_data > > > (bh) call can guarantee that these buffers get added to metadata-update > > > transaction ? > > > > Even though there aren't any updates to any metadata blocks that take > > place between the journal_start() and journal_stop() calls, if > > journal_dirty_data() is called (for example in ordered_writepage), > > those buffers will be associated with the currently open transaction, > > so they will be guaranteed to be written before the transaction is > > allowed to commit. > > > > Remember, journal_start and journal_stop do not delineate a full > > ext3/jbd transaction, but rather an operation, where a large number of > > operations are bundled together to form a transaction. When you call > > journal_start, and request a certain number of credits (number of > > buffers that you maximally intend to dirty), that opens up an > > operation. If the operation turns out not to dirty any metadata > > blocks at the time of journal_stop(), all of the credits that were > > reserved by jouranl_start() are returned to the currently open > > transaction. However, any data blocks which are marked via > > journal_dirty_data() are still going to be associated with the > > currently open transaction, and they will still be forced out before > > the transaction is allowed to commit. > > > > Does that make sense? > > Makes perfect sense, except that it doesn't match what I see through > "debugfs" - logdump :( > > I wrote a testcase to re-write same blocks again & again - there is > absolutely nothing showed up in log. Which implied to me that, all > the jorunal_dirty_data we did on all those buffers, did nothing - > since there is no current transaction. What am I missing ? The data buffers are not journaled. The buffers are just attached to the transaction and when the transaction is committed, they are written directly to their final location. This ensures the ordering but no data goes via the log... I guess you should see empty transactions in the log which are eventually commited when they become too old. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-16 22:05 ` Jan Kara @ 2006-03-16 23:45 ` Badari Pulavarty 2006-03-17 0:44 ` Theodore Ts'o 2006-03-17 0:54 ` Andreas Dilger 0 siblings, 2 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-03-16 23:45 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, Andrew Morton, sct, lkml, linux-fsdevel On Thu, 2006-03-16 at 23:05 +0100, Jan Kara wrote: > > On Thu, 2006-03-16 at 16:04 -0500, Theodore Ts'o wrote: > > > On Thu, Mar 16, 2006 at 10:22:40AM -0800, Badari Pulavarty wrote: > > > > > However, if what we are doing is overwriting our own data with more an > > > > > updated, more recent version of the data block, do we guarantee that > > > > > any ordering semantics apply? For example, what if we write a data > > > > > block, and then follow it up with some kind of metadata update (say we > > > > > touch atime, or add an extended attribute). Do we guarantee that if > > > > > the metadata update is committed, that the data block will have made > > > > > it to disk as well? > > > > > > > > I don't see how we do this today. Yes. Metadata updates are jounalled, > > > > but I don't see how current adding buffers through journal_dirty_data > > > > (bh) call can guarantee that these buffers get added to metadata-update > > > > transaction ? > > > > > > Even though there aren't any updates to any metadata blocks that take > > > place between the journal_start() and journal_stop() calls, if > > > journal_dirty_data() is called (for example in ordered_writepage), > > > those buffers will be associated with the currently open transaction, > > > so they will be guaranteed to be written before the transaction is > > > allowed to commit. > > > > > > Remember, journal_start and journal_stop do not delineate a full > > > ext3/jbd transaction, but rather an operation, where a large number of > > > operations are bundled together to form a transaction. When you call > > > journal_start, and request a certain number of credits (number of > > > buffers that you maximally intend to dirty), that opens up an > > > operation. If the operation turns out not to dirty any metadata > > > blocks at the time of journal_stop(), all of the credits that were > > > reserved by jouranl_start() are returned to the currently open > > > transaction. However, any data blocks which are marked via > > > journal_dirty_data() are still going to be associated with the > > > currently open transaction, and they will still be forced out before > > > the transaction is allowed to commit. > > > > > > Does that make sense? > > > > Makes perfect sense, except that it doesn't match what I see through > > "debugfs" - logdump :( > > > > I wrote a testcase to re-write same blocks again & again - there is > > absolutely nothing showed up in log. Which implied to me that, all > > the jorunal_dirty_data we did on all those buffers, did nothing - > > since there is no current transaction. What am I missing ? > The data buffers are not journaled. The buffers are just attached to the > transaction and when the transaction is committed, they are written > directly to their final location. This ensures the ordering but no data > goes via the log... I guess you should see empty transactions in the log > which are eventually commited when they become too old. Yep. I wasn't expecting to see buffers in the transaction/log. I was expecting to see some "dummy" transaction - which these buffers are attached to provide ordering. (even though we are not doing metadata updates). In fact, I was expecting to see "ctime" update in the transaction. Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-16 23:45 ` Badari Pulavarty @ 2006-03-17 0:44 ` Theodore Ts'o 2006-03-17 0:54 ` Andreas Dilger 1 sibling, 0 replies; 58+ messages in thread From: Theodore Ts'o @ 2006-03-17 0:44 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Jan Kara, Andrew Morton, sct, lkml, linux-fsdevel On Thu, Mar 16, 2006 at 03:45:21PM -0800, Badari Pulavarty wrote: > Yep. I wasn't expecting to see buffers in the transaction/log. I was > expecting to see some "dummy" transaction - which these buffers are > attached to provide ordering. (even though we are not doing metadata > updates). In fact, I was expecting to see "ctime" update in the > transaction. What you're missing is that journal_start() and journal_stop() don't create a transaction. They delimit an operation, yes, but multiple operations are grouped together to form a transaction. Transactions are only closed when after the commit_internal or if the journal runs out of space. So you're not going to see a dummy transaction which the buffers are attached to; instead, all of the various operations happening within commit_interval are grouped into a single transaction. This is all explained in Stephen's paper; see page #4 in: http://ftp.kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-16 23:45 ` Badari Pulavarty 2006-03-17 0:44 ` Theodore Ts'o @ 2006-03-17 0:54 ` Andreas Dilger 2006-03-17 17:05 ` Stephen C. Tweedie 1 sibling, 1 reply; 58+ messages in thread From: Andreas Dilger @ 2006-03-17 0:54 UTC (permalink / raw) To: Badari Pulavarty Cc: Jan Kara, Theodore Ts'o, Andrew Morton, sct, lkml, linux-fsdevel On Mar 16, 2006 15:45 -0800, Badari Pulavarty wrote: > On Thu, 2006-03-16 at 23:05 +0100, Jan Kara wrote: > > The data buffers are not journaled. The buffers are just attached to the > > transaction and when the transaction is committed, they are written > > directly to their final location. This ensures the ordering but no data > > goes via the log... I guess you should see empty transactions in the log > > which are eventually commited when they become too old. > > Yep. I wasn't expecting to see buffers in the transaction/log. I was > expecting to see some "dummy" transaction - which these buffers are > attached to provide ordering. (even though we are not doing metadata > updates). In fact, I was expecting to see "ctime" update in the > transaction. The U. Wisconsin group that was doing journal-guided fast RAID resync actually ended up putting dummy transactions into the logs for this with the block numbers even if there were no metadata changes. That way the journal could tell the MD RAID layer what blocks might need resyncing instead of having to scan the whole block device for inconsistencies. That code hasn't been merged, or even posted anywhere yet AFAICS, though I'd be very interested in seeing it. It changes MD RAID recovery time from O(device size) to O(journal size), and that is a huge deal when you have an 8TB filesystem. As for the ctime update, I'm not sure what happened to that, though ext3 would currently only update the inode at most once a second. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 0:54 ` Andreas Dilger @ 2006-03-17 17:05 ` Stephen C. Tweedie 2006-03-17 21:32 ` Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: Stephen C. Tweedie @ 2006-03-17 17:05 UTC (permalink / raw) To: Andreas Dilger Cc: Badari Pulavarty, Jan Kara, Theodore Ts'o, Andrew Morton, lkml, linux-fsdevel, Stephen Tweedie Hi, On Thu, 2006-03-16 at 17:54 -0700, Andreas Dilger wrote: > That way the journal could tell the MD RAID layer what blocks might > need resyncing instead of having to scan the whole block device for > inconsistencies. The current md layer supports write-intent bitmaps to deal with this. --Stephen ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 17:05 ` Stephen C. Tweedie @ 2006-03-17 21:32 ` Badari Pulavarty 2006-03-17 22:22 ` Stephen C. Tweedie 2006-03-18 3:02 ` Suparna Bhattacharya 0 siblings, 2 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-03-17 21:32 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Andreas Dilger, Jan Kara, Theodore Ts'o, Andrew Morton, lkml, linux-fsdevel Hi Stephen, Now that we got your attention, I am wondering whats your opinion on this ? I have a patch which eliminates adding buffers to the journal, if we are doing just re-write of the disk block. In theory, it should be fine - but it does change the current behavior for order mode writes. I guess, current code adds the buffers to the journal, so any metadata updates to any file in the filesystem happen in the journal - guarantees our buffers to be flushed out before that transaction completes. My patch *breaks* that guarantee. But provides significant improvement for re-write case. My micro benchmark shows: 2.6.16-rc6 2.6.16-rc6+patch real 0m6.606s 0m3.705s user 0m0.124s 0m0.108s sys 0m6.456s 0m3.600s In real world, does this ordering guarantee matter ? Waiting for your advise. Thanks, Badari Make use of PageMappedToDisk(page) to find out if we need to block allocation and skip the calls to it, if not needed. When we are not doing block allocation, also avoid calls to journal start and adding buffers to transaction. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Index: linux-2.6.16-rc6/fs/buffer.c =================================================================== --- linux-2.6.16-rc6.orig/fs/buffer.c 2006-03-11 14:12:55.000000000 -0800 +++ linux-2.6.16-rc6/fs/buffer.c 2006-03-16 08:22:37.000000000 -0800 @@ -2029,6 +2029,7 @@ static int __block_commit_write(struct i int partial = 0; unsigned blocksize; struct buffer_head *bh, *head; + int fullymapped = 1; blocksize = 1 << inode->i_blkbits; @@ -2043,6 +2044,8 @@ static int __block_commit_write(struct i set_buffer_uptodate(bh); mark_buffer_dirty(bh); } + if (!buffer_mapped(bh)) + fullymapped = 0; } /* @@ -2053,6 +2056,9 @@ static int __block_commit_write(struct i */ if (!partial) SetPageUptodate(page); + + if (fullymapped) + SetPageMappedToDisk(page); return 0; } Index: linux-2.6.16-rc6/fs/ext3/inode.c =================================================================== --- linux-2.6.16-rc6.orig/fs/ext3/inode.c 2006-03-11 14:12:55.000000000 -0800 +++ linux-2.6.16-rc6/fs/ext3/inode.c 2006-03-15 13:30:04.000000000 -0800 @@ -999,6 +999,12 @@ static int ext3_prepare_write(struct fil handle_t *handle; int retries = 0; + /* + * If the page is already mapped to disk and we are not + * journalling the data - there is nothing to do. + */ + if (PageMappedToDisk(page) && !ext3_should_journal_data(inode)) + return 0; retry: handle = ext3_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { @@ -1059,8 +1065,14 @@ static int ext3_ordered_commit_write(str struct inode *inode = page->mapping->host; int ret = 0, ret2; - ret = walk_page_buffers(handle, page_buffers(page), - from, to, NULL, ext3_journal_dirty_data); + /* + * If the page is already mapped to disk, we won't have + * a handle - which means no metadata updates are needed. + * So, no need to add buffers to the transaction. + */ + if (handle) + ret = walk_page_buffers(handle, page_buffers(page), + from, to, NULL, ext3_journal_dirty_data); if (ret == 0) { /* @@ -1075,9 +1087,11 @@ static int ext3_ordered_commit_write(str EXT3_I(inode)->i_disksize = new_i_size; ret = generic_commit_write(file, page, from, to); } - ret2 = ext3_journal_stop(handle); - if (!ret) - ret = ret2; + if (handle) { + ret2 = ext3_journal_stop(handle); + if (!ret) + ret = ret2; + } return ret; } @@ -1098,9 +1112,11 @@ static int ext3_writeback_commit_write(s else ret = generic_commit_write(file, page, from, to); - ret2 = ext3_journal_stop(handle); - if (!ret) - ret = ret2; + if (handle) { + ret2 = ext3_journal_stop(handle); + if (!ret) + ret = ret2; + } return ret; } @@ -1278,6 +1294,14 @@ static int ext3_ordered_writepage(struct if (ext3_journal_current_handle()) goto out_fail; + /* + * If the page is mapped to disk, just do the IO + */ + if (PageMappedToDisk(page)) { + ret = block_write_full_page(page, ext3_get_block, wbc); + goto out; + } + handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); if (IS_ERR(handle)) { @@ -1318,6 +1342,7 @@ static int ext3_ordered_writepage(struct err = ext3_journal_stop(handle); if (!ret) ret = err; +out: return ret; out_fail: @@ -1337,10 +1362,13 @@ static int ext3_writeback_writepage(stru if (ext3_journal_current_handle()) goto out_fail; - handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); - goto out_fail; + if (!PageMappedToDisk(page)) { + handle = ext3_journal_start(inode, + ext3_writepage_trans_blocks(inode)); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + goto out_fail; + } } if (test_opt(inode->i_sb, NOBH)) @@ -1348,9 +1376,11 @@ static int ext3_writeback_writepage(stru else ret = block_write_full_page(page, ext3_get_block, wbc); - err = ext3_journal_stop(handle); - if (!ret) - ret = err; + if (handle) { + err = ext3_journal_stop(handle); + if (!ret) + ret = err; + } return ret; out_fail: ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 21:32 ` Badari Pulavarty @ 2006-03-17 22:22 ` Stephen C. Tweedie 2006-03-17 22:38 ` Badari Pulavarty 2006-03-18 2:57 ` Suparna Bhattacharya 2006-03-18 3:02 ` Suparna Bhattacharya 1 sibling, 2 replies; 58+ messages in thread From: Stephen C. Tweedie @ 2006-03-17 22:22 UTC (permalink / raw) To: Badari Pulavarty Cc: Andreas Dilger, Jan Kara, Theodore Ts'o, Andrew Morton, lkml, linux-fsdevel, Stephen Tweedie Hi, On Fri, 2006-03-17 at 13:32 -0800, Badari Pulavarty wrote: > I have a patch which eliminates adding buffers to the journal, if > we are doing just re-write of the disk block. ... > 2.6.16-rc6 2.6.16-rc6+patch > real 0m6.606s 0m3.705s OK, that's a really significant win! What exactly was the test case for this, and does that performance edge persist for a longer-running test? > In real world, does this ordering guarantee matter ? Not that I am aware of. Even with the ordering guarantee, there is still no guarantee of the order in which the writes hit disk within that transaction, which makes it hard to depend on it. I recall that some versions of fsync depended on ordered mode flushing dirty data on transaction commit, but I don't think the current ext3_sync_file() will have any problems there. Other than that, the only thing I can think of that had definite dependencies in this are was InterMezzo, and that's no longer in the tree. Even then, I'm not 100% certain that InterMezzo had a dependency for overwrites (it was certainly strongly dependent on the ordering semantics for allocates.) It is theoretically possible to write applications that depend on that ordering, but they would be necessarily non-portable anyway. I think relaxing it is fine, especially for a 100% (wow) performance gain. There is one other perspective to be aware of, though: the current behaviour means that by default ext3 generally starts flushing pending writeback data within 5 seconds of a write. Without that, we may end up accumulating a lot more dirty data in memory, shifting the task of write throttling from the filesystem to the VM. That's not a problem per se, just a change of behaviour to keep in mind, as it could expose different corner cases in the performance of write-intensive workloads. --Stephen ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 22:22 ` Stephen C. Tweedie @ 2006-03-17 22:38 ` Badari Pulavarty 2006-03-17 23:23 ` Mingming Cao 2006-03-18 2:57 ` Suparna Bhattacharya 1 sibling, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-03-17 22:38 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Andreas Dilger, Jan Kara, Theodore Ts'o, Andrew Morton, lkml, linux-fsdevel On Fri, 2006-03-17 at 17:22 -0500, Stephen C. Tweedie wrote: > Hi, > > On Fri, 2006-03-17 at 13:32 -0800, Badari Pulavarty wrote: > > > I have a patch which eliminates adding buffers to the journal, if > > we are doing just re-write of the disk block. ... > > > 2.6.16-rc6 2.6.16-rc6+patch > > real 0m6.606s 0m3.705s > > OK, that's a really significant win! What exactly was the test case for > this, and does that performance edge persist for a longer-running test? Well, its a micro benchmark to test prepare_write/commit_write code. Which does over-write of same blocks again & again for thousands of times. I am doing general filesystem tests to see overall benifits also. > > > In real world, does this ordering guarantee matter ? > > Not that I am aware of. Even with the ordering guarantee, there is > still no guarantee of the order in which the writes hit disk within that > transaction, which makes it hard to depend on it. > > I recall that some versions of fsync depended on ordered mode flushing > dirty data on transaction commit, but I don't think the current > ext3_sync_file() will have any problems there. > > Other than that, the only thing I can think of that had definite > dependencies in this are was InterMezzo, and that's no longer in the > tree. Even then, I'm not 100% certain that InterMezzo had a dependency > for overwrites (it was certainly strongly dependent on the ordering > semantics for allocates.) > > It is theoretically possible to write applications that depend on that > ordering, but they would be necessarily non-portable anyway. I think > relaxing it is fine, especially for a 100% (wow) performance gain. > > There is one other perspective to be aware of, though: the current > behaviour means that by default ext3 generally starts flushing pending > writeback data within 5 seconds of a write. Without that, we may end up > accumulating a lot more dirty data in memory, shifting the task of write > throttling from the filesystem to the VM. Hmm.. You got a point there. > > That's not a problem per se, just a change of behaviour to keep in mind, > as it could expose different corner cases in the performance of > write-intensive workloads. > > --Stephen > > ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 22:38 ` Badari Pulavarty @ 2006-03-17 23:23 ` Mingming Cao 2006-03-20 17:05 ` Stephen C. Tweedie 0 siblings, 1 reply; 58+ messages in thread From: Mingming Cao @ 2006-03-17 23:23 UTC (permalink / raw) To: Badari Pulavarty Cc: Stephen C. Tweedie, Andreas Dilger, Jan Kara, Theodore Ts'o, Andrew Morton, lkml, linux-fsdevel On Fri, 2006-03-17 at 14:38 -0800, Badari Pulavarty wrote: > On Fri, 2006-03-17 at 17:22 -0500, Stephen C. Tweedie wrote: > > There is one other perspective to be aware of, though: the current > > behaviour means that by default ext3 generally starts flushing pending > > writeback data within 5 seconds of a write. Without that, we may end up > > accumulating a lot more dirty data in memory, shifting the task of write > > throttling from the filesystem to the VM. > > Hmm.. You got a point there. > > > > > That's not a problem per se, just a change of behaviour to keep in mind, > > as it could expose different corner cases in the performance of > > write-intensive workloads. > > > > --Stephen > > > > > Current data=writeback mode already behaves like this, so the VM subsystem should be tested for a certain extent, isn't? > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 23:23 ` Mingming Cao @ 2006-03-20 17:05 ` Stephen C. Tweedie 0 siblings, 0 replies; 58+ messages in thread From: Stephen C. Tweedie @ 2006-03-20 17:05 UTC (permalink / raw) To: cmm Cc: Badari Pulavarty, Andreas Dilger, Jan Kara, Theodore Ts'o, Andrew Morton, lkml, linux-fsdevel, Stephen Tweedie Hi, On Fri, 2006-03-17 at 15:23 -0800, Mingming Cao wrote: > > > There is one other perspective to be aware of, though: the current > > > behaviour means that by default ext3 generally starts flushing pending > > > writeback data within 5 seconds of a write. Without that, we may end up > > > accumulating a lot more dirty data in memory, shifting the task of write > > > throttling from the filesystem to the VM. > Current data=writeback mode already behaves like this, so the VM > subsystem should be tested for a certain extent, isn't? Yes, but there are repeated reports that for many workloads, data=writeback is actually slower than data=ordered. So there are probably some interactions like this which may be hurting us already. --Stephen ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 22:22 ` Stephen C. Tweedie 2006-03-17 22:38 ` Badari Pulavarty @ 2006-03-18 2:57 ` Suparna Bhattacharya 1 sibling, 0 replies; 58+ messages in thread From: Suparna Bhattacharya @ 2006-03-18 2:57 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Badari Pulavarty, Andreas Dilger, Jan Kara, Theodore Ts'o, Andrew Morton, lkml, linux-fsdevel On Fri, Mar 17, 2006 at 05:22:13PM -0500, Stephen C. Tweedie wrote: > Hi, > > On Fri, 2006-03-17 at 13:32 -0800, Badari Pulavarty wrote: > > > I have a patch which eliminates adding buffers to the journal, if > > we are doing just re-write of the disk block. ... > > > 2.6.16-rc6 2.6.16-rc6+patch > > real 0m6.606s 0m3.705s > > OK, that's a really significant win! What exactly was the test case for > this, and does that performance edge persist for a longer-running test? > > > In real world, does this ordering guarantee matter ? > > Not that I am aware of. Even with the ordering guarantee, there is > still no guarantee of the order in which the writes hit disk within that > transaction, which makes it hard to depend on it. > > I recall that some versions of fsync depended on ordered mode flushing > dirty data on transaction commit, but I don't think the current > ext3_sync_file() will have any problems there. > > Other than that, the only thing I can think of that had definite > dependencies in this are was InterMezzo, and that's no longer in the > tree. Even then, I'm not 100% certain that InterMezzo had a dependency > for overwrites (it was certainly strongly dependent on the ordering > semantics for allocates.) Besides we seem to have already broken the guarantee in async DIO writes for the overwrite case. Regards Suparna > > It is theoretically possible to write applications that depend on that > ordering, but they would be necessarily non-portable anyway. I think > relaxing it is fine, especially for a 100% (wow) performance gain. > > There is one other perspective to be aware of, though: the current > behaviour means that by default ext3 generally starts flushing pending > writeback data within 5 seconds of a write. Without that, we may end up > accumulating a lot more dirty data in memory, shifting the task of write > throttling from the filesystem to the VM. > > That's not a problem per se, just a change of behaviour to keep in mind, > as it could expose different corner cases in the performance of > write-intensive workloads. > > --Stephen > > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 21:32 ` Badari Pulavarty 2006-03-17 22:22 ` Stephen C. Tweedie @ 2006-03-18 3:02 ` Suparna Bhattacharya 1 sibling, 0 replies; 58+ messages in thread From: Suparna Bhattacharya @ 2006-03-18 3:02 UTC (permalink / raw) To: Badari Pulavarty Cc: Stephen C. Tweedie, Andreas Dilger, Jan Kara, Theodore Ts'o, Andrew Morton, lkml, linux-fsdevel On Fri, Mar 17, 2006 at 01:32:21PM -0800, Badari Pulavarty wrote: > Hi Stephen, > > Now that we got your attention, I am wondering whats your opinion on > this ? > > I have a patch which eliminates adding buffers to the journal, if > we are doing just re-write of the disk block. In theory, it should > be fine - but it does change the current behavior for order mode > writes. I guess, current code adds the buffers to the journal, so > any metadata updates to any file in the filesystem happen in the > journal - guarantees our buffers to be flushed out before that > transaction completes. > > My patch *breaks* that guarantee. But provides significant improvement > for re-write case. My micro benchmark shows: > > 2.6.16-rc6 2.6.16-rc6+patch > real 0m6.606s 0m3.705s > user 0m0.124s 0m0.108s > sys 0m6.456s 0m3.600s > Just curious, how does this compare to the writeback case ? Essentially this change amounts to getting close to writeback mode performance for the overwrites of existing files, isn't it ? > > In real world, does this ordering guarantee matter ? Waiting for your > advise. > > Thanks, > Badari > > Make use of PageMappedToDisk(page) to find out if we need to > block allocation and skip the calls to it, if not needed. > When we are not doing block allocation, also avoid calls > to journal start and adding buffers to transaction. > > Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> > Index: linux-2.6.16-rc6/fs/buffer.c > =================================================================== > --- linux-2.6.16-rc6.orig/fs/buffer.c 2006-03-11 14:12:55.000000000 -0800 > +++ linux-2.6.16-rc6/fs/buffer.c 2006-03-16 08:22:37.000000000 -0800 > @@ -2029,6 +2029,7 @@ static int __block_commit_write(struct i > int partial = 0; > unsigned blocksize; > struct buffer_head *bh, *head; > + int fullymapped = 1; > > blocksize = 1 << inode->i_blkbits; > > @@ -2043,6 +2044,8 @@ static int __block_commit_write(struct i > set_buffer_uptodate(bh); > mark_buffer_dirty(bh); > } > + if (!buffer_mapped(bh)) > + fullymapped = 0; > } > > /* > @@ -2053,6 +2056,9 @@ static int __block_commit_write(struct i > */ > if (!partial) > SetPageUptodate(page); > + > + if (fullymapped) > + SetPageMappedToDisk(page); > return 0; > } > > Index: linux-2.6.16-rc6/fs/ext3/inode.c > =================================================================== > --- linux-2.6.16-rc6.orig/fs/ext3/inode.c 2006-03-11 14:12:55.000000000 -0800 > +++ linux-2.6.16-rc6/fs/ext3/inode.c 2006-03-15 13:30:04.000000000 -0800 > @@ -999,6 +999,12 @@ static int ext3_prepare_write(struct fil > handle_t *handle; > int retries = 0; > > + /* > + * If the page is already mapped to disk and we are not > + * journalling the data - there is nothing to do. > + */ > + if (PageMappedToDisk(page) && !ext3_should_journal_data(inode)) > + return 0; > retry: > handle = ext3_journal_start(inode, needed_blocks); > if (IS_ERR(handle)) { > @@ -1059,8 +1065,14 @@ static int ext3_ordered_commit_write(str > struct inode *inode = page->mapping->host; > int ret = 0, ret2; > > - ret = walk_page_buffers(handle, page_buffers(page), > - from, to, NULL, ext3_journal_dirty_data); > + /* > + * If the page is already mapped to disk, we won't have > + * a handle - which means no metadata updates are needed. > + * So, no need to add buffers to the transaction. > + */ > + if (handle) > + ret = walk_page_buffers(handle, page_buffers(page), > + from, to, NULL, ext3_journal_dirty_data); > > if (ret == 0) { > /* > @@ -1075,9 +1087,11 @@ static int ext3_ordered_commit_write(str > EXT3_I(inode)->i_disksize = new_i_size; > ret = generic_commit_write(file, page, from, to); > } > - ret2 = ext3_journal_stop(handle); > - if (!ret) > - ret = ret2; > + if (handle) { > + ret2 = ext3_journal_stop(handle); > + if (!ret) > + ret = ret2; > + } > return ret; > } > > @@ -1098,9 +1112,11 @@ static int ext3_writeback_commit_write(s > else > ret = generic_commit_write(file, page, from, to); > > - ret2 = ext3_journal_stop(handle); > - if (!ret) > - ret = ret2; > + if (handle) { > + ret2 = ext3_journal_stop(handle); > + if (!ret) > + ret = ret2; > + } > return ret; > } > > @@ -1278,6 +1294,14 @@ static int ext3_ordered_writepage(struct > if (ext3_journal_current_handle()) > goto out_fail; > > + /* > + * If the page is mapped to disk, just do the IO > + */ > + if (PageMappedToDisk(page)) { > + ret = block_write_full_page(page, ext3_get_block, wbc); > + goto out; > + } > + > handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); > > if (IS_ERR(handle)) { > @@ -1318,6 +1342,7 @@ static int ext3_ordered_writepage(struct > err = ext3_journal_stop(handle); > if (!ret) > ret = err; > +out: > return ret; > > out_fail: > @@ -1337,10 +1362,13 @@ static int ext3_writeback_writepage(stru > if (ext3_journal_current_handle()) > goto out_fail; > > - handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); > - if (IS_ERR(handle)) { > - ret = PTR_ERR(handle); > - goto out_fail; > + if (!PageMappedToDisk(page)) { > + handle = ext3_journal_start(inode, > + ext3_writepage_trans_blocks(inode)); > + if (IS_ERR(handle)) { > + ret = PTR_ERR(handle); > + goto out_fail; > + } > } > > if (test_opt(inode->i_sb, NOBH)) > @@ -1348,9 +1376,11 @@ static int ext3_writeback_writepage(stru > else > ret = block_write_full_page(page, ext3_get_block, wbc); > > - err = ext3_journal_stop(handle); > - if (!ret) > - ret = err; > + if (handle) { > + err = ext3_journal_stop(handle); > + if (!ret) > + ret = err; > + } > return ret; > > out_fail: > > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-16 18:09 ` Theodore Ts'o 2006-03-16 18:22 ` Badari Pulavarty @ 2006-03-17 15:32 ` Jamie Lokier 2006-03-17 21:50 ` Stephen C. Tweedie 1 sibling, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2006-03-17 15:32 UTC (permalink / raw) To: Theodore Ts'o, Badari Pulavarty, Andrew Morton, sct, linux-kernel, linux-fsdevel, jack Theodore Ts'o wrote: > > >Yup. Ordered-mode JBD commit needs to write and wait upon all dirty > > >file-data buffers prior to journalling the metadata. If we didn't run > > >journal_dirty_data_fn() against those buffers then they'd still be under > > >I/O after commit had completed. > > > > > In non-block allocation case, what metadata are we journaling in > > writepage() ? block allocation happend in prepare_write() and > > commit_write() journaled the transaction. All the meta data > > updates should be done there. What JBD commit are you refering to > > here ? > > Basically, this boils down to what is our definition of ordered-mode? > > If the goal is to make sure we avoid the security exposure of > allocating a block and then crashing before we write the data block, > potentially exposing previously written data that might be belong to > another user, then what Badari is suggesting would avoid this > particular problem. > > However, if what we are doing is overwriting our own data with more an > updated, more recent version of the data block, do we guarantee that > any ordering semantics apply? For example, what if we write a data > block, and then follow it up with some kind of metadata update (say we > touch atime, or add an extended attribute). Do we guarantee that if > the metadata update is committed, that the data block will have made > it to disk as well? Today that is the way things work, but is that > guarantee part of the contract of ordered-mode? That's the wrong way around for uses which check mtimes to revalidate information about a file's contents. Local search engines like Beagle, and also anything where "make" is involved, and "rsync" come to mind. Then the mtime update should committed _before_ the data begins to be written, not after, so that after a crash, programs will know those files may not contain the expected data. A notable example is "rsync". After a power cycle, you may want to synchronise some files from another machine. Ideally, running "rsync" to copy from the other machine would do the trick. But if data is committed to files on the power cycled machine, and mtime updates for those writes did not get committed, when "rsync" is later run it will assume those files are already correct and not update them. The result is that the data is not copied properly in the way the user expects. With "rsync" this problem can be avoided using the --checksum option, but that's not something a person is likely to think of needing when they assume ext3 provides reasonable consistency guarantees, so that it's safe to pull the plug. -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 15:32 ` Jamie Lokier @ 2006-03-17 21:50 ` Stephen C. Tweedie 2006-03-17 22:11 ` Theodore Ts'o 2006-03-17 22:23 ` Jamie Lokier 0 siblings, 2 replies; 58+ messages in thread From: Stephen C. Tweedie @ 2006-03-17 21:50 UTC (permalink / raw) To: Jamie Lokier Cc: Theodore Ts'o, Badari Pulavarty, Andrew Morton, linux-kernel, linux-fsdevel, jack, Stephen Tweedie Hi, On Fri, 2006-03-17 at 15:32 +0000, Jamie Lokier wrote: > That's the wrong way around for uses which check mtimes to revalidate > information about a file's contents. It's actually the right way for newly-allocated data: the blocks being written early are invisible until the mtime update, because the mtime update is an atomic part of the transaction which links the blocks into the inode. > Local search engines like Beagle, and also anything where "make" is > involved, and "rsync" come to mind. Make and rsync (when writing, that is) are not usually updating in place, so they do in fact want the current ordered mode. It's *only* for updating existing data blocks that there's any justification for writing mtime first. That's the question here. There's a significant cost in forcing the mtime to go first: it means that the VM cannot perform any data writeback for data written by a transaction until the transaction has first been committed. That's the last thing you want to be happening under VM pressure, as you may not in fact be able to close the transaction without first allocating more memory. --Stephen ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 21:50 ` Stephen C. Tweedie @ 2006-03-17 22:11 ` Theodore Ts'o 2006-03-17 22:44 ` Jamie Lokier 2006-03-17 22:23 ` Jamie Lokier 1 sibling, 1 reply; 58+ messages in thread From: Theodore Ts'o @ 2006-03-17 22:11 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Jamie Lokier, Badari Pulavarty, Andrew Morton, linux-kernel, linux-fsdevel, jack On Fri, Mar 17, 2006 at 04:50:21PM -0500, Stephen C. Tweedie wrote: > > It's *only* for updating existing data blocks that there's any > justification for writing mtime first. That's the question here. > > There's a significant cost in forcing the mtime to go first: it means > that the VM cannot perform any data writeback for data written by a > transaction until the transaction has first been committed. That's the > last thing you want to be happening under VM pressure, as you may not in > fact be able to close the transaction without first allocating more > memory. Actually, we're not even able to force the mtime to happen first in this case. In ordered mode, we still force the data blocks *first*, and only later do we force the mtime update out. With Badari's proposed change, we completely decouple when the data blocks get written out with the mtime update; it could happen before, or after, at the OS's convenience. If the application cares about the precise ordering of data blocks being written out with respect to the mtime field, I'd respectfully suggest that the application use data journalling mode --- and note that most applications which update existing data blocks, especially relational databases, either don't care about mtime, have their own data recovering subsystems, or both. - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 22:11 ` Theodore Ts'o @ 2006-03-17 22:44 ` Jamie Lokier 2006-03-18 23:40 ` Theodore Ts'o 0 siblings, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2006-03-17 22:44 UTC (permalink / raw) To: Theodore Ts'o, Stephen C. Tweedie, Badari Pulavarty, Andrew Morton, linux-kernel, linux-fsdevel, jack Theodore Ts'o wrote: > If the application cares about the precise ordering of data blocks > being written out with respect to the mtime field, I'd respectfully > suggest that the application use data journalling mode --- and note > that most applications which update existing data blocks, especially > relational databases, either don't care about mtime, have their own > data recovering subsystems, or both. I think if you're thinking this only affects "applications" or individual programs (like databases), then you didn't think about the example I gave. Scenario: - Person has two computers, A and B. Maybe a desktop and laptop. Maybe office and home machines. - Sometimes they do work on A, sometimes they do work on B. Things like editing pictures or spreadsheets or whatever. - They use "rsync" to copy their working directory from A to B, or B to A, when they move between computers. - They're working on A one day, and there's a power cut. - Power comes back. - They decide to start again on A, using "rsync" to copy from B to A to get a good set of files. - "rsync" is believed to mirror directories from one place to another without problems. It's always worked for them before. (Heck, until this thread came up, I assumed it would always work). - ext3 is generally trusted, so no fsck or anything else special is thought to be required after a power cut. - So after running "rsync", they believe it's safe to work on A. This assumption is invalid, because of ext3's data vs. mtime write ordering when they were working on A before the power cut. But the user doesn't expect this. It's far from obvious (except to a very thoughtful techie) that rsync, which always works normally and even tidies up mistakes normally, won't give correct results this time. - So they carry on working, with corrupted data. Maybe they won't notice for a long time, and the corruption stays in their work project. No individual program or mount option is at fault in the above scenario. The combination together creates a fault, but only after a power cut. The usage is fine in normal use and for all other typical errors which affect files. Technically, using data=journal, or --checksum with rsync, would be fine. But nobody _expects_ to have to do that. It's a surprise. And they both imply a big performance overhead, so nobody is ever advised to do that just to be safe for "ordinary" day to day work. -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 22:44 ` Jamie Lokier @ 2006-03-18 23:40 ` Theodore Ts'o 2006-03-19 2:36 ` Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: Theodore Ts'o @ 2006-03-18 23:40 UTC (permalink / raw) To: Jamie Lokier Cc: Stephen C. Tweedie, Badari Pulavarty, Andrew Morton, linux-kernel, linux-fsdevel, jack On Fri, Mar 17, 2006 at 10:44:39PM +0000, Jamie Lokier wrote: > - Sometimes they do work on A, sometimes they do work on B. > Things like editing pictures or spreadsheets or whatever. > > - They use "rsync" to copy their working directory from A to B, or > B to A, when they move between computers. > > - They're working on A one day, and there's a power cut. ... and this is not a problem, because rsync works works by building the file to be copied via a name such as .filename.NoC10k. If there is a power cut, there will be a stale dot file on A that might take up some extra disk space, but it won't cause a problem with a loss of synchronization between B and A. Hence, in your scenario, nothing would go wrong. And since rsync builds up a new file each time, and only deletes the old file and moves the new file to replace the old file when it is successful, in ordered data mode all of the data blocks will be forced to disk before the metadata operations for the close and rename are allowed to be commited. Hence, it's perfectly safe, even with Badari's proposed change. I would also note that in the even with rsync doing the checksum test, *even* if it didn't use the dotfile with the uniquifer, rsync always did check to see if file sizes matched, and since the file sizes would be different, it would have caught it that way. - Ted P.S. There is a potential problem with mtimes causing confusing results with make, but it has nothing to do with ext3 journalling modes, and everything to do with the fact that most programs, including the compiler and linker, do not write their output files using the rsync technique of using a temporary filename, and then renaming the file to its final location once the compiler or linker operation is complete. So it doesn't matter what filesystem you use, if you are writing out some garguantuan .o file, and the system crashes before the .o file is written out, the fact that it exist means and is newer than the source files will lead make(1) to conclude that the file is up to date, and not try to rebuild it. This has been true for three decades or so that make has been around, yet I don't see people running around in histronics about how this "horrible problem". If people did care, the right way to fix it would be to make the C compiler use the temp filename and rename trick, or change the default .c.o rule in Makefiles to do the same thing. But in reality, it doesn't seem to bother most developers, so no one has really bothered. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-18 23:40 ` Theodore Ts'o @ 2006-03-19 2:36 ` Jamie Lokier 2006-03-19 5:28 ` Chris Adams ` (2 more replies) 0 siblings, 3 replies; 58+ messages in thread From: Jamie Lokier @ 2006-03-19 2:36 UTC (permalink / raw) To: Theodore Ts'o, Stephen C. Tweedie, Badari Pulavarty, Andrew Morton, linux-kernel, linux-fsdevel, jack Theodore Ts'o wrote: > On Fri, Mar 17, 2006 at 10:44:39PM +0000, Jamie Lokier wrote: > > - Sometimes they do work on A, sometimes they do work on B. > > Things like editing pictures or spreadsheets or whatever. > > > > - They use "rsync" to copy their working directory from A to B, or > > B to A, when they move between computers. > > > > - They're working on A one day, and there's a power cut. > > ... and this is not a problem, because rsync works works by building > the file to be copied via a name such as .filename.NoC10k. If there > is a power cut, there will be a stale dot file on A that might take up > some extra disk space, but it won't cause a problem with a loss of > synchronization between B and A. Eh? Yikes, i'm obviously not writing clearly enough, because that's not what I'm talking about at all. The power cut doesn't interrupt rsync, it interrupts other programs (unspecified ones), or just after they've written data but it hasn't reached the disk. It occurs in step 3 above: after "working on A", i.e. using OpenOffice, Emacs, Gnumeric, whatever, etc. _Those_ are the programs which save files shortly before the power cut in that scenario. rsync is relevant only *after* the power cut, because it checks mtimes to see if files are modified. The method by which rsync writes files isn't relevant to this scenario. > Hence, in your scenario, nothing would go wrong. And since rsync > builds up a new file each time, and only deletes the old file and > moves the new file to replace the old file when it is successful, in > ordered data mode all of the data blocks will be forced to disk before > the metadata operations for the close and rename are allowed to be > commited. Hence, it's perfectly safe, even with Badari's proposed > change. Not relevant; rsync is run after the power cut; how it writes files is irrelevant. How it detects changed files is relevant. It's other programs (OpenOffice, etc.) which are being used just before the power cut. If the programs which run just before the power cut do the above (writing then renaming), then they're fine, but many programs don't do that. Now, to be fair, most programs don't overwrite data blocks in place either. They usually open files with O_TRUNC to write with new contents. How does that work out with/without Badari's patch? Is that safe in the same way as creating new files and appending to them is? Or does O_TRUNC mean that the old data blocks might be released and assigned to other files, before this file's metadata is committed, opening a security hole where reading this file after a restart might read blocks belonging to another, unrelated file? > P.S. There is a potential problem with mtimes causing confusing > results with make, but it has nothing to do with ext3 journalling > modes, and everything to do with the fact that most programs, > including the compiler and linker, do not write their output files > using the rsync technique of using a temporary filename, and then > renaming the file to its final location once the compiler or linker > operation is complete. So it doesn't matter what filesystem you use, > if you are writing out some garguantuan .o file, and the system > crashes before the .o file is written out, the fact that it exist > means and is newer than the source files will lead make(1) to conclude > that the file is up to date, and not try to rebuild it. This has been > true for three decades or so that make has been around, yet I don't > see people running around in histronics about how this "horrible > problem". If people did care, the right way to fix it would be to > make the C compiler use the temp filename and rename trick, or change > the default .c.o rule in Makefiles to do the same thing. But in > reality, it doesn't seem to bother most developers, so no one has > really bothered. Again, I know about that problem. I'm referring to a _different_ problem with make, one that is less well known and has less easily detected effects. It's this: you edit a source file with your favourite editor, and save it. 3 seconds later, there's a power cut. The next day, power comes back and you've forgotten that you edited this file. You type "make", and because the _source_ file's new data was committed, but the mtime wasn't, "make" doesn't rebuild anything. The result is output files which are perfectly valid, but inconsistent with source files. No truncated output files (which tend to be easily detected because they don't link or run). This has nothing to do with partially written output files, and more importantly, you can't fix it by cleverness in the Makefile. It's insiduous because whole builds may appear to be fine for a long time (something that doesn't occur with partially written output files - those trigger further errors when used - which is maybe why nobody much worries about those). Bugs in behaviour may not be seen from viewing source code, and if you don't know a power cut was involved, it may not be obvious to think it could be due to source-object mismatch. Similar effects occur with automatic byte-code compilations like Python to .pyc files, and web sites where a templating system caches generated output depending on mtimes of source files. However, all of the above examples really depend on what happens with O_TRUNC, because in practice all editors etc. that are likely to be used, use that if they don't do write-then-rename. So what does happen with O_TRUNC? Does that commit the size and mtime change before (or at the same time as) freeing the data blocks? Or can the data block freeing be committed first? If the former, O_TRUNC is as good as writing to a new file: it's fine. If the latter, it's like writing data in-place, and can have the problems I've described. -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-19 2:36 ` Jamie Lokier @ 2006-03-19 5:28 ` Chris Adams 2006-03-20 2:18 ` Theodore Ts'o 2006-03-20 16:26 ` Stephen C. Tweedie 2 siblings, 0 replies; 58+ messages in thread From: Chris Adams @ 2006-03-19 5:28 UTC (permalink / raw) To: linux-kernel Once upon a time, Jamie Lokier <jamie@shareable.org> said: >rsync is relevant only *after* the power cut, because it checks mtimes >to see if files are modified. The method by which rsync writes files >isn't relevant to this scenario. To simplify: substitute "rsync" with "backup program". Any backup software that maintains some type of index of what has been backed up (for incremental type backups) or even just backs up files modified since a particular date (e.g. "dump") can miss files modified shortly before a crash/power cut/unexpected shutdown. The data may get modified but since the mtime may not get updated, nothing can tell that the data has been modified. rsync is actually a special case, in that you could always force it to compare contents between two copies. Most backup software doesn't do that (especially tape backups). -- Chris Adams <cmadams@hiwaay.net> Systems and Network Administrator - HiWAAY Internet Services I don't speak for anybody but myself - that's enough trouble. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-19 2:36 ` Jamie Lokier 2006-03-19 5:28 ` Chris Adams @ 2006-03-20 2:18 ` Theodore Ts'o 2006-03-20 16:26 ` Stephen C. Tweedie 2 siblings, 0 replies; 58+ messages in thread From: Theodore Ts'o @ 2006-03-20 2:18 UTC (permalink / raw) To: Jamie Lokier Cc: Stephen C. Tweedie, Badari Pulavarty, Andrew Morton, linux-kernel, linux-fsdevel, jack On Sun, Mar 19, 2006 at 02:36:10AM +0000, Jamie Lokier wrote: > It's other programs (OpenOffice, etc.) which are being used just > before the power cut. If the programs which run just before the power > cut do the above (writing then renaming), then they're fine, but many > programs don't do that. > > Now, to be fair, most programs don't overwrite data blocks in place either. > > They usually open files with O_TRUNC to write with new contents. How > does that work out with/without Badari's patch? Is that safe in the > same way as creating new files and appending to them is? The competently written ones don't open files with O_TRUNC; they will create a temp. file, write to the temp. file, and then rename file once it is fully written to the original file, just like rsync does. This is important, given that many developers (especially kernel developers) like to use hard link farms to minimize space, and if you just O_TRUNC the existing file, then the change will be visible in all of the directories. If instead the editor (or openoffice, etc.) writes to a temp file and then renames, then it breaks the hard link with COW semantics, which is what you want --- and in practice, everyone using (or was using) bk, git, and mercurial use hard-linked directories and it works just fine. But yes, using O_TRUNC works just fine with and without Badari's patch, because allocating new data blocks to a old file that is truncated is exactly the same as appending new data blocks to a new file. > Or does O_TRUNC mean that the old data blocks might be released and > assigned to other files, before this file's metadata is committed, > opening a security hole where reading this file after a restart might > read blocks belonging to another, unrelated file? No, not in journal=ordered mode, since the data blocks are forced to disk before the metadata is committed. Opening with O_TRUNC is metadata operation, and allocating new blocks after O_TRUNC is also a metadata operation, and in data=journaled mode, blocks are written out before the metadata is forced out. > It's this: you edit a source file with your favourite editor, and save > it. 3 seconds later, there's a power cut. The next day, power comes > back and you've forgotten that you edited this file. Again, *my* favorite editor saves the file to a newly created file, #filename, and once it is done writing the new file, renames filename to filename~, and finally renames #filename to filename. This means that we don't have to worry about your power cut scenario, and it also means that hard-link farms have the proper COW semantics. > However, all of the above examples really depend on what happens with > O_TRUNC, because in practice all editors etc. that are likely to be > used, use that if they don't do write-then-rename. O_TRUNC is a bad idea, because it means if the editor crashes, or computer crashes, or the fileserver crashes, the original file is *G*O*N*E*. So only silly/stupidly written editors use O_TRUNC; if yours does, abandon it in favor of another editor, or one day you'll be really sorry. It's much, much, safer to do write-then-rename - Ted ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-19 2:36 ` Jamie Lokier 2006-03-19 5:28 ` Chris Adams 2006-03-20 2:18 ` Theodore Ts'o @ 2006-03-20 16:26 ` Stephen C. Tweedie 2 siblings, 0 replies; 58+ messages in thread From: Stephen C. Tweedie @ 2006-03-20 16:26 UTC (permalink / raw) To: Jamie Lokier Cc: Theodore Ts'o, Badari Pulavarty, Andrew Morton, linux-kernel, linux-fsdevel, jack, Stephen Tweedie Hi, On Sun, 2006-03-19 at 02:36 +0000, Jamie Lokier wrote: > Now, to be fair, most programs don't overwrite data blocks in place either. Which is the point we're trying to make: "make" is almost always being used to create or fully replace whole files, not to update existing data inside a file, for example. > They usually open files with O_TRUNC to write with new contents. How > does that work out with/without Badari's patch? Is that safe in the > same way as creating new files and appending to them is? Yes, absolutely. We have to be extremely careful about ordering when it comes to truncate, because we cannot allow the discarded data blocks to be reused until the truncate has committed (otherwise a crash which rolled back the truncate would potentially expose corruption in those data blocks.) That's all done in the allocate logic, not in the writeback code, so it is unaffected by the writeback patches. So the O_TRUNC is still fully safe; and the allocation of new blocks after that is simply a special case of extend, so it is also unaffected by the patch. It is *only* the recovery semantics of update-in-place which are affected. > It's this: you edit a source file with your favourite editor, and save > it. 3 seconds later, there's a power cut. The next day, power comes > back and you've forgotten that you edited this file. If your editor is really opening the existing file and modifying the contents in place, then you have got a fundamentally unsolvable problem because the crash you worry about might happen while the editor is still writing and the file is internally inconsistent. That's not something I think is the filesystem's responsibility to fix! --Stephen ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: ext3_ordered_writepage() questions 2006-03-17 21:50 ` Stephen C. Tweedie 2006-03-17 22:11 ` Theodore Ts'o @ 2006-03-17 22:23 ` Jamie Lokier 1 sibling, 0 replies; 58+ messages in thread From: Jamie Lokier @ 2006-03-17 22:23 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Theodore Ts'o, Badari Pulavarty, Andrew Morton, linux-kernel, linux-fsdevel, jack Stephen C. Tweedie wrote: > > That's the wrong way around for uses which check mtimes to revalidate > > information about a file's contents. > > It's actually the right way for newly-allocated data: the blocks being > written early are invisible until the mtime update, because the mtime > update is an atomic part of the transaction which links the blocks into > the inode. Yes, I agree. It's right for that. > > Local search engines like Beagle, and also anything where "make" is > > involved, and "rsync" come to mind. > > Make and rsync (when writing, that is) are not usually updating in > place, so they do in fact want the current ordered mode. I'm referring to make and rsync _after_ a recovery, when _reading_ to decide whether file data is up to date. The writing in that scenario is by other programs. Those are the times when the current ordering gives surprising results, to the person who hasn't thought about this ordering, such as rsync not synchronising a directory properly because it assumes (incorrectly) a file's mtime is indicative of the last time data was written to the file. I agree that when writing data to the end of a new file, the data must be committed before the metadata. The weird distinction is really because the order ought to be, if they can't all be atomic: commit mtime, then data, then size. But we always commit size and mtime together. > It's *only* for updating existing data blocks that there's any > justification for writing mtime first. That's the question here. > > There's a significant cost in forcing the mtime to go first: it means > that the VM cannot perform any data writeback for data written by a > transaction until the transaction has first been committed. That's the > last thing you want to be happening under VM pressure, as you may not in > fact be able to close the transaction without first allocating more > memory. While I agree that it's not good for VM pressure, fooling programs that rely on mtimes to decide if a file's content has changed is a correctness issue for some applications. I picked the example of copying a directory using rsync (or any other program which compares mtimes) and not getting expected results as one that's easily understood, that people actually do, and where they may already be getting surprises that may not be noticed immediately. Maybe the answer is to make the writeback order for in-place writes a mount option and/or a file attribute? -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops. @ 2006-05-02 15:07 Badari Pulavarty 2006-05-02 15:08 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty 2006-05-09 18:03 ` [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops Badari Pulavarty 0 siblings, 2 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-05-02 15:07 UTC (permalink / raw) To: lkml; +Cc: akpm, Zach Brown, christoph, Benjamin LaHaise, pbadari, cel Hi, These series of patches collapses all the vectored IO support into single set of file-operation method using aio_read/aio_write. This work was originally suggested & started by Christoph Hellwig, when Zach Brown tried to add vectored support for AIO. Here is the summary: [PATCH 1/3] Vectorize aio_read/aio_write methods [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead. [PATCH 3/3] Zach's core aio changes to support vectored AIO. BTW, Chuck Lever is actually re-arranging NFS DIO, AIO code to fit into this model. I ran various testing including LTP on this series. Andrew, can you include these in -mm tree ? Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-02 15:07 [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops Badari Pulavarty @ 2006-05-02 15:08 ` Badari Pulavarty 2006-05-02 15:20 ` Chuck Lever 2006-05-09 18:03 ` [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops Badari Pulavarty 1 sibling, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-05-02 15:08 UTC (permalink / raw) To: lkml; +Cc: akpm, Zach Brown, christoph, Benjamin LaHaise, cel This patch vectorizes aio_read() and aio_write() methods to prepare for collapsing all the vectored & aio operations into one interface - which is aio_read()/aio_write(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Documentation/filesystems/Locking | 5 +- Documentation/filesystems/vfs.txt | 4 +- drivers/char/raw.c | 14 ------- drivers/usb/gadget/inode.c | 71 +++++++++++++++++++++++++++----------- fs/aio.c | 15 +++++--- fs/block_dev.c | 10 ----- fs/cifs/cifsfs.c | 6 +-- fs/ext3/file.c | 5 +- fs/read_write.c | 20 ++++++++-- fs/reiserfs/file.c | 8 ---- fs/xfs/linux-2.6/xfs_file.c | 44 +++++++++++------------ include/linux/aio.h | 2 + include/linux/fs.h | 10 ++--- include/net/sock.h | 1 mm/filemap.c | 39 ++++++++++---------- net/socket.c | 48 ++++++++++++------------- 16 files changed, 163 insertions(+), 139 deletions(-) Index: linux-2.6.17-rc3/Documentation/filesystems/Locking =================================================================== --- linux-2.6.17-rc3.orig/Documentation/filesystems/Locking 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/Documentation/filesystems/Locking 2006-05-02 07:53:58.000000000 -0700 @@ -355,10 +355,9 @@ The last two are called only from check_ prototypes: loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, - loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, Index: linux-2.6.17-rc3/Documentation/filesystems/vfs.txt =================================================================== --- linux-2.6.17-rc3.orig/Documentation/filesystems/vfs.txt 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/Documentation/filesystems/vfs.txt 2006-05-02 07:53:58.000000000 -0700 @@ -699,9 +699,9 @@ This describes how the VFS can manipulat struct file_operations { loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); Index: linux-2.6.17-rc3/drivers/char/raw.c =================================================================== --- linux-2.6.17-rc3.orig/drivers/char/raw.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/drivers/char/raw.c 2006-05-02 07:53:58.000000000 -0700 @@ -250,23 +250,11 @@ static ssize_t raw_file_write(struct fil return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t raw_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { - .iov_base = (char __user *)buf, - .iov_len = count - }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - - static struct file_operations raw_fops = { .read = generic_file_read, .aio_read = generic_file_aio_read, .write = raw_file_write, - .aio_write = raw_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .open = raw_open, .release= raw_release, .ioctl = raw_ioctl, Index: linux-2.6.17-rc3/fs/aio.c =================================================================== --- linux-2.6.17-rc3.orig/fs/aio.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/aio.c 2006-05-02 07:53:58.000000000 -0700 @@ -15,6 +15,7 @@ #include <linux/aio_abi.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/uio.h> #define DEBUG 0 @@ -1315,8 +1316,11 @@ static ssize_t aio_pread(struct kiocb *i ssize_t ret = 0; do { - ret = file->f_op->aio_read(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + iocb->ki_inline_vec.iov_base = iocb->ki_buf; + iocb->ki_inline_vec.iov_len = iocb->ki_left; + + ret = file->f_op->aio_read(iocb, &iocb->ki_inline_vec, + 1, iocb->ki_pos); /* * Can't just depend on iocb->ki_left to determine * whether we are done. This may have been a short read. @@ -1349,8 +1353,11 @@ static ssize_t aio_pwrite(struct kiocb * ssize_t ret = 0; do { - ret = file->f_op->aio_write(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + iocb->ki_inline_vec.iov_base = iocb->ki_buf; + iocb->ki_inline_vec.iov_len = iocb->ki_left; + + ret = file->f_op->aio_write(iocb, &iocb->ki_inline_vec, + 1, iocb->ki_pos); if (ret > 0) { iocb->ki_buf += ret; iocb->ki_left -= ret; Index: linux-2.6.17-rc3/fs/block_dev.c =================================================================== --- linux-2.6.17-rc3.orig/fs/block_dev.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/block_dev.c 2006-05-02 07:53:58.000000000 -0700 @@ -1064,14 +1064,6 @@ static ssize_t blkdev_file_write(struct return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t blkdev_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) { return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); @@ -1094,7 +1086,7 @@ const struct file_operations def_blk_fop .read = generic_file_read, .write = blkdev_file_write, .aio_read = generic_file_aio_read, - .aio_write = blkdev_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .mmap = generic_file_mmap, .fsync = block_fsync, .unlocked_ioctl = block_ioctl, Index: linux-2.6.17-rc3/fs/cifs/cifsfs.c =================================================================== --- linux-2.6.17-rc3.orig/fs/cifs/cifsfs.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/cifs/cifsfs.c 2006-05-02 07:53:58.000000000 -0700 @@ -496,13 +496,13 @@ static ssize_t cifs_file_writev(struct f return written; } -static ssize_t cifs_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +static ssize_t cifs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct inode *inode = iocb->ki_filp->f_dentry->d_inode; ssize_t written; - written = generic_file_aio_write(iocb, buf, count, pos); + written = generic_file_aio_write(iocb, iov, nr_segs, pos); if (!CIFS_I(inode)->clientCanCacheAll) filemap_fdatawrite(inode->i_mapping); return written; Index: linux-2.6.17-rc3/fs/ext3/file.c =================================================================== --- linux-2.6.17-rc3.orig/fs/ext3/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/ext3/file.c 2006-05-02 07:53:58.000000000 -0700 @@ -48,14 +48,15 @@ static int ext3_release_file (struct ino } static ssize_t -ext3_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +ext3_file_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_dentry->d_inode; ssize_t ret; int err; - ret = generic_file_aio_write(iocb, buf, count, pos); + ret = generic_file_aio_write(iocb, iov, nr_segs, pos); /* * Skip flushing if there was an error, or if nothing was written. Index: linux-2.6.17-rc3/fs/read_write.c =================================================================== --- linux-2.6.17-rc3.orig/fs/read_write.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/read_write.c 2006-05-02 07:53:58.000000000 -0700 @@ -227,14 +227,20 @@ static void wait_on_retry_sync_kiocb(str ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos))) + kiocb.ki_left = len; + + for (;;) { + ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); @@ -279,14 +285,20 @@ EXPORT_SYMBOL(vfs_read); ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_write(&kiocb, buf, len, kiocb.ki_pos))) + kiocb.ki_left = len; + + for (;;) { + ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); Index: linux-2.6.17-rc3/fs/reiserfs/file.c =================================================================== --- linux-2.6.17-rc3.orig/fs/reiserfs/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/reiserfs/file.c 2006-05-02 07:53:58.000000000 -0700 @@ -1560,12 +1560,6 @@ static ssize_t reiserfs_file_write(struc return res; } -static ssize_t reiserfs_aio_write(struct kiocb *iocb, const char __user * buf, - size_t count, loff_t pos) -{ - return generic_file_aio_write(iocb, buf, count, pos); -} - const struct file_operations reiserfs_file_operations = { .read = generic_file_read, .write = reiserfs_file_write, @@ -1575,7 +1569,7 @@ const struct file_operations reiserfs_fi .fsync = reiserfs_sync_file, .sendfile = generic_file_sendfile, .aio_read = generic_file_aio_read, - .aio_write = reiserfs_aio_write, + .aio_write = generic_file_aio_write, .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, }; Index: linux-2.6.17-rc3/fs/xfs/linux-2.6/xfs_file.c =================================================================== --- linux-2.6.17-rc3.orig/fs/xfs/linux-2.6/xfs_file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/xfs/linux-2.6/xfs_file.c 2006-05-02 07:53:58.000000000 -0700 @@ -51,12 +51,11 @@ static struct vm_operations_struct xfs_d STATIC inline ssize_t __xfs_file_read( struct kiocb *iocb, - char __user *buf, + const struct iovec *iov, + unsigned long nr_segs, int ioflags, - size_t count, loff_t pos) { - struct iovec iov = {buf, count}; struct file *file = iocb->ki_filp; vnode_t *vp = vn_from_inode(file->f_dentry->d_inode); ssize_t rval; @@ -65,39 +64,38 @@ __xfs_file_read( if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_READ(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_READ(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } STATIC ssize_t xfs_file_aio_read( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_read(iocb, buf, IO_ISAIO, count, pos); + return __xfs_file_read(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t xfs_file_aio_read_invis( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_read(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __xfs_file_read(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } STATIC inline ssize_t __xfs_file_write( - struct kiocb *iocb, - const char __user *buf, - int ioflags, - size_t count, - loff_t pos) + struct kiocb *iocb, + const struct iovec *iov, + unsigned long nr_segs, + int ioflags, + loff_t pos) { - struct iovec iov = {(void __user *)buf, count}; struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; vnode_t *vp = vn_from_inode(inode); @@ -107,28 +105,28 @@ __xfs_file_write( if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_WRITE(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_WRITE(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } STATIC ssize_t xfs_file_aio_write( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_write(iocb, buf, IO_ISAIO, count, pos); + return __xfs_file_write(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t xfs_file_aio_write_invis( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_write(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __xfs_file_write(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } STATIC inline ssize_t Index: linux-2.6.17-rc3/include/linux/fs.h =================================================================== --- linux-2.6.17-rc3.orig/include/linux/fs.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/include/linux/fs.h 2006-05-02 07:53:58.000000000 -0700 @@ -1015,9 +1015,9 @@ struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); @@ -1594,11 +1594,11 @@ extern int file_send_actor(read_descript extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *); int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *); -extern ssize_t generic_file_aio_read(struct kiocb *, char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t __generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t *); -extern ssize_t generic_file_aio_write(struct kiocb *, const char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *, - unsigned long, loff_t *); + unsigned long, loff_t); extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *, unsigned long *, loff_t, loff_t *, size_t, size_t); extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *, Index: linux-2.6.17-rc3/include/net/sock.h =================================================================== --- linux-2.6.17-rc3.orig/include/net/sock.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/include/net/sock.h 2006-05-02 07:53:58.000000000 -0700 @@ -659,7 +659,6 @@ struct sock_iocb { struct sock *sk; struct scm_cookie *scm; struct msghdr *msg, async_msg; - struct iovec async_iov; struct kiocb *kiocb; }; Index: linux-2.6.17-rc3/mm/filemap.c =================================================================== --- linux-2.6.17-rc3.orig/mm/filemap.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/mm/filemap.c 2006-05-02 07:53:58.000000000 -0700 @@ -1096,14 +1096,12 @@ out: EXPORT_SYMBOL(__generic_file_aio_read); ssize_t -generic_file_aio_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - struct iovec local_iov = { .iov_base = buf, .iov_len = count }; - BUG_ON(iocb->ki_pos != pos); - return __generic_file_aio_read(iocb, &local_iov, 1, &iocb->ki_pos); + return __generic_file_aio_read(iocb, iov, nr_segs, &iocb->ki_pos); } - EXPORT_SYMBOL(generic_file_aio_read); ssize_t @@ -2163,22 +2161,21 @@ out: current->backing_dev_info = NULL; return written ? written : err; } -EXPORT_SYMBOL(generic_file_aio_write_nolock); -ssize_t -generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) +ssize_t generic_file_aio_write_nolock(struct kiocb *iocb, + const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - loff_t pos = *ppos; - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, ppos); + BUG_ON(iocb->ki_pos != pos); + + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { - int err; + ssize_t err; err = sync_page_range_nolock(inode, mapping, pos, ret); if (err < 0) @@ -2186,6 +2183,7 @@ generic_file_aio_write_nolock(struct kio } return ret; } +EXPORT_SYMBOL(generic_file_aio_write_nolock); static ssize_t __generic_file_write_nolock(struct file *file, const struct iovec *iov, @@ -2195,9 +2193,11 @@ __generic_file_write_nolock(struct file ssize_t ret; init_sync_kiocb(&kiocb, file); + kiocb.ki_pos = *ppos; ret = __generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); - if (ret == -EIOCBQUEUED) + if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; return ret; } @@ -2209,28 +2209,27 @@ generic_file_write_nolock(struct file *f ssize_t ret; init_sync_kiocb(&kiocb, file); - ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); + kiocb.ki_pos = *ppos; + ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, *ppos); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; return ret; } EXPORT_SYMBOL(generic_file_write_nolock); -ssize_t generic_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - struct iovec local_iov = { .iov_base = (void __user *)buf, - .iov_len = count }; BUG_ON(iocb->ki_pos != pos); mutex_lock(&inode->i_mutex); - ret = __generic_file_aio_write_nolock(iocb, &local_iov, 1, - &iocb->ki_pos); + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); mutex_unlock(&inode->i_mutex); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { Index: linux-2.6.17-rc3/net/socket.c =================================================================== --- linux-2.6.17-rc3.orig/net/socket.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/net/socket.c 2006-05-02 07:53:58.000000000 -0700 @@ -96,10 +96,10 @@ #include <linux/netfilter.h> static int sock_no_open(struct inode *irrelevant, struct file *dontcare); -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *buf, - size_t size, loff_t pos); -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *buf, - size_t size, loff_t pos); +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); static int sock_mmap(struct file *file, struct vm_area_struct * vma); static int sock_close(struct inode *inode, struct file *file); @@ -700,7 +700,7 @@ static ssize_t sock_sendpage(struct file } static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb, - char __user *ubuf, size_t size, struct sock_iocb *siocb) + struct sock_iocb *siocb) { if (!is_sync_kiocb(iocb)) { siocb = kmalloc(sizeof(*siocb), GFP_KERNEL); @@ -710,15 +710,13 @@ static struct sock_iocb *alloc_sock_iocb } siocb->kiocb = iocb; - siocb->async_iov.iov_base = ubuf; - siocb->async_iov.iov_len = size; - iocb->private = siocb; return siocb; } static ssize_t do_sock_read(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -749,31 +747,33 @@ static ssize_t sock_readv(struct file *f init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_read(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_read(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; - if (count == 0) /* Match SYS5 behaviour */ + + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ return 0; - x = alloc_sock_iocb(iocb, ubuf, count, &siocb); + + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } static ssize_t do_sock_write(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -806,28 +806,28 @@ static ssize_t sock_writev(struct file * init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_write(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_write(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; - if (count == 0) /* Match SYS5 behaviour */ + + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ return 0; - x = alloc_sock_iocb(iocb, (void __user *)ubuf, count, &siocb); + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } Index: linux-2.6.17-rc3/drivers/usb/gadget/inode.c =================================================================== --- linux-2.6.17-rc3.orig/drivers/usb/gadget/inode.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/drivers/usb/gadget/inode.c 2006-05-02 07:53:58.000000000 -0700 @@ -528,7 +528,8 @@ struct kiocb_priv { struct usb_request *req; struct ep_data *epdata; void *buf; - char __user *ubuf; + struct iovec *iv; + unsigned long count; unsigned actual; }; @@ -556,18 +557,32 @@ static int ep_aio_cancel(struct kiocb *i static ssize_t ep_aio_read_retry(struct kiocb *iocb) { struct kiocb_priv *priv = iocb->private; - ssize_t status = priv->actual; + ssize_t len, total; /* we "retry" to get the right mm context for this: */ - status = copy_to_user(priv->ubuf, priv->buf, priv->actual); - if (unlikely(0 != status)) - status = -EFAULT; - else - status = priv->actual; + + /* copy stuff into user buffers */ + total = priv->actual; + len = 0; + for (i=0; i < priv->count; i++) { + ssize_t this = min(priv->iv[i].iov_len, (size_t)total); + + if (copy_to_user(priv->iv[i].iov_buf, priv->buf, this)) + break; + + total -= this; + len += this; + if (total <= 0) + break; + } + + if (unlikely(len != 0)) + len = -EFAULT; + kfree(priv->buf); kfree(priv); aio_put_req(iocb); - return status; + return len; } static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req) @@ -615,7 +630,8 @@ ep_aio_rwtail( char *buf, size_t len, struct ep_data *epdata, - char __user *ubuf + const struct iovec *iv, + unsigned long count ) { struct kiocb_priv *priv = (void *) &iocb->private; @@ -630,7 +646,8 @@ fail: return value; } iocb->private = priv; - priv->ubuf = ubuf; + priv->iovec = iv; + priv->count = count; value = get_ready_ep(iocb->ki_filp->f_flags, epdata); if (unlikely(value < 0)) { @@ -675,36 +692,52 @@ fail: } static ssize_t -ep_aio_read(struct kiocb *iocb, char __user *ubuf, size_t len, loff_t o) +ep_aio_read(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len; + int i = 0; + ssize_t ret; if (unlikely(epdata->desc.bEndpointAddress & USB_DIR_IN)) return -EINVAL; - buf = kmalloc(len, GFP_KERNEL); + + buf = kmalloc(iocb->ki_left, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; + iocb->ki_retry = ep_aio_read_retry; - return ep_aio_rwtail(iocb, buf, len, epdata, ubuf); + return ep_aio_rwtail(iocb, buf, len, epdata, iv, count); } static ssize_t -ep_aio_write(struct kiocb *iocb, const char __user *ubuf, size_t len, loff_t o) +ep_aio_write(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len = 0; + int i = 0; + ssize_t ret; if (unlikely(!(epdata->desc.bEndpointAddress & USB_DIR_IN))) return -EINVAL; - buf = kmalloc(len, GFP_KERNEL); + + buf = kmalloc(iocb->ki_left, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; - if (unlikely(copy_from_user(buf, ubuf, len) != 0)) { - kfree(buf); - return -EFAULT; + + for (i=0; i < count; i++) { + if (unlikely(copy_from_user(&buf[len], iv[i]->iov_base, + iv[i]->iov_len) != 0)) { + kfree(buf); + return -EFAULT; + } + len += iv[i]->iov_len; } - return ep_aio_rwtail(iocb, buf, len, epdata, NULL); + return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0); } /*----------------------------------------------------------------------*/ Index: linux-2.6.17-rc3/include/linux/aio.h =================================================================== --- linux-2.6.17-rc3.orig/include/linux/aio.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/include/linux/aio.h 2006-05-02 07:53:58.000000000 -0700 @@ -4,6 +4,7 @@ #include <linux/list.h> #include <linux/workqueue.h> #include <linux/aio_abi.h> +#include <linux/uio.h> #include <asm/atomic.h> @@ -112,6 +113,7 @@ struct kiocb { long ki_retried; /* just for testing */ long ki_kicked; /* just for testing */ long ki_queued; /* just for testing */ + struct iovec ki_inline_vec; /* inline vector */ struct list_head ki_list; /* the aio core uses this * for cancellation */ ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-02 15:08 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty @ 2006-05-02 15:20 ` Chuck Lever 2006-05-02 15:35 ` Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: Chuck Lever @ 2006-05-02 15:20 UTC (permalink / raw) To: Badari Pulavarty Cc: lkml, akpm, Zach Brown, christoph, Benjamin LaHaise, Trond Myklebust [-- Attachment #1: Type: text/plain, Size: 30440 bytes --] If you apply this one, then the NFS client no longer builds. I think you might need to stub out vectored direct I/O support for the NFS client temporarily with something like the attached patch. Badari Pulavarty wrote: > This patch vectorizes aio_read() and aio_write() methods to prepare > for collapsing all the vectored & aio operations into one interface - > which is aio_read()/aio_write(). > > > Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> > Signed-off-by: Christoph Hellwig <hch@lst.de> > > Documentation/filesystems/Locking | 5 +- > Documentation/filesystems/vfs.txt | 4 +- > drivers/char/raw.c | 14 ------- > drivers/usb/gadget/inode.c | 71 +++++++++++++++++++++++++++----------- > fs/aio.c | 15 +++++--- > fs/block_dev.c | 10 ----- > fs/cifs/cifsfs.c | 6 +-- > fs/ext3/file.c | 5 +- > fs/read_write.c | 20 ++++++++-- > fs/reiserfs/file.c | 8 ---- > fs/xfs/linux-2.6/xfs_file.c | 44 +++++++++++------------ > include/linux/aio.h | 2 + > include/linux/fs.h | 10 ++--- > include/net/sock.h | 1 > mm/filemap.c | 39 ++++++++++---------- > net/socket.c | 48 ++++++++++++------------- > 16 files changed, 163 insertions(+), 139 deletions(-) > > Index: linux-2.6.17-rc3/Documentation/filesystems/Locking > =================================================================== > --- linux-2.6.17-rc3.orig/Documentation/filesystems/Locking 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/Documentation/filesystems/Locking 2006-05-02 07:53:58.000000000 -0700 > @@ -355,10 +355,9 @@ The last two are called only from check_ > prototypes: > loff_t (*llseek) (struct file *, loff_t, int); > ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); > - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); > ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); > - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, > - loff_t); > + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); > + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); > int (*readdir) (struct file *, void *, filldir_t); > unsigned int (*poll) (struct file *, struct poll_table_struct *); > int (*ioctl) (struct inode *, struct file *, unsigned int, > Index: linux-2.6.17-rc3/Documentation/filesystems/vfs.txt > =================================================================== > --- linux-2.6.17-rc3.orig/Documentation/filesystems/vfs.txt 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/Documentation/filesystems/vfs.txt 2006-05-02 07:53:58.000000000 -0700 > @@ -699,9 +699,9 @@ This describes how the VFS can manipulat > struct file_operations { > loff_t (*llseek) (struct file *, loff_t, int); > ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); > - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); > ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); > - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); > + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); > + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); > int (*readdir) (struct file *, void *, filldir_t); > unsigned int (*poll) (struct file *, struct poll_table_struct *); > int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); > Index: linux-2.6.17-rc3/drivers/char/raw.c > =================================================================== > --- linux-2.6.17-rc3.orig/drivers/char/raw.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/drivers/char/raw.c 2006-05-02 07:53:58.000000000 -0700 > @@ -250,23 +250,11 @@ static ssize_t raw_file_write(struct fil > return generic_file_write_nolock(file, &local_iov, 1, ppos); > } > > -static ssize_t raw_file_aio_write(struct kiocb *iocb, const char __user *buf, > - size_t count, loff_t pos) > -{ > - struct iovec local_iov = { > - .iov_base = (char __user *)buf, > - .iov_len = count > - }; > - > - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); > -} > - > - > static struct file_operations raw_fops = { > .read = generic_file_read, > .aio_read = generic_file_aio_read, > .write = raw_file_write, > - .aio_write = raw_file_aio_write, > + .aio_write = generic_file_aio_write_nolock, > .open = raw_open, > .release= raw_release, > .ioctl = raw_ioctl, > Index: linux-2.6.17-rc3/fs/aio.c > =================================================================== > --- linux-2.6.17-rc3.orig/fs/aio.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/fs/aio.c 2006-05-02 07:53:58.000000000 -0700 > @@ -15,6 +15,7 @@ > #include <linux/aio_abi.h> > #include <linux/module.h> > #include <linux/syscalls.h> > +#include <linux/uio.h> > > #define DEBUG 0 > > @@ -1315,8 +1316,11 @@ static ssize_t aio_pread(struct kiocb *i > ssize_t ret = 0; > > do { > - ret = file->f_op->aio_read(iocb, iocb->ki_buf, > - iocb->ki_left, iocb->ki_pos); > + iocb->ki_inline_vec.iov_base = iocb->ki_buf; > + iocb->ki_inline_vec.iov_len = iocb->ki_left; > + > + ret = file->f_op->aio_read(iocb, &iocb->ki_inline_vec, > + 1, iocb->ki_pos); > /* > * Can't just depend on iocb->ki_left to determine > * whether we are done. This may have been a short read. > @@ -1349,8 +1353,11 @@ static ssize_t aio_pwrite(struct kiocb * > ssize_t ret = 0; > > do { > - ret = file->f_op->aio_write(iocb, iocb->ki_buf, > - iocb->ki_left, iocb->ki_pos); > + iocb->ki_inline_vec.iov_base = iocb->ki_buf; > + iocb->ki_inline_vec.iov_len = iocb->ki_left; > + > + ret = file->f_op->aio_write(iocb, &iocb->ki_inline_vec, > + 1, iocb->ki_pos); > if (ret > 0) { > iocb->ki_buf += ret; > iocb->ki_left -= ret; > Index: linux-2.6.17-rc3/fs/block_dev.c > =================================================================== > --- linux-2.6.17-rc3.orig/fs/block_dev.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/fs/block_dev.c 2006-05-02 07:53:58.000000000 -0700 > @@ -1064,14 +1064,6 @@ static ssize_t blkdev_file_write(struct > return generic_file_write_nolock(file, &local_iov, 1, ppos); > } > > -static ssize_t blkdev_file_aio_write(struct kiocb *iocb, const char __user *buf, > - size_t count, loff_t pos) > -{ > - struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count }; > - > - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); > -} > - > static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) > { > return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); > @@ -1094,7 +1086,7 @@ const struct file_operations def_blk_fop > .read = generic_file_read, > .write = blkdev_file_write, > .aio_read = generic_file_aio_read, > - .aio_write = blkdev_file_aio_write, > + .aio_write = generic_file_aio_write_nolock, > .mmap = generic_file_mmap, > .fsync = block_fsync, > .unlocked_ioctl = block_ioctl, > Index: linux-2.6.17-rc3/fs/cifs/cifsfs.c > =================================================================== > --- linux-2.6.17-rc3.orig/fs/cifs/cifsfs.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/fs/cifs/cifsfs.c 2006-05-02 07:53:58.000000000 -0700 > @@ -496,13 +496,13 @@ static ssize_t cifs_file_writev(struct f > return written; > } > > -static ssize_t cifs_file_aio_write(struct kiocb *iocb, const char __user *buf, > - size_t count, loff_t pos) > +static ssize_t cifs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos) > { > struct inode *inode = iocb->ki_filp->f_dentry->d_inode; > ssize_t written; > > - written = generic_file_aio_write(iocb, buf, count, pos); > + written = generic_file_aio_write(iocb, iov, nr_segs, pos); > if (!CIFS_I(inode)->clientCanCacheAll) > filemap_fdatawrite(inode->i_mapping); > return written; > Index: linux-2.6.17-rc3/fs/ext3/file.c > =================================================================== > --- linux-2.6.17-rc3.orig/fs/ext3/file.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/fs/ext3/file.c 2006-05-02 07:53:58.000000000 -0700 > @@ -48,14 +48,15 @@ static int ext3_release_file (struct ino > } > > static ssize_t > -ext3_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) > +ext3_file_write(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos) > { > struct file *file = iocb->ki_filp; > struct inode *inode = file->f_dentry->d_inode; > ssize_t ret; > int err; > > - ret = generic_file_aio_write(iocb, buf, count, pos); > + ret = generic_file_aio_write(iocb, iov, nr_segs, pos); > > /* > * Skip flushing if there was an error, or if nothing was written. > Index: linux-2.6.17-rc3/fs/read_write.c > =================================================================== > --- linux-2.6.17-rc3.orig/fs/read_write.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/fs/read_write.c 2006-05-02 07:53:58.000000000 -0700 > @@ -227,14 +227,20 @@ static void wait_on_retry_sync_kiocb(str > > ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) > { > + struct iovec iov = { .iov_base = buf, .iov_len = len }; > struct kiocb kiocb; > ssize_t ret; > > init_sync_kiocb(&kiocb, filp); > kiocb.ki_pos = *ppos; > - while (-EIOCBRETRY == > - (ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos))) > + kiocb.ki_left = len; > + > + for (;;) { > + ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); > + if (ret != -EIOCBRETRY) > + break; > wait_on_retry_sync_kiocb(&kiocb); > + } > > if (-EIOCBQUEUED == ret) > ret = wait_on_sync_kiocb(&kiocb); > @@ -279,14 +285,20 @@ EXPORT_SYMBOL(vfs_read); > > ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) > { > + struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len }; > struct kiocb kiocb; > ssize_t ret; > > init_sync_kiocb(&kiocb, filp); > kiocb.ki_pos = *ppos; > - while (-EIOCBRETRY == > - (ret = filp->f_op->aio_write(&kiocb, buf, len, kiocb.ki_pos))) > + kiocb.ki_left = len; > + > + for (;;) { > + ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); > + if (ret != -EIOCBRETRY) > + break; > wait_on_retry_sync_kiocb(&kiocb); > + } > > if (-EIOCBQUEUED == ret) > ret = wait_on_sync_kiocb(&kiocb); > Index: linux-2.6.17-rc3/fs/reiserfs/file.c > =================================================================== > --- linux-2.6.17-rc3.orig/fs/reiserfs/file.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/fs/reiserfs/file.c 2006-05-02 07:53:58.000000000 -0700 > @@ -1560,12 +1560,6 @@ static ssize_t reiserfs_file_write(struc > return res; > } > > -static ssize_t reiserfs_aio_write(struct kiocb *iocb, const char __user * buf, > - size_t count, loff_t pos) > -{ > - return generic_file_aio_write(iocb, buf, count, pos); > -} > - > const struct file_operations reiserfs_file_operations = { > .read = generic_file_read, > .write = reiserfs_file_write, > @@ -1575,7 +1569,7 @@ const struct file_operations reiserfs_fi > .fsync = reiserfs_sync_file, > .sendfile = generic_file_sendfile, > .aio_read = generic_file_aio_read, > - .aio_write = reiserfs_aio_write, > + .aio_write = generic_file_aio_write, > .splice_read = generic_file_splice_read, > .splice_write = generic_file_splice_write, > }; > Index: linux-2.6.17-rc3/fs/xfs/linux-2.6/xfs_file.c > =================================================================== > --- linux-2.6.17-rc3.orig/fs/xfs/linux-2.6/xfs_file.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/fs/xfs/linux-2.6/xfs_file.c 2006-05-02 07:53:58.000000000 -0700 > @@ -51,12 +51,11 @@ static struct vm_operations_struct xfs_d > STATIC inline ssize_t > __xfs_file_read( > struct kiocb *iocb, > - char __user *buf, > + const struct iovec *iov, > + unsigned long nr_segs, > int ioflags, > - size_t count, > loff_t pos) > { > - struct iovec iov = {buf, count}; > struct file *file = iocb->ki_filp; > vnode_t *vp = vn_from_inode(file->f_dentry->d_inode); > ssize_t rval; > @@ -65,39 +64,38 @@ __xfs_file_read( > > if (unlikely(file->f_flags & O_DIRECT)) > ioflags |= IO_ISDIRECT; > - VOP_READ(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); > + VOP_READ(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); > return rval; > } > > STATIC ssize_t > xfs_file_aio_read( > struct kiocb *iocb, > - char __user *buf, > - size_t count, > + const struct iovec *iov, > + unsigned long nr_segs, > loff_t pos) > { > - return __xfs_file_read(iocb, buf, IO_ISAIO, count, pos); > + return __xfs_file_read(iocb, iov, nr_segs, IO_ISAIO, pos); > } > > STATIC ssize_t > xfs_file_aio_read_invis( > struct kiocb *iocb, > - char __user *buf, > - size_t count, > + const struct iovec *iov, > + unsigned long nr_segs, > loff_t pos) > { > - return __xfs_file_read(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); > + return __xfs_file_read(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); > } > > STATIC inline ssize_t > __xfs_file_write( > - struct kiocb *iocb, > - const char __user *buf, > - int ioflags, > - size_t count, > - loff_t pos) > + struct kiocb *iocb, > + const struct iovec *iov, > + unsigned long nr_segs, > + int ioflags, > + loff_t pos) > { > - struct iovec iov = {(void __user *)buf, count}; > struct file *file = iocb->ki_filp; > struct inode *inode = file->f_mapping->host; > vnode_t *vp = vn_from_inode(inode); > @@ -107,28 +105,28 @@ __xfs_file_write( > if (unlikely(file->f_flags & O_DIRECT)) > ioflags |= IO_ISDIRECT; > > - VOP_WRITE(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); > + VOP_WRITE(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); > return rval; > } > > STATIC ssize_t > xfs_file_aio_write( > struct kiocb *iocb, > - const char __user *buf, > - size_t count, > + const struct iovec *iov, > + unsigned long nr_segs, > loff_t pos) > { > - return __xfs_file_write(iocb, buf, IO_ISAIO, count, pos); > + return __xfs_file_write(iocb, iov, nr_segs, IO_ISAIO, pos); > } > > STATIC ssize_t > xfs_file_aio_write_invis( > struct kiocb *iocb, > - const char __user *buf, > - size_t count, > + const struct iovec *iov, > + unsigned long nr_segs, > loff_t pos) > { > - return __xfs_file_write(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); > + return __xfs_file_write(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); > } > > STATIC inline ssize_t > Index: linux-2.6.17-rc3/include/linux/fs.h > =================================================================== > --- linux-2.6.17-rc3.orig/include/linux/fs.h 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/include/linux/fs.h 2006-05-02 07:53:58.000000000 -0700 > @@ -1015,9 +1015,9 @@ struct file_operations { > struct module *owner; > loff_t (*llseek) (struct file *, loff_t, int); > ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); > - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); > ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); > - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); > + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); > + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); > int (*readdir) (struct file *, void *, filldir_t); > unsigned int (*poll) (struct file *, struct poll_table_struct *); > int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); > @@ -1594,11 +1594,11 @@ extern int file_send_actor(read_descript > extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *); > int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); > extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *); > -extern ssize_t generic_file_aio_read(struct kiocb *, char __user *, size_t, loff_t); > +extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); > extern ssize_t __generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t *); > -extern ssize_t generic_file_aio_write(struct kiocb *, const char __user *, size_t, loff_t); > +extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); > extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *, > - unsigned long, loff_t *); > + unsigned long, loff_t); > extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *, > unsigned long *, loff_t, loff_t *, size_t, size_t); > extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *, > Index: linux-2.6.17-rc3/include/net/sock.h > =================================================================== > --- linux-2.6.17-rc3.orig/include/net/sock.h 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/include/net/sock.h 2006-05-02 07:53:58.000000000 -0700 > @@ -659,7 +659,6 @@ struct sock_iocb { > struct sock *sk; > struct scm_cookie *scm; > struct msghdr *msg, async_msg; > - struct iovec async_iov; > struct kiocb *kiocb; > }; > > Index: linux-2.6.17-rc3/mm/filemap.c > =================================================================== > --- linux-2.6.17-rc3.orig/mm/filemap.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/mm/filemap.c 2006-05-02 07:53:58.000000000 -0700 > @@ -1096,14 +1096,12 @@ out: > EXPORT_SYMBOL(__generic_file_aio_read); > > ssize_t > -generic_file_aio_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) > +generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos) > { > - struct iovec local_iov = { .iov_base = buf, .iov_len = count }; > - > BUG_ON(iocb->ki_pos != pos); > - return __generic_file_aio_read(iocb, &local_iov, 1, &iocb->ki_pos); > + return __generic_file_aio_read(iocb, iov, nr_segs, &iocb->ki_pos); > } > - > EXPORT_SYMBOL(generic_file_aio_read); > > ssize_t > @@ -2163,22 +2161,21 @@ out: > current->backing_dev_info = NULL; > return written ? written : err; > } > -EXPORT_SYMBOL(generic_file_aio_write_nolock); > > -ssize_t > -generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, > - unsigned long nr_segs, loff_t *ppos) > +ssize_t generic_file_aio_write_nolock(struct kiocb *iocb, > + const struct iovec *iov, unsigned long nr_segs, loff_t pos) > { > struct file *file = iocb->ki_filp; > struct address_space *mapping = file->f_mapping; > struct inode *inode = mapping->host; > ssize_t ret; > - loff_t pos = *ppos; > > - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, ppos); > + BUG_ON(iocb->ki_pos != pos); > + > + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); > > if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { > - int err; > + ssize_t err; > > err = sync_page_range_nolock(inode, mapping, pos, ret); > if (err < 0) > @@ -2186,6 +2183,7 @@ generic_file_aio_write_nolock(struct kio > } > return ret; > } > +EXPORT_SYMBOL(generic_file_aio_write_nolock); > > static ssize_t > __generic_file_write_nolock(struct file *file, const struct iovec *iov, > @@ -2195,9 +2193,11 @@ __generic_file_write_nolock(struct file > ssize_t ret; > > init_sync_kiocb(&kiocb, file); > + kiocb.ki_pos = *ppos; > ret = __generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); > - if (ret == -EIOCBQUEUED) > + if (-EIOCBQUEUED == ret) > ret = wait_on_sync_kiocb(&kiocb); > + *ppos = kiocb.ki_pos; > return ret; > } > > @@ -2209,28 +2209,27 @@ generic_file_write_nolock(struct file *f > ssize_t ret; > > init_sync_kiocb(&kiocb, file); > - ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); > + kiocb.ki_pos = *ppos; > + ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, *ppos); > if (-EIOCBQUEUED == ret) > ret = wait_on_sync_kiocb(&kiocb); > + *ppos = kiocb.ki_pos; > return ret; > } > EXPORT_SYMBOL(generic_file_write_nolock); > > -ssize_t generic_file_aio_write(struct kiocb *iocb, const char __user *buf, > - size_t count, loff_t pos) > +ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos) > { > struct file *file = iocb->ki_filp; > struct address_space *mapping = file->f_mapping; > struct inode *inode = mapping->host; > ssize_t ret; > - struct iovec local_iov = { .iov_base = (void __user *)buf, > - .iov_len = count }; > > BUG_ON(iocb->ki_pos != pos); > > mutex_lock(&inode->i_mutex); > - ret = __generic_file_aio_write_nolock(iocb, &local_iov, 1, > - &iocb->ki_pos); > + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); > mutex_unlock(&inode->i_mutex); > > if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { > Index: linux-2.6.17-rc3/net/socket.c > =================================================================== > --- linux-2.6.17-rc3.orig/net/socket.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/net/socket.c 2006-05-02 07:53:58.000000000 -0700 > @@ -96,10 +96,10 @@ > #include <linux/netfilter.h> > > static int sock_no_open(struct inode *irrelevant, struct file *dontcare); > -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *buf, > - size_t size, loff_t pos); > -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *buf, > - size_t size, loff_t pos); > +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos); > +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos); > static int sock_mmap(struct file *file, struct vm_area_struct * vma); > > static int sock_close(struct inode *inode, struct file *file); > @@ -700,7 +700,7 @@ static ssize_t sock_sendpage(struct file > } > > static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb, > - char __user *ubuf, size_t size, struct sock_iocb *siocb) > + struct sock_iocb *siocb) > { > if (!is_sync_kiocb(iocb)) { > siocb = kmalloc(sizeof(*siocb), GFP_KERNEL); > @@ -710,15 +710,13 @@ static struct sock_iocb *alloc_sock_iocb > } > > siocb->kiocb = iocb; > - siocb->async_iov.iov_base = ubuf; > - siocb->async_iov.iov_len = size; > - > iocb->private = siocb; > return siocb; > } > > static ssize_t do_sock_read(struct msghdr *msg, struct kiocb *iocb, > - struct file *file, struct iovec *iov, unsigned long nr_segs) > + struct file *file, const struct iovec *iov, > + unsigned long nr_segs) > { > struct socket *sock = file->private_data; > size_t size = 0; > @@ -749,31 +747,33 @@ static ssize_t sock_readv(struct file *f > init_sync_kiocb(&iocb, NULL); > iocb.private = &siocb; > > - ret = do_sock_read(&msg, &iocb, file, (struct iovec *)iov, nr_segs); > + ret = do_sock_read(&msg, &iocb, file, iov, nr_segs); > if (-EIOCBQUEUED == ret) > ret = wait_on_sync_kiocb(&iocb); > return ret; > } > > -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *ubuf, > - size_t count, loff_t pos) > +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos) > { > struct sock_iocb siocb, *x; > > if (pos != 0) > return -ESPIPE; > - if (count == 0) /* Match SYS5 behaviour */ > + > + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ > return 0; > > - x = alloc_sock_iocb(iocb, ubuf, count, &siocb); > + > + x = alloc_sock_iocb(iocb, &siocb); > if (!x) > return -ENOMEM; > - return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, > - &x->async_iov, 1); > + return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); > } > > static ssize_t do_sock_write(struct msghdr *msg, struct kiocb *iocb, > - struct file *file, struct iovec *iov, unsigned long nr_segs) > + struct file *file, const struct iovec *iov, > + unsigned long nr_segs) > { > struct socket *sock = file->private_data; > size_t size = 0; > @@ -806,28 +806,28 @@ static ssize_t sock_writev(struct file * > init_sync_kiocb(&iocb, NULL); > iocb.private = &siocb; > > - ret = do_sock_write(&msg, &iocb, file, (struct iovec *)iov, nr_segs); > + ret = do_sock_write(&msg, &iocb, file, iov, nr_segs); > if (-EIOCBQUEUED == ret) > ret = wait_on_sync_kiocb(&iocb); > return ret; > } > > -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *ubuf, > - size_t count, loff_t pos) > +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos) > { > struct sock_iocb siocb, *x; > > if (pos != 0) > return -ESPIPE; > - if (count == 0) /* Match SYS5 behaviour */ > + > + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ > return 0; > > - x = alloc_sock_iocb(iocb, (void __user *)ubuf, count, &siocb); > + x = alloc_sock_iocb(iocb, &siocb); > if (!x) > return -ENOMEM; > > - return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, > - &x->async_iov, 1); > + return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); > } > > > Index: linux-2.6.17-rc3/drivers/usb/gadget/inode.c > =================================================================== > --- linux-2.6.17-rc3.orig/drivers/usb/gadget/inode.c 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/drivers/usb/gadget/inode.c 2006-05-02 07:53:58.000000000 -0700 > @@ -528,7 +528,8 @@ struct kiocb_priv { > struct usb_request *req; > struct ep_data *epdata; > void *buf; > - char __user *ubuf; > + struct iovec *iv; > + unsigned long count; > unsigned actual; > }; > > @@ -556,18 +557,32 @@ static int ep_aio_cancel(struct kiocb *i > static ssize_t ep_aio_read_retry(struct kiocb *iocb) > { > struct kiocb_priv *priv = iocb->private; > - ssize_t status = priv->actual; > + ssize_t len, total; > > /* we "retry" to get the right mm context for this: */ > - status = copy_to_user(priv->ubuf, priv->buf, priv->actual); > - if (unlikely(0 != status)) > - status = -EFAULT; > - else > - status = priv->actual; > + > + /* copy stuff into user buffers */ > + total = priv->actual; > + len = 0; > + for (i=0; i < priv->count; i++) { > + ssize_t this = min(priv->iv[i].iov_len, (size_t)total); > + > + if (copy_to_user(priv->iv[i].iov_buf, priv->buf, this)) > + break; > + > + total -= this; > + len += this; > + if (total <= 0) > + break; > + } > + > + if (unlikely(len != 0)) > + len = -EFAULT; > + > kfree(priv->buf); > kfree(priv); > aio_put_req(iocb); > - return status; > + return len; > } > > static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req) > @@ -615,7 +630,8 @@ ep_aio_rwtail( > char *buf, > size_t len, > struct ep_data *epdata, > - char __user *ubuf > + const struct iovec *iv, > + unsigned long count > ) > { > struct kiocb_priv *priv = (void *) &iocb->private; > @@ -630,7 +646,8 @@ fail: > return value; > } > iocb->private = priv; > - priv->ubuf = ubuf; > + priv->iovec = iv; > + priv->count = count; > > value = get_ready_ep(iocb->ki_filp->f_flags, epdata); > if (unlikely(value < 0)) { > @@ -675,36 +692,52 @@ fail: > } > > static ssize_t > -ep_aio_read(struct kiocb *iocb, char __user *ubuf, size_t len, loff_t o) > +ep_aio_read(struct kiocb *iocb, const struct iovec *iv, > + unsigned long count, loff_t o) > { > struct ep_data *epdata = iocb->ki_filp->private_data; > char *buf; > + size_t len; > + int i = 0; > + ssize_t ret; > > if (unlikely(epdata->desc.bEndpointAddress & USB_DIR_IN)) > return -EINVAL; > - buf = kmalloc(len, GFP_KERNEL); > + > + buf = kmalloc(iocb->ki_left, GFP_KERNEL); > if (unlikely(!buf)) > return -ENOMEM; > + > iocb->ki_retry = ep_aio_read_retry; > - return ep_aio_rwtail(iocb, buf, len, epdata, ubuf); > + return ep_aio_rwtail(iocb, buf, len, epdata, iv, count); > } > > static ssize_t > -ep_aio_write(struct kiocb *iocb, const char __user *ubuf, size_t len, loff_t o) > +ep_aio_write(struct kiocb *iocb, const struct iovec *iv, > + unsigned long count, loff_t o) > { > struct ep_data *epdata = iocb->ki_filp->private_data; > char *buf; > + size_t len = 0; > + int i = 0; > + ssize_t ret; > > if (unlikely(!(epdata->desc.bEndpointAddress & USB_DIR_IN))) > return -EINVAL; > - buf = kmalloc(len, GFP_KERNEL); > + > + buf = kmalloc(iocb->ki_left, GFP_KERNEL); > if (unlikely(!buf)) > return -ENOMEM; > - if (unlikely(copy_from_user(buf, ubuf, len) != 0)) { > - kfree(buf); > - return -EFAULT; > + > + for (i=0; i < count; i++) { > + if (unlikely(copy_from_user(&buf[len], iv[i]->iov_base, > + iv[i]->iov_len) != 0)) { > + kfree(buf); > + return -EFAULT; > + } > + len += iv[i]->iov_len; > } > - return ep_aio_rwtail(iocb, buf, len, epdata, NULL); > + return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0); > } > > /*----------------------------------------------------------------------*/ > Index: linux-2.6.17-rc3/include/linux/aio.h > =================================================================== > --- linux-2.6.17-rc3.orig/include/linux/aio.h 2006-04-26 19:19:25.000000000 -0700 > +++ linux-2.6.17-rc3/include/linux/aio.h 2006-05-02 07:53:58.000000000 -0700 > @@ -4,6 +4,7 @@ > #include <linux/list.h> > #include <linux/workqueue.h> > #include <linux/aio_abi.h> > +#include <linux/uio.h> > > #include <asm/atomic.h> > > @@ -112,6 +113,7 @@ struct kiocb { > long ki_retried; /* just for testing */ > long ki_kicked; /* just for testing */ > long ki_queued; /* just for testing */ > + struct iovec ki_inline_vec; /* inline vector */ > > struct list_head ki_list; /* the aio core uses this > * for cancellation */ > -- corporate: <cel at netapp dot com> personal: <chucklever at bigfoot dot com> [-- Attachment #2: 04-nfs-vector-io.diff --] [-- Type: text/plain, Size: 8232 bytes --] nfs: update nfs_file_read and nfs_file_write to the new vectored API From: Chuck Lever <cel@netapp.com> Migrate NFS client's read and write file operations to use the new vectored I/O API. Note that the direct I/O path supports only standard non-vectored I/O for now. Also fix some tab damage in the definition of nfs_file_operations, and update dprintk's to reflect the true size of loff_t. Test plan: Standard read- and write- intensive workloads. Signed-off-by: Chuck Lever <cel@netapp.com> --- fs/nfs/direct.c | 24 ++++++++++++++++++------ fs/nfs/file.c | 43 ++++++++++++++++++++++++------------------- include/linux/nfs_fs.h | 8 ++++---- 3 files changed, 46 insertions(+), 29 deletions(-) diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index 3c72b0c..e5707b3 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -745,8 +745,8 @@ static ssize_t nfs_direct_write(struct k /** * nfs_file_direct_read - file direct read operation for NFS files * @iocb: target I/O control block - * @buf: user's buffer into which to read data - * @count: number of bytes to read + * @iov: vector of user buffers into which to read data + * @nr_segs: size of iov vector * @pos: byte offset in file where reading starts * * We use this function for direct reads instead of calling @@ -763,19 +763,25 @@ static ssize_t nfs_direct_write(struct k * client must read the updated atime from the server back into its * cache. */ -ssize_t nfs_file_direct_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { ssize_t retval = -EINVAL; int page_count; struct page **pages; struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; + /* XXX: temporary */ + const char __user *buf = iov[0].iov_base; + size_t count = iov[0].iov_len; dprintk("nfs: direct read(%s/%s, %lu@%Ld)\n", file->f_dentry->d_parent->d_name.name, file->f_dentry->d_name.name, (unsigned long) count, (long long) pos); + if (nr_segs != 1) + return -EINVAL; + if (count < 0) goto out; retval = -EFAULT; @@ -807,8 +813,8 @@ out: /** * nfs_file_direct_write - file direct write operation for NFS files * @iocb: target I/O control block - * @buf: user's buffer from which to write data - * @count: number of bytes to write + * @iov: vector of user buffers from which to write data + * @nr_segs: size of iov vector * @pos: byte offset in file where writing starts * * We use this function for direct writes instead of calling @@ -829,19 +835,25 @@ out: * Note that O_APPEND is not supported for NFS direct writes, as there * is no atomic O_APPEND write facility in the NFS protocol. */ -ssize_t nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { ssize_t retval; int page_count; struct page **pages; struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; + /* XXX: temporary */ + const char __user *buf = iov[0].iov_base; + size_t count = iov[0].iov_len; dfprintk(VFS, "nfs: direct write(%s/%s, %lu@%Ld)\n", file->f_dentry->d_parent->d_name.name, file->f_dentry->d_name.name, (unsigned long) count, (long long) pos); + if (nr_segs != 1) + return -EINVAL; + retval = generic_write_checks(file, &pos, &count, 0); if (retval) goto out; diff --git a/fs/nfs/file.c b/fs/nfs/file.c index fade02c..4fea6aa 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -41,8 +41,8 @@ static int nfs_file_release(struct inode static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); static int nfs_file_mmap(struct file *, struct vm_area_struct *); static ssize_t nfs_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *); -static ssize_t nfs_file_read(struct kiocb *, char __user *, size_t, loff_t); -static ssize_t nfs_file_write(struct kiocb *, const char __user *, size_t, loff_t); +static ssize_t nfs_file_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); +static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); static int nfs_file_flush(struct file *); static int nfs_fsync(struct file *, struct dentry *dentry, int datasync); static int nfs_check_flags(int flags); @@ -53,8 +53,8 @@ const struct file_operations nfs_file_op .llseek = nfs_file_llseek, .read = do_sync_read, .write = do_sync_write, - .aio_read = nfs_file_read, - .aio_write = nfs_file_write, + .aio_read = nfs_file_read, + .aio_write = nfs_file_write, .mmap = nfs_file_mmap, .open = nfs_file_open, .flush = nfs_file_flush, @@ -212,26 +212,30 @@ nfs_file_flush(struct file *file) return status; } -static ssize_t -nfs_file_read(struct kiocb *iocb, char __user * buf, size_t count, loff_t pos) +static ssize_t nfs_file_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; ssize_t result; + unsigned long seg; + size_t count = 0; + + for (seg = 0; seg < nr_segs; seg++) + count += iov[seg].iov_len; #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_read(iocb, buf, count, pos); + return nfs_file_direct_read(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: read(%s/%s, %lu@%lu)\n", + dfprintk(VFS, "nfs: read(%s/%s, %lu@%Ld)\n", dentry->d_parent->d_name.name, dentry->d_name.name, - (unsigned long) count, (unsigned long) pos); + (unsigned long) count, (long long) pos); result = nfs_revalidate_file(inode, iocb->ki_filp); nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, count); if (!result) - result = generic_file_aio_read(iocb, buf, count, pos); + result = generic_file_aio_read(iocb, iov, nr_segs, pos); return result; } @@ -343,24 +347,25 @@ struct address_space_operations nfs_file #endif }; -/* - * Write to a file (through the page cache). - */ -static ssize_t -nfs_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; ssize_t result; + unsigned long seg; + size_t count = 0; + + for (seg = 0; seg < nr_segs; seg++) + count += iov[seg].iov_len; #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_write(iocb, buf, count, pos); + return nfs_file_direct_write(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%lu)\n", + dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%Ld)\n", dentry->d_parent->d_name.name, dentry->d_name.name, - inode->i_ino, (unsigned long) count, (unsigned long) pos); + inode->i_ino, (unsigned long) count, (long long) pos); result = -EBUSY; if (IS_SWAPFILE(inode)) @@ -380,7 +385,7 @@ nfs_file_write(struct kiocb *iocb, const goto out; nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, count); - result = generic_file_aio_write(iocb, buf, count, pos); + result = generic_file_aio_write(iocb, iov, nr_segs, pos); out: return result; diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index c71227d..f590a87 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -359,10 +359,10 @@ extern int nfs3_removexattr (struct dent */ extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t, unsigned long); -extern ssize_t nfs_file_direct_read(struct kiocb *iocb, char __user *buf, - size_t count, loff_t pos); -extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos); +extern ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); +extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); /* * linux/fs/nfs/dir.c ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-02 15:20 ` Chuck Lever @ 2006-05-02 15:35 ` Badari Pulavarty 0 siblings, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-05-02 15:35 UTC (permalink / raw) To: cel; +Cc: lkml, akpm, Zach Brown, christoph, Benjamin LaHaise, Trond Myklebust On Tue, 2006-05-02 at 11:20 -0400, Chuck Lever wrote: > If you apply this one, then the NFS client no longer builds. > > I think you might need to stub out vectored direct I/O support for the > NFS client temporarily with something like the attached patch. > Yuck. I meant to send this one (with your temporary fix) - which is the one I was testing earlier. Thanks, Badari This patch vectorizes aio_read() and aio_write() methods to prepare for collapsing all aio & vectored operations into one interface - which is aio_read()/aio_write(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Documentation/filesystems/Locking | 5 +- Documentation/filesystems/vfs.txt | 4 +- drivers/char/raw.c | 14 ------- drivers/usb/gadget/inode.c | 71 +++++++++++++++++++++++++++----------- fs/aio.c | 15 +++++--- fs/block_dev.c | 10 ----- fs/cifs/cifsfs.c | 6 +-- fs/ext3/file.c | 5 +- fs/nfs/direct.c | 24 +++++++++--- fs/nfs/file.c | 43 ++++++++++++----------- fs/read_write.c | 20 ++++++++-- fs/reiserfs/file.c | 8 ---- fs/xfs/linux-2.6/xfs_file.c | 44 +++++++++++------------ include/linux/aio.h | 2 + include/linux/fs.h | 10 ++--- include/linux/nfs_fs.h | 8 ++-- include/net/sock.h | 1 mm/filemap.c | 39 ++++++++++---------- net/socket.c | 48 ++++++++++++------------- 19 files changed, 209 insertions(+), 168 deletions(-) Index: linux-2.6.17-rc3/Documentation/filesystems/Locking =================================================================== --- linux-2.6.17-rc3.orig/Documentation/filesystems/Locking 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/Documentation/filesystems/Locking 2006-05-02 07:53:58.000000000 -0700 @@ -355,10 +355,9 @@ The last two are called only from check_ prototypes: loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, - loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, Index: linux-2.6.17-rc3/Documentation/filesystems/vfs.txt =================================================================== --- linux-2.6.17-rc3.orig/Documentation/filesystems/vfs.txt 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/Documentation/filesystems/vfs.txt 2006-05-02 07:53:58.000000000 -0700 @@ -699,9 +699,9 @@ This describes how the VFS can manipulat struct file_operations { loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); Index: linux-2.6.17-rc3/drivers/char/raw.c =================================================================== --- linux-2.6.17-rc3.orig/drivers/char/raw.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/drivers/char/raw.c 2006-05-02 08:28:50.000000000 -0700 @@ -250,23 +250,11 @@ static ssize_t raw_file_write(struct fil return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t raw_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { - .iov_base = (char __user *)buf, - .iov_len = count - }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - - static struct file_operations raw_fops = { .read = generic_file_read, .aio_read = generic_file_aio_read, .write = raw_file_write, - .aio_write = raw_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .open = raw_open, .release= raw_release, .ioctl = raw_ioctl, Index: linux-2.6.17-rc3/fs/aio.c =================================================================== --- linux-2.6.17-rc3.orig/fs/aio.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/aio.c 2006-05-02 08:28:47.000000000 -0700 @@ -15,6 +15,7 @@ #include <linux/aio_abi.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/uio.h> #define DEBUG 0 @@ -1315,8 +1316,11 @@ static ssize_t aio_pread(struct kiocb *i ssize_t ret = 0; do { - ret = file->f_op->aio_read(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + iocb->ki_inline_vec.iov_base = iocb->ki_buf; + iocb->ki_inline_vec.iov_len = iocb->ki_left; + + ret = file->f_op->aio_read(iocb, &iocb->ki_inline_vec, + 1, iocb->ki_pos); /* * Can't just depend on iocb->ki_left to determine * whether we are done. This may have been a short read. @@ -1349,8 +1353,11 @@ static ssize_t aio_pwrite(struct kiocb * ssize_t ret = 0; do { - ret = file->f_op->aio_write(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + iocb->ki_inline_vec.iov_base = iocb->ki_buf; + iocb->ki_inline_vec.iov_len = iocb->ki_left; + + ret = file->f_op->aio_write(iocb, &iocb->ki_inline_vec, + 1, iocb->ki_pos); if (ret > 0) { iocb->ki_buf += ret; iocb->ki_left -= ret; Index: linux-2.6.17-rc3/fs/block_dev.c =================================================================== --- linux-2.6.17-rc3.orig/fs/block_dev.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/block_dev.c 2006-05-02 08:28:50.000000000 -0700 @@ -1064,14 +1064,6 @@ static ssize_t blkdev_file_write(struct return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t blkdev_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) { return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); @@ -1094,7 +1086,7 @@ const struct file_operations def_blk_fop .read = generic_file_read, .write = blkdev_file_write, .aio_read = generic_file_aio_read, - .aio_write = blkdev_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .mmap = generic_file_mmap, .fsync = block_fsync, .unlocked_ioctl = block_ioctl, Index: linux-2.6.17-rc3/fs/cifs/cifsfs.c =================================================================== --- linux-2.6.17-rc3.orig/fs/cifs/cifsfs.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/cifs/cifsfs.c 2006-05-02 08:28:50.000000000 -0700 @@ -496,13 +496,13 @@ static ssize_t cifs_file_writev(struct f return written; } -static ssize_t cifs_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +static ssize_t cifs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct inode *inode = iocb->ki_filp->f_dentry->d_inode; ssize_t written; - written = generic_file_aio_write(iocb, buf, count, pos); + written = generic_file_aio_write(iocb, iov, nr_segs, pos); if (!CIFS_I(inode)->clientCanCacheAll) filemap_fdatawrite(inode->i_mapping); return written; Index: linux-2.6.17-rc3/fs/ext3/file.c =================================================================== --- linux-2.6.17-rc3.orig/fs/ext3/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/ext3/file.c 2006-05-02 08:28:50.000000000 -0700 @@ -48,14 +48,15 @@ static int ext3_release_file (struct ino } static ssize_t -ext3_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +ext3_file_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_dentry->d_inode; ssize_t ret; int err; - ret = generic_file_aio_write(iocb, buf, count, pos); + ret = generic_file_aio_write(iocb, iov, nr_segs, pos); /* * Skip flushing if there was an error, or if nothing was written. Index: linux-2.6.17-rc3/fs/read_write.c =================================================================== --- linux-2.6.17-rc3.orig/fs/read_write.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/read_write.c 2006-05-02 08:28:50.000000000 -0700 @@ -227,14 +227,20 @@ static void wait_on_retry_sync_kiocb(str ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos))) + kiocb.ki_left = len; + + for (;;) { + ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); @@ -279,14 +285,20 @@ EXPORT_SYMBOL(vfs_read); ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_write(&kiocb, buf, len, kiocb.ki_pos))) + kiocb.ki_left = len; + + for (;;) { + ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); Index: linux-2.6.17-rc3/fs/reiserfs/file.c =================================================================== --- linux-2.6.17-rc3.orig/fs/reiserfs/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/reiserfs/file.c 2006-05-02 07:53:58.000000000 -0700 @@ -1560,12 +1560,6 @@ static ssize_t reiserfs_file_write(struc return res; } -static ssize_t reiserfs_aio_write(struct kiocb *iocb, const char __user * buf, - size_t count, loff_t pos) -{ - return generic_file_aio_write(iocb, buf, count, pos); -} - const struct file_operations reiserfs_file_operations = { .read = generic_file_read, .write = reiserfs_file_write, @@ -1575,7 +1569,7 @@ const struct file_operations reiserfs_fi .fsync = reiserfs_sync_file, .sendfile = generic_file_sendfile, .aio_read = generic_file_aio_read, - .aio_write = reiserfs_aio_write, + .aio_write = generic_file_aio_write, .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, }; Index: linux-2.6.17-rc3/fs/xfs/linux-2.6/xfs_file.c =================================================================== --- linux-2.6.17-rc3.orig/fs/xfs/linux-2.6/xfs_file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/xfs/linux-2.6/xfs_file.c 2006-05-02 08:28:50.000000000 -0700 @@ -51,12 +51,11 @@ static struct vm_operations_struct xfs_d STATIC inline ssize_t __xfs_file_read( struct kiocb *iocb, - char __user *buf, + const struct iovec *iov, + unsigned long nr_segs, int ioflags, - size_t count, loff_t pos) { - struct iovec iov = {buf, count}; struct file *file = iocb->ki_filp; vnode_t *vp = vn_from_inode(file->f_dentry->d_inode); ssize_t rval; @@ -65,39 +64,38 @@ __xfs_file_read( if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_READ(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_READ(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } STATIC ssize_t xfs_file_aio_read( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_read(iocb, buf, IO_ISAIO, count, pos); + return __xfs_file_read(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t xfs_file_aio_read_invis( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_read(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __xfs_file_read(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } STATIC inline ssize_t __xfs_file_write( - struct kiocb *iocb, - const char __user *buf, - int ioflags, - size_t count, - loff_t pos) + struct kiocb *iocb, + const struct iovec *iov, + unsigned long nr_segs, + int ioflags, + loff_t pos) { - struct iovec iov = {(void __user *)buf, count}; struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; vnode_t *vp = vn_from_inode(inode); @@ -107,28 +105,28 @@ __xfs_file_write( if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_WRITE(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_WRITE(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } STATIC ssize_t xfs_file_aio_write( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_write(iocb, buf, IO_ISAIO, count, pos); + return __xfs_file_write(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t xfs_file_aio_write_invis( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_write(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __xfs_file_write(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } STATIC inline ssize_t Index: linux-2.6.17-rc3/include/linux/fs.h =================================================================== --- linux-2.6.17-rc3.orig/include/linux/fs.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/include/linux/fs.h 2006-05-02 08:28:50.000000000 -0700 @@ -1015,9 +1015,9 @@ struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); @@ -1594,11 +1594,11 @@ extern int file_send_actor(read_descript extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *); int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *); -extern ssize_t generic_file_aio_read(struct kiocb *, char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t __generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t *); -extern ssize_t generic_file_aio_write(struct kiocb *, const char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *, - unsigned long, loff_t *); + unsigned long, loff_t); extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *, unsigned long *, loff_t, loff_t *, size_t, size_t); extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *, Index: linux-2.6.17-rc3/include/net/sock.h =================================================================== --- linux-2.6.17-rc3.orig/include/net/sock.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/include/net/sock.h 2006-05-02 07:53:58.000000000 -0700 @@ -659,7 +659,6 @@ struct sock_iocb { struct sock *sk; struct scm_cookie *scm; struct msghdr *msg, async_msg; - struct iovec async_iov; struct kiocb *kiocb; }; Index: linux-2.6.17-rc3/mm/filemap.c =================================================================== --- linux-2.6.17-rc3.orig/mm/filemap.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/mm/filemap.c 2006-05-02 08:28:50.000000000 -0700 @@ -1096,14 +1096,12 @@ out: EXPORT_SYMBOL(__generic_file_aio_read); ssize_t -generic_file_aio_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - struct iovec local_iov = { .iov_base = buf, .iov_len = count }; - BUG_ON(iocb->ki_pos != pos); - return __generic_file_aio_read(iocb, &local_iov, 1, &iocb->ki_pos); + return __generic_file_aio_read(iocb, iov, nr_segs, &iocb->ki_pos); } - EXPORT_SYMBOL(generic_file_aio_read); ssize_t @@ -2163,22 +2161,21 @@ out: current->backing_dev_info = NULL; return written ? written : err; } -EXPORT_SYMBOL(generic_file_aio_write_nolock); -ssize_t -generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) +ssize_t generic_file_aio_write_nolock(struct kiocb *iocb, + const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - loff_t pos = *ppos; - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, ppos); + BUG_ON(iocb->ki_pos != pos); + + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { - int err; + ssize_t err; err = sync_page_range_nolock(inode, mapping, pos, ret); if (err < 0) @@ -2186,6 +2183,7 @@ generic_file_aio_write_nolock(struct kio } return ret; } +EXPORT_SYMBOL(generic_file_aio_write_nolock); static ssize_t __generic_file_write_nolock(struct file *file, const struct iovec *iov, @@ -2195,9 +2193,11 @@ __generic_file_write_nolock(struct file ssize_t ret; init_sync_kiocb(&kiocb, file); + kiocb.ki_pos = *ppos; ret = __generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); - if (ret == -EIOCBQUEUED) + if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; return ret; } @@ -2209,28 +2209,27 @@ generic_file_write_nolock(struct file *f ssize_t ret; init_sync_kiocb(&kiocb, file); - ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); + kiocb.ki_pos = *ppos; + ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, *ppos); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; return ret; } EXPORT_SYMBOL(generic_file_write_nolock); -ssize_t generic_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - struct iovec local_iov = { .iov_base = (void __user *)buf, - .iov_len = count }; BUG_ON(iocb->ki_pos != pos); mutex_lock(&inode->i_mutex); - ret = __generic_file_aio_write_nolock(iocb, &local_iov, 1, - &iocb->ki_pos); + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); mutex_unlock(&inode->i_mutex); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { Index: linux-2.6.17-rc3/net/socket.c =================================================================== --- linux-2.6.17-rc3.orig/net/socket.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/net/socket.c 2006-05-02 08:28:50.000000000 -0700 @@ -96,10 +96,10 @@ #include <linux/netfilter.h> static int sock_no_open(struct inode *irrelevant, struct file *dontcare); -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *buf, - size_t size, loff_t pos); -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *buf, - size_t size, loff_t pos); +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); static int sock_mmap(struct file *file, struct vm_area_struct * vma); static int sock_close(struct inode *inode, struct file *file); @@ -700,7 +700,7 @@ static ssize_t sock_sendpage(struct file } static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb, - char __user *ubuf, size_t size, struct sock_iocb *siocb) + struct sock_iocb *siocb) { if (!is_sync_kiocb(iocb)) { siocb = kmalloc(sizeof(*siocb), GFP_KERNEL); @@ -710,15 +710,13 @@ static struct sock_iocb *alloc_sock_iocb } siocb->kiocb = iocb; - siocb->async_iov.iov_base = ubuf; - siocb->async_iov.iov_len = size; - iocb->private = siocb; return siocb; } static ssize_t do_sock_read(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -749,31 +747,33 @@ static ssize_t sock_readv(struct file *f init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_read(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_read(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; - if (count == 0) /* Match SYS5 behaviour */ + + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ return 0; - x = alloc_sock_iocb(iocb, ubuf, count, &siocb); + + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } static ssize_t do_sock_write(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -806,28 +806,28 @@ static ssize_t sock_writev(struct file * init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_write(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_write(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; - if (count == 0) /* Match SYS5 behaviour */ + + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ return 0; - x = alloc_sock_iocb(iocb, (void __user *)ubuf, count, &siocb); + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } Index: linux-2.6.17-rc3/drivers/usb/gadget/inode.c =================================================================== --- linux-2.6.17-rc3.orig/drivers/usb/gadget/inode.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/drivers/usb/gadget/inode.c 2006-05-02 07:53:58.000000000 -0700 @@ -528,7 +528,8 @@ struct kiocb_priv { struct usb_request *req; struct ep_data *epdata; void *buf; - char __user *ubuf; + struct iovec *iv; + unsigned long count; unsigned actual; }; @@ -556,18 +557,32 @@ static int ep_aio_cancel(struct kiocb *i static ssize_t ep_aio_read_retry(struct kiocb *iocb) { struct kiocb_priv *priv = iocb->private; - ssize_t status = priv->actual; + ssize_t len, total; /* we "retry" to get the right mm context for this: */ - status = copy_to_user(priv->ubuf, priv->buf, priv->actual); - if (unlikely(0 != status)) - status = -EFAULT; - else - status = priv->actual; + + /* copy stuff into user buffers */ + total = priv->actual; + len = 0; + for (i=0; i < priv->count; i++) { + ssize_t this = min(priv->iv[i].iov_len, (size_t)total); + + if (copy_to_user(priv->iv[i].iov_buf, priv->buf, this)) + break; + + total -= this; + len += this; + if (total <= 0) + break; + } + + if (unlikely(len != 0)) + len = -EFAULT; + kfree(priv->buf); kfree(priv); aio_put_req(iocb); - return status; + return len; } static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req) @@ -615,7 +630,8 @@ ep_aio_rwtail( char *buf, size_t len, struct ep_data *epdata, - char __user *ubuf + const struct iovec *iv, + unsigned long count ) { struct kiocb_priv *priv = (void *) &iocb->private; @@ -630,7 +646,8 @@ fail: return value; } iocb->private = priv; - priv->ubuf = ubuf; + priv->iovec = iv; + priv->count = count; value = get_ready_ep(iocb->ki_filp->f_flags, epdata); if (unlikely(value < 0)) { @@ -675,36 +692,52 @@ fail: } static ssize_t -ep_aio_read(struct kiocb *iocb, char __user *ubuf, size_t len, loff_t o) +ep_aio_read(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len; + int i = 0; + ssize_t ret; if (unlikely(epdata->desc.bEndpointAddress & USB_DIR_IN)) return -EINVAL; - buf = kmalloc(len, GFP_KERNEL); + + buf = kmalloc(iocb->ki_left, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; + iocb->ki_retry = ep_aio_read_retry; - return ep_aio_rwtail(iocb, buf, len, epdata, ubuf); + return ep_aio_rwtail(iocb, buf, len, epdata, iv, count); } static ssize_t -ep_aio_write(struct kiocb *iocb, const char __user *ubuf, size_t len, loff_t o) +ep_aio_write(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len = 0; + int i = 0; + ssize_t ret; if (unlikely(!(epdata->desc.bEndpointAddress & USB_DIR_IN))) return -EINVAL; - buf = kmalloc(len, GFP_KERNEL); + + buf = kmalloc(iocb->ki_left, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; - if (unlikely(copy_from_user(buf, ubuf, len) != 0)) { - kfree(buf); - return -EFAULT; + + for (i=0; i < count; i++) { + if (unlikely(copy_from_user(&buf[len], iv[i]->iov_base, + iv[i]->iov_len) != 0)) { + kfree(buf); + return -EFAULT; + } + len += iv[i]->iov_len; } - return ep_aio_rwtail(iocb, buf, len, epdata, NULL); + return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0); } /*----------------------------------------------------------------------*/ Index: linux-2.6.17-rc3/include/linux/aio.h =================================================================== --- linux-2.6.17-rc3.orig/include/linux/aio.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/include/linux/aio.h 2006-05-02 08:28:47.000000000 -0700 @@ -4,6 +4,7 @@ #include <linux/list.h> #include <linux/workqueue.h> #include <linux/aio_abi.h> +#include <linux/uio.h> #include <asm/atomic.h> @@ -112,6 +113,7 @@ struct kiocb { long ki_retried; /* just for testing */ long ki_kicked; /* just for testing */ long ki_queued; /* just for testing */ + struct iovec ki_inline_vec; /* inline vector */ struct list_head ki_list; /* the aio core uses this * for cancellation */ Index: linux-2.6.17-rc3/fs/nfs/direct.c =================================================================== --- linux-2.6.17-rc3.orig/fs/nfs/direct.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/nfs/direct.c 2006-05-02 08:31:58.000000000 -0700 @@ -745,8 +745,8 @@ static ssize_t nfs_direct_write(struct k /** * nfs_file_direct_read - file direct read operation for NFS files * @iocb: target I/O control block - * @buf: user's buffer into which to read data - * @count: number of bytes to read + * @iov: vector of user buffers into which to read data + * @nr_segs: size of iov vector * @pos: byte offset in file where reading starts * * We use this function for direct reads instead of calling @@ -763,19 +763,25 @@ static ssize_t nfs_direct_write(struct k * client must read the updated atime from the server back into its * cache. */ -ssize_t nfs_file_direct_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { ssize_t retval = -EINVAL; int page_count; struct page **pages; struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; + /* XXX: temporary */ + const char __user *buf = iov[0].iov_base; + size_t count = iov[0].iov_len; dprintk("nfs: direct read(%s/%s, %lu@%Ld)\n", file->f_dentry->d_parent->d_name.name, file->f_dentry->d_name.name, (unsigned long) count, (long long) pos); + if (nr_segs != 1) + return -EINVAL; + if (count < 0) goto out; retval = -EFAULT; @@ -807,8 +813,8 @@ out: /** * nfs_file_direct_write - file direct write operation for NFS files * @iocb: target I/O control block - * @buf: user's buffer from which to write data - * @count: number of bytes to write + * @iov: vector of user buffers from which to write data + * @nr_segs: size of iov vector * @pos: byte offset in file where writing starts * * We use this function for direct writes instead of calling @@ -829,19 +835,25 @@ out: * Note that O_APPEND is not supported for NFS direct writes, as there * is no atomic O_APPEND write facility in the NFS protocol. */ -ssize_t nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { ssize_t retval; int page_count; struct page **pages; struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; + /* XXX: temporary */ + const char __user *buf = iov[0].iov_base; + size_t count = iov[0].iov_len; dfprintk(VFS, "nfs: direct write(%s/%s, %lu@%Ld)\n", file->f_dentry->d_parent->d_name.name, file->f_dentry->d_name.name, (unsigned long) count, (long long) pos); + if (nr_segs != 1) + return -EINVAL; + retval = generic_write_checks(file, &pos, &count, 0); if (retval) goto out; Index: linux-2.6.17-rc3/fs/nfs/file.c =================================================================== --- linux-2.6.17-rc3.orig/fs/nfs/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/fs/nfs/file.c 2006-05-02 08:31:58.000000000 -0700 @@ -41,8 +41,8 @@ static int nfs_file_release(struct inode static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); static int nfs_file_mmap(struct file *, struct vm_area_struct *); static ssize_t nfs_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *); -static ssize_t nfs_file_read(struct kiocb *, char __user *, size_t, loff_t); -static ssize_t nfs_file_write(struct kiocb *, const char __user *, size_t, loff_t); +static ssize_t nfs_file_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); +static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); static int nfs_file_flush(struct file *); static int nfs_fsync(struct file *, struct dentry *dentry, int datasync); static int nfs_check_flags(int flags); @@ -53,8 +53,8 @@ const struct file_operations nfs_file_op .llseek = nfs_file_llseek, .read = do_sync_read, .write = do_sync_write, - .aio_read = nfs_file_read, - .aio_write = nfs_file_write, + .aio_read = nfs_file_read, + .aio_write = nfs_file_write, .mmap = nfs_file_mmap, .open = nfs_file_open, .flush = nfs_file_flush, @@ -212,26 +212,30 @@ nfs_file_flush(struct file *file) return status; } -static ssize_t -nfs_file_read(struct kiocb *iocb, char __user * buf, size_t count, loff_t pos) +static ssize_t nfs_file_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; ssize_t result; + unsigned long seg; + size_t count = 0; + + for (seg = 0; seg < nr_segs; seg++) + count += iov[seg].iov_len; #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_read(iocb, buf, count, pos); + return nfs_file_direct_read(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: read(%s/%s, %lu@%lu)\n", + dfprintk(VFS, "nfs: read(%s/%s, %lu@%Ld)\n", dentry->d_parent->d_name.name, dentry->d_name.name, - (unsigned long) count, (unsigned long) pos); + (unsigned long) count, (long long) pos); result = nfs_revalidate_file(inode, iocb->ki_filp); nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, count); if (!result) - result = generic_file_aio_read(iocb, buf, count, pos); + result = generic_file_aio_read(iocb, iov, nr_segs, pos); return result; } @@ -343,24 +347,25 @@ struct address_space_operations nfs_file #endif }; -/* - * Write to a file (through the page cache). - */ -static ssize_t -nfs_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; ssize_t result; + unsigned long seg; + size_t count = 0; + + for (seg = 0; seg < nr_segs; seg++) + count += iov[seg].iov_len; #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_write(iocb, buf, count, pos); + return nfs_file_direct_write(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%lu)\n", + dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%Ld)\n", dentry->d_parent->d_name.name, dentry->d_name.name, - inode->i_ino, (unsigned long) count, (unsigned long) pos); + inode->i_ino, (unsigned long) count, (long long) pos); result = -EBUSY; if (IS_SWAPFILE(inode)) @@ -380,7 +385,7 @@ nfs_file_write(struct kiocb *iocb, const goto out; nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, count); - result = generic_file_aio_write(iocb, buf, count, pos); + result = generic_file_aio_write(iocb, iov, nr_segs, pos); out: return result; Index: linux-2.6.17-rc3/include/linux/nfs_fs.h =================================================================== --- linux-2.6.17-rc3.orig/include/linux/nfs_fs.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3/include/linux/nfs_fs.h 2006-05-02 08:31:58.000000000 -0700 @@ -359,10 +359,10 @@ extern int nfs3_removexattr (struct dent */ extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t, unsigned long); -extern ssize_t nfs_file_direct_read(struct kiocb *iocb, char __user *buf, - size_t count, loff_t pos); -extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos); +extern ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); +extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); /* * linux/fs/nfs/dir.c ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops. 2006-05-02 15:07 [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops Badari Pulavarty 2006-05-02 15:08 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty @ 2006-05-09 18:03 ` Badari Pulavarty 2006-05-09 18:07 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty 1 sibling, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-05-09 18:03 UTC (permalink / raw) To: lkml, akpm; +Cc: christoph, Benjamin LaHaise, cel, pbadari Hi, These series of patches collapses all the vectored IO support into single set of file-operation method using aio_read/aio_write. This work was originally suggested & started by Christoph Hellwig, when Zach Brown tried to add vectored support for AIO. Here is the summary: [PATCH 1/3] Vectorize aio_read/aio_write methods [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead. [PATCH 3/3] Zach's core aio changes to support vectored AIO. BTW, Chuck Lever is actually re-arranging NFS DIO, AIO code to fit into this model. Thanks to Chuck Lever and Shaggy for tracking down the latest set of issues :) I ran various testing including LTP on this series. Andrew, can you include these in -mm tree ? Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 18:03 ` [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops Badari Pulavarty @ 2006-05-09 18:07 ` Badari Pulavarty 2006-05-09 19:01 ` Andrew Morton 0 siblings, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-05-09 18:07 UTC (permalink / raw) To: lkml; +Cc: akpm, christoph, Benjamin LaHaise, cel This patch vectorizes aio_read() and aio_write() methods to prepare for collapsing all aio & vectored operations into one interface - which is aio_read()/aio_write(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Documentation/filesystems/Locking | 5 +- Documentation/filesystems/vfs.txt | 4 +- drivers/char/raw.c | 14 ------- drivers/usb/gadget/inode.c | 71 +++++++++++++++++++++++++++----------- fs/aio.c | 15 +++++--- fs/block_dev.c | 10 ----- fs/cifs/cifsfs.c | 6 +-- fs/ext3/file.c | 5 +- fs/nfs/direct.c | 24 +++++++++--- fs/nfs/file.c | 43 ++++++++++++----------- fs/ntfs/file.c | 8 +--- fs/ocfs2/file.c | 28 ++++++-------- fs/read_write.c | 20 ++++++++-- fs/reiserfs/file.c | 8 ---- fs/xfs/linux-2.6/xfs_file.c | 44 +++++++++++------------ include/linux/aio.h | 2 + include/linux/fs.h | 10 ++--- include/linux/nfs_fs.h | 8 ++-- include/net/sock.h | 1 mm/filemap.c | 38 +++++++++----------- net/socket.c | 48 ++++++++++++------------- 21 files changed, 224 insertions(+), 188 deletions(-) Index: linux-2.6.17-rc3.save/Documentation/filesystems/Locking =================================================================== --- linux-2.6.17-rc3.save.orig/Documentation/filesystems/Locking 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/Documentation/filesystems/Locking 2006-05-02 07:53:58.000000000 -0700 @@ -355,10 +355,9 @@ The last two are called only from check_ prototypes: loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, - loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, Index: linux-2.6.17-rc3.save/Documentation/filesystems/vfs.txt =================================================================== --- linux-2.6.17-rc3.save.orig/Documentation/filesystems/vfs.txt 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/Documentation/filesystems/vfs.txt 2006-05-02 07:53:58.000000000 -0700 @@ -699,9 +699,9 @@ This describes how the VFS can manipulat struct file_operations { loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); Index: linux-2.6.17-rc3.save/drivers/char/raw.c =================================================================== --- linux-2.6.17-rc3.save.orig/drivers/char/raw.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/drivers/char/raw.c 2006-05-09 10:58:58.000000000 -0700 @@ -250,23 +250,11 @@ static ssize_t raw_file_write(struct fil return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t raw_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { - .iov_base = (char __user *)buf, - .iov_len = count - }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - - static struct file_operations raw_fops = { .read = generic_file_read, .aio_read = generic_file_aio_read, .write = raw_file_write, - .aio_write = raw_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .open = raw_open, .release= raw_release, .ioctl = raw_ioctl, Index: linux-2.6.17-rc3.save/fs/aio.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/aio.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/aio.c 2006-05-09 10:58:53.000000000 -0700 @@ -15,6 +15,7 @@ #include <linux/aio_abi.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/uio.h> #define DEBUG 0 @@ -1315,8 +1316,11 @@ static ssize_t aio_pread(struct kiocb *i ssize_t ret = 0; do { - ret = file->f_op->aio_read(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + iocb->ki_inline_vec.iov_base = iocb->ki_buf; + iocb->ki_inline_vec.iov_len = iocb->ki_left; + + ret = file->f_op->aio_read(iocb, &iocb->ki_inline_vec, + 1, iocb->ki_pos); /* * Can't just depend on iocb->ki_left to determine * whether we are done. This may have been a short read. @@ -1349,8 +1353,11 @@ static ssize_t aio_pwrite(struct kiocb * ssize_t ret = 0; do { - ret = file->f_op->aio_write(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + iocb->ki_inline_vec.iov_base = iocb->ki_buf; + iocb->ki_inline_vec.iov_len = iocb->ki_left; + + ret = file->f_op->aio_write(iocb, &iocb->ki_inline_vec, + 1, iocb->ki_pos); if (ret > 0) { iocb->ki_buf += ret; iocb->ki_left -= ret; Index: linux-2.6.17-rc3.save/fs/block_dev.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/block_dev.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/block_dev.c 2006-05-09 10:58:58.000000000 -0700 @@ -1064,14 +1064,6 @@ static ssize_t blkdev_file_write(struct return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t blkdev_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) { return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); @@ -1094,7 +1086,7 @@ const struct file_operations def_blk_fop .read = generic_file_read, .write = blkdev_file_write, .aio_read = generic_file_aio_read, - .aio_write = blkdev_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .mmap = generic_file_mmap, .fsync = block_fsync, .unlocked_ioctl = block_ioctl, Index: linux-2.6.17-rc3.save/fs/cifs/cifsfs.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/cifs/cifsfs.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/cifs/cifsfs.c 2006-05-09 10:58:58.000000000 -0700 @@ -496,13 +496,13 @@ static ssize_t cifs_file_writev(struct f return written; } -static ssize_t cifs_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +static ssize_t cifs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct inode *inode = iocb->ki_filp->f_dentry->d_inode; ssize_t written; - written = generic_file_aio_write(iocb, buf, count, pos); + written = generic_file_aio_write(iocb, iov, nr_segs, pos); if (!CIFS_I(inode)->clientCanCacheAll) filemap_fdatawrite(inode->i_mapping); return written; Index: linux-2.6.17-rc3.save/fs/ext3/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ext3/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ext3/file.c 2006-05-09 10:58:58.000000000 -0700 @@ -48,14 +48,15 @@ static int ext3_release_file (struct ino } static ssize_t -ext3_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +ext3_file_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_dentry->d_inode; ssize_t ret; int err; - ret = generic_file_aio_write(iocb, buf, count, pos); + ret = generic_file_aio_write(iocb, iov, nr_segs, pos); /* * Skip flushing if there was an error, or if nothing was written. Index: linux-2.6.17-rc3.save/fs/read_write.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/read_write.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/read_write.c 2006-05-09 10:58:58.000000000 -0700 @@ -227,14 +227,20 @@ static void wait_on_retry_sync_kiocb(str ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos))) + kiocb.ki_left = len; + + for (;;) { + ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); @@ -279,14 +285,20 @@ EXPORT_SYMBOL(vfs_read); ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_write(&kiocb, buf, len, kiocb.ki_pos))) + kiocb.ki_left = len; + + for (;;) { + ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); Index: linux-2.6.17-rc3.save/fs/reiserfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/reiserfs/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/reiserfs/file.c 2006-05-02 07:53:58.000000000 -0700 @@ -1560,12 +1560,6 @@ static ssize_t reiserfs_file_write(struc return res; } -static ssize_t reiserfs_aio_write(struct kiocb *iocb, const char __user * buf, - size_t count, loff_t pos) -{ - return generic_file_aio_write(iocb, buf, count, pos); -} - const struct file_operations reiserfs_file_operations = { .read = generic_file_read, .write = reiserfs_file_write, @@ -1575,7 +1569,7 @@ const struct file_operations reiserfs_fi .fsync = reiserfs_sync_file, .sendfile = generic_file_sendfile, .aio_read = generic_file_aio_read, - .aio_write = reiserfs_aio_write, + .aio_write = generic_file_aio_write, .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, }; Index: linux-2.6.17-rc3.save/fs/xfs/linux-2.6/xfs_file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/xfs/linux-2.6/xfs_file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/xfs/linux-2.6/xfs_file.c 2006-05-09 10:58:58.000000000 -0700 @@ -51,12 +51,11 @@ static struct vm_operations_struct xfs_d STATIC inline ssize_t __xfs_file_read( struct kiocb *iocb, - char __user *buf, + const struct iovec *iov, + unsigned long nr_segs, int ioflags, - size_t count, loff_t pos) { - struct iovec iov = {buf, count}; struct file *file = iocb->ki_filp; vnode_t *vp = vn_from_inode(file->f_dentry->d_inode); ssize_t rval; @@ -65,39 +64,38 @@ __xfs_file_read( if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_READ(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_READ(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } STATIC ssize_t xfs_file_aio_read( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_read(iocb, buf, IO_ISAIO, count, pos); + return __xfs_file_read(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t xfs_file_aio_read_invis( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_read(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __xfs_file_read(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } STATIC inline ssize_t __xfs_file_write( - struct kiocb *iocb, - const char __user *buf, - int ioflags, - size_t count, - loff_t pos) + struct kiocb *iocb, + const struct iovec *iov, + unsigned long nr_segs, + int ioflags, + loff_t pos) { - struct iovec iov = {(void __user *)buf, count}; struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; vnode_t *vp = vn_from_inode(inode); @@ -107,28 +105,28 @@ __xfs_file_write( if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_WRITE(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_WRITE(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } STATIC ssize_t xfs_file_aio_write( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_write(iocb, buf, IO_ISAIO, count, pos); + return __xfs_file_write(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t xfs_file_aio_write_invis( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __xfs_file_write(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __xfs_file_write(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } STATIC inline ssize_t Index: linux-2.6.17-rc3.save/include/linux/fs.h =================================================================== --- linux-2.6.17-rc3.save.orig/include/linux/fs.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/include/linux/fs.h 2006-05-09 10:58:58.000000000 -0700 @@ -1015,9 +1015,9 @@ struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); @@ -1594,11 +1594,11 @@ extern int file_send_actor(read_descript extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *); int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *); -extern ssize_t generic_file_aio_read(struct kiocb *, char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t __generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t *); -extern ssize_t generic_file_aio_write(struct kiocb *, const char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *, - unsigned long, loff_t *); + unsigned long, loff_t); extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *, unsigned long *, loff_t, loff_t *, size_t, size_t); extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *, Index: linux-2.6.17-rc3.save/include/net/sock.h =================================================================== --- linux-2.6.17-rc3.save.orig/include/net/sock.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/include/net/sock.h 2006-05-02 07:53:58.000000000 -0700 @@ -659,7 +659,6 @@ struct sock_iocb { struct sock *sk; struct scm_cookie *scm; struct msghdr *msg, async_msg; - struct iovec async_iov; struct kiocb *kiocb; }; Index: linux-2.6.17-rc3.save/mm/filemap.c =================================================================== --- linux-2.6.17-rc3.save.orig/mm/filemap.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/mm/filemap.c 2006-05-09 10:58:58.000000000 -0700 @@ -1096,14 +1096,12 @@ out: EXPORT_SYMBOL(__generic_file_aio_read); ssize_t -generic_file_aio_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - struct iovec local_iov = { .iov_base = buf, .iov_len = count }; - BUG_ON(iocb->ki_pos != pos); - return __generic_file_aio_read(iocb, &local_iov, 1, &iocb->ki_pos); + return __generic_file_aio_read(iocb, iov, nr_segs, &iocb->ki_pos); } - EXPORT_SYMBOL(generic_file_aio_read); ssize_t @@ -2163,22 +2161,21 @@ out: current->backing_dev_info = NULL; return written ? written : err; } -EXPORT_SYMBOL(generic_file_aio_write_nolock); -ssize_t -generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) +ssize_t generic_file_aio_write_nolock(struct kiocb *iocb, + const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - loff_t pos = *ppos; - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, ppos); + BUG_ON(iocb->ki_pos != pos); + + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { - int err; + ssize_t err; err = sync_page_range_nolock(inode, mapping, pos, ret); if (err < 0) @@ -2186,6 +2183,7 @@ generic_file_aio_write_nolock(struct kio } return ret; } +EXPORT_SYMBOL(generic_file_aio_write_nolock); static ssize_t __generic_file_write_nolock(struct file *file, const struct iovec *iov, @@ -2195,8 +2193,9 @@ __generic_file_write_nolock(struct file ssize_t ret; init_sync_kiocb(&kiocb, file); + kiocb.ki_pos = *ppos; ret = __generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); - if (ret == -EIOCBQUEUED) + if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); return ret; } @@ -2209,28 +2208,27 @@ generic_file_write_nolock(struct file *f ssize_t ret; init_sync_kiocb(&kiocb, file); - ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); + kiocb.ki_pos = *ppos; + ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, *ppos); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; return ret; } EXPORT_SYMBOL(generic_file_write_nolock); -ssize_t generic_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - struct iovec local_iov = { .iov_base = (void __user *)buf, - .iov_len = count }; BUG_ON(iocb->ki_pos != pos); mutex_lock(&inode->i_mutex); - ret = __generic_file_aio_write_nolock(iocb, &local_iov, 1, - &iocb->ki_pos); + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); mutex_unlock(&inode->i_mutex); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { Index: linux-2.6.17-rc3.save/net/socket.c =================================================================== --- linux-2.6.17-rc3.save.orig/net/socket.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/net/socket.c 2006-05-09 10:58:58.000000000 -0700 @@ -96,10 +96,10 @@ #include <linux/netfilter.h> static int sock_no_open(struct inode *irrelevant, struct file *dontcare); -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *buf, - size_t size, loff_t pos); -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *buf, - size_t size, loff_t pos); +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); static int sock_mmap(struct file *file, struct vm_area_struct * vma); static int sock_close(struct inode *inode, struct file *file); @@ -700,7 +700,7 @@ static ssize_t sock_sendpage(struct file } static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb, - char __user *ubuf, size_t size, struct sock_iocb *siocb) + struct sock_iocb *siocb) { if (!is_sync_kiocb(iocb)) { siocb = kmalloc(sizeof(*siocb), GFP_KERNEL); @@ -710,15 +710,13 @@ static struct sock_iocb *alloc_sock_iocb } siocb->kiocb = iocb; - siocb->async_iov.iov_base = ubuf; - siocb->async_iov.iov_len = size; - iocb->private = siocb; return siocb; } static ssize_t do_sock_read(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -749,31 +747,33 @@ static ssize_t sock_readv(struct file *f init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_read(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_read(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; - if (count == 0) /* Match SYS5 behaviour */ + + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ return 0; - x = alloc_sock_iocb(iocb, ubuf, count, &siocb); + + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } static ssize_t do_sock_write(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -806,28 +806,28 @@ static ssize_t sock_writev(struct file * init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_write(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_write(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; - if (count == 0) /* Match SYS5 behaviour */ + + if (iocb->ki_left == 0) /* Match SYS5 behaviour */ return 0; - x = alloc_sock_iocb(iocb, (void __user *)ubuf, count, &siocb); + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } Index: linux-2.6.17-rc3.save/drivers/usb/gadget/inode.c =================================================================== --- linux-2.6.17-rc3.save.orig/drivers/usb/gadget/inode.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/drivers/usb/gadget/inode.c 2006-05-02 07:53:58.000000000 -0700 @@ -528,7 +528,8 @@ struct kiocb_priv { struct usb_request *req; struct ep_data *epdata; void *buf; - char __user *ubuf; + struct iovec *iv; + unsigned long count; unsigned actual; }; @@ -556,18 +557,32 @@ static int ep_aio_cancel(struct kiocb *i static ssize_t ep_aio_read_retry(struct kiocb *iocb) { struct kiocb_priv *priv = iocb->private; - ssize_t status = priv->actual; + ssize_t len, total; /* we "retry" to get the right mm context for this: */ - status = copy_to_user(priv->ubuf, priv->buf, priv->actual); - if (unlikely(0 != status)) - status = -EFAULT; - else - status = priv->actual; + + /* copy stuff into user buffers */ + total = priv->actual; + len = 0; + for (i=0; i < priv->count; i++) { + ssize_t this = min(priv->iv[i].iov_len, (size_t)total); + + if (copy_to_user(priv->iv[i].iov_buf, priv->buf, this)) + break; + + total -= this; + len += this; + if (total <= 0) + break; + } + + if (unlikely(len != 0)) + len = -EFAULT; + kfree(priv->buf); kfree(priv); aio_put_req(iocb); - return status; + return len; } static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req) @@ -615,7 +630,8 @@ ep_aio_rwtail( char *buf, size_t len, struct ep_data *epdata, - char __user *ubuf + const struct iovec *iv, + unsigned long count ) { struct kiocb_priv *priv = (void *) &iocb->private; @@ -630,7 +646,8 @@ fail: return value; } iocb->private = priv; - priv->ubuf = ubuf; + priv->iovec = iv; + priv->count = count; value = get_ready_ep(iocb->ki_filp->f_flags, epdata); if (unlikely(value < 0)) { @@ -675,36 +692,52 @@ fail: } static ssize_t -ep_aio_read(struct kiocb *iocb, char __user *ubuf, size_t len, loff_t o) +ep_aio_read(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len; + int i = 0; + ssize_t ret; if (unlikely(epdata->desc.bEndpointAddress & USB_DIR_IN)) return -EINVAL; - buf = kmalloc(len, GFP_KERNEL); + + buf = kmalloc(iocb->ki_left, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; + iocb->ki_retry = ep_aio_read_retry; - return ep_aio_rwtail(iocb, buf, len, epdata, ubuf); + return ep_aio_rwtail(iocb, buf, len, epdata, iv, count); } static ssize_t -ep_aio_write(struct kiocb *iocb, const char __user *ubuf, size_t len, loff_t o) +ep_aio_write(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len = 0; + int i = 0; + ssize_t ret; if (unlikely(!(epdata->desc.bEndpointAddress & USB_DIR_IN))) return -EINVAL; - buf = kmalloc(len, GFP_KERNEL); + + buf = kmalloc(iocb->ki_left, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; - if (unlikely(copy_from_user(buf, ubuf, len) != 0)) { - kfree(buf); - return -EFAULT; + + for (i=0; i < count; i++) { + if (unlikely(copy_from_user(&buf[len], iv[i]->iov_base, + iv[i]->iov_len) != 0)) { + kfree(buf); + return -EFAULT; + } + len += iv[i]->iov_len; } - return ep_aio_rwtail(iocb, buf, len, epdata, NULL); + return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0); } /*----------------------------------------------------------------------*/ Index: linux-2.6.17-rc3.save/include/linux/aio.h =================================================================== --- linux-2.6.17-rc3.save.orig/include/linux/aio.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/include/linux/aio.h 2006-05-09 10:58:53.000000000 -0700 @@ -4,6 +4,7 @@ #include <linux/list.h> #include <linux/workqueue.h> #include <linux/aio_abi.h> +#include <linux/uio.h> #include <asm/atomic.h> @@ -112,6 +113,7 @@ struct kiocb { long ki_retried; /* just for testing */ long ki_kicked; /* just for testing */ long ki_queued; /* just for testing */ + struct iovec ki_inline_vec; /* inline vector */ struct list_head ki_list; /* the aio core uses this * for cancellation */ Index: linux-2.6.17-rc3.save/fs/nfs/direct.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/nfs/direct.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/nfs/direct.c 2006-05-02 08:31:58.000000000 -0700 @@ -745,8 +745,8 @@ static ssize_t nfs_direct_write(struct k /** * nfs_file_direct_read - file direct read operation for NFS files * @iocb: target I/O control block - * @buf: user's buffer into which to read data - * @count: number of bytes to read + * @iov: vector of user buffers into which to read data + * @nr_segs: size of iov vector * @pos: byte offset in file where reading starts * * We use this function for direct reads instead of calling @@ -763,19 +763,25 @@ static ssize_t nfs_direct_write(struct k * client must read the updated atime from the server back into its * cache. */ -ssize_t nfs_file_direct_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { ssize_t retval = -EINVAL; int page_count; struct page **pages; struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; + /* XXX: temporary */ + const char __user *buf = iov[0].iov_base; + size_t count = iov[0].iov_len; dprintk("nfs: direct read(%s/%s, %lu@%Ld)\n", file->f_dentry->d_parent->d_name.name, file->f_dentry->d_name.name, (unsigned long) count, (long long) pos); + if (nr_segs != 1) + return -EINVAL; + if (count < 0) goto out; retval = -EFAULT; @@ -807,8 +813,8 @@ out: /** * nfs_file_direct_write - file direct write operation for NFS files * @iocb: target I/O control block - * @buf: user's buffer from which to write data - * @count: number of bytes to write + * @iov: vector of user buffers from which to write data + * @nr_segs: size of iov vector * @pos: byte offset in file where writing starts * * We use this function for direct writes instead of calling @@ -829,19 +835,25 @@ out: * Note that O_APPEND is not supported for NFS direct writes, as there * is no atomic O_APPEND write facility in the NFS protocol. */ -ssize_t nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { ssize_t retval; int page_count; struct page **pages; struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; + /* XXX: temporary */ + const char __user *buf = iov[0].iov_base; + size_t count = iov[0].iov_len; dfprintk(VFS, "nfs: direct write(%s/%s, %lu@%Ld)\n", file->f_dentry->d_parent->d_name.name, file->f_dentry->d_name.name, (unsigned long) count, (long long) pos); + if (nr_segs != 1) + return -EINVAL; + retval = generic_write_checks(file, &pos, &count, 0); if (retval) goto out; Index: linux-2.6.17-rc3.save/fs/nfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/nfs/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/nfs/file.c 2006-05-02 08:31:58.000000000 -0700 @@ -41,8 +41,8 @@ static int nfs_file_release(struct inode static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); static int nfs_file_mmap(struct file *, struct vm_area_struct *); static ssize_t nfs_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *); -static ssize_t nfs_file_read(struct kiocb *, char __user *, size_t, loff_t); -static ssize_t nfs_file_write(struct kiocb *, const char __user *, size_t, loff_t); +static ssize_t nfs_file_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); +static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); static int nfs_file_flush(struct file *); static int nfs_fsync(struct file *, struct dentry *dentry, int datasync); static int nfs_check_flags(int flags); @@ -53,8 +53,8 @@ const struct file_operations nfs_file_op .llseek = nfs_file_llseek, .read = do_sync_read, .write = do_sync_write, - .aio_read = nfs_file_read, - .aio_write = nfs_file_write, + .aio_read = nfs_file_read, + .aio_write = nfs_file_write, .mmap = nfs_file_mmap, .open = nfs_file_open, .flush = nfs_file_flush, @@ -212,26 +212,30 @@ nfs_file_flush(struct file *file) return status; } -static ssize_t -nfs_file_read(struct kiocb *iocb, char __user * buf, size_t count, loff_t pos) +static ssize_t nfs_file_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; ssize_t result; + unsigned long seg; + size_t count = 0; + + for (seg = 0; seg < nr_segs; seg++) + count += iov[seg].iov_len; #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_read(iocb, buf, count, pos); + return nfs_file_direct_read(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: read(%s/%s, %lu@%lu)\n", + dfprintk(VFS, "nfs: read(%s/%s, %lu@%Ld)\n", dentry->d_parent->d_name.name, dentry->d_name.name, - (unsigned long) count, (unsigned long) pos); + (unsigned long) count, (long long) pos); result = nfs_revalidate_file(inode, iocb->ki_filp); nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, count); if (!result) - result = generic_file_aio_read(iocb, buf, count, pos); + result = generic_file_aio_read(iocb, iov, nr_segs, pos); return result; } @@ -343,24 +347,25 @@ struct address_space_operations nfs_file #endif }; -/* - * Write to a file (through the page cache). - */ -static ssize_t -nfs_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; ssize_t result; + unsigned long seg; + size_t count = 0; + + for (seg = 0; seg < nr_segs; seg++) + count += iov[seg].iov_len; #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_write(iocb, buf, count, pos); + return nfs_file_direct_write(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%lu)\n", + dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%Ld)\n", dentry->d_parent->d_name.name, dentry->d_name.name, - inode->i_ino, (unsigned long) count, (unsigned long) pos); + inode->i_ino, (unsigned long) count, (long long) pos); result = -EBUSY; if (IS_SWAPFILE(inode)) @@ -380,7 +385,7 @@ nfs_file_write(struct kiocb *iocb, const goto out; nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, count); - result = generic_file_aio_write(iocb, buf, count, pos); + result = generic_file_aio_write(iocb, iov, nr_segs, pos); out: return result; Index: linux-2.6.17-rc3.save/include/linux/nfs_fs.h =================================================================== --- linux-2.6.17-rc3.save.orig/include/linux/nfs_fs.h 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/include/linux/nfs_fs.h 2006-05-02 08:31:58.000000000 -0700 @@ -359,10 +359,10 @@ extern int nfs3_removexattr (struct dent */ extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t, unsigned long); -extern ssize_t nfs_file_direct_read(struct kiocb *iocb, char __user *buf, - size_t count, loff_t pos); -extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos); +extern ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); +extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); /* * linux/fs/nfs/dir.c Index: linux-2.6.17-rc3.save/fs/ocfs2/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ocfs2/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ocfs2/file.c 2006-05-05 13:36:49.000000000 -0700 @@ -929,25 +929,23 @@ static inline int ocfs2_write_should_rem } static ssize_t ocfs2_file_aio_write(struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - struct iovec local_iov = { .iov_base = (void __user *)buf, - .iov_len = count }; int ret, rw_level = -1, meta_level = -1, have_alloc_sem = 0; u32 clusters; struct file *filp = iocb->ki_filp; struct inode *inode = filp->f_dentry->d_inode; loff_t newsize, saved_pos; - mlog_entry("(0x%p, 0x%p, %u, '%.*s')\n", filp, buf, - (unsigned int)count, + mlog_entry("(0x%p, %u, '%.*s')\n", filp, + (unsigned int)nr_segs, filp->f_dentry->d_name.len, filp->f_dentry->d_name.name); /* happy write of zero bytes */ - if (count == 0) + if (iocb->ki_left == 0) return 0; if (!inode) { @@ -1016,7 +1014,7 @@ static ssize_t ocfs2_file_aio_write(stru } else { saved_pos = iocb->ki_pos; } - newsize = count + saved_pos; + newsize = iocb->ki_left + saved_pos; mlog(0, "pos=%lld newsize=%lld cursize=%lld\n", (long long) saved_pos, (long long) newsize, @@ -1059,7 +1057,7 @@ static ssize_t ocfs2_file_aio_write(stru /* Fill any holes which would've been created by this * write. If we're O_APPEND, this will wind up * (correctly) being a noop. */ - ret = ocfs2_zero_extend(inode, (u64) newsize - count); + ret = ocfs2_zero_extend(inode, (u64) newsize - iocb->ki_left); if (ret < 0) { mlog_errno(ret); goto out; @@ -1075,7 +1073,7 @@ static ssize_t ocfs2_file_aio_write(stru /* communicate with ocfs2_dio_end_io */ ocfs2_iocb_set_rw_locked(iocb); - ret = generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); + ret = generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb->ki_pos); /* buffered aio wouldn't have proper lock coverage today */ BUG_ON(ret == -EIOCBQUEUED && !(filp->f_flags & O_DIRECT)); @@ -1109,16 +1107,16 @@ out: } static ssize_t ocfs2_file_aio_read(struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { int ret = 0, rw_level = -1, have_alloc_sem = 0; struct file *filp = iocb->ki_filp; struct inode *inode = filp->f_dentry->d_inode; - mlog_entry("(0x%p, 0x%p, %u, '%.*s')\n", filp, buf, - (unsigned int)count, + mlog_entry("(0x%p, %u, '%.*s')\n", filp, + (unsigned int)nr_segs, filp->f_dentry->d_name.len, filp->f_dentry->d_name.name); @@ -1146,7 +1144,7 @@ static ssize_t ocfs2_file_aio_read(struc ocfs2_iocb_set_rw_locked(iocb); } - ret = generic_file_aio_read(iocb, buf, count, iocb->ki_pos); + ret = generic_file_aio_read(iocb, iov, nr_segs, iocb->ki_pos); if (ret == -EINVAL) mlog(ML_ERROR, "generic_file_aio_read returned -EINVAL\n"); Index: linux-2.6.17-rc3.save/fs/ntfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ntfs/file.c 2006-05-02 08:28:50.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ntfs/file.c 2006-05-09 10:58:58.000000000 -0700 @@ -2174,20 +2174,18 @@ out: /** * ntfs_file_aio_write - */ -static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - struct iovec local_iov = { .iov_base = (void __user *)buf, - .iov_len = count }; BUG_ON(iocb->ki_pos != pos); mutex_lock(&inode->i_mutex); - ret = ntfs_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); + ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); mutex_unlock(&inode->i_mutex); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { int err = sync_page_range(inode, mapping, pos, ret); ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 18:07 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty @ 2006-05-09 19:01 ` Andrew Morton 2006-05-09 19:03 ` Christoph Hellwig 2006-05-09 23:53 ` Badari Pulavarty 0 siblings, 2 replies; 58+ messages in thread From: Andrew Morton @ 2006-05-09 19:01 UTC (permalink / raw) To: Badari Pulavarty; +Cc: linux-kernel, hch, bcrl, cel Badari Pulavarty <pbadari@us.ibm.com> wrote: > > static ssize_t ep_aio_read_retry(struct kiocb *iocb) > { > struct kiocb_priv *priv = iocb->private; > - ssize_t status = priv->actual; > + ssize_t len, total; > > /* we "retry" to get the right mm context for this: */ > - status = copy_to_user(priv->ubuf, priv->buf, priv->actual); > - if (unlikely(0 != status)) > - status = -EFAULT; > - else > - status = priv->actual; > + > + /* copy stuff into user buffers */ > + total = priv->actual; > + len = 0; > + for (i=0; i < priv->count; i++) { for (i = 0 > + ssize_t this = min(priv->iv[i].iov_len, (size_t)total); min_t(). Strange mixture of size_t and ssize_t there. > + if (copy_to_user(priv->iv[i].iov_buf, priv->buf, this)) > + break; > + > + total -= this; > + len += this; > + if (total <= 0) > + break; > + } > + > + if (unlikely(len != 0)) > + len = -EFAULT; This looks wrong. I think you meant (total != 0). Together these three patches shrink the kernel by 113 lines. I don't know what the effect is on text size, but that's a pretty modest saving, at a pretty high risk level. What else do we get in return for this risk? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 19:01 ` Andrew Morton @ 2006-05-09 19:03 ` Christoph Hellwig 2006-05-09 19:13 ` Andrew Morton 2006-05-09 20:07 ` Badari Pulavarty 2006-05-09 23:53 ` Badari Pulavarty 1 sibling, 2 replies; 58+ messages in thread From: Christoph Hellwig @ 2006-05-09 19:03 UTC (permalink / raw) To: Andrew Morton; +Cc: Badari Pulavarty, linux-kernel, hch, bcrl, cel On Tue, May 09, 2006 at 12:01:05PM -0700, Andrew Morton wrote: > Together these three patches shrink the kernel by 113 lines. I don't know > what the effect is on text size, but that's a pretty modest saving, at a > pretty high risk level. > > What else do we get in return for this risk? there's another patch ontop which I didn't bother to redo until this is accepted which kills a lot more code. After that filesystems only have to implement one method each for all kinds of read/write calls. Which allows to both make the mm/filemap.c far less complex and actually understandable aswell as for any filesystem that uses more complex read/write variants than direct filemap.c calls. In addition to these simplification we also get a feature (async vectored I/O) for free. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 19:03 ` Christoph Hellwig @ 2006-05-09 19:13 ` Andrew Morton 2006-05-09 19:20 ` Christoph Hellwig 2006-05-10 20:50 ` Badari Pulavarty 2006-05-09 20:07 ` Badari Pulavarty 1 sibling, 2 replies; 58+ messages in thread From: Andrew Morton @ 2006-05-09 19:13 UTC (permalink / raw) To: Christoph Hellwig; +Cc: pbadari, linux-kernel, hch, bcrl, cel Christoph Hellwig <hch@lst.de> wrote: > > On Tue, May 09, 2006 at 12:01:05PM -0700, Andrew Morton wrote: > > Together these three patches shrink the kernel by 113 lines. I don't know > > what the effect is on text size, but that's a pretty modest saving, at a > > pretty high risk level. > > > > What else do we get in return for this risk? > > there's another patch ontop which I didn't bother to redo until this is > accepted which kills a lot more code. After that filesystems only have > to implement one method each for all kinds of read/write calls. Which > allows to both make the mm/filemap.c far less complex and actually > understandable aswell as for any filesystem that uses more complex > read/write variants than direct filemap.c calls. In addition to these > simplification we also get a feature (async vectored I/O) for free. Fair enough, thanks. Simplifying filemap.c would be a win. I'll crunch on these three patches in the normal fashion. It'll be good if we can get the followup patch done within the next week or two so we can get it all tested at the same time. Although from your description it doesn't sound like it'll be completely trivial... ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 19:13 ` Andrew Morton @ 2006-05-09 19:20 ` Christoph Hellwig 2006-05-09 23:57 ` Badari Pulavarty 2006-05-10 16:01 ` Badari Pulavarty 2006-05-10 20:50 ` Badari Pulavarty 1 sibling, 2 replies; 58+ messages in thread From: Christoph Hellwig @ 2006-05-09 19:20 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Hellwig, pbadari, linux-kernel, bcrl, cel On Tue, May 09, 2006 at 12:13:05PM -0700, Andrew Morton wrote: > > there's another patch ontop which I didn't bother to redo until this is > > accepted which kills a lot more code. After that filesystems only have > > to implement one method each for all kinds of read/write calls. Which > > allows to both make the mm/filemap.c far less complex and actually > > understandable aswell as for any filesystem that uses more complex > > read/write variants than direct filemap.c calls. In addition to these > > simplification we also get a feature (async vectored I/O) for free. > > Fair enough, thanks. Simplifying filemap.c would be a win. > > I'll crunch on these three patches in the normal fashion. It'll be good if > we can get the followup patch done within the next week or two so we can > get it all tested at the same time. Although from your description it > doesn't sound like it'll be completely trivial... That patch is lots of tirival and boring work. If anyone wants to beat me to it: - in any filesystem that implements the generic_file_aio_{read,write} directly remove these apply this patch to the file_operations vectors: - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .write = do_sync_write, Note that this does _not_ cause additional indirection for normal sys_read/sys_write calls because they call .aio_read/.aio_write directly. It's only needed because we have various places in the tree that like to call .read/.write directly - in the filesystems that implement more or less trivial wrappers around generic_file_read/generic_file_write to the aio_read/aio_write prototypes so they can set .read/.write as above - after that generic_file_read/generic_file_write/generic_file_read/ generic_file_write_nolock should have no callers left and the code for read/write in mm/filemap.c can be collapsed into very few functions. What's left should be something like: - generic_file_aio_read (__generic_file_aio_read and generic_file_aio_read merged into one) - __generic_file_aio_write (basically the current __generic_file_aio_write_nolock) - generic_file_aio_write_nolock - generic_file_aio_write (small wrappers around __generic_file_aio_write) ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 19:20 ` Christoph Hellwig @ 2006-05-09 23:57 ` Badari Pulavarty 2006-05-10 8:00 ` Christoph Hellwig 2006-05-10 16:01 ` Badari Pulavarty 1 sibling, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-05-09 23:57 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, lkml, Benjamin LaHaise, cel On Tue, 2006-05-09 at 21:20 +0200, Christoph Hellwig wrote: > On Tue, May 09, 2006 at 12:13:05PM -0700, Andrew Morton wrote: > > > there's another patch ontop which I didn't bother to redo until this is > > > accepted which kills a lot more code. After that filesystems only have > > > to implement one method each for all kinds of read/write calls. Which > > > allows to both make the mm/filemap.c far less complex and actually > > > understandable aswell as for any filesystem that uses more complex > > > read/write variants than direct filemap.c calls. In addition to these > > > simplification we also get a feature (async vectored I/O) for free. > > > > Fair enough, thanks. Simplifying filemap.c would be a win. > > > > I'll crunch on these three patches in the normal fashion. It'll be good if > > we can get the followup patch done within the next week or two so we can > > get it all tested at the same time. Although from your description it > > doesn't sound like it'll be completely trivial... > > That patch is lots of tirival and boring work. If anyone wants to beat > me to it: Well, I am not sure if you mean *exactly* this.. So far, I have this. I really don't like the idea of adding .aio_read/.aio_write methods for the filesystems who currently don't have one (so we can force their .read/.write to do_sync_*()). Is there a way to fix callers of .read/.write() methods to use something like do_sync_read/write - that way we can take out .read/.write completely ? Anyway, here it is compiled but untested.. I think I can clean up more in filemap.c (after reading through your suggestions). Please let me know, if I am on wrong path ... Thanks, Badari Patch to remove generic_file_read() and generic_file_write() as we seem to have too many interfaces. Make .read/.write methods for filesystems to use do_sync_read() and do_sync_write() which makes use of aio_read/aio_write(). I really don't like keeping .read()/.write() methods since sys_read/sys_write() can make use of async methods - but this is for those who call .read/.write() directly. drivers/char/raw.c | 4 +-- fs/adfs/file.c | 6 +++-- fs/affs/file.c | 6 +++-- fs/bfs/file.c | 6 +++-- fs/block_dev.c | 2 - fs/ext2/file.c | 4 +-- fs/fuse/file.c | 6 +++-- fs/hfs/inode.c | 6 +++-- fs/hfsplus/inode.c | 6 +++-- fs/hostfs/hostfs_kern.c | 4 +-- fs/hpfs/file.c | 6 +++-- fs/jffs/inode-v23.c | 6 +++-- fs/jffs2/file.c | 6 +++-- fs/jfs/file.c | 4 +-- fs/minix/file.c | 6 +++-- fs/ntfs/file.c | 2 - fs/qnx4/file.c | 6 +++-- fs/ramfs/file-mmu.c | 6 +++-- fs/ramfs/file-nommu.c | 6 +++-- fs/read_write.c | 3 +- include/linux/fs.h | 2 - mm/filemap.c | 55 ------------------------------------------------ 22 files changed, 64 insertions(+), 94 deletions(-) Index: linux-2.6.17-rc3.save/drivers/char/raw.c =================================================================== --- linux-2.6.17-rc3.save.orig/drivers/char/raw.c 2006-05-09 14:11:51.000000000 -0700 +++ linux-2.6.17-rc3.save/drivers/char/raw.c 2006-05-09 14:15:28.000000000 -0700 @@ -251,9 +251,9 @@ static ssize_t raw_file_write(struct fil } static struct file_operations raw_fops = { - .read = generic_file_read, + .read = do_sync_read, .aio_read = generic_file_aio_read, - .write = raw_file_write, + .write = do_sync_write, .aio_write = generic_file_aio_write_nolock, .open = raw_open, .release= raw_release, Index: linux-2.6.17-rc3.save/fs/adfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/adfs/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/adfs/file.c 2006-05-09 14:31:50.000000000 -0700 @@ -27,10 +27,12 @@ const struct file_operations adfs_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, + .aio_read = generic_file_aio_read, .mmap = generic_file_mmap, .fsync = file_fsync, - .write = generic_file_write, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .sendfile = generic_file_sendfile, }; Index: linux-2.6.17-rc3.save/fs/affs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/affs/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/affs/file.c 2006-05-09 14:35:22.000000000 -0700 @@ -27,8 +27,10 @@ static int affs_file_release(struct inod const struct file_operations affs_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .open = affs_file_open, .release = affs_file_release, Index: linux-2.6.17-rc3.save/fs/bfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/bfs/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/bfs/file.c 2006-05-09 14:36:49.000000000 -0700 @@ -19,8 +19,10 @@ const struct file_operations bfs_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .sendfile = generic_file_sendfile, }; Index: linux-2.6.17-rc3.save/fs/block_dev.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/block_dev.c 2006-05-09 14:11:51.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/block_dev.c 2006-05-09 14:39:54.000000000 -0700 @@ -1083,7 +1083,7 @@ const struct file_operations def_blk_fop .open = blkdev_open, .release = blkdev_close, .llseek = block_llseek, - .read = generic_file_read, + .read = do_sync_read, .write = blkdev_file_write, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write_nolock, Index: linux-2.6.17-rc3.save/fs/ext2/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ext2/file.c 2006-05-09 14:11:51.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ext2/file.c 2006-05-09 14:41:14.000000000 -0700 @@ -41,8 +41,8 @@ static int ext2_release_file (struct ino */ const struct file_operations ext2_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .write = do_sync_write, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, .ioctl = ext2_ioctl, Index: linux-2.6.17-rc3.save/fs/fuse/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/fuse/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/fuse/file.c 2006-05-09 14:44:43.000000000 -0700 @@ -621,8 +621,10 @@ static int fuse_set_page_dirty(struct pa static const struct file_operations fuse_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = fuse_file_mmap, .open = fuse_open, .flush = fuse_flush, Index: linux-2.6.17-rc3.save/fs/hfs/inode.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/hfs/inode.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/hfs/inode.c 2006-05-09 14:46:37.000000000 -0700 @@ -603,8 +603,10 @@ int hfs_inode_setattr(struct dentry *den static const struct file_operations hfs_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .sendfile = generic_file_sendfile, .fsync = file_fsync, Index: linux-2.6.17-rc3.save/fs/hfsplus/inode.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/hfsplus/inode.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/hfsplus/inode.c 2006-05-09 15:05:44.000000000 -0700 @@ -282,8 +282,10 @@ static struct inode_operations hfsplus_f static const struct file_operations hfsplus_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .sendfile = generic_file_sendfile, .fsync = file_fsync, Index: linux-2.6.17-rc3.save/fs/hostfs/hostfs_kern.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/hostfs/hostfs_kern.c 2006-05-09 14:11:51.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/hostfs/hostfs_kern.c 2006-05-09 15:06:37.000000000 -0700 @@ -386,11 +386,11 @@ int hostfs_fsync(struct file *file, stru static const struct file_operations hostfs_file_fops = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, .sendfile = generic_file_sendfile, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, - .write = generic_file_write, + .write = do_sync_write, .mmap = generic_file_mmap, .open = hostfs_file_open, .release = NULL, Index: linux-2.6.17-rc3.save/fs/hpfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/hpfs/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/hpfs/file.c 2006-05-09 15:08:53.000000000 -0700 @@ -113,7 +113,7 @@ static ssize_t hpfs_file_write(struct fi { ssize_t retval; - retval = generic_file_write(file, buf, count, ppos); + retval = do_sync_write(file, buf, count, ppos); if (retval > 0) hpfs_i(file->f_dentry->d_inode)->i_dirty = 1; return retval; @@ -122,8 +122,10 @@ static ssize_t hpfs_file_write(struct fi const struct file_operations hpfs_file_ops = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, + .aio_read = generic_file_aio_read, .write = hpfs_file_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .release = hpfs_file_release, .fsync = hpfs_file_fsync, Index: linux-2.6.17-rc3.save/fs/jffs/inode-v23.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/jffs/inode-v23.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/jffs/inode-v23.c 2006-05-09 15:10:34.000000000 -0700 @@ -1633,8 +1633,10 @@ static const struct file_operations jffs { .open = generic_file_open, .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .ioctl = jffs_ioctl, .mmap = generic_file_readonly_mmap, .fsync = jffs_fsync, Index: linux-2.6.17-rc3.save/fs/jffs2/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/jffs2/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/jffs2/file.c 2006-05-09 15:11:58.000000000 -0700 @@ -42,8 +42,10 @@ const struct file_operations jffs2_file_ { .llseek = generic_file_llseek, .open = generic_file_open, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .ioctl = jffs2_ioctl, .mmap = generic_file_readonly_mmap, .fsync = jffs2_fsync, Index: linux-2.6.17-rc3.save/fs/jfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/jfs/file.c 2006-05-09 14:11:51.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/jfs/file.c 2006-05-09 15:12:41.000000000 -0700 @@ -103,8 +103,8 @@ struct inode_operations jfs_file_inode_o const struct file_operations jfs_file_operations = { .open = jfs_open, .llseek = generic_file_llseek, - .write = generic_file_write, - .read = generic_file_read, + .write = do_sync_write, + .read = do_sync_read, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, Index: linux-2.6.17-rc3.save/fs/minix/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/minix/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/minix/file.c 2006-05-09 15:15:06.000000000 -0700 @@ -17,8 +17,10 @@ int minix_sync_file(struct file *, struc const struct file_operations minix_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .fsync = minix_sync_file, .sendfile = generic_file_sendfile, Index: linux-2.6.17-rc3.save/fs/ntfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ntfs/file.c 2006-05-09 14:11:51.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ntfs/file.c 2006-05-09 15:50:43.000000000 -0700 @@ -2294,7 +2294,7 @@ static int ntfs_file_fsync(struct file * const struct file_operations ntfs_file_ops = { .llseek = generic_file_llseek, /* Seek inside file. */ - .read = generic_file_read, /* Read from file. */ + .read = do_sync_read, /* Read from file. */ .aio_read = generic_file_aio_read, /* Async read from file. */ #ifdef NTFS_RW .write = ntfs_file_write, /* Write to file. */ Index: linux-2.6.17-rc3.save/fs/qnx4/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/qnx4/file.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/qnx4/file.c 2006-05-09 15:18:10.000000000 -0700 @@ -22,11 +22,13 @@ const struct file_operations qnx4_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, + .aio_read = generic_file_aio_read, .mmap = generic_file_mmap, .sendfile = generic_file_sendfile, #ifdef CONFIG_QNX4FS_RW - .write = generic_file_write, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .fsync = qnx4_sync_file, #endif }; Index: linux-2.6.17-rc3.save/fs/ramfs/file-mmu.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ramfs/file-mmu.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ramfs/file-mmu.c 2006-05-09 15:19:34.000000000 -0700 @@ -33,8 +33,10 @@ struct address_space_operations ramfs_ao }; const struct file_operations ramfs_file_operations = { - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .fsync = simple_sync_file, .sendfile = generic_file_sendfile, Index: linux-2.6.17-rc3.save/fs/ramfs/file-nommu.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ramfs/file-nommu.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ramfs/file-nommu.c 2006-05-09 15:20:37.000000000 -0700 @@ -36,8 +36,10 @@ struct address_space_operations ramfs_ao const struct file_operations ramfs_file_operations = { .mmap = ramfs_nommu_mmap, .get_unmapped_area = ramfs_nommu_get_unmapped_area, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .fsync = simple_sync_file, .sendfile = generic_file_sendfile, .llseek = generic_file_llseek, Index: linux-2.6.17-rc3.save/fs/read_write.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/read_write.c 2006-05-09 14:11:53.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/read_write.c 2006-05-09 15:21:53.000000000 -0700 @@ -22,7 +22,8 @@ const struct file_operations generic_ro_fops = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, + .aio_read = generic_file_aio_read, .mmap = generic_file_readonly_mmap, .sendfile = generic_file_sendfile, }; Index: linux-2.6.17-rc3.save/include/linux/fs.h =================================================================== --- linux-2.6.17-rc3.save.orig/include/linux/fs.h 2006-05-09 14:11:53.000000000 -0700 +++ linux-2.6.17-rc3.save/include/linux/fs.h 2006-05-09 15:41:52.000000000 -0700 @@ -1594,9 +1594,7 @@ extern int generic_file_mmap(struct file extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size); extern int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size); -extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *); int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); -extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t __generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t *); extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); Index: linux-2.6.17-rc3.save/mm/filemap.c =================================================================== --- linux-2.6.17-rc3.save.orig/mm/filemap.c 2006-05-09 14:11:51.000000000 -0700 +++ linux-2.6.17-rc3.save/mm/filemap.c 2006-05-09 15:41:20.000000000 -0700 @@ -1104,22 +1104,6 @@ generic_file_aio_read(struct kiocb *iocb } EXPORT_SYMBOL(generic_file_aio_read); -ssize_t -generic_file_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) -{ - struct iovec local_iov = { .iov_base = buf, .iov_len = count }; - struct kiocb kiocb; - ssize_t ret; - - init_sync_kiocb(&kiocb, filp); - ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos); - if (-EIOCBQUEUED == ret) - ret = wait_on_sync_kiocb(&kiocb); - return ret; -} - -EXPORT_SYMBOL(generic_file_read); - int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size) { ssize_t written; @@ -2185,21 +2169,6 @@ ssize_t generic_file_aio_write_nolock(st } EXPORT_SYMBOL(generic_file_aio_write_nolock); -static ssize_t -__generic_file_write_nolock(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) -{ - struct kiocb kiocb; - ssize_t ret; - - init_sync_kiocb(&kiocb, file); - kiocb.ki_pos = *ppos; - ret = __generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); - if (-EIOCBQUEUED == ret) - ret = wait_on_sync_kiocb(&kiocb); - return ret; -} - ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos) @@ -2242,30 +2211,6 @@ ssize_t generic_file_aio_write(struct ki } EXPORT_SYMBOL(generic_file_aio_write); -ssize_t generic_file_write(struct file *file, const char __user *buf, - size_t count, loff_t *ppos) -{ - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; - ssize_t ret; - struct iovec local_iov = { .iov_base = (void __user *)buf, - .iov_len = count }; - - mutex_lock(&inode->i_mutex); - ret = __generic_file_write_nolock(file, &local_iov, 1, ppos); - mutex_unlock(&inode->i_mutex); - - if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { - ssize_t err; - - err = sync_page_range(inode, mapping, *ppos - ret, ret); - if (err < 0) - ret = err; - } - return ret; -} -EXPORT_SYMBOL(generic_file_write); - /* * Called under i_mutex for writes to S_ISREG files. Returns -EIO if something * went wrong during pagecache shootdown. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 23:57 ` Badari Pulavarty @ 2006-05-10 8:00 ` Christoph Hellwig 2006-05-10 15:01 ` Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: Christoph Hellwig @ 2006-05-10 8:00 UTC (permalink / raw) To: Badari Pulavarty Cc: Christoph Hellwig, Andrew Morton, lkml, Benjamin LaHaise, cel On Tue, May 09, 2006 at 04:57:42PM -0700, Badari Pulavarty wrote: > > > we can get the followup patch done within the next week or two so we can > > > get it all tested at the same time. Although from your description it > > > doesn't sound like it'll be completely trivial... > > > > That patch is lots of tirival and boring work. If anyone wants to beat > > me to it: > > Well, I am not sure if you mean *exactly* this.. > > So far, I have this. I really don't like the idea of > adding .aio_read/.aio_write methods for the filesystems who currently > don't have one (so we can force their .read/.write to do_sync_*()). Why don't you like this idea? It helps to sort out callers into two categories. The following is a something I wrote up to put into Documentation/ somewhere once these patches are in. -------- snip -------- There are two ways to implement read/write for filesystems and drivers: The simple way is to implement the read and write methods. Normal synchronous, single buffer requests are handed directly to the driver in this case. Vectored requests are emulated using a loop in the higher level code. AIO requests are silently performed synchronous. This method is normally used for character drivers and synthetic filesystems. The advanced method is to implement the aio_read and aio_write methods. These allow the request to be done asynchronously and submit multiple IO vectores in parallel. A page cache based filesystem gets this functionality by freee by using the routines from filemap.c - in fact there is not easy way to use the generic page cache code without implementing aio_read and aio_write. The other big user of this interface are sockets. Very few character driver need this complexity. -------- snip -------- > Is there a way to fix callers of .read/.write() methods to use > something like do_sync_read/write - that way we can take out > .read/.write completely ? The only way to fix this is to add some kernel_read/kernel_write helpers that factor out the use aio_read / aio_write if present and wait for I/O completion logic from vfs_read/vfs_write. I started on that but it got very messy. > Anyway, here it is compiled but untested.. I think I can clean up > more in filemap.c (after reading through your suggestions). Please > let me know, if I am on wrong path ... Currently I don't have time to actually apply the patchlkit and look at the result, so I'll defer further comments. Beside maybe not doing all possible cleanups (e.g. I still see generic_file_write_nolock) this patch looks very good. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-10 8:00 ` Christoph Hellwig @ 2006-05-10 15:01 ` Badari Pulavarty 0 siblings, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-05-10 15:01 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, lkml, Benjamin LaHaise, cel On Wed, 2006-05-10 at 10:00 +0200, Christoph Hellwig wrote: > On Tue, May 09, 2006 at 04:57:42PM -0700, Badari Pulavarty wrote: > > > > we can get the followup patch done within the next week or two so we can > > > > get it all tested at the same time. Although from your description it > > > > doesn't sound like it'll be completely trivial... > > > > > > That patch is lots of tirival and boring work. If anyone wants to beat > > > me to it: > > > > Well, I am not sure if you mean *exactly* this.. > > > > So far, I have this. I really don't like the idea of > > adding .aio_read/.aio_write methods for the filesystems who currently > > don't have one (so we can force their .read/.write to do_sync_*()). > > Why don't you like this idea? Few reasons: 1) I added .aio_read/.aio_write methods for all the filesystems that are not currently having this, just to make their .read/.write to do_sync_*(). 2) Its just not possible for filesystems ONLY to provide only .aio_read/.aio_write() interfaces. They have to have .read/.write() also to handle direct callers :( 3) sys_read/sys_write() will now have an extra indirection: sys_read() -> vfs_read() -> do_sync_read() -> .aio_read() where as current code.. sys_read() -> vfs_read() -> .write() We now have an extra do_sync_read() code, but may be okay. > > -------- snip -------- > > There are two ways to implement read/write for filesystems and drivers: > > The simple way is to implement the read and write methods. Normal > synchronous, single buffer requests are handed directly to the driver in > this case. Vectored requests are emulated using a loop in the higher > level code. AIO requests are silently performed synchronous. > This method is normally used for character drivers and synthetic > filesystems. > > The advanced method is to implement the aio_read and aio_write methods. > These allow the request to be done asynchronously and submit multiple > IO vectores in parallel. A page cache based filesystem gets this > functionality by freee by using the routines from filemap.c - in fact > there is not easy way to use the generic page cache code without > implementing aio_read and aio_write. The other big user of this > interface are sockets. Very few character driver need this complexity. > > -------- snip -------- > > Is there a way to fix callers of .read/.write() methods to use > > something like do_sync_read/write - that way we can take out > > .read/.write completely ? > > The only way to fix this is to add some kernel_read/kernel_write helpers > that factor out the use aio_read / aio_write if present and wait for > I/O completion logic from vfs_read/vfs_write. I started on that but it > got very messy. Okay. I will take your word for it - I won't bother trying for now :) > > > Anyway, here it is compiled but untested.. I think I can clean up > > more in filemap.c (after reading through your suggestions). Please > > let me know, if I am on wrong path ... > > Currently I don't have time to actually apply the patchlkit and look at > the result, so I'll defer further comments. Beside maybe not doing all > possible cleanups (e.g. I still see generic_file_write_nolock) this > patch looks very good. I need to take a closer look at generic_file_write_nolock() since I couldn't eliminate it easily in my first dumb pass. I will also look at cleanups you suggested. Thanks. Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 19:20 ` Christoph Hellwig 2006-05-09 23:57 ` Badari Pulavarty @ 2006-05-10 16:01 ` Badari Pulavarty 1 sibling, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-05-10 16:01 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, lkml, Benjamin LaHaise, cel I am starting to like this :) Here is what I have so far (this patch applies on top of the other set). We will NOW have only up following: generic_file_aio_read() - read handler generic_file_aio_write() - write handler generic_file_aio_write_nolock() - no lock write handler __generic_file_aio_write_nolock() - internal worker routine (not exported) Thanks, Badari Get rid of everything other than following generic read/write interfaces: generic_file_aio_read() - read handler generic_file_aio_write() - write handler generic_file_aio_write_nolock() - no lock write handler __generic_file_aio_write_nolock() - internal worker routine (not exported) Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> drivers/char/raw.c | 15 +------ fs/adfs/file.c | 6 ++- fs/affs/file.c | 6 ++- fs/bfs/file.c | 6 ++- fs/block_dev.c | 12 +----- fs/ext2/file.c | 4 +- fs/fuse/file.c | 6 ++- fs/hfs/inode.c | 6 ++- fs/hfsplus/inode.c | 6 ++- fs/hostfs/hostfs_kern.c | 4 +- fs/hpfs/file.c | 6 ++- fs/jffs/inode-v23.c | 6 ++- fs/jffs2/file.c | 6 ++- fs/jfs/file.c | 4 +- fs/minix/file.c | 6 ++- fs/ntfs/file.c | 2 - fs/qnx4/file.c | 6 ++- fs/ramfs/file-mmu.c | 6 ++- fs/ramfs/file-nommu.c | 6 ++- fs/read_write.c | 3 + fs/xfs/linux-2.6/xfs_lrw.c | 4 +- include/linux/fs.h | 5 -- mm/filemap.c | 88 ++------------------------------------------- 23 files changed, 72 insertions(+), 147 deletions(-) Index: linux-2.6.17-rc3.save/drivers/char/raw.c =================================================================== --- linux-2.6.17-rc3.save.orig/drivers/char/raw.c 2006-05-10 08:23:47.000000000 -0700 +++ linux-2.6.17-rc3.save/drivers/char/raw.c 2006-05-10 08:29:35.000000000 -0700 @@ -239,21 +239,10 @@ out: return err; } -static ssize_t raw_file_write(struct file *file, const char __user *buf, - size_t count, loff_t *ppos) -{ - struct iovec local_iov = { - .iov_base = (char __user *)buf, - .iov_len = count - }; - - return generic_file_write_nolock(file, &local_iov, 1, ppos); -} - static struct file_operations raw_fops = { - .read = generic_file_read, + .read = do_sync_read, .aio_read = generic_file_aio_read, - .write = raw_file_write, + .write = do_sync_write, .aio_write = generic_file_aio_write_nolock, .open = raw_open, .release= raw_release, Index: linux-2.6.17-rc3.save/fs/adfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/adfs/file.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/adfs/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -27,10 +27,12 @@ const struct file_operations adfs_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, + .aio_read = generic_file_aio_read, .mmap = generic_file_mmap, .fsync = file_fsync, - .write = generic_file_write, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .sendfile = generic_file_sendfile, }; Index: linux-2.6.17-rc3.save/fs/affs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/affs/file.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/affs/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -27,8 +27,10 @@ static int affs_file_release(struct inod const struct file_operations affs_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .open = affs_file_open, .release = affs_file_release, Index: linux-2.6.17-rc3.save/fs/bfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/bfs/file.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/bfs/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -19,8 +19,10 @@ const struct file_operations bfs_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .sendfile = generic_file_sendfile, }; Index: linux-2.6.17-rc3.save/fs/block_dev.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/block_dev.c 2006-05-10 08:23:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/block_dev.c 2006-05-10 08:29:35.000000000 -0700 @@ -1056,14 +1056,6 @@ static int blkdev_close(struct inode * i return blkdev_put(bdev); } -static ssize_t blkdev_file_write(struct file *file, const char __user *buf, - size_t count, loff_t *ppos) -{ - struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count }; - - return generic_file_write_nolock(file, &local_iov, 1, ppos); -} - static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) { return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); @@ -1083,8 +1075,8 @@ const struct file_operations def_blk_fop .open = blkdev_open, .release = blkdev_close, .llseek = block_llseek, - .read = generic_file_read, - .write = blkdev_file_write, + .read = do_sync_read, + .write = do_sync_write, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write_nolock, .mmap = generic_file_mmap, Index: linux-2.6.17-rc3.save/fs/ext2/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ext2/file.c 2006-05-10 08:23:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ext2/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -41,8 +41,8 @@ static int ext2_release_file (struct ino */ const struct file_operations ext2_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .write = do_sync_write, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, .ioctl = ext2_ioctl, Index: linux-2.6.17-rc3.save/fs/fuse/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/fuse/file.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/fuse/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -621,8 +621,10 @@ static int fuse_set_page_dirty(struct pa static const struct file_operations fuse_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = fuse_file_mmap, .open = fuse_open, .flush = fuse_flush, Index: linux-2.6.17-rc3.save/fs/hfs/inode.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/hfs/inode.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/hfs/inode.c 2006-05-10 08:29:35.000000000 -0700 @@ -603,8 +603,10 @@ int hfs_inode_setattr(struct dentry *den static const struct file_operations hfs_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .sendfile = generic_file_sendfile, .fsync = file_fsync, Index: linux-2.6.17-rc3.save/fs/hfsplus/inode.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/hfsplus/inode.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/hfsplus/inode.c 2006-05-10 08:29:35.000000000 -0700 @@ -282,8 +282,10 @@ static struct inode_operations hfsplus_f static const struct file_operations hfsplus_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .sendfile = generic_file_sendfile, .fsync = file_fsync, Index: linux-2.6.17-rc3.save/fs/hostfs/hostfs_kern.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/hostfs/hostfs_kern.c 2006-05-10 08:23:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/hostfs/hostfs_kern.c 2006-05-10 08:29:35.000000000 -0700 @@ -386,11 +386,11 @@ int hostfs_fsync(struct file *file, stru static const struct file_operations hostfs_file_fops = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, .sendfile = generic_file_sendfile, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, - .write = generic_file_write, + .write = do_sync_write, .mmap = generic_file_mmap, .open = hostfs_file_open, .release = NULL, Index: linux-2.6.17-rc3.save/fs/hpfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/hpfs/file.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/hpfs/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -113,7 +113,7 @@ static ssize_t hpfs_file_write(struct fi { ssize_t retval; - retval = generic_file_write(file, buf, count, ppos); + retval = do_sync_write(file, buf, count, ppos); if (retval > 0) hpfs_i(file->f_dentry->d_inode)->i_dirty = 1; return retval; @@ -122,8 +122,10 @@ static ssize_t hpfs_file_write(struct fi const struct file_operations hpfs_file_ops = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, + .aio_read = generic_file_aio_read, .write = hpfs_file_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .release = hpfs_file_release, .fsync = hpfs_file_fsync, Index: linux-2.6.17-rc3.save/fs/jffs/inode-v23.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/jffs/inode-v23.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/jffs/inode-v23.c 2006-05-10 08:29:35.000000000 -0700 @@ -1633,8 +1633,10 @@ static const struct file_operations jffs { .open = generic_file_open, .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .ioctl = jffs_ioctl, .mmap = generic_file_readonly_mmap, .fsync = jffs_fsync, Index: linux-2.6.17-rc3.save/fs/jffs2/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/jffs2/file.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/jffs2/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -42,8 +42,10 @@ const struct file_operations jffs2_file_ { .llseek = generic_file_llseek, .open = generic_file_open, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .ioctl = jffs2_ioctl, .mmap = generic_file_readonly_mmap, .fsync = jffs2_fsync, Index: linux-2.6.17-rc3.save/fs/jfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/jfs/file.c 2006-05-10 08:23:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/jfs/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -103,8 +103,8 @@ struct inode_operations jfs_file_inode_o const struct file_operations jfs_file_operations = { .open = jfs_open, .llseek = generic_file_llseek, - .write = generic_file_write, - .read = generic_file_read, + .write = do_sync_write, + .read = do_sync_read, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, Index: linux-2.6.17-rc3.save/fs/minix/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/minix/file.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/minix/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -17,8 +17,10 @@ int minix_sync_file(struct file *, struc const struct file_operations minix_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .fsync = minix_sync_file, .sendfile = generic_file_sendfile, Index: linux-2.6.17-rc3.save/fs/ntfs/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ntfs/file.c 2006-05-10 08:23:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ntfs/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -2294,7 +2294,7 @@ static int ntfs_file_fsync(struct file * const struct file_operations ntfs_file_ops = { .llseek = generic_file_llseek, /* Seek inside file. */ - .read = generic_file_read, /* Read from file. */ + .read = do_sync_read, /* Read from file. */ .aio_read = generic_file_aio_read, /* Async read from file. */ #ifdef NTFS_RW .write = ntfs_file_write, /* Write to file. */ Index: linux-2.6.17-rc3.save/fs/qnx4/file.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/qnx4/file.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/qnx4/file.c 2006-05-10 08:29:35.000000000 -0700 @@ -22,11 +22,13 @@ const struct file_operations qnx4_file_operations = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, + .aio_read = generic_file_aio_read, .mmap = generic_file_mmap, .sendfile = generic_file_sendfile, #ifdef CONFIG_QNX4FS_RW - .write = generic_file_write, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .fsync = qnx4_sync_file, #endif }; Index: linux-2.6.17-rc3.save/fs/ramfs/file-mmu.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ramfs/file-mmu.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ramfs/file-mmu.c 2006-05-10 08:29:35.000000000 -0700 @@ -33,8 +33,10 @@ struct address_space_operations ramfs_ao }; const struct file_operations ramfs_file_operations = { - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .fsync = simple_sync_file, .sendfile = generic_file_sendfile, Index: linux-2.6.17-rc3.save/fs/ramfs/file-nommu.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/ramfs/file-nommu.c 2006-05-10 08:21:47.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/ramfs/file-nommu.c 2006-05-10 08:29:35.000000000 -0700 @@ -36,8 +36,10 @@ struct address_space_operations ramfs_ao const struct file_operations ramfs_file_operations = { .mmap = ramfs_nommu_mmap, .get_unmapped_area = ramfs_nommu_get_unmapped_area, - .read = generic_file_read, - .write = generic_file_write, + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, .fsync = simple_sync_file, .sendfile = generic_file_sendfile, .llseek = generic_file_llseek, Index: linux-2.6.17-rc3.save/fs/read_write.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/read_write.c 2006-05-10 08:29:26.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/read_write.c 2006-05-10 08:29:35.000000000 -0700 @@ -22,7 +22,8 @@ const struct file_operations generic_ro_fops = { .llseek = generic_file_llseek, - .read = generic_file_read, + .read = do_sync_read, + .aio_read = generic_file_aio_read, .mmap = generic_file_readonly_mmap, .sendfile = generic_file_sendfile, }; Index: linux-2.6.17-rc3.save/include/linux/fs.h =================================================================== --- linux-2.6.17-rc3.save.orig/include/linux/fs.h 2006-05-10 08:29:26.000000000 -0700 +++ linux-2.6.17-rc3.save/include/linux/fs.h 2006-05-10 09:00:37.000000000 -0700 @@ -1594,11 +1594,8 @@ extern int generic_file_mmap(struct file extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size); extern int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size); -extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *); int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); -extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); -extern ssize_t __generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t *); extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *, unsigned long, loff_t); @@ -1608,8 +1605,6 @@ extern ssize_t generic_file_buffered_wri unsigned long, loff_t, loff_t *, size_t, ssize_t); extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos); extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos); -ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos); extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *); extern void do_generic_mapping_read(struct address_space *mapping, struct file_ra_state *, struct file *, Index: linux-2.6.17-rc3.save/mm/filemap.c =================================================================== --- linux-2.6.17-rc3.save.orig/mm/filemap.c 2006-05-10 08:23:47.000000000 -0700 +++ linux-2.6.17-rc3.save/mm/filemap.c 2006-05-10 08:44:01.000000000 -0700 @@ -1018,13 +1018,14 @@ success: * that can use the page cache directly. */ ssize_t -__generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) +generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *filp = iocb->ki_filp; ssize_t retval; unsigned long seg; size_t count; + loff_t *ppos = &iocb->ki_pos; count = 0; for (seg = 0; seg < nr_segs; seg++) { @@ -1048,7 +1049,7 @@ __generic_file_aio_read(struct kiocb *io /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ if (filp->f_flags & O_DIRECT) { - loff_t pos = *ppos, size; + loff_t size; struct address_space *mapping; struct inode *inode; @@ -1093,33 +1094,8 @@ out: return retval; } -EXPORT_SYMBOL(__generic_file_aio_read); - -ssize_t -generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, - unsigned long nr_segs, loff_t pos) -{ - BUG_ON(iocb->ki_pos != pos); - return __generic_file_aio_read(iocb, iov, nr_segs, &iocb->ki_pos); -} EXPORT_SYMBOL(generic_file_aio_read); -ssize_t -generic_file_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) -{ - struct iovec local_iov = { .iov_base = buf, .iov_len = count }; - struct kiocb kiocb; - ssize_t ret; - - init_sync_kiocb(&kiocb, filp); - ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos); - if (-EIOCBQUEUED == ret) - ret = wait_on_sync_kiocb(&kiocb); - return ret; -} - -EXPORT_SYMBOL(generic_file_read); - int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size) { ssize_t written; @@ -2185,38 +2161,6 @@ ssize_t generic_file_aio_write_nolock(st } EXPORT_SYMBOL(generic_file_aio_write_nolock); -static ssize_t -__generic_file_write_nolock(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) -{ - struct kiocb kiocb; - ssize_t ret; - - init_sync_kiocb(&kiocb, file); - kiocb.ki_pos = *ppos; - ret = __generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); - if (-EIOCBQUEUED == ret) - ret = wait_on_sync_kiocb(&kiocb); - return ret; -} - -ssize_t -generic_file_write_nolock(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) -{ - struct kiocb kiocb; - ssize_t ret; - - init_sync_kiocb(&kiocb, file); - kiocb.ki_pos = *ppos; - ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, *ppos); - if (-EIOCBQUEUED == ret) - ret = wait_on_sync_kiocb(&kiocb); - *ppos = kiocb.ki_pos; - return ret; -} -EXPORT_SYMBOL(generic_file_write_nolock); - ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { @@ -2242,30 +2186,6 @@ ssize_t generic_file_aio_write(struct ki } EXPORT_SYMBOL(generic_file_aio_write); -ssize_t generic_file_write(struct file *file, const char __user *buf, - size_t count, loff_t *ppos) -{ - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; - ssize_t ret; - struct iovec local_iov = { .iov_base = (void __user *)buf, - .iov_len = count }; - - mutex_lock(&inode->i_mutex); - ret = __generic_file_write_nolock(file, &local_iov, 1, ppos); - mutex_unlock(&inode->i_mutex); - - if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { - ssize_t err; - - err = sync_page_range(inode, mapping, *ppos - ret, ret); - if (err < 0) - ret = err; - } - return ret; -} -EXPORT_SYMBOL(generic_file_write); - /* * Called under i_mutex for writes to S_ISREG files. Returns -EIO if something * went wrong during pagecache shootdown. Index: linux-2.6.17-rc3.save/fs/xfs/linux-2.6/xfs_lrw.c =================================================================== --- linux-2.6.17-rc3.save.orig/fs/xfs/linux-2.6/xfs_lrw.c 2006-04-26 19:19:25.000000000 -0700 +++ linux-2.6.17-rc3.save/fs/xfs/linux-2.6/xfs_lrw.c 2006-05-10 08:45:52.000000000 -0700 @@ -276,7 +276,9 @@ xfs_read( xfs_rw_enter_trace(XFS_READ_ENTER, &ip->i_iocore, (void *)iovp, segs, *offset, ioflags); - ret = __generic_file_aio_read(iocb, iovp, segs, offset); + + iocb->ki_pos = *offset; + ret = generic_file_aio_read(iocb, iovp, segs, *offset); if (ret == -EIOCBQUEUED && !(ioflags & IO_ISAIO)) ret = wait_on_sync_kiocb(iocb); if (ret > 0) ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 19:13 ` Andrew Morton 2006-05-09 19:20 ` Christoph Hellwig @ 2006-05-10 20:50 ` Badari Pulavarty 1 sibling, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-05-10 20:50 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Hellwig, linux-kernel, bcrl, cel Andrew, If you haven't picked these patches into -mm yet, can you hold off till tomorrow ? I have an updated version with few minor fixes + I am almost ready with the filemap.c cleanups. I am currently testing those and haven't found any blockers. Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 19:03 ` Christoph Hellwig 2006-05-09 19:13 ` Andrew Morton @ 2006-05-09 20:07 ` Badari Pulavarty 1 sibling, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-05-09 20:07 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel, bcrl, cel Christoph Hellwig wrote: >On Tue, May 09, 2006 at 12:01:05PM -0700, Andrew Morton wrote: > >>Together these three patches shrink the kernel by 113 lines. I don't know >>what the effect is on text size, but that's a pretty modest saving, at a >>pretty high risk level. >> >>What else do we get in return for this risk? >> > >there's another patch ontop which I didn't bother to redo until this is >accepted which kills a lot more code. After that filesystems only have >to implement one method each for all kinds of read/write calls. Which >allows to both make the mm/filemap.c far less complex and actually >understandable aswell as for any filesystem that uses more complex >read/write variants than direct filemap.c calls. In addition to these >simplification we also get a feature (async vectored I/O) for free. > Yep. I am currently killing read/write methods for all filesystems and also getting rid of generic_file_read() and generic_file_write(). Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-05-09 19:01 ` Andrew Morton 2006-05-09 19:03 ` Christoph Hellwig @ 2006-05-09 23:53 ` Badari Pulavarty 1 sibling, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-05-09 23:53 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, christoph, Benjamin LaHaise, cel On Tue, 2006-05-09 at 12:01 -0700, Andrew Morton wrote: > Badari Pulavarty <pbadari@us.ibm.com> wrote: > > > > static ssize_t ep_aio_read_retry(struct kiocb *iocb) > > { > > struct kiocb_priv *priv = iocb->private; > > - ssize_t status = priv->actual; > > + ssize_t len, total; > > > > /* we "retry" to get the right mm context for this: */ > > - status = copy_to_user(priv->ubuf, priv->buf, priv->actual); > > - if (unlikely(0 != status)) > > - status = -EFAULT; > > - else > > - status = priv->actual; > > + > > + /* copy stuff into user buffers */ > > + total = priv->actual; > > + len = 0; > > + for (i=0; i < priv->count; i++) { > > for (i = 0 > > > + ssize_t this = min(priv->iv[i].iov_len, (size_t)total); > > min_t(). > > Strange mixture of size_t and ssize_t there. Borrowed it from somewhere :( I will clean it up. > > > + if (copy_to_user(priv->iv[i].iov_buf, priv->buf, this)) > > + break; > > + > > + total -= this; > > + len += this; > > + if (total <= 0) > > + break; > > + } > > + > > + if (unlikely(len != 0)) > > + len = -EFAULT; > > This looks wrong. I think you meant (total != 0). Yes. It should be "total". Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 0/3] VFS changes to collapse all the vectored and AIO support @ 2006-02-02 16:12 Badari Pulavarty 2006-02-02 16:14 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-02-02 16:12 UTC (permalink / raw) To: christoph, Benjamin LaHaise, Zach Brown; +Cc: lkml, linux-fsdevel, pbadari Hi, This work was originally suggested & started by Christoph Hellwig, when Zack Brown tried to add vectored support for AIO. These series of changes collapses all the vectored IO support into single file-operation method using aio_read/aio_write. Christoph & Zack, comments/suggestions ? If you are happy with the work, can you add your Sign-off or Ack ? Here is the summary: [PATCH 1/3] Vectorize aio_read/aio_write methods [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead. [PATCH 3/3] Zack's core aio changes to support vectored AIO. To Do/Issues: 1) Since aio_read/aio_write are vectorized now, need to modify nfs AIO+DIO and usb/gadget to handle vectors. Is it needed ? For now, it handles only single vector. Christoph, should I loop over all the vectors ? 2) AIO changes need careful review & could be cleaned up further. Zack, can you take a look at those ? 3) Ben's suggestion of kernel iovec to hold precomputed information (like total iolen) instead of computing every time. Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-02-02 16:12 [PATCH 0/3] VFS changes to collapse all the vectored and AIO support Badari Pulavarty @ 2006-02-02 16:14 ` Badari Pulavarty 2006-02-04 13:28 ` Christoph Hellwig 0 siblings, 1 reply; 58+ messages in thread From: Badari Pulavarty @ 2006-02-02 16:14 UTC (permalink / raw) To: christoph; +Cc: Benjamin LaHaise, Zach Brown, lkml, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 181 bytes --] This patch vectorizes aio_read() and aio_write() methods to prepare for colapsing all the vectored operations into one interface - which is aio_read()/aio_write(). Thanks, Badari [-- Attachment #2: aiovector.patch --] [-- Type: text/x-patch, Size: 31621 bytes --] This patch vectorizes aio_read() and aio_write() methods to prepare for colapsing all the vectored operations into one interface - which is aio_read()/aio_write(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Index: linux-2.6.16-rc1.quilt/Documentation/filesystems/Locking =================================================================== --- linux-2.6.16-rc1.quilt.orig/Documentation/filesystems/Locking 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/Documentation/filesystems/Locking 2006-02-01 16:38:14.000000000 -0800 @@ -355,10 +355,9 @@ prototypes: loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, - loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, Index: linux-2.6.16-rc1.quilt/Documentation/filesystems/vfs.txt =================================================================== --- linux-2.6.16-rc1.quilt.orig/Documentation/filesystems/vfs.txt 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/Documentation/filesystems/vfs.txt 2006-02-01 16:38:14.000000000 -0800 @@ -526,9 +526,9 @@ struct file_operations { loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); Index: linux-2.6.16-rc1.quilt/drivers/char/raw.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/drivers/char/raw.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/drivers/char/raw.c 2006-02-01 16:38:14.000000000 -0800 @@ -249,23 +249,11 @@ return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t raw_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { - .iov_base = (char __user *)buf, - .iov_len = count - }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - - static struct file_operations raw_fops = { .read = generic_file_read, .aio_read = generic_file_aio_read, .write = raw_file_write, - .aio_write = raw_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .open = raw_open, .release= raw_release, .ioctl = raw_ioctl, Index: linux-2.6.16-rc1.quilt/fs/aio.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/fs/aio.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/fs/aio.c 2006-02-01 16:38:14.000000000 -0800 @@ -15,6 +15,7 @@ #include <linux/aio_abi.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/uio.h> #define DEBUG 0 @@ -1316,8 +1317,12 @@ ssize_t ret = 0; do { - ret = file->f_op->aio_read(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + struct iovec iov = { + .iov_base = iocb->ki_buf, + .iov_len = iocb->ki_left + }; + + ret = file->f_op->aio_read(iocb, &iov, 1, iocb->ki_pos); /* * Can't just depend on iocb->ki_left to determine * whether we are done. This may have been a short read. @@ -1350,8 +1355,12 @@ ssize_t ret = 0; do { - ret = file->f_op->aio_write(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + struct iovec iov = { + .iov_base = iocb->ki_buf, + .iov_len = iocb->ki_left + }; + + ret = file->f_op->aio_write(iocb, &iov, 1, iocb->ki_pos); if (ret > 0) { iocb->ki_buf += ret; iocb->ki_left -= ret; Index: linux-2.6.16-rc1.quilt/fs/block_dev.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/fs/block_dev.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/fs/block_dev.c 2006-02-01 16:38:14.000000000 -0800 @@ -769,14 +769,6 @@ return generic_file_write_nolock(file, &local_iov, 1, ppos); } -static ssize_t blkdev_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) -{ - struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count }; - - return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); -} - static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) { return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); @@ -799,7 +791,7 @@ .read = generic_file_read, .write = blkdev_file_write, .aio_read = generic_file_aio_read, - .aio_write = blkdev_file_aio_write, + .aio_write = generic_file_aio_write_nolock, .mmap = generic_file_mmap, .fsync = block_fsync, .unlocked_ioctl = block_ioctl, Index: linux-2.6.16-rc1.quilt/fs/cifs/cifsfs.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/fs/cifs/cifsfs.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/fs/cifs/cifsfs.c 2006-02-01 16:38:14.000000000 -0800 @@ -501,13 +501,13 @@ return written; } -static ssize_t cifs_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +static ssize_t cifs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct inode *inode = iocb->ki_filp->f_dentry->d_inode; ssize_t written; - written = generic_file_aio_write(iocb, buf, count, pos); + written = generic_file_aio_write(iocb, iov, nr_segs, pos); if (!CIFS_I(inode)->clientCanCacheAll) filemap_fdatawrite(inode->i_mapping); return written; Index: linux-2.6.16-rc1.quilt/fs/ext3/file.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/fs/ext3/file.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/fs/ext3/file.c 2006-02-01 16:38:14.000000000 -0800 @@ -48,14 +48,15 @@ } static ssize_t -ext3_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +ext3_file_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_dentry->d_inode; ssize_t ret; int err; - ret = generic_file_aio_write(iocb, buf, count, pos); + ret = generic_file_aio_write(iocb, iov, nr_segs, pos); /* * Skip flushing if there was an error, or if nothing was written. Index: linux-2.6.16-rc1.quilt/fs/nfs/file.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/fs/nfs/file.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/fs/nfs/file.c 2006-02-01 16:38:14.000000000 -0800 @@ -40,8 +40,10 @@ static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); static int nfs_file_mmap(struct file *, struct vm_area_struct *); static ssize_t nfs_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *); -static ssize_t nfs_file_read(struct kiocb *, char __user *, size_t, loff_t); -static ssize_t nfs_file_write(struct kiocb *, const char __user *, size_t, loff_t); +static ssize_t nfs_file_read(struct kiocb *, const struct iovec *, + unsigned long, loff_t); +static ssize_t nfs_file_write(struct kiocb *, const struct iovec *, + unsigned long, loff_t); static int nfs_file_flush(struct file *); static int nfs_fsync(struct file *, struct dentry *dentry, int datasync); static int nfs_check_flags(int flags); @@ -52,8 +54,8 @@ .llseek = nfs_file_llseek, .read = do_sync_read, .write = do_sync_write, - .aio_read = nfs_file_read, - .aio_write = nfs_file_write, + .aio_read = nfs_file_read, + .aio_write = nfs_file_write, .mmap = nfs_file_mmap, .open = nfs_file_open, .flush = nfs_file_flush, @@ -213,7 +215,8 @@ } static ssize_t -nfs_file_read(struct kiocb *iocb, char __user * buf, size_t count, loff_t pos) +nfs_file_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; @@ -221,16 +224,15 @@ #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_read(iocb, buf, count, pos); + return nfs_file_direct_read(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: read(%s/%s, %lu@%lu)\n", - dentry->d_parent->d_name.name, dentry->d_name.name, - (unsigned long) count, (unsigned long) pos); + dfprintk(VFS, "nfs: read(%s/%s)\n", + dentry->d_parent->d_name.name, dentry->d_name.name); result = nfs_revalidate_file(inode, iocb->ki_filp); if (!result) - result = generic_file_aio_read(iocb, buf, count, pos); + result = generic_file_aio_read(iocb, iov, nr_segs, pos); return result; } @@ -333,7 +335,8 @@ * Write to a file (through the page cache). */ static ssize_t -nfs_file_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +nfs_file_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct dentry * dentry = iocb->ki_filp->f_dentry; struct inode * inode = dentry->d_inode; @@ -341,12 +344,12 @@ #ifdef CONFIG_NFS_DIRECTIO if (iocb->ki_filp->f_flags & O_DIRECT) - return nfs_file_direct_write(iocb, buf, count, pos); + return nfs_file_direct_write(iocb, iov, nr_segs, pos); #endif - dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%lu)\n", + dfprintk(VFS, "nfs: write(%s/%s(%ld))\n", dentry->d_parent->d_name.name, dentry->d_name.name, - inode->i_ino, (unsigned long) count, (unsigned long) pos); + inode->i_ino); result = -EBUSY; if (IS_SWAPFILE(inode)) @@ -361,11 +364,7 @@ } nfs_revalidate_mapping(inode, iocb->ki_filp->f_mapping); - result = count; - if (!count) - goto out; - - result = generic_file_aio_write(iocb, buf, count, pos); + result = generic_file_aio_write(iocb, iov, nr_segs, pos); out: return result; Index: linux-2.6.16-rc1.quilt/fs/read_write.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/fs/read_write.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/fs/read_write.c 2006-02-01 16:38:14.000000000 -0800 @@ -227,16 +227,21 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos))) + + for (;;) { + ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } - if (-EIOCBQUEUED == ret) + if (ret == -EIOCBQUEUED) ret = wait_on_sync_kiocb(&kiocb); *ppos = kiocb.ki_pos; return ret; @@ -279,14 +284,19 @@ ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { + struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len }; struct kiocb kiocb; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - while (-EIOCBRETRY == - (ret = filp->f_op->aio_write(&kiocb, buf, len, kiocb.ki_pos))) + + for (;;) { + ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); + if (ret != -EIOCBRETRY) + break; wait_on_retry_sync_kiocb(&kiocb); + } if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); Index: linux-2.6.16-rc1.quilt/fs/reiserfs/file.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/fs/reiserfs/file.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/fs/reiserfs/file.c 2006-02-01 16:38:14.000000000 -0800 @@ -1541,12 +1541,6 @@ return res; } -static ssize_t reiserfs_aio_write(struct kiocb *iocb, const char __user * buf, - size_t count, loff_t pos) -{ - return generic_file_aio_write(iocb, buf, count, pos); -} - struct file_operations reiserfs_file_operations = { .read = generic_file_read, .write = reiserfs_file_write, @@ -1556,7 +1550,7 @@ .fsync = reiserfs_sync_file, .sendfile = generic_file_sendfile, .aio_read = generic_file_aio_read, - .aio_write = reiserfs_aio_write, + .aio_write = generic_file_aio_write, }; struct inode_operations reiserfs_file_inode_operations = { Index: linux-2.6.16-rc1.quilt/fs/xfs/linux-2.6/xfs_file.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/fs/xfs/linux-2.6/xfs_file.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/fs/xfs/linux-2.6/xfs_file.c 2006-02-01 16:38:14.000000000 -0800 @@ -51,12 +51,11 @@ STATIC inline ssize_t __linvfs_read( struct kiocb *iocb, - char __user *buf, + const struct iovec *iov, + unsigned long nr_segs, int ioflags, - size_t count, loff_t pos) { - struct iovec iov = {buf, count}; struct file *file = iocb->ki_filp; vnode_t *vp = LINVFS_GET_VP(file->f_dentry->d_inode); ssize_t rval; @@ -65,7 +64,7 @@ if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_READ(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_READ(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } @@ -73,33 +72,32 @@ STATIC ssize_t linvfs_aio_read( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __linvfs_read(iocb, buf, IO_ISAIO, count, pos); + return __linvfs_read(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t linvfs_aio_read_invis( struct kiocb *iocb, - char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __linvfs_read(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __linvfs_read(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } STATIC inline ssize_t __linvfs_write( - struct kiocb *iocb, - const char __user *buf, - int ioflags, - size_t count, - loff_t pos) + struct kiocb *iocb, + const struct iovec *iov, + unsigned long nr_segs, + int ioflags, + loff_t pos) { - struct iovec iov = {(void __user *)buf, count}; struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; vnode_t *vp = LINVFS_GET_VP(inode); @@ -109,7 +107,7 @@ if (unlikely(file->f_flags & O_DIRECT)) ioflags |= IO_ISDIRECT; - VOP_WRITE(vp, iocb, &iov, 1, &iocb->ki_pos, ioflags, NULL, rval); + VOP_WRITE(vp, iocb, iov, nr_segs, &iocb->ki_pos, ioflags, NULL, rval); return rval; } @@ -117,21 +115,21 @@ STATIC ssize_t linvfs_aio_write( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __linvfs_write(iocb, buf, IO_ISAIO, count, pos); + return __linvfs_write(iocb, iov, nr_segs, IO_ISAIO, pos); } STATIC ssize_t linvfs_aio_write_invis( struct kiocb *iocb, - const char __user *buf, - size_t count, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - return __linvfs_write(iocb, buf, IO_ISAIO|IO_INVIS, count, pos); + return __linvfs_write(iocb, iov, nr_segs, IO_ISAIO|IO_INVIS, pos); } Index: linux-2.6.16-rc1.quilt/include/linux/fs.h =================================================================== --- linux-2.6.16-rc1.quilt.orig/include/linux/fs.h 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/include/linux/fs.h 2006-02-01 16:38:14.000000000 -0800 @@ -997,9 +997,9 @@ struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); @@ -1558,11 +1558,11 @@ extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *); int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *); -extern ssize_t generic_file_aio_read(struct kiocb *, char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t __generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t *); -extern ssize_t generic_file_aio_write(struct kiocb *, const char __user *, size_t, loff_t); +extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *, - unsigned long, loff_t *); + unsigned long, loff_t); extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *, unsigned long *, loff_t, loff_t *, size_t, size_t); extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *, Index: linux-2.6.16-rc1.quilt/include/net/sock.h =================================================================== --- linux-2.6.16-rc1.quilt.orig/include/net/sock.h 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/include/net/sock.h 2006-02-01 16:38:14.000000000 -0800 @@ -650,7 +650,6 @@ struct sock *sk; struct scm_cookie *scm; struct msghdr *msg, async_msg; - struct iovec async_iov; struct kiocb *kiocb; }; Index: linux-2.6.16-rc1.quilt/mm/filemap.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/mm/filemap.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/mm/filemap.c 2006-02-01 16:38:14.000000000 -0800 @@ -1064,14 +1064,12 @@ EXPORT_SYMBOL(__generic_file_aio_read); ssize_t -generic_file_aio_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - struct iovec local_iov = { .iov_base = buf, .iov_len = count }; - BUG_ON(iocb->ki_pos != pos); - return __generic_file_aio_read(iocb, &local_iov, 1, &iocb->ki_pos); + return __generic_file_aio_read(iocb, iov, nr_segs, &iocb->ki_pos); } - EXPORT_SYMBOL(generic_file_aio_read); ssize_t @@ -2131,22 +2129,21 @@ current->backing_dev_info = NULL; return written ? written : err; } -EXPORT_SYMBOL(generic_file_aio_write_nolock); -ssize_t -generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) +ssize_t generic_file_aio_write_nolock(struct kiocb *iocb, + const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - loff_t pos = *ppos; - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, ppos); + BUG_ON(iocb->ki_pos != pos); + + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { - int err; + ssize_t err; err = sync_page_range_nolock(inode, mapping, pos, ret); if (err < 0) @@ -2154,6 +2151,7 @@ } return ret; } +EXPORT_SYMBOL(generic_file_aio_write_nolock); static ssize_t __generic_file_write_nolock(struct file *file, const struct iovec *iov, @@ -2163,9 +2161,11 @@ ssize_t ret; init_sync_kiocb(&kiocb, file); + kiocb.ki_pos = *ppos; ret = __generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); - if (ret == -EIOCBQUEUED) + if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; return ret; } @@ -2177,28 +2177,27 @@ ssize_t ret; init_sync_kiocb(&kiocb, file); - ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); + kiocb.ki_pos = *ppos; + ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, *ppos); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; return ret; } EXPORT_SYMBOL(generic_file_write_nolock); -ssize_t generic_file_aio_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos) +ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; - struct iovec local_iov = { .iov_base = (void __user *)buf, - .iov_len = count }; BUG_ON(iocb->ki_pos != pos); mutex_lock(&inode->i_mutex); - ret = __generic_file_aio_write_nolock(iocb, &local_iov, 1, - &iocb->ki_pos); + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); mutex_unlock(&inode->i_mutex); if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { Index: linux-2.6.16-rc1.quilt/net/socket.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/net/socket.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/net/socket.c 2006-02-01 16:38:14.000000000 -0800 @@ -98,10 +98,10 @@ #include <linux/netfilter.h> static int sock_no_open(struct inode *irrelevant, struct file *dontcare); -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *buf, - size_t size, loff_t pos); -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *buf, - size_t size, loff_t pos); +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); static int sock_mmap(struct file *file, struct vm_area_struct * vma); static int sock_close(struct inode *inode, struct file *file); @@ -635,11 +635,6 @@ return result; } -static void sock_aio_dtor(struct kiocb *iocb) -{ - kfree(iocb->private); -} - static ssize_t sock_sendpage(struct file *file, struct page *page, int offset, size_t size, loff_t *ppos, int more) { @@ -655,8 +650,13 @@ return sock->ops->sendpage(sock, page, offset, size, flags); } +static void sock_aio_dtor(struct kiocb *iocb) +{ + kfree(iocb->private); +} + static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb, - char __user *ubuf, size_t size, struct sock_iocb *siocb) + struct sock_iocb *siocb) { if (!is_sync_kiocb(iocb)) { siocb = kmalloc(sizeof(*siocb), GFP_KERNEL); @@ -666,15 +666,13 @@ } siocb->kiocb = iocb; - siocb->async_iov.iov_base = ubuf; - siocb->async_iov.iov_len = size; - iocb->private = siocb; return siocb; } static ssize_t do_sock_read(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -705,31 +703,33 @@ init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_read(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_read(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; +#if 0 if (count == 0) /* Match SYS5 behaviour */ return 0; +#endif - x = alloc_sock_iocb(iocb, ubuf, count, &siocb); + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_read(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } static ssize_t do_sock_write(struct msghdr *msg, struct kiocb *iocb, - struct file *file, struct iovec *iov, unsigned long nr_segs) + struct file *file, const struct iovec *iov, + unsigned long nr_segs) { struct socket *sock = file->private_data; size_t size = 0; @@ -762,28 +762,29 @@ init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; - ret = do_sock_write(&msg, &iocb, file, (struct iovec *)iov, nr_segs); + ret = do_sock_write(&msg, &iocb, file, iov, nr_segs); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret; } -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *ubuf, - size_t count, loff_t pos) +static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { struct sock_iocb siocb, *x; if (pos != 0) return -ESPIPE; +#if 0 if (count == 0) /* Match SYS5 behaviour */ return 0; +#endif - x = alloc_sock_iocb(iocb, (void __user *)ubuf, count, &siocb); + x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM; - return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, - &x->async_iov, 1); + return do_sock_write(&x->async_msg, iocb, iocb->ki_filp, iov, nr_segs); } Index: linux-2.6.16-rc1.quilt/fs/nfs/direct.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/fs/nfs/direct.c 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/fs/nfs/direct.c 2006-02-01 16:38:14.000000000 -0800 @@ -648,7 +648,8 @@ * cache. */ ssize_t -nfs_file_direct_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { ssize_t retval = -EINVAL; loff_t *ppos = &iocb->ki_pos; @@ -657,10 +658,11 @@ (struct nfs_open_context *) file->private_data; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; - struct iovec iov = { - .iov_base = buf, - .iov_len = count, - }; + ssize_t count; + + /* FIXME: Can we have multiple vectors here ? */ + BUG_ON(nr_segs != 1); + count = iov->iov_len; dprintk("nfs: direct read(%s/%s, %lu@%Ld)\n", file->f_dentry->d_parent->d_name.name, @@ -672,7 +674,7 @@ if (count < 0) goto out; retval = -EFAULT; - if (!access_ok(VERIFY_WRITE, iov.iov_base, iov.iov_len)) + if (!access_ok(VERIFY_WRITE, iov->iov_base, iov->iov_len)) goto out; retval = 0; if (!count) @@ -682,7 +684,7 @@ if (retval) goto out; - retval = nfs_direct_read(inode, ctx, &iov, pos, 1); + retval = nfs_direct_read(inode, ctx, iov, pos, 1); if (retval > 0) *ppos = pos + retval; @@ -716,7 +718,8 @@ * is no atomic O_APPEND write facility in the NFS protocol. */ ssize_t -nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, size_t count, loff_t pos) +nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { ssize_t retval; struct file *file = iocb->ki_filp; @@ -724,9 +727,11 @@ (struct nfs_open_context *) file->private_data; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; - struct iovec iov = { - .iov_base = (char __user *)buf, - }; + ssize_t count; + + /* FIXME: Can we have multiple vectors here ? */ + BUG_ON(nr_segs != 1); + count = iov->iov_len; dfprintk(VFS, "nfs: direct write(%s/%s, %lu@%Ld)\n", file->f_dentry->d_parent->d_name.name, @@ -747,17 +752,16 @@ retval = 0; if (!count) goto out; - iov.iov_len = count, retval = -EFAULT; - if (!access_ok(VERIFY_READ, iov.iov_base, iov.iov_len)) + if (!access_ok(VERIFY_READ, iov->iov_base, iov->iov_len)) goto out; retval = nfs_sync_mapping(mapping); if (retval) goto out; - retval = nfs_direct_write(inode, ctx, &iov, pos, 1); + retval = nfs_direct_write(inode, ctx, iov, pos, 1); if (mapping->nrpages) invalidate_inode_pages2(mapping); if (retval > 0) Index: linux-2.6.16-rc1.quilt/include/linux/nfs_fs.h =================================================================== --- linux-2.6.16-rc1.quilt.orig/include/linux/nfs_fs.h 2006-02-01 16:36:55.000000000 -0800 +++ linux-2.6.16-rc1.quilt/include/linux/nfs_fs.h 2006-02-01 16:38:14.000000000 -0800 @@ -369,10 +369,10 @@ */ extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t, unsigned long); -extern ssize_t nfs_file_direct_read(struct kiocb *iocb, char __user *buf, - size_t count, loff_t pos); -extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const char __user *buf, - size_t count, loff_t pos); +extern ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *, + unsigned long nr_segs, loff_t pos); +extern ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *, + unsigned long nr_segs, loff_t pos); /* * linux/fs/nfs/dir.c Index: linux-2.6.16-rc1.quilt/drivers/usb/gadget/inode.c =================================================================== --- linux-2.6.16-rc1.quilt.orig/drivers/usb/gadget/inode.c 2006-02-01 16:16:50.000000000 -0800 +++ linux-2.6.16-rc1.quilt/drivers/usb/gadget/inode.c 2006-02-01 16:39:06.000000000 -0800 @@ -675,32 +675,46 @@ } static ssize_t -ep_aio_read(struct kiocb *iocb, char __user *ubuf, size_t len, loff_t o) +ep_aio_read(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len; if (unlikely(epdata->desc.bEndpointAddress & USB_DIR_IN)) return -EINVAL; + + /* FIXME: Can we really get a vector here ? If so, handle it */ + BUG_ON(count != 1); + len = iv->iov_len; + buf = kmalloc(len, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; iocb->ki_retry = ep_aio_read_retry; - return ep_aio_rwtail(iocb, buf, len, epdata, ubuf); + return ep_aio_rwtail(iocb, buf, len, epdata, iv->iov_base); } static ssize_t -ep_aio_write(struct kiocb *iocb, const char __user *ubuf, size_t len, loff_t o) +ep_aio_write(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; char *buf; + size_t len; if (unlikely(!(epdata->desc.bEndpointAddress & USB_DIR_IN))) return -EINVAL; + + /* FIXME: Can we really get a vector here ? If so, handle it */ + BUG_ON(count != 1); + len = iv->iov_len; + buf = kmalloc(len, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; - if (unlikely(copy_from_user(buf, ubuf, len) != 0)) { + if (unlikely(copy_from_user(buf, iv->iov_base, len) != 0)) { kfree(buf); return -EFAULT; } ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-02-02 16:14 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty @ 2006-02-04 13:28 ` Christoph Hellwig 2006-02-04 22:10 ` Badari Pulavarty 0 siblings, 1 reply; 58+ messages in thread From: Christoph Hellwig @ 2006-02-04 13:28 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Benjamin LaHaise, Zach Brown, lkml, linux-fsdevel do { - ret = file->f_op->aio_read(iocb, iocb->ki_buf, - iocb->ki_left, iocb->ki_pos); + struct iovec iov = { + .iov_base = iocb->ki_buf, + .iov_len = iocb->ki_left + }; + + ret = file->f_op->aio_read(iocb, &iov, 1, iocb->ki_pos); this still has the lifetime problems Ben pointed out. aio might still be outstanding when this thread returned to userspace, so we need to dynamically allocated the iovec and free it later. (or make it part of the iocb?) ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/3] Vectorize aio_read/aio_write methods 2006-02-04 13:28 ` Christoph Hellwig @ 2006-02-04 22:10 ` Badari Pulavarty 0 siblings, 0 replies; 58+ messages in thread From: Badari Pulavarty @ 2006-02-04 22:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Benjamin LaHaise, Zach Brown, lkml, linux-fsdevel Christoph Hellwig wrote: > do { > - ret = file->f_op->aio_read(iocb, iocb->ki_buf, > - iocb->ki_left, iocb->ki_pos); > + struct iovec iov = { > + .iov_base = iocb->ki_buf, > + .iov_len = iocb->ki_left > + }; > + > + ret = file->f_op->aio_read(iocb, &iov, 1, iocb->ki_pos); > > this still has the lifetime problems Ben pointed out. aio might still > be outstanding when this thread returned to userspace, so we need to > dynamically allocated the iovec and free it later. (or make it part > of the iocb?) I left that intentionally alone. I was planning to make it a special case of Zach's vector IO handling code. Thanks, Badari ^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2006-05-10 20:50 UTC | newest] Thread overview: 58+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-08 0:19 [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support Badari Pulavarty 2006-03-08 0:22 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty 2006-03-08 12:44 ` christoph 2006-03-08 0:23 ` [PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write instead Badari Pulavarty 2006-03-08 12:45 ` christoph 2006-03-08 16:26 ` Badari Pulavarty 2006-03-08 0:24 ` [PATCH 3/3] Zach's core aio changes to support vectored AIO Badari Pulavarty 2006-03-08 3:37 ` Benjamin LaHaise 2006-03-08 16:34 ` Badari Pulavarty 2006-03-08 12:47 ` [RFC PATCH 0/3] VFS changes to collapse all the vectored and AIO support christoph 2006-03-08 16:24 ` Badari Pulavarty 2006-03-09 16:17 ` ext3_ordered_writepage() questions Badari Pulavarty 2006-03-09 23:35 ` Andrew Morton 2006-03-10 0:36 ` Badari Pulavarty 2006-03-16 18:09 ` Theodore Ts'o 2006-03-16 18:22 ` Badari Pulavarty 2006-03-16 21:04 ` Theodore Ts'o 2006-03-16 21:57 ` Badari Pulavarty 2006-03-16 22:05 ` Jan Kara 2006-03-16 23:45 ` Badari Pulavarty 2006-03-17 0:44 ` Theodore Ts'o 2006-03-17 0:54 ` Andreas Dilger 2006-03-17 17:05 ` Stephen C. Tweedie 2006-03-17 21:32 ` Badari Pulavarty 2006-03-17 22:22 ` Stephen C. Tweedie 2006-03-17 22:38 ` Badari Pulavarty 2006-03-17 23:23 ` Mingming Cao 2006-03-20 17:05 ` Stephen C. Tweedie 2006-03-18 2:57 ` Suparna Bhattacharya 2006-03-18 3:02 ` Suparna Bhattacharya 2006-03-17 15:32 ` Jamie Lokier 2006-03-17 21:50 ` Stephen C. Tweedie 2006-03-17 22:11 ` Theodore Ts'o 2006-03-17 22:44 ` Jamie Lokier 2006-03-18 23:40 ` Theodore Ts'o 2006-03-19 2:36 ` Jamie Lokier 2006-03-19 5:28 ` Chris Adams 2006-03-20 2:18 ` Theodore Ts'o 2006-03-20 16:26 ` Stephen C. Tweedie 2006-03-17 22:23 ` Jamie Lokier -- strict thread matches above, loose matches on Subject: below -- 2006-05-02 15:07 [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops Badari Pulavarty 2006-05-02 15:08 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty 2006-05-02 15:20 ` Chuck Lever 2006-05-02 15:35 ` Badari Pulavarty 2006-05-09 18:03 ` [PATCH 0/3] VFS changes to collapse AIO and vectored IO into single (set of) fileops Badari Pulavarty 2006-05-09 18:07 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty 2006-05-09 19:01 ` Andrew Morton 2006-05-09 19:03 ` Christoph Hellwig 2006-05-09 19:13 ` Andrew Morton 2006-05-09 19:20 ` Christoph Hellwig 2006-05-09 23:57 ` Badari Pulavarty 2006-05-10 8:00 ` Christoph Hellwig 2006-05-10 15:01 ` Badari Pulavarty 2006-05-10 16:01 ` Badari Pulavarty 2006-05-10 20:50 ` Badari Pulavarty 2006-05-09 20:07 ` Badari Pulavarty 2006-05-09 23:53 ` Badari Pulavarty 2006-02-02 16:12 [PATCH 0/3] VFS changes to collapse all the vectored and AIO support Badari Pulavarty 2006-02-02 16:14 ` [PATCH 1/3] Vectorize aio_read/aio_write methods Badari Pulavarty 2006-02-04 13:28 ` Christoph Hellwig 2006-02-04 22:10 ` Badari Pulavarty
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox