* [PATCH] Add block device speciffic splice write method @ 2008-10-19 14:00 Dmitri Monakhov 2008-10-20 17:49 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Dmitri Monakhov @ 2008-10-19 14:00 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel, Dmitri Monakhov Block device write procedure is different from regular file: - Actual write performed without i_mutex. - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock. - We do not have to worry about S_ISUID/S_ISGID bits. Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org> --- fs/block_dev.c | 2 +- fs/splice.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 3 files changed, 51 insertions(+), 1 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 7ce823c..9aa63b5 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1251,7 +1251,7 @@ const struct file_operations def_blk_fops = { .compat_ioctl = compat_blkdev_ioctl, #endif .splice_read = generic_file_splice_read, - .splice_write = generic_file_splice_write, + .splice_write = blkdev_splice_write, }; int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg) diff --git a/fs/splice.c b/fs/splice.c index a1e701c..f0ba76c 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -884,6 +884,54 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out, EXPORT_SYMBOL(generic_splice_sendpage); +/** + * blkdev_splice_write - splice data from a pipe to a block device + * @pipe: pipe info + * @out: file to write to + * @ppos: position in @out + * @len: number of bytes to splice + * @flags: splice modifier flags + * + * Description: + * Will either move or copy pages (determined by @flags options) from + * the given pipe inode to the given block device. + * Note: blockdev's i_mutex is not held on entry and it is never taken. + */ +ssize_t +blkdev_splice_write(struct pipe_inode_info *pipe, struct file *out, + loff_t *ppos, size_t len, unsigned int flags) +{ + struct address_space *mapping = out->f_mapping; + struct inode *inode = mapping->host; + struct splice_desc sd = { + .total_len = len, + .flags = flags, + .pos = *ppos, + .u.file = out, + }; + ssize_t ret; + unsigned long nr_pages; + mutex_lock(&pipe->inode->i_mutex); + ret = __splice_from_pipe(pipe, &sd, pipe_to_file); + mutex_unlock(&pipe->inode->i_mutex); + if (ret <= 0) + return ret; + + *ppos += ret; + nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + + if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) { + int err; + err = sync_page_range_nolock(inode, mapping, *ppos, ret); + if (err) + ret = err; + } + balance_dirty_pages_ratelimited_nr(mapping, nr_pages); + return ret; +} + +EXPORT_SYMBOL(blkdev_splice_write); + /* * Attempt to initiate a splice from pipe to file. */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 194b607..8543b21 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1960,6 +1960,8 @@ extern ssize_t generic_file_splice_write_nolock(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out, loff_t *, size_t len, unsigned int flags); +extern ssize_t blkdev_splice_write(struct pipe_inode_info *pipe, + struct file *out, loff_t *, size_t len, unsigned int flags); extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, size_t len, unsigned int flags); -- 1.5.4.3 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-19 14:00 [PATCH] Add block device speciffic splice write method Dmitri Monakhov @ 2008-10-20 17:49 ` Jens Axboe 2008-10-20 18:11 ` Jens Axboe 2008-10-20 18:29 ` Dmitri Monakhov 0 siblings, 2 replies; 13+ messages in thread From: Jens Axboe @ 2008-10-20 17:49 UTC (permalink / raw) To: Dmitri Monakhov; +Cc: linux-kernel, linux-fsdevel On Sun, Oct 19 2008, Dmitri Monakhov wrote: > Block device write procedure is different from regular file: > - Actual write performed without i_mutex. > - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock. > - We do not have to worry about S_ISUID/S_ISGID bits. I already did an O_DIRECT part of block device splicing [1], I'll fold this into the splice branch and double check with some testing. [1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5 -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-20 17:49 ` Jens Axboe @ 2008-10-20 18:11 ` Jens Axboe 2008-10-20 18:42 ` Dmitri Monakhov 2008-10-23 5:39 ` Andrew Morton 2008-10-20 18:29 ` Dmitri Monakhov 1 sibling, 2 replies; 13+ messages in thread From: Jens Axboe @ 2008-10-20 18:11 UTC (permalink / raw) To: Dmitri Monakhov; +Cc: linux-kernel, linux-fsdevel On Mon, Oct 20 2008, Jens Axboe wrote: > On Sun, Oct 19 2008, Dmitri Monakhov wrote: > > Block device write procedure is different from regular file: > > - Actual write performed without i_mutex. > > - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock. > > - We do not have to worry about S_ISUID/S_ISGID bits. > > I already did an O_DIRECT part of block device splicing [1], I'll fold > this into the splice branch and double check with some testing. > > [1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5 The below is what I merged. Note that I changed the naming and made the function look a lot more like the other splice helpers, so it's more apparent how it differs. Let me know if I can add you Signed-off-by to this one (preferably after you test it as well :-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 4d154dc..083198a 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1288,7 +1288,7 @@ new_bio: * Splice to file opened with O_DIRECT. Bypass caching completely and * just go direct-to-bio */ -static ssize_t __block_splice_write(struct pipe_inode_info *pipe, +static ssize_t __block_splice_direct_write(struct pipe_inode_info *pipe, struct file *out, loff_t *ppos, size_t len, unsigned int flags) { @@ -1318,6 +1318,9 @@ static ssize_t __block_splice_write(struct pipe_inode_info *pipe, if (bsd.bio) submit_bio(WRITE, bsd.bio); + if (ret > 0) + *ppos += ret; + return ret; } @@ -1327,12 +1330,11 @@ static ssize_t block_splice_write(struct pipe_inode_info *pipe, { ssize_t ret; - if (out->f_flags & O_DIRECT) { - ret = __block_splice_write(pipe, out, ppos, len, flags); - if (ret > 0) - *ppos += ret; - } else - ret = generic_file_splice_write(pipe, out, ppos, len, flags); + if (out->f_flags & O_DIRECT) + ret = __block_splice_direct_write(pipe, out, ppos, len, flags); + else + ret = generic_file_splice_write_file_nolock(pipe, out, ppos, + len, flags); return ret; } diff --git a/fs/splice.c b/fs/splice.c index 4108264..eb1e1ac 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -788,6 +788,59 @@ ssize_t splice_from_pipe(struct pipe_inode_info *pipe, struct file *out, } /** + * generic_file_splice_write_file_nolock - splice data from a pipe to a file + * @pipe: pipe info + * @out: file to write to + * @ppos: position in @out + * @len: number of bytes to splice + * @flags: splice modifier flags + * + * Description: + * Will either move or copy pages (determined by @flags options) from + * the given pipe inode to the given block device. + * Note: this is like @generic_file_splice_write, except that we + * don't bother locking the output file. Useful for splicing directly + * to a block device. + */ +ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, + size_t len, unsigned int flags) +{ + struct address_space *mapping = out->f_mapping; + struct inode *inode = mapping->host; + struct splice_desc sd = { + .total_len = len, + .flags = flags, + .pos = *ppos, + .u.file = out, + }; + ssize_t ret; + + mutex_lock(&pipe->inode->i_mutex); + ret = __splice_from_pipe(pipe, &sd, pipe_to_file); + mutex_unlock(&pipe->inode->i_mutex); + + if (ret > 0) { + unsigned long nr_pages; + + *ppos += ret; + nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + + if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) { + int er; + + er = sync_page_range_nolock(inode, mapping, *ppos, ret); + if (er) + ret = er; + } + balance_dirty_pages_ratelimited_nr(mapping, nr_pages); + } + + return ret; +} +EXPORT_SYMBOL(generic_file_splice_write_file_nolock); + +/** * generic_file_splice_write_nolock - generic_file_splice_write without mutexes * @pipe: pipe info * @out: file to write to diff --git a/include/linux/fs.h b/include/linux/fs.h index a6a625b..5c9b880 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1957,6 +1957,8 @@ extern ssize_t generic_file_splice_write(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); extern ssize_t generic_file_splice_write_nolock(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); +extern ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *, + struct file *, loff_t *, size_t, unsigned int); extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out, loff_t *, size_t len, unsigned int flags); extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, -- Jens Axboe ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-20 18:11 ` Jens Axboe @ 2008-10-20 18:42 ` Dmitri Monakhov 2008-10-23 5:39 ` Andrew Morton 1 sibling, 0 replies; 13+ messages in thread From: Dmitri Monakhov @ 2008-10-20 18:42 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel Jens Axboe <jens.axboe@oracle.com> writes: > On Mon, Oct 20 2008, Jens Axboe wrote: >> On Sun, Oct 19 2008, Dmitri Monakhov wrote: >> > Block device write procedure is different from regular file: >> > - Actual write performed without i_mutex. >> > - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock. >> > - We do not have to worry about S_ISUID/S_ISGID bits. >> >> I already did an O_DIRECT part of block device splicing [1], I'll fold >> this into the splice branch and double check with some testing. >> >> [1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5 > > The below is what I merged. Note that I changed the naming and made the > function look a lot more like the other splice helpers, so it's more > apparent how it differs. Let me know if I can add you Signed-off-by to Off course yes. > this one (preferably after you test it as well :-) currently i'm testing this stuff. > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 4d154dc..083198a 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1288,7 +1288,7 @@ new_bio: > * Splice to file opened with O_DIRECT. Bypass caching completely and > * just go direct-to-bio > */ > -static ssize_t __block_splice_write(struct pipe_inode_info *pipe, > +static ssize_t __block_splice_direct_write(struct pipe_inode_info *pipe, > struct file *out, loff_t *ppos, size_t len, > unsigned int flags) > { > @@ -1318,6 +1318,9 @@ static ssize_t __block_splice_write(struct pipe_inode_info *pipe, > if (bsd.bio) > submit_bio(WRITE, bsd.bio); > > + if (ret > 0) > + *ppos += ret; > + > return ret; > } > > @@ -1327,12 +1330,11 @@ static ssize_t block_splice_write(struct pipe_inode_info *pipe, > { > ssize_t ret; > > - if (out->f_flags & O_DIRECT) { > - ret = __block_splice_write(pipe, out, ppos, len, flags); > - if (ret > 0) > - *ppos += ret; > - } else > - ret = generic_file_splice_write(pipe, out, ppos, len, flags); > + if (out->f_flags & O_DIRECT) > + ret = __block_splice_direct_write(pipe, out, ppos, len, flags); > + else > + ret = generic_file_splice_write_file_nolock(pipe, out, ppos, > + len, flags); > > return ret; > } > diff --git a/fs/splice.c b/fs/splice.c > index 4108264..eb1e1ac 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -788,6 +788,59 @@ ssize_t splice_from_pipe(struct pipe_inode_info *pipe, struct file *out, > } > > /** > + * generic_file_splice_write_file_nolock - splice data from a pipe to a file > + * @pipe: pipe info > + * @out: file to write to > + * @ppos: position in @out > + * @len: number of bytes to splice > + * @flags: splice modifier flags > + * > + * Description: > + * Will either move or copy pages (determined by @flags options) from > + * the given pipe inode to the given block device. > + * Note: this is like @generic_file_splice_write, except that we > + * don't bother locking the output file. Useful for splicing directly > + * to a block device. > + */ > +ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe, > + struct file *out, loff_t *ppos, > + size_t len, unsigned int flags) > +{ > + struct address_space *mapping = out->f_mapping; > + struct inode *inode = mapping->host; > + struct splice_desc sd = { > + .total_len = len, > + .flags = flags, > + .pos = *ppos, > + .u.file = out, > + }; > + ssize_t ret; > + > + mutex_lock(&pipe->inode->i_mutex); > + ret = __splice_from_pipe(pipe, &sd, pipe_to_file); > + mutex_unlock(&pipe->inode->i_mutex); > + > + if (ret > 0) { > + unsigned long nr_pages; > + > + *ppos += ret; > + nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; > + > + if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) { > + int er; > + > + er = sync_page_range_nolock(inode, mapping, *ppos, ret); > + if (er) > + ret = er; > + } > + balance_dirty_pages_ratelimited_nr(mapping, nr_pages); > + } > + > + return ret; > +} > +EXPORT_SYMBOL(generic_file_splice_write_file_nolock); > + > +/** > * generic_file_splice_write_nolock - generic_file_splice_write without mutexes > * @pipe: pipe info > * @out: file to write to > diff --git a/include/linux/fs.h b/include/linux/fs.h > index a6a625b..5c9b880 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1957,6 +1957,8 @@ extern ssize_t generic_file_splice_write(struct pipe_inode_info *, > struct file *, loff_t *, size_t, unsigned int); > extern ssize_t generic_file_splice_write_nolock(struct pipe_inode_info *, > struct file *, loff_t *, size_t, unsigned int); > +extern ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *, > + struct file *, loff_t *, size_t, unsigned int); > extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, > struct file *out, loff_t *, size_t len, unsigned int flags); > extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-20 18:11 ` Jens Axboe 2008-10-20 18:42 ` Dmitri Monakhov @ 2008-10-23 5:39 ` Andrew Morton 2008-10-23 6:29 ` Jens Axboe 2008-10-23 8:41 ` Dmitri Monakhov 1 sibling, 2 replies; 13+ messages in thread From: Andrew Morton @ 2008-10-23 5:39 UTC (permalink / raw) To: Jens Axboe; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel On Mon, 20 Oct 2008 20:11:56 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > +ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe, > + struct file *out, loff_t *ppos, > + size_t len, unsigned int flags) > +{ > + struct address_space *mapping = out->f_mapping; > + struct inode *inode = mapping->host; > + struct splice_desc sd = { > + .total_len = len, > + .flags = flags, > + .pos = *ppos, > + .u.file = out, > + }; > + ssize_t ret; > + > + mutex_lock(&pipe->inode->i_mutex); > + ret = __splice_from_pipe(pipe, &sd, pipe_to_file); > + mutex_unlock(&pipe->inode->i_mutex); > + > + if (ret > 0) { > + unsigned long nr_pages; > + > + *ppos += ret; > + nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; > + > + if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) { > + int er; > + > + er = sync_page_range_nolock(inode, mapping, *ppos, ret); > + if (er) > + ret = er; > + } > + balance_dirty_pages_ratelimited_nr(mapping, nr_pages); > + } > + > + return ret; > +} > +EXPORT_SYMBOL(generic_file_splice_write_file_nolock); I don't think the balance_dirty_pages() is needed if we just did the sync_page_range(). But really it'd be better if the throttling happened down in pipe_to_file(), on a per-page basis. As it stands we can dirty an arbitrary number of pagecache pages without throttling. I think? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-23 5:39 ` Andrew Morton @ 2008-10-23 6:29 ` Jens Axboe 2008-10-23 6:41 ` Andrew Morton 2008-10-23 8:41 ` Dmitri Monakhov 1 sibling, 1 reply; 13+ messages in thread From: Jens Axboe @ 2008-10-23 6:29 UTC (permalink / raw) To: Andrew Morton; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel On Wed, Oct 22 2008, Andrew Morton wrote: > On Mon, 20 Oct 2008 20:11:56 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > > > +ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe, > > + struct file *out, loff_t *ppos, > > + size_t len, unsigned int flags) > > +{ > > + struct address_space *mapping = out->f_mapping; > > + struct inode *inode = mapping->host; > > + struct splice_desc sd = { > > + .total_len = len, > > + .flags = flags, > > + .pos = *ppos, > > + .u.file = out, > > + }; > > + ssize_t ret; > > + > > + mutex_lock(&pipe->inode->i_mutex); > > + ret = __splice_from_pipe(pipe, &sd, pipe_to_file); > > + mutex_unlock(&pipe->inode->i_mutex); > > + > > + if (ret > 0) { > > + unsigned long nr_pages; > > + > > + *ppos += ret; > > + nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; > > + > > + if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) { > > + int er; > > + > > + er = sync_page_range_nolock(inode, mapping, *ppos, ret); > > + if (er) > > + ret = er; > > + } > > + balance_dirty_pages_ratelimited_nr(mapping, nr_pages); > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(generic_file_splice_write_file_nolock); > > I don't think the balance_dirty_pages() is needed if we just did the > sync_page_range(). Good point, I think we can get rid of that. > > > But really it'd be better if the throttling happened down in > pipe_to_file(), on a per-page basis. As it stands we can dirty an > arbitrary number of pagecache pages without throttling. I think? That's pretty exactly why it isn't done in the actor, to avoid doing it per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this is better. Back in the splice early days, the balance_dirty_pages() actually showed up in profiles when it was done on a per-page basis. So I'm reluctant to change it :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-23 6:29 ` Jens Axboe @ 2008-10-23 6:41 ` Andrew Morton 2008-10-23 6:51 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2008-10-23 6:41 UTC (permalink / raw) To: Jens Axboe; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel On Thu, 23 Oct 2008 08:29:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > > But really it'd be better if the throttling happened down in > > pipe_to_file(), on a per-page basis. As it stands we can dirty an > > arbitrary number of pagecache pages without throttling. I think? > > That's pretty exactly why it isn't done in the actor, to avoid doing it > per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this > is better. > > Back in the splice early days, the balance_dirty_pages() actually showed > up in profiles when it was done on a per-page basis. So I'm reluctant to > change it :-) That's why (the misnamed) balance_dirty_pages_ratelimited() exists? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-23 6:41 ` Andrew Morton @ 2008-10-23 6:51 ` Jens Axboe 2008-10-23 7:03 ` Andrew Morton 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2008-10-23 6:51 UTC (permalink / raw) To: Andrew Morton; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel On Wed, Oct 22 2008, Andrew Morton wrote: > On Thu, 23 Oct 2008 08:29:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > > > > But really it'd be better if the throttling happened down in > > > pipe_to_file(), on a per-page basis. As it stands we can dirty an > > > arbitrary number of pagecache pages without throttling. I think? > > > > That's pretty exactly why it isn't done in the actor, to avoid doing it > > per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this > > is better. > > > > Back in the splice early days, the balance_dirty_pages() actually showed > > up in profiles when it was done on a per-page basis. So I'm reluctant to > > change it :-) > > That's why (the misnamed) balance_dirty_pages_ratelimited() exists? I think that is what was used, but the details are a little hazy at this point. So I can't say for sure. In this case it's moot anyway, since we can kill it. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-23 6:51 ` Jens Axboe @ 2008-10-23 7:03 ` Andrew Morton 2008-10-23 7:16 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2008-10-23 7:03 UTC (permalink / raw) To: Jens Axboe; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel On Thu, 23 Oct 2008 08:51:13 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > On Wed, Oct 22 2008, Andrew Morton wrote: > > On Thu, 23 Oct 2008 08:29:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > But really it'd be better if the throttling happened down in > > > > pipe_to_file(), on a per-page basis. As it stands we can dirty an > > > > arbitrary number of pagecache pages without throttling. I think? > > > > > > That's pretty exactly why it isn't done in the actor, to avoid doing it > > > per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this > > > is better. > > > > > > Back in the splice early days, the balance_dirty_pages() actually showed > > > up in profiles when it was done on a per-page basis. So I'm reluctant to > > > change it :-) > > > > That's why (the misnamed) balance_dirty_pages_ratelimited() exists? > > I think that is what was used, but the details are a little hazy at this > point. So I can't say for sure. All that function does is to bump a per-cpu variable and once-per-thousand or so it does the balance. If it was causing problems in the splice application we want to know, because write() uses it! > In this case it's moot anyway, since we can kill it. Nope, we can only remove it if the fd is O_SYNC||is_sync(). ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-23 7:03 ` Andrew Morton @ 2008-10-23 7:16 ` Jens Axboe 0 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2008-10-23 7:16 UTC (permalink / raw) To: Andrew Morton; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel On Thu, Oct 23 2008, Andrew Morton wrote: > On Thu, 23 Oct 2008 08:51:13 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > > > On Wed, Oct 22 2008, Andrew Morton wrote: > > > On Thu, 23 Oct 2008 08:29:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > > > But really it'd be better if the throttling happened down in > > > > > pipe_to_file(), on a per-page basis. As it stands we can dirty an > > > > > arbitrary number of pagecache pages without throttling. I think? > > > > > > > > That's pretty exactly why it isn't done in the actor, to avoid doing it > > > > per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this > > > > is better. > > > > > > > > Back in the splice early days, the balance_dirty_pages() actually showed > > > > up in profiles when it was done on a per-page basis. So I'm reluctant to > > > > change it :-) > > > > > > That's why (the misnamed) balance_dirty_pages_ratelimited() exists? > > > > I think that is what was used, but the details are a little hazy at this > > point. So I can't say for sure. > > All that function does is to bump a per-cpu variable and > once-per-thousand or so it does the balance. If it was causing > problems in the splice application we want to know, because write() > uses it! Once per 8 or 32. If we haven't exceeded the dirty limit, calling it in the actor or at the end should not make a difference for splice, since we should be going into balance_dirty_pages() at most once. Perhaps it was different some years ago, or perhaps the micro benchmarks were screwed. Or perhaps my memory is shot, can't say for sure :) > > In this case it's moot anyway, since we can kill it. > > Nope, we can only remove it if the fd is O_SYNC||is_sync(). Right, I forgot this is still the buffered path. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-23 5:39 ` Andrew Morton 2008-10-23 6:29 ` Jens Axboe @ 2008-10-23 8:41 ` Dmitri Monakhov 1 sibling, 0 replies; 13+ messages in thread From: Dmitri Monakhov @ 2008-10-23 8:41 UTC (permalink / raw) To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, linux-fsdevel Andrew Morton <akpm@linux-foundation.org> writes: > On Mon, 20 Oct 2008 20:11:56 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > >> +ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe, >> + struct file *out, loff_t *ppos, >> + size_t len, unsigned int flags) >> +{ >> + struct address_space *mapping = out->f_mapping; >> + struct inode *inode = mapping->host; >> + struct splice_desc sd = { >> + .total_len = len, >> + .flags = flags, >> + .pos = *ppos, >> + .u.file = out, >> + }; >> + ssize_t ret; >> + >> + mutex_lock(&pipe->inode->i_mutex); >> + ret = __splice_from_pipe(pipe, &sd, pipe_to_file); >> + mutex_unlock(&pipe->inode->i_mutex); >> + >> + if (ret > 0) { >> + unsigned long nr_pages; >> + >> + *ppos += ret; >> + nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; >> + >> + if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) { >> + int er; >> + >> + er = sync_page_range_nolock(inode, mapping, *ppos, ret); >> + if (er) >> + ret = er; >> + } >> + balance_dirty_pages_ratelimited_nr(mapping, nr_pages); >> + } >> + >> + return ret; >> +} >> +EXPORT_SYMBOL(generic_file_splice_write_file_nolock); > > I don't think the balance_dirty_pages() is needed if we just did the > sync_page_range(). I think so too, but I've done it in this way because all other writers does it. > > > But really it'd be better if the throttling happened down in > pipe_to_file(), on a per-page basis. As it stands we can dirty an > arbitrary number of pagecache pages without throttling. I think? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-20 17:49 ` Jens Axboe 2008-10-20 18:11 ` Jens Axboe @ 2008-10-20 18:29 ` Dmitri Monakhov 2008-10-20 18:33 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Dmitri Monakhov @ 2008-10-20 18:29 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel Jens Axboe <jens.axboe@oracle.com> writes: > On Sun, Oct 19 2008, Dmitri Monakhov wrote: >> Block device write procedure is different from regular file: >> - Actual write performed without i_mutex. >> - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock. >> - We do not have to worry about S_ISUID/S_ISGID bits. > > I already did an O_DIRECT part of block device splicing [1], I'll fold > this into the splice branch and double check with some testing. > > [1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5 Ok i've missed this branch :(, your approach is really cool. But current patch seems not completely ready, O_DIRECT case: - sync case missed, some one may want use it with O_DIRECT|O_SYNC - i'm not sure why it is necessary to always hold bd_inode->i_mutex inside __splice_on_pice(.., pipe_to_disk) !O_DIRECT case: - still use generic_file_splice_write So I'll re-base to your patch and: - add appropriate fixes necessary fixes for direct case. - redone my patch on top of yours for buffered writes. What do you think? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Add block device speciffic splice write method 2008-10-20 18:29 ` Dmitri Monakhov @ 2008-10-20 18:33 ` Jens Axboe 0 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2008-10-20 18:33 UTC (permalink / raw) To: Dmitri Monakhov; +Cc: linux-kernel, linux-fsdevel On Mon, Oct 20 2008, Dmitri Monakhov wrote: > Jens Axboe <jens.axboe@oracle.com> writes: > > > On Sun, Oct 19 2008, Dmitri Monakhov wrote: > >> Block device write procedure is different from regular file: > >> - Actual write performed without i_mutex. > >> - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock. > >> - We do not have to worry about S_ISUID/S_ISGID bits. > > > > I already did an O_DIRECT part of block device splicing [1], I'll fold > > this into the splice branch and double check with some testing. > > > > [1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5 > Ok i've missed this branch :(, your approach is really cool. > But current patch seems not completely ready, Not surprising, it's still pretty fresh. The core of it works, which was the first objective :-) > O_DIRECT case: > - sync case missed, some one may want use it with O_DIRECT|O_SYNC Good point, I'll update that to wait on in-progress bios. > - i'm not sure why it is necessary to always hold bd_inode->i_mutex > inside __splice_on_pice(.., pipe_to_disk) It is not, I'll drop that too. > !O_DIRECT case: > - still use generic_file_splice_write Well, the patch adds O_DIRECT support, so that's not really a missing piece! > So I'll re-base to your patch and: > - add appropriate fixes necessary fixes for direct case. > - redone my patch on top of yours for buffered writes. > > What do you think? Please just send a patch for the missing bits on top of the current splice branch, that includes the patch I sent which is a rebased version of yours. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2008-10-23 8:42 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-10-19 14:00 [PATCH] Add block device speciffic splice write method Dmitri Monakhov 2008-10-20 17:49 ` Jens Axboe 2008-10-20 18:11 ` Jens Axboe 2008-10-20 18:42 ` Dmitri Monakhov 2008-10-23 5:39 ` Andrew Morton 2008-10-23 6:29 ` Jens Axboe 2008-10-23 6:41 ` Andrew Morton 2008-10-23 6:51 ` Jens Axboe 2008-10-23 7:03 ` Andrew Morton 2008-10-23 7:16 ` Jens Axboe 2008-10-23 8:41 ` Dmitri Monakhov 2008-10-20 18:29 ` Dmitri Monakhov 2008-10-20 18:33 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).