* disk IO directly from PCI memory to block device sectors
@ 2008-09-26 7:29 marty
2008-09-26 8:03 ` Jens Axboe
2008-09-26 8:46 ` Alan Cox
0 siblings, 2 replies; 12+ messages in thread
From: marty @ 2008-09-26 7:29 UTC (permalink / raw)
To: linux-kernel; +Cc: martin.leisner
We have a large ram area on a PCI board (think of a custom framebuffer
type application). We're using 2.6.20.
We have the PCI ram mapped into kernel space, and knew the physical addresses.
We have a raw partition on the block device which we reserve for this.
We want to be able to stick the contents of selected portion of PCI ram onto a block device (disk). Past incarnations modified the disk driver, and developed a special API so the custom driver constructed scatter/gather lists and fed it to the driver (bypassing the elevator algorithm, to execute
as the "next request".
What I'm looking is for a more generic/driver independent way of sticking
contents of PCI ram onto a disk.
Is offset + length of each bio_vec < pagesize?
What's the best way to do this (much of my data is already in physically
contiguous memory [and mapped into virtual memory)).
Any good examples to look at?
marty
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: disk IO directly from PCI memory to block device sectors 2008-09-26 7:29 disk IO directly from PCI memory to block device sectors marty @ 2008-09-26 8:03 ` Jens Axboe 2008-09-26 8:46 ` Alan Cox 1 sibling, 0 replies; 12+ messages in thread From: Jens Axboe @ 2008-09-26 8:03 UTC (permalink / raw) To: marty; +Cc: linux-kernel, martin.leisner On Fri, Sep 26 2008, marty wrote: > We have a large ram area on a PCI board (think of a custom > framebuffer type application). We're using 2.6.20. > > We have the PCI ram mapped into kernel space, and knew the physical > addresses. > > We have a raw partition on the block device which we reserve for this. > > We want to be able to stick the contents of selected portion of PCI > ram onto a block device (disk). Past incarnations modified the disk > driver, and developed a special API so the custom driver constructed > scatter/gather lists and fed it to the driver (bypassing the elevator > algorithm, to execute as the "next request". > > What I'm looking is for a more generic/driver independent way of > sticking contents of PCI ram onto a disk. > > Is offset + length of each bio_vec < pagesize? Yes, each vec is no more than a page. > What's the best way to do this (much of my data is already in > physically contiguous memory [and mapped into virtual memory)). You don't need it mapped into virtual memory. Whether the data is contig or not does not matter, the block layer will handle it either way. > Any good examples to look at? Apart from where you get your memory from, you can easily use the generic infrastructure for this. Something ala: void my_end_io_function(struct bio *bio, int err) { /* * whatever you need to do here, once you get this call IO is * done for that bio. put bio at the end to free it again. */ ... bio_put(bio); } write_my_data(struct block_device, sector_t sector, unsigned int bytes) { struct request_queue *q; struct bio *bio = NULL; struct page *page; unsigned int offset, length; q = bdev_get_queue(bdev); offset = first_page_offset; while (bytes) { if (!bio) { unsigned int npages = (bytes + PAGE_SIZE - 1) >> PAGE_SHIFT; bio = bio_alloc(GFP_KERNEL, npages); bio->bi_sector = sector; bio->bi_bdev = bdev_to_write_to; bio->bi_end_io = my_end_io_function; /* called on io end */ bio->bi_private = some_data; /* if my_end_io_function wants that */ } page = some_func_to_return_you_a_page_in_the_pci_mem(sector); length = bytes; if (length > PAGE_SIZE) length = PAGE_SIZE; /* if this fails, we can't map more at this offset. send * what we have and force a new bio alloc at the top of * the loop */ if (!bio_add_page(bio, page, length, offset)) { submit_bio(WRITE, bio); bio = NULL; } bytes -= length; sector += length >> 9; offset = 0; } } totally untested, just typed into this email. So probably full of typos, but you should get the idea. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: disk IO directly from PCI memory to block device sectors 2008-09-26 7:29 disk IO directly from PCI memory to block device sectors marty 2008-09-26 8:03 ` Jens Axboe @ 2008-09-26 8:46 ` Alan Cox 2008-09-26 9:11 ` Jens Axboe 1 sibling, 1 reply; 12+ messages in thread From: Alan Cox @ 2008-09-26 8:46 UTC (permalink / raw) To: marty; +Cc: linux-kernel, martin.leisner > What I'm looking is for a more generic/driver independent way of sticking > contents of PCI ram onto a disk. Ermm seriously why not have a userspace task with the PCI RAM mmapped and just use write() like normal sane people do ? Alan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: disk IO directly from PCI memory to block device sectors 2008-09-26 8:46 ` Alan Cox @ 2008-09-26 9:11 ` Jens Axboe 2008-09-26 10:06 ` Alan Cox 2008-09-26 15:51 ` Leisner, Martin 0 siblings, 2 replies; 12+ messages in thread From: Jens Axboe @ 2008-09-26 9:11 UTC (permalink / raw) To: Alan Cox; +Cc: marty, linux-kernel, martin.leisner On Fri, Sep 26 2008, Alan Cox wrote: > > What I'm looking is for a more generic/driver independent way of sticking > > contents of PCI ram onto a disk. > > Ermm seriously why not have a userspace task with the PCI RAM mmapped > and just use write() like normal sane people do ? To avoid the fault and copy, I would assume. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: disk IO directly from PCI memory to block device sectors 2008-09-26 9:11 ` Jens Axboe @ 2008-09-26 10:06 ` Alan Cox 2008-09-26 10:19 ` Jens Axboe 2008-09-26 15:51 ` Leisner, Martin 1 sibling, 1 reply; 12+ messages in thread From: Alan Cox @ 2008-09-26 10:06 UTC (permalink / raw) To: Jens Axboe; +Cc: marty, linux-kernel, martin.leisner On Fri, 26 Sep 2008 11:11:35 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > On Fri, Sep 26 2008, Alan Cox wrote: > > > What I'm looking is for a more generic/driver independent way of sticking > > > contents of PCI ram onto a disk. > > > > Ermm seriously why not have a userspace task with the PCI RAM mmapped > > and just use write() like normal sane people do ? > > To avoid the fault and copy, I would assume. It's a write to a raw partition so with O_DIRECT you won't have to copy and MAP_POPULATE will premap the object if even the first write wants to occur without faulting overhead. Alan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: disk IO directly from PCI memory to block device sectors 2008-09-26 10:06 ` Alan Cox @ 2008-09-26 10:19 ` Jens Axboe 2008-09-26 11:34 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2008-09-26 10:19 UTC (permalink / raw) To: Alan Cox; +Cc: marty, linux-kernel, martin.leisner On Fri, Sep 26 2008, Alan Cox wrote: > On Fri, 26 Sep 2008 11:11:35 +0200 > Jens Axboe <jens.axboe@oracle.com> wrote: > > > On Fri, Sep 26 2008, Alan Cox wrote: > > > > What I'm looking is for a more generic/driver independent way of sticking > > > > contents of PCI ram onto a disk. > > > > > > Ermm seriously why not have a userspace task with the PCI RAM mmapped > > > and just use write() like normal sane people do ? > > > > To avoid the fault and copy, I would assume. > > It's a write to a raw partition so with O_DIRECT you won't have to copy > and MAP_POPULATE will premap the object if even the first write wants to > occur without faulting overhead. You are still going through get_user_pages() for each write. As I would imagine the writes would generally be large, the hit would not be too bad (but it's still there). Depending on the hardware, it may or may not be a big deal. But the path from device to disk is definitely a lot bigger and more complex with the mmap/write approach. Another alternative would be using splice - if the pci device exposed a char device node, you could support ->splice_read() there which would just fill the pages into the pipe buffer. Then change the block device fops ->splice_write() to go direct to the block device through a bio instead of using the page cache based generic_file_splice_write(). Such a change would actually make sense to do, if the block device has been opened with O_DIRECT. And it would get you about the same performance as doing it in-kernel, the only extra overhead would be two syscalls per 64k (well probably only one extra syscall, since you probably need an ioctl/syscall to initiate the in-kernel activity as well). So just about as free as you could get. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: disk IO directly from PCI memory to block device sectors 2008-09-26 10:19 ` Jens Axboe @ 2008-09-26 11:34 ` Jens Axboe 0 siblings, 0 replies; 12+ messages in thread From: Jens Axboe @ 2008-09-26 11:34 UTC (permalink / raw) To: Alan Cox; +Cc: marty, linux-kernel, martin.leisner On Fri, Sep 26 2008, Jens Axboe wrote: > Another alternative would be using splice - if the pci device exposed a > char device node, you could support ->splice_read() there which would > just fill the pages into the pipe buffer. Then change the block device > fops ->splice_write() to go direct to the block device through a bio > instead of using the page cache based generic_file_splice_write(). Such > a change would actually make sense to do, if the block device has been > opened with O_DIRECT. And it would get you about the same performance as > doing it in-kernel, the only extra overhead would be two syscalls per > 64k (well probably only one extra syscall, since you probably need an > ioctl/syscall to initiate the in-kernel activity as well). So just about > as free as you could get. Something like this, totally untested but should get the point across. diff --git a/fs/block_dev.c b/fs/block_dev.c index 57e2786..fd06032 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -24,6 +24,7 @@ #include <linux/uio.h> #include <linux/namei.h> #include <linux/log2.h> +#include <linux/splice.h> #include <asm/uaccess.h> #include "internal.h" @@ -1224,6 +1225,77 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); } +static void block_splice_end_io(struct bio *bio, int err) +{ + bio_put(bio); +} + +static int pipe_to_disk(struct pipe_inode_info *pipe, struct pipe_buffer *buf, + struct splice_desc *sd) +{ + struct block_device *bdev = I_BDEV(sd->u.file->f_mapping->host); + struct bio *bio; + int ret, bs; + + bs = queue_hardsect_size(bdev_get_queue(bdev)); + if (sd->pos & (bs - 1)) + return -EINVAL; + + ret = buf->ops->confirm(pipe, buf); + if (unlikely(ret)) + return ret; + + bio = bio_alloc(GFP_KERNEL, 1); + bio->bi_sector = sd->pos / bs; + bio->bi_bdev = bdev; + bio->bi_end_io = block_splice_end_io; + + bio_add_page(bio, buf->page, buf->len, buf->offset); + + submit_bio(WRITE, bio); + return buf->len; +} + +/* + * Splice to file opened with O_DIRECT. Bypass caching completely and + * just go direct-to-bio + */ +static ssize_t __block_splice_write(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, size_t len, + unsigned int flags) +{ + struct splice_desc sd = { + .total_len = len, + .flags = flags, + .pos = *ppos, + .u.file = out, + }; + struct inode *inode = out->f_mapping->host; + ssize_t ret; + + if (unlikely(*ppos & 511)) + return -EINVAL; + + inode_double_lock(inode, pipe->inode); + ret = __splice_from_pipe(pipe, &sd, pipe_to_disk); + inode_double_unlock(inode, pipe->inode); + + if (ret > 0) + *ppos += ret; + + return ret; +} + +static ssize_t block_splice_write(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, size_t len, + unsigned int flags) +{ + if (out->f_flags & O_DIRECT) + return __block_splice_write(pipe, out, ppos, len, flags); + + return generic_file_splice_write(pipe, out, ppos, len, flags); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .writepage = blkdev_writepage, @@ -1249,7 +1321,7 @@ const struct file_operations def_blk_fops = { .compat_ioctl = compat_blkdev_ioctl, #endif .splice_read = generic_file_splice_read, - .splice_write = generic_file_splice_write, + .splice_write = block_splice_write, }; int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg) -- Jens Axboe ^ permalink raw reply related [flat|nested] 12+ messages in thread
* RE: disk IO directly from PCI memory to block device sectors 2008-09-26 9:11 ` Jens Axboe 2008-09-26 10:06 ` Alan Cox @ 2008-09-26 15:51 ` Leisner, Martin 2008-09-29 13:02 ` Jens Axboe 1 sibling, 1 reply; 12+ messages in thread From: Leisner, Martin @ 2008-09-26 15:51 UTC (permalink / raw) To: Jens Axboe, Alan Cox; +Cc: marty, linux-kernel > -----Original Message----- > From: Jens Axboe [mailto:jens.axboe@oracle.com] > Sent: Friday, September 26, 2008 5:12 AM > To: Alan Cox > Cc: marty; linux-kernel@vger.kernel.org; Leisner, Martin > Subject: Re: disk IO directly from PCI memory to block device sectors > > On Fri, Sep 26 2008, Alan Cox wrote: > > > What I'm looking is for a more generic/driver independent way of > sticking > > > contents of PCI ram onto a disk. > > > > Ermm seriously why not have a userspace task with the PCI RAM mmapped > > and just use write() like normal sane people do ? > > To avoid the fault and copy, I would assume. > > -- > Jens Axboe Also: a) to deal with interrupts from the hardware b) using legacy code/design/architecture The splice approaches look very interesting...thanks... marty ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: disk IO directly from PCI memory to block device sectors 2008-09-26 15:51 ` Leisner, Martin @ 2008-09-29 13:02 ` Jens Axboe 2008-10-01 19:05 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2008-09-29 13:02 UTC (permalink / raw) To: Leisner, Martin; +Cc: Alan Cox, marty, linux-kernel On Fri, Sep 26 2008, Leisner, Martin wrote: > > > > -----Original Message----- > > From: Jens Axboe [mailto:jens.axboe@oracle.com] > > Sent: Friday, September 26, 2008 5:12 AM > > To: Alan Cox > > Cc: marty; linux-kernel@vger.kernel.org; Leisner, Martin > > Subject: Re: disk IO directly from PCI memory to block device > sectors > > > > On Fri, Sep 26 2008, Alan Cox wrote: > > > > What I'm looking is for a more generic/driver independent way of > > sticking > > > > contents of PCI ram onto a disk. > > > > > > Ermm seriously why not have a userspace task with the PCI RAM > mmapped > > > and just use write() like normal sane people do ? > > > > To avoid the fault and copy, I would assume. > > > > -- > > Jens Axboe > > Also: > a) to deal with interrupts from the hardware > b) using legacy code/design/architecture > > The splice approaches look very interesting...thanks... Just for kicks, I did the read part of the fast bdev interface as well. As with the write, it's totally untested (apart from compiled). Just in case anyone is curious... I plan to do a bit of testing on this this week. IMHO, this interface totally rocks. It's really async like splice was intended, and it's fast too. I may have to look into some generic IO mechanism to unify them all, O_DIRECT/page cache/splice. Famous last words, I'm sure. diff --git a/fs/block_dev.c b/fs/block_dev.c index aff5421..f8df781 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -24,6 +24,7 @@ #include <linux/uio.h> #include <linux/namei.h> #include <linux/log2.h> +#include <linux/splice.h> #include <asm/uaccess.h> #include "internal.h" @@ -1155,6 +1156,264 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); } +static void block_splice_write_end_io(struct bio *bio, int err) +{ + bio_put(bio); +} + +static int pipe_to_disk(struct pipe_inode_info *pipe, struct pipe_buffer *buf, + struct splice_desc *sd) +{ + struct block_device *bdev = I_BDEV(sd->u.file->f_mapping->host); + struct bio *bio; + int ret, bs; + + bs = queue_hardsect_size(bdev_get_queue(bdev)); + if (sd->pos & (bs - 1)) + return -EINVAL; + + ret = buf->ops->confirm(pipe, buf); + if (unlikely(ret)) + return ret; + + bio = bio_alloc(GFP_KERNEL, 1); + bio->bi_sector = sd->pos / bs; + bio->bi_bdev = bdev; + bio->bi_end_io = block_splice_write_end_io; + + bio_add_page(bio, buf->page, buf->len, buf->offset); + + submit_bio(WRITE, bio); + return buf->len; +} + +/* + * Splice to file opened with O_DIRECT. Bypass caching completely and + * just go direct-to-bio + */ +static ssize_t __block_splice_write(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, size_t len, + unsigned int flags) +{ + struct splice_desc sd = { + .total_len = len, + .flags = flags, + .pos = *ppos, + .u.file = out, + }; + struct inode *inode = out->f_mapping->host; + ssize_t ret; + + if (unlikely(*ppos & 511)) + return -EINVAL; + + inode_double_lock(inode, pipe->inode); + ret = __splice_from_pipe(pipe, &sd, pipe_to_disk); + inode_double_unlock(inode, pipe->inode); + + if (ret > 0) + *ppos += ret; + + return ret; +} + +static ssize_t block_splice_write(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, size_t len, + unsigned int flags) +{ + if (out->f_flags & O_DIRECT) + return __block_splice_write(pipe, out, ppos, len, flags); + + return generic_file_splice_write(pipe, out, ppos, len, flags); +} + +static void block_pipe_buf_release(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + struct bio *bio; + + bio = (struct bio *) buf->private; + if (bio) + bio_put(bio); + + __free_pages(buf->page, 0); +} + +/* + * Wait for IO to be done on the bio that this buf belongs to + */ +static int block_pipe_buf_confirm(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + struct bio *bio = (struct bio *) buf->private; + struct completion *comp = bio->bi_private; + + wait_for_completion(comp); + return 0; +} + +static const struct pipe_buf_operations block_pipe_buf_ops = { + .can_merge = 0, + .map = generic_pipe_buf_map, + .unmap = generic_pipe_buf_unmap, + .confirm = block_pipe_buf_confirm, + .release = block_pipe_buf_release, + .steal = generic_pipe_buf_steal, + .get = generic_pipe_buf_get, +}; + +static void block_release_page(struct splice_pipe_desc *spd, unsigned int i) +{ + struct bio *bio; + + bio = (struct bio *) spd->partial[i].private; + if (bio) + bio_put(bio); + + __free_pages(spd->pages[i], 0); +} + +/* + * READ end io handling completes the bio, so that we can wakeup + * anyone waiting in ->confirm(). + */ +static void block_splice_read_end_io(struct bio *bio, int err) +{ + struct completion *comp = bio->bi_private; + + complete(comp); + bio_put(bio); +} + +static void block_splice_bio_destructor(struct bio *bio) +{ + kfree(bio->bi_private); + bio_free(bio, fs_bio_set); +} + +/* + * Bypass the page cache and allocate pages for IO directly + */ +static ssize_t __block_splice_read(struct pipe_inode_info *pipe, + struct file *in, loff_t *ppos, size_t len, + unsigned int flags) +{ + struct page *pages[PIPE_BUFFERS]; + struct partial_page partial[PIPE_BUFFERS]; + struct splice_pipe_desc spd = { + .pages = pages, + .partial = partial, + .flags = flags, + .ops = &block_pipe_buf_ops, + .spd_release = block_release_page, + }; + struct inode *inode = in->f_mapping->host; + struct block_device *bdev = I_BDEV(inode); + struct bio *bio; + sector_t sector; + loff_t isize, left; + int bs, err; + + /* + * First to alignment and length sanity checks + */ + bs = queue_hardsect_size(bdev_get_queue(bdev)); + if (*ppos & (bs - 1)) + return -EINVAL; + + isize = i_size_read(inode); + if (unlikely(*ppos >= isize)) + return 0; + + left = isize - *ppos; + if (unlikely(left < len)) + len = left; + + err = 0; + spd.nr_pages = 0; + sector = *ppos / bs; + bio = NULL; + while (len) { + struct completion *comp; + unsigned int this_len; + struct page *page; + + this_len = len; + if (this_len > PAGE_SIZE) + this_len = PAGE_SIZE; + + page = alloc_page(GFP_KERNEL); + if (!page) { + err = -ENOMEM; + break; + } + + if (!bio) { +alloc_new_bio: + comp = kmalloc(sizeof(*comp), GFP_KERNEL); + if (!comp) { + err = -ENOMEM; + break; + } + + init_completion(comp); + + bio = bio_alloc(GFP_KERNEL, (len + PAGE_SIZE - 1) / PAGE_SIZE); + bio->bi_sector = sector; + bio->bi_bdev = bdev; + bio->bi_private = comp; + bio->bi_end_io = block_splice_read_end_io; + + /* + * Not too nice... + */ + bio->bi_destructor = block_splice_bio_destructor; + } + + /* + * if we fail adding page, then submit this bio and get + * a new one + */ + if (bio_add_page(bio, page, this_len, 0) != this_len) { + submit_bio(READ, bio); + bio = NULL; + goto alloc_new_bio; + } + + /* + * The pipe buffer needs to hang on to the bio, so that we + * can reuse it in the ->confirm() part of the pipe ops + */ + bio_get(bio); + + sector += (this_len / bs); + len -= this_len; + partial[spd.nr_pages].offset = 0; + partial[spd.nr_pages].len = this_len; + partial[spd.nr_pages].private = (unsigned long) bio; + pages[spd.nr_pages] = page; + spd.nr_pages++; + } + + if (bio) + submit_bio(READ, bio); + + if (spd.nr_pages) + return splice_to_pipe(pipe, &spd); + + return err; +} + +static ssize_t block_splice_read(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) +{ + if (in->f_flags & O_DIRECT) + return __block_splice_read(pipe, in, ppos, len, flags); + + return generic_file_splice_read(in, ppos, pipe, len, flags); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .writepage = blkdev_writepage, @@ -1179,8 +1438,8 @@ const struct file_operations def_blk_fops = { #ifdef CONFIG_COMPAT .compat_ioctl = compat_blkdev_ioctl, #endif - .splice_read = generic_file_splice_read, - .splice_write = generic_file_splice_write, + .splice_read = block_splice_read, + .splice_write = block_splice_write, }; int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg) -- Jens Axboe ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: disk IO directly from PCI memory to block device sectors 2008-09-29 13:02 ` Jens Axboe @ 2008-10-01 19:05 ` Jens Axboe 2008-10-02 16:15 ` Leon Woestenberg 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2008-10-01 19:05 UTC (permalink / raw) To: Leisner, Martin; +Cc: Alan Cox, marty, linux-kernel On Mon, Sep 29 2008, Jens Axboe wrote: > On Fri, Sep 26 2008, Leisner, Martin wrote: > > > > > > > -----Original Message----- > > > From: Jens Axboe [mailto:jens.axboe@oracle.com] > > > Sent: Friday, September 26, 2008 5:12 AM > > > To: Alan Cox > > > Cc: marty; linux-kernel@vger.kernel.org; Leisner, Martin > > > Subject: Re: disk IO directly from PCI memory to block device > > sectors > > > > > > On Fri, Sep 26 2008, Alan Cox wrote: > > > > > What I'm looking is for a more generic/driver independent way of > > > sticking > > > > > contents of PCI ram onto a disk. > > > > > > > > Ermm seriously why not have a userspace task with the PCI RAM > > mmapped > > > > and just use write() like normal sane people do ? > > > > > > To avoid the fault and copy, I would assume. > > > > > > -- > > > Jens Axboe > > > > Also: > > a) to deal with interrupts from the hardware > > b) using legacy code/design/architecture > > > > The splice approaches look very interesting...thanks... > > Just for kicks, I did the read part of the fast bdev interface as well. > As with the write, it's totally untested (apart from compiled). Just in > case anyone is curious... I plan to do a bit of testing on this this > week. > > IMHO, this interface totally rocks. It's really async like splice was > intended, and it's fast too. I may have to look into some generic IO > mechanism to unify them all, O_DIRECT/page cache/splice. Famous last > words, I'm sure. Alright, so this one actually works :-) Apart from fixing the bugs in it, it's also more clever in using the bio for the write part. It'll reuse the same bio in the splice actor until it's full, only then submitting it and allocating a new one. The read part works the same way. diff --git a/fs/block_dev.c b/fs/block_dev.c index aff5421..1e807a3 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -24,6 +24,7 @@ #include <linux/uio.h> #include <linux/namei.h> #include <linux/log2.h> +#include <linux/splice.h> #include <asm/uaccess.h> #include "internal.h" @@ -1155,6 +1156,346 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) return blkdev_ioctl(file->f_mapping->host, file, cmd, arg); } +static void block_splice_write_end_io(struct bio *bio, int err) +{ + bio_put(bio); +} + +/* + * No need going above PIPE_BUFFERS, as we cannot fill that anyway + */ +static inline unsigned len_to_max_pages(unsigned int len) +{ + unsigned pages = (len + PAGE_SIZE - 1) / PAGE_SIZE; + + return min_t(unsigned, pages, PIPE_BUFFERS); +} + +/* + * A bit of state data, to allow us to make larger bios + */ +struct block_splice_data { + struct file *file; + struct bio *bio; +}; + +static int pipe_to_disk(struct pipe_inode_info *pipe, struct pipe_buffer *buf, + struct splice_desc *sd) +{ + struct block_splice_data *bsd = sd->u.data; + struct block_device *bdev = I_BDEV(bsd->file->f_mapping->host); + unsigned int mask; + struct bio *bio; + int ret; + + mask = queue_hardsect_size(bdev_get_queue(bdev)) - 1; + if ((sd->pos & mask) || (buf->len & mask) || (buf->offset & mask)) + return -EINVAL; + + ret = buf->ops->confirm(pipe, buf); + if (unlikely(ret)) + return ret; + + bio = bsd->bio; + if (!bio) { +new_bio: + bio = bio_alloc(GFP_KERNEL, len_to_max_pages(sd->total_len)); + bio->bi_sector = sd->pos; + do_div(bio->bi_sector, mask + 1); + bio->bi_bdev = bdev; + bio->bi_end_io = block_splice_write_end_io; + bsd->bio = bio; + } + + if (bio_add_page(bio, buf->page, buf->len, buf->offset) != buf->len) { + submit_bio(WRITE, bio); + bsd->bio = NULL; + goto new_bio; + } + + return buf->len; +} + +/* + * Splice to file opened with O_DIRECT. Bypass caching completely and + * just go direct-to-bio + */ +static ssize_t __block_splice_write(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, size_t len, + unsigned int flags) +{ + struct block_splice_data bsd; + struct splice_desc sd = { + .total_len = len, + .flags = flags, + .pos = *ppos, + .u.data = &bsd, + }; + struct inode *inode = out->f_mapping->host; + ssize_t ret; + + if (unlikely(*ppos & 511)) + return -EINVAL; + + bsd.file = out; + bsd.bio = NULL; + + inode_double_lock(inode, pipe->inode); + ret = __splice_from_pipe(pipe, &sd, pipe_to_disk); + inode_double_unlock(inode, pipe->inode); + + /* + * submit a potential in-progress bio + */ + if (bsd.bio) + submit_bio(WRITE, bsd.bio); + + return ret; +} + +static ssize_t block_splice_write(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, size_t len, + unsigned int flags) +{ + ssize_t ret; + + if (out->f_flags & O_DIRECT) { + ret = __block_splice_write(pipe, out, ppos, len, flags); + if (ret > 0) + *ppos += ret; + } else + ret = generic_file_splice_write(pipe, out, ppos, len, flags); + + return ret; +} + +/* + * Free the pipe page and put the pipe_buffer reference to the bio + */ +static void block_drop_buf_ref(struct page *page, unsigned long data) +{ + struct bio *bio; + + bio = (struct bio *) data; + if (bio) { + struct completion *comp; + + comp = bio->bi_private; + if (comp) + wait_for_completion(comp); + + bio_put(bio); + } + + __free_page(page); +} + +static void block_pipe_buf_release(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + block_drop_buf_ref(buf->page, buf->private); +} + +/* + * Wait for IO to be done on the bio that this buf belongs to + */ +static int block_pipe_buf_confirm(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + struct bio *bio = (struct bio *) buf->private; + struct completion *comp = bio->bi_private; + + wait_for_completion(comp); + return 0; +} + +static void block_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) +{ + struct bio *bio; + + bio = (struct bio *) buf->private; + if (bio) + bio_get(bio); + + get_page(buf->page); +} + +static const struct pipe_buf_operations block_pipe_buf_ops = { + .can_merge = 0, + .map = generic_pipe_buf_map, + .unmap = generic_pipe_buf_unmap, + .confirm = block_pipe_buf_confirm, + .release = block_pipe_buf_release, + .steal = generic_pipe_buf_steal, + .get = block_pipe_buf_get, +}; + +/* + * Free the pipe page and put the pipe_buffer reference to the bio + */ +static void block_release_page(struct splice_pipe_desc *spd, unsigned int i) +{ + block_drop_buf_ref(spd->pages[i], spd->partial[i].private); +} + +/* + * READ end io handling completes the bio, so that we can wakeup + * anyone waiting in ->confirm(). + */ +static void block_splice_read_end_io(struct bio *bio, int err) +{ + struct completion *comp = bio->bi_private; + + /* + * IO done, so complete to wake up potential waiters. Put our + * allocation reference to the bio + */ + complete_all(comp); + bio_put(bio); +} + +/* + * Overload the default destructor, so we can safely free our completion too + */ +static void block_splice_bio_destructor(struct bio *bio) +{ + kfree(bio->bi_private); + bio_free(bio, fs_bio_set); +} + +/* + * Bypass the page cache and allocate pages for IO directly + */ +static ssize_t __block_splice_read(struct pipe_inode_info *pipe, + struct file *in, loff_t *ppos, size_t len, + unsigned int flags) +{ + struct page *pages[PIPE_BUFFERS]; + struct partial_page partial[PIPE_BUFFERS]; + struct splice_pipe_desc spd = { + .pages = pages, + .partial = partial, + .nr_pages = 0, + .flags = flags, + .ops = &block_pipe_buf_ops, + .spd_release = block_release_page, + }; + struct inode *inode = in->f_mapping->host; + struct block_device *bdev = I_BDEV(inode); + struct bio *bio; + sector_t sector; + loff_t isize, left; + int bs, err; + + /* + * First to alignment and length sanity checks + */ + bs = queue_hardsect_size(bdev_get_queue(bdev)); + if (*ppos & (bs - 1)) + return -EINVAL; + + isize = i_size_read(inode); + if (unlikely(*ppos >= isize)) + return 0; + + left = isize - *ppos; + if (unlikely(left < len)) + len = left; + + err = 0; + sector = *ppos; + do_div(sector, bs); + bio = NULL; + while (len && spd.nr_pages < PIPE_BUFFERS) { + unsigned int this_len = min_t(unsigned int, len, PAGE_SIZE); + struct completion *comp; + struct page *page; + + page = alloc_page(GFP_KERNEL); + if (!page) { + err = -ENOMEM; + break; + } + + if (!bio) { +alloc_new_bio: + comp = kmalloc(sizeof(*comp), GFP_KERNEL); + if (!comp) { + __free_page(page); + err = -ENOMEM; + break; + } + + init_completion(comp); + + bio = bio_alloc(GFP_KERNEL, len_to_max_pages(len)); + bio->bi_sector = sector; + bio->bi_bdev = bdev; + bio->bi_private = comp; + bio->bi_end_io = block_splice_read_end_io; + + /* + * Not too nice... + */ + bio->bi_destructor = block_splice_bio_destructor; + } + + /* + * if we fail adding the page, then submit this bio and go + * fetch a new one + */ + if (bio_add_page(bio, page, this_len, 0) != this_len) { + submit_bio(READ, bio); + bio = NULL; + goto alloc_new_bio; + } + + /* + * The pipe buffer needs to hang on to the bio, so that we + * can reuse it in the ->confirm() part of the pipe ops + */ + bio_get(bio); + + sector += (this_len / bs); + len -= this_len; + partial[spd.nr_pages].offset = 0; + partial[spd.nr_pages].len = this_len; + partial[spd.nr_pages].private = (unsigned long) bio; + pages[spd.nr_pages] = page; + spd.nr_pages++; + } + + /* + * submit the current bio, if any + */ + if (bio) + submit_bio(READ, bio); + + /* + * if we succeeded in adding some pages, fill them into the pipe + */ + if (spd.nr_pages) + return splice_to_pipe(pipe, &spd); + + return err; +} + +static ssize_t block_splice_read(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) +{ + ssize_t ret; + + if (in->f_flags & O_DIRECT) { + ret = __block_splice_read(pipe, in, ppos, len, flags); + if (ret > 0) + *ppos += ret; + } else + ret = generic_file_splice_read(in, ppos, pipe, len, flags); + + return ret; +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .writepage = blkdev_writepage, @@ -1179,8 +1520,8 @@ const struct file_operations def_blk_fops = { #ifdef CONFIG_COMPAT .compat_ioctl = compat_blkdev_ioctl, #endif - .splice_read = generic_file_splice_read, - .splice_write = generic_file_splice_write, + .splice_read = block_splice_read, + .splice_write = block_splice_write, }; int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg) -- Jens Axboe ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: disk IO directly from PCI memory to block device sectors 2008-10-01 19:05 ` Jens Axboe @ 2008-10-02 16:15 ` Leon Woestenberg 2008-10-02 16:32 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Leon Woestenberg @ 2008-10-02 16:15 UTC (permalink / raw) To: Jens Axboe; +Cc: Leisner, Martin, Alan Cox, marty, linux-kernel Hello Jens, On Wed, Oct 1, 2008 at 9:05 PM, Jens Axboe <jens.axboe@oracle.com> wrote: > On Mon, Sep 29 2008, Jens Axboe wrote: >> On Fri, Sep 26 2008, Leisner, Martin wrote: >> IMHO, this interface totally rocks. It's really async like splice was > > Alright, so this one actually works :-) > Apart from fixing the bugs in it, it's also more clever in using the bio > for the write part. It'll reuse the same bio in the splice actor until > it's full, only then submitting it and allocating a new one. The read > part works the same way. > I have been following this thread trying to grasp a very nifty use case (high speed acquisition and storage of data) of splice. I think it would make a perfect example of splice functionality. What would the user space part look like to exercise this interface? And whoever writes Linux Device Drivers 4th edition or one of the kernel books; make sure this topic in is :-) Regards, -- Leon ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: disk IO directly from PCI memory to block device sectors 2008-10-02 16:15 ` Leon Woestenberg @ 2008-10-02 16:32 ` Jens Axboe 0 siblings, 0 replies; 12+ messages in thread From: Jens Axboe @ 2008-10-02 16:32 UTC (permalink / raw) To: Leon Woestenberg; +Cc: Leisner, Martin, Alan Cox, marty, linux-kernel On Thu, Oct 02 2008, Leon Woestenberg wrote: > Hello Jens, > > On Wed, Oct 1, 2008 at 9:05 PM, Jens Axboe <jens.axboe@oracle.com> wrote: > > On Mon, Sep 29 2008, Jens Axboe wrote: > >> On Fri, Sep 26 2008, Leisner, Martin wrote: > >> IMHO, this interface totally rocks. It's really async like splice was > > > > Alright, so this one actually works :-) > > Apart from fixing the bugs in it, it's also more clever in using the bio > > for the write part. It'll reuse the same bio in the splice actor until > > it's full, only then submitting it and allocating a new one. The read > > part works the same way. > > > I have been following this thread trying to grasp a very nifty use > case (high speed acquisition and storage of data) of splice. > > I think it would make a perfect example of splice functionality. > > What would the user space part look like to exercise this interface? Download: http://brick.kernel.dk/snaps/splice-git-latest.tar.gz which has lots of little examples for splice. You would want to do something ala # splice-in /dev/my-pci-device | splice-out /dev/sda in one app of course, but take a look at the examples and get a feel for the interface... BTW, in my splice branch I have this queued as well. Not going anywhere for now, but should get updated and tested every now and then. http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/splice -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-10-02 16:32 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-26 7:29 disk IO directly from PCI memory to block device sectors marty 2008-09-26 8:03 ` Jens Axboe 2008-09-26 8:46 ` Alan Cox 2008-09-26 9:11 ` Jens Axboe 2008-09-26 10:06 ` Alan Cox 2008-09-26 10:19 ` Jens Axboe 2008-09-26 11:34 ` Jens Axboe 2008-09-26 15:51 ` Leisner, Martin 2008-09-29 13:02 ` Jens Axboe 2008-10-01 19:05 ` Jens Axboe 2008-10-02 16:15 ` Leon Woestenberg 2008-10-02 16:32 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox