* zero-copy recv ?
@ 2008-06-19 14:11 Peter T. Breuer
2008-06-19 15:02 ` Frederik Deweerdt
0 siblings, 1 reply; 5+ messages in thread
From: Peter T. Breuer @ 2008-06-19 14:11 UTC (permalink / raw)
To: linux kernel
G'day all
I've been mmapping the request's bio buffers received on my block device
driver for userspace to use directly as tcp send/receive buffers.
However ...
1) this works fantastically for (zero-copy) tcp send (i.e.
from the address in user space that my mmap trick provides for the
request buffers),
2) tcp recv hangs.
What's going on? I'd be grateful for any clues as to how to fix this as
it's tcp zero-copy on recv when it goes OK!
What does tcp socket recv need exactly by way of an mmapped buffer?
Is there some set of flags that needs to be set on the pages that make
up the mmap?
Recv() hangs somewhere inside the tcp recv call inside kernel paths that
I cannot trace. I see "recv(5, ...)" via strace. The data is sent out
on the wire from the other side and apparently comes in, but the socket
tcp recv never progresses.
If, OTOH, I have previously written the device at that point, then
reading the device causes a request to appear at the device driver
with pages carrying the flags
40001826
(referenced|error|lru|private|writeback)
and all works fantastically in the sense that recv called with those
pages mmapped into userspace as the recv buffer works just like it
should.
When the device has not been written at that point previously, then
I see pages appearing in the request bio buffers with pretty
random-looking flags, such as
40020801
(locked|private|readahead)
and recv does its hang trick.
Does anyone have any insight as to what is going on?
Thanks
Peter
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: zero-copy recv ? 2008-06-19 14:11 zero-copy recv ? Peter T. Breuer @ 2008-06-19 15:02 ` Frederik Deweerdt 2008-06-19 17:01 ` Peter T. Breuer 0 siblings, 1 reply; 5+ messages in thread From: Frederik Deweerdt @ 2008-06-19 15:02 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux kernel On Thu, Jun 19, 2008 at 04:11:14PM +0200, Peter T. Breuer wrote: > G'day all > > I've been mmapping the request's bio buffers received on my block device > driver for userspace to use directly as tcp send/receive buffers. Is the code available somewhere by any chance? Regards, Frederik ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: zero-copy recv ? 2008-06-19 15:02 ` Frederik Deweerdt @ 2008-06-19 17:01 ` Peter T. Breuer 0 siblings, 0 replies; 5+ messages in thread From: Peter T. Breuer @ 2008-06-19 17:01 UTC (permalink / raw) To: Frederik Deweerdt; +Cc: linux-kernel In article <20080619150248.GA13463@slug> you wrote: > On Thu, Jun 19, 2008 at 04:11:14PM +0200, Peter T. Breuer wrote: >> G'day all >> >> I've been mmapping the request's bio buffers received on my block device >> driver for userspace to use directly as tcp send/receive buffers. > Is the code available somewhere by any chance? Sure. It's just the unstable snapshot release of my enbd device driver. ftp://oboe.it.uc3m.es/pub/Programs/enbd-2.4.36.tgz would be the current image. It gets updated nightly. I imagine there are mirrors several places. If you would like me to describe the code, I'll say that originally I was using the "nopage" method as described in Rubini, and nowadays I do a tiny bit more than only that. Generally speaking, on receiving a read/write request, the driver just queues it and notifies a user space daemon. The daemon gets told about the request pending, and then mmaps the corresponding area of the device. The driver just replies "yes" to the mmap call, and waits for an actual page fault from an access before doing any real work (though nowabouts, as I mentioned, I'm prefaulting in the mmap call itself, so I can walk the pages and decide if they're OK before letting the mmap return OK, and then also benefit from reduced latency). At the page fault or whenever I prefault it, we supplied a nopage method to the vma struct at mmap time, and that gets called. In the nopage method we go hunt down the individual buffer in the kernel request that corresponds to the notional device offset of the missing mmapped page. The advantage of using nopage is that I can count on only having to deal with one page at a time, thus making things conceptually simpler. The only thing the mmap call does is load the private field of the vma with a pointer to the device, and do various tests, then prefaults all the vma struct pages in (at the mo). int enbd_mmap(struct file *file, struct vm_area_struct * vma) { unsigned long long vma_offset_in_disk = ((unsigned long long)vma->vm_pgoff) << PAGE_SHIFT; unsigned long vma_len = vma->vm_end - vma->vm_start; // used in pre-faulting in pages unsigned long addr, len; // device parameters int dev; struct enbd_device *lo; int nbd; int part; int islot; // setup device parameters from @file arg // ... // various sanity tests cause -EINVAL return if failed // ... if (vma_offset_in_disk >= __pa(high_memory) || (file->f_flags & O_SYNC)) vma->vm_flags |= VM_IO; // don't core dump this area vma->vm_flags |= VM_RESERVED; // don't swap out this area vma->vm_flags |= VM_MAYREAD; // for good luck. Not definitive vma->vm_flags |= VM_MAYWRITE; // our vm_ops has the nopage method vma->vm_ops = &enbd_vm_ops; // begin pre-fault in the pages addr = vma->vm_start; len = vma_len; while (len > 0) { struct page * page = enbd_vma_nopage(vma, addr, NULL); if (page == NOPAGE_SIGBUS || page == NOPAGE_OOM) { // got too few pages return -EINVAL; } if (vm_insert_page(vma, addr, page)) { // reverse an extra get_page we did in nopage put_page(page); return -EINVAL; } // reverse an extra get_page we did in nopage put_page(page); len -= PAGE_SIZE; addr += PAGE_SIZE; } // end pre-fault in pages enbd_vma_open(vma); return 0; } and the nopage method ... it goes searchabout on the local device queue for the request with the page that is wanted as one of its buffers, and then grabs the page reference from the buffer and returns it (after incrementing the use count with get_page under lock). static struct page * enbd_vma_nopage(struct vm_area_struct * vma, unsigned long addr, int *type) { struct page *page; // get stored device params out of vma private data struct enbd_slot * const slot = vma->vm_private_data; const int islot = slot->i; const int part = islot + 1; struct enbd_device * const lo = slot->lo; // for scanning requests struct request *xreq, *req = NULL; struct bio *bio; struct buffer_head *bh; // offset data const unsigned long page_offset_in_vma = addr - vma->vm_start; const unsigned long long vma_offset_in_disk = ((unsigned long long)vma->vm_pgoff) << PAGE_SHIFT; const unsigned long long page_offset_in_disk = page_offset_in_vma + vma_offset_in_disk; // look under local lock on our queue for matching request spin_lock(&slot->lock); list_for_each_entry_reverse (xreq, &slot->queue, queuelist) { if (xreq->sector <= (page_offset_in_disk >> 9) && xreq->sector + xreq->nr_sectors >= ((page_offset_in_disk + PAGE_SIZE)>> 9)) { req = xreq; // found the request break; } } if (!req) { spin_unlock(&slot->lock); goto try_searching_general_memory_pages; } // still under local device queue lock page = NULL; __rq_for_each_bio(bio, req) { int i; struct bio_vec * bvec; // set the offset in req since bios may be noncontiguous int offset_in_req = (bio->bi_sector - req->sector) << 9; bio_for_each_segment(bvec, bio, i) { const unsigned current_segment_size // <= PAGE_SIZE = bvec->bv_len; // PTB are we on the same page of the device? if (((req->sector + (offset_in_req >> 9)) >> (PAGE_SHIFT - 9)) == (page_offset_in_disk >> PAGE_SHIFT)) { struct page *old_page = page; page = bvec->bv_page; if (page != old_page) { // increment page use count get_page(page); } spin_unlock(&slot->lock); goto got_page; } offset_in_req += current_segment_size; } } spin_unlock(&slot->lock); // not possible goto nopage; try_searching_general_memory_pages: // This one does not sleep bh = __find_get_block(lo->bdev, (sector_t) (page_offset_in_disk >> lo->logblksize), PAGE_SIZE); if (bh) { page = bh->b_page; // increment page use count. Decremented by unmap. get_page(page); put_bh(bh); goto got_page; } // dropthru nopage: if (type) *type = VM_FAULT_MAJOR; return NOPAGE_SIGBUS; got_page: if (type) *type = VM_FAULT_MINOR; return page; } Peter ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: zero-copy recv ?
@ 2008-06-19 22:35 Peter T. Breuer
2008-06-20 8:44 ` Peter T. Breuer
0 siblings, 1 reply; 5+ messages in thread
From: Peter T. Breuer @ 2008-06-19 22:35 UTC (permalink / raw)
To: linux kernel
References: <200806191411.m5JEBE56008942@betty.it.uc3m.es>
Gah .. I wrote an answer to this and it seems to have got lost. If you
saw the answer, please forward to me!
> On Thu, Jun 19, 2008 at 04:11:14PM +0200, Peter T. Breuer wrote:
> > I've been mmapping the request's bio buffers received on my block device
> > driver for userspace to use directly as tcp send/receive buffers.
> Is the code available somewhere by any chance?
It's just the daily development snapshot for the enbd driver:
ftp://oboe.it.uc3m.es/pub/Programs/enbd-2.4.36.tgz
and mirrors.
I wrote a quick summary of the relevant code in my lost answer :(.
Please ask me to repeat if it is really AWOL.
Peter
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: zero-copy recv ? 2008-06-19 22:35 Peter T. Breuer @ 2008-06-20 8:44 ` Peter T. Breuer 0 siblings, 0 replies; 5+ messages in thread From: Peter T. Breuer @ 2008-06-20 8:44 UTC (permalink / raw) Peter T. Breuer <ptb@inv.it.uc3m.es> wrote: > References: <200806191411.m5JEBE56008942@betty.it.uc3m.es> > ftp://oboe.it.uc3m.es/pub/Programs/enbd-2.4.36.tgz > I wrote a quick summary of the relevant code in my lost answer :(. > Please ask me to repeat if it is really AWOL. The code uses the "nopage" technique from rubini. That is, the mmap call simply replies "yes" without doing any work, but loads the vma struct with its own nopage method. The nopage method gets called when the mmapped region is actually accessed. That will be at once. What is happening in the larger picture is that the block device driver has received a r/w request, has notified a user daemon, and the user daemon is responding by attemping to mmap the region on the device corresponding to the r/w request it's just been informed about. The intention is that it will then recv/send on a tcp socket with the data directly to/from the mmapped address as recv/send buffer. This works fine for send, but *recv* *hangs* (oww! why?). The nopage method simply goes and seaches in the request bio buffers for any page it is told is needed. It's guaranteed to find it, because it's been asked to do this as part of an mmap attempt on exactly the device area corresponding to the r/w request that's currently sitting on its queue, packed with nice juicy buffers. Here is the mmap, simplified int enbd_mmap(struct file *file, struct vm_area_struct * vma) { unsigned long long vma_offset_in_disk = ((unsigned long long)vma->vm_pgoff) << PAGE_SHIFT; unsigned long vma_len = vma->vm_end - vma->vm_start; // ... // device data to be stored in vma private field vma->vm_private_data = slot; // set VMA flags if (vma_offset_in_disk >= __pa(high_memory) || (file->f_flags & O_SYNC)) vma->vm_flags |= VM_IO; vma->vm_flags |= VM_RESERVED; vma->vm_flags |= VM_MAYREAD; // for good luck vma->vm_flags |= VM_MAYWRITE; vma->vm_ops = &enbd_vm_ops; // vm_ops contains my nopage method enbd_vma_open(vma); // accounting return 0; } and here's the simplified nopage method static struct page * enbd_vma_nopage(struct vm_area_struct * vma, unsigned long addr, int *type) { struct page *page = NULL; // device data retrieved from vma private field struct enbd_slot * const slot = vma->vm_private_data; // ... // used in scanning requests on local queue struct request *xreq, *req = NULL; struct bio *bio; // offset data const unsigned long page_offset_in_vma = addr - vma->vm_start; const unsigned long long vma_offset_in_disk = ((unsigned long long)vma->vm_pgoff) << PAGE_SHIFT; const unsigned long long page_offset_in_disk = page_offset_in_vma + vma_offset_in_disk; const long vma_len = vma->vm_end - vma->vm_start; const unsigned long long page_end_in_disk = page_offset_in_disk + PAGE_SIZE; const unsigned long long page_index = page_offset_in_disk >> PAGE_SHIFT; // begin seeking a matching req on local device queue under lock spin_lock(&slot->lock); list_for_each_entry_reverse (xreq, &slot->queue, queuelist) { unsigned long long xreq_end_sector = xreq->sector + xreq->nr_sectors; if (xreq->sector <= (page_offset_in_disk >> 9) && xreq_end_sector >= (page_end_in_disk >> 9) ) { // PTB found the request with the wanted buffer req = xreq; break; } } // end seeking a matching req on local queue, still under lock if (!req) { spin_unlock(&slot->lock); goto got_no_page; } // can't release lock yet. Look inside the req for buffer page __rq_for_each_bio(bio, req) { int i; struct bio_vec * bvec; // set the offset in req since bios may be noncontiguous int current_offset_in_req = (bio->bi_sector - req->sector) << 9; bio_for_each_segment(bvec, bio, i) { const unsigned current_segment_size // <= PAGE_SIZE = bvec->bv_len; const unsigned long long current_sector = req->sector + (current_offset_in_req >> 9); const unsigned long long current_page = current_sector >> (PAGE_SHIFT - 9); // are we on the same page? if (current_page == page_index) { page = bvec->bv_page; // increment page use count for mmap get_page(page); spin_unlock(&slot->lock); goto got_page; } current_offset_in_req += current_segment_size; } } spin_unlock(&slot->lock); goto got_no_page; got_no_page: if (type) *type = VM_FAULT_MAJOR; return NOPAGE_SIGBUS; got_page: if (type) *type = VM_FAULT_MINOR; return page; } I've tried prefaulting in the mmap pages at mmap time, but not been successful. vm_insert won't touch the pages for insertion in the vma because it thinks they're anonymous. I can run nopage on each page all the same, without doing the vma insertion, and that looks as though it is initially helpful, but a random looking oops happens a little later, probably because of bad refcount management. It does remove the recv hang, though, so the hang might be that the recv has to bring the buffer it is receiving to into existence first, and that takes one through memory. I'd like to know how to prefault in the intended mmap pages properly. vm_insert_page won't let me do it, using the page addresses found in the i/o request, because it thinks they're anonymous. Help? Peter ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-06-20 8:44 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-06-19 14:11 zero-copy recv ? Peter T. Breuer 2008-06-19 15:02 ` Frederik Deweerdt 2008-06-19 17:01 ` Peter T. Breuer -- strict thread matches above, loose matches on Subject: below -- 2008-06-19 22:35 Peter T. Breuer 2008-06-20 8:44 ` Peter T. Breuer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.