* Re: blktap: Sync with XCP, dropping zero-copy. [not found] <20101116215621.59FC2CF782@homiemail-mx7.g.dreamhost.com> @ 2010-11-17 16:36 ` Andres Lagar-Cavilla 2010-11-17 17:52 ` Jeremy Fitzhardinge 2010-11-17 23:42 ` Daniel Stodden 0 siblings, 2 replies; 18+ messages in thread From: Andres Lagar-Cavilla @ 2010-11-17 16:36 UTC (permalink / raw) To: xen-devel; +Cc: Jeremy Fitzhardinge, Daniel Stodden I'll throw an idea there and you educate me why it's lame. Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn's with fake page structs that make get_user_pages happy, quirky grant PTEs, etc. Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out. The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out: 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them. 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn's is swapped with the domU mfn. Pfn's and page struct's on both ends remain untouched. 3. For blkfront reads, call swap when stuffing the response back into the ring 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don't have a weirdo granted PTE, or work with a frame from other domain. It's your page all along, dom0 5. One assumption for domU is that pages allocated as blkfront buffers won't be touched by anybody, so a) it's safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle) 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as often and at the granularity as skbuff's demanded for GNTTABOP_transfer 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it's not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don't know about how practical that is though. Problems at first glance: 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback. 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren't all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback? 3. Managing the pool of backend reserved pages may be a problem? So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I'd like to hear other problems people may find with this approach Andres > Message: 3 > Date: Tue, 16 Nov 2010 13:28:51 -0800 > From: Daniel Stodden <daniel.stodden@citrix.com> > Subject: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy. > To: Jeremy Fitzhardinge <jeremy@goop.org> > Cc: "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com> > Message-ID: <1289942932.11102.802.camel@agari.van.xensource.com> > Content-Type: text/plain; charset="UTF-8" > > On Tue, 2010-11-16 at 12:56 -0500, Jeremy Fitzhardinge wrote: >> On 11/16/2010 01:13 AM, Daniel Stodden wrote: >>> On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: >>>> On 11/12/2010 07:55 PM, Daniel Stodden wrote: >>>>>> Surely this can be dealt with by replacing the mapped granted page with >>>>>> a local copy if the refcount is elevated? >>>>> Yeah. We briefly discussed this when the problem started to pop up >>>>> (again). >>>>> >>>>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with >>>>> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily >>>>> does the job. >>>> Hm, I'd be a bit concerned that that might cause problems if used >>>> generically. >>> Yeah. It wasn't a problem because all the network backends are on TCP, >>> where one can be rather sure that the dups are going to be properly >>> dropped. >>> >>> Does this hold everywhere ..? -- As mentioned below, the problem is >>> rather in AIO/DIO than being Xen-specific, so you can see the same >>> behavior on bare metal kernels too. A userspace app seeing an AIO >>> complete and then reusing that buffer elsewhere will occassionally >>> resend garbage over the network. >> >> Yeah, that sounds like a generic security problem. I presume the >> protocol will just discard the excess retransmit data, but it might mean >> a usermode program ends up transmitting secrets it never intended to... >> >>> There are some important parts which would go missing. Such as >>> ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to >>> gntmap 352 pages simultaneously isn't so good, so there still needs to >>> be some bridge arbitrating them. I'd rather keep that in kernel space, >>> okay to cram stuff like that into gntdev? It'd be much more >>> straightforward than IPC. >> >> What's the problem? If you do nothing then it will appear to the kernel >> as a bunch of processes doing memory allocations, and they'll get >> blocked/rate-limited accordingly if memory is getting short. > > The problem is that just letting the page allocator work through > allocations isn't going to scale anywhere. > > The worst case memory requested under load is <number-of-disks> * (32 * > 11 pages). As a (conservative) rule of thumb, N will be 200 or rather > better. > > The number of I/O actually in-flight at any point, in contrast, is > derived from the queue/sg sizes of the physical device. For a simple > disk, that's about a ring or two. > >> There's >> plenty of existing mechanisms to control that sort of thing (cgroups, >> etc) without adding anything new to the kernel. Or are you talking >> about something other than simple memory pressure? >> >> And there's plenty of existing IPC mechanisms if you want them to >> explicitly coordinate with each other, but I'd tend to thing that's >> premature unless you have something specific in mind. >> >>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. >>> Can't find it now, what happened? Without, there's presently still no >>> zero-copy. >> >> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively >> new-ish) mmu notifier infrastructure which is intended to allow a device >> to sync an external MMU with usermode mappings. We're not using it in >> precisely that way, but it allows us to wrangle grant mappings before >> the generic code tries to do normal pte ops on them. > > The mmu notifiers were for safe teardown only. They are not sufficient > for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll > need to back those VMAs with page structs. Or bounce again (gulp, just > mentioning it). As with the blktap2 patches, note there is no difference > in the dom0 memory bill, it takes page frames. > > This is pretty much exactly the pooling stuff in current drivers/blktap. > The interface could look as follows ([] designates users). > > * [toolstack] > Calling some ctls to create/destroy ctls pools of frames. > (Blktap currently does this in sysfs.) > > * [toolstack] > Optionally resize them, according to the physical queue > depth [estimate] of the underlying storage. > > * [tapdisk] > A backend instance, when starting up, opens a gntdev, then > uses a ctl to bind its gntdev handle to a frame pool. > > * [tapdisk] > The .mmap call now will allocate frames to back the VMA. > This operation can fail/block under congestion. Neither > is desirable, so we need a .poll. > > * [tapdisk] > To integrate grant mappings with a single-threaded event loop, > use .poll. The handle fires as soon as a request can be mapped. > > Under congestion, the .poll code will queue waiting disks and wake > them round-robin, once VMAs are released. > > (A [tapdisk] doesn't mean to dismiss a potential [qemu].) > > Still confident we want that? (Seriously asking). A lot of the code to > do so has been written for blktap, it wouldn't be hard to bend into a > gntdev extension. > >>> Once the issues were solved, it'd be kinda nice. Simplifies stuff like >>> memshr for blktap, which depends on getting hold of original grefs. >>> >>> We'd presumably still need the tapdev nodes, for qemu, etc. But those >>> can stay non-xen aware then. >>> >>>>>> The only caveat is the stray unmapping problem, but I think gntdev can >>>>>> be modified to deal with that pretty easily. >>>>> Not easier than anything else in kernel space, but when dealing only >>>>> with the refcounts, that's as as good a place as anwhere else, yes. >>>> I think the refcount test is pretty straightforward - if the refcount is >>>> 1, then we're the sole owner of the page and we don't need to worry >>>> about any other users. If its > 1, then somebody else has it, and we >>>> need to make sure it no longer refers to a granted page (which is just a >>>> matter of doing a set_pte_atomic() to remap from present to present). >>> [set_pte_atomic over grant ptes doesn't work, or does it?] >> >> No, I forgot about grant ptes magic properties. But there is the hypercall. > > Yup. > >>>> Then we'd have a set of frames whose lifetimes are being determined by >>>> some other subsystem. We can either maintain a list of them and poll >>>> waiting for them to become free, or just release them and let them be >>>> managed by the normal kernel lifetime rules (which requires that the >>>> memory attached to them be completely normal, of course). >>> The latter sounds like a good alternative to polling. So an >>> unmap_and_replace, and giving up ownership thereafter. Next run of the >>> dispatcher thread can can just refill the foreign pfn range via >>> alloc_empty_pages(), to rebalance. >> >> Do we actually need a "foreign page range"? Won't any pfn do? If we >> start with a specific range of foreign pfns and then start freeing those >> pfns back to the kernel, we won't have one for long... > > I guess we've been meaning the same thing here, unless I'm > misunderstanding you. Any pfn does, and the balloon pagevec allocations > default to order 0 entries indeed. Sorry, you're right, that's not a > 'range'. With a pending re-xmit, the backend can find a couple (or all) > of the request frames have count>1. It can flip and abandon those as > normal memory. But it will need those lost memory slots back, straight > away or next time it's running out of frames. As order-0 allocations. > > Foreign memory is deliberately short. Blkback still defaults to 2 rings > worth of address space, iirc, globally. That's what that mempool sysfs > stuff in the later blktap2 patches aimed at -- making the size > configurable where queue length matters, and isolate throughput between > physical backends, where the toolstack wants to care. > > Daniel > > > > > ------------------------------ > > Message: 4 > Date: Tue, 16 Nov 2010 13:42:54 -0800 (PST) > From: Boris Derzhavets <bderzhavets@yahoo.com> > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to > handle kernel paging request > To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Cc: Jeremy Fitzhardinge <jeremy@goop.org>, > xen-devel@lists.xensource.com, Bruce Edge <bruce.edge@gmail.com> > Message-ID: <923132.8834.qm@web56101.mail.re3.yahoo.com> > Content-Type: text/plain; charset="us-ascii" > >> So what I think you are saying is that you keep on getting the bug in DomU? >> Is the stack-trace the same as in rc1? > > Yes. > When i want to get 1-2 hr of stable work :- > > # service network restart > # service nfs restart > > at Dom0. > > I also believe that presence of xen-pcifront.fix.patch is making things much more stable > on F14. > > Boris. > > --- On Tue, 11/16/10, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to handle kernel paging request > To: "Boris Derzhavets" <bderzhavets@yahoo.com> > Cc: "Jeremy Fitzhardinge" <jeremy@goop.org>, xen-devel@lists.xensource.com, "Bruce Edge" <bruce.edge@gmail.com> > Date: Tuesday, November 16, 2010, 4:15 PM > > On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote: >>> Huh. I .. what? I am confused. I thought we established that the issue >>> was not related to Xen PCI front? You also seem to uncomment the >>> upstream.core.patches and the xen.pvhvm.patch - why? >> >> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch >> it gives failed HUNKs > > Uhh.. I am even more confused. >> >>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes >> >> I am pretty sure rc2 doesn't contain everything from xen.next-2.6.37.patch, >> gntdev's stuff for sure. I've built 2.6.37-rc2 kernel rpms and loaded >> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1. >> Device /dev/xen/gntdev has not been created. I understand that it's >> unrelated to DomU ( related to Dom0) , but once again with rc2 in DomU i cannot >> get 3.2 GB copied over to DomU from NFS share at Dom0. > > So what I think you are saying is that you keep on getting the bug in DomU? > Is the stack-trace the same as in rc1? > > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/015048ae/attachment.html > > ------------------------------ > > Message: 5 > Date: Tue, 16 Nov 2010 13:49:14 -0800 (PST) > From: Boris Derzhavets <bderzhavets@yahoo.com> > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to > handle kernel paging request > To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Cc: Jeremy Fitzhardinge <jeremy@goop.org>, > xen-devel@lists.xensource.com, Bruce Edge <bruce.edge@gmail.com> > Message-ID: <228566.47308.qm@web56106.mail.re3.yahoo.com> > Content-Type: text/plain; charset="iso-8859-1" > > Yes, here we are > > [ 186.975228] ------------[ cut here ]------------ > [ 186.975245] kernel BUG at mm/mmap.c:2399! > [ 186.975254] invalid opcode: 0000 [#1] SMP > [ 186.975269] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map > [ 186.975284] CPU 0 > [ 186.975290] Modules linked in: nfs fscache deflate zlib_deflate ctr camellia cast5 rmd160 crypto_null ccm serpent blowfish twofish_generic twofish_x86_64 twofish_common ecb xcbc cbc sha256_generic sha512_generic des_generic cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 uinput xen_netfront microcode xen_blkfront [last unloaded: scsi_wait_scan] > [ 186.975507] > [ 186.975515] Pid: 1562, comm: ls Not tainted 2.6.37-0.1.rc1.git8.xendom0.fc14.x86_64 #1 / > [ 186.975529] RIP: e030:[<ffffffff8110ada1>] [<ffffffff8110ada1>] exit_mmap+0x10c/0x119 > [ 186.975550] RSP: e02b:ffff8800781bde18 EFLAGS: 00010202 > [ 186.975560] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > [ 186.975573] RDX: 00000000914a9149 RSI: 0000000000000001 RDI: ffffea00000c0280 > [ 186.975585] RBP: ffff8800781bde48 R08: ffffea00000c0280 R09: 0000000000000001 > [ 186.975598] R10: ffffffff8100750f R11: ffffea0000967778 R12: ffff880076c68b00 > [ 186.975610] R13: ffff88007f83f1e0 R14: ffff880076c68b68 R15: 0000000000000001 > [ 186.975625] FS: 00007f8e471d97c0(0000) GS:ffff88007f831000(0000) knlGS:0000000000000000 > [ 186.975639] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 186.975650] CR2: 00007f8e464a9940 CR3: 0000000001a03000 CR4: 0000000000002660 > [ 186.975663] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 186.976012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 186.976012] Process ls (pid: 1562, threadinfo ffff8800781bc000, task ffff8800788223e0) > [ 186.976012] Stack: > [ 186.976012] 000000000000006b ffff88007f83f1e0 ffff8800781bde38 ffff880076c68b00 > [ 186.976012] ffff880076c68c40 ffff8800788229d0 ffff8800781bde68 ffffffff810505fc > [ 186.976012] ffff8800788223e0 ffff880076c68b00 ffff8800781bdeb8 ffffffff81056747 > [ 186.976012] Call Trace: > [ 186.976012] [<ffffffff810505fc>] mmput+0x65/0xd8 > [ 186.976012] [<ffffffff81056747>] exit_mm+0x13e/0x14b > [ 186.976012] [<ffffffff81056976>] do_exit+0x222/0x7c6 > [ 186.976012] [<ffffffff8100750f>] ? xen_restore_fl_direct_end+0x0/0x1 > [ 186.976012] [<ffffffff8107ea7c>] ? arch_local_irq_restore+0xb/0xd > [ 186.976012] [<ffffffff814b3949>] ? lockdep_sys_exit_thunk+0x35/0x67 > [ 186.976012] [<ffffffff810571b0>] do_group_exit+0x88/0xb6 > [ 186.976012] [<ffffffff810571f5>] sys_exit_group+0x17/0x1b > [ 186.976012] [<ffffffff8100acf2>] system_call_fastpath+0x16/0x1b > [ 186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 00 00 48 89 df e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 00 00 00 74 02 <0f> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 54 53 48 > [ 186.976012] RIP [<ffffffff8110ada1>] exit_mmap+0x10c/0x119 > [ 186.976012] RSP <ffff8800781bde18> > [ 186.976012] ---[ end trace c0f4eff4054a67e4 ]--- > [ 186.976012] Fixing recursive fault but reboot is needed! > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.975228] ------------[ cut here ]------------ > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.975254] invalid opcode: 0000 [#1] SMP > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.975269] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.976012] Stack: > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.976012] Call Trace: > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 00 00 48 89 df e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 00 00 00 74 02 <0f> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 54 53 48 > > --- On Tue, 11/16/10, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to handle kernel paging request > To: "Boris Derzhavets" <bderzhavets@yahoo.com> > Cc: "Jeremy Fitzhardinge" <jeremy@goop.org>, xen-devel@lists.xensource.com, "Bruce Edge" <bruce.edge@gmail.com> > Date: Tuesday, November 16, 2010, 4:15 PM > > On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote: >>> Huh. I .. what? I am confused. I thought we established that the issue >>> was not related to Xen PCI front? You also seem to uncomment the >>> upstream.core.patches and the xen.pvhvm.patch - why? >> >> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch >> it gives failed HUNKs > > Uhh.. I am even more confused. >> >>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes >> >> I am pretty sure rc2 doesn't contain everything from xen.next-2.6.37.patch, >> gntdev's stuff for sure. I've built 2.6.37-rc2 kernel rpms and loaded >> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1. >> Device /dev/xen/gntdev has not been created. I understand that it's >> unrelated to DomU ( related to Dom0) , but once again with rc2 in DomU i cannot >> get 3.2 GB copied over to DomU from NFS share at Dom0. > > So what I think you are saying is that you keep on getting the bug in DomU? > Is the stack-trace the same as in rc1? > > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/84bccfd3/attachment.html > > ------------------------------ > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > > > End of Xen-devel Digest, Vol 69, Issue 218 > ****************************************** ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-17 16:36 ` blktap: Sync with XCP, dropping zero-copy Andres Lagar-Cavilla @ 2010-11-17 17:52 ` Jeremy Fitzhardinge 2010-11-17 19:47 ` Andres Lagar-Cavilla 2010-11-17 23:42 ` Daniel Stodden 1 sibling, 1 reply; 18+ messages in thread From: Jeremy Fitzhardinge @ 2010-11-17 17:52 UTC (permalink / raw) To: Andres Lagar-Cavilla; +Cc: xen-devel, Daniel Stodden On 11/17/2010 08:36 AM, Andres Lagar-Cavilla wrote: > I'll throw an idea there and you educate me why it's lame. > > Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn's with fake page structs that make get_user_pages happy, quirky grant PTEs, etc. > > Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out. > > The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out: > > 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them. > > 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn's is swapped with the domU mfn. Pfn's and page struct's on both ends remain untouched. Would GNTTABOP_swap also require the domU to have already unmapped the page from its own pagetables? Presumably it would fail if it didn't, otherwise you'd end up with a domU mapping the same mfn as a dom0-private page. > 3. For blkfront reads, call swap when stuffing the response back into the ring > > 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don't have a weirdo granted PTE, or work with a frame from other domain. It's your page all along, dom0 > > 5. One assumption for domU is that pages allocated as blkfront buffers won't be touched by anybody, so a) it's safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle) > > 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as often and at the granularity as skbuff's demanded for GNTTABOP_transfer > > 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it's not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don't know about how practical that is though. I think that will be the common case - the kernel will always attempt to write dirty pagecache pages to make clean ones, and it will still want them around to access. So it can't really give up the page altogether; if it hands it over to dom0, it needs to make a local copy first. > Problems at first glance: > 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback. > 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren't all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback? > 3. Managing the pool of backend reserved pages may be a problem? > > So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I'd like to hear other problems people may find with this approach It's not clear to me that its any improvement over just directly copying the data up front. J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-17 17:52 ` Jeremy Fitzhardinge @ 2010-11-17 19:47 ` Andres Lagar-Cavilla 0 siblings, 0 replies; 18+ messages in thread From: Andres Lagar-Cavilla @ 2010-11-17 19:47 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: xen-devel, Daniel Stodden So, swapping mfns for write requests is a definite no-no. One could still live with copying write buffers and swapping read buffers by the end of the request. That still yields some benefit. As for kernel mappings, I though a solution would be to provide the hypervisor with both pte pointers. After all pte pointers are already provided for mapping grants in user-space. But that's a little too much to handle for the current interface. Thanks for the feedback Andres On Nov 17, 2010, at 12:52 PM, Jeremy Fitzhardinge wrote: > On 11/17/2010 08:36 AM, Andres Lagar-Cavilla wrote: >> I'll throw an idea there and you educate me why it's lame. >> >> Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn's with fake page structs that make get_user_pages happy, quirky grant PTEs, etc. >> >> Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out. >> >> The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out: >> >> 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them. >> >> 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn's is swapped with the domU mfn. Pfn's and page struct's on both ends remain untouched. > > Would GNTTABOP_swap also require the domU to have already unmapped the > page from its own pagetables? Presumably it would fail if it didn't, > otherwise you'd end up with a domU mapping the same mfn as a > dom0-private page. > >> 3. For blkfront reads, call swap when stuffing the response back into the ring >> >> 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don't have a weirdo granted PTE, or work with a frame from other domain. It's your page all along, dom0 >> >> 5. One assumption for domU is that pages allocated as blkfront buffers won't be touched by anybody, so a) it's safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle) >> >> 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as often and at the granularity as skbuff's demanded for GNTTABOP_transfer >> >> 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it's not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don't know about how practical that is though. > > I think that will be the common case - the kernel will always attempt to > write dirty pagecache pages to make clean ones, and it will still want > them around to access. So it can't really give up the page altogether; > if it hands it over to dom0, it needs to make a local copy first. > >> Problems at first glance: >> 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback. >> 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren't all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback? >> 3. Managing the pool of backend reserved pages may be a problem? >> >> So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I'd like to hear other problems people may find with this approach > > It's not clear to me that its any improvement over just directly copying > the data up front. > > J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-17 16:36 ` blktap: Sync with XCP, dropping zero-copy Andres Lagar-Cavilla 2010-11-17 17:52 ` Jeremy Fitzhardinge @ 2010-11-17 23:42 ` Daniel Stodden 1 sibling, 0 replies; 18+ messages in thread From: Daniel Stodden @ 2010-11-17 23:42 UTC (permalink / raw) To: Andres Lagar-Cavilla; +Cc: Fitzhardinge, xen-devel@lists.xensource.com, Jeremy On Wed, 2010-11-17 at 11:36 -0500, Andres Lagar-Cavilla wrote: > I'll throw an idea there and you educate me why it's lame. > > Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn's with fake page structs that make get_user_pages happy, quirky grant PTEs, etc. > > Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out. > > The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out: > > 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them. > > 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn's is swapped with the domU mfn. Pfn's and page struct's on both ends remain untouched. > > 3. For blkfront reads, call swap when stuffing the response back into the ring > > 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don't have a weirdo granted PTE, or work with a frame from other domain. It's your page all along, dom0 > > 5. One assumption for domU is that pages allocated as blkfront buffers won't be touched by anybody, so a) it's safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle) > > 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as often and at the granularity as skbuff's demanded for GNTTABOP_transfer > > 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it's not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don't know about how practical that is though. > > Problems at first glance: > 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback. > 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren't all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback? > 3. Managing the pool of backend reserved pages may be a problem? I guess GNT_transfer for network I/O died because of the double-ended TLB fallout? Still liked the general direction, nice shot. Cheers, Daniel ^ permalink raw reply [flat|nested] 18+ messages in thread
* blktap: Sync with XCP, dropping zero-copy. @ 2010-11-12 23:31 Daniel Stodden 2010-11-13 0:50 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 18+ messages in thread From: Daniel Stodden @ 2010-11-12 23:31 UTC (permalink / raw) To: Xen; +Cc: Jeremy Fitzhardinge Hi all. This is the better half of what XCP developments and testing brought for blktap. It's fairly a big change in how I/O buffers are managed. Prior to this series, we had zero-copy I/O down to userspace. Unfortunately, blktap2 always had to jump through a couple of extra loops to do so. Present state of that is that we dropped that, so all tapdev I/O is bounced to/from a bunch of normal pages. Essentially replacing the old VMA management with a couple insert/zap VM calls. One issue was that the kernel can't cope with recursive I/O. Submitting an iovec on a tapdev, passing it to userspace and then reissuing the same vector via AIO apparently doesn't fit well with the lock protocol applied to those pages. This is the main reason why blktap had to deal a lot with grant refs. About as much as blkback already does before passing requests on. What happens there is that it's aliasing those granted pages under a different PFN, thereby in a separate page struct. Not pretty, but it worked, so it's not the reason why we chose to drop that at some point. The more prevalent problem was network storage, especially anything involving TCP. That includes VHD on both NFS and iSCSI. The problem with those is that retransmits (by the transport) and I/O op completion (on the application layer) are never synchronized. With sufficiently bad timing and bit of jitter on the network, it's perfectly common for the kernel to complete an AIO request with a late ack on the input queue just when retransmission timer is about to fire underneath. The completion will unmap the granted frame, crashing any uncanceled retransmission on an empty page frame. There are different ways to deal with that. Page destructors might be one way, but as far as I heard they are not particularly popular upstream. Issuing the block I/O on dom0 memory is straightforward and avoids the hassle. One could go argue that retransmits after DIO completion are still a potential privacy problem (I did), but it's not Xen's problem after all. If zero-copy becomes more attractive again, the plan would be to rather use grantdev in userspace, such as a filter driver for tapdisk instead. Until then, there's presumably a notable difference in L2 cache footprint. Then again, there's also a whole number of cycles not spent in redundant hypercalls now, to deal with the pseudophysical map. There are also benefits or non-issues. - This blktap is rather xen-independent. Certainly depends on the common ring macros, but lacking grant stuff it compiles on bare metal Linux with no CONFIG_XEN. Not consummated here, because that's going to move the source tree out of drivers/xen. But I'd like to post a new branch proposing to do so. - Blktaps size in dom0 didn't really change. Frames (now pages) were always pooled. We used to balloon memory to claim space for redundant grant mappings. Now we reserve, by default, the same volume in normal memory. - The previous code would runs all I/O on a single pool. Typically two rings worth of requests. Sufficient for a whole lot of systems, especially with single storage backends, but not so nice when I/O on a number of otherwise independent filers or volumes collides. Pools are refcounted kobjects in sysfs. Toolstacks using the new code can thereby choose to elimitate bottlenecks by grouping taps on different buffer pools. Pools can also be resized, to accomodate greater queue depths. [Note that blkback still has the same issue, so guests won't take advantage of that before that's resolved as well.] - XCP started to make some use of stacking tapdevs. Think pointing the image chain of a bunch of "leaf" taps to a shared parent node. That works fairly well, but definitely takes independent resource pools to avoid deadlock by parent starvation then. Please pull upstream/xen/dom0/backend/blktap2 from git://xenbits.xensource.com/people/dstodden/linux.git .. and/or upstream/xen/next for a merge. I also pulled in the pending warning fix from Teck Choon Giam. Cheers, Daniel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-12 23:31 Daniel Stodden @ 2010-11-13 0:50 ` Jeremy Fitzhardinge 2010-11-13 3:56 ` Daniel Stodden [not found] ` <1289620544.11102.373.camel@agari.van.xensource.com> 0 siblings, 2 replies; 18+ messages in thread From: Jeremy Fitzhardinge @ 2010-11-13 0:50 UTC (permalink / raw) To: Daniel Stodden; +Cc: Xen On 11/12/2010 03:31 PM, Daniel Stodden wrote: > It's fairly a big change in how I/O buffers are managed. Prior to this > series, we had zero-copy I/O down to userspace. Unfortunately, blktap2 > always had to jump through a couple of extra loops to do so. Present > state of that is that we dropped that, so all tapdev I/O is bounced > to/from a bunch of normal pages. Essentially replacing the old VMA > management with a couple insert/zap VM calls. Do you have any performance results comparing the two approaches? > One issue was that the kernel can't cope with recursive > I/O. Submitting an iovec on a tapdev, passing it to userspace and then > reissuing the same vector via AIO apparently doesn't fit well with the > lock protocol applied to those pages. This is the main reason why > blktap had to deal a lot with grant refs. About as much as blkback > already does before passing requests on. What happens there is that > it's aliasing those granted pages under a different PFN, thereby in a > separate page struct. Not pretty, but it worked, so it's not the > reason why we chose to drop that at some point. > > The more prevalent problem was network storage, especially anything > involving TCP. That includes VHD on both NFS and iSCSI. The problem > with those is that retransmits (by the transport) and I/O op > completion (on the application layer) are never synchronized. With > sufficiently bad timing and bit of jitter on the network, it's > perfectly common for the kernel to complete an AIO request with a late > ack on the input queue just when retransmission timer is about to fire > underneath. The completion will unmap the granted frame, crashing any > uncanceled retransmission on an empty page frame. There are different > ways to deal with that. Page destructors might be one way, but as far > as I heard they are not particularly popular upstream. Issuing the > block I/O on dom0 memory is straightforward and avoids the hassle. One > could go argue that retransmits after DIO completion are still a > potential privacy problem (I did), but it's not Xen's problem after > all. Surely this can be dealt with by replacing the mapped granted page with a local copy if the refcount is elevated? Then that can catch any stray residual references while we can still return the granted page to its owner. And obviously, not reuse that pfn for grants until the refcount is zero... > If zero-copy becomes more attractive again, the plan would be to > rather use grantdev in userspace, such as a filter driver for tapdisk > instead. Until then, there's presumably a notable difference in L2 > cache footprint. Then again, there's also a whole number of cycles not > spent in redundant hypercalls now, to deal with the pseudophysical > map. Frankly, I think the idea of putting blkback+tapdisk entirely in usermode is all upside with no (obvious) downsides. It: 1. avoids having to upstream anything 2. avoids having to upstream anything 3. avoids having to upstream anything 4. gets us back zero-copy (if that's important) 5. makes the IO path nice and straightforward 6. seems to address all the other problems you mentioned The only caveat is the stray unmapping problem, but I think gntdev can be modified to deal with that pretty easily. qemu has usermode blkback support already, and an actively improving block-IO infrastructure, so one approach might be to consider putting (parts of) tapdisk into qemu - and makes it pretty natural to reuse it with non-Xen guests via virtio-block, emulated devices, etc. But I'm not sold on that; having a standalone tapdisk w/ blkback makes sense to me as well. On the other hand, I don't think we're going to be able to get away with putting netback in usermode, so we still need to deal with that - but I think an all-copying version will be fine to get started with at least. > There are also benefits or non-issues. > > - This blktap is rather xen-independent. Certainly depends on the > common ring macros, but lacking grant stuff it compiles on bare > metal Linux with no CONFIG_XEN. Not consummated here, because > that's going to move the source tree out of drivers/xen. But I'd > like to post a new branch proposing to do so. > > - Blktaps size in dom0 didn't really change. Frames (now pages) were > always pooled. We used to balloon memory to claim space for > redundant grant mappings. Now we reserve, by default, the same > volume in normal memory. > > - The previous code would runs all I/O on a single pool. Typically > two rings worth of requests. Sufficient for a whole lot of systems, > especially with single storage backends, but not so nice when I/O > on a number of otherwise independent filers or volumes collides. > > Pools are refcounted kobjects in sysfs. Toolstacks using the new > code can thereby choose to elimitate bottlenecks by grouping taps > on different buffer pools. Pools can also be resized, to accomodate > greater queue depths. [Note that blkback still has the same issue, > so guests won't take advantage of that before that's resolved as > well.] > > - XCP started to make some use of stacking tapdevs. Think pointing > the image chain of a bunch of "leaf" taps to a shared parent > node. That works fairly well, but definitely takes independent > resource pools to avoid deadlock by parent starvation then. > > Please pull upstream/xen/dom0/backend/blktap2 from > git://xenbits.xensource.com/people/dstodden/linux.git OK, I've pulled it, but I haven't had a chance to test it yet. Thanks, J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-13 0:50 ` Jeremy Fitzhardinge @ 2010-11-13 3:56 ` Daniel Stodden [not found] ` <1289620544.11102.373.camel@agari.van.xensource.com> 1 sibling, 0 replies; 18+ messages in thread From: Daniel Stodden @ 2010-11-13 3:56 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: xen-devel@lists.xensource.com Hey. On Fri, 2010-11-12 at 19:50 -0500, Jeremy Fitzhardinge wrote: > On 11/12/2010 03:31 PM, Daniel Stodden wrote: > > It's fairly a big change in how I/O buffers are managed. Prior to this > > series, we had zero-copy I/O down to userspace. Unfortunately, blktap2 > > always had to jump through a couple of extra loops to do so. Present > > state of that is that we dropped that, so all tapdev I/O is bounced > > to/from a bunch of normal pages. Essentially replacing the old VMA > > management with a couple insert/zap VM calls. > > Do you have any performance results comparing the two approaches? No. One could probably go try large ramdisks or an AIO backend in tmpfs. All the storage I'm concerned with here terminates either on the NIC or a local spindle. That's hard to thwart in cache bandwidth. > > One issue was that the kernel can't cope with recursive > > I/O. Submitting an iovec on a tapdev, passing it to userspace and then > > reissuing the same vector via AIO apparently doesn't fit well with the > > lock protocol applied to those pages. This is the main reason why > > blktap had to deal a lot with grant refs. About as much as blkback > > already does before passing requests on. What happens there is that > > it's aliasing those granted pages under a different PFN, thereby in a > > separate page struct. Not pretty, but it worked, so it's not the > > reason why we chose to drop that at some point. > > > > The more prevalent problem was network storage, especially anything > > involving TCP. That includes VHD on both NFS and iSCSI. The problem > > with those is that retransmits (by the transport) and I/O op > > completion (on the application layer) are never synchronized. With > > sufficiently bad timing and bit of jitter on the network, it's > > perfectly common for the kernel to complete an AIO request with a late > > ack on the input queue just when retransmission timer is about to fire > > underneath. The completion will unmap the granted frame, crashing any > > uncanceled retransmission on an empty page frame. There are different > > ways to deal with that. Page destructors might be one way, but as far > > as I heard they are not particularly popular upstream. Issuing the > > block I/O on dom0 memory is straightforward and avoids the hassle. One > > could go argue that retransmits after DIO completion are still a > > potential privacy problem (I did), but it's not Xen's problem after > > all. > > Surely this can be dealt with by replacing the mapped granted page with > a local copy if the refcount is elevated? Yeah. We briefly discussed this when the problem started to pop up (again). I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily does the job. On UP that'd be just a matter of disabling interrupts for a while. I dropped it after it became clear that XS was moving to SMP, where one would end up with a full barrier to orchestrate the TLB flushes everywhere. Now, the skb runs prone to crash all run in softirq context, I wouldn't exactly expect a huge performance win from syncing on that kind of thing across all nodes, compared to local memcpy. Nor would I want storage stuff to touch locks shared with TCP, that's just not our business. Correct me if I'm mistaken. I'd like to see maybe stuff like node affinity on NUMA getting a bit more work. I think the patch presently just fills the queue node in, but that didn't see much testing, and one would have to correlate that. > Then that can catch any stray > residual references while we can still return the granted page to its > owner. And obviously, not reuse that pfn for grants until the refcount > is zero... > > If zero-copy becomes more attractive again, the plan would be to > > rather use grantdev in userspace, such as a filter driver for tapdisk > > instead. Until then, there's presumably a notable difference in L2 > > cache footprint. Then again, there's also a whole number of cycles not > > spent in redundant hypercalls now, to deal with the pseudophysical > > map. > > Frankly, I think the idea of putting blkback+tapdisk entirely in > usermode is all upside with no (obvious) downsides. It: > > 1. avoids having to upstream anything > 2. avoids having to upstream anything > 3. avoids having to upstream anything > > 4. gets us back zero-copy (if that's important) (No, unfortunately. DIO on a granted frame under blktap would be as vulnerable as DIO on a granted frame under a userland blkback, userland won't buy us anthing as far as the zcopy side of things is concerned). > 5. makes the IO path nice and straightforward > 6. seems to address all the other problems you mentioned I'm not at all against a userland blkback. Just wouldn't go as far as considering this a silver bullet. The main thing I'm scared of is ending up hacking cheesy stuff into the user ABI to take advantage of things immediately available to FSes on the bio layer, but harder (or at least slower) to get made available to userland. DISCARD support is one currently hot example, do you see that in AIO somewhere? Ok, probably a good thing for everybody anyway, so maybe patching that is useful work. But it's extra work right now and probably no more fun to maintain than blktap is. The second issue I see is the XCP side of things. XenServer got a lot of benefit out of blktap2, and particularly because of the tapdevs. It promotes a fairly rigorous split between a blkback VBD, controlled by the agent, and tapdevs, controlled by XS's storage manager. That doesn't prevent blkback to go into userspace, but it better won't share a process with some libblktap, which in turn would better not be controlled under the same xenstore path. So for XCP it'd be AIO on tapdevs for the time being, and with that whatever the syscall interface lets you do. > The only caveat is the stray unmapping problem, but I think gntdev can > be modified to deal with that pretty easily. Not easier than anything else in kernel space, but when dealing only with the refcounts, that's as as good a place as anwhere else, yes. > qemu has usermode blkback support already, and an actively improving > block-IO infrastructure, so one approach might be to consider putting > (parts of) tapdisk into qemu - and makes it pretty natural to reuse it > with non-Xen guests via virtio-block, emulated devices, etc. But I'm > not sold on that; having a standalone tapdisk w/ blkback makes sense to > me as well. > > On the other hand, I don't think we're going to be able to get away with > putting netback in usermode, so we still need to deal with that - but I > think an all-copying version will be fine to get started with at least. > > Please pull upstream/xen/dom0/backend/blktap2 from > > git://xenbits.xensource.com/people/dstodden/linux.git > > OK, I've pulled it, but I haven't had a chance to test it yet. Thanks. Daniel ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <1289620544.11102.373.camel@agari.van.xensource.com>]
* Re: blktap: Sync with XCP, dropping zero-copy. [not found] ` <1289620544.11102.373.camel@agari.van.xensource.com> @ 2010-11-15 18:27 ` Jeremy Fitzhardinge 2010-11-16 9:13 ` Daniel Stodden 0 siblings, 1 reply; 18+ messages in thread From: Jeremy Fitzhardinge @ 2010-11-15 18:27 UTC (permalink / raw) To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com On 11/12/2010 07:55 PM, Daniel Stodden wrote: >> Surely this can be dealt with by replacing the mapped granted page with >> a local copy if the refcount is elevated? > Yeah. We briefly discussed this when the problem started to pop up > (again). > > I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with > a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily > does the job. Hm, I'd be a bit concerned that that might cause problems if used generically. If the page is being used RO, then replacing with a copy shouldn't be a problem, but getting a consistent snapshot of a mutable page is obviously going to be a problem. > On UP that'd be just a matter of disabling interrupts for > a while. Disable for what purpose? You mean to do an exchange of the mapping, or something else? > I dropped it after it became clear that XS was moving to SMP, where one > would end up with a full barrier to orchestrate the TLB flushes > everywhere. Now, the skb runs prone to crash all run in softirq context, > I wouldn't exactly expect a huge performance win from syncing on that > kind of thing across all nodes, compared to local memcpy. Nor would I > want storage stuff to touch locks shared with TCP, that's just not our > business. Correct me if I'm mistaken. I don't follow what your high-level concern is here. If we update the pte to unmap the granted page, then its up to Xen to arrange for any TLB flushes to make sure the page is no longer accessible to the domain. We don't need to do anything explicit, and its independent of whether we're simply unmapping the page or replacing the mapping with another one (ie, any TLB flushing necessary is already going on). I was also under the impression that this is a relatively rare event rather than something that's going to be necessary for every page, so the overhead should be minimal. >>> If zero-copy becomes more attractive again, the plan would be to >>> rather use grantdev in userspace, such as a filter driver for tapdisk >>> instead. Until then, there's presumably a notable difference in L2 >>> cache footprint. Then again, there's also a whole number of cycles not >>> spent in redundant hypercalls now, to deal with the pseudophysical >>> map. >> Frankly, I think the idea of putting blkback+tapdisk entirely in >> usermode is all upside with no (obvious) downsides. It: >> >> 1. avoids having to upstream anything >> 2. avoids having to upstream anything >> 3. avoids having to upstream anything >> >> 4. gets us back zero-copy (if that's important) > (No, unfortunately. DIO on a granted frame under blktap would be as > vulnerable as DIO on a granted frame under a userland blkback, userland > won't buy us anthing as far as the zcopy side of things is concerned). Why's that? Are you talking about the stray page reference vulnerability, or something else? You're right that it doesn't really help with stray references because its still kernel code which ends up doing the dereference rather than usermode. >> 5. makes the IO path nice and straightforward >> 6. seems to address all the other problems you mentioned > I'm not at all against a userland blkback. Just wouldn't go as far as > considering this a silver bullet. > > The main thing I'm scared of is ending up hacking cheesy stuff into the > user ABI to take advantage of things immediately available to FSes on > the bio layer, but harder (or at least slower) to get made available to > userland. But that hasn't been a problem for tapdisk so far. Given that it is using DIO then any completed write has at least been submitted to a storage device, so then its just a question of how to make sure that any buffers are fully flushed, which would be fdatasync() I guess. Besides, our requirements are hardly unique; if there's some clear need for a new API, then the course of action is to work with broader Linux community to work out what it should be and how it should be implemented. There's no need for "hacking cheesy stuff" - there's been enough of that already. > DISCARD support is one currently hot example, do you see that in AIO > somewhere? Ok, probably a good thing for everybody anyway, so maybe > patching that is useful work. But it's extra work right now and probably > no more fun to maintain than blktap is. fallocate() is being extended to allow hole-punching in files. I don't know what work is being done to do discard on random block devices, but that's clearly a generically useful thing to have. > The second issue I see is the XCP side of things. XenServer got a lot of > benefit out of blktap2, and particularly because of the tapdevs. It > promotes a fairly rigorous split between a blkback VBD, controlled by > the agent, and tapdevs, controlled by XS's storage manager. > > That doesn't prevent blkback to go into userspace, but it better won't > share a process with some libblktap, which in turn would better not be > controlled under the same xenstore path. Could you elaborate on this? What was the benefit? >> The only caveat is the stray unmapping problem, but I think gntdev can >> be modified to deal with that pretty easily. > Not easier than anything else in kernel space, but when dealing only > with the refcounts, that's as as good a place as anwhere else, yes. I think the refcount test is pretty straightforward - if the refcount is 1, then we're the sole owner of the page and we don't need to worry about any other users. If its > 1, then somebody else has it, and we need to make sure it no longer refers to a granted page (which is just a matter of doing a set_pte_atomic() to remap from present to present). Then we'd have a set of frames whose lifetimes are being determined by some other subsystem. We can either maintain a list of them and poll waiting for them to become free, or just release them and let them be managed by the normal kernel lifetime rules (which requires that the memory attached to them be completely normal, of course). J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-15 18:27 ` Jeremy Fitzhardinge @ 2010-11-16 9:13 ` Daniel Stodden 2010-11-16 17:56 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 18+ messages in thread From: Daniel Stodden @ 2010-11-16 9:13 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: > On 11/12/2010 07:55 PM, Daniel Stodden wrote: > >> Surely this can be dealt with by replacing the mapped granted page with > >> a local copy if the refcount is elevated? > > Yeah. We briefly discussed this when the problem started to pop up > > (again). > > > > I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with > > a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily > > does the job. > > Hm, I'd be a bit concerned that that might cause problems if used > generically. Yeah. It wasn't a problem because all the network backends are on TCP, where one can be rather sure that the dups are going to be properly dropped. Does this hold everywhere ..? -- As mentioned below, the problem is rather in AIO/DIO than being Xen-specific, so you can see the same behavior on bare metal kernels too. A userspace app seeing an AIO complete and then reusing that buffer elsewhere will occassionally resend garbage over the network. How often that would happen in practice probably depends on the popularity of DIO on NFS for normal apps. Not too many, so yeah, very generically one would maybe want to memcpy(). > If the page is being used RO, then replacing with a copy > shouldn't be a problem, but getting a consistent snapshot of a mutable > page is obviously going to be a problem. I think it's safe to assume the problem is limited to writes, so the buffers aren't really mutable. > > On UP that'd be just a matter of disabling interrupts for > > a while. > > Disable for what purpose? You mean to do an exchange of the mapping, or > something else? [I was referring to the prospect of doing a gref unmap/map sequence in dom0. Since we do have a swap operation in Xen, just asking flipping the PTE entry and flushing across all vcpus sounds sufficient to me, indeed.] > > I dropped it after it became clear that XS was moving to SMP, where one > > would end up with a full barrier to orchestrate the TLB flushes > > everywhere. Now, the skb runs prone to crash all run in softirq context, > > I wouldn't exactly expect a huge performance win from syncing on that > > kind of thing across all nodes, compared to local memcpy. Nor would I > > want storage stuff to touch locks shared with TCP, that's just not our > > business. Correct me if I'm mistaken. > > I don't follow what your high-level concern is here. If we update the > pte to unmap the granted page, then its up to Xen to arrange for any TLB > flushes to make sure the page is no longer accessible to the domain. We > don't need to do anything explicit, and its independent of whether we're > simply unmapping the page or replacing the mapping with another one (ie, > any TLB flushing necessary is already going on). > > I was also under the impression that this is a relatively rare event > rather than something that's going to be necessary for every page, so > the overhead should be minimal. > > >>> If zero-copy becomes more attractive again, the plan would be to > >>> rather use grantdev in userspace, such as a filter driver for tapdisk > >>> instead. Until then, there's presumably a notable difference in L2 > >>> cache footprint. Then again, there's also a whole number of cycles not > >>> spent in redundant hypercalls now, to deal with the pseudophysical > >>> map. > >> Frankly, I think the idea of putting blkback+tapdisk entirely in > >> usermode is all upside with no (obvious) downsides. It: > >> > >> 1. avoids having to upstream anything > >> 2. avoids having to upstream anything > >> 3. avoids having to upstream anything > >> > >> 4. gets us back zero-copy (if that's important) > > (No, unfortunately. DIO on a granted frame under blktap would be as > > vulnerable as DIO on a granted frame under a userland blkback, userland > > won't buy us anthing as far as the zcopy side of things is concerned). > > Why's that? Are you talking about the stray page reference > vulnerability, or something else? You're right that it doesn't really > help with stray references because its still kernel code which ends up > doing the dereference rather than usermode. [Still was talking about the stray reference issues]. > >> 5. makes the IO path nice and straightforward > >> 6. seems to address all the other problems you mentioned > > I'm not at all against a userland blkback. Just wouldn't go as far as > > considering this a silver bullet. > > > > The main thing I'm scared of is ending up hacking cheesy stuff into the > > user ABI to take advantage of things immediately available to FSes on > > the bio layer, but harder (or at least slower) to get made available to > > userland. > > But that hasn't been a problem for tapdisk so far. Given that it is > using DIO then any completed write has at least been submitted to a > storage device, so then its just a question of how to make sure that any > buffers are fully flushed, which would be fdatasync() I guess. > Besides, our requirements are hardly unique; if there's some clear need > for a new API, then the course of action is to work with broader Linux > community to work out what it should be and how it should be > implemented. There's no need for "hacking cheesy stuff" - there's been > enough of that already. > > DISCARD support is one currently hot example, do you see that in AIO > > somewhere? Ok, probably a good thing for everybody anyway, so maybe > > patching that is useful work. But it's extra work right now and probably > > no more fun to maintain than blktap is. > > fallocate() is being extended to allow hole-punching in files. I don't > know what work is being done to do discard on random block devices, but > that's clearly a generically useful thing to have. Hole punching for userland being in the works is actually great news. Just saw the patchset on ext4-devel flying by, great. > > The second issue I see is the XCP side of things. XenServer got a lot of > > benefit out of blktap2, and particularly because of the tapdevs. It > > promotes a fairly rigorous split between a blkback VBD, controlled by > > the agent, and tapdevs, controlled by XS's storage manager. > > > > That doesn't prevent blkback to go into userspace, but it better won't > > share a process with some libblktap, which in turn would better not be > > controlled under the same xenstore path. > > > Could you elaborate on this? What was the benefit? It's been mainly a matter of who controls what. Blktap1 was basically a VBD, controlled by the agent. Blktap2 is a VDI represented as a block device. Leaving management of that to XCP's storage manager, which just hands that device node over to Xapi simplified many things. Before, the agent had to understand a lot about the type of storage, then talk to the right backend accordingly. Worse, in order to have storage management control a couple datapath features, you'd basically have to talk to Xapi, which would talk though xenstore to blktap, which was a bit tedious. :) Merging the VDI side of things with VBDs again just didn't sound so attrative at first sight. But I've thinking a little longer about the issues in the meantime. I think it's actually not so hard to do that without collateral damage. Let's say we create an extension to tapdisk which speaks blkback's datapath in userland. We'd basically put one of those tapdisks on every storage node, independent of the image type, such as a bare LUN or a VHD. We add a couple additional IPC calls to make it directly connect/disconnect to/from (ring-ref,event-channel) pairs. Means it doesn't even need to talk xenstore, the control plane could all be left to some single daemon, which knows how to instruct the right tapdev (via libblktapctl) by looking at the physical-device node. I guess getting the control stuff out of the kernel is always a good idea. There are some important parts which would go missing. Such as ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to gntmap 352 pages simultaneously isn't so good, so there still needs to be some bridge arbitrating them. I'd rather keep that in kernel space, okay to cram stuff like that into gntdev? It'd be much more straightforward than IPC. Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. Can't find it now, what happened? Without, there's presently still no zero-copy. Once the issues were solved, it'd be kinda nice. Simplifies stuff like memshr for blktap, which depends on getting hold of original grefs. We'd presumably still need the tapdev nodes, for qemu, etc. But those can stay non-xen aware then. > >> The only caveat is the stray unmapping problem, but I think gntdev can > >> be modified to deal with that pretty easily. > > Not easier than anything else in kernel space, but when dealing only > > with the refcounts, that's as as good a place as anwhere else, yes. > > I think the refcount test is pretty straightforward - if the refcount is > 1, then we're the sole owner of the page and we don't need to worry > about any other users. If its > 1, then somebody else has it, and we > need to make sure it no longer refers to a granted page (which is just a > matter of doing a set_pte_atomic() to remap from present to present). [set_pte_atomic over grant ptes doesn't work, or does it?] > Then we'd have a set of frames whose lifetimes are being determined by > some other subsystem. We can either maintain a list of them and poll > waiting for them to become free, or just release them and let them be > managed by the normal kernel lifetime rules (which requires that the > memory attached to them be completely normal, of course). The latter sounds like a good alternative to polling. So an unmap_and_replace, and giving up ownership thereafter. Next run of the dispatcher thread can can just refill the foreign pfn range via alloc_empty_pages(), to rebalance. Daniel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-16 9:13 ` Daniel Stodden @ 2010-11-16 17:56 ` Jeremy Fitzhardinge 2010-11-16 21:28 ` Daniel Stodden 0 siblings, 1 reply; 18+ messages in thread From: Jeremy Fitzhardinge @ 2010-11-16 17:56 UTC (permalink / raw) To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com On 11/16/2010 01:13 AM, Daniel Stodden wrote: > On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: >> On 11/12/2010 07:55 PM, Daniel Stodden wrote: >>>> Surely this can be dealt with by replacing the mapped granted page with >>>> a local copy if the refcount is elevated? >>> Yeah. We briefly discussed this when the problem started to pop up >>> (again). >>> >>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with >>> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily >>> does the job. >> Hm, I'd be a bit concerned that that might cause problems if used >> generically. > Yeah. It wasn't a problem because all the network backends are on TCP, > where one can be rather sure that the dups are going to be properly > dropped. > > Does this hold everywhere ..? -- As mentioned below, the problem is > rather in AIO/DIO than being Xen-specific, so you can see the same > behavior on bare metal kernels too. A userspace app seeing an AIO > complete and then reusing that buffer elsewhere will occassionally > resend garbage over the network. Yeah, that sounds like a generic security problem. I presume the protocol will just discard the excess retransmit data, but it might mean a usermode program ends up transmitting secrets it never intended to... > There are some important parts which would go missing. Such as > ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to > gntmap 352 pages simultaneously isn't so good, so there still needs to > be some bridge arbitrating them. I'd rather keep that in kernel space, > okay to cram stuff like that into gntdev? It'd be much more > straightforward than IPC. What's the problem? If you do nothing then it will appear to the kernel as a bunch of processes doing memory allocations, and they'll get blocked/rate-limited accordingly if memory is getting short. There's plenty of existing mechanisms to control that sort of thing (cgroups, etc) without adding anything new to the kernel. Or are you talking about something other than simple memory pressure? And there's plenty of existing IPC mechanisms if you want them to explicitly coordinate with each other, but I'd tend to thing that's premature unless you have something specific in mind. > Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. > Can't find it now, what happened? Without, there's presently still no > zero-copy. gntdev doesn't need VM_FOREIGN any more - it uses the (relatively new-ish) mmu notifier infrastructure which is intended to allow a device to sync an external MMU with usermode mappings. We're not using it in precisely that way, but it allows us to wrangle grant mappings before the generic code tries to do normal pte ops on them. > Once the issues were solved, it'd be kinda nice. Simplifies stuff like > memshr for blktap, which depends on getting hold of original grefs. > > We'd presumably still need the tapdev nodes, for qemu, etc. But those > can stay non-xen aware then. > >>>> The only caveat is the stray unmapping problem, but I think gntdev can >>>> be modified to deal with that pretty easily. >>> Not easier than anything else in kernel space, but when dealing only >>> with the refcounts, that's as as good a place as anwhere else, yes. >> I think the refcount test is pretty straightforward - if the refcount is >> 1, then we're the sole owner of the page and we don't need to worry >> about any other users. If its > 1, then somebody else has it, and we >> need to make sure it no longer refers to a granted page (which is just a >> matter of doing a set_pte_atomic() to remap from present to present). > [set_pte_atomic over grant ptes doesn't work, or does it?] No, I forgot about grant ptes magic properties. But there is the hypercall. >> Then we'd have a set of frames whose lifetimes are being determined by >> some other subsystem. We can either maintain a list of them and poll >> waiting for them to become free, or just release them and let them be >> managed by the normal kernel lifetime rules (which requires that the >> memory attached to them be completely normal, of course). > The latter sounds like a good alternative to polling. So an > unmap_and_replace, and giving up ownership thereafter. Next run of the > dispatcher thread can can just refill the foreign pfn range via > alloc_empty_pages(), to rebalance. Do we actually need a "foreign page range"? Won't any pfn do? If we start with a specific range of foreign pfns and then start freeing those pfns back to the kernel, we won't have one for long... J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-16 17:56 ` Jeremy Fitzhardinge @ 2010-11-16 21:28 ` Daniel Stodden 2010-11-17 18:00 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 18+ messages in thread From: Daniel Stodden @ 2010-11-16 21:28 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com On Tue, 2010-11-16 at 12:56 -0500, Jeremy Fitzhardinge wrote: > On 11/16/2010 01:13 AM, Daniel Stodden wrote: > > On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: > >> On 11/12/2010 07:55 PM, Daniel Stodden wrote: > >>>> Surely this can be dealt with by replacing the mapped granted page with > >>>> a local copy if the refcount is elevated? > >>> Yeah. We briefly discussed this when the problem started to pop up > >>> (again). > >>> > >>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with > >>> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily > >>> does the job. > >> Hm, I'd be a bit concerned that that might cause problems if used > >> generically. > > Yeah. It wasn't a problem because all the network backends are on TCP, > > where one can be rather sure that the dups are going to be properly > > dropped. > > > > Does this hold everywhere ..? -- As mentioned below, the problem is > > rather in AIO/DIO than being Xen-specific, so you can see the same > > behavior on bare metal kernels too. A userspace app seeing an AIO > > complete and then reusing that buffer elsewhere will occassionally > > resend garbage over the network. > > Yeah, that sounds like a generic security problem. I presume the > protocol will just discard the excess retransmit data, but it might mean > a usermode program ends up transmitting secrets it never intended to... > > > There are some important parts which would go missing. Such as > > ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to > > gntmap 352 pages simultaneously isn't so good, so there still needs to > > be some bridge arbitrating them. I'd rather keep that in kernel space, > > okay to cram stuff like that into gntdev? It'd be much more > > straightforward than IPC. > > What's the problem? If you do nothing then it will appear to the kernel > as a bunch of processes doing memory allocations, and they'll get > blocked/rate-limited accordingly if memory is getting short. The problem is that just letting the page allocator work through allocations isn't going to scale anywhere. The worst case memory requested under load is <number-of-disks> * (32 * 11 pages). As a (conservative) rule of thumb, N will be 200 or rather better. The number of I/O actually in-flight at any point, in contrast, is derived from the queue/sg sizes of the physical device. For a simple disk, that's about a ring or two. > There's > plenty of existing mechanisms to control that sort of thing (cgroups, > etc) without adding anything new to the kernel. Or are you talking > about something other than simple memory pressure? > > And there's plenty of existing IPC mechanisms if you want them to > explicitly coordinate with each other, but I'd tend to thing that's > premature unless you have something specific in mind. > > > Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. > > Can't find it now, what happened? Without, there's presently still no > > zero-copy. > > gntdev doesn't need VM_FOREIGN any more - it uses the (relatively > new-ish) mmu notifier infrastructure which is intended to allow a device > to sync an external MMU with usermode mappings. We're not using it in > precisely that way, but it allows us to wrangle grant mappings before > the generic code tries to do normal pte ops on them. The mmu notifiers were for safe teardown only. They are not sufficient for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll need to back those VMAs with page structs. Or bounce again (gulp, just mentioning it). As with the blktap2 patches, note there is no difference in the dom0 memory bill, it takes page frames. This is pretty much exactly the pooling stuff in current drivers/blktap. The interface could look as follows ([] designates users). * [toolstack] Calling some ctls to create/destroy ctls pools of frames. (Blktap currently does this in sysfs.) * [toolstack] Optionally resize them, according to the physical queue depth [estimate] of the underlying storage. * [tapdisk] A backend instance, when starting up, opens a gntdev, then uses a ctl to bind its gntdev handle to a frame pool. * [tapdisk] The .mmap call now will allocate frames to back the VMA. This operation can fail/block under congestion. Neither is desirable, so we need a .poll. * [tapdisk] To integrate grant mappings with a single-threaded event loop, use .poll. The handle fires as soon as a request can be mapped. Under congestion, the .poll code will queue waiting disks and wake them round-robin, once VMAs are released. (A [tapdisk] doesn't mean to dismiss a potential [qemu].) Still confident we want that? (Seriously asking). A lot of the code to do so has been written for blktap, it wouldn't be hard to bend into a gntdev extension. > > Once the issues were solved, it'd be kinda nice. Simplifies stuff like > > memshr for blktap, which depends on getting hold of original grefs. > > > > We'd presumably still need the tapdev nodes, for qemu, etc. But those > > can stay non-xen aware then. > > > >>>> The only caveat is the stray unmapping problem, but I think gntdev can > >>>> be modified to deal with that pretty easily. > >>> Not easier than anything else in kernel space, but when dealing only > >>> with the refcounts, that's as as good a place as anwhere else, yes. > >> I think the refcount test is pretty straightforward - if the refcount is > >> 1, then we're the sole owner of the page and we don't need to worry > >> about any other users. If its > 1, then somebody else has it, and we > >> need to make sure it no longer refers to a granted page (which is just a > >> matter of doing a set_pte_atomic() to remap from present to present). > > [set_pte_atomic over grant ptes doesn't work, or does it?] > > No, I forgot about grant ptes magic properties. But there is the hypercall. Yup. > >> Then we'd have a set of frames whose lifetimes are being determined by > >> some other subsystem. We can either maintain a list of them and poll > >> waiting for them to become free, or just release them and let them be > >> managed by the normal kernel lifetime rules (which requires that the > >> memory attached to them be completely normal, of course). > > The latter sounds like a good alternative to polling. So an > > unmap_and_replace, and giving up ownership thereafter. Next run of the > > dispatcher thread can can just refill the foreign pfn range via > > alloc_empty_pages(), to rebalance. > > Do we actually need a "foreign page range"? Won't any pfn do? If we > start with a specific range of foreign pfns and then start freeing those > pfns back to the kernel, we won't have one for long... I guess we've been meaning the same thing here, unless I'm misunderstanding you. Any pfn does, and the balloon pagevec allocations default to order 0 entries indeed. Sorry, you're right, that's not a 'range'. With a pending re-xmit, the backend can find a couple (or all) of the request frames have count>1. It can flip and abandon those as normal memory. But it will need those lost memory slots back, straight away or next time it's running out of frames. As order-0 allocations. Foreign memory is deliberately short. Blkback still defaults to 2 rings worth of address space, iirc, globally. That's what that mempool sysfs stuff in the later blktap2 patches aimed at -- making the size configurable where queue length matters, and isolate throughput between physical backends, where the toolstack wants to care. Daniel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-16 21:28 ` Daniel Stodden @ 2010-11-17 18:00 ` Jeremy Fitzhardinge 2010-11-17 20:21 ` Daniel Stodden 0 siblings, 1 reply; 18+ messages in thread From: Jeremy Fitzhardinge @ 2010-11-17 18:00 UTC (permalink / raw) To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com On 11/16/2010 01:28 PM, Daniel Stodden wrote: >> What's the problem? If you do nothing then it will appear to the kernel >> as a bunch of processes doing memory allocations, and they'll get >> blocked/rate-limited accordingly if memory is getting short. > The problem is that just letting the page allocator work through > allocations isn't going to scale anywhere. > > The worst case memory requested under load is <number-of-disks> * (32 * > 11 pages). As a (conservative) rule of thumb, N will be 200 or rather > better. Under what circumstances would you end up needing to allocate that many pages? > The number of I/O actually in-flight at any point, in contrast, is > derived from the queue/sg sizes of the physical device. For a simple > disk, that's about a ring or two. Wouldn't that be the worst case? >> There's >> plenty of existing mechanisms to control that sort of thing (cgroups, >> etc) without adding anything new to the kernel. Or are you talking >> about something other than simple memory pressure? >> >> And there's plenty of existing IPC mechanisms if you want them to >> explicitly coordinate with each other, but I'd tend to thing that's >> premature unless you have something specific in mind. >> >>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. >>> Can't find it now, what happened? Without, there's presently still no >>> zero-copy. >> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively >> new-ish) mmu notifier infrastructure which is intended to allow a device >> to sync an external MMU with usermode mappings. We're not using it in >> precisely that way, but it allows us to wrangle grant mappings before >> the generic code tries to do normal pte ops on them. > The mmu notifiers were for safe teardown only. They are not sufficient > for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll > need to back those VMAs with page structs. The pages will have struct page, because they're normal kernel pages which happen to be backed by mapped granted pages. Are you talking about the #ifdef CONFIG_XEN code in the middle of __get_user_pages()? Isn't that just there to cope with the nested-IO-on-the-same-page problem that the current blktap architecture provokes? If there's only a single IO on each page - the one initiated by usermode - then it shouldn't be necessary, right? > Or bounce again (gulp, just > mentioning it). As with the blktap2 patches, note there is no difference > in the dom0 memory bill, it takes page frames. (And perhaps actual pages to substitute for the granted pages.) > I guess we've been meaning the same thing here, unless I'm > misunderstanding you. Any pfn does, and the balloon pagevec allocations > default to order 0 entries indeed. Sorry, you're right, that's not a > 'range'. With a pending re-xmit, the backend can find a couple (or all) > of the request frames have count>1. It can flip and abandon those as > normal memory. But it will need those lost memory slots back, straight > away or next time it's running out of frames. As order-0 allocations. Right. GFP_KERNEL order 0 allocations are pretty reliable; they only fail if the system is under extreme memory pressure. And it has the nice property that if those allocations block or fail it rate limits IO ingress from domains rather than being crushed by memory pressure at the backend (ie, the problem with trying to allocate memory in the writeout path). Also the cgroup mechanism looks like an extremely powerful way to control the allocations for a process or group of processes to stop them from dominating the whole machine. J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-17 18:00 ` Jeremy Fitzhardinge @ 2010-11-17 20:21 ` Daniel Stodden 2010-11-17 21:02 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 18+ messages in thread From: Daniel Stodden @ 2010-11-17 20:21 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com On Wed, 2010-11-17 at 13:00 -0500, Jeremy Fitzhardinge wrote: > On 11/16/2010 01:28 PM, Daniel Stodden wrote: > >> What's the problem? If you do nothing then it will appear to the kernel > >> as a bunch of processes doing memory allocations, and they'll get > >> blocked/rate-limited accordingly if memory is getting short. > > The problem is that just letting the page allocator work through > > allocations isn't going to scale anywhere. > > > > The worst case memory requested under load is <number-of-disks> * (32 * > > 11 pages). As a (conservative) rule of thumb, N will be 200 or rather > > better. > > Under what circumstances would you end up needing to allocate that many > pages? I don't. Independently running tapdisks would do, on behalf of guests queuing I/O. That's why one wouldn't just let them run and allocate their own memory. The memory space set aside for I/O should be a shared resource. > > The number of I/O actually in-flight at any point, in contrast, is > > derived from the queue/sg sizes of the physical device. For a simple > > disk, that's about a ring or two. > > Wouldn't that be the worst case? Yes. It's quite small. A 2 or 3 megs per physical backend are usually sufficient. > >> There's > >> plenty of existing mechanisms to control that sort of thing (cgroups, > >> etc) without adding anything new to the kernel. Or are you talking > >> about something other than simple memory pressure? > >> > >> And there's plenty of existing IPC mechanisms if you want them to > >> explicitly coordinate with each other, but I'd tend to thing that's > >> premature unless you have something specific in mind. > >> > >>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. > >>> Can't find it now, what happened? Without, there's presently still no > >>> zero-copy. > >> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively > >> new-ish) mmu notifier infrastructure which is intended to allow a device > >> to sync an external MMU with usermode mappings. We're not using it in > >> precisely that way, but it allows us to wrangle grant mappings before > >> the generic code tries to do normal pte ops on them. > > The mmu notifiers were for safe teardown only. They are not sufficient > > for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll > > need to back those VMAs with page structs. > > The pages will have struct page, because they're normal kernel pages > which happen to be backed by mapped granted pages. And, like all granted frames, not owning them implies they are not resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup() without the VM_FOREIGN hack. Correct me if I'm mistaken. I used to be quicker looking up stuff on arch-xen kernels, but I think fundamental constants of the Xen universe didn't change since last time. > Are you talking > about the #ifdef CONFIG_XEN code in the middle of __get_user_pages()? > Isn't that just there to cope with the nested-IO-on-the-same-page > problem that the current blktap architecture provokes? If there's only > a single IO on each page - the one initiated by usermode - then it > shouldn't be necessary, right? No. Jake brought the aliasing in specifically to get blktap2 working with zero-copy. VM_FOREIGN is much older. Only blktap2 does the recursive thing, because it's a blkdev above some physical dev. Blktap1 went from the guest ring straight down to userland. As would be the case with a gntdev-based blkback. > > Or bounce again (gulp, just > > mentioning it). As with the blktap2 patches, note there is no difference > > in the dom0 memory bill, it takes page frames. > > (And perhaps actual pages to substitute for the granted pages.) Well yes, that's right. Still fine as long there's some relatively small constant boundary on the worst case. O(n) for large systems would go in the hundreds of megs. Given that the *reasonable* amount of memory used simultaneously is pretty small in any case, even going through the memory allocator can be skipped. [ Part of the reason why blktap *never* frees those pages, apart from being slightly greedy, are deadlock hazards when writing those nodes in dom0 through the pagecache, as dom0 might. You need memory pools on the datapath to guarantee progress under pressure. That got pretty ugly after 2.6.27, btw. ] In any case, let's skip trying what happens if a thundering herd of several hundred userspace disks tries gfp()ing their grant slots out of dom0 without without arbitration. > > I guess we've been meaning the same thing here, unless I'm > > misunderstanding you. Any pfn does, and the balloon pagevec allocations > > default to order 0 entries indeed. Sorry, you're right, that's not a > > 'range'. With a pending re-xmit, the backend can find a couple (or all) > > of the request frames have count>1. It can flip and abandon those as > > normal memory. But it will need those lost memory slots back, straight > > away or next time it's running out of frames. As order-0 allocations. > > Right. GFP_KERNEL order 0 allocations are pretty reliable; they only > fail if the system is under extreme memory pressure. And it has the > nice property that if those allocations block or fail it rate limits IO > ingress from domains rather than being crushed by memory pressure at the > backend (ie, the problem with trying to allocate memory in the writeout > path). > > Also the cgroup mechanism looks like an extremely powerful way to > control the allocations for a process or group of processes to stop them > from dominating the whole machine. Ah. In case it can be put to work to bind processes allocating pagecache entries for dirtying to some boundary, I'd be really interested. I think I came across it once but didn't take the time to read the docs thoroughly. Can it? Daniel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-17 20:21 ` Daniel Stodden @ 2010-11-17 21:02 ` Jeremy Fitzhardinge 2010-11-17 21:57 ` Daniel Stodden 0 siblings, 1 reply; 18+ messages in thread From: Jeremy Fitzhardinge @ 2010-11-17 21:02 UTC (permalink / raw) To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com On 11/17/2010 12:21 PM, Daniel Stodden wrote: > And, like all granted frames, not owning them implies they are not > resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup() > without the VM_FOREIGN hack. Hm, I see. Well, I wonder if using _PAGE_SPECIAL would help (it is put on usermode ptes which don't have a backing struct page). After all, there's no fundamental reason why it would need a pfn; the mfn in the pte is what's actually needed to ultimately generate a DMA descriptor. > Correct me if I'm mistaken. I used to be quicker looking up stuff on > arch-xen kernels, but I think fundamental constants of the Xen universe > didn't change since last time. No, but Linux has. > [ > Part of the reason why blktap *never* frees those pages, apart from > being slightly greedy, are deadlock hazards when writing those nodes in > dom0 through the pagecache, as dom0 might. You need memory pools on the > datapath to guarantee progress under pressure. That got pretty ugly > after 2.6.27, btw. > ] That's what mempools are intended to solve. > In any case, let's skip trying what happens if a thundering herd of > several hundred userspace disks tries gfp()ing their grant slots out of > dom0 without without arbitration. I'm not against arbitration, but I don't think that's something that should be implemented as part of a Xen driver. >>> I guess we've been meaning the same thing here, unless I'm >>> misunderstanding you. Any pfn does, and the balloon pagevec allocations >>> default to order 0 entries indeed. Sorry, you're right, that's not a >>> 'range'. With a pending re-xmit, the backend can find a couple (or all) >>> of the request frames have count>1. It can flip and abandon those as >>> normal memory. But it will need those lost memory slots back, straight >>> away or next time it's running out of frames. As order-0 allocations. >> Right. GFP_KERNEL order 0 allocations are pretty reliable; they only >> fail if the system is under extreme memory pressure. And it has the >> nice property that if those allocations block or fail it rate limits IO >> ingress from domains rather than being crushed by memory pressure at the >> backend (ie, the problem with trying to allocate memory in the writeout >> path). >> >> Also the cgroup mechanism looks like an extremely powerful way to >> control the allocations for a process or group of processes to stop them >> from dominating the whole machine. > Ah. In case it can be put to work to bind processes allocating pagecache > entries for dirtying to some boundary, I'd be really interested. I think > I came across it once but didn't take the time to read the docs > thoroughly. Can it? I'm not sure about dirtyness - it seems like something that should be within its remit, even if it doesn't currently have it. The cgroup mechanism is extremely powerful, now that I look at it. You can do everything from setting block IO priorities and QoS parameters to CPU limits. J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-17 21:02 ` Jeremy Fitzhardinge @ 2010-11-17 21:57 ` Daniel Stodden 2010-11-17 22:14 ` Jeremy Fitzhardinge 2010-11-17 23:32 ` Daniel Stodden 0 siblings, 2 replies; 18+ messages in thread From: Daniel Stodden @ 2010-11-17 21:57 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com On Wed, 2010-11-17 at 16:02 -0500, Jeremy Fitzhardinge wrote: > On 11/17/2010 12:21 PM, Daniel Stodden wrote: > > And, like all granted frames, not owning them implies they are not > > resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup() > > without the VM_FOREIGN hack. > > Hm, I see. Well, I wonder if using _PAGE_SPECIAL would help (it is put > on usermode ptes which don't have a backing struct page). After all, > there's no fundamental reason why it would need a pfn; the mfn in the > pte is what's actually needed to ultimately generate a DMA descriptor. The kernel needs the page structs at least for locking and refcounting. There's also a some trickier stuff in there. Like redirtying disk-backed user memory after read completion, in case it's been laundered. (So that an AIO on unpinned user memory doesn't subsequently get flashed back when cycling through swap, if I understood that thing correctly.) Doesn't apply for blktap (it's all reserved pages). All I mean is: I wouldn't exactly see some innocent little dio hack or so shape up in there. Kernel allowing to DMA into a bare pfnmap -- From the platform POV, I'd agree. E.g. there's a concept of devices DMA-ing into arbitrary I/O memory space, not host memory, on some bus architectures. PCI would come to my mind (the old shared medium stuff, unsure about those newfangled P-t-P topologies). But not in Linux, so I presently don't see anybody upstream bothering to make block-I/O request addressing more forgiving than it is. PAGE_SPECIAL -- to the kernel, that means the opposite: page structs which aren't backed by 'real' memory, so gup(), for example, is told to fail (how nasty). In contrast, VM_FOREIGN is non-memory backed by page structs. > > Correct me if I'm mistaken. I used to be quicker looking up stuff on > > arch-xen kernels, but I think fundamental constants of the Xen universe > > didn't change since last time. > > No, but Linux has. Not in that respect. There's certainly a way to get VM_FOREIGN out of the mainline code. It would involve an unlikely() branch in .pte_val(=xen_pte_val) to fall back into a private local m2p hash lookup. Assuming that kind of thing gets nowhere inlined. Not nice, but still more upstreamable than VM_FOREIGN. > > [ > > Part of the reason why blktap *never* frees those pages, apart from > > being slightly greedy, are deadlock hazards when writing those nodes in > > dom0 through the pagecache, as dom0 might. You need memory pools on the > > datapath to guarantee progress under pressure. That got pretty ugly > > after 2.6.27, btw. > > ] > > That's what mempools are intended to solve. That's why the blktap frame pool is now a mempool, indeed. > > In any case, let's skip trying what happens if a thundering herd of > > several hundred userspace disks tries gfp()ing their grant slots out of > > dom0 without without arbitration. > > I'm not against arbitration, but I don't think that's something that > should be implemented as part of a Xen driver. Uhm, maybe I'm misunderstanding you, isn't the whole thing a Xen driver? What do you suggest? > >>> I guess we've been meaning the same thing here, unless I'm > >>> misunderstanding you. Any pfn does, and the balloon pagevec allocations > >>> default to order 0 entries indeed. Sorry, you're right, that's not a > >>> 'range'. With a pending re-xmit, the backend can find a couple (or all) > >>> of the request frames have count>1. It can flip and abandon those as > >>> normal memory. But it will need those lost memory slots back, straight > >>> away or next time it's running out of frames. As order-0 allocations. > >> Right. GFP_KERNEL order 0 allocations are pretty reliable; they only > >> fail if the system is under extreme memory pressure. And it has the > >> nice property that if those allocations block or fail it rate limits IO > >> ingress from domains rather than being crushed by memory pressure at the > >> backend (ie, the problem with trying to allocate memory in the writeout > >> path). > >> > >> Also the cgroup mechanism looks like an extremely powerful way to > >> control the allocations for a process or group of processes to stop them > >> from dominating the whole machine. > > Ah. In case it can be put to work to bind processes allocating pagecache > > entries for dirtying to some boundary, I'd be really interested. I think > > I came across it once but didn't take the time to read the docs > > thoroughly. Can it? > > I'm not sure about dirtyness - it seems like something that should be > within its remit, even if it doesn't currently have it. > > The cgroup mechanism is extremely powerful, now that I look at it. You > can do everything from setting block IO priorities and QoS parameters to > CPU limits. Thanks. I'll keep it under my pillow then. Daniel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-17 21:57 ` Daniel Stodden @ 2010-11-17 22:14 ` Jeremy Fitzhardinge [not found] ` <1290035201.11102.1577.camel@agari.van.xensource.com> 2010-11-17 23:32 ` Daniel Stodden 1 sibling, 1 reply; 18+ messages in thread From: Jeremy Fitzhardinge @ 2010-11-17 22:14 UTC (permalink / raw) To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com On 11/17/2010 01:57 PM, Daniel Stodden wrote: > On Wed, 2010-11-17 at 16:02 -0500, Jeremy Fitzhardinge wrote: >> On 11/17/2010 12:21 PM, Daniel Stodden wrote: >>> And, like all granted frames, not owning them implies they are not >>> resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup() >>> without the VM_FOREIGN hack. >> Hm, I see. Well, I wonder if using _PAGE_SPECIAL would help (it is put >> on usermode ptes which don't have a backing struct page). After all, >> there's no fundamental reason why it would need a pfn; the mfn in the >> pte is what's actually needed to ultimately generate a DMA descriptor. > The kernel needs the page structs at least for locking and refcounting. Yeah. > There's also a some trickier stuff in there. Like redirtying disk-backed > user memory after read completion, in case it's been laundered. (So that > an AIO on unpinned user memory doesn't subsequently get flashed back > when cycling through swap, if I understood that thing correctly.) > > Doesn't apply for blktap (it's all reserved pages). All I mean is: I > wouldn't exactly see some innocent little dio hack or so shape up in > there. > > Kernel allowing to DMA into a bare pfnmap -- From the platform POV, I'd > agree. E.g. there's a concept of devices DMA-ing into arbitrary I/O > memory space, not host memory, on some bus architectures. PCI would come > to my mind (the old shared medium stuff, unsure about those newfangled > P-t-P topologies). But not in Linux, so I presently don't see anybody > upstream bothering to make block-I/O request addressing more forgiving > than it is. > > PAGE_SPECIAL -- to the kernel, that means the opposite: page structs > which aren't backed by 'real' memory, so gup(), for example, is told to > fail (how nasty). It's pfns with no corresponding struct page - it's the pte level equivalent of VM_PFNMAP at the VMA level. But you're right that we can't do without struct pages. So we're back to needing a way of mapping from a random mfn to a pfn so we can find the corresponding struct page. I would be tempted to put a layer over m2p to allow local m2p mappings to override the global m2p table. > In contrast, VM_FOREIGN is non-memory backed by page > structs. Yep. J ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <1290035201.11102.1577.camel@agari.van.xensource.com>]
[parent not found: <4CE46A03.3010104@goop.org>]
[parent not found: <1290040898.11102.1709.camel@agari.van.xensource.com>]
* Re: blktap: Sync with XCP, dropping zero-copy. [not found] ` <1290040898.11102.1709.camel@agari.van.xensource.com> @ 2010-11-18 2:29 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 18+ messages in thread From: Jeremy Fitzhardinge @ 2010-11-18 2:29 UTC (permalink / raw) To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com On 11/17/2010 04:41 PM, Daniel Stodden wrote: > On Wed, 2010-11-17 at 18:49 -0500, Jeremy Fitzhardinge wrote: >> On 11/17/2010 03:06 PM, Daniel Stodden wrote: >>>> So we're back to needing a way of mapping from a random mfn to a pfn so >>>> we can find the corresponding struct page. I would be tempted to put a >>>> layer over m2p to allow local m2p mappings to override the global m2p table. >>> I think a local m2p lookup on a slow path is a superior option, iff you >>> do think it's doable. Without e.g. risking to bloat some inline stuff, I >>> mean. >>> >>> Where do you think in the call stack down into pvops code would the >>> lookup go? Asking because I'd expect the kernel to potentially learn >>> more tricks with that. >> I don't think m2p is all that performance critical. p2m is used way more. >> >> p2m is already a radix tree; > Yes, but pfns are dense plug holes, aren't they? Yes. >> I think m2p could be done somewhat >> similarly, where undefined entries fall through to the global m2p. The >> main problem is probably making sure the memory allocation for new m2p >> entries can be done in a blocking context, so we don't have to rely on >> GFP_ATOMIC. > Whatever the index is going to be, all the backends I'm aware of run > their rings on a kthread. Sounds to me like GFP_WAIT followed by an > rwlock is perfectly sufficient. Only the reversal commonly ends up in > interrupt context. > >> That particular m2p lookup would be in xen_pte_val(), but I think that's >> the callsite for pretty much all m2p lookups. >> >>> A mfn->gref mapping would obsolete blkback-pagemap. Well, iff the >>> kernel-blktap zerocopy stuff wants to be resurrected. >>> >>> It would also be a cheap way to implement current blktap to do >>> virt->gref lookups for tapdisks. Some tapdisk filters want this, present >>> major example is the memshr patch, and it's sort of nicer than a ring >>> message hack. >>> >>> Wouldn't storing the handle allow unmapping grant ptes on the normal >>> user PT teardown path? I think we always was this .zap_pte vma-operation >>> in blktap, iirc. MMU notifier replacement? Maybe not a good one. >> I think mmu notifiers are fine; this is exactly what they're for after >> all. Very few address spaces have granted pages mapped into them, so >> keeping the normal pte operations as fast as possible and using more >> expensive notifiers for the afflicted mms seems like the right tradeoff. >> >> Hm. Before Gerd highlighted mmu notifiers as the right way to deal with >> granted pages in gntdev, I had the idea of allocating a shadow page for >> each pte page and storing grant refs there, where the shadow is hung of >> the pte page's struct page, where set_pte could use it to do the special >> thing if needed. >> >> I wonder if we could do something similar here, where we store the pfn >> for a given pte in the shadow? But how would it get used? There's no >> "read pte" pvop, and pte_val gets the pte value, not its address, so >> there'd be no way to find the shadow. So I guess that doesn't work. > I'm not sure about the radix variant. All the backends do order-0 > allocations, as discussed above. Depending on the driver pair behavior, > the mfn ranges can get arbitrarily sparse. The real M2P makes completely > different assumptions in density and size, or not? Well, there's two use-cases for the local m2p idea. One is for granted pages, which are going to be all over the place, but the other is for hardware mfns, which are more likely to be densely clustered. A radix tree for grant mfns is likely to be pretty inefficient - but the worst case is one radix page per mfn, which isn't too bad, since we're not expecting that many granted mfns to be around. But perhaps a hash or rbtree would be a better fit. Or we could insist on making the mfns contiguous. > Well, maybe one could shadow (cow, actually) just that? Saves the index. > Likely a dumb idea. :) I guess Xen won't let us map over the m2p, but maybe we could alias it. But that's going to waste lots of address space in a 32b dom0. > What numbers of grant refs do we run? I remember times when the tables > were bumped up massively, for stuff like pvfb, a long time ago. I guess > rest remained rather conservative. We only really need to worry about mfns which are actually going to be mapped into userspace and guped. We could even propose a (new?) mmu notifier to prepare for gup so that it can be deferred as late as possible. > The shadow idea sounds pretty cool, mainly because the vaddr spaces are > often contiguous. At least for userland pages. > > Thing which would bother me is that settling on a single shadow page > already limits map entries to sizeof(pte), so ideas like additionally > mapping to grefs/handles/flag already go out of the window. Or grow > bigger leaf, but on average that's also more space overhead. At least we can always fit a pointer into sizeof(pte_t), so there's some scope for having more storage if needed. But I don't see how it can help for gup... J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: blktap: Sync with XCP, dropping zero-copy. 2010-11-17 21:57 ` Daniel Stodden 2010-11-17 22:14 ` Jeremy Fitzhardinge @ 2010-11-17 23:32 ` Daniel Stodden 1 sibling, 0 replies; 18+ messages in thread From: Daniel Stodden @ 2010-11-17 23:32 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com On Wed, 2010-11-17 at 16:57 -0500, Daniel Stodden wrote: > > > In any case, let's skip trying what happens if a thundering herd of > > > several hundred userspace disks tries gfp()ing their grant slots out of > > > dom0 without without arbitration. > > > > I'm not against arbitration, but I don't think that's something that > > should be implemented as part of a Xen driver. > > Uhm, maybe I'm misunderstanding you, isn't the whole thing a Xen driver? > What do you suggest? Just misread you, sorry. You mean arbitration via IPC. Somewhat, but ugly. Just counting pages with cmpxchg in some shm segment isnt't a big deal, agreed. But userspace fixing the counter after detecting some process crash would already start to complicate that. Next, someone has to do the ballooning, and you need gntmap to understand the VMA type anyway. From there on, going the rest of the way and let the kernel driver round-robin the frame pool altogether is much smaller and cleaner. Daniel ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2010-11-18 2:29 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20101116215621.59FC2CF782@homiemail-mx7.g.dreamhost.com>
2010-11-17 16:36 ` blktap: Sync with XCP, dropping zero-copy Andres Lagar-Cavilla
2010-11-17 17:52 ` Jeremy Fitzhardinge
2010-11-17 19:47 ` Andres Lagar-Cavilla
2010-11-17 23:42 ` Daniel Stodden
2010-11-12 23:31 Daniel Stodden
2010-11-13 0:50 ` Jeremy Fitzhardinge
2010-11-13 3:56 ` Daniel Stodden
[not found] ` <1289620544.11102.373.camel@agari.van.xensource.com>
2010-11-15 18:27 ` Jeremy Fitzhardinge
2010-11-16 9:13 ` Daniel Stodden
2010-11-16 17:56 ` Jeremy Fitzhardinge
2010-11-16 21:28 ` Daniel Stodden
2010-11-17 18:00 ` Jeremy Fitzhardinge
2010-11-17 20:21 ` Daniel Stodden
2010-11-17 21:02 ` Jeremy Fitzhardinge
2010-11-17 21:57 ` Daniel Stodden
2010-11-17 22:14 ` Jeremy Fitzhardinge
[not found] ` <1290035201.11102.1577.camel@agari.van.xensource.com>
[not found] ` <4CE46A03.3010104@goop.org>
[not found] ` <1290040898.11102.1709.camel@agari.van.xensource.com>
2010-11-18 2:29 ` Jeremy Fitzhardinge
2010-11-17 23:32 ` Daniel Stodden
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.