From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:51510) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UODWG-00027a-6l for qemu-devel@nongnu.org; Fri, 05 Apr 2013 16:45:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UODW8-0002av-8T for qemu-devel@nongnu.org; Fri, 05 Apr 2013 16:45:56 -0400 Received: from e39.co.us.ibm.com ([32.97.110.160]:46676) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UODW7-0002ak-RA for qemu-devel@nongnu.org; Fri, 05 Apr 2013 16:45:48 -0400 Received: from /spool/local by e39.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 5 Apr 2013 14:45:46 -0600 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id BE9D43E4003E for ; Fri, 5 Apr 2013 14:45:26 -0600 (MDT) Received: from d03av06.boulder.ibm.com (d03av06.boulder.ibm.com [9.17.195.245]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r35KjcEe165786 for ; Fri, 5 Apr 2013 14:45:38 -0600 Received: from d03av06.boulder.ibm.com (loopback [127.0.0.1]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r35KmMRO025601 for ; Fri, 5 Apr 2013 14:48:22 -0600 Message-ID: <515F37EE.1000702@linux.vnet.ibm.com> Date: Fri, 05 Apr 2013 16:45:34 -0400 From: "Michael R. Hines" MIME-Version: 1.0 References: <20130318212646.GB20406@redhat.com> <5147A209.80202@linux.vnet.ibm.com> <20130319081939.GC11259@redhat.com> <51487F68.2060305@linux.vnet.ibm.com> <20130319151606.GA13649@redhat.com> <51488521.4010909@linux.vnet.ibm.com> <20130319153658.GA14317@redhat.com> <51489BC3.3030504@linux.vnet.ibm.com> <51489D05.2000400@redhat.com> <5148A52E.6020208@linux.vnet.ibm.com> <20130321061159.GA28328@redhat.com> In-Reply-To: <20130321061159.GA28328@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: aliguori@us.ibm.com, qemu-devel@nongnu.org, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, Paolo Bonzini On 03/21/2013 02:11 AM, Michael S. Tsirkin wrote: > On Tue, Mar 19, 2013 at 01:49:34PM -0400, Michael R. Hines wrote: >> I also did a test using RDMA + cgroup, and the kernel killed my QEMU :) >> >> So, infiniband is not smart enough to know how to avoid pinning a >> zero page, I guess. >> >> - Michael >> >> On 03/19/2013 01:14 PM, Paolo Bonzini wrote: >>> Il 19/03/2013 18:09, Michael R. Hines ha scritto: >>>> Allowing QEMU to swap due to a cgroup limit during migration is a viable >>>> overcommit option? >>>> >>>> I'm trying to keep an open mind, but that would kill the migration >>>> time..... >>> Would it swap? Doesn't the kernel back all zero pages with a single >>> copy-on-write page? If that still accounts towards cgroup limits, it >>> would be a bug. >>> >>> Old kernels do not have a shared zero hugepage, and that includes some >>> distro kernels. Perhaps that's the problem. >>> >>> Paolo >>> > I really shouldn't break COW if you don't request LOCAL_WRITE. > I think it's a kernel bug, and apparently has been there in the code since the > first version: get_user_pages parameters swapped. > > I'll send a patch. If it's applied, you should also > change your code from > > + IBV_ACCESS_LOCAL_WRITE | > + IBV_ACCESS_REMOTE_WRITE | > + IBV_ACCESS_REMOTE_READ); > > to > > + IBV_ACCESS_REMOTE_READ); > > on send side. > Then, each time we detect a page has changed we must make sure to > unregister and re-register it. Or if you want to be very > smart, check that the PFN didn't change and reregister > if it did. > > This will make overcommit work. > Unfortunately RDMA + cgroups still kills QEMU: I removed the *_WRITE flags and did a test like this: 1. Start QEMU with 2GB ram configured $ cd /sys/fs/cgroup/memory/libvirt/qemu $ echo "-1" > memory.memsw.limit_in_bytes $ echo "-1" > memory.limit_in_bytes $ echo $(pidof qemu-system-x86_64) > tasks $ echo 512M > memory.limit_in_bytes # maximum RSS $ echo 3G > memory.memsw.limit_in_bytes # maximum RSS + swap, extra 1G to be safe 2. Start RDMA migration 3. RSS of 512M is reached 4. swap starts filling up 5. the kernel kills QEMU 6. dmesg: [ 2981.657135] Task in /libvirt/qemu killed as a result of limit of /libvirt/qemu [ 2981.657140] memory: usage 524288kB, limit 524288kB, failcnt 18031 [ 2981.657143] memory+swap: usage 525460kB, limit 3145728kB, failcnt 0 [ 2981.657146] Mem-Info: [ 2981.657148] Node 0 DMA per-cpu: [ 2981.657152] CPU 0: hi: 0, btch: 1 usd: 0 [ 2981.657155] CPU 1: hi: 0, btch: 1 usd: 0 [ 2981.657157] CPU 2: hi: 0, btch: 1 usd: 0 [ 2981.657160] CPU 3: hi: 0, btch: 1 usd: 0 [ 2981.657163] CPU 4: hi: 0, btch: 1 usd: 0 [ 2981.657165] CPU 5: hi: 0, btch: 1 usd: 0 [ 2981.657167] CPU 6: hi: 0, btch: 1 usd: 0 [ 2981.657170] CPU 7: hi: 0, btch: 1 usd: 0 [ 2981.657172] Node 0 DMA32 per-cpu: [ 2981.657176] CPU 0: hi: 186, btch: 31 usd: 160 [ 2981.657178] CPU 1: hi: 186, btch: 31 usd: 22 [ 2981.657181] CPU 2: hi: 186, btch: 31 usd: 179 [ 2981.657184] CPU 3: hi: 186, btch: 31 usd: 6 [ 2981.657186] CPU 4: hi: 186, btch: 31 usd: 21 [ 2981.657189] CPU 5: hi: 186, btch: 31 usd: 15 [ 2981.657191] CPU 6: hi: 186, btch: 31 usd: 19 [ 2981.657194] CPU 7: hi: 186, btch: 31 usd: 22 [ 2981.657196] Node 0 Normal per-cpu: [ 2981.657200] CPU 0: hi: 186, btch: 31 usd: 44 [ 2981.657202] CPU 1: hi: 186, btch: 31 usd: 58 [ 2981.657205] CPU 2: hi: 186, btch: 31 usd: 156 [ 2981.657207] CPU 3: hi: 186, btch: 31 usd: 107 [ 2981.657210] CPU 4: hi: 186, btch: 31 usd: 44 [ 2981.657213] CPU 5: hi: 186, btch: 31 usd: 70 [ 2981.657215] CPU 6: hi: 186, btch: 31 usd: 76 [ 2981.657218] CPU 7: hi: 186, btch: 31 usd: 173 [ 2981.657223] active_anon:181703 inactive_anon:68856 isolated_anon:0 [ 2981.657224] active_file:66881 inactive_file:141056 isolated_file:0 [ 2981.657225] unevictable:2174 dirty:6 writeback:0 unstable:0 [ 2981.657226] free:4058168 slab_reclaimable:5152 slab_unreclaimable:10785 [ 2981.657227] mapped:7709 shmem:192 pagetables:1913 bounce:0 [ 2981.657230] Node 0 DMA free:15896kB min:56kB low:68kB high:84kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 2981.657242] lowmem_reserve[]: 0 1966 18126 18126 [ 2981.657249] Node 0 DMA32 free:1990652kB min:7324kB low:9152kB high:10984kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2013280kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 2981.657260] lowmem_reserve[]: 0 0 16160 16160 [ 2981.657268] Node 0 Normal free:14226124kB min:60200kB low:75248kB high:90300kB active_anon:726812kB inactive_anon:275424kB active_file:267524kB inactive_file:564224kB unevictable:8696kB isolated(anon):0kB isolated(file):0kB present:16547840kB mlocked:6652kB dirty:24kB writeback:0kB mapped:30832kB shmem:768kB slab_reclaimable:20608kB slab_unreclaimable:43140kB kernel_stack:1784kB pagetables:7652kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 2981.657281] lowmem_reserve[]: 0 0 0 0 [ 2981.657289] Node 0 DMA: 0*4kB 1*8kB 1*16kB 0*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15896kB [ 2981.657307] Node 0 DMA32: 17*4kB 9*8kB 7*16kB 4*32kB 8*64kB 5*128kB 6*256kB 4*512kB 3*1024kB 6*2048kB 481*4096kB = 1990652kB [ 2981.657325] Node 0 Normal: 2*4kB 1*8kB 991*16kB 893*32kB 271*64kB 50*128kB 50*256kB 12*512kB 5*1024kB 1*2048kB 3450*4096kB = 14225504kB [ 2981.657343] 277718 total pagecache pages [ 2981.657345] 68816 pages in swap cache [ 2981.657348] Swap cache stats: add 656848, delete 588032, find 19850/22338 [ 2981.657350] Free swap = 15288376kB [ 2981.657353] Total swap = 15564796kB [ 2981.706982] 4718576 pages RAM