From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nomen Nescio Subject: Re: remus trouble Date: Wed, 7 Jul 2010 14:12:09 +0200 Message-ID: <20100707121209.GO9918@puscii.nl> References: <20100706113654.GN9918@puscii.nl> <20100706180212.GE13388@kremvax.cs.ubc.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20100706180212.GE13388@kremvax.cs.ubc.ca> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org Hey Brendan & all, > > I ran into some problems trying remus on xen4.0.1rc4 with the 2.6.31.13 > > dom0 (checkout from yesterday): > > Wat's your domU kernel? pvops support was recently added to dom0, but > still doesn't work for domU. Ah, that explains a few things, however similar behaviour occurs with hvm. Remus starts, spits out the following output: qemu logdirty mode: enable 1: sent 267046, skipped 218, delta 8962ms, dom0 68%, target 0%, sent 976Mb/s, dirtied 1Mb/s 290 pages 2: sent 290, skipped 0, delta 12ms, dom0 66%, target 0%, sent 791Mb/s, dirtied 43Mb/s 16 pages 3: sent 16, skipped 0, Start last iteration PROF: suspending at 1278503125.101352 issuing HVM suspend hypercall suspend hypercall returned 0 pausing QEMU SUSPEND shinfo 000fffff delta 11ms, dom0 18%, target 0%, sent 47Mb/s, dirtied 47Mb/s 16 pages 4: sent 16, skipped 0, delta 5ms, dom0 20%, target 0%, sent 104Mb/s, dirtied 104Mb/s 16 pages Total pages sent= 267368 (0.25x) (of which 0 were fixups) All memory is saved PROF: resumed at 1278503125.111614 resuming QEMU Sending 6017 bytes of QEMU state PROF: flushed memory at 1278503125.112014 and then seems to become inactive. ps tree looks like this: root 4756 0.4 0.1 82740 11040 pts/0 SLl+ 13:45 0:03 /usr/bin/python /usr/bin/remus --no-net remus1 backup according to strace, it's stuck reading FD6, which is a FIFO file: /var/run/tap/remus_nas1_9000.msg the domU comes up in blocked state on the backup machine and seems to run fine there. however xm list on the primary shows no state whatsoever: Domain-0 0 10208 12 r----- 468.6 remus1 1 1024 1 ------ 41.8 and after a ctrl-c remus segfaults: remus[4756]: segfault at 0 ip 00007f3f49cc7376 sp 00007fffec999fd8 error 4 in libc-2.11.1.so[7f3f49ba1000+178000] > Are these in dom0 or the primary domU? Looks a bit like dom0, but I > haven't seen these before. those were in dom0. this time dmesg shows output after destroying the domU on the primary: [ 1920.059226] INFO: task xenwatch:55 blocked for more than 120 seconds. [ 1920.059262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1920.059315] xenwatch D 0000000000000000 0 55 2 0x00000000 [ 1920.059363] ffff8802e2e656c0 0000000000000246 0000000000011200 0000000000000000 [ 1920.059439] ffff8802e2e65720 0000000000000000 ffff8802d55d20c0 00000001001586b3 [ 1920.059520] ffff8802e2e683b0 000000000000f668 00000000000153c0 ffff8802e2e683b0 [ 1920.059592] Call Trace: [ 1920.059626] [] io_schedule+0x2d/0x40 [ 1920.059661] [] get_request_wait+0xe9/0x1c0 [ 1920.059695] [] ? autoremove_wake_function+0x0/0x40 [ 1920.059732] [] ? elv_merge+0x37/0x200 [ 1920.059765] [] __make_request+0xa1/0x470 [ 1920.059800] [] ? xen_restore_fl_direct_end+0x0/0x1 [ 1920.059837] [] ? retint_restore_args+0x5/0x6 [ 1920.059874] [] generic_make_request+0x17c/0x4a0 [ 1920.059909] [] ? mempool_alloc+0x56/0x140 [ 1920.059946] [] ? xen_force_evtchn_callback+0xd/0x10 [ 1920.059979] [] submit_bio+0x78/0xf0 [ 1920.060013] [] submit_bh+0xf9/0x140 [ 1920.060046] [] __block_write_full_page+0x1e0/0x3a0 [ 1920.060080] [] ? end_buffer_async_write+0x0/0x1f0 [ 1920.060116] [] ? blkdev_get_block+0x0/0x70 [ 1920.060151] [] ? blkdev_get_block+0x0/0x70 [ 1920.060186] [] ? end_buffer_async_write+0x0/0x1f0 [ 1920.060222] [] block_write_full_page_endio+0xe1/0x120 [ 1920.060259] [] ? check_events+0x12/0x20 [ 1920.060294] [] block_write_full_page+0x15/0x20 [ 1920.060330] [] blkdev_writepage+0x18/0x20 [ 1920.060365] [] __writepage+0x17/0x40 [ 1920.060399] [] write_cache_pages+0x227/0x4d0 [ 1920.060434] [] ? __writepage+0x0/0x40 [ 1920.060469] [] ? xen_restore_fl_direct_end+0x0/0x1 [ 1920.060504] [] generic_writepages+0x24/0x30 [ 1920.060539] [] do_writepages+0x2d/0x50 [ 1920.060576] [] __filemap_fdatawrite_range+0x5b/0x60 [ 1920.060613] [] filemap_fdatawrite+0x1f/0x30 [ 1920.060646] [] filemap_write_and_wait+0x35/0x50 [ 1920.060681] [] __sync_blockdev+0x24/0x50 [ 1920.060716] [] sync_blockdev+0x13/0x20 [ 1920.060748] [] __blkdev_put+0xa8/0x1a0 [ 1920.060784] [] blkdev_put+0x10/0x20 [ 1920.060819] [] vbd_free+0x2a/0x40 [ 1920.060851] [] blkback_remove+0x59/0x90 [ 1920.060885] [] xenbus_dev_remove+0x50/0x70 [ 1920.060921] [] __device_release_driver+0x58/0xb0 [ 1920.060956] [] device_release_driver+0x2d/0x40 [ 1920.060991] [] bus_remove_device+0x9a/0xc0 [ 1920.061027] [] device_del+0x127/0x1d0 [ 1920.061061] [] device_unregister+0x16/0x30 [ 1920.061095] [] frontend_changed+0x90/0x2a0 [ 1920.061131] [] xenbus_otherend_changed+0xb2/0xc0 [ 1920.061167] [] ? _spin_unlock_irqrestore+0x37/0x60 [ 1920.061209] [] frontend_changed+0x10/0x20 [ 1920.061243] [] xenwatch_thread+0xb4/0x190 [ 1920.061281] [] ? autoremove_wake_function+0x0/0x40 [ 1920.061314] [] ? xenwatch_thread+0x0/0x190 [ 1920.061349] [] kthread+0xa6/0xb0 [ 1920.061383] [] child_rip+0xa/0x20 [ 1920.061415] [] ? int_ret_from_sys_call+0x7/0x1b [ 1920.061451] [] ? retint_restore_args+0x5/0x6 [ 1920.061485] [] ? child_rip+0x0/0x20 Any idea what's going wrong? Thanks! Cheers, NN