From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36992) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZfOIa-00071L-FT for qemu-devel@nongnu.org; Fri, 25 Sep 2015 04:24:09 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZfOIW-000371-Ft for qemu-devel@nongnu.org; Fri, 25 Sep 2015 04:24:08 -0400 Received: from [59.151.112.132] (port=31883 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZfOIV-00036b-Ny for qemu-devel@nongnu.org; Fri, 25 Sep 2015 04:24:04 -0400 References: <1443099218-14857-1-git-send-email-den@openvz.org> <5604A1B7.9050208@cn.fujitsu.com> <5604FFE5.8090400@openvz.org> From: Wen Congyang Message-ID: <56050489.9010306@cn.fujitsu.com> Date: Fri, 25 Sep 2015 16:23:37 +0800 MIME-Version: 1.0 In-Reply-To: <5604FFE5.8090400@openvz.org> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 1/1] migration: fix deadlock List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Denis V. Lunev" Cc: Amit Shah , Igor Redko , qemu-devel@nongnu.org, Juan Quintela On 09/25/2015 04:03 PM, Denis V. Lunev wrote: > On 09/25/2015 04:21 AM, Wen Congyang wrote: >> On 09/24/2015 08:53 PM, Denis V. Lunev wrote: >>> From: Igor Redko >>> >>> Release qemu global mutex before call synchronize_rcu(). >>> synchronize_rcu() waiting for all readers to finish their critical >>> sections. There is at least one critical section in which we try >>> to get QGM (critical section is in address_space_rw() and >>> prepare_mmio_access() is trying to aquire QGM). >>> >>> Both functions (migration_end() and migration_bitmap_extend()) >>> are called from main thread which is holding QGM. >>> >>> Thus there is a race condition that ends up with deadlock: >>> main thread working thread >>> Lock QGA | >>> | Call KVM_EXIT_IO handler >>> | | >>> | Open rcu reader's critical section >>> Migration cleanup bh | >>> | | >>> synchronize_rcu() is | >>> waiting for readers | >>> | prepare_mmio_access() is waiting for QGM >>> \ / >>> deadlock >>> >>> The patch just releases QGM before calling synchronize_rcu(). >>> >>> Signed-off-by: Igor Redko >>> Reviewed-by: Anna Melekhova >>> Signed-off-by: Denis V. Lunev >>> CC: Juan Quintela >>> CC: Amit Shah >>> --- >>> migration/ram.c | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> diff --git a/migration/ram.c b/migration/ram.c >>> index 7f007e6..d01febc 100644 >>> --- a/migration/ram.c >>> +++ b/migration/ram.c >>> @@ -1028,12 +1028,16 @@ static void migration_end(void) >>> { >>> /* caller have hold iothread lock or is in a bh, so there is >>> * no writing race against this migration_bitmap >>> + * but rcu used not only for migration_bitmap, so we should >>> + * release QGM or we get in deadlock. >>> */ >>> unsigned long *bitmap = migration_bitmap; >>> atomic_rcu_set(&migration_bitmap, NULL); >>> if (bitmap) { >>> memory_global_dirty_log_stop(); >>> + qemu_mutex_unlock_iothread(); >>> synchronize_rcu(); >>> + qemu_mutex_lock_iothread(); >> migration_end() can called in two cases: >> 1. migration_completed >> 2. migration is cancelled >> >> In case 1, you should not unlock iothread, otherwise, the vm's state may be changed >> unexpectedly. > > sorry, but there is now very good choice here. We should either > unlock or not call synchronize_rcu which is also an option. > > In the other case the rework should be much more sufficient. I don't reproduce this bug. But according to your description, the bug only exists in case 2. Is it right? > > Den > >>> g_free(bitmap); >>> } >>> @@ -1085,7 +1089,9 @@ void migration_bitmap_extend(ram_addr_t old, ram_addr_t new) >>> atomic_rcu_set(&migration_bitmap, bitmap); >>> qemu_mutex_unlock(&migration_bitmap_mutex); >>> migration_dirty_pages += new - old; >>> + qemu_mutex_unlock_iothread(); >>> synchronize_rcu(); >>> + qemu_mutex_lock_iothread(); >> Hmm, I think it is OK to unlock iothread here >> >>> g_free(old_bitmap); >>> } >>> } >>> > > . >