From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C009C4363A for ; Fri, 30 Oct 2020 13:54:43 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 7B9E32071A for ; Fri, 30 Oct 2020 13:54:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="KGcdMajj" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7B9E32071A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:55348 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kYUrY-0002IR-H1 for qemu-devel@archiver.kernel.org; Fri, 30 Oct 2020 09:54:40 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:41678) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kYUqx-0001sO-BO for qemu-devel@nongnu.org; Fri, 30 Oct 2020 09:54:03 -0400 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:58511) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.90_1) (envelope-from ) id 1kYUqv-00053G-0K for qemu-devel@nongnu.org; Fri, 30 Oct 2020 09:54:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1604066039; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=w1sNFlKeFv+R7z4swC+EwUcPCyoErZNczC3y42HB3Z4=; b=KGcdMajj/klPC+hbmOAnWRUA58NZZPLbCWyFFY0GE412PefEH3pLYCfRgekSaIDZYP+eID 9iaORgW9escec4gCynZwJlPllcjydCa/dLHdHp36vL29rg1y+NdxLIFf9Scc+cfNHSxfFj 1gZQ5nrhcN2xmo6CjGM4AFSFrBnHals= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-438-HAtDWylSN2aV-oBrlxVRzw-1; Fri, 30 Oct 2020 09:53:53 -0400 X-MC-Unique: HAtDWylSN2aV-oBrlxVRzw-1 Received: by mail-qk1-f200.google.com with SMTP id f9so3921829qkg.13 for ; Fri, 30 Oct 2020 06:53:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=JiKrxgqmW3caL3S6bXXo6Cc/ACTFApk18fbx/I5VCwQ=; b=lTsblQAMSFeQLELv7KdsE56Ko++HT7U/9vPsDw+5Thp9k5PwOrPN2BgOajztw18+iD mJnJBNbV4TyigxQcHkvz1pwnBtpzeX5t7MxK4SIVGBCe7/ISG7u/s1V4EUdLbSEutOYg b2z9lMlKiyQMou//T9jup697wVmdniVCPwYD5RCD6gawb4CnrFMQe+rcny9jWYvNH1T8 P2td7goLP7UkIpXfWps2U5udmLyVx/MDNapcw9ndkoEVqwgQ5mAslJ9sSF/B3QkdGdIB YM++r+FEwwlaWPFdIBPtlftBenmTWuR160esa210QN3odyLpkfoceD1deoARD7Wmbpq5 OH2w== X-Gm-Message-State: AOAM532wI1nASGWaV2ow5eTgSeo3x6lFMCEC7obAjQSQPYt7ejl5fFUI AJWqFeRDoXmLN5pQt5wyDk9XyIZERyZzmTSxF3nclqm41DoCMceuFocz0b6Irb2dbflwQXzEEbR zwAmHmUzT8+0JLS0= X-Received: by 2002:a05:622a:4b:: with SMTP id y11mr2225182qtw.348.1604066032557; Fri, 30 Oct 2020 06:53:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxi2o/B3gzkFNRTpqKxNjMxtcGORxG9e0yk0CW8gfDxSHL9qA2SJyy5yjvfs6CNukOIE6fkJg== X-Received: by 2002:a05:622a:4b:: with SMTP id y11mr2225154qtw.348.1604066032217; Fri, 30 Oct 2020 06:53:52 -0700 (PDT) Received: from xz-x1 (toroon474qw-lp140-04-174-95-215-133.dsl.bell.ca. [174.95.215.133]) by smtp.gmail.com with ESMTPSA id k4sm2576696qkj.46.2020.10.30.06.53.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 30 Oct 2020 06:53:51 -0700 (PDT) Date: Fri, 30 Oct 2020 09:53:50 -0400 From: Peter Xu To: Peter Maydell , "Dr. David Alan Gilbert" Subject: Re: recent flakiness (intermittent hangs) of migration-test Message-ID: <20201030135350.GA588069@xz-x1> References: <20201029174124.GD3335@work-vm> <20201029193433.GE3335@work-vm> <20201029202810.GD455015@xz-x1> MIME-Version: 1.0 In-Reply-To: Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/mixed; boundary="HlL+5n6rz5pIUxbD" Content-Disposition: inline Received-SPF: pass client-ip=63.128.21.124; envelope-from=peterx@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/10/30 01:22:25 X-ACL-Warn: Detected OS = Linux 2.2.x-3.x [generic] [fuzzy] X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juan Quintela , "Dr. David Alan Gilbert" , QEMU Developers Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" --HlL+5n6rz5pIUxbD Content-Type: text/plain; charset=utf-8 Content-Disposition: inline On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote: > > Peter, is it possible that you enable QTEST_LOG=1 in your future migration-test > > testcase and try to capture the stderr? With the help of commit a47295014d > > ("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test should > > be able to dump quite some helpful information to further identify the issue. > > Here's the result of running just the migration test with > QTEST_LOG=1: > https://people.linaro.org/~peter.maydell/migration.log > It's 300MB because when the test hangs one of the processes > is apparently in a polling state and continues to send status > queries. > > My impression is that the test is OK on an unloaded machine but > more likely to fail if the box is doing other things at the > same time. Alternatively it might be a 'parallel make check' bug. Thanks for collecting that, Peter. I'm copy-pasting the important information out here (with some moves and indents to make things even clearer): ... {"execute": "migrate-recover", "arguments": {"uri": "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id": "recover-cmd"} {"timestamp": {"seconds": 1604056292, "microseconds": 177955}, "event": "MIGRATION", "data": {"status": "setup"}} {"return": {}, "id": "recover-cmd"} {"execute": "query-migrate"} ... {"execute": "migrate", "arguments": {"resume": true, "uri": "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}} qemu-system-x86_64: ram_save_queue_pages no previous block qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. {"return": {}} {"execute": "migrate-set-parameters", "arguments": {"max-postcopy-bandwidth": 0}} ... The problem is probably an misuse on last_rb on destination node. When looking at it, I also found a race. So I guess I should fix both... Peter, would it be easy to try apply the two patches I attached to see whether the test hang would be resolved? Dave, feel free to give early comments too on the two fixes before I post them on the list. Thanks! -- Peter Xu --HlL+5n6rz5pIUxbD Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0001-migration-Unify-reset-of-last_rb-on-destination-node.patch" >From b9af559df89ae9c2eed506ebdd811356ea799f63 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 30 Oct 2020 09:37:44 -0400 Subject: [PATCH 1/2] migration: Unify reset of last_rb on destination node when recover When postcopy recover happens, we need to reset last_rb after each return of postcopy_pause_fault_thread() because that means we just got the postcopy migration continued. Unify this reset to the place right before we want to kick the fault thread again, when we get the command MIG_CMD_POSTCOPY_RESUME from source. This is actually more than that - because the main thread on destination will now be able to call migrate_send_rp_req_pages_pending() too, so the fault thread is not the only user of last_rb now. Move the reset earlier will allow the first call to migrate_send_rp_req_pages_pending() to use the reset value even if called from the main thread. (NOTE: this is not a real fix to 0c26781c09 mentioned below, however it is just a mark that when picking up 0c26781c09 we'd better have this one too; the real fix will come later) Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery") Signed-off-by: Peter Xu --- migration/postcopy-ram.c | 2 -- migration/savevm.c | 6 ++++++ 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index d3bb3a744b..d99842eb1b 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -903,7 +903,6 @@ static void *postcopy_ram_fault_thread(void *opaque) * the channel is rebuilt. */ if (postcopy_pause_fault_thread(mis)) { - mis->last_rb = NULL; /* Continue to read the userfaultfd */ } else { error_report("%s: paused but don't allow to continue", @@ -985,7 +984,6 @@ retry: /* May be network failure, try to wait for recovery */ if (ret == -EIO && postcopy_pause_fault_thread(mis)) { /* We got reconnected somehow, try to continue */ - mis->last_rb = NULL; goto retry; } else { /* This is a unavoidable fault */ diff --git a/migration/savevm.c b/migration/savevm.c index 21ccba9fb3..e8834991ec 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2061,6 +2061,12 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis) return 0; } + /* + * Reset the last_rb before we resend any page req to source again, since + * the source should have it reset already. + */ + mis->last_rb = NULL; + /* * This means source VM is ready to resume the postcopy migration. * It's time to switch state and release the fault thread to -- 2.26.2 --HlL+5n6rz5pIUxbD Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0002-migration-Postpone-the-kick-of-the-fault-thread-afte.patch" >From 2a28241de49d27d7b0968f1d1b137aea153fc3b4 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 30 Oct 2020 09:44:32 -0400 Subject: [PATCH 2/2] migration: Postpone the kick of the fault thread after recover The new migrate_send_rp_req_pages_pending() call should greatly improve destination responsiveness because it will resync faulted address after postcopy recovery. However it is also the 1st place to initiate the page request from the main thread. One thing is overlooked on that migrate_send_rp_message_req_pages() is not designed to be thread-safe. So if we wake the fault thread before syncing all the faulted pages in the main thread, it means they can race. Postpone the wake up operation after the sync of faulted addresses. Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery") Signed-off-by: Peter Xu --- migration/savevm.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/migration/savevm.c b/migration/savevm.c index e8834991ec..5f937a2762 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2069,12 +2069,9 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis) /* * This means source VM is ready to resume the postcopy migration. - * It's time to switch state and release the fault thread to - * continue service page faults. */ migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER, MIGRATION_STATUS_POSTCOPY_ACTIVE); - qemu_sem_post(&mis->postcopy_pause_sem_fault); trace_loadvm_postcopy_handle_resume(); @@ -2095,6 +2092,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis) */ migrate_send_rp_req_pages_pending(mis); + /* + * It's time to switch state and release the fault thread to continue + * service page faults. Note that this should be explicitly after the + * above call to migrate_send_rp_req_pages_pending(). In short: + * migrate_send_rp_message_req_pages() is not thread safe, yet. + */ + qemu_sem_post(&mis->postcopy_pause_sem_fault); + return 0; } -- 2.26.2 --HlL+5n6rz5pIUxbD--