From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 04EFFC77B73 for ; Fri, 26 May 2023 15:55:36 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1q2Zm8-0003h6-Tf; Fri, 26 May 2023 11:54:44 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q2Zm5-0003eZ-Hd; Fri, 26 May 2023 11:54:41 -0400 Received: from proxmox-new.maurer-it.com ([94.136.29.106]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q2Zlu-0007v4-Rv; Fri, 26 May 2023 11:54:32 -0400 Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 63FF24736E; Fri, 26 May 2023 17:54:21 +0200 (CEST) Message-ID: Date: Fri, 26 May 2023 17:54:20 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Subject: Re: [PATCH v2] migration: hold the BQL during setup Content-Language: en-US From: Fiona Ebner To: quintela@redhat.com Cc: qemu-devel@nongnu.org, peterx@redhat.com, leobras@redhat.com, eblake@redhat.com, vsementsov@yandex-team.ru, jsnow@redhat.com, stefanha@redhat.com, fam@euphon.net, qemu-block@nongnu.org, pbonzini@redhat.com, t.lamprecht@proxmox.com References: <20230525164726.45176-1-f.ebner@proxmox.com> <87sfbj1jq7.fsf@secure.mitica> <74785ad6-ea07-6b11-61ea-fd796daf21ad@proxmox.com> In-Reply-To: <74785ad6-ea07-6b11-61ea-fd796daf21ad@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Received-SPF: pass client-ip=94.136.29.106; envelope-from=f.ebner@proxmox.com; helo=proxmox-new.maurer-it.com X-Spam_score_int: -19 X-Spam_score: -2.0 X-Spam_bar: -- X-Spam_report: (-2.0 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.092, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Am 26.05.23 um 15:47 schrieb Fiona Ebner: > Am 26.05.23 um 12:16 schrieb Juan Quintela: >> Nak >> >> Sometimes it works, and sometimes it hangs. > > Sorry, I originally only ran the tests for x86_64 (native for me). I now > ran into the hang too, with qtest-aarch64/migration-test and > qtest-i386/migration-test. > >> Can you take a look? > > Will do! > So I took a look at the multifd_send_state->params[$i] and noticed that the IOChannel c is still NULL and running is still false, while the name, page_size, etc. have already been initialized. So it seems like that socket_send_channel_create() did not manage to execute multifd_new_send_channel_async() yet, telling from the following in multifd_save_setup(): > p->page_size = qemu_target_page_size(); > p->page_count = page_count; > > if (migrate_zero_copy_send()) { > p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY; > } else { > p->write_flags = 0; > } > > socket_send_channel_create(multifd_new_send_channel_async, p); > } I guess the execution of multifd_new_send_channel_async() after socket_send_channel_create() somehow depends on the main thread doing something? But the main thread is waiting on the BQL. Should I introduce an unlocked section around multifd_send_sync_main() in ram_save_setup() and hope for the best? The tests ran two times without errors for me afterwards. Is there an easy way to only run the single problematic test cases again? There's still the risk there's something else that needs an unlocked section. In that regard, v1 of the patch is safer, because it doesn't change which sections are inside or outside the BQL for migration, just for snapshot. Best Regards, Fiona