From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 79C60C433F5 for ; Wed, 18 May 2022 12:33:58 +0000 (UTC) Received: from localhost ([::1]:34588 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nrIsH-0007aB-CK for qemu-devel@archiver.kernel.org; Wed, 18 May 2022 08:33:57 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:49412) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nrIoX-0005Nh-Pf for qemu-devel@nongnu.org; Wed, 18 May 2022 08:30:05 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:23803) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nrIoU-0007HJ-6g for qemu-devel@nongnu.org; Wed, 18 May 2022 08:30:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1652877001; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4EMiSD1ENzTyZG0N5a0zuNTy1YmuwkgkLYVx+LfnprA=; b=VutthVqOQmVB8Lb6LwPlPGQjWrHUXjKPY27T0HnFI+gikBHMH82ChY9CT61WS5fTitKGcm inX79x/MUEmvtCQPPgSlv09X+LcyrrxQ2oHxUvaOaS3uVd1d1NiydsemjKeaf8K+3UNKK/ Wx5uiiyEL7+KOj0caAGQv8nmKirvUX4= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-15-Sp6c2EMfPdu_MXbSeMuR6A-1; Wed, 18 May 2022 08:28:44 -0400 X-MC-Unique: Sp6c2EMfPdu_MXbSeMuR6A-1 Received: by mail-wm1-f69.google.com with SMTP id m26-20020a05600c3b1a00b00397220d6329so398744wms.5 for ; Wed, 18 May 2022 05:28:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=4EMiSD1ENzTyZG0N5a0zuNTy1YmuwkgkLYVx+LfnprA=; b=TTAF3OpNqZhSF0HnMGRbXMZUHzSXU4jNB83fOor1bhuBwoItvquFPXkqr+y5buUkWt qQpKr2bRopbNSeyjTD6gpfwe6zJDNCYPQAl3NmJrMyyxsD8R97oWOOgVltLvRwORXcii gH7n4Syh/HTPnSSClgH19+4oZpziS1xvNvX/GE4IklS09tb3JZwbpdq/CG60pr36yujv uC9tmV8Kdv8+rnHLc6IDcpthIXyMiLleLVTah03Th8n74ATUuO06TlG+loU/yFa4q6y4 XEmK5p8C4IUcatSTxKgI3eYb/u+gFWaSU0va2g4zlIXuG1TiTZcGR2Q+qsRWGMDFomdc 722w== X-Gm-Message-State: AOAM5326u2GqewTM20EEs310r8qXFrcpw40rptzf6nq6cLfpqIA7uk6N qtHr/6B30jicPeSF4Qtm7aHx68tj8FfMb9iPW2IqU4XzXfqRqwNSw+w5qxU4/z9R8iOPKe/Gnld CqlDlYIS8Q68wC+I= X-Received: by 2002:adf:e38c:0:b0:20d:1181:6dde with SMTP id e12-20020adfe38c000000b0020d11816ddemr9663359wrm.225.1652876923617; Wed, 18 May 2022 05:28:43 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwoRO6a6HhnEpzdJpLjEh8C6O6gcJFPd92ICr13z92TlZCooDMTKw4GGmzH3G9gLhCn/z+zrw== X-Received: by 2002:adf:e38c:0:b0:20d:1181:6dde with SMTP id e12-20020adfe38c000000b0020d11816ddemr9663326wrm.225.1652876923195; Wed, 18 May 2022 05:28:43 -0700 (PDT) Received: from [192.168.149.183] (58.254.164.109.static.wline.lns.sme.cust.swisscom.ch. [109.164.254.58]) by smtp.gmail.com with ESMTPSA id r64-20020a1c4443000000b003942a244ed6sm1666499wma.27.2022.05.18.05.28.41 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 18 May 2022 05:28:42 -0700 (PDT) Message-ID: Date: Wed, 18 May 2022 14:28:41 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Subject: Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock Content-Language: en-US To: Stefan Hajnoczi Cc: qemu-block@nongnu.org, Kevin Wolf , Hanna Reitz , John Snow , Paolo Bonzini , Vladimir Sementsov-Ogievskiy , Fam Zheng , qemu-devel@nongnu.org References: <20220426085114.199647-1-eesposit@redhat.com> From: Emanuele Giuseppe Esposito In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Received-SPF: pass client-ip=170.10.129.124; envelope-from=eesposit@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -28 X-Spam_score: -2.9 X-Spam_bar: -- X-Spam_report: (-2.9 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Am 17/05/2022 um 12:59 schrieb Stefan Hajnoczi: > On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote: >> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote: >>> This is a new attempt to replace the need to take the AioContext lock to >>> protect graph modifications. In particular, we aim to remove >>> (or better, substitute) the AioContext around bdrv_replace_child_noperm, >>> since this function changes BlockDriverState's ->parents and ->children >>> lists. >>> >>> In the previous version, we decided to discard using subtree_drains to >>> protect the nodes, for different reasons: for those unfamiliar with it, >>> please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/ >> >> I reread the thread and it's unclear to me why drain is the wrong >> mechanism for protecting graph modifications. We theorized a lot but >> ultimately is this new mechanism sufficiently different from >> bdrv_drained_begin()/end() to make it worth implementing? >> >> Instead of invoking .drained_begin() callbacks to stop further I/O, >> we're now queuing coroutines (without backpressure information that >> whoever is spawning I/O needs so they can stop). The writer still waits >> for in-flight I/O to finish, including I/O not associated with the bdrv >> graph we wish to modify (because rdlock is per-AioContext and unrelated >> to a specific graph). Is this really more lightweight than drain? >> >> If I understand correctly, the original goal was to avoid the need to >> hold the AioContext lock across bdrv_replace_child_noperm(). I would >> focus on that and use drain for now. >> >> Maybe I've missed an important point about why the new mechanism is >> needed? > > Ping? label: // read till the end to see why I wrote this here I was hoping someone from the "No" party would answer to your question, because as you know we reached this same conclusion together. We thought about dropping the drain for various reasons, the main one (at least as far as I understood) is that we are not sure if something can still happen in between drain_begin/end, and it is a little bit confusing to use the same mechanism to block I/O and protect the graph. We then thought about implementing a rwlock. A rdlock would clarify what we are protecting and who is using the lock. I had a rwlock draft implementation sent in this thread, but this also lead to additional problems. Main problem was that this new lock would introduce nested event loops, that together with such locking would just create deadlocks. If readers are in coroutines and writers are not (because graph operations are not running in coroutines), we have a lot of deadlocks. If a writer has to take the lock, it must wait all other readers to finish. And it does it by internally calling AIO_WAIT_WHILE, creating nested event loop. We don't know what could execute when polling for events, and for example another writer could be resumed. Ideally, we want writers in coroutines too. Additionally, many readers are running in what we call "mixed" functions: usually implemented automatically with generated_co_wrapper tag, they let a function (usually bdrv callback) run always in a coroutine, creating one if necessary. For example, bdrv_flush() makes sure hat bs->bdrv_co_flush() is always run in a coroutine. Such mixed functions are used in other callbacks too, making it really difficult to understand if we are in a coroutine or not, and mostly important make rwlock usage very difficult. Which lead us to stepping back once more and try to convert all BlockDriverState callbacks in coroutines. This would greatly simplify rwlock usage, because we could make the rwlock coroutine-frendly (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by just yielding and queuing itself in coroutine queues). First step was then to convert all callbacks in coroutines, using generated_coroutine_wrapper (g_c_w). A typical g_c_w is implemented in this way: if (qemu_in_coroutine()) { callback(); } else { // much simplified co = qemu_coroutine_create(callback); bdrv_coroutine_enter(bs, co); BDRV_POLL_WHILE(bs, coroutine_in_progress); } Once all callbacks are implemented using g_c_w, we can start splitting the two sides of the if function to only create a coroutine when we are outside from a bdrv callback. However, we immediately found a problem while starting to convert the first callbacks: the AioContext lock is taken around some non coroutine callbacks! For example, bs->bdrv_open() is always called with the AioContext lock taken. In addition, callbacks like bdrv_open are graph-modifying functions, which is probably why we are taking the Aiocontext lock, and they do not like to run in coroutines. Anyways, the real problem comes when we create a coroutine in such places where the AioContext lock is taken and we have a graph-modifying function. bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks if the coroutine is entering another context from the current (which is not the case for open) and if we are already in coroutine (for sure not). Therefore it resorts to the following calls; aio_context_acquire(ctx); qemu_aio_coroutine_enter(ctx, co); aio_context_release(ctx); Which is clearly a problem, because we are taking the lock twice: once from the original caller of the callback, and once here due to the coroutine. This creates a lot of deadlock situations. For example, all callers of bdrv_open() always take the AioContext lock. Often it is taken very high in the call stack, but it's always taken. Getting rid of the lock around qemu_aio_coroutine_enter() is difficult too, because coroutines expect to have the lock taken. For example, if we want to drain from a coroutine, bdrv_co_yield_to_drain releases the lock for us. The solution is to get rid of the aio_context lock that was originally taken in the caller, but the problem is: can we get rid of the AioContext lock? What is that lock protecting, being so high in the call stack? What is the alternative for its usage? We are getting rid of it and replacing it with nothing. We are going to drop the AioContext lock to allow coroutines callback, but leave graph operations unprotected. We might suppose that many callbacks are called under drain and in GLOBAL_STATE, which should be enough, but from our experimentation in the previous series we saw that currently not everything is under drain, leaving some operations unprotected (remember assert_graph_writable temporarily disabled, since drain coverage for bdrv_replace_child_noperm was not 100%?). Therefore we need to add more drains. But isn't drain what we decided to drop at the beginning? Why isn't drain good? goto label; Thank you, Emanuele > > Stefan >