From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 56EDFC433F5 for ; Mon, 23 May 2022 13:56:36 +0000 (UTC) Received: from localhost ([::1]:36930 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nt8Xz-0002ln-9T for qemu-devel@archiver.kernel.org; Mon, 23 May 2022 09:56:35 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:36848) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nt8WD-0000ln-Ng for qemu-devel@nongnu.org; Mon, 23 May 2022 09:54:45 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:26939) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nt8W9-0004Ay-9k for qemu-devel@nongnu.org; Mon, 23 May 2022 09:54:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1653314080; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IZzppufIliHQADVVEt0uVMMH0aztH902HSKh69Ikb0I=; b=ehVUjM8+IkJS+s5HHvSz2VP/xU6JCuIRTC816DKtI1rlKj0mGXG84ABuDSwWHlLtH984pS /W8kuGdtXj5FIDhkg9STdLw2iDnFbmasuM73qqaCH/lMy2YGKJwVH56OMCOS5L5q+tNWg0 PjhUQoYNC2bGX6+TF8Szs+qoSTeDxqw= Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com [209.85.219.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-455-wi2JyOm4MN2r8omE0AlovA-1; Mon, 23 May 2022 09:54:39 -0400 X-MC-Unique: wi2JyOm4MN2r8omE0AlovA-1 Received: by mail-qv1-f70.google.com with SMTP id r13-20020a056214068d00b00461d27bf0fdso11032016qvz.1 for ; Mon, 23 May 2022 06:54:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=IZzppufIliHQADVVEt0uVMMH0aztH902HSKh69Ikb0I=; b=oBgC124rlT62hfsVicNDCb8cHB5fMWwMhINKndvb53F57EJqF6wKu3AUB2xy/Luuq5 IedzHWrzoPVLfZs3ylOn2v03ayd5cyou9YLTLImktHbW6yBnRrDPbKXdZ8tIyT7ATgz6 7YrwxowRpwACNX9cqL9AejwTUrP+ejr1b9+oHsIsIwyqxfi4T00ictoc1uv08WR4rFWE sXmEvYKQpxuVchktWqMNSQwXyPU8z8ct3r27nUDPJhxPABCOg2pkEirL64XstZAab9o3 D4zf+4rgBWGN1k/WjhQDL7uMydt0PtRpwtHcDIAyTU8PCQmiNMST1sAMSqgVuIHG/lMn QGDA== X-Gm-Message-State: AOAM532TXFRqMjX4qpc9w47wm+2kDlkimC0sJAfUiQWX1W5Ahj7/AVO0 BxpEXMPc4WxvYTOExszccezzNGAFOwaw+Lt8qXIP7O8xFDbd0tetRYdpkXmVYr6yY2LwHVJZs4l +6UFZ14Om5dk35qA= X-Received: by 2002:a0c:ea34:0:b0:456:319f:f3aa with SMTP id t20-20020a0cea34000000b00456319ff3aamr17302758qvp.18.1653314078127; Mon, 23 May 2022 06:54:38 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwrGt/hzfZN7FJUAZiDd8FQuFR1zx7quIC0ZVEjKcJz4BqsPr0sNFZULMk1T0VlhCqbKi6fng== X-Received: by 2002:a0c:ea34:0:b0:456:319f:f3aa with SMTP id t20-20020a0cea34000000b00456319ff3aamr17302732qvp.18.1653314077732; Mon, 23 May 2022 06:54:37 -0700 (PDT) Received: from [192.168.149.183] (58.254.164.109.static.wline.lns.sme.cust.swisscom.ch. [109.164.254.58]) by smtp.gmail.com with ESMTPSA id o18-20020ae9f512000000b006a38debe62csm1899428qkg.89.2022.05.23.06.54.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 23 May 2022 06:54:37 -0700 (PDT) Message-ID: <0aeeef05-37ec-51a4-820a-633c29776609@redhat.com> Date: Mon, 23 May 2022 15:54:39 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Subject: Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock Content-Language: en-US To: Stefan Hajnoczi Cc: Kevin Wolf , Paolo Bonzini , qemu-block@nongnu.org, Hanna Reitz , John Snow , Vladimir Sementsov-Ogievskiy , Fam Zheng , qemu-devel@nongnu.org References: <20220426085114.199647-1-eesposit@redhat.com> <6fc3e40e-7682-b9dc-f789-3ca95e0430db@redhat.com> From: Emanuele Giuseppe Esposito In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Received-SPF: pass client-ip=170.10.129.124; envelope-from=eesposit@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -28 X-Spam_score: -2.9 X-Spam_bar: -- X-Spam_report: (-2.9 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=unavailable autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Am 23/05/2022 um 15:15 schrieb Stefan Hajnoczi: > On Mon, May 23, 2022 at 10:48:48AM +0200, Emanuele Giuseppe Esposito wrote: >> >> >> Am 22/05/2022 um 17:06 schrieb Stefan Hajnoczi: >>> On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote: >>>> Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben: >>>>> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: >>>>>> For example, all callers of bdrv_open() always take the AioContext lock. >>>>>> Often it is taken very high in the call stack, but it's always taken. >>>>> >>>>> I think it's actually not a problem of who takes the AioContext lock or >>>>> where; the requirements are contradictory: >>>>> >>>>> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to >>>>> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) >>>>> >>>>> * to call these functions with the lock taken, the code has to run in the >>>>> BDS's home iothread. Attempts to do otherwise results in deadlocks (the >>>>> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot >>>>> happen without releasing the aiocontext lock) >>>>> >>>>> * running the code in the BDS's home iothread is not possible for >>>>> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main >>>>> thread, but that cannot be guaranteed in general) >>>>> >>>>>> We might suppose that many callbacks are called under drain and in >>>>>> GLOBAL_STATE, which should be enough, but from our experimentation in >>>>>> the previous series we saw that currently not everything is under drain, >>>>>> leaving some operations unprotected (remember assert_graph_writable >>>>>> temporarily disabled, since drain coverage for bdrv_replace_child_noperm >>>>>> was not 100%?). >>>>>> Therefore we need to add more drains. But isn't drain what we decided to >>>>>> drop at the beginning? Why isn't drain good? >>>>> >>>>> To sum up the patch ordering deadlock that we have right now: >>>>> >>>>> * in some cases, graph manipulations are protected by the AioContext lock >>>>> >>>>> * eliminating the AioContext lock is needed to move callbacks to coroutine >>>>> contexts (see above for the deadlock scenario) >>>>> >>>>> * moving callbacks to coroutine context is needed by the graph rwlock >>>>> implementation >>>>> >>>>> On one hand, we cannot protect the graph across manipulations with a graph >>>>> rwlock without removing the AioContext lock; on the other hand, the >>>>> AioContext lock is what _right now_ protects the graph. >>>>> >>>>> So I'd rather go back to Emanuele's draining approach. It may not be >>>>> beautiful, but it allows progress. Once that is in place, we can remove the >>>>> AioContext lock (which mostly protects virtio-blk/virtio-scsi code right >>>>> now) and reevaluate our next steps. >>>> >>>> If we want to use drain for locking, we need to make sure that drain >>>> actually does the job correctly. I see two major problems with it: >>>> >>>> The first one is that drain only covers I/O paths, but we need to >>>> protect against _anything_ touching block nodes. This might mean a >>>> massive audit and making sure that everything in QEMU that could >>>> possibly touch a block node is integrated with drain. >>>> >>>> I think Emanuele has argued before that because writes to the graph only >>>> happen in the main thread and we believe that currently only I/O >>>> requests are processed in iothreads, this is safe and we don't actually >>>> need to audit everything. >>>> >>>> This is true as long as the assumption holds true (how do we ensure that >>>> nobody ever introduces non-I/O code touching a block node in an >>>> iothread?) and as long as the graph writer never yields or polls. I >>>> think the latter condition is violated today, a random example is that >>>> adjusting drain counts in bdrv_replace_child_noperm() does poll. Without >>>> cooperation from all relevant places, the effectively locked code >>>> section ends right there, even if the drained section continues. Even if >>>> we can fix this, verifying that the conditions are met everywhere seems >>>> not trivial. >>>> >>>> And that's exactly my second major concern: Even if we manage to >>>> correctly implement things with drain, I don't see a way to meaningfully >>>> review it. I just don't know how to verify with some confidence that >>>> it's actually correct and covering everything that needs to be covered. >>>> >>>> Drain is not really a lock. But if you use it as one, the best it can >>>> provide is advisory locking (callers, inside and outside the block >>>> layer, need to explicitly support drain instead of having the lock >>>> applied somewhere in the block layer functions). And even if all >>>> relevant pieces actually make use of it, it still has an awkward >>>> interface for locking: >>>> >>>> /* Similar to rdlock(), but doesn't wait for writers to finish. It is >>>> * the callers responsibility to make sure that there are no writers. */ >>>> bdrv_inc_in_flight() >>>> >>>> /* Similar to wrlock(). Waits for readers to finish. New readers are not >>>> * prevented from starting after it returns. Third parties are politely >>>> * asked not to touch the block node while it is drained. */ >>>> bdrv_drained_begin() >>>> >>>> (I think the unlock counterparts actually behave as expected from a real >>>> lock.) >>>> >>>> Having an actual rdlock() (that waits for writers), and using static >>>> analysis to verify that all relevant places use it (so that wrlock() >>>> doesn't rely on politely asking third parties to leave the node alone) >>>> gave me some confidence that we could verify the result. >>>> >>>> I'm not sure at all how to achieve the same with the drain interface. In >>>> theory, it's possible. But it complicates the conditions so much that >>>> I'm pretty much sure that the review wouldn't only be very time >>>> consuming, but I would make mistakes during the review, rendering it >>>> useless. >>>> >>>> Maybe throwing some more static analysis on the code can help, not sure. >>>> It's going to be a bit more complex than with the other approach, >>>> though. >>> >>> Hi, >>> Sorry for the long email. I've included three topics that may help us discuss >>> draining and AioContext removal further. >>> >>> The shortcomings of drain >>> ------------------------- >>> I wanted to explore the logical argument that making graph modifications within >>> a drained section is correct: >>> - Graph modifications and BB/BDS lookup are Global State (GS). >>> - Graph traversal from a BB/BDS pointer is IO. >>> - Only one thread executes GS code at a time. >>> - IO is quiesced within a drained section. >>> - Therefore a drained section in GS code suspends graph traversal, other graph >>> modifications, and BB/BDS lookup. >>> - Therefore it is safe to modify the graph from a GS drained section. >>> >>> However, I hit on a problem that I think Emanuele and Paolo have already >>> pointed out: draining is GS & IO. This might have worked under the 1 IOThread >>> model but it does not make sense for multi-queue. It is possible to submit I/O >>> requests in drained sections. How can multiple threads be in drained sections >>> simultaneously and possibly submit further I/O requests in their drained >>> sections? Those sections wouldn't be "drained" in any useful sense of the word. >>> >>> It is necessary to adjust how recursive drained sections work: >>> bdrv_drained_begin() must not return while there are deeper nested drained >>> sections. >>> >>> This is allowed: >>> >>> Monitor command Block job >>> --------------- --------- >>> > bdrv_drained_begin() >>> . > bdrv_drained_begin() >>> . < bdrv_drained_begin() >>> . . >>> . . >>> . > bdrv_drained_end() >>> . < bdrv_drained_end() >>> < bdrv_drained_begin() >>> . >>> . >>> > bdrv_drained_end() >>> < bdrv_drained_end() >>> >>> This is not allowed: >>> >>> Monitor command Block job >>> --------------- --------- >>> > bdrv_drained_begin() >>> . > bdrv_drained_begin() >>> . < bdrv_drained_begin() >>> . . >>> . . >>> < bdrv_drained_begin() . >>> . . >>> . > bdrv_drained_end() >>> . < bdrv_drained_end() >>> > bdrv_drained_end() >>> < bdrv_drained_end() >>> >>> This restriction creates an ordering between bdrv_drained_begin() callers. In >>> this example the block job must not depend on the monitor command completing >>> first. Otherwise there would be a deadlock just like with two threads wait for >>> each other while holding a mutex. >>> >>> For sanity I think draining should be GS-only. IO code should not invoke >>> bdrv_drained_begin() to avoid ordering problems when multiple threads drain >>> simultaneously and have dependencies on each other. >>> >>> block/mirror.c needs to be modified because it currently drains from IO when >>> mirroring is about to end. >> >> Yes, mirror seems to be the only clear place where draining is performed >> from IO, namely in mirror_run. It's also a little bit weird because just >> drained_begin is invoked from IO, while drained_end is in the main loop. >> >> If I understand correctly, the reason for that is that we want to >> prevent bdrv_get_dirty_count to be modified (ie that additional >> modifications come) after we just finished mirroring. >> >> Not sure how to deal with this though. > > I don't know the code so I don't have a concrete solution in mind. > >>> >>> With this change to draining I think the logical argument for correctness with >>> graph modifications holds. >>> >>> Enforcing GS/IO separation at compile time >>> ------------------------------------------ >>> Right now GS/IO asserts check assumptions at runtime. The next step may be >>> using the type system to separate GS and IO APIs so it's impossible for IO code >>> to accidentally call GS code, for example. >>> >>> typedef struct { >>> BlockDriverState *bs; >>> } BlockDriverStateIOPtr; >>> >>> typedef struct { >>> BlockDriverState *bs; >>> } BlockDriverStateGSPtr; >>> >>> Then APIs can be protected as follows: >>> >>> void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); >>> >>> A function that only has a BlockDriverStateIOPtr cannot call >>> bdrv_set_aio_context_ignore() by mistake since the compiler will complain that >>> the first argument must be a BlockDriverStateGSPtr. >> >> What about functions that do not need a BlockDriverState *, and for >> example they use BdrvChild? I would assume that we need to replicate it >> also for that. >> >> And what about GS & IO functions? Not only drain, but also all the >> generated_co_wrapper should be GS & IO. > > What is the meaning of GS & IO in a multi-queue world? I guess it's > effectively IO except it's well-behaved if called with the BQL held? Good question. I think so. But will need to be protected, as they will be probably graph traversing while being either GS or IO. > >> >>> >>> Possible steps for AioContext removal >>> ------------------------------------- >>> I also wanted to share my assumptions about multi-queue and AioContext removal. >>> Please let me know if anything seems wrong or questionable: >>> >>> - IO code can execute in any thread that has an AioContext. >>> - Multiple threads may execute a IO code at the same time. >>> - GS code only execute under the BQL. >>> >>> For AioContext removal this means: >>> >>> - bdrv_get_aio_context() becomes mostly meaningless since there is no need for >>> a special "home" AioContext. >> >> Makes sense >> >>> - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to >>> run a coroutine in the BDS's AioContext. >> >> Why is there no need? > > The coroutine can execute in the current thread since the BDS must be > thread-safe. > > If bdrv_coroutine_enter() is used in cases like block jobs to mean "run > in the same AioContext as the job coroutine" then the code should > probably be changed to explicitly use the job AioContext instead of the > more vague BDS AioContext. Ok makes sense > >>> - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many >>> threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() >>> may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue >>> virtio-blk device). >> >> Ok >> >>> - AIO_WAIT_WHILE() simplifies to >>> >>> while ((cond)) { >>> aio_poll(qemu_get_current_aio_context(), true); >>> ... >>> } >>> >>> and the distinction between home AioContext and non-home context is >>> eliminated. AioContext unlocking is dropped. >> >> I guess this implies no coroutine in BDS's AioContext. > > I'm not sure what you mean but coroutines accessing the BDS can run in > any AioContext with multi-queue. If they have any locking requirements > amongst each other then that should be explicit instead of just throwing > them all into the BDS AioContext. Ok > >>> >>> Does this make sense? I haven't seen these things in recent patch series. >> >> You haven't seen them because there's no way to do it if we don't even >> agree on what to replace the AioContext lock with. >> Getting to this point would imply firstly having something (drain, >> rwlock, whatever) together with AioContext, and then gradually remove >> it. At least, that's how I understood it. > > What, besides graph modification and the things I listed in this email, > still needs to be untangled from AioContext? Nothing else (I hope), but as Kevin also said, this requires a lot of preliminary work before getting to this state. Emanuele > > Stefan >