From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B1A0CD339A3 for ; Tue, 29 Oct 2024 07:43:38 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1t5gsU-0003P3-JH; Tue, 29 Oct 2024 03:42:59 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1t5gsH-0003Np-Rk for qemu-devel@nongnu.org; Tue, 29 Oct 2024 03:42:46 -0400 Received: from mail-pf1-x429.google.com ([2607:f8b0:4864:20::429]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1t5gsC-0001IF-DZ for qemu-devel@nongnu.org; Tue, 29 Oct 2024 03:42:45 -0400 Received: by mail-pf1-x429.google.com with SMTP id d2e1a72fcca58-72097a5ca74so226499b3a.3 for ; Tue, 29 Oct 2024 00:42:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=daynix-com.20230601.gappssmtp.com; s=20230601; t=1730187758; x=1730792558; darn=nongnu.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=uuJZ19WtLdxfjQyVMkBdMKCy8aHM7u7Y3NgldeRpHgw=; b=YfK8apgzk+TDWJmuVKLjgTiV2BF6FvUi818W7ZsoEGZHju4l8Lxp8/UJy7Vapt8eaV fwRF92ARcCCmf5DDM0LOuhiu9JgHQsvcy3AiCCTyvuXoyedT5E8mMChzmSqd1iqvQK56 SVc3H1qR+ilcVizU0MJuDANLuwGXFDUWjFAcXGL5/hvJcNDNhbv+CMg9tCkwGidGxnMu 8XhSGWEgs3PITqZdnGkm4nu0RtGvrNH1MxxlSlQ02M71SLGlM2jP7Nz6bwEHyDb7gCQA 8+4qMY+voH5xQ6N17dpUrEAodyUSTgLv6HuUsb1DoWVRVju0yFYsdesnGmj5oMOyFh6V wkbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730187758; x=1730792558; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=uuJZ19WtLdxfjQyVMkBdMKCy8aHM7u7Y3NgldeRpHgw=; b=ZUDShlMlk7PE+cxEfFtrgWQXEnZdn0GPpCbHq44a9ou7TtugQLr/sy5esgNUbN+m7m LdMCl0qhpk13ztp1SEtBGGbAcg81e7dAPstSr/TdX1gnnK+wfA/QPufkbqFJC5AvXGkL aIgAwLhFW6wZDQXXNDEAAErkvcm8EIeuMTFp2oxGeEu/Nwa4xslc0uxiW4FOlsqKcxqg uubdDW/4dU2fIlKuhipwFbDGkGKs1V7Cv+lWSb6tdGvv55EXr0e3HVB2mVQTGVcCVxZ9 srgUZNuI6YRwcu387tv0usmMIs12M0Ts7R8gCPrlNZyd8Q8H+KIh3y2wmcgevFmgPWzo W2cA== X-Gm-Message-State: AOJu0YzbVqzcnchJdcILp+cSZYoMZ4H2QU6+rGLJY/8Mi8LWExc9qBJw wMzbrfI2QI2v9SW09YboVT2VKChfoNiXD2ljkP31BuKmq1tgIXz39VInJ9moyck= X-Google-Smtp-Source: AGHT+IHThn+szCg4E27r/Iyamy4GkVJEo0c+oDZ8z1ZnGwwRZxGMt09eT5huFz+lyD9r0xCKZ5z4ig== X-Received: by 2002:a05:6a20:9f95:b0:1d9:23ab:470 with SMTP id adf61e73a8af0-1d9a840afd3mr16357467637.29.1730187758025; Tue, 29 Oct 2024 00:42:38 -0700 (PDT) Received: from [157.82.207.107] ([157.82.207.107]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-72057a20241sm6982699b3a.165.2024.10.29.00.42.31 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 29 Oct 2024 00:42:37 -0700 (PDT) Message-ID: Date: Tue, 29 Oct 2024 16:42:30 +0900 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 02/15] hw/display/apple-gfx: Introduce ParavirtualizedGraphics.Framework support To: Phil Dennis-Jordan Cc: qemu-devel@nongnu.org, agraf@csgraf.de, peter.maydell@linaro.org, pbonzini@redhat.com, rad@semihalf.com, quic_llindhol@quicinc.com, marcin.juszkiewicz@linaro.org, stefanha@redhat.com, mst@redhat.com, slp@redhat.com, richard.henderson@linaro.org, eduardo@habkost.net, marcel.apfelbaum@gmail.com, gaosong@loongson.cn, jiaxun.yang@flygoat.com, chenhuacai@kernel.org, kwolf@redhat.com, hreitz@redhat.com, philmd@linaro.org, shorne@gmail.com, palmer@dabbelt.com, alistair.francis@wdc.com, bmeng.cn@gmail.com, liwei1518@gmail.com, dbarboza@ventanamicro.com, zhiwei_liu@linux.alibaba.com, jcmvbkbc@gmail.com, marcandre.lureau@redhat.com, berrange@redhat.com, qemu-arm@nongnu.org, qemu-block@nongnu.org, qemu-riscv@nongnu.org, Alexander Graf References: <20241024102813.9855-1-phil@philjordan.eu> <20241024102813.9855-3-phil@philjordan.eu> <9e310d5e-ab73-47b9-b9ed-5a16d4db3fb9@daynix.com> <6a989d04-6416-4bd9-98ac-e1230a1095a9@daynix.com> <27ced2db-472d-47ae-9047-9efc0b589a1e@daynix.com> Content-Language: en-US From: Akihiko Odaki In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Received-SPF: none client-ip=2607:f8b0:4864:20::429; envelope-from=akihiko.odaki@daynix.com; helo=mail-pf1-x429.google.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_NONE=0.001 autolearn=unavailable autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On 2024/10/29 6:06, Phil Dennis-Jordan wrote: > > > On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki > wrote: > > On 2024/10/28 23:13, Phil Dennis-Jordan wrote: > > > > > > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki > > > >> wrote: > > > >     On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > >      > > >      > > >      > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan > >      > > > >      > > >>> wrote: > >      > > >      > > >      >          >      > > >      >          >      > Hmm. I think if we were to use that, we > would > >     need to > >      >         create a new > >      >          >      > QemuEvent for every job and destroy it > afterward, > >      >         which seems > >      >          >     expensive. > >      >          >      > We can't rule out multiple concurrent > jobs being > >      >         submitted, and the > >      >          >      > QemuEvent system only supports a single > producer as > >      >         far as I can > >      >          >     tell. > >      >          >      > > >      >          >      > You can probably sort of hack around it with > >     just one > >      >         QemuEvent by > >      >          >      > putting the qemu_event_wait into a loop > and turning > >      >         the job.done > >      >          >     flag > >      >          >      > into an atomic (because it would now > need to be > >      >         checked outside the > >      >          >      > lock) but this all seems unnecessarily > complicated > >      >         considering the > >      >          >      > QemuEvent uses the same mechanism QemuCond/ > >     QemuMutex > >      >         internally > >      >          >     on macOS > >      >          >      > (the only platform relevant here), except we > >     can use it as > >      >          >     intended with > >      >          >      > QemuCond/QemuMutex rather than having to > work > >     against the > >      >          >     abstraction. > >      >          > > >      >          >     I don't think it's going to be used > concurrently. It > >      >         would be difficult > >      >          >     to reason even for the framework if it > performs memory > >      >          >     unmapping/mapping/reading operations > concurrently. > >      >          > > >      >          > > >      >          > I've just performed a very quick test by > wrapping the job > >      >         submission/ > >      >          > wait in the 2 mapMemory callbacks and the 1 > readMemory > >      >         callback with > >      >          > atomic counters and logging whenever a counter went > >     above 1. > >      >          > > >      >          >   * Overall, concurrent callbacks across all > types were > >      >         common (many per > >      >          > second when the VM is busy). It's not exactly a > >     "thundering > >      >         herd" (I > >      >          > never saw >2) but it's probably not a bad idea > to use > >     a separate > >      >          > condition variable for each job type. (task map, > >     surface map, > >      >         memory read) > >      >          >   * While I did not observe any concurrent > memory mapping > >      >         operations > >      >          > *within* a type of memory map (2 task mappings or 2 > >     surface > >      >         mappings) I > >      >          > did see very occasional concurrent memory *read* > >     callbacks. > >      >         These would, > >      >          > as far as I can tell, not be safe with QemuEvents, > >     unless we > >      >         placed the > >      >          > event inside the job struct and init/destroyed > it on every > >      >         callback > >      >          > (which seems like excessive overhead). > >      > > >      >         I think we can tolerate that overhead. init/destroy > >     essentially > >      >         sets the > >      >         fields in the data structure and I estimate its > total size is > >      >         about 100 > >      >         bytes. It is probably better than waking an > irrelevant thread > >      >         up. I also > >      >         hope that keeps the code simple; it's not worthwhile > >     adding code to > >      >         optimize this. > >      > > >      > > >      >     At least pthread_cond_{init,destroy} and > >      >     pthread_mutex_{init,destroy} don't make any syscalls, > so yeah > >     it's > >      >     probably an acceptable overhead. > >      > > >      > > >      > I've just experimented with QemuEvents created on-demand > and ran > >     into > >      > some weird deadlocks, which then made me sit down and > think about it > >      > some more. I've come to the conclusion that creating (and > crucially, > >      > destroying) QemuEvents on demand in this way is not safe. > >      > > >      > Specifically, you must not call qemu_event_destroy() - which > >      > transitively destroys the mutex and condition variable - > unless > >     you can > >      > guarantee that the qemu_event_set() call on that event > object has > >     completed. > >      > > >      > In qemu_event_set, the event object's value is atomically > set to > >     EV_SET. > >      > If the previous value was EV_BUSY, qemu_futex_wake() is > called. > >     All of > >      > this is outside any mutex, however, so apart from memory > coherence > >      > (there are barriers) this can race with the waiting thread. > >      > qemu_event_wait() reads the event's value. If EV_FREE, it's > >     atomically > >      > set to EV_BUSY. Then the mutex is locked, the value is checked > >     again, > >      > and if it's still EV_BUSY, it waits for the condition > variable, > >      > otherwise the mutex is immediately unlocked again. If the > trigger > >      > thread's qemu_event_set() flip to EV_SET occurs between > the waiting > >      > thread's two atomic reads of the value, the waiting thread > will > >     never > >      > wait for the condition variable, but the trigger thread > WILL try to > >      > acquire the mutex and signal the condition variable in > >      > qemu_futex_wake(), by which  time the waiting thread may have > >     advanced > >      > outside of qemu_event_wait(). > > > >     Sorry if I'm making a mistake again, but the waiting thread won't > >     set to > >     EV_BUSY unless the value is EV_FREE on the second read so the > trigger > >     thread will not call qemu_futex_wake() if it manages to set > to EV_SET > >     before the second read, will it? > > > > > > This sequence of events will cause the problem: > > > > WAITER (in qemu_event_wait): > > value = qatomic_load_acquire(&ev->value); > > -> EV_FREE > > > > TRIGGER (in qemu_event_set): > > qatomic_read(&ev->value) != EV_SET > > -> EV_FREE (condition is false) > > > > WAITER: > > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET > > -> cmpxchg returns EV_FREE, condition false. > > ev->value =  EV_BUSY. > > > TRIGGER: > >          int old = qatomic_xchg(&ev->value, EV_SET); > >          smp_mb__after_rmw(); > >          if (old == EV_BUSY) { > > -> old = EV_BUSY, condition true. > > ev->value = EV_SET > > > > WAITER (in qemu_futex_wait(ev, EV_BUSY)): > >      pthread_mutex_lock(&ev->lock); > >      if (ev->value == val) { > > -> false, because value is EV_SET > > > > WAITER: > >      pthread_mutex_unlock(&ev->lock); > >      … > >      qemu_event_destroy(&job->done_event); > > > > TRIGGER (in qemu_futex_wake(ev, INT_MAX)): > >      pthread_mutex_lock(&ev->lock); > > -> hangs, because mutex has been destroyed > > Thanks for clarification. This is very insightful. > > > > > >      > > >      > This is all fine usually, BUT if you destroy the QemuEvent > >     immediately > >      > after the qemu_event_wait() call, qemu_futex_wake() may try to > >     lock a > >      > mutex that has been destroyed, or signal a condition variable > >     which has > >      > been destroyed. I don't see a reasonable way of making > this safe > >     other > >      > than using long-lived mutexes and condition variables. And > >     anyway, we > >      > have much, MUCH bigger contention/performance issues > coming from > >     almost > >      > everything being covered by the BQL. (If waking these > callbacks > >     can even > >      > be considered an issue: I haven't seen it show up in > profiling, > >     whereas > >      > BQL contention very much does.) > >      > > >      > I'll submit v5 of this patch set with separate condition > >     variables for > >      > each job type. This should make the occurrence of waking > the wrong > >      > thread quite rare, while reasoning about correctness is pretty > >      > straightforward. I think that's good enough. > > What about using QemuSemaphore then? It does not seem to have the > problem same with QemuEvent. > > > Nowhere else in the code base uses short-lived semaphores, and while I > can't immediately see a risk (the mutex is held during both post and > wait) there might be some non-obvious problem with the approach. > Internally, the semaphores use condition variables. The solution using > condition variables directly already works, is safe, relatively easy to > reason about, and does not cause any performance issues. There is a tiny > inefficiency about waking up a thread unnecessarily in the rare case > when two callbacks of the same kind occur concurrently. In practice, > it's irrelevant. Thanks to the awkward mismatch of the > PVGraphics.framework's libdispatch based approach and Qemu's BQL/AIO/BH > approach, we are already sending messages to other threads very > frequently. This isn't ideal, but not fixable without drastically > reducing the need to acquire the BQL across Qemu. I found several usage of ephemeral semaphores: h_random() in hw/ppc/spapr_rng.c colo_process_checkpoint() in migration/colo.c postcopy_thread_create() in migration/postcopy-ram.c I'm sure short-lived semaphores will keep working (or break migration in strange ways). > > I do not think it is worth spending even more time trying to fix this > part of the code which isn't broken in the first place. I'm sorry to bring you to this mess, which I didn't really expect. I thought combining a shared pair of conditional variable and mutex and job-specific bools is unnecessarily complex, and having one synchronization primitive for each job will be simpler and will just work. However, there was a pitfall with QemuEvent as you demonstrated and now I grep QemuEvent and QemuSemaphore to find all such ephemeral usage is written with QemuSemaphore instead of QemuEvent. I think the critical problem here is that it is not documented that qemu_event_destroy() cannot be used immediately after qemu_event_wait(). We would not run into this situation if it is written. I will write a patch to add such a documentation comment.