From: Akihiko Odaki <akihiko.odaki@daynix.com>
To: Phil Dennis-Jordan <phil@philjordan.eu>
Cc: qemu-devel@nongnu.org, agraf@csgraf.de, peter.maydell@linaro.org,
pbonzini@redhat.com, rad@semihalf.com, quic_llindhol@quicinc.com,
marcin.juszkiewicz@linaro.org, stefanha@redhat.com,
mst@redhat.com, slp@redhat.com, richard.henderson@linaro.org,
eduardo@habkost.net, marcel.apfelbaum@gmail.com,
gaosong@loongson.cn, jiaxun.yang@flygoat.com,
chenhuacai@kernel.org, kwolf@redhat.com, hreitz@redhat.com,
philmd@linaro.org, shorne@gmail.com, palmer@dabbelt.com,
alistair.francis@wdc.com, bmeng.cn@gmail.com,
liwei1518@gmail.com, dbarboza@ventanamicro.com,
zhiwei_liu@linux.alibaba.com, jcmvbkbc@gmail.com,
marcandre.lureau@redhat.com, berrange@redhat.com,
qemu-arm@nongnu.org, qemu-block@nongnu.org,
qemu-riscv@nongnu.org, Alexander Graf <graf@amazon.com>
Subject: Re: [PATCH v4 02/15] hw/display/apple-gfx: Introduce ParavirtualizedGraphics.Framework support
Date: Tue, 29 Oct 2024 16:42:30 +0900 [thread overview]
Message-ID: <a8b4472c-8741-4f80-87e5-b406c56d1acd@daynix.com> (raw)
In-Reply-To: <CAAibmn1Th6w6JJcxgD7HA62qo-Lk0yV-Cg4XK9OYEtDvvQbhrg@mail.gmail.com>
On 2024/10/29 6:06, Phil Dennis-Jordan wrote:
>
>
> On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki <akihiko.odaki@daynix.com
> <mailto:akihiko.odaki@daynix.com>> wrote:
>
> On 2024/10/28 23:13, Phil Dennis-Jordan wrote:
> >
> >
> > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki
> <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com>
> > <mailto:akihiko.odaki@daynix.com
> <mailto:akihiko.odaki@daynix.com>>> wrote:
> >
> > On 2024/10/28 22:31, Phil Dennis-Jordan wrote:
> > >
> > >
> > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan
> > <phil@philjordan.eu <mailto:phil@philjordan.eu>
> <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>
> > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>
> <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>>> wrote:
> > >
> > >
> > > > >
> > > > > Hmm. I think if we were to use that, we
> would
> > need to
> > > create a new
> > > > > QemuEvent for every job and destroy it
> afterward,
> > > which seems
> > > > expensive.
> > > > > We can't rule out multiple concurrent
> jobs being
> > > submitted, and the
> > > > > QemuEvent system only supports a single
> producer as
> > > far as I can
> > > > tell.
> > > > >
> > > > > You can probably sort of hack around it with
> > just one
> > > QemuEvent by
> > > > > putting the qemu_event_wait into a loop
> and turning
> > > the job.done
> > > > flag
> > > > > into an atomic (because it would now
> need to be
> > > checked outside the
> > > > > lock) but this all seems unnecessarily
> complicated
> > > considering the
> > > > > QemuEvent uses the same mechanism QemuCond/
> > QemuMutex
> > > internally
> > > > on macOS
> > > > > (the only platform relevant here), except we
> > can use it as
> > > > intended with
> > > > > QemuCond/QemuMutex rather than having to
> work
> > against the
> > > > abstraction.
> > > >
> > > > I don't think it's going to be used
> concurrently. It
> > > would be difficult
> > > > to reason even for the framework if it
> performs memory
> > > > unmapping/mapping/reading operations
> concurrently.
> > > >
> > > >
> > > > I've just performed a very quick test by
> wrapping the job
> > > submission/
> > > > wait in the 2 mapMemory callbacks and the 1
> readMemory
> > > callback with
> > > > atomic counters and logging whenever a counter went
> > above 1.
> > > >
> > > > * Overall, concurrent callbacks across all
> types were
> > > common (many per
> > > > second when the VM is busy). It's not exactly a
> > "thundering
> > > herd" (I
> > > > never saw >2) but it's probably not a bad idea
> to use
> > a separate
> > > > condition variable for each job type. (task map,
> > surface map,
> > > memory read)
> > > > * While I did not observe any concurrent
> memory mapping
> > > operations
> > > > *within* a type of memory map (2 task mappings or 2
> > surface
> > > mappings) I
> > > > did see very occasional concurrent memory *read*
> > callbacks.
> > > These would,
> > > > as far as I can tell, not be safe with QemuEvents,
> > unless we
> > > placed the
> > > > event inside the job struct and init/destroyed
> it on every
> > > callback
> > > > (which seems like excessive overhead).
> > >
> > > I think we can tolerate that overhead. init/destroy
> > essentially
> > > sets the
> > > fields in the data structure and I estimate its
> total size is
> > > about 100
> > > bytes. It is probably better than waking an
> irrelevant thread
> > > up. I also
> > > hope that keeps the code simple; it's not worthwhile
> > adding code to
> > > optimize this.
> > >
> > >
> > > At least pthread_cond_{init,destroy} and
> > > pthread_mutex_{init,destroy} don't make any syscalls,
> so yeah
> > it's
> > > probably an acceptable overhead.
> > >
> > >
> > > I've just experimented with QemuEvents created on-demand
> and ran
> > into
> > > some weird deadlocks, which then made me sit down and
> think about it
> > > some more. I've come to the conclusion that creating (and
> crucially,
> > > destroying) QemuEvents on demand in this way is not safe.
> > >
> > > Specifically, you must not call qemu_event_destroy() - which
> > > transitively destroys the mutex and condition variable -
> unless
> > you can
> > > guarantee that the qemu_event_set() call on that event
> object has
> > completed.
> > >
> > > In qemu_event_set, the event object's value is atomically
> set to
> > EV_SET.
> > > If the previous value was EV_BUSY, qemu_futex_wake() is
> called.
> > All of
> > > this is outside any mutex, however, so apart from memory
> coherence
> > > (there are barriers) this can race with the waiting thread.
> > > qemu_event_wait() reads the event's value. If EV_FREE, it's
> > atomically
> > > set to EV_BUSY. Then the mutex is locked, the value is checked
> > again,
> > > and if it's still EV_BUSY, it waits for the condition
> variable,
> > > otherwise the mutex is immediately unlocked again. If the
> trigger
> > > thread's qemu_event_set() flip to EV_SET occurs between
> the waiting
> > > thread's two atomic reads of the value, the waiting thread
> will
> > never
> > > wait for the condition variable, but the trigger thread
> WILL try to
> > > acquire the mutex and signal the condition variable in
> > > qemu_futex_wake(), by which time the waiting thread may have
> > advanced
> > > outside of qemu_event_wait().
> >
> > Sorry if I'm making a mistake again, but the waiting thread won't
> > set to
> > EV_BUSY unless the value is EV_FREE on the second read so the
> trigger
> > thread will not call qemu_futex_wake() if it manages to set
> to EV_SET
> > before the second read, will it?
> >
> >
> > This sequence of events will cause the problem:
> >
> > WAITER (in qemu_event_wait):
> > value = qatomic_load_acquire(&ev->value);
> > -> EV_FREE
> >
> > TRIGGER (in qemu_event_set):
> > qatomic_read(&ev->value) != EV_SET
> > -> EV_FREE (condition is false)
> >
> > WAITER:
> > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET
> > -> cmpxchg returns EV_FREE, condition false.
> > ev->value = EV_BUSY.
> > > TRIGGER:
> > int old = qatomic_xchg(&ev->value, EV_SET);
> > smp_mb__after_rmw();
> > if (old == EV_BUSY) {
> > -> old = EV_BUSY, condition true.
> > ev->value = EV_SET
> >
> > WAITER (in qemu_futex_wait(ev, EV_BUSY)):
> > pthread_mutex_lock(&ev->lock);
> > if (ev->value == val) {
> > -> false, because value is EV_SET
> >
> > WAITER:
> > pthread_mutex_unlock(&ev->lock);
> > …
> > qemu_event_destroy(&job->done_event);
> >
> > TRIGGER (in qemu_futex_wake(ev, INT_MAX)):
> > pthread_mutex_lock(&ev->lock);
> > -> hangs, because mutex has been destroyed
>
> Thanks for clarification. This is very insightful.
>
>
> >
> > >
> > > This is all fine usually, BUT if you destroy the QemuEvent
> > immediately
> > > after the qemu_event_wait() call, qemu_futex_wake() may try to
> > lock a
> > > mutex that has been destroyed, or signal a condition variable
> > which has
> > > been destroyed. I don't see a reasonable way of making
> this safe
> > other
> > > than using long-lived mutexes and condition variables. And
> > anyway, we
> > > have much, MUCH bigger contention/performance issues
> coming from
> > almost
> > > everything being covered by the BQL. (If waking these
> callbacks
> > can even
> > > be considered an issue: I haven't seen it show up in
> profiling,
> > whereas
> > > BQL contention very much does.)
> > >
> > > I'll submit v5 of this patch set with separate condition
> > variables for
> > > each job type. This should make the occurrence of waking
> the wrong
> > > thread quite rare, while reasoning about correctness is pretty
> > > straightforward. I think that's good enough.
>
> What about using QemuSemaphore then? It does not seem to have the
> problem same with QemuEvent.
>
>
> Nowhere else in the code base uses short-lived semaphores, and while I
> can't immediately see a risk (the mutex is held during both post and
> wait) there might be some non-obvious problem with the approach.
> Internally, the semaphores use condition variables. The solution using
> condition variables directly already works, is safe, relatively easy to
> reason about, and does not cause any performance issues. There is a tiny
> inefficiency about waking up a thread unnecessarily in the rare case
> when two callbacks of the same kind occur concurrently. In practice,
> it's irrelevant. Thanks to the awkward mismatch of the
> PVGraphics.framework's libdispatch based approach and Qemu's BQL/AIO/BH
> approach, we are already sending messages to other threads very
> frequently. This isn't ideal, but not fixable without drastically
> reducing the need to acquire the BQL across Qemu.
I found several usage of ephemeral semaphores:
h_random() in hw/ppc/spapr_rng.c
colo_process_checkpoint() in migration/colo.c
postcopy_thread_create() in migration/postcopy-ram.c
I'm sure short-lived semaphores will keep working (or break migration in
strange ways).
>
> I do not think it is worth spending even more time trying to fix this
> part of the code which isn't broken in the first place.
I'm sorry to bring you to this mess, which I didn't really expect. I
thought combining a shared pair of conditional variable and mutex and
job-specific bools is unnecessarily complex, and having one
synchronization primitive for each job will be simpler and will just work.
However, there was a pitfall with QemuEvent as you demonstrated and now
I grep QemuEvent and QemuSemaphore to find all such ephemeral usage is
written with QemuSemaphore instead of QemuEvent. I think the critical
problem here is that it is not documented that qemu_event_destroy()
cannot be used immediately after qemu_event_wait(). We would not run
into this situation if it is written. I will write a patch to add such a
documentation comment.
next prev parent reply other threads:[~2024-10-29 7:43 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-24 10:27 [PATCH v4 00/15] macOS PV Graphics and new vmapple machine type Phil Dennis-Jordan
2024-10-24 10:27 ` [PATCH v4 01/15] ui & main loop: Redesign of system-specific main thread event handling Phil Dennis-Jordan
2024-10-25 4:34 ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 02/15] hw/display/apple-gfx: Introduce ParavirtualizedGraphics.Framework support Phil Dennis-Jordan
2024-10-25 6:03 ` Akihiko Odaki
2024-10-25 19:43 ` Phil Dennis-Jordan
2024-10-26 4:40 ` Akihiko Odaki
2024-10-26 10:24 ` Phil Dennis-Jordan
2024-10-28 7:42 ` Akihiko Odaki
2024-10-28 9:00 ` Phil Dennis-Jordan
2024-10-28 13:31 ` Phil Dennis-Jordan
2024-10-28 14:02 ` Akihiko Odaki
2024-10-28 14:13 ` Phil Dennis-Jordan
2024-10-28 16:06 ` Akihiko Odaki
2024-10-28 21:06 ` Phil Dennis-Jordan
2024-10-29 7:42 ` Akihiko Odaki [this message]
2024-10-29 21:16 ` Phil Dennis-Jordan
2024-10-31 6:52 ` Akihiko Odaki
2024-11-03 15:08 ` Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 03/15] hw/display/apple-gfx: Adds PCI implementation Phil Dennis-Jordan
2024-10-26 4:45 ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 04/15] hw/display/apple-gfx: Adds configurable mode list Phil Dennis-Jordan
2024-10-26 5:15 ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 05/15] MAINTAINERS: Add myself as maintainer for apple-gfx, reviewer for HVF Phil Dennis-Jordan
2024-11-05 15:36 ` Roman Bolshakov
2024-10-24 10:28 ` [PATCH v4 06/15] hw: Add vmapple subdir Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 07/15] hw/misc/pvpanic: Add MMIO interface Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 08/15] hvf: arm: Ignore writes to CNTP_CTL_EL0 Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 09/15] gpex: Allow more than 4 legacy IRQs Phil Dennis-Jordan
2024-10-26 5:21 ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 10/15] hw/vmapple/aes: Introduce aes engine Phil Dennis-Jordan
2024-10-26 5:40 ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 11/15] hw/vmapple/bdif: Introduce vmapple backdoor interface Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 12/15] hw/vmapple/cfg: Introduce vmapple cfg region Phil Dennis-Jordan
2024-10-26 5:48 ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 13/15] hw/vmapple/virtio-blk: Add support for apple virtio-blk Phil Dennis-Jordan
2024-10-26 6:02 ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 14/15] hw/block/virtio-blk: Replaces request free function with g_free Phil Dennis-Jordan
2024-10-26 6:03 ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 15/15] hw/vmapple/vmapple: Add vmapple machine type Phil Dennis-Jordan
2024-10-26 6:20 ` Akihiko Odaki
2024-10-26 11:58 ` Phil Dennis-Jordan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a8b4472c-8741-4f80-87e5-b406c56d1acd@daynix.com \
--to=akihiko.odaki@daynix.com \
--cc=agraf@csgraf.de \
--cc=alistair.francis@wdc.com \
--cc=berrange@redhat.com \
--cc=bmeng.cn@gmail.com \
--cc=chenhuacai@kernel.org \
--cc=dbarboza@ventanamicro.com \
--cc=eduardo@habkost.net \
--cc=gaosong@loongson.cn \
--cc=graf@amazon.com \
--cc=hreitz@redhat.com \
--cc=jcmvbkbc@gmail.com \
--cc=jiaxun.yang@flygoat.com \
--cc=kwolf@redhat.com \
--cc=liwei1518@gmail.com \
--cc=marcandre.lureau@redhat.com \
--cc=marcel.apfelbaum@gmail.com \
--cc=marcin.juszkiewicz@linaro.org \
--cc=mst@redhat.com \
--cc=palmer@dabbelt.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=phil@philjordan.eu \
--cc=philmd@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=qemu-riscv@nongnu.org \
--cc=quic_llindhol@quicinc.com \
--cc=rad@semihalf.com \
--cc=richard.henderson@linaro.org \
--cc=shorne@gmail.com \
--cc=slp@redhat.com \
--cc=stefanha@redhat.com \
--cc=zhiwei_liu@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).