qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Akihiko Odaki <akihiko.odaki@daynix.com>
To: Phil Dennis-Jordan <phil@philjordan.eu>
Cc: qemu-devel@nongnu.org, agraf@csgraf.de, peter.maydell@linaro.org,
	pbonzini@redhat.com, rad@semihalf.com, quic_llindhol@quicinc.com,
	marcin.juszkiewicz@linaro.org, stefanha@redhat.com,
	mst@redhat.com, slp@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, marcel.apfelbaum@gmail.com,
	gaosong@loongson.cn, jiaxun.yang@flygoat.com,
	chenhuacai@kernel.org, kwolf@redhat.com, hreitz@redhat.com,
	philmd@linaro.org, shorne@gmail.com, palmer@dabbelt.com,
	alistair.francis@wdc.com, bmeng.cn@gmail.com,
	liwei1518@gmail.com, dbarboza@ventanamicro.com,
	zhiwei_liu@linux.alibaba.com, jcmvbkbc@gmail.com,
	marcandre.lureau@redhat.com, berrange@redhat.com,
	qemu-arm@nongnu.org, qemu-block@nongnu.org,
	qemu-riscv@nongnu.org, Alexander Graf <graf@amazon.com>
Subject: Re: [PATCH v4 02/15] hw/display/apple-gfx: Introduce ParavirtualizedGraphics.Framework support
Date: Tue, 29 Oct 2024 16:42:30 +0900	[thread overview]
Message-ID: <a8b4472c-8741-4f80-87e5-b406c56d1acd@daynix.com> (raw)
In-Reply-To: <CAAibmn1Th6w6JJcxgD7HA62qo-Lk0yV-Cg4XK9OYEtDvvQbhrg@mail.gmail.com>

On 2024/10/29 6:06, Phil Dennis-Jordan wrote:
> 
> 
> On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki <akihiko.odaki@daynix.com 
> <mailto:akihiko.odaki@daynix.com>> wrote:
> 
>     On 2024/10/28 23:13, Phil Dennis-Jordan wrote:
>      >
>      >
>      > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki
>     <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com>
>      > <mailto:akihiko.odaki@daynix.com
>     <mailto:akihiko.odaki@daynix.com>>> wrote:
>      >
>      >     On 2024/10/28 22:31, Phil Dennis-Jordan wrote:
>      >      >
>      >      >
>      >      > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan
>      >     <phil@philjordan.eu <mailto:phil@philjordan.eu>
>     <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>
>      >      > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>
>     <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>>> wrote:
>      >      >
>      >      >
>      >      >          >      >
>      >      >          >      > Hmm. I think if we were to use that, we
>     would
>      >     need to
>      >      >         create a new
>      >      >          >      > QemuEvent for every job and destroy it
>     afterward,
>      >      >         which seems
>      >      >          >     expensive.
>      >      >          >      > We can't rule out multiple concurrent
>     jobs being
>      >      >         submitted, and the
>      >      >          >      > QemuEvent system only supports a single
>     producer as
>      >      >         far as I can
>      >      >          >     tell.
>      >      >          >      >
>      >      >          >      > You can probably sort of hack around it with
>      >     just one
>      >      >         QemuEvent by
>      >      >          >      > putting the qemu_event_wait into a loop
>     and turning
>      >      >         the job.done
>      >      >          >     flag
>      >      >          >      > into an atomic (because it would now
>     need to be
>      >      >         checked outside the
>      >      >          >      > lock) but this all seems unnecessarily
>     complicated
>      >      >         considering the
>      >      >          >      > QemuEvent uses the same mechanism QemuCond/
>      >     QemuMutex
>      >      >         internally
>      >      >          >     on macOS
>      >      >          >      > (the only platform relevant here), except we
>      >     can use it as
>      >      >          >     intended with
>      >      >          >      > QemuCond/QemuMutex rather than having to
>     work
>      >     against the
>      >      >          >     abstraction.
>      >      >          >
>      >      >          >     I don't think it's going to be used
>     concurrently. It
>      >      >         would be difficult
>      >      >          >     to reason even for the framework if it
>     performs memory
>      >      >          >     unmapping/mapping/reading operations
>     concurrently.
>      >      >          >
>      >      >          >
>      >      >          > I've just performed a very quick test by
>     wrapping the job
>      >      >         submission/
>      >      >          > wait in the 2 mapMemory callbacks and the 1
>     readMemory
>      >      >         callback with
>      >      >          > atomic counters and logging whenever a counter went
>      >     above 1.
>      >      >          >
>      >      >          >   * Overall, concurrent callbacks across all
>     types were
>      >      >         common (many per
>      >      >          > second when the VM is busy). It's not exactly a
>      >     "thundering
>      >      >         herd" (I
>      >      >          > never saw >2) but it's probably not a bad idea
>     to use
>      >     a separate
>      >      >          > condition variable for each job type. (task map,
>      >     surface map,
>      >      >         memory read)
>      >      >          >   * While I did not observe any concurrent
>     memory mapping
>      >      >         operations
>      >      >          > *within* a type of memory map (2 task mappings or 2
>      >     surface
>      >      >         mappings) I
>      >      >          > did see very occasional concurrent memory *read*
>      >     callbacks.
>      >      >         These would,
>      >      >          > as far as I can tell, not be safe with QemuEvents,
>      >     unless we
>      >      >         placed the
>      >      >          > event inside the job struct and init/destroyed
>     it on every
>      >      >         callback
>      >      >          > (which seems like excessive overhead).
>      >      >
>      >      >         I think we can tolerate that overhead. init/destroy
>      >     essentially
>      >      >         sets the
>      >      >         fields in the data structure and I estimate its
>     total size is
>      >      >         about 100
>      >      >         bytes. It is probably better than waking an
>     irrelevant thread
>      >      >         up. I also
>      >      >         hope that keeps the code simple; it's not worthwhile
>      >     adding code to
>      >      >         optimize this.
>      >      >
>      >      >
>      >      >     At least pthread_cond_{init,destroy} and
>      >      >     pthread_mutex_{init,destroy} don't make any syscalls,
>     so yeah
>      >     it's
>      >      >     probably an acceptable overhead.
>      >      >
>      >      >
>      >      > I've just experimented with QemuEvents created on-demand
>     and ran
>      >     into
>      >      > some weird deadlocks, which then made me sit down and
>     think about it
>      >      > some more. I've come to the conclusion that creating (and
>     crucially,
>      >      > destroying) QemuEvents on demand in this way is not safe.
>      >      >
>      >      > Specifically, you must not call qemu_event_destroy() - which
>      >      > transitively destroys the mutex and condition variable -
>     unless
>      >     you can
>      >      > guarantee that the qemu_event_set() call on that event
>     object has
>      >     completed.
>      >      >
>      >      > In qemu_event_set, the event object's value is atomically
>     set to
>      >     EV_SET.
>      >      > If the previous value was EV_BUSY, qemu_futex_wake() is
>     called.
>      >     All of
>      >      > this is outside any mutex, however, so apart from memory
>     coherence
>      >      > (there are barriers) this can race with the waiting thread.
>      >      > qemu_event_wait() reads the event's value. If EV_FREE, it's
>      >     atomically
>      >      > set to EV_BUSY. Then the mutex is locked, the value is checked
>      >     again,
>      >      > and if it's still EV_BUSY, it waits for the condition
>     variable,
>      >      > otherwise the mutex is immediately unlocked again. If the
>     trigger
>      >      > thread's qemu_event_set() flip to EV_SET occurs between
>     the waiting
>      >      > thread's two atomic reads of the value, the waiting thread
>     will
>      >     never
>      >      > wait for the condition variable, but the trigger thread
>     WILL try to
>      >      > acquire the mutex and signal the condition variable in
>      >      > qemu_futex_wake(), by which  time the waiting thread may have
>      >     advanced
>      >      > outside of qemu_event_wait().
>      >
>      >     Sorry if I'm making a mistake again, but the waiting thread won't
>      >     set to
>      >     EV_BUSY unless the value is EV_FREE on the second read so the
>     trigger
>      >     thread will not call qemu_futex_wake() if it manages to set
>     to EV_SET
>      >     before the second read, will it?
>      >
>      >
>      > This sequence of events will cause the problem:
>      >
>      > WAITER (in qemu_event_wait):
>      > value = qatomic_load_acquire(&ev->value);
>      > -> EV_FREE
>      >
>      > TRIGGER (in qemu_event_set):
>      > qatomic_read(&ev->value) != EV_SET
>      > -> EV_FREE (condition is false)
>      >
>      > WAITER:
>      > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET
>      > -> cmpxchg returns EV_FREE, condition false.
>      > ev->value =  EV_BUSY.
>      > > TRIGGER:
>      >          int old = qatomic_xchg(&ev->value, EV_SET);
>      >          smp_mb__after_rmw();
>      >          if (old == EV_BUSY) {
>      > -> old = EV_BUSY, condition true.
>      > ev->value = EV_SET
>      >
>      > WAITER (in qemu_futex_wait(ev, EV_BUSY)):
>      >      pthread_mutex_lock(&ev->lock);
>      >      if (ev->value == val) {
>      > -> false, because value is EV_SET
>      >
>      > WAITER:
>      >      pthread_mutex_unlock(&ev->lock);
>      >      …
>      >      qemu_event_destroy(&job->done_event);
>      >
>      > TRIGGER (in qemu_futex_wake(ev, INT_MAX)):
>      >      pthread_mutex_lock(&ev->lock);
>      > -> hangs, because mutex has been destroyed
> 
>     Thanks for clarification. This is very insightful.
> 
> 
>      >
>      >      >
>      >      > This is all fine usually, BUT if you destroy the QemuEvent
>      >     immediately
>      >      > after the qemu_event_wait() call, qemu_futex_wake() may try to
>      >     lock a
>      >      > mutex that has been destroyed, or signal a condition variable
>      >     which has
>      >      > been destroyed. I don't see a reasonable way of making
>     this safe
>      >     other
>      >      > than using long-lived mutexes and condition variables. And
>      >     anyway, we
>      >      > have much, MUCH bigger contention/performance issues
>     coming from
>      >     almost
>      >      > everything being covered by the BQL. (If waking these
>     callbacks
>      >     can even
>      >      > be considered an issue: I haven't seen it show up in
>     profiling,
>      >     whereas
>      >      > BQL contention very much does.)
>      >      >
>      >      > I'll submit v5 of this patch set with separate condition
>      >     variables for
>      >      > each job type. This should make the occurrence of waking
>     the wrong
>      >      > thread quite rare, while reasoning about correctness is pretty
>      >      > straightforward. I think that's good enough.
> 
>     What about using QemuSemaphore then? It does not seem to have the
>     problem same with QemuEvent.
> 
> 
> Nowhere else in the code base uses short-lived semaphores, and while I 
> can't immediately see a risk (the mutex is held during both post and 
> wait) there might be some non-obvious problem with the approach. 
> Internally, the semaphores use condition variables. The solution using 
> condition variables directly already works, is safe, relatively easy to 
> reason about, and does not cause any performance issues. There is a tiny 
> inefficiency about waking up a thread unnecessarily in the rare case 
> when two callbacks of the same kind occur concurrently. In practice, 
> it's irrelevant. Thanks to the awkward mismatch of the 
> PVGraphics.framework's libdispatch based approach and Qemu's BQL/AIO/BH 
> approach, we are already sending messages to other threads very 
> frequently. This isn't ideal, but not fixable without drastically 
> reducing the need to acquire the BQL across Qemu.

I found several usage of ephemeral semaphores:
h_random() in hw/ppc/spapr_rng.c
colo_process_checkpoint() in migration/colo.c
postcopy_thread_create() in migration/postcopy-ram.c

I'm sure short-lived semaphores will keep working (or break migration in 
strange ways).

> 
> I do not think it is worth spending even more time trying to fix this 
> part of the code which isn't broken in the first place.

I'm sorry to bring you to this mess, which I didn't really expect. I 
thought combining a shared pair of conditional variable and mutex and 
job-specific bools is unnecessarily complex, and having one 
synchronization primitive for each job will be simpler and will just work.

However, there was a pitfall with QemuEvent as you demonstrated and now 
I grep QemuEvent and QemuSemaphore to find all such ephemeral usage is 
written with QemuSemaphore instead of QemuEvent. I think the critical 
problem here is that it is not documented that qemu_event_destroy() 
cannot be used immediately after qemu_event_wait(). We would not run 
into this situation if it is written. I will write a patch to add such a 
documentation comment.


  reply	other threads:[~2024-10-29  7:43 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-24 10:27 [PATCH v4 00/15] macOS PV Graphics and new vmapple machine type Phil Dennis-Jordan
2024-10-24 10:27 ` [PATCH v4 01/15] ui & main loop: Redesign of system-specific main thread event handling Phil Dennis-Jordan
2024-10-25  4:34   ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 02/15] hw/display/apple-gfx: Introduce ParavirtualizedGraphics.Framework support Phil Dennis-Jordan
2024-10-25  6:03   ` Akihiko Odaki
2024-10-25 19:43     ` Phil Dennis-Jordan
2024-10-26  4:40       ` Akihiko Odaki
2024-10-26 10:24         ` Phil Dennis-Jordan
2024-10-28  7:42           ` Akihiko Odaki
2024-10-28  9:00             ` Phil Dennis-Jordan
2024-10-28 13:31               ` Phil Dennis-Jordan
2024-10-28 14:02                 ` Akihiko Odaki
2024-10-28 14:13                   ` Phil Dennis-Jordan
2024-10-28 16:06                     ` Akihiko Odaki
2024-10-28 21:06                       ` Phil Dennis-Jordan
2024-10-29  7:42                         ` Akihiko Odaki [this message]
2024-10-29 21:16                           ` Phil Dennis-Jordan
2024-10-31  6:52                             ` Akihiko Odaki
2024-11-03 15:08                               ` Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 03/15] hw/display/apple-gfx: Adds PCI implementation Phil Dennis-Jordan
2024-10-26  4:45   ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 04/15] hw/display/apple-gfx: Adds configurable mode list Phil Dennis-Jordan
2024-10-26  5:15   ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 05/15] MAINTAINERS: Add myself as maintainer for apple-gfx, reviewer for HVF Phil Dennis-Jordan
2024-11-05 15:36   ` Roman Bolshakov
2024-10-24 10:28 ` [PATCH v4 06/15] hw: Add vmapple subdir Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 07/15] hw/misc/pvpanic: Add MMIO interface Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 08/15] hvf: arm: Ignore writes to CNTP_CTL_EL0 Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 09/15] gpex: Allow more than 4 legacy IRQs Phil Dennis-Jordan
2024-10-26  5:21   ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 10/15] hw/vmapple/aes: Introduce aes engine Phil Dennis-Jordan
2024-10-26  5:40   ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 11/15] hw/vmapple/bdif: Introduce vmapple backdoor interface Phil Dennis-Jordan
2024-10-24 10:28 ` [PATCH v4 12/15] hw/vmapple/cfg: Introduce vmapple cfg region Phil Dennis-Jordan
2024-10-26  5:48   ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 13/15] hw/vmapple/virtio-blk: Add support for apple virtio-blk Phil Dennis-Jordan
2024-10-26  6:02   ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 14/15] hw/block/virtio-blk: Replaces request free function with g_free Phil Dennis-Jordan
2024-10-26  6:03   ` Akihiko Odaki
2024-10-24 10:28 ` [PATCH v4 15/15] hw/vmapple/vmapple: Add vmapple machine type Phil Dennis-Jordan
2024-10-26  6:20   ` Akihiko Odaki
2024-10-26 11:58     ` Phil Dennis-Jordan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a8b4472c-8741-4f80-87e5-b406c56d1acd@daynix.com \
    --to=akihiko.odaki@daynix.com \
    --cc=agraf@csgraf.de \
    --cc=alistair.francis@wdc.com \
    --cc=berrange@redhat.com \
    --cc=bmeng.cn@gmail.com \
    --cc=chenhuacai@kernel.org \
    --cc=dbarboza@ventanamicro.com \
    --cc=eduardo@habkost.net \
    --cc=gaosong@loongson.cn \
    --cc=graf@amazon.com \
    --cc=hreitz@redhat.com \
    --cc=jcmvbkbc@gmail.com \
    --cc=jiaxun.yang@flygoat.com \
    --cc=kwolf@redhat.com \
    --cc=liwei1518@gmail.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=marcin.juszkiewicz@linaro.org \
    --cc=mst@redhat.com \
    --cc=palmer@dabbelt.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=phil@philjordan.eu \
    --cc=philmd@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-riscv@nongnu.org \
    --cc=quic_llindhol@quicinc.com \
    --cc=rad@semihalf.com \
    --cc=richard.henderson@linaro.org \
    --cc=shorne@gmail.com \
    --cc=slp@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=zhiwei_liu@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).