Best practice for issuing blocking calls in response to an event

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Best practice for issuing blocking calls in response to an event
@ 2025-03-20 16:34 Miles Glenn
  2025-03-20 20:09 ` Stefan Hajnoczi
  0 siblings, 1 reply; 5+ messages in thread
From: Miles Glenn @ 2025-03-20 16:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: stefanha

Hello,

I am attempting to simulate a system with multiple CPU
architectures.  To do this I am starting a unique QEMU process for each
CPU architecture that is needed. I'm also developing some QEMU code
that aids in transporting MMIO transactions across the process
boundaries using sockets.

The design takes MMIO request messages off of a socket, services the
request by calling address_space_ldq_be(), then sends a response
message (containing the requested data) over the same
socket.  Currently, this is all done inside the socket IOReadHandler
callback function.

This works as long as the targeted register exists in the same QEMU
process that received the request.  However, If the register exists in
another QEMU process, then the call to address_space_ldq_be() results
in another socket message being sent to that QEMU process, requesting
the data, and then waiting (blocking) for the response message
containing the data.  In other words, it ends up blocking inside the
event handler and even though the QEMU process containing the target
register was able to receive the request and send the response, the
originator of the request is unable to receive the response until it
eventually times out and stops blocking.  Once it times out and stops
blocking, it does receive the response, but now it is too late.

Here's a summary of the stack up to where the code blocks:

IOReadHandler callback
  calls address_space_ldq_be()
    resolves to mmio read op of a remote device
      sends request over socket and waits (blocks) for response

So, I'm looking for a way to handle the work of calling
address_space_ldq_be(), which might block when attempting to read a
register of a remote device, without blocking inside the IOReadHandler
callback context.

I've done a lot of searches and reading about how to do this on the web
and in the QEMU code but it's still not really clear to me how this
should be done in QEMU.  I've seen a lot about using coroutines to
handle cases like this. Is that what I should be using here?

Thanks,

Glenn Miles

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Best practice for issuing blocking calls in response to an event
  2025-03-20 16:34 Best practice for issuing blocking calls in response to an event Miles Glenn
@ 2025-03-20 20:09 ` Stefan Hajnoczi
  2025-03-21 15:17   ` Miles Glenn
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan Hajnoczi @ 2025-03-20 20:09 UTC (permalink / raw)
  To: milesg; +Cc: qemu-devel, Philippe Mathieu-Daudé

On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <milesg@linux.ibm.com> wrote:
>
> Hello,
>
> I am attempting to simulate a system with multiple CPU
> architectures.  To do this I am starting a unique QEMU process for each
> CPU architecture that is needed. I'm also developing some QEMU code
> that aids in transporting MMIO transactions across the process
> boundaries using sockets.

I have CCed Phil. He has been working on heterogenous target emulation
and might be interested.

>
> The design takes MMIO request messages off of a socket, services the
> request by calling address_space_ldq_be(), then sends a response
> message (containing the requested data) over the same
> socket.  Currently, this is all done inside the socket IOReadHandler
> callback function.

At a high level this is similar to the vfio-user feature where a PCI
device is emulated in a separate process. This also involves sending
messages describing QEMU's MemoryRegion accesses. See the "remote"
machine type in QEMU to look at the code.

>
> This works as long as the targeted register exists in the same QEMU
> process that received the request.  However, If the register exists in
> another QEMU process, then the call to address_space_ldq_be() results
> in another socket message being sent to that QEMU process, requesting
> the data, and then waiting (blocking) for the response message
> containing the data.  In other words, it ends up blocking inside the
> event handler and even though the QEMU process containing the target
> register was able to receive the request and send the response, the
> originator of the request is unable to receive the response until it
> eventually times out and stops blocking.  Once it times out and stops
> blocking, it does receive the response, but now it is too late.
>
> Here's a summary of the stack up to where the code blocks:
>
> IOReadHandler callback
>   calls address_space_ldq_be()
>     resolves to mmio read op of a remote device
>       sends request over socket and waits (blocks) for response
>
> So, I'm looking for a way to handle the work of calling
> address_space_ldq_be(), which might block when attempting to read a
> register of a remote device, without blocking inside the IOReadHandler
> callback context.
>
> I've done a lot of searches and reading about how to do this on the web
> and in the QEMU code but it's still not really clear to me how this
> should be done in QEMU.  I've seen a lot about using coroutines to
> handle cases like this. Is that what I should be using here?

The fundamental problem is that address_space_ldq_be() is synchronous,
so there is no way to return back to the caller until the response has
been received.

vfio-user didn't solve this problem. It simply blocks until the
response is received, but it does drop the Big QEMU Lock during this
time so that other vCPU threads can run. For example, see
hw/remote/proxy.c:send_bar_access_msg() and
mpqemu_msg_send_and_await_reply().

QEMU supports nested event loops, but they come with their own set of
gotchas. The way a nested event loop might help here is to send the
request and then call aio_poll() to receive the response in another
IOReadHandler. This way other event loop processing can take place
while waiting in address_space_ldq_be().

The second problem is that this approach where QEMU processes send
requests to each other needs to be implemented carefully to avoid
deadlocks. For example, devices that do DMA could load/store memory
belonging to another device handled by another QEMU. Once there is an
A -> B -> A situation it could deadlock.

Both vfio-user and vhost-user have similar issues with their
bi-directional communication where a device emulation process can send
a message to QEMU while processing a message from QEMU. Deadlock can
be avoided if the code is structured so that QEMU is able to receive
new requests during the time when it is waiting for a response.

Stefan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Best practice for issuing blocking calls in response to an event
  2025-03-20 20:09 ` Stefan Hajnoczi
@ 2025-03-21 15:17   ` Miles Glenn
  2025-03-24 18:35     ` Stefan Hajnoczi
  0 siblings, 1 reply; 5+ messages in thread
From: Miles Glenn @ 2025-03-21 15:17 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, Philippe Mathieu-Daudé

On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote:
> On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <milesg@linux.ibm.com> wrote:
> > Hello,
> > 
> > I am attempting to simulate a system with multiple CPU
> > architectures.  To do this I am starting a unique QEMU process for each
> > CPU architecture that is needed. I'm also developing some QEMU code
> > that aids in transporting MMIO transactions across the process
> > boundaries using sockets.
> 
> I have CCed Phil. He has been working on heterogenous target emulation
> and might be interested.
> 
> > The design takes MMIO request messages off of a socket, services the
> > request by calling address_space_ldq_be(), then sends a response
> > message (containing the requested data) over the same
> > socket.  Currently, this is all done inside the socket IOReadHandler
> > callback function.
> 
> At a high level this is similar to the vfio-user feature where a PCI
> device is emulated in a separate process. This also involves sending
> messages describing QEMU's MemoryRegion accesses. See the "remote"
> machine type in QEMU to look at the code.
> 
> > This works as long as the targeted register exists in the same QEMU
> > process that received the request.  However, If the register exists in
> > another QEMU process, then the call to address_space_ldq_be() results
> > in another socket message being sent to that QEMU process, requesting
> > the data, and then waiting (blocking) for the response message
> > containing the data.  In other words, it ends up blocking inside the
> > event handler and even though the QEMU process containing the target
> > register was able to receive the request and send the response, the
> > originator of the request is unable to receive the response until it
> > eventually times out and stops blocking.  Once it times out and stops
> > blocking, it does receive the response, but now it is too late.
> > 
> > Here's a summary of the stack up to where the code blocks:
> > 
> > IOReadHandler callback
> >   calls address_space_ldq_be()
> >     resolves to mmio read op of a remote device
> >       sends request over socket and waits (blocks) for response
> > 
> > So, I'm looking for a way to handle the work of calling
> > address_space_ldq_be(), which might block when attempting to read a
> > register of a remote device, without blocking inside the IOReadHandler
> > callback context.
> > 
> > I've done a lot of searches and reading about how to do this on the web
> > and in the QEMU code but it's still not really clear to me how this
> > should be done in QEMU.  I've seen a lot about using coroutines to
> > handle cases like this. Is that what I should be using here?
> 
> The fundamental problem is that address_space_ldq_be() is synchronous,
> so there is no way to return back to the caller until the response has
> been received.
> 
> vfio-user didn't solve this problem. It simply blocks until the
> response is received, but it does drop the Big QEMU Lock during this
> time so that other vCPU threads can run. For example, see
> hw/remote/proxy.c:send_bar_access_msg() and
> mpqemu_msg_send_and_await_reply().
> 
> QEMU supports nested event loops, but they come with their own set of
> gotchas. The way a nested event loop might help here is to send the
> request and then call aio_poll() to receive the response in another
> IOReadHandler. This way other event loop processing can take place
> while waiting in address_space_ldq_be().
> 
> The second problem is that this approach where QEMU processes send
> requests to each other needs to be implemented carefully to avoid
> deadlocks. For example, devices that do DMA could load/store memory
> belonging to another device handled by another QEMU. Once there is an
> A -> B -> A situation it could deadlock.
> 
> Both vfio-user and vhost-user have similar issues with their
> bi-directional communication where a device emulation process can send
> a message to QEMU while processing a message from QEMU. Deadlock can
> be avoided if the code is structured so that QEMU is able to receive
> new requests during the time when it is waiting for a response.
> 
> Stefan

Stefan, Thank you for the quick response and great information!

I'm not sure if this is the best way, but I was able to get things
working today using the coroutine approach.

Now, the aforementioned stack looks like this:

IOReadHandler callback receives request
  enters coroutine
    calls address_space_ldq_be()
      resolves to mmio read op of a remote device
        sends request
over socket
        detects coroutine context and
        calls qemu_coroutine_yield() instead of blocking
  returns to callback 

<time passes>

IOReadHandler callback receives response
  re-enters coroutine
        mmio read op returns data received in response message
    address_space_ldq_be() returns
  coroutine completes and returns to callback

While this works, I couldn't help but notice that the coroutine concept
seems to be like a form of multithreading.  Is there some advantage to
using coroutines over doing the work in another thread?  Does QEMU
offer an interface that allows for a callback to queue up work that can
be handled by another thread or a pool of threads?

Thanks,

Glenn Miles




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Best practice for issuing blocking calls in response to an event
  2025-03-21 15:17   ` Miles Glenn
@ 2025-03-24 18:35     ` Stefan Hajnoczi
  2025-03-25 15:08       ` Miles Glenn
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan Hajnoczi @ 2025-03-24 18:35 UTC (permalink / raw)
  To: milesg; +Cc: qemu-devel, Philippe Mathieu-Daudé

On Fri, Mar 21, 2025 at 11:17 AM Miles Glenn <milesg@linux.ibm.com> wrote:
>
> On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote:
> > On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <milesg@linux.ibm.com> wrote:
> > > Hello,
> > >
> > > I am attempting to simulate a system with multiple CPU
> > > architectures.  To do this I am starting a unique QEMU process for each
> > > CPU architecture that is needed. I'm also developing some QEMU code
> > > that aids in transporting MMIO transactions across the process
> > > boundaries using sockets.
> >
> > I have CCed Phil. He has been working on heterogenous target emulation
> > and might be interested.
> >
> > > The design takes MMIO request messages off of a socket, services the
> > > request by calling address_space_ldq_be(), then sends a response
> > > message (containing the requested data) over the same
> > > socket.  Currently, this is all done inside the socket IOReadHandler
> > > callback function.
> >
> > At a high level this is similar to the vfio-user feature where a PCI
> > device is emulated in a separate process. This also involves sending
> > messages describing QEMU's MemoryRegion accesses. See the "remote"
> > machine type in QEMU to look at the code.
> >
> > > This works as long as the targeted register exists in the same QEMU
> > > process that received the request.  However, If the register exists in
> > > another QEMU process, then the call to address_space_ldq_be() results
> > > in another socket message being sent to that QEMU process, requesting
> > > the data, and then waiting (blocking) for the response message
> > > containing the data.  In other words, it ends up blocking inside the
> > > event handler and even though the QEMU process containing the target
> > > register was able to receive the request and send the response, the
> > > originator of the request is unable to receive the response until it
> > > eventually times out and stops blocking.  Once it times out and stops
> > > blocking, it does receive the response, but now it is too late.
> > >
> > > Here's a summary of the stack up to where the code blocks:
> > >
> > > IOReadHandler callback
> > >   calls address_space_ldq_be()
> > >     resolves to mmio read op of a remote device
> > >       sends request over socket and waits (blocks) for response
> > >
> > > So, I'm looking for a way to handle the work of calling
> > > address_space_ldq_be(), which might block when attempting to read a
> > > register of a remote device, without blocking inside the IOReadHandler
> > > callback context.
> > >
> > > I've done a lot of searches and reading about how to do this on the web
> > > and in the QEMU code but it's still not really clear to me how this
> > > should be done in QEMU.  I've seen a lot about using coroutines to
> > > handle cases like this. Is that what I should be using here?
> >
> > The fundamental problem is that address_space_ldq_be() is synchronous,
> > so there is no way to return back to the caller until the response has
> > been received.
> >
> > vfio-user didn't solve this problem. It simply blocks until the
> > response is received, but it does drop the Big QEMU Lock during this
> > time so that other vCPU threads can run. For example, see
> > hw/remote/proxy.c:send_bar_access_msg() and
> > mpqemu_msg_send_and_await_reply().
> >
> > QEMU supports nested event loops, but they come with their own set of
> > gotchas. The way a nested event loop might help here is to send the
> > request and then call aio_poll() to receive the response in another
> > IOReadHandler. This way other event loop processing can take place
> > while waiting in address_space_ldq_be().
> >
> > The second problem is that this approach where QEMU processes send
> > requests to each other needs to be implemented carefully to avoid
> > deadlocks. For example, devices that do DMA could load/store memory
> > belonging to another device handled by another QEMU. Once there is an
> > A -> B -> A situation it could deadlock.
> >
> > Both vfio-user and vhost-user have similar issues with their
> > bi-directional communication where a device emulation process can send
> > a message to QEMU while processing a message from QEMU. Deadlock can
> > be avoided if the code is structured so that QEMU is able to receive
> > new requests during the time when it is waiting for a response.
> >
> > Stefan
>
> Stefan, Thank you for the quick response and great information!
>
> I'm not sure if this is the best way, but I was able to get things
> working today using the coroutine approach.
>
> Now, the aforementioned stack looks like this:
>
> IOReadHandler callback receives request
>   enters coroutine
>     calls address_space_ldq_be()
>       resolves to mmio read op of a remote device
>         sends request
> over socket
>         detects coroutine context and
>         calls qemu_coroutine_yield() instead of blocking
>   returns to callback
>
> <time passes>
>
> IOReadHandler callback receives response
>   re-enters coroutine
>         mmio read op returns data received in response message
>     address_space_ldq_be() returns
>   coroutine completes and returns to callback
>
> While this works, I couldn't help but notice that the coroutine concept
> seems to be like a form of multithreading.  Is there some advantage to
> using coroutines over doing the work in another thread?  Does QEMU
> offer an interface that allows for a callback to queue up work that can
> be handled by another thread or a pool of threads?

Coroutines make it easier to write concurrent code in an event loop.
The alternative is to write asynchronous callback functions, which is
tedious for sequences with multiple steps that need to wait for I/O.

Coroutines do not offer parallelism, so they are not replacement for
multi-threading. QEMU is mostly event-driven rather than
multi-threaded. Usually only computation in QEMU that really needs its
own CPU runs in its own thread (vCPUs, compression, blocking syscalls
when there is no alternative, etc).

There are advantages to using coroutines: less synchronization is
necessary than with threads (you can be sure no other coroutine will
run in the same thread while your code is running) and this eliminates
most thread-safety issues. Also, event loops are seen as more scalable
than threads (lots of historical resources, for example
http://www.kegel.com/c10k.html). One QEMU-specific advantage of
coroutines: coroutine code has access to all of QEMU's APIs that
require the event loop whereas threads need to take extra steps to
interact with the rest of QEMU.

Stefan


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Best practice for issuing blocking calls in response to an event
  2025-03-24 18:35     ` Stefan Hajnoczi
@ 2025-03-25 15:08       ` Miles Glenn
  0 siblings, 0 replies; 5+ messages in thread
From: Miles Glenn @ 2025-03-25 15:08 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, Philippe Mathieu-Daudé

On Mon, 2025-03-24 at 14:35 -0400, Stefan Hajnoczi wrote:
> On Fri, Mar 21, 2025 at 11:17 AM Miles Glenn <milesg@linux.ibm.com> wrote:
> > On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote:
> > > On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <milesg@linux.ibm.com> wrote:
> > > > Hello,
> > > > 
> > > > I am attempting to simulate a system with multiple CPU
> > > > architectures.  To do this I am starting a unique QEMU process for each
> > > > CPU architecture that is needed. I'm also developing some QEMU code
> > > > that aids in transporting MMIO transactions across the process
> > > > boundaries using sockets.
> > > 
> > > I have CCed Phil. He has been working on heterogenous target emulation
> > > and might be interested.
> > > 
> > > > The design takes MMIO request messages off of a socket, services the
> > > > request by calling address_space_ldq_be(), then sends a response
> > > > message (containing the requested data) over the same
> > > > socket.  Currently, this is all done inside the socket IOReadHandler
> > > > callback function.
> > > 
> > > At a high level this is similar to the vfio-user feature where a PCI
> > > device is emulated in a separate process. This also involves sending
> > > messages describing QEMU's MemoryRegion accesses. See the "remote"
> > > machine type in QEMU to look at the code.
> > > 
> > > > This works as long as the targeted register exists in the same QEMU
> > > > process that received the request.  However, If the register exists in
> > > > another QEMU process, then the call to address_space_ldq_be() results
> > > > in another socket message being sent to that QEMU process, requesting
> > > > the data, and then waiting (blocking) for the response message
> > > > containing the data.  In other words, it ends up blocking inside the
> > > > event handler and even though the QEMU process containing the target
> > > > register was able to receive the request and send the response, the
> > > > originator of the request is unable to receive the response until it
> > > > eventually times out and stops blocking.  Once it times out and stops
> > > > blocking, it does receive the response, but now it is too late.
> > > > 
> > > > Here's a summary of the stack up to where the code blocks:
> > > > 
> > > > IOReadHandler callback
> > > >   calls address_space_ldq_be()
> > > >     resolves to mmio read op of a remote device
> > > >       sends request over socket and waits (blocks) for response
> > > > 
> > > > So, I'm looking for a way to handle the work of calling
> > > > address_space_ldq_be(), which might block when attempting to read a
> > > > register of a remote device, without blocking inside the IOReadHandler
> > > > callback context.
> > > > 
> > > > I've done a lot of searches and reading about how to do this on the web
> > > > and in the QEMU code but it's still not really clear to me how this
> > > > should be done in QEMU.  I've seen a lot about using coroutines to
> > > > handle cases like this. Is that what I should be using here?
> > > 
> > > The fundamental problem is that address_space_ldq_be() is synchronous,
> > > so there is no way to return back to the caller until the response has
> > > been received.
> > > 
> > > vfio-user didn't solve this problem. It simply blocks until the
> > > response is received, but it does drop the Big QEMU Lock during this
> > > time so that other vCPU threads can run. For example, see
> > > hw/remote/proxy.c:send_bar_access_msg() and
> > > mpqemu_msg_send_and_await_reply().
> > > 
> > > QEMU supports nested event loops, but they come with their own set of
> > > gotchas. The way a nested event loop might help here is to send the
> > > request and then call aio_poll() to receive the response in another
> > > IOReadHandler. This way other event loop processing can take place
> > > while waiting in address_space_ldq_be().
> > > 
> > > The second problem is that this approach where QEMU processes send
> > > requests to each other needs to be implemented carefully to avoid
> > > deadlocks. For example, devices that do DMA could load/store memory
> > > belonging to another device handled by another QEMU. Once there is an
> > > A -> B -> A situation it could deadlock.
> > > 
> > > Both vfio-user and vhost-user have similar issues with their
> > > bi-directional communication where a device emulation process can send
> > > a message to QEMU while processing a message from QEMU. Deadlock can
> > > be avoided if the code is structured so that QEMU is able to receive
> > > new requests during the time when it is waiting for a response.
> > > 
> > > Stefan
> > 
> > Stefan, Thank you for the quick response and great information!
> > 
> > I'm not sure if this is the best way, but I was able to get things
> > working today using the coroutine approach.
> > 
> > Now, the aforementioned stack looks like this:
> > 
> > IOReadHandler callback receives request
> >   enters coroutine
> >     calls address_space_ldq_be()
> >       resolves to mmio read op of a remote device
> >         sends request
> > over socket
> >         detects coroutine context and
> >         calls qemu_coroutine_yield() instead of blocking
> >   returns to callback
> > 
> > <time passes>
> > 
> > IOReadHandler callback receives response
> >   re-enters coroutine
> >         mmio read op returns data received in response message
> >     address_space_ldq_be() returns
> >   coroutine completes and returns to callback
> > 
> > While this works, I couldn't help but notice that the coroutine concept
> > seems to be like a form of multithreading.  Is there some advantage to
> > using coroutines over doing the work in another thread?  Does QEMU
> > offer an interface that allows for a callback to queue up work that can
> > be handled by another thread or a pool of threads?
> 
> Coroutines make it easier to write concurrent code in an event loop.
> The alternative is to write asynchronous callback functions, which is
> tedious for sequences with multiple steps that need to wait for I/O.
> 
> Coroutines do not offer parallelism, so they are not replacement for
> multi-threading. QEMU is mostly event-driven rather than
> multi-threaded. Usually only computation in QEMU that really needs its
> own CPU runs in its own thread (vCPUs, compression, blocking syscalls
> when there is no alternative, etc).
> 
> There are advantages to using coroutines: less synchronization is
> necessary than with threads (you can be sure no other coroutine will
> run in the same thread while your code is running) and this eliminates
> most thread-safety issues. Also, event loops are seen as more scalable
> than threads (lots of historical resources, for example
> http://www.kegel.com/c10k.html). One QEMU-specific advantage of
> coroutines: coroutine code has access to all of QEMU's APIs that
> require the event loop whereas threads need to take extra steps to
> interact with the rest of QEMU.
> 
> Stefan

Thanks for the explanation, Stefan!

Glenn



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-03-25 15:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 16:34 Best practice for issuing blocking calls in response to an event Miles Glenn
2025-03-20 20:09 ` Stefan Hajnoczi
2025-03-21 15:17   ` Miles Glenn
2025-03-24 18:35     ` Stefan Hajnoczi
2025-03-25 15:08       ` Miles Glenn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).