* Best practice for issuing blocking calls in response to an event
@ 2025-03-20 16:34 Miles Glenn
2025-03-20 20:09 ` Stefan Hajnoczi
0 siblings, 1 reply; 5+ messages in thread
From: Miles Glenn @ 2025-03-20 16:34 UTC (permalink / raw)
To: qemu-devel; +Cc: stefanha
Hello,
I am attempting to simulate a system with multiple CPU
architectures. To do this I am starting a unique QEMU process for each
CPU architecture that is needed. I'm also developing some QEMU code
that aids in transporting MMIO transactions across the process
boundaries using sockets.
The design takes MMIO request messages off of a socket, services the
request by calling address_space_ldq_be(), then sends a response
message (containing the requested data) over the same
socket. Currently, this is all done inside the socket IOReadHandler
callback function.
This works as long as the targeted register exists in the same QEMU
process that received the request. However, If the register exists in
another QEMU process, then the call to address_space_ldq_be() results
in another socket message being sent to that QEMU process, requesting
the data, and then waiting (blocking) for the response message
containing the data. In other words, it ends up blocking inside the
event handler and even though the QEMU process containing the target
register was able to receive the request and send the response, the
originator of the request is unable to receive the response until it
eventually times out and stops blocking. Once it times out and stops
blocking, it does receive the response, but now it is too late.
Here's a summary of the stack up to where the code blocks:
IOReadHandler callback
calls address_space_ldq_be()
resolves to mmio read op of a remote device
sends request over socket and waits (blocks) for response
So, I'm looking for a way to handle the work of calling
address_space_ldq_be(), which might block when attempting to read a
register of a remote device, without blocking inside the IOReadHandler
callback context.
I've done a lot of searches and reading about how to do this on the web
and in the QEMU code but it's still not really clear to me how this
should be done in QEMU. I've seen a lot about using coroutines to
handle cases like this. Is that what I should be using here?
Thanks,
Glenn Miles
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Best practice for issuing blocking calls in response to an event 2025-03-20 16:34 Best practice for issuing blocking calls in response to an event Miles Glenn @ 2025-03-20 20:09 ` Stefan Hajnoczi 2025-03-21 15:17 ` Miles Glenn 0 siblings, 1 reply; 5+ messages in thread From: Stefan Hajnoczi @ 2025-03-20 20:09 UTC (permalink / raw) To: milesg; +Cc: qemu-devel, Philippe Mathieu-Daudé On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <milesg@linux.ibm.com> wrote: > > Hello, > > I am attempting to simulate a system with multiple CPU > architectures. To do this I am starting a unique QEMU process for each > CPU architecture that is needed. I'm also developing some QEMU code > that aids in transporting MMIO transactions across the process > boundaries using sockets. I have CCed Phil. He has been working on heterogenous target emulation and might be interested. > > The design takes MMIO request messages off of a socket, services the > request by calling address_space_ldq_be(), then sends a response > message (containing the requested data) over the same > socket. Currently, this is all done inside the socket IOReadHandler > callback function. At a high level this is similar to the vfio-user feature where a PCI device is emulated in a separate process. This also involves sending messages describing QEMU's MemoryRegion accesses. See the "remote" machine type in QEMU to look at the code. > > This works as long as the targeted register exists in the same QEMU > process that received the request. However, If the register exists in > another QEMU process, then the call to address_space_ldq_be() results > in another socket message being sent to that QEMU process, requesting > the data, and then waiting (blocking) for the response message > containing the data. In other words, it ends up blocking inside the > event handler and even though the QEMU process containing the target > register was able to receive the request and send the response, the > originator of the request is unable to receive the response until it > eventually times out and stops blocking. Once it times out and stops > blocking, it does receive the response, but now it is too late. > > Here's a summary of the stack up to where the code blocks: > > IOReadHandler callback > calls address_space_ldq_be() > resolves to mmio read op of a remote device > sends request over socket and waits (blocks) for response > > So, I'm looking for a way to handle the work of calling > address_space_ldq_be(), which might block when attempting to read a > register of a remote device, without blocking inside the IOReadHandler > callback context. > > I've done a lot of searches and reading about how to do this on the web > and in the QEMU code but it's still not really clear to me how this > should be done in QEMU. I've seen a lot about using coroutines to > handle cases like this. Is that what I should be using here? The fundamental problem is that address_space_ldq_be() is synchronous, so there is no way to return back to the caller until the response has been received. vfio-user didn't solve this problem. It simply blocks until the response is received, but it does drop the Big QEMU Lock during this time so that other vCPU threads can run. For example, see hw/remote/proxy.c:send_bar_access_msg() and mpqemu_msg_send_and_await_reply(). QEMU supports nested event loops, but they come with their own set of gotchas. The way a nested event loop might help here is to send the request and then call aio_poll() to receive the response in another IOReadHandler. This way other event loop processing can take place while waiting in address_space_ldq_be(). The second problem is that this approach where QEMU processes send requests to each other needs to be implemented carefully to avoid deadlocks. For example, devices that do DMA could load/store memory belonging to another device handled by another QEMU. Once there is an A -> B -> A situation it could deadlock. Both vfio-user and vhost-user have similar issues with their bi-directional communication where a device emulation process can send a message to QEMU while processing a message from QEMU. Deadlock can be avoided if the code is structured so that QEMU is able to receive new requests during the time when it is waiting for a response. Stefan ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Best practice for issuing blocking calls in response to an event 2025-03-20 20:09 ` Stefan Hajnoczi @ 2025-03-21 15:17 ` Miles Glenn 2025-03-24 18:35 ` Stefan Hajnoczi 0 siblings, 1 reply; 5+ messages in thread From: Miles Glenn @ 2025-03-21 15:17 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: qemu-devel, Philippe Mathieu-Daudé On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote: > On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <milesg@linux.ibm.com> wrote: > > Hello, > > > > I am attempting to simulate a system with multiple CPU > > architectures. To do this I am starting a unique QEMU process for each > > CPU architecture that is needed. I'm also developing some QEMU code > > that aids in transporting MMIO transactions across the process > > boundaries using sockets. > > I have CCed Phil. He has been working on heterogenous target emulation > and might be interested. > > > The design takes MMIO request messages off of a socket, services the > > request by calling address_space_ldq_be(), then sends a response > > message (containing the requested data) over the same > > socket. Currently, this is all done inside the socket IOReadHandler > > callback function. > > At a high level this is similar to the vfio-user feature where a PCI > device is emulated in a separate process. This also involves sending > messages describing QEMU's MemoryRegion accesses. See the "remote" > machine type in QEMU to look at the code. > > > This works as long as the targeted register exists in the same QEMU > > process that received the request. However, If the register exists in > > another QEMU process, then the call to address_space_ldq_be() results > > in another socket message being sent to that QEMU process, requesting > > the data, and then waiting (blocking) for the response message > > containing the data. In other words, it ends up blocking inside the > > event handler and even though the QEMU process containing the target > > register was able to receive the request and send the response, the > > originator of the request is unable to receive the response until it > > eventually times out and stops blocking. Once it times out and stops > > blocking, it does receive the response, but now it is too late. > > > > Here's a summary of the stack up to where the code blocks: > > > > IOReadHandler callback > > calls address_space_ldq_be() > > resolves to mmio read op of a remote device > > sends request over socket and waits (blocks) for response > > > > So, I'm looking for a way to handle the work of calling > > address_space_ldq_be(), which might block when attempting to read a > > register of a remote device, without blocking inside the IOReadHandler > > callback context. > > > > I've done a lot of searches and reading about how to do this on the web > > and in the QEMU code but it's still not really clear to me how this > > should be done in QEMU. I've seen a lot about using coroutines to > > handle cases like this. Is that what I should be using here? > > The fundamental problem is that address_space_ldq_be() is synchronous, > so there is no way to return back to the caller until the response has > been received. > > vfio-user didn't solve this problem. It simply blocks until the > response is received, but it does drop the Big QEMU Lock during this > time so that other vCPU threads can run. For example, see > hw/remote/proxy.c:send_bar_access_msg() and > mpqemu_msg_send_and_await_reply(). > > QEMU supports nested event loops, but they come with their own set of > gotchas. The way a nested event loop might help here is to send the > request and then call aio_poll() to receive the response in another > IOReadHandler. This way other event loop processing can take place > while waiting in address_space_ldq_be(). > > The second problem is that this approach where QEMU processes send > requests to each other needs to be implemented carefully to avoid > deadlocks. For example, devices that do DMA could load/store memory > belonging to another device handled by another QEMU. Once there is an > A -> B -> A situation it could deadlock. > > Both vfio-user and vhost-user have similar issues with their > bi-directional communication where a device emulation process can send > a message to QEMU while processing a message from QEMU. Deadlock can > be avoided if the code is structured so that QEMU is able to receive > new requests during the time when it is waiting for a response. > > Stefan Stefan, Thank you for the quick response and great information! I'm not sure if this is the best way, but I was able to get things working today using the coroutine approach. Now, the aforementioned stack looks like this: IOReadHandler callback receives request enters coroutine calls address_space_ldq_be() resolves to mmio read op of a remote device sends request over socket detects coroutine context and calls qemu_coroutine_yield() instead of blocking returns to callback <time passes> IOReadHandler callback receives response re-enters coroutine mmio read op returns data received in response message address_space_ldq_be() returns coroutine completes and returns to callback While this works, I couldn't help but notice that the coroutine concept seems to be like a form of multithreading. Is there some advantage to using coroutines over doing the work in another thread? Does QEMU offer an interface that allows for a callback to queue up work that can be handled by another thread or a pool of threads? Thanks, Glenn Miles ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Best practice for issuing blocking calls in response to an event 2025-03-21 15:17 ` Miles Glenn @ 2025-03-24 18:35 ` Stefan Hajnoczi 2025-03-25 15:08 ` Miles Glenn 0 siblings, 1 reply; 5+ messages in thread From: Stefan Hajnoczi @ 2025-03-24 18:35 UTC (permalink / raw) To: milesg; +Cc: qemu-devel, Philippe Mathieu-Daudé On Fri, Mar 21, 2025 at 11:17 AM Miles Glenn <milesg@linux.ibm.com> wrote: > > On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote: > > On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <milesg@linux.ibm.com> wrote: > > > Hello, > > > > > > I am attempting to simulate a system with multiple CPU > > > architectures. To do this I am starting a unique QEMU process for each > > > CPU architecture that is needed. I'm also developing some QEMU code > > > that aids in transporting MMIO transactions across the process > > > boundaries using sockets. > > > > I have CCed Phil. He has been working on heterogenous target emulation > > and might be interested. > > > > > The design takes MMIO request messages off of a socket, services the > > > request by calling address_space_ldq_be(), then sends a response > > > message (containing the requested data) over the same > > > socket. Currently, this is all done inside the socket IOReadHandler > > > callback function. > > > > At a high level this is similar to the vfio-user feature where a PCI > > device is emulated in a separate process. This also involves sending > > messages describing QEMU's MemoryRegion accesses. See the "remote" > > machine type in QEMU to look at the code. > > > > > This works as long as the targeted register exists in the same QEMU > > > process that received the request. However, If the register exists in > > > another QEMU process, then the call to address_space_ldq_be() results > > > in another socket message being sent to that QEMU process, requesting > > > the data, and then waiting (blocking) for the response message > > > containing the data. In other words, it ends up blocking inside the > > > event handler and even though the QEMU process containing the target > > > register was able to receive the request and send the response, the > > > originator of the request is unable to receive the response until it > > > eventually times out and stops blocking. Once it times out and stops > > > blocking, it does receive the response, but now it is too late. > > > > > > Here's a summary of the stack up to where the code blocks: > > > > > > IOReadHandler callback > > > calls address_space_ldq_be() > > > resolves to mmio read op of a remote device > > > sends request over socket and waits (blocks) for response > > > > > > So, I'm looking for a way to handle the work of calling > > > address_space_ldq_be(), which might block when attempting to read a > > > register of a remote device, without blocking inside the IOReadHandler > > > callback context. > > > > > > I've done a lot of searches and reading about how to do this on the web > > > and in the QEMU code but it's still not really clear to me how this > > > should be done in QEMU. I've seen a lot about using coroutines to > > > handle cases like this. Is that what I should be using here? > > > > The fundamental problem is that address_space_ldq_be() is synchronous, > > so there is no way to return back to the caller until the response has > > been received. > > > > vfio-user didn't solve this problem. It simply blocks until the > > response is received, but it does drop the Big QEMU Lock during this > > time so that other vCPU threads can run. For example, see > > hw/remote/proxy.c:send_bar_access_msg() and > > mpqemu_msg_send_and_await_reply(). > > > > QEMU supports nested event loops, but they come with their own set of > > gotchas. The way a nested event loop might help here is to send the > > request and then call aio_poll() to receive the response in another > > IOReadHandler. This way other event loop processing can take place > > while waiting in address_space_ldq_be(). > > > > The second problem is that this approach where QEMU processes send > > requests to each other needs to be implemented carefully to avoid > > deadlocks. For example, devices that do DMA could load/store memory > > belonging to another device handled by another QEMU. Once there is an > > A -> B -> A situation it could deadlock. > > > > Both vfio-user and vhost-user have similar issues with their > > bi-directional communication where a device emulation process can send > > a message to QEMU while processing a message from QEMU. Deadlock can > > be avoided if the code is structured so that QEMU is able to receive > > new requests during the time when it is waiting for a response. > > > > Stefan > > Stefan, Thank you for the quick response and great information! > > I'm not sure if this is the best way, but I was able to get things > working today using the coroutine approach. > > Now, the aforementioned stack looks like this: > > IOReadHandler callback receives request > enters coroutine > calls address_space_ldq_be() > resolves to mmio read op of a remote device > sends request > over socket > detects coroutine context and > calls qemu_coroutine_yield() instead of blocking > returns to callback > > <time passes> > > IOReadHandler callback receives response > re-enters coroutine > mmio read op returns data received in response message > address_space_ldq_be() returns > coroutine completes and returns to callback > > While this works, I couldn't help but notice that the coroutine concept > seems to be like a form of multithreading. Is there some advantage to > using coroutines over doing the work in another thread? Does QEMU > offer an interface that allows for a callback to queue up work that can > be handled by another thread or a pool of threads? Coroutines make it easier to write concurrent code in an event loop. The alternative is to write asynchronous callback functions, which is tedious for sequences with multiple steps that need to wait for I/O. Coroutines do not offer parallelism, so they are not replacement for multi-threading. QEMU is mostly event-driven rather than multi-threaded. Usually only computation in QEMU that really needs its own CPU runs in its own thread (vCPUs, compression, blocking syscalls when there is no alternative, etc). There are advantages to using coroutines: less synchronization is necessary than with threads (you can be sure no other coroutine will run in the same thread while your code is running) and this eliminates most thread-safety issues. Also, event loops are seen as more scalable than threads (lots of historical resources, for example http://www.kegel.com/c10k.html). One QEMU-specific advantage of coroutines: coroutine code has access to all of QEMU's APIs that require the event loop whereas threads need to take extra steps to interact with the rest of QEMU. Stefan ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Best practice for issuing blocking calls in response to an event 2025-03-24 18:35 ` Stefan Hajnoczi @ 2025-03-25 15:08 ` Miles Glenn 0 siblings, 0 replies; 5+ messages in thread From: Miles Glenn @ 2025-03-25 15:08 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: qemu-devel, Philippe Mathieu-Daudé On Mon, 2025-03-24 at 14:35 -0400, Stefan Hajnoczi wrote: > On Fri, Mar 21, 2025 at 11:17 AM Miles Glenn <milesg@linux.ibm.com> wrote: > > On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote: > > > On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <milesg@linux.ibm.com> wrote: > > > > Hello, > > > > > > > > I am attempting to simulate a system with multiple CPU > > > > architectures. To do this I am starting a unique QEMU process for each > > > > CPU architecture that is needed. I'm also developing some QEMU code > > > > that aids in transporting MMIO transactions across the process > > > > boundaries using sockets. > > > > > > I have CCed Phil. He has been working on heterogenous target emulation > > > and might be interested. > > > > > > > The design takes MMIO request messages off of a socket, services the > > > > request by calling address_space_ldq_be(), then sends a response > > > > message (containing the requested data) over the same > > > > socket. Currently, this is all done inside the socket IOReadHandler > > > > callback function. > > > > > > At a high level this is similar to the vfio-user feature where a PCI > > > device is emulated in a separate process. This also involves sending > > > messages describing QEMU's MemoryRegion accesses. See the "remote" > > > machine type in QEMU to look at the code. > > > > > > > This works as long as the targeted register exists in the same QEMU > > > > process that received the request. However, If the register exists in > > > > another QEMU process, then the call to address_space_ldq_be() results > > > > in another socket message being sent to that QEMU process, requesting > > > > the data, and then waiting (blocking) for the response message > > > > containing the data. In other words, it ends up blocking inside the > > > > event handler and even though the QEMU process containing the target > > > > register was able to receive the request and send the response, the > > > > originator of the request is unable to receive the response until it > > > > eventually times out and stops blocking. Once it times out and stops > > > > blocking, it does receive the response, but now it is too late. > > > > > > > > Here's a summary of the stack up to where the code blocks: > > > > > > > > IOReadHandler callback > > > > calls address_space_ldq_be() > > > > resolves to mmio read op of a remote device > > > > sends request over socket and waits (blocks) for response > > > > > > > > So, I'm looking for a way to handle the work of calling > > > > address_space_ldq_be(), which might block when attempting to read a > > > > register of a remote device, without blocking inside the IOReadHandler > > > > callback context. > > > > > > > > I've done a lot of searches and reading about how to do this on the web > > > > and in the QEMU code but it's still not really clear to me how this > > > > should be done in QEMU. I've seen a lot about using coroutines to > > > > handle cases like this. Is that what I should be using here? > > > > > > The fundamental problem is that address_space_ldq_be() is synchronous, > > > so there is no way to return back to the caller until the response has > > > been received. > > > > > > vfio-user didn't solve this problem. It simply blocks until the > > > response is received, but it does drop the Big QEMU Lock during this > > > time so that other vCPU threads can run. For example, see > > > hw/remote/proxy.c:send_bar_access_msg() and > > > mpqemu_msg_send_and_await_reply(). > > > > > > QEMU supports nested event loops, but they come with their own set of > > > gotchas. The way a nested event loop might help here is to send the > > > request and then call aio_poll() to receive the response in another > > > IOReadHandler. This way other event loop processing can take place > > > while waiting in address_space_ldq_be(). > > > > > > The second problem is that this approach where QEMU processes send > > > requests to each other needs to be implemented carefully to avoid > > > deadlocks. For example, devices that do DMA could load/store memory > > > belonging to another device handled by another QEMU. Once there is an > > > A -> B -> A situation it could deadlock. > > > > > > Both vfio-user and vhost-user have similar issues with their > > > bi-directional communication where a device emulation process can send > > > a message to QEMU while processing a message from QEMU. Deadlock can > > > be avoided if the code is structured so that QEMU is able to receive > > > new requests during the time when it is waiting for a response. > > > > > > Stefan > > > > Stefan, Thank you for the quick response and great information! > > > > I'm not sure if this is the best way, but I was able to get things > > working today using the coroutine approach. > > > > Now, the aforementioned stack looks like this: > > > > IOReadHandler callback receives request > > enters coroutine > > calls address_space_ldq_be() > > resolves to mmio read op of a remote device > > sends request > > over socket > > detects coroutine context and > > calls qemu_coroutine_yield() instead of blocking > > returns to callback > > > > <time passes> > > > > IOReadHandler callback receives response > > re-enters coroutine > > mmio read op returns data received in response message > > address_space_ldq_be() returns > > coroutine completes and returns to callback > > > > While this works, I couldn't help but notice that the coroutine concept > > seems to be like a form of multithreading. Is there some advantage to > > using coroutines over doing the work in another thread? Does QEMU > > offer an interface that allows for a callback to queue up work that can > > be handled by another thread or a pool of threads? > > Coroutines make it easier to write concurrent code in an event loop. > The alternative is to write asynchronous callback functions, which is > tedious for sequences with multiple steps that need to wait for I/O. > > Coroutines do not offer parallelism, so they are not replacement for > multi-threading. QEMU is mostly event-driven rather than > multi-threaded. Usually only computation in QEMU that really needs its > own CPU runs in its own thread (vCPUs, compression, blocking syscalls > when there is no alternative, etc). > > There are advantages to using coroutines: less synchronization is > necessary than with threads (you can be sure no other coroutine will > run in the same thread while your code is running) and this eliminates > most thread-safety issues. Also, event loops are seen as more scalable > than threads (lots of historical resources, for example > http://www.kegel.com/c10k.html). One QEMU-specific advantage of > coroutines: coroutine code has access to all of QEMU's APIs that > require the event loop whereas threads need to take extra steps to > interact with the rest of QEMU. > > Stefan Thanks for the explanation, Stefan! Glenn ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-03-25 15:08 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-20 16:34 Best practice for issuing blocking calls in response to an event Miles Glenn 2025-03-20 20:09 ` Stefan Hajnoczi 2025-03-21 15:17 ` Miles Glenn 2025-03-24 18:35 ` Stefan Hajnoczi 2025-03-25 15:08 ` Miles Glenn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).