On Tue, Apr 07, 2026 at 09:02:26PM +0200, Alexander Mikhalitsyn wrote: > Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi : > > > > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote: > > > + /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */ > > > + for (i = 0; i < n->num_queues; i++) { > > > + NvmeRequest *req; > > > + NvmeSQueue *sq = n->sq[i]; > > > + > > > + if (!sq) > > > + continue; > > > + > > > + trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size); > > > + > > > +wait_out_reqs: > > > + QTAILQ_FOREACH(req, &sq->out_req_list, entry) { > > > + if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) { > > > + cpu_relax(); > > > + goto wait_out_reqs; > > > + } > > > + } > > > + > > > + trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail); > > > + } > > > > Hi Stefan, > > > Emulated storage controllers usually do not drain requests themselves. > > They rely on core migration code (e.g. migration_completion_precopy()) > > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why > > does NVMe busy wait for requests here? > > I rely on core migration code to stop vCPUs and drain requests, *but* > a challenge here is that > a concept of "in-flight" request in NVMe is not that simple and we > have a few different types of in-flight requests: > - request was written in SQ (sq->head != sq->tail) -> this I don't > even consider as in-flight, because we just stop SQ processing > and these requests don't require any special handling during migration > - request was taken from SQ by nvme_process_sq() and it now lives in > sq->out_req_list - this means that > we have also initialized req->aiocb and submitted IO for processing > in QEMU block layer. After request is processed, completion callback > will be called (for read/write requests it is > nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be > called and remove > NvmeRequest from sq->out_req_list and put it into cq->req_list. > I expect, that by the time when we enter nvme_ctrl_pre_save(), > bdrv_drain_all_begin/end() were called and > all AIO is finished and sq->out_req_list is empty (except AERs). > *But* to be on a safe side I also added busy loop on > sq->out_req_list. > > So, I tend to agree that this busy wait is probably not required, but > I believe that we still need to verify that sq->out_req_list > is in fact empty. Because if we messed up, then it's better to crash > on assert() than to have silent data corruption. > > Then after I have a loop cq->req_list, and this time it is absolutely > required because we need to write all NvmeRequest > results to CQ and free NvmeRequest structure, cause I didn't want to > deal with NvmeRequest serialization. I don't see how the busy wait approach can work since migration_completion_precopy() holds the Big QEMU Lock (bql_lock()/bql_unlock()) while .pre_save() is called. The main loop thread's event loop will not be able to make progress while .pre_save() is busy waiting. Based on what you and Klaus have said, how about: 1. assert(QTAILQ_EMPTY(&sq->out_req_list)) 2. Call nvme_post_cqes() from .pre_save() to write CQEs back to RAM. Whether or not cq->bh is cancelled too doesn't really matter. 3. Full CQs need to be handled. It will be necessary to serialize the remainder of cq->req_list them. Failing live migration for this reason is not acceptable from a user perspective. (virtio-scsi and virtio-blk have a mechanism for serializing failed I/O requests that need to be restarted after migration. There might be a nicer way to do this with vmstate nowadays, but they are at least two examples of how to serialize a storage controller's internal request state.) Stefan