On Tue, Apr 07, 2026 at 09:02:26PM +0200, Alexander Mikhalitsyn wrote:
> Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> >
> > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > > +    for (i = 0; i < n->num_queues; i++) {
> > > +        NvmeRequest *req;
> > > +        NvmeSQueue *sq = n->sq[i];
> > > +
> > > +        if (!sq)
> > > +            continue;
> > > +
> > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > > +
> > > +wait_out_reqs:
> > > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > > +                cpu_relax();
> > > +                goto wait_out_reqs;
> > > +            }
> > > +        }
> > > +
> > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > > +    }
> >
> 
> Hi Stefan,
> 
> > Emulated storage controllers usually do not drain requests themselves.
> > They rely on core migration code (e.g. migration_completion_precopy())
> > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> > does NVMe busy wait for requests here?
> 
> I rely on core migration code to stop vCPUs and drain requests, *but*
> a challenge here is that
> a concept of "in-flight" request in NVMe is not that simple and we
> have a few different types of in-flight requests:
> - request was written in SQ (sq->head != sq->tail) -> this I don't
> even consider as in-flight, because we just stop SQ processing
>   and these requests don't require any special handling during migration
> - request was taken from SQ by nvme_process_sq() and it now lives in
> sq->out_req_list - this means that
>   we have also initialized req->aiocb and submitted IO for processing
> in QEMU block layer. After request is processed, completion callback
>   will be called (for read/write requests it is
> nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
> called and remove
>   NvmeRequest from sq->out_req_list and put it into cq->req_list.
>   I expect, that by the time when we enter nvme_ctrl_pre_save(),
> bdrv_drain_all_begin/end() were called and
>   all AIO is finished and sq->out_req_list is empty (except AERs).
> *But* to be on a safe side I also added busy loop on
>   sq->out_req_list.
> 
> So, I tend to agree that this busy wait is probably not required, but
> I believe that we still need to verify that sq->out_req_list
> is in fact empty. Because if we messed up, then it's better to crash
> on assert() than to have silent data corruption.
> 
> Then after I have a loop cq->req_list, and this time it is absolutely
> required because we need to write all NvmeRequest
> results to CQ and free NvmeRequest structure, cause I didn't want to
> deal with NvmeRequest serialization.

I don't see how the busy wait approach can work since
migration_completion_precopy() holds the Big QEMU Lock
(bql_lock()/bql_unlock()) while .pre_save() is called. The main loop
thread's event loop will not be able to make progress while .pre_save()
is busy waiting.

Based on what you and Klaus have said, how about:
1. assert(QTAILQ_EMPTY(&sq->out_req_list))
2. Call nvme_post_cqes() from .pre_save() to write CQEs back to RAM.
   Whether or not cq->bh is cancelled too doesn't really matter.
3. Full CQs need to be handled. It will be necessary to serialize the
   remainder of cq->req_list them. Failing live migration for this
   reason is not acceptable from a user perspective. 
   (virtio-scsi and virtio-blk have a mechanism for serializing failed
   I/O requests that need to be restarted after migration. There might
   be a nicer way to do this with vmstate nowadays, but they are at
   least two examples of how to serialize a storage controller's
   internal request state.)

Stefan