On Wed, Apr 08, 2026 at 09:55:10PM +0200, Alexander Mikhalitsyn wrote:
> Am Mi., 8. Apr. 2026 um 20:27 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> >
> > On Tue, Apr 07, 2026 at 09:02:26PM +0200, Alexander Mikhalitsyn wrote:
> > > Am Di., 7. Apr. 2026 um 17:48 Uhr schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> > > >
> > > > On Tue, Mar 17, 2026 at 11:27:07AM +0100, Alexander Mikhalitsyn wrote:
> > > > > +    /* wait when all in-flight IO requests (except NVME_ADM_CMD_ASYNC_EV_REQ) are processed */
> > > > > +    for (i = 0; i < n->num_queues; i++) {
> > > > > +        NvmeRequest *req;
> > > > > +        NvmeSQueue *sq = n->sq[i];
> > > > > +
> > > > > +        if (!sq)
> > > > > +            continue;
> > > > > +
> > > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait(n, i, sq->head, sq->tail, sq->size);
> > > > > +
> > > > > +wait_out_reqs:
> > > > > +        QTAILQ_FOREACH(req, &sq->out_req_list, entry) {
> > > > > +            if (req->cmd.opcode != NVME_ADM_CMD_ASYNC_EV_REQ) {
> > > > > +                cpu_relax();
> > > > > +                goto wait_out_reqs;
> > > > > +            }
> > > > > +        }
> > > > > +
> > > > > +        trace_pci_nvme_pre_save_sq_out_req_drain_wait_end(n, i, sq->head, sq->tail);
> > > > > +    }
> > > >
> > >
> > > Hi Stefan,
> > >
> > > > Emulated storage controllers usually do not drain requests themselves.
> > > > They rely on core migration code (e.g. migration_completion_precopy())
> > > > to stop vCPUs and call bdrv_drain_all_begin/end() to quiesce I/O. Why
> > > > does NVMe busy wait for requests here?
> > >
> > > I rely on core migration code to stop vCPUs and drain requests, *but*
> > > a challenge here is that
> > > a concept of "in-flight" request in NVMe is not that simple and we
> > > have a few different types of in-flight requests:
> > > - request was written in SQ (sq->head != sq->tail) -> this I don't
> > > even consider as in-flight, because we just stop SQ processing
> > >   and these requests don't require any special handling during migration
> > > - request was taken from SQ by nvme_process_sq() and it now lives in
> > > sq->out_req_list - this means that
> > >   we have also initialized req->aiocb and submitted IO for processing
> > > in QEMU block layer. After request is processed, completion callback
> > >   will be called (for read/write requests it is
> > > nvme_rw_complete_cb()), then nvme_enqueue_req_completion() will be
> > > called and remove
> > >   NvmeRequest from sq->out_req_list and put it into cq->req_list.
> > >   I expect, that by the time when we enter nvme_ctrl_pre_save(),
> > > bdrv_drain_all_begin/end() were called and
> > >   all AIO is finished and sq->out_req_list is empty (except AERs).
> > > *But* to be on a safe side I also added busy loop on
> > >   sq->out_req_list.
> > >
> > > So, I tend to agree that this busy wait is probably not required, but
> > > I believe that we still need to verify that sq->out_req_list
> > > is in fact empty. Because if we messed up, then it's better to crash
> > > on assert() than to have silent data corruption.
> > >
> > > Then after I have a loop cq->req_list, and this time it is absolutely
> > > required because we need to write all NvmeRequest
> > > results to CQ and free NvmeRequest structure, cause I didn't want to
> > > deal with NvmeRequest serialization.
> >
> > I don't see how the busy wait approach can work since
> > migration_completion_precopy() holds the Big QEMU Lock
> > (bql_lock()/bql_unlock()) while .pre_save() is called. The main loop
> > thread's event loop will not be able to make progress while .pre_save()
> > is busy waiting.
> 
> Yes, my bad. Good catch! You are absolutely right.
> 
> This is especially stupid mistake from my side, taking into account
> that I *knew* about that we hold BQL
> in this context, cause in my first version of this patchset I *was*
> taking this into account:
> https://lore.kernel.org/qemu-devel/20260217152517.271422-5-alexander@mihalicyn.com/
> 
> see comment before qemu_bh_cancel(n->admin_cq.bh).
> 
> Thanks, Stefan! ;)

No worries. The good news is that we don't need to worry about race
conditions due to device state changing while .pre_save() runs :).

Stefan