From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 754FA3E5EDE for ; Mon, 16 Mar 2026 19:24:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773689100; cv=none; b=Xt66Fbp4hWUF+A99gpw1fdH3gVl3NKeOYQPgWmuCnGl1wb2zHwcuD9WPshErxGEn+Wm5fG+JnlLaSf4J8C64SQOCSFBaza3ydZq7P2qrUL/FeavWryLPIxtXlO7ik6sxCPmaJ5T3HoiAouu4jlfqf5B93+mU9DdW8e4HRoBf5hA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773689100; c=relaxed/simple; bh=WwLT3F9cm7hgu+w4uiCEwprQU6EHgPbhEV2X2keq6Sk=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=d0++4YItLGlzDhyz6Qd1gnNu3reJ2bmu7QmPHSV4TXveRK4CF4XiH0e3q1ds4IvV2nVMaPraqeNJskbDxUJVRDKLYsnLzwl5iNKIRbFiVhScFC5wlxFJ73H2KrlwQBaORXZsvJZVhsvqFx/o3TbuNM4jVkvPJX9C1Ce+s5spnTE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=a6jn+dua; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=TKfE5FWs; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="a6jn+dua"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="TKfE5FWs" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1773689097; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NLGtp3LU6HGHNlnSOBma33L7xD5k2WlTsV6P1lC4WEc=; b=a6jn+duaoxBd4izxk4D9LAv2iXHYVq36wFoGlhwuLkvqs3hcak200LJs+HhecjZaL0T8f7 ejir+mDnD2EbzC7TcjH7nJHxtd2rCrhE9hu7PYvtZ5P7z7K845hnD07O8wVrkWb6NYcbCC 6mk8wtwEu6oI5dGg5iAc8J1zyFAUgBY= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-14-bpvS4rTbPDKk2EcJyrwtHQ-1; Mon, 16 Mar 2026 15:24:56 -0400 X-MC-Unique: bpvS4rTbPDKk2EcJyrwtHQ-1 X-Mimecast-MFC-AGG-ID: bpvS4rTbPDKk2EcJyrwtHQ_1773689095 Received: by mail-qt1-f200.google.com with SMTP id d75a77b69052e-50921f7da6dso53835731cf.0 for ; Mon, 16 Mar 2026 12:24:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1773689095; x=1774293895; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=NLGtp3LU6HGHNlnSOBma33L7xD5k2WlTsV6P1lC4WEc=; b=TKfE5FWsjTxwNIcOJU4OgAaaJIif0n4UnWiUKskHRjPjRu4ifxrW5nV/2iOfime9Bx imVh70iT1G50o+x/KEy3c5VYD8O07J/GcBgUhmgqZzAvwx2MJBkbN2VDYQJ6MOHNP2Cn rAc+6W4Ndnvk3EX83YpgGa9HjYoKkbRyDeqFRKNvGQJcdjvKK46674uiAveg+/qHqtTi SG0+zpXD2/UAqBN28utCjI5R/SNozh7XGxSQrekij42MEdZ49HhAel531MvKIoeapt99 C1WKKbzgfQQwdELye2ZAz/ngnJO4cVXcfQKvouAot1IbAn5EzRXozdFob7mp94492jgi 4S4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1773689095; x=1774293895; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NLGtp3LU6HGHNlnSOBma33L7xD5k2WlTsV6P1lC4WEc=; b=EDk6QkeNzN+Vgwb+v5fnu8OSAcf9vYkjVYTb1WKWsQtsRRKEAQbm8wkd+20pPrYvra fE8kP9UOSqIYUp+qHgasGQBF6lv2FdfKOMROIkPESD8A6zW6ah3/r7674+xW2KSLiKMv F4aF/71NlM/ZNEFaNfFfVGaBqQaAu7iUEDBMfGRWIFlDwWSktTXMwKqI/D6G8ODDMGMH o9GWshoJoLCb0SaYJEF2Mc1XJgBCt1rhJQ4UgB8XQ4sio7v5SN95Wh7ReCsrlb4+YXxp 9PdEKyhTqXe6dEzWdKPlCQDShBwEzwveiY8xzMR+/h3jik/wMj+uZN18MTE+j8Eg5Pz5 zlgw== X-Forwarded-Encrypted: i=1; AJvYcCUp0mpVt9xznyUH5rce+WiRhSnjVkXI1j0J6IzZB7B1TtXswC2FHOAC9/OLXexvzBQIvDw=@vger.kernel.org X-Gm-Message-State: AOJu0Yxc1YjTE+KajPdHoZPEtRYEyFALYYHXYaOgQbE9P621E4KiS5Dk bxrqWzXbSVPqNTqvznfBE/+DNORt7DlpUSXEptqvNd8CE7ynqFIosiGSyDRz7+IzwyFAKSkH1vw OdGHWEt7kRKYfnxgcqRrvRV+fXgsCAkIjjlG2nXwIbLpZ7IQ4DMfshQ== X-Gm-Gg: ATEYQzwVcqe0mdkU8exCtMm/JXQ96Y88wP0T07wWnckaA4uQXW+2STJXJT20WhOQvbL CXNoPuOxwZm6toYPRdgVxrYHPW1fawU4GJRtCI2HkrDSIt6wwR+4O1G4pSOwjWT3rZLCNnV/jVF Ov3vSYhYHDCfynmkonCFob54HCRzS1REWDZ329MVIhBC45ChJyAfSmFicfdHWoU4BhHAo2pufNO BXdrxRtl09iYZ0383YMFU5uP/HPFeAkswU1lG631WSyCyFlp/B8Wamd+r2BlVxp275I+k1KgyPb ZB5EPCa8I6jPad+M7hnzyN4qNisk3J9VMd+0pKtdWWPDCzPcvFYO5828FHtb8dRpFQy8T19jgSz vzkpgFutXVdwYug== X-Received: by 2002:a05:622a:1996:b0:509:116c:8e05 with SMTP id d75a77b69052e-50957e0105cmr189112071cf.43.1773689095343; Mon, 16 Mar 2026 12:24:55 -0700 (PDT) X-Received: by 2002:a05:622a:1996:b0:509:116c:8e05 with SMTP id d75a77b69052e-50957e0105cmr189111641cf.43.1773689094662; Mon, 16 Mar 2026 12:24:54 -0700 (PDT) Received: from x1.local ([142.189.10.167]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-89c463af3e9sm56503366d6.46.2026.03.16.12.24.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Mar 2026 12:24:54 -0700 (PDT) Date: Mon, 16 Mar 2026 15:24:52 -0400 From: Peter Xu To: Yishai Hadas Cc: Alex Williamson , jgg@nvidia.com, kvm@vger.kernel.org, kevin.tian@intel.com, joao.m.martins@oracle.com, leonro@nvidia.com, maorg@nvidia.com, avihaih@nvidia.com, clg@redhat.com, liulongfang@huawei.com, giovanni.cabiddu@intel.com, kwankhede@nvidia.com Subject: Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO Message-ID: References: <20260310164006.4020-1-yishaih@nvidia.com> <20260310164006.4020-7-yishaih@nvidia.com> <20260312130817.69ff3e60@shazbot.org> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: On Sun, Mar 15, 2026 at 04:19:18PM +0200, Yishai Hadas wrote: > On 12/03/2026 22:16, Peter Xu wrote: > > On Thu, Mar 12, 2026 at 01:08:17PM -0600, Alex Williamson wrote: > > > Hey Peter, > > > > Hey, Alex, > > > > > > > > On Thu, 12 Mar 2026 13:37:04 -0400 > > > Peter Xu wrote: > > > > > > > Hi, Yishai, > > > > > > > > Please feel free to treat my comments as pure questions only. > > > > > > > > On Tue, Mar 10, 2026 at 06:40:06PM +0200, Yishai Hadas wrote: > > > > > When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the > > > > > driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response > > > > > to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes > > > > > value. > > > > > > > > Does it also mean that VFIO_PRECOPY_INFO_REINIT is almost only a hint that > > > > can be deduced by the userspace too, if it remembers the last time fetch of > > > > initial_bytes? > > > > > > I'll try to answer some of these. PRECOPY_INFO is already just a hint. > > > We essentially define initial_bytes as the "please copy this before > > > migration to avoid high latency setup" and dirty_bytes is "I also have > > > this much dirty state I could give to you now". We've defined > > > initial_bytes as monotonically decreasing, so a user could deduce that > > > they've passed the intended high latency setup threshold, while > > > dirty_bytes is purely volatile. > > > > I see.. That might be another problem though to switchover decisions. > > > > Currently, QEMU relies on dirty reporting to decide when to switchover. > > > > What it does is asking all the modules for how many dirty data left, then > > src QEMU do a sum, divide that sum with the estimated bandwidth to guess > > the downtime. > > > > When the estimated downtime is small enough so as to satisfy the user > > specified downtime, QEMU src will switchover. This didn't take > > switchover_ack for VFIO into account, but it's a separate concept. > > > > Above was based on the fact that the reported values are "total data", not > > "what you can collect".. > > > > Is there possible way to provide a total amount? It can even be a maximum > > total amount just to cap the downtime. > > The total amount is already reported today via the > VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl and QEMU accounts that in the > switchover decision. Ok, I somehow got the impression that initial+dirty should be the total previously. It's likely because I was referring to this piece of code in QEMU: static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy, uint64_t *can_postcopy) { VFIODevice *vbasedev = opaque; VFIOMigration *migration = vbasedev->migration; if (!vfio_device_state_is_precopy(vbasedev)) { return; } *must_precopy += migration->precopy_init_size + migration->precopy_dirty_size; trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy, *can_postcopy, migration->precopy_init_size, migration->precopy_dirty_size); } After you said so, I found indeed the exact() version is fetching the stop-size: static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, uint64_t *can_postcopy) { VFIODevice *vbasedev = opaque; VFIOMigration *migration = vbasedev->migration; uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; /* * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is * reported so downtime limit won't be violated. */ vfio_query_stop_copy_size(vbasedev, &stop_copy_size); *must_precopy += stop_copy_size; if (vfio_device_state_is_precopy(vbasedev)) { vfio_query_precopy_size(migration); } trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy, stop_copy_size, migration->precopy_init_size, migration->precopy_dirty_size); } Do you know why the estimate version doesn't report a cached stop_size instead? Reporting different things will also confuse QEMU in its estimate() and exact() hooks. They should report the same thing except that the estimate() can use a fast path for cached value. > > If with the current reporting > > definition, VM is destined to have unpredictable live migration downtime > > when relevant VFIO devices are involved. > > > > The larger the diff between the current reported dirty value v.s. "total > > data", the larger the downtime mistake can happen. > > > > > > > > The trouble comes, for example, if the device has undergone a > > > reconfiguration during migration, which may effectively negate the > > > initial_bytes and switchover-ack. > > > > Ah so it's about that, thanks. IMHO it might be great if Yishai could > > mention the source of growing initial_bytes somewhere in the commit log, or > > even when documenting the new feature bit. > > Sure, we can add as part of V2 the below chunk when documenting the new > feature. > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 90e51e84539d..bb4a2df0550d 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -1268,6 +1268,8 @@ enum vfio_device_mig_state { > * value and decrease as migration data is read from the device. > * The presence of the VFIO_PRECOPY_INFO_REINIT output flag indicates > * that new initial data is present on the stream. > + * The new initial data may result, for example, from device > reconfiguration > + * during migration that requires additional initialization data. This is helpful at least to me, thanks. > > > > > > > > > > A user deducing they've sent enough device data to cover initial_bytes > > > is essentially what we have now because our protocol doesn't allow the > > > driver to reset initial_bytes. The driver may choose to send that > > > reconfiguration information in dirty_bytes bytes, but we don't > > > currently have any way to indicate to the user that data remaining > > > there is of higher importance for startup on the target than any other > > > dirtying of device state. > > > > > > Hopefully the user/VMM is already polling the interface for dirty > > > bytes, where with the opt-in for the protocol change here, allows the > > > driver to split out the priority bytes versus the background dirtying. > > > > It definitely sounds a bit weird when some initial_* data can actually > > > > change, because it's not "initial_" anymore. > > > > > > It's just a priority scheme. In the case I've outlined above it might > > > be more aptly named setup_bytes or critical_bytes as you've used, but > > > another driver might just use it for detecting migration compatibility. > > > Naming is hard. > > > > Yep. :) initial_bytes is still fine at least to me. I wonder if we could > > still update the document of this field, then it'll be good enough. > > As Alex mentioned, initial_bytes can be used for various purposes. > > So, I would keep the existing description in the uAPI. > > In the context of the new feature, the uAPI commit message refers to > initial_bytes as 'critical data', to explain the motivation behind the > feature. Together with the extra chunk in the uAPI suggested above, I > believe this clarifies the intended usage. > > Makes sense ? As long as Alex is happy with it, I'm OK either way. > > > > > > > Another question is, if initial_bytes reached zero, could it be boosted > > > > again to be non-zero? > > > > > > Under the new protocol, yes, and the REINIT flag would be set indicate > > > it had been reset. Under the old protocol, no. > > > > I don't see what stops it from happening, if the "we get some fresh new > > > > critical data" seem to be able to happen anytime.. but if so, I wonder if > > > > it's a problem to QEMU: when initial_bytes reported to 0 at least _once_ it > > > > means it's possible src QEMU decides to switchover. Then looks like it > > > > beats the purpose of "don't switchover until we flush the critical data" > > > > whole idea. > > > > > > The definition of the protocol in the header stop it from happening. > > > We can't know that there isn't some userspace that follows the > > > deduction protocol rather than polling. We don't know there isn't some > > > userspace that segfaults if initial_bytes doesn't follow the published > > > protocol. Therefore opt-in where we have a mechanism to expose a new > > > initial_bytes session without it becoming a purely volatile value. > > > > Here, IMHO the problem is QEMU still needs to know when a switchover can > > happen. > > > > After a new QEMU probing this new driver feature bit and enable it, now > > initial_bytes can be incremented when REINIT flag set. This is fine on its > > own. But then, src QEMU still needs to decide when it can switch over. > > > > It seems to me the only way to do it (with/without the new feature bit > > enabled), is to relying on initial_bytes being zero. When it's zero, it > > means all possible "critical data" has been moved, then src QEMU can > > kickoff that "switchover" message. > > > > After that, IIUC we need to be prepared to trigger switchover anytime. > > > > With the new REINIT, it means we can still observe REINIT event after src > > QEMU making that decision. Would that be a problem? > > > > Nowadays, when looking at vfio code, what happens is src QEMU after seeing > > initial_bytes==0 send one VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to dest QEMU, > > later dst QEMU will ack that by sending back MIG_RP_MSG_SWITCHOVER_ACK. > > Then switchover can happen anytime by the downtime calculation above. > > > > Maybe there should be solution in the userspace to fix it, but we'll need > > to figure it out. Likely, we need one way or another to revoke the > > switchover message, so ultimately we need to stop VM, query the last time, > > seeing initial_bytes==0, then it can proceed with switchover. If it sees > > initial_bytes nonzero again, it will need to restart the VM and revoke the > > previous message somehow. > > The counterpart QEMU series that we pointed to, handles that in similar way > to what you described. > > The switchover-ack mechanism is modified to be revoke-able and a final query > to check initial_bytes == 0 is added after vCPUs are stopped. I only roughly skimmed the series and overlooked the QEMU branch link. I read it, indeed it should do most of above, except one possible issue I found, that QEMU shouldn't fail the migration when REINIT happened after exact() but before vm_stop(); instead IIUC it should fallback to iterations and try to move over the initial_bytes and retry a switchover. Thanks, > > Thanks, > Yishai > > > > > > > Is there a way the HW can report and confidentally say no further critical > > > > data will be generated? > > > > > > So long as there's a guest userspace running that can reconfigure the > > > device, no. But if you stop the vCPUs and test PRECOPY_INFO, it should > > > be reliable. > > > > This is definitely an important piece of info. I recall Zhiyi used to tell > > me there's no way to really stop a VFIO device from generating dirty data. > > Happy to know it seems there seems to still be a way. And now I suspect > > what Zhiyi observed was exactly seeing dirty_bytes growing even after VM > > stopped. If that counter means "how much you can read" it all makes more > > sense (even though it may suffer from the issue I mentioned above). > > > > > > > > > > The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the > > > > > caller that new initial data is available in the migration stream. > > > > > > > > > > If the firmware reports a new initial-data chunk, any previously dirty > > > > > bytes in memory are treated as initial bytes, since the caller must read > > > > > both sets before reaching the end of the initial-data region. > > > > > > > > This is unfortunate. I believe it's a limtation because of the current > > > > single fd streaming protocol, so HW can only append things because it's > > > > kind of a pipeline. > > > > > > > > One thing to mention is, I recall VFIO migration suffers from a major > > > > bottleneck on read() of the VFIO FD, it means this streaming whole design > > > > is also causing other perf issues. > > > > > > > > Have you or anyone thought about making it not a stream anymore? Take > > > > example of RAM blocks: it is pagesize accessible, with that we can do a lot > > > > more, e.g. we don't need to streamline pages, we can send pages in whatever > > > > order. Meanwhile, we can send pages concurrently because they're not > > > > streamlined too. > > > > > > > > I wonder if VFIO FDs can provide something like that too, as a start it > > > > doesn't need to be as fine granule, maybe at least instead of using one > > > > stream it can provide two streams, one for initial_bytes (or, I really > > > > think this should be called "critical data" or something similar, if it > > > > represents that rather than "some initial states", not anymore), another > > > > one for dirty. Then at least when you attach new critical data you don't > > > > need to flush dirty queue too. > > > > > > > > If to extend it a bit more, then we can also make e.g. dirty queue to be > > > > multiple FDs, so that userspace can read() in multiple threads, speeding up > > > > the switchover phase. > > > > > > > > I had a vague memory that there's sometimes kernel big locks to block it, > > > > but from interfacing POV it sounds always better to avoid using one fd to > > > > stream everything. > > > > > > I'll leave it to others to brainstorm improvements, but I'll note that > > > flushing dirty_bytes is a driver policy, another driver could consider > > > unread dirty bytes as invalidated by new initial_bytes and reset > > > counters. > > > > > > It's not clear to me that there's generic algorithm to use for handling > > > device state as addressable blocks rather than serialized into a data > > > stream. Multiple streams of different priorities seems feasible, but > > > now we're talking about a v3 migration protocol. Thanks, > > > > Yep, definitely not a request to invent v3 yet, but just to brainstorm it. > > It doesn't need to be all-things addressable, index-able (e.g. via >1 > > objects) would be also nice even through one fd, then it can also be > > threadified somehow. > > > > It seems the HW designer needs to understand how hypervisor works on > > collecting these HW data, so it does look like a hard problem when it's all > > across the stack from silicon layer.. > > > > I just had a feeling that v3 (or more) will come at some point when we want > > to finally resolve the VFIO downtime problems.. > > > > Thanks, > > > -- Peter Xu