From: Peter Xu <peterx@redhat.com>
To: "Liu, Yuan1" <yuan1.liu@intel.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>,
"farosas@suse.de" <farosas@suse.de>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"hao.xiang@bytedance.com" <hao.xiang@bytedance.com>,
"bryan.zhang@bytedance.com" <bryan.zhang@bytedance.com>,
"Zou, Nanhai" <nanhai.zou@intel.com>
Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
Date: Fri, 22 Mar 2024 12:40:32 -0400 [thread overview]
Message-ID: <Zf20gJbSavpp93_b@x1n> (raw)
In-Reply-To: <PH7PR11MB5941B5EB0C21FBFB6C8FFA16A3312@PH7PR11MB5941.namprd11.prod.outlook.com>
On Fri, Mar 22, 2024 at 02:47:02PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Liu, Yuan1
> > Sent: Friday, March 22, 2024 10:07 AM
> > To: Peter Xu <peterx@redhat.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: RE: [PATCH v5 5/7] migration/multifd: implement initialization of
> > qpl compression
> >
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, March 21, 2024 11:28 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com;
> > Zou,
> > > Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization
> > of
> > > qpl compression
> > >
> > > On Thu, Mar 21, 2024 at 01:37:36AM +0000, Liu, Yuan1 wrote:
> > > > > -----Original Message-----
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > Sent: Thursday, March 21, 2024 4:32 AM
> > > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > > > devel@nongnu.org; hao.xiang@bytedance.com;
> > bryan.zhang@bytedance.com;
> > > Zou,
> > > > > Nanhai <nanhai.zou@intel.com>
> > > > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement
> > > initialization of
> > > > > qpl compression
> > > > >
> > > > > On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> > > > > > let me explain here, during the decompression operation of IAA,
> > the
> > > > > > decompressed data can be directly output to the virtual address of
> > > the
> > > > > > guest memory by IAA hardware. It can avoid copying the
> > decompressed
> > > > > data
> > > > > > to guest memory by CPU.
> > > > >
> > > > > I see.
> > > > >
> > > > > > Without -mem-prealloc, all the guest memory is not populated, and
> > > IAA
> > > > > > hardware needs to trigger I/O page fault first and then output the
> > > > > > decompressed data to the guest memory region. Besides that, CPU
> > > page
> > > > > > faults will also trigger IOTLB flush operation when IAA devices
> > use
> > > SVM.
> > > > >
> > > > > Oh so the IAA hardware already can use CPU pgtables? Nice..
> > > > >
> > > > > Why IOTLB flush is needed? AFAIU we're only installing new pages,
> > the
> > > > > request can either come from a CPU access or a DMA. In all cases
> > > there
> > > > > should have no tearing down of an old page. Isn't an iotlb flush
> > only
> > > > > needed if a tear down happens?
> > > >
> > > > As far as I know, IAA hardware uses SVM technology to use the CPU's
> > page
> > > table
> > > > for address translation (IOMMU scalable mode directly accesses the CPU
> > > page table).
> > > > Therefore, when the CPU page table changes, the device's Invalidation
> > > operation needs
> > > > to be triggered to update the IOMMU and the device's cache.
> > > >
> > > > My current kernel version is mainline 6.2. The issue I see is as
> > > follows:
> > > > --Handle_mm_fault
> > > > |
> > > > -- wp_page_copy
> > >
> > > This is the CoW path. Not usual at all..
> > >
> > > I assume this issue should only present on destination. Then the guest
> > > pages should be the destination of such DMAs to happen, which means
> > these
> > > should be write faults, and as we see here it is, otherwise it won't
> > > trigger a CoW.
> > >
> > > However it's not clear to me why a pre-installed zero page existed. It
> > > means someone read the guest pages first.
> > >
> > > It might be interesting to know _why_ someone reads the guest pages,
> > even
> > > if we know they're all zeros. If we can avoid such reads then it'll be
> > a
> > > hole rather than a prefaulted read on zero page, then invalidations are
> > > not
> > > needed, and I expect that should fix the iotlb storm issue.
> >
> > The received pages will be read for zero pages check first. Although
> > these pages are zero pages, and IAA hardware will not access them, the
> > COW happens and causes following IOTLB flush operation. As far as I know,
> > IOMMU quickly detects whether the address range has been used by the
> > device,
> > and does not invalidate the address that is not used by the device, this
> > has
> > not yet been resolved in Linux kernel 6.2. I will check the latest status
> > for
> > this.
>
> I checked the Linux mainline 6.8 code, there are no big changes for this.
> In version 6.8, if the process needs to flush MMU TLB, then I/O TLB flush
> will be also triggered when the process has SVM devices. I haven't found
> the code to check if pages have been set EA (Extended-Accessed) bit before
> submitting invalidation operations, this is same with version 6.2.
>
> VT-d 3.6.2
> If the Extended-Accessed-Flag-Enable (EAFE) is 1 in a scalable-mode PASID-table
> entry that references a first-stage paging-structure entry used by the remapping
> hardware, it atomically sets the EA field in that entry. Whenever EA field is
> atomically set, the A field is also set in the same atomic operation. For software
> usages where the first-stage paging structures are shared across heterogeneous agents
> (e.g., CPUs and accelerator devices such as GPUs), the EA flag may be used by software
> to identify pages accessed by non-CPU agent(s) (as opposed to the A flag which indicates
> access by any agent sharing the paging structures).
This seems pretty new hardware features. I didn't check in depths but what
you said makes sense.
>
> > void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> > {
> > for (int i = 0; i < p->zero_num; i++) {
> > void *page = p->host + p->zero[i];
> > if (!buffer_is_zero(page, p->page_size)) {
> > memset(page, 0, p->page_size);
> > }
> > }
> > }
It may not matter much (where I also see your below comments), but just to
mention another solution to avoid this read is that we can maintain
RAMBlock->receivedmap for precopy (especially, multifd, afaiu multifd
doesn't yet update this bitmap.. even if normal precopy does), then here
instead of scanning every time, maybe we can do:
/*
* If it's the 1st time receiving it, no need to clear it as it must be
* all zeros now.
*/
if (bitmap_test(rb->receivedmap, page_offset)) {
memset(page, 0, ...);
} else {
bitmap_set(rb->receivedmap, page_offset);
}
And we also always set the bit when !zero too.
My rational is that it's unlikely a zero page if it's sent once or more,
while OTOH for the 1st time we receive it, it must be a zero page, so no
need to scan for the 1st round.
> >
> >
> > > It'll still be good we can fix this first to not make qpl special from
> > > this
> > > regard, so that the hope is migration submodule shouldn't rely on any
> > > pre-config (-mem-prealloc) on guest memory behaviors to work properly.
> >
> > Even if the IOTLB problem can be avoided, the I/O page fault problem
> > (normal
> > pages are loaded by the IAA device and solving normal page faults through
> > IOMMU,
> > the performance is not good)
Do you have a rough estimate on how slow that could be? It'll be good to
mention some details too in the doc file in that case.
> >
> > It can let the decompressed data of the IAA device be output to a pre-
> > populated
> > memory instead of directly outputting to the guest address, but then each
> > multifd
> > thread needs two memory copies, one copy from the network to the IAA input
> > memory(pre-populated), and another copy from the IAA output memory(pre-
> > populated)
> > to the guest address, which may become a performance bottleneck at the
> > destination
> > during the live migration process.
> >
> > So I think it is still necessary to use the -mem-prealloc option
Right, that complexity may not be necessary, in that case, maybe such
suggestion is fine.
Thanks,
> >
> > > > -- mmu_notifier_invalidate_range
> > > > |
> > > > -- intel_invalidate_rage
> > > > |
> > > > -- qi_flush_piotlb
> > > > -- qi_flush_dev_iotlb_pasid
> > >
> > > --
> > > Peter Xu
>
--
Peter Xu
next prev parent reply other threads:[~2024-03-22 16:40 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
2024-03-19 16:45 ` [PATCH v5 1/7] docs/migration: add qpl compression feature Yuan Liu
2024-03-26 17:58 ` Peter Xu
2024-03-27 2:14 ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 2/7] migration/multifd: put IOV initialization into compression method Yuan Liu
2024-03-20 15:18 ` Fabiano Rosas
2024-03-20 15:32 ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 3/7] configure: add --enable-qpl build option Yuan Liu
2024-03-20 8:55 ` Thomas Huth
2024-03-20 8:56 ` Thomas Huth
2024-03-20 14:34 ` Liu, Yuan1
2024-03-20 10:31 ` Daniel P. Berrangé
2024-03-20 14:42 ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 4/7] migration/multifd: add qpl compression method Yuan Liu
2024-03-27 19:49 ` Peter Xu
2024-03-28 3:03 ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression Yuan Liu
2024-03-20 10:42 ` Daniel P. Berrangé
2024-03-20 15:02 ` Liu, Yuan1
2024-03-20 15:20 ` Daniel P. Berrangé
2024-03-20 16:04 ` Liu, Yuan1
2024-03-20 15:34 ` Peter Xu
2024-03-20 16:23 ` Liu, Yuan1
2024-03-20 20:31 ` Peter Xu
2024-03-21 1:37 ` Liu, Yuan1
2024-03-21 15:28 ` Peter Xu
2024-03-22 2:06 ` Liu, Yuan1
2024-03-22 14:47 ` Liu, Yuan1
2024-03-22 16:40 ` Peter Xu [this message]
2024-03-27 19:25 ` Peter Xu
2024-03-28 2:32 ` Liu, Yuan1
2024-03-28 15:16 ` Peter Xu
2024-03-29 2:04 ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 6/7] migration/multifd: implement qpl compression and decompression Yuan Liu
2024-03-19 16:45 ` [PATCH v5 7/7] tests/migration-test: add qpl compression test Yuan Liu
2024-03-20 10:45 ` Daniel P. Berrangé
2024-03-20 15:30 ` Liu, Yuan1
2024-03-20 15:39 ` Daniel P. Berrangé
2024-03-20 16:26 ` Liu, Yuan1
2024-03-26 20:30 ` [PATCH v5 0/7] Live Migration With IAA Peter Xu
2024-03-27 3:20 ` Liu, Yuan1
2024-03-27 19:46 ` Peter Xu
2024-03-28 3:02 ` Liu, Yuan1
2024-03-28 15:22 ` Peter Xu
2024-03-29 3:33 ` Liu, Yuan1
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zf20gJbSavpp93_b@x1n \
--to=peterx@redhat.com \
--cc=berrange@redhat.com \
--cc=bryan.zhang@bytedance.com \
--cc=farosas@suse.de \
--cc=hao.xiang@bytedance.com \
--cc=nanhai.zou@intel.com \
--cc=qemu-devel@nongnu.org \
--cc=yuan1.liu@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).