Re: RFC: Split EPT huge pages in advance of dirty logging

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: "Zhoujian (jay)" <jianjay.zhou@huawei.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"quintela@redhat.com" <quintela@redhat.com>,
	"Liujinsong (Paul)" <liu.jinsong@huawei.com>,
	"linfeng (M)" <linfeng23@huawei.com>,
	"wangxin (U)" <wangxinxin.wang@huawei.com>,
	"Huangweidong (C)" <weidong.huang@huawei.com>
Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
Date: Tue, 18 Feb 2020 12:43:11 -0500	[thread overview]
Message-ID: <20200218174311.GE1408806@xz-x1> (raw)
In-Reply-To: <B2D15215269B544CADD246097EACE7474BAF9AB6@DGGEMM528-MBX.china.huawei.com>

On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> Hi all,
> 
> We found that the guest will be soft-lockup occasionally when live migrating a 60 vCPU,
> 512GiB huge page and memory sensitive VM. The reason is clear, almost all of the vCPUs
> are waiting for the KVM MMU spin-lock to create 4K SPTEs when the huge pages are
> write protected. This phenomenon is also described in this patch set:
> https://patchwork.kernel.org/cover/11163459/
> which aims to handle page faults in parallel more efficiently.
> 
> Our idea is to use the migration thread to touch all of the guest memory in the
> granularity of 4K before enabling dirty logging. To be more specific, we split all the
> PDPE_LEVEL SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then split all
> the DIRECTORY_LEVEL SPTEs into PAGE_TABLE_LEVEL SPTEs as the following step.

IIUC, QEMU will prefer to use huge pages for all the anonymous
ramblocks (please refer to ram_block_add):

        qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_HUGEPAGE);

Another alternative I can think of is to add an extra parameter to
QEMU to explicitly disable huge pages (so that can even be
MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that should also
drag down the performance for the whole lifecycle of the VM.  A 3rd
option is to make a QMP command to dynamically turn huge pages on/off
for ramblocks globally.  Haven't thought deep into any of them, but
seems doable.

Thanks,

-- 
Peter Xu

WARNING: multiple messages have this Message-ID (diff)

From: Peter Xu <peterx@redhat.com>
To: "Zhoujian (jay)" <jianjay.zhou@huawei.com>
Cc: "Liujinsong \(Paul\)" <liu.jinsong@huawei.com>,
	"linfeng \(M\)" <linfeng23@huawei.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"quintela@redhat.com" <quintela@redhat.com>,
	"wangxin \(U\)" <wangxinxin.wang@huawei.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"Huangweidong \(C\)" <weidong.huang@huawei.com>
Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
Date: Tue, 18 Feb 2020 12:43:11 -0500	[thread overview]
Message-ID: <20200218174311.GE1408806@xz-x1> (raw)
In-Reply-To: <B2D15215269B544CADD246097EACE7474BAF9AB6@DGGEMM528-MBX.china.huawei.com>

On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> Hi all,
> 
> We found that the guest will be soft-lockup occasionally when live migrating a 60 vCPU,
> 512GiB huge page and memory sensitive VM. The reason is clear, almost all of the vCPUs
> are waiting for the KVM MMU spin-lock to create 4K SPTEs when the huge pages are
> write protected. This phenomenon is also described in this patch set:
> https://patchwork.kernel.org/cover/11163459/
> which aims to handle page faults in parallel more efficiently.
> 
> Our idea is to use the migration thread to touch all of the guest memory in the
> granularity of 4K before enabling dirty logging. To be more specific, we split all the
> PDPE_LEVEL SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then split all
> the DIRECTORY_LEVEL SPTEs into PAGE_TABLE_LEVEL SPTEs as the following step.

IIUC, QEMU will prefer to use huge pages for all the anonymous
ramblocks (please refer to ram_block_add):

        qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_HUGEPAGE);

Another alternative I can think of is to add an extra parameter to
QEMU to explicitly disable huge pages (so that can even be
MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).  However that should also
drag down the performance for the whole lifecycle of the VM.  A 3rd
option is to make a QMP command to dynamically turn huge pages on/off
for ramblocks globally.  Haven't thought deep into any of them, but
seems doable.

Thanks,

-- 
Peter Xu

next prev parent reply	other threads:[~2020-02-18 17:43 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-18 13:13 RFC: Split EPT huge pages in advance of dirty logging Zhoujian (jay)
2020-02-18 13:13 ` Zhoujian (jay)
2020-02-18 17:43 ` Peter Xu [this message]
2020-02-18 17:43   ` Peter Xu
2020-02-19 13:19   ` Zhoujian (jay)
2020-02-19 13:19     ` Zhoujian (jay)
2020-02-19 17:19     ` Peter Xu
2020-02-19 17:19       ` Peter Xu
2020-02-20 13:52       ` Zhoujian (jay)
2020-02-20 13:52         ` Zhoujian (jay)
2020-02-20 17:32         ` Ben Gardon
2020-02-20 17:34         ` Ben Gardon
2020-02-20 18:17           ` Peter Xu
2020-02-20 18:17             ` Peter Xu
2020-02-21  6:51             ` Zhoujian (jay)
2020-02-21  6:51               ` Zhoujian (jay)
2020-02-21 22:08           ` Junaid Shahid
2020-02-22  0:19             ` Peter Feiner
2020-02-24  1:07               ` Zhoujian (jay)
2020-02-24  1:07                 ` Zhoujian (jay)
2020-03-02 13:38               ` Zhoujian (jay)
2020-03-02 13:38                 ` Zhoujian (jay)
2020-03-02 16:28                 ` Peter Feiner
2020-03-03  4:29                   ` Zhoujian (jay)
2020-03-03  4:29                     ` Zhoujian (jay)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200218174311.GE1408806@xz-x1 \
    --to=peterx@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=jianjay.zhou@huawei.com \
    --cc=kvm@vger.kernel.org \
    --cc=linfeng23@huawei.com \
    --cc=liu.jinsong@huawei.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=wangxinxin.wang@huawei.com \
    --cc=weidong.huang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.