From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page
 table out of vcpu thread
Date: Wed, 9 Oct 2013 22:47:10 -0300
Message-ID: <20131010014710.GA2198@amt.cnet>
References: <1378376958-27252-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com>
 <1378376958-27252-13-git-send-email-xiaoguangrong@linux.vnet.ibm.com>
 <20131008012355.GA3588@amt.cnet>
 <CAD6C3B4-DD55-47EB-9BC0-17867937AE2D@gmail.com>
 <20131009015627.GA4816@amt.cnet>
 <525533DB.1060104@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>, gleb@redhat.com,
	avi.kivity@gmail.com, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org
To: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Return-path: <linux-kernel-owner@vger.kernel.org>
Content-Disposition: inline
In-Reply-To: <525533DB.1060104@gmail.com>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: kvm.vger.kernel.org

On Wed, Oct 09, 2013 at 06:45:47PM +0800, Xiao Guangrong wrote:
> On 10/09/2013 09:56 AM, Marcelo Tosatti wrote:
> > On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
> >>
> >> Hi Marcelo,
> >>
> >> On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti <mtosatti@redhat.com> =
wrote:
> >>
> >>>>
> >>>> +	if (kvm->arch.rcu_free_shadow_page) {
> >>>> +		kvm_mmu_isolate_pages(invalid_list);
> >>>> +		sp =3D list_first_entry(invalid_list, struct kvm_mmu_page, li=
nk);
> >>>> +		list_del_init(invalid_list);
> >>>> +		call_rcu(&sp->rcu, free_pages_rcu);
> >>>> +		return;
> >>>> +	}
> >>>
> >>> This is unbounded (there was a similar problem with early fast pa=
ge fault
> >>> implementations):
> >>>
> >>> From RCU/checklist.txt:
> >>>
> >>> "        An especially important property of the synchronize_rcu(=
)
> >>>        primitive is that it automatically self-limits: if grace p=
eriods
> >>>        are delayed for whatever reason, then the synchronize_rcu(=
)
> >>>        primitive will correspondingly delay updates.  In contrast=
,
> >>>        code using call_rcu() should explicitly limit update rate =
in
> >>>        cases where grace periods are delayed, as failing to do so=
 can
> >>>        result in excessive realtime latencies or even OOM conditi=
ons.
> >>> "
> >>
> >> I understand what you are worrying about=E2=80=A6 Hmm, can it be a=
voided by
> >> just using kvm->arch.rcu_free_shadow_page in a small window? - The=
n
> >> there are slight chance that the page need to be freed by call_rcu=
=2E
> >=20
> > The point that must be addressed is that you cannot allow an unlimi=
ted
> > number of sp's to be freed via call_rcu between two grace periods.
> >=20
> > So something like:
> >=20
> > - For every 17MB worth of shadow pages.
> > - Guarantee a grace period has passed.
>=20
> Hmm, the 'qhimark' in rcu is 10000, that means rcu allows call_rcu
> to pend 10000 times in a grace-period without slowdown. Can we really
> reach this number while rcu_free_shadow_page is set? Anyway, if it ca=
n,
> we can use rcu tech-break to avoid it, can't we?

Yes.

> > If you control kvm->arch.rcu_free_shadow_page, you could periodical=
ly
> > verify how many MBs worth of shadow pages are in the queue for RCU
> > freeing and force grace period after a certain number.
>=20
> I have no idea how to froce grace-period for the cpu which is running
> in rcu-read side. IIUC, only dyntick-idle and offline CPU can be froc=
ed,
> see rcu_gp_fqs().

synchronize_rcu.

> >>> Moreover, freeing pages differently depending on some state shoul=
d=20
> >>> be avoided.
> >>>
> >>> Alternatives:
> >>>
> >>> - Disable interrupts at write protect sites.
> >>
> >> The write-protection can be triggered by KVM ioctl that is not in =
the VCPU
> >> context, if we do this, we also need to send IPI to the KVM thread=
 when do
> >> TLB flush.
> >=20
> > Yes. However for the case being measured, simultaneous page freeing=
 by vcpus=20
> > should be minimal (therefore not affecting the latency of GET_DIRTY=
_LOG).
>=20
> I agree, but write-protection will cost lots of time, we need to:
> 1) do write-protection under irq disabled, that is not good for devic=
e
> Or
> 2) do pieces of works, then enable irq and disable it agian to contin=
ue
>    the work. Enabling and disabing irq many times are not cheap for x=
86.
>=20
> no?

Its getting cheaper (see performance optimization guide).

> >> And we can not do much work while interrupt is disabled due to
> >> interrupt latency.
> >>
> >>> - Rate limit the number of pages freed via call_rcu
> >>> per grace period.
> >>
> >> Seems complex. :(
> >>
> >>> - Some better alternative.
> >>
> >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page t=
able
> >> and encodes the page-level into the spte (since we need to check i=
f the spte
> >> is the last-spte. ).  How about this?
> >=20
> > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu wit=
h
> > regards to limitation? (maybe it is).
>=20
> For my experience, freeing shadow page and allocing shadow page are b=
alanced,
> we can check it by (make -j12 on a guest with 4 vcpus and):
>=20
> # echo > trace
> [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> 10816
> [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> 10656
> [root@eric-desktop tracing]# cat set_event
> kvmmmu:kvm_mmu_get_page
> kvmmmu:kvm_mmu_prepare_zap_page
>=20
> alloc VS. free =3D 10816 : 10656
>=20
> So that, mostly all allocing and freeing are done in the slab's
> cache and the slab frees shdadow pages very slowly, there is no rcu i=
ssue.

A more detailed test case would be:

- cpu0-vcpu0 releasing pages as fast as possible
- cpu1 executing get_dirty_log

Think of a very large guest.

> >> I planned to do it after this patchset merged, if you like it and =
if you think
> >> that "using kvm->arch.rcu_free_shadow_page in a small window" can =
not avoid
> >> the issue, i am happy to do it in the next version. :)
> >=20
> > Unfortunately the window can be large (as it depends on the size of=
 the
> > memslot), so it would be best if this problem can be addressed befo=
re=20
> > merging. What is your idea for reducing rcu_free_shadow_page=3D1 wi=
ndow?
> >=20
>=20
> I mean something like:
>=20
> for (i =3D0; i <=3D page_nr; i++) {
>     rcu_free_shadow_page=3D1;
>     write_protect(&rmap[i]);
>     rcu_free_shadow_page=3D0;
> }
>=20
> we write-protect less entries per time rather than a memslot.
>=20
> BTW, i think rcu_need_break() would be usefull for this case, then th=
e
> rcu_read side do not dealy synchronize_rcu() too much.

Yes, that works, you'd have to synchronize_sched eventually, depending =
on the
count of pending pages to be freed via RCU (so that rate limiting is
performed).

Or, you could bound the problematic callsites, those that can free a
large number of shadow pages (mmu_shrink, kvm_mmu_change_mmu_pages,
=2E..), and limit there:

if (spin_needbreak(mmu_lock) || (sp_pending_rcu_free > sp_pending_rcu_f=
ree_max)) {
		if (sp_pending...)
			synchronize_rcu();
}

And any other code path that can possibly free pages, outside locks of
course (even those that can free 1 page). In a similar fashion as to ho=
w
the per-vcpu mmu object slabs are grown.