From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread Date: Wed, 9 Oct 2013 22:47:10 -0300 Message-ID: <20131010014710.GA2198@amt.cnet> References: <1378376958-27252-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com> <1378376958-27252-13-git-send-email-xiaoguangrong@linux.vnet.ibm.com> <20131008012355.GA3588@amt.cnet> <20131009015627.GA4816@amt.cnet> <525533DB.1060104@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Xiao Guangrong , gleb@redhat.com, avi.kivity@gmail.com, pbonzini@redhat.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org To: Xiao Guangrong Return-path: Content-Disposition: inline In-Reply-To: <525533DB.1060104@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On Wed, Oct 09, 2013 at 06:45:47PM +0800, Xiao Guangrong wrote: > On 10/09/2013 09:56 AM, Marcelo Tosatti wrote: > > On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote: > >> > >> Hi Marcelo, > >> > >> On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti = wrote: > >> > >>>> > >>>> + if (kvm->arch.rcu_free_shadow_page) { > >>>> + kvm_mmu_isolate_pages(invalid_list); > >>>> + sp =3D list_first_entry(invalid_list, struct kvm_mmu_page, li= nk); > >>>> + list_del_init(invalid_list); > >>>> + call_rcu(&sp->rcu, free_pages_rcu); > >>>> + return; > >>>> + } > >>> > >>> This is unbounded (there was a similar problem with early fast pa= ge fault > >>> implementations): > >>> > >>> From RCU/checklist.txt: > >>> > >>> " An especially important property of the synchronize_rcu(= ) > >>> primitive is that it automatically self-limits: if grace p= eriods > >>> are delayed for whatever reason, then the synchronize_rcu(= ) > >>> primitive will correspondingly delay updates. In contrast= , > >>> code using call_rcu() should explicitly limit update rate = in > >>> cases where grace periods are delayed, as failing to do so= can > >>> result in excessive realtime latencies or even OOM conditi= ons. > >>> " > >> > >> I understand what you are worrying about=E2=80=A6 Hmm, can it be a= voided by > >> just using kvm->arch.rcu_free_shadow_page in a small window? - The= n > >> there are slight chance that the page need to be freed by call_rcu= =2E > >=20 > > The point that must be addressed is that you cannot allow an unlimi= ted > > number of sp's to be freed via call_rcu between two grace periods. > >=20 > > So something like: > >=20 > > - For every 17MB worth of shadow pages. > > - Guarantee a grace period has passed. >=20 > Hmm, the 'qhimark' in rcu is 10000, that means rcu allows call_rcu > to pend 10000 times in a grace-period without slowdown. Can we really > reach this number while rcu_free_shadow_page is set? Anyway, if it ca= n, > we can use rcu tech-break to avoid it, can't we? Yes. > > If you control kvm->arch.rcu_free_shadow_page, you could periodical= ly > > verify how many MBs worth of shadow pages are in the queue for RCU > > freeing and force grace period after a certain number. >=20 > I have no idea how to froce grace-period for the cpu which is running > in rcu-read side. IIUC, only dyntick-idle and offline CPU can be froc= ed, > see rcu_gp_fqs(). synchronize_rcu. > >>> Moreover, freeing pages differently depending on some state shoul= d=20 > >>> be avoided. > >>> > >>> Alternatives: > >>> > >>> - Disable interrupts at write protect sites. > >> > >> The write-protection can be triggered by KVM ioctl that is not in = the VCPU > >> context, if we do this, we also need to send IPI to the KVM thread= when do > >> TLB flush. > >=20 > > Yes. However for the case being measured, simultaneous page freeing= by vcpus=20 > > should be minimal (therefore not affecting the latency of GET_DIRTY= _LOG). >=20 > I agree, but write-protection will cost lots of time, we need to: > 1) do write-protection under irq disabled, that is not good for devic= e > Or > 2) do pieces of works, then enable irq and disable it agian to contin= ue > the work. Enabling and disabing irq many times are not cheap for x= 86. >=20 > no? Its getting cheaper (see performance optimization guide). > >> And we can not do much work while interrupt is disabled due to > >> interrupt latency. > >> > >>> - Rate limit the number of pages freed via call_rcu > >>> per grace period. > >> > >> Seems complex. :( > >> > >>> - Some better alternative. > >> > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page t= able > >> and encodes the page-level into the spte (since we need to check i= f the spte > >> is the last-spte. ). How about this? > >=20 > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu wit= h > > regards to limitation? (maybe it is). >=20 > For my experience, freeing shadow page and allocing shadow page are b= alanced, > we can check it by (make -j12 on a guest with 4 vcpus and): >=20 > # echo > trace > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3 > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l > 10816 > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l > 10656 > [root@eric-desktop tracing]# cat set_event > kvmmmu:kvm_mmu_get_page > kvmmmu:kvm_mmu_prepare_zap_page >=20 > alloc VS. free =3D 10816 : 10656 >=20 > So that, mostly all allocing and freeing are done in the slab's > cache and the slab frees shdadow pages very slowly, there is no rcu i= ssue. A more detailed test case would be: - cpu0-vcpu0 releasing pages as fast as possible - cpu1 executing get_dirty_log Think of a very large guest. > >> I planned to do it after this patchset merged, if you like it and = if you think > >> that "using kvm->arch.rcu_free_shadow_page in a small window" can = not avoid > >> the issue, i am happy to do it in the next version. :) > >=20 > > Unfortunately the window can be large (as it depends on the size of= the > > memslot), so it would be best if this problem can be addressed befo= re=20 > > merging. What is your idea for reducing rcu_free_shadow_page=3D1 wi= ndow? > >=20 >=20 > I mean something like: >=20 > for (i =3D0; i <=3D page_nr; i++) { > rcu_free_shadow_page=3D1; > write_protect(&rmap[i]); > rcu_free_shadow_page=3D0; > } >=20 > we write-protect less entries per time rather than a memslot. >=20 > BTW, i think rcu_need_break() would be usefull for this case, then th= e > rcu_read side do not dealy synchronize_rcu() too much. Yes, that works, you'd have to synchronize_sched eventually, depending = on the count of pending pages to be freed via RCU (so that rate limiting is performed). Or, you could bound the problematic callsites, those that can free a large number of shadow pages (mmu_shrink, kvm_mmu_change_mmu_pages, =2E..), and limit there: if (spin_needbreak(mmu_lock) || (sp_pending_rcu_free > sp_pending_rcu_f= ree_max)) { if (sp_pending...) synchronize_rcu(); } And any other code path that can possibly free pages, outside locks of course (even those that can free 1 page). In a similar fashion as to ho= w the per-vcpu mmu object slabs are grown.