From: Gleb Natapov <gleb@redhat.com>
To: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: avi.kivity@gmail.com, mtosatti@redhat.com, pbonzini@redhat.com,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
Date: Wed, 28 Aug 2013 16:36:36 +0300 [thread overview]
Message-ID: <20130828133635.GU22899@redhat.com> (raw)
In-Reply-To: <521DE9E8.2040908@linux.vnet.ibm.com>
On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote:
> On 08/28/2013 06:49 PM, Gleb Natapov wrote:
> > On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
> >> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
> >>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
> >>>>> Or what if desc is moved to another rmap, but then it
> >>>>> is moved back to initial rmap (but another place in the desc list) so
> >>>>> the check here will not catch that we need to restart walking?
> >>>>
> >>>> It is okay. We always add the new desc to the head, then we will walk
> >>>> all the entires under this case.
> >>>>
> >>> Which races another question: What if desc is added in front of the list
> >>> behind the point where lockless walker currently is?
> >>
> >> That case is new spte is being added into the rmap. We need not to care the
> >> new sptes since it will set the dirty-bitmap then they can be write-protected
> >> next time.
> >>
> > OK.
> >
> >>>
> >>>> Right?
> >>> Not sure. While lockless walker works on a desc rmap can be completely
> >>> destroyed and recreated again. It can be any order.
> >>
> >> I think the thing is very similar as include/linux/rculist_nulls.h
> > include/linux/rculist_nulls.h is for implementing hash tables, so they
> > may not care about add/del/lookup race for instance, but may be we are
> > (you are saying above that we are not), so similarity does not prove
> > correctness for our case.
>
> We do not care the "add" and "del" too when lookup the rmap. Under the "add"
> case, it is okay, the reason i have explained above. Under the "del" case,
> the spte becomes unpresent and flush all tlbs immediately, so it is also okay.
>
> I always use a stupid way to check the correctness, that is enumerating
> all cases we may meet, in this patch, we may meet these cases:
>
> 1) kvm deletes the desc before we are current on
> that descs have been checked, do not need to care it.
>
> 2) kvm deletes the desc after we are currently on
> Since we always add/del the head desc, we can sure the current desc has been
> deleted, then we will meet case 3).
>
> 3) kvm deletes the desc that we are currently on
> 3.a): the desc stays in slab cache (do not be reused).
> all spte entires are empty, then the fn() will skip the nonprsent spte,
> and desc->more is
> 3.a.1) still pointing to next-desc, then we will continue the lookup
> 3.a.2) or it is the "nulls list", that means we reach the last one,
> then finish the walk.
>
> 3.b): the desc is alloc-ed from slab cache and it's being initialized.
> we will see "desc->more == NULL" then restart the walking. It's okay.
>
> 3.c): the desc is added to rmap or pte_list again.
> 3.c.1): the desc is added to the current rmap again.
> the new desc always acts as the head desc, then we will walk
> all entries, some entries are double checked and not entry
> can be missed. It is okay.
>
> 3.c.2): the desc is added to another rmap or pte_list
> since kvm_set_memory_region() and get_dirty are serial by slots-lock.
> so the "nulls" can not be reused during lookup. Then we we will
> meet the different "nulls" at the end of walking that will cause
> rewalk.
>
> I know check the algorithm like this is really silly, do you have other idea?
>
Not silly, but easy to miss cases. For instance 3.c.3 can be:
the desc is added to another rmap then we move to another desc on the
wrong rmap, this other desc is also deleted and reinserted into
original rmap. Seams like justification from 3.c.1 applies to that to
though.
> > BTW I do not see
> > rcu_assign_pointer()/rcu_dereference() in your patches which hints on
>
> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
> p = v to assign a pointer to a pointer. But in our case, we need:
> *pte_list = (unsigned long)desc | 1;
>From Documentation/RCU/whatisRCU.txt:
The updater uses this function to assign a new value to an RCU-protected pointer.
This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here)
The fact that the value is not correct pointer should not matter.
>
> So i add the smp_wmb() by myself:
> /*
> * Esure the old spte has been updated into desc, so
> * that the another side can not get the desc from pte_list
> * but miss the old spte.
> */
> smp_wmb();
>
> *pte_list = (unsigned long)desc | 1;
>
> But i missed it when inserting a empty desc, in that case, we need the barrier
> too since we should make desc->more visible before assign it to pte_list to
> avoid the lookup side seeing the invalid "nulls".
>
> I also use own code instead of rcu_dereference():
> pte_list_walk_lockless():
> pte_list_value = ACCESS_ONCE(*pte_list);
> if (!pte_list_value)
> return;
>
> if (!(pte_list_value & 1))
> return fn((u64 *)pte_list_value);
>
> /*
> * fetch pte_list before read sptes in the desc, see the comments
> * in pte_list_add().
> *
> * There is the data dependence since the desc is got from pte_list.
> */
> smp_read_barrier_depends();
>
> That part can be replaced by rcu_dereference().
>
Yes please, also see commit c87a124a5d5e8cf8e21c4363c3372bcaf53ea190 for
kind of scary bugs we can get here.
> > incorrect usage of RCU. I think any access to slab pointers will need to
> > use those.
>
> Remove desc is not necessary i think since we do not mind to see the old
> info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)
>
May be a bug. I also noticed that rculist_nulls uses rcu_dereference()
to access ->next, but it does not use rcu_assign_pointer() pointer to
assign it.
--
Gleb.
next prev parent reply other threads:[~2013-08-28 13:36 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-30 13:01 [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-07-30 13:01 ` [PATCH 01/12] KVM: MMU: remove unused parameter Xiao Guangrong
2013-08-29 7:22 ` Gleb Natapov
2013-07-30 13:02 ` [PATCH 02/12] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
2013-07-30 13:02 ` [PATCH 03/12] KVM: MMU: lazily drop large spte Xiao Guangrong
2013-08-02 14:55 ` Marcelo Tosatti
2013-08-02 15:42 ` Xiao Guangrong
2013-08-02 20:27 ` Marcelo Tosatti
2013-08-02 22:56 ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable Xiao Guangrong
2013-07-30 13:26 ` Paolo Bonzini
2013-07-31 7:25 ` Xiao Guangrong
2013-08-07 1:48 ` Marcelo Tosatti
2013-08-07 4:06 ` Xiao Guangrong
2013-08-08 15:06 ` Marcelo Tosatti
2013-08-08 16:26 ` Xiao Guangrong
2013-11-20 0:29 ` Marcelo Tosatti
2013-11-20 0:35 ` Marcelo Tosatti
2013-11-20 14:20 ` Xiao Guangrong
2013-11-20 19:47 ` Marcelo Tosatti
2013-11-21 4:26 ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page Xiao Guangrong
2013-07-30 13:27 ` Paolo Bonzini
2013-07-31 7:33 ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
2013-08-28 7:23 ` Gleb Natapov
2013-08-28 7:50 ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
2013-08-28 8:12 ` Gleb Natapov
2013-08-28 8:37 ` Xiao Guangrong
2013-08-28 8:58 ` Gleb Natapov
2013-08-28 9:19 ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 08/12] KVM: MMU: introduce nulls desc Xiao Guangrong
2013-08-28 8:40 ` Gleb Natapov
2013-08-28 8:54 ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker Xiao Guangrong
2013-08-28 9:20 ` Gleb Natapov
2013-08-28 9:33 ` Xiao Guangrong
2013-08-28 9:46 ` Gleb Natapov
2013-08-28 10:13 ` Xiao Guangrong
2013-08-28 10:49 ` Gleb Natapov
2013-08-28 12:15 ` Xiao Guangrong
2013-08-28 13:36 ` Gleb Natapov [this message]
2013-08-29 6:50 ` Xiao Guangrong
2013-08-29 9:08 ` Gleb Natapov
2013-08-29 9:31 ` Xiao Guangrong
2013-08-29 9:51 ` Gleb Natapov
2013-08-29 11:26 ` Xiao Guangrong
2013-08-30 11:38 ` Gleb Natapov
2013-09-02 7:02 ` Xiao Guangrong
2013-08-29 9:31 ` Gleb Natapov
2013-08-29 11:33 ` Xiao Guangrong
2013-08-29 12:02 ` Xiao Guangrong
2013-08-30 11:44 ` Gleb Natapov
2013-09-02 8:50 ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread Xiao Guangrong
2013-08-07 13:09 ` Takuya Yoshikawa
2013-08-07 13:19 ` Xiao Guangrong
2013-08-29 9:10 ` Gleb Natapov
2013-08-29 9:25 ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 11/12] KVM: MMU: locklessly write-protect the page Xiao Guangrong
2013-07-30 13:02 ` [PATCH 12/12] KVM: MMU: clean up spte_write_protect Xiao Guangrong
2013-07-30 13:11 ` [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-08-03 5:09 ` Takuya Yoshikawa
2013-08-04 14:15 ` Xiao Guangrong
2013-08-29 7:16 ` Gleb Natapov
2013-08-06 13:16 ` Xiao Guangrong
2013-08-08 17:38 ` Paolo Bonzini
2013-08-09 4:51 ` Xiao Guangrong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130828133635.GU22899@redhat.com \
--to=gleb@redhat.com \
--cc=avi.kivity@gmail.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mtosatti@redhat.com \
--cc=pbonzini@redhat.com \
--cc=xiaoguangrong@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).