linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
To: Gleb Natapov <gleb@redhat.com>
Cc: avi.kivity@gmail.com, mtosatti@redhat.com, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
Date: Thu, 29 Aug 2013 14:50:51 +0800	[thread overview]
Message-ID: <521EEF4B.4040107@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130828133635.GU22899@redhat.com>

On 08/28/2013 09:36 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 06:49 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
>>>> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
>>>>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>>>>>> Or what if desc is moved to another rmap, but then it
>>>>>>> is moved back to initial rmap (but another place in the desc list) so
>>>>>>> the check here will not catch that we need to restart walking?
>>>>>>
>>>>>> It is okay. We always add the new desc to the head, then we will walk
>>>>>> all the entires under this case.
>>>>>>
>>>>> Which races another question: What if desc is added in front of the list
>>>>> behind the point where lockless walker currently is?
>>>>
>>>> That case is new spte is being added into the rmap. We need not to care the
>>>> new sptes since it will set the dirty-bitmap then they can be write-protected
>>>> next time.
>>>>
>>> OK.
>>>
>>>>>
>>>>>> Right?
>>>>> Not sure. While lockless walker works on a desc rmap can be completely
>>>>> destroyed and recreated again. It can be any order.
>>>>
>>>> I think the thing is very similar as include/linux/rculist_nulls.h
>>> include/linux/rculist_nulls.h is for implementing hash tables, so they
>>> may not care about add/del/lookup race for instance, but may be we are
>>> (you are saying above that we are not), so similarity does not prove
>>> correctness for our case. 
>>
>> We do not care the "add" and "del" too when lookup the rmap. Under the "add"
>> case, it is okay, the reason i have explained above. Under the "del" case,
>> the spte becomes unpresent and flush all tlbs immediately, so it is also okay.
>>
>> I always use a stupid way to check the correctness, that is enumerating
>> all cases we may meet, in this patch, we may meet these cases:
>>
>> 1) kvm deletes the desc before we are current on
>>    that descs have been checked, do not need to care it.
>>
>> 2) kvm deletes the desc after we are currently on
>>    Since we always add/del the head desc, we can sure the current desc has been
>>    deleted, then we will meet case 3).
>>
>> 3) kvm deletes the desc that we are currently on
>>    3.a): the desc stays in slab cache (do not be reused).
>>          all spte entires are empty, then the fn() will skip the nonprsent spte,
>>          and desc->more is
>>          3.a.1) still pointing to next-desc, then we will continue the lookup
>>          3.a.2) or it is the "nulls list", that means we reach the last one,
>>                 then finish the walk.
>>
>>    3.b): the desc is alloc-ed from slab cache and it's being initialized.
>>          we will see "desc->more == NULL" then restart the walking. It's okay.
>>
>>    3.c): the desc is added to rmap or pte_list again.
>>          3.c.1): the desc is added to the current rmap again.
>> 		 the new desc always acts as the head desc, then we will walk
>>                  all entries, some entries are double checked and not entry
>>                  can be missed. It is okay.
>>
>>          3.c.2): the desc is added to another rmap or pte_list
>>                  since kvm_set_memory_region() and get_dirty are serial by slots-lock.
>>                  so the "nulls" can not be reused during lookup. Then we we will
>>                  meet the different "nulls" at the end of walking that will cause
>>                  rewalk.
>>
>> I know check the algorithm like this is really silly, do you have other idea?
>>
> Not silly, but easy to miss cases. For instance 3.c.3 can be:
>  the desc is added to another rmap then we move to another desc on the
>  wrong rmap, this other desc is also deleted and reinserted into
>  original rmap. Seams like justification from 3.c.1 applies to that to
>  though.
> 
>>> BTW I do not see
>>> rcu_assign_pointer()/rcu_dereference() in your patches which hints on
>>
>> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
>> p = v to assign a pointer to a pointer. But in our case, we need:
>>    *pte_list = (unsigned long)desc | 1;
>>From Documentation/RCU/whatisRCU.txt:
> 
> The updater uses this function to assign a new value to an RCU-protected pointer.
> 
> This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here)
> The fact that the value is not correct pointer should not matter.
> 

Okay. Will change that code to:

+
+#define rcu_assign_head_desc(pte_list_p, value)        \
+       rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), (unsigned long *)(value))
+
 /*
  * Pte mapping structures:
  *
@@ -1006,14 +1010,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
                desc->sptes[1] = spte;
                desc_mark_nulls(pte_list, desc);

-               /*
-                * Esure the old spte has been updated into desc, so
-                * that the another side can not get the desc from pte_list
-                * but miss the old spte.
-                */
-               smp_wmb();
-
-               *pte_list = (unsigned long)desc | 1;
+               rcu_assign_head_desc(pte_list, (unsigned long)desc | 1);

>>
>> So i add the smp_wmb() by myself:
>> 		/*
>> 		 * Esure the old spte has been updated into desc, so
>> 		 * that the another side can not get the desc from pte_list
>> 		 * but miss the old spte.
>> 		 */
>> 		smp_wmb();
>>
>> 		*pte_list = (unsigned long)desc | 1;
>>
>> But i missed it when inserting a empty desc, in that case, we need the barrier
>> too since we should make desc->more visible before assign it to pte_list to
>> avoid the lookup side seeing the invalid "nulls".
>>
>> I also use own code instead of rcu_dereference():
>> pte_list_walk_lockless():
>> 	pte_list_value = ACCESS_ONCE(*pte_list);
>> 	if (!pte_list_value)
>> 		return;
>>
>> 	if (!(pte_list_value & 1))
>> 		return fn((u64 *)pte_list_value);
>>
>> 	/*
>> 	 * fetch pte_list before read sptes in the desc, see the comments
>> 	 * in pte_list_add().
>> 	 *
>> 	 * There is the data dependence since the desc is got from pte_list.
>> 	 */
>> 	smp_read_barrier_depends();
>>
>> That part can be replaced by rcu_dereference().
>>
> Yes please, also see commit c87a124a5d5e8cf8e21c4363c3372bcaf53ea190 for
> kind of scary bugs we can get here.

Right, it is likely trigger-able in our case, will fix it.

> 
>>> incorrect usage of RCU. I think any access to slab pointers will need to
>>> use those.
>>
>> Remove desc is not necessary i think since we do not mind to see the old
>> info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)
>>
> May be a bug. I also noticed that rculist_nulls uses rcu_dereference()

But list_del_rcu() does not use rcu_assign_pointer() too.

> to access ->next, but it does not use rcu_assign_pointer() pointer to
> assign it.

You mean rcu_dereference() is used in hlist_nulls_for_each_entry_rcu()? I think
it's because we should validate the prefetched data before entry->next is
accessed, it is paired with the barrier in rcu_assign_pointer() when add a
new entry into the list. rcu_assign_pointer() make other fields in the entry
be visible before linking entry to the list. Otherwise, the lookup can access
that entry but get the invalid fields.

After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
is removed. The remove-API does not care the order between unlink the entry and
the changes to its fields. It is the caller's responsibility:
- in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
  enforce all lookups exit and the later change on that entry is invisible to the
  lookups.

- In the case of rculist_nulls, it seems refcounter is used to guarantee the order
  (see the example from Documentation/RCU/rculist_nulls.txt).

- In our case, we allow the lookup to see the deleted desc even if it is in slab cache
  or its is initialized or it is re-added.

Your thought?


  reply	other threads:[~2013-08-29  6:51 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-30 13:01 [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-07-30 13:01 ` [PATCH 01/12] KVM: MMU: remove unused parameter Xiao Guangrong
2013-08-29  7:22   ` Gleb Natapov
2013-07-30 13:02 ` [PATCH 02/12] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
2013-07-30 13:02 ` [PATCH 03/12] KVM: MMU: lazily drop large spte Xiao Guangrong
2013-08-02 14:55   ` Marcelo Tosatti
2013-08-02 15:42     ` Xiao Guangrong
2013-08-02 20:27       ` Marcelo Tosatti
2013-08-02 22:56         ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable Xiao Guangrong
2013-07-30 13:26   ` Paolo Bonzini
2013-07-31  7:25     ` Xiao Guangrong
2013-08-07  1:48   ` Marcelo Tosatti
2013-08-07  4:06     ` Xiao Guangrong
2013-08-08 15:06       ` Marcelo Tosatti
2013-08-08 16:26         ` Xiao Guangrong
2013-11-20  0:29       ` Marcelo Tosatti
2013-11-20  0:35         ` Marcelo Tosatti
2013-11-20 14:20         ` Xiao Guangrong
2013-11-20 19:47           ` Marcelo Tosatti
2013-11-21  4:26             ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page Xiao Guangrong
2013-07-30 13:27   ` Paolo Bonzini
2013-07-31  7:33     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
2013-08-28  7:23   ` Gleb Natapov
2013-08-28  7:50     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
2013-08-28  8:12   ` Gleb Natapov
2013-08-28  8:37     ` Xiao Guangrong
2013-08-28  8:58       ` Gleb Natapov
2013-08-28  9:19         ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 08/12] KVM: MMU: introduce nulls desc Xiao Guangrong
2013-08-28  8:40   ` Gleb Natapov
2013-08-28  8:54     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker Xiao Guangrong
2013-08-28  9:20   ` Gleb Natapov
2013-08-28  9:33     ` Xiao Guangrong
2013-08-28  9:46       ` Gleb Natapov
2013-08-28 10:13         ` Xiao Guangrong
2013-08-28 10:49           ` Gleb Natapov
2013-08-28 12:15             ` Xiao Guangrong
2013-08-28 13:36               ` Gleb Natapov
2013-08-29  6:50                 ` Xiao Guangrong [this message]
2013-08-29  9:08                   ` Gleb Natapov
2013-08-29  9:31                     ` Xiao Guangrong
2013-08-29  9:51                       ` Gleb Natapov
2013-08-29 11:26                         ` Xiao Guangrong
2013-08-30 11:38                           ` Gleb Natapov
2013-09-02  7:02                             ` Xiao Guangrong
2013-08-29  9:31                   ` Gleb Natapov
2013-08-29 11:33                     ` Xiao Guangrong
2013-08-29 12:02                       ` Xiao Guangrong
2013-08-30 11:44                         ` Gleb Natapov
2013-09-02  8:50                           ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread Xiao Guangrong
2013-08-07 13:09   ` Takuya Yoshikawa
2013-08-07 13:19     ` Xiao Guangrong
2013-08-29  9:10   ` Gleb Natapov
2013-08-29  9:25     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 11/12] KVM: MMU: locklessly write-protect the page Xiao Guangrong
2013-07-30 13:02 ` [PATCH 12/12] KVM: MMU: clean up spte_write_protect Xiao Guangrong
2013-07-30 13:11 ` [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-08-03  5:09 ` Takuya Yoshikawa
2013-08-04 14:15   ` Xiao Guangrong
2013-08-29  7:16   ` Gleb Natapov
2013-08-06 13:16 ` Xiao Guangrong
2013-08-08 17:38   ` Paolo Bonzini
2013-08-09  4:51     ` Xiao Guangrong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=521EEF4B.4040107@linux.vnet.ibm.com \
    --to=xiaoguangrong@linux.vnet.ibm.com \
    --cc=avi.kivity@gmail.com \
    --cc=gleb@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).