All of lore.kernel.org
 help / color / mirror / Atom feed
From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
To: Gleb Natapov <gleb@redhat.com>
Cc: avi.kivity@gmail.com, mtosatti@redhat.com, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
Date: Thu, 29 Aug 2013 14:50:51 +0800	[thread overview]
Message-ID: <521EEF4B.4040107@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130828133635.GU22899@redhat.com>

On 08/28/2013 09:36 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 06:49 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
>>>> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
>>>>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>>>>>> Or what if desc is moved to another rmap, but then it
>>>>>>> is moved back to initial rmap (but another place in the desc list) so
>>>>>>> the check here will not catch that we need to restart walking?
>>>>>>
>>>>>> It is okay. We always add the new desc to the head, then we will walk
>>>>>> all the entires under this case.
>>>>>>
>>>>> Which races another question: What if desc is added in front of the list
>>>>> behind the point where lockless walker currently is?
>>>>
>>>> That case is new spte is being added into the rmap. We need not to care the
>>>> new sptes since it will set the dirty-bitmap then they can be write-protected
>>>> next time.
>>>>
>>> OK.
>>>
>>>>>
>>>>>> Right?
>>>>> Not sure. While lockless walker works on a desc rmap can be completely
>>>>> destroyed and recreated again. It can be any order.
>>>>
>>>> I think the thing is very similar as include/linux/rculist_nulls.h
>>> include/linux/rculist_nulls.h is for implementing hash tables, so they
>>> may not care about add/del/lookup race for instance, but may be we are
>>> (you are saying above that we are not), so similarity does not prove
>>> correctness for our case. 
>>
>> We do not care the "add" and "del" too when lookup the rmap. Under the "add"
>> case, it is okay, the reason i have explained above. Under the "del" case,
>> the spte becomes unpresent and flush all tlbs immediately, so it is also okay.
>>
>> I always use a stupid way to check the correctness, that is enumerating
>> all cases we may meet, in this patch, we may meet these cases:
>>
>> 1) kvm deletes the desc before we are current on
>>    that descs have been checked, do not need to care it.
>>
>> 2) kvm deletes the desc after we are currently on
>>    Since we always add/del the head desc, we can sure the current desc has been
>>    deleted, then we will meet case 3).
>>
>> 3) kvm deletes the desc that we are currently on
>>    3.a): the desc stays in slab cache (do not be reused).
>>          all spte entires are empty, then the fn() will skip the nonprsent spte,
>>          and desc->more is
>>          3.a.1) still pointing to next-desc, then we will continue the lookup
>>          3.a.2) or it is the "nulls list", that means we reach the last one,
>>                 then finish the walk.
>>
>>    3.b): the desc is alloc-ed from slab cache and it's being initialized.
>>          we will see "desc->more == NULL" then restart the walking. It's okay.
>>
>>    3.c): the desc is added to rmap or pte_list again.
>>          3.c.1): the desc is added to the current rmap again.
>> 		 the new desc always acts as the head desc, then we will walk
>>                  all entries, some entries are double checked and not entry
>>                  can be missed. It is okay.
>>
>>          3.c.2): the desc is added to another rmap or pte_list
>>                  since kvm_set_memory_region() and get_dirty are serial by slots-lock.
>>                  so the "nulls" can not be reused during lookup. Then we we will
>>                  meet the different "nulls" at the end of walking that will cause
>>                  rewalk.
>>
>> I know check the algorithm like this is really silly, do you have other idea?
>>
> Not silly, but easy to miss cases. For instance 3.c.3 can be:
>  the desc is added to another rmap then we move to another desc on the
>  wrong rmap, this other desc is also deleted and reinserted into
>  original rmap. Seams like justification from 3.c.1 applies to that to
>  though.
> 
>>> BTW I do not see
>>> rcu_assign_pointer()/rcu_dereference() in your patches which hints on
>>
>> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
>> p = v to assign a pointer to a pointer. But in our case, we need:
>>    *pte_list = (unsigned long)desc | 1;
>>From Documentation/RCU/whatisRCU.txt:
> 
> The updater uses this function to assign a new value to an RCU-protected pointer.
> 
> This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here)
> The fact that the value is not correct pointer should not matter.
> 

Okay. Will change that code to:

+
+#define rcu_assign_head_desc(pte_list_p, value)        \
+       rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), (unsigned long *)(value))
+
 /*
  * Pte mapping structures:
  *
@@ -1006,14 +1010,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
                desc->sptes[1] = spte;
                desc_mark_nulls(pte_list, desc);

-               /*
-                * Esure the old spte has been updated into desc, so
-                * that the another side can not get the desc from pte_list
-                * but miss the old spte.
-                */
-               smp_wmb();
-
-               *pte_list = (unsigned long)desc | 1;
+               rcu_assign_head_desc(pte_list, (unsigned long)desc | 1);

>>
>> So i add the smp_wmb() by myself:
>> 		/*
>> 		 * Esure the old spte has been updated into desc, so
>> 		 * that the another side can not get the desc from pte_list
>> 		 * but miss the old spte.
>> 		 */
>> 		smp_wmb();
>>
>> 		*pte_list = (unsigned long)desc | 1;
>>
>> But i missed it when inserting a empty desc, in that case, we need the barrier
>> too since we should make desc->more visible before assign it to pte_list to
>> avoid the lookup side seeing the invalid "nulls".
>>
>> I also use own code instead of rcu_dereference():
>> pte_list_walk_lockless():
>> 	pte_list_value = ACCESS_ONCE(*pte_list);
>> 	if (!pte_list_value)
>> 		return;
>>
>> 	if (!(pte_list_value & 1))
>> 		return fn((u64 *)pte_list_value);
>>
>> 	/*
>> 	 * fetch pte_list before read sptes in the desc, see the comments
>> 	 * in pte_list_add().
>> 	 *
>> 	 * There is the data dependence since the desc is got from pte_list.
>> 	 */
>> 	smp_read_barrier_depends();
>>
>> That part can be replaced by rcu_dereference().
>>
> Yes please, also see commit c87a124a5d5e8cf8e21c4363c3372bcaf53ea190 for
> kind of scary bugs we can get here.

Right, it is likely trigger-able in our case, will fix it.

> 
>>> incorrect usage of RCU. I think any access to slab pointers will need to
>>> use those.
>>
>> Remove desc is not necessary i think since we do not mind to see the old
>> info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)
>>
> May be a bug. I also noticed that rculist_nulls uses rcu_dereference()

But list_del_rcu() does not use rcu_assign_pointer() too.

> to access ->next, but it does not use rcu_assign_pointer() pointer to
> assign it.

You mean rcu_dereference() is used in hlist_nulls_for_each_entry_rcu()? I think
it's because we should validate the prefetched data before entry->next is
accessed, it is paired with the barrier in rcu_assign_pointer() when add a
new entry into the list. rcu_assign_pointer() make other fields in the entry
be visible before linking entry to the list. Otherwise, the lookup can access
that entry but get the invalid fields.

After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
is removed. The remove-API does not care the order between unlink the entry and
the changes to its fields. It is the caller's responsibility:
- in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
  enforce all lookups exit and the later change on that entry is invisible to the
  lookups.

- In the case of rculist_nulls, it seems refcounter is used to guarantee the order
  (see the example from Documentation/RCU/rculist_nulls.txt).

- In our case, we allow the lookup to see the deleted desc even if it is in slab cache
  or its is initialized or it is re-added.

Your thought?

  reply	other threads:[~2013-08-29  6:50 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-30 13:01 [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-07-30 13:01 ` [PATCH 01/12] KVM: MMU: remove unused parameter Xiao Guangrong
2013-08-29  7:22   ` Gleb Natapov
2013-07-30 13:02 ` [PATCH 02/12] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
2013-07-30 13:02 ` [PATCH 03/12] KVM: MMU: lazily drop large spte Xiao Guangrong
2013-08-02 14:55   ` Marcelo Tosatti
2013-08-02 15:42     ` Xiao Guangrong
2013-08-02 20:27       ` Marcelo Tosatti
2013-08-02 22:56         ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable Xiao Guangrong
2013-07-30 13:26   ` Paolo Bonzini
2013-07-31  7:25     ` Xiao Guangrong
2013-08-07  1:48   ` Marcelo Tosatti
2013-08-07  4:06     ` Xiao Guangrong
2013-08-08 15:06       ` Marcelo Tosatti
2013-08-08 16:26         ` Xiao Guangrong
2013-11-20  0:29       ` Marcelo Tosatti
2013-11-20  0:35         ` Marcelo Tosatti
2013-11-20 14:20         ` Xiao Guangrong
2013-11-20 19:47           ` Marcelo Tosatti
2013-11-21  4:26             ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page Xiao Guangrong
2013-07-30 13:27   ` Paolo Bonzini
2013-07-31  7:33     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
2013-08-28  7:23   ` Gleb Natapov
2013-08-28  7:50     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
2013-08-28  8:12   ` Gleb Natapov
2013-08-28  8:37     ` Xiao Guangrong
2013-08-28  8:58       ` Gleb Natapov
2013-08-28  9:19         ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 08/12] KVM: MMU: introduce nulls desc Xiao Guangrong
2013-08-28  8:40   ` Gleb Natapov
2013-08-28  8:54     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker Xiao Guangrong
2013-08-28  9:20   ` Gleb Natapov
2013-08-28  9:33     ` Xiao Guangrong
2013-08-28  9:46       ` Gleb Natapov
2013-08-28 10:13         ` Xiao Guangrong
2013-08-28 10:49           ` Gleb Natapov
2013-08-28 12:15             ` Xiao Guangrong
2013-08-28 13:36               ` Gleb Natapov
2013-08-29  6:50                 ` Xiao Guangrong [this message]
2013-08-29  9:08                   ` Gleb Natapov
2013-08-29  9:31                     ` Xiao Guangrong
2013-08-29  9:51                       ` Gleb Natapov
2013-08-29 11:26                         ` Xiao Guangrong
2013-08-30 11:38                           ` Gleb Natapov
2013-09-02  7:02                             ` Xiao Guangrong
2013-08-29  9:31                   ` Gleb Natapov
2013-08-29 11:33                     ` Xiao Guangrong
2013-08-29 12:02                       ` Xiao Guangrong
2013-08-30 11:44                         ` Gleb Natapov
2013-09-02  8:50                           ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread Xiao Guangrong
2013-08-07 13:09   ` Takuya Yoshikawa
2013-08-07 13:19     ` Xiao Guangrong
2013-08-29  9:10   ` Gleb Natapov
2013-08-29  9:25     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 11/12] KVM: MMU: locklessly write-protect the page Xiao Guangrong
2013-07-30 13:02 ` [PATCH 12/12] KVM: MMU: clean up spte_write_protect Xiao Guangrong
2013-07-30 13:11 ` [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-08-03  5:09 ` Takuya Yoshikawa
2013-08-04 14:15   ` Xiao Guangrong
2013-08-29  7:16   ` Gleb Natapov
2013-08-06 13:16 ` Xiao Guangrong
2013-08-08 17:38   ` Paolo Bonzini
2013-08-09  4:51     ` Xiao Guangrong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=521EEF4B.4040107@linux.vnet.ibm.com \
    --to=xiaoguangrong@linux.vnet.ibm.com \
    --cc=avi.kivity@gmail.com \
    --cc=gleb@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.