linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
To: Gleb Natapov <gleb@redhat.com>
Cc: avi.kivity@gmail.com, mtosatti@redhat.com, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list
Date: Wed, 28 Aug 2013 17:19:23 +0800	[thread overview]
Message-ID: <521DC09B.2070000@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130828085841.GP22899@redhat.com>

On 08/28/2013 04:58 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 04:37:32PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 04:12 PM, Gleb Natapov wrote:
>>
>>>> +
>>>> +	rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
>>>> +	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>>>> +
>>>> +	/* No empty position in the desc. */
>>>> +	if (desc->sptes[PTE_LIST_EXT - 1]) {
>>>> +		struct pte_list_desc *new_desc;
>>>> +		new_desc = mmu_alloc_pte_list_desc(vcpu);
>>>> +		new_desc->more = desc;
>>>> +		desc = new_desc;
>>>> +		*pte_list = (unsigned long)desc | 1;
>>>>  	}
>>>> -	return count;
>>>> +
>>>> +	free_pos = find_first_free(desc);
>>>> +	desc->sptes[free_pos] = spte;
>>>> +	return count_spte_number(desc);
>>> Should it be count_spte_number(desc) - 1? The function should returns
>>> the number of pte entries before the spte was added.
>>
>> Yes. We have handled it count_spte_number(), we count the number like this:
>>
>> 	return first_free + desc_num * PTE_LIST_EXT;
>>
>> The first_free is indexed from 0.
>>
> Suppose when pte_list_add() is called there is one full desc, so the
> number that should be returned is PTE_LIST_EXT, correct? But since
> before calling count_spte_number() one more desc will be added and
> desc->sptes[0] will be set in it the first_free in count_spte_number
> will be 1 and PTE_LIST_EXT + 1 will be returned.

Oh, yes, you are right. Will fix it in the next version, thanks for you
pointing it out.

> 
>> Maybe it is clearer to let count_spte_number() return the real number.
>>
>>>
>>>>  }
>>>>  
>>>>  static void
>>>> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
>>>> -			   int i, struct pte_list_desc *prev_desc)
>>>> +pte_list_desc_remove_entry(unsigned long *pte_list,
>>>> +			   struct pte_list_desc *desc, int i)
>>>>  {
>>>> -	int j;
>>>> +	struct pte_list_desc *first_desc;
>>>> +	int last_used;
>>>> +
>>>> +	first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>>>> +	last_used = find_last_used(first_desc);
>>>>  
>>>> -	for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
>>>> -		;
>>>> -	desc->sptes[i] = desc->sptes[j];
>>>> -	desc->sptes[j] = NULL;
>>>> -	if (j != 0)
>>>> +	/*
>>>> +	 * Move the entry from the first desc to this position we want
>>>> +	 * to remove.
>>>> +	 */
>>>> +	desc->sptes[i] = first_desc->sptes[last_used];
>>>> +	first_desc->sptes[last_used] = NULL;
>>>> +
>>> What if desc == first_desc and i < last_used. You still move spte
>>> backwards so lockless walk may have already examined entry at i and
>>> will miss spte that was moved there from last_used position, no?
>>
>> Right. I noticed it too and fixed in the v2 which is being tested.
>> I fixed it by bottom-up walk desc, like this:
>>
>> pte_list_walk_lockless():
>>
>> 	desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
>> 	while (!desc_is_a_nulls(desc)) {
>> 		/*
>> 		 * We should do bottom-up walk since we always use the
>> 		 * bottom entry to replace the deleted entry if only
>> 		 * one desc is used in the rmap when a spte is removed.
>> 		 * Otherwise the moved entry will be missed.
>> 		 */
> I would call it top-down walk since we are walking from big indices to
> smaller once.

Okay, will fix the comments.

> 
>> 		for (i = PTE_LIST_EXT - 1; i >= 0; i--)
>> 			fn(desc->sptes[i]);
>>
>> 		desc = ACCESS_ONCE(desc->more);
>>
>> 		/* It is being initialized. */
>> 		if (unlikely(!desc))
>> 			goto restart;
>> 	}
>>
>> How about this?
>>
> Tricky, very very tricky :)
> 
>>>
>>>> +	/* No valid entry in this desc, we can free this desc now. */
>>>> +	if (!first_desc->sptes[0]) {
>>>> +		struct pte_list_desc *next_desc = first_desc->more;
>>>> +
>>>> +		/*
>>>> +		 * Only one entry existing but still use a desc to store it?
>>>> +		 */
>>>> +		WARN_ON(!next_desc);
>>>> +
>>>> +		mmu_free_pte_list_desc(first_desc);
>>>> +		first_desc = next_desc;
>>>> +		*pte_list = (unsigned long)first_desc | 1ul;
>>>>  		return;
>>>> -	if (!prev_desc && !desc->more)
>>>> -		*pte_list = (unsigned long)desc->sptes[0];
>>>> -	else
>>>> -		if (prev_desc)
>>>> -			prev_desc->more = desc->more;
>>>> -		else
>>>> -			*pte_list = (unsigned long)desc->more | 1;
>>>> -	mmu_free_pte_list_desc(desc);
>>>> +	}
>>>> +
>>>> +	WARN_ON(!first_desc->sptes[0]);
>>>> +
>>>> +	/*
>>>> +	 * Only one entry in this desc, move the entry to the head
>>>> +	 * then the desc can be freed.
>>>> +	 */
>>>> +	if (!first_desc->sptes[1] && !first_desc->more) {
>>>> +		*pte_list = (unsigned long)first_desc->sptes[0];
>>>> +		mmu_free_pte_list_desc(first_desc);
>>>> +	}
>>>>  }
>>>>  
>>>>  static void pte_list_remove(u64 *spte, unsigned long *pte_list)
>>>>  {
>>>>  	struct pte_list_desc *desc;
>>>> -	struct pte_list_desc *prev_desc;
>>>>  	int i;
>>>>  
>>>>  	if (!*pte_list) {
>>>> -		printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
>>>> -		BUG();
>>>> -	} else if (!(*pte_list & 1)) {
>>>> +		WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
>>> Why change BUG() to WARN() here and below?
>>
>> WARN(1, "xxx") can replace two lines in the origin code. And personally,
>> i prefer WARN() to BUG() since sometimes BUG() can stop my box and i need to
>> get the full log by using kdump.
>>
>> If you object it, i will change it back in the next version. :)
>>
> For debugging WARN() is doubtlessly better, but outside of development
> you do not want to allow kernel to run after serious MMU corruption is
> detected. It may be exploitable further, we do not know, so the safe
> choice is to stop the kernel.

Okay, will keep BUG() in the next version.




  reply	other threads:[~2013-08-28  9:19 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-30 13:01 [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-07-30 13:01 ` [PATCH 01/12] KVM: MMU: remove unused parameter Xiao Guangrong
2013-08-29  7:22   ` Gleb Natapov
2013-07-30 13:02 ` [PATCH 02/12] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
2013-07-30 13:02 ` [PATCH 03/12] KVM: MMU: lazily drop large spte Xiao Guangrong
2013-08-02 14:55   ` Marcelo Tosatti
2013-08-02 15:42     ` Xiao Guangrong
2013-08-02 20:27       ` Marcelo Tosatti
2013-08-02 22:56         ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable Xiao Guangrong
2013-07-30 13:26   ` Paolo Bonzini
2013-07-31  7:25     ` Xiao Guangrong
2013-08-07  1:48   ` Marcelo Tosatti
2013-08-07  4:06     ` Xiao Guangrong
2013-08-08 15:06       ` Marcelo Tosatti
2013-08-08 16:26         ` Xiao Guangrong
2013-11-20  0:29       ` Marcelo Tosatti
2013-11-20  0:35         ` Marcelo Tosatti
2013-11-20 14:20         ` Xiao Guangrong
2013-11-20 19:47           ` Marcelo Tosatti
2013-11-21  4:26             ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page Xiao Guangrong
2013-07-30 13:27   ` Paolo Bonzini
2013-07-31  7:33     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
2013-08-28  7:23   ` Gleb Natapov
2013-08-28  7:50     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
2013-08-28  8:12   ` Gleb Natapov
2013-08-28  8:37     ` Xiao Guangrong
2013-08-28  8:58       ` Gleb Natapov
2013-08-28  9:19         ` Xiao Guangrong [this message]
2013-07-30 13:02 ` [PATCH 08/12] KVM: MMU: introduce nulls desc Xiao Guangrong
2013-08-28  8:40   ` Gleb Natapov
2013-08-28  8:54     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker Xiao Guangrong
2013-08-28  9:20   ` Gleb Natapov
2013-08-28  9:33     ` Xiao Guangrong
2013-08-28  9:46       ` Gleb Natapov
2013-08-28 10:13         ` Xiao Guangrong
2013-08-28 10:49           ` Gleb Natapov
2013-08-28 12:15             ` Xiao Guangrong
2013-08-28 13:36               ` Gleb Natapov
2013-08-29  6:50                 ` Xiao Guangrong
2013-08-29  9:08                   ` Gleb Natapov
2013-08-29  9:31                     ` Xiao Guangrong
2013-08-29  9:51                       ` Gleb Natapov
2013-08-29 11:26                         ` Xiao Guangrong
2013-08-30 11:38                           ` Gleb Natapov
2013-09-02  7:02                             ` Xiao Guangrong
2013-08-29  9:31                   ` Gleb Natapov
2013-08-29 11:33                     ` Xiao Guangrong
2013-08-29 12:02                       ` Xiao Guangrong
2013-08-30 11:44                         ` Gleb Natapov
2013-09-02  8:50                           ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread Xiao Guangrong
2013-08-07 13:09   ` Takuya Yoshikawa
2013-08-07 13:19     ` Xiao Guangrong
2013-08-29  9:10   ` Gleb Natapov
2013-08-29  9:25     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 11/12] KVM: MMU: locklessly write-protect the page Xiao Guangrong
2013-07-30 13:02 ` [PATCH 12/12] KVM: MMU: clean up spte_write_protect Xiao Guangrong
2013-07-30 13:11 ` [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-08-03  5:09 ` Takuya Yoshikawa
2013-08-04 14:15   ` Xiao Guangrong
2013-08-29  7:16   ` Gleb Natapov
2013-08-06 13:16 ` Xiao Guangrong
2013-08-08 17:38   ` Paolo Bonzini
2013-08-09  4:51     ` Xiao Guangrong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=521DC09B.2070000@linux.vnet.ibm.com \
    --to=xiaoguangrong@linux.vnet.ibm.com \
    --cc=avi.kivity@gmail.com \
    --cc=gleb@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).