Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: mtosatti@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, peterz@infradead.org,
	paulmck@linux.vnet.ibm.com
Subject: Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
Date: Wed, 29 Feb 2012 11:25:14 +0200	[thread overview]
Message-ID: <4F4DEEFA.6000506@redhat.com> (raw)
In-Reply-To: <20120229164934.745d08f5.yoshikawa.takuya@oss.ntt.co.jp>

On 02/29/2012 09:49 AM, Takuya Yoshikawa wrote:
> Avi Kivity <avi@redhat.com> wrote:
>
> > > The key part of the implementation is the use of xchg() operation for
> > > clearing dirty bits atomically.  Since this allows us to update only
> > > BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
> > > until every dirty bit is cleared again for the next call.
> > 
> > What about using cmpxchg16b?  That should reduce locked ops by a factor
> > of 2 (but note it needs 16 bytes alignment).
>
> I tried cmpxchg16b first: the implementation could not be naturally
> combined with the for loop over the unsigned long array.
>
> 	Extra "if not zero", alignement check and ... it was ugly
> 	and I guessed it would be slow.
>
> Taking it into account that cmpxchg16b needs more cycles than others,
> I think this should be tried carefully with more measurement later.
>
> How about concentrating on xchg now?

Okay.  But note you don't need the alignment check; simply allocate the
array aligned, and a multiple of 16 bytes, in the first place.

>
> The implementation is simple and gives us enough improvement for now.
> At least, I want to see whether xchg-based implementation works well
> for one release.
>
> 	GET_DIRTY_LOG can be easily tuned to one particular case and
> 	it is really hard to check whether the implementation works well
> 	for every important case.  I really want feedback from users
> 	before adding non-obvious optimization.
>
> In addition, we should care about the new API.  It is not decided about
> what kind of range can be ordered.  I think restricting the range to be
> long size aligned is natural.  Do you have any plan?

Not really.  But the current changes are all internal and don't affect
the user API.

>
> > > Another point to note is that we do not use for_each_set_bit() to check
> > > which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
> > > simply use __ffs() and __fls() and pass the range in between the two
> > > positions found by them to kvm_mmu_write_protect_pt_range().
> > 
> > This seems artificial.
>
> OK, then I want to pass the bits (unsingned long) as a mask.
>
> Non-NPT machines may gain some.
>
> > > Even though the passed range may include clean pages, it is much faster
> > > than repeatedly call find_next_bit() due to the locality of dirty pages.
> > 
> > Perhaps this is due to the implementation of find_next_bit()?  would
> > using bsf improve things?
>
> I need to explain what I did in the past.
>
> Before srcu-less work, I had already noticed the slowness of
> for_each_set_bit() and replaced it with simple for loop like now: the
> improvement was significant.
>
> Yes, find_next_bit() is for generic use and not at all good when there
> are many consecutive bits set: it cannot assume anything so needs to check
> a lot of cases - we have long size aligned bitmap and "bits" is already
> known to be non-zero after the first check of the for loop.
>
> Of course, doing 64 function calls alone should be avoided in our case.
> I also do not want to call kvm_mmu_* for each bit.
>
> So, above, I proposed just passing "bits" to kvm_mmu_*: we can check
> each bit i in a register before using rmap[i] if needed.
>
> __ffs is really fast compared to other APIs.

You could do something like this

    for (i = 0; i < bitmap_size_in_longs; ++i) {
        mask = bitmap[i];
        if (!mask)
               continue;
        for (j = __ffs(mask); mask; mask &= mask - 1, j = __ffs(mask))
                handle i * BITS_PER_LONG + j;
    }

This gives you the speed of __ffs() but without working on zero bits.

>
> One note is that we will lose in cases like bits = 0xffffffff..
>
> 2271171.4    12064.9       138.6      16K     -45
> 3375866.2    14743.3       103.0      32K     -55
> 4408395.6    10720.0        67.2      64K     -51
> 5915336.2    26538.1        45.1     128K     -44
> 8497356.4    16441.0        32.4     256K     -29
>
> So the last one will become worse.  For other 4 patterns I am not sure.
>
> I thought that we should tune to the last case for gaining a lot from
> the locality of WWS.  What do you think about this point?
>
> > > -	} else {
> > > -		r = -EFAULT;
> > > -		if (clear_user(log->dirty_bitmap, n))
> > > -			goto out;
> > > +		kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);
> > 
> > If indeed the problem is find_next_bit(), then we could hanve
> > kvm_mmu_write_protect_slot_masked() which would just take the bitmap as
> > a parameter.  This would allow covering just this function with the
> > spinlock, not the xchg loop.
>
> We may see partial display updates if we do not hold the mmu_lock during
> xchg loop: it is possible that pages near the end of the framebuffer alone
> gets updated sometimes - I noticed this problem when I fixed the TLB flush
> issue.

I don't understand why this happens.

>
> Not a big problem but still maybe-noticeable change, so I think we should
> do it separately with some comments if needed.

Well if it's noticable on the framebuffer it's also noticable with live
migration.  We could do it later, but we need to really understand it first.

> In addition, we do not want to scan the dirty bitmap twice.  Using the
> bits value soon after it is read into a register seems to be the fastest.

Probably.

> BTW, I also want to decide the design of the new API at this chance.

Let's wait with that.  We're adding APIs too quickly.  Let's squeeze
everything we can out of the current APIs.

-- 
error compiling committee.c: too many arguments to function

next prev parent reply	other threads:[~2012-02-29  9:25 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-23 11:33 [PATCH 0/4] KVM: srcu-less dirty logging Takuya Yoshikawa
2012-02-23 11:33 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
2012-02-23 11:34 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
2012-02-23 11:35 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
2012-02-28 11:46   ` Avi Kivity
2012-02-29  7:49     ` Takuya Yoshikawa
2012-02-29  9:25       ` Avi Kivity [this message]
2012-02-29 10:16         ` Takuya Yoshikawa
2012-02-29 12:27           ` Avi Kivity
2012-02-29 14:18             ` Takuya Yoshikawa
2012-02-23 11:36 ` [PATCH 4/4] KVM: Remove unused dirty_bitmap_head and nr_dirty_pages Takuya Yoshikawa
2012-02-23 13:25 ` [PATCH 0/4] KVM: srcu-less dirty logging Peter Zijlstra
2012-02-28 10:03   ` Avi Kivity
2012-02-29  4:30     ` Takuya Yoshikawa
  -- strict thread matches above, loose matches on Subject: below --
2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
2012-03-01 10:33 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
2012-03-16  5:03   ` Xiao Guangrong
2012-03-16  6:55     ` Takuya Yoshikawa
2012-03-16  7:30       ` Xiao Guangrong
2012-03-16  7:55         ` Takuya Yoshikawa
2012-03-16  8:28           ` Xiao Guangrong
2012-03-16  9:44             ` Takuya Yoshikawa
2012-03-19  9:34               ` Xiao Guangrong
2012-03-19 10:15                 ` Takuya Yoshikawa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F4DEEFA.6000506@redhat.com \
    --to=avi@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=yoshikawa.takuya@oss.ntt.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).