* [Qemu-devel] [RFC] find_next_bit optimizations
@ 2013-03-11 13:44 Peter Lieven
2013-03-11 14:04 ` Peter Maydell
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 13:44 UTC (permalink / raw)
To: qemu-devel@nongnu.org; +Cc: Orit Wasserman, Corentin Chary, Paolo Bonzini
Hi,
I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
seemed to be one of the culprits (apart from removing locking which Paolo is working on).
I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
I applied the same unrolling idea like in buffer_is_zero().
Peter
--- a/util/bitops.c
+++ b/util/bitops.c
@@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
const unsigned long *p = addr + BITOP_WORD(offset);
unsigned long result = offset & ~(BITS_PER_LONG-1);
unsigned long tmp;
+ unsigned long d0,d1,d2,d3;
if (offset >= size) {
return size;
}
size -= result;
- offset %= BITS_PER_LONG;
+ offset &= (BITS_PER_LONG-1);
if (offset) {
tmp = *(p++);
tmp &= (~0UL << offset);
@@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
result += BITS_PER_LONG;
}
while (size & ~(BITS_PER_LONG-1)) {
+ while (!(size & (4*BITS_PER_LONG-1))) {
+ d0 = *p;
+ d1 = *(p+1);
+ d2 = *(p+2);
+ d3 = *(p+3);
+ if (d0 || d1 || d2 || d3) {
+ break;
+ }
+ p+=4;
+ result += 4*BITS_PER_LONG;
+ size -= 4*BITS_PER_LONG;
+ }
if ((tmp = *(p++))) {
goto found_middle;
}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 13:44 [Qemu-devel] [RFC] find_next_bit optimizations Peter Lieven
@ 2013-03-11 14:04 ` Peter Maydell
2013-03-11 14:14 ` Paolo Bonzini
2013-03-12 8:35 ` Stefan Hajnoczi
2 siblings, 0 replies; 22+ messages in thread
From: Peter Maydell @ 2013-03-11 14:04 UTC (permalink / raw)
To: Peter Lieven
Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary,
Paolo Bonzini
On 11 March 2013 13:44, Peter Lieven <pl@dlhnet.de> wrote:
> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr,
> unsigned long size,
> const unsigned long *p = addr + BITOP_WORD(offset);
> unsigned long result = offset & ~(BITS_PER_LONG-1);
> unsigned long tmp;
> + unsigned long d0,d1,d2,d3;
>
> if (offset >= size) {
> return size;
> }
> size -= result;
> - offset %= BITS_PER_LONG;
> + offset &= (BITS_PER_LONG-1);
This change at least is unnecessary -- I just checked, and gcc
is already smart enough to turn the % operation into a logical
and. The generated object files are identical.
-- PMM
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 13:44 [Qemu-devel] [RFC] find_next_bit optimizations Peter Lieven
2013-03-11 14:04 ` Peter Maydell
@ 2013-03-11 14:14 ` Paolo Bonzini
2013-03-11 14:22 ` Peter Lieven
2013-03-12 8:35 ` Stefan Hajnoczi
2 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 14:14 UTC (permalink / raw)
To: Peter Lieven; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary
Il 11/03/2013 14:44, Peter Lieven ha scritto:
> Hi,
>
> I ever since had a few VMs which are very hard to migrate because of a
> lot of memory I/O. I found that finding the next dirty bit
> seemed to be one of the culprits (apart from removing locking which
> Paolo is working on).
>
> I have to following proposal which seems to help a lot in my case. Just
> wanted to have some feedback.
> I applied the same unrolling idea like in buffer_is_zero().
>
> Peter
>
> --- a/util/bitops.c
> +++ b/util/bitops.c
> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
> *addr, unsigned long size,
> const unsigned long *p = addr + BITOP_WORD(offset);
> unsigned long result = offset & ~(BITS_PER_LONG-1);
> unsigned long tmp;
> + unsigned long d0,d1,d2,d3;
>
> if (offset >= size) {
> return size;
> }
> size -= result;
> - offset %= BITS_PER_LONG;
> + offset &= (BITS_PER_LONG-1);
> if (offset) {
> tmp = *(p++);
> tmp &= (~0UL << offset);
> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
> *addr, unsigned long size,
> result += BITS_PER_LONG;
> }
> while (size & ~(BITS_PER_LONG-1)) {
> + while (!(size & (4*BITS_PER_LONG-1))) {
This really means
if (!(size & (4*BITS_PER_LONG-1))) {
while (1) {
...
}
}
because the subtraction will not change the result of the "while" loop
condition.
What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
in turn means "while (size >= 4*BITS_PER_LONG).
Please change both while loops to use a ">=" condition, it's easier to read.
Paolo
> + d0 = *p;
> + d1 = *(p+1);
> + d2 = *(p+2);
> + d3 = *(p+3);
> + if (d0 || d1 || d2 || d3) {
> + break;
> + }
> + p+=4;
> + result += 4*BITS_PER_LONG;
> + size -= 4*BITS_PER_LONG;
> + }
> if ((tmp = *(p++))) {
> goto found_middle;
> }
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 14:14 ` Paolo Bonzini
@ 2013-03-11 14:22 ` Peter Lieven
2013-03-11 14:29 ` Peter Lieven
2013-03-11 14:35 ` Paolo Bonzini
0 siblings, 2 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 14:22 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary
Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>> Hi,
>>
>> I ever since had a few VMs which are very hard to migrate because of a
>> lot of memory I/O. I found that finding the next dirty bit
>> seemed to be one of the culprits (apart from removing locking which
>> Paolo is working on).
>>
>> I have to following proposal which seems to help a lot in my case. Just
>> wanted to have some feedback.
>> I applied the same unrolling idea like in buffer_is_zero().
>>
>> Peter
>>
>> --- a/util/bitops.c
>> +++ b/util/bitops.c
>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>> *addr, unsigned long size,
>> const unsigned long *p = addr + BITOP_WORD(offset);
>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>> unsigned long tmp;
>> + unsigned long d0,d1,d2,d3;
>>
>> if (offset >= size) {
>> return size;
>> }
>> size -= result;
>> - offset %= BITS_PER_LONG;
>> + offset &= (BITS_PER_LONG-1);
>> if (offset) {
>> tmp = *(p++);
>> tmp &= (~0UL << offset);
>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>> *addr, unsigned long size,
>> result += BITS_PER_LONG;
>> }
>> while (size & ~(BITS_PER_LONG-1)) {
>> + while (!(size & (4*BITS_PER_LONG-1))) {
>
> This really means
>
> if (!(size & (4*BITS_PER_LONG-1))) {
> while (1) {
> ...
> }
> }
>
> because the subtraction will not change the result of the "while" loop
> condition.
Are you sure? The above is working nicely for me (wondering why ;-))
I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.
>
> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
> in turn means "while (size >= 4*BITS_PER_LONG).
>
> Please change both while loops to use a ">=" condition, it's easier to read.
Good idea, its easier to understand.
Has anyone evidence if unrolling 4 longs is optimal on today processors?
I just chooses 4 longs because it was the same in buffer_is_zero.
Peter
>
> Paolo
>
>> + d0 = *p;
>> + d1 = *(p+1);
>> + d2 = *(p+2);
>> + d3 = *(p+3);
>> + if (d0 || d1 || d2 || d3) {
>> + break;
>> + }
>> + p+=4;
>> + result += 4*BITS_PER_LONG;
>> + size -= 4*BITS_PER_LONG;
>> + }
>> if ((tmp = *(p++))) {
>> goto found_middle;
>> }
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 14:22 ` Peter Lieven
@ 2013-03-11 14:29 ` Peter Lieven
2013-03-11 14:35 ` Paolo Bonzini
1 sibling, 0 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 14:29 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary
Am 11.03.2013 um 15:22 schrieb Peter Lieven <pl@dlhnet.de>:
>
> Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>
>> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>>> Hi,
>>>
>>> I ever since had a few VMs which are very hard to migrate because of a
>>> lot of memory I/O. I found that finding the next dirty bit
>>> seemed to be one of the culprits (apart from removing locking which
>>> Paolo is working on).
>>>
>>> I have to following proposal which seems to help a lot in my case. Just
>>> wanted to have some feedback.
>>> I applied the same unrolling idea like in buffer_is_zero().
>>>
>>> Peter
>>>
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>> const unsigned long *p = addr + BITOP_WORD(offset);
>>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>>> unsigned long tmp;
>>> + unsigned long d0,d1,d2,d3;
>>>
>>> if (offset >= size) {
>>> return size;
>>> }
>>> size -= result;
>>> - offset %= BITS_PER_LONG;
>>> + offset &= (BITS_PER_LONG-1);
>>> if (offset) {
>>> tmp = *(p++);
>>> tmp &= (~0UL << offset);
>>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>> result += BITS_PER_LONG;
>>> }
>>> while (size & ~(BITS_PER_LONG-1)) {
>>> + while (!(size & (4*BITS_PER_LONG-1))) {
>>
>> This really means
>>
>> if (!(size & (4*BITS_PER_LONG-1))) {
>> while (1) {
>> ...
>> }
>> }
>>
>> because the subtraction will not change the result of the "while" loop
>> condition.
>
> Are you sure? The above is working nicely for me (wondering why ;-))
> I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
> propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
> by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.
>
>>
>> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
>> in turn means "while (size >= 4*BITS_PER_LONG).
>>
>> Please change both while loops to use a ">=" condition, it's easier to read.
Thinking again, in case a bit is found, this might lead to unnecessary iterations
in the while loop if the bit is in d1, d2 or d3.
>
> Good idea, its easier to understand.
>
> Has anyone evidence if unrolling 4 longs is optimal on today processors?
> I just chooses 4 longs because it was the same in buffer_is_zero.
>
> Peter
>
>>
>> Paolo
>>
>>> + d0 = *p;
>>> + d1 = *(p+1);
>>> + d2 = *(p+2);
>>> + d3 = *(p+3);
>>> + if (d0 || d1 || d2 || d3) {
>>> + break;
>>> + }
>>> + p+=4;
>>> + result += 4*BITS_PER_LONG;
>>> + size -= 4*BITS_PER_LONG;
>>> + }
>>> if ((tmp = *(p++))) {
>>> goto found_middle;
>>> }
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 14:22 ` Peter Lieven
2013-03-11 14:29 ` Peter Lieven
@ 2013-03-11 14:35 ` Paolo Bonzini
2013-03-11 15:24 ` Peter Lieven
1 sibling, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 14:35 UTC (permalink / raw)
To: Peter Lieven; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary
Il 11/03/2013 15:22, Peter Lieven ha scritto:
>
> Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>
>> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>>> Hi,
>>>
>>> I ever since had a few VMs which are very hard to migrate because of a
>>> lot of memory I/O. I found that finding the next dirty bit
>>> seemed to be one of the culprits (apart from removing locking which
>>> Paolo is working on).
>>>
>>> I have to following proposal which seems to help a lot in my case. Just
>>> wanted to have some feedback.
>>> I applied the same unrolling idea like in buffer_is_zero().
>>>
>>> Peter
>>>
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>> const unsigned long *p = addr + BITOP_WORD(offset);
>>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>>> unsigned long tmp;
>>> + unsigned long d0,d1,d2,d3;
>>>
>>> if (offset >= size) {
>>> return size;
>>> }
>>> size -= result;
>>> - offset %= BITS_PER_LONG;
>>> + offset &= (BITS_PER_LONG-1);
>>> if (offset) {
>>> tmp = *(p++);
>>> tmp &= (~0UL << offset);
>>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>> result += BITS_PER_LONG;
>>> }
>>> while (size & ~(BITS_PER_LONG-1)) {
>>> + while (!(size & (4*BITS_PER_LONG-1))) {
>>
>> This really means
>>
>> if (!(size & (4*BITS_PER_LONG-1))) {
>> while (1) {
>> ...
>> }
>> }
>>
>> because the subtraction will not change the result of the "while" loop
>> condition.
>
> Are you sure? The above is working nicely for me (wondering why ;-))
while (!(size & (4*BITS_PER_LONG-1))) =>
while (!(size % (4*BITS_PER_LONG)) =>
while ((size % (4*BITS_PER_LONG)) == 0)
Subtracting 4*BITS_PER_LONG doesn't change the modulus.
> I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
> propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
> by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.
In fact I'm not really sure why it works for you. :)
>> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
>> in turn means "while (size >= 4*BITS_PER_LONG).
>>
>> Please change both while loops to use a ">=" condition, it's easier to read.
>
> Good idea, its easier to understand.
>
>>> Please change both while loops to use a ">=" condition, it's easier to read.
>
> Thinking again, in case a bit is found, this might lead to unnecessary iterations
> in the while loop if the bit is in d1, d2 or d3.
How would that be different in your patch? But you can solve it by
making two >= loops, one checking for 4*BITS_PER_LONG and one checking
BITS_PER_LONG.
Paolo
>
> Peter
>
>>
>> Paolo
>>
>>> + d0 = *p;
>>> + d1 = *(p+1);
>>> + d2 = *(p+2);
>>> + d3 = *(p+3);
>>> + if (d0 || d1 || d2 || d3) {
>>> + break;
>>> + }
>>> + p+=4;
>>> + result += 4*BITS_PER_LONG;
>>> + size -= 4*BITS_PER_LONG;
>>> + }
>>> if ((tmp = *(p++))) {
>>> goto found_middle;
>>> }
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 14:35 ` Paolo Bonzini
@ 2013-03-11 15:24 ` Peter Lieven
2013-03-11 15:25 ` Peter Maydell
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 15:24 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary
> How would that be different in your patch? But you can solve it by
> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
> BITS_PER_LONG.
This is what I have now:
diff --git a/util/bitops.c b/util/bitops.c
index e72237a..b0dc93f 100644
--- a/util/bitops.c
+++ b/util/bitops.c
@@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
const unsigned long *p = addr + BITOP_WORD(offset);
unsigned long result = offset & ~(BITS_PER_LONG-1);
unsigned long tmp;
+ unsigned long d0,d1,d2,d3;
if (offset >= size) {
return size;
}
size -= result;
- offset %= BITS_PER_LONG;
+ offset &= (BITS_PER_LONG-1);
if (offset) {
tmp = *(p++);
tmp &= (~0UL << offset);
@@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
size -= BITS_PER_LONG;
result += BITS_PER_LONG;
}
- while (size & ~(BITS_PER_LONG-1)) {
+ while (size >= 4*BITS_PER_LONG) {
+ d0 = *p;
+ d1 = *(p+1);
+ d2 = *(p+2);
+ d3 = *(p+3);
+ if (d0 || d1 || d2 || d3) {
+ break;
+ }
+ p+=4;
+ result += 4*BITS_PER_LONG;
+ size -= 4*BITS_PER_LONG;
+ }
+ while (size >= BITS_PER_LONG) {
if ((tmp = *(p++))) {
goto found_middle;
}
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 15:24 ` Peter Lieven
@ 2013-03-11 15:25 ` Peter Maydell
2013-03-11 15:29 ` Paolo Bonzini
2013-03-11 15:37 ` [Qemu-devel] [RFC] find_next_bit optimizations Peter Maydell
2 siblings, 0 replies; 22+ messages in thread
From: Peter Maydell @ 2013-03-11 15:25 UTC (permalink / raw)
To: Peter Lieven
Cc: Paolo Bonzini, qemu-devel@nongnu.org, Corentin Chary,
Orit Wasserman
On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote:
> - offset %= BITS_PER_LONG;
> + offset &= (BITS_PER_LONG-1);
Still pointless.
-- PMM
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 15:24 ` Peter Lieven
2013-03-11 15:25 ` Peter Maydell
@ 2013-03-11 15:29 ` Paolo Bonzini
2013-03-11 15:37 ` Peter Lieven
2013-03-11 15:37 ` [Qemu-devel] [RFC] find_next_bit optimizations Peter Maydell
2 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 15:29 UTC (permalink / raw)
To: Peter Lieven; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary
Il 11/03/2013 16:24, Peter Lieven ha scritto:
>
>> How would that be different in your patch? But you can solve it by
>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>> BITS_PER_LONG.
>
> This is what I have now:
>
> diff --git a/util/bitops.c b/util/bitops.c
> index e72237a..b0dc93f 100644
> --- a/util/bitops.c
> +++ b/util/bitops.c
> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
> const unsigned long *p = addr + BITOP_WORD(offset);
> unsigned long result = offset & ~(BITS_PER_LONG-1);
> unsigned long tmp;
> + unsigned long d0,d1,d2,d3;
>
> if (offset >= size) {
> return size;
> }
> size -= result;
> - offset %= BITS_PER_LONG;
> + offset &= (BITS_PER_LONG-1);
> if (offset) {
> tmp = *(p++);
> tmp &= (~0UL << offset);
> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
> size -= BITS_PER_LONG;
> result += BITS_PER_LONG;
> }
> - while (size & ~(BITS_PER_LONG-1)) {
> + while (size >= 4*BITS_PER_LONG) {
> + d0 = *p;
> + d1 = *(p+1);
> + d2 = *(p+2);
> + d3 = *(p+3);
> + if (d0 || d1 || d2 || d3) {
> + break;
> + }
> + p+=4;
> + result += 4*BITS_PER_LONG;
> + size -= 4*BITS_PER_LONG;
> + }
> + while (size >= BITS_PER_LONG) {
> if ((tmp = *(p++))) {
> goto found_middle;
> }
>
Minus the %= vs. &=,
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Perhaps:
tmp = *p;
d1 = *(p+1);
d2 = *(p+2);
d3 = *(p+3);
if (tmp) {
goto found_middle;
}
if (d1 || d2 || d3) {
break;
}
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 15:29 ` Paolo Bonzini
@ 2013-03-11 15:37 ` Peter Lieven
2013-03-11 15:58 ` Paolo Bonzini
0 siblings, 1 reply; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 15:37 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary
Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>
>>> How would that be different in your patch? But you can solve it by
>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>> BITS_PER_LONG.
>>
>> This is what I have now:
>>
>> diff --git a/util/bitops.c b/util/bitops.c
>> index e72237a..b0dc93f 100644
>> --- a/util/bitops.c
>> +++ b/util/bitops.c
>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>> const unsigned long *p = addr + BITOP_WORD(offset);
>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>> unsigned long tmp;
>> + unsigned long d0,d1,d2,d3;
>>
>> if (offset >= size) {
>> return size;
>> }
>> size -= result;
>> - offset %= BITS_PER_LONG;
>> + offset &= (BITS_PER_LONG-1);
>> if (offset) {
>> tmp = *(p++);
>> tmp &= (~0UL << offset);
>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>> size -= BITS_PER_LONG;
>> result += BITS_PER_LONG;
>> }
>> - while (size & ~(BITS_PER_LONG-1)) {
>> + while (size >= 4*BITS_PER_LONG) {
>> + d0 = *p;
>> + d1 = *(p+1);
>> + d2 = *(p+2);
>> + d3 = *(p+3);
>> + if (d0 || d1 || d2 || d3) {
>> + break;
>> + }
>> + p+=4;
>> + result += 4*BITS_PER_LONG;
>> + size -= 4*BITS_PER_LONG;
>> + }
>> + while (size >= BITS_PER_LONG) {
>> if ((tmp = *(p++))) {
>> goto found_middle;
>> }
>>
>
> Minus the %= vs. &=,
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>
> Perhaps:
>
> tmp = *p;
> d1 = *(p+1);
> d2 = *(p+2);
> d3 = *(p+3);
> if (tmp) {
> goto found_middle;
> }
> if (d1 || d2 || d3) {
> break;
> }
i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
i would guess its sth like one addition w/ carry and 1 test?
your proposed change would introduce 2 tests (maybe)?
what about this to be sure?
tmp = *p;
d1 = *(p+1);
d2 = *(p+2);
d3 = *(p+3);
if (tmp || d1 || d2 || d3) {
if (tmp) {
goto found_middle;
}
break;
}
Peter
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 15:24 ` Peter Lieven
2013-03-11 15:25 ` Peter Maydell
2013-03-11 15:29 ` Paolo Bonzini
@ 2013-03-11 15:37 ` Peter Maydell
2013-03-11 15:41 ` Peter Lieven
2 siblings, 1 reply; 22+ messages in thread
From: Peter Maydell @ 2013-03-11 15:37 UTC (permalink / raw)
To: Peter Lieven
Cc: Paolo Bonzini, qemu-devel@nongnu.org, Corentin Chary,
Orit Wasserman
On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote:
> + unsigned long d0,d1,d2,d3;
These commas should have spaces after them. Also, since
the variables are only used inside the scope of your
newly added while loop:
> - while (size & ~(BITS_PER_LONG-1)) {
> + while (size >= 4*BITS_PER_LONG) {
it would be better to declare them here.
> + d0 = *p;
> + d1 = *(p+1);
> + d2 = *(p+2);
> + d3 = *(p+3);
> + if (d0 || d1 || d2 || d3) {
> + break;
> + }
> + p+=4;
> + result += 4*BITS_PER_LONG;
> + size -= 4*BITS_PER_LONG;
> + }
> + while (size >= BITS_PER_LONG) {
> if ((tmp = *(p++))) {
> goto found_middle;
> }
thanks
-- PMM
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 15:37 ` [Qemu-devel] [RFC] find_next_bit optimizations Peter Maydell
@ 2013-03-11 15:41 ` Peter Lieven
2013-03-11 15:42 ` Paolo Bonzini
0 siblings, 1 reply; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 15:41 UTC (permalink / raw)
To: Peter Maydell
Cc: Paolo Bonzini, qemu-devel@nongnu.org, Corentin Chary,
Orit Wasserman
Am 11.03.2013 um 16:37 schrieb Peter Maydell <peter.maydell@linaro.org>:
> On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote:
>> + unsigned long d0,d1,d2,d3;
>
> These commas should have spaces after them. Also, since
> the variables are only used inside the scope of your
> newly added while loop:
>
>> - while (size & ~(BITS_PER_LONG-1)) {
>> + while (size >= 4*BITS_PER_LONG) {
>
> it would be better to declare them here.
can you verify if this does not make difference in the generated object code?
in buffer_is_zero() its outside the loop.
thanks
peter
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 15:41 ` Peter Lieven
@ 2013-03-11 15:42 ` Paolo Bonzini
2013-03-11 15:48 ` Peter Lieven
0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 15:42 UTC (permalink / raw)
To: Peter Lieven
Cc: Peter Maydell, qemu-devel@nongnu.org, Corentin Chary,
Orit Wasserman
Il 11/03/2013 16:41, Peter Lieven ha scritto:
> can you verify if this does not make difference in the generated object code?
> in buffer_is_zero() its outside the loop.
No, it doesn't.
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 15:42 ` Paolo Bonzini
@ 2013-03-11 15:48 ` Peter Lieven
0 siblings, 0 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 15:48 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Peter Maydell, qemu-devel@nongnu.org, Corentin Chary,
Orit Wasserman
Am 11.03.2013 um 16:42 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> Il 11/03/2013 16:41, Peter Lieven ha scritto:
>> can you verify if this does not make difference in the generated object code?
>> in buffer_is_zero() its outside the loop.
>
> No, it doesn't.
ok, i will sent the final patch tomorrow.
one last thought. would it make sense to update only `size`in the while loops
and compute the `result` at the end as `orgsize` - `size`?
Peter
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 15:37 ` Peter Lieven
@ 2013-03-11 15:58 ` Paolo Bonzini
2013-03-11 17:06 ` ronnie sahlberg
0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 15:58 UTC (permalink / raw)
To: Peter Lieven; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary
Il 11/03/2013 16:37, Peter Lieven ha scritto:
>
> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>
>> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>>
>>>> How would that be different in your patch? But you can solve it by
>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>>> BITS_PER_LONG.
>>>
>>> This is what I have now:
>>>
>>> diff --git a/util/bitops.c b/util/bitops.c
>>> index e72237a..b0dc93f 100644
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>> const unsigned long *p = addr + BITOP_WORD(offset);
>>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>>> unsigned long tmp;
>>> + unsigned long d0,d1,d2,d3;
>>>
>>> if (offset >= size) {
>>> return size;
>>> }
>>> size -= result;
>>> - offset %= BITS_PER_LONG;
>>> + offset &= (BITS_PER_LONG-1);
>>> if (offset) {
>>> tmp = *(p++);
>>> tmp &= (~0UL << offset);
>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>> size -= BITS_PER_LONG;
>>> result += BITS_PER_LONG;
>>> }
>>> - while (size & ~(BITS_PER_LONG-1)) {
>>> + while (size >= 4*BITS_PER_LONG) {
>>> + d0 = *p;
>>> + d1 = *(p+1);
>>> + d2 = *(p+2);
>>> + d3 = *(p+3);
>>> + if (d0 || d1 || d2 || d3) {
>>> + break;
>>> + }
>>> + p+=4;
>>> + result += 4*BITS_PER_LONG;
>>> + size -= 4*BITS_PER_LONG;
>>> + }
>>> + while (size >= BITS_PER_LONG) {
>>> if ((tmp = *(p++))) {
>>> goto found_middle;
>>> }
>>>
>>
>> Minus the %= vs. &=,
>>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>
>> Perhaps:
>>
>> tmp = *p;
>> d1 = *(p+1);
>> d2 = *(p+2);
>> d3 = *(p+3);
>> if (tmp) {
>> goto found_middle;
>> }
>> if (d1 || d2 || d3) {
>> break;
>> }
>
> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
It depends on the target and how expensive branches are.
> i would guess its sth like one addition w/ carry and 1 test?
It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
followed by a compare-and-jump.
That is, either:
test %r8, %r8
jz second_loop
test %r9, %r9
jz second_loop
test %r10, %r10
jz second_loop
test %r11, %r11
jz second_loop
or
or %r9, %r8
or %r11, %r10
or %r8, %r10
jz second_loop
Don't let the length of the code fool you. The processor knows how to
optimize all of these, and GCC knows too.
> your proposed change would introduce 2 tests (maybe)?
Yes, but I expect they to be fairly well predicted.
> what about this to be sure?
>
> tmp = *p;
> d1 = *(p+1);
> d2 = *(p+2);
> d3 = *(p+3);
> if (tmp || d1 || d2 || d3) {
> if (tmp) {
> goto found_middle;
I suspect that GCC would rewrite it my version (definitely if it
produces 4 compare-and-jumps; but possibly it does it even if it goes
for bitwise ORs, I haven't checked.
Regarding your other question ("one last thought. would it make sense to
update only `size`in the while loops and compute the `result` at the end
as `orgsize` - `size`?"), again the compiler knows better and might even
do this for you. It will likely drop the p increases and use p[result],
so if you do that change you may even get the same code, only this time
p is increased and you get an extra subtraction at the end. :)
Bottom line: don't try to outsmart an optimizing C compiler on
micro-optimization, unless you have benchmarked it and it shows there is
a problem.
Paolo
> }
> break;
> }
>
> Peter
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 15:58 ` Paolo Bonzini
@ 2013-03-11 17:06 ` ronnie sahlberg
2013-03-11 17:07 ` Paolo Bonzini
0 siblings, 1 reply; 22+ messages in thread
From: ronnie sahlberg @ 2013-03-11 17:06 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Orit Wasserman, Peter Lieven, qemu-devel@nongnu.org,
Corentin Chary
Even more efficient might be to do bitwise instead of logical or
> if (tmp | d1 | d2 | d3) {
that should remove 3 of the 4 conditional jumps
and should become 3 bitwise ors and one conditional jump
On Mon, Mar 11, 2013 at 8:58 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 11/03/2013 16:37, Peter Lieven ha scritto:
>>
>> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>
>>> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>>>
>>>>> How would that be different in your patch? But you can solve it by
>>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>>>> BITS_PER_LONG.
>>>>
>>>> This is what I have now:
>>>>
>>>> diff --git a/util/bitops.c b/util/bitops.c
>>>> index e72237a..b0dc93f 100644
>>>> --- a/util/bitops.c
>>>> +++ b/util/bitops.c
>>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>> const unsigned long *p = addr + BITOP_WORD(offset);
>>>> unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>> unsigned long tmp;
>>>> + unsigned long d0,d1,d2,d3;
>>>>
>>>> if (offset >= size) {
>>>> return size;
>>>> }
>>>> size -= result;
>>>> - offset %= BITS_PER_LONG;
>>>> + offset &= (BITS_PER_LONG-1);
>>>> if (offset) {
>>>> tmp = *(p++);
>>>> tmp &= (~0UL << offset);
>>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>> size -= BITS_PER_LONG;
>>>> result += BITS_PER_LONG;
>>>> }
>>>> - while (size & ~(BITS_PER_LONG-1)) {
>>>> + while (size >= 4*BITS_PER_LONG) {
>>>> + d0 = *p;
>>>> + d1 = *(p+1);
>>>> + d2 = *(p+2);
>>>> + d3 = *(p+3);
>>>> + if (d0 || d1 || d2 || d3) {
>>>> + break;
>>>> + }
>>>> + p+=4;
>>>> + result += 4*BITS_PER_LONG;
>>>> + size -= 4*BITS_PER_LONG;
>>>> + }
>>>> + while (size >= BITS_PER_LONG) {
>>>> if ((tmp = *(p++))) {
>>>> goto found_middle;
>>>> }
>>>>
>>>
>>> Minus the %= vs. &=,
>>>
>>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>>
>>> Perhaps:
>>>
>>> tmp = *p;
>>> d1 = *(p+1);
>>> d2 = *(p+2);
>>> d3 = *(p+3);
>>> if (tmp) {
>>> goto found_middle;
>>> }
>>> if (d1 || d2 || d3) {
>>> break;
>>> }
>>
>> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
>
> It depends on the target and how expensive branches are.
>
>> i would guess its sth like one addition w/ carry and 1 test?
>
> It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
> followed by a compare-and-jump.
>
> That is, either:
>
> test %r8, %r8
> jz second_loop
> test %r9, %r9
> jz second_loop
> test %r10, %r10
> jz second_loop
> test %r11, %r11
> jz second_loop
>
> or
>
> or %r9, %r8
> or %r11, %r10
> or %r8, %r10
> jz second_loop
>
> Don't let the length of the code fool you. The processor knows how to
> optimize all of these, and GCC knows too.
>
>> your proposed change would introduce 2 tests (maybe)?
>
> Yes, but I expect they to be fairly well predicted.
>
>> what about this to be sure?
>>
>> tmp = *p;
>> d1 = *(p+1);
>> d2 = *(p+2);
>> d3 = *(p+3);
>> if (tmp || d1 || d2 || d3) {
>> if (tmp) {
>> goto found_middle;
>
> I suspect that GCC would rewrite it my version (definitely if it
> produces 4 compare-and-jumps; but possibly it does it even if it goes
> for bitwise ORs, I haven't checked.
>
> Regarding your other question ("one last thought. would it make sense to
> update only `size`in the while loops and compute the `result` at the end
> as `orgsize` - `size`?"), again the compiler knows better and might even
> do this for you. It will likely drop the p increases and use p[result],
> so if you do that change you may even get the same code, only this time
> p is increased and you get an extra subtraction at the end. :)
>
> Bottom line: don't try to outsmart an optimizing C compiler on
> micro-optimization, unless you have benchmarked it and it shows there is
> a problem.
>
> Paolo
>
>> }
>> break;
>> }
>>
>> Peter
>>
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 17:06 ` ronnie sahlberg
@ 2013-03-11 17:07 ` Paolo Bonzini
2013-03-11 18:20 ` Peter Lieven
2013-03-12 7:32 ` [Qemu-devel] [PATCH] bitops: unroll while loop in find_next_bit() Peter Lieven
0 siblings, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 17:07 UTC (permalink / raw)
To: ronnie sahlberg
Cc: Orit Wasserman, Peter Lieven, qemu-devel@nongnu.org,
Corentin Chary
Il 11/03/2013 18:06, ronnie sahlberg ha scritto:
> Even more efficient might be to do bitwise instead of logical or
>
>> > if (tmp | d1 | d2 | d3) {
> that should remove 3 of the 4 conditional jumps
> and should become 3 bitwise ors and one conditional jump
Without any serious profiling, please let the compiler do that.
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 17:07 ` Paolo Bonzini
@ 2013-03-11 18:20 ` Peter Lieven
2013-03-12 7:32 ` [Qemu-devel] [PATCH] bitops: unroll while loop in find_next_bit() Peter Lieven
1 sibling, 0 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 18:20 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Orit Wasserman, qemu-devel@nongnu.org, ronnie sahlberg,
Corentin Chary
Am 11.03.2013 18:07, schrieb Paolo Bonzini:
> Il 11/03/2013 18:06, ronnie sahlberg ha scritto:
>> Even more efficient might be to do bitwise instead of logical or
>>
>>>> if (tmp | d1 | d2 | d3) {
>> that should remove 3 of the 4 conditional jumps
>> and should become 3 bitwise ors and one conditional jump
>
> Without any serious profiling, please let the compiler do that.
Paolo is right, i ran some tests with gcc 4.6.3 on x86_64 (with -O3) and tried the
various ideas. They all made no significant difference. Even unrolling to 8 unsigned
longs didn't change anything.
What I tried is running 1^20 interations of find_next_bit(bitfield,4194304,0);
I choosed the bitfield to be 4MByte which equals a 16GB VM. The bitfield was
complete zeroed so find_next_bit had to run completely through the bitfield.
The original version took 1 minute and 10 seconds whereas all other took
approx. 37-38 seconds which is almost a 100% boost ;-)
So I think this here is the final version:
while (size >= 4*BITS_PER_LONG) {
unsigned long d1, d2, d3;
tmp = *p;
d1 = *(p+1);
d2 = *(p+2);
d3 = *(p+3);
if (tmp) {
goto found_middle;
}
if (d1 || d2 || d3) {
break;
}
p += 4;
result += 4*BITS_PER_LONG;
size -= 4*BITS_PER_LONG;
}
Peter
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Qemu-devel] [PATCH] bitops: unroll while loop in find_next_bit().
2013-03-11 17:07 ` Paolo Bonzini
2013-03-11 18:20 ` Peter Lieven
@ 2013-03-12 7:32 ` Peter Lieven
1 sibling, 0 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-12 7:32 UTC (permalink / raw)
To: qemu-devel@nongnu.org
Cc: peter.maydell, Paolo Bonzini, Corentin Chary, ronnie sahlberg,
Orit Wasserman
this patch adopts the loop unrolling idea of bitmap_is_zero() to
speed up the skipping of large areas with zeros in find_next_bit().
this routine is extensively used to find dirty pages in
live migration.
testing only the find_next_bit performance on a zeroed bitfield
the loop onrolling decreased executing time by approx. 50% on x86_64.
Signed-off-by: Peter Lieven <pl@kamp.de>
---
util/bitops.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/util/bitops.c b/util/bitops.c
index e72237a..227c38b 100644
--- a/util/bitops.c
+++ b/util/bitops.c
@@ -42,7 +42,23 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
size -= BITS_PER_LONG;
result += BITS_PER_LONG;
}
- while (size & ~(BITS_PER_LONG-1)) {
+ while (size >= 4*BITS_PER_LONG) {
+ unsigned long d1, d2, d3;
+ tmp = *p;
+ d1 = *(p+1);
+ d2 = *(p+2);
+ d3 = *(p+3);
+ if (tmp) {
+ goto found_middle;
+ }
+ if (d1 | d2 | d3) {
+ break;
+ }
+ p += 4;
+ result += 4*BITS_PER_LONG;
+ size -= 4*BITS_PER_LONG;
+ }
+ while (size >= BITS_PER_LONG) {
if ((tmp = *(p++))) {
goto found_middle;
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-11 13:44 [Qemu-devel] [RFC] find_next_bit optimizations Peter Lieven
2013-03-11 14:04 ` Peter Maydell
2013-03-11 14:14 ` Paolo Bonzini
@ 2013-03-12 8:35 ` Stefan Hajnoczi
2013-03-12 8:41 ` Peter Lieven
2 siblings, 1 reply; 22+ messages in thread
From: Stefan Hajnoczi @ 2013-03-12 8:35 UTC (permalink / raw)
To: Peter Lieven
Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary,
Paolo Bonzini
On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
> seemed to be one of the culprits (apart from removing locking which Paolo is working on).
>
> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
Hi Peter,
Do you have any performance numbers for this patch? I'm just curious
how big the win is.
Stefan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-12 8:35 ` Stefan Hajnoczi
@ 2013-03-12 8:41 ` Peter Lieven
2013-03-12 15:12 ` Stefan Hajnoczi
0 siblings, 1 reply; 22+ messages in thread
From: Peter Lieven @ 2013-03-12 8:41 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary,
Paolo Bonzini
Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi <stefanha@gmail.com>:
> On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
>> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
>> seemed to be one of the culprits (apart from removing locking which Paolo is working on).
>>
>> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
>
> Hi Peter,
> Do you have any performance numbers for this patch? I'm just curious
> how big the win is.
Hi Stefan,
please see my recent email to the list with the final patch.
The win is up to 100%. Worst case execution time (whole
array is zero) is halved on x86_64.
Peter
>
> Stefan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC] find_next_bit optimizations
2013-03-12 8:41 ` Peter Lieven
@ 2013-03-12 15:12 ` Stefan Hajnoczi
0 siblings, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2013-03-12 15:12 UTC (permalink / raw)
To: Peter Lieven
Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary,
Paolo Bonzini
On Tue, Mar 12, 2013 at 09:41:04AM +0100, Peter Lieven wrote:
>
> Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi <stefanha@gmail.com>:
>
> > On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
> >> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
> >> seemed to be one of the culprits (apart from removing locking which Paolo is working on).
> >>
> >> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
> >
> > Hi Peter,
> > Do you have any performance numbers for this patch? I'm just curious
> > how big the win is.
>
> Hi Stefan,
>
> please see my recent email to the list with the final patch.
> The win is up to 100%. Worst case execution time (whole
> array is zero) is halved on x86_64.
Thanks!
Stefan
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2013-03-12 15:12 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-11 13:44 [Qemu-devel] [RFC] find_next_bit optimizations Peter Lieven
2013-03-11 14:04 ` Peter Maydell
2013-03-11 14:14 ` Paolo Bonzini
2013-03-11 14:22 ` Peter Lieven
2013-03-11 14:29 ` Peter Lieven
2013-03-11 14:35 ` Paolo Bonzini
2013-03-11 15:24 ` Peter Lieven
2013-03-11 15:25 ` Peter Maydell
2013-03-11 15:29 ` Paolo Bonzini
2013-03-11 15:37 ` Peter Lieven
2013-03-11 15:58 ` Paolo Bonzini
2013-03-11 17:06 ` ronnie sahlberg
2013-03-11 17:07 ` Paolo Bonzini
2013-03-11 18:20 ` Peter Lieven
2013-03-12 7:32 ` [Qemu-devel] [PATCH] bitops: unroll while loop in find_next_bit() Peter Lieven
2013-03-11 15:37 ` [Qemu-devel] [RFC] find_next_bit optimizations Peter Maydell
2013-03-11 15:41 ` Peter Lieven
2013-03-11 15:42 ` Paolo Bonzini
2013-03-11 15:48 ` Peter Lieven
2013-03-12 8:35 ` Stefan Hajnoczi
2013-03-12 8:41 ` Peter Lieven
2013-03-12 15:12 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).