qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC] find_next_bit optimizations
@ 2013-03-11 13:44 Peter Lieven
  2013-03-11 14:04 ` Peter Maydell
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 13:44 UTC (permalink / raw)
  To: qemu-devel@nongnu.org; +Cc: Orit Wasserman, Corentin Chary, Paolo Bonzini

Hi,

I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
seemed to be one of the culprits (apart from removing locking which Paolo is working on).

I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
I applied the same unrolling idea like in buffer_is_zero().

Peter

--- a/util/bitops.c
+++ b/util/bitops.c
@@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
      const unsigned long *p = addr + BITOP_WORD(offset);
      unsigned long result = offset & ~(BITS_PER_LONG-1);
      unsigned long tmp;
+    unsigned long d0,d1,d2,d3;

      if (offset >= size) {
          return size;
      }
      size -= result;
-    offset %= BITS_PER_LONG;
+    offset &= (BITS_PER_LONG-1);
      if (offset) {
          tmp = *(p++);
          tmp &= (~0UL << offset);
@@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
          result += BITS_PER_LONG;
      }
      while (size & ~(BITS_PER_LONG-1)) {
+        while (!(size & (4*BITS_PER_LONG-1))) {
+            d0 = *p;
+            d1 = *(p+1);
+            d2 = *(p+2);
+            d3 = *(p+3);
+            if (d0 || d1 || d2 || d3) {
+                break;
+            }
+            p+=4;
+            result += 4*BITS_PER_LONG;
+            size -= 4*BITS_PER_LONG;
+        }
          if ((tmp = *(p++))) {
              goto found_middle;
          }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 13:44 [Qemu-devel] [RFC] find_next_bit optimizations Peter Lieven
@ 2013-03-11 14:04 ` Peter Maydell
  2013-03-11 14:14 ` Paolo Bonzini
  2013-03-12  8:35 ` Stefan Hajnoczi
  2 siblings, 0 replies; 22+ messages in thread
From: Peter Maydell @ 2013-03-11 14:04 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary,
	Paolo Bonzini

On 11 March 2013 13:44, Peter Lieven <pl@dlhnet.de> wrote:

> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr,
> unsigned long size,
>      const unsigned long *p = addr + BITOP_WORD(offset);
>      unsigned long result = offset & ~(BITS_PER_LONG-1);
>      unsigned long tmp;
> +    unsigned long d0,d1,d2,d3;
>
>      if (offset >= size) {
>          return size;
>      }
>      size -= result;
> -    offset %= BITS_PER_LONG;
> +    offset &= (BITS_PER_LONG-1);

This change at least is unnecessary -- I just checked, and gcc
is already smart enough to turn the % operation into a logical
and. The generated object files are identical.

-- PMM

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 13:44 [Qemu-devel] [RFC] find_next_bit optimizations Peter Lieven
  2013-03-11 14:04 ` Peter Maydell
@ 2013-03-11 14:14 ` Paolo Bonzini
  2013-03-11 14:22   ` Peter Lieven
  2013-03-12  8:35 ` Stefan Hajnoczi
  2 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 14:14 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary

Il 11/03/2013 14:44, Peter Lieven ha scritto:
> Hi,
> 
> I ever since had a few VMs which are very hard to migrate because of a
> lot of memory I/O. I found that finding the next dirty bit
> seemed to be one of the culprits (apart from removing locking which
> Paolo is working on).
> 
> I have to following proposal which seems to help a lot in my case. Just
> wanted to have some feedback.
> I applied the same unrolling idea like in buffer_is_zero().
> 
> Peter
> 
> --- a/util/bitops.c
> +++ b/util/bitops.c
> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
> *addr, unsigned long size,
>      const unsigned long *p = addr + BITOP_WORD(offset);
>      unsigned long result = offset & ~(BITS_PER_LONG-1);
>      unsigned long tmp;
> +    unsigned long d0,d1,d2,d3;
> 
>      if (offset >= size) {
>          return size;
>      }
>      size -= result;
> -    offset %= BITS_PER_LONG;
> +    offset &= (BITS_PER_LONG-1);
>      if (offset) {
>          tmp = *(p++);
>          tmp &= (~0UL << offset);
> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
> *addr, unsigned long size,
>          result += BITS_PER_LONG;
>      }
>      while (size & ~(BITS_PER_LONG-1)) {
> +        while (!(size & (4*BITS_PER_LONG-1))) {

This really means

       if (!(size & (4*BITS_PER_LONG-1))) {
           while (1) {
               ...
           }
       }

because the subtraction will not change the result of the "while" loop
condition.

What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
in turn means "while (size >= 4*BITS_PER_LONG).

Please change both while loops to use a ">=" condition, it's easier to read.

Paolo

> +            d0 = *p;
> +            d1 = *(p+1);
> +            d2 = *(p+2);
> +            d3 = *(p+3);
> +            if (d0 || d1 || d2 || d3) {
> +                break;
> +            }
> +            p+=4;
> +            result += 4*BITS_PER_LONG;
> +            size -= 4*BITS_PER_LONG;
> +        }
>          if ((tmp = *(p++))) {
>              goto found_middle;
>          }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 14:14 ` Paolo Bonzini
@ 2013-03-11 14:22   ` Peter Lieven
  2013-03-11 14:29     ` Peter Lieven
  2013-03-11 14:35     ` Paolo Bonzini
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 14:22 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary


Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>> Hi,
>> 
>> I ever since had a few VMs which are very hard to migrate because of a
>> lot of memory I/O. I found that finding the next dirty bit
>> seemed to be one of the culprits (apart from removing locking which
>> Paolo is working on).
>> 
>> I have to following proposal which seems to help a lot in my case. Just
>> wanted to have some feedback.
>> I applied the same unrolling idea like in buffer_is_zero().
>> 
>> Peter
>> 
>> --- a/util/bitops.c
>> +++ b/util/bitops.c
>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>> *addr, unsigned long size,
>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>     unsigned long tmp;
>> +    unsigned long d0,d1,d2,d3;
>> 
>>     if (offset >= size) {
>>         return size;
>>     }
>>     size -= result;
>> -    offset %= BITS_PER_LONG;
>> +    offset &= (BITS_PER_LONG-1);
>>     if (offset) {
>>         tmp = *(p++);
>>         tmp &= (~0UL << offset);
>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>> *addr, unsigned long size,
>>         result += BITS_PER_LONG;
>>     }
>>     while (size & ~(BITS_PER_LONG-1)) {
>> +        while (!(size & (4*BITS_PER_LONG-1))) {
> 
> This really means
> 
>       if (!(size & (4*BITS_PER_LONG-1))) {
>           while (1) {
>               ...
>           }
>       }
> 
> because the subtraction will not change the result of the "while" loop
> condition.

Are you sure? The above is working nicely for me (wondering why ;-))
I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.

> 
> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
> in turn means "while (size >= 4*BITS_PER_LONG).
> 
> Please change both while loops to use a ">=" condition, it's easier to read.

Good idea, its easier to understand.

Has anyone evidence if unrolling 4 longs is optimal on today processors?
I just chooses 4 longs because it was the same in buffer_is_zero.

Peter

> 
> Paolo
> 
>> +            d0 = *p;
>> +            d1 = *(p+1);
>> +            d2 = *(p+2);
>> +            d3 = *(p+3);
>> +            if (d0 || d1 || d2 || d3) {
>> +                break;
>> +            }
>> +            p+=4;
>> +            result += 4*BITS_PER_LONG;
>> +            size -= 4*BITS_PER_LONG;
>> +        }
>>         if ((tmp = *(p++))) {
>>             goto found_middle;
>>         }
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 14:22   ` Peter Lieven
@ 2013-03-11 14:29     ` Peter Lieven
  2013-03-11 14:35     ` Paolo Bonzini
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 14:29 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary


Am 11.03.2013 um 15:22 schrieb Peter Lieven <pl@dlhnet.de>:

> 
> Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>>> Hi,
>>> 
>>> I ever since had a few VMs which are very hard to migrate because of a
>>> lot of memory I/O. I found that finding the next dirty bit
>>> seemed to be one of the culprits (apart from removing locking which
>>> Paolo is working on).
>>> 
>>> I have to following proposal which seems to help a lot in my case. Just
>>> wanted to have some feedback.
>>> I applied the same unrolling idea like in buffer_is_zero().
>>> 
>>> Peter
>>> 
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>    const unsigned long *p = addr + BITOP_WORD(offset);
>>>    unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>    unsigned long tmp;
>>> +    unsigned long d0,d1,d2,d3;
>>> 
>>>    if (offset >= size) {
>>>        return size;
>>>    }
>>>    size -= result;
>>> -    offset %= BITS_PER_LONG;
>>> +    offset &= (BITS_PER_LONG-1);
>>>    if (offset) {
>>>        tmp = *(p++);
>>>        tmp &= (~0UL << offset);
>>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>        result += BITS_PER_LONG;
>>>    }
>>>    while (size & ~(BITS_PER_LONG-1)) {
>>> +        while (!(size & (4*BITS_PER_LONG-1))) {
>> 
>> This really means
>> 
>>      if (!(size & (4*BITS_PER_LONG-1))) {
>>          while (1) {
>>              ...
>>          }
>>      }
>> 
>> because the subtraction will not change the result of the "while" loop
>> condition.
> 
> Are you sure? The above is working nicely for me (wondering why ;-))
> I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
> propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
> by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.
> 
>> 
>> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
>> in turn means "while (size >= 4*BITS_PER_LONG).
>> 
>> Please change both while loops to use a ">=" condition, it's easier to read.

Thinking again, in case a bit is found, this might lead to unnecessary iterations
in the while loop if the bit is in d1, d2 or d3.


> 
> Good idea, its easier to understand.
> 
> Has anyone evidence if unrolling 4 longs is optimal on today processors?
> I just chooses 4 longs because it was the same in buffer_is_zero.
> 
> Peter
> 
>> 
>> Paolo
>> 
>>> +            d0 = *p;
>>> +            d1 = *(p+1);
>>> +            d2 = *(p+2);
>>> +            d3 = *(p+3);
>>> +            if (d0 || d1 || d2 || d3) {
>>> +                break;
>>> +            }
>>> +            p+=4;
>>> +            result += 4*BITS_PER_LONG;
>>> +            size -= 4*BITS_PER_LONG;
>>> +        }
>>>        if ((tmp = *(p++))) {
>>>            goto found_middle;
>>>        }
>> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 14:22   ` Peter Lieven
  2013-03-11 14:29     ` Peter Lieven
@ 2013-03-11 14:35     ` Paolo Bonzini
  2013-03-11 15:24       ` Peter Lieven
  1 sibling, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 14:35 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary

Il 11/03/2013 15:22, Peter Lieven ha scritto:
> 
> Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>>> Hi,
>>>
>>> I ever since had a few VMs which are very hard to migrate because of a
>>> lot of memory I/O. I found that finding the next dirty bit
>>> seemed to be one of the culprits (apart from removing locking which
>>> Paolo is working on).
>>>
>>> I have to following proposal which seems to help a lot in my case. Just
>>> wanted to have some feedback.
>>> I applied the same unrolling idea like in buffer_is_zero().
>>>
>>> Peter
>>>
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>     unsigned long tmp;
>>> +    unsigned long d0,d1,d2,d3;
>>>
>>>     if (offset >= size) {
>>>         return size;
>>>     }
>>>     size -= result;
>>> -    offset %= BITS_PER_LONG;
>>> +    offset &= (BITS_PER_LONG-1);
>>>     if (offset) {
>>>         tmp = *(p++);
>>>         tmp &= (~0UL << offset);
>>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>         result += BITS_PER_LONG;
>>>     }
>>>     while (size & ~(BITS_PER_LONG-1)) {
>>> +        while (!(size & (4*BITS_PER_LONG-1))) {
>>
>> This really means
>>
>>       if (!(size & (4*BITS_PER_LONG-1))) {
>>           while (1) {
>>               ...
>>           }
>>       }
>>
>> because the subtraction will not change the result of the "while" loop
>> condition.
> 
> Are you sure? The above is working nicely for me (wondering why ;-))

   while (!(size & (4*BITS_PER_LONG-1)))    =>
   while (!(size % (4*BITS_PER_LONG))       =>
   while ((size % (4*BITS_PER_LONG)) == 0)

Subtracting 4*BITS_PER_LONG doesn't change the modulus.

> I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
> propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
> by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.

In fact I'm not really sure why it works for you. :)

>> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
>> in turn means "while (size >= 4*BITS_PER_LONG).
>>
>> Please change both while loops to use a ">=" condition, it's easier to read.
> 
> Good idea, its easier to understand.
> 
>>> Please change both while loops to use a ">=" condition, it's easier to read.
> 
> Thinking again, in case a bit is found, this might lead to unnecessary iterations
> in the while loop if the bit is in d1, d2 or d3.

How would that be different in your patch?  But you can solve it by
making two >= loops, one checking for 4*BITS_PER_LONG and one checking
BITS_PER_LONG.

Paolo

> 
> Peter
> 
>>
>> Paolo
>>
>>> +            d0 = *p;
>>> +            d1 = *(p+1);
>>> +            d2 = *(p+2);
>>> +            d3 = *(p+3);
>>> +            if (d0 || d1 || d2 || d3) {
>>> +                break;
>>> +            }
>>> +            p+=4;
>>> +            result += 4*BITS_PER_LONG;
>>> +            size -= 4*BITS_PER_LONG;
>>> +        }
>>>         if ((tmp = *(p++))) {
>>>             goto found_middle;
>>>         }
>>
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 14:35     ` Paolo Bonzini
@ 2013-03-11 15:24       ` Peter Lieven
  2013-03-11 15:25         ` Peter Maydell
                           ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 15:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary


> How would that be different in your patch?  But you can solve it by
> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
> BITS_PER_LONG.

This is what I have now:

diff --git a/util/bitops.c b/util/bitops.c
index e72237a..b0dc93f 100644
--- a/util/bitops.c
+++ b/util/bitops.c
@@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
     const unsigned long *p = addr + BITOP_WORD(offset);
     unsigned long result = offset & ~(BITS_PER_LONG-1);
     unsigned long tmp;
+    unsigned long d0,d1,d2,d3;
 
     if (offset >= size) {
         return size;
     }
     size -= result;
-    offset %= BITS_PER_LONG;
+    offset &= (BITS_PER_LONG-1);
     if (offset) {
         tmp = *(p++);
         tmp &= (~0UL << offset);
@@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
         size -= BITS_PER_LONG;
         result += BITS_PER_LONG;
     }
-    while (size & ~(BITS_PER_LONG-1)) {
+    while (size >= 4*BITS_PER_LONG) {
+        d0 = *p;
+        d1 = *(p+1);
+        d2 = *(p+2);
+        d3 = *(p+3);
+        if (d0 || d1 || d2 || d3) {
+            break;
+        }
+        p+=4;
+        result += 4*BITS_PER_LONG;
+        size -= 4*BITS_PER_LONG;
+    }
+    while (size >= BITS_PER_LONG) {
         if ((tmp = *(p++))) {
             goto found_middle;
         }

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 15:24       ` Peter Lieven
@ 2013-03-11 15:25         ` Peter Maydell
  2013-03-11 15:29         ` Paolo Bonzini
  2013-03-11 15:37         ` [Qemu-devel] [RFC] find_next_bit optimizations Peter Maydell
  2 siblings, 0 replies; 22+ messages in thread
From: Peter Maydell @ 2013-03-11 15:25 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, qemu-devel@nongnu.org, Corentin Chary,
	Orit Wasserman

On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote:
> -    offset %= BITS_PER_LONG;
> +    offset &= (BITS_PER_LONG-1);

Still pointless.

-- PMM

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 15:24       ` Peter Lieven
  2013-03-11 15:25         ` Peter Maydell
@ 2013-03-11 15:29         ` Paolo Bonzini
  2013-03-11 15:37           ` Peter Lieven
  2013-03-11 15:37         ` [Qemu-devel] [RFC] find_next_bit optimizations Peter Maydell
  2 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 15:29 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary

Il 11/03/2013 16:24, Peter Lieven ha scritto:
> 
>> How would that be different in your patch?  But you can solve it by
>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>> BITS_PER_LONG.
> 
> This is what I have now:
> 
> diff --git a/util/bitops.c b/util/bitops.c
> index e72237a..b0dc93f 100644
> --- a/util/bitops.c
> +++ b/util/bitops.c
> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>      const unsigned long *p = addr + BITOP_WORD(offset);
>      unsigned long result = offset & ~(BITS_PER_LONG-1);
>      unsigned long tmp;
> +    unsigned long d0,d1,d2,d3;
>  
>      if (offset >= size) {
>          return size;
>      }
>      size -= result;
> -    offset %= BITS_PER_LONG;
> +    offset &= (BITS_PER_LONG-1);
>      if (offset) {
>          tmp = *(p++);
>          tmp &= (~0UL << offset);
> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>          size -= BITS_PER_LONG;
>          result += BITS_PER_LONG;
>      }
> -    while (size & ~(BITS_PER_LONG-1)) {
> +    while (size >= 4*BITS_PER_LONG) {
> +        d0 = *p;
> +        d1 = *(p+1);
> +        d2 = *(p+2);
> +        d3 = *(p+3);
> +        if (d0 || d1 || d2 || d3) {
> +            break;
> +        }
> +        p+=4;
> +        result += 4*BITS_PER_LONG;
> +        size -= 4*BITS_PER_LONG;
> +    }
> +    while (size >= BITS_PER_LONG) {
>          if ((tmp = *(p++))) {
>              goto found_middle;
>          }
> 

Minus the %= vs. &=,

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Perhaps:

        tmp = *p;
        d1 = *(p+1);
        d2 = *(p+2);
        d3 = *(p+3);
        if (tmp) {
            goto found_middle;
        }
        if (d1 || d2 || d3) {
            break;
        }

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 15:29         ` Paolo Bonzini
@ 2013-03-11 15:37           ` Peter Lieven
  2013-03-11 15:58             ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 15:37 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary


Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>> 
>>> How would that be different in your patch?  But you can solve it by
>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>> BITS_PER_LONG.
>> 
>> This is what I have now:
>> 
>> diff --git a/util/bitops.c b/util/bitops.c
>> index e72237a..b0dc93f 100644
>> --- a/util/bitops.c
>> +++ b/util/bitops.c
>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>     unsigned long tmp;
>> +    unsigned long d0,d1,d2,d3;
>> 
>>     if (offset >= size) {
>>         return size;
>>     }
>>     size -= result;
>> -    offset %= BITS_PER_LONG;
>> +    offset &= (BITS_PER_LONG-1);
>>     if (offset) {
>>         tmp = *(p++);
>>         tmp &= (~0UL << offset);
>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>         size -= BITS_PER_LONG;
>>         result += BITS_PER_LONG;
>>     }
>> -    while (size & ~(BITS_PER_LONG-1)) {
>> +    while (size >= 4*BITS_PER_LONG) {
>> +        d0 = *p;
>> +        d1 = *(p+1);
>> +        d2 = *(p+2);
>> +        d3 = *(p+3);
>> +        if (d0 || d1 || d2 || d3) {
>> +            break;
>> +        }
>> +        p+=4;
>> +        result += 4*BITS_PER_LONG;
>> +        size -= 4*BITS_PER_LONG;
>> +    }
>> +    while (size >= BITS_PER_LONG) {
>>         if ((tmp = *(p++))) {
>>             goto found_middle;
>>         }
>> 
> 
> Minus the %= vs. &=,
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> Perhaps:
> 
>        tmp = *p;
>        d1 = *(p+1);
>        d2 = *(p+2);
>        d3 = *(p+3);
>        if (tmp) {
>            goto found_middle;
>        }
>        if (d1 || d2 || d3) {
>            break;
>        }

i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
i would guess its sth like one addition w/ carry and 1 test?

your proposed change would introduce 2 tests (maybe)?

what about this to be sure?

       tmp = *p;
       d1 = *(p+1);
       d2 = *(p+2);
       d3 = *(p+3);
       if (tmp || d1 || d2 || d3) {
           if (tmp) {
               goto found_middle;
           }
           break;
       }

Peter

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 15:24       ` Peter Lieven
  2013-03-11 15:25         ` Peter Maydell
  2013-03-11 15:29         ` Paolo Bonzini
@ 2013-03-11 15:37         ` Peter Maydell
  2013-03-11 15:41           ` Peter Lieven
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Maydell @ 2013-03-11 15:37 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, qemu-devel@nongnu.org, Corentin Chary,
	Orit Wasserman

On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote:
> +    unsigned long d0,d1,d2,d3;

These commas should have spaces after them. Also, since
the variables are only used inside the scope of your
newly added while loop:

> -    while (size & ~(BITS_PER_LONG-1)) {
> +    while (size >= 4*BITS_PER_LONG) {

it would be better to declare them here.

> +        d0 = *p;
> +        d1 = *(p+1);
> +        d2 = *(p+2);
> +        d3 = *(p+3);
> +        if (d0 || d1 || d2 || d3) {
> +            break;
> +        }
> +        p+=4;
> +        result += 4*BITS_PER_LONG;
> +        size -= 4*BITS_PER_LONG;
> +    }
> +    while (size >= BITS_PER_LONG) {
>          if ((tmp = *(p++))) {
>              goto found_middle;
>          }

thanks
-- PMM

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 15:37         ` [Qemu-devel] [RFC] find_next_bit optimizations Peter Maydell
@ 2013-03-11 15:41           ` Peter Lieven
  2013-03-11 15:42             ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 15:41 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Paolo Bonzini, qemu-devel@nongnu.org, Corentin Chary,
	Orit Wasserman


Am 11.03.2013 um 16:37 schrieb Peter Maydell <peter.maydell@linaro.org>:

> On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote:
>> +    unsigned long d0,d1,d2,d3;
> 
> These commas should have spaces after them. Also, since
> the variables are only used inside the scope of your
> newly added while loop:
> 
>> -    while (size & ~(BITS_PER_LONG-1)) {
>> +    while (size >= 4*BITS_PER_LONG) {
> 
> it would be better to declare them here.

can you verify if this does not make difference in the generated object code?
in buffer_is_zero() its outside the loop.

thanks
peter

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 15:41           ` Peter Lieven
@ 2013-03-11 15:42             ` Paolo Bonzini
  2013-03-11 15:48               ` Peter Lieven
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 15:42 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Peter Maydell, qemu-devel@nongnu.org, Corentin Chary,
	Orit Wasserman

Il 11/03/2013 16:41, Peter Lieven ha scritto:
> can you verify if this does not make difference in the generated object code?
> in buffer_is_zero() its outside the loop.

No, it doesn't.

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 15:42             ` Paolo Bonzini
@ 2013-03-11 15:48               ` Peter Lieven
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 15:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Maydell, qemu-devel@nongnu.org, Corentin Chary,
	Orit Wasserman


Am 11.03.2013 um 16:42 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 11/03/2013 16:41, Peter Lieven ha scritto:
>> can you verify if this does not make difference in the generated object code?
>> in buffer_is_zero() its outside the loop.
> 
> No, it doesn't.

ok, i will sent the final patch tomorrow.

one last thought. would it make sense to update only `size`in the while loops
and compute the `result` at the end as `orgsize` - `size`?

Peter

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 15:37           ` Peter Lieven
@ 2013-03-11 15:58             ` Paolo Bonzini
  2013-03-11 17:06               ` ronnie sahlberg
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 15:58 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary

Il 11/03/2013 16:37, Peter Lieven ha scritto:
> 
> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>>
>>>> How would that be different in your patch?  But you can solve it by
>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>>> BITS_PER_LONG.
>>>
>>> This is what I have now:
>>>
>>> diff --git a/util/bitops.c b/util/bitops.c
>>> index e72237a..b0dc93f 100644
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>     unsigned long tmp;
>>> +    unsigned long d0,d1,d2,d3;
>>>
>>>     if (offset >= size) {
>>>         return size;
>>>     }
>>>     size -= result;
>>> -    offset %= BITS_PER_LONG;
>>> +    offset &= (BITS_PER_LONG-1);
>>>     if (offset) {
>>>         tmp = *(p++);
>>>         tmp &= (~0UL << offset);
>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>         size -= BITS_PER_LONG;
>>>         result += BITS_PER_LONG;
>>>     }
>>> -    while (size & ~(BITS_PER_LONG-1)) {
>>> +    while (size >= 4*BITS_PER_LONG) {
>>> +        d0 = *p;
>>> +        d1 = *(p+1);
>>> +        d2 = *(p+2);
>>> +        d3 = *(p+3);
>>> +        if (d0 || d1 || d2 || d3) {
>>> +            break;
>>> +        }
>>> +        p+=4;
>>> +        result += 4*BITS_PER_LONG;
>>> +        size -= 4*BITS_PER_LONG;
>>> +    }
>>> +    while (size >= BITS_PER_LONG) {
>>>         if ((tmp = *(p++))) {
>>>             goto found_middle;
>>>         }
>>>
>>
>> Minus the %= vs. &=,
>>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>
>> Perhaps:
>>
>>        tmp = *p;
>>        d1 = *(p+1);
>>        d2 = *(p+2);
>>        d3 = *(p+3);
>>        if (tmp) {
>>            goto found_middle;
>>        }
>>        if (d1 || d2 || d3) {
>>            break;
>>        }
> 
> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?

It depends on the target and how expensive branches are.

> i would guess its sth like one addition w/ carry and 1 test?

It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
followed by a compare-and-jump.

That is, either:

     test %r8, %r8
     jz   second_loop
     test %r9, %r9
     jz   second_loop
     test %r10, %r10
     jz   second_loop
     test %r11, %r11
     jz   second_loop

or

     or %r9, %r8
     or %r11, %r10
     or %r8, %r10
     jz   second_loop

Don't let the length of the code fool you.  The processor knows how to
optimize all of these, and GCC knows too.

> your proposed change would introduce 2 tests (maybe)?

Yes, but I expect they to be fairly well predicted.

> what about this to be sure?
> 
>        tmp = *p;
>        d1 = *(p+1);
>        d2 = *(p+2);
>        d3 = *(p+3);
>        if (tmp || d1 || d2 || d3) {
>            if (tmp) {
>                goto found_middle;

I suspect that GCC would rewrite it my version (definitely if it
produces 4 compare-and-jumps; but possibly it does it even if it goes
for bitwise ORs, I haven't checked.

Regarding your other question ("one last thought. would it make sense to
update only `size`in the while loops and compute the `result` at the end
as `orgsize` - `size`?"), again the compiler knows better and might even
do this for you.  It will likely drop the p increases and use p[result],
so if you do that change you may even get the same code, only this time
p is increased and you get an extra subtraction at the end. :)

Bottom line: don't try to outsmart an optimizing C compiler on
micro-optimization, unless you have benchmarked it and it shows there is
a problem.

Paolo

>            }
>            break;
>        }
> 
> Peter
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 15:58             ` Paolo Bonzini
@ 2013-03-11 17:06               ` ronnie sahlberg
  2013-03-11 17:07                 ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: ronnie sahlberg @ 2013-03-11 17:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Orit Wasserman, Peter Lieven, qemu-devel@nongnu.org,
	Corentin Chary

Even more efficient might be to do bitwise instead of logical or

>        if (tmp | d1 | d2 | d3) {

that should remove 3 of the 4 conditional jumps
and should become 3 bitwise ors and one conditional jump


On Mon, Mar 11, 2013 at 8:58 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 11/03/2013 16:37, Peter Lieven ha scritto:
>>
>> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>
>>> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>>>
>>>>> How would that be different in your patch?  But you can solve it by
>>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>>>> BITS_PER_LONG.
>>>>
>>>> This is what I have now:
>>>>
>>>> diff --git a/util/bitops.c b/util/bitops.c
>>>> index e72237a..b0dc93f 100644
>>>> --- a/util/bitops.c
>>>> +++ b/util/bitops.c
>>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>>     unsigned long tmp;
>>>> +    unsigned long d0,d1,d2,d3;
>>>>
>>>>     if (offset >= size) {
>>>>         return size;
>>>>     }
>>>>     size -= result;
>>>> -    offset %= BITS_PER_LONG;
>>>> +    offset &= (BITS_PER_LONG-1);
>>>>     if (offset) {
>>>>         tmp = *(p++);
>>>>         tmp &= (~0UL << offset);
>>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>>         size -= BITS_PER_LONG;
>>>>         result += BITS_PER_LONG;
>>>>     }
>>>> -    while (size & ~(BITS_PER_LONG-1)) {
>>>> +    while (size >= 4*BITS_PER_LONG) {
>>>> +        d0 = *p;
>>>> +        d1 = *(p+1);
>>>> +        d2 = *(p+2);
>>>> +        d3 = *(p+3);
>>>> +        if (d0 || d1 || d2 || d3) {
>>>> +            break;
>>>> +        }
>>>> +        p+=4;
>>>> +        result += 4*BITS_PER_LONG;
>>>> +        size -= 4*BITS_PER_LONG;
>>>> +    }
>>>> +    while (size >= BITS_PER_LONG) {
>>>>         if ((tmp = *(p++))) {
>>>>             goto found_middle;
>>>>         }
>>>>
>>>
>>> Minus the %= vs. &=,
>>>
>>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>>
>>> Perhaps:
>>>
>>>        tmp = *p;
>>>        d1 = *(p+1);
>>>        d2 = *(p+2);
>>>        d3 = *(p+3);
>>>        if (tmp) {
>>>            goto found_middle;
>>>        }
>>>        if (d1 || d2 || d3) {
>>>            break;
>>>        }
>>
>> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
>
> It depends on the target and how expensive branches are.
>
>> i would guess its sth like one addition w/ carry and 1 test?
>
> It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
> followed by a compare-and-jump.
>
> That is, either:
>
>      test %r8, %r8
>      jz   second_loop
>      test %r9, %r9
>      jz   second_loop
>      test %r10, %r10
>      jz   second_loop
>      test %r11, %r11
>      jz   second_loop
>
> or
>
>      or %r9, %r8
>      or %r11, %r10
>      or %r8, %r10
>      jz   second_loop
>
> Don't let the length of the code fool you.  The processor knows how to
> optimize all of these, and GCC knows too.
>
>> your proposed change would introduce 2 tests (maybe)?
>
> Yes, but I expect they to be fairly well predicted.
>
>> what about this to be sure?
>>
>>        tmp = *p;
>>        d1 = *(p+1);
>>        d2 = *(p+2);
>>        d3 = *(p+3);
>>        if (tmp || d1 || d2 || d3) {
>>            if (tmp) {
>>                goto found_middle;
>
> I suspect that GCC would rewrite it my version (definitely if it
> produces 4 compare-and-jumps; but possibly it does it even if it goes
> for bitwise ORs, I haven't checked.
>
> Regarding your other question ("one last thought. would it make sense to
> update only `size`in the while loops and compute the `result` at the end
> as `orgsize` - `size`?"), again the compiler knows better and might even
> do this for you.  It will likely drop the p increases and use p[result],
> so if you do that change you may even get the same code, only this time
> p is increased and you get an extra subtraction at the end. :)
>
> Bottom line: don't try to outsmart an optimizing C compiler on
> micro-optimization, unless you have benchmarked it and it shows there is
> a problem.
>
> Paolo
>
>>            }
>>            break;
>>        }
>>
>> Peter
>>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 17:06               ` ronnie sahlberg
@ 2013-03-11 17:07                 ` Paolo Bonzini
  2013-03-11 18:20                   ` Peter Lieven
  2013-03-12  7:32                   ` [Qemu-devel] [PATCH] bitops: unroll while loop in find_next_bit() Peter Lieven
  0 siblings, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 17:07 UTC (permalink / raw)
  To: ronnie sahlberg
  Cc: Orit Wasserman, Peter Lieven, qemu-devel@nongnu.org,
	Corentin Chary

Il 11/03/2013 18:06, ronnie sahlberg ha scritto:
> Even more efficient might be to do bitwise instead of logical or
> 
>> >        if (tmp | d1 | d2 | d3) {
> that should remove 3 of the 4 conditional jumps
> and should become 3 bitwise ors and one conditional jump

Without any serious profiling, please let the compiler do that.

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 17:07                 ` Paolo Bonzini
@ 2013-03-11 18:20                   ` Peter Lieven
  2013-03-12  7:32                   ` [Qemu-devel] [PATCH] bitops: unroll while loop in find_next_bit() Peter Lieven
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-11 18:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Orit Wasserman, qemu-devel@nongnu.org, ronnie sahlberg,
	Corentin Chary

Am 11.03.2013 18:07, schrieb Paolo Bonzini:
> Il 11/03/2013 18:06, ronnie sahlberg ha scritto:
>> Even more efficient might be to do bitwise instead of logical or
>>
>>>>        if (tmp | d1 | d2 | d3) {
>> that should remove 3 of the 4 conditional jumps
>> and should become 3 bitwise ors and one conditional jump
> 
> Without any serious profiling, please let the compiler do that.

Paolo is right, i ran some tests with gcc 4.6.3 on x86_64 (with -O3) and tried the
various ideas. They all made no significant difference. Even unrolling to 8 unsigned
longs didn't change anything.

What I tried is running 1^20 interations of find_next_bit(bitfield,4194304,0);
I choosed the bitfield to be 4MByte which equals a 16GB VM. The bitfield was
complete zeroed so find_next_bit had to run completely through the bitfield.

The original version took 1 minute and 10 seconds whereas all other took
approx. 37-38 seconds which is almost a 100% boost ;-)

So I think this here is the final version:

    while (size >= 4*BITS_PER_LONG) {
        unsigned long d1, d2, d3;
        tmp = *p;
        d1 = *(p+1);
        d2 = *(p+2);
        d3 = *(p+3);
        if (tmp) {
            goto found_middle;
        }        
        if (d1 || d2 || d3) {
            break;
        }
        p += 4;
        result += 4*BITS_PER_LONG;
        size -= 4*BITS_PER_LONG;
    }

Peter

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Qemu-devel] [PATCH] bitops: unroll while loop in find_next_bit().
  2013-03-11 17:07                 ` Paolo Bonzini
  2013-03-11 18:20                   ` Peter Lieven
@ 2013-03-12  7:32                   ` Peter Lieven
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Lieven @ 2013-03-12  7:32 UTC (permalink / raw)
  To: qemu-devel@nongnu.org
  Cc: peter.maydell, Paolo Bonzini, Corentin Chary, ronnie sahlberg,
	Orit Wasserman

this patch adopts the loop unrolling idea of bitmap_is_zero() to
speed up the skipping of large areas with zeros in find_next_bit().

this routine is extensively used to find dirty pages in
live migration.

testing only the find_next_bit performance on a zeroed bitfield
the loop onrolling decreased executing time by approx. 50% on x86_64.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
  util/bitops.c |   18 +++++++++++++++++-
  1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/util/bitops.c b/util/bitops.c
index e72237a..227c38b 100644
--- a/util/bitops.c
+++ b/util/bitops.c
@@ -42,7 +42,23 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
          size -= BITS_PER_LONG;
          result += BITS_PER_LONG;
      }
-    while (size & ~(BITS_PER_LONG-1)) {
+    while (size >= 4*BITS_PER_LONG) {
+        unsigned long d1, d2, d3;
+        tmp = *p;
+        d1 = *(p+1);
+        d2 = *(p+2);
+        d3 = *(p+3);
+        if (tmp) {
+            goto found_middle;
+        }
+        if (d1 | d2 | d3) {
+            break;
+        }
+        p += 4;
+        result += 4*BITS_PER_LONG;
+        size -= 4*BITS_PER_LONG;
+    }
+    while (size >= BITS_PER_LONG) {
          if ((tmp = *(p++))) {
              goto found_middle;
          }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-11 13:44 [Qemu-devel] [RFC] find_next_bit optimizations Peter Lieven
  2013-03-11 14:04 ` Peter Maydell
  2013-03-11 14:14 ` Paolo Bonzini
@ 2013-03-12  8:35 ` Stefan Hajnoczi
  2013-03-12  8:41   ` Peter Lieven
  2 siblings, 1 reply; 22+ messages in thread
From: Stefan Hajnoczi @ 2013-03-12  8:35 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary,
	Paolo Bonzini

On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
> seemed to be one of the culprits (apart from removing locking which Paolo is working on).
> 
> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.

Hi Peter,
Do you have any performance numbers for this patch?  I'm just curious
how big the win is.

Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-12  8:35 ` Stefan Hajnoczi
@ 2013-03-12  8:41   ` Peter Lieven
  2013-03-12 15:12     ` Stefan Hajnoczi
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Lieven @ 2013-03-12  8:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary,
	Paolo Bonzini


Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi <stefanha@gmail.com>:

> On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
>> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
>> seemed to be one of the culprits (apart from removing locking which Paolo is working on).
>> 
>> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
> 
> Hi Peter,
> Do you have any performance numbers for this patch?  I'm just curious
> how big the win is.

Hi Stefan,

please see my recent email to the list with the final patch.
The win is up to 100%. Worst case execution time (whole
array is zero) is halved on x86_64.

Peter

> 
> Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC] find_next_bit optimizations
  2013-03-12  8:41   ` Peter Lieven
@ 2013-03-12 15:12     ` Stefan Hajnoczi
  0 siblings, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2013-03-12 15:12 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Orit Wasserman, qemu-devel@nongnu.org, Corentin Chary,
	Paolo Bonzini

On Tue, Mar 12, 2013 at 09:41:04AM +0100, Peter Lieven wrote:
> 
> Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi <stefanha@gmail.com>:
> 
> > On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
> >> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
> >> seemed to be one of the culprits (apart from removing locking which Paolo is working on).
> >> 
> >> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
> > 
> > Hi Peter,
> > Do you have any performance numbers for this patch?  I'm just curious
> > how big the win is.
> 
> Hi Stefan,
> 
> please see my recent email to the list with the final patch.
> The win is up to 100%. Worst case execution time (whole
> array is zero) is halved on x86_64.

Thanks!

Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-03-12 15:12 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-11 13:44 [Qemu-devel] [RFC] find_next_bit optimizations Peter Lieven
2013-03-11 14:04 ` Peter Maydell
2013-03-11 14:14 ` Paolo Bonzini
2013-03-11 14:22   ` Peter Lieven
2013-03-11 14:29     ` Peter Lieven
2013-03-11 14:35     ` Paolo Bonzini
2013-03-11 15:24       ` Peter Lieven
2013-03-11 15:25         ` Peter Maydell
2013-03-11 15:29         ` Paolo Bonzini
2013-03-11 15:37           ` Peter Lieven
2013-03-11 15:58             ` Paolo Bonzini
2013-03-11 17:06               ` ronnie sahlberg
2013-03-11 17:07                 ` Paolo Bonzini
2013-03-11 18:20                   ` Peter Lieven
2013-03-12  7:32                   ` [Qemu-devel] [PATCH] bitops: unroll while loop in find_next_bit() Peter Lieven
2013-03-11 15:37         ` [Qemu-devel] [RFC] find_next_bit optimizations Peter Maydell
2013-03-11 15:41           ` Peter Lieven
2013-03-11 15:42             ` Paolo Bonzini
2013-03-11 15:48               ` Peter Lieven
2013-03-12  8:35 ` Stefan Hajnoczi
2013-03-12  8:41   ` Peter Lieven
2013-03-12 15:12     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).