linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* improving checksum cpu consumption in ksm
@ 2009-08-28 20:21 Izik Eidus
  2009-08-31 22:49 ` Hugh Dickins
  0 siblings, 1 reply; 9+ messages in thread
From: Izik Eidus @ 2009-08-28 20:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli; +Cc: linux-mm

Hi,

As you know we are using checksum (jhash) to know if page is changing 
too often, and then if it does we are not inserting it to the unstable 
tree (so it wont get unstable too much at trivial cases)

This is highly needed for ksm in some workloads - I have seen production 
visualization server that had about 74 or so giga of ram, and the way 
ksm was running there it would take ksm about 6 mins to finish one 
memory loop, In such case the hashing make sure that we will insert 
pages that are really not changing (about 6 mins) and we are protecting 
our unstable tree, without it, if we will insert any page we will end up 
with an really really unstable tree that probably wont find much.

(There is a case where we don`t calculate jhash - where we actually find 
two identical pages inside the stable tree)

So after we know why we want to keep this hash/checksum, we want to look 
how we can reduce the cpu cycles that it consume - jhash is very expensive
And not only that it take cpu cycles it also dirty the cache by walking 
over the whole page...

As for now i see 2 trivials ways how to solve it: (All we need to know 
is - what pages have been changed)
1) use the dirty bit of page tables pointing into the page:
     We can walk over the page tables, and keep cleaning the dirty bit - 
we are using anonymous pages so it shouldnt matther anyone if we clean 
the dirty bit,
     With this case we win 2 things - 1) no cpu cycles on the expensive 
jhash, and 2) no dirty of the cache
     the dissadvange of this usage is the PAGE_INVALID we need to call 
of the specific tlbs entries associate with the ptes we are clearing the 
dirty bit:
     Is it worst than dirty the cache?, is it going to really hurt 
applications performence? (note we are just tlb_flush specific entries, 
not the entire tlb)
     If this going to hurt applications pefromence we are better not to 
deal with it, but what do you think about this?

     Taking this further more we can use 'unstable dirty bit tracking' - 
if we look on ksm work loads we can split the memory into three diffrent 
kind of pages:
     a) pages that are identical
     b) pages that are not identical and keep changing all the time
     c) pages that are not identical but doesn't change

     So taking this three type of pages lets assume ksm was using the 
following way to track pages that are changing:

     Each time ksm find page that its page tables pointing to it are dirty,:
       ksm will clean the dirty bits out of the ptes (without 
INVALID_PAGE them),
       and will continue without inserting the page into the unstable tree.

     Each time ksm will find page that the page tables pointing to it 
are clean:
       ksm will calucate jhash to know if the page was changed -
       this is needed due to the fact that we cleaned the dirty bit,
       but we didnt tlb_flush the tlb entry pointing to the page,
       so we have to jhash to make sure if the page was changed.

     Now looking on the three diffrent kind of pages
     a) pages that are identical:
           would get find anyway when comparing them inside the stable tree
     b) pages that are not identical and keep changing all the time:
           Most of the chances that they will appear dirty on the pte, 
even thougth that the tlb entry was not flushed by ksm,
           If they still wont be dirty, the jhash check will be run on 
them to know if the page was changed,
           This meaning that most of the time this optimization will 
save the jhash calcualtion to this kind of pages:
           beacuse when we will see them dirty, we wont need to calcuate 
the jhash.
     c) pages that are not identical but doesn't change:
           This kind of pages will always be clean, so we will clacuate 
jhash on them like before.
  

2) Nehalem cpus with sse 4.1 have crc instruction - the good - it going 
to be faster, the bad - only Nehlem and above cpus will have it
     (Linux already have support for it)

What you think?, Or am i too much think about the cpu cycles we are 
burning with the jhash?


Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: improving checksum cpu consumption in ksm
  2009-08-28 20:21 improving checksum cpu consumption in ksm Izik Eidus
@ 2009-08-31 22:49 ` Hugh Dickins
  2009-09-01 12:12   ` Izik Eidus
  2009-09-03 12:36   ` Izik Eidus
  0 siblings, 2 replies; 9+ messages in thread
From: Hugh Dickins @ 2009-08-31 22:49 UTC (permalink / raw)
  To: Izik Eidus; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm

On Fri, 28 Aug 2009, Izik Eidus wrote:
> 
> As you know we are using checksum (jhash) to know if page is changing too
> often, and then if it does we are not inserting it to the unstable tree (so it
> wont get unstable too much at trivial cases)
> 
> This is highly needed for ksm in some workloads - I have seen production
> visualization server that had about 74 or so giga of ram, and the way ksm was
> running there it would take ksm about 6 mins to finish one memory loop, In
> such case the hashing make sure that we will insert pages that are really not
> changing (about 6 mins) and we are protecting our unstable tree, without it,
> if we will insert any page we will end up with an really really unstable tree
> that probably wont find much.
> 
> (There is a case where we don`t calculate jhash - where we actually find two
> identical pages inside the stable tree)
> 
> So after we know why we want to keep this hash/checksum, we want to look how
> we can reduce the cpu cycles that it consume - jhash is very expensive
> And not only that it take cpu cycles it also dirty the cache by walking over
> the whole page...

Not dirtying the cache, I think, but... my mind's gone blank on the
right word too... wasting it anyway.

I wonder what proportion of the cost is the memory accesses and what
proportion is the cost of the jhash algorithm.  My guesses count for
nothing: it's measurement you should be doing.

But the first thing to try (measure) would be Jozsef's patch, updating
jhash.h from the 1996 Jenkins lookup2 to the 2006 Jenkins lookup3,
which is supposed to be a considerable improvement from all angles.

See http://lkml.org/lkml/2009/2/12/65

I think that was the final version of it: I don't know what happened
after that (beyond Eric questioning "deadbeef", but that won't matter
in your testing) - perhaps it was unclear which maintainer should pick
it up, and it then fell through the cracks?

Or, going the other way, how damaging would it be to unstable tree
integrity to use a cheaper hashing algorithm?  There's no perfection
here, a page that's unchanged for one KSM cycle is no way guaranteed
to be unchanged for the next, that's merely an heuristic.

> 
> As for now i see 2 trivials ways how to solve it: (All we need to know is -
> what pages have been changed)
> 1) use the dirty bit of page tables pointing into the page:
>     We can walk over the page tables, and keep cleaning the dirty bit - we are
> using anonymous pages so it shouldnt matther anyone if we clean the dirty bit,

Well, if you clean the pte dirty bit, you must be sure to transfer that
to the page dirty bit, otherwise the anonymous page will be thrown away
as a zero page (or a clean copy of a swap page) when memory is needed.

And beware of s390, which dirties differently.

>     With this case we win 2 things - 1) no cpu cycles on the expensive jhash,
> and 2) no dirty of the cache
>     the dissadvange of this usage is the PAGE_INVALID we need to call of the
> specific tlbs entries associate with the ptes we are clearing the dirty bit:
>     Is it worst than dirty the cache?, is it going to really hurt applications
> performence? (note we are just tlb_flush specific entries, not the entire tlb)
>     If this going to hurt applications pefromence we are better not to deal
> with it, but what do you think about this?

I think three things.

One, that I've no idea, and it's measurement you need.

Two, please keep in mind that it may not be the TLB flush itself that
matters, but the IPI to do it on other CPUs: how much that costs
depends on how much ksmd is running at the same time as threads
of the apps making use of it.

Three, that if we go in this direction, might it be even better to make
the unstable tree stable? i.e. write protect its contents so that we're
sure a node cannot be poisoned with modified data.  It may be
immediately obvious to you that that's a terrible suggestion (all the
overhead of the write faults), but it's not quite obvious to me yet.

I expect a lot depends on the proportions of pages_shared, pages_sharing,
pages_unshared, pages_volatile.  But as I've said before, I've no
feeling yet (or ever?) for the behaviour of the unstable tree: for
example, I've no grasp of whether a large unstable tree is inherently
less stable (more likely to be seriously poisoned) than a small one,
or not.

> 
>     Taking this further more we can use 'unstable dirty bit tracking' - if we
> look on ksm work loads we can split the memory into three diffrent kind of
> pages:
>     a) pages that are identical
>     b) pages that are not identical and keep changing all the time
>     c) pages that are not identical but doesn't change
> 
>     So taking this three type of pages lets assume ksm was using the following
> way to track pages that are changing:
> 
>     Each time ksm find page that its page tables pointing to it are dirty,:
>       ksm will clean the dirty bits out of the ptes (without INVALID_PAGE
> them),
>       and will continue without inserting the page into the unstable tree.
> 
>     Each time ksm will find page that the page tables pointing to it are
> clean:
>       ksm will calucate jhash to know if the page was changed -
>       this is needed due to the fact that we cleaned the dirty bit,
>       but we didnt tlb_flush the tlb entry pointing to the page,
>       so we have to jhash to make sure if the page was changed.

Interesting.  At first I thought that sounded like a worst of all
worlds solution, but perhaps not: the proportions might make that
a very sensible approach.

But playing with the pte dirty bit without flushing TLB is a dangerous
game: you run the risk that MM will catch it at a moment when it looks
clean and free it.  We could, I suppose, change MM to assume that anon
pages are dirty in VM_MERGEABLE areas; but it feels too early for the
KSM tail to be wagging the MM dog in such a way.

> 
>     Now looking on the three diffrent kind of pages
>     a) pages that are identical:
>           would get find anyway when comparing them inside the stable tree
>     b) pages that are not identical and keep changing all the time:
>           Most of the chances that they will appear dirty on the pte, even
> thougth that the tlb entry was not flushed by ksm,
>           If they still wont be dirty, the jhash check will be run on them to
> know if the page was changed,
>           This meaning that most of the time this optimization will save the
> jhash calcualtion to this kind of pages:
>           beacuse when we will see them dirty, we wont need to calcuate the
> jhash.
>     c) pages that are not identical but doesn't change:
>           This kind of pages will always be clean, so we will clacuate jhash
> on them like before.
>  
> 
> 2) Nehalem cpus with sse 4.1 have crc instruction - the good - it going to be
> faster, the bad - only Nehlem and above cpus will have it
>     (Linux already have support for it)

Sounds perfectly sensible to use hardware CRC where it's available;
I expect other architectures have models which support CRC too.

Assuming the characteristics of the hardware CRC are superior to
or close enough to the characteristics of the jhash - I'm entirely
ignorant of CRCs and hashing matters, perhaps nobody would put a 
CRC into their hardware unless it was good enough (but good enough
for what purpose?).

> 
> What you think?, Or am i too much think about the cpu cycles we are burning
> with the jhash?

I do think KSM burns a lot of CPU; but whether it's the jhash or
whether it's all the other stuff (page table walking, radix tree
walking, memcmping) I've not looked.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: improving checksum cpu consumption in ksm
  2009-08-31 22:49 ` Hugh Dickins
@ 2009-09-01 12:12   ` Izik Eidus
  2009-09-03 12:36   ` Izik Eidus
  1 sibling, 0 replies; 9+ messages in thread
From: Izik Eidus @ 2009-09-01 12:12 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm

Hugh Dickins wrote:
>
> But the first thing to try (measure) would be Jozsef's patch, updating
> jhash.h from the 1996 Jenkins lookup2 to the 2006 Jenkins lookup3,
> which is supposed to be a considerable improvement from all angles.
>
> See http://lkml.org/lkml/2009/2/12/65
>   

I will check this one.

>
> Three, that if we go in this direction, might it be even better to make
> the unstable tree stable? i.e. write protect its contents so that we're
> sure a node cannot be poisoned with modified data.  It may be
> immediately obvious to you that that's a terrible suggestion (all the
> overhead of the write faults), but it's not quite obvious to me yet.
>   

It was one of the early thoughts about how to write the new ksm -
checking if page checksum was not changed, and if it didn't - write 
protect it and insert it to the tree.
The problem is: every kvm guest that would be idle all its memory will 
become read only,
In such cases performance of kvm guests will get hurt, when they are 
using ksm.
(I believe it better to burn more cpu cycles from the ksm side, than 
hurting the users of it.)

> I expect a lot depends on the proportions of pages_shared, pages_sharing,
> pages_unshared, pages_volatile.  But as I've said before, I've no
> feeling yet (or ever?) for the behaviour of the unstable tree: for
> example, I've no grasp of whether a large unstable tree is inherently
> less stable (more likely to be seriously poisoned) than a small one,
> or not.
>   

Yesterday I logged into account that had 94gigas of ram to ksm to scan,
It seems like it performed very well there, the trick is this:

the bigger the memory is, the more time it take ksm to finish the memory 
scanning loop - we will insert pages that didnt changed for about 8 mins 
in such systems...

So from this side we are quite safe, (and we have the stable tree 
exactly to solve the left issues of the unstable tree , the only task of 
the unstable tree is to build the stable tree)
(It is exactly the same proportion for small systems and big systems, in 
unstable tree that have 1 giga of ram there are 1/2 pages to get 
corrupted than in unstable tree that have 2 giga of ram, however the 
pages inside the unstable tree that have 2 giga of ram have 1/2 less 
chances to get corrupted (because their jhash content was left the same 
for x2 the time)

The worst case for big unstable tree is - if node near the root become 
invalid - the chances for this are 2 ^ n - 1, and this is exactly why we 
keep rebuilding it (for such bad lucks)

Actually another nice thing to note is: when page is changed only 50% it 
will become invalid, beacuse in sorted tree we are just care if "page 1" 
is bigger than "page 2"
so if "page 2" was bigger than "page 1" and then "page 2" was changed 
but is still bigger than "page 1" then the unstable tree is still valid...

As for now I dont think I found workload the the hash table version 
found more pages in it.

>   
>>     Taking this further more we can use 'unstable dirty bit tracking' - if we
>> look on ksm work loads we can split the memory into three diffrent kind of
>> pages:
>>     a) pages that are identical
>>     b) pages that are not identical and keep changing all the time
>>     c) pages that are not identical but doesn't change
>>
>>     So taking this three type of pages lets assume ksm was using the following
>> way to track pages that are changing:
>>
>>     Each time ksm find page that its page tables pointing to it are dirty,:
>>       ksm will clean the dirty bits out of the ptes (without INVALID_PAGE
>> them),
>>       and will continue without inserting the page into the unstable tree.
>>
>>     Each time ksm will find page that the page tables pointing to it are
>> clean:
>>       ksm will calucate jhash to know if the page was changed -
>>       this is needed due to the fact that we cleaned the dirty bit,
>>       but we didnt tlb_flush the tlb entry pointing to the page,
>>       so we have to jhash to make sure if the page was changed.
>>     
>
> Interesting.  At first I thought that sounded like a worst of all
> worlds solution, but perhaps not: the proportions might make that
> a very sensible approach.
>
> But playing with the pte dirty bit without flushing TLB is a dangerous
> game: you run the risk that MM will catch it at a moment when it looks
> clean and free it.  We could, I suppose, change MM to assume that anon
> pages are dirty in VM_MERGEABLE areas; but it feels too early for the
> KSM tail to be wagging the MM dog in such a way.
>   

Agree.


>> ot flushed by ksm,
>>           If they still wont be dirty, the jhash check will be run on them to
>> know if the page was changed,
>>           This meaning that most of the time this optimization will save the
>> jhash calcualtion to this kind of pages:
>>           beacuse when we will see them dirty, we wont need to calcuate the
>> jhash.
>>     c) pages that are not identical but doesn't change:
>>           This kind of pages will always be clean, so we will clacuate jhash
>> on them like before.
>>  
>>
>> 2) Nehalem cpus with sse 4.1 have crc instruction - the good - it going to be
>> faster, the bad - only Nehlem and above cpus will have it
>>     (Linux already have support for it)
>>     
>
> Sounds perfectly sensible to use hardware CRC where it's available;
> I expect other architectures have models which support CRC too.
>
> Assuming the characteristics of the hardware CRC are superior to
> or close enough to the characteristics of the jhash - I'm entirely
> ignorant of CRCs and hashing matters, perhaps nobody would put a 
> CRC into their hardware unless it was good enough (but good enough
> for what purpose?).
>   

I did test it with CRC, simple naive test showed no difference from the 
point of "how unstable the unstable tree is.
I will look on it after 2.6.32

>   
>> What you think?, Or am i too much think about the cpu cycles we are burning
>> with the jhash?
>>     
>
> I do think KSM burns a lot of CPU; but whether it's the jhash or
> whether it's all the other stuff (page table walking, radix tree
> walking, memcmping) I've not looked.
>   

Jhash is just one of the problem agree, there is a way to make the trees 
memcping faster and less cpu intensive as well, but i will keep it for 
another mail (not in the near future...).

Thanks for the response Hugh.

> Hugh
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: improving checksum cpu consumption in ksm
  2009-08-31 22:49 ` Hugh Dickins
  2009-09-01 12:12   ` Izik Eidus
@ 2009-09-03 12:36   ` Izik Eidus
  2009-09-03 15:20     ` Hugh Dickins
  1 sibling, 1 reply; 9+ messages in thread
From: Izik Eidus @ 2009-09-03 12:36 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm

Hugh Dickins wrote:
>
> But the first thing to try (measure) would be Jozsef's patch, updating
> jhash.h from the 1996 Jenkins lookup2 to the 2006 Jenkins lookup3,
> which is supposed to be a considerable improvement from all angles.
>
> See http://lkml.org/lkml/2009/2/12/65
>
>   

Hi,
I just did small test of the new hash compare to the old

using the program below, i ran ksm (with nice -20)
at time_to_sleep_in_millisecs = 1
run = 1
pages_to_scan = 9999

(The program is designing just to  pressure the hash calcs and tree 
walking (and not to share any page really)

then i checked how many full_scans have ksm reached (i just checked 
/sys/kernel/mm/ksm/full_scans)

And i got the following results:
with the old jhash version ksm did 395 loops
with the new jhash version ksm did 455 loops
we got here 15% improvment for this case where we have pages that are 
static but are not shareable...
(And it will help in any case we got page we are not merging in the 
stable tree)

I think it is nice...

(I used  AMD Phenom(tm) II X3 720 Processor, but probably i didnt run 
the test enougth, i should rerun it again and see if the results are 
consistent)

(Another thing is that I had conflit with something of 
JHASH_GOLDEN_NUNBER? i just added back that define and the kernel was 
compiling)

Thanks.


#include <malloc.h>
#include <stdio.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

int main()
{
    int x;
    unsigned char *p, *tmp;
    unsigned char *p_end;

    p = (unsigned char *) malloc(1024 * 1024 * 100 + 4096);
    if (!p) {
        printf("error\n");
    }

    p_end = p + 1024 * 1024 * 100;
    p = (unsigned char *)((unsigned long)p & ~4095);

    tmp = p;
    for(; tmp != p_end; ++tmp) {
        *tmp = (unsigned char)random();
    }

    if (madvise(p, 1024 * 1024 * 100, 12) == -1) {
        perror("madvise");
    }

    sleep(60);

    return 0;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: improving checksum cpu consumption in ksm
  2009-09-03 12:36   ` Izik Eidus
@ 2009-09-03 15:20     ` Hugh Dickins
  2009-09-03 15:42       ` Izik Eidus
                         ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Hugh Dickins @ 2009-09-03 15:20 UTC (permalink / raw)
  To: Izik Eidus; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm

On Thu, 3 Sep 2009, Izik Eidus wrote:
> 
> Hi,
> I just did small test of the new hash compare to the old
> 
> using the program below, i ran ksm (with nice -20)
> at time_to_sleep_in_millisecs = 1

Better 0?

> run = 1
> pages_to_scan = 9999

Okay, the bigger the better.

> 
> (The program is designing just to  pressure the hash calcs and tree walking
> (and not to share any page really)
> 
> then i checked how many full_scans have ksm reached (i just checked
> /sys/kernel/mm/ksm/full_scans)
> 
> And i got the following results:
> with the old jhash version ksm did 395 loops
> with the new jhash version ksm did 455 loops

The first few loops will be settling down, need to subtract those.

> we got here 15% improvment for this case where we have pages that are static
> but are not shareable...
> (And it will help in any case we got page we are not merging in the stable
> tree)
> 
> I think it is nice...

Yes, that's nice, thank you for looking into it.

But please do some more along these lines, if you've time?
Presumably the improvement from Jenkins lookup2 to lookup3
is therefore more than 15%, but we cannot tell how much.

I think you need to do a run with a null version of jhash2(),
one just returning 0 or 0xffffffff (the first would settle down
a little quicker because oldchecksum 0 will match the first time;
but there should be no difference once you cut out settling time).

And a run with an almost-null version of jhash2(), one which does
also read the whole page sequentially into cache, so we can see
how much is the processing and how much is the memory access.

And also, while you're about it, a run with cmp_and_merge_page()
stubbed out, so we can see how much is just the page table walking
(and deduce from that how much is the radix tree walking and memcmping).

Hmm, and a run to see how much is radix tree walking,
by stubbing out the memcmping.

Sorry... if you (or someone else following) have the time!

> 
> (I used  AMD Phenom(tm) II X3 720 Processor, but probably i didnt run the test
> enougth, i should rerun it again and see if the results are consistent)

Right, other processors will differ some(unknown)what, so we shouldn't
take the numbers you find too seriously.  But at this moment I've no
idea of what proportion of time is spent on what: it should be helpful
to see what dominates.

> 
>    p = (unsigned char *) malloc(1024 * 1024 * 100 + 4096);
>    if (!p) {
>        printf("error\n");
>    }
> 
>    p_end = p + 1024 * 1024 * 100;
>    p = (unsigned char *)((unsigned long)p & ~4095);

Doesn't matter to your results, so long as it didn't crash;
but I think you meant to say

     p = (unsigned char *)(((unsigned long)p + 4095) & ~4095);
     p_end = p + 1024 * 1024 * 100;

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: improving checksum cpu consumption in ksm
  2009-09-03 15:20     ` Hugh Dickins
@ 2009-09-03 15:42       ` Izik Eidus
  2009-09-04 22:29       ` Moussa Ba
  2009-09-12 16:33       ` Izik Eidus
  2 siblings, 0 replies; 9+ messages in thread
From: Izik Eidus @ 2009-09-03 15:42 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm

Hugh Dickins wrote:
> Yes, that's nice, thank you for looking into it.
>
> But please do some more along these lines, if you've time?
> Presumably the improvement from Jenkins lookup2 to lookup3
> is therefore more than 15%, but we cannot tell how much.
>
> I think you need to do a run with a null version of jhash2(),
> one just returning 0 or 0xffffffff (the first would settle down
> a little quicker because oldchecksum 0 will match the first time;
> but there should be no difference once you cut out settling time).
>
> And a run with an almost-null version of jhash2(), one which does
> also read the whole page sequentially into cache, so we can see
> how much is the processing and how much is the memory access.
>
> And also, while you're about it, a run with cmp_and_merge_page()
> stubbed out, so we can see how much is just the page table walking
> (and deduce from that how much is the radix tree walking and memcmping).
>
> Hmm, and a run to see how much is radix tree walking,
> by stubbing out the memcmping.
>
> Sorry... if you (or someone else following) have the time!
>   

Good ideas, The tests that I did were quick tests, I hope in Saturday 
(or maybe few days after it) I will have more time to spend on this area
I will try to collect and see how much every part is taking there.
So i will keep benchmarking it in terms of "how many loops it accomplish"

>
> Doesn't matter to your results, so long as it didn't crash;
> but I think you meant to say
>
>      p = (unsigned char *)(((unsigned long)p + 4095) & ~4095);
>      p_end = p + 1024 * 1024 * 100;
>   

Yes...


Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: improving checksum cpu consumption in ksm
  2009-09-03 15:20     ` Hugh Dickins
  2009-09-03 15:42       ` Izik Eidus
@ 2009-09-04 22:29       ` Moussa Ba
  2009-09-05 11:33         ` Hugh Dickins
  2009-09-12 16:33       ` Izik Eidus
  2 siblings, 1 reply; 9+ messages in thread
From: Moussa Ba @ 2009-09-04 22:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Andrea Arcangeli, Jozsef Kadlecsik, linux-mm, jaredeh

Just to add to the discussion, we have also seen a high cpu usage for
KSM.  In our case however it is more serious as the system that KSM is
running on is battery powered  with a weaker processor.  With KSM
constantly running, the effect on the battery life is significant.

I like the idea of dirty bit tracking as it would obviate the need to
rehash once we know the page has not been dirtied.  We have been
working on a patch that adds dirty bit clearing from user space,
similar to the clear_refs entry under /proc/pid/.  In our instance we
use this mechanism to measure page accesses and write frequency on
ANONYMOUS pages, file backed pages or both.  Could this potentially
pose a problem if KSM decides to use that mechanism for page state
tracking?

Moussa.

On Thu, Sep 3, 2009 at 8:20 AM, Hugh Dickins<hugh.dickins@tiscali.co.uk> wrote:
> On Thu, 3 Sep 2009, Izik Eidus wrote:
>>
>> Hi,
>> I just did small test of the new hash compare to the old
>>
>> using the program below, i ran ksm (with nice -20)
>> at time_to_sleep_in_millisecs = 1
>
> Better 0?
>
>> run = 1
>> pages_to_scan = 9999
>
> Okay, the bigger the better.
>
>>
>> (The program is designing just to  pressure the hash calcs and tree walking
>> (and not to share any page really)
>>
>> then i checked how many full_scans have ksm reached (i just checked
>> /sys/kernel/mm/ksm/full_scans)
>>
>> And i got the following results:
>> with the old jhash version ksm did 395 loops
>> with the new jhash version ksm did 455 loops
>
> The first few loops will be settling down, need to subtract those.
>
>> we got here 15% improvment for this case where we have pages that are static
>> but are not shareable...
>> (And it will help in any case we got page we are not merging in the stable
>> tree)
>>
>> I think it is nice...
>
> Yes, that's nice, thank you for looking into it.
>
> But please do some more along these lines, if you've time?
> Presumably the improvement from Jenkins lookup2 to lookup3
> is therefore more than 15%, but we cannot tell how much.
>
> I think you need to do a run with a null version of jhash2(),
> one just returning 0 or 0xffffffff (the first would settle down
> a little quicker because oldchecksum 0 will match the first time;
> but there should be no difference once you cut out settling time).
>
> And a run with an almost-null version of jhash2(), one which does
> also read the whole page sequentially into cache, so we can see
> how much is the processing and how much is the memory access.
>
> And also, while you're about it, a run with cmp_and_merge_page()
> stubbed out, so we can see how much is just the page table walking
> (and deduce from that how much is the radix tree walking and memcmping).
>
> Hmm, and a run to see how much is radix tree walking,
> by stubbing out the memcmping.
>
> Sorry... if you (or someone else following) have the time!
>
>>
>> (I used  AMD Phenom(tm) II X3 720 Processor, but probably i didnt run the test
>> enougth, i should rerun it again and see if the results are consistent)
>
> Right, other processors will differ some(unknown)what, so we shouldn't
> take the numbers you find too seriously.  But at this moment I've no
> idea of what proportion of time is spent on what: it should be helpful
> to see what dominates.
>
>>
>>    p = (unsigned char *) malloc(1024 * 1024 * 100 + 4096);
>>    if (!p) {
>>        printf("error\n");
>>    }
>>
>>    p_end = p + 1024 * 1024 * 100;
>>    p = (unsigned char *)((unsigned long)p & ~4095);
>
> Doesn't matter to your results, so long as it didn't crash;
> but I think you meant to say
>
>     p = (unsigned char *)(((unsigned long)p + 4095) & ~4095);
>     p_end = p + 1024 * 1024 * 100;
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: improving checksum cpu consumption in ksm
  2009-09-04 22:29       ` Moussa Ba
@ 2009-09-05 11:33         ` Hugh Dickins
  0 siblings, 0 replies; 9+ messages in thread
From: Hugh Dickins @ 2009-09-05 11:33 UTC (permalink / raw)
  To: Moussa Ba
  Cc: Izik Eidus, Andrea Arcangeli, Jozsef Kadlecsik, linux-mm, jaredeh

On Fri, 4 Sep 2009, Moussa Ba wrote:

> Just to add to the discussion, we have also seen a high cpu usage for
> KSM.  In our case however it is more serious as the system that KSM is
> running on is battery powered  with a weaker processor.  With KSM
> constantly running, the effect on the battery life is significant.

Sounds like it would be a good idea for us to throttle back ksmd
when the system is otherwise idle.  Though quite a bit of thought
should go into how we decide "idle" for that.

> 
> I like the idea of dirty bit tracking as it would obviate the need to
> rehash once we know the page has not been dirtied.  We have been
> working on a patch that adds dirty bit clearing from user space,
> similar to the clear_refs entry under /proc/pid/.  In our instance we
> use this mechanism to measure page accesses and write frequency on
> ANONYMOUS pages, file backed pages or both.  Could this potentially
> pose a problem if KSM decides to use that mechanism for page state
> tracking?

Yes, KSM's use of the bit would interfere with your statistics, and
your use of the bit would interfere with KSM's efficiency: better
not use them both together (or keep yours off MADV_MERGEABLE areas).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: improving checksum cpu consumption in ksm
  2009-09-03 15:20     ` Hugh Dickins
  2009-09-03 15:42       ` Izik Eidus
  2009-09-04 22:29       ` Moussa Ba
@ 2009-09-12 16:33       ` Izik Eidus
  2 siblings, 0 replies; 9+ messages in thread
From: Izik Eidus @ 2009-09-12 16:33 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, moussa ba, linux-mm

Hugh Dickins wrote:
> On Thu, 3 Sep 2009, Izik Eidus wrote:
>   
>
>
> Yes, that's nice, thank you for looking into it.
>
> But please do some more along these lines, if you've time?
> Presumably the improvement from Jenkins lookup2 to lookup3
> is therefore more than 15%, but we cannot tell how much.
>
> I think you need to do a run with a null version of jhash2(),
> one just returning 0 or 0xffffffff (the first would settle down
> a little quicker because oldchecksum 0 will match the first time;
> but there should be no difference once you cut out settling time).
>
> And a run with an almost-null version of jhash2(), one which does
> also read the whole page sequentially into cache, so we can see
> how much is the processing and how much is the memory access.
>
> And also, while you're about it, a run with cmp_and_merge_page()
> stubbed out, so we can see how much is just the page table walking
> (and deduce from that how much is the radix tree walking and memcmping).
>
> Hmm, and a run to see how much is radix tree walking,
> by stubbing out the memcmping.
>
> Sorry... if you (or someone else following) have the time!
>
>   
Ok so with exactly the same program as before (just now sleep() time 
went from 60 to 120)
and:

ksm nice = -20

echo 1 > /sys/kernel/mm/ksm/sleep_millisecs
echo 1 > /sys/kernel/mm/ksm/run
echo 0 > /sys/kernel/mm/ksm/max_kernel_pages
echo 99999 > /sys/kernel/mm/ksm/pages_to_scan
(I forgot to put sleep_millisecs into 0, but I think the results will 
still give good image)

Results:
Normal ksmd as found on mmtom:
1nd run: 789 full scans
2nd run: 751 full scans
3nd run: 790 full scans

no hash version (jhash2 just retun 0xffffffff):
1nd run: 1873
2nd run: 1888
3nd run: 1874

no hash but read all the memory version: (checking memory access)
1nd run: 1364
2nd run: 1363
3nd run: 1251

checksum version - just increase u64 varible by the content of each 
64bits of a page :
something like:
ret = 0;
for(; p != p_end; ++p)
    ret += *p
1nd run: 1362
2nd run: 1250
3nd run: 1250

no cmp_and_merge_page() version:
1nd run: 10,000
2nd run: 10,000
3nd run: 15,017
(At this point I figured it is probably just take time to do the 
sleeping and therefore i tried it with sleep_millisecs = 0 and got:)
4nd run: 29,987
5nd run: 45,012 (didnt really look what happen in the kernel when 
sleep_millisecs = 0, but this results seems to be irrelevant because it 
seems like we are not burning cpus here)

memcmp_pages() return 1 version (without jhash):
(This version should really pressure the unstable tree walking - it will 
build tree of  256,000 pages (the application allocate 100mb)
meaning the depth will be 15, and each time it will check page it will 
walk over all this 15 depth levels in the unstable tree (because return 
of memcmp_pages() will always be 1 so it will keep going right)

1nd run: 1763
2nd run: 1765
3nd run: 1732



Looking on all this results I tend to think that the winner here is 
dirty bit from performance perspective, but because we dont want to deal 
with the problems of dirty bit tracking right now (specially that 
unstable dirty bit),
we probably need to look on how much the jhash cost us, Looking on the 
memory access times compare to jhash times, It does look like jhash is 
expensive,
because we just need to track pages that are not changing, I guess we 
can use some kind of checksum that is much less cpu intensive than jhash...?

(Btw, starting with merging the new jhash would be good idea as well...)

Thanks, and btw did i lose any test that you want might want?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-09-12 16:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-28 20:21 improving checksum cpu consumption in ksm Izik Eidus
2009-08-31 22:49 ` Hugh Dickins
2009-09-01 12:12   ` Izik Eidus
2009-09-03 12:36   ` Izik Eidus
2009-09-03 15:20     ` Hugh Dickins
2009-09-03 15:42       ` Izik Eidus
2009-09-04 22:29       ` Moussa Ba
2009-09-05 11:33         ` Hugh Dickins
2009-09-12 16:33       ` Izik Eidus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).