* improving checksum cpu consumption in ksm @ 2009-08-28 20:21 Izik Eidus 2009-08-31 22:49 ` Hugh Dickins 0 siblings, 1 reply; 9+ messages in thread From: Izik Eidus @ 2009-08-28 20:21 UTC (permalink / raw) To: Hugh Dickins, Andrea Arcangeli; +Cc: linux-mm Hi, As you know we are using checksum (jhash) to know if page is changing too often, and then if it does we are not inserting it to the unstable tree (so it wont get unstable too much at trivial cases) This is highly needed for ksm in some workloads - I have seen production visualization server that had about 74 or so giga of ram, and the way ksm was running there it would take ksm about 6 mins to finish one memory loop, In such case the hashing make sure that we will insert pages that are really not changing (about 6 mins) and we are protecting our unstable tree, without it, if we will insert any page we will end up with an really really unstable tree that probably wont find much. (There is a case where we don`t calculate jhash - where we actually find two identical pages inside the stable tree) So after we know why we want to keep this hash/checksum, we want to look how we can reduce the cpu cycles that it consume - jhash is very expensive And not only that it take cpu cycles it also dirty the cache by walking over the whole page... As for now i see 2 trivials ways how to solve it: (All we need to know is - what pages have been changed) 1) use the dirty bit of page tables pointing into the page: We can walk over the page tables, and keep cleaning the dirty bit - we are using anonymous pages so it shouldnt matther anyone if we clean the dirty bit, With this case we win 2 things - 1) no cpu cycles on the expensive jhash, and 2) no dirty of the cache the dissadvange of this usage is the PAGE_INVALID we need to call of the specific tlbs entries associate with the ptes we are clearing the dirty bit: Is it worst than dirty the cache?, is it going to really hurt applications performence? (note we are just tlb_flush specific entries, not the entire tlb) If this going to hurt applications pefromence we are better not to deal with it, but what do you think about this? Taking this further more we can use 'unstable dirty bit tracking' - if we look on ksm work loads we can split the memory into three diffrent kind of pages: a) pages that are identical b) pages that are not identical and keep changing all the time c) pages that are not identical but doesn't change So taking this three type of pages lets assume ksm was using the following way to track pages that are changing: Each time ksm find page that its page tables pointing to it are dirty,: ksm will clean the dirty bits out of the ptes (without INVALID_PAGE them), and will continue without inserting the page into the unstable tree. Each time ksm will find page that the page tables pointing to it are clean: ksm will calucate jhash to know if the page was changed - this is needed due to the fact that we cleaned the dirty bit, but we didnt tlb_flush the tlb entry pointing to the page, so we have to jhash to make sure if the page was changed. Now looking on the three diffrent kind of pages a) pages that are identical: would get find anyway when comparing them inside the stable tree b) pages that are not identical and keep changing all the time: Most of the chances that they will appear dirty on the pte, even thougth that the tlb entry was not flushed by ksm, If they still wont be dirty, the jhash check will be run on them to know if the page was changed, This meaning that most of the time this optimization will save the jhash calcualtion to this kind of pages: beacuse when we will see them dirty, we wont need to calcuate the jhash. c) pages that are not identical but doesn't change: This kind of pages will always be clean, so we will clacuate jhash on them like before. 2) Nehalem cpus with sse 4.1 have crc instruction - the good - it going to be faster, the bad - only Nehlem and above cpus will have it (Linux already have support for it) What you think?, Or am i too much think about the cpu cycles we are burning with the jhash? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: improving checksum cpu consumption in ksm 2009-08-28 20:21 improving checksum cpu consumption in ksm Izik Eidus @ 2009-08-31 22:49 ` Hugh Dickins 2009-09-01 12:12 ` Izik Eidus 2009-09-03 12:36 ` Izik Eidus 0 siblings, 2 replies; 9+ messages in thread From: Hugh Dickins @ 2009-08-31 22:49 UTC (permalink / raw) To: Izik Eidus; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm On Fri, 28 Aug 2009, Izik Eidus wrote: > > As you know we are using checksum (jhash) to know if page is changing too > often, and then if it does we are not inserting it to the unstable tree (so it > wont get unstable too much at trivial cases) > > This is highly needed for ksm in some workloads - I have seen production > visualization server that had about 74 or so giga of ram, and the way ksm was > running there it would take ksm about 6 mins to finish one memory loop, In > such case the hashing make sure that we will insert pages that are really not > changing (about 6 mins) and we are protecting our unstable tree, without it, > if we will insert any page we will end up with an really really unstable tree > that probably wont find much. > > (There is a case where we don`t calculate jhash - where we actually find two > identical pages inside the stable tree) > > So after we know why we want to keep this hash/checksum, we want to look how > we can reduce the cpu cycles that it consume - jhash is very expensive > And not only that it take cpu cycles it also dirty the cache by walking over > the whole page... Not dirtying the cache, I think, but... my mind's gone blank on the right word too... wasting it anyway. I wonder what proportion of the cost is the memory accesses and what proportion is the cost of the jhash algorithm. My guesses count for nothing: it's measurement you should be doing. But the first thing to try (measure) would be Jozsef's patch, updating jhash.h from the 1996 Jenkins lookup2 to the 2006 Jenkins lookup3, which is supposed to be a considerable improvement from all angles. See http://lkml.org/lkml/2009/2/12/65 I think that was the final version of it: I don't know what happened after that (beyond Eric questioning "deadbeef", but that won't matter in your testing) - perhaps it was unclear which maintainer should pick it up, and it then fell through the cracks? Or, going the other way, how damaging would it be to unstable tree integrity to use a cheaper hashing algorithm? There's no perfection here, a page that's unchanged for one KSM cycle is no way guaranteed to be unchanged for the next, that's merely an heuristic. > > As for now i see 2 trivials ways how to solve it: (All we need to know is - > what pages have been changed) > 1) use the dirty bit of page tables pointing into the page: > We can walk over the page tables, and keep cleaning the dirty bit - we are > using anonymous pages so it shouldnt matther anyone if we clean the dirty bit, Well, if you clean the pte dirty bit, you must be sure to transfer that to the page dirty bit, otherwise the anonymous page will be thrown away as a zero page (or a clean copy of a swap page) when memory is needed. And beware of s390, which dirties differently. > With this case we win 2 things - 1) no cpu cycles on the expensive jhash, > and 2) no dirty of the cache > the dissadvange of this usage is the PAGE_INVALID we need to call of the > specific tlbs entries associate with the ptes we are clearing the dirty bit: > Is it worst than dirty the cache?, is it going to really hurt applications > performence? (note we are just tlb_flush specific entries, not the entire tlb) > If this going to hurt applications pefromence we are better not to deal > with it, but what do you think about this? I think three things. One, that I've no idea, and it's measurement you need. Two, please keep in mind that it may not be the TLB flush itself that matters, but the IPI to do it on other CPUs: how much that costs depends on how much ksmd is running at the same time as threads of the apps making use of it. Three, that if we go in this direction, might it be even better to make the unstable tree stable? i.e. write protect its contents so that we're sure a node cannot be poisoned with modified data. It may be immediately obvious to you that that's a terrible suggestion (all the overhead of the write faults), but it's not quite obvious to me yet. I expect a lot depends on the proportions of pages_shared, pages_sharing, pages_unshared, pages_volatile. But as I've said before, I've no feeling yet (or ever?) for the behaviour of the unstable tree: for example, I've no grasp of whether a large unstable tree is inherently less stable (more likely to be seriously poisoned) than a small one, or not. > > Taking this further more we can use 'unstable dirty bit tracking' - if we > look on ksm work loads we can split the memory into three diffrent kind of > pages: > a) pages that are identical > b) pages that are not identical and keep changing all the time > c) pages that are not identical but doesn't change > > So taking this three type of pages lets assume ksm was using the following > way to track pages that are changing: > > Each time ksm find page that its page tables pointing to it are dirty,: > ksm will clean the dirty bits out of the ptes (without INVALID_PAGE > them), > and will continue without inserting the page into the unstable tree. > > Each time ksm will find page that the page tables pointing to it are > clean: > ksm will calucate jhash to know if the page was changed - > this is needed due to the fact that we cleaned the dirty bit, > but we didnt tlb_flush the tlb entry pointing to the page, > so we have to jhash to make sure if the page was changed. Interesting. At first I thought that sounded like a worst of all worlds solution, but perhaps not: the proportions might make that a very sensible approach. But playing with the pte dirty bit without flushing TLB is a dangerous game: you run the risk that MM will catch it at a moment when it looks clean and free it. We could, I suppose, change MM to assume that anon pages are dirty in VM_MERGEABLE areas; but it feels too early for the KSM tail to be wagging the MM dog in such a way. > > Now looking on the three diffrent kind of pages > a) pages that are identical: > would get find anyway when comparing them inside the stable tree > b) pages that are not identical and keep changing all the time: > Most of the chances that they will appear dirty on the pte, even > thougth that the tlb entry was not flushed by ksm, > If they still wont be dirty, the jhash check will be run on them to > know if the page was changed, > This meaning that most of the time this optimization will save the > jhash calcualtion to this kind of pages: > beacuse when we will see them dirty, we wont need to calcuate the > jhash. > c) pages that are not identical but doesn't change: > This kind of pages will always be clean, so we will clacuate jhash > on them like before. > > > 2) Nehalem cpus with sse 4.1 have crc instruction - the good - it going to be > faster, the bad - only Nehlem and above cpus will have it > (Linux already have support for it) Sounds perfectly sensible to use hardware CRC where it's available; I expect other architectures have models which support CRC too. Assuming the characteristics of the hardware CRC are superior to or close enough to the characteristics of the jhash - I'm entirely ignorant of CRCs and hashing matters, perhaps nobody would put a CRC into their hardware unless it was good enough (but good enough for what purpose?). > > What you think?, Or am i too much think about the cpu cycles we are burning > with the jhash? I do think KSM burns a lot of CPU; but whether it's the jhash or whether it's all the other stuff (page table walking, radix tree walking, memcmping) I've not looked. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: improving checksum cpu consumption in ksm 2009-08-31 22:49 ` Hugh Dickins @ 2009-09-01 12:12 ` Izik Eidus 2009-09-03 12:36 ` Izik Eidus 1 sibling, 0 replies; 9+ messages in thread From: Izik Eidus @ 2009-09-01 12:12 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm Hugh Dickins wrote: > > But the first thing to try (measure) would be Jozsef's patch, updating > jhash.h from the 1996 Jenkins lookup2 to the 2006 Jenkins lookup3, > which is supposed to be a considerable improvement from all angles. > > See http://lkml.org/lkml/2009/2/12/65 > I will check this one. > > Three, that if we go in this direction, might it be even better to make > the unstable tree stable? i.e. write protect its contents so that we're > sure a node cannot be poisoned with modified data. It may be > immediately obvious to you that that's a terrible suggestion (all the > overhead of the write faults), but it's not quite obvious to me yet. > It was one of the early thoughts about how to write the new ksm - checking if page checksum was not changed, and if it didn't - write protect it and insert it to the tree. The problem is: every kvm guest that would be idle all its memory will become read only, In such cases performance of kvm guests will get hurt, when they are using ksm. (I believe it better to burn more cpu cycles from the ksm side, than hurting the users of it.) > I expect a lot depends on the proportions of pages_shared, pages_sharing, > pages_unshared, pages_volatile. But as I've said before, I've no > feeling yet (or ever?) for the behaviour of the unstable tree: for > example, I've no grasp of whether a large unstable tree is inherently > less stable (more likely to be seriously poisoned) than a small one, > or not. > Yesterday I logged into account that had 94gigas of ram to ksm to scan, It seems like it performed very well there, the trick is this: the bigger the memory is, the more time it take ksm to finish the memory scanning loop - we will insert pages that didnt changed for about 8 mins in such systems... So from this side we are quite safe, (and we have the stable tree exactly to solve the left issues of the unstable tree , the only task of the unstable tree is to build the stable tree) (It is exactly the same proportion for small systems and big systems, in unstable tree that have 1 giga of ram there are 1/2 pages to get corrupted than in unstable tree that have 2 giga of ram, however the pages inside the unstable tree that have 2 giga of ram have 1/2 less chances to get corrupted (because their jhash content was left the same for x2 the time) The worst case for big unstable tree is - if node near the root become invalid - the chances for this are 2 ^ n - 1, and this is exactly why we keep rebuilding it (for such bad lucks) Actually another nice thing to note is: when page is changed only 50% it will become invalid, beacuse in sorted tree we are just care if "page 1" is bigger than "page 2" so if "page 2" was bigger than "page 1" and then "page 2" was changed but is still bigger than "page 1" then the unstable tree is still valid... As for now I dont think I found workload the the hash table version found more pages in it. > >> Taking this further more we can use 'unstable dirty bit tracking' - if we >> look on ksm work loads we can split the memory into three diffrent kind of >> pages: >> a) pages that are identical >> b) pages that are not identical and keep changing all the time >> c) pages that are not identical but doesn't change >> >> So taking this three type of pages lets assume ksm was using the following >> way to track pages that are changing: >> >> Each time ksm find page that its page tables pointing to it are dirty,: >> ksm will clean the dirty bits out of the ptes (without INVALID_PAGE >> them), >> and will continue without inserting the page into the unstable tree. >> >> Each time ksm will find page that the page tables pointing to it are >> clean: >> ksm will calucate jhash to know if the page was changed - >> this is needed due to the fact that we cleaned the dirty bit, >> but we didnt tlb_flush the tlb entry pointing to the page, >> so we have to jhash to make sure if the page was changed. >> > > Interesting. At first I thought that sounded like a worst of all > worlds solution, but perhaps not: the proportions might make that > a very sensible approach. > > But playing with the pte dirty bit without flushing TLB is a dangerous > game: you run the risk that MM will catch it at a moment when it looks > clean and free it. We could, I suppose, change MM to assume that anon > pages are dirty in VM_MERGEABLE areas; but it feels too early for the > KSM tail to be wagging the MM dog in such a way. > Agree. >> ot flushed by ksm, >> If they still wont be dirty, the jhash check will be run on them to >> know if the page was changed, >> This meaning that most of the time this optimization will save the >> jhash calcualtion to this kind of pages: >> beacuse when we will see them dirty, we wont need to calcuate the >> jhash. >> c) pages that are not identical but doesn't change: >> This kind of pages will always be clean, so we will clacuate jhash >> on them like before. >> >> >> 2) Nehalem cpus with sse 4.1 have crc instruction - the good - it going to be >> faster, the bad - only Nehlem and above cpus will have it >> (Linux already have support for it) >> > > Sounds perfectly sensible to use hardware CRC where it's available; > I expect other architectures have models which support CRC too. > > Assuming the characteristics of the hardware CRC are superior to > or close enough to the characteristics of the jhash - I'm entirely > ignorant of CRCs and hashing matters, perhaps nobody would put a > CRC into their hardware unless it was good enough (but good enough > for what purpose?). > I did test it with CRC, simple naive test showed no difference from the point of "how unstable the unstable tree is. I will look on it after 2.6.32 > >> What you think?, Or am i too much think about the cpu cycles we are burning >> with the jhash? >> > > I do think KSM burns a lot of CPU; but whether it's the jhash or > whether it's all the other stuff (page table walking, radix tree > walking, memcmping) I've not looked. > Jhash is just one of the problem agree, there is a way to make the trees memcping faster and less cpu intensive as well, but i will keep it for another mail (not in the near future...). Thanks for the response Hugh. > Hugh > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: improving checksum cpu consumption in ksm 2009-08-31 22:49 ` Hugh Dickins 2009-09-01 12:12 ` Izik Eidus @ 2009-09-03 12:36 ` Izik Eidus 2009-09-03 15:20 ` Hugh Dickins 1 sibling, 1 reply; 9+ messages in thread From: Izik Eidus @ 2009-09-03 12:36 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm Hugh Dickins wrote: > > But the first thing to try (measure) would be Jozsef's patch, updating > jhash.h from the 1996 Jenkins lookup2 to the 2006 Jenkins lookup3, > which is supposed to be a considerable improvement from all angles. > > See http://lkml.org/lkml/2009/2/12/65 > > Hi, I just did small test of the new hash compare to the old using the program below, i ran ksm (with nice -20) at time_to_sleep_in_millisecs = 1 run = 1 pages_to_scan = 9999 (The program is designing just to pressure the hash calcs and tree walking (and not to share any page really) then i checked how many full_scans have ksm reached (i just checked /sys/kernel/mm/ksm/full_scans) And i got the following results: with the old jhash version ksm did 395 loops with the new jhash version ksm did 455 loops we got here 15% improvment for this case where we have pages that are static but are not shareable... (And it will help in any case we got page we are not merging in the stable tree) I think it is nice... (I used AMD Phenom(tm) II X3 720 Processor, but probably i didnt run the test enougth, i should rerun it again and see if the results are consistent) (Another thing is that I had conflit with something of JHASH_GOLDEN_NUNBER? i just added back that define and the kernel was compiling) Thanks. #include <malloc.h> #include <stdio.h> #include <sys/mman.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> int main() { int x; unsigned char *p, *tmp; unsigned char *p_end; p = (unsigned char *) malloc(1024 * 1024 * 100 + 4096); if (!p) { printf("error\n"); } p_end = p + 1024 * 1024 * 100; p = (unsigned char *)((unsigned long)p & ~4095); tmp = p; for(; tmp != p_end; ++tmp) { *tmp = (unsigned char)random(); } if (madvise(p, 1024 * 1024 * 100, 12) == -1) { perror("madvise"); } sleep(60); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: improving checksum cpu consumption in ksm 2009-09-03 12:36 ` Izik Eidus @ 2009-09-03 15:20 ` Hugh Dickins 2009-09-03 15:42 ` Izik Eidus ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Hugh Dickins @ 2009-09-03 15:20 UTC (permalink / raw) To: Izik Eidus; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm On Thu, 3 Sep 2009, Izik Eidus wrote: > > Hi, > I just did small test of the new hash compare to the old > > using the program below, i ran ksm (with nice -20) > at time_to_sleep_in_millisecs = 1 Better 0? > run = 1 > pages_to_scan = 9999 Okay, the bigger the better. > > (The program is designing just to pressure the hash calcs and tree walking > (and not to share any page really) > > then i checked how many full_scans have ksm reached (i just checked > /sys/kernel/mm/ksm/full_scans) > > And i got the following results: > with the old jhash version ksm did 395 loops > with the new jhash version ksm did 455 loops The first few loops will be settling down, need to subtract those. > we got here 15% improvment for this case where we have pages that are static > but are not shareable... > (And it will help in any case we got page we are not merging in the stable > tree) > > I think it is nice... Yes, that's nice, thank you for looking into it. But please do some more along these lines, if you've time? Presumably the improvement from Jenkins lookup2 to lookup3 is therefore more than 15%, but we cannot tell how much. I think you need to do a run with a null version of jhash2(), one just returning 0 or 0xffffffff (the first would settle down a little quicker because oldchecksum 0 will match the first time; but there should be no difference once you cut out settling time). And a run with an almost-null version of jhash2(), one which does also read the whole page sequentially into cache, so we can see how much is the processing and how much is the memory access. And also, while you're about it, a run with cmp_and_merge_page() stubbed out, so we can see how much is just the page table walking (and deduce from that how much is the radix tree walking and memcmping). Hmm, and a run to see how much is radix tree walking, by stubbing out the memcmping. Sorry... if you (or someone else following) have the time! > > (I used AMD Phenom(tm) II X3 720 Processor, but probably i didnt run the test > enougth, i should rerun it again and see if the results are consistent) Right, other processors will differ some(unknown)what, so we shouldn't take the numbers you find too seriously. But at this moment I've no idea of what proportion of time is spent on what: it should be helpful to see what dominates. > > p = (unsigned char *) malloc(1024 * 1024 * 100 + 4096); > if (!p) { > printf("error\n"); > } > > p_end = p + 1024 * 1024 * 100; > p = (unsigned char *)((unsigned long)p & ~4095); Doesn't matter to your results, so long as it didn't crash; but I think you meant to say p = (unsigned char *)(((unsigned long)p + 4095) & ~4095); p_end = p + 1024 * 1024 * 100; Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: improving checksum cpu consumption in ksm 2009-09-03 15:20 ` Hugh Dickins @ 2009-09-03 15:42 ` Izik Eidus 2009-09-04 22:29 ` Moussa Ba 2009-09-12 16:33 ` Izik Eidus 2 siblings, 0 replies; 9+ messages in thread From: Izik Eidus @ 2009-09-03 15:42 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, linux-mm Hugh Dickins wrote: > Yes, that's nice, thank you for looking into it. > > But please do some more along these lines, if you've time? > Presumably the improvement from Jenkins lookup2 to lookup3 > is therefore more than 15%, but we cannot tell how much. > > I think you need to do a run with a null version of jhash2(), > one just returning 0 or 0xffffffff (the first would settle down > a little quicker because oldchecksum 0 will match the first time; > but there should be no difference once you cut out settling time). > > And a run with an almost-null version of jhash2(), one which does > also read the whole page sequentially into cache, so we can see > how much is the processing and how much is the memory access. > > And also, while you're about it, a run with cmp_and_merge_page() > stubbed out, so we can see how much is just the page table walking > (and deduce from that how much is the radix tree walking and memcmping). > > Hmm, and a run to see how much is radix tree walking, > by stubbing out the memcmping. > > Sorry... if you (or someone else following) have the time! > Good ideas, The tests that I did were quick tests, I hope in Saturday (or maybe few days after it) I will have more time to spend on this area I will try to collect and see how much every part is taking there. So i will keep benchmarking it in terms of "how many loops it accomplish" > > Doesn't matter to your results, so long as it didn't crash; > but I think you meant to say > > p = (unsigned char *)(((unsigned long)p + 4095) & ~4095); > p_end = p + 1024 * 1024 * 100; > Yes... Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: improving checksum cpu consumption in ksm 2009-09-03 15:20 ` Hugh Dickins 2009-09-03 15:42 ` Izik Eidus @ 2009-09-04 22:29 ` Moussa Ba 2009-09-05 11:33 ` Hugh Dickins 2009-09-12 16:33 ` Izik Eidus 2 siblings, 1 reply; 9+ messages in thread From: Moussa Ba @ 2009-09-04 22:29 UTC (permalink / raw) To: Hugh Dickins Cc: Izik Eidus, Andrea Arcangeli, Jozsef Kadlecsik, linux-mm, jaredeh Just to add to the discussion, we have also seen a high cpu usage for KSM. In our case however it is more serious as the system that KSM is running on is battery powered with a weaker processor. With KSM constantly running, the effect on the battery life is significant. I like the idea of dirty bit tracking as it would obviate the need to rehash once we know the page has not been dirtied. We have been working on a patch that adds dirty bit clearing from user space, similar to the clear_refs entry under /proc/pid/. In our instance we use this mechanism to measure page accesses and write frequency on ANONYMOUS pages, file backed pages or both. Could this potentially pose a problem if KSM decides to use that mechanism for page state tracking? Moussa. On Thu, Sep 3, 2009 at 8:20 AM, Hugh Dickins<hugh.dickins@tiscali.co.uk> wrote: > On Thu, 3 Sep 2009, Izik Eidus wrote: >> >> Hi, >> I just did small test of the new hash compare to the old >> >> using the program below, i ran ksm (with nice -20) >> at time_to_sleep_in_millisecs = 1 > > Better 0? > >> run = 1 >> pages_to_scan = 9999 > > Okay, the bigger the better. > >> >> (The program is designing just to pressure the hash calcs and tree walking >> (and not to share any page really) >> >> then i checked how many full_scans have ksm reached (i just checked >> /sys/kernel/mm/ksm/full_scans) >> >> And i got the following results: >> with the old jhash version ksm did 395 loops >> with the new jhash version ksm did 455 loops > > The first few loops will be settling down, need to subtract those. > >> we got here 15% improvment for this case where we have pages that are static >> but are not shareable... >> (And it will help in any case we got page we are not merging in the stable >> tree) >> >> I think it is nice... > > Yes, that's nice, thank you for looking into it. > > But please do some more along these lines, if you've time? > Presumably the improvement from Jenkins lookup2 to lookup3 > is therefore more than 15%, but we cannot tell how much. > > I think you need to do a run with a null version of jhash2(), > one just returning 0 or 0xffffffff (the first would settle down > a little quicker because oldchecksum 0 will match the first time; > but there should be no difference once you cut out settling time). > > And a run with an almost-null version of jhash2(), one which does > also read the whole page sequentially into cache, so we can see > how much is the processing and how much is the memory access. > > And also, while you're about it, a run with cmp_and_merge_page() > stubbed out, so we can see how much is just the page table walking > (and deduce from that how much is the radix tree walking and memcmping). > > Hmm, and a run to see how much is radix tree walking, > by stubbing out the memcmping. > > Sorry... if you (or someone else following) have the time! > >> >> (I used AMD Phenom(tm) II X3 720 Processor, but probably i didnt run the test >> enougth, i should rerun it again and see if the results are consistent) > > Right, other processors will differ some(unknown)what, so we shouldn't > take the numbers you find too seriously. But at this moment I've no > idea of what proportion of time is spent on what: it should be helpful > to see what dominates. > >> >> p = (unsigned char *) malloc(1024 * 1024 * 100 + 4096); >> if (!p) { >> printf("error\n"); >> } >> >> p_end = p + 1024 * 1024 * 100; >> p = (unsigned char *)((unsigned long)p & ~4095); > > Doesn't matter to your results, so long as it didn't crash; > but I think you meant to say > > p = (unsigned char *)(((unsigned long)p + 4095) & ~4095); > p_end = p + 1024 * 1024 * 100; > > Hugh > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: improving checksum cpu consumption in ksm 2009-09-04 22:29 ` Moussa Ba @ 2009-09-05 11:33 ` Hugh Dickins 0 siblings, 0 replies; 9+ messages in thread From: Hugh Dickins @ 2009-09-05 11:33 UTC (permalink / raw) To: Moussa Ba Cc: Izik Eidus, Andrea Arcangeli, Jozsef Kadlecsik, linux-mm, jaredeh On Fri, 4 Sep 2009, Moussa Ba wrote: > Just to add to the discussion, we have also seen a high cpu usage for > KSM. In our case however it is more serious as the system that KSM is > running on is battery powered with a weaker processor. With KSM > constantly running, the effect on the battery life is significant. Sounds like it would be a good idea for us to throttle back ksmd when the system is otherwise idle. Though quite a bit of thought should go into how we decide "idle" for that. > > I like the idea of dirty bit tracking as it would obviate the need to > rehash once we know the page has not been dirtied. We have been > working on a patch that adds dirty bit clearing from user space, > similar to the clear_refs entry under /proc/pid/. In our instance we > use this mechanism to measure page accesses and write frequency on > ANONYMOUS pages, file backed pages or both. Could this potentially > pose a problem if KSM decides to use that mechanism for page state > tracking? Yes, KSM's use of the bit would interfere with your statistics, and your use of the bit would interfere with KSM's efficiency: better not use them both together (or keep yours off MADV_MERGEABLE areas). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: improving checksum cpu consumption in ksm 2009-09-03 15:20 ` Hugh Dickins 2009-09-03 15:42 ` Izik Eidus 2009-09-04 22:29 ` Moussa Ba @ 2009-09-12 16:33 ` Izik Eidus 2 siblings, 0 replies; 9+ messages in thread From: Izik Eidus @ 2009-09-12 16:33 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrea Arcangeli, Jozsef Kadlecsik, moussa ba, linux-mm Hugh Dickins wrote: > On Thu, 3 Sep 2009, Izik Eidus wrote: > > > > Yes, that's nice, thank you for looking into it. > > But please do some more along these lines, if you've time? > Presumably the improvement from Jenkins lookup2 to lookup3 > is therefore more than 15%, but we cannot tell how much. > > I think you need to do a run with a null version of jhash2(), > one just returning 0 or 0xffffffff (the first would settle down > a little quicker because oldchecksum 0 will match the first time; > but there should be no difference once you cut out settling time). > > And a run with an almost-null version of jhash2(), one which does > also read the whole page sequentially into cache, so we can see > how much is the processing and how much is the memory access. > > And also, while you're about it, a run with cmp_and_merge_page() > stubbed out, so we can see how much is just the page table walking > (and deduce from that how much is the radix tree walking and memcmping). > > Hmm, and a run to see how much is radix tree walking, > by stubbing out the memcmping. > > Sorry... if you (or someone else following) have the time! > > Ok so with exactly the same program as before (just now sleep() time went from 60 to 120) and: ksm nice = -20 echo 1 > /sys/kernel/mm/ksm/sleep_millisecs echo 1 > /sys/kernel/mm/ksm/run echo 0 > /sys/kernel/mm/ksm/max_kernel_pages echo 99999 > /sys/kernel/mm/ksm/pages_to_scan (I forgot to put sleep_millisecs into 0, but I think the results will still give good image) Results: Normal ksmd as found on mmtom: 1nd run: 789 full scans 2nd run: 751 full scans 3nd run: 790 full scans no hash version (jhash2 just retun 0xffffffff): 1nd run: 1873 2nd run: 1888 3nd run: 1874 no hash but read all the memory version: (checking memory access) 1nd run: 1364 2nd run: 1363 3nd run: 1251 checksum version - just increase u64 varible by the content of each 64bits of a page : something like: ret = 0; for(; p != p_end; ++p) ret += *p 1nd run: 1362 2nd run: 1250 3nd run: 1250 no cmp_and_merge_page() version: 1nd run: 10,000 2nd run: 10,000 3nd run: 15,017 (At this point I figured it is probably just take time to do the sleeping and therefore i tried it with sleep_millisecs = 0 and got:) 4nd run: 29,987 5nd run: 45,012 (didnt really look what happen in the kernel when sleep_millisecs = 0, but this results seems to be irrelevant because it seems like we are not burning cpus here) memcmp_pages() return 1 version (without jhash): (This version should really pressure the unstable tree walking - it will build tree of 256,000 pages (the application allocate 100mb) meaning the depth will be 15, and each time it will check page it will walk over all this 15 depth levels in the unstable tree (because return of memcmp_pages() will always be 1 so it will keep going right) 1nd run: 1763 2nd run: 1765 3nd run: 1732 Looking on all this results I tend to think that the winner here is dirty bit from performance perspective, but because we dont want to deal with the problems of dirty bit tracking right now (specially that unstable dirty bit), we probably need to look on how much the jhash cost us, Looking on the memory access times compare to jhash times, It does look like jhash is expensive, because we just need to track pages that are not changing, I guess we can use some kind of checksum that is much less cpu intensive than jhash...? (Btw, starting with merging the new jhash would be good idea as well...) Thanks, and btw did i lose any test that you want might want? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-09-12 16:25 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-28 20:21 improving checksum cpu consumption in ksm Izik Eidus 2009-08-31 22:49 ` Hugh Dickins 2009-09-01 12:12 ` Izik Eidus 2009-09-03 12:36 ` Izik Eidus 2009-09-03 15:20 ` Hugh Dickins 2009-09-03 15:42 ` Izik Eidus 2009-09-04 22:29 ` Moussa Ba 2009-09-05 11:33 ` Hugh Dickins 2009-09-12 16:33 ` Izik Eidus
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).