* Re: [RFC PATCH 0/4]: affinity-on-next-touch [not found] <000c01c9d212$4c244720$e46cd560$@rwth-aachen.de> @ 2009-05-11 13:22 ` Andi Kleen 2009-05-11 13:32 ` Brice Goglin ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Andi Kleen @ 2009-05-11 13:22 UTC (permalink / raw) To: Stefan Lankes; +Cc: linux-kernel, Lee.Schermerhorn, linux-numa Stefan Lankes <lankes@lfbs.rwth-aachen.de> writes: > > [Patch 1/4]: Extend the system call madvise with a new parameter > MADV_ACCESS_LWP (the same as used in Solaris). The specified memory area Linux does NUMA memory policies in mbind(), not madvise() Also if there's a new NUMA policy it should be in the standard Linux NUMA memory policy frame work, not inventing a new one [I find it amazing that you did apparently so much work without being familiar with existing Linux NUMA policies] Your patches seem to have a lot of overlap with Lee Schermerhorn's old migrate memory on cpu migration patches. I don't know the status of those. > [Patch 4/4]: This part of the patch adds some counters to detect migration > errors and publishes these counters via /proc/vmstat. Besides this, the > Kconfig file is extend with the parameter CONFIG_AFFINITY_ON_NEXT_TOUCH. > > With this patch, the kernel reduces the overhead of page distribution via > "affinity-on-next-touch" from 2518ms to 366ms compared to the user-level The interesting part is less how much faster it is compared to an user space implementation, but how much this migrate on touch approach helps in general compared to already existing policies. Some hard numbers on that would appreciated. Note that for the OpenMP case old kernels sometimes had trouble because the threads tended to be not scheduled to the final target CPU on the first time slice so the memory was often first-touched on the wrong node. Later kernels avoided that by more aggressively moving the threads early. This nearly sounds like a workaround for that (I hope it's more than that) If you present any benchmark make sure the kernel you're benching against does not have this issue. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-05-11 13:22 ` [RFC PATCH 0/4]: affinity-on-next-touch Andi Kleen @ 2009-05-11 13:32 ` Brice Goglin 2009-05-11 14:54 ` Stefan Lankes 2009-06-11 18:45 ` Stefan Lankes 2 siblings, 0 replies; 40+ messages in thread From: Brice Goglin @ 2009-05-11 13:32 UTC (permalink / raw) To: Andi Kleen; +Cc: Stefan Lankes, linux-kernel, Lee.Schermerhorn, linux-numa Andi Kleen wrote: > Stefan Lankes <lankes@lfbs.rwth-aachen.de> writes: > >> [Patch 1/4]: Extend the system call madvise with a new parameter >> MADV_ACCESS_LWP (the same as used in Solaris). The specified memory area >> > > Linux does NUMA memory policies in mbind(), not madvise() > Also if there's a new NUMA policy it should be in the standard > Linux NUMA memory policy frame work, not inventing a new one > Marking a buffer as "migrate-on-next-touch" is very different from setting a NUMA policy. Migrate-on-next-touch a temporary flag that is cleared on the next-touch. It's cleared per page, not per area or whatever. So marking a VMA as "migrate-on-next-touch" doesn't make much sense since some pages could already have been migrated (and brought back to their usual state) while some other are still marked as migrate-on-next-touch. Brice ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-05-11 13:22 ` [RFC PATCH 0/4]: affinity-on-next-touch Andi Kleen 2009-05-11 13:32 ` Brice Goglin @ 2009-05-11 14:54 ` Stefan Lankes 2009-05-11 16:37 ` Andi Kleen 2009-06-11 18:45 ` Stefan Lankes 2 siblings, 1 reply; 40+ messages in thread From: Stefan Lankes @ 2009-05-11 14:54 UTC (permalink / raw) To: 'Andi Kleen' Cc: linux-kernel, Lee.Schermerhorn, linux-numa, brice.goglin, 'Terboven, Christian', anmey, 'Boris Bierbaum' > From: Andi Kleen [mailto:andi@firstfloor.org] > > Stefan Lankes <lankes@lfbs.rwth-aachen.de> writes: > > > > [Patch 1/4]: Extend the system call madvise with a new parameter > > MADV_ACCESS_LWP (the same as used in Solaris). The specified memory > area > > Linux does NUMA memory policies in mbind(), not madvise() > Also if there's a new NUMA policy it should be in the standard > Linux NUMA memory policy frame work, not inventing a new one By default, mbind only has an effect on new allocations. I think that this is different from what we need for applications with dynamic memory access patterns. The app gives the kernel a hint that the access pattern has been changed and the kernel has to redistribute the pages which are already allocated. > > [Patch 4/4]: This part of the patch adds some counters to detect > migration > > errors and publishes these counters via /proc/vmstat. Besides this, > the > > Kconfig file is extend with the parameter > CONFIG_AFFINITY_ON_NEXT_TOUCH. > > > > With this patch, the kernel reduces the overhead of page distribution > via > > "affinity-on-next-touch" from 2518ms to 366ms compared to the user- > level > > The interesting part is less how much faster it is compared to an user > space implementation, but how much this migrate on touch approach > helps in general compared to already existing policies. Some hard > numbers on that would appreciated. > > Note that for the OpenMP case old kernels sometimes had trouble because > the threads tended to be not scheduled to the final target CPU > on the first time slice so the memory was often first-touched > on the wrong node. Later kernels avoided that by more aggressively > moving the threads early. > "affinity-on-next-touch" is not a data distribution strategy for applications with a static access pattern. If the access pattern changed, you could initialize the "affinity-on-next-touch" mechanism and afterwards the kernel redistributes the pages. For instance, Norden's PDE solvers using adaptive mesh refinements (AMR) [1] is an application with a dynamic access pattern. We use this example to evaluate the performance of our patch. We ran this solver on our quad-socket, dual-core Opteron 875 (2.2GHz) system running CentOS 5.2. The code was already optimized for NUMA architectures. Before the arrays are initialized, the threads are bound to one core. In our test case, the solver needs 5318s. If we use our kernel extension, the solver needs 4489s. Currently, we are testing some other apps. Stefan [1] Norden, M., Löf, H., Rantakokko, J., Holmgren, S.: Geographical Locality and Dynamic Data Migration for OpenMP Implementations of Adaptive PDE Solvers. In: Proceedings of the 2nd International Workshop on OpenMP (IWOMP), Reims, France (June 2006) 382–393 -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-05-11 14:54 ` Stefan Lankes @ 2009-05-11 16:37 ` Andi Kleen 2009-05-11 17:22 ` Stefan Lankes 0 siblings, 1 reply; 40+ messages in thread From: Andi Kleen @ 2009-05-11 16:37 UTC (permalink / raw) To: Stefan Lankes Cc: 'Andi Kleen', linux-kernel, Lee.Schermerhorn, linux-numa, brice.goglin, 'Terboven, Christian', anmey, 'Boris Bierbaum' On Mon, May 11, 2009 at 04:54:40PM +0200, Stefan Lankes wrote: > > From: Andi Kleen [mailto:andi@firstfloor.org] > > > > Stefan Lankes <lankes@lfbs.rwth-aachen.de> writes: > > > > > > [Patch 1/4]: Extend the system call madvise with a new parameter > > > MADV_ACCESS_LWP (the same as used in Solaris). The specified memory > > area > > > > Linux does NUMA memory policies in mbind(), not madvise() > > Also if there's a new NUMA policy it should be in the standard > > Linux NUMA memory policy frame work, not inventing a new one > > By default, mbind only has an effect on new allocations. I think that this Nope, it affects existing pages too, it can even move pages if you ask for it. > is different from what we need for applications with dynamic memory access > patterns. The app gives the kernel a hint that the access pattern has been > changed and the kernel has to redistribute the pages which are already > allocated. MF_MOVE > For instance, Norden's PDE solvers using adaptive mesh refinements (AMR) [1] > is an application with a dynamic access pattern. We use this example to > evaluate the performance of our patch. We ran this solver on our > quad-socket, dual-core Opteron 875 (2.2GHz) system running CentOS 5.2. The > code was already optimized for NUMA architectures. Before the arrays are > initialized, the threads are bound to one core. In our test case, the solver > needs 5318s. If we use our kernel extension, the solver needs 4489s. Okay that sounds like good numbers. > Currently, we are testing some other apps. Please keep the list updated. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-05-11 16:37 ` Andi Kleen @ 2009-05-11 17:22 ` Stefan Lankes 0 siblings, 0 replies; 40+ messages in thread From: Stefan Lankes @ 2009-05-11 17:22 UTC (permalink / raw) To: 'Andi Kleen' Cc: linux-kernel, Lee.Schermerhorn, linux-numa, brice.goglin, 'Terboven, Christian', anmey, Boris Bierbaum > -----Original Message----- > From: Andi Kleen [mailto:andi@firstfloor.org] > Sent: Monday, May 11, 2009 6:37 PM > To: Stefan Lankes > Cc: 'Andi Kleen'; linux-kernel@vger.kernel.org; > Lee.Schermerhorn@hp.com; linux-numa@vger.kernel.org; > brice.goglin@inria.fr; 'Terboven, Christian'; anmey@rz.rwth-aachen.de; > 'Boris Bierbaum' > Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch > > > By default, mbind only has an effect on new allocations. I think that > this > > Nope, it affects existing pages too, it can even move pages > if you ask for it. > I know this possibility. I thought that "affinity-on-next-touch" fit better to madvise. Brice told already the technical reasons for preferring of madvise. > > For instance, Norden's PDE solvers using adaptive mesh refinements > (AMR) [1] > > is an application with a dynamic access pattern. We use this example > to > > evaluate the performance of our patch. We ran this solver on our > > quad-socket, dual-core Opteron 875 (2.2GHz) system running CentOS > 5.2. The > > code was already optimized for NUMA architectures. Before the arrays > are > > initialized, the threads are bound to one core. In our test case, the > solver > > needs 5318s. If we use our kernel extension, the solver needs 4489s. > > Okay that sounds like good numbers. > > > Currently, we are testing some other apps. > > Please keep the list updated. > I will do it. Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-05-11 13:22 ` [RFC PATCH 0/4]: affinity-on-next-touch Andi Kleen 2009-05-11 13:32 ` Brice Goglin 2009-05-11 14:54 ` Stefan Lankes @ 2009-06-11 18:45 ` Stefan Lankes 2009-06-12 10:32 ` Andi Kleen 2009-06-16 2:21 ` Lee Schermerhorn 2 siblings, 2 replies; 40+ messages in thread From: Stefan Lankes @ 2009-06-11 18:45 UTC (permalink / raw) To: 'Andi Kleen' Cc: linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum, 'Brice Goglin' > Your patches seem to have a lot of overlap with > Lee Schermerhorn's old migrate memory on cpu migration patches. > I don't know the status of those. I analyze Lee Schermerhorn's migrate memory on cpu migration patches (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee Schermerhorn add similar functionalities to the kernel. He called the "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his patches the normal NUMA memory policies. Therefore, his solution fits better to the Linux kernel. I tested his patches with our test applications and got nearly the same performance results. I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop these patches further? Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-11 18:45 ` Stefan Lankes @ 2009-06-12 10:32 ` Andi Kleen 2009-06-12 11:46 ` Stefan Lankes 2009-06-16 2:25 ` Lee Schermerhorn 2009-06-16 2:21 ` Lee Schermerhorn 1 sibling, 2 replies; 40+ messages in thread From: Andi Kleen @ 2009-06-12 10:32 UTC (permalink / raw) To: Stefan Lankes Cc: 'Andi Kleen', linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum, 'Brice Goglin' On Thu, Jun 11, 2009 at 08:45:29PM +0200, Stefan Lankes wrote: > > Your patches seem to have a lot of overlap with > > Lee Schermerhorn's old migrate memory on cpu migration patches. > > I don't know the status of those. > > I analyze Lee Schermerhorn's migrate memory on cpu migration patches > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee > Schermerhorn add similar functionalities to the kernel. He called the > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his > patches the normal NUMA memory policies. Therefore, his solution fits better > to the Linux kernel. I tested his patches with our test applications and got > nearly the same performance results. That's great to know. I didn't think he had a per process setting though, did he? > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop > these patches further? Not to much knowledge. Maybe Lee will pick them up again now that there are more use cases. If he doesn't have time maybe you could update them? -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-12 10:32 ` Andi Kleen @ 2009-06-12 11:46 ` Stefan Lankes 2009-06-12 12:30 ` Brice Goglin 2009-06-16 2:39 ` Lee Schermerhorn 2009-06-16 2:25 ` Lee Schermerhorn 1 sibling, 2 replies; 40+ messages in thread From: Stefan Lankes @ 2009-06-12 11:46 UTC (permalink / raw) To: 'Andi Kleen' Cc: linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum, 'Brice Goglin' > > I analyze Lee Schermerhorn's migrate memory on cpu migration patches > > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that > Lee > > Schermerhorn add similar functionalities to the kernel. He called the > > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in > his > > patches the normal NUMA memory policies. Therefore, his solution fits > better > > to the Linux kernel. I tested his patches with our test applications > and got > > nearly the same performance results. > > That's great to know. > > I didn't think he had a per process setting though, did he? He enables the support of migration-on-fault via cpusets (echo 1 > /dev/cpuset/migrate_on_fault). Afterwards, every process could initiate migration-on-fault via mbind(..., MPOL_MF_MOVE|MPOL_MF_LAZY). > > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone > develop > > these patches further? > > Not to much knowledge. Maybe Lee will pick them up again now that there > are more use cases. > > If he doesn't have time maybe you could update them? We are planning to work in this area. I think that I could update these patches. At least, I am able to support Lee. Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-12 11:46 ` Stefan Lankes @ 2009-06-12 12:30 ` Brice Goglin 2009-06-12 13:21 ` Stefan Lankes 2009-06-12 13:48 ` Stefan Lankes 2009-06-16 2:39 ` Lee Schermerhorn 1 sibling, 2 replies; 40+ messages in thread From: Brice Goglin @ 2009-06-12 12:30 UTC (permalink / raw) To: Stefan Lankes Cc: 'Andi Kleen', linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum Stefan Lankes wrote: > He enables the support of migration-on-fault via cpusets (echo 1 > > /dev/cpuset/migrate_on_fault). > Afterwards, every process could initiate migration-on-fault via mbind(..., > MPOL_MF_MOVE|MPOL_MF_LAZY). So mbind(MPOL_MF_LAZY) is taking care of changing page protection so as to generate page-faults on next-touch? (instead of your madvise) Is it migrating the whole memory area? Or only single pages? Then, what's happening with MPOL_MF_LAZY in the kernel? Is it actually stored in the mempolicy? If so, couldn't another fault later cause another migration? Or is MPOL_MF_LAZY filtered out of the policy once the protection of all PTE has been changed? I don't see why we need a new mempolicy here. If we are migrating single pages, migrate-on-next-touch looks like a page-attribute to me. There should be nothing to store in a mempolicy/VMA/whatever. Brice ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-12 12:30 ` Brice Goglin @ 2009-06-12 13:21 ` Stefan Lankes 2009-06-12 13:48 ` Stefan Lankes 1 sibling, 0 replies; 40+ messages in thread From: Stefan Lankes @ 2009-06-12 13:21 UTC (permalink / raw) To: 'Brice Goglin' Cc: 'Andi Kleen', linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum > So mbind(MPOL_MF_LAZY) is taking care of changing page protection so as > to generate page-faults on next-touch? (instead of your madvise) > Is it migrating the whole memory area? Or only single pages? mbind removes the pte references. Page migration will occur, when a task access to one of these unmapped pages. Therefore, Lee's solution migrate one single page and not the whole area. You find further information at slides 19-23 of http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/197.pdf. > Then, what's happening with MPOL_MF_LAZY in the kernel? Is it actually > stored in the mempolicy? If so, couldn't another fault later cause > another migration? > Or is MPOL_MF_LAZY filtered out of the policy once the protection of > all > PTE has been changed? > > I don't see why we need a new mempolicy here. If we are migrating > single > pages, migrate-on-next-touch looks like a page-attribute to me. There > should be nothing to store in a mempolicy/VMA/whatever. > MPOL_MF_LAZY is used as flag and does not specify a new policy. Therefore, MPOL_MF_LAZY isn't stored in a VMA. The flag is only used to detect that the system call mbind has to unmap these pages. Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-12 12:30 ` Brice Goglin 2009-06-12 13:21 ` Stefan Lankes @ 2009-06-12 13:48 ` Stefan Lankes 1 sibling, 0 replies; 40+ messages in thread From: Stefan Lankes @ 2009-06-12 13:48 UTC (permalink / raw) To: 'Brice Goglin' Cc: 'Andi Kleen', linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum > > > > MPOL_MF_LAZY is used as flag and does not specify a new policy. > Therefore, MPOL_MF_LAZY isn't stored in a VMA. The flag is only used to > detect that the system call mbind has to unmap these pages. > In one of my previous e-mails, I wrote that Lee use the normal NUMA memory policies. This statement was unclear. Lee's patches define no new memory policy. He adds only a new flag. The using of mbind looks for me smarter as the using of madvise. Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-12 11:46 ` Stefan Lankes 2009-06-12 12:30 ` Brice Goglin @ 2009-06-16 2:39 ` Lee Schermerhorn 2009-06-16 13:58 ` Stefan Lankes 1 sibling, 1 reply; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-16 2:39 UTC (permalink / raw) To: Stefan Lankes Cc: 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin' On Fri, 2009-06-12 at 13:46 +0200, Stefan Lankes wrote: > > > I analyze Lee Schermerhorn's migrate memory on cpu migration patches > > > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that > > Lee > > > Schermerhorn add similar functionalities to the kernel. He called the > > > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in > > his > > > patches the normal NUMA memory policies. Therefore, his solution fits > > better > > > to the Linux kernel. I tested his patches with our test applications > > and got > > > nearly the same performance results. > > > > That's great to know. > > > > I didn't think he had a per process setting though, did he? > > He enables the support of migration-on-fault via cpusets (echo 1 > > /dev/cpuset/migrate_on_fault). > Afterwards, every process could initiate migration-on-fault via mbind(..., > MPOL_MF_MOVE|MPOL_MF_LAZY). I should have read through the entire thread before responding Andi's mail. > > > > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone > > develop > > > these patches further? > > > > Not to much knowledge. Maybe Lee will pick them up again now that there > > are more use cases. > > > > If he doesn't have time maybe you could update them? > > We are planning to work in this area. I think that I could update these > patches. > At least, I am able to support Lee. I would like to get these patches working with latest mmotm to test on some newer hardware where I think they will help more. And I would welcome your support. However, I think we'll need to get Balbir and Kamezawa-san involved to sort out the interaction with memory control group. I can send you the more recent rebase that I've done. This is getting pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to the most recent mmotm [that boots on my platforms], at least so that we can build and boot with migrate-on-fault disabled, within the next couple of weeks. Regards, Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-16 2:39 ` Lee Schermerhorn @ 2009-06-16 13:58 ` Stefan Lankes 2009-06-16 14:59 ` Lee Schermerhorn 0 siblings, 1 reply; 40+ messages in thread From: Stefan Lankes @ 2009-06-16 13:58 UTC (permalink / raw) To: 'Lee Schermerhorn' Cc: 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin' > > I would like to get these patches working with latest mmotm to test on > some newer hardware where I think they will help more. And I would > welcome your support. However, I think we'll need to get Balbir and > Kamezawa-san involved to sort out the interaction with memory control > group. > > I can send you the more recent rebase that I've done. This is getting > pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to > the most recent mmotm [that boots on my platforms], at least so that we > can build and boot with migrate-on-fault disabled, within the next > couple of weeks. > Sounds good! Send me your last version and I will try to reconstruct your problems. Afterwards, we could try to solve these problems. Regards, Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-16 13:58 ` Stefan Lankes @ 2009-06-16 14:59 ` Lee Schermerhorn 2009-06-17 1:22 ` KAMEZAWA Hiroyuki 2009-06-17 7:45 ` Stefan Lankes 0 siblings, 2 replies; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-16 14:59 UTC (permalink / raw) To: Stefan Lankes Cc: 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin' On Tue, 2009-06-16 at 15:58 +0200, Stefan Lankes wrote: > > > > I would like to get these patches working with latest mmotm to test on > > some newer hardware where I think they will help more. And I would > > welcome your support. However, I think we'll need to get Balbir and > > Kamezawa-san involved to sort out the interaction with memory control > > group. > > > > I can send you the more recent rebase that I've done. This is getting > > pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to > > the most recent mmotm [that boots on my platforms], at least so that we > > can build and boot with migrate-on-fault disabled, within the next > > couple of weeks. > > > > Sounds good! Send me your last version and I will try to reconstruct your > problems. Afterwards, we could try to solve these problems. > Stefan: I've placed the last rebased version in : http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-081110/ As I recall, this version DOES bug out because of reference count problems due to interaction with the memory controller. As I said, I'll try to rebase to latest mmotm real soon. I need to test whether it will boot on my test systems first. Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-16 14:59 ` Lee Schermerhorn @ 2009-06-17 1:22 ` KAMEZAWA Hiroyuki 2009-06-17 12:02 ` Lee Schermerhorn 2009-06-17 7:45 ` Stefan Lankes 1 sibling, 1 reply; 40+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-06-17 1:22 UTC (permalink / raw) To: Lee Schermerhorn Cc: Stefan Lankes, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin' On Tue, 16 Jun 2009 10:59:55 -0400 Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote: > On Tue, 2009-06-16 at 15:58 +0200, Stefan Lankes wrote: > > > > > > I would like to get these patches working with latest mmotm to test on > > > some newer hardware where I think they will help more. And I would > > > welcome your support. However, I think we'll need to get Balbir and > > > Kamezawa-san involved to sort out the interaction with memory control > > > group. > > > > > > I can send you the more recent rebase that I've done. This is getting > > > pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to > > > the most recent mmotm [that boots on my platforms], at least so that we > > > can build and boot with migrate-on-fault disabled, within the next > > > couple of weeks. > > > > > > > Sounds good! Send me your last version and I will try to reconstruct your > > problems. Afterwards, we could try to solve these problems. > > > > Stefan: > > I've placed the last rebased version in : > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-081110/ > > As I recall, this version DOES bug out because of reference count > problems due to interaction with the memory controller. > please report in precise if memcg has bug. An example of test is welcome. Thanks, -Kmae ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-17 1:22 ` KAMEZAWA Hiroyuki @ 2009-06-17 12:02 ` Lee Schermerhorn 0 siblings, 0 replies; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-17 12:02 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Stefan Lankes, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin' On Wed, 2009-06-17 at 10:22 +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 16 Jun 2009 10:59:55 -0400 > Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote: > > > On Tue, 2009-06-16 at 15:58 +0200, Stefan Lankes wrote: > > > > > > > > I would like to get these patches working with latest mmotm to test on > > > > some newer hardware where I think they will help more. And I would > > > > welcome your support. However, I think we'll need to get Balbir and > > > > Kamezawa-san involved to sort out the interaction with memory control > > > > group. > > > > > > > > I can send you the more recent rebase that I've done. This is getting > > > > pretty old now: 2.6.28-rc4-mmotm-081110-081117. I'll try to rebase to > > > > the most recent mmotm [that boots on my platforms], at least so that we > > > > can build and boot with migrate-on-fault disabled, within the next > > > > couple of weeks. > > > > > > > > > > Sounds good! Send me your last version and I will try to reconstruct your > > > problems. Afterwards, we could try to solve these problems. > > > > > > > Stefan: > > > > I've placed the last rebased version in : > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-081110/ > > > > As I recall, this version DOES bug out because of reference count > > problems due to interaction with the memory controller. > > > please report in precise if memcg has bug. > An example of test is welcome. Not an memcg bug. Just an implementation choice [2 phase migration handling: start/end calls] that is problematic for "lazy" page migration--i.e., "migration-on-touch" in the fault path. I'd be interested in your opinion on the feasibility of transferring the "charge" against the page--including the "try charge" from do_swap_page()--in migrate_page_copy() along with other page state. I'll try to rebase my lazy migration series to recent mmotm [as time permits] in the "near future" and gather more info on the problem I was having. Regards, Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-16 14:59 ` Lee Schermerhorn 2009-06-17 1:22 ` KAMEZAWA Hiroyuki @ 2009-06-17 7:45 ` Stefan Lankes 2009-06-18 4:37 ` Lee Schermerhorn 1 sibling, 1 reply; 40+ messages in thread From: Stefan Lankes @ 2009-06-17 7:45 UTC (permalink / raw) To: 'Lee Schermerhorn' Cc: 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin' > I've placed the last rebased version in : > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm- > 081110/ > OK! I will try to reconstruct the problem. Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-17 7:45 ` Stefan Lankes @ 2009-06-18 4:37 ` Lee Schermerhorn 2009-06-18 19:04 ` Lee Schermerhorn 2009-06-22 12:34 ` Brice Goglin 0 siblings, 2 replies; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-18 4:37 UTC (permalink / raw) To: Stefan Lankes Cc: 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin', KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: > > I've placed the last rebased version in : > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm- > > 081110/ > > > > OK! I will try to reconstruct the problem. Stefan: Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... [along with my shared policy series atop which they sit in my tree]. Patches reside in: http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ I did a quick test. I'm afraid the patches have suffered some "bit rot" vis a vis mainline/mmotm over the past several months. Two possibly related issues: 1) lazy migration doesn't seem to work. Looks like mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the pages so, of course, migrate on fault won't work. I suspect the reference count handling has changed since I last tried this. [Note one of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind flag definitions in mempolicy.h and I may have botched the resolution thereof.] 2) When the pages get freed on exit/unmap, they are still PageLocked() and free_pages_check()/bad_page() bugs out with bad page state. Note: This is independent of memcg--i.e., happens whether or not memcg configured. To test this, I created a test cpuset with all nodes/mems/cpus and enabled migrate_on_fault therein. I then ran an interactive "memtoy" session there [shown below]. Memtoy is a program I use for ad hoc testing of various mm features. You can find the latest version [almost always] at: http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz You'll need the numactl-devel package to build--an older one with the V1 api, I think. I need to upgrade it to latest libnuma. The same directory [Tools] contains a tarball of simple cpuset scripts to make, query, modify, "enter" and run commands in cpusets. There may be other versions of such scripts around. If you don't already have any, feel free to grab them. Since you've expressed interest in this [as has Kamezawa-san], I'll try to pay some attention to debugging the patches in my copious spare time. And, I'd be very interested in anything you discover in your investigations. Regards, Lee Memtoy-0.19c [for latest MPOL_MF flags defs]: !!! lines are my annotations: memtoy pid: 4222 memtoy>mems mems allowed = 0-3 mems policy = 0-3 memtoy>cpus cpu affinity mask/ids: 0-7 memtoy>anon a1 8p memtoy>map a1 memtoy>mbind a1 pref 1 memtoy>touch a1 w memtoy: touched 8 pages in 0.000 secs memtoy>where a1 a 0x00007f51ae757000 0x000000008000 0x000000000000 rw- default a1 page offset +00 +01 +02 +03 +04 +05 +06 +07 0: 1 1 1 1 1 1 1 1 memtoy>mbind a1 pref+move 2 memtoy: migration of a1 [8 pages] took 0.000secs. memtoy>where a1 a 0x00007f51ae757000 0x000000008000 0x000000000000 rw- default a1 page offset +00 +01 +02 +03 +04 +05 +06 +07 0: 2 2 2 2 2 2 2 2 !!! direct migration [still] works! Try lazy: memtoy>mbind a1 pref+move+lazy 3 memtoy: unmap of a1 [8 pages] took 0.000secs. memtoy>where a1 !!! "where" command uses get_mempolicy() w/ MPOL_ADDR|MPOL_NODE flags to fetch page location. Will call get_user_pages() and refault pages. Should migrate to node 3, but: a 0x00007f51ae757000 0x000000008000 0x000000000000 rw- default a1 page offset +00 +01 +02 +03 +04 +05 +06 +07 0: 2 2 2 2 2 2 2 2 !!! didn't move memtoy>exit On console I see, for each of 8 pages of segment a1: BUG: Bad page state in process memtoy pfn:67515f page:ffffea001699ccc8 flags:0a0000000010001d count:0 mapcount:0 mapping:(null) index:7f51ae75e Pid: 4222, comm: memtoy Not tainted 2.6.30-mmotm-090612-1220+spol+lpm #6 Call Trace: [<ffffffff810a787a>] bad_page+0xaa/0x130 [<ffffffff810a8719>] free_hot_cold_page+0x199/0x1d0 [<ffffffff810a8774>] __pagevec_free+0x24/0x30 [<ffffffff810ac96a>] release_pages+0x1ca/0x210 [<ffffffff810c8b7d>] free_pages_and_swap_cache+0x8d/0xb0 [<ffffffff810c0505>] exit_mmap+0x145/0x160 [<ffffffff81044177>] mmput+0x47/0xa0 [<ffffffff81048854>] exit_mm+0xf4/0x130 [<ffffffff81049c58>] do_exit+0x188/0x810 [<ffffffff81337194>] ? do_page_fault+0x184/0x310 [<ffffffff8104a31e>] do_group_exit+0x3e/0xa0 [<ffffffff8104a392>] sys_exit_group+0x12/0x20 [<ffffffff8100bd2b>] system_call_fastpath+0x16/0x1b Page flags 0x10001d: locked, referenced, uptodate, dirty, swapbacked. 'locked' is bad state. ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-18 4:37 ` Lee Schermerhorn @ 2009-06-18 19:04 ` Lee Schermerhorn 2009-06-19 15:26 ` Lee Schermerhorn 2009-06-22 12:34 ` Brice Goglin 1 sibling, 1 reply; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-18 19:04 UTC (permalink / raw) To: Stefan Lankes Cc: 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin', KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote: > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: > > > I've placed the last rebased version in : > > > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm- > > > 081110/ > > > > > > > OK! I will try to reconstruct the problem. > > Stefan: > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... > [along with my shared policy series atop which they sit in my tree]. > Patches reside in: > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ > I have updated the migrate-on-fault tarball in the above location to fix part of the problems I was seeing. See below. > > I did a quick test. I'm afraid the patches have suffered some "bit rot" > vis a vis mainline/mmotm over the past several months. Two possibly > related issues: > > 1) lazy migration doesn't seem to work. Looks like > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the > pages so, of course, migrate on fault won't work. I suspect the > reference count handling has changed since I last tried this. [Note one > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind > flag definitions in mempolicy.h and I may have botched the resolution > thereof.] > > 2) When the pages get freed on exit/unmap, they are still PageLocked() > and free_pages_check()/bad_page() bugs out with bad page state. > > Note: This is independent of memcg--i.e., happens whether or not memcg > configured. > <snip> OK. Found time to look at this. Turns out I hadn't tested since trylock_page() was introduced. I did a one-for-one replacement of the old API [TestSetPageLocked()], not noticing that the sense of the return was inverted. Thus, I was bailing out of the migrate_pages_unmap_only() loop with the page locked, thinking someone else had locked it and would take care of it. Since the page wasn't unmapped from the page table[s], of course it wouldn't migrate on fault--wouldn't even fault! Fixed this. Now: lazy migration works w/ or w/o memcg configured, but NOT with the swap resource controller configured. I'll look at that as time permits. Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-18 19:04 ` Lee Schermerhorn @ 2009-06-19 15:26 ` Lee Schermerhorn 2009-06-19 15:41 ` Balbir Singh 2009-06-19 21:19 ` Stefan Lankes 0 siblings, 2 replies; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-19 15:26 UTC (permalink / raw) To: Stefan Lankes Cc: 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin', KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote: > On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote: > > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: > > > > I've placed the last rebased version in : > > > > > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm- > > > > 081110/ > > > > > > > > > > OK! I will try to reconstruct the problem. > > > > Stefan: > > > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... > > [along with my shared policy series atop which they sit in my tree]. > > Patches reside in: > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ > > > > I have updated the migrate-on-fault tarball in the above location to fix > part of the problems I was seeing. See below. > > > > > I did a quick test. I'm afraid the patches have suffered some "bit rot" > > vis a vis mainline/mmotm over the past several months. Two possibly > > related issues: > > > > 1) lazy migration doesn't seem to work. Looks like > > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the > > pages so, of course, migrate on fault won't work. I suspect the > > reference count handling has changed since I last tried this. [Note one > > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind > > flag definitions in mempolicy.h and I may have botched the resolution > > thereof.] > > > > 2) When the pages get freed on exit/unmap, they are still PageLocked() > > and free_pages_check()/bad_page() bugs out with bad page state. > > > > Note: This is independent of memcg--i.e., happens whether or not memcg > > configured. > > > <snip> > > OK. Found time to look at this. Turns out I hadn't tested since > trylock_page() was introduced. I did a one-for-one replacement of the > old API [TestSetPageLocked()], not noticing that the sense of the return > was inverted. Thus, I was bailing out of the migrate_pages_unmap_only() > loop with the page locked, thinking someone else had locked it and would > take care of it. Since the page wasn't unmapped from the page table[s], > of course it wouldn't migrate on fault--wouldn't even fault! > > Fixed this. > > Now: lazy migration works w/ or w/o memcg configured, but NOT with the > swap resource controller configured. I'll look at that as time permits. Update: I now can't reproduce the lazy migration failure with the swap resource controller configured. Perhaps I had booted the wrong kernel for the test reported above. Now the updated patch series mentioned above seems to be working with both memory and swap resource controllers configured for simple memtoy driven lazy migration. Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-19 15:26 ` Lee Schermerhorn @ 2009-06-19 15:41 ` Balbir Singh 2009-06-19 15:59 ` Lee Schermerhorn 2009-06-19 21:19 ` Stefan Lankes 1 sibling, 1 reply; 40+ messages in thread From: Balbir Singh @ 2009-06-19 15:41 UTC (permalink / raw) To: Lee Schermerhorn Cc: Stefan Lankes, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin', KAMEZAWA Hiroyuki, KOSAKI Motohiro * Lee Schermerhorn <Lee.Schermerhorn@hp.com> [2009-06-19 11:26:53]: > On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote: > > On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote: > > > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: > > > > > I've placed the last rebased version in : > > > > > > > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm- > > > > > 081110/ > > > > > > > > > > > > > OK! I will try to reconstruct the problem. > > > > > > Stefan: > > > > > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... > > > [along with my shared policy series atop which they sit in my tree]. > > > Patches reside in: > > > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ > > > > > > > I have updated the migrate-on-fault tarball in the above location to fix > > part of the problems I was seeing. See below. > > > > > > > > I did a quick test. I'm afraid the patches have suffered some "bit rot" > > > vis a vis mainline/mmotm over the past several months. Two possibly > > > related issues: > > > > > > 1) lazy migration doesn't seem to work. Looks like > > > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the > > > pages so, of course, migrate on fault won't work. I suspect the > > > reference count handling has changed since I last tried this. [Note one > > > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind > > > flag definitions in mempolicy.h and I may have botched the resolution > > > thereof.] > > > > > > 2) When the pages get freed on exit/unmap, they are still PageLocked() > > > and free_pages_check()/bad_page() bugs out with bad page state. > > > > > > Note: This is independent of memcg--i.e., happens whether or not memcg > > > configured. > > > > > <snip> > > > > OK. Found time to look at this. Turns out I hadn't tested since > > trylock_page() was introduced. I did a one-for-one replacement of the > > old API [TestSetPageLocked()], not noticing that the sense of the return > > was inverted. Thus, I was bailing out of the migrate_pages_unmap_only() > > loop with the page locked, thinking someone else had locked it and would > > take care of it. Since the page wasn't unmapped from the page table[s], > > of course it wouldn't migrate on fault--wouldn't even fault! > > > > Fixed this. > > > > Now: lazy migration works w/ or w/o memcg configured, but NOT with the > > swap resource controller configured. I'll look at that as time permits. > > Update: I now can't reproduce the lazy migration failure with the swap > resource controller configured. Perhaps I had booted the wrong kernel > for the test reported above. Now the updated patch series mentioned > above seems to be working with both memory and swap resource controllers > configured for simple memtoy driven lazy migration. Excellent, I presume that you are using the latest mmotm or mainline. We've had some swap cache leakage fix go in, but those are not as serious (they can potentially cause OOM in a cgroup when the leak occurs). -- Balbir ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-19 15:41 ` Balbir Singh @ 2009-06-19 15:59 ` Lee Schermerhorn 0 siblings, 0 replies; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-19 15:59 UTC (permalink / raw) To: balbir Cc: Stefan Lankes, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin', KAMEZAWA Hiroyuki, KOSAKI Motohiro On Fri, 2009-06-19 at 21:11 +0530, Balbir Singh wrote: > * Lee Schermerhorn <Lee.Schermerhorn@hp.com> [2009-06-19 11:26:53]: > > > On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote: > > > On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote: > > > > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: > > > > > > I've placed the last rebased version in : > > > > > > > > > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm- > > > > > > 081110/ > > > > > > > > > > > > > > > > OK! I will try to reconstruct the problem. > > > > > > > > Stefan: > > > > > > > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... > > > > [along with my shared policy series atop which they sit in my tree]. > > > > Patches reside in: > > > > > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ > > > > > > > > > > I have updated the migrate-on-fault tarball in the above location to fix > > > part of the problems I was seeing. See below. > > > > > > > > > > > I did a quick test. I'm afraid the patches have suffered some "bit rot" > > > > vis a vis mainline/mmotm over the past several months. Two possibly > > > > related issues: > > > > > > > > 1) lazy migration doesn't seem to work. Looks like > > > > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the > > > > pages so, of course, migrate on fault won't work. I suspect the > > > > reference count handling has changed since I last tried this. [Note one > > > > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind > > > > flag definitions in mempolicy.h and I may have botched the resolution > > > > thereof.] > > > > > > > > 2) When the pages get freed on exit/unmap, they are still PageLocked() > > > > and free_pages_check()/bad_page() bugs out with bad page state. > > > > > > > > Note: This is independent of memcg--i.e., happens whether or not memcg > > > > configured. > > > > > > > <snip> > > > > > > OK. Found time to look at this. Turns out I hadn't tested since > > > trylock_page() was introduced. I did a one-for-one replacement of the > > > old API [TestSetPageLocked()], not noticing that the sense of the return > > > was inverted. Thus, I was bailing out of the migrate_pages_unmap_only() > > > loop with the page locked, thinking someone else had locked it and would > > > take care of it. Since the page wasn't unmapped from the page table[s], > > > of course it wouldn't migrate on fault--wouldn't even fault! > > > > > > Fixed this. > > > > > > Now: lazy migration works w/ or w/o memcg configured, but NOT with the > > > swap resource controller configured. I'll look at that as time permits. > > > > Update: I now can't reproduce the lazy migration failure with the swap > > resource controller configured. Perhaps I had booted the wrong kernel > > for the test reported above. Now the updated patch series mentioned > > above seems to be working with both memory and swap resource controllers > > configured for simple memtoy driven lazy migration. > > Excellent, I presume that you are using the latest mmotm or mainline. > We've had some swap cache leakage fix go in, but those are not as > serious (they can potentially cause OOM in a cgroup when the leak > occurs). Yes, I'm using the 12jun mmotm atop 2.6.30. I use the mmotm timestamp in my kernel versions to show the base I using. E.g., see the url above. Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-19 15:26 ` Lee Schermerhorn 2009-06-19 15:41 ` Balbir Singh @ 2009-06-19 21:19 ` Stefan Lankes 1 sibling, 0 replies; 40+ messages in thread From: Stefan Lankes @ 2009-06-19 21:19 UTC (permalink / raw) To: Lee Schermerhorn Cc: 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin', KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro Lee Schermerhorn wrote: > On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote: >> On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote: >>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: >>>>> I've placed the last rebased version in : >>>>> >>>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm- >>>>> 081110/ >>>>> >>>> OK! I will try to reconstruct the problem. >>> Stefan: >>> >>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... >>> [along with my shared policy series atop which they sit in my tree]. >>> Patches reside in: >>> >>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ >>> >> I have updated the migrate-on-fault tarball in the above location to fix >> part of the problems I was seeing. See below. >> >>> I did a quick test. I'm afraid the patches have suffered some "bit rot" >>> vis a vis mainline/mmotm over the past several months. Two possibly >>> related issues: >>> >>> 1) lazy migration doesn't seem to work. Looks like >>> mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the >>> pages so, of course, migrate on fault won't work. I suspect the >>> reference count handling has changed since I last tried this. [Note one >>> of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind >>> flag definitions in mempolicy.h and I may have botched the resolution >>> thereof.] >>> >>> 2) When the pages get freed on exit/unmap, they are still PageLocked() >>> and free_pages_check()/bad_page() bugs out with bad page state. >>> >>> Note: This is independent of memcg--i.e., happens whether or not memcg >>> configured. >>> >> <snip> >> >> OK. Found time to look at this. Turns out I hadn't tested since >> trylock_page() was introduced. I did a one-for-one replacement of the >> old API [TestSetPageLocked()], not noticing that the sense of the return >> was inverted. Thus, I was bailing out of the migrate_pages_unmap_only() >> loop with the page locked, thinking someone else had locked it and would >> take care of it. Since the page wasn't unmapped from the page table[s], >> of course it wouldn't migrate on fault--wouldn't even fault! >> >> Fixed this. >> >> Now: lazy migration works w/ or w/o memcg configured, but NOT with the >> swap resource controller configured. I'll look at that as time permits. > > Update: I now can't reproduce the lazy migration failure with the swap > resource controller configured. Perhaps I had booted the wrong kernel > for the test reported above. Now the updated patch series mentioned > above seems to be working with both memory and swap resource controllers > configured for simple memtoy driven lazy migration. > The current version of your patch works fine on my system. I tested the patches with our test applications and got very good performance results! Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-18 4:37 ` Lee Schermerhorn 2009-06-18 19:04 ` Lee Schermerhorn @ 2009-06-22 12:34 ` Brice Goglin 2009-06-22 14:24 ` Lee Schermerhorn 2009-06-22 14:32 ` Stefan Lankes 1 sibling, 2 replies; 40+ messages in thread From: Brice Goglin @ 2009-06-22 12:34 UTC (permalink / raw) To: Lee Schermerhorn Cc: Stefan Lankes, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro Lee Schermerhorn wrote: > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: > >>> I've placed the last rebased version in : >>> >>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm- >>> 081110/ >>> >>> >> OK! I will try to reconstruct the problem. >> > > Stefan: > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... > [along with my shared policy series atop which they sit in my tree]. > Patches reside in: > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ > > I gave this patchset a try and indeed it seems to work fine, thanks a lot. But the migration performance isn't very good. I am seeing about 540MB/s when doing mbind+touch_all_pages on large buffers on a quad-barcelona machines. move_pages gets 640MB/s there. And my own next-touch implementation were near 800MB/s in the past. I wonder if there is a more general migration performance degradation in latest Linus git. move_pages performance was supposed to increase by 15% (more than 700MB/s) thanks to commit dfa33d45 but I don't seem to see the improvement with git or mmotm. Also migrate_pages seems to have decreased but it might be older than 2.6.30. I need to find some time to git bisect all this, otherwise it's hard to compare the performance of your migrate-on-fault with other older implementations :) When do you plan to actually submit all your patches for inclusion? thanks, Brice ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 12:34 ` Brice Goglin @ 2009-06-22 14:24 ` Lee Schermerhorn 2009-06-22 15:28 ` Brice Goglin 2009-06-22 14:32 ` Stefan Lankes 1 sibling, 1 reply; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-22 14:24 UTC (permalink / raw) To: Brice Goglin Cc: Stefan Lankes, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro On Mon, 2009-06-22 at 14:34 +0200, Brice Goglin wrote: > Lee Schermerhorn wrote: > > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: > > > >>> I've placed the last rebased version in : > >>> > >>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm- > >>> 081110/ > >>> > >>> > >> OK! I will try to reconstruct the problem. > >> > > > > Stefan: > > > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... > > [along with my shared policy series atop which they sit in my tree]. > > Patches reside in: > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ > > > > > > I gave this patchset a try and indeed it seems to work fine, thanks a > lot. But the migration performance isn't very good. I am seeing about > 540MB/s when doing mbind+touch_all_pages on large buffers on a > quad-barcelona machines. move_pages gets 640MB/s there. And my own > next-touch implementation were near 800MB/s in the past. Interesting. Do you have any idea where the differences come from? Are you comparing them on the same kernel versions? I don't know the details of your implementation, but one possible area is the check for "misplacement". When migrate-on-fault is enabled, I check all pages with page_mapcount() == 0 for misplacement in the [swap page] fault path. That, and other filtering to eliminate unnecessary migrations could cause extra overhead. Aside: currently, my implementation could migrate a page, only to find that it will be replaced by a new page due to copy-on-write. I have on my list to check write access and whether we can reuse the swap page and avoid the migration if we're going to COW later anyway. This could improve performance for write accesses, if the snoop traffic doesn't overshadow any such improvement. > > I wonder if there is a more general migration performance degradation in > latest Linus git. move_pages performance was supposed to increase by 15% > (more than 700MB/s) thanks to commit dfa33d45 but I don't seem to see > the improvement with git or mmotm. Also migrate_pages seems to have > decreased but it might be older than 2.6.30. I need to find some time to > git bisect all this, otherwise it's hard to compare the performance of > your migrate-on-fault with other older implementations :) Confession: I've not measured migration performance directly. Rather, I've only observed how applications/benchmarks perform with migrate-on-fault+automigration enabled. On the platforms available to me back when I was actively working on this, I did see improvements in real and user time due to improved locality, especially under heavy load when interconnect bandwidth is at a premium. Of course, system time increased because of the migration overheads. > > When do you plan to actually submit all your patches for inclusion? I had/have no immediate plans. I held off on these series while other mm features--reclaim scalability, memory control groups, ...--seemed higher priority, and the churn in mm made it difficult to keep these patches up to date. Now that the patches seem to be working again, I plan to test them on newer platforms with more "interesting" numa topologies. If they work well there, and with your interest and cooperation, perhaps we can try again with some variant or combination of our approaches. Regards, Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 14:24 ` Lee Schermerhorn @ 2009-06-22 15:28 ` Brice Goglin 2009-06-22 16:55 ` Lee Schermerhorn 0 siblings, 1 reply; 40+ messages in thread From: Brice Goglin @ 2009-06-22 15:28 UTC (permalink / raw) To: Lee Schermerhorn Cc: Stefan Lankes, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro Lee Schermerhorn wrote: >> I gave this patchset a try and indeed it seems to work fine, thanks a >> lot. But the migration performance isn't very good. I am seeing about >> 540MB/s when doing mbind+touch_all_pages on large buffers on a >> quad-barcelona machines. move_pages gets 640MB/s there. And my own >> next-touch implementation were near 800MB/s in the past. >> > > Interesting. Do you have any idea where the differences come from? Are > you comparing them on the same kernel versions? I don't know the > details of your implementation, but one possible area is the check for > "misplacement". When migrate-on-fault is enabled, I check all pages > with page_mapcount() == 0 for misplacement in the [swap page] fault > path. That, and other filtering to eliminate unnecessary migrations > could cause extra overhead. > (I'll actually talk about this at the Linux Symposium) I used 2.6.27 initially, with some 2.6.29 patches to fix the throughput of move_pages for large buffers. So move_pages was getting about 600MB/s there. Then my own (hacky) next-touch implementation was getting about 800MB/s. The main difference with your code is that mine only modifies the current process PTE without touching the other processes if the page is shared. So my code basically only supports private pages, it duplicates/migrates them on next-touch. I thought it was faster than move_pages because I didn't support shared-page migration. But, I found out later that move_pages could be further improved up to about 750MB/s (it will be in 2.6.31). So now, I'd expect both the next-touch migration and move_pages to have similar migration throughput, about 750-800MB/s on my quad-barcelona machine. Right now, I'm seeing less than that for both, so there might be a problem deeper. Actually, looking at COW performance when the new page is allocated on a remote numa node, I also see the throughput much lower in 2.6.29+ (about 720MB/s) than in 2.6.27 (about 850MB/s). Maybe a regression in the low-level page copy routine? Brice ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 15:28 ` Brice Goglin @ 2009-06-22 16:55 ` Lee Schermerhorn 2009-06-22 17:06 ` Brice Goglin 0 siblings, 1 reply; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-22 16:55 UTC (permalink / raw) To: Brice Goglin Cc: Stefan Lankes, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro On Mon, 2009-06-22 at 17:28 +0200, Brice Goglin wrote: > Lee Schermerhorn wrote: > >> I gave this patchset a try and indeed it seems to work fine, thanks a > >> lot. But the migration performance isn't very good. I am seeing about > >> 540MB/s when doing mbind+touch_all_pages on large buffers on a > >> quad-barcelona machines. move_pages gets 640MB/s there. And my own > >> next-touch implementation were near 800MB/s in the past. > >> > > > > Interesting. Do you have any idea where the differences come from? Are > > you comparing them on the same kernel versions? I don't know the > > details of your implementation, but one possible area is the check for > > "misplacement". When migrate-on-fault is enabled, I check all pages > > with page_mapcount() == 0 for misplacement in the [swap page] fault > > path. That, and other filtering to eliminate unnecessary migrations > > could cause extra overhead. > > > > (I'll actually talk about this at the Linux Symposium) I used 2.6.27 > initially, with some 2.6.29 patches to fix the throughput of move_pages > for large buffers. So move_pages was getting about 600MB/s there. Then > my own (hacky) next-touch implementation was getting about 800MB/s. The > main difference with your code is that mine only modifies the current > process PTE without touching the other processes if the page is shared. The primary difference should be at unmap time, right? In the fault path, I only update the pte of the faulting task. That's why I require the [anon] pages to be in the swap cache [or something similar]. I don't want to be fixing up other tasks' page tables in the context of the faulting task's fault handler. If, later, another task touches the page, it will take a minor fault and find the [possibly migrated] page in the cache. Hmmm, I guess all tasks WILL incur the minor fault if they touch the page after the unmap. That could be part of the difference if you compare on the same kernel version. > So my code basically only supports private pages, it duplicates/migrates > them on next-touch. I thought it was faster than move_pages because I > didn't support shared-page migration. But, I found out later that > move_pages could be further improved up to about 750MB/s (it will be in > 2.6.31). > > So now, I'd expect both the next-touch migration and move_pages to have > similar migration throughput, about 750-800MB/s on my quad-barcelona > machine. Right now, I'm seeing less than that for both, so there might > be a problem deeper. Try booting with cgroup_disable=memory on the command line, if you have the memory resource controller configured in. See what that does to your measurements. > Actually, looking at COW performance when the new > page is allocated on a remote numa node, I also see the throughput much > lower in 2.6.29+ (about 720MB/s) than in 2.6.27 (about 850MB/s). Maybe a > regression in the low-level page copy routine? ??? I would expect low level page copying to be highly optimized per arch, and also fairly stable. Based on recent experience, I'd more likely suspect the mm housekeeping overheads--e.g., global and per memcg lru management, ... We seen a lot of new code in this area in the past few releases. Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 16:55 ` Lee Schermerhorn @ 2009-06-22 17:06 ` Brice Goglin 2009-06-22 17:59 ` Stefan Lankes 0 siblings, 1 reply; 40+ messages in thread From: Brice Goglin @ 2009-06-22 17:06 UTC (permalink / raw) To: Lee Schermerhorn Cc: Stefan Lankes, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro Lee Schermerhorn wrote: > The primary difference should be at unmap time, right? In the fault > path, I only update the pte of the faulting task. That's why I require > the [anon] pages to be in the swap cache [or something similar]. I > don't want to be fixing up other tasks' page tables in the context of > the faulting task's fault handler. If, later, another task touches the > page, it will take a minor fault and find the [possibly migrated] page > in the cache. Hmmm, I guess all tasks WILL incur the minor fault if > they touch the page after the unmap. That could be part of the > difference if you compare on the same kernel version. > Agreed. > Try booting with cgroup_disable=memory on the command line, if you have > the memory resource controller configured in. See what that does to > your measurements. > It doesn't seem to help. I'll try to bisect and find where the performance dropped. > ??? I would expect low level page copying to be highly optimized per > arch, and also fairly stable. I just did a quick copy_page benchmark and didn't see any performance difference between 2.6.27 and mmotm. thanks, Brice ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 17:06 ` Brice Goglin @ 2009-06-22 17:59 ` Stefan Lankes 2009-06-22 19:10 ` Brice Goglin 0 siblings, 1 reply; 40+ messages in thread From: Stefan Lankes @ 2009-06-22 17:59 UTC (permalink / raw) To: Brice Goglin Cc: Lee Schermerhorn, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro Brice Goglin wrote: > Lee Schermerhorn wrote: >> The primary difference should be at unmap time, right? In the fault >> path, I only update the pte of the faulting task. That's why I require >> the [anon] pages to be in the swap cache [or something similar]. I >> don't want to be fixing up other tasks' page tables in the context of >> the faulting task's fault handler. If, later, another task touches the >> page, it will take a minor fault and find the [possibly migrated] page >> in the cache. Hmmm, I guess all tasks WILL incur the minor fault if >> they touch the page after the unmap. That could be part of the >> difference if you compare on the same kernel version. >> > > Agreed. > >> Try booting with cgroup_disable=memory on the command line, if you have >> the memory resource controller configured in. See what that does to >> your measurements. >> > > It doesn't seem to help. I'll try to bisect and find where the > performance dropped. > I am not able to reconstruct any performance drawbacks on my system. Could you send me your low-level benchmark? Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 17:59 ` Stefan Lankes @ 2009-06-22 19:10 ` Brice Goglin 2009-06-22 20:16 ` Stefan Lankes 0 siblings, 1 reply; 40+ messages in thread From: Brice Goglin @ 2009-06-22 19:10 UTC (permalink / raw) To: Stefan Lankes Cc: Lee Schermerhorn, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro [-- Attachment #1: Type: text/plain, Size: 495 bytes --] Stefan Lankes wrote: > I am not able to reconstruct any performance drawbacks on my system. > Could you send me your low-level benchmark? It's attached. As you may see, it's fairly trivial. It just does several iterations of mbind+touch_all_pages for different power-of-two buffer sizes. Just replace mbind with madvise in the inner loop if you want to try with your affinit-on-next-touch. Which kernels are you using when comparing your next-touch implementation with Lee's patchset? Brice [-- Attachment #2: next-touch-mof-cost.c --] [-- Type: text/x-csrc, Size: 2044 bytes --] #define _GNU_SOURCE 1 #include <unistd.h> #include <sys/mman.h> #include <sys/time.h> #include <stdio.h> #include <stdlib.h> #include <numa.h> #include <numaif.h> #include <errno.h> #include <sched.h> #ifndef MPOL_MF_LAZY #define MPOL_MF_LAZY (1<<3) #endif #define TOTALPAGES 262144 int nbpages, loop; int pagesize; int main(int argc, char **argv) { void *buffer; int i, err; unsigned long nodemask; int maxnode; struct timeval tv1, tv2; unsigned long us; cpu_set_t cset; /* put the thread on node 0 */ CPU_ZERO(&cset); CPU_SET(0, &cset); err = sched_setaffinity(0, sizeof(cset), &cset); if (err < 0) { perror("sched_setaffinity"); exit(-1); } pagesize = getpagesize(); maxnode = numa_max_node(); fprintf(stdout, "# Nb_pages\tCost(ns)\n"); for(nbpages=2 ; nbpages<=TOTALPAGES ; nbpages*=2) { int loops = TOTALPAGES/nbpages; if (loops > 128) loops = 128; buffer = mmap(NULL, TOTALPAGES*pagesize, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); if (buffer == MAP_FAILED) { perror("mmap"); exit(-1); } /* bind to node 1 and prefault */ nodemask = 1<<1; err = mbind(buffer, TOTALPAGES*pagesize, MPOL_BIND, &nodemask, maxnode+2, MPOL_MF_MOVE); if (err < 0) { perror("mbind"); exit(-1); } for(i=0 ; i<TOTALPAGES ; i++) *(int*)(buffer+i*pagesize) = 0; gettimeofday(&tv1, NULL); for(loop=0 ; loop<loops ; loop++) { /* mark subbuf as next-touch and touch it */ void *subbuf = buffer + loop*nbpages*pagesize; err = mbind(subbuf, nbpages*pagesize, MPOL_PREFERRED, NULL, 0, MPOL_MF_MOVE|MPOL_MF_LAZY); if (err < 0) { perror("mbind"); exit(-1); } for(i=0;i<nbpages;i++) *(int*)(subbuf + i*pagesize) = 42; } gettimeofday(&tv2, NULL); us = (tv2.tv_sec - tv1.tv_sec) * 1000000 + (tv2.tv_usec - tv1.tv_usec); fprintf(stdout, "%d\t%ld\n", nbpages, us * 1000/loops); fflush(stdout); munmap(buffer, TOTALPAGES*pagesize); } return 0; } ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 19:10 ` Brice Goglin @ 2009-06-22 20:16 ` Stefan Lankes 2009-06-22 20:34 ` Brice Goglin 0 siblings, 1 reply; 40+ messages in thread From: Stefan Lankes @ 2009-06-22 20:16 UTC (permalink / raw) To: Brice Goglin Cc: Lee Schermerhorn, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro Brice Goglin wrote: > Stefan Lankes wrote: >> I am not able to reconstruct any performance drawbacks on my system. >> Could you send me your low-level benchmark? > > It's attached. As you may see, it's fairly trivial. It just does several > iterations of mbind+touch_all_pages for different power-of-two buffer > sizes. Just replace mbind with madvise in the inner loop if you want to > try with your affinit-on-next-touch. I use MPOL_NOOP instead of MPOL_PREFERRED. On my system, MPOL_NOOP is defined in as 4 (-> include/linux/mempolicy.h). By the way, do you also add Lee's "shared policy" patches? These patches add MPOL_MF_SHARED, which is specified as 3. Afterwards, you have to define MPOL_MF_LAZY as 4. I got following performance results with MPOL_NOOP: # Nb_pages Cost(ns) 2 44539 4 44695 8 53937 16 61625 32 87757 64 135070 128 233812 256 428539 512 870476 1024 1695859 2048 3280695 4096 6450328 8192 12719187 16384 25377750 32768 50431375 65536 101970000 131072 216200500 262144 511706000 I got following performance results with MPOL_PREFERRED: # Nb_pages Cost(ns) 2 50742 4 58656 8 79929 16 117171 32 195304 64 354851 128 744835 256 1354476 512 2759570 1024 5433304 2048 10173390 4096 20178453 8192 36452343 16384 71077375 32768 141738000 65536 281460250 131072 576971000 262144 1231694000 > Which kernels are you using when comparing your next-touch > implementation with Lee's patchset? > The current mmotm tree. Regards, Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 20:16 ` Stefan Lankes @ 2009-06-22 20:34 ` Brice Goglin 0 siblings, 0 replies; 40+ messages in thread From: Brice Goglin @ 2009-06-22 20:34 UTC (permalink / raw) To: Stefan Lankes Cc: Lee Schermerhorn, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro Stefan Lankes wrote: > By the way, do you also add Lee's "shared policy" patches? These > patches add MPOL_MF_SHARED, which is specified as 3. Afterwards, you > have to define MPOL_MF_LAZY as 4. Yes, I applied shared-policy-* since migrate-on-fault doesn't apply without them :) But I have the following in include/linux/mempolicy.h after applying all patches: #define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */ #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ Where did you get your F_SHARED=3 and MF_LAZY=4? > I got following performance results with MPOL_NOOP: > > # Nb_pages Cost(ns) > 32768 50431375 > 65536 101970000 > 131072 216200500 > 262144 511706000 Is there any migration here? Don't you just have unmap and fault without migration? In my test program, the initialization does MPOL_BIND. So the following MPOL_NOOP should just do nothing since the page is already correctly placed with regard to the previous MPOL_BIND. I feel like 2us per page looks too low for a migration and it's also very high for just unmap and fault-in. > I got following performance results with MPOL_PREFERRED: > > # Nb_pages Cost(ns) > 32768 141738000 That's about 60% faster than on my machine (quad-barcelona 8347HE 1.9GHz). What machine are you running on? Brice ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 12:34 ` Brice Goglin 2009-06-22 14:24 ` Lee Schermerhorn @ 2009-06-22 14:32 ` Stefan Lankes 2009-06-22 14:56 ` Lee Schermerhorn 1 sibling, 1 reply; 40+ messages in thread From: Stefan Lankes @ 2009-06-22 14:32 UTC (permalink / raw) To: Brice Goglin Cc: Lee Schermerhorn, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro Brice Goglin wrote: > Lee Schermerhorn wrote: >> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: >> >> >> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... >> [along with my shared policy series atop which they sit in my tree]. >> Patches reside in: >> >> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ >> >> > > I gave this patchset a try and indeed it seems to work fine, thanks a > lot. But the migration performance isn't very good. I am seeing about > 540MB/s when doing mbind+touch_all_pages on large buffers on a > quad-barcelona machines. move_pages gets 640MB/s there. And my own > next-touch implementation were near 800MB/s in the past. I used a modified stream benchmark to evaluate the performance of Lee's and my version of the next-touch implementation. In this low-level benchmark is Lee's patch better than my patch. I think that Brice and I use the same technique to realize affinity-on-next-touch. Do you use another kernel version to evaluate the performance? Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 14:32 ` Stefan Lankes @ 2009-06-22 14:56 ` Lee Schermerhorn 2009-06-22 15:42 ` Stefan Lankes 0 siblings, 1 reply; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-22 14:56 UTC (permalink / raw) To: Stefan Lankes Cc: Brice Goglin, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro On Mon, 2009-06-22 at 16:32 +0200, Stefan Lankes wrote: > > Brice Goglin wrote: > > Lee Schermerhorn wrote: > >> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: > >> > >> > >> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... > >> [along with my shared policy series atop which they sit in my tree]. > >> Patches reside in: > >> > >> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ > >> > >> > > > > I gave this patchset a try and indeed it seems to work fine, thanks a > > lot. But the migration performance isn't very good. I am seeing about > > 540MB/s when doing mbind+touch_all_pages on large buffers on a > > quad-barcelona machines. move_pages gets 640MB/s there. And my own > > next-touch implementation were near 800MB/s in the past. > > I used a modified stream benchmark to evaluate the performance of Lee's > and my version of the next-touch implementation. In this low-level > benchmark is Lee's patch better than my patch. I think that Brice and I > use the same technique to realize affinity-on-next-touch. Do you use > another kernel version to evaluate the performance? Hi, Stefan: I also used a [modified!] stream benchmark to test my patches. One of the modifications was to dump the time it takes for one pass over the data arrays to a specific file description, if that file description was open at start time--e.g., via something like "4>stream_times". Then, I increased the number of iterations to something large so that I could run other tests during the stream run. I plotted the "time per iteration" vs iteration number and could see that after any transient load, the stream benchmark returned to a good [not sure if maximal] locality state. The time per interation was comparable to hand affinitized of the threads. Without automigration and hand affinitization, any transient load would scramble the location of the threads relative to the data region they were operating on due to load balancing. The more nodes you have, the less likely you'll end up in a good state. I was using a parallel kernel make [-j <2*nr_cpus>] as the load. In addition to the stream returning to good locality, I noticed that the kernel build completed much faster in the presence of the stream load with automigration enabled. I reported these results in a presentation at LCA'07. Slides and video [yuck! :)] are available on line at the LCA'07 site. Regards, Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 14:56 ` Lee Schermerhorn @ 2009-06-22 15:42 ` Stefan Lankes 2009-06-22 16:38 ` Lee Schermerhorn 0 siblings, 1 reply; 40+ messages in thread From: Stefan Lankes @ 2009-06-22 15:42 UTC (permalink / raw) To: Lee Schermerhorn Cc: Brice Goglin, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro Lee Schermerhorn wrote: > On Mon, 2009-06-22 at 16:32 +0200, Stefan Lankes wrote: >> Brice Goglin wrote: >>> Lee Schermerhorn wrote: >>>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: >>>> >>>> >>>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... >>>> [along with my shared policy series atop which they sit in my tree]. >>>> Patches reside in: >>>> >>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ >>>> >>>> >>> I gave this patchset a try and indeed it seems to work fine, thanks a >>> lot. But the migration performance isn't very good. I am seeing about >>> 540MB/s when doing mbind+touch_all_pages on large buffers on a >>> quad-barcelona machines. move_pages gets 640MB/s there. And my own >>> next-touch implementation were near 800MB/s in the past. >> I used a modified stream benchmark to evaluate the performance of Lee's >> and my version of the next-touch implementation. In this low-level >> benchmark is Lee's patch better than my patch. I think that Brice and I >> use the same technique to realize affinity-on-next-touch. Do you use >> another kernel version to evaluate the performance? > > Hi, Stefan: > > I also used a [modified!] stream benchmark to test my patches. One of > the modifications was to dump the time it takes for one pass over the > data arrays to a specific file description, if that file description was > open at start time--e.g., via something like "4>stream_times". Then, I > increased the number of iterations to something large so that I could > run other tests during the stream run. I plotted the "time per > iteration" vs iteration number and could see that after any transient > load, the stream benchmark returned to a good [not sure if maximal] > locality state. The time per interation was comparable to hand > affinitized of the threads. Without automigration and hand > affinitization, any transient load would scramble the location of the > threads relative to the data region they were operating on due to load > balancing. The more nodes you have, the less likely you'll end up in a > good state. > > I was using a parallel kernel make [-j <2*nr_cpus>] as the load. In > addition to the stream returning to good locality, I noticed that the > kernel build completed much faster in the presence of the stream load > with automigration enabled. I reported these results in a presentation > at LCA'07. Slides and video [yuck! :)] are available on line at the > LCA'07 site. I think that you use migration-on-fault in the context of automigration. Brice and I use affinity-on-next-touch/migration-on-fault in another context. If the access pattern of an application changed, we want to redistribute the pages in "nearly" ideal matter. Sometimes it is difficult to determine the ideal page distribution. In such cases, affinity-on-next-touch could be an attractive solution. In our test applications, we add at some certain points the system call to use affinity-on-next-touch and redistribute the pages. Assumed that the next thread use these pages very often, we improve the performance of our test applications. Regards, Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-22 15:42 ` Stefan Lankes @ 2009-06-22 16:38 ` Lee Schermerhorn 0 siblings, 0 replies; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-22 16:38 UTC (permalink / raw) To: Stefan Lankes Cc: Brice Goglin, 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro On Mon, 2009-06-22 at 17:42 +0200, Stefan Lankes wrote: > > Lee Schermerhorn wrote: > > On Mon, 2009-06-22 at 16:32 +0200, Stefan Lankes wrote: > >> Brice Goglin wrote: > >>> Lee Schermerhorn wrote: > >>>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote: > >>>> > >>>> > >>>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612... > >>>> [along with my shared policy series atop which they sit in my tree]. > >>>> Patches reside in: > >>>> > >>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/ > >>>> > >>>> > >>> I gave this patchset a try and indeed it seems to work fine, thanks a > >>> lot. But the migration performance isn't very good. I am seeing about > >>> 540MB/s when doing mbind+touch_all_pages on large buffers on a > >>> quad-barcelona machines. move_pages gets 640MB/s there. And my own > >>> next-touch implementation were near 800MB/s in the past. > >> I used a modified stream benchmark to evaluate the performance of Lee's > >> and my version of the next-touch implementation. In this low-level > >> benchmark is Lee's patch better than my patch. I think that Brice and I > >> use the same technique to realize affinity-on-next-touch. Do you use > >> another kernel version to evaluate the performance? > > > > Hi, Stefan: > > > > I also used a [modified!] stream benchmark to test my patches. One of > > the modifications was to dump the time it takes for one pass over the > > data arrays to a specific file description, if that file description was > > open at start time--e.g., via something like "4>stream_times". Then, I > > increased the number of iterations to something large so that I could > > run other tests during the stream run. I plotted the "time per > > iteration" vs iteration number and could see that after any transient > > load, the stream benchmark returned to a good [not sure if maximal] > > locality state. The time per interation was comparable to hand > > affinitized of the threads. Without automigration and hand > > affinitization, any transient load would scramble the location of the > > threads relative to the data region they were operating on due to load > > balancing. The more nodes you have, the less likely you'll end up in a > > good state. > > > > I was using a parallel kernel make [-j <2*nr_cpus>] as the load. In > > addition to the stream returning to good locality, I noticed that the > > kernel build completed much faster in the presence of the stream load > > with automigration enabled. I reported these results in a presentation > > at LCA'07. Slides and video [yuck! :)] are available on line at the > > LCA'07 site. > > I think that you use migration-on-fault in the context of automigration. > Brice and I use affinity-on-next-touch/migration-on-fault in another > context. If the access pattern of an application changed, we want to > redistribute the pages in "nearly" ideal matter. Sometimes it is > difficult to determine the ideal page distribution. In such cases, > affinity-on-next-touch could be an attractive solution. In our test > applications, we add at some certain points the system call to use > affinity-on-next-touch and redistribute the pages. Assumed that the next > thread use these pages very often, we improve the performance of our > test applications. I understand. That's one of the motivations for MPOL_MF_LAZY and the MPOL_MF_NOOP policy mode. It simply unmaps [removes pte refs from] the pages, priming them for migrate on next touch, if they are "misplaced" relative to the task touching them. It's useful for testing and that's my personal primary use case, but I did envision it for use in applications that know they're entering a new computation phase with different access patterns. Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-12 10:32 ` Andi Kleen 2009-06-12 11:46 ` Stefan Lankes @ 2009-06-16 2:25 ` Lee Schermerhorn 2009-06-20 7:24 ` Brice Goglin 1 sibling, 1 reply; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-16 2:25 UTC (permalink / raw) To: Andi Kleen Cc: Stefan Lankes, linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin' On Fri, 2009-06-12 at 12:32 +0200, Andi Kleen wrote: > On Thu, Jun 11, 2009 at 08:45:29PM +0200, Stefan Lankes wrote: > > > Your patches seem to have a lot of overlap with > > > Lee Schermerhorn's old migrate memory on cpu migration patches. > > > I don't know the status of those. > > > > I analyze Lee Schermerhorn's migrate memory on cpu migration patches > > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee > > Schermerhorn add similar functionalities to the kernel. He called the > > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his > > patches the normal NUMA memory policies. Therefore, his solution fits better > > to the Linux kernel. I tested his patches with our test applications and got > > nearly the same performance results. > > That's great to know. > > I didn't think he had a per process setting though, did he? Hi, Andi. My patches don't have per process enablement. Rather, I chose to use per cpuset enablement. I view cpusets as sort of "numa control groups" and thought this was an appropriate level at which to control this sort of behavior--analogous to memory_spread_{page|slab}. That probably needs to be discussed more widely, tho'. > > > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop > > these patches further? > > Not to much knowledge. Maybe Lee will pick them up again now that there > are more use cases. > > If he doesn't have time maybe you could update them? As I mentioned earlier, I need to sort out the interaction with the memory controller. It was changing too fast for me to keep up in the time I could devote to it. Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-16 2:25 ` Lee Schermerhorn @ 2009-06-20 7:24 ` Brice Goglin 2009-06-22 13:49 ` Lee Schermerhorn 0 siblings, 1 reply; 40+ messages in thread From: Brice Goglin @ 2009-06-20 7:24 UTC (permalink / raw) To: Lee Schermerhorn Cc: Andi Kleen, Stefan Lankes, linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin' Lee Schermerhorn wrote: > My patches don't have per process enablement. Rather, I chose to use > per cpuset enablement. I view cpusets as sort of "numa control groups" > and thought this was an appropriate level at which to control this sort > of behavior--analogous to memory_spread_{page|slab}. That probably > needs to be discussed more widely, tho'. > Could you explain why you actually want to enable/disable migrate-on-fault on a cpuset (or process) basis? Why would an administrator want to disable it? Aren't the existing cpuset memory restriction abilities enough? Brice ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-20 7:24 ` Brice Goglin @ 2009-06-22 13:49 ` Lee Schermerhorn 0 siblings, 0 replies; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-22 13:49 UTC (permalink / raw) To: Brice Goglin Cc: Andi Kleen, Stefan Lankes, linux-kernel, linux-numa, Boris Bierbaum On Sat, 2009-06-20 at 09:24 +0200, Brice Goglin wrote: > Lee Schermerhorn wrote: > > My patches don't have per process enablement. Rather, I chose to use > > per cpuset enablement. I view cpusets as sort of "numa control groups" > > and thought this was an appropriate level at which to control this sort > > of behavior--analogous to memory_spread_{page|slab}. That probably > > needs to be discussed more widely, tho'. > > > > Could you explain why you actually want to enable/disable > migrate-on-fault on a cpuset (or process) basis? Why would an > administrator want to disable it? Aren't the existing cpuset memory > restriction abilities enough? > > Brice > Hello, Brice: There are a couple of aspects to this question, I think? 1) why enable/disable at all? why not always enabled? When I try out some new behavior such as migrate of fault, I start with the assumption [right or wrong] that not all users will want this behavior. For migrate-on-fault, one probably won't run into it all that often unless the MPOL_MF_LAZY flag is used to forcibly unmap regions. However, with swap read-ahead, one could end up with anon pages in the swap cache with no pte references, and could experience unexpected migrations. I've learned that some folks really don't like surprises :). Now, when you consider the "automigration" feature ["auto" here means "self" more than "automatic"], I think it's more important to be able to enable/disable it. I've not seen any performance degradation when using it, but I feared that for some workloads, thrashing could cause such degradation. Page migration isn't free. Also, because Linux runs on such a wide range of platforms, I don't want to burden smaller, embedded systems with the additional code, so I also try to make the feature source configurable. I know we worry about the proliferation of config options, but it's easier to remove one after the fact, I think, than to retrofit it. 2) Why a per cpuset control? I consider cpusets to be "numa control groups". They constrain resources on a numa node [and related cpus] granularity, and control numa related behavior, such as migration when changing cpusets, spreading page cache and slab pages over nodes in the cpuset, ... In fact, I think it would have been appropriate to call the cpuset control group the "numa control group" when cgroups were introduced, but it's too late for that now. Finally, and not a reason to include the controls in the mainline, it's REALLY useful during development. One can boot a test kernel, and only enable the feature in a test cpuset, limiting the damage of, e.g., a reference counting bug or such. It's also useful for measuring the overhead of the patches absent any actual page migrations. However, if this feature ever makes it to mainline, the community will have its say on whether these controls should be included and how. Hope this helps, Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [RFC PATCH 0/4]: affinity-on-next-touch 2009-06-11 18:45 ` Stefan Lankes 2009-06-12 10:32 ` Andi Kleen @ 2009-06-16 2:21 ` Lee Schermerhorn 1 sibling, 0 replies; 40+ messages in thread From: Lee Schermerhorn @ 2009-06-16 2:21 UTC (permalink / raw) To: Stefan Lankes Cc: 'Andi Kleen', linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin' On Thu, 2009-06-11 at 20:45 +0200, Stefan Lankes wrote: > > Your patches seem to have a lot of overlap with > > Lee Schermerhorn's old migrate memory on cpu migration patches. > > I don't know the status of those. > > I analyze Lee Schermerhorn's migrate memory on cpu migration patches > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee > Schermerhorn add similar functionalities to the kernel. He called the > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his > patches the normal NUMA memory policies. Therefore, his solution fits better > to the Linux kernel. I tested his patches with our test applications and got > nearly the same performance results. > > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop > these patches further? Sorry for the delay. I was offline for a long weekend. Regarding the patches: I was rebasing them every few mmotm releases until I ran into trouble with the memory controller handling of page migration conflicting with migrating in the fault path and haven't had time to investigate a solution. Here's the problem I have: when migrating a page with memory controller configured, the migration code [mem_cgroup_prepare_migration()] tentatively charges the page against the control group. Then, when migration completes, it calls mem_cgroup_end_migration() to commit [or cancel?] the charge. Migration on fault operates on an anon page in the page cache that has zero pte references [page_mapcount(page) == 0] in do_swap_page(). do_swap_page() does a mem_cgroup_try_charge_swapin() that also tentatively charges the page. I don't try to migrate the page unless this succeeds. No sense in doing all that work if the cgroup can't afford the page. But, this ends up with nested "tentative charges" against the page when I call down into the migration code via migrate_misplaced_page() and I was having problems getting the ref counting correct. It would bug out under load. What I want to do is see if the page migration code can "atomically transfer" the page charge [including any tentative charge from do_swap_page()] down in migrate_page_copy(), the way all other page state is copied. Haven't had time to see whether this is feasible. Regards, Lee ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2009-06-22 20:34 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <000c01c9d212$4c244720$e46cd560$@rwth-aachen.de>
2009-05-11 13:22 ` [RFC PATCH 0/4]: affinity-on-next-touch Andi Kleen
2009-05-11 13:32 ` Brice Goglin
2009-05-11 14:54 ` Stefan Lankes
2009-05-11 16:37 ` Andi Kleen
2009-05-11 17:22 ` Stefan Lankes
2009-06-11 18:45 ` Stefan Lankes
2009-06-12 10:32 ` Andi Kleen
2009-06-12 11:46 ` Stefan Lankes
2009-06-12 12:30 ` Brice Goglin
2009-06-12 13:21 ` Stefan Lankes
2009-06-12 13:48 ` Stefan Lankes
2009-06-16 2:39 ` Lee Schermerhorn
2009-06-16 13:58 ` Stefan Lankes
2009-06-16 14:59 ` Lee Schermerhorn
2009-06-17 1:22 ` KAMEZAWA Hiroyuki
2009-06-17 12:02 ` Lee Schermerhorn
2009-06-17 7:45 ` Stefan Lankes
2009-06-18 4:37 ` Lee Schermerhorn
2009-06-18 19:04 ` Lee Schermerhorn
2009-06-19 15:26 ` Lee Schermerhorn
2009-06-19 15:41 ` Balbir Singh
2009-06-19 15:59 ` Lee Schermerhorn
2009-06-19 21:19 ` Stefan Lankes
2009-06-22 12:34 ` Brice Goglin
2009-06-22 14:24 ` Lee Schermerhorn
2009-06-22 15:28 ` Brice Goglin
2009-06-22 16:55 ` Lee Schermerhorn
2009-06-22 17:06 ` Brice Goglin
2009-06-22 17:59 ` Stefan Lankes
2009-06-22 19:10 ` Brice Goglin
2009-06-22 20:16 ` Stefan Lankes
2009-06-22 20:34 ` Brice Goglin
2009-06-22 14:32 ` Stefan Lankes
2009-06-22 14:56 ` Lee Schermerhorn
2009-06-22 15:42 ` Stefan Lankes
2009-06-22 16:38 ` Lee Schermerhorn
2009-06-16 2:25 ` Lee Schermerhorn
2009-06-20 7:24 ` Brice Goglin
2009-06-22 13:49 ` Lee Schermerhorn
2009-06-16 2:21 ` Lee Schermerhorn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).