* ZVC: Increase threshold for larger processor configurationss @ 2006-06-28 17:41 Christoph Lameter 2006-06-28 22:49 ` Andrew Morton 0 siblings, 1 reply; 9+ messages in thread From: Christoph Lameter @ 2006-06-28 17:41 UTC (permalink / raw) To: akpm; +Cc: linux-kernel We detecteded a slight degree of contention with the new zoned VM counters if over 128 processors simultaneously fault anonymous pages. Raising the threshold to 64 fixed the contention issue. So we need to increase the threshhold depending on the number of processors in the system. And it may be best to overcompensate a bit. We keep the existing threshold of 32 for configurations with less than or equal to 64 processors. In the range of 64 to 128 processors we go to a threshold of 64. Beyond that we go to 125 (we have to be able to increment one beyond the threshold and then better avoid 127 just in case). (There are a more scalability improvement possible by utilizing the cacheline when it has been acquired to update all pending counters but I want to first test with a few hundred processors to see if we need those and then we need to figure out if there are bad effects for smaller configurations.) Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.17-mm3/mm/vmstat.c =================================================================== --- linux-2.6.17-mm3.orig/mm/vmstat.c 2006-06-27 20:24:37.455840645 -0700 +++ linux-2.6.17-mm3/mm/vmstat.c 2006-06-28 10:32:14.441818620 -0700 @@ -112,7 +112,21 @@ atomic_long_t vm_stat[NR_VM_ZONE_STAT_IT #ifdef CONFIG_SMP +/* + * With higher processor counts we need higher threshold to avoid contention. + */ +#if NR_CPUS <= 64 #define STAT_THRESHOLD 32 +#elif NR_CPUS <= 128 +#define STAT_THRESHOLD 64 +#else +/* + * Use the maximum usable threshhold. + * We need to increment one beyond the threshold. To be safe + * also avoid 127. + */ +#define STAT_THRESHOLD 125 +#endif /* * Determine pointer to currently valid differential byte given a zone and ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ZVC: Increase threshold for larger processor configurationss 2006-06-28 17:41 ZVC: Increase threshold for larger processor configurationss Christoph Lameter @ 2006-06-28 22:49 ` Andrew Morton 2006-06-29 18:22 ` Christoph Lameter 0 siblings, 1 reply; 9+ messages in thread From: Andrew Morton @ 2006-06-28 22:49 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-kernel Christoph Lameter <clameter@sgi.com> wrote: > > We detecteded a slight degree of contention with the new zoned VM counters > if over 128 processors simultaneously fault anonymous pages. Raising the > threshold to 64 fixed the contention issue. > > So we need to increase the threshhold depending on the number of processors > in the system. And it may be best to overcompensate a bit. > > We keep the existing threshold of 32 for configurations with less than or > equal to 64 processors. In the range of 64 to 128 processors we go to a > threshold of 64. Beyond that we go to 125 (we have to be able to increment > one beyond the threshold and then better avoid 127 just in case). > > (There are a more scalability improvement possible by utilizing the > cacheline when it has been acquired to update all pending counters but I > want to first test with a few hundred processors to see if we need those > and then we need to figure out if there are bad effects for smaller > configurations.) > > ... > > +/* > + * With higher processor counts we need higher threshold to avoid contention. > + */ > +#if NR_CPUS <= 64 > #define STAT_THRESHOLD 32 > +#elif NR_CPUS <= 128 > +#define STAT_THRESHOLD 64 > +#else > +/* > + * Use the maximum usable threshhold. > + * We need to increment one beyond the threshold. To be safe > + * also avoid 127. > + */ > +#define STAT_THRESHOLD 125 > +#endif As we become less and less dependent upon NR_CPUS, it becomes more and more viable for people to ship kernels which are compiled with a very high NR_CPUS value. And code such as the above becomes less and less effective. An alternative would be to calculate stat_threshold at runtime, based on the cpu_possible count (I guess). Or even: static inline int stat_threshold(void) { #if NR_CPUS <= 32 return 32; #else return dynamically_calculated_stat_threshold; #endif } Did you consider my earlier suggestion about these counters? That, over the short-term, they tend to count in only one direction? So we can do if (x > STAT_THRESHOLD) { zone_page_state_add(x + STAT_THRESHOLD/2, zone, item); x = -STAT_THRESHOLD/2; } else if (x < -STAT_THRESHOLD) { zone_page_state_add(x - STAT_THRESHOLD/2, zone, item); x = STAT_THRESHOLD; } that'll give an decrease in contention while not consuming any extra storage and while (I think) increasing accuracy. Whatever way we go, let's think harder about this one, please. Anything which uses NR_CPUS is a red flag. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ZVC: Increase threshold for larger processor configurationss 2006-06-28 22:49 ` Andrew Morton @ 2006-06-29 18:22 ` Christoph Lameter 2006-06-29 18:57 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Christoph Lameter @ 2006-06-29 18:22 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Wed, 28 Jun 2006, Andrew Morton wrote: > An alternative would be to calculate stat_threshold at runtime, based on > the cpu_possible count (I guess). Or even: > > static inline int stat_threshold(void) > { > #if NR_CPUS <= 32 > return 32; > #else > return dynamically_calculated_stat_threshold; > #endif > } Thats one more memory reference. Hmmm... We could place the threshold in the same cacheline. That would also open up the possbiliity of dynamically calculating the threshold. > Did you consider my earlier suggestion about these counters? That, over the > short-term, they tend to count in only one direction? So we can do > > if (x > STAT_THRESHOLD) { > zone_page_state_add(x + STAT_THRESHOLD/2, zone, item); > x = -STAT_THRESHOLD/2; > } else if (x < -STAT_THRESHOLD) { > zone_page_state_add(x - STAT_THRESHOLD/2, zone, item); > x = STAT_THRESHOLD; > } > > that'll give an decrease in contention while not consuming any extra > storage and while (I think) increasing accuracy. Uhh... We are overcompensating right? Pretty funky idea that is new to me and that would require some thought. This would basically increase the stepping by 50% if we are only going in one direction. If we are doing a mixture of allocations and deallocations (or pages being marked dirty / undirty, mapping unmapping) then this may potentially increase the number of updates and therefore the cacheline contentions. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ZVC: Increase threshold for larger processor configurationss 2006-06-29 18:22 ` Christoph Lameter @ 2006-06-29 18:57 ` Andrew Morton 2006-06-30 17:15 ` Christoph Lameter 2006-06-30 6:14 ` Christoph Lameter 2006-06-30 6:15 ` Christoph Lameter 2 siblings, 1 reply; 9+ messages in thread From: Andrew Morton @ 2006-06-29 18:57 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-kernel On Thu, 29 Jun 2006 11:22:45 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 28 Jun 2006, Andrew Morton wrote: > > > An alternative would be to calculate stat_threshold at runtime, based on > > the cpu_possible count (I guess). Or even: > > > > static inline int stat_threshold(void) > > { > > #if NR_CPUS <= 32 > > return 32; > > #else > > return dynamically_calculated_stat_threshold; > > #endif > > } > > Thats one more memory reference. Hmmm... We could place the threshold in > the same cacheline. That would also open up the possbiliity of dynamically > calculating the threshold. yup. > > Did you consider my earlier suggestion about these counters? That, over the > > short-term, they tend to count in only one direction? So we can do > > > > if (x > STAT_THRESHOLD) { > > zone_page_state_add(x + STAT_THRESHOLD/2, zone, item); > > x = -STAT_THRESHOLD/2; > > } else if (x < -STAT_THRESHOLD) { > > zone_page_state_add(x - STAT_THRESHOLD/2, zone, item); > > x = STAT_THRESHOLD; > > } > > > > that'll give an decrease in contention while not consuming any extra > > storage and while (I think) increasing accuracy. > > Uhh... We are overcompensating right? Pretty funky idea that is new to me > and that would require some thought. See inbox ;) > This would basically increase the stepping by 50% if we are only going in > one direction. yes. > If we are doing a mixture of allocations and deallocations (or pages being > marked dirty / undirty, mapping unmapping) then this may potentially > increase the number of updates and therefore the cacheline contentions. Yes. I'd handwavingly contend that this is a rare situation. A lot of the counters only ever count in one direction! So we could even skew them by an entire STAT_THRESHOLD. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ZVC: Increase threshold for larger processor configurationss 2006-06-29 18:57 ` Andrew Morton @ 2006-06-30 17:15 ` Christoph Lameter 0 siblings, 0 replies; 9+ messages in thread From: Christoph Lameter @ 2006-06-30 17:15 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Thu, 29 Jun 2006, Andrew Morton wrote: > A lot of the counters only ever count in one direction! So we could even > skew them by an entire STAT_THRESHOLD. I just tested both approaches separately. If I leave the STAT_THRESHOLD at 32 and just add the STAT_THRESHOLD / 2 trick we can scale cleanly up to 160p in the synthetic test. I have a bad feeling though for 512p 1024p and 4096p with the static approach with STAT_THRESHOLD /2 but it will take some time to get tests done on those. I think the best right now is to just take the dynamic threshold patch. I expect that to do just fine at the higher numbers. If we have more trouble at 1024p then maybe try STAT_THRESHOLD/2 before going to aggregate counter consolidation. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ZVC: Increase threshold for larger processor configurationss 2006-06-29 18:22 ` Christoph Lameter 2006-06-29 18:57 ` Andrew Morton @ 2006-06-30 6:14 ` Christoph Lameter 2006-06-30 6:15 ` Christoph Lameter 2 siblings, 0 replies; 9+ messages in thread From: Christoph Lameter @ 2006-06-30 6:14 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Thu, 29 Jun 2006, Christoph Lameter wrote: > Thats one more memory reference. Hmmm... We could place the threshold in > the same cacheline. That would also open up the possbiliity of dynamically > calculating the threshold. > like this? ZVC: Dynamic counter threshold configuration We add a threshold before the first counter and calculate the threshold based on the size of the zone assuming that larger machines have larger zone sizes and therefore larger thresholds. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.17-mm4/mm/vmstat.c =================================================================== --- linux-2.6.17-mm4.orig/mm/vmstat.c 2006-06-29 13:07:13.342483251 -0700 +++ linux-2.6.17-mm4/mm/vmstat.c 2006-06-29 13:17:52.978741713 -0700 @@ -112,37 +112,22 @@ atomic_long_t vm_stat[NR_VM_ZONE_STAT_IT EXPORT_SYMBOL(vm_stat); #ifdef CONFIG_SMP - -#define STAT_THRESHOLD 32 - -/* - * Determine pointer to currently valid differential byte given a zone and - * the item number. - * - * Preemption must be off - */ -static inline s8 *diff_pointer(struct zone *zone, enum zone_stat_item item) -{ - return &zone_pcp(zone, smp_processor_id())->vm_stat_diff[item]; -} - /* * For use when we know that interrupts are disabled. */ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, int delta) { - s8 *p; + struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); + s8 *p = pcp->vm_stat_diff + item; long x; - p = diff_pointer(zone, item); x = delta + *p; - if (unlikely(x > STAT_THRESHOLD || x < -STAT_THRESHOLD)) { + if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) { zone_page_state_add(x, zone, item); x = 0; } - *p = x; } EXPORT_SYMBOL(__mod_zone_page_state); @@ -184,11 +169,12 @@ EXPORT_SYMBOL(mod_zone_page_state); */ static void __inc_zone_state(struct zone *zone, enum zone_stat_item item) { - s8 *p = diff_pointer(zone, item); + struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); + s8 *p = pcp->vm_stat_diff + item; (*p)++; - if (unlikely(*p > STAT_THRESHOLD)) { + if (unlikely(*p > pcp->stat_threshold)) { zone_page_state_add(*p, zone, item); *p = 0; } @@ -203,11 +189,12 @@ EXPORT_SYMBOL(__inc_zone_page_state); void __dec_zone_page_state(struct page *page, enum zone_stat_item item) { struct zone *zone = page_zone(page); - s8 *p = diff_pointer(zone, item); + struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); + s8 *p = pcp->vm_stat_diff + item; (*p)--; - if (unlikely(*p < -STAT_THRESHOLD)) { + if (unlikely(*p < -pcp->stat_threshold)) { zone_page_state_add(*p, zone, item); *p = 0; } @@ -238,16 +225,12 @@ EXPORT_SYMBOL(inc_zone_page_state); void dec_zone_page_state(struct page *page, enum zone_stat_item item) { unsigned long flags; - struct zone *zone; - s8 *p; + struct zone *zone = page_zone(page); + struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); + s8 *p = pcp->vm_stat_diff + item; - zone = page_zone(page); local_irq_save(flags); - p = diff_pointer(zone, item); - - (*p)--; - - if (unlikely(*p < -STAT_THRESHOLD)) { + if (unlikely(*p < -pcp->stat_threshold)) { zone_page_state_add(*p, zone, item); *p = 0; } @@ -502,6 +485,7 @@ static int zoneinfo_show(struct seq_file seq_printf(m, ")" "\n pagesets"); + seq_printf(m, "\n stat_threshold: %i", zone_pcp(zone, 0)->stat_threshold); for_each_online_cpu(i) { struct per_cpu_pageset *pageset; int j; Index: linux-2.6.17-mm4/include/linux/mmzone.h =================================================================== --- linux-2.6.17-mm4.orig/include/linux/mmzone.h 2006-06-29 13:07:12.486090884 -0700 +++ linux-2.6.17-mm4/include/linux/mmzone.h 2006-06-29 13:07:26.486201908 -0700 @@ -77,6 +77,7 @@ struct per_cpu_pages { struct per_cpu_pageset { struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */ #ifdef CONFIG_SMP + s8 stat_threshold; /* maximum diff before update */ s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS]; #endif } ____cacheline_aligned_in_smp; Index: linux-2.6.17-mm4/mm/page_alloc.c =================================================================== --- linux-2.6.17-mm4.orig/mm/page_alloc.c 2006-06-29 13:07:13.313188187 -0700 +++ linux-2.6.17-mm4/mm/page_alloc.c 2006-06-29 13:14:57.831351395 -0700 @@ -1918,6 +1918,8 @@ static int __cpuinit process_zones(int c if (percpu_pagelist_fraction) setup_pagelist_highmark(zone_pcp(zone, cpu), (zone->present_pages / percpu_pagelist_fraction)); + + vm_stat_setup(zone, cpu); } return 0; @@ -2039,6 +2041,7 @@ static __meminit void zone_pcp_init(stru #else setup_pageset(zone_pcp(zone,cpu), batch); #endif + vm_stat_setup(zone, cpu); } if (zone->present_pages) printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n", Index: linux-2.6.17-mm4/include/linux/vmstat.h =================================================================== --- linux-2.6.17-mm4.orig/include/linux/vmstat.h 2006-06-29 13:07:12.916728323 -0700 +++ linux-2.6.17-mm4/include/linux/vmstat.h 2006-06-29 13:18:20.594224260 -0700 @@ -174,6 +175,27 @@ extern void inc_zone_state(struct zone * void refresh_cpu_vm_stats(int); void refresh_vm_stats(void); +static inline void vm_stat_setup(struct zone *zone, int cpu) +{ + int threshold; + + /* + * Increase the threshold for every 64 Megabyte of memory by one. + * The bigger the zone the more flexibility we can give us. + * + * The minimum threshold of 4 is equal to 256 MB of memory. + * The maximum of 125 reflects 8 Gigabytes. + */ + threshold = zone->present_pages / (1024*1024*64 / PAGE_SIZE); + + /* + * Threshold must be between 4 and 125 + */ + threshold = max(4, min(125, threshold)); + + zone_pcp(zone, cpu)->stat_threshold = threshold; +} + #else /* CONFIG_SMP */ /* @@ -210,6 +232,7 @@ static inline void __dec_zone_page_state static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_vm_stats(void) { } +static inline void vm_stat_setup(struct zone *z, int cpu) { } #endif #endif /* _LINUX_VMSTAT_H */ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ZVC: Increase threshold for larger processor configurationss 2006-06-29 18:22 ` Christoph Lameter 2006-06-29 18:57 ` Andrew Morton 2006-06-30 6:14 ` Christoph Lameter @ 2006-06-30 6:15 ` Christoph Lameter 2006-06-30 6:31 ` Andrew Morton 2 siblings, 1 reply; 9+ messages in thread From: Christoph Lameter @ 2006-06-30 6:15 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Thu, 29 Jun 2006, Christoph Lameter wrote: > > Did you consider my earlier suggestion about these counters? That, over the > > short-term, they tend to count in only one direction? So we can do > Uhh... We are overcompensating right? Pretty funky idea that is new to me > and that would require some thought. > > This would basically increase the stepping by 50% if we are only going in > one direction. A patch that does this: ZVC: overcompensate while incrementing ZVC counters Overcompensate by a balance factor when incrementing or decrementing ZVC counters anticipating continual increase in the same direction. Note that I have not been able to see any effect off this approach on an 8p system where I tested this. I probably will have a chance to test it on larger systems (160p) tomorrow. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.17-mm4/mm/vmstat.c =================================================================== --- linux-2.6.17-mm4.orig/mm/vmstat.c 2006-06-29 13:35:16.959161608 -0700 +++ linux-2.6.17-mm4/mm/vmstat.c 2006-06-29 13:54:46.361438715 -0700 @@ -167,6 +167,9 @@ EXPORT_SYMBOL(mod_zone_page_state); * in between and therefore the atomicity vs. interrupt cannot be exploited * in a useful way here. */ + +#define balance_factor(x) (x->stat_threshold / 2) + static void __inc_zone_state(struct zone *zone, enum zone_stat_item item) { struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); @@ -175,8 +178,8 @@ static void __inc_zone_state(struct zone (*p)++; if (unlikely(*p > pcp->stat_threshold)) { - zone_page_state_add(*p, zone, item); - *p = 0; + zone_page_state_add(*p + balance_factor(pcp), zone, item); + *p = -balance_factor(pcp); } } @@ -195,8 +198,8 @@ void __dec_zone_page_state(struct page * (*p)--; if (unlikely(*p < -pcp->stat_threshold)) { - zone_page_state_add(*p, zone, item); - *p = 0; + zone_page_state_add(*p - balance_factor(pcp), zone, item); + *p = balance_factor(pcp); } } EXPORT_SYMBOL(__dec_zone_page_state); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ZVC: Increase threshold for larger processor configurationss 2006-06-30 6:15 ` Christoph Lameter @ 2006-06-30 6:31 ` Andrew Morton 2006-06-30 7:17 ` Christoph Lameter 0 siblings, 1 reply; 9+ messages in thread From: Andrew Morton @ 2006-06-30 6:31 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-kernel, Trond Myklebust On Thu, 29 Jun 2006 23:15:55 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Thu, 29 Jun 2006, Christoph Lameter wrote: > > > > Did you consider my earlier suggestion about these counters? That, over the > > > short-term, they tend to count in only one direction? So we can do > > Uhh... We are overcompensating right? Pretty funky idea that is new to me > > and that would require some thought. > > > > This would basically increase the stepping by 50% if we are only going in > > one direction. > > A patch that does this: > > > ZVC: overcompensate while incrementing ZVC counters > > Overcompensate by a balance factor when incrementing or decrementing > ZVC counters anticipating continual increase in the same direction. Looks sensible. Please check that none of this is racy wrt memory hotplug (process_zones->vm_stat_setup). > Note that I have not been able to see any effect off this approach on > an 8p system where I tested this. > I probably will have a chance to test it on larger systems (160p) tomorrow. OK. But let's not rush - it's only fine-tuning. I'm thinking we should get what we have in -mm4 into -rc1 - I think it's stable enough, and we don't want to be carrying all those changes splatered across the VM for the next two months. Let's aim to get the well-measured fine-tuning in place for -rc2, OK? Are you aware of any to-do items remaining in the -mm4 patches? The NFS changes need a review from Trond - hopefully he'll be able to find 5-10 minutes to do that sometime? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ZVC: Increase threshold for larger processor configurationss 2006-06-30 6:31 ` Andrew Morton @ 2006-06-30 7:17 ` Christoph Lameter 0 siblings, 0 replies; 9+ messages in thread From: Christoph Lameter @ 2006-06-30 7:17 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Trond Myklebust On Thu, 29 Jun 2006, Andrew Morton wrote: > Are you aware of any to-do items remaining in the -mm4 patches? The NFS > changes need a review from Trond - hopefully he'll be able to find 5-10 > minutes to do that sometime? I am reasonably sure that these are okay now. I spend some quality hours with NFS. sigh. But another eye would be good. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2006-06-30 17:15 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-28 17:41 ZVC: Increase threshold for larger processor configurationss Christoph Lameter 2006-06-28 22:49 ` Andrew Morton 2006-06-29 18:22 ` Christoph Lameter 2006-06-29 18:57 ` Andrew Morton 2006-06-30 17:15 ` Christoph Lameter 2006-06-30 6:14 ` Christoph Lameter 2006-06-30 6:15 ` Christoph Lameter 2006-06-30 6:31 ` Andrew Morton 2006-06-30 7:17 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox