* [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator @ 2010-08-16 9:42 Mel Gorman 2010-08-16 9:42 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman ` (2 more replies) 0 siblings, 3 replies; 49+ messages in thread From: Mel Gorman @ 2010-08-16 9:42 UTC (permalink / raw) To: linux-mm Cc: Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Mel Gorman Internal IBM test teams beta testing distribution kernels have reported problems on machines with a large number of CPUs whereby page allocator failure messages show huge differences between the nr_free_pages vmstat counter and what is available on the buddy lists. In an extreme example, nr_free_pages was above the min watermark but zero pages were on the buddy lists allowing the system to potentially deadlock. There is no reason why the problems would not affect mainline so the following series mitigates the problems in the page allocator related to to per-cpu counter drift and lists. The first patch ensures that counters are updated after pages are added to free lists. The second patch notes that the counter drift between nr_free_pages and what is on the per-cpu lists can be very high. When memory is low and kswapd is awake, the per-cpu counters are checked as well as reading the value of NR_FREE_PAGES. This will slow the page allocator when memory is low and kswapd is awake but it will be much harder to breach the min watermark and potentially livelock the system. The third patch notes that after direct-reclaim an allocation can fail because the necessary pages are on the per-cpu lists. After a direct-reclaim-and-allocation-failure, the per-cpu lists are drained and a second attempt is made. Performance tests did not show up anything interesting. A version of this series that continually called vmstat_update() when memory was low was tested internally and found to help the counter drift problem. I described this during LSF/MM Summit and the potential for IPI storms was frowned upon. An alternative fix is in patch two which uses for_each_online_cpu() to read the vmstat deltas while memory is low and kswapd is awake. This should be functionally similar. Comments? include/linux/mmzone.h | 9 +++++++++ mm/mmzone.c | 27 +++++++++++++++++++++++++++ mm/page_alloc.c | 28 ++++++++++++++++++++++------ mm/vmstat.c | 5 ++++- 4 files changed, 62 insertions(+), 7 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list 2010-08-16 9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman @ 2010-08-16 9:42 ` Mel Gorman 2010-08-16 14:04 ` Rik van Riel ` (3 more replies) 2010-08-16 9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman 2010-08-16 9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman 2 siblings, 4 replies; 49+ messages in thread From: Mel Gorman @ 2010-08-16 9:42 UTC (permalink / raw) To: linux-mm Cc: Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Mel Gorman When allocating a page, the system uses NR_FREE_PAGES counters to determine if watermarks would remain intact after the allocation was made. This check is made without interrupts disabled or the zone lock held and so is race-prone by nature. Unfortunately, when pages are being freed in batch, the counters are updated before the pages are added on the list. During this window, the counters are misleading as the pages do not exist yet. When under significant pressure on systems with large numbers of CPUs, it's possible for processes to make progress even though they should have been stalled. This is particularly problematic if a number of the processes are using GFP_ATOMIC as the min watermark can be accidentally breached and in extreme cases, the system can livelock. This patch updates the counters after the pages have been added to the list. This makes the allocator more cautious with respect to preserving the watermarks and mitigates livelock possibilities. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/page_alloc.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9bd339e..c2407a4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count, { int migratetype = 0; int batch_free = 0; + int freed = count; spin_lock(&zone->lock); zone->all_unreclaimable = 0; zone->pages_scanned = 0; - __mod_zone_page_state(zone, NR_FREE_PAGES, count); while (count) { struct page *page; struct list_head *list; @@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, trace_mm_page_pcpu_drain(page, 0, page_private(page)); } while (--count && --batch_free && !list_empty(list)); } + __mod_zone_page_state(zone, NR_FREE_PAGES, freed); spin_unlock(&zone->lock); } @@ -631,8 +632,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order, zone->all_unreclaimable = 0; zone->pages_scanned = 0; - __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order); __free_one_page(page, zone, order, migratetype); + __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order); spin_unlock(&zone->lock); } -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list 2010-08-16 9:42 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman @ 2010-08-16 14:04 ` Rik van Riel 2010-08-16 15:26 ` Johannes Weiner ` (2 subsequent siblings) 3 siblings, 0 replies; 49+ messages in thread From: Rik van Riel @ 2010-08-16 14:04 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On 08/16/2010 05:42 AM, Mel Gorman wrote: > When allocating a page, the system uses NR_FREE_PAGES counters to determine > if watermarks would remain intact after the allocation was made. This > check is made without interrupts disabled or the zone lock held and so is > race-prone by nature. Unfortunately, when pages are being freed in batch, > the counters are updated before the pages are added on the list. During this > window, the counters are misleading as the pages do not exist yet. When > under significant pressure on systems with large numbers of CPUs, it's > possible for processes to make progress even though they should have been > stalled. This is particularly problematic if a number of the processes are > using GFP_ATOMIC as the min watermark can be accidentally breached and in > extreme cases, the system can livelock. > > This patch updates the counters after the pages have been added to the > list. This makes the allocator more cautious with respect to preserving > the watermarks and mitigates livelock possibilities. > > Signed-off-by: Mel Gorman<mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list 2010-08-16 9:42 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman 2010-08-16 14:04 ` Rik van Riel @ 2010-08-16 15:26 ` Johannes Weiner 2010-08-17 2:21 ` Minchan Kim 2010-08-18 2:21 ` KAMEZAWA Hiroyuki 3 siblings, 0 replies; 49+ messages in thread From: Johannes Weiner @ 2010-08-16 15:26 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon, Aug 16, 2010 at 10:42:11AM +0100, Mel Gorman wrote: > When allocating a page, the system uses NR_FREE_PAGES counters to determine > if watermarks would remain intact after the allocation was made. This > check is made without interrupts disabled or the zone lock held and so is > race-prone by nature. Unfortunately, when pages are being freed in batch, > the counters are updated before the pages are added on the list. During this > window, the counters are misleading as the pages do not exist yet. When > under significant pressure on systems with large numbers of CPUs, it's > possible for processes to make progress even though they should have been > stalled. This is particularly problematic if a number of the processes are > using GFP_ATOMIC as the min watermark can be accidentally breached and in > extreme cases, the system can livelock. > > This patch updates the counters after the pages have been added to the > list. This makes the allocator more cautious with respect to preserving > the watermarks and mitigates livelock possibilities. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list 2010-08-16 9:42 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman 2010-08-16 14:04 ` Rik van Riel 2010-08-16 15:26 ` Johannes Weiner @ 2010-08-17 2:21 ` Minchan Kim 2010-08-17 9:59 ` Mel Gorman 2010-08-18 2:21 ` KAMEZAWA Hiroyuki 3 siblings, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-17 2:21 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro Hi, Mel. On Mon, Aug 16, 2010 at 6:42 PM, Mel Gorman <mel@csn.ul.ie> wrote: > When allocating a page, the system uses NR_FREE_PAGES counters to determine > if watermarks would remain intact after the allocation was made. This > check is made without interrupts disabled or the zone lock held and so is > race-prone by nature. Unfortunately, when pages are being freed in batch, > the counters are updated before the pages are added on the list. During this > window, the counters are misleading as the pages do not exist yet. When > under significant pressure on systems with large numbers of CPUs, it's > possible for processes to make progress even though they should have been > stalled. This is particularly problematic if a number of the processes are > using GFP_ATOMIC as the min watermark can be accidentally breached and in > extreme cases, the system can livelock. > > This patch updates the counters after the pages have been added to the > list. This makes the allocator more cautious with respect to preserving > the watermarks and mitigates livelock possibilities. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Page free path looks good by your patch. Now allocation path decrease NR_FREE_PAGES _after_ it remove pages from buddy. It can make that actually we don't have enough pages in buddy but pretend to have enough pages. It could make same situation with free path which is your concern. So I think it can confuse watermark check in extreme case. So don't we need to consider _allocation_ path with conservative? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list 2010-08-17 2:21 ` Minchan Kim @ 2010-08-17 9:59 ` Mel Gorman 2010-08-17 14:25 ` Minchan Kim 0 siblings, 1 reply; 49+ messages in thread From: Mel Gorman @ 2010-08-17 9:59 UTC (permalink / raw) To: Minchan Kim Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Tue, Aug 17, 2010 at 11:21:15AM +0900, Minchan Kim wrote: > Hi, Mel. > > On Mon, Aug 16, 2010 at 6:42 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > When allocating a page, the system uses NR_FREE_PAGES counters to determine > > if watermarks would remain intact after the allocation was made. This > > check is made without interrupts disabled or the zone lock held and so is > > race-prone by nature. Unfortunately, when pages are being freed in batch, > > the counters are updated before the pages are added on the list. During this > > window, the counters are misleading as the pages do not exist yet. When > > under significant pressure on systems with large numbers of CPUs, it's > > possible for processes to make progress even though they should have been > > stalled. This is particularly problematic if a number of the processes are > > using GFP_ATOMIC as the min watermark can be accidentally breached and in > > extreme cases, the system can livelock. > > > > This patch updates the counters after the pages have been added to the > > list. This makes the allocator more cautious with respect to preserving > > the watermarks and mitigates livelock possibilities. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com> > > Page free path looks good by your patch. > Thanks > Now allocation path decrease NR_FREE_PAGES _after_ it remove pages from buddy. > It can make that actually we don't have enough pages in buddy but > pretend to have enough pages. > It could make same situation with free path which is your concern. > So I think it can confuse watermark check in extreme case. > > So don't we need to consider _allocation_ path with conservative? > I considered it and it would be desirable. The downside was that the paths became more complicated. Take rmqueue_bulk() for example. It could start by modifying the counters but there then needs to be a recovery path if all the requested pages were not allocated. It'd be nice to see if these patches on their own were enough to alleviate the worst of the per-cpu-counter drift before adding new branches to the allocation path. Does that make sense? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list 2010-08-17 9:59 ` Mel Gorman @ 2010-08-17 14:25 ` Minchan Kim 0 siblings, 0 replies; 49+ messages in thread From: Minchan Kim @ 2010-08-17 14:25 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Tue, Aug 17, 2010 at 10:59:18AM +0100, Mel Gorman wrote: > On Tue, Aug 17, 2010 at 11:21:15AM +0900, Minchan Kim wrote: > > Now allocation path decrease NR_FREE_PAGES _after_ it remove pages from buddy. > > It can make that actually we don't have enough pages in buddy but > > pretend to have enough pages. > > It could make same situation with free path which is your concern. > > So I think it can confuse watermark check in extreme case. > > > > So don't we need to consider _allocation_ path with conservative? > > > > I considered it and it would be desirable. The downside was that the > paths became more complicated. Take rmqueue_bulk() for example. It could > start by modifying the counters but there then needs to be a recovery > path if all the requested pages were not allocated. > > It'd be nice to see if these patches on their own were enough to > alleviate the worst of the per-cpu-counter drift before adding new > branches to the allocation path. > > Does that make sense? No problem. It was a usecase of big machine. I also hope we don't add unnecessary overhead in normal machine due to unlikely problem. Let's consider it by further step if it isn't enough. Thanks, Mel. > > -- > Mel Gorman > Part-time Phd Student Linux Technology Center > University of Limerick IBM Dublin Software Lab -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list 2010-08-16 9:42 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman ` (2 preceding siblings ...) 2010-08-17 2:21 ` Minchan Kim @ 2010-08-18 2:21 ` KAMEZAWA Hiroyuki 3 siblings, 0 replies; 49+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-18 2:21 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KOSAKI Motohiro On Mon, 16 Aug 2010 10:42:11 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > When allocating a page, the system uses NR_FREE_PAGES counters to determine > if watermarks would remain intact after the allocation was made. This > check is made without interrupts disabled or the zone lock held and so is > race-prone by nature. Unfortunately, when pages are being freed in batch, > the counters are updated before the pages are added on the list. During this > window, the counters are misleading as the pages do not exist yet. When > under significant pressure on systems with large numbers of CPUs, it's > possible for processes to make progress even though they should have been > stalled. This is particularly problematic if a number of the processes are > using GFP_ATOMIC as the min watermark can be accidentally breached and in > extreme cases, the system can livelock. > > This patch updates the counters after the pages have been added to the > list. This makes the allocator more cautious with respect to preserving > the watermarks and mitigates livelock possibilities. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-16 9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman 2010-08-16 9:42 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman @ 2010-08-16 9:42 ` Mel Gorman 2010-08-16 9:43 ` Mel Gorman 2010-08-18 2:59 ` KAMEZAWA Hiroyuki 2010-08-16 9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman 2 siblings, 2 replies; 49+ messages in thread From: Mel Gorman @ 2010-08-16 9:42 UTC (permalink / raw) To: linux-mm Cc: Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Mel Gorman Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If the system is under both load and low memory, it's possible for watermarks to be breached. In extreme cases, the number of free pages can drop to 0 leading to the possibility of system livelock. This patch introduces zone_nr_free_pages() to take a slightly more accurate estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/mmzone.h | 9 +++++++++ mm/mmzone.c | 27 +++++++++++++++++++++++++++ mm/page_alloc.c | 4 ++-- mm/vmstat.c | 5 ++++- 4 files changed, 42 insertions(+), 3 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b4d109e..1df3c43 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -284,6 +284,13 @@ struct zone { unsigned long watermark[NR_WMARK]; /* + * When free pages are below this point, additional steps are taken + * when reading the number of free pages to avoid per-cpu counter + * drift allowing watermarks to be breached + */ + unsigned long percpu_drift_mark; + + /* * We don't know if the memory that we're going to allocate will be freeable * or/and it will be released eventually, so to avoid totally wasting several * GB of ram we must reserve some of the lower zone memory (otherwise we risk @@ -456,6 +463,8 @@ static inline int zone_is_oom_locked(const struct zone *zone) return test_bit(ZONE_OOM_LOCKED, &zone->flags); } +unsigned long zone_nr_free_pages(struct zone *zone); + /* * The "priority" of VM scanning is how much of the queues we will scan in one * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the diff --git a/mm/mmzone.c b/mm/mmzone.c index f5b7d17..89842ec 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -87,3 +87,30 @@ int memmap_valid_within(unsigned long pfn, return 1; } #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ + +/* Called when a more accurate view of NR_FREE_PAGES is needed */ +unsigned long zone_nr_free_pages(struct zone *zone) +{ + unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); + + /* + * While kswapd is awake, it is considered the zone is under some + * memory pressure. Under pressure, there is a risk that + * er-cpu-counter-drift will allow the min watermark to be breached + * potentially causing a live-lock. While kswapd is awake and + * free pages are low, get a better estimate for free pages + */ + if (free < zone->percpu_drift_mark && + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { + int cpu; + + for_each_online_cpu(cpu) { + struct per_cpu_pageset *pset; + + pset = per_cpu_ptr(zone->pageset, cpu); + nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES]; + } + } + + return nr_free_pages; +} diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c2407a4..67a2ed0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, { /* free_pages my go negative - that's OK */ long min = mark; - long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; + long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; int o; if (alloc_flags & ALLOC_HIGH) @@ -2413,7 +2413,7 @@ void show_free_areas(void) " all_unreclaimable? %s" "\n", zone->name, - K(zone_page_state(zone, NR_FREE_PAGES)), + K(zone_nr_free_pages(zone)), K(min_wmark_pages(zone)), K(low_wmark_pages(zone)), K(high_wmark_pages(zone)), diff --git a/mm/vmstat.c b/mm/vmstat.c index 7759941..c95a159 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) for_each_online_cpu(cpu) per_cpu_ptr(zone->pageset, cpu)->stat_threshold = threshold; + + zone->percpu_drift_mark = high_wmark_pages(zone) + + num_online_cpus() * threshold; } } @@ -813,7 +816,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n scanned %lu" "\n spanned %lu" "\n present %lu", - zone_page_state(zone, NR_FREE_PAGES), + zone_nr_free_pages(zone), min_wmark_pages(zone), low_wmark_pages(zone), high_wmark_pages(zone), -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-16 9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman @ 2010-08-16 9:43 ` Mel Gorman 2010-08-16 14:47 ` Rik van Riel ` (2 more replies) 2010-08-18 2:59 ` KAMEZAWA Hiroyuki 1 sibling, 3 replies; 49+ messages in thread From: Mel Gorman @ 2010-08-16 9:43 UTC (permalink / raw) To: linux-mm Cc: Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote: > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > it is cheaper than scanning a number of lists. To avoid synchronization > overhead, counter deltas are maintained on a per-cpu basis and drained both > periodically and when the delta is above a threshold. On large CPU systems, > the difference between the estimated and real value of NR_FREE_PAGES can be > very high. If the system is under both load and low memory, it's possible > for watermarks to be breached. In extreme cases, the number of free pages > can drop to 0 leading to the possibility of system livelock. > > This patch introduces zone_nr_free_pages() to take a slightly more accurate > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect > and may result in cache line bounces but is expected to be lighter than the > IPI calls necessary to continually drain the per-cpu counters while kswapd > is awake. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> And the second I sent this, I realised I had sent a slightly old version that missed a compile-fix :( ==== CUT HERE ==== mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If the system is under both load and low memory, it's possible for watermarks to be breached. In extreme cases, the number of free pages can drop to 0 leading to the possibility of system livelock. This patch introduces zone_nr_free_pages() to take a slightly more accurate estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/mmzone.h | 9 +++++++++ mm/mmzone.c | 27 +++++++++++++++++++++++++++ mm/page_alloc.c | 4 ++-- mm/vmstat.c | 5 ++++- 4 files changed, 42 insertions(+), 3 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b4d109e..1df3c43 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -284,6 +284,13 @@ struct zone { unsigned long watermark[NR_WMARK]; /* + * When free pages are below this point, additional steps are taken + * when reading the number of free pages to avoid per-cpu counter + * drift allowing watermarks to be breached + */ + unsigned long percpu_drift_mark; + + /* * We don't know if the memory that we're going to allocate will be freeable * or/and it will be released eventually, so to avoid totally wasting several * GB of ram we must reserve some of the lower zone memory (otherwise we risk @@ -456,6 +463,8 @@ static inline int zone_is_oom_locked(const struct zone *zone) return test_bit(ZONE_OOM_LOCKED, &zone->flags); } +unsigned long zone_nr_free_pages(struct zone *zone); + /* * The "priority" of VM scanning is how much of the queues we will scan in one * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the diff --git a/mm/mmzone.c b/mm/mmzone.c index f5b7d17..056e374 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -87,3 +87,30 @@ int memmap_valid_within(unsigned long pfn, return 1; } #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ + +/* Called when a more accurate view of NR_FREE_PAGES is needed */ +unsigned long zone_nr_free_pages(struct zone *zone) +{ + unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); + + /* + * While kswapd is awake, it is considered the zone is under some + * memory pressure. Under pressure, there is a risk that + * er-cpu-counter-drift will allow the min watermark to be breached + * potentially causing a live-lock. While kswapd is awake and + * free pages are low, get a better estimate for free pages + */ + if (nr_free_pages < zone->percpu_drift_mark && + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { + int cpu; + + for_each_online_cpu(cpu) { + struct per_cpu_pageset *pset; + + pset = per_cpu_ptr(zone->pageset, cpu); + nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES]; + } + } + + return nr_free_pages; +} diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c2407a4..67a2ed0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, { /* free_pages my go negative - that's OK */ long min = mark; - long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; + long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; int o; if (alloc_flags & ALLOC_HIGH) @@ -2413,7 +2413,7 @@ void show_free_areas(void) " all_unreclaimable? %s" "\n", zone->name, - K(zone_page_state(zone, NR_FREE_PAGES)), + K(zone_nr_free_pages(zone)), K(min_wmark_pages(zone)), K(low_wmark_pages(zone)), K(high_wmark_pages(zone)), diff --git a/mm/vmstat.c b/mm/vmstat.c index 7759941..c95a159 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) for_each_online_cpu(cpu) per_cpu_ptr(zone->pageset, cpu)->stat_threshold = threshold; + + zone->percpu_drift_mark = high_wmark_pages(zone) + + num_online_cpus() * threshold; } } @@ -813,7 +816,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n scanned %lu" "\n spanned %lu" "\n present %lu", - zone_page_state(zone, NR_FREE_PAGES), + zone_nr_free_pages(zone), min_wmark_pages(zone), low_wmark_pages(zone), high_wmark_pages(zone), -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-16 9:43 ` Mel Gorman @ 2010-08-16 14:47 ` Rik van Riel 2010-08-16 16:06 ` Johannes Weiner 2010-08-19 15:46 ` Minchan Kim 2 siblings, 0 replies; 49+ messages in thread From: Rik van Riel @ 2010-08-16 14:47 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On 08/16/2010 05:43 AM, Mel Gorman wrote: > On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote: >> Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as >> it is cheaper than scanning a number of lists. To avoid synchronization >> overhead, counter deltas are maintained on a per-cpu basis and drained both >> periodically and when the delta is above a threshold. On large CPU systems, >> the difference between the estimated and real value of NR_FREE_PAGES can be >> very high. If the system is under both load and low memory, it's possible >> for watermarks to be breached. In extreme cases, the number of free pages >> can drop to 0 leading to the possibility of system livelock. >> >> This patch introduces zone_nr_free_pages() to take a slightly more accurate >> estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect >> and may result in cache line bounces but is expected to be lighter than the >> IPI calls necessary to continually drain the per-cpu counters while kswapd >> is awake. >> >> Signed-off-by: Mel Gorman<mel@csn.ul.ie> > > And the second I sent this, I realised I had sent a slightly old version > that missed a compile-fix :( Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-16 9:43 ` Mel Gorman 2010-08-16 14:47 ` Rik van Riel @ 2010-08-16 16:06 ` Johannes Weiner 2010-08-17 2:26 ` Minchan Kim 2010-08-17 10:16 ` Mel Gorman 2010-08-19 15:46 ` Minchan Kim 2 siblings, 2 replies; 49+ messages in thread From: Johannes Weiner @ 2010-08-16 16:06 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro [npiggin@suse.de bounces, switched to yahoo address] On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote: > On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote: > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > > it is cheaper than scanning a number of lists. To avoid synchronization > > overhead, counter deltas are maintained on a per-cpu basis and drained both > > periodically and when the delta is above a threshold. On large CPU systems, > > the difference between the estimated and real value of NR_FREE_PAGES can be > > very high. If the system is under both load and low memory, it's possible > > for watermarks to be breached. In extreme cases, the number of free pages > > can drop to 0 leading to the possibility of system livelock. > > > > This patch introduces zone_nr_free_pages() to take a slightly more accurate > > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect > > and may result in cache line bounces but is expected to be lighter than the > > IPI calls necessary to continually drain the per-cpu counters while kswapd > > is awake. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > And the second I sent this, I realised I had sent a slightly old version > that missed a compile-fix :( > > ==== CUT HERE ==== > mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > it is cheaper than scanning a number of lists. To avoid synchronization > overhead, counter deltas are maintained on a per-cpu basis and drained both > periodically and when the delta is above a threshold. On large CPU systems, > the difference between the estimated and real value of NR_FREE_PAGES can be > very high. If the system is under both load and low memory, it's possible > for watermarks to be breached. In extreme cases, the number of free pages > can drop to 0 leading to the possibility of system livelock. > > This patch introduces zone_nr_free_pages() to take a slightly more accurate > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect > and may result in cache line bounces but is expected to be lighter than the > IPI calls necessary to continually drain the per-cpu counters while kswapd > is awake. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> [...] > --- a/mm/mmzone.c > +++ b/mm/mmzone.c > @@ -87,3 +87,30 @@ int memmap_valid_within(unsigned long pfn, > return 1; > } > #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ > + > +/* Called when a more accurate view of NR_FREE_PAGES is needed */ > +unsigned long zone_nr_free_pages(struct zone *zone) > +{ > + unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); > + > + /* > + * While kswapd is awake, it is considered the zone is under some > + * memory pressure. Under pressure, there is a risk that > + * er-cpu-counter-drift will allow the min watermark to be breached Missing `p'. > + * potentially causing a live-lock. While kswapd is awake and > + * free pages are low, get a better estimate for free pages > + */ > + if (nr_free_pages < zone->percpu_drift_mark && > + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { > + int cpu; > + > + for_each_online_cpu(cpu) { > + struct per_cpu_pageset *pset; > + > + pset = per_cpu_ptr(zone->pageset, cpu); > + nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES]; > + } > + } > + > + return nr_free_pages; > +} > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c2407a4..67a2ed0 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > { > /* free_pages my go negative - that's OK */ > long min = mark; > - long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; > + long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; > int o; > > if (alloc_flags & ALLOC_HIGH) > @@ -2413,7 +2413,7 @@ void show_free_areas(void) > " all_unreclaimable? %s" > "\n", > zone->name, > - K(zone_page_state(zone, NR_FREE_PAGES)), > + K(zone_nr_free_pages(zone)), > K(min_wmark_pages(zone)), > K(low_wmark_pages(zone)), > K(high_wmark_pages(zone)), > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 7759941..c95a159 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) > for_each_online_cpu(cpu) > per_cpu_ptr(zone->pageset, cpu)->stat_threshold > = threshold; > + > + zone->percpu_drift_mark = high_wmark_pages(zone) + > + num_online_cpus() * threshold; > } > } Hm, this one I don't quite get (might be the jetlag, though): we have _at least_ NR_FREE_PAGES free pages, there may just be more lurking in the pcp counters. So shouldn't we only collect the pcp deltas in case the high watermark is breached? Above this point, we should be fine or better, no? Hannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-16 16:06 ` Johannes Weiner @ 2010-08-17 2:26 ` Minchan Kim 2010-08-17 10:42 ` Mel Gorman 2010-08-17 10:16 ` Mel Gorman 1 sibling, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-17 2:26 UTC (permalink / raw) To: Johannes Weiner Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Tue, Aug 17, 2010 at 1:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > [npiggin@suse.de bounces, switched to yahoo address] > > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote: <snip> >> + * potentially causing a live-lock. While kswapd is awake and >> + * free pages are low, get a better estimate for free pages >> + */ >> + if (nr_free_pages < zone->percpu_drift_mark && >> + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { >> + int cpu; >> + >> + for_each_online_cpu(cpu) { >> + struct per_cpu_pageset *pset; >> + >> + pset = per_cpu_ptr(zone->pageset, cpu); >> + nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES]; We need to consider CONFIG_SMP. >> + } >> + } >> + >> + return nr_free_pages; >> +} >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index c2407a4..67a2ed0 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, >> { >> /* free_pages my go negative - that's OK */ >> long min = mark; >> - long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; >> + long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; >> int o; >> >> if (alloc_flags & ALLOC_HIGH) >> @@ -2413,7 +2413,7 @@ void show_free_areas(void) >> " all_unreclaimable? %s" >> "\n", >> zone->name, >> - K(zone_page_state(zone, NR_FREE_PAGES)), >> + K(zone_nr_free_pages(zone)), >> K(min_wmark_pages(zone)), >> K(low_wmark_pages(zone)), >> K(high_wmark_pages(zone)), >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 7759941..c95a159 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) >> for_each_online_cpu(cpu) >> per_cpu_ptr(zone->pageset, cpu)->stat_threshold >> = threshold; >> + >> + zone->percpu_drift_mark = high_wmark_pages(zone) + >> + num_online_cpus() * threshold; >> } >> } > > Hm, this one I don't quite get (might be the jetlag, though): we have > _at least_ NR_FREE_PAGES free pages, there may just be more lurking in We can't make sure it. As I said previous mail, current allocation path decreases NR_FREE_PAGES after it removes pages from buddy list. > the pcp counters. > > So shouldn't we only collect the pcp deltas in case the high watermark > is breached? Above this point, we should be fine or better, no? If we don't consider allocation path, I agree on Hannes's opinion. At least, we need to listen why Mel determine the threshold. :) -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-17 2:26 ` Minchan Kim @ 2010-08-17 10:42 ` Mel Gorman 2010-08-17 15:01 ` Minchan Kim 0 siblings, 1 reply; 49+ messages in thread From: Mel Gorman @ 2010-08-17 10:42 UTC (permalink / raw) To: Minchan Kim Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Tue, Aug 17, 2010 at 11:26:05AM +0900, Minchan Kim wrote: > On Tue, Aug 17, 2010 at 1:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > [npiggin@suse.de bounces, switched to yahoo address] > > > > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote: > > <snip> > > >> + * potentially causing a live-lock. While kswapd is awake and > >> + * free pages are low, get a better estimate for free pages > >> + */ > >> + if (nr_free_pages < zone->percpu_drift_mark && > >> + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { > >> + int cpu; > >> + > >> + for_each_online_cpu(cpu) { > >> + struct per_cpu_pageset *pset; > >> + > >> + pset = per_cpu_ptr(zone->pageset, cpu); > >> + nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES]; > > We need to consider CONFIG_SMP. > We do. #ifdef CONFIG_SMP unsigned long zone_nr_free_pages(struct zone *zone); #else #define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES) #endif /* CONFIG_SMP */ and a wrapping of CONFIG_SMP around the function in mmzone.c . > >> + } > >> + } > >> + > >> + return nr_free_pages; > >> +} > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >> index c2407a4..67a2ed0 100644 > >> --- a/mm/page_alloc.c > >> +++ b/mm/page_alloc.c > >> @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > >> { > >> /* free_pages my go negative - that's OK */ > >> long min = mark; > >> - long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; > >> + long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; > >> int o; > >> > >> if (alloc_flags & ALLOC_HIGH) > >> @@ -2413,7 +2413,7 @@ void show_free_areas(void) > >> " all_unreclaimable? %s" > >> "\n", > >> zone->name, > >> - K(zone_page_state(zone, NR_FREE_PAGES)), > >> + K(zone_nr_free_pages(zone)), > >> K(min_wmark_pages(zone)), > >> K(low_wmark_pages(zone)), > >> K(high_wmark_pages(zone)), > >> diff --git a/mm/vmstat.c b/mm/vmstat.c > >> index 7759941..c95a159 100644 > >> --- a/mm/vmstat.c > >> +++ b/mm/vmstat.c > >> @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) > >> for_each_online_cpu(cpu) > >> per_cpu_ptr(zone->pageset, cpu)->stat_threshold > >> = threshold; > >> + > >> + zone->percpu_drift_mark = high_wmark_pages(zone) + > >> + num_online_cpus() * threshold; > >> } > >> } > > > > Hm, this one I don't quite get (might be the jetlag, though): we have > > _at least_ NR_FREE_PAGES free pages, there may just be more lurking in > > We can't make sure it. > As I said previous mail, current allocation path decreases > NR_FREE_PAGES after it removes pages from buddy list. > > > the pcp counters. > > > > So shouldn't we only collect the pcp deltas in case the high watermark > > is breached? Above this point, we should be fine or better, no? > > If we don't consider allocation path, I agree on Hannes's opinion. > At least, we need to listen why Mel determine the threshold. :) > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-17 10:42 ` Mel Gorman @ 2010-08-17 15:01 ` Minchan Kim 2010-08-17 15:05 ` Mel Gorman 0 siblings, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-17 15:01 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Tue, Aug 17, 2010 at 11:42:46AM +0100, Mel Gorman wrote: > On Tue, Aug 17, 2010 at 11:26:05AM +0900, Minchan Kim wrote: > > On Tue, Aug 17, 2010 at 1:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > > [npiggin@suse.de bounces, switched to yahoo address] > > > > > > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote: > > > > <snip> > > > > >> + * potentially causing a live-lock. While kswapd is awake and > > >> + * free pages are low, get a better estimate for free pages > > >> + */ > > >> + if (nr_free_pages < zone->percpu_drift_mark && > > >> + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { > > >> + int cpu; > > >> + > > >> + for_each_online_cpu(cpu) { > > >> + struct per_cpu_pageset *pset; > > >> + > > >> + pset = per_cpu_ptr(zone->pageset, cpu); > > >> + nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES]; > > > > We need to consider CONFIG_SMP. > > > > We do. > > #ifdef CONFIG_SMP > unsigned long zone_nr_free_pages(struct zone *zone); > #else > #define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES) > #endif /* CONFIG_SMP */ > > and a wrapping of CONFIG_SMP around the function in mmzone.c . I can't find it in this patch series. Hmm.. :( -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-17 15:01 ` Minchan Kim @ 2010-08-17 15:05 ` Mel Gorman 0 siblings, 0 replies; 49+ messages in thread From: Mel Gorman @ 2010-08-17 15:05 UTC (permalink / raw) To: Minchan Kim Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Wed, Aug 18, 2010 at 12:01:44AM +0900, Minchan Kim wrote: > On Tue, Aug 17, 2010 at 11:42:46AM +0100, Mel Gorman wrote: > > On Tue, Aug 17, 2010 at 11:26:05AM +0900, Minchan Kim wrote: > > > On Tue, Aug 17, 2010 at 1:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > [npiggin@suse.de bounces, switched to yahoo address] > > > > > > > > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote: > > > > > > <snip> > > > > > > >> + * potentially causing a live-lock. While kswapd is awake and > > > >> + * free pages are low, get a better estimate for free pages > > > >> + */ > > > >> + if (nr_free_pages < zone->percpu_drift_mark && > > > >> + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { > > > >> + int cpu; > > > >> + > > > >> + for_each_online_cpu(cpu) { > > > >> + struct per_cpu_pageset *pset; > > > >> + > > > >> + pset = per_cpu_ptr(zone->pageset, cpu); > > > >> + nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES]; > > > > > > We need to consider CONFIG_SMP. > > > > > > > We do. > > > > #ifdef CONFIG_SMP > > unsigned long zone_nr_free_pages(struct zone *zone); > > #else > > #define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES) > > #endif /* CONFIG_SMP */ > > > > and a wrapping of CONFIG_SMP around the function in mmzone.c . > > I can't find it in this patch series. My bad. What I meant is "You're right, we do need to consider CONFIG_SMP, how about something like the following"; I've made such a change to my local tree but it was not part of the released series. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-16 16:06 ` Johannes Weiner 2010-08-17 2:26 ` Minchan Kim @ 2010-08-17 10:16 ` Mel Gorman 2010-08-17 11:05 ` Johannes Weiner 2010-08-17 14:20 ` Minchan Kim 1 sibling, 2 replies; 49+ messages in thread From: Mel Gorman @ 2010-08-17 10:16 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon, Aug 16, 2010 at 06:06:23PM +0200, Johannes Weiner wrote: > [npiggin@suse.de bounces, switched to yahoo address] > > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote: > > On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote: > > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > > > it is cheaper than scanning a number of lists. To avoid synchronization > > > overhead, counter deltas are maintained on a per-cpu basis and drained both > > > periodically and when the delta is above a threshold. On large CPU systems, > > > the difference between the estimated and real value of NR_FREE_PAGES can be > > > very high. If the system is under both load and low memory, it's possible > > > for watermarks to be breached. In extreme cases, the number of free pages > > > can drop to 0 leading to the possibility of system livelock. > > > > > > This patch introduces zone_nr_free_pages() to take a slightly more accurate > > > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect > > > and may result in cache line bounces but is expected to be lighter than the > > > IPI calls necessary to continually drain the per-cpu counters while kswapd > > > is awake. > > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > And the second I sent this, I realised I had sent a slightly old version > > that missed a compile-fix :( > > > > ==== CUT HERE ==== > > mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake > > > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > > it is cheaper than scanning a number of lists. To avoid synchronization > > overhead, counter deltas are maintained on a per-cpu basis and drained both > > periodically and when the delta is above a threshold. On large CPU systems, > > the difference between the estimated and real value of NR_FREE_PAGES can be > > very high. If the system is under both load and low memory, it's possible > > for watermarks to be breached. In extreme cases, the number of free pages > > can drop to 0 leading to the possibility of system livelock. > > > > This patch introduces zone_nr_free_pages() to take a slightly more accurate > > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect > > and may result in cache line bounces but is expected to be lighter than the > > IPI calls necessary to continually drain the per-cpu counters while kswapd > > is awake. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > [...] > > > --- a/mm/mmzone.c > > +++ b/mm/mmzone.c > > @@ -87,3 +87,30 @@ int memmap_valid_within(unsigned long pfn, > > return 1; > > } > > #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ > > + > > +/* Called when a more accurate view of NR_FREE_PAGES is needed */ > > +unsigned long zone_nr_free_pages(struct zone *zone) > > +{ > > + unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); > > + > > + /* > > + * While kswapd is awake, it is considered the zone is under some > > + * memory pressure. Under pressure, there is a risk that > > + * er-cpu-counter-drift will allow the min watermark to be breached > > Missing `p'. > D'oh. Fixed > > + * potentially causing a live-lock. While kswapd is awake and > > + * free pages are low, get a better estimate for free pages > > + */ > > + if (nr_free_pages < zone->percpu_drift_mark && > > + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { > > + int cpu; > > + > > + for_each_online_cpu(cpu) { > > + struct per_cpu_pageset *pset; > > + > > + pset = per_cpu_ptr(zone->pageset, cpu); > > + nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES]; > > + } > > + } > > + > > + return nr_free_pages; > > +} > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index c2407a4..67a2ed0 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > > { > > /* free_pages my go negative - that's OK */ > > long min = mark; > > - long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; > > + long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; > > int o; > > > > if (alloc_flags & ALLOC_HIGH) > > @@ -2413,7 +2413,7 @@ void show_free_areas(void) > > " all_unreclaimable? %s" > > "\n", > > zone->name, > > - K(zone_page_state(zone, NR_FREE_PAGES)), > > + K(zone_nr_free_pages(zone)), > > K(min_wmark_pages(zone)), > > K(low_wmark_pages(zone)), > > K(high_wmark_pages(zone)), > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index 7759941..c95a159 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) > > for_each_online_cpu(cpu) > > per_cpu_ptr(zone->pageset, cpu)->stat_threshold > > = threshold; > > + > > + zone->percpu_drift_mark = high_wmark_pages(zone) + > > + num_online_cpus() * threshold; > > } > > } > > Hm, this one I don't quite get (might be the jetlag, though): we have > _at least_ NR_FREE_PAGES free pages, there may just be more lurking in > the pcp counters. > Well, the drift can be either direction because drift can be due to pages being either freed or allocated. e.g. it could be something like NR_FREE_PAGES CPU 0 CPU 1 Actual Free 128 -32 +64 160 Because CPU 0 was allocating pages while CPU 1 was freeing them but that is not what is important here. At any given time, the NR_FREE_PAGES can be wrong by as much as num_online_cpus * (threshold - 1) As kswapd goes back to sleep when the high watermark is reached, it's important that it has actually reached the watermark before sleeping. Similarly, if an allocator is checking the low watermark, it needs an accurate count. Hence a more careful accounting for NR_FREE_PAGES should happen when the number of free pages is within high_watermark + (num_online_cpus * (threshold - 1)) Only checking when kswapd is awake still leaves a window between the low and min watermark when we could breach the watermark but I'm expecting it can only happen for at worst one allocation. After that, kswapd wakes and the count becomes accurate again. > So shouldn't we only collect the pcp deltas in case the high watermark > is breached? Above this point, we should be fine or better, no? > Is that not what is happening in zone_nr_free_pages with this check? /* * While kswapd is awake, it is considered the zone is under some * memory pressure. Under pressure, there is a risk that * per-cpu-counter-drift will allow the min watermark to be breached * potentially causing a live-lock. While kswapd is awake and * free pages are low, get a better estimate for free pages */ if (nr_free_pages < zone->percpu_drift_mark && !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { Maybe I'm misunderstanding your question. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-17 10:16 ` Mel Gorman @ 2010-08-17 11:05 ` Johannes Weiner 2010-08-17 14:20 ` Minchan Kim 1 sibling, 0 replies; 49+ messages in thread From: Johannes Weiner @ 2010-08-17 11:05 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Tue, Aug 17, 2010 at 11:16:55AM +0100, Mel Gorman wrote: > On Mon, Aug 16, 2010 at 06:06:23PM +0200, Johannes Weiner wrote: > > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote: > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > index 7759941..c95a159 100644 > > > --- a/mm/vmstat.c > > > +++ b/mm/vmstat.c > > > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) > > > for_each_online_cpu(cpu) > > > per_cpu_ptr(zone->pageset, cpu)->stat_threshold > > > = threshold; > > > + > > > + zone->percpu_drift_mark = high_wmark_pages(zone) + > > > + num_online_cpus() * threshold; > > > } > > > } > > > > Hm, this one I don't quite get (might be the jetlag, though): we have > > _at least_ NR_FREE_PAGES free pages, there may just be more lurking in > > the pcp counters. > > > > Well, the drift can be either direction because drift can be due to pages > being either freed or allocated. e.g. it could be something like > > NR_FREE_PAGES CPU 0 CPU 1 Actual Free > 128 -32 +64 160 > > Because CPU 0 was allocating pages while CPU 1 was freeing them but that > is not what is important here. At any given time, the NR_FREE_PAGES can be > wrong by as much as > > num_online_cpus * (threshold - 1) I somehow assumed the pcp cache could only be positive, but the vm_stat_diff can indeed hold negative values. > > So shouldn't we only collect the pcp deltas in case the high watermark > > is breached? Above this point, we should be fine or better, no? > > > > Is that not what is happening in zone_nr_free_pages with this check? > > /* > * While kswapd is awake, it is considered the zone is under some > * memory pressure. Under pressure, there is a risk that > * per-cpu-counter-drift will allow the min watermark to be breached > * potentially causing a live-lock. While kswapd is awake and > * free pages are low, get a better estimate for free pages > */ > if (nr_free_pages < zone->percpu_drift_mark && > !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { > > Maybe I'm misunderstanding your question. This was just a conclusion based on my wrong assumption: if the pcp diff could only be positive, it would be enough to go for accurate counts at the point NR_FREE_PAGES breaches the watermark. As it is, however, the error margin needs to be taken into account in both directions, as you said, so your patch makes perfect sense. Sorry for the noise! And Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-17 10:16 ` Mel Gorman 2010-08-17 11:05 ` Johannes Weiner @ 2010-08-17 14:20 ` Minchan Kim 2010-08-18 8:51 ` Mel Gorman 1 sibling, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-17 14:20 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Tue, Aug 17, 2010 at 11:16:55AM +0100, Mel Gorman wrote: > Well, the drift can be either direction because drift can be due to pages > being either freed or allocated. e.g. it could be something like > > NR_FREE_PAGES CPU 0 CPU 1 Actual Free > 128 -32 +64 160 > > Because CPU 0 was allocating pages while CPU 1 was freeing them but that > is not what is important here. At any given time, the NR_FREE_PAGES can be > wrong by as much as > > num_online_cpus * (threshold - 1) That's the answer I expected. As I mentioned previous mail, we need to consider allocation path. But you already have been considered it by partially in here. Yes. It looks good to me. :) Reviewed-by: Minchan Kim <minchan.kim@gmail.com> > > As kswapd goes back to sleep when the high watermark is reached, it's important > that it has actually reached the watermark before sleeping. Similarly, > if an allocator is checking the low watermark, it needs an accurate count. > Hence a more careful accounting for NR_FREE_PAGES should happen when the > number of free pages is within > > high_watermark + (num_online_cpus * (threshold - 1)) > > Only checking when kswapd is awake still leaves a window between the low > and min watermark when we could breach the watermark but I'm expecting it > can only happen for at worst one allocation. After that, kswapd wakes > and the count becomes accurate again. I can't understand the point. Now kswapd starts from below low wmark and stops until high wmark. So if VM has pages of below low wmark, it could always check by zone_nr_free_pages regardless of min. What's a window low and min wmark? Maybe I can miss your point. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-17 14:20 ` Minchan Kim @ 2010-08-18 8:51 ` Mel Gorman 2010-08-18 14:57 ` Minchan Kim 0 siblings, 1 reply; 49+ messages in thread From: Mel Gorman @ 2010-08-18 8:51 UTC (permalink / raw) To: Minchan Kim Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Tue, Aug 17, 2010 at 11:20:40PM +0900, Minchan Kim wrote: > On Tue, Aug 17, 2010 at 11:16:55AM +0100, Mel Gorman wrote: > > Well, the drift can be either direction because drift can be due to pages > > being either freed or allocated. e.g. it could be something like > > > > NR_FREE_PAGES CPU 0 CPU 1 Actual Free > > 128 -32 +64 160 > > > > Because CPU 0 was allocating pages while CPU 1 was freeing them but that > > is not what is important here. At any given time, the NR_FREE_PAGES can be > > wrong by as much as > > > > num_online_cpus * (threshold - 1) > > That's the answer I expected. > As I mentioned previous mail, we need to consider allocation path. > But you already have been considered it by partially in here. > Yes. It looks good to me. :) > > Reviewed-by: Minchan Kim <minchan.kim@gmail.com> > Thanks. > > > > As kswapd goes back to sleep when the high watermark is reached, it's important > > that it has actually reached the watermark before sleeping. Similarly, > > if an allocator is checking the low watermark, it needs an accurate count. > > Hence a more careful accounting for NR_FREE_PAGES should happen when the > > number of free pages is within > > > > high_watermark + (num_online_cpus * (threshold - 1)) > > > > Only checking when kswapd is awake still leaves a window between the low > > and min watermark when we could breach the watermark but I'm expecting it > > can only happen for at worst one allocation. After that, kswapd wakes > > and the count becomes accurate again. > > I can't understand the point. > Now kswapd starts from below low wmark and stops until high wmark. Correct. > So if VM has pages of below low wmark, it could always check by zone_nr_free_pages > regardless of min. > The difficulty is that NR_FREE_PAGES is an estimate so for a time the VM may not know it is below the low watermark. We can get a more accurate view but it's costly so we want to avoid that cost whenever we can. > What's a window low and min wmark? Maybe I can miss your point. > The window is due to the fact kswapd is not awake yet. The window is because kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The system is really somewhere between the low and min watermark but we are not taking the accurate measure until kswapd gets woken up. The first allocation to notice we are below the low watermark (be it due to vmstat refreshing or that NR_FREE_PAGES happens to report we are below the watermark regardless of any drift) wakes kswapd and other callers then take an accurate count hence "we could breach the watermark but I'm expecting it can only happen for at worst one allocation". -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-18 8:51 ` Mel Gorman @ 2010-08-18 14:57 ` Minchan Kim 2010-08-19 8:06 ` Mel Gorman 0 siblings, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-18 14:57 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: > > What's a window low and min wmark? Maybe I can miss your point. > > > > The window is due to the fact kswapd is not awake yet. The window is because > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The > system is really somewhere between the low and min watermark but we are not > taking the accurate measure until kswapd gets woken up. The first allocation > to notice we are below the low watermark (be it due to vmstat refreshing or > that NR_FREE_PAGES happens to report we are below the watermark regardless of > any drift) wakes kswapd and other callers then take an accurate count hence > "we could breach the watermark but I'm expecting it can only happen for at > worst one allocation". Right. I misunderstood your word. One more question. Could you explain live lock scenario? I looked over the code. Although the VM pass zone_watermark_ok by luck, It can't allocate the page from buddy and then might go OOM. When do we meet live lock case? I think the description in change log would be better to understand this patch in future. Thanks. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-18 14:57 ` Minchan Kim @ 2010-08-19 8:06 ` Mel Gorman 2010-08-19 10:33 ` Minchan Kim 0 siblings, 1 reply; 49+ messages in thread From: Mel Gorman @ 2010-08-19 8:06 UTC (permalink / raw) To: Minchan Kim Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote: > On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: > > > What's a window low and min wmark? Maybe I can miss your point. > > > > > > > The window is due to the fact kswapd is not awake yet. The window is because > > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The > > system is really somewhere between the low and min watermark but we are not > > taking the accurate measure until kswapd gets woken up. The first allocation > > to notice we are below the low watermark (be it due to vmstat refreshing or > > that NR_FREE_PAGES happens to report we are below the watermark regardless of > > any drift) wakes kswapd and other callers then take an accurate count hence > > "we could breach the watermark but I'm expecting it can only happen for at > > worst one allocation". > > Right. I misunderstood your word. > One more question. > > Could you explain live lock scenario? > Lets say NR_FREE_PAGES = 256 Actual free pages = 8 The PCP lists get refilled in patch taking all 8 pages. Now there are zero free pages. Reclaim kicks in but to reclaim any pages it needs to clean something but all the pages are on a network-backed filesystem. To clean them, it must transmit on the network so it tries to allocate some buffers. The livelock is that to free some memory, an allocation must succeed but for an allocation to succeed, some memory must be freed. The system might still remain alive if a process exits and does not need to allocate memory while exiting but by and large, the system is in a dangerous state. > I looked over the code. Although the VM pass zone_watermark_ok by luck, > It can't allocate the page from buddy and then might go OOM. > When do we meet live lock case? > > I think the description in change log would be better to understand > this patch in future. > Is the above description useful? If so, I can put it in the leader. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 8:06 ` Mel Gorman @ 2010-08-19 10:33 ` Minchan Kim 2010-08-19 10:38 ` Mel Gorman 0 siblings, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-19 10:33 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote: > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote: >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: >> > > What's a window low and min wmark? Maybe I can miss your point. >> > > >> > >> > The window is due to the fact kswapd is not awake yet. The window is because >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The >> > system is really somewhere between the low and min watermark but we are not >> > taking the accurate measure until kswapd gets woken up. The first allocation >> > to notice we are below the low watermark (be it due to vmstat refreshing or >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of >> > any drift) wakes kswapd and other callers then take an accurate count hence >> > "we could breach the watermark but I'm expecting it can only happen for at >> > worst one allocation". >> >> Right. I misunderstood your word. >> One more question. >> >> Could you explain live lock scenario? >> > > Lets say > > NR_FREE_PAGES = 256 > Actual free pages = 8 > > The PCP lists get refilled in patch taking all 8 pages. Now there are > zero free pages. Reclaim kicks in but to reclaim any pages it needs to > clean something but all the pages are on a network-backed filesystem. To > clean them, it must transmit on the network so it tries to allocate some > buffers. > > The livelock is that to free some memory, an allocation must succeed but > for an allocation to succeed, some memory must be freed. The system Yes. I understood this as livelock but at last VM will kill victim process then it can allocate free pages. So I think it's not a livelock. > might still remain alive if a process exits and does not need to > allocate memory while exiting but by and large, the system is in a > dangerous state. Do you mean dangerous state of the system is livelock? Maybe not. I can't understand livelock in this context. Anyway, I am okay with this patch except livelock pharse. :) Thanks, Mel. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 10:33 ` Minchan Kim @ 2010-08-19 10:38 ` Mel Gorman 2010-08-19 14:01 ` Minchan Kim 0 siblings, 1 reply; 49+ messages in thread From: Mel Gorman @ 2010-08-19 10:38 UTC (permalink / raw) To: Minchan Kim Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote: > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote: > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: > >> > > What's a window low and min wmark? Maybe I can miss your point. > >> > > > >> > > >> > The window is due to the fact kswapd is not awake yet. The window is because > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The > >> > system is really somewhere between the low and min watermark but we are not > >> > taking the accurate measure until kswapd gets woken up. The first allocation > >> > to notice we are below the low watermark (be it due to vmstat refreshing or > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of > >> > any drift) wakes kswapd and other callers then take an accurate count hence > >> > "we could breach the watermark but I'm expecting it can only happen for at > >> > worst one allocation". > >> > >> Right. I misunderstood your word. > >> One more question. > >> > >> Could you explain live lock scenario? > >> > > > > Lets say > > > > NR_FREE_PAGES = 256 > > Actual free pages = 8 > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to > > clean something but all the pages are on a network-backed filesystem. To > > clean them, it must transmit on the network so it tries to allocate some > > buffers. > > > > The livelock is that to free some memory, an allocation must succeed but > > for an allocation to succeed, some memory must be freed. The system > > Yes. I understood this as livelock but at last VM will kill victim > process then it can allocate free pages. And if the exit path for the OOM kill needs to allocate a page what should it do? > So I think it's not a livelock. > > > might still remain alive if a process exits and does not need to > > allocate memory while exiting but by and large, the system is in a > > dangerous state. > > Do you mean dangerous state of the system is livelock? > Maybe not. > I can't understand livelock in this context. > Anyway, I am okay with this patch except livelock pharse. :) > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 10:38 ` Mel Gorman @ 2010-08-19 14:01 ` Minchan Kim 2010-08-19 14:09 ` Mel Gorman 0 siblings, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-19 14:01 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote: > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote: > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote: > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: > > >> > > What's a window low and min wmark? Maybe I can miss your point. > > >> > > > > >> > > > >> > The window is due to the fact kswapd is not awake yet. The window is because > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The > > >> > system is really somewhere between the low and min watermark but we are not > > >> > taking the accurate measure until kswapd gets woken up. The first allocation > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of > > >> > any drift) wakes kswapd and other callers then take an accurate count hence > > >> > "we could breach the watermark but I'm expecting it can only happen for at > > >> > worst one allocation". > > >> > > >> Right. I misunderstood your word. > > >> One more question. > > >> > > >> Could you explain live lock scenario? > > >> > > > > > > Lets say > > > > > > NR_FREE_PAGES = 256 > > > Actual free pages = 8 > > > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to > > > clean something but all the pages are on a network-backed filesystem. To > > > clean them, it must transmit on the network so it tries to allocate some > > > buffers. > > > > > > The livelock is that to free some memory, an allocation must succeed but > > > for an allocation to succeed, some memory must be freed. The system > > > > Yes. I understood this as livelock but at last VM will kill victim > > process then it can allocate free pages. > > And if the exit path for the OOM kill needs to allocate a page what > should it do? Yeah. It might be livelock. Then, let's rethink the problem. The problem is following as. 1. Process A try to allocate the page 2. VM try to reclaim the page for process A 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A 4. VM try to kill process B 5. The exit path need new pages for exiting process B 6. Livelock happens(I am not sure but we need any warning if it really happens at least) If OOM kills process B successfully, there ins't the livelock problem. So then How about this? We need to retry allocation of new page with draining free pages just before OOM. It doesn't have any overhead before going OOM and it's not frequent. This patch can't handle your problem? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1bb327a..113bea9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2045,6 +2045,15 @@ rebalance: * running out of options and have to consider going OOM */ if (!did_some_progress) { + + /* Ther are some free pages on PCP */ + drain_all_pages(); + page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, + high_zoneidx, alloc_flags &~ALLOCX_NO_WATERMARKS, + preferred_zone, migratetype); + if (page) + goto got_pg; + if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { if (oom_killer_disabled) goto nopage; -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 14:01 ` Minchan Kim @ 2010-08-19 14:09 ` Mel Gorman 2010-08-19 14:34 ` Minchan Kim 0 siblings, 1 reply; 49+ messages in thread From: Mel Gorman @ 2010-08-19 14:09 UTC (permalink / raw) To: Minchan Kim Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote: > On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote: > > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote: > > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote: > > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: > > > >> > > What's a window low and min wmark? Maybe I can miss your point. > > > >> > > > > > >> > > > > >> > The window is due to the fact kswapd is not awake yet. The window is because > > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The > > > >> > system is really somewhere between the low and min watermark but we are not > > > >> > taking the accurate measure until kswapd gets woken up. The first allocation > > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or > > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of > > > >> > any drift) wakes kswapd and other callers then take an accurate count hence > > > >> > "we could breach the watermark but I'm expecting it can only happen for at > > > >> > worst one allocation". > > > >> > > > >> Right. I misunderstood your word. > > > >> One more question. > > > >> > > > >> Could you explain live lock scenario? > > > >> > > > > > > > > Lets say > > > > > > > > NR_FREE_PAGES = 256 > > > > Actual free pages = 8 > > > > > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are > > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to > > > > clean something but all the pages are on a network-backed filesystem. To > > > > clean them, it must transmit on the network so it tries to allocate some > > > > buffers. > > > > > > > > The livelock is that to free some memory, an allocation must succeed but > > > > for an allocation to succeed, some memory must be freed. The system > > > > > > Yes. I understood this as livelock but at last VM will kill victim > > > process then it can allocate free pages. > > > > And if the exit path for the OOM kill needs to allocate a page what > > should it do? > > Yeah. It might be livelock. > Then, let's rethink the problem. > > The problem is following as. > > 1. Process A try to allocate the page > 2. VM try to reclaim the page for process A > 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A > 4. VM try to kill process B > 5. The exit path need new pages for exiting process B > 6. Livelock happens(I am not sure but we need any warning if it really happens at least) > The problem this patch is concerned with is about the vmstat counters, not the pages on the per-cpu lists. The issue being dealt with is that the page allocator grants a page going below the min watermark because NR_FREE_PAGES can be inaccurate. The patch aims to fix that but taking greater care with NR_FREE_PAGES when memory is low. > If OOM kills process B successfully, there ins't the livelock problem. > So then How about this? > > We need to retry allocation of new page with draining free pages just before OOM. > It doesn't have any overhead before going OOM and it's not frequent. > It's a different problem and it's what patch 3/3 of this series aims to address. > This patch can't handle your problem? > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 1bb327a..113bea9 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2045,6 +2045,15 @@ rebalance: > * running out of options and have to consider going OOM > */ > if (!did_some_progress) { > + > + /* Ther are some free pages on PCP */ > + drain_all_pages(); > + page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, > + high_zoneidx, alloc_flags &~ALLOCX_NO_WATERMARKS, > + preferred_zone, migratetype); > + if (page) > + goto got_pg; > + > if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { > if (oom_killer_disabled) > goto nopage; > > > > -- > Kind regards, > Minchan Kim > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 14:09 ` Mel Gorman @ 2010-08-19 14:34 ` Minchan Kim 2010-08-19 15:07 ` Mel Gorman 0 siblings, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-19 14:34 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 03:09:46PM +0100, Mel Gorman wrote: > On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote: > > On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote: > > > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote: > > > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote: > > > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: > > > > >> > > What's a window low and min wmark? Maybe I can miss your point. > > > > >> > > > > > > >> > > > > > >> > The window is due to the fact kswapd is not awake yet. The window is because > > > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The > > > > >> > system is really somewhere between the low and min watermark but we are not > > > > >> > taking the accurate measure until kswapd gets woken up. The first allocation > > > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or > > > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of > > > > >> > any drift) wakes kswapd and other callers then take an accurate count hence > > > > >> > "we could breach the watermark but I'm expecting it can only happen for at > > > > >> > worst one allocation". > > > > >> > > > > >> Right. I misunderstood your word. > > > > >> One more question. > > > > >> > > > > >> Could you explain live lock scenario? > > > > >> > > > > > > > > > > Lets say > > > > > > > > > > NR_FREE_PAGES = 256 > > > > > Actual free pages = 8 > > > > > > > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are > > > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to > > > > > clean something but all the pages are on a network-backed filesystem. To > > > > > clean them, it must transmit on the network so it tries to allocate some > > > > > buffers. > > > > > > > > > > The livelock is that to free some memory, an allocation must succeed but > > > > > for an allocation to succeed, some memory must be freed. The system > > > > > > > > Yes. I understood this as livelock but at last VM will kill victim > > > > process then it can allocate free pages. > > > > > > And if the exit path for the OOM kill needs to allocate a page what > > > should it do? > > > > Yeah. It might be livelock. > > Then, let's rethink the problem. > > > > The problem is following as. > > > > 1. Process A try to allocate the page > > 2. VM try to reclaim the page for process A > > 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A > > 4. VM try to kill process B > > 5. The exit path need new pages for exiting process B > > 6. Livelock happens(I am not sure but we need any warning if it really happens at least) > > > > The problem this patch is concerned with is about the vmstat counters, not > the pages on the per-cpu lists. The issue being dealt with is that the page > allocator grants a page going below the min watermark because NR_FREE_PAGES > can be inaccurate. The patch aims to fix that but taking greater care > with NR_FREE_PAGES when memory is low. Your goal is to protect _min_ pages which is reserved. Right? I thought your final goal is to protect the livelock problem. Hmm.. Sorry for the noise. :( -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 14:34 ` Minchan Kim @ 2010-08-19 15:07 ` Mel Gorman 2010-08-19 15:22 ` Minchan Kim 0 siblings, 1 reply; 49+ messages in thread From: Mel Gorman @ 2010-08-19 15:07 UTC (permalink / raw) To: Minchan Kim Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 11:34:39PM +0900, Minchan Kim wrote: > On Thu, Aug 19, 2010 at 03:09:46PM +0100, Mel Gorman wrote: > > On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote: > > > On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote: > > > > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote: > > > > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote: > > > > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: > > > > > >> > > What's a window low and min wmark? Maybe I can miss your point. > > > > > >> > > > > > > > >> > > > > > > >> > The window is due to the fact kswapd is not awake yet. The window is because > > > > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The > > > > > >> > system is really somewhere between the low and min watermark but we are not > > > > > >> > taking the accurate measure until kswapd gets woken up. The first allocation > > > > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or > > > > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of > > > > > >> > any drift) wakes kswapd and other callers then take an accurate count hence > > > > > >> > "we could breach the watermark but I'm expecting it can only happen for at > > > > > >> > worst one allocation". > > > > > >> > > > > > >> Right. I misunderstood your word. > > > > > >> One more question. > > > > > >> > > > > > >> Could you explain live lock scenario? > > > > > >> > > > > > > > > > > > > Lets say > > > > > > > > > > > > NR_FREE_PAGES = 256 > > > > > > Actual free pages = 8 > > > > > > > > > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are > > > > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to > > > > > > clean something but all the pages are on a network-backed filesystem. To > > > > > > clean them, it must transmit on the network so it tries to allocate some > > > > > > buffers. > > > > > > > > > > > > The livelock is that to free some memory, an allocation must succeed but > > > > > > for an allocation to succeed, some memory must be freed. The system > > > > > > > > > > Yes. I understood this as livelock but at last VM will kill victim > > > > > process then it can allocate free pages. > > > > > > > > And if the exit path for the OOM kill needs to allocate a page what > > > > should it do? > > > > > > Yeah. It might be livelock. > > > Then, let's rethink the problem. > > > > > > The problem is following as. > > > > > > 1. Process A try to allocate the page > > > 2. VM try to reclaim the page for process A > > > 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A > > > 4. VM try to kill process B > > > 5. The exit path need new pages for exiting process B > > > 6. Livelock happens(I am not sure but we need any warning if it really happens at least) > > > > > > > The problem this patch is concerned with is about the vmstat counters, not > > the pages on the per-cpu lists. The issue being dealt with is that the page > > allocator grants a page going below the min watermark because NR_FREE_PAGES > > can be inaccurate. The patch aims to fix that but taking greater care > > with NR_FREE_PAGES when memory is low. > > Your goal is to protect _min_ pages which is reserved. Right? > I thought your final goal is to protect the livelock problem. > Hmm.. Sorry for the noise. :( > Emm, it's the same thing. If the min watermark is not properly preserved, the system is in danger of being live-locked. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 15:07 ` Mel Gorman @ 2010-08-19 15:22 ` Minchan Kim 2010-08-19 15:40 ` Mel Gorman 0 siblings, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-19 15:22 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 04:07:39PM +0100, Mel Gorman wrote: > On Thu, Aug 19, 2010 at 11:34:39PM +0900, Minchan Kim wrote: > > On Thu, Aug 19, 2010 at 03:09:46PM +0100, Mel Gorman wrote: > > > On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote: > > > > On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote: > > > > > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote: > > > > > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > > > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote: > > > > > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: > > > > > > >> > > What's a window low and min wmark? Maybe I can miss your point. > > > > > > >> > > > > > > > > >> > > > > > > > >> > The window is due to the fact kswapd is not awake yet. The window is because > > > > > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The > > > > > > >> > system is really somewhere between the low and min watermark but we are not > > > > > > >> > taking the accurate measure until kswapd gets woken up. The first allocation > > > > > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or > > > > > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of > > > > > > >> > any drift) wakes kswapd and other callers then take an accurate count hence > > > > > > >> > "we could breach the watermark but I'm expecting it can only happen for at > > > > > > >> > worst one allocation". > > > > > > >> > > > > > > >> Right. I misunderstood your word. > > > > > > >> One more question. > > > > > > >> > > > > > > >> Could you explain live lock scenario? > > > > > > >> > > > > > > > > > > > > > > Lets say > > > > > > > > > > > > > > NR_FREE_PAGES = 256 > > > > > > > Actual free pages = 8 > > > > > > > > > > > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are > > > > > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to > > > > > > > clean something but all the pages are on a network-backed filesystem. To > > > > > > > clean them, it must transmit on the network so it tries to allocate some > > > > > > > buffers. > > > > > > > > > > > > > > The livelock is that to free some memory, an allocation must succeed but > > > > > > > for an allocation to succeed, some memory must be freed. The system > > > > > > > > > > > > Yes. I understood this as livelock but at last VM will kill victim > > > > > > process then it can allocate free pages. > > > > > > > > > > And if the exit path for the OOM kill needs to allocate a page what > > > > > should it do? > > > > > > > > Yeah. It might be livelock. > > > > Then, let's rethink the problem. > > > > > > > > The problem is following as. > > > > > > > > 1. Process A try to allocate the page > > > > 2. VM try to reclaim the page for process A > > > > 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A > > > > 4. VM try to kill process B > > > > 5. The exit path need new pages for exiting process B > > > > 6. Livelock happens(I am not sure but we need any warning if it really happens at least) > > > > > > > > > > The problem this patch is concerned with is about the vmstat counters, not > > > the pages on the per-cpu lists. The issue being dealt with is that the page > > > allocator grants a page going below the min watermark because NR_FREE_PAGES > > > can be inaccurate. The patch aims to fix that but taking greater care > > > with NR_FREE_PAGES when memory is low. > > > > Your goal is to protect _min_ pages which is reserved. Right? > > I thought your final goal is to protect the livelock problem. > > Hmm.. Sorry for the noise. :( > > > > Emm, it's the same thing. If the min watermark is not properly > preserved, the system is in danger of being live-locked. Totally right. Maybe I am sleeping. Let's add follwing as comment about livelock. "If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM can allocate pages below min watermark(At worst, buddy is zero). Although VM kills some victim for freeing memory, it can't do it if the exit path requires new page since buddy have zero page. It can result in livelock." At least, it help to not hurt you in future by me who is fool. Thanks, Mel. > > -- > Mel Gorman > Part-time Phd Student Linux Technology Center > University of Limerick IBM Dublin Software Lab > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 15:22 ` Minchan Kim @ 2010-08-19 15:40 ` Mel Gorman 2010-08-19 15:44 ` Minchan Kim 0 siblings, 1 reply; 49+ messages in thread From: Mel Gorman @ 2010-08-19 15:40 UTC (permalink / raw) To: Minchan Kim Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Fri, Aug 20, 2010 at 12:22:33AM +0900, Minchan Kim wrote: > On Thu, Aug 19, 2010 at 04:07:39PM +0100, Mel Gorman wrote: > > On Thu, Aug 19, 2010 at 11:34:39PM +0900, Minchan Kim wrote: > > > On Thu, Aug 19, 2010 at 03:09:46PM +0100, Mel Gorman wrote: > > > > On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote: > > > > > On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote: > > > > > > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote: > > > > > > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > > > > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote: > > > > > > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote: > > > > > > > >> > > What's a window low and min wmark? Maybe I can miss your point. > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > The window is due to the fact kswapd is not awake yet. The window is because > > > > > > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The > > > > > > > >> > system is really somewhere between the low and min watermark but we are not > > > > > > > >> > taking the accurate measure until kswapd gets woken up. The first allocation > > > > > > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or > > > > > > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of > > > > > > > >> > any drift) wakes kswapd and other callers then take an accurate count hence > > > > > > > >> > "we could breach the watermark but I'm expecting it can only happen for at > > > > > > > >> > worst one allocation". > > > > > > > >> > > > > > > > >> Right. I misunderstood your word. > > > > > > > >> One more question. > > > > > > > >> > > > > > > > >> Could you explain live lock scenario? > > > > > > > >> > > > > > > > > > > > > > > > > Lets say > > > > > > > > > > > > > > > > NR_FREE_PAGES = 256 > > > > > > > > Actual free pages = 8 > > > > > > > > > > > > > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are > > > > > > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to > > > > > > > > clean something but all the pages are on a network-backed filesystem. To > > > > > > > > clean them, it must transmit on the network so it tries to allocate some > > > > > > > > buffers. > > > > > > > > > > > > > > > > The livelock is that to free some memory, an allocation must succeed but > > > > > > > > for an allocation to succeed, some memory must be freed. The system > > > > > > > > > > > > > > Yes. I understood this as livelock but at last VM will kill victim > > > > > > > process then it can allocate free pages. > > > > > > > > > > > > And if the exit path for the OOM kill needs to allocate a page what > > > > > > should it do? > > > > > > > > > > Yeah. It might be livelock. > > > > > Then, let's rethink the problem. > > > > > > > > > > The problem is following as. > > > > > > > > > > 1. Process A try to allocate the page > > > > > 2. VM try to reclaim the page for process A > > > > > 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A > > > > > 4. VM try to kill process B > > > > > 5. The exit path need new pages for exiting process B > > > > > 6. Livelock happens(I am not sure but we need any warning if it really happens at least) > > > > > > > > > > > > > The problem this patch is concerned with is about the vmstat counters, not > > > > the pages on the per-cpu lists. The issue being dealt with is that the page > > > > allocator grants a page going below the min watermark because NR_FREE_PAGES > > > > can be inaccurate. The patch aims to fix that but taking greater care > > > > with NR_FREE_PAGES when memory is low. > > > > > > Your goal is to protect _min_ pages which is reserved. Right? > > > I thought your final goal is to protect the livelock problem. > > > Hmm.. Sorry for the noise. :( > > > > > > > Emm, it's the same thing. If the min watermark is not properly > > preserved, the system is in danger of being live-locked. > > Totally right. > Maybe I am sleeping. > > Let's add follwing as comment about livelock. > Sure! > "If NR_FREE_PAGES is much higher than number of real free page in buddy, > the VM can allocate pages below min watermark(At worst, buddy is zero). > Although VM kills some victim for freeing memory, it can't do it if the > exit path requires new page since buddy have zero page. It can result in > livelock." > Thanks > At least, it help to not hurt you in future by me who is fool. > The patch leader now reads as Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM can allocate pages below min watermark, at worst reducing the real number of pages to zero. Even if the OOM killer kills some victim for freeing memory, it may not free memory if the exit path requires a new page resulting in livelock. This patch introduces zone_nr_free_pages() to take a slightly more accurate estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Is that better? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 15:40 ` Mel Gorman @ 2010-08-19 15:44 ` Minchan Kim 0 siblings, 0 replies; 49+ messages in thread From: Minchan Kim @ 2010-08-19 15:44 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 04:40:33PM +0100, Mel Gorman wrote: > The patch leader now reads as > > Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is > cheaper than scanning a number of lists. To avoid synchronization overhead, > counter deltas are maintained on a per-cpu basis and drained both periodically > and when the delta is above a threshold. On large CPU systems, the difference > between the estimated and real value of NR_FREE_PAGES can be very high. > If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM > can allocate pages below min watermark, at worst reducing the real number of > pages to zero. Even if the OOM killer kills some victim for freeing memory, it > may not free memory if the exit path requires a new page resulting in livelock. > > This patch introduces zone_nr_free_pages() to take a slightly more accurate > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect > and may result in cache line bounces but is expected to be lighter than the > IPI calls necessary to continually drain the per-cpu counters while kswapd > is awake. > > Is that better? Good! > > -- > Mel Gorman > Part-time Phd Student Linux Technology Center > University of Limerick IBM Dublin Software Lab -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-16 9:43 ` Mel Gorman 2010-08-16 14:47 ` Rik van Riel 2010-08-16 16:06 ` Johannes Weiner @ 2010-08-19 15:46 ` Minchan Kim 2010-08-19 16:06 ` Mel Gorman 2 siblings, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-19 15:46 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote: > On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote: > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > > it is cheaper than scanning a number of lists. To avoid synchronization > > overhead, counter deltas are maintained on a per-cpu basis and drained both > > periodically and when the delta is above a threshold. On large CPU systems, > > the difference between the estimated and real value of NR_FREE_PAGES can be > > very high. If the system is under both load and low memory, it's possible > > for watermarks to be breached. In extreme cases, the number of free pages > > can drop to 0 leading to the possibility of system livelock. > > > > This patch introduces zone_nr_free_pages() to take a slightly more accurate > > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect > > and may result in cache line bounces but is expected to be lighter than the > > IPI calls necessary to continually drain the per-cpu counters while kswapd > > is awake. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > And the second I sent this, I realised I had sent a slightly old version > that missed a compile-fix :( > > ==== CUT HERE ==== > mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > it is cheaper than scanning a number of lists. To avoid synchronization > overhead, counter deltas are maintained on a per-cpu basis and drained both > periodically and when the delta is above a threshold. On large CPU systems, > the difference between the estimated and real value of NR_FREE_PAGES can be > very high. If the system is under both load and low memory, it's possible > for watermarks to be breached. In extreme cases, the number of free pages > can drop to 0 leading to the possibility of system livelock. Mel. Could you consider normal(or small) system but has two core at least? I means we apply you rule according to the number of CPU and RAM size. (ie, threshold value). Now mobile system begin to have two core in system and above 1G RAM. Such case, it has threshold 8. It is unlikey to happen livelock. Is it worth to have such overhead in such system? What do you think? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 15:46 ` Minchan Kim @ 2010-08-19 16:06 ` Mel Gorman 2010-08-19 16:45 ` Minchan Kim 0 siblings, 1 reply; 49+ messages in thread From: Mel Gorman @ 2010-08-19 16:06 UTC (permalink / raw) To: Minchan Kim Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Fri, Aug 20, 2010 at 12:46:38AM +0900, Minchan Kim wrote: > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote: > > On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote: > > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > > > it is cheaper than scanning a number of lists. To avoid synchronization > > > overhead, counter deltas are maintained on a per-cpu basis and drained both > > > periodically and when the delta is above a threshold. On large CPU systems, > > > the difference between the estimated and real value of NR_FREE_PAGES can be > > > very high. If the system is under both load and low memory, it's possible > > > for watermarks to be breached. In extreme cases, the number of free pages > > > can drop to 0 leading to the possibility of system livelock. > > > > > > This patch introduces zone_nr_free_pages() to take a slightly more accurate > > > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect > > > and may result in cache line bounces but is expected to be lighter than the > > > IPI calls necessary to continually drain the per-cpu counters while kswapd > > > is awake. > > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > And the second I sent this, I realised I had sent a slightly old version > > that missed a compile-fix :( > > > > ==== CUT HERE ==== > > mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake > > > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > > it is cheaper than scanning a number of lists. To avoid synchronization > > overhead, counter deltas are maintained on a per-cpu basis and drained both > > periodically and when the delta is above a threshold. On large CPU systems, > > the difference between the estimated and real value of NR_FREE_PAGES can be > > very high. If the system is under both load and low memory, it's possible > > for watermarks to be breached. In extreme cases, the number of free pages > > can drop to 0 leading to the possibility of system livelock. > > Mel. Could you consider normal(or small) system but has two core at least? I did consider it but I was not keen on the idea of small systems behaving very differently to large systems in this regard. I thought there was a danger that a problem problem would be hidden by such a move. > I means we apply you rule according to the number of CPU and RAM size. (ie, > threshold value). > Now mobile system begin to have two core in system and above 1G RAM. > Such case, it has threshold 8. > > It is unlikey to happen livelock. > Is it worth to have such overhead in such system? > What do you think? > Such overhead could be avoided if we made a check like the following in refresh_zone_stat_thresholds() /* * Only set percpu_drift_mark if there is a danger that * NR_FREE_PAGES reports the low watermark is ok when in fact * the min watermark could be breached by an allocation */ tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone); max_drift = num_online_cpus() * threshold; if (max_drift > tolerate_drift) zone->percpu_drift_mark = high_wmark_pages(zone) + max_drift; Would this be preferable? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 16:06 ` Mel Gorman @ 2010-08-19 16:45 ` Minchan Kim 0 siblings, 0 replies; 49+ messages in thread From: Minchan Kim @ 2010-08-19 16:45 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 05:06:12PM +0100, Mel Gorman wrote: > On Fri, Aug 20, 2010 at 12:46:38AM +0900, Minchan Kim wrote: > Mel. Could you consider normal(or small) system but has two core at least? > > I did consider it but I was not keen on the idea of small systems behaving > very differently to large systems in this regard. I thought there was a > danger that a problem problem would be hidden by such a move. > > > I means we apply you rule according to the number of CPU and RAM size. (ie, > > threshold value). > > Now mobile system begin to have two core in system and above 1G RAM. > > Such case, it has threshold 8. > > > > It is unlikey to happen livelock. > > Is it worth to have such overhead in such system? > > What do you think? > > > > Such overhead could be avoided if we made a check like the following in > refresh_zone_stat_thresholds() > > /* > * Only set percpu_drift_mark if there is a danger that > * NR_FREE_PAGES reports the low watermark is ok when in fact > * the min watermark could be breached by an allocation > */ > tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone); > max_drift = num_online_cpus() * threshold; > if (max_drift > tolerate_drift) > zone->percpu_drift_mark = high_wmark_pages(zone) > + max_drift; > > Would this be preferable? Yes. It looks good to me. > > -- > Mel Gorman > Part-time Phd Student Linux Technology Center > University of Limerick IBM Dublin Software Lab -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-16 9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman 2010-08-16 9:43 ` Mel Gorman @ 2010-08-18 2:59 ` KAMEZAWA Hiroyuki 2010-08-18 15:55 ` Christoph Lameter 1 sibling, 1 reply; 49+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-18 2:59 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KOSAKI Motohiro, cl@linux-foundation.org On Mon, 16 Aug 2010 10:42:12 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as > it is cheaper than scanning a number of lists. To avoid synchronization > overhead, counter deltas are maintained on a per-cpu basis and drained both > periodically and when the delta is above a threshold. On large CPU systems, > the difference between the estimated and real value of NR_FREE_PAGES can be > very high. If the system is under both load and low memory, it's possible > for watermarks to be breached. In extreme cases, the number of free pages > can drop to 0 leading to the possibility of system livelock. > > This patch introduces zone_nr_free_pages() to take a slightly more accurate > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect > and may result in cache line bounces but is expected to be lighter than the > IPI calls necessary to continually drain the per-cpu counters while kswapd > is awake. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> BTW, a nitpick. > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) > for_each_online_cpu(cpu) > per_cpu_ptr(zone->pageset, cpu)->stat_threshold > = threshold; > + > + zone->percpu_drift_mark = high_wmark_pages(zone) + > + num_online_cpus() * threshold; > } > } This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE It's done by this patch....but the reason is unclear to me. == http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d1187ed21026fd512b87851d0ca26d9ae16f9059 == Christoph ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-18 2:59 ` KAMEZAWA Hiroyuki @ 2010-08-18 15:55 ` Christoph Lameter 2010-08-19 0:07 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 49+ messages in thread From: Christoph Lameter @ 2010-08-18 15:55 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KOSAKI Motohiro On Wed, 18 Aug 2010, KAMEZAWA Hiroyuki wrote: > BTW, a nitpick. > > > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) > > for_each_online_cpu(cpu) > > per_cpu_ptr(zone->pageset, cpu)->stat_threshold > > = threshold; > > + > > + zone->percpu_drift_mark = high_wmark_pages(zone) + > > + num_online_cpus() * threshold; > > } > > } > > This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE calculate_threshold() does its calculation based on the number of online cpus. Therefore the threshold may change if a cpu is brought down. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-18 15:55 ` Christoph Lameter @ 2010-08-19 0:07 ` KAMEZAWA Hiroyuki 2010-08-19 19:00 ` Christoph Lameter 0 siblings, 1 reply; 49+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-19 0:07 UTC (permalink / raw) To: Christoph Lameter Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KOSAKI Motohiro On Wed, 18 Aug 2010 10:55:53 -0500 (CDT) Christoph Lameter <cl@linux-foundation.org> wrote: > On Wed, 18 Aug 2010, KAMEZAWA Hiroyuki wrote: > > > BTW, a nitpick. > > > > > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void) > > > for_each_online_cpu(cpu) > > > per_cpu_ptr(zone->pageset, cpu)->stat_threshold > > > = threshold; > > > + > > > + zone->percpu_drift_mark = high_wmark_pages(zone) + > > > + num_online_cpus() * threshold; > > > } > > > } > > > > This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE > > calculate_threshold() does its calculation based on the number of online > cpus. Therefore the threshold may change if a cpu is brought down. > yes. but why not calculate at bringing up ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 0:07 ` KAMEZAWA Hiroyuki @ 2010-08-19 19:00 ` Christoph Lameter 2010-08-19 23:49 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 49+ messages in thread From: Christoph Lameter @ 2010-08-19 19:00 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KOSAKI Motohiro n Thu, 19 Aug 2010, KAMEZAWA Hiroyuki wrote: > > > This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE > > > > calculate_threshold() does its calculation based on the number of online > > cpus. Therefore the threshold may change if a cpu is brought down. > > > yes. but why not calculate at bringing up ? True. Seems to have gone missing somehow. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake 2010-08-19 19:00 ` Christoph Lameter @ 2010-08-19 23:49 ` KAMEZAWA Hiroyuki 2010-08-20 0:22 ` [PATCH] vmstat : update zone stat threshold at onlining a cpu KAMEZAWA Hiroyuki 0 siblings, 1 reply; 49+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-19 23:49 UTC (permalink / raw) To: Christoph Lameter Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KOSAKI Motohiro On Thu, 19 Aug 2010 14:00:44 -0500 (CDT) Christoph Lameter <cl@linux-foundation.org> wrote: > n Thu, 19 Aug 2010, KAMEZAWA Hiroyuki wrote: > > > > > This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE > > > > > > calculate_threshold() does its calculation based on the number of online > > > cpus. Therefore the threshold may change if a cpu is brought down. > > > > > yes. but why not calculate at bringing up ? > > True. Seems to have gone missing somehow. > ok, thank you for checking. I'll prepare a patch. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH] vmstat : update zone stat threshold at onlining a cpu 2010-08-19 23:49 ` KAMEZAWA Hiroyuki @ 2010-08-20 0:22 ` KAMEZAWA Hiroyuki 2010-08-20 14:54 ` Christoph Lameter 2010-08-23 7:18 ` Mel Gorman 0 siblings, 2 replies; 49+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-20 0:22 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Christoph Lameter, Mel Gorman, linux-mm, akpm@linux-foundation.org refresh_zone_stat_thresholds() calculates parameter based on the number of online cpus. It's called at cpu offlining but needs to be called at onlining, too. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- mm/vmstat.c | 1 + 1 file changed, 1 insertion(+) Index: mmotm-0811/mm/vmstat.c =================================================================== --- mmotm-0811.orig/mm/vmstat.c +++ mmotm-0811/mm/vmstat.c @@ -998,6 +998,7 @@ static int __cpuinit vmstat_cpuup_callba switch (action) { case CPU_ONLINE: case CPU_ONLINE_FROZEN: + refresh_zone_stat_thresholds(); start_cpu_timer(cpu); node_set_state(cpu_to_node(cpu), N_CPU); break; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] vmstat : update zone stat threshold at onlining a cpu 2010-08-20 0:22 ` [PATCH] vmstat : update zone stat threshold at onlining a cpu KAMEZAWA Hiroyuki @ 2010-08-20 14:54 ` Christoph Lameter 2010-08-20 17:29 ` Andrew Morton 2010-08-23 7:18 ` Mel Gorman 1 sibling, 1 reply; 49+ messages in thread From: Christoph Lameter @ 2010-08-20 14:54 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Mel Gorman, linux-mm, akpm@linux-foundation.org On Fri, 20 Aug 2010, KAMEZAWA Hiroyuki wrote: > 1 file changed, 1 insertion(+) > > Index: mmotm-0811/mm/vmstat.c > =================================================================== > --- mmotm-0811.orig/mm/vmstat.c > +++ mmotm-0811/mm/vmstat.c > @@ -998,6 +998,7 @@ static int __cpuinit vmstat_cpuup_callba > switch (action) { > case CPU_ONLINE: > case CPU_ONLINE_FROZEN: > + refresh_zone_stat_thresholds(); > start_cpu_timer(cpu); > node_set_state(cpu_to_node(cpu), N_CPU); > break; refresh_zone_stat_threshold must be run *after* the number of online cpus has been incremented. Does that occur before the callback? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] vmstat : update zone stat threshold at onlining a cpu 2010-08-20 14:54 ` Christoph Lameter @ 2010-08-20 17:29 ` Andrew Morton 0 siblings, 0 replies; 49+ messages in thread From: Andrew Morton @ 2010-08-20 17:29 UTC (permalink / raw) To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, Mel Gorman, linux-mm On Fri, 20 Aug 2010 09:54:56 -0500 (CDT) Christoph Lameter <cl@linux-foundation.org> wrote: > On Fri, 20 Aug 2010, KAMEZAWA Hiroyuki wrote: > > > 1 file changed, 1 insertion(+) > > > > Index: mmotm-0811/mm/vmstat.c > > =================================================================== > > --- mmotm-0811.orig/mm/vmstat.c > > +++ mmotm-0811/mm/vmstat.c > > @@ -998,6 +998,7 @@ static int __cpuinit vmstat_cpuup_callba > > switch (action) { > > case CPU_ONLINE: > > case CPU_ONLINE_FROZEN: > > + refresh_zone_stat_thresholds(); > > start_cpu_timer(cpu); > > node_set_state(cpu_to_node(cpu), N_CPU); > > break; > > refresh_zone_stat_threshold must be run *after* the number of online cpus > has been incremented. Does that occur before the callback? It does. _cpu_up() calls __cpu_up() before calling cpu_notify(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] vmstat : update zone stat threshold at onlining a cpu 2010-08-20 0:22 ` [PATCH] vmstat : update zone stat threshold at onlining a cpu KAMEZAWA Hiroyuki 2010-08-20 14:54 ` Christoph Lameter @ 2010-08-23 7:18 ` Mel Gorman 1 sibling, 0 replies; 49+ messages in thread From: Mel Gorman @ 2010-08-23 7:18 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Christoph Lameter, linux-mm, akpm@linux-foundation.org On Fri, Aug 20, 2010 at 09:22:51AM +0900, KAMEZAWA Hiroyuki wrote: > > refresh_zone_stat_thresholds() calculates parameter based on > the number of online cpus. It's called at cpu offlining but > needs to be called at onlining, too. > > Cc: Christoph Lameter <cl@linux-foundation.org> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails 2010-08-16 9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman 2010-08-16 9:42 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman 2010-08-16 9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman @ 2010-08-16 9:42 ` Mel Gorman 2010-08-16 14:50 ` Rik van Riel ` (3 more replies) 2 siblings, 4 replies; 49+ messages in thread From: Mel Gorman @ 2010-08-16 9:42 UTC (permalink / raw) To: linux-mm Cc: Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Mel Gorman When under significant memory pressure, a process enters direct reclaim and immediately afterwards tries to allocate a page. If it fails and no further progress is made, it's possible the system will go OOM. However, on systems with large amounts of memory, it's possible that a significant number of pages are on per-cpu lists and inaccessible to the calling process. This leads to a process entering direct reclaim more often than it should increasing the pressure on the system and compounding the problem. This patch notes that if direct reclaim is making progress but allocations are still failing that the system is already under heavy pressure. In this case, it drains the per-cpu lists and tries the allocation a second time before continuing. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/page_alloc.c | 19 +++++++++++++++++-- 1 files changed, 17 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 67a2ed0..a8651a4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1844,6 +1844,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, struct page *page = NULL; struct reclaim_state reclaim_state; struct task_struct *p = current; + bool drained = false; cond_resched(); @@ -1865,11 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, if (order != 0) drain_all_pages(); - if (likely(*did_some_progress)) - page = get_page_from_freelist(gfp_mask, nodemask, order, + if (unlikely(!(*did_some_progress))) + return NULL; + +retry: + page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags, preferred_zone, migratetype); + + /* + * If an allocation failed after direct reclaim, it could be because + * pages are pinned on the per-cpu lists. Drain them and try again + */ + if (!page && !drained) { + drain_all_pages(); + drained = true; + goto retry; + } + return page; } -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails 2010-08-16 9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman @ 2010-08-16 14:50 ` Rik van Riel 2010-08-17 2:57 ` Minchan Kim ` (2 subsequent siblings) 3 siblings, 0 replies; 49+ messages in thread From: Rik van Riel @ 2010-08-16 14:50 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On 08/16/2010 05:42 AM, Mel Gorman wrote: > When under significant memory pressure, a process enters direct reclaim > and immediately afterwards tries to allocate a page. If it fails and no > further progress is made, it's possible the system will go OOM. However, > on systems with large amounts of memory, it's possible that a significant > number of pages are on per-cpu lists and inaccessible to the calling > process. This leads to a process entering direct reclaim more often than > it should increasing the pressure on the system and compounding the problem. > > This patch notes that if direct reclaim is making progress but > allocations are still failing that the system is already under heavy > pressure. In this case, it drains the per-cpu lists and tries the > allocation a second time before continuing. > > Signed-off-by: Mel Gorman<mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails 2010-08-16 9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman 2010-08-16 14:50 ` Rik van Riel @ 2010-08-17 2:57 ` Minchan Kim 2010-08-18 3:02 ` KAMEZAWA Hiroyuki 2010-08-19 14:47 ` Minchan Kim 3 siblings, 0 replies; 49+ messages in thread From: Minchan Kim @ 2010-08-17 2:57 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon, Aug 16, 2010 at 6:42 PM, Mel Gorman <mel@csn.ul.ie> wrote: > When under significant memory pressure, a process enters direct reclaim > and immediately afterwards tries to allocate a page. If it fails and no > further progress is made, it's possible the system will go OOM. However, > on systems with large amounts of memory, it's possible that a significant > number of pages are on per-cpu lists and inaccessible to the calling > process. This leads to a process entering direct reclaim more often than > it should increasing the pressure on the system and compounding the problem. > > This patch notes that if direct reclaim is making progress but > allocations are still failing that the system is already under heavy > pressure. In this case, it drains the per-cpu lists and tries the > allocation a second time before continuing. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> IPI overhead would be good rather than going OOM or nopage. In addition, here isn't a hot path and frequent case. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails 2010-08-16 9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman 2010-08-16 14:50 ` Rik van Riel 2010-08-17 2:57 ` Minchan Kim @ 2010-08-18 3:02 ` KAMEZAWA Hiroyuki 2010-08-19 14:47 ` Minchan Kim 3 siblings, 0 replies; 49+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-18 3:02 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KOSAKI Motohiro On Mon, 16 Aug 2010 10:42:13 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > When under significant memory pressure, a process enters direct reclaim > and immediately afterwards tries to allocate a page. If it fails and no > further progress is made, it's possible the system will go OOM. However, > on systems with large amounts of memory, it's possible that a significant > number of pages are on per-cpu lists and inaccessible to the calling > process. This leads to a process entering direct reclaim more often than > it should increasing the pressure on the system and compounding the problem. > > This patch notes that if direct reclaim is making progress but > allocations are still failing that the system is already under heavy > pressure. In this case, it drains the per-cpu lists and tries the > allocation a second time before continuing. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails 2010-08-16 9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman ` (2 preceding siblings ...) 2010-08-18 3:02 ` KAMEZAWA Hiroyuki @ 2010-08-19 14:47 ` Minchan Kim 2010-08-19 15:10 ` Mel Gorman 3 siblings, 1 reply; 49+ messages in thread From: Minchan Kim @ 2010-08-19 14:47 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon, Aug 16, 2010 at 10:42:13AM +0100, Mel Gorman wrote: > When under significant memory pressure, a process enters direct reclaim > and immediately afterwards tries to allocate a page. If it fails and no > further progress is made, it's possible the system will go OOM. However, > on systems with large amounts of memory, it's possible that a significant > number of pages are on per-cpu lists and inaccessible to the calling > process. This leads to a process entering direct reclaim more often than > it should increasing the pressure on the system and compounding the problem. > > This patch notes that if direct reclaim is making progress but > allocations are still failing that the system is already under heavy > pressure. In this case, it drains the per-cpu lists and tries the > allocation a second time before continuing. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > --- > mm/page_alloc.c | 19 +++++++++++++++++-- > 1 files changed, 17 insertions(+), 2 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 67a2ed0..a8651a4 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1844,6 +1844,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, > struct page *page = NULL; > struct reclaim_state reclaim_state; > struct task_struct *p = current; > + bool drained = false; > > cond_resched(); > > @@ -1865,11 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, > if (order != 0) > drain_all_pages(); > Nitpick: How about removing above condition and drain_all_pages? If get_page_from_freelist fails, we do drain_all_pages at last. It can remove double calling of drain_all_pagse in case of order > 0. In addition, if the VM can't reclaim anythings, we don't need to drain in case of order > 0. > - if (likely(*did_some_progress)) > - page = get_page_from_freelist(gfp_mask, nodemask, order, > + if (unlikely(!(*did_some_progress))) > + return NULL; > + > +retry: > + page = get_page_from_freelist(gfp_mask, nodemask, order, > zonelist, high_zoneidx, > alloc_flags, preferred_zone, > migratetype); > + > + /* > + * If an allocation failed after direct reclaim, it could be because > + * pages are pinned on the per-cpu lists. Drain them and try again > + */ > + if (!page && !drained) { > + drain_all_pages(); > + drained = true; > + goto retry; > + } > + > return page; > } > > -- > 1.7.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails 2010-08-19 14:47 ` Minchan Kim @ 2010-08-19 15:10 ` Mel Gorman 0 siblings, 0 replies; 49+ messages in thread From: Mel Gorman @ 2010-08-19 15:10 UTC (permalink / raw) To: Minchan Kim Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Thu, Aug 19, 2010 at 11:47:03PM +0900, Minchan Kim wrote: > On Mon, Aug 16, 2010 at 10:42:13AM +0100, Mel Gorman wrote: > > When under significant memory pressure, a process enters direct reclaim > > and immediately afterwards tries to allocate a page. If it fails and no > > further progress is made, it's possible the system will go OOM. However, > > on systems with large amounts of memory, it's possible that a significant > > number of pages are on per-cpu lists and inaccessible to the calling > > process. This leads to a process entering direct reclaim more often than > > it should increasing the pressure on the system and compounding the problem. > > > > This patch notes that if direct reclaim is making progress but > > allocations are still failing that the system is already under heavy > > pressure. In this case, it drains the per-cpu lists and tries the > > allocation a second time before continuing. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > --- > > mm/page_alloc.c | 19 +++++++++++++++++-- > > 1 files changed, 17 insertions(+), 2 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 67a2ed0..a8651a4 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1844,6 +1844,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, > > struct page *page = NULL; > > struct reclaim_state reclaim_state; > > struct task_struct *p = current; > > + bool drained = false; > > > > cond_resched(); > > > > @@ -1865,11 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, > > if (order != 0) > > drain_all_pages(); > > > > Nitpick: > > How about removing above condition and drain_all_pages? > If get_page_from_freelist fails, we do drain_all_pages at last. > It can remove double calling of drain_all_pagse in case of order > 0. > In addition, if the VM can't reclaim anythings, we don't need to drain > in case of order > 0. > That sounds reasonable. V2 of this series will delete the lines if (order != 0) drain_all_pages() > > > - if (likely(*did_some_progress)) > > - page = get_page_from_freelist(gfp_mask, nodemask, order, > > + if (unlikely(!(*did_some_progress))) > > + return NULL; > > + > > +retry: > > + page = get_page_from_freelist(gfp_mask, nodemask, order, > > zonelist, high_zoneidx, > > alloc_flags, preferred_zone, > > migratetype); > > + > > + /* > > + * If an allocation failed after direct reclaim, it could be because > > + * pages are pinned on the per-cpu lists. Drain them and try again > > + */ > > + if (!page && !drained) { > > + drain_all_pages(); > > + drained = true; > > + goto retry; > > + } > > + > > return page; > > } > > > > -- > > 1.7.1 > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2010-08-23 7:18 UTC | newest] Thread overview: 49+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-08-16 9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman 2010-08-16 9:42 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman 2010-08-16 14:04 ` Rik van Riel 2010-08-16 15:26 ` Johannes Weiner 2010-08-17 2:21 ` Minchan Kim 2010-08-17 9:59 ` Mel Gorman 2010-08-17 14:25 ` Minchan Kim 2010-08-18 2:21 ` KAMEZAWA Hiroyuki 2010-08-16 9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman 2010-08-16 9:43 ` Mel Gorman 2010-08-16 14:47 ` Rik van Riel 2010-08-16 16:06 ` Johannes Weiner 2010-08-17 2:26 ` Minchan Kim 2010-08-17 10:42 ` Mel Gorman 2010-08-17 15:01 ` Minchan Kim 2010-08-17 15:05 ` Mel Gorman 2010-08-17 10:16 ` Mel Gorman 2010-08-17 11:05 ` Johannes Weiner 2010-08-17 14:20 ` Minchan Kim 2010-08-18 8:51 ` Mel Gorman 2010-08-18 14:57 ` Minchan Kim 2010-08-19 8:06 ` Mel Gorman 2010-08-19 10:33 ` Minchan Kim 2010-08-19 10:38 ` Mel Gorman 2010-08-19 14:01 ` Minchan Kim 2010-08-19 14:09 ` Mel Gorman 2010-08-19 14:34 ` Minchan Kim 2010-08-19 15:07 ` Mel Gorman 2010-08-19 15:22 ` Minchan Kim 2010-08-19 15:40 ` Mel Gorman 2010-08-19 15:44 ` Minchan Kim 2010-08-19 15:46 ` Minchan Kim 2010-08-19 16:06 ` Mel Gorman 2010-08-19 16:45 ` Minchan Kim 2010-08-18 2:59 ` KAMEZAWA Hiroyuki 2010-08-18 15:55 ` Christoph Lameter 2010-08-19 0:07 ` KAMEZAWA Hiroyuki 2010-08-19 19:00 ` Christoph Lameter 2010-08-19 23:49 ` KAMEZAWA Hiroyuki 2010-08-20 0:22 ` [PATCH] vmstat : update zone stat threshold at onlining a cpu KAMEZAWA Hiroyuki 2010-08-20 14:54 ` Christoph Lameter 2010-08-20 17:29 ` Andrew Morton 2010-08-23 7:18 ` Mel Gorman 2010-08-16 9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman 2010-08-16 14:50 ` Rik van Riel 2010-08-17 2:57 ` Minchan Kim 2010-08-18 3:02 ` KAMEZAWA Hiroyuki 2010-08-19 14:47 ` Minchan Kim 2010-08-19 15:10 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).