* [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache
@ 2014-01-24 22:03 Johannes Weiner
2014-01-24 22:03 ` [patch 1/2] mm: page-writeback: fix dirty_balance_reserve subtraction from dirtyable memory Johannes Weiner
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Johannes Weiner @ 2014-01-24 22:03 UTC (permalink / raw)
To: Andrew Morton
Cc: Tejun Heo, Rik van Riel, Mel Gorman, linux-mm, linux-fsdevel,
linux-kernel
Tejun reported stuttering and latency spikes on a system where random
tasks would enter direct reclaim and get stuck on dirty pages. Around
50% of memory was occupied by tmpfs backed by an SSD, and another disk
(rotating) was reading and writing at max speed to shrink a partition.
Analysis:
When calculating the amount of dirtyable memory, the VM considers all
free memory and all file and anon pages as baseline to which to apply
dirty limits. This implies that, given memory pressure from dirtied
cache, the VM would actually start swapping to make room. But alas,
this is not really the case and page reclaim tries very hard not to
swap as long as there is any used-once cache available. The dirty
limit may have been 10-15% of main memory, but page cache was less
than 50% of that, which means that a third of the pages that the
reclaimers actually looked at were dirty. Kswapd stopped making
progress, and in turn allocators were forced into direct reclaim only
to get stuck on dirty/writeback congestion.
These two patches fix the dirtyable memory calculation to acknowledge
the fact that the VM does not really replace anon with dirty cache.
As such, anon memory can no longer be considered "dirtyable."
Longer term we probably want to look into reducing some of the bias
towards cache. The problematic workload in particular was not even
using any of the anon pages, one swap burst could have resolved it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [patch 1/2] mm: page-writeback: fix dirty_balance_reserve subtraction from dirtyable memory
2014-01-24 22:03 [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache Johannes Weiner
@ 2014-01-24 22:03 ` Johannes Weiner
2014-01-24 23:05 ` Rik van Riel
2014-01-28 13:48 ` Michal Hocko
2014-01-24 22:03 ` [patch 2/2] mm: page-writeback: do not count anon pages as " Johannes Weiner
` (2 subsequent siblings)
3 siblings, 2 replies; 12+ messages in thread
From: Johannes Weiner @ 2014-01-24 22:03 UTC (permalink / raw)
To: Andrew Morton
Cc: Tejun Heo, Rik van Riel, Mel Gorman, linux-mm, linux-fsdevel,
linux-kernel
The dirty_balance_reserve is an approximation of the fraction of free
pages that the page allocator does not make available for page cache
allocations. As a result, it has to be taken into account when
calculating the amount of "dirtyable memory", the baseline to which
dirty_background_ratio and dirty_ratio are applied.
However, currently the reserve is subtracted from the sum of free and
reclaimable pages, which is non-sensical and leads to erroneous
results when the system is dominated by unreclaimable pages and the
dirty_balance_reserve is bigger than free+reclaimable. In that case,
at least the already allocated cache should be considered dirtyable.
Fix the calculation by subtracting the reserve from the amount of free
pages, then adding the reclaimable pages on top.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page-writeback.c | 52 +++++++++++++++++++++++-----------------------------
1 file changed, 23 insertions(+), 29 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 63807583d8e8..79cf52b058a7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -191,6 +191,25 @@ static unsigned long writeout_period_time = 0;
* global dirtyable memory first.
*/
+/**
+ * zone_dirtyable_memory - number of dirtyable pages in a zone
+ * @zone: the zone
+ *
+ * Returns the zone's number of pages potentially available for dirty
+ * page cache. This is the base value for the per-zone dirty limits.
+ */
+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+ unsigned long nr_pages;
+
+ nr_pages = zone_page_state(zone, NR_FREE_PAGES);
+ nr_pages -= min(nr_pages, zone->dirty_balance_reserve);
+
+ nr_pages += zone_reclaimable_pages(zone);
+
+ return nr_pages;
+}
+
static unsigned long highmem_dirtyable_memory(unsigned long total)
{
#ifdef CONFIG_HIGHMEM
@@ -201,8 +220,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
struct zone *z =
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
- x += zone_page_state(z, NR_FREE_PAGES) +
- zone_reclaimable_pages(z) - z->dirty_balance_reserve;
+ x += zone_dirtyable_memory(zone);
}
/*
* Unreclaimable memory (kernel memory or anonymous memory
@@ -238,9 +256,11 @@ static unsigned long global_dirtyable_memory(void)
{
unsigned long x;
- x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+ x = global_page_state(NR_FREE_PAGES);
x -= min(x, dirty_balance_reserve);
+ x += global_reclaimable_pages();
+
if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
@@ -289,32 +309,6 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
}
/**
- * zone_dirtyable_memory - number of dirtyable pages in a zone
- * @zone: the zone
- *
- * Returns the zone's number of pages potentially available for dirty
- * page cache. This is the base value for the per-zone dirty limits.
- */
-static unsigned long zone_dirtyable_memory(struct zone *zone)
-{
- /*
- * The effective global number of dirtyable pages may exclude
- * highmem as a big-picture measure to keep the ratio between
- * dirty memory and lowmem reasonable.
- *
- * But this function is purely about the individual zone and a
- * highmem zone can hold its share of dirty pages, so we don't
- * care about vm_highmem_is_dirtyable here.
- */
- unsigned long nr_pages = zone_page_state(zone, NR_FREE_PAGES) +
- zone_reclaimable_pages(zone);
-
- /* don't allow this to underflow */
- nr_pages -= min(nr_pages, zone->dirty_balance_reserve);
- return nr_pages;
-}
-
-/**
* zone_dirty_limit - maximum number of dirty pages allowed in a zone
* @zone: the zone
*
--
1.8.4.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [patch 2/2] mm: page-writeback: do not count anon pages as dirtyable memory
2014-01-24 22:03 [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache Johannes Weiner
2014-01-24 22:03 ` [patch 1/2] mm: page-writeback: fix dirty_balance_reserve subtraction from dirtyable memory Johannes Weiner
@ 2014-01-24 22:03 ` Johannes Weiner
2014-01-24 23:25 ` Rik van Riel
2014-01-28 13:58 ` Michal Hocko
2014-01-24 22:21 ` [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache Tejun Heo
2014-01-24 22:30 ` Andrew Morton
3 siblings, 2 replies; 12+ messages in thread
From: Johannes Weiner @ 2014-01-24 22:03 UTC (permalink / raw)
To: Andrew Morton
Cc: Tejun Heo, Rik van Riel, Mel Gorman, linux-mm, linux-fsdevel,
linux-kernel
The VM is currently heavily tuned to avoid swapping. Whether that is
good or bad is a separate discussion, but as long as the VM won't swap
to make room for dirty cache, we can not consider anonymous pages when
calculating the amount of dirtyable memory, the baseline to which
dirty_background_ratio and dirty_ratio are applied.
A simple workload that occupies a significant size (40+%, depending on
memory layout, storage speeds etc.) of memory with anon/tmpfs pages
and uses the remainder for a streaming writer demonstrates this
problem. In that case, the actual cache pages are a small fraction of
what is considered dirtyable overall, which results in an relatively
large portion of the cache pages to be dirtied. As kswapd starts
rotating these, random tasks enter direct reclaim and stall on IO.
Only consider free pages and file pages dirtyable.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/vmstat.h | 2 --
mm/internal.h | 1 -
mm/page-writeback.c | 6 ++++--
mm/vmscan.c | 23 +----------------------
4 files changed, 5 insertions(+), 27 deletions(-)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e4b948080d20..a67b38415768 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -142,8 +142,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
return x;
}
-extern unsigned long global_reclaimable_pages(void);
-
#ifdef CONFIG_NUMA
/*
* Determine the per node value of a stat item. This function
diff --git a/mm/internal.h b/mm/internal.h
index 684f7aa9692a..8b6cfd63b5a5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -85,7 +85,6 @@ extern unsigned long highest_memmap_pfn;
*/
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern bool zone_reclaimable(struct zone *zone);
/*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 79cf52b058a7..29e129478644 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -205,7 +205,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
nr_pages = zone_page_state(zone, NR_FREE_PAGES);
nr_pages -= min(nr_pages, zone->dirty_balance_reserve);
- nr_pages += zone_reclaimable_pages(zone);
+ nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
+ nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
return nr_pages;
}
@@ -259,7 +260,8 @@ static unsigned long global_dirtyable_memory(void)
x = global_page_state(NR_FREE_PAGES);
x -= min(x, dirty_balance_reserve);
- x += global_reclaimable_pages();
+ x += global_page_state(NR_INACTIVE_FILE);
+ x += global_page_state(NR_ACTIVE_FILE);
if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d9cff6..05e6095159dc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -147,7 +147,7 @@ static bool global_reclaim(struct scan_control *sc)
}
#endif
-unsigned long zone_reclaimable_pages(struct zone *zone)
+static unsigned long zone_reclaimable_pages(struct zone *zone)
{
int nr;
@@ -3297,27 +3297,6 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
wake_up_interruptible(&pgdat->kswapd_wait);
}
-/*
- * The reclaimable count would be mostly accurate.
- * The less reclaimable pages may be
- * - mlocked pages, which will be moved to unevictable list when encountered
- * - mapped pages, which may require several travels to be reclaimed
- * - dirty pages, which is not "instantly" reclaimable
- */
-unsigned long global_reclaimable_pages(void)
-{
- int nr;
-
- nr = global_page_state(NR_ACTIVE_FILE) +
- global_page_state(NR_INACTIVE_FILE);
-
- if (get_nr_swap_pages() > 0)
- nr += global_page_state(NR_ACTIVE_ANON) +
- global_page_state(NR_INACTIVE_ANON);
-
- return nr;
-}
-
#ifdef CONFIG_HIBERNATION
/*
* Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
--
1.8.4.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache
2014-01-24 22:03 [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache Johannes Weiner
2014-01-24 22:03 ` [patch 1/2] mm: page-writeback: fix dirty_balance_reserve subtraction from dirtyable memory Johannes Weiner
2014-01-24 22:03 ` [patch 2/2] mm: page-writeback: do not count anon pages as " Johannes Weiner
@ 2014-01-24 22:21 ` Tejun Heo
2014-01-24 23:31 ` Tejun Heo
2014-01-24 22:30 ` Andrew Morton
3 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2014-01-24 22:21 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Rik van Riel, Mel Gorman, linux-mm, linux-fsdevel,
linux-kernel
Hello,
On Fri, Jan 24, 2014 at 05:03:02PM -0500, Johannes Weiner wrote:
> These two patches fix the dirtyable memory calculation to acknowledge
> the fact that the VM does not really replace anon with dirty cache.
> As such, anon memory can no longer be considered "dirtyable."
>
> Longer term we probably want to look into reducing some of the bias
> towards cache. The problematic workload in particular was not even
> using any of the anon pages, one swap burst could have resolved it.
For both patches,
Tested-by: Tejun Heo <tj@kernel.org>
I don't have much idea what's going on here, but the problem was
pretty ridiculous. It's a 8gig machine w/ one ssd and 10k rpm
harddrive and I could reliably reproduce constant stuttering every
several seconds for as long as buffered IO was going on on the hard
drive either with tmpfs occupying somewhere above 4gig or a test
program which allocates about the same amount of anon memory.
Although swap usage was zero, turning off swap also made the problem
go away too.
The trigger conditions seem quite plausible - high anon memory usage
w/ heavy buffered IO and swap configured - and it's highly likely that
this is happening in the wild too. (this can happen with copying
large files to usb sticks too, right?)
So, if this is the right fix && can be determined not to cause
noticeable regressions, it probably is worthwhile to cc -stable.
Thanks a lot!
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache
2014-01-24 22:03 [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache Johannes Weiner
` (2 preceding siblings ...)
2014-01-24 22:21 ` [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache Tejun Heo
@ 2014-01-24 22:30 ` Andrew Morton
2014-01-24 22:51 ` Johannes Weiner
3 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2014-01-24 22:30 UTC (permalink / raw)
To: Johannes Weiner
Cc: Tejun Heo, Rik van Riel, Mel Gorman, linux-mm, linux-fsdevel,
linux-kernel
On Fri, 24 Jan 2014 17:03:02 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> Tejun reported stuttering and latency spikes on a system where random
> tasks would enter direct reclaim and get stuck on dirty pages. Around
> 50% of memory was occupied by tmpfs backed by an SSD, and another disk
> (rotating) was reading and writing at max speed to shrink a partition.
Do you think this is serious enough to squeeze these into 3.14?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache
2014-01-24 22:30 ` Andrew Morton
@ 2014-01-24 22:51 ` Johannes Weiner
2014-01-24 23:26 ` Rik van Riel
0 siblings, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2014-01-24 22:51 UTC (permalink / raw)
To: Andrew Morton
Cc: Tejun Heo, Rik van Riel, Mel Gorman, linux-mm, linux-fsdevel,
linux-kernel
On Fri, Jan 24, 2014 at 02:30:03PM -0800, Andrew Morton wrote:
> On Fri, 24 Jan 2014 17:03:02 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > Tejun reported stuttering and latency spikes on a system where random
> > tasks would enter direct reclaim and get stuck on dirty pages. Around
> > 50% of memory was occupied by tmpfs backed by an SSD, and another disk
> > (rotating) was reading and writing at max speed to shrink a partition.
>
> Do you think this is serious enough to squeeze these into 3.14?
We have been biasing towards cache reclaim at least as far back as the
LRU split and we always considered anon dirtyable, so it's not really
a *new* problem. And there is a chance of regressing write bandwidth
for certain workloads by effectively shrinking their dirty limit -
although that is easily fixed by changing dirty_ratio.
On the other hand, the stuttering is pretty nasty (could reproduce it
locally too) and the workload is not exactly esoteric. Plus, I'm not
sure if waiting will increase the test exposure.
So 3.14 would work for me, unless Mel and Rik have concerns.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 1/2] mm: page-writeback: fix dirty_balance_reserve subtraction from dirtyable memory
2014-01-24 22:03 ` [patch 1/2] mm: page-writeback: fix dirty_balance_reserve subtraction from dirtyable memory Johannes Weiner
@ 2014-01-24 23:05 ` Rik van Riel
2014-01-28 13:48 ` Michal Hocko
1 sibling, 0 replies; 12+ messages in thread
From: Rik van Riel @ 2014-01-24 23:05 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton
Cc: Tejun Heo, Mel Gorman, linux-mm, linux-fsdevel, linux-kernel
On 01/24/2014 05:03 PM, Johannes Weiner wrote:
> The dirty_balance_reserve is an approximation of the fraction of free
> pages that the page allocator does not make available for page cache
> allocations. As a result, it has to be taken into account when
> calculating the amount of "dirtyable memory", the baseline to which
> dirty_background_ratio and dirty_ratio are applied.
>
> However, currently the reserve is subtracted from the sum of free and
> reclaimable pages, which is non-sensical and leads to erroneous
> results when the system is dominated by unreclaimable pages and the
> dirty_balance_reserve is bigger than free+reclaimable. In that case,
> at least the already allocated cache should be considered dirtyable.
>
> Fix the calculation by subtracting the reserve from the amount of free
> pages, then adding the reclaimable pages on top.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 2/2] mm: page-writeback: do not count anon pages as dirtyable memory
2014-01-24 22:03 ` [patch 2/2] mm: page-writeback: do not count anon pages as " Johannes Weiner
@ 2014-01-24 23:25 ` Rik van Riel
2014-01-28 13:58 ` Michal Hocko
1 sibling, 0 replies; 12+ messages in thread
From: Rik van Riel @ 2014-01-24 23:25 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton
Cc: Tejun Heo, Mel Gorman, linux-mm, linux-fsdevel, linux-kernel
On 01/24/2014 05:03 PM, Johannes Weiner wrote:
> The VM is currently heavily tuned to avoid swapping. Whether that is
> good or bad is a separate discussion, but as long as the VM won't swap
> to make room for dirty cache, we can not consider anonymous pages when
> calculating the amount of dirtyable memory, the baseline to which
> dirty_background_ratio and dirty_ratio are applied.
>
> A simple workload that occupies a significant size (40+%, depending on
> memory layout, storage speeds etc.) of memory with anon/tmpfs pages
> and uses the remainder for a streaming writer demonstrates this
> problem. In that case, the actual cache pages are a small fraction of
> what is considered dirtyable overall, which results in an relatively
> large portion of the cache pages to be dirtied. As kswapd starts
> rotating these, random tasks enter direct reclaim and stall on IO.
>
> Only consider free pages and file pages dirtyable.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache
2014-01-24 22:51 ` Johannes Weiner
@ 2014-01-24 23:26 ` Rik van Riel
0 siblings, 0 replies; 12+ messages in thread
From: Rik van Riel @ 2014-01-24 23:26 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton
Cc: Tejun Heo, Mel Gorman, linux-mm, linux-fsdevel, linux-kernel
On 01/24/2014 05:51 PM, Johannes Weiner wrote:
> On Fri, Jan 24, 2014 at 02:30:03PM -0800, Andrew Morton wrote:
>> On Fri, 24 Jan 2014 17:03:02 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>>> Tejun reported stuttering and latency spikes on a system where random
>>> tasks would enter direct reclaim and get stuck on dirty pages. Around
>>> 50% of memory was occupied by tmpfs backed by an SSD, and another disk
>>> (rotating) was reading and writing at max speed to shrink a partition.
>>
>> Do you think this is serious enough to squeeze these into 3.14?
>
> We have been biasing towards cache reclaim at least as far back as the
> LRU split and we always considered anon dirtyable, so it's not really
> a *new* problem. And there is a chance of regressing write bandwidth
> for certain workloads by effectively shrinking their dirty limit -
> although that is easily fixed by changing dirty_ratio.
>
> On the other hand, the stuttering is pretty nasty (could reproduce it
> locally too) and the workload is not exactly esoteric. Plus, I'm not
> sure if waiting will increase the test exposure.
>
> So 3.14 would work for me, unless Mel and Rik have concerns.
3.14 would be fine, indeed.
On the other hand, if there are enough user reports of the stuttering
problem on older kernels, a -stable backport could be appropriate
too...
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache
2014-01-24 22:21 ` [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache Tejun Heo
@ 2014-01-24 23:31 ` Tejun Heo
0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2014-01-24 23:31 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Rik van Riel, Mel Gorman, linux-mm, linux-fsdevel,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2195 bytes --]
On Fri, Jan 24, 2014 at 05:21:44PM -0500, Tejun Heo wrote:
> The trigger conditions seem quite plausible - high anon memory usage
> w/ heavy buffered IO and swap configured - and it's highly likely that
> this is happening in the wild too. (this can happen with copying
> large files to usb sticks too, right?)
So, just tested with the usb stick and these two patches, while not
perfect, make a world of difference. The problem is really easy to
reproduce on my machine which has 8gig of memory with the two attached
test programs.
* run "test-membloat 4300" and wait for it to report completion.
* run "test-latency"
Mount a slow USB stick and copy a large (multi-gig) file to it.
test-latency tries to print out a dot every 10ms but will report a
log2 number if the latency becomes more than twice high - ie. 4 means
it took 2^4 * 10ms to complete a loop which is supposed to take
slightly longer than 10ms (10ms sleep + 4 page fault). My USB stick
only can do a couple mbytes/s and without these patches the machine
becomes basically useless. It's just not useable, it stutters more
than it runs until the whole file finishes copying.
Because I've been using tmpfs as build target for a while, I've been
experiencing this occassionally and secretly growing bitter
disappointment towards the linux kernel which developed into
self-loathing to the point where I found booting into win8 consoling
after looking at my machine stuttering for 45mins while it was
repartitioning the hard drive to make room for steamos. Oh the irony.
I had to stay in fetal position for a while afterwards. It was a
crisis.
With the patches applied, for both heavy harddrive IO and
copy-large-file-to-slow-USB cases, the behavior is vastly improved.
It does stutter for a while once memory is filled up but stabilizes in
somewhere above ten seconds and then stays responsive. While it isn't
perfect, it's not completely ridiculous as before.
So, lots of kudos to Johannes for *finally* fixing the issue and I
strongly believe this is something we should consider for -stable even
if that takes considerable amount of effort to verify it's not too
harmful for other workloads.
Thanks a lot.
--
tejun
[-- Attachment #2: test-latency.c --]
[-- Type: text/plain, Size: 1292 bytes --]
#include <stdio.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <time.h>
#include <math.h>
#include <stdlib.h>
#include <unistd.h>
#define NR_ALPHAS ('z' - 'a' + 1)
int main(int argc, char **argv)
{
struct timespec intv_ts = { }, ts;
unsigned long long time0, time1;
long long msecs = 10;
const size_t map_size = 4096 * 4;
if (argc > 1) {
msecs = atoll(argv[1]);
if (msecs <= 0) {
fprintf(stderr, "test-latency [interval-in-msecs]\n");
return 1;
}
}
intv_ts.tv_sec = msecs / 1000;
intv_ts.tv_nsec = (msecs % 1000) * 1000000;
clock_gettime(CLOCK_MONOTONIC, &ts);
time1 = ts.tv_sec * 1000000000LLU + ts.tv_nsec;
while (1) {
void *map, *p;
int idx;
char c;
nanosleep(&intv_ts, NULL);
map = mmap(NULL, map_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (map == MAP_FAILED) {
perror("mmap");
return 1;
}
for (p = map; p < map + map_size; p += 4096)
*(volatile unsigned long *)p = 0xdeadbeef;
munmap(map, map_size);
time0 = time1;
clock_gettime(CLOCK_MONOTONIC, &ts);
time1 = ts.tv_sec * 1000000000LLU + ts.tv_nsec;
idx = (time1 - time0) / msecs / 1000000;
idx = log2(idx);
if (idx <= 1) {
c = '.';
} else {
if (idx > 9)
idx = 9;
c = '0' + idx;
}
write(1, &c, 1);
}
}
[-- Attachment #3: test-membloat.c --]
[-- Type: text/plain, Size: 1096 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char **argv)
{
struct timespec ts_100s = { .tv_sec = 100 };
long mbytes, cnt;
void *map, *p;
int fd = -1;
int flags;
if (argc < 2 || (mbytes = atol(argv[1])) <= 0) {
fprintf(stderr, "test-membloat SIZE_IN_MBYTES [FILENAME]\n");
return 1;
}
if (argc >= 3) {
fd = open(argv[2], O_CREAT|O_TRUNC|O_RDWR, S_IRWXU);
if (fd < 0) {
perror("open");
return 1;
}
if (ftruncate(fd, mbytes << 20)) {
perror("ftruncate");
return 1;
}
flags = MAP_SHARED;
} else {
flags = MAP_ANONYMOUS | MAP_PRIVATE;
}
map = mmap(NULL, (size_t)mbytes << 20, PROT_READ | PROT_WRITE,
flags, fd, 0);
if (map == MAP_FAILED) {
perror("mmap");
return 1;
}
for (p = map, cnt = 0; p < map + (mbytes << 20); p += 4096) {
*(volatile unsigned long *)p = 0xdeadbeef;
cnt++;
}
printf("faulted in %ld mbytes, %ld pages\n", mbytes, cnt);
while (1)
nanosleep(&ts_100s, NULL);
return 0;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 1/2] mm: page-writeback: fix dirty_balance_reserve subtraction from dirtyable memory
2014-01-24 22:03 ` [patch 1/2] mm: page-writeback: fix dirty_balance_reserve subtraction from dirtyable memory Johannes Weiner
2014-01-24 23:05 ` Rik van Riel
@ 2014-01-28 13:48 ` Michal Hocko
1 sibling, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-01-28 13:48 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Tejun Heo, Rik van Riel, Mel Gorman, linux-mm,
linux-fsdevel, linux-kernel
On Fri 24-01-14 17:03:03, Johannes Weiner wrote:
> The dirty_balance_reserve is an approximation of the fraction of free
> pages that the page allocator does not make available for page cache
> allocations. As a result, it has to be taken into account when
> calculating the amount of "dirtyable memory", the baseline to which
> dirty_background_ratio and dirty_ratio are applied.
>
> However, currently the reserve is subtracted from the sum of free and
> reclaimable pages, which is non-sensical and leads to erroneous
> results when the system is dominated by unreclaimable pages and the
> dirty_balance_reserve is bigger than free+reclaimable. In that case,
> at least the already allocated cache should be considered dirtyable.
>
> Fix the calculation by subtracting the reserve from the amount of free
> pages, then adding the reclaimable pages on top.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Makes sense
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
> mm/page-writeback.c | 52 +++++++++++++++++++++++-----------------------------
> 1 file changed, 23 insertions(+), 29 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 63807583d8e8..79cf52b058a7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -191,6 +191,25 @@ static unsigned long writeout_period_time = 0;
> * global dirtyable memory first.
> */
>
> +/**
> + * zone_dirtyable_memory - number of dirtyable pages in a zone
> + * @zone: the zone
> + *
> + * Returns the zone's number of pages potentially available for dirty
> + * page cache. This is the base value for the per-zone dirty limits.
> + */
> +static unsigned long zone_dirtyable_memory(struct zone *zone)
> +{
> + unsigned long nr_pages;
> +
> + nr_pages = zone_page_state(zone, NR_FREE_PAGES);
> + nr_pages -= min(nr_pages, zone->dirty_balance_reserve);
> +
> + nr_pages += zone_reclaimable_pages(zone);
> +
> + return nr_pages;
> +}
> +
> static unsigned long highmem_dirtyable_memory(unsigned long total)
> {
> #ifdef CONFIG_HIGHMEM
> @@ -201,8 +220,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> struct zone *z =
> &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
>
> - x += zone_page_state(z, NR_FREE_PAGES) +
> - zone_reclaimable_pages(z) - z->dirty_balance_reserve;
> + x += zone_dirtyable_memory(zone);
> }
> /*
> * Unreclaimable memory (kernel memory or anonymous memory
> @@ -238,9 +256,11 @@ static unsigned long global_dirtyable_memory(void)
> {
> unsigned long x;
>
> - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> + x = global_page_state(NR_FREE_PAGES);
> x -= min(x, dirty_balance_reserve);
>
> + x += global_reclaimable_pages();
> +
> if (!vm_highmem_is_dirtyable)
> x -= highmem_dirtyable_memory(x);
>
> @@ -289,32 +309,6 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> }
>
> /**
> - * zone_dirtyable_memory - number of dirtyable pages in a zone
> - * @zone: the zone
> - *
> - * Returns the zone's number of pages potentially available for dirty
> - * page cache. This is the base value for the per-zone dirty limits.
> - */
> -static unsigned long zone_dirtyable_memory(struct zone *zone)
> -{
> - /*
> - * The effective global number of dirtyable pages may exclude
> - * highmem as a big-picture measure to keep the ratio between
> - * dirty memory and lowmem reasonable.
> - *
> - * But this function is purely about the individual zone and a
> - * highmem zone can hold its share of dirty pages, so we don't
> - * care about vm_highmem_is_dirtyable here.
> - */
> - unsigned long nr_pages = zone_page_state(zone, NR_FREE_PAGES) +
> - zone_reclaimable_pages(zone);
> -
> - /* don't allow this to underflow */
> - nr_pages -= min(nr_pages, zone->dirty_balance_reserve);
> - return nr_pages;
> -}
> -
> -/**
> * zone_dirty_limit - maximum number of dirty pages allowed in a zone
> * @zone: the zone
> *
> --
> 1.8.4.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 2/2] mm: page-writeback: do not count anon pages as dirtyable memory
2014-01-24 22:03 ` [patch 2/2] mm: page-writeback: do not count anon pages as " Johannes Weiner
2014-01-24 23:25 ` Rik van Riel
@ 2014-01-28 13:58 ` Michal Hocko
1 sibling, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-01-28 13:58 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Tejun Heo, Rik van Riel, Mel Gorman, linux-mm,
linux-fsdevel, linux-kernel
On Fri 24-01-14 17:03:04, Johannes Weiner wrote:
> The VM is currently heavily tuned to avoid swapping. Whether that is
> good or bad is a separate discussion, but as long as the VM won't swap
> to make room for dirty cache, we can not consider anonymous pages when
> calculating the amount of dirtyable memory, the baseline to which
> dirty_background_ratio and dirty_ratio are applied.
>
> A simple workload that occupies a significant size (40+%, depending on
> memory layout, storage speeds etc.) of memory with anon/tmpfs pages
> and uses the remainder for a streaming writer demonstrates this
> problem. In that case, the actual cache pages are a small fraction of
> what is considered dirtyable overall, which results in an relatively
> large portion of the cache pages to be dirtied. As kswapd starts
> rotating these, random tasks enter direct reclaim and stall on IO.
>
> Only consider free pages and file pages dirtyable.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
> include/linux/vmstat.h | 2 --
> mm/internal.h | 1 -
> mm/page-writeback.c | 6 ++++--
> mm/vmscan.c | 23 +----------------------
> 4 files changed, 5 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index e4b948080d20..a67b38415768 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -142,8 +142,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> return x;
> }
>
> -extern unsigned long global_reclaimable_pages(void);
> -
> #ifdef CONFIG_NUMA
> /*
> * Determine the per node value of a stat item. This function
> diff --git a/mm/internal.h b/mm/internal.h
> index 684f7aa9692a..8b6cfd63b5a5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -85,7 +85,6 @@ extern unsigned long highest_memmap_pfn;
> */
> extern int isolate_lru_page(struct page *page);
> extern void putback_lru_page(struct page *page);
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
> extern bool zone_reclaimable(struct zone *zone);
>
> /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 79cf52b058a7..29e129478644 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -205,7 +205,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
> nr_pages = zone_page_state(zone, NR_FREE_PAGES);
> nr_pages -= min(nr_pages, zone->dirty_balance_reserve);
>
> - nr_pages += zone_reclaimable_pages(zone);
> + nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
> + nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
>
> return nr_pages;
> }
> @@ -259,7 +260,8 @@ static unsigned long global_dirtyable_memory(void)
> x = global_page_state(NR_FREE_PAGES);
> x -= min(x, dirty_balance_reserve);
>
> - x += global_reclaimable_pages();
> + x += global_page_state(NR_INACTIVE_FILE);
> + x += global_page_state(NR_ACTIVE_FILE);
>
> if (!vm_highmem_is_dirtyable)
> x -= highmem_dirtyable_memory(x);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eea668d9cff6..05e6095159dc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -147,7 +147,7 @@ static bool global_reclaim(struct scan_control *sc)
> }
> #endif
>
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> +static unsigned long zone_reclaimable_pages(struct zone *zone)
> {
> int nr;
>
> @@ -3297,27 +3297,6 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
> wake_up_interruptible(&pgdat->kswapd_wait);
> }
>
> -/*
> - * The reclaimable count would be mostly accurate.
> - * The less reclaimable pages may be
> - * - mlocked pages, which will be moved to unevictable list when encountered
> - * - mapped pages, which may require several travels to be reclaimed
> - * - dirty pages, which is not "instantly" reclaimable
> - */
> -unsigned long global_reclaimable_pages(void)
> -{
> - int nr;
> -
> - nr = global_page_state(NR_ACTIVE_FILE) +
> - global_page_state(NR_INACTIVE_FILE);
> -
> - if (get_nr_swap_pages() > 0)
> - nr += global_page_state(NR_ACTIVE_ANON) +
> - global_page_state(NR_INACTIVE_ANON);
> -
> - return nr;
> -}
> -
> #ifdef CONFIG_HIBERNATION
> /*
> * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
> --
> 1.8.4.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2014-01-28 13:58 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-24 22:03 [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache Johannes Weiner
2014-01-24 22:03 ` [patch 1/2] mm: page-writeback: fix dirty_balance_reserve subtraction from dirtyable memory Johannes Weiner
2014-01-24 23:05 ` Rik van Riel
2014-01-28 13:48 ` Michal Hocko
2014-01-24 22:03 ` [patch 2/2] mm: page-writeback: do not count anon pages as " Johannes Weiner
2014-01-24 23:25 ` Rik van Riel
2014-01-28 13:58 ` Michal Hocko
2014-01-24 22:21 ` [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache Tejun Heo
2014-01-24 23:31 ` Tejun Heo
2014-01-24 22:30 ` Andrew Morton
2014-01-24 22:51 ` Johannes Weiner
2014-01-24 23:26 ` Rik van Riel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).