* [PATCH] mm: vmscan: Do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL
@ 2014-04-22 8:38 Mel Gorman
2014-04-22 19:31 ` Andrew Morton
0 siblings, 1 reply; 3+ messages in thread
From: Mel Gorman @ 2014-04-22 8:38 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-mm
throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve. It
throttes on the first node in the zonelist but this is flawed.
On a NUMA machine running a 32-bit kernel (I know) allocation requests
freom CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled. This patch adjusts throttling of direct reclaim to throttle
based on the first node in the zonelist that has a usable ZONE_NORMAL or
lower zone.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 33 +++++++++++++++++++++++++++------
1 file changed, 27 insertions(+), 6 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f56c8d..9c4918e9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2507,10 +2507,17 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
for (i = 0; i <= ZONE_NORMAL; i++) {
zone = &pgdat->node_zones[i];
+ if (!populated_zone(zone))
+ continue;
+
pfmemalloc_reserve += min_wmark_pages(zone);
free_pages += zone_page_state(zone, NR_FREE_PAGES);
}
+ /* If there are no reserves (unexpected config) then do not throttle */
+ if (!pfmemalloc_reserve)
+ return true;
+
wmark_ok = free_pages > pfmemalloc_reserve / 2;
/* kswapd must be awake if processes are being throttled */
@@ -2535,9 +2542,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
nodemask_t *nodemask)
{
+ struct zoneref *z;
struct zone *zone;
- int high_zoneidx = gfp_zone(gfp_mask);
- pg_data_t *pgdat;
+ pg_data_t *pgdat = NULL;
/*
* Kernel threads should not be throttled as they may be indirectly
@@ -2556,10 +2563,24 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
if (fatal_signal_pending(current))
goto out;
- /* Check if the pfmemalloc reserves are ok */
- first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
- pgdat = zone->zone_pgdat;
- if (pfmemalloc_watermark_ok(pgdat))
+ /*
+ * Check if the pfmemalloc reserves are ok by finding the first node
+ * with a usable ZONE_NORMAL or lower zone
+ */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ gfp_mask, nodemask) {
+ if (zone_idx(zone) > ZONE_NORMAL)
+ continue;
+
+ /* Throttle based on the first usable node */
+ pgdat = zone->zone_pgdat;
+ if (pfmemalloc_watermark_ok(pgdat))
+ goto out;
+ break;
+ }
+
+ /* If no zone was usable by the allocation flags then do not throttle */
+ if (!pgdat)
goto out;
/* Account for the throttling */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] mm: vmscan: Do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL
2014-04-22 8:38 [PATCH] mm: vmscan: Do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL Mel Gorman
@ 2014-04-22 19:31 ` Andrew Morton
2014-04-23 13:52 ` Mel Gorman
0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2014-04-22 19:31 UTC (permalink / raw)
To: Mel Gorman; +Cc: linux-kernel, linux-mm
On Tue, 22 Apr 2014 09:38:52 +0100 Mel Gorman <mgorman@suse.de> wrote:
> throttle_direct_reclaim() is meant to trigger during swap-over-network
> during which the min watermark is treated as a pfmemalloc reserve. It
> throttes on the first node in the zonelist but this is flawed.
>
> On a NUMA machine running a 32-bit kernel (I know) allocation requests
> freom CPUs on node 1 would detect no pfmemalloc reserves and the process
> gets throttled. This patch adjusts throttling of direct reclaim to throttle
> based on the first node in the zonelist that has a usable ZONE_NORMAL or
> lower zone.
I'm unable to determine from the above whether we should backport this
fix. Please don't forget to describe the end-user visible effects of
a bug when that isn't obvious.
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2507,10 +2507,17 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>
> for (i = 0; i <= ZONE_NORMAL; i++) {
> zone = &pgdat->node_zones[i];
> + if (!populated_zone(zone))
> + continue;
What's this? Performance tweak? Or does min_wmark_pages() return
non-zero for an unpopulated zone, which seems odd.
> pfmemalloc_reserve += min_wmark_pages(zone);
> free_pages += zone_page_state(zone, NR_FREE_PAGES);
> }
>
> + /* If there are no reserves (unexpected config) then do not throttle */
> + if (!pfmemalloc_reserve)
> + return true;
> +
> wmark_ok = free_pages > pfmemalloc_reserve / 2;
>
> /* kswapd must be awake if processes are being throttled */
> @@ -2535,9 +2542,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> nodemask_t *nodemask)
> {
> + struct zoneref *z;
> struct zone *zone;
> - int high_zoneidx = gfp_zone(gfp_mask);
> - pg_data_t *pgdat;
> + pg_data_t *pgdat = NULL;
>
> /*
> * Kernel threads should not be throttled as they may be indirectly
> @@ -2556,10 +2563,24 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> if (fatal_signal_pending(current))
> goto out;
>
> - /* Check if the pfmemalloc reserves are ok */
> - first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> - pgdat = zone->zone_pgdat;
> - if (pfmemalloc_watermark_ok(pgdat))
> + /*
> + * Check if the pfmemalloc reserves are ok by finding the first node
> + * with a usable ZONE_NORMAL or lower zone
> + */
That comment tells us what the code does but not why it does it.
- Why do we ignore zones >= ZONE_NORMAL?
- Why do we throttle when there may be as-yet-unexamined nodes which
have reclaimable pages?
> + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> + gfp_mask, nodemask) {
Those two lines have spaces-instead-of-tabs.
> + if (zone_idx(zone) > ZONE_NORMAL)
> + continue;
> +
> + /* Throttle based on the first usable node */
> + pgdat = zone->zone_pgdat;
> + if (pfmemalloc_watermark_ok(pgdat))
> + goto out;
> + break;
> + }
> +
> + /* If no zone was usable by the allocation flags then do not throttle */
> + if (!pgdat)
> goto out;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] mm: vmscan: Do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL
2014-04-22 19:31 ` Andrew Morton
@ 2014-04-23 13:52 ` Mel Gorman
0 siblings, 0 replies; 3+ messages in thread
From: Mel Gorman @ 2014-04-23 13:52 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-mm
On Tue, Apr 22, 2014 at 12:31:49PM -0700, Andrew Morton wrote:
> On Tue, 22 Apr 2014 09:38:52 +0100 Mel Gorman <mgorman@suse.de> wrote:
>
> > throttle_direct_reclaim() is meant to trigger during swap-over-network
> > during which the min watermark is treated as a pfmemalloc reserve. It
> > throttes on the first node in the zonelist but this is flawed.
> >
> > On a NUMA machine running a 32-bit kernel (I know) allocation requests
> > freom CPUs on node 1 would detect no pfmemalloc reserves and the process
> > gets throttled. This patch adjusts throttling of direct reclaim to throttle
> > based on the first node in the zonelist that has a usable ZONE_NORMAL or
> > lower zone.
>
> I'm unable to determine from the above whether we should backport this
> fix. Please don't forget to describe the end-user visible effects of
> a bug when that isn't obvious.
>
The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely. This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that node.
Strictly speaking this is stable material. I should have flagged it as
such but hadn't as I was treating 32-bit kernels running on NUMA hardware
as being a poor choice.
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2507,10 +2507,17 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >
> > for (i = 0; i <= ZONE_NORMAL; i++) {
> > zone = &pgdat->node_zones[i];
> > + if (!populated_zone(zone))
> > + continue;
>
> What's this? Performance tweak? Or does min_wmark_pages() return
> non-zero for an unpopulated zone, which seems odd.
>
Minor performance tweak. It's a force of habit to skip populated zones
when doing a zone walk like this.
> > pfmemalloc_reserve += min_wmark_pages(zone);
> > free_pages += zone_page_state(zone, NR_FREE_PAGES);
> > }
> >
> > + /* If there are no reserves (unexpected config) then do not throttle */
> > + if (!pfmemalloc_reserve)
> > + return true;
> > +
> > wmark_ok = free_pages > pfmemalloc_reserve / 2;
> >
> > /* kswapd must be awake if processes are being throttled */
> > @@ -2535,9 +2542,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> > static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > nodemask_t *nodemask)
> > {
> > + struct zoneref *z;
> > struct zone *zone;
> > - int high_zoneidx = gfp_zone(gfp_mask);
> > - pg_data_t *pgdat;
> > + pg_data_t *pgdat = NULL;
> >
> > /*
> > * Kernel threads should not be throttled as they may be indirectly
> > @@ -2556,10 +2563,24 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > if (fatal_signal_pending(current))
> > goto out;
> >
> > - /* Check if the pfmemalloc reserves are ok */
> > - first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> > - pgdat = zone->zone_pgdat;
> > - if (pfmemalloc_watermark_ok(pgdat))
> > + /*
> > + * Check if the pfmemalloc reserves are ok by finding the first node
> > + * with a usable ZONE_NORMAL or lower zone
> > + */
>
> That comment tells us what the code does but not why it does it.
>
> - Why do we ignore zones >= ZONE_NORMAL?
>
> - Why do we throttle when there may be as-yet-unexamined nodes which
> have reclaimable pages?
>
/*
* Check if the pfmemalloc reserves are ok by finding the first node
* with a usable ZONE_NORMAL or lower zone. The expectation is that
* GFP_KERNEL will be required for allocating network buffers when
* swapping over the network so ZONE_HIGHMEM is unusable.
*
* Throttling is based on the first usable node and throttled processes
* wait on a queue until kswapd makes progress and wakes them. There
* is an affinity then between processes waking up and where reclaim
* progress has been made assuming the process wakes on the same node.
* More importantly, processes running on remote nodes will not compete
* for remote pfmemalloc reserves and processes on different nodes
* should make reasonable progress.
*/
?
>
> > + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > + gfp_mask, nodemask) {
>
> Those two lines have spaces-instead-of-tabs.
>
Sorry, that was careless.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-04-23 13:52 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-22 8:38 [PATCH] mm: vmscan: Do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL Mel Gorman
2014-04-22 19:31 ` Andrew Morton
2014-04-23 13:52 ` Mel Gorman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).