From: Andrew Morton <akpm@linux-foundation.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Linux-MM <linux-mm@kvack.org>,
Linux-Netdev <netdev@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
David Miller <davem@davemloft.net>, Neil Brown <neilb@suse.de>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Mike Christie <michaelc@cs.wisc.edu>,
Eric B Munson <emunson@mgebm.net>
Subject: Re: [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
Date: Tue, 1 May 2012 15:24:37 -0700 [thread overview]
Message-ID: <20120501152437.194f0fc2.akpm@linux-foundation.org> (raw)
In-Reply-To: <1334578624-23257-16-git-send-email-mgorman@suse.de>
On Mon, 16 Apr 2012 13:17:02 +0100
Mel Gorman <mgorman@suse.de> wrote:
> If swap is backed by network storage such as NBD, there is a risk
> that a large number of reclaimers can hang the system by consuming
> all PF_MEMALLOC reserves. To avoid these hangs, the administrator
> must tune min_free_kbytes in advance which is a bit fragile.
>
> This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
> are in use. If the system is routinely getting throttled the system
> administrator can increase min_free_kbytes so degradation is smoother
> but the system will keep running.
>
>
> ...
>
> +static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> +{
> + struct zone *zone;
> + unsigned long pfmemalloc_reserve = 0;
> + unsigned long free_pages = 0;
> + int i;
> + bool wmark_ok;
> +
> + for (i = 0; i <= ZONE_NORMAL; i++) {
> + zone = &pgdat->node_zones[i];
> + pfmemalloc_reserve += min_wmark_pages(zone);
> + free_pages += zone_page_state(zone, NR_FREE_PAGES);
> + }
> +
> + wmark_ok = (free_pages > pfmemalloc_reserve / 2) ? true : false;
wmark_ok = free_pages > pfmemalloc_reserve / 2;
> +
> + /* kswapd must be awake if processes are being throttled */
> + if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> + pgdat->classzone_idx = min(pgdat->classzone_idx,
> + (enum zone_type)ZONE_NORMAL);
> + wake_up_interruptible(&pgdat->kswapd_wait);
> + }
> +
> + return wmark_ok;
> +}
> +
> +/*
> + * Throttle direct reclaimers if backing storage is backed by the network
> + * and the PFMEMALLOC reserve for the preferred node is getting dangerously
> + * depleted. kswapd will continue to make progress and wake the processes
> + * when the low watermark is reached
> + */
> +static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> + nodemask_t *nodemask)
> +{
> + struct zone *zone;
> + int high_zoneidx = gfp_zone(gfp_mask);
> + pg_data_t *pgdat;
> +
> + /* Kernel threads such as kjournald should not be throttled */
The comment should explain "why", not "what". Particularly when the
"what" was bleedin obvious ;)
Also... why?
> + if (current->flags & PF_KTHREAD)
> + return;
> +
> + /* Check if the pfmemalloc reserves are ok */
> + first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> + pgdat = zone->zone_pgdat;
> + if (pfmemalloc_watermark_ok(pgdat))
> + return;
> +
> + /*
> + * If the caller cannot enter the filesystem, it's possible that it
> + * is processing a journal transaction. In this case, it is not safe
> + * to block on pfmemalloc_wait as kswapd could also be blocked waiting
> + * to start a transaction. Instead, throttle for up to a second before
> + * the reclaim must continue.
> + */
I suppose this applies to fs locks in general, not just to
journal_start()?
> + if (!(gfp_mask & __GFP_FS)) {
> + wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> + pfmemalloc_watermark_ok(pgdat), HZ);
> + return;
> + }
> +
> + /* Throttle until kswapd wakes the process */
> + wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> + pfmemalloc_watermark_ok(pgdat));
> +}
> +
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *nodemask)
> {
>
> ...
>
> @@ -2610,6 +2686,20 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> if (remaining)
> return true;
>
> + /*
> + * There is a potential race between when kswapd checks it watermarks
"its"
> + * and a process gets throttled. There is also a potential race if
> + * processes get throttled, kswapd wakes, a large process exits therby
> + * balancing the zones that causes kswapd to miss a wakeup. If kswapd
> + * is going to sleep, no process should be sleeping on pfmemalloc_wait
> + * so wake them now if necessary. If necessary, processes will wake
> + * kswapd and get throttled again
> + */
Yes, the possibility for missed wakeups here worried me. There's no
synchronization and it would be easy to leave holes.
It's good that there is no timeout on the throttling - a timeout would
cover up rare races most nastily.
> + if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> + wake_up(&pgdat->pfmemalloc_wait);
> + return true;
> + }
A bool-returning function called "sleeping_prematurely" should have no
side-effects. But it now performs wakeups. Wanna see if there is a
way of making this nicer?
> /* Check the watermark levels */
> for (i = 0; i <= classzone_idx; i++) {
> struct zone *zone = pgdat->node_zones + i;
> @@ -2871,6 +2961,12 @@ loop_again:
> }
>
> }
> +
> + /* Wake throttled direct reclaimers if low watermark is met */
s/"what"/"why"/ !
> + if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
> + pfmemalloc_watermark_ok(pgdat))
> + wake_up(&pgdat->pfmemalloc_wait);
> +
> if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
> break; /* kswapd: all done */
> /*
>
> ...
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Andrew Morton <akpm@linux-foundation.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Linux-MM <linux-mm@kvack.org>,
Linux-Netdev <netdev@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
David Miller <davem@davemloft.net>, Neil Brown <neilb@suse.de>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Mike Christie <michaelc@cs.wisc.edu>,
Eric B Munson <emunson@mgebm.net>
Subject: Re: [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
Date: Tue, 1 May 2012 15:24:37 -0700 [thread overview]
Message-ID: <20120501152437.194f0fc2.akpm@linux-foundation.org> (raw)
In-Reply-To: <1334578624-23257-16-git-send-email-mgorman@suse.de>
On Mon, 16 Apr 2012 13:17:02 +0100
Mel Gorman <mgorman@suse.de> wrote:
> If swap is backed by network storage such as NBD, there is a risk
> that a large number of reclaimers can hang the system by consuming
> all PF_MEMALLOC reserves. To avoid these hangs, the administrator
> must tune min_free_kbytes in advance which is a bit fragile.
>
> This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
> are in use. If the system is routinely getting throttled the system
> administrator can increase min_free_kbytes so degradation is smoother
> but the system will keep running.
>
>
> ...
>
> +static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> +{
> + struct zone *zone;
> + unsigned long pfmemalloc_reserve = 0;
> + unsigned long free_pages = 0;
> + int i;
> + bool wmark_ok;
> +
> + for (i = 0; i <= ZONE_NORMAL; i++) {
> + zone = &pgdat->node_zones[i];
> + pfmemalloc_reserve += min_wmark_pages(zone);
> + free_pages += zone_page_state(zone, NR_FREE_PAGES);
> + }
> +
> + wmark_ok = (free_pages > pfmemalloc_reserve / 2) ? true : false;
wmark_ok = free_pages > pfmemalloc_reserve / 2;
> +
> + /* kswapd must be awake if processes are being throttled */
> + if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> + pgdat->classzone_idx = min(pgdat->classzone_idx,
> + (enum zone_type)ZONE_NORMAL);
> + wake_up_interruptible(&pgdat->kswapd_wait);
> + }
> +
> + return wmark_ok;
> +}
> +
> +/*
> + * Throttle direct reclaimers if backing storage is backed by the network
> + * and the PFMEMALLOC reserve for the preferred node is getting dangerously
> + * depleted. kswapd will continue to make progress and wake the processes
> + * when the low watermark is reached
> + */
> +static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> + nodemask_t *nodemask)
> +{
> + struct zone *zone;
> + int high_zoneidx = gfp_zone(gfp_mask);
> + pg_data_t *pgdat;
> +
> + /* Kernel threads such as kjournald should not be throttled */
The comment should explain "why", not "what". Particularly when the
"what" was bleedin obvious ;)
Also... why?
> + if (current->flags & PF_KTHREAD)
> + return;
> +
> + /* Check if the pfmemalloc reserves are ok */
> + first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> + pgdat = zone->zone_pgdat;
> + if (pfmemalloc_watermark_ok(pgdat))
> + return;
> +
> + /*
> + * If the caller cannot enter the filesystem, it's possible that it
> + * is processing a journal transaction. In this case, it is not safe
> + * to block on pfmemalloc_wait as kswapd could also be blocked waiting
> + * to start a transaction. Instead, throttle for up to a second before
> + * the reclaim must continue.
> + */
I suppose this applies to fs locks in general, not just to
journal_start()?
> + if (!(gfp_mask & __GFP_FS)) {
> + wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> + pfmemalloc_watermark_ok(pgdat), HZ);
> + return;
> + }
> +
> + /* Throttle until kswapd wakes the process */
> + wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> + pfmemalloc_watermark_ok(pgdat));
> +}
> +
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *nodemask)
> {
>
> ...
>
> @@ -2610,6 +2686,20 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> if (remaining)
> return true;
>
> + /*
> + * There is a potential race between when kswapd checks it watermarks
"its"
> + * and a process gets throttled. There is also a potential race if
> + * processes get throttled, kswapd wakes, a large process exits therby
> + * balancing the zones that causes kswapd to miss a wakeup. If kswapd
> + * is going to sleep, no process should be sleeping on pfmemalloc_wait
> + * so wake them now if necessary. If necessary, processes will wake
> + * kswapd and get throttled again
> + */
Yes, the possibility for missed wakeups here worried me. There's no
synchronization and it would be easy to leave holes.
It's good that there is no timeout on the throttling - a timeout would
cover up rare races most nastily.
> + if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> + wake_up(&pgdat->pfmemalloc_wait);
> + return true;
> + }
A bool-returning function called "sleeping_prematurely" should have no
side-effects. But it now performs wakeups. Wanna see if there is a
way of making this nicer?
> /* Check the watermark levels */
> for (i = 0; i <= classzone_idx; i++) {
> struct zone *zone = pgdat->node_zones + i;
> @@ -2871,6 +2961,12 @@ loop_again:
> }
>
> }
> +
> + /* Wake throttled direct reclaimers if low watermark is met */
s/"what"/"why"/ !
> + if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
> + pfmemalloc_watermark_ok(pgdat))
> + wake_up(&pgdat->pfmemalloc_wait);
> +
> if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
> break; /* kswapd: all done */
> /*
>
> ...
>
next prev parent reply other threads:[~2012-05-01 22:24 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-16 12:16 [PATCH 00/16] Swap-over-NBD without deadlocking V9 Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 01/16] mm: Serialize access to min_free_kbytes Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-23 23:50 ` David Rientjes
2012-04-23 23:50 ` David Rientjes
2012-04-16 12:16 ` [PATCH 02/16] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-23 23:51 ` David Rientjes
2012-04-23 23:51 ` David Rientjes
2012-04-25 15:05 ` Mel Gorman
2012-04-25 15:05 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 03/16] mm: slub: Optimise the SLUB fast path to avoid pfmemalloc checks Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 04/16] mm: Introduce __GFP_MEMALLOC to allow access to emergency reserves Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 05/16] mm: allow PF_MEMALLOC from softirq context Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-05-01 22:08 ` Andrew Morton
2012-05-01 22:08 ` Andrew Morton
2012-05-02 16:24 ` Mel Gorman
2012-05-02 16:24 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 06/16] mm: Ignore mempolicies when using ALLOC_NO_WATERMARK Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 07/16] net: Introduce sk_allocation() to allow addition of GFP flags depending on the individual socket Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 08/16] netvm: Allow the use of __GFP_MEMALLOC by specific sockets Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 09/16] netvm: Allow skb allocation to use PFMEMALLOC reserves Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 10/16] netvm: Propagate page->pfmemalloc to skb Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 11/16] netvm: Propagate page->pfmemalloc from netdev_alloc_page " Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:16 ` [PATCH 12/16] netvm: Set PF_MEMALLOC as appropriate during SKB processing Mel Gorman
2012-04-16 12:16 ` Mel Gorman
2012-04-16 12:17 ` [PATCH 13/16] mm: Micro-optimise slab to avoid a function call Mel Gorman
2012-04-16 12:17 ` Mel Gorman
2012-04-16 12:17 ` [PATCH 14/16] nbd: Set SOCK_MEMALLOC for access to PFMEMALLOC reserves Mel Gorman
2012-04-16 12:17 ` Mel Gorman
2012-04-16 12:17 ` [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage Mel Gorman
2012-04-16 12:17 ` Mel Gorman
2012-05-01 22:24 ` Andrew Morton [this message]
2012-05-01 22:24 ` Andrew Morton
2012-05-02 16:24 ` Mel Gorman
2012-05-02 16:24 ` Mel Gorman
2012-04-16 12:17 ` [PATCH 16/16] mm: Account for the number of times direct reclaimers get throttled Mel Gorman
2012-04-16 12:17 ` Mel Gorman
2012-04-21 18:15 ` [PATCH 00/16] Swap-over-NBD without deadlocking V9 Eric B Munson
2012-05-01 22:28 ` Andrew Morton
2012-05-01 22:28 ` Andrew Morton
2012-05-03 15:00 ` Mel Gorman
2012-05-03 15:00 ` Mel Gorman
2012-05-03 17:06 ` David Miller
2012-05-03 17:06 ` David Miller
2012-05-04 10:16 ` Mel Gorman
2012-05-04 10:16 ` Mel Gorman
-- strict thread matches above, loose matches on Subject: below --
2012-06-22 14:30 [PATCH 00/17] Swap-over-NBD without deadlocking V13 Mel Gorman
2012-06-22 14:30 ` [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage Mel Gorman
2012-06-22 14:30 ` Mel Gorman
2012-06-29 13:32 [PATCH 00/16] Swap-over-NBD without deadlocking V14 Mel Gorman
2012-06-29 13:32 ` [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage Mel Gorman
2012-06-29 13:32 ` Mel Gorman
2012-07-12 6:40 [PATCH 00/16] Swap-over-NBD without deadlocking V15 Mel Gorman
2012-07-12 6:40 ` [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage Mel Gorman
2012-07-12 6:40 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120501152437.194f0fc2.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=a.p.zijlstra@chello.nl \
--cc=davem@davemloft.net \
--cc=emunson@mgebm.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=michaelc@cs.wisc.edu \
--cc=neilb@suse.de \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.