Re: [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@techsingularity.net>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	David Rientjes <rientjes@google.com>,
	Michal Hocko <mhocko@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Rik van Riel <riel@redhat.com>
Subject: Re: [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely
Date: Fri, 28 Jul 2017 11:16:22 +0100	[thread overview]
Message-ID: <20170728101622.uoirf7ryvtoddq7b@techsingularity.net> (raw)
In-Reply-To: <20170727160701.9245-3-vbabka@suse.cz>

On Thu, Jul 27, 2017 at 06:06:57PM +0200, Vlastimil Babka wrote:
> This patch deals with a corner case found when testing kcompactd with a very
> simple testcase that first fragments memory (by creating a large shmem file and
> then punching hole in every even page) and then uses artificial order-9
> GFP_NOWAIT allocations in a loop. This is freshly after virtme-run boot in KVM
> and no other activity.
> 
> What happens is that kswapd always reclaims too little to get over
> compact_gap() in kswapd_shrink_node(), so it doesn't set sc->order to 0, thus
> "goto kswapd_try_sleep" in kswapd() doesn't happen. In the next iteration of
> kswapd() loop, alloc_order and reclaim_order is read again from
> pgdat->kswapd_order, which the previous iteration has reset to 0 and there was
> no other kswapd wakeup meanwhile (the workload inserts short sleeps between
> allocations). With the working order 0, node appears balanced and
> wakeup_kcompactd() does nothing.
> 

The risk with a change like this is that there is an introduction of
kswapd-stuck-at-100%-cpu reclaiming for high order pages. Consider for
example this part

> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
> +/*
> + * Return true if kswapd fully slept because pgdat was balanced and there was
> + * no premature wakeup.
> + */
> +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
>  				unsigned int classzone_idx)
>  {
>  	long remaining = 0;
>  	DEFINE_WAIT(wait);
> +	bool ret = false;
>  
>  	if (freezing(current) || kthread_should_stop())
> -		return;
> +		return false;
>  
>  	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>  

...

> @@ -3493,23 +3491,32 @@ static int kswapd(void *p)
>  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  	set_freezable();
>  
> -	pgdat->kswapd_order = 0;
> +	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
>  	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>  	for ( ; ; ) {
>  		bool ret;
>  
> -		alloc_order = reclaim_order = pgdat->kswapd_order;
> +		alloc_order = reclaim_order = max(alloc_order, pgdat->kswapd_order);
>  		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
>  
>  kswapd_try_sleep:
> -		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> -					classzone_idx);
> -
> -		/* Read the new order and classzone_idx */
> -		alloc_order = reclaim_order = pgdat->kswapd_order;
> -		classzone_idx = kswapd_classzone_idx(pgdat, 0);
> -		pgdat->kswapd_order = 0;
> -		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> +		if (kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> +							classzone_idx)) {
> +
> +			/* Read the new order and classzone_idx */
> +			alloc_order = reclaim_order = pgdat->kswapd_order;
> +			classzone_idx = kswapd_classzone_idx(pgdat, 0);
> +			pgdat->kswapd_order = 0;
> +			pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> +		} else {
> +			/*
> +			 * We failed to sleep, so continue on the current order
> +			 * and classzone_idx, unless they increased.
> +			 */
> +			alloc_order = max(alloc_order, pgdat->kswapd_order);
> +			reclaim_order = max(reclaim_order, pgdat->kswapd_order) ;
> +			classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> +		}
>  
>  		ret = try_to_freeze();
>  		if (kthread_should_stop())

kswapd_try_to_sleep returns true only if it fully slept. Now, consider
a case where kswapd is woken for order-9, fails and there are streaming
allocators that are keeping kswapd awake between the low/high watermark.
Even though all subsequent wakeups are for potentially for order-0, the
false branch above keeps kswapd at order-9.

You should be very wary of keeping kswapd awake for high-order allocations
and somehow defer to either kcompactd or push it into direct reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-07-28 10:16 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-27 16:06 [RFC PATCH 0/6] proactive kcompactd Vlastimil Babka
2017-07-27 16:06 ` [PATCH 1/6] mm, kswapd: refactor kswapd_try_to_sleep() Vlastimil Babka
2017-07-28  9:38   ` Mel Gorman
2017-07-27 16:06 ` [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely Vlastimil Babka
2017-07-28 10:16   ` Mel Gorman [this message]
2017-07-27 16:06 ` [PATCH 3/6] mm, kswapd: reset kswapd's order to 0 when it fails to reclaim enough Vlastimil Babka
2017-07-27 16:06 ` [PATCH 4/6] mm, kswapd: wake up kcompactd when kswapd had too many failures Vlastimil Babka
2017-07-28 10:41   ` Mel Gorman
2017-07-27 16:07 ` [RFC PATCH 5/6] mm, compaction: stop when number of free pages goes below watermark Vlastimil Babka
2017-07-28 10:43   ` Mel Gorman
2017-07-27 16:07 ` [RFC PATCH 6/6] mm: make kcompactd more proactive Vlastimil Babka
2017-07-28 10:58   ` Mel Gorman
2017-08-09 20:58 ` [RFC PATCH 0/6] proactive kcompactd David Rientjes
2017-08-21 14:10   ` Johannes Weiner
2017-08-21 21:40     ` Rik van Riel
2017-08-22 20:57     ` David Rientjes
2017-08-23  5:36     ` Joonsoo Kim
2017-08-23  8:12       ` Vlastimil Babka
2017-08-24  6:24         ` Joonsoo Kim
2017-08-24 11:30           ` Vlastimil Babka
2017-08-24 23:51             ` Joonsoo Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170728101622.uoirf7ryvtoddq7b@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=aarcange@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.