Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@techsingularity.net>
To: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.com>, Linux-MM <linux-mm@kvack.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rik van Riel <riel@redhat.com>, Vlastimil Babka <vbabka@suse.cz>,
	Pintu Kumar <pintu.k@samsung.com>,
	Xishi Qiu <qiuxishi@huawei.com>, Gioh Kim <gioh.kim@lge.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
Date: Fri, 31 Jul 2015 08:11:13 +0100	[thread overview]
Message-ID: <20150731071113.GA5840@techsingularity.net> (raw)
In-Reply-To: <20150731055407.GA15912@js1304-P5Q-DELUXE>

On Fri, Jul 31, 2015 at 02:54:07PM +0900, Joonsoo Kim wrote:
> Hello, Mel.
> 
> On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
> > From: Mel Gorman <mgorman@suse.de>
> > 
> > High-order watermark checking exists for two reasons --  kswapd high-order
> > awareness and protection for high-order atomic requests. Historically we
> > depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> > pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> > that reserves pageblocks for high-order atomic allocations. This is expected
> > to be more reliable than MIGRATE_RESERVE was.
> 
> I have some concerns on this patch.
> 
> 1) This patch breaks intention of __GFP_WAIT.
> __GFP_WAIT is used when we want to succeed allocation even if we need
> to do some reclaim/compaction work. That implies importance of
> allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
> atomic allocation (~__GFP_WAIT) more successful than allocation with
> __GFP_WAIT in many situation. It breaks basic assumption of gfp flags
> and doesn't make any sense.
> 

Currently allocation requests that do not specify __GFP_WAIT get the
ALLOC_HARDER flag which allows them to dip further into watermark reserves.
It already is the case that there are corner cases where a high atomic
allocation can succeed when a non-atomic allocation would reclaim.

> 2) Who care about success of high-order atomic allocation with this
> reliability?

Historically network configurations with large MTUs that could not scatter
gather. These days network will also attempt atomic order-3 allocations
to reduce overhead. SLUB also attempts atomic high-order allocations to
reduce overhead. It's why MIGRATE_RESERVE exists at all so the intent of
the patch is to preserve what MIGRATE_RESERVE was for but do it better.

> In case of allocation without __GFP_WAIT, requestor preare sufficient
> fallback method. They just want to success if it is easily successful.
> They don't want to succeed allocation with paying great cost that slow
> down general workload by this patch that can be accidentally reserve
> too much memory.
> 

Not necessary true. In the historical case, the network request was atomic
because it was from IRQ context and could not sleep.

> > A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> > a pageblock but limits the total number to 10% of the zone.
> 
> When steals happens, pageblock already can be fragmented and we can't
> fully utilize this pageblock without allowing order-0 allocation. This
> is very waste.
> 

If the pageblock was stolen, it implies there was at least 1 usable page
of the correct order. As the pageblock is then reserved, any pages that
free in that block stay free for use by high-order atomic allocations.
Else, the number of pageblocks will increase again until the 10% limit
is hit.

> > The pageblocks are unreserved if an allocation fails after a direct
> > reclaim attempt.
> > 
> > The watermark checks account for the reserved pageblocks when the allocation
> > request is not a high-order atomic allocation.
> > 
> > The stutter benchmark was used to evaluate this but while it was running
> > there was a systemtap script that randomly allocated between 1 and 1G worth
> > of order-3 pages using GFP_ATOMIC. In kernel 4.2-rc1 running this workload
> > on a single-node machine there were 339574 allocation failures. With this
> > patch applied there were 28798 failures -- a 92% reduction. On a 4-node
> > machine, allocation failures went from 76917 to 0 failures.
> 
> There is some missing information to justify benchmark result.
> Especially, I'd like to know:
> 
> 1) Detailed system setup (CPU, MEMORY, etc...)

CPUs were 8 core Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz with 8G of RAM.

> 2) Total number of attempt of GFP_ATOMIC allocation request
> 

Each attempt was between 1 and 1G randomly as described already.

> I don't know how you modify stutter benchmark in mmtests but it
> looks like there is no delay when continually requesting GFP_ATOMIC
> allocation.
> 1G of order-3 allocation request without delay seems insane
> to me. Could you tell me how you modify that benchmark for this patch?
> 

The stutter benchmark was not modified. The watch-stress-highorder-atomic
monitor was run in parallel and that's what is doing the allocation. It's
true that up to 1G of order-3 allocations without delay would be insane
in a normal situation. The point was to show an extreme case where atomic
allocations were used and to test whether the reserves held up or not.


> > There are minor theoritical side-effects. If the system is intensively
> > making large numbers of long-lived high-order atomic allocations then
> > there will be a lot of reserved pageblocks. This may push some workloads
> > into reclaim until the number of reserved pageblocks is reduced again. This
> > problem was not observed in reclaim intensive workloads but such workloads
> > are also not atomic high-order intensive.
> 
> I don't think this is theoritical side-effects. It can happen easily.
> Recently, network subsystem makes some of their high-order allocation
> request ~_GFP_WAIT (fb05e7a89f50: net: don't wait for order-3 page
> allocation). And, I've submitted similar patch for slub today
> (mm/slub: don't wait for high-order page allocation). That
> makes system atomic high-order allocation request more and this side-effect
> can be possible in many situation.
> 

The key is long-lived allocations. The network subsystem frees theirs. I
was not able to trigger a situation in a variety of workloads where these
happened which is why I classified it as theoritical.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Mel Gorman <mgorman@techsingularity.net>
To: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.com>, Linux-MM <linux-mm@kvack.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rik van Riel <riel@redhat.com>, Vlastimil Babka <vbabka@suse.cz>,
	Pintu Kumar <pintu.k@samsung.com>,
	Xishi Qiu <qiuxishi@huawei.com>, Gioh Kim <gioh.kim@lge.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
Date: Fri, 31 Jul 2015 08:11:13 +0100	[thread overview]
Message-ID: <20150731071113.GA5840@techsingularity.net> (raw)
In-Reply-To: <20150731055407.GA15912@js1304-P5Q-DELUXE>

On Fri, Jul 31, 2015 at 02:54:07PM +0900, Joonsoo Kim wrote:
> Hello, Mel.
> 
> On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
> > From: Mel Gorman <mgorman@suse.de>
> > 
> > High-order watermark checking exists for two reasons --  kswapd high-order
> > awareness and protection for high-order atomic requests. Historically we
> > depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> > pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> > that reserves pageblocks for high-order atomic allocations. This is expected
> > to be more reliable than MIGRATE_RESERVE was.
> 
> I have some concerns on this patch.
> 
> 1) This patch breaks intention of __GFP_WAIT.
> __GFP_WAIT is used when we want to succeed allocation even if we need
> to do some reclaim/compaction work. That implies importance of
> allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
> atomic allocation (~__GFP_WAIT) more successful than allocation with
> __GFP_WAIT in many situation. It breaks basic assumption of gfp flags
> and doesn't make any sense.
> 

Currently allocation requests that do not specify __GFP_WAIT get the
ALLOC_HARDER flag which allows them to dip further into watermark reserves.
It already is the case that there are corner cases where a high atomic
allocation can succeed when a non-atomic allocation would reclaim.

> 2) Who care about success of high-order atomic allocation with this
> reliability?

Historically network configurations with large MTUs that could not scatter
gather. These days network will also attempt atomic order-3 allocations
to reduce overhead. SLUB also attempts atomic high-order allocations to
reduce overhead. It's why MIGRATE_RESERVE exists at all so the intent of
the patch is to preserve what MIGRATE_RESERVE was for but do it better.

> In case of allocation without __GFP_WAIT, requestor preare sufficient
> fallback method. They just want to success if it is easily successful.
> They don't want to succeed allocation with paying great cost that slow
> down general workload by this patch that can be accidentally reserve
> too much memory.
> 

Not necessary true. In the historical case, the network request was atomic
because it was from IRQ context and could not sleep.

> > A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> > a pageblock but limits the total number to 10% of the zone.
> 
> When steals happens, pageblock already can be fragmented and we can't
> fully utilize this pageblock without allowing order-0 allocation. This
> is very waste.
> 

If the pageblock was stolen, it implies there was at least 1 usable page
of the correct order. As the pageblock is then reserved, any pages that
free in that block stay free for use by high-order atomic allocations.
Else, the number of pageblocks will increase again until the 10% limit
is hit.

> > The pageblocks are unreserved if an allocation fails after a direct
> > reclaim attempt.
> > 
> > The watermark checks account for the reserved pageblocks when the allocation
> > request is not a high-order atomic allocation.
> > 
> > The stutter benchmark was used to evaluate this but while it was running
> > there was a systemtap script that randomly allocated between 1 and 1G worth
> > of order-3 pages using GFP_ATOMIC. In kernel 4.2-rc1 running this workload
> > on a single-node machine there were 339574 allocation failures. With this
> > patch applied there were 28798 failures -- a 92% reduction. On a 4-node
> > machine, allocation failures went from 76917 to 0 failures.
> 
> There is some missing information to justify benchmark result.
> Especially, I'd like to know:
> 
> 1) Detailed system setup (CPU, MEMORY, etc...)

CPUs were 8 core Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz with 8G of RAM.

> 2) Total number of attempt of GFP_ATOMIC allocation request
> 

Each attempt was between 1 and 1G randomly as described already.

> I don't know how you modify stutter benchmark in mmtests but it
> looks like there is no delay when continually requesting GFP_ATOMIC
> allocation.
> 1G of order-3 allocation request without delay seems insane
> to me. Could you tell me how you modify that benchmark for this patch?
> 

The stutter benchmark was not modified. The watch-stress-highorder-atomic
monitor was run in parallel and that's what is doing the allocation. It's
true that up to 1G of order-3 allocations without delay would be insane
in a normal situation. The point was to show an extreme case where atomic
allocations were used and to test whether the reserves held up or not.


> > There are minor theoritical side-effects. If the system is intensively
> > making large numbers of long-lived high-order atomic allocations then
> > there will be a lot of reserved pageblocks. This may push some workloads
> > into reclaim until the number of reserved pageblocks is reduced again. This
> > problem was not observed in reclaim intensive workloads but such workloads
> > are also not atomic high-order intensive.
> 
> I don't think this is theoritical side-effects. It can happen easily.
> Recently, network subsystem makes some of their high-order allocation
> request ~_GFP_WAIT (fb05e7a89f50: net: don't wait for order-3 page
> allocation). And, I've submitted similar patch for slub today
> (mm/slub: don't wait for high-order page allocation). That
> makes system atomic high-order allocation request more and this side-effect
> can be possible in many situation.
> 

The key is long-lived allocations. The network subsystem frees theirs. I
was not able to trigger a situation in a variety of workloads where these
happened which is why I classified it as theoritical.

-- 
Mel Gorman
SUSE Labs

next prev parent reply	other threads:[~2015-07-31  7:11 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-20  8:00 [RFC PATCH 00/10] Remove zonelist cache and high-order watermark checking Mel Gorman
2015-07-20  8:00 ` Mel Gorman
2015-07-20  8:00 ` [PATCH 01/10] mm, page_alloc: Delete the zonelist_cache Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-21 23:47   ` David Rientjes
2015-07-21 23:47     ` David Rientjes
2015-07-23 10:58     ` Mel Gorman
2015-07-23 10:58       ` Mel Gorman
2015-07-20  8:00 ` [PATCH 02/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-21 23:49   ` David Rientjes
2015-07-21 23:49     ` David Rientjes
2015-07-28 12:20   ` Vlastimil Babka
2015-07-28 12:20     ` Vlastimil Babka
2015-07-20  8:00 ` [PATCH 03/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-22  0:08   ` David Rientjes
2015-07-22  0:08     ` David Rientjes
2015-07-23 12:28     ` Mel Gorman
2015-07-23 12:28       ` Mel Gorman
2015-07-28 12:25   ` Vlastimil Babka
2015-07-28 12:25     ` Vlastimil Babka
2015-07-20  8:00 ` [PATCH 04/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-22  0:11   ` David Rientjes
2015-07-22  0:11     ` David Rientjes
2015-07-28 12:32   ` Vlastimil Babka
2015-07-28 12:32     ` Vlastimil Babka
2015-07-20  8:00 ` [PATCH 05/10] mm, page_alloc: Remove unnecessary updating of GFP flags during normal operation Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-28 13:36   ` Vlastimil Babka
2015-07-28 13:36     ` Vlastimil Babka
2015-07-28 13:47     ` Peter Zijlstra
2015-07-28 13:47       ` Peter Zijlstra
2015-07-28 15:48     ` Mel Gorman
2015-07-28 15:48       ` Mel Gorman
2015-07-20  8:00 ` [PATCH 06/10] mm, page_alloc: Use jump label to check if page grouping by mobility is enabled Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-28 13:42   ` Vlastimil Babka
2015-07-28 13:42     ` Vlastimil Babka
2015-07-20  8:00 ` [PATCH 07/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-20  8:00 ` [PATCH 08/10] mm, page_alloc: Remove MIGRATE_RESERVE Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-29  9:59   ` Vlastimil Babka
2015-07-29  9:59     ` Vlastimil Babka
2015-07-29 12:25     ` Mel Gorman
2015-07-29 12:25       ` Mel Gorman
2015-07-20  8:00 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-29 11:35   ` Vlastimil Babka
2015-07-29 11:35     ` Vlastimil Babka
2015-07-29 12:53     ` Mel Gorman
2015-07-29 12:53       ` Mel Gorman
2015-07-31  8:28       ` Vlastimil Babka
2015-07-31  8:28         ` Vlastimil Babka
2015-07-31  8:43         ` Mel Gorman
2015-07-31  8:43           ` Mel Gorman
2015-07-31  5:54   ` Joonsoo Kim
2015-07-31  5:54     ` Joonsoo Kim
2015-07-31  7:11     ` Mel Gorman [this message]
2015-07-31  7:11       ` Mel Gorman
2015-07-31  7:25       ` Vlastimil Babka
2015-07-31  7:25         ` Vlastimil Babka
2015-07-31  8:22         ` Mel Gorman
2015-07-31  8:22           ` Mel Gorman
2015-07-31  8:30         ` Joonsoo Kim
2015-07-31  8:30           ` Joonsoo Kim
2015-07-31  8:26       ` Joonsoo Kim
2015-07-31  8:26         ` Joonsoo Kim
2015-07-31  8:41         ` Mel Gorman
2015-07-31  8:41           ` Mel Gorman
2015-07-20  8:00 ` [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
2015-07-20  8:00   ` Mel Gorman
2015-07-29 12:25   ` Vlastimil Babka
2015-07-29 12:25     ` Vlastimil Babka
2015-07-29 13:04     ` Mel Gorman
2015-07-29 13:04       ` Mel Gorman
2015-07-31  6:08   ` Joonsoo Kim
2015-07-31  6:08     ` Joonsoo Kim
2015-07-31  7:19     ` Mel Gorman
2015-07-31  7:19       ` Mel Gorman
2015-07-31  8:40       ` Joonsoo Kim
2015-07-31  8:40         ` Joonsoo Kim
2015-07-31  6:14 ` [RFC PATCH 00/10] Remove zonelist cache and high-order watermark checking Joonsoo Kim
2015-07-31  6:14   ` Joonsoo Kim
2015-07-31  7:20   ` Mel Gorman
2015-07-31  7:20     ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2015-08-12 10:45 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v2 Mel Gorman
2015-08-12 10:45 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
2015-08-12 10:45   ` Mel Gorman
2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
2015-09-21 10:52 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-24 13:50   ` Michal Hocko
2015-09-24 13:50     ` Michal Hocko
2015-09-25 19:22   ` Johannes Weiner
2015-09-25 19:22     ` Johannes Weiner
2015-09-29 21:01   ` Andrew Morton
2015-09-29 21:01     ` Andrew Morton
2015-09-30  8:27     ` Mel Gorman
2015-09-30  8:27       ` Mel Gorman
2015-09-30 14:02       ` Vlastimil Babka
2015-09-30 14:02         ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150731071113.GA5840@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=gioh.kim@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.com \
    --cc=pintu.k@samsung.com \
    --cc=qiuxishi@huawei.com \
    --cc=riel@redhat.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.