[PATCH V2 0/3] mm: page_alloc: fixes for early oom kills

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2 0/3] mm: page_alloc: fixes for early oom kills
@ 2023-11-05 12:50 Charan Teja Kalla
  2023-11-05 12:50 ` [PATCH V2 1/3] mm: page_alloc: unreserve highatomic page blocks before oom Charan Teja Kalla
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2023-11-05 12:50 UTC (permalink / raw)
  To: akpm, mgorman, mhocko, david, vbabka, hannes, quic_pkondeti
  Cc: linux-mm, linux-kernel, Charan Teja Kalla

Ealry OOM is happened on a system with out unreserving the highatomic
page blocks and draining of pcp lists for an allocation request.

The state of the system where this issue exposed shown in oom kill logs:
[  295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
local_pcp:492kB free_cma:0kB
[  295.998656] lowmem_reserve[]: 0 32
[  295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 7752kB

From the above log:
a) OOM is occurred with out unreserving the high atomic page blocks.
b) ~16MB of memory reserved for high atomic reserves against the 
expectation of 1% reserves.

These 2 issues are tried to be fixed in 1st and 2nd patch.

Another excerpt from the oom kill message:
Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
reserved_highatomic:0KB managed:49152kB free_pcp:460kB

In the above, pcp lists too aren't drained before entering into
oom kill path. This is tried to be fixed in 3rd patch.

Changes in V1:
 o Unreserving the high atomic page blocks is tried to fix from
   the oom kill path rather than in should_reclaim_retry()
 o discussed why a lot more than 1% of managed memory is reserved
   for high atomic reserves.
 o https://lore.kernel.org/linux-mm/1698669590-3193-1-git-send-email-quic_charante@quicinc.com/

Charan Teja Kalla (3):
  mm: page_alloc: unreserve highatomic page blocks before oom
  mm: page_alloc: correct high atomic reserve calculations
  mm: page_alloc: drain pcp lists before oom kill

 mm/page_alloc.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V2 1/3] mm: page_alloc: unreserve highatomic page blocks before oom
  2023-11-05 12:50 [PATCH V2 0/3] mm: page_alloc: fixes for early oom kills Charan Teja Kalla
@ 2023-11-05 12:50 ` Charan Teja Kalla
  2023-11-09 10:29   ` Michal Hocko
  2023-11-05 12:50 ` [PATCH V2 2/3] mm: page_alloc: correct high atomic reserve calculations Charan Teja Kalla
  2023-11-05 12:50 ` [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill Charan Teja Kalla
  2 siblings, 1 reply; 23+ messages in thread
From: Charan Teja Kalla @ 2023-11-05 12:50 UTC (permalink / raw)
  To: akpm, mgorman, mhocko, david, vbabka, hannes, quic_pkondeti
  Cc: linux-mm, linux-kernel, Charan Teja Kalla

__alloc_pages_direct_reclaim() is called from slowpath allocation where
high atomic reserves can be unreserved after there is a progress in
reclaim and yet no suitable page is found. Later should_reclaim_retry()
gets called from slow path allocation to decide if the reclaim needs to
be retried before OOM kill path is taken.

should_reclaim_retry() checks the available(reclaimable + free pages)
memory against the min wmark levels of a zone and returns:
a)  true, if it is above the min wmark so that slow path allocation will
do the reclaim retries.
b) false, thus slowpath allocation takes oom kill path.

should_reclaim_retry() can also unreserves the high atomic reserves
**but only after all the reclaim retries are exhausted.**

In a case where there are almost none reclaimable memory and free pages
contains mostly the high atomic reserves but allocation context can't
use these high atomic reserves, makes the available memory below min
wmark levels hence false is returned from should_reclaim_retry() leading
the allocation request to take OOM kill path. This can turn into a early
oom kill if high atomic reserves are holding lot of free memory and
unreserving of them is not attempted.

(early)OOM is encountered on a VM with the below state:
[  295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
local_pcp:492kB free_cma:0kB
[  295.998656] lowmem_reserve[]: 0 32
[  295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 7752kB

Per above log, the free memory of ~7MB exist in the high atomic
reserves is not freed up before falling back to oom kill path.

Fix it by trying to unreserve the high atomic reserves in
should_reclaim_retry() before __alloc_pages_direct_reclaim() can
fallback to oom kill path.

Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
Reported-by: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
 mm/page_alloc.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 95546f3..e07a38f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3809,14 +3809,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	else
 		(*no_progress_loops)++;

-	/*
-	 * Make sure we converge to OOM if we cannot make any progress
-	 * several times in the row.
-	 */
-	if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
-		/* Before OOM, exhaust highatomic_reserve */
-		return unreserve_highatomic_pageblock(ac, true);
-	}
+	if (*no_progress_loops > MAX_RECLAIM_RETRIES)
+		goto out;
+

 	/*
 	 * Keep reclaiming pages while there is a chance this will lead
@@ -3859,6 +3854,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		schedule_timeout_uninterruptible(1);
 	else
 		cond_resched();
+out:
+	/* Before OOM, exhaust highatomic_reserve */
+	if (!ret)
+		return unreserve_highatomic_pageblock(ac, true);
+
 	return ret;
 }

-- 
2.7.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V2 2/3] mm: page_alloc: correct high atomic reserve calculations
  2023-11-05 12:50 [PATCH V2 0/3] mm: page_alloc: fixes for early oom kills Charan Teja Kalla
  2023-11-05 12:50 ` [PATCH V2 1/3] mm: page_alloc: unreserve highatomic page blocks before oom Charan Teja Kalla
@ 2023-11-05 12:50 ` Charan Teja Kalla
  2023-11-16  9:59   ` Mel Gorman
  2023-11-05 12:50 ` [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill Charan Teja Kalla
  2 siblings, 1 reply; 23+ messages in thread
From: Charan Teja Kalla @ 2023-11-05 12:50 UTC (permalink / raw)
  To: akpm, mgorman, mhocko, david, vbabka, hannes, quic_pkondeti
  Cc: linux-mm, linux-kernel, Charan Teja Kalla

reserve_highatomic_pageblock() aims to reserve the 1% of the managed
pages of a zone, which is used for the high order atomic allocations.

It uses the below calculation to reserve:
static void reserve_highatomic_pageblock(struct page *page, ....) {

   .......
   max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;

   if (zone->nr_reserved_highatomic >= max_managed)
       goto out;

   zone->nr_reserved_highatomic += pageblock_nr_pages;
   set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
   move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);

out:
   ....
}

Since we are always appending the 1% of zone managed pages count to
pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the
nr_reserved_highatomic is incremented/decremented in pageblock sizes.

Encountered a system(actually a VM running on the Linux kernel) with the
below zone configuration:
Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
reserved_highatomic:8192KB managed:49224kB

The existing calculations making it to reserve the 8MB(with pageblock
size of 4MB) i.e. 16% of the zone managed memory.  Reserving such high
amount of memory can easily exert memory pressure in the system thus may
lead into unnecessary reclaims till unreserving of high atomic reserves.

Since high atomic reserves are managed in pageblock size granules, as
MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for
high atomic reserves as,  minimum is pageblock size , maximum is
approximately 1% of the zone managed pages.

Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e07a38f..b91c99e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1883,10 +1883,11 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	unsigned long max_managed, flags;
 
 	/*
-	 * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
+	 * The number reserved as: minimum is 1 pageblock, maximum is
+	 * roughly 1% of a zone.
 	 * Check is race-prone but harmless.
 	 */
-	max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;
+	max_managed = ALIGN((zone_managed_pages(zone) / 100), pageblock_nr_pages);
 	if (zone->nr_reserved_highatomic >= max_managed)
 		return;
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-05 12:50 [PATCH V2 0/3] mm: page_alloc: fixes for early oom kills Charan Teja Kalla
  2023-11-05 12:50 ` [PATCH V2 1/3] mm: page_alloc: unreserve highatomic page blocks before oom Charan Teja Kalla
  2023-11-05 12:50 ` [PATCH V2 2/3] mm: page_alloc: correct high atomic reserve calculations Charan Teja Kalla
@ 2023-11-05 12:50 ` Charan Teja Kalla
  2023-11-05 12:55   ` Charan Teja Kalla
  2023-11-09 10:33   ` Michal Hocko
  2 siblings, 2 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2023-11-05 12:50 UTC (permalink / raw)
  To: akpm, mgorman, mhocko, david, vbabka, hannes, quic_pkondeti
  Cc: linux-mm, linux-kernel, Charan Teja Kalla

pcp lists are drained from __alloc_pages_direct_reclaim(), only if some
progress is made in the attempt.

struct page *__alloc_pages_direct_reclaim() {
    .....
   *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
   if (unlikely(!(*did_some_progress)))
      goto out;
retry:
    page = get_page_from_freelist();
    if (!page && !drained) {
        drain_all_pages(NULL);
        drained = true;
        goto retry;
    }
out:
}

After the above, allocation attempt can fallback to
should_reclaim_retry() to decide reclaim retries. If it too return
false, allocation request will simply fallback to oom kill path without
even attempting the draining of the pcp pages that might help the
allocation attempt to succeed.

VM system running with ~50MB of memory shown the below stats during OOM
kill:
Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
reserved_highatomic:0KB managed:49152kB free_pcp:460kB

Though in such system state OOM kill is imminent, but the current kill
could have been delayed if the pcp is drained as pcp + free is even
above the high watermark.

Fix this missing drain of pcp list in should_reclaim_retry() along with
unreserving the high atomic page blocks, like it is done in
__alloc_pages_direct_reclaim().

Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
 mm/page_alloc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b91c99e..8eee292 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3857,8 +3857,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		cond_resched();
 out:
 	/* Before OOM, exhaust highatomic_reserve */
-	if (!ret)
-		return unreserve_highatomic_pageblock(ac, true);
+	if (!ret) {
+		ret =  unreserve_highatomic_pageblock(ac, true);
+		drain_all_pages(NULL);
+	}
 
 	return ret;
 }
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-05 12:50 ` [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill Charan Teja Kalla
@ 2023-11-05 12:55   ` Charan Teja Kalla
  2023-11-09 10:33   ` Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2023-11-05 12:55 UTC (permalink / raw)
  To: akpm, mgorman, mhocko, david, vbabka, hannes, quic_pkondeti
  Cc: linux-mm, linux-kernel

Sorry, this is supposed to be PATCH V2 in place of V3. Not sure If I
have to resend it as V2 again.

On 11/5/2023 6:20 PM, Charan Teja Kalla wrote:
> pcp lists are drained from __alloc_pages_direct_reclaim(), only if some
> progress is made in the attempt.
> 
> struct page *__alloc_pages_direct_reclaim() {
>     .....
>    *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
>    if (unlikely(!(*did_some_progress)))
>       goto out;
> retry:
>     page = get_page_from_freelist();
>     if (!page && !drained) {
>         drain_all_pages(NULL);
>         drained = true;
>         goto retry;
>     }
> out:
> }
> 
> After the above, allocation attempt can fallback to
> should_reclaim_retry() to decide reclaim retries. If it too return
> false, allocation request will simply fallback to oom kill path without
> even attempting the draining of the pcp pages that might help the
> allocation attempt to succeed.
> 
> VM system running with ~50MB of memory shown the below stats during OOM
> kill:
> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
> reserved_highatomic:0KB managed:49152kB free_pcp:460kB
> 
> Though in such system state OOM kill is imminent, but the current kill
> could have been delayed if the pcp is drained as pcp + free is even
> above the high watermark.
> 
> Fix this missing drain of pcp list in should_reclaim_retry() along with
> unreserving the high atomic page blocks, like it is done in
> __alloc_pages_direct_reclaim().
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
>  mm/page_alloc.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b91c99e..8eee292 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3857,8 +3857,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		cond_resched();
>  out:
>  	/* Before OOM, exhaust highatomic_reserve */
> -	if (!ret)
> -		return unreserve_highatomic_pageblock(ac, true);
> +	if (!ret) {
> +		ret =  unreserve_highatomic_pageblock(ac, true);
> +		drain_all_pages(NULL);
> +	}
>  
>  	return ret;
>  }


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 1/3] mm: page_alloc: unreserve highatomic page blocks before oom
  2023-11-05 12:50 ` [PATCH V2 1/3] mm: page_alloc: unreserve highatomic page blocks before oom Charan Teja Kalla
@ 2023-11-09 10:29   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2023-11-09 10:29 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, mgorman, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

On Sun 05-11-23 18:20:48, Charan Teja Kalla wrote:
> __alloc_pages_direct_reclaim() is called from slowpath allocation where
> high atomic reserves can be unreserved after there is a progress in
> reclaim and yet no suitable page is found. Later should_reclaim_retry()
> gets called from slow path allocation to decide if the reclaim needs to
> be retried before OOM kill path is taken.
> 
> should_reclaim_retry() checks the available(reclaimable + free pages)
> memory against the min wmark levels of a zone and returns:
> a)  true, if it is above the min wmark so that slow path allocation will
> do the reclaim retries.
> b) false, thus slowpath allocation takes oom kill path.
> 
> should_reclaim_retry() can also unreserves the high atomic reserves
> **but only after all the reclaim retries are exhausted.**
> 
> In a case where there are almost none reclaimable memory and free pages
> contains mostly the high atomic reserves but allocation context can't
> use these high atomic reserves, makes the available memory below min
> wmark levels hence false is returned from should_reclaim_retry() leading
> the allocation request to take OOM kill path. This can turn into a early
> oom kill if high atomic reserves are holding lot of free memory and
> unreserving of them is not attempted.
> 
> (early)OOM is encountered on a VM with the below state:
> [  295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
> high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
> active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
> present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
> local_pcp:492kB free_cma:0kB
> [  295.998656] lowmem_reserve[]: 0 32
> [  295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
> 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 7752kB
> 
> Per above log, the free memory of ~7MB exist in the high atomic
> reserves is not freed up before falling back to oom kill path.
> 
> Fix it by trying to unreserve the high atomic reserves in
> should_reclaim_retry() before __alloc_pages_direct_reclaim() can
> fallback to oom kill path.
> 
> Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
> Reported-by: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  mm/page_alloc.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 95546f3..e07a38f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3809,14 +3809,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  	else
>  		(*no_progress_loops)++;
>  
> -	/*
> -	 * Make sure we converge to OOM if we cannot make any progress
> -	 * several times in the row.
> -	 */
> -	if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> -		/* Before OOM, exhaust highatomic_reserve */
> -		return unreserve_highatomic_pageblock(ac, true);
> -	}
> +	if (*no_progress_loops > MAX_RECLAIM_RETRIES)
> +		goto out;
> +
>  
>  	/*
>  	 * Keep reclaiming pages while there is a chance this will lead
> @@ -3859,6 +3854,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		schedule_timeout_uninterruptible(1);
>  	else
>  		cond_resched();
> +out:
> +	/* Before OOM, exhaust highatomic_reserve */
> +	if (!ret)
> +		return unreserve_highatomic_pageblock(ac, true);
> +
>  	return ret;
>  }
>  
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-05 12:50 ` [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill Charan Teja Kalla
  2023-11-05 12:55   ` Charan Teja Kalla
@ 2023-11-09 10:33   ` Michal Hocko
  2023-11-10 16:36     ` Charan Teja Kalla
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-11-09 10:33 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, mgorman, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

On Sun 05-11-23 18:20:50, Charan Teja Kalla wrote:
> pcp lists are drained from __alloc_pages_direct_reclaim(), only if some
> progress is made in the attempt.
> 
> struct page *__alloc_pages_direct_reclaim() {
>     .....
>    *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
>    if (unlikely(!(*did_some_progress)))
>       goto out;
> retry:
>     page = get_page_from_freelist();
>     if (!page && !drained) {
>         drain_all_pages(NULL);
>         drained = true;
>         goto retry;
>     }
> out:
> }
> 
> After the above, allocation attempt can fallback to
> should_reclaim_retry() to decide reclaim retries. If it too return
> false, allocation request will simply fallback to oom kill path without
> even attempting the draining of the pcp pages that might help the
> allocation attempt to succeed.
> 
> VM system running with ~50MB of memory shown the below stats during OOM
> kill:
> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
> reserved_highatomic:0KB managed:49152kB free_pcp:460kB
> 
> Though in such system state OOM kill is imminent, but the current kill
> could have been delayed if the pcp is drained as pcp + free is even
> above the high watermark.

TBH I am not sure this is really worth it. Does it really reduce the
risk of the OOM in any practical situation?

> Fix this missing drain of pcp list in should_reclaim_retry() along with
> unreserving the high atomic page blocks, like it is done in
> __alloc_pages_direct_reclaim().
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
>  mm/page_alloc.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b91c99e..8eee292 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3857,8 +3857,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		cond_resched();
>  out:
>  	/* Before OOM, exhaust highatomic_reserve */
> -	if (!ret)
> -		return unreserve_highatomic_pageblock(ac, true);
> +	if (!ret) {
> +		ret =  unreserve_highatomic_pageblock(ac, true);
> +		drain_all_pages(NULL);
> +	}
>  
>  	return ret;
>  }
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-09 10:33   ` Michal Hocko
@ 2023-11-10 16:36     ` Charan Teja Kalla
  2023-11-14 10:48       ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Charan Teja Kalla @ 2023-11-10 16:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, mgorman, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

Thanks Michal!!

On 11/9/2023 4:03 PM, Michal Hocko wrote:
>> VM system running with ~50MB of memory shown the below stats during OOM
>> kill:
>> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
>> reserved_highatomic:0KB managed:49152kB free_pcp:460kB
>>
>> Though in such system state OOM kill is imminent, but the current kill
>> could have been delayed if the pcp is drained as pcp + free is even
>> above the high watermark.
> TBH I am not sure this is really worth it. Does it really reduce the
> risk of the OOM in any practical situation?

At least in my particular stress test case it just delayed the OOM as i
can see that at the time of OOM kill, there are no free pcp pages. My
understanding of the OOM is that it should be the last resort and only
after doing the enough reclaim retries. CMIW here.

This patch just aims to not miss the corner case where we hit the OOM
without draining the pcp lists. And after draining, some systems may not
need the oom and some may still need the oom. My case is the later here
so I am really not sure If we ever encountered/noticed the former case here.

> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-10 16:36     ` Charan Teja Kalla
@ 2023-11-14 10:48       ` Michal Hocko
  2023-11-14 16:36         ` Charan Teja Kalla
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-11-14 10:48 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, mgorman, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

On Fri 10-11-23 22:06:22, Charan Teja Kalla wrote:
> Thanks Michal!!
> 
> On 11/9/2023 4:03 PM, Michal Hocko wrote:
> >> VM system running with ~50MB of memory shown the below stats during OOM
> >> kill:
> >> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
> >> reserved_highatomic:0KB managed:49152kB free_pcp:460kB
> >>
> >> Though in such system state OOM kill is imminent, but the current kill
> >> could have been delayed if the pcp is drained as pcp + free is even
> >> above the high watermark.
> > TBH I am not sure this is really worth it. Does it really reduce the
> > risk of the OOM in any practical situation?
> 
> At least in my particular stress test case it just delayed the OOM as i
> can see that at the time of OOM kill, there are no free pcp pages. My
> understanding of the OOM is that it should be the last resort and only
> after doing the enough reclaim retries. CMIW here.

Yes it is a last resort but it is a heuristic as well. So the real
questoin is whether this makes any practical difference outside of
artificial workloads. I do not see anything particularly worrying to
drain the pcp cache but it should be noted that this won't be 100%
either as racing freeing of memory will end up on pcp lists first.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-14 10:48       ` Michal Hocko
@ 2023-11-14 16:36         ` Charan Teja Kalla
  2023-11-15 14:09           ` Michal Hocko
  2024-01-25 16:36           ` Zach O'Keefe
  0 siblings, 2 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2023-11-14 16:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, mgorman, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

Thanks Michal!!

On 11/14/2023 4:18 PM, Michal Hocko wrote:
>> At least in my particular stress test case it just delayed the OOM as i
>> can see that at the time of OOM kill, there are no free pcp pages. My
>> understanding of the OOM is that it should be the last resort and only
>> after doing the enough reclaim retries. CMIW here.
> Yes it is a last resort but it is a heuristic as well. So the real
> questoin is whether this makes any practical difference outside of
> artificial workloads. I do not see anything particularly worrying to
> drain the pcp cache but it should be noted that this won't be 100%
> either as racing freeing of memory will end up on pcp lists first.

Okay, I don't have any practical scenario where this helped me in
avoiding the OOM.  Will comeback If I ever encounter this issue in
practical scenario.

Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
high atomic reserve calculations will help me.

Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-14 16:36         ` Charan Teja Kalla
@ 2023-11-15 14:09           ` Michal Hocko
  2023-11-16  6:00             ` Charan Teja Kalla
  2024-01-25 16:36           ` Zach O'Keefe
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-11-15 14:09 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, mgorman, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

On Tue 14-11-23 22:06:45, Charan Teja Kalla wrote:
> Thanks Michal!!
> 
> On 11/14/2023 4:18 PM, Michal Hocko wrote:
> >> At least in my particular stress test case it just delayed the OOM as i
> >> can see that at the time of OOM kill, there are no free pcp pages. My
> >> understanding of the OOM is that it should be the last resort and only
> >> after doing the enough reclaim retries. CMIW here.
> > Yes it is a last resort but it is a heuristic as well. So the real
> > questoin is whether this makes any practical difference outside of
> > artificial workloads. I do not see anything particularly worrying to
> > drain the pcp cache but it should be noted that this won't be 100%
> > either as racing freeing of memory will end up on pcp lists first.
> 
> Okay, I don't have any practical scenario where this helped me in
> avoiding the OOM.  Will comeback If I ever encounter this issue in
> practical scenario.
> 
> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
> high atomic reserve calculations will help me.

I do not have a strong opinion on that one to be honest. I am not even
sure that reserving a full page block (4MB) on small systems as
presented is really a good use of memory.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-15 14:09           ` Michal Hocko
@ 2023-11-16  6:00             ` Charan Teja Kalla
  2023-11-16 12:55               ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Charan Teja Kalla @ 2023-11-16  6:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, mgorman, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

Thanks Michal.

On 11/15/2023 7:39 PM, Michal Hocko wrote:
>> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
>> high atomic reserve calculations will help me.
> I do not have a strong opinion on that one to be honest. I am not even
> sure that reserving a full page block (4MB) on small systems as
> presented is really a good use of memory.

May be other way to look at that patch is comment is really not being
reflected in the code. It says, " Limit the number reserved to 1
pageblock or roughly 1% of a zone.", but the current code is making it 2
pageblocks. So, for 4M block size, it is > 1%.

A second patch, that I will post, like not reserving the high atomic
page blocks on small systems -- But how to define the meaning of small
systems is not sure. Instead will let the system administrators chose
this through either:
a) command line param, high_atomic_reserves=off, on by default --
Another knob, so admins may really not like this?
b) CONFIG_HIGH_ATOMIC_RESERVES, which if not defined, will not reserve.

Please lmk If you have any more suggestions here?

Also, I am thinking to request Andrew to pick [PATCH V2 1/3] patch and
take these discussions separately in a separate thread.

Thanks,
Charan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 2/3] mm: page_alloc: correct high atomic reserve calculations
  2023-11-05 12:50 ` [PATCH V2 2/3] mm: page_alloc: correct high atomic reserve calculations Charan Teja Kalla
@ 2023-11-16  9:59   ` Mel Gorman
  2023-11-16 12:52     ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2023-11-16  9:59 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, mhocko, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

On Sun, Nov 05, 2023 at 06:20:49PM +0530, Charan Teja Kalla wrote:
> reserve_highatomic_pageblock() aims to reserve the 1% of the managed
> pages of a zone, which is used for the high order atomic allocations.
> 
> It uses the below calculation to reserve:
> static void reserve_highatomic_pageblock(struct page *page, ....) {
> 
>    .......
>    max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;
> 
>    if (zone->nr_reserved_highatomic >= max_managed)
>        goto out;
> 
>    zone->nr_reserved_highatomic += pageblock_nr_pages;
>    set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
>    move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
> 
> out:
>    ....
> }
> 
> Since we are always appending the 1% of zone managed pages count to
> pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the
> nr_reserved_highatomic is incremented/decremented in pageblock sizes.
> 
> Encountered a system(actually a VM running on the Linux kernel) with the
> below zone configuration:
> Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
> reserved_highatomic:8192KB managed:49224kB
> 
> The existing calculations making it to reserve the 8MB(with pageblock
> size of 4MB) i.e. 16% of the zone managed memory.  Reserving such high
> amount of memory can easily exert memory pressure in the system thus may
> lead into unnecessary reclaims till unreserving of high atomic reserves.
> 
> Since high atomic reserves are managed in pageblock size granules, as
> MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for
> high atomic reserves as,  minimum is pageblock size , maximum is
> approximately 1% of the zone managed pages.
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>

This patch in isolation seems fine with the caveat that such a small
system may find the atomic reserves to be borderline useless.

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 2/3] mm: page_alloc: correct high atomic reserve calculations
  2023-11-16  9:59   ` Mel Gorman
@ 2023-11-16 12:52     ` Michal Hocko
  2023-11-17 16:19       ` Mel Gorman
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-11-16 12:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Charan Teja Kalla, akpm, david, vbabka, hannes, quic_pkondeti,
	linux-mm, linux-kernel

On Thu 16-11-23 09:59:01, Mel Gorman wrote:
[...]
> This patch in isolation seems fine with the caveat that such a small
> system may find the atomic reserves to be borderline useless.

Yes, exactly what I had in mind. Would it make sense to reserve the the
pageblock only if it really is less than 1% of available memory?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-16  6:00             ` Charan Teja Kalla
@ 2023-11-16 12:55               ` Michal Hocko
  2023-11-17  5:43                 ` Charan Teja Kalla
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-11-16 12:55 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, mgorman, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

On Thu 16-11-23 11:30:04, Charan Teja Kalla wrote:
> Thanks Michal.
> 
> On 11/15/2023 7:39 PM, Michal Hocko wrote:
> >> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
> >> high atomic reserve calculations will help me.
> > I do not have a strong opinion on that one to be honest. I am not even
> > sure that reserving a full page block (4MB) on small systems as
> > presented is really a good use of memory.
> 
> May be other way to look at that patch is comment is really not being
> reflected in the code. It says, " Limit the number reserved to 1
> pageblock or roughly 1% of a zone.", but the current code is making it 2
> pageblocks. So, for 4M block size, it is > 1%.
> 
> A second patch, that I will post, like not reserving the high atomic
> page blocks on small systems -- But how to define the meaning of small
> systems is not sure. Instead will let the system administrators chose
> this through either:
> a) command line param, high_atomic_reserves=off, on by default --
> Another knob, so admins may really not like this?
> b) CONFIG_HIGH_ATOMIC_RESERVES, which if not defined, will not reserve.

Please don't! I do not see any admin wanting to care about this at all.
It just takes a lot of understanding of internal MM stuff to make an
educated guess. This should really be auto-tuned. And as responded in
other reply my take would be to reserve a page block on if it doesn't
consume more than 1% of memory to preserve the existing behavior yet not
overconsume on small systems.
 
> Please lmk If you have any more suggestions here?
> 
> Also, I am thinking to request Andrew to pick [PATCH V2 1/3] patch and
> take these discussions separately in a separate thread.

That makes sense as that is a clear bug fix.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-16 12:55               ` Michal Hocko
@ 2023-11-17  5:43                 ` Charan Teja Kalla
  0 siblings, 0 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2023-11-17  5:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, mgorman, david, vbabka, hannes, quic_pkondeti, linux-mm,
	linux-kernel

Thanks Michal!!

On 11/16/2023 6:25 PM, Michal Hocko wrote:
>> May be other way to look at that patch is comment is really not being
>> reflected in the code. It says, " Limit the number reserved to 1
>> pageblock or roughly 1% of a zone.", but the current code is making it 2
>> pageblocks. So, for 4M block size, it is > 1%.
>>
>> A second patch, that I will post, like not reserving the high atomic
>> page blocks on small systems -- But how to define the meaning of small
>> systems is not sure. Instead will let the system administrators chose
>> this through either:
>> a) command line param, high_atomic_reserves=off, on by default --
>> Another knob, so admins may really not like this?
>> b) CONFIG_HIGH_ATOMIC_RESERVES, which if not defined, will not reserve.
> Please don't! I do not see any admin wanting to care about this at all.
> It just takes a lot of understanding of internal MM stuff to make an
> educated guess. This should really be auto-tuned. And as responded in
> other reply my take would be to reserve a page block on if it doesn't
> consume more than 1% of memory to preserve the existing behavior yet not
> overconsume on small systems.

This idea of auto tune, by reserving a pageblock only if it doesn't
consume more than 1% of memory, seems cleaner to me.  For a page block
size of 4MB, this will turnout to be upto 400MB of RAM.

If it is fine, I can post a patch with suggested-by: you.

>  


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 2/3] mm: page_alloc: correct high atomic reserve calculations
  2023-11-16 12:52     ` Michal Hocko
@ 2023-11-17 16:19       ` Mel Gorman
  0 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2023-11-17 16:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Charan Teja Kalla, akpm, david, vbabka, hannes, quic_pkondeti,
	linux-mm, linux-kernel

On Thu, Nov 16, 2023 at 01:52:26PM +0100, Michal Hocko wrote:
> On Thu 16-11-23 09:59:01, Mel Gorman wrote:
> [...]
> > This patch in isolation seems fine with the caveat that such a small
> > system may find the atomic reserves to be borderline useless.
> 
> Yes, exactly what I had in mind. Would it make sense to reserve the the
> pageblock only if it really is less than 1% of available memory?

I'd have no objection with the caveat that we need to watch out for new
bugs related to high-order atomic failures. If ths risk is noted in the
changelog and that it's a revert candidate then any bisection should
trivially find it.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2023-11-14 16:36         ` Charan Teja Kalla
  2023-11-15 14:09           ` Michal Hocko
@ 2024-01-25 16:36           ` Zach O'Keefe
  2024-01-26 10:47             ` Charan Teja Kalla
  1 sibling, 1 reply; 23+ messages in thread
From: Zach O'Keefe @ 2024-01-25 16:36 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: Michal Hocko, akpm, mgorman, david, vbabka, hannes, quic_pkondeti,
	linux-mm, linux-kernel, Axel Rasmussen, Yosry Ahmed,
	David Rientjes

Thanks for the patch, Charan, and thanks to Yosry for pointing me towards it.

I took a look at data from our fleet, and there are many cases on
high-cpu-count machines where we find multi-GiB worth of data sitting
on pcpu free lists at the time of system oom-kill, when free memory
for the relevant zones are below min watermarks. I.e. clear cases
where this patch could have prevented OOM.

This kind of issue scales with the number of cpus, so presumably this
patch will only become increasingly valuable to both datacenters and
desktops alike going forward. Can we revamp it as a standalone patch?

Thanks,
Zach


On Tue, Nov 14, 2023 at 8:37 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Thanks Michal!!
>
> On 11/14/2023 4:18 PM, Michal Hocko wrote:
> >> At least in my particular stress test case it just delayed the OOM as i
> >> can see that at the time of OOM kill, there are no free pcp pages. My
> >> understanding of the OOM is that it should be the last resort and only
> >> after doing the enough reclaim retries. CMIW here.
> > Yes it is a last resort but it is a heuristic as well. So the real
> > questoin is whether this makes any practical difference outside of
> > artificial workloads. I do not see anything particularly worrying to
> > drain the pcp cache but it should be noted that this won't be 100%
> > either as racing freeing of memory will end up on pcp lists first.
>
> Okay, I don't have any practical scenario where this helped me in
> avoiding the OOM.  Will comeback If I ever encounter this issue in
> practical scenario.
>
> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
> high atomic reserve calculations will help me.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2024-01-25 16:36           ` Zach O'Keefe
@ 2024-01-26 10:47             ` Charan Teja Kalla
  2024-01-26 10:57               ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Charan Teja Kalla @ 2024-01-26 10:47 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Michal Hocko, akpm, mgorman, david, vbabka, hannes, quic_pkondeti,
	linux-mm, linux-kernel, Axel Rasmussen, Yosry Ahmed,
	David Rientjes

Hi Michal/Zach,

On 1/25/2024 10:06 PM, Zach O'Keefe wrote:
> Thanks for the patch, Charan, and thanks to Yosry for pointing me towards it.
> 
> I took a look at data from our fleet, and there are many cases on
> high-cpu-count machines where we find multi-GiB worth of data sitting
> on pcpu free lists at the time of system oom-kill, when free memory
> for the relevant zones are below min watermarks. I.e. clear cases
> where this patch could have prevented OOM.
> 
> This kind of issue scales with the number of cpus, so presumably this
> patch will only become increasingly valuable to both datacenters and
> desktops alike going forward. Can we revamp it as a standalone patch?
> 

Glad to see a real world use case for this. We too have observed OOM for
 every now and then with relatively significant PCP cache, but in all
such cases OOM is imminent.

AFAICS, Your use case description to be seen like a premature OOM
scenario despite lot of free memory sitting on the pcp lists, where this
patch should've helped.

@Michal: This usecase seems to be a practical scenario that you were
asking below.
Other concern of racing freeing of memory ending up in pcp lists first
-- will that be such a big issue? This patch enables, drain the current
pcp lists now that can avoid the oom altogether. If this racing free is
a major concern, should that be taken as a separate discussion?

Will revamp this as a separate patch if no more concerns here.

> Thanks,
> Zach
> 
> 
> On Tue, Nov 14, 2023 at 8:37 AM Charan Teja Kalla
> <quic_charante@quicinc.com> wrote:
>>
>> Thanks Michal!!
>>
>> On 11/14/2023 4:18 PM, Michal Hocko wrote:
>>>> At least in my particular stress test case it just delayed the OOM as i
>>>> can see that at the time of OOM kill, there are no free pcp pages. My
>>>> understanding of the OOM is that it should be the last resort and only
>>>> after doing the enough reclaim retries. CMIW here.
>>> Yes it is a last resort but it is a heuristic as well. So the real
>>> questoin is whether this makes any practical difference outside of
>>> artificial workloads. I do not see anything particularly worrying to
>>> drain the pcp cache but it should be noted that this won't be 100%
>>> either as racing freeing of memory will end up on pcp lists first.
>>
>> Okay, I don't have any practical scenario where this helped me in
>> avoiding the OOM.  Will comeback If I ever encounter this issue in
>> practical scenario.
>>
>> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
>> high atomic reserve calculations will help me.
>>
>> Thanks.
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2024-01-26 10:47             ` Charan Teja Kalla
@ 2024-01-26 10:57               ` Michal Hocko
  2024-01-26 22:51                 ` Zach O'Keefe
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2024-01-26 10:57 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Charan Teja Kalla, akpm, mgorman, david, vbabka, hannes,
	quic_pkondeti, linux-mm, linux-kernel, Axel Rasmussen,
	Yosry Ahmed, David Rientjes

On Fri 26-01-24 16:17:04, Charan Teja Kalla wrote:
> Hi Michal/Zach,
> 
> On 1/25/2024 10:06 PM, Zach O'Keefe wrote:
> > Thanks for the patch, Charan, and thanks to Yosry for pointing me towards it.
> > 
> > I took a look at data from our fleet, and there are many cases on
> > high-cpu-count machines where we find multi-GiB worth of data sitting
> > on pcpu free lists at the time of system oom-kill, when free memory
> > for the relevant zones are below min watermarks. I.e. clear cases
> > where this patch could have prevented OOM.
> > 
> > This kind of issue scales with the number of cpus, so presumably this
> > patch will only become increasingly valuable to both datacenters and
> > desktops alike going forward. Can we revamp it as a standalone patch?

Do you have any example OOM reports? There were recent changes to scale
the pcp pages and it would be good to know whether they work reasonably
well even under memory pressure.

I am not objecting to the patch discussed here but it would be really
good to understand the underlying problem and the scale of it.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2024-01-26 10:57               ` Michal Hocko
@ 2024-01-26 22:51                 ` Zach O'Keefe
  2024-01-29 15:04                   ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Zach O'Keefe @ 2024-01-26 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Charan Teja Kalla, akpm, mgorman, david, vbabka, hannes,
	quic_pkondeti, linux-mm, linux-kernel, Axel Rasmussen,
	Yosry Ahmed, David Rientjes

Hey Michal,

> Do you have any example OOM reports? [..]

Sure, here is one on a 1TiB, 128-physical core machine running a
5.10-based kernel (sorry, it reads pretty awkwardly when wrapped):

---8<---
mytask invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE),
order=0, oom_score_adj=0
<...>
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=sdc,mems_allowed=0-1,global_oom,task_memcg=/sdc,task=mytask,pid=835214,uid=0
Out of memory: Killed process 835214 (mytask) total-vm:787716604kB,
anon-rss:787536152kB, file-rss:64kB, shmem-rss:0kB, UID:0
pgtables:1541224kB oom_score_adj:0, hugetlb-usage:0kB
Mem-Info:
active_anon:320 inactive_anon:198083493 isolated_anon:0
 active_file:128283 inactive_file:290086 isolated_file:0
 unevictable:3525 dirty:15 writeback:0
 slab_reclaimable:35505 slab_unreclaimable:272917
 mapped:46414 shmem:822 pagetables:64085088
 sec_pagetables:0 bounce:0
 kernel_misc_reclaimable:0
 free:325793 free_pcp:263277 free_cma:0
Node 0 active_anon:1112kB inactive_anon:268172556kB
active_file:270992kB inactive_file:254612kB unevictable:12404kB
isolated(anon):0kB isolated(file):0kB mapped:147240kB dirty:52kB
writeback:0kB shmem:304kB shmem_thp:0kB shmem_pmdmapped:0kB
anon_thp:1310720kB writeback_tmp:0kB kernel_stack:32000kB
pagetables:255483108kB sec_pagetables:0kB all_unreclaimable? yes
Node 1 active_anon:168kB inactive_anon:524161416kB
active_file:242140kB inactive_file:905732kB unevictable:1696kB
isolated(anon):0kB isolated(file):0kB mapped:38416kB dirty:8kB
writeback:0kB shmem:2984kB shmem_thp:0kB shmem_pmdmapped:0kB
anon_thp:267732992kB writeback_tmp:0kB kernel_stack:8520kB
pagetables:857244kB sec_pagetables:0kB all_unreclaimable? yes
Node 0 Crash free:72kB min:108kB low:220kB high:332kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:111940kB
active_file:280kB inactive_file:316kB unevictable:0kB writepending:4kB
present:114284kB managed:114196kB mlocked:0kB bounce:0kB
free_pcp:1528kB local_pcp:24kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 0 DMA32 free:66592kB min:2580kB low:5220kB high:7860kB
reserved_highatomic:0KB active_anon:8kB inactive_anon:19456kB
active_file:4kB inactive_file:224kB unevictable:0kB writepending:0kB
present:2643512kB managed:2643512kB mlocked:0kB bounce:0kB
free_pcp:8040kB local_pcp:244kB free_cma:0kB
lowmem_reserve[]: 0 0 16029 16029
Node 0 Normal free:513048kB min:513192kB low:1038700kB high:1564208kB
reserved_highatomic:0KB active_anon:1104kB inactive_anon:268040520kB
active_file:270708kB inactive_file:254072kB unevictable:12404kB
writepending:48kB present:533969920kB managed:525510968kB
mlocked:12344kB bounce:0kB free_pcp:790040kB local_pcp:7060kB
free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 1 Normal free:723460kB min:755656kB low:1284080kB high:1812504kB
reserved_highatomic:0KB active_anon:168kB inactive_anon:524161416kB
active_file:242140kB inactive_file:905732kB unevictable:1696kB
writepending:8kB present:536866816kB managed:528427664kB
mlocked:1588kB bounce:0kB free_pcp:253500kB local_pcp:12kB
free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 0 Crash: 0*4kB 0*8kB 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 0*2048kB 0*4096kB = 16kB
Node 0 DMA32: 80*4kB (UME) 74*8kB (UE) 23*16kB (UME) 21*32kB (UME)
40*64kB (UE) 35*128kB (UME) 3*256kB (UE) 9*512kB (UME) 13*1024kB (UM)
19*2048kB (UME) 0*4096kB = 66592kB
Node 0 Normal: 1999*4kB (UE) 259*8kB (UM) 465*16kB (UM) 114*32kB (UE)
54*64kB (UME) 14*128kB (U) 74*256kB (UME) 128*512kB (UE) 96*1024kB (U)
56*2048kB (U) 46*4096kB (U) = 512292kB
Node 1 Normal: 2280*4kB (UM) 12667*8kB (UM) 8859*16kB (UME) 5221*32kB
(UME) 1631*64kB (UME) 899*128kB (UM) 330*256kB (UME) 0*512kB 0*1024kB
0*2048kB 0*4096kB = 723208kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=2048kB
Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=1048576kB
Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=2048kB
420675 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 268435456kB
Total swap = 268435456kB
---8<---

Node 0/1 Normal free memory is below respective min watermarks, with
790040kB+253500kB ~= 1GiB of memory on pcp lists.

With this patch, the GFP_HIGHUSER_MOVABLE + unrestricted mems_allowed
allocation would have allowed us to access all that memory, very
likely avoiding the oom.

> [..] There were recent changes to scale
> the pcp pages and it would be good to know whether they work reasonably
> well even under memory pressure.

I'm not familiar with these changes, but a quick check of recent
activity points to v6.7 commit fa8c4f9a665b ("mm: fix draining remote
pageset") ; is this what you are referring to?

Thanks, and have a great day,
Zach



>
> I am not objecting to the patch discussed here but it would be really
> good to understand the underlying problem and the scale of it.
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2024-01-26 22:51                 ` Zach O'Keefe
@ 2024-01-29 15:04                   ` Michal Hocko
  2024-02-06 23:15                     ` Zach O'Keefe
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2024-01-29 15:04 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Charan Teja Kalla, akpm, mgorman, david, vbabka, hannes,
	quic_pkondeti, linux-mm, linux-kernel, Axel Rasmussen,
	Yosry Ahmed, David Rientjes

On Fri 26-01-24 14:51:26, Zach O'Keefe wrote:
[...]
> Node 0 DMA32 free:66592kB min:2580kB low:5220kB high:7860kB
[...]
> free_pcp:8040kB local_pcp:244kB free_cma:0kB
> lowmem_reserve[]: 0 0 16029 16029
> Node 0 Normal free:513048kB min:513192kB low:1038700kB high:1564208kB
[...]
> mlocked:12344kB bounce:0kB free_pcp:790040kB local_pcp:7060kB
[...]
> mlocked:1588kB bounce:0kB free_pcp:253500kB local_pcp:12kB
[...]
> I'm not familiar with these changes, but a quick check of recent
> activity points to v6.7 commit fa8c4f9a665b ("mm: fix draining remote
> pageset") ; is this what you are referring to?

No, but looking at above discrepancy between free_pcp and local_pcp
would point that direction for sure. So this is worth checking.
vmstat is a periodic activity and it cannot really deal with bursts
of memory allocations but it is quite possible that the patch above
will prevent the build up before it grows that large.

I originally referred to different work though https://lore.kernel.org/all/20231016053002.756205-10-ying.huang@intel.com/T/#m9fdfabaee37db1320bbc678a69d1cdd8391640e0
merged as ca71fe1ad922 ("mm, pcp: avoid to drain PCP when process exit")
and the associated patches.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill
  2024-01-29 15:04                   ` Michal Hocko
@ 2024-02-06 23:15                     ` Zach O'Keefe
  0 siblings, 0 replies; 23+ messages in thread
From: Zach O'Keefe @ 2024-02-06 23:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Charan Teja Kalla, akpm, mgorman, david, vbabka, hannes,
	quic_pkondeti, linux-mm, linux-kernel, Axel Rasmussen,
	Yosry Ahmed, David Rientjes

On Mon, Jan 29, 2024 at 7:04 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 26-01-24 14:51:26, Zach O'Keefe wrote:
> [...]
> > Node 0 DMA32 free:66592kB min:2580kB low:5220kB high:7860kB
> [...]
> > free_pcp:8040kB local_pcp:244kB free_cma:0kB
> > lowmem_reserve[]: 0 0 16029 16029
> > Node 0 Normal free:513048kB min:513192kB low:1038700kB high:1564208kB
> [...]
> > mlocked:12344kB bounce:0kB free_pcp:790040kB local_pcp:7060kB
> [...]
> > mlocked:1588kB bounce:0kB free_pcp:253500kB local_pcp:12kB
> [...]
> > I'm not familiar with these changes, but a quick check of recent
> > activity points to v6.7 commit fa8c4f9a665b ("mm: fix draining remote
> > pageset") ; is this what you are referring to?
>
> No, but looking at above discrepancy between free_pcp and local_pcp
> would point that direction for sure. So this is worth checking.
> vmstat is a periodic activity and it cannot really deal with bursts
> of memory allocations but it is quite possible that the patch above
> will prevent the build up before it grows that large.
>
> I originally referred to different work though https://lore.kernel.org/all/20231016053002.756205-10-ying.huang@intel.com/T/#m9fdfabaee37db1320bbc678a69d1cdd8391640e0
> merged as ca71fe1ad922 ("mm, pcp: avoid to drain PCP when process exit")
> and the associated patches.

Thanks for the response, Michal, and also thank you for the reference here.

It'll take me a bit to evaluate how these patches might have helped,
and if draining pcpu would have added anything on top. At present,
that might take me a bit to get to, but I just wanted to thank you for
your response, and to leave this discussion, for the moment, with the
ball in my court to return w/ findings.

Thanks,
Zach

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-02-06 23:16 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-05 12:50 [PATCH V2 0/3] mm: page_alloc: fixes for early oom kills Charan Teja Kalla
2023-11-05 12:50 ` [PATCH V2 1/3] mm: page_alloc: unreserve highatomic page blocks before oom Charan Teja Kalla
2023-11-09 10:29   ` Michal Hocko
2023-11-05 12:50 ` [PATCH V2 2/3] mm: page_alloc: correct high atomic reserve calculations Charan Teja Kalla
2023-11-16  9:59   ` Mel Gorman
2023-11-16 12:52     ` Michal Hocko
2023-11-17 16:19       ` Mel Gorman
2023-11-05 12:50 ` [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill Charan Teja Kalla
2023-11-05 12:55   ` Charan Teja Kalla
2023-11-09 10:33   ` Michal Hocko
2023-11-10 16:36     ` Charan Teja Kalla
2023-11-14 10:48       ` Michal Hocko
2023-11-14 16:36         ` Charan Teja Kalla
2023-11-15 14:09           ` Michal Hocko
2023-11-16  6:00             ` Charan Teja Kalla
2023-11-16 12:55               ` Michal Hocko
2023-11-17  5:43                 ` Charan Teja Kalla
2024-01-25 16:36           ` Zach O'Keefe
2024-01-26 10:47             ` Charan Teja Kalla
2024-01-26 10:57               ` Michal Hocko
2024-01-26 22:51                 ` Zach O'Keefe
2024-01-29 15:04                   ` Michal Hocko
2024-02-06 23:15                     ` Zach O'Keefe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).