folio_alloc_buffers() doing allocations > order 1 with GFP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL
@ 2023-11-01  0:13 Christoph Lameter (Ampere)
  2023-11-01  8:08 ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter (Ampere) @ 2023-11-01  0:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

Hi Matthew,

There is a strange warning on bootup related to folios. Seen it a couple 
of times before. Why does this occur?



        Activating swap
     20.765039] ------------[ cut here ]------------
[   20.772475] WARNING: CPU: 35 PID: 1152 at mm/page_alloc.c:2774 get_page_from_freelist+0x214/0x17f8
[   20.784194] Modules linked in:
[   20.784197] CPU: 35 PID: 1152 Comm: swapon Not tainted 6.6.0-base #109
[   20.784202] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0014/Mt.Jade Motherboard, BIOS 2.10.20230517 (SCP: 2.10.20230517) 2023/05/17
[   20.808072] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   20.808075] pc : get_page_from_freelist+0x214/0x17f8
[   20.808080] lr : get_page_from_freelist+0xaa8/0x17f8
[   20.808084] sp : ffff800094623510
[   20.808085] x29: ffff800094623510 x28: 0000000000000000 x27: ffff083f7fffe700
[   20.808089] x26: 0000000000000000 x25: 0000000000000004 x24: ffff083f7fffe700
[   20.808093] x23: ffff083f7fffe700 x22: 0000000000012dd8 x21: ffff8000828cf000
[   20.808096] x20: 0000000000000002 x19: 0000000000000801 x18: 0000000000000000
[   20.808100] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[   20.808102] x14: 0000000000000000 x13: ffff07ff917f0c7b x12: 0000000000000005
[   20.808105] x11: 0000000000000000 x10: ffff083f7fffe8f0 x9 : 0000000000000000
[   20.808108] x8 : ffff083f7fffe8e0 x7 : ffff083f7fffe8d0 x6 : ffff083f7fffe700
[   20.878104] x5 : ffff083f7fffe910 x4 : fffffc1fff3e1708 x3 : ffff083f7fffe900
[   20.878107] x2 : ffff083f7fffe7d0 x1 : ffff800094623710 x0 : 0000000000008000
[   20.878110] Call trace:
[   20.878111]  get_page_from_freelist+0x214/0x17f8
[   20.878116]  __alloc_pages+0x17c/0xe08
[   20.878120]  __kmalloc_large_node+0xa0/0x170
[   20.878123]  __kmalloc_node+0x120/0x1d0
[   20.878125]  memcg_alloc_slab_cgroups+0x48/0xc0
[   20.878128]  memcg_slab_post_alloc_hook+0xa8/0x1c8
[   20.878132]  kmem_cache_alloc+0x18c/0x338
[   20.878135]  alloc_buffer_head+0x28/0xa0
[   20.878138]  folio_alloc_buffers+0xe8/0x1c0
[   20.878141]  folio_create_empty_buffers+0x2c/0x1e8
[   20.878143]  folio_create_buffers+0x58/0x80
[   20.878145]  block_read_full_folio+0x80/0x450
[   20.878148]  blkdev_read_folio+0x24/0x38
[   20.956921]  filemap_read_folio+0x60/0x138
[   20.956925]  do_read_cache_folio+0x180/0x298
[   20.965270]  read_cache_page+0x24/0x90
[   20.965273]  __arm64_sys_swapon+0x2e0/0x1208
[   20.965277]  invoke_syscall+0x78/0x108
[   20.965282]  el0_svc_common.constprop.0+0x48/0xf0
[   20.981702]  do_el0_svc+0x24/0x38
[   20.993773]  el0t_64_sync_handler+0x100/0x130
[   20.993776]  el0t_64_sync+0x190/0x198
[   20.993779] ---[ end trace 0000000000000000 ]---
[   20.999972] Adding 999420k swap on 
/dev/mapper/eng07sys--r113--vg-swap_1.  Priority:-2 extents:1 
across:999420k SS

This is due to



folio_alloc_buffers() setting GFP_NOFAIL:


struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long 
size,
                                         bool retry)
{
         struct buffer_head *bh, *head;
         gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
         long offset;
         struct mem_cgroup *memcg, *old_memcg;

         if (retry)
                 gfp |= __GFP_NOFAIL;





Then rmqueue() checks for order > 1 and GFP_NOFAIL


static inline
struct page *rmqueue(struct zone *preferred_zone,
                         struct zone *zone, unsigned int order,
                         gfp_t gfp_flags, unsigned int alloc_flags,
                         int migratetype)
{
         struct page *page;

         /*
          * We most definitely don't want callers attempting to
          * allocate greater than order-1 page units with __GFP_NOFAIL.
          */
         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL
  2023-11-01  0:13 folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL Christoph Lameter (Ampere)
@ 2023-11-01  8:08 ` Matthew Wilcox
  2023-11-07  2:57   ` cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL) Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2023-11-01  8:08 UTC (permalink / raw)
  To: Christoph Lameter (Ampere); +Cc: linux-mm

On Tue, Oct 31, 2023 at 05:13:57PM -0700, Christoph Lameter (Ampere) wrote:
> Hi Matthew,
> 
> There is a strange warning on bootup related to folios. Seen it a couple of
> times before. Why does this occur?

Filesystems generally can't cope with failing to allocate a bufferhead.
So the buffer head code sets __GFP_NOFAIL.  That's better than trying
to implement __GFP_NOFAIL semantics in the fs code, right?

> [   20.878110] Call trace:
> [   20.878111]  get_page_from_freelist+0x214/0x17f8
> [   20.878116]  __alloc_pages+0x17c/0xe08
> [   20.878120]  __kmalloc_large_node+0xa0/0x170
> [   20.878123]  __kmalloc_node+0x120/0x1d0
> [   20.878125]  memcg_alloc_slab_cgroups+0x48/0xc0

Oho.  It's not buffer's fault, specifically.  memcg is allocating
its own metadata for the slab.  I decree this Not My Fault.

> [   20.878128]  memcg_slab_post_alloc_hook+0xa8/0x1c8
> [   20.878132]  kmem_cache_alloc+0x18c/0x338
> [   20.878135]  alloc_buffer_head+0x28/0xa0
> [   20.878138]  folio_alloc_buffers+0xe8/0x1c0
> [   20.878141]  folio_create_empty_buffers+0x2c/0x1e8
> [   20.878143]  folio_create_buffers+0x58/0x80
> [   20.878145]  block_read_full_folio+0x80/0x450
> [   20.878148]  blkdev_read_folio+0x24/0x38
> [   20.956921]  filemap_read_folio+0x60/0x138
> [   20.956925]  do_read_cache_folio+0x180/0x298
> [   20.965270]  read_cache_page+0x24/0x90
> [   20.965273]  __arm64_sys_swapon+0x2e0/0x1208
> [   20.965277]  invoke_syscall+0x78/0x108
> [   20.965282]  el0_svc_common.constprop.0+0x48/0xf0
> [   20.981702]  do_el0_svc+0x24/0x38
> [   20.993773]  el0t_64_sync_handler+0x100/0x130
> [   20.993776]  el0t_64_sync+0x190/0x198
> [   20.993779] ---[ end trace 0000000000000000 ]---
> [   20.999972] Adding 999420k swap on /dev/mapper/eng07sys--r113--vg-swap_1.
> Priority:-2 extents:1 across:999420k SS
> 
> This is due to
> 
> 
> 
> folio_alloc_buffers() setting GFP_NOFAIL:
> 
> 
> struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long
> size,
>                                         bool retry)
> {
>         struct buffer_head *bh, *head;
>         gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
>         long offset;
>         struct mem_cgroup *memcg, *old_memcg;
> 
>         if (retry)
>                 gfp |= __GFP_NOFAIL;

This isn't new.  It was introduced by 640ab98fb362 in 2017.
It seems reasonable to be able to kmalloc(512, GFP_NOFAIL).  It's the
memcg code which is having problems here.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-01  8:08 ` Matthew Wilcox
@ 2023-11-07  2:57   ` Christoph Lameter
  2023-11-07 18:05     ` Roman Gushchin
  2023-11-07 19:24     ` Matthew Wilcox
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Lameter @ 2023-11-07  2:57 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Matthew Wilcox, linux-mm, cgroups

Right.. Well lets add the cgoup folks to this.

The code that simply uses the GFP_NOFAIL to allocate cgroup metadata 
using an order > 1:

int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
 				 gfp_t gfp, bool new_slab)
{
 	unsigned int objects = objs_per_slab(s, slab);
 	unsigned long memcg_data;
 	void *vec;

 	gfp &= ~OBJCGS_CLEAR_MASK;
 	vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
 			   slab_nid(slab));




On Wed, 1 Nov 2023, Matthew Wilcox wrote:

> On Tue, Oct 31, 2023 at 05:13:57PM -0700, Christoph Lameter (Ampere) wrote:
>> Hi Matthew,
>>
>> There is a strange warning on bootup related to folios. Seen it a couple of
>> times before. Why does this occur?
>
> Filesystems generally can't cope with failing to allocate a bufferhead.
> So the buffer head code sets __GFP_NOFAIL.  That's better than trying
> to implement __GFP_NOFAIL semantics in the fs code, right?
>
>> [   20.878110] Call trace:
>> [   20.878111]  get_page_from_freelist+0x214/0x17f8
>> [   20.878116]  __alloc_pages+0x17c/0xe08
>> [   20.878120]  __kmalloc_large_node+0xa0/0x170
>> [   20.878123]  __kmalloc_node+0x120/0x1d0
>> [   20.878125]  memcg_alloc_slab_cgroups+0x48/0xc0
>
> Oho.  It's not buffer's fault, specifically.  memcg is allocating
> its own metadata for the slab.  I decree this Not My Fault.
>
>> [   20.878128]  memcg_slab_post_alloc_hook+0xa8/0x1c8
>> [   20.878132]  kmem_cache_alloc+0x18c/0x338
>> [   20.878135]  alloc_buffer_head+0x28/0xa0
>> [   20.878138]  folio_alloc_buffers+0xe8/0x1c0
>> [   20.878141]  folio_create_empty_buffers+0x2c/0x1e8
>> [   20.878143]  folio_create_buffers+0x58/0x80
>> [   20.878145]  block_read_full_folio+0x80/0x450
>> [   20.878148]  blkdev_read_folio+0x24/0x38
>> [   20.956921]  filemap_read_folio+0x60/0x138
>> [   20.956925]  do_read_cache_folio+0x180/0x298
>> [   20.965270]  read_cache_page+0x24/0x90
>> [   20.965273]  __arm64_sys_swapon+0x2e0/0x1208
>> [   20.965277]  invoke_syscall+0x78/0x108
>> [   20.965282]  el0_svc_common.constprop.0+0x48/0xf0
>> [   20.981702]  do_el0_svc+0x24/0x38
>> [   20.993773]  el0t_64_sync_handler+0x100/0x130
>> [   20.993776]  el0t_64_sync+0x190/0x198
>> [   20.993779] ---[ end trace 0000000000000000 ]---
>> [   20.999972] Adding 999420k swap on /dev/mapper/eng07sys--r113--vg-swap_1.
>> Priority:-2 extents:1 across:999420k SS
>>
>> This is due to
>>
>>
>>
>> folio_alloc_buffers() setting GFP_NOFAIL:
>>
>>
>> struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long
>> size,
>>                                         bool retry)
>> {
>>         struct buffer_head *bh, *head;
>>         gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
>>         long offset;
>>         struct mem_cgroup *memcg, *old_memcg;
>>
>>         if (retry)
>>                 gfp |= __GFP_NOFAIL;
>
> This isn't new.  It was introduced by 640ab98fb362 in 2017.
> It seems reasonable to be able to kmalloc(512, GFP_NOFAIL).  It's the
> memcg code which is having problems here.
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-07  2:57   ` cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL) Christoph Lameter
@ 2023-11-07 18:05     ` Roman Gushchin
  2023-11-07 18:18       ` Shakeel Butt
  2023-11-08 10:33       ` Michal Hocko
  2023-11-07 19:24     ` Matthew Wilcox
  1 sibling, 2 replies; 15+ messages in thread
From: Roman Gushchin @ 2023-11-07 18:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, linux-mm, cgroups, Shakeel Butt, Michal Hocko

On Mon, Nov 06, 2023 at 06:57:05PM -0800, Christoph Lameter wrote:
> Right.. Well lets add the cgoup folks to this.

Hello!

I think it's the best thing we can do now. Thoughts?

From 5ed3e88f4f052b6ce8dbec0545dfc80eb7534a1a Mon Sep 17 00:00:00 2001
From: Roman Gushchin <roman.gushchin@linux.dev>
Date: Tue, 7 Nov 2023 09:18:02 -0800
Subject: [PATCH] mm: kmem: drop __GFP_NOFAIL when allocating objcg vectors

Objcg vectors attached to slab pages to store slab object ownership
information are allocated using gfp flags for the original slab
allocation. Depending on slab page order and the size of slab objects,
objcg vector can take several pages.

If the original allocation was done with the __GFP_NOFAIL flag, it
triggered a warning in the page allocation code. Indeed, order > 1
pages should not been allocated with the __GFP_NOFAIL flag.

Fix this by simple dropping the __GFP_NOFAIL flag when allocating
the objcg vector. It effectively allows to skip the accounting of a
single slab object under a heavy memory pressure.

An alternative would be to implement the mechanism to fallback to
order-0 allocations for accounting metadata, which is also not perfect
because it will increase performance penalty and memory footprint
of the kernel memory accounting under memory pressure.

Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Matthew Wilcox <willy@infradead.org>
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 774bd6e21e27..1c1061df9cd1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2936,7 +2936,8 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
  * Moreover, it should not come from DMA buffer and is not readily
  * reclaimable. So those GFP bits should be masked off.
  */
-#define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
+#define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | \
+				 __GFP_ACCOUNT | __GFP_NOFAIL)

 /*
  * mod_objcg_mlstate() may be called with irq enabled, so
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-07 18:05     ` Roman Gushchin
@ 2023-11-07 18:18       ` Shakeel Butt
  2023-11-08 10:33       ` Michal Hocko
  1 sibling, 0 replies; 15+ messages in thread
From: Shakeel Butt @ 2023-11-07 18:18 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Christoph Lameter, Matthew Wilcox, linux-mm, cgroups,
	Michal Hocko

+Andrew

On Tue, Nov 7, 2023 at 10:05 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Mon, Nov 06, 2023 at 06:57:05PM -0800, Christoph Lameter wrote:
> > Right.. Well lets add the cgoup folks to this.
>
> Hello!
>
> I think it's the best thing we can do now. Thoughts?
>
> From 5ed3e88f4f052b6ce8dbec0545dfc80eb7534a1a Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <roman.gushchin@linux.dev>
> Date: Tue, 7 Nov 2023 09:18:02 -0800
> Subject: [PATCH] mm: kmem: drop __GFP_NOFAIL when allocating objcg vectors
>
> Objcg vectors attached to slab pages to store slab object ownership
> information are allocated using gfp flags for the original slab
> allocation. Depending on slab page order and the size of slab objects,
> objcg vector can take several pages.
>
> If the original allocation was done with the __GFP_NOFAIL flag, it
> triggered a warning in the page allocation code. Indeed, order > 1
> pages should not been allocated with the __GFP_NOFAIL flag.
>
> Fix this by simple dropping the __GFP_NOFAIL flag when allocating

*simply

> the objcg vector. It effectively allows to skip the accounting of a
> single slab object under a heavy memory pressure.
>
> An alternative would be to implement the mechanism to fallback to
> order-0 allocations for accounting metadata, which is also not perfect
> because it will increase performance penalty and memory footprint
> of the kernel memory accounting under memory pressure.
>
> Reported-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox <willy@infradead.org>

I think we should CC stable too.

Acked-by: Shakeel Butt <shakeelb@google.com>


> ---
>  mm/memcontrol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 774bd6e21e27..1c1061df9cd1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2936,7 +2936,8 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
>   * Moreover, it should not come from DMA buffer and is not readily
>   * reclaimable. So those GFP bits should be masked off.
>   */
> -#define OBJCGS_CLEAR_MASK      (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
> +#define OBJCGS_CLEAR_MASK      (__GFP_DMA | __GFP_RECLAIMABLE | \
> +                                __GFP_ACCOUNT | __GFP_NOFAIL)
>
>  /*
>   * mod_objcg_mlstate() may be called with irq enabled, so
> --
> 2.42.0
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-07  2:57   ` cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL) Christoph Lameter
  2023-11-07 18:05     ` Roman Gushchin
@ 2023-11-07 19:24     ` Matthew Wilcox
  2023-11-07 21:33       ` Roman Gushchin
  2023-11-10 13:38       ` Matthew Wilcox
  1 sibling, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2023-11-07 19:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Roman Gushchin, linux-mm, cgroups

On Mon, Nov 06, 2023 at 06:57:05PM -0800, Christoph Lameter wrote:
> Right.. Well lets add the cgoup folks to this.
> 
> The code that simply uses the GFP_NOFAIL to allocate cgroup metadata using
> an order > 1:
> 
> int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> 				 gfp_t gfp, bool new_slab)
> {
> 	unsigned int objects = objs_per_slab(s, slab);
> 	unsigned long memcg_data;
> 	void *vec;
> 
> 	gfp &= ~OBJCGS_CLEAR_MASK;
> 	vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
> 			   slab_nid(slab));

But, but but, why does this incur an allocation larger than PAGE_SIZE?

sizeof(void *) is 8.  We have N objects allocated from the slab.  I
happen to know this is used for buffer_head, so:

buffer_head         1369   1560    104   39    1 : tunables    0    0    0 : slabdata     40     40      0

we get 39 objects per slab.  and we're only allocating one page per slab.
39 * 8 is only 312.

Maybe Christoph is playing with min_slab_order or something, so we're
getting 8 pages per slab.  That's still only 2496 bytes.  Why are we
calling into the large kmalloc path?  What's really going on here?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-07 19:24     ` Matthew Wilcox
@ 2023-11-07 21:33       ` Roman Gushchin
  2023-11-07 21:37         ` Matthew Wilcox
  2023-11-10 13:38       ` Matthew Wilcox
  1 sibling, 1 reply; 15+ messages in thread
From: Roman Gushchin @ 2023-11-07 21:33 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Lameter, linux-mm, cgroups

On Tue, Nov 07, 2023 at 07:24:08PM +0000, Matthew Wilcox wrote:
> On Mon, Nov 06, 2023 at 06:57:05PM -0800, Christoph Lameter wrote:
> > Right.. Well lets add the cgoup folks to this.
> > 
> > The code that simply uses the GFP_NOFAIL to allocate cgroup metadata using
> > an order > 1:
> > 
> > int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> > 				 gfp_t gfp, bool new_slab)
> > {
> > 	unsigned int objects = objs_per_slab(s, slab);
> > 	unsigned long memcg_data;
> > 	void *vec;
> > 
> > 	gfp &= ~OBJCGS_CLEAR_MASK;
> > 	vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
> > 			   slab_nid(slab));
> 
> But, but but, why does this incur an allocation larger than PAGE_SIZE?
> 
> sizeof(void *) is 8.  We have N objects allocated from the slab.  I
> happen to know this is used for buffer_head, so:
> 
> buffer_head         1369   1560    104   39    1 : tunables    0    0    0 : slabdata     40     40      0
> 
> we get 39 objects per slab.  and we're only allocating one page per slab.
> 39 * 8 is only 312.
> 
> Maybe Christoph is playing with min_slab_order or something, so we're
> getting 8 pages per slab.  That's still only 2496 bytes.  Why are we
> calling into the large kmalloc path?  What's really going on here?

Good question and I *guess* it's something related to Christoph's hardware
(64k pages or something like this) - otherwise we would see it sooner.

I'd like to have the answer too.

Thanks!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-07 21:33       ` Roman Gushchin
@ 2023-11-07 21:37         ` Matthew Wilcox
  0 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2023-11-07 21:37 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Christoph Lameter, linux-mm, cgroups

On Tue, Nov 07, 2023 at 01:33:41PM -0800, Roman Gushchin wrote:
> On Tue, Nov 07, 2023 at 07:24:08PM +0000, Matthew Wilcox wrote:
> > On Mon, Nov 06, 2023 at 06:57:05PM -0800, Christoph Lameter wrote:
> > > Right.. Well lets add the cgoup folks to this.
> > > 
> > > The code that simply uses the GFP_NOFAIL to allocate cgroup metadata using
> > > an order > 1:
> > > 
> > > int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> > > 				 gfp_t gfp, bool new_slab)
> > > {
> > > 	unsigned int objects = objs_per_slab(s, slab);
> > > 	unsigned long memcg_data;
> > > 	void *vec;
> > > 
> > > 	gfp &= ~OBJCGS_CLEAR_MASK;
> > > 	vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
> > > 			   slab_nid(slab));
> > 
> > But, but but, why does this incur an allocation larger than PAGE_SIZE?
> > 
> > sizeof(void *) is 8.  We have N objects allocated from the slab.  I
> > happen to know this is used for buffer_head, so:
> > 
> > buffer_head         1369   1560    104   39    1 : tunables    0    0    0 : slabdata     40     40      0
> > 
> > we get 39 objects per slab.  and we're only allocating one page per slab.
> > 39 * 8 is only 312.
> > 
> > Maybe Christoph is playing with min_slab_order or something, so we're
> > getting 8 pages per slab.  That's still only 2496 bytes.  Why are we
> > calling into the large kmalloc path?  What's really going on here?
> 
> Good question and I *guess* it's something related to Christoph's hardware
> (64k pages or something like this) - otherwise we would see it sooner.

I was wondering about that, and obviously it'd make N scale up.  But then,
we'd be able to fit more pointers in a page too.  At the ed of the day,
8 < 104.  Even if we go to order-3, 64 < 104.  If Christoph is playing
with min_slab_order=4, we'd see it ... but that's a really big change,
and I don't think it would justify this patch, let alone cc'ing stable.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-07 18:05     ` Roman Gushchin
  2023-11-07 18:18       ` Shakeel Butt
@ 2023-11-08 10:33       ` Michal Hocko
  2023-11-09  6:37         ` Shakeel Butt
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2023-11-08 10:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christoph Lameter, Matthew Wilcox, linux-mm, cgroups,
	Shakeel Butt

On Tue 07-11-23 10:05:24, Roman Gushchin wrote:
> On Mon, Nov 06, 2023 at 06:57:05PM -0800, Christoph Lameter wrote:
> > Right.. Well lets add the cgoup folks to this.
> 
> Hello!
> 
> I think it's the best thing we can do now. Thoughts?
> 
> >From 5ed3e88f4f052b6ce8dbec0545dfc80eb7534a1a Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <roman.gushchin@linux.dev>
> Date: Tue, 7 Nov 2023 09:18:02 -0800
> Subject: [PATCH] mm: kmem: drop __GFP_NOFAIL when allocating objcg vectors
> 
> Objcg vectors attached to slab pages to store slab object ownership
> information are allocated using gfp flags for the original slab
> allocation. Depending on slab page order and the size of slab objects,
> objcg vector can take several pages.
> 
> If the original allocation was done with the __GFP_NOFAIL flag, it
> triggered a warning in the page allocation code. Indeed, order > 1
> pages should not been allocated with the __GFP_NOFAIL flag.
> 
> Fix this by simple dropping the __GFP_NOFAIL flag when allocating
> the objcg vector. It effectively allows to skip the accounting of a
> single slab object under a heavy memory pressure.

It would be really good to describe what happens if the memcg metadata
allocation fails. AFAICS both callers of memcg_alloc_slab_cgroups -
memcg_slab_post_alloc_hook and account_slab will simply skip the
accounting which is rather curious but probably tolerable (does this
allow to runaway from memcg limits). If that is intended then it should
be documented so that new users do not get it wrong. We do not want to
error ever propagate down to the allocator caller which doesn't expect
it.

Btw. if the large allocation is really necessary, which hasn't been
explained so far AFAIK, would vmalloc fallback be an option?
 
> An alternative would be to implement the mechanism to fallback to
> order-0 allocations for accounting metadata, which is also not perfect
> because it will increase performance penalty and memory footprint
> of the kernel memory accounting under memory pressure.
> 
> Reported-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox <willy@infradead.org>
> ---
>  mm/memcontrol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 774bd6e21e27..1c1061df9cd1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2936,7 +2936,8 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
>   * Moreover, it should not come from DMA buffer and is not readily
>   * reclaimable. So those GFP bits should be masked off.
>   */
> -#define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
> +#define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | \
> +				 __GFP_ACCOUNT | __GFP_NOFAIL)
>  
>  /*
>   * mod_objcg_mlstate() may be called with irq enabled, so
> -- 
> 2.42.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-08 10:33       ` Michal Hocko
@ 2023-11-09  6:37         ` Shakeel Butt
  2023-11-09 17:36           ` Roman Gushchin
  0 siblings, 1 reply; 15+ messages in thread
From: Shakeel Butt @ 2023-11-09  6:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Christoph Lameter, Matthew Wilcox, linux-mm,
	cgroups

On Wed, Nov 8, 2023 at 2:33 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 07-11-23 10:05:24, Roman Gushchin wrote:
> > On Mon, Nov 06, 2023 at 06:57:05PM -0800, Christoph Lameter wrote:
> > > Right.. Well lets add the cgoup folks to this.
> >
> > Hello!
> >
> > I think it's the best thing we can do now. Thoughts?
> >
> > >From 5ed3e88f4f052b6ce8dbec0545dfc80eb7534a1a Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin <roman.gushchin@linux.dev>
> > Date: Tue, 7 Nov 2023 09:18:02 -0800
> > Subject: [PATCH] mm: kmem: drop __GFP_NOFAIL when allocating objcg vectors
> >
> > Objcg vectors attached to slab pages to store slab object ownership
> > information are allocated using gfp flags for the original slab
> > allocation. Depending on slab page order and the size of slab objects,
> > objcg vector can take several pages.
> >
> > If the original allocation was done with the __GFP_NOFAIL flag, it
> > triggered a warning in the page allocation code. Indeed, order > 1
> > pages should not been allocated with the __GFP_NOFAIL flag.
> >
> > Fix this by simple dropping the __GFP_NOFAIL flag when allocating
> > the objcg vector. It effectively allows to skip the accounting of a
> > single slab object under a heavy memory pressure.
>
> It would be really good to describe what happens if the memcg metadata
> allocation fails. AFAICS both callers of memcg_alloc_slab_cgroups -
> memcg_slab_post_alloc_hook and account_slab will simply skip the
> accounting which is rather curious but probably tolerable (does this
> allow to runaway from memcg limits). If that is intended then it should
> be documented so that new users do not get it wrong. We do not want to
> error ever propagate down to the allocator caller which doesn't expect
> it.

The memcg metadata allocation failure is a situation kind of similar
to how we used to have per-memcg kmem caches for accounting slab
memory. The first allocation from a memcg triggers kmem cache creation
and lets the allocation pass through.

>
> Btw. if the large allocation is really necessary, which hasn't been
> explained so far AFAIK, would vmalloc fallback be an option?
>

For this specific scenario, large allocation is kind of unexpected,
like a large (multi-order) slab having tiny objects. Roman, do you
know the slab settings where this failure occurs?

Anyways, I think kvmalloc is a better option. Most of the time we
should have order 0 allocation here and for weird settings we fallback
to vmalloc.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-09  6:37         ` Shakeel Butt
@ 2023-11-09 17:36           ` Roman Gushchin
  0 siblings, 0 replies; 15+ messages in thread
From: Roman Gushchin @ 2023-11-09 17:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Christoph Lameter, Matthew Wilcox, linux-mm,
	cgroups

On Wed, Nov 08, 2023 at 10:37:00PM -0800, Shakeel Butt wrote:
> On Wed, Nov 8, 2023 at 2:33 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 07-11-23 10:05:24, Roman Gushchin wrote:
> > > On Mon, Nov 06, 2023 at 06:57:05PM -0800, Christoph Lameter wrote:
> > > > Right.. Well lets add the cgoup folks to this.
> > >
> > > Hello!
> > >
> > > I think it's the best thing we can do now. Thoughts?
> > >
> > > >From 5ed3e88f4f052b6ce8dbec0545dfc80eb7534a1a Mon Sep 17 00:00:00 2001
> > > From: Roman Gushchin <roman.gushchin@linux.dev>
> > > Date: Tue, 7 Nov 2023 09:18:02 -0800
> > > Subject: [PATCH] mm: kmem: drop __GFP_NOFAIL when allocating objcg vectors
> > >
> > > Objcg vectors attached to slab pages to store slab object ownership
> > > information are allocated using gfp flags for the original slab
> > > allocation. Depending on slab page order and the size of slab objects,
> > > objcg vector can take several pages.
> > >
> > > If the original allocation was done with the __GFP_NOFAIL flag, it
> > > triggered a warning in the page allocation code. Indeed, order > 1
> > > pages should not been allocated with the __GFP_NOFAIL flag.
> > >
> > > Fix this by simple dropping the __GFP_NOFAIL flag when allocating
> > > the objcg vector. It effectively allows to skip the accounting of a
> > > single slab object under a heavy memory pressure.
> >
> > It would be really good to describe what happens if the memcg metadata
> > allocation fails. AFAICS both callers of memcg_alloc_slab_cgroups -
> > memcg_slab_post_alloc_hook and account_slab will simply skip the
> > accounting which is rather curious but probably tolerable (does this
> > allow to runaway from memcg limits). If that is intended then it should
> > be documented so that new users do not get it wrong. We do not want to
> > error ever propagate down to the allocator caller which doesn't expect
> > it.
> 
> The memcg metadata allocation failure is a situation kind of similar
> to how we used to have per-memcg kmem caches for accounting slab
> memory. The first allocation from a memcg triggers kmem cache creation
> and lets the allocation pass through.
> 
> >
> > Btw. if the large allocation is really necessary, which hasn't been
> > explained so far AFAIK, would vmalloc fallback be an option?
> >
> 
> For this specific scenario, large allocation is kind of unexpected,
> like a large (multi-order) slab having tiny objects. Roman, do you
> know the slab settings where this failure occurs?

No, I hope Christoph will shed some light here.

> Anyways, I think kvmalloc is a better option. Most of the time we
> should have order 0 allocation here and for weird settings we fallback
> to vmalloc.

I'm not sure about kvmalloc, because it's not fast.
I think the better option would be to force the slab allocator to fall back
to order-0 pages. Theoretically, we don't even need to free and re-allocate
slab objects, but break the slab folio into pages and release all but first
page.

But I'd like to learn more about the use case before committing any time
into this effort.

Thanks!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-07 19:24     ` Matthew Wilcox
  2023-11-07 21:33       ` Roman Gushchin
@ 2023-11-10 13:38       ` Matthew Wilcox
  2023-11-13 19:48         ` Christoph Lameter
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2023-11-10 13:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Roman Gushchin, linux-mm, cgroups

On Tue, Nov 07, 2023 at 07:24:08PM +0000, Matthew Wilcox wrote:
> On Mon, Nov 06, 2023 at 06:57:05PM -0800, Christoph Lameter wrote:
> > Right.. Well lets add the cgoup folks to this.
> > 
> > The code that simply uses the GFP_NOFAIL to allocate cgroup metadata using
> > an order > 1:
> > 
> > int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> > 				 gfp_t gfp, bool new_slab)
> > {
> > 	unsigned int objects = objs_per_slab(s, slab);
> > 	unsigned long memcg_data;
> > 	void *vec;
> > 
> > 	gfp &= ~OBJCGS_CLEAR_MASK;
> > 	vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
> > 			   slab_nid(slab));
> 
> But, but but, why does this incur an allocation larger than PAGE_SIZE?
> 
> sizeof(void *) is 8.  We have N objects allocated from the slab.  I
> happen to know this is used for buffer_head, so:
> 
> buffer_head         1369   1560    104   39    1 : tunables    0    0    0 : slabdata     40     40      0
> 
> we get 39 objects per slab.  and we're only allocating one page per slab.
> 39 * 8 is only 312.
> 
> Maybe Christoph is playing with min_slab_order or something, so we're
> getting 8 pages per slab.  That's still only 2496 bytes.  Why are we
> calling into the large kmalloc path?  What's really going on here?

Christoph?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-10 13:38       ` Matthew Wilcox
@ 2023-11-13 19:48         ` Christoph Lameter
  2023-11-13 22:48           ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2023-11-13 19:48 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Roman Gushchin, linux-mm, cgroups

On Fri, 10 Nov 2023, Matthew Wilcox wrote:

>> Maybe Christoph is playing with min_slab_order or something, so we're
>> getting 8 pages per slab.  That's still only 2496 bytes.  Why are we
>> calling into the large kmalloc path?  What's really going on here?
>
> Christoph?

Sorry I thought I already answered that.

This was a boot with slub_min_order=5 that was inadvertently 
left in from a performance test.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-13 19:48         ` Christoph Lameter
@ 2023-11-13 22:48           ` Matthew Wilcox
  2023-11-14 17:29             ` Roman Gushchin
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2023-11-13 22:48 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Roman Gushchin, linux-mm, cgroups

On Mon, Nov 13, 2023 at 11:48:57AM -0800, Christoph Lameter wrote:
> On Fri, 10 Nov 2023, Matthew Wilcox wrote:
> 
> > > Maybe Christoph is playing with min_slab_order or something, so we're
> > > getting 8 pages per slab.  That's still only 2496 bytes.  Why are we
> > > calling into the large kmalloc path?  What's really going on here?
> > 
> > Christoph?
> 
> Sorry I thought I already answered that.
> 
> This was a boot with slub_min_order=5 that was inadvertently left in from a
> performance test.

Ah.  So do you think we need to fix this?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL)
  2023-11-13 22:48           ` Matthew Wilcox
@ 2023-11-14 17:29             ` Roman Gushchin
  0 siblings, 0 replies; 15+ messages in thread
From: Roman Gushchin @ 2023-11-14 17:29 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Lameter, linux-mm, cgroups

On Mon, Nov 13, 2023 at 10:48:39PM +0000, Matthew Wilcox wrote:
> On Mon, Nov 13, 2023 at 11:48:57AM -0800, Christoph Lameter wrote:
> > On Fri, 10 Nov 2023, Matthew Wilcox wrote:
> > 
> > > > Maybe Christoph is playing with min_slab_order or something, so we're
> > > > getting 8 pages per slab.  That's still only 2496 bytes.  Why are we
> > > > calling into the large kmalloc path?  What's really going on here?
> > > 
> > > Christoph?
> > 
> > Sorry I thought I already answered that.
> > 
> > This was a boot with slub_min_order=5 that was inadvertently left in from a
> > performance test.
> 
> Ah.  So do you think we need to fix this?

I'd leave the fix, so that we don't have to look into this "problem" next time.
But I'm not inclined to work on a proper fix: fixing the slab accounting
for this non-trivial setup. Maybe we should add a note into a doc saying that
raising slub_min_order might affect the slab accounting precision?


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-11-14 17:29 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-01  0:13 folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL Christoph Lameter (Ampere)
2023-11-01  8:08 ` Matthew Wilcox
2023-11-07  2:57   ` cgroups: warning for metadata allocation with GFP_NOFAIL (was Re: folio_alloc_buffers() doing allocations > order 1 with GFP_NOFAIL) Christoph Lameter
2023-11-07 18:05     ` Roman Gushchin
2023-11-07 18:18       ` Shakeel Butt
2023-11-08 10:33       ` Michal Hocko
2023-11-09  6:37         ` Shakeel Butt
2023-11-09 17:36           ` Roman Gushchin
2023-11-07 19:24     ` Matthew Wilcox
2023-11-07 21:33       ` Roman Gushchin
2023-11-07 21:37         ` Matthew Wilcox
2023-11-10 13:38       ` Matthew Wilcox
2023-11-13 19:48         ` Christoph Lameter
2023-11-13 22:48           ` Matthew Wilcox
2023-11-14 17:29             ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).