[PATCH] bpf: Try harder when allocating memory for maps

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] bpf: Try harder when allocating memory for maps
@ 2019-03-08  8:08 Martynas Pumputis
  2019-03-08  8:44 ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Martynas Pumputis @ 2019-03-08  8:08 UTC (permalink / raw)
  To: bpf; +Cc: ast, daniel, mhocko, m

It has been observed that sometimes memory allocation for BPF maps
fails when there is no obvious memory pressure in a system.

E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
could not be created due to due to vmalloc unable to allocate 75497472B,
when the system's memory consumption (in MB) was the following:

    Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727

Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
__GFP_RETRY_MAYFAIL with more useful semantic") we can replace
__GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
and will try harder to fulfil allocation requests.

The change has been tested with the workloads mentioned above and by
observing oom_kill value from /proc/vmstat.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
---
 kernel/bpf/syscall.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 62f6bced3a3c..eb5cefe44af3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)

 void *bpf_map_area_alloc(size_t size, int numa_node)
 {
-	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
-	 * trigger under memory pressure as we really just want to
-	 * fail instead.
+	/* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so
+	 * OOM killer doesn't trigger under memory pressure as we really
+	 * just want to fail instead.
 	 */
-	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
+	const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO;
 	void *area;

 	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-- 
2.21.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-08  8:08 [PATCH] bpf: Try harder when allocating memory for maps Martynas Pumputis
@ 2019-03-08  8:44 ` Michal Hocko
  2019-03-08 10:33   ` Daniel Borkmann
  2019-03-08 11:14   ` Martynas Pumputis
  0 siblings, 2 replies; 11+ messages in thread
From: Michal Hocko @ 2019-03-08  8:44 UTC (permalink / raw)
  To: Martynas Pumputis; +Cc: bpf, ast, daniel

On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
> It has been observed that sometimes memory allocation for BPF maps
> fails when there is no obvious memory pressure in a system.
> 
> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
> could not be created due to due to vmalloc unable to allocate 75497472B,
> when the system's memory consumption (in MB) was the following:
> 
>     Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727

Hmm 75MB is quite large and much larger than the slab/page allocator
cann provide so this is not really a fragmentation issue. Vmalloc does
respect noretry but considering that there shouldn't be a large memory
pressure I wonder how NORETRY managed to fail the allocation. Do you
happen to have the allocation failure report?

Btw. is there any real reason to opencode and duplicate kvmalloc logic
here? In other words why not simply make bpf_map_area_alloc use
kvmalloc_node with GFP_KERNEL?

> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace
> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
> and will try harder to fulfil allocation requests.
> 
> The change has been tested with the workloads mentioned above and by
> observing oom_kill value from /proc/vmstat.
> 
> Signed-off-by: Martynas Pumputis <m@lambda.lt>
> ---
>  kernel/bpf/syscall.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 62f6bced3a3c..eb5cefe44af3 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
>  
>  void *bpf_map_area_alloc(size_t size, int numa_node)
>  {
> -	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
> -	 * trigger under memory pressure as we really just want to
> -	 * fail instead.
> +	/* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so
> +	 * OOM killer doesn't trigger under memory pressure as we really
> +	 * just want to fail instead.
>  	 */
> -	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> +	const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO;
>  	void *area;
>  
>  	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -- 
> 2.21.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-08  8:44 ` Michal Hocko
@ 2019-03-08 10:33   ` Daniel Borkmann
  2019-03-08 10:55     ` Michal Hocko
  2019-03-08 11:14   ` Martynas Pumputis
  1 sibling, 1 reply; 11+ messages in thread
From: Daniel Borkmann @ 2019-03-08 10:33 UTC (permalink / raw)
  To: Michal Hocko, Martynas Pumputis; +Cc: bpf, ast

On 03/08/2019 09:44 AM, Michal Hocko wrote:
> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:

Martynas, for the patch, please also Cc netdev in the submission so
that it lands properly in patchwork. Setup where patches only Cc'ed
to bpf@vger.kernel.org would land in our delegate is not yet completed
by ozlabs folks, just fyi.

>> It has been observed that sometimes memory allocation for BPF maps
>> fails when there is no obvious memory pressure in a system.
>>
>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
>> could not be created due to due to vmalloc unable to allocate 75497472B,
>> when the system's memory consumption (in MB) was the following:
>>
>>     Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
> 
> Hmm 75MB is quite large and much larger than the slab/page allocator
> cann provide so this is not really a fragmentation issue. Vmalloc does

Agree.

> respect noretry but considering that there shouldn't be a large memory
> pressure I wonder how NORETRY managed to fail the allocation. Do you
> happen to have the allocation failure report?

I'll defer to Martynas here.

> Btw. is there any real reason to opencode and duplicate kvmalloc logic
> here? In other words why not simply make bpf_map_area_alloc use
> kvmalloc_node with GFP_KERNEL?

Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer
under pressure with map alloc"). I remember back then we had a discussion
that __GFP_NORETRY is not fully supported and should only be seen as a hint
in our case (since it's not propagated all the way through in vmalloc, if
I recall correctly). And looking at kvmalloc_node(), __GFP_NORETRY is only
really set in case of kmalloc attempts. Given these alloc requests for maps
can often be large in size, what we really want is something that ideally under
*no* circumstances oom killer would trigger as that is way too disruptive. So
instead, allocation should just fail and bpf loader or whatnot can deal with
it. Looks like __GFP_RETRY_MAYFAIL would be better suited wrt OOM for both
allocators and would allow to reuse kvmalloc though it would try much harder
than __GFP_NORETRY. Ideally something like GFP_KERNEL | __GFP_NOWARN |
__GFP_NOOOM | __GFP_ZERO would be nice to have, semantics of __GFP_RETRY_MAYFAIL
kind of gets closer to it from looking at dcda9b0471.

>> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
>> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace
>> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
>> and will try harder to fulfil allocation requests.
>>
>> The change has been tested with the workloads mentioned above and by
>> observing oom_kill value from /proc/vmstat.
>>
>> Signed-off-by: Martynas Pumputis <m@lambda.lt>
>> ---
>>  kernel/bpf/syscall.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 62f6bced3a3c..eb5cefe44af3 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
>>  
>>  void *bpf_map_area_alloc(size_t size, int numa_node)
>>  {
>> -	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
>> -	 * trigger under memory pressure as we really just want to
>> -	 * fail instead.
>> +	/* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so
>> +	 * OOM killer doesn't trigger under memory pressure as we really
>> +	 * just want to fail instead.
>>  	 */
>> -	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
>> +	const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO;
>>  	void *area;
>>  
>>  	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>> -- 
>> 2.21.0
>>
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-08 10:33   ` Daniel Borkmann
@ 2019-03-08 10:55     ` Michal Hocko
  2019-03-08 11:30       ` Daniel Borkmann
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2019-03-08 10:55 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Martynas Pumputis, bpf, ast

On Fri 08-03-19 11:33:00, Daniel Borkmann wrote:
> On 03/08/2019 09:44 AM, Michal Hocko wrote:
> > On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
> 
> Martynas, for the patch, please also Cc netdev in the submission so
> that it lands properly in patchwork. Setup where patches only Cc'ed
> to bpf@vger.kernel.org would land in our delegate is not yet completed
> by ozlabs folks, just fyi.
> 
> >> It has been observed that sometimes memory allocation for BPF maps
> >> fails when there is no obvious memory pressure in a system.
> >>
> >> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
> >> could not be created due to due to vmalloc unable to allocate 75497472B,
> >> when the system's memory consumption (in MB) was the following:
> >>
> >>     Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
> > 
> > Hmm 75MB is quite large and much larger than the slab/page allocator
> > cann provide so this is not really a fragmentation issue. Vmalloc does
> 
> Agree.
> 
> > respect noretry but considering that there shouldn't be a large memory
> > pressure I wonder how NORETRY managed to fail the allocation. Do you
> > happen to have the allocation failure report?
> 
> I'll defer to Martynas here.
> 
> > Btw. is there any real reason to opencode and duplicate kvmalloc logic
> > here? In other words why not simply make bpf_map_area_alloc use
> > kvmalloc_node with GFP_KERNEL?
> 
> Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer
> under pressure with map alloc"). I remember back then we had a discussion
> that __GFP_NORETRY is not fully supported and should only be seen as a hint
> in our case (since it's not propagated all the way through in vmalloc, if
> I recall correctly).

Yes, that is still the case and there is no way to really have nooom
semantic for vmalloc. Even with your opencoded version btw.

> And looking at kvmalloc_node(), __GFP_NORETRY is only
> really set in case of kmalloc attempts. Given these alloc requests for maps
> can often be large in size, what we really want is something that ideally under
> *no* circumstances oom killer would trigger as that is way too disruptive.

That is not really possible. Even if you do not trigger the OOM killer
directly, som concurrent allocation might do that because your
particular one has eaten the remaining memory.

> So
> instead, allocation should just fail and bpf loader or whatnot can deal with
> it. Looks like __GFP_RETRY_MAYFAIL would be better suited wrt OOM for both
> allocators and would allow to reuse kvmalloc though it would try much harder
> than __GFP_NORETRY.

Yes.

> Ideally something like GFP_KERNEL | __GFP_NOWARN |
> __GFP_NOOOM | __GFP_ZERO would be nice to have, semantics of __GFP_RETRY_MAYFAIL
> kind of gets closer to it from looking at dcda9b0471.

NOOOM semantic is simply impossible to make it sensible as explained
above.

> >> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
> >> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace
> >> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
> >> and will try harder to fulfil allocation requests.
> >>
> >> The change has been tested with the workloads mentioned above and by
> >> observing oom_kill value from /proc/vmstat.
> >>
> >> Signed-off-by: Martynas Pumputis <m@lambda.lt>
> >> ---
> >>  kernel/bpf/syscall.c | 8 ++++----
> >>  1 file changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index 62f6bced3a3c..eb5cefe44af3 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
> >>  
> >>  void *bpf_map_area_alloc(size_t size, int numa_node)
> >>  {
> >> -	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
> >> -	 * trigger under memory pressure as we really just want to
> >> -	 * fail instead.
> >> +	/* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so
> >> +	 * OOM killer doesn't trigger under memory pressure as we really
> >> +	 * just want to fail instead.
> >>  	 */
> >> -	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> >> +	const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO;
> >>  	void *area;
> >>  
> >>  	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> >> -- 
> >> 2.21.0
> >>
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-08 10:55     ` Michal Hocko
@ 2019-03-08 11:30       ` Daniel Borkmann
  2019-03-08 12:00         ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Borkmann @ 2019-03-08 11:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Martynas Pumputis, bpf, ast

On 03/08/2019 11:55 AM, Michal Hocko wrote:
> On Fri 08-03-19 11:33:00, Daniel Borkmann wrote:
>> On 03/08/2019 09:44 AM, Michal Hocko wrote:
>>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
>>
>> Martynas, for the patch, please also Cc netdev in the submission so
>> that it lands properly in patchwork. Setup where patches only Cc'ed
>> to bpf@vger.kernel.org would land in our delegate is not yet completed
>> by ozlabs folks, just fyi.
>>
>>>> It has been observed that sometimes memory allocation for BPF maps
>>>> fails when there is no obvious memory pressure in a system.
>>>>
>>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
>>>> could not be created due to due to vmalloc unable to allocate 75497472B,
>>>> when the system's memory consumption (in MB) was the following:
>>>>
>>>>     Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
>>>
>>> Hmm 75MB is quite large and much larger than the slab/page allocator
>>> cann provide so this is not really a fragmentation issue. Vmalloc does
>>
>> Agree.
>>
>>> respect noretry but considering that there shouldn't be a large memory
>>> pressure I wonder how NORETRY managed to fail the allocation. Do you
>>> happen to have the allocation failure report?
>>
>> I'll defer to Martynas here.
>>
>>> Btw. is there any real reason to opencode and duplicate kvmalloc logic
>>> here? In other words why not simply make bpf_map_area_alloc use
>>> kvmalloc_node with GFP_KERNEL?
>>
>> Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer
>> under pressure with map alloc"). I remember back then we had a discussion
>> that __GFP_NORETRY is not fully supported and should only be seen as a hint
>> in our case (since it's not propagated all the way through in vmalloc, if
>> I recall correctly).
> 
> Yes, that is still the case and there is no way to really have nooom
> semantic for vmalloc. Even with your opencoded version btw.

Okay, so similar situation applies to __GFP_RETRY_MAYFAIL reclaim modifier
then if I understand you correctly? In dcda9b0471, its mentioned "this
means that all the reclaim opportunities have been exhausted except the
most disruptive one (the OOM killer) and a user defined fallback behavior
is more sensible than keep retrying in the page allocator." In the api
comment in kvmalloc_node(), it says "__GFP_RETRY_MAYFAIL is supported,
and it should be used only if kmalloc is preferable to the vmalloc fallback,
due to visible performance drawbacks". But if you say above that for
vmalloc, there is no nooom semantic, then presumably __GFP_RETRY_MAYFAIL
should also be taken as a hint wrt OOM, not a guarantee for the given
allocation request (if kvmalloc_node selects the vmalloc based allocator),
is that correct?

>> And looking at kvmalloc_node(), __GFP_NORETRY is only
>> really set in case of kmalloc attempts. Given these alloc requests for maps
>> can often be large in size, what we really want is something that ideally under
>> *no* circumstances oom killer would trigger as that is way too disruptive.
> 
> That is not really possible. Even if you do not trigger the OOM killer
> directly, som concurrent allocation might do that because your
> particular one has eaten the remaining memory.

Yes, in that situation there's not much we can do with either modifier.

>> So
>> instead, allocation should just fail and bpf loader or whatnot can deal with
>> it. Looks like __GFP_RETRY_MAYFAIL would be better suited wrt OOM for both
>> allocators and would allow to reuse kvmalloc though it would try much harder
>> than __GFP_NORETRY.
> 
> Yes.
> 
>> Ideally something like GFP_KERNEL | __GFP_NOWARN |
>> __GFP_NOOOM | __GFP_ZERO would be nice to have, semantics of __GFP_RETRY_MAYFAIL
>> kind of gets closer to it from looking at dcda9b0471.
> 
> NOOOM semantic is simply impossible to make it sensible as explained
> above.
> 
>>>> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
>>>> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace
>>>> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
>>>> and will try harder to fulfil allocation requests.
>>>>
>>>> The change has been tested with the workloads mentioned above and by
>>>> observing oom_kill value from /proc/vmstat.
>>>>
>>>> Signed-off-by: Martynas Pumputis <m@lambda.lt>
>>>> ---
>>>>  kernel/bpf/syscall.c | 8 ++++----
>>>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>> index 62f6bced3a3c..eb5cefe44af3 100644
>>>> --- a/kernel/bpf/syscall.c
>>>> +++ b/kernel/bpf/syscall.c
>>>> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
>>>>  
>>>>  void *bpf_map_area_alloc(size_t size, int numa_node)
>>>>  {
>>>> -	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
>>>> -	 * trigger under memory pressure as we really just want to
>>>> -	 * fail instead.
>>>> +	/* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so
>>>> +	 * OOM killer doesn't trigger under memory pressure as we really
>>>> +	 * just want to fail instead.
>>>>  	 */
>>>> -	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
>>>> +	const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO;
>>>>  	void *area;
>>>>  
>>>>  	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>> -- 
>>>> 2.21.0
>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-08 11:30       ` Daniel Borkmann
@ 2019-03-08 12:00         ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2019-03-08 12:00 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Martynas Pumputis, bpf, ast

On Fri 08-03-19 12:30:47, Daniel Borkmann wrote:
> On 03/08/2019 11:55 AM, Michal Hocko wrote:
> > On Fri 08-03-19 11:33:00, Daniel Borkmann wrote:
> >> On 03/08/2019 09:44 AM, Michal Hocko wrote:
> >>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
> >>
> >> Martynas, for the patch, please also Cc netdev in the submission so
> >> that it lands properly in patchwork. Setup where patches only Cc'ed
> >> to bpf@vger.kernel.org would land in our delegate is not yet completed
> >> by ozlabs folks, just fyi.
> >>
> >>>> It has been observed that sometimes memory allocation for BPF maps
> >>>> fails when there is no obvious memory pressure in a system.
> >>>>
> >>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
> >>>> could not be created due to due to vmalloc unable to allocate 75497472B,
> >>>> when the system's memory consumption (in MB) was the following:
> >>>>
> >>>>     Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
> >>>
> >>> Hmm 75MB is quite large and much larger than the slab/page allocator
> >>> cann provide so this is not really a fragmentation issue. Vmalloc does
> >>
> >> Agree.
> >>
> >>> respect noretry but considering that there shouldn't be a large memory
> >>> pressure I wonder how NORETRY managed to fail the allocation. Do you
> >>> happen to have the allocation failure report?
> >>
> >> I'll defer to Martynas here.
> >>
> >>> Btw. is there any real reason to opencode and duplicate kvmalloc logic
> >>> here? In other words why not simply make bpf_map_area_alloc use
> >>> kvmalloc_node with GFP_KERNEL?
> >>
> >> Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer
> >> under pressure with map alloc"). I remember back then we had a discussion
> >> that __GFP_NORETRY is not fully supported and should only be seen as a hint
> >> in our case (since it's not propagated all the way through in vmalloc, if
> >> I recall correctly).
> > 
> > Yes, that is still the case and there is no way to really have nooom
> > semantic for vmalloc. Even with your opencoded version btw.
> 
> Okay, so similar situation applies to __GFP_RETRY_MAYFAIL reclaim modifier
> then if I understand you correctly? In dcda9b0471, its mentioned "this
> means that all the reclaim opportunities have been exhausted except the
> most disruptive one (the OOM killer) and a user defined fallback behavior
> is more sensible than keep retrying in the page allocator." In the api
> comment in kvmalloc_node(), it says "__GFP_RETRY_MAYFAIL is supported,
> and it should be used only if kmalloc is preferable to the vmalloc fallback,
> due to visible performance drawbacks". But if you say above that for
> vmalloc, there is no nooom semantic, then presumably __GFP_RETRY_MAYFAIL
> should also be taken as a hint wrt OOM, not a guarantee for the given
> allocation request (if kvmalloc_node selects the vmalloc based allocator),
> is that correct?

Yes, kvmalloc* doesn't have the full MAYFAIL/NORETRY semantic because
this is not possible to implement without reworking the whole page table
allocation code or making both behave like scoped NOFS/NOIO.

But I believe you shouldn't really have to care. Does the same problem
really happen with a plain kvmalloc(GFP_KERNEL)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-08  8:44 ` Michal Hocko
  2019-03-08 10:33   ` Daniel Borkmann
@ 2019-03-08 11:14   ` Martynas Pumputis
  2019-03-08 11:20     ` Michal Hocko
  1 sibling, 1 reply; 11+ messages in thread
From: Martynas Pumputis @ 2019-03-08 11:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: bpf, ast, daniel



On 3/8/19 9:44 AM, Michal Hocko wrote:
> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
>> It has been observed that sometimes memory allocation for BPF maps
>> fails when there is no obvious memory pressure in a system.
>>
>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
>> could not be created due to due to vmalloc unable to allocate 75497472B,
>> when the system's memory consumption (in MB) was the following:
>>
>>      Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
> 
> Hmm 75MB is quite large and much larger than the slab/page allocator
> cann provide so this is not really a fragmentation issue. Vmalloc does
> respect noretry but considering that there shouldn't be a large memory
> pressure I wonder how NORETRY managed to fail the allocation. Do you
> happen to have the allocation failure report?

I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has 
failed:
https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d

Let me know if more info is required, I can reproduce the failure. Thanks.

> 
> Btw. is there any real reason to opencode and duplicate kvmalloc logic
> here? In other words why not simply make bpf_map_area_alloc use
> kvmalloc_node with GFP_KERNEL?
> 
>> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
>> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace
>> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
>> and will try harder to fulfil allocation requests.
>>
>> The change has been tested with the workloads mentioned above and by
>> observing oom_kill value from /proc/vmstat.
>>
>> Signed-off-by: Martynas Pumputis <m@lambda.lt>
>> ---
>>   kernel/bpf/syscall.c | 8 ++++----
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 62f6bced3a3c..eb5cefe44af3 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
>>   
>>   void *bpf_map_area_alloc(size_t size, int numa_node)
>>   {
>> -	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
>> -	 * trigger under memory pressure as we really just want to
>> -	 * fail instead.
>> +	/* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so
>> +	 * OOM killer doesn't trigger under memory pressure as we really
>> +	 * just want to fail instead.
>>   	 */
>> -	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
>> +	const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO;
>>   	void *area;
>>   
>>   	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>> -- 
>> 2.21.0
>>
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-08 11:14   ` Martynas Pumputis
@ 2019-03-08 11:20     ` Michal Hocko
  2019-03-08 20:02       ` Martynas Pumputis
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2019-03-08 11:20 UTC (permalink / raw)
  To: Martynas Pumputis; +Cc: bpf, ast, daniel

On Fri 08-03-19 12:14:16, Martynas Pumputis wrote:
> 
> 
> On 3/8/19 9:44 AM, Michal Hocko wrote:
> > On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
> > > It has been observed that sometimes memory allocation for BPF maps
> > > fails when there is no obvious memory pressure in a system.
> > > 
> > > E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
> > > could not be created due to due to vmalloc unable to allocate 75497472B,
> > > when the system's memory consumption (in MB) was the following:
> > > 
> > >      Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
> > 
> > Hmm 75MB is quite large and much larger than the slab/page allocator
> > cann provide so this is not really a fragmentation issue. Vmalloc does
> > respect noretry but considering that there shouldn't be a large memory
> > pressure I wonder how NORETRY managed to fail the allocation. Do you
> > happen to have the allocation failure report?
> 
> I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has
> failed:
> https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d

dmesg with the allocation failure report would be more helpful

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-08 11:20     ` Michal Hocko
@ 2019-03-08 20:02       ` Martynas Pumputis
  2019-03-10  7:13         ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Martynas Pumputis @ 2019-03-08 20:02 UTC (permalink / raw)
  To: Michal Hocko; +Cc: bpf, ast, daniel



On 3/8/19 12:20 PM, Michal Hocko wrote:
> On Fri 08-03-19 12:14:16, Martynas Pumputis wrote:
>>
>>
>> On 3/8/19 9:44 AM, Michal Hocko wrote:
>>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
>>>> It has been observed that sometimes memory allocation for BPF maps
>>>> fails when there is no obvious memory pressure in a system.
>>>>
>>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
>>>> could not be created due to due to vmalloc unable to allocate 75497472B,
>>>> when the system's memory consumption (in MB) was the following:
>>>>
>>>>       Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
>>>
>>> Hmm 75MB is quite large and much larger than the slab/page allocator
>>> cann provide so this is not really a fragmentation issue. Vmalloc does
>>> respect noretry but considering that there shouldn't be a large memory
>>> pressure I wonder how NORETRY managed to fail the allocation. Do you
>>> happen to have the allocation failure report?
>>
>> I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has
>> failed:
>> https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d
> 
> dmesg with the allocation failure report would be more helpful

https://gist.github.com/brb/2d7ac323d2e14cb7a38bacba301fe3af

> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-08 20:02       ` Martynas Pumputis
@ 2019-03-10  7:13         ` Michal Hocko
  2019-03-11 19:33           ` Martynas Pumputis
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2019-03-10  7:13 UTC (permalink / raw)
  To: Martynas Pumputis; +Cc: bpf, ast, daniel

On Fri 08-03-19 21:02:41, Martynas Pumputis wrote:
> 
> 
> On 3/8/19 12:20 PM, Michal Hocko wrote:
> > On Fri 08-03-19 12:14:16, Martynas Pumputis wrote:
> > > 
> > > 
> > > On 3/8/19 9:44 AM, Michal Hocko wrote:
> > > > On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
> > > > > It has been observed that sometimes memory allocation for BPF maps
> > > > > fails when there is no obvious memory pressure in a system.
> > > > > 
> > > > > E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
> > > > > could not be created due to due to vmalloc unable to allocate 75497472B,
> > > > > when the system's memory consumption (in MB) was the following:
> > > > > 
> > > > >       Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
> > > > 
> > > > Hmm 75MB is quite large and much larger than the slab/page allocator
> > > > cann provide so this is not really a fragmentation issue. Vmalloc does
> > > > respect noretry but considering that there shouldn't be a large memory
> > > > pressure I wonder how NORETRY managed to fail the allocation. Do you
> > > > happen to have the allocation failure report?
> > > 
> > > I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has
> > > failed:
> > > https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d
> > 
> > dmesg with the allocation failure report would be more helpful
> 
> https://gist.github.com/brb/2d7ac323d2e14cb7a38bacba301fe3af

Thanks!

tc: vmalloc: allocation failure, allocated 15609856 of 62918656 bytes, mode:0x6090c0(GFP_KERNEL|__GFP_NORETRY|__GFP_ZERO), nodemask=(null),cpuset=b389e318420d891300ad9658f8e056b59972fda9547dd566245a922c34bb9e42,mems_allowed=0
[...]
Node 0 DMA free:15728kB min:268kB low:332kB high:396kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:12kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 3419 3866 3866 3866
Node 0 DMA32 free:105004kB min:104588kB low:119468kB high:134348kB active_anon:526128kB inactive_anon:612kB active_file:862524kB inactive_file:1552884kB unevictable:0kB writepending:0kB present:3653568kB managed:3563596kB mlocked:0kB kernel_stack:7592kB pagetables:6636kB bounce:0kB free_pcp:916kB local_pcp:736kB free_cma:0kB
lowmem_reserve[]: 0 0 446 446 446
Node 0 Normal free:22844kB min:24160kB low:26104kB high:28048kB active_anon:92340kB inactive_anon:228kB active_file:160072kB inactive_file:82480kB unevictable:0kB writepending:0kB present:524288kB managed:457544kB mlocked:0kB kernel_stack:2224kB pagetables:3776kB bounce:0kB free_pcp:996kB local_pcp:672kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0

Except for a srtange cpuset value (which should be checked separately),
the allocation is restricted to node 0 which is pretty much out of
memory (below min watermark - lowmem_reserve). There is still a lot of
page cache to reclaim so a further reclaim is quite likely to make a
further progress. There is still 45MB to go and at least page cache is
1.5G so there is some buffer to allocate from.

That being said __GFP_NORETRY caused a pre-mature failure indeed. Using
kvmalloc(GFP_KERNEL|__GFP_RETRY_MAYFAIL) would likely help here unless
the pagecache is really hard to reclaim. Please note that this will also
imply that requests which can be satisfied from the slab allocator will
retry harder as well. Not sure this is desirable for these requests
though but your original patch does the same so if you wanted to have
__GFP_RETRY_MAYFAIL behavior only for the vmalloc path then you would
need to have an opencoded version which adds the flag just to the
vmalloc fallback path.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] bpf: Try harder when allocating memory for maps
  2019-03-10  7:13         ` Michal Hocko
@ 2019-03-11 19:33           ` Martynas Pumputis
  0 siblings, 0 replies; 11+ messages in thread
From: Martynas Pumputis @ 2019-03-11 19:33 UTC (permalink / raw)
  To: Michal Hocko; +Cc: bpf, ast, daniel



On 3/10/19 8:13 AM, Michal Hocko wrote:
> On Fri 08-03-19 21:02:41, Martynas Pumputis wrote:
>>
>>
>> On 3/8/19 12:20 PM, Michal Hocko wrote:
>>> On Fri 08-03-19 12:14:16, Martynas Pumputis wrote:
>>>>
>>>>
>>>> On 3/8/19 9:44 AM, Michal Hocko wrote:
>>>>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
>>>>>> It has been observed that sometimes memory allocation for BPF maps
>>>>>> fails when there is no obvious memory pressure in a system.
>>>>>>
>>>>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
>>>>>> could not be created due to due to vmalloc unable to allocate 75497472B,
>>>>>> when the system's memory consumption (in MB) was the following:
>>>>>>
>>>>>>        Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
>>>>>
>>>>> Hmm 75MB is quite large and much larger than the slab/page allocator
>>>>> cann provide so this is not really a fragmentation issue. Vmalloc does
>>>>> respect noretry but considering that there shouldn't be a large memory
>>>>> pressure I wonder how NORETRY managed to fail the allocation. Do you
>>>>> happen to have the allocation failure report?
>>>>
>>>> I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has
>>>> failed:
>>>> https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d
>>>
>>> dmesg with the allocation failure report would be more helpful
>>
>> https://gist.github.com/brb/2d7ac323d2e14cb7a38bacba301fe3af
> 
> Thanks!
> 
> tc: vmalloc: allocation failure, allocated 15609856 of 62918656 bytes, mode:0x6090c0(GFP_KERNEL|__GFP_NORETRY|__GFP_ZERO), nodemask=(null),cpuset=b389e318420d891300ad9658f8e056b59972fda9547dd566245a922c34bb9e42,mems_allowed=0
> [...]
> Node 0 DMA free:15728kB min:268kB low:332kB high:396kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:12kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 3419 3866 3866 3866
> Node 0 DMA32 free:105004kB min:104588kB low:119468kB high:134348kB active_anon:526128kB inactive_anon:612kB active_file:862524kB inactive_file:1552884kB unevictable:0kB writepending:0kB present:3653568kB managed:3563596kB mlocked:0kB kernel_stack:7592kB pagetables:6636kB bounce:0kB free_pcp:916kB local_pcp:736kB free_cma:0kB
> lowmem_reserve[]: 0 0 446 446 446
> Node 0 Normal free:22844kB min:24160kB low:26104kB high:28048kB active_anon:92340kB inactive_anon:228kB active_file:160072kB inactive_file:82480kB unevictable:0kB writepending:0kB present:524288kB managed:457544kB mlocked:0kB kernel_stack:2224kB pagetables:3776kB bounce:0kB free_pcp:996kB local_pcp:672kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0 0
> 
> Except for a srtange cpuset value (which should be checked separately),
> the allocation is restricted to node 0 which is pretty much out of
> memory (below min watermark - lowmem_reserve). There is still a lot of
> page cache to reclaim so a further reclaim is quite likely to make a
> further progress. There is still 45MB to go and at least page cache is
> 1.5G so there is some buffer to allocate from.
> 
> That being said __GFP_NORETRY caused a pre-mature failure indeed. Using
> kvmalloc(GFP_KERNEL|__GFP_RETRY_MAYFAIL) would likely help here unless
> the pagecache is really hard to reclaim. Please note that this will also
> imply that requests which can be satisfied from the slab allocator will
> retry harder as well. Not sure this is desirable for these requests
> though but your original patch does the same so if you wanted to have
> __GFP_RETRY_MAYFAIL behavior only for the vmalloc path then you would
> need to have an opencoded version which adds the flag just to the
> vmalloc fallback path.

Thanks a lot for the analysis. I've re-submitted the patch.

> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-03-11 19:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-08  8:08 [PATCH] bpf: Try harder when allocating memory for maps Martynas Pumputis
2019-03-08  8:44 ` Michal Hocko
2019-03-08 10:33   ` Daniel Borkmann
2019-03-08 10:55     ` Michal Hocko
2019-03-08 11:30       ` Daniel Borkmann
2019-03-08 12:00         ` Michal Hocko
2019-03-08 11:14   ` Martynas Pumputis
2019-03-08 11:20     ` Michal Hocko
2019-03-08 20:02       ` Martynas Pumputis
2019-03-10  7:13         ` Michal Hocko
2019-03-11 19:33           ` Martynas Pumputis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.