* [PATCH] bpf: Try harder when allocating memory for maps
@ 2019-03-08 8:08 Martynas Pumputis
2019-03-08 8:44 ` Michal Hocko
0 siblings, 1 reply; 11+ messages in thread
From: Martynas Pumputis @ 2019-03-08 8:08 UTC (permalink / raw)
To: bpf; +Cc: ast, daniel, mhocko, m
It has been observed that sometimes memory allocation for BPF maps
fails when there is no obvious memory pressure in a system.
E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
could not be created due to due to vmalloc unable to allocate 75497472B,
when the system's memory consumption (in MB) was the following:
Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
__GFP_RETRY_MAYFAIL with more useful semantic") we can replace
__GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
and will try harder to fulfil allocation requests.
The change has been tested with the workloads mentioned above and by
observing oom_kill value from /proc/vmstat.
Signed-off-by: Martynas Pumputis <m@lambda.lt>
---
kernel/bpf/syscall.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 62f6bced3a3c..eb5cefe44af3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
void *bpf_map_area_alloc(size_t size, int numa_node)
{
- /* We definitely need __GFP_NORETRY, so OOM killer doesn't
- * trigger under memory pressure as we really just want to
- * fail instead.
+ /* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so
+ * OOM killer doesn't trigger under memory pressure as we really
+ * just want to fail instead.
*/
- const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
+ const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO;
void *area;
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
--
2.21.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-08 8:08 [PATCH] bpf: Try harder when allocating memory for maps Martynas Pumputis @ 2019-03-08 8:44 ` Michal Hocko 2019-03-08 10:33 ` Daniel Borkmann 2019-03-08 11:14 ` Martynas Pumputis 0 siblings, 2 replies; 11+ messages in thread From: Michal Hocko @ 2019-03-08 8:44 UTC (permalink / raw) To: Martynas Pumputis; +Cc: bpf, ast, daniel On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: > It has been observed that sometimes memory allocation for BPF maps > fails when there is no obvious memory pressure in a system. > > E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) > could not be created due to due to vmalloc unable to allocate 75497472B, > when the system's memory consumption (in MB) was the following: > > Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 Hmm 75MB is quite large and much larger than the slab/page allocator cann provide so this is not really a fragmentation issue. Vmalloc does respect noretry but considering that there shouldn't be a large memory pressure I wonder how NORETRY managed to fail the allocation. Do you happen to have the allocation failure report? Btw. is there any real reason to opencode and duplicate kvmalloc logic here? In other words why not simply make bpf_map_area_alloc use kvmalloc_node with GFP_KERNEL? > Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by > __GFP_RETRY_MAYFAIL with more useful semantic") we can replace > __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer > and will try harder to fulfil allocation requests. > > The change has been tested with the workloads mentioned above and by > observing oom_kill value from /proc/vmstat. > > Signed-off-by: Martynas Pumputis <m@lambda.lt> > --- > kernel/bpf/syscall.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 62f6bced3a3c..eb5cefe44af3 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) > > void *bpf_map_area_alloc(size_t size, int numa_node) > { > - /* We definitely need __GFP_NORETRY, so OOM killer doesn't > - * trigger under memory pressure as we really just want to > - * fail instead. > + /* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so > + * OOM killer doesn't trigger under memory pressure as we really > + * just want to fail instead. > */ > - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; > + const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO; > void *area; > > if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { > -- > 2.21.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-08 8:44 ` Michal Hocko @ 2019-03-08 10:33 ` Daniel Borkmann 2019-03-08 10:55 ` Michal Hocko 2019-03-08 11:14 ` Martynas Pumputis 1 sibling, 1 reply; 11+ messages in thread From: Daniel Borkmann @ 2019-03-08 10:33 UTC (permalink / raw) To: Michal Hocko, Martynas Pumputis; +Cc: bpf, ast On 03/08/2019 09:44 AM, Michal Hocko wrote: > On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: Martynas, for the patch, please also Cc netdev in the submission so that it lands properly in patchwork. Setup where patches only Cc'ed to bpf@vger.kernel.org would land in our delegate is not yet completed by ozlabs folks, just fyi. >> It has been observed that sometimes memory allocation for BPF maps >> fails when there is no obvious memory pressure in a system. >> >> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) >> could not be created due to due to vmalloc unable to allocate 75497472B, >> when the system's memory consumption (in MB) was the following: >> >> Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 > > Hmm 75MB is quite large and much larger than the slab/page allocator > cann provide so this is not really a fragmentation issue. Vmalloc does Agree. > respect noretry but considering that there shouldn't be a large memory > pressure I wonder how NORETRY managed to fail the allocation. Do you > happen to have the allocation failure report? I'll defer to Martynas here. > Btw. is there any real reason to opencode and duplicate kvmalloc logic > here? In other words why not simply make bpf_map_area_alloc use > kvmalloc_node with GFP_KERNEL? Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc"). I remember back then we had a discussion that __GFP_NORETRY is not fully supported and should only be seen as a hint in our case (since it's not propagated all the way through in vmalloc, if I recall correctly). And looking at kvmalloc_node(), __GFP_NORETRY is only really set in case of kmalloc attempts. Given these alloc requests for maps can often be large in size, what we really want is something that ideally under *no* circumstances oom killer would trigger as that is way too disruptive. So instead, allocation should just fail and bpf loader or whatnot can deal with it. Looks like __GFP_RETRY_MAYFAIL would be better suited wrt OOM for both allocators and would allow to reuse kvmalloc though it would try much harder than __GFP_NORETRY. Ideally something like GFP_KERNEL | __GFP_NOWARN | __GFP_NOOOM | __GFP_ZERO would be nice to have, semantics of __GFP_RETRY_MAYFAIL kind of gets closer to it from looking at dcda9b0471. >> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by >> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace >> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer >> and will try harder to fulfil allocation requests. >> >> The change has been tested with the workloads mentioned above and by >> observing oom_kill value from /proc/vmstat. >> >> Signed-off-by: Martynas Pumputis <m@lambda.lt> >> --- >> kernel/bpf/syscall.c | 8 ++++---- >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index 62f6bced3a3c..eb5cefe44af3 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) >> >> void *bpf_map_area_alloc(size_t size, int numa_node) >> { >> - /* We definitely need __GFP_NORETRY, so OOM killer doesn't >> - * trigger under memory pressure as we really just want to >> - * fail instead. >> + /* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so >> + * OOM killer doesn't trigger under memory pressure as we really >> + * just want to fail instead. >> */ >> - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; >> + const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO; >> void *area; >> >> if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { >> -- >> 2.21.0 >> > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-08 10:33 ` Daniel Borkmann @ 2019-03-08 10:55 ` Michal Hocko 2019-03-08 11:30 ` Daniel Borkmann 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2019-03-08 10:55 UTC (permalink / raw) To: Daniel Borkmann; +Cc: Martynas Pumputis, bpf, ast On Fri 08-03-19 11:33:00, Daniel Borkmann wrote: > On 03/08/2019 09:44 AM, Michal Hocko wrote: > > On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: > > Martynas, for the patch, please also Cc netdev in the submission so > that it lands properly in patchwork. Setup where patches only Cc'ed > to bpf@vger.kernel.org would land in our delegate is not yet completed > by ozlabs folks, just fyi. > > >> It has been observed that sometimes memory allocation for BPF maps > >> fails when there is no obvious memory pressure in a system. > >> > >> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) > >> could not be created due to due to vmalloc unable to allocate 75497472B, > >> when the system's memory consumption (in MB) was the following: > >> > >> Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 > > > > Hmm 75MB is quite large and much larger than the slab/page allocator > > cann provide so this is not really a fragmentation issue. Vmalloc does > > Agree. > > > respect noretry but considering that there shouldn't be a large memory > > pressure I wonder how NORETRY managed to fail the allocation. Do you > > happen to have the allocation failure report? > > I'll defer to Martynas here. > > > Btw. is there any real reason to opencode and duplicate kvmalloc logic > > here? In other words why not simply make bpf_map_area_alloc use > > kvmalloc_node with GFP_KERNEL? > > Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer > under pressure with map alloc"). I remember back then we had a discussion > that __GFP_NORETRY is not fully supported and should only be seen as a hint > in our case (since it's not propagated all the way through in vmalloc, if > I recall correctly). Yes, that is still the case and there is no way to really have nooom semantic for vmalloc. Even with your opencoded version btw. > And looking at kvmalloc_node(), __GFP_NORETRY is only > really set in case of kmalloc attempts. Given these alloc requests for maps > can often be large in size, what we really want is something that ideally under > *no* circumstances oom killer would trigger as that is way too disruptive. That is not really possible. Even if you do not trigger the OOM killer directly, som concurrent allocation might do that because your particular one has eaten the remaining memory. > So > instead, allocation should just fail and bpf loader or whatnot can deal with > it. Looks like __GFP_RETRY_MAYFAIL would be better suited wrt OOM for both > allocators and would allow to reuse kvmalloc though it would try much harder > than __GFP_NORETRY. Yes. > Ideally something like GFP_KERNEL | __GFP_NOWARN | > __GFP_NOOOM | __GFP_ZERO would be nice to have, semantics of __GFP_RETRY_MAYFAIL > kind of gets closer to it from looking at dcda9b0471. NOOOM semantic is simply impossible to make it sensible as explained above. > >> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by > >> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace > >> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer > >> and will try harder to fulfil allocation requests. > >> > >> The change has been tested with the workloads mentioned above and by > >> observing oom_kill value from /proc/vmstat. > >> > >> Signed-off-by: Martynas Pumputis <m@lambda.lt> > >> --- > >> kernel/bpf/syscall.c | 8 ++++---- > >> 1 file changed, 4 insertions(+), 4 deletions(-) > >> > >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > >> index 62f6bced3a3c..eb5cefe44af3 100644 > >> --- a/kernel/bpf/syscall.c > >> +++ b/kernel/bpf/syscall.c > >> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) > >> > >> void *bpf_map_area_alloc(size_t size, int numa_node) > >> { > >> - /* We definitely need __GFP_NORETRY, so OOM killer doesn't > >> - * trigger under memory pressure as we really just want to > >> - * fail instead. > >> + /* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so > >> + * OOM killer doesn't trigger under memory pressure as we really > >> + * just want to fail instead. > >> */ > >> - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; > >> + const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO; > >> void *area; > >> > >> if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { > >> -- > >> 2.21.0 > >> > > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-08 10:55 ` Michal Hocko @ 2019-03-08 11:30 ` Daniel Borkmann 2019-03-08 12:00 ` Michal Hocko 0 siblings, 1 reply; 11+ messages in thread From: Daniel Borkmann @ 2019-03-08 11:30 UTC (permalink / raw) To: Michal Hocko; +Cc: Martynas Pumputis, bpf, ast On 03/08/2019 11:55 AM, Michal Hocko wrote: > On Fri 08-03-19 11:33:00, Daniel Borkmann wrote: >> On 03/08/2019 09:44 AM, Michal Hocko wrote: >>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: >> >> Martynas, for the patch, please also Cc netdev in the submission so >> that it lands properly in patchwork. Setup where patches only Cc'ed >> to bpf@vger.kernel.org would land in our delegate is not yet completed >> by ozlabs folks, just fyi. >> >>>> It has been observed that sometimes memory allocation for BPF maps >>>> fails when there is no obvious memory pressure in a system. >>>> >>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) >>>> could not be created due to due to vmalloc unable to allocate 75497472B, >>>> when the system's memory consumption (in MB) was the following: >>>> >>>> Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 >>> >>> Hmm 75MB is quite large and much larger than the slab/page allocator >>> cann provide so this is not really a fragmentation issue. Vmalloc does >> >> Agree. >> >>> respect noretry but considering that there shouldn't be a large memory >>> pressure I wonder how NORETRY managed to fail the allocation. Do you >>> happen to have the allocation failure report? >> >> I'll defer to Martynas here. >> >>> Btw. is there any real reason to opencode and duplicate kvmalloc logic >>> here? In other words why not simply make bpf_map_area_alloc use >>> kvmalloc_node with GFP_KERNEL? >> >> Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer >> under pressure with map alloc"). I remember back then we had a discussion >> that __GFP_NORETRY is not fully supported and should only be seen as a hint >> in our case (since it's not propagated all the way through in vmalloc, if >> I recall correctly). > > Yes, that is still the case and there is no way to really have nooom > semantic for vmalloc. Even with your opencoded version btw. Okay, so similar situation applies to __GFP_RETRY_MAYFAIL reclaim modifier then if I understand you correctly? In dcda9b0471, its mentioned "this means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator." In the api comment in kvmalloc_node(), it says "__GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks". But if you say above that for vmalloc, there is no nooom semantic, then presumably __GFP_RETRY_MAYFAIL should also be taken as a hint wrt OOM, not a guarantee for the given allocation request (if kvmalloc_node selects the vmalloc based allocator), is that correct? >> And looking at kvmalloc_node(), __GFP_NORETRY is only >> really set in case of kmalloc attempts. Given these alloc requests for maps >> can often be large in size, what we really want is something that ideally under >> *no* circumstances oom killer would trigger as that is way too disruptive. > > That is not really possible. Even if you do not trigger the OOM killer > directly, som concurrent allocation might do that because your > particular one has eaten the remaining memory. Yes, in that situation there's not much we can do with either modifier. >> So >> instead, allocation should just fail and bpf loader or whatnot can deal with >> it. Looks like __GFP_RETRY_MAYFAIL would be better suited wrt OOM for both >> allocators and would allow to reuse kvmalloc though it would try much harder >> than __GFP_NORETRY. > > Yes. > >> Ideally something like GFP_KERNEL | __GFP_NOWARN | >> __GFP_NOOOM | __GFP_ZERO would be nice to have, semantics of __GFP_RETRY_MAYFAIL >> kind of gets closer to it from looking at dcda9b0471. > > NOOOM semantic is simply impossible to make it sensible as explained > above. > >>>> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by >>>> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace >>>> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer >>>> and will try harder to fulfil allocation requests. >>>> >>>> The change has been tested with the workloads mentioned above and by >>>> observing oom_kill value from /proc/vmstat. >>>> >>>> Signed-off-by: Martynas Pumputis <m@lambda.lt> >>>> --- >>>> kernel/bpf/syscall.c | 8 ++++---- >>>> 1 file changed, 4 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >>>> index 62f6bced3a3c..eb5cefe44af3 100644 >>>> --- a/kernel/bpf/syscall.c >>>> +++ b/kernel/bpf/syscall.c >>>> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) >>>> >>>> void *bpf_map_area_alloc(size_t size, int numa_node) >>>> { >>>> - /* We definitely need __GFP_NORETRY, so OOM killer doesn't >>>> - * trigger under memory pressure as we really just want to >>>> - * fail instead. >>>> + /* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so >>>> + * OOM killer doesn't trigger under memory pressure as we really >>>> + * just want to fail instead. >>>> */ >>>> - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; >>>> + const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO; >>>> void *area; >>>> >>>> if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { >>>> -- >>>> 2.21.0 >>>> >>> > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-08 11:30 ` Daniel Borkmann @ 2019-03-08 12:00 ` Michal Hocko 0 siblings, 0 replies; 11+ messages in thread From: Michal Hocko @ 2019-03-08 12:00 UTC (permalink / raw) To: Daniel Borkmann; +Cc: Martynas Pumputis, bpf, ast On Fri 08-03-19 12:30:47, Daniel Borkmann wrote: > On 03/08/2019 11:55 AM, Michal Hocko wrote: > > On Fri 08-03-19 11:33:00, Daniel Borkmann wrote: > >> On 03/08/2019 09:44 AM, Michal Hocko wrote: > >>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: > >> > >> Martynas, for the patch, please also Cc netdev in the submission so > >> that it lands properly in patchwork. Setup where patches only Cc'ed > >> to bpf@vger.kernel.org would land in our delegate is not yet completed > >> by ozlabs folks, just fyi. > >> > >>>> It has been observed that sometimes memory allocation for BPF maps > >>>> fails when there is no obvious memory pressure in a system. > >>>> > >>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) > >>>> could not be created due to due to vmalloc unable to allocate 75497472B, > >>>> when the system's memory consumption (in MB) was the following: > >>>> > >>>> Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 > >>> > >>> Hmm 75MB is quite large and much larger than the slab/page allocator > >>> cann provide so this is not really a fragmentation issue. Vmalloc does > >> > >> Agree. > >> > >>> respect noretry but considering that there shouldn't be a large memory > >>> pressure I wonder how NORETRY managed to fail the allocation. Do you > >>> happen to have the allocation failure report? > >> > >> I'll defer to Martynas here. > >> > >>> Btw. is there any real reason to opencode and duplicate kvmalloc logic > >>> here? In other words why not simply make bpf_map_area_alloc use > >>> kvmalloc_node with GFP_KERNEL? > >> > >> Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer > >> under pressure with map alloc"). I remember back then we had a discussion > >> that __GFP_NORETRY is not fully supported and should only be seen as a hint > >> in our case (since it's not propagated all the way through in vmalloc, if > >> I recall correctly). > > > > Yes, that is still the case and there is no way to really have nooom > > semantic for vmalloc. Even with your opencoded version btw. > > Okay, so similar situation applies to __GFP_RETRY_MAYFAIL reclaim modifier > then if I understand you correctly? In dcda9b0471, its mentioned "this > means that all the reclaim opportunities have been exhausted except the > most disruptive one (the OOM killer) and a user defined fallback behavior > is more sensible than keep retrying in the page allocator." In the api > comment in kvmalloc_node(), it says "__GFP_RETRY_MAYFAIL is supported, > and it should be used only if kmalloc is preferable to the vmalloc fallback, > due to visible performance drawbacks". But if you say above that for > vmalloc, there is no nooom semantic, then presumably __GFP_RETRY_MAYFAIL > should also be taken as a hint wrt OOM, not a guarantee for the given > allocation request (if kvmalloc_node selects the vmalloc based allocator), > is that correct? Yes, kvmalloc* doesn't have the full MAYFAIL/NORETRY semantic because this is not possible to implement without reworking the whole page table allocation code or making both behave like scoped NOFS/NOIO. But I believe you shouldn't really have to care. Does the same problem really happen with a plain kvmalloc(GFP_KERNEL)? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-08 8:44 ` Michal Hocko 2019-03-08 10:33 ` Daniel Borkmann @ 2019-03-08 11:14 ` Martynas Pumputis 2019-03-08 11:20 ` Michal Hocko 1 sibling, 1 reply; 11+ messages in thread From: Martynas Pumputis @ 2019-03-08 11:14 UTC (permalink / raw) To: Michal Hocko; +Cc: bpf, ast, daniel On 3/8/19 9:44 AM, Michal Hocko wrote: > On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: >> It has been observed that sometimes memory allocation for BPF maps >> fails when there is no obvious memory pressure in a system. >> >> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) >> could not be created due to due to vmalloc unable to allocate 75497472B, >> when the system's memory consumption (in MB) was the following: >> >> Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 > > Hmm 75MB is quite large and much larger than the slab/page allocator > cann provide so this is not really a fragmentation issue. Vmalloc does > respect noretry but considering that there shouldn't be a large memory > pressure I wonder how NORETRY managed to fail the allocation. Do you > happen to have the allocation failure report? I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has failed: https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d Let me know if more info is required, I can reproduce the failure. Thanks. > > Btw. is there any real reason to opencode and duplicate kvmalloc logic > here? In other words why not simply make bpf_map_area_alloc use > kvmalloc_node with GFP_KERNEL? > >> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by >> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace >> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer >> and will try harder to fulfil allocation requests. >> >> The change has been tested with the workloads mentioned above and by >> observing oom_kill value from /proc/vmstat. >> >> Signed-off-by: Martynas Pumputis <m@lambda.lt> >> --- >> kernel/bpf/syscall.c | 8 ++++---- >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index 62f6bced3a3c..eb5cefe44af3 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) >> >> void *bpf_map_area_alloc(size_t size, int numa_node) >> { >> - /* We definitely need __GFP_NORETRY, so OOM killer doesn't >> - * trigger under memory pressure as we really just want to >> - * fail instead. >> + /* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so >> + * OOM killer doesn't trigger under memory pressure as we really >> + * just want to fail instead. >> */ >> - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; >> + const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO; >> void *area; >> >> if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { >> -- >> 2.21.0 >> > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-08 11:14 ` Martynas Pumputis @ 2019-03-08 11:20 ` Michal Hocko 2019-03-08 20:02 ` Martynas Pumputis 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2019-03-08 11:20 UTC (permalink / raw) To: Martynas Pumputis; +Cc: bpf, ast, daniel On Fri 08-03-19 12:14:16, Martynas Pumputis wrote: > > > On 3/8/19 9:44 AM, Michal Hocko wrote: > > On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: > > > It has been observed that sometimes memory allocation for BPF maps > > > fails when there is no obvious memory pressure in a system. > > > > > > E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) > > > could not be created due to due to vmalloc unable to allocate 75497472B, > > > when the system's memory consumption (in MB) was the following: > > > > > > Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 > > > > Hmm 75MB is quite large and much larger than the slab/page allocator > > cann provide so this is not really a fragmentation issue. Vmalloc does > > respect noretry but considering that there shouldn't be a large memory > > pressure I wonder how NORETRY managed to fail the allocation. Do you > > happen to have the allocation failure report? > > I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has > failed: > https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d dmesg with the allocation failure report would be more helpful Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-08 11:20 ` Michal Hocko @ 2019-03-08 20:02 ` Martynas Pumputis 2019-03-10 7:13 ` Michal Hocko 0 siblings, 1 reply; 11+ messages in thread From: Martynas Pumputis @ 2019-03-08 20:02 UTC (permalink / raw) To: Michal Hocko; +Cc: bpf, ast, daniel On 3/8/19 12:20 PM, Michal Hocko wrote: > On Fri 08-03-19 12:14:16, Martynas Pumputis wrote: >> >> >> On 3/8/19 9:44 AM, Michal Hocko wrote: >>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: >>>> It has been observed that sometimes memory allocation for BPF maps >>>> fails when there is no obvious memory pressure in a system. >>>> >>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) >>>> could not be created due to due to vmalloc unable to allocate 75497472B, >>>> when the system's memory consumption (in MB) was the following: >>>> >>>> Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 >>> >>> Hmm 75MB is quite large and much larger than the slab/page allocator >>> cann provide so this is not really a fragmentation issue. Vmalloc does >>> respect noretry but considering that there shouldn't be a large memory >>> pressure I wonder how NORETRY managed to fail the allocation. Do you >>> happen to have the allocation failure report? >> >> I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has >> failed: >> https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d > > dmesg with the allocation failure report would be more helpful https://gist.github.com/brb/2d7ac323d2e14cb7a38bacba301fe3af > > Thanks! > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-08 20:02 ` Martynas Pumputis @ 2019-03-10 7:13 ` Michal Hocko 2019-03-11 19:33 ` Martynas Pumputis 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2019-03-10 7:13 UTC (permalink / raw) To: Martynas Pumputis; +Cc: bpf, ast, daniel On Fri 08-03-19 21:02:41, Martynas Pumputis wrote: > > > On 3/8/19 12:20 PM, Michal Hocko wrote: > > On Fri 08-03-19 12:14:16, Martynas Pumputis wrote: > > > > > > > > > On 3/8/19 9:44 AM, Michal Hocko wrote: > > > > On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: > > > > > It has been observed that sometimes memory allocation for BPF maps > > > > > fails when there is no obvious memory pressure in a system. > > > > > > > > > > E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) > > > > > could not be created due to due to vmalloc unable to allocate 75497472B, > > > > > when the system's memory consumption (in MB) was the following: > > > > > > > > > > Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 > > > > > > > > Hmm 75MB is quite large and much larger than the slab/page allocator > > > > cann provide so this is not really a fragmentation issue. Vmalloc does > > > > respect noretry but considering that there shouldn't be a large memory > > > > pressure I wonder how NORETRY managed to fail the allocation. Do you > > > > happen to have the allocation failure report? > > > > > > I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has > > > failed: > > > https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d > > > > dmesg with the allocation failure report would be more helpful > > https://gist.github.com/brb/2d7ac323d2e14cb7a38bacba301fe3af Thanks! tc: vmalloc: allocation failure, allocated 15609856 of 62918656 bytes, mode:0x6090c0(GFP_KERNEL|__GFP_NORETRY|__GFP_ZERO), nodemask=(null),cpuset=b389e318420d891300ad9658f8e056b59972fda9547dd566245a922c34bb9e42,mems_allowed=0 [...] Node 0 DMA free:15728kB min:268kB low:332kB high:396kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:12kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 3419 3866 3866 3866 Node 0 DMA32 free:105004kB min:104588kB low:119468kB high:134348kB active_anon:526128kB inactive_anon:612kB active_file:862524kB inactive_file:1552884kB unevictable:0kB writepending:0kB present:3653568kB managed:3563596kB mlocked:0kB kernel_stack:7592kB pagetables:6636kB bounce:0kB free_pcp:916kB local_pcp:736kB free_cma:0kB lowmem_reserve[]: 0 0 446 446 446 Node 0 Normal free:22844kB min:24160kB low:26104kB high:28048kB active_anon:92340kB inactive_anon:228kB active_file:160072kB inactive_file:82480kB unevictable:0kB writepending:0kB present:524288kB managed:457544kB mlocked:0kB kernel_stack:2224kB pagetables:3776kB bounce:0kB free_pcp:996kB local_pcp:672kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Except for a srtange cpuset value (which should be checked separately), the allocation is restricted to node 0 which is pretty much out of memory (below min watermark - lowmem_reserve). There is still a lot of page cache to reclaim so a further reclaim is quite likely to make a further progress. There is still 45MB to go and at least page cache is 1.5G so there is some buffer to allocate from. That being said __GFP_NORETRY caused a pre-mature failure indeed. Using kvmalloc(GFP_KERNEL|__GFP_RETRY_MAYFAIL) would likely help here unless the pagecache is really hard to reclaim. Please note that this will also imply that requests which can be satisfied from the slab allocator will retry harder as well. Not sure this is desirable for these requests though but your original patch does the same so if you wanted to have __GFP_RETRY_MAYFAIL behavior only for the vmalloc path then you would need to have an opencoded version which adds the flag just to the vmalloc fallback path. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] bpf: Try harder when allocating memory for maps 2019-03-10 7:13 ` Michal Hocko @ 2019-03-11 19:33 ` Martynas Pumputis 0 siblings, 0 replies; 11+ messages in thread From: Martynas Pumputis @ 2019-03-11 19:33 UTC (permalink / raw) To: Michal Hocko; +Cc: bpf, ast, daniel On 3/10/19 8:13 AM, Michal Hocko wrote: > On Fri 08-03-19 21:02:41, Martynas Pumputis wrote: >> >> >> On 3/8/19 12:20 PM, Michal Hocko wrote: >>> On Fri 08-03-19 12:14:16, Martynas Pumputis wrote: >>>> >>>> >>>> On 3/8/19 9:44 AM, Michal Hocko wrote: >>>>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: >>>>>> It has been observed that sometimes memory allocation for BPF maps >>>>>> fails when there is no obvious memory pressure in a system. >>>>>> >>>>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) >>>>>> could not be created due to due to vmalloc unable to allocate 75497472B, >>>>>> when the system's memory consumption (in MB) was the following: >>>>>> >>>>>> Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 >>>>> >>>>> Hmm 75MB is quite large and much larger than the slab/page allocator >>>>> cann provide so this is not really a fragmentation issue. Vmalloc does >>>>> respect noretry but considering that there shouldn't be a large memory >>>>> pressure I wonder how NORETRY managed to fail the allocation. Do you >>>>> happen to have the allocation failure report? >>>> >>>> I got /proc/{meminfo,vmstat,vmallocinfo} just after the allocation has >>>> failed: >>>> https://gist.github.com/brb/62092c1d83daa6527271b88f0352e32d >>> >>> dmesg with the allocation failure report would be more helpful >> >> https://gist.github.com/brb/2d7ac323d2e14cb7a38bacba301fe3af > > Thanks! > > tc: vmalloc: allocation failure, allocated 15609856 of 62918656 bytes, mode:0x6090c0(GFP_KERNEL|__GFP_NORETRY|__GFP_ZERO), nodemask=(null),cpuset=b389e318420d891300ad9658f8e056b59972fda9547dd566245a922c34bb9e42,mems_allowed=0 > [...] > Node 0 DMA free:15728kB min:268kB low:332kB high:396kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:12kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 3419 3866 3866 3866 > Node 0 DMA32 free:105004kB min:104588kB low:119468kB high:134348kB active_anon:526128kB inactive_anon:612kB active_file:862524kB inactive_file:1552884kB unevictable:0kB writepending:0kB present:3653568kB managed:3563596kB mlocked:0kB kernel_stack:7592kB pagetables:6636kB bounce:0kB free_pcp:916kB local_pcp:736kB free_cma:0kB > lowmem_reserve[]: 0 0 446 446 446 > Node 0 Normal free:22844kB min:24160kB low:26104kB high:28048kB active_anon:92340kB inactive_anon:228kB active_file:160072kB inactive_file:82480kB unevictable:0kB writepending:0kB present:524288kB managed:457544kB mlocked:0kB kernel_stack:2224kB pagetables:3776kB bounce:0kB free_pcp:996kB local_pcp:672kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 0 > > Except for a srtange cpuset value (which should be checked separately), > the allocation is restricted to node 0 which is pretty much out of > memory (below min watermark - lowmem_reserve). There is still a lot of > page cache to reclaim so a further reclaim is quite likely to make a > further progress. There is still 45MB to go and at least page cache is > 1.5G so there is some buffer to allocate from. > > That being said __GFP_NORETRY caused a pre-mature failure indeed. Using > kvmalloc(GFP_KERNEL|__GFP_RETRY_MAYFAIL) would likely help here unless > the pagecache is really hard to reclaim. Please note that this will also > imply that requests which can be satisfied from the slab allocator will > retry harder as well. Not sure this is desirable for these requests > though but your original patch does the same so if you wanted to have > __GFP_RETRY_MAYFAIL behavior only for the vmalloc path then you would > need to have an opencoded version which adds the flag just to the > vmalloc fallback path. Thanks a lot for the analysis. I've re-submitted the patch. > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-03-11 19:32 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-03-08 8:08 [PATCH] bpf: Try harder when allocating memory for maps Martynas Pumputis 2019-03-08 8:44 ` Michal Hocko 2019-03-08 10:33 ` Daniel Borkmann 2019-03-08 10:55 ` Michal Hocko 2019-03-08 11:30 ` Daniel Borkmann 2019-03-08 12:00 ` Michal Hocko 2019-03-08 11:14 ` Martynas Pumputis 2019-03-08 11:20 ` Michal Hocko 2019-03-08 20:02 ` Martynas Pumputis 2019-03-10 7:13 ` Michal Hocko 2019-03-11 19:33 ` Martynas Pumputis
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.