* [PATCH 0/3] netfilter/x_tables: go back to using vmalloc
@ 2025-09-22 19:48 Daniil Tatianin
2025-09-22 19:48 ` [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info Daniil Tatianin
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Daniil Tatianin @ 2025-09-22 19:48 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Daniil Tatianin, Jozsef Kadlecsik, Florian Westphal, Phil Sutter,
David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
linux-kernel, netdev
This series aims to replace most calls to kvmalloc whose size directly depends
on user input with vmalloc. This was actually the way xt_table_info was
previously allocated if it ended up being too large back in 2017 before it got
replaced with a call to kvmalloc in the
commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()").
The commit that changed it did so because "xt_alloc_table_info()
basically opencodes kvmalloc()", which is not actually what it was
doing. kvmalloc() does not attempt to go directly to vmalloc if the
order the caller is trying to allocate is "expensive", instead it only
uses vmalloc as a fallback in case the buddy allocator is not able to
fullfill the request.
The difference between the two is actually huge in case the system is
under memory pressure and has no free pages of a large order. Before the
change to kvmalloc we wouldn't even try going to the buddy allocator for
large orders, but now we would force it to try to find a page of the
required order by waking up kswapd/kcompactd and dropping reclaimable memory
for no reason at all to satisfy our huge order allocation that could easily
exist within vmalloc'ed memory instead.
Revert the change to always call vmalloc, since this code doesn't really
benefit from contiguous physical memory, and the size it allocates is
directly dictated by the userspace-passed table buffer thus allowing it to
torture the buddy allocator by carefully crafting a huge table that fits
right at the maximum available memory order on the system.
This series also touches the allocation of entry_offsets, since they suffer
from the same issue.
Daniil Tatianin (3):
netfilter/x_tables: go back to using vmalloc for xt_table_info
netfilter/x_tables: introduce a helper for freeing entry offsets
netfilter/x_tables: allocate entry_offsets with vcalloc
include/linux/netfilter/x_tables.h | 1 +
net/ipv4/netfilter/arp_tables.c | 4 ++--
net/ipv4/netfilter/ip_tables.c | 4 ++--
net/ipv6/netfilter/ip6_tables.c | 4 ++--
net/netfilter/x_tables.c | 12 +++++++++---
5 files changed, 16 insertions(+), 9 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info
2025-09-22 19:48 [PATCH 0/3] netfilter/x_tables: go back to using vmalloc Daniil Tatianin
@ 2025-09-22 19:48 ` Daniil Tatianin
2025-09-22 21:12 ` Eric Dumazet
2025-09-22 19:48 ` [PATCH 2/3] netfilter/x_tables: introduce a helper for freeing entry offsets Daniil Tatianin
2025-09-22 19:48 ` [PATCH 3/3] netfilter/x_tables: allocate entry_offsets with vcalloc Daniil Tatianin
2 siblings, 1 reply; 9+ messages in thread
From: Daniil Tatianin @ 2025-09-22 19:48 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Daniil Tatianin, Jozsef Kadlecsik, Florian Westphal, Phil Sutter,
David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
linux-kernel, netdev
This code previously always used vmalloc for anything above
PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in
commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()").
The commit that changed it did so because "xt_alloc_table_info()
basically opencodes kvmalloc()", which is not actually what it was
doing. kvmalloc() does not attempt to go directly to vmalloc if the
order the caller is trying to allocate is "expensive", instead it only
uses vmalloc as a fallback in case the buddy allocator is not able to
fullfill the request.
The difference between the two is actually huge in case the system is
under memory pressure and has no free pages of a large order. Before the
change to kvmalloc we wouldn't even try going to the buddy allocator for
large orders, but now we would force it to try to find a page of the
required order by waking up kswapd/kcompactd and dropping reclaimable memory
for no reason at all to satisfy our huge order allocation that could easily
exist within vmalloc'ed memory instead.
Revert the change to always call vmalloc, since this code doesn't really
benefit from contiguous physical memory, and the size it allocates is
directly dictated by the userspace-passed table buffer thus allowing it to
torture the buddy allocator by carefully crafting a huge table that fits
right at the maximum available memory order on the system.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
net/netfilter/x_tables.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 90b7630421c4..c98f4b05d79d 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1190,7 +1190,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
return NULL;
- info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
+ info = __vmalloc(sz, GFP_KERNEL_ACCOUNT);
if (!info)
return NULL;
@@ -1210,7 +1210,7 @@ void xt_free_table_info(struct xt_table_info *info)
kvfree(info->jumpstack);
}
- kvfree(info);
+ vfree(info);
}
EXPORT_SYMBOL(xt_free_table_info);
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/3] netfilter/x_tables: introduce a helper for freeing entry offsets
2025-09-22 19:48 [PATCH 0/3] netfilter/x_tables: go back to using vmalloc Daniil Tatianin
2025-09-22 19:48 ` [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info Daniil Tatianin
@ 2025-09-22 19:48 ` Daniil Tatianin
2025-09-22 19:48 ` [PATCH 3/3] netfilter/x_tables: allocate entry_offsets with vcalloc Daniil Tatianin
2 siblings, 0 replies; 9+ messages in thread
From: Daniil Tatianin @ 2025-09-22 19:48 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Daniil Tatianin, Jozsef Kadlecsik, Florian Westphal, Phil Sutter,
David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
linux-kernel, netdev
Analogous to xt_free_table_info, add a helper so that the users of
xt_alloc_entry_offsets don't have to assume the way the array was
allocated. This also allows us to cleanly change how the array is
allocated internally in the following commit.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
include/linux/netfilter/x_tables.h | 1 +
net/ipv4/netfilter/arp_tables.c | 4 ++--
net/ipv4/netfilter/ip_tables.c | 4 ++--
net/ipv6/netfilter/ip6_tables.c | 4 ++--
net/netfilter/x_tables.c | 6 ++++++
5 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 77c778d84d4c..f695230eb89c 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -274,6 +274,7 @@ int xt_check_entry_offsets(const void *base, const char *elems,
int xt_check_table_hooks(const struct xt_table_info *info, unsigned int valid_hooks);
unsigned int *xt_alloc_entry_offsets(unsigned int size);
+void xt_free_entry_offsets(unsigned int *offsets);
bool xt_find_jump_offset(const unsigned int *offsets,
unsigned int target, unsigned int size);
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 1cdd9c28ab2d..bc164c2e22b0 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -570,7 +570,7 @@ static int translate_table(struct net *net,
ret = -ELOOP;
goto out_free;
}
- kvfree(offsets);
+ xt_free_entry_offsets(offsets);
/* Finally, each sanity check must pass */
i = 0;
@@ -593,7 +593,7 @@ static int translate_table(struct net *net,
return ret;
out_free:
- kvfree(offsets);
+ xt_free_entry_offsets(offsets);
return ret;
}
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 23c8deff8095..1ffd871456e1 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -708,7 +708,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
ret = -ELOOP;
goto out_free;
}
- kvfree(offsets);
+ xt_free_entry_offsets(offsets);
/* Finally, each sanity check must pass */
i = 0;
@@ -731,7 +731,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
return ret;
out_free:
- kvfree(offsets);
+ xt_free_entry_offsets(offsets);
return ret;
}
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index d585ac3c1113..0f2999155bde 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -725,7 +725,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
ret = -ELOOP;
goto out_free;
}
- kvfree(offsets);
+ xt_free_entry_offsets(offsets);
/* Finally, each sanity check must pass */
i = 0;
@@ -748,7 +748,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
return ret;
out_free:
- kvfree(offsets);
+ xt_free_entry_offsets(offsets);
return ret;
}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index c98f4b05d79d..5ea95c56f3a0 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -970,6 +970,12 @@ unsigned int *xt_alloc_entry_offsets(unsigned int size)
}
EXPORT_SYMBOL(xt_alloc_entry_offsets);
+void xt_free_entry_offsets(unsigned int *offsets)
+{
+ kvfree(offsets);
+}
+EXPORT_SYMBOL(xt_free_entry_offsets);
+
/**
* xt_find_jump_offset - check if target is a valid jump offset
*
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/3] netfilter/x_tables: allocate entry_offsets with vcalloc
2025-09-22 19:48 [PATCH 0/3] netfilter/x_tables: go back to using vmalloc Daniil Tatianin
2025-09-22 19:48 ` [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info Daniil Tatianin
2025-09-22 19:48 ` [PATCH 2/3] netfilter/x_tables: introduce a helper for freeing entry offsets Daniil Tatianin
@ 2025-09-22 19:48 ` Daniil Tatianin
2 siblings, 0 replies; 9+ messages in thread
From: Daniil Tatianin @ 2025-09-22 19:48 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Daniil Tatianin, Jozsef Kadlecsik, Florian Westphal, Phil Sutter,
David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
linux-kernel, netdev
This allocation does not benefit from contiguous physical memory, and
its size depends on userspace input.
No reason to stress the buddy allocator and thus the entire system for
allocations that can exist in vmalloc'ed memory just fine.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
net/netfilter/x_tables.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 5ea95c56f3a0..06a86648b931 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -965,14 +965,14 @@ unsigned int *xt_alloc_entry_offsets(unsigned int size)
if (size > XT_MAX_TABLE_SIZE / sizeof(unsigned int))
return NULL;
- return kvcalloc(size, sizeof(unsigned int), GFP_KERNEL);
+ return __vcalloc(size, sizeof(unsigned int), GFP_KERNEL);
}
EXPORT_SYMBOL(xt_alloc_entry_offsets);
void xt_free_entry_offsets(unsigned int *offsets)
{
- kvfree(offsets);
+ vfree(offsets);
}
EXPORT_SYMBOL(xt_free_entry_offsets);
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info
2025-09-22 19:48 ` [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info Daniil Tatianin
@ 2025-09-22 21:12 ` Eric Dumazet
2025-09-23 10:11 ` Daniil Tatianin
0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2025-09-22 21:12 UTC (permalink / raw)
To: Daniil Tatianin
Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
Phil Sutter, David S. Miller, David Ahern, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
linux-kernel, netdev
On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin
<d-tatianin@yandex-team.ru> wrote:
>
> This code previously always used vmalloc for anything above
> PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in
> commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()").
>
> The commit that changed it did so because "xt_alloc_table_info()
> basically opencodes kvmalloc()", which is not actually what it was
> doing. kvmalloc() does not attempt to go directly to vmalloc if the
> order the caller is trying to allocate is "expensive", instead it only
> uses vmalloc as a fallback in case the buddy allocator is not able to
> fullfill the request.
>
> The difference between the two is actually huge in case the system is
> under memory pressure and has no free pages of a large order. Before the
> change to kvmalloc we wouldn't even try going to the buddy allocator for
> large orders, but now we would force it to try to find a page of the
> required order by waking up kswapd/kcompactd and dropping reclaimable memory
> for no reason at all to satisfy our huge order allocation that could easily
> exist within vmalloc'ed memory instead.
This would hint at an issue with kvmalloc(), why not fixing it, instead
of trying to fix all its users ?
There was a time where PAGE_ALLOC_COSTLY_ORDER was used.
>
> Revert the change to always call vmalloc, since this code doesn't really
> benefit from contiguous physical memory, and the size it allocates is
> directly dictated by the userspace-passed table buffer thus allowing it to
> torture the buddy allocator by carefully crafting a huge table that fits
> right at the maximum available memory order on the system.
>
> Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
> ---
> net/netfilter/x_tables.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index 90b7630421c4..c98f4b05d79d 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1190,7 +1190,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
> if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
> return NULL;
>
> - info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
> + info = __vmalloc(sz, GFP_KERNEL_ACCOUNT);
> if (!info)
> return NULL;
>
> @@ -1210,7 +1210,7 @@ void xt_free_table_info(struct xt_table_info *info)
> kvfree(info->jumpstack);
> }
>
> - kvfree(info);
> + vfree(info);
> }
> EXPORT_SYMBOL(xt_free_table_info);
>
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info
2025-09-22 21:12 ` Eric Dumazet
@ 2025-09-23 10:11 ` Daniil Tatianin
2025-09-23 11:36 ` Florian Westphal
0 siblings, 1 reply; 9+ messages in thread
From: Daniil Tatianin @ 2025-09-23 10:11 UTC (permalink / raw)
To: Eric Dumazet
Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
Phil Sutter, David S. Miller, David Ahern, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
linux-kernel, netdev
On 9/23/25 12:12 AM, Eric Dumazet wrote:
> On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin
> <d-tatianin@yandex-team.ru> wrote:
>> This code previously always used vmalloc for anything above
>> PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in
>> commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()").
>>
>> The commit that changed it did so because "xt_alloc_table_info()
>> basically opencodes kvmalloc()", which is not actually what it was
>> doing. kvmalloc() does not attempt to go directly to vmalloc if the
>> order the caller is trying to allocate is "expensive", instead it only
>> uses vmalloc as a fallback in case the buddy allocator is not able to
>> fullfill the request.
>>
>> The difference between the two is actually huge in case the system is
>> under memory pressure and has no free pages of a large order. Before the
>> change to kvmalloc we wouldn't even try going to the buddy allocator for
>> large orders, but now we would force it to try to find a page of the
>> required order by waking up kswapd/kcompactd and dropping reclaimable memory
>> for no reason at all to satisfy our huge order allocation that could easily
>> exist within vmalloc'ed memory instead.
> This would hint at an issue with kvmalloc(), why not fixing it, instead
> of trying to fix all its users ?
Thanks for the quick reply! From my understanding, there is a lot of
callers of kvmalloc
who do indeed benefit from the physical memory being contiguous, because
it is then
used for hardware DMA etc., so I'm not sure that would be feasible.
>
> There was a time where PAGE_ALLOC_COSTLY_ORDER was used.
Out of curiosity, do you mean kvmalloc used to always fall back to
vmalloc for > COSTLY_ORDER?
If so, do you happen to know, which commit changed that behavior? I
tried grepping the logs and
looking at the git blame of slub.c but I guess it was changed too long
ago so I wasn't successful.
>
>
>
>> Revert the change to always call vmalloc, since this code doesn't really
>> benefit from contiguous physical memory, and the size it allocates is
>> directly dictated by the userspace-passed table buffer thus allowing it to
>> torture the buddy allocator by carefully crafting a huge table that fits
>> right at the maximum available memory order on the system.
>>
>> Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
>> ---
>> net/netfilter/x_tables.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
>> index 90b7630421c4..c98f4b05d79d 100644
>> --- a/net/netfilter/x_tables.c
>> +++ b/net/netfilter/x_tables.c
>> @@ -1190,7 +1190,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
>> if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
>> return NULL;
>>
>> - info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
>> + info = __vmalloc(sz, GFP_KERNEL_ACCOUNT);
>> if (!info)
>> return NULL;
>>
>> @@ -1210,7 +1210,7 @@ void xt_free_table_info(struct xt_table_info *info)
>> kvfree(info->jumpstack);
>> }
>>
>> - kvfree(info);
>> + vfree(info);
>> }
>> EXPORT_SYMBOL(xt_free_table_info);
>>
>> --
>> 2.34.1
>>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info
2025-09-23 10:11 ` Daniil Tatianin
@ 2025-09-23 11:36 ` Florian Westphal
2025-09-23 12:04 ` Daniil Tatianin
0 siblings, 1 reply; 9+ messages in thread
From: Florian Westphal @ 2025-09-23 11:36 UTC (permalink / raw)
To: Daniil Tatianin
Cc: Eric Dumazet, Pablo Neira Ayuso, Jozsef Kadlecsik, Phil Sutter,
David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Simon Horman, netfilter-devel, coreteam, linux-kernel, netdev
Daniil Tatianin <d-tatianin@yandex-team.ru> wrote:
> > On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin
> > <d-tatianin@yandex-team.ru> wrote:
> >> This code previously always used vmalloc for anything above
> >> PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in
> >> commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()").
> >>
> >> The commit that changed it did so because "xt_alloc_table_info()
> >> basically opencodes kvmalloc()", which is not actually what it was
> >> doing. kvmalloc() does not attempt to go directly to vmalloc if the
> >> order the caller is trying to allocate is "expensive", instead it only
> >> uses vmalloc as a fallback in case the buddy allocator is not able to
> >> fullfill the request.
> >>
> >> The difference between the two is actually huge in case the system is
> >> under memory pressure and has no free pages of a large order. Before the
> >> change to kvmalloc we wouldn't even try going to the buddy allocator for
> >> large orders, but now we would force it to try to find a page of the
> >> required order by waking up kswapd/kcompactd and dropping reclaimable memory
> >> for no reason at all to satisfy our huge order allocation that could easily
> >> exist within vmalloc'ed memory instead.
> > This would hint at an issue with kvmalloc(), why not fixing it, instead
> > of trying to fix all its users ?
I agree with Eric. There is nothing special in xtables compared to
kvmalloc usage elsewhere in the stack. Why "fix" xtables and not e.g.
rhashtable?
Please work with mm hackers to improve the situation for your use case.
Maybe its enough to raise __GFP_NORETRY in kmalloc_gfp_adjust() if size
results in >= PAGE_ALLOC_COSTLY_ORDER allocation.
> Thanks for the quick reply! From my understanding, there is a lot of
> callers of kvmalloc
> who do indeed benefit from the physical memory being contiguous, because
> it is then
> used for hardware DMA etc., so I'm not sure that would be feasible.
How can that work? kvmalloc won't make vmalloc backed memory
physically contiguous.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info
2025-09-23 11:36 ` Florian Westphal
@ 2025-09-23 12:04 ` Daniil Tatianin
2025-09-23 12:22 ` Eric Dumazet
0 siblings, 1 reply; 9+ messages in thread
From: Daniil Tatianin @ 2025-09-23 12:04 UTC (permalink / raw)
To: Florian Westphal
Cc: Eric Dumazet, Pablo Neira Ayuso, Jozsef Kadlecsik, Phil Sutter,
David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Simon Horman, netfilter-devel, coreteam, linux-kernel, netdev
On 9/23/25 2:36 PM, Florian Westphal wrote:
> Daniil Tatianin <d-tatianin@yandex-team.ru> wrote:
>>> On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin
>>> <d-tatianin@yandex-team.ru> wrote:
>>>> This code previously always used vmalloc for anything above
>>>> PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in
>>>> commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()").
>>>>
>>>> The commit that changed it did so because "xt_alloc_table_info()
>>>> basically opencodes kvmalloc()", which is not actually what it was
>>>> doing. kvmalloc() does not attempt to go directly to vmalloc if the
>>>> order the caller is trying to allocate is "expensive", instead it only
>>>> uses vmalloc as a fallback in case the buddy allocator is not able to
>>>> fullfill the request.
>>>>
>>>> The difference between the two is actually huge in case the system is
>>>> under memory pressure and has no free pages of a large order. Before the
>>>> change to kvmalloc we wouldn't even try going to the buddy allocator for
>>>> large orders, but now we would force it to try to find a page of the
>>>> required order by waking up kswapd/kcompactd and dropping reclaimable memory
>>>> for no reason at all to satisfy our huge order allocation that could easily
>>>> exist within vmalloc'ed memory instead.
>>> This would hint at an issue with kvmalloc(), why not fixing it, instead
>>> of trying to fix all its users ?
> I agree with Eric. There is nothing special in xtables compared to
> kvmalloc usage elsewhere in the stack. Why "fix" xtables and not e.g.
> rhashtable?
>
> Please work with mm hackers to improve the situation for your use case.
>
> Maybe its enough to raise __GFP_NORETRY in kmalloc_gfp_adjust() if size
> results in >= PAGE_ALLOC_COSTLY_ORDER allocation.
Thanks for your reply! Perhaps this is the way to go, although this
might have
much broader implications since there are tons of other callers to take
into account.
I'm not sure whether rhashtable's size also directly depends on user
input, I was only
aware of x_table since this is the case we ran into specifically.
>
>> Thanks for the quick reply! From my understanding, there is a lot of
>> callers of kvmalloc
>> who do indeed benefit from the physical memory being contiguous, because
>> it is then
>> used for hardware DMA etc., so I'm not sure that would be feasible.
> How can that work? kvmalloc won't make vmalloc backed memory
> physically contiguous.
The allocated physical memory won't be contiguous only for fallback
cases (which should be rare),
I assume in that case the hardware operation may end up being more
expensive with larger scatter-gather
lists etc. So most of the time such code can take optimized paths for
fully contiguous memory. This is not
the case for x_tables etc.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info
2025-09-23 12:04 ` Daniil Tatianin
@ 2025-09-23 12:22 ` Eric Dumazet
0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2025-09-23 12:22 UTC (permalink / raw)
To: Daniil Tatianin
Cc: Florian Westphal, Pablo Neira Ayuso, Jozsef Kadlecsik,
Phil Sutter, David S. Miller, David Ahern, Jakub Kicinski,
Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
linux-kernel, netdev
On Tue, Sep 23, 2025 at 5:04 AM Daniil Tatianin
<d-tatianin@yandex-team.ru> wrote:
>
> On 9/23/25 2:36 PM, Florian Westphal wrote:
>
> > Daniil Tatianin <d-tatianin@yandex-team.ru> wrote:
> >>> On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin
> >>> <d-tatianin@yandex-team.ru> wrote:
> >>>> This code previously always used vmalloc for anything above
> >>>> PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in
> >>>> commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()").
> >>>>
> >>>> The commit that changed it did so because "xt_alloc_table_info()
> >>>> basically opencodes kvmalloc()", which is not actually what it was
> >>>> doing. kvmalloc() does not attempt to go directly to vmalloc if the
> >>>> order the caller is trying to allocate is "expensive", instead it only
> >>>> uses vmalloc as a fallback in case the buddy allocator is not able to
> >>>> fullfill the request.
> >>>>
> >>>> The difference between the two is actually huge in case the system is
> >>>> under memory pressure and has no free pages of a large order. Before the
> >>>> change to kvmalloc we wouldn't even try going to the buddy allocator for
> >>>> large orders, but now we would force it to try to find a page of the
> >>>> required order by waking up kswapd/kcompactd and dropping reclaimable memory
> >>>> for no reason at all to satisfy our huge order allocation that could easily
> >>>> exist within vmalloc'ed memory instead.
> >>> This would hint at an issue with kvmalloc(), why not fixing it, instead
> >>> of trying to fix all its users ?
> > I agree with Eric. There is nothing special in xtables compared to
> > kvmalloc usage elsewhere in the stack. Why "fix" xtables and not e.g.
> > rhashtable?
> >
> > Please work with mm hackers to improve the situation for your use case.
> >
> > Maybe its enough to raise __GFP_NORETRY in kmalloc_gfp_adjust() if size
> > results in >= PAGE_ALLOC_COSTLY_ORDER allocation.
>
> Thanks for your reply! Perhaps this is the way to go, although this
> might have
> much broader implications since there are tons of other callers to take
> into account.
>
> I'm not sure whether rhashtable's size also directly depends on user
> input, I was only
> aware of x_table since this is the case we ran into specifically.
>
> >
> >> Thanks for the quick reply! From my understanding, there is a lot of
> >> callers of kvmalloc
> >> who do indeed benefit from the physical memory being contiguous, because
> >> it is then
> >> used for hardware DMA etc., so I'm not sure that would be feasible.
> > How can that work? kvmalloc won't make vmalloc backed memory
> > physically contiguous.
>
> The allocated physical memory won't be contiguous only for fallback
> cases (which should be rare),
> I assume in that case the hardware operation may end up being more
> expensive with larger scatter-gather
> lists etc. So most of the time such code can take optimized paths for
> fully contiguous memory. This is not
> the case for x_tables etc.
At least some years ago, we were seeing a performance difference.
x_tables data is often read sequentially, I do not know if modern
cpus hardware prefetches use TLB (virtual space). I do not know
if they can span a 4K page, even if physically contiguous.
Some context :
commit 6c5ab6511f718c3fb19bcc3f78a90b0e0b601675
Author: Michal Hocko <mhocko@suse.com>
Date: Mon May 8 15:57:15 2017 -0700
mm: support __GFP_REPEAT in kvmalloc_node for >32kB
vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
vhost_vsock because it would really like to prefer kmalloc to the
vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device
allocation to vmalloc") for more context. Michael Tsirkin has also
noted:
commit 23cc5a991c7a9fb7e6d6550e65cee4f4173111c5
Author: Michael S. Tsirkin <mst@redhat.com>
Date: Wed Jan 23 21:46:47 2013 +0100
vhost-net: extend device allocation to vmalloc
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-09-23 12:23 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-22 19:48 [PATCH 0/3] netfilter/x_tables: go back to using vmalloc Daniil Tatianin
2025-09-22 19:48 ` [PATCH 1/3] netfilter/x_tables: go back to using vmalloc for xt_table_info Daniil Tatianin
2025-09-22 21:12 ` Eric Dumazet
2025-09-23 10:11 ` Daniil Tatianin
2025-09-23 11:36 ` Florian Westphal
2025-09-23 12:04 ` Daniil Tatianin
2025-09-23 12:22 ` Eric Dumazet
2025-09-22 19:48 ` [PATCH 2/3] netfilter/x_tables: introduce a helper for freeing entry offsets Daniil Tatianin
2025-09-22 19:48 ` [PATCH 3/3] netfilter/x_tables: allocate entry_offsets with vcalloc Daniil Tatianin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).