* Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
@ 2025-04-14 16:29 Jaroslav Pulchart
2025-04-14 17:15 ` [Intel-wired-lan] " Paul Menzel
` (2 more replies)
0 siblings, 3 replies; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-04-14 16:29 UTC (permalink / raw)
To: Tony Nguyen, Kitszel, Przemyslaw
Cc: jdamato, intel-wired-lan, netdev, Igor Raits, Daniel Secik,
Zdenek Pesek
Hello,
While investigating increased memory usage after upgrading our
host/hypervisor servers from Linux kernel 6.12.y to 6.13.y, I observed
a regression in available memory per NUMA node. Our servers allocate
60GB of each NUMA node’s 64GB of RAM to HugePages for VMs, leaving 4GB
for the host OS.
After the upgrade, we noticed approximately 500MB less free RAM on
NUMA nodes 0 and 2 compared to 6.12.y, even with no VMs running (just
the host OS after reboot). These nodes host Intel 810-XXV NICs. Here's
a snapshot of the NUMA stats on vanilla 6.13.y:
NUMA nodes: 0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15
HPFreeGiB: 60 60 60 60 60 60 60 60 60
60 60 60 60 60 60 60
MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453
65470 65470 65470 65470 65470 65470 65470 65462
MemFree: 2793 3559 3150 3438 3616 3722 3520 3547 3547
3536 3506 3452 3440 3489 3607 3729
We traced the issue to commit 492a044508ad13a490a24c66f311339bf891cb5f
"ice: Add support for persistent NAPI config".
We limit the number of channels on the NICs to match local NUMA cores
or less if unused interface (from ridiculous 96 default), for example:
ethtool -L em1 combined 6 # active port; from 96
ethtool -L p3p2 combined 2 # unused port; from 96
This typically aligns memory use with local CPUs and keeps NUMA-local
memory usage within expected limits. However, starting with kernel
6.13.y and this commit, the high memory usage by the ICE driver
persists regardless of reduced channel configuration.
Reverting the commit restores expected memory availability on nodes 0
and 2. Below are stats from 6.13.y with the commit reverted:
NUMA nodes: 0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15
HPFreeGiB: 60 60 60 60 60 60 60 60 60
60 60 60 60 60 60 60
MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453 65470
65470 65470 65470 65470 65470 65470 65462
MemFree: 3208 3765 3668 3507 3811 3727 3812 3546 3676 3596 ...
This brings nodes 0 and 2 back to ~3.5GB free RAM, similar to kernel
6.12.y, and avoids swap pressure and memory exhaustion when running
services and VMs.
I also do not see any practical benefit in persisting the channel
memory allocation. After a fresh server reboot, channels are not
explicitly configured, and the system will not automatically resize
them back to a higher count unless manually set again. Therefore,
retaining the previous memory footprint appears unnecessary and
potentially harmful in memory-constrained environments
Best regards,
Jaroslav Pulchart
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-14 16:29 Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad) Jaroslav Pulchart
@ 2025-04-14 17:15 ` Paul Menzel
2025-04-15 14:38 ` Przemek Kitszel
2025-07-04 16:55 ` Michal Kubiak
2 siblings, 0 replies; 46+ messages in thread
From: Paul Menzel @ 2025-04-14 17:15 UTC (permalink / raw)
To: Jaroslav Pulchart, Tony Nguyen, Przemyslaw Kitszel
Cc: jdamato, intel-wired-lan, netdev, Igor Raits, Daniel Secik,
Zdenek Pesek, regressions
#regzbot ^introduced: 492a044508ad13a490a24c66f311339bf891cb5f
Am 14.04.25 um 18:29 schrieb Jaroslav Pulchart:
> Hello,
>
> While investigating increased memory usage after upgrading our
> host/hypervisor servers from Linux kernel 6.12.y to 6.13.y, I observed
> a regression in available memory per NUMA node. Our servers allocate
> 60GB of each NUMA node’s 64GB of RAM to HugePages for VMs, leaving 4GB
> for the host OS.
>
> After the upgrade, we noticed approximately 500MB less free RAM on
> NUMA nodes 0 and 2 compared to 6.12.y, even with no VMs running (just
> the host OS after reboot). These nodes host Intel 810-XXV NICs. Here's
> a snapshot of the NUMA stats on vanilla 6.13.y:
>
> NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> HPFreeGiB: 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60
> MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453 65470 65470 65470 65470 65470 65470 65470 65462
> MemFree: 2793 3559 3150 3438 3616 3722 3520 3547 3547 3536 3506 3452 3440 3489 3607 3729
>
> We traced the issue to commit 492a044508ad13a490a24c66f311339bf891cb5f
> "ice: Add support for persistent NAPI config".
>
> We limit the number of channels on the NICs to match local NUMA cores
> or less if unused interface (from ridiculous 96 default), for example:
> ethtool -L em1 combined 6 # active port; from 96
> ethtool -L p3p2 combined 2 # unused port; from 96
>
> This typically aligns memory use with local CPUs and keeps NUMA-local
> memory usage within expected limits. However, starting with kernel
> 6.13.y and this commit, the high memory usage by the ICE driver
> persists regardless of reduced channel configuration.
>
> Reverting the commit restores expected memory availability on nodes 0
> and 2. Below are stats from 6.13.y with the commit reverted:
> NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> HPFreeGiB: 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60
> MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453 65470 65470 65470 65470 65470 65470 65470 65462
> MemFree: 3208 3765 3668 3507 3811 3727 3812 3546 3676 3596 ...
>
> This brings nodes 0 and 2 back to ~3.5GB free RAM, similar to kernel
> 6.12.y, and avoids swap pressure and memory exhaustion when running
> services and VMs.
>
> I also do not see any practical benefit in persisting the channel
> memory allocation. After a fresh server reboot, channels are not
> explicitly configured, and the system will not automatically resize
> them back to a higher count unless manually set again. Therefore,
> retaining the previous memory footprint appears unnecessary and
> potentially harmful in memory-constrained environments
>
> Best regards,
> Jaroslav Pulchart
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-14 16:29 Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad) Jaroslav Pulchart
2025-04-14 17:15 ` [Intel-wired-lan] " Paul Menzel
@ 2025-04-15 14:38 ` Przemek Kitszel
2025-04-16 0:53 ` Jakub Kicinski
2025-07-04 16:55 ` Michal Kubiak
2 siblings, 1 reply; 46+ messages in thread
From: Przemek Kitszel @ 2025-04-15 14:38 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: jdamato, intel-wired-lan, netdev, Tony Nguyen, Igor Raits,
Daniel Secik, Zdenek Pesek, Jakub Kicinski, Eric Dumazet,
Martin Karsten, Ahmed Zaki, Czapnik, Lukasz, Michal Swiatkowski
On 4/14/25 18:29, Jaroslav Pulchart wrote:
> Hello,
+CC to co-devs and reviewers of initial napi_config introduction
+CC Ahmed, who leverages napi_config for more stuff in 6.15
>
> While investigating increased memory usage after upgrading our
> host/hypervisor servers from Linux kernel 6.12.y to 6.13.y, I observed
> a regression in available memory per NUMA node. Our servers allocate
> 60GB of each NUMA node’s 64GB of RAM to HugePages for VMs, leaving 4GB
> for the host OS.
>
> After the upgrade, we noticed approximately 500MB less free RAM on
> NUMA nodes 0 and 2 compared to 6.12.y, even with no VMs running (just
> the host OS after reboot). These nodes host Intel 810-XXV NICs. Here's
> a snapshot of the NUMA stats on vanilla 6.13.y:
>
> NUMA nodes: 0 1 2 3 4 5 6 7 8
> 9 10 11 12 13 14 15
> HPFreeGiB: 60 60 60 60 60 60 60 60 60
> 60 60 60 60 60 60 60
> MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453
> 65470 65470 65470 65470 65470 65470 65470 65462
> MemFree: 2793 3559 3150 3438 3616 3722 3520 3547 3547
> 3536 3506 3452 3440 3489 3607 3729
>
> We traced the issue to commit 492a044508ad13a490a24c66f311339bf891cb5f
> "ice: Add support for persistent NAPI config".
thank you for the report and bisection,
this commit is ice's opt-in into using persistent napi_config
I have checked the code, and there is nothing obvious to inflate memory
consumption in the driver/core in the touched parts. I have not yet
looked into how much memory is eaten by the hash array of now-kept
configs.
>
> We limit the number of channels on the NICs to match local NUMA cores
> or less if unused interface (from ridiculous 96 default), for example:
We will experiment with other defaults, looks like number of total CPUs,
instead of local NUMA cores, might be better here. And even if that
would resolve the issue, I would like to have a more direct fix for this
> ethtool -L em1 combined 6 # active port; from 96
> ethtool -L p3p2 combined 2 # unused port; from 96
>
> This typically aligns memory use with local CPUs and keeps NUMA-local
> memory usage within expected limits. However, starting with kernel
> 6.13.y and this commit, the high memory usage by the ICE driver
> persists regardless of reduced channel configuration.
As a workaround, you could try to do devlink reload (action
driver_reinit), that should flush all napi instances.
We will try to reproduce the issue locally and work on a fix.
>
> Reverting the commit restores expected memory availability on nodes 0
> and 2. Below are stats from 6.13.y with the commit reverted:
> NUMA nodes: 0 1 2 3 4 5 6 7 8
> 9 10 11 12 13 14 15
> HPFreeGiB: 60 60 60 60 60 60 60 60 60
> 60 60 60 60 60 60 60
> MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453 65470
> 65470 65470 65470 65470 65470 65470 65462
> MemFree: 3208 3765 3668 3507 3811 3727 3812 3546 3676 3596 ...
>
> This brings nodes 0 and 2 back to ~3.5GB free RAM, similar to kernel
> 6.12.y, and avoids swap pressure and memory exhaustion when running
> services and VMs.
>
> I also do not see any practical benefit in persisting the channel
> memory allocation. After a fresh server reboot, channels are not
> explicitly configured, and the system will not automatically resize
> them back to a higher count unless manually set again. Therefore,
> retaining the previous memory footprint appears unnecessary and
> potentially harmful in memory-constrained environments
in this particular case there is indeed no benefit, it was designed
for keeping the config/stats for queues that were meaningfully used
it is rather clunky anyway
>
> Best regards,
> Jaroslav Pulchart
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-15 14:38 ` Przemek Kitszel
@ 2025-04-16 0:53 ` Jakub Kicinski
2025-04-16 7:13 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2025-04-16 0:53 UTC (permalink / raw)
To: Przemek Kitszel
Cc: Jaroslav Pulchart, jdamato, intel-wired-lan, netdev, Tony Nguyen,
Igor Raits, Daniel Secik, Zdenek Pesek, Eric Dumazet,
Martin Karsten, Ahmed Zaki, Czapnik, Lukasz, Michal Swiatkowski
On Tue, 15 Apr 2025 16:38:40 +0200 Przemek Kitszel wrote:
> > We traced the issue to commit 492a044508ad13a490a24c66f311339bf891cb5f
> > "ice: Add support for persistent NAPI config".
>
> thank you for the report and bisection,
> this commit is ice's opt-in into using persistent napi_config
>
> I have checked the code, and there is nothing obvious to inflate memory
> consumption in the driver/core in the touched parts. I have not yet
> looked into how much memory is eaten by the hash array of now-kept
> configs.
+1 also unclear to me how that commit makes any difference.
Jaroslav, when you say "traced" what do you mean?
CONFIG_MEM_ALLOC_PROFILING ?
The napi_config struct is just 24B. The queue struct (we allocate
napi_config for each queue) is 320B...
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-16 0:53 ` Jakub Kicinski
@ 2025-04-16 7:13 ` Jaroslav Pulchart
2025-04-16 13:48 ` Jakub Kicinski
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-04-16 7:13 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Przemek Kitszel, jdamato, intel-wired-lan, netdev, Tony Nguyen,
Igor Raits, Daniel Secik, Zdenek Pesek, Eric Dumazet,
Martin Karsten, Ahmed Zaki, Czapnik, Lukasz, Michal Swiatkowski
st 16. 4. 2025 v 2:54 odesílatel Jakub Kicinski <kuba@kernel.org> napsal:
>
> On Tue, 15 Apr 2025 16:38:40 +0200 Przemek Kitszel wrote:
> > > We traced the issue to commit 492a044508ad13a490a24c66f311339bf891cb5f
> > > "ice: Add support for persistent NAPI config".
> >
> > thank you for the report and bisection,
> > this commit is ice's opt-in into using persistent napi_config
> >
> > I have checked the code, and there is nothing obvious to inflate memory
> > consumption in the driver/core in the touched parts. I have not yet
> > looked into how much memory is eaten by the hash array of now-kept
> > configs.
>
> +1 also unclear to me how that commit makes any difference.
>
> Jaroslav, when you say "traced" what do you mean?
> CONFIG_MEM_ALLOC_PROFILING ?
>
> The napi_config struct is just 24B. The queue struct (we allocate
> napi_config for each queue) is 320B...
By "traced" I mean using the kernel and checking memory situation on
numa nodes with and without production load. Numa nodes, with X810
NIC, showing a quite less available memory with default queue length
(num of all cpus) and it needs to be lowered to 1-2 (for unused
interfaces) and up-to-count of numa node cores on used interfaces to
make the memory allocation reasonable and server avoiding "kswapd"...
See "MemFree" on numa 0 + 1 on different/smaller but utilized (running
VMs + using network) host server with 8 numa nodes (32GB RAM each, 28G
in Hugepase for VMs and 4GB for host os):
6.13.y vanilla (lot of kswapd0 in background):
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 0 0 0 0 0 0 0 0
MemTotal: 32220 32701 32701 32686 32701 32701
32701 32696
MemFree: 274 254 1327 1928 1949 2683 2624 2769
6.13.y + Revert (no memory issues at all):
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 0 0 0 0 0 0 0 0
MemTotal: 32220 32701 32701 32686 32701 32701 32701 32696
MemFree: 2213 2438 3402 3108 2846 2672 2592 3063
We need to lower the queue on all X810 interfaces from default (64 in
this case), to ensure we have memory available for host OS services.
ethtool -L em2 combined 1
ethtool -L p3p2 combined 1
ethtool -L em1 combined 6
ethtool -L p3p1 combined 6
This trick "does not work" without the revert.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-16 7:13 ` Jaroslav Pulchart
@ 2025-04-16 13:48 ` Jakub Kicinski
2025-04-16 16:03 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2025-04-16 13:48 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Przemek Kitszel, jdamato, intel-wired-lan, netdev, Tony Nguyen,
Igor Raits, Daniel Secik, Zdenek Pesek, Eric Dumazet,
Martin Karsten, Ahmed Zaki, Czapnik, Lukasz, Michal Swiatkowski
On Wed, 16 Apr 2025 09:13:23 +0200 Jaroslav Pulchart wrote:
> By "traced" I mean using the kernel and checking memory situation on
> numa nodes with and without production load. Numa nodes, with X810
> NIC, showing a quite less available memory with default queue length
> (num of all cpus) and it needs to be lowered to 1-2 (for unused
> interfaces) and up-to-count of numa node cores on used interfaces to
> make the memory allocation reasonable and server avoiding "kswapd"...
>
> See "MemFree" on numa 0 + 1 on different/smaller but utilized (running
> VMs + using network) host server with 8 numa nodes (32GB RAM each, 28G
> in Hugepase for VMs and 4GB for host os):
FWIW you can also try the tools/net/ynl/samples/page-pool
application, not sure if Intel NICs init page pools appropriately
but this will show you exactly how much memory is sitting on Rx rings
of the driver (and in net socket buffers).
> 6.13.y vanilla (lot of kswapd0 in background):
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 0 0 0 0 0 0 0 0
> MemTotal: 32220 32701 32701 32686 32701 32701
> 32701 32696
> MemFree: 274 254 1327 1928 1949 2683 2624 2769
> 6.13.y + Revert (no memory issues at all):
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 0 0 0 0 0 0 0 0
> MemTotal: 32220 32701 32701 32686 32701 32701 32701 32696
> MemFree: 2213 2438 3402 3108 2846 2672 2592 3063
>
> We need to lower the queue on all X810 interfaces from default (64 in
> this case), to ensure we have memory available for host OS services.
> ethtool -L em2 combined 1
> ethtool -L p3p2 combined 1
> ethtool -L em1 combined 6
> ethtool -L p3p1 combined 6
> This trick "does not work" without the revert.
And you're reverting just and exactly 492a044508ad13 ?
The memory for persistent config is allocated in alloc_netdev_mqs()
unconditionally. I'm lost as to how this commit could make any
difference :(
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-16 13:48 ` Jakub Kicinski
@ 2025-04-16 16:03 ` Jaroslav Pulchart
2025-04-16 22:44 ` Jakub Kicinski
2025-04-16 22:57 ` Keller, Jacob E
0 siblings, 2 replies; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-04-16 16:03 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Przemek Kitszel, jdamato, intel-wired-lan, netdev, Tony Nguyen,
Igor Raits, Daniel Secik, Zdenek Pesek, Eric Dumazet,
Martin Karsten, Ahmed Zaki, Czapnik, Lukasz, Michal Swiatkowski
>
> On Wed, 16 Apr 2025 09:13:23 +0200 Jaroslav Pulchart wrote:
> > By "traced" I mean using the kernel and checking memory situation on
> > numa nodes with and without production load. Numa nodes, with X810
> > NIC, showing a quite less available memory with default queue length
> > (num of all cpus) and it needs to be lowered to 1-2 (for unused
> > interfaces) and up-to-count of numa node cores on used interfaces to
> > make the memory allocation reasonable and server avoiding "kswapd"...
> >
> > See "MemFree" on numa 0 + 1 on different/smaller but utilized (running
> > VMs + using network) host server with 8 numa nodes (32GB RAM each, 28G
> > in Hugepase for VMs and 4GB for host os):
>
> FWIW you can also try the tools/net/ynl/samples/page-pool
> application, not sure if Intel NICs init page pools appropriately
> but this will show you exactly how much memory is sitting on Rx rings
> of the driver (and in net socket buffers).
I'm not familiar with the page-pool tool, I try to build it, run it
and nothing is shown. Any hint/menual how to use it?
>
> > 6.13.y vanilla (lot of kswapd0 in background):
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 0 0 0 0 0 0 0 0
> > MemTotal: 32220 32701 32701 32686 32701 32701
> > 32701 32696
> > MemFree: 274 254 1327 1928 1949 2683 2624 2769
> > 6.13.y + Revert (no memory issues at all):
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 0 0 0 0 0 0 0 0
> > MemTotal: 32220 32701 32701 32686 32701 32701 32701 32696
> > MemFree: 2213 2438 3402 3108 2846 2672 2592 3063
> >
> > We need to lower the queue on all X810 interfaces from default (64 in
> > this case), to ensure we have memory available for host OS services.
> > ethtool -L em2 combined 1
> > ethtool -L p3p2 combined 1
> > ethtool -L em1 combined 6
> > ethtool -L p3p1 combined 6
> > This trick "does not work" without the revert.
>
> And you're reverting just and exactly 492a044508ad13 ?
> The memory for persistent config is allocated in alloc_netdev_mqs()
> unconditionally. I'm lost as to how this commit could make any
> difference :(
Yes, reverted the 492a044508ad13.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-16 16:03 ` Jaroslav Pulchart
@ 2025-04-16 22:44 ` Jakub Kicinski
2025-04-16 22:57 ` [Intel-wired-lan] " Keller, Jacob E
2025-04-16 22:57 ` Keller, Jacob E
1 sibling, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2025-04-16 22:44 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Przemek Kitszel, jdamato, intel-wired-lan, netdev, Tony Nguyen,
Igor Raits, Daniel Secik, Zdenek Pesek, Eric Dumazet,
Martin Karsten, Ahmed Zaki, Czapnik, Lukasz, Michal Swiatkowski
On Wed, 16 Apr 2025 18:03:52 +0200 Jaroslav Pulchart wrote:
> > FWIW you can also try the tools/net/ynl/samples/page-pool
> > application, not sure if Intel NICs init page pools appropriately
> > but this will show you exactly how much memory is sitting on Rx rings
> > of the driver (and in net socket buffers).
>
> I'm not familiar with the page-pool tool, I try to build it, run it
> and nothing is shown. Any hint/menual how to use it?
It's pretty dumb, you run it and it tells you how much memory is
allocated by Rx page pools. Commit message has an example:
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=637567e4a3ef6f6a5ffa48781207d270265f7e68
^ permalink raw reply [flat|nested] 46+ messages in thread
* RE: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-16 16:03 ` Jaroslav Pulchart
2025-04-16 22:44 ` Jakub Kicinski
@ 2025-04-16 22:57 ` Keller, Jacob E
2025-04-17 0:13 ` Jakub Kicinski
1 sibling, 1 reply; 46+ messages in thread
From: Keller, Jacob E @ 2025-04-16 22:57 UTC (permalink / raw)
To: Jaroslav Pulchart, Jakub Kicinski
Cc: Kitszel, Przemyslaw, Damato, Joe,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
Nguyen, Anthony L, Igor Raits, Daniel Secik, Zdenek Pesek,
Dumazet, Eric, Martin Karsten, Zaki, Ahmed, Czapnik, Lukasz,
Michal Swiatkowski
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Jaroslav
> Pulchart
> Sent: Wednesday, April 16, 2025 9:04 AM
> To: Jakub Kicinski <kuba@kernel.org>
> Cc: Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Damato, Joe
> <jdamato@fastly.com>; intel-wired-lan@lists.osuosl.org; netdev@vger.kernel.org;
> Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Igor Raits
> <igor@gooddata.com>; Daniel Secik <daniel.secik@gooddata.com>; Zdenek Pesek
> <zdenek.pesek@gooddata.com>; Dumazet, Eric <edumazet@google.com>; Martin
> Karsten <mkarsten@uwaterloo.ca>; Zaki, Ahmed <ahmed.zaki@intel.com>;
> Czapnik, Lukasz <lukasz.czapnik@intel.com>; Michal Swiatkowski
> <michal.swiatkowski@linux.intel.com>
> Subject: Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE
> driver after upgrade to 6.13.y (regression in commit 492a044508ad)
>
> >
> > On Wed, 16 Apr 2025 09:13:23 +0200 Jaroslav Pulchart wrote:
> > > By "traced" I mean using the kernel and checking memory situation on
> > > numa nodes with and without production load. Numa nodes, with X810
> > > NIC, showing a quite less available memory with default queue length
> > > (num of all cpus) and it needs to be lowered to 1-2 (for unused
> > > interfaces) and up-to-count of numa node cores on used interfaces to
> > > make the memory allocation reasonable and server avoiding "kswapd"...
> > >
> > > See "MemFree" on numa 0 + 1 on different/smaller but utilized (running
> > > VMs + using network) host server with 8 numa nodes (32GB RAM each, 28G
> > > in Hugepase for VMs and 4GB for host os):
> >
> > FWIW you can also try the tools/net/ynl/samples/page-pool
> > application, not sure if Intel NICs init page pools appropriately
> > but this will show you exactly how much memory is sitting on Rx rings
> > of the driver (and in net socket buffers).
>
> I'm not familiar with the page-pool tool, I try to build it, run it
> and nothing is shown. Any hint/menual how to use it?
>
> >
> > > 6.13.y vanilla (lot of kswapd0 in background):
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 0 0 0 0 0 0 0 0
> > > MemTotal: 32220 32701 32701 32686 32701 32701
> > > 32701 32696
> > > MemFree: 274 254 1327 1928 1949 2683 2624 2769
> > > 6.13.y + Revert (no memory issues at all):
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 0 0 0 0 0 0 0 0
> > > MemTotal: 32220 32701 32701 32686 32701 32701 32701 32696
> > > MemFree: 2213 2438 3402 3108 2846 2672 2592 3063
> > >
> > > We need to lower the queue on all X810 interfaces from default (64 in
> > > this case), to ensure we have memory available for host OS services.
> > > ethtool -L em2 combined 1
> > > ethtool -L p3p2 combined 1
> > > ethtool -L em1 combined 6
> > > ethtool -L p3p1 combined 6
> > > This trick "does not work" without the revert.
> >
> > And you're reverting just and exactly 492a044508ad13 ?
> > The memory for persistent config is allocated in alloc_netdev_mqs()
> > unconditionally. I'm lost as to how this commit could make any
> > difference :(
>
> Yes, reverted the 492a044508ad13.
Struct napi_config *is* 1056 bytes, or about 1Kb, and we will allocate one per max queue with this change, resulting in 1KB per CPU.. if there is a 64 CPU system this should be at most 64KB... That seems unlikely to be the root cause of memory outage like this is just the napi_config structure....
Perhaps something that netif_napi_restore_config is somehow causing us to end up with more allocated memory? Or some interaction with our ethtool callback to reduce the number of rings is not working properly..?
^ permalink raw reply [flat|nested] 46+ messages in thread
* RE: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-16 22:44 ` Jakub Kicinski
@ 2025-04-16 22:57 ` Keller, Jacob E
0 siblings, 0 replies; 46+ messages in thread
From: Keller, Jacob E @ 2025-04-16 22:57 UTC (permalink / raw)
To: Jakub Kicinski, Jaroslav Pulchart
Cc: Kitszel, Przemyslaw, Damato, Joe,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
Nguyen, Anthony L, Igor Raits, Daniel Secik, Zdenek Pesek,
Dumazet, Eric, Martin Karsten, Zaki, Ahmed, Czapnik, Lukasz,
Michal Swiatkowski
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Jakub
> Kicinski
> Sent: Wednesday, April 16, 2025 3:45 PM
> To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> Cc: Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Damato, Joe
> <jdamato@fastly.com>; intel-wired-lan@lists.osuosl.org; netdev@vger.kernel.org;
> Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Igor Raits
> <igor@gooddata.com>; Daniel Secik <daniel.secik@gooddata.com>; Zdenek Pesek
> <zdenek.pesek@gooddata.com>; Dumazet, Eric <edumazet@google.com>; Martin
> Karsten <mkarsten@uwaterloo.ca>; Zaki, Ahmed <ahmed.zaki@intel.com>;
> Czapnik, Lukasz <lukasz.czapnik@intel.com>; Michal Swiatkowski
> <michal.swiatkowski@linux.intel.com>
> Subject: Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE
> driver after upgrade to 6.13.y (regression in commit 492a044508ad)
>
> On Wed, 16 Apr 2025 18:03:52 +0200 Jaroslav Pulchart wrote:
> > > FWIW you can also try the tools/net/ynl/samples/page-pool
> > > application, not sure if Intel NICs init page pools appropriately
> > > but this will show you exactly how much memory is sitting on Rx rings
> > > of the driver (and in net socket buffers).
> >
> > I'm not familiar with the page-pool tool, I try to build it, run it
> > and nothing is shown. Any hint/menual how to use it?
>
> It's pretty dumb, you run it and it tells you how much memory is
> allocated by Rx page pools. Commit message has an example:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> d=637567e4a3ef6f6a5ffa48781207d270265f7e68
Unfortunately, I don't think ice has migrated to page pool just yet ☹
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-16 22:57 ` Keller, Jacob E
@ 2025-04-17 0:13 ` Jakub Kicinski
2025-04-17 17:52 ` Keller, Jacob E
0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2025-04-17 0:13 UTC (permalink / raw)
To: Keller, Jacob E
Cc: Jaroslav Pulchart, Kitszel, Przemyslaw, Damato, Joe,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
Nguyen, Anthony L, Igor Raits, Daniel Secik, Zdenek Pesek,
Dumazet, Eric, Martin Karsten, Zaki, Ahmed, Czapnik, Lukasz,
Michal Swiatkowski
On Wed, 16 Apr 2025 22:57:10 +0000 Keller, Jacob E wrote:
> > > And you're reverting just and exactly 492a044508ad13 ?
> > > The memory for persistent config is allocated in alloc_netdev_mqs()
> > > unconditionally. I'm lost as to how this commit could make any
> > > difference :(
> >
> > Yes, reverted the 492a044508ad13.
>
> Struct napi_config *is* 1056 bytes
You're probably looking at 6.15-rcX kernels. Yes, the affinity mask
can be large depending on the kernel config. But report is for 6.13,
AFAIU. In 6.13 and 6.14 napi_config was tiny.
^ permalink raw reply [flat|nested] 46+ messages in thread
* RE: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-17 0:13 ` Jakub Kicinski
@ 2025-04-17 17:52 ` Keller, Jacob E
2025-05-21 10:50 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Keller, Jacob E @ 2025-04-17 17:52 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Jaroslav Pulchart, Kitszel, Przemyslaw, Damato, Joe,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
Nguyen, Anthony L, Igor Raits, Daniel Secik, Zdenek Pesek,
Dumazet, Eric, Martin Karsten, Zaki, Ahmed, Czapnik, Lukasz,
Michal Swiatkowski
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, April 16, 2025 5:13 PM
> To: Keller, Jacob E <jacob.e.keller@intel.com>
> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Damato, Joe <jdamato@fastly.com>; intel-wired-
> lan@lists.osuosl.org; netdev@vger.kernel.org; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; Igor Raits <igor@gooddata.com>; Daniel Secik
> <daniel.secik@gooddata.com>; Zdenek Pesek <zdenek.pesek@gooddata.com>;
> Dumazet, Eric <edumazet@google.com>; Martin Karsten
> <mkarsten@uwaterloo.ca>; Zaki, Ahmed <ahmed.zaki@intel.com>; Czapnik,
> Lukasz <lukasz.czapnik@intel.com>; Michal Swiatkowski
> <michal.swiatkowski@linux.intel.com>
> Subject: Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE
> driver after upgrade to 6.13.y (regression in commit 492a044508ad)
>
> On Wed, 16 Apr 2025 22:57:10 +0000 Keller, Jacob E wrote:
> > > > And you're reverting just and exactly 492a044508ad13 ?
> > > > The memory for persistent config is allocated in alloc_netdev_mqs()
> > > > unconditionally. I'm lost as to how this commit could make any
> > > > difference :(
> > >
> > > Yes, reverted the 492a044508ad13.
> >
> > Struct napi_config *is* 1056 bytes
>
> You're probably looking at 6.15-rcX kernels. Yes, the affinity mask
> can be large depending on the kernel config. But report is for 6.13,
> AFAIU. In 6.13 and 6.14 napi_config was tiny.
Regardless, it should still be ~64KB even in that case which is a far cry from eating all available memory. Something else must be going on....
Thanks,
Jake
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-17 17:52 ` Keller, Jacob E
@ 2025-05-21 10:50 ` Jaroslav Pulchart
2025-06-04 8:42 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-05-21 10:50 UTC (permalink / raw)
To: Keller, Jacob E, Jakub Kicinski, Kitszel, Przemyslaw, Damato, Joe,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
Nguyen, Anthony L, Michal Swiatkowski, Czapnik, Lukasz,
Dumazet, Eric, Zaki, Ahmed, Martin Karsten
Cc: Igor Raits, Daniel Secik, Zdenek Pesek
čt 17. 4. 2025 v 19:52 odesílatel Keller, Jacob E
<jacob.e.keller@intel.com> napsal:
>
>
>
> > -----Original Message-----
> > From: Jakub Kicinski <kuba@kernel.org>
> > Sent: Wednesday, April 16, 2025 5:13 PM
> > To: Keller, Jacob E <jacob.e.keller@intel.com>
> > Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>; Kitszel, Przemyslaw
> > <przemyslaw.kitszel@intel.com>; Damato, Joe <jdamato@fastly.com>; intel-wired-
> > lan@lists.osuosl.org; netdev@vger.kernel.org; Nguyen, Anthony L
> > <anthony.l.nguyen@intel.com>; Igor Raits <igor@gooddata.com>; Daniel Secik
> > <daniel.secik@gooddata.com>; Zdenek Pesek <zdenek.pesek@gooddata.com>;
> > Dumazet, Eric <edumazet@google.com>; Martin Karsten
> > <mkarsten@uwaterloo.ca>; Zaki, Ahmed <ahmed.zaki@intel.com>; Czapnik,
> > Lukasz <lukasz.czapnik@intel.com>; Michal Swiatkowski
> > <michal.swiatkowski@linux.intel.com>
> > Subject: Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE
> > driver after upgrade to 6.13.y (regression in commit 492a044508ad)
> >
> > On Wed, 16 Apr 2025 22:57:10 +0000 Keller, Jacob E wrote:
> > > > > And you're reverting just and exactly 492a044508ad13 ?
> > > > > The memory for persistent config is allocated in alloc_netdev_mqs()
> > > > > unconditionally. I'm lost as to how this commit could make any
> > > > > difference :(
> > > >
> > > > Yes, reverted the 492a044508ad13.
> > >
> > > Struct napi_config *is* 1056 bytes
> >
> > You're probably looking at 6.15-rcX kernels. Yes, the affinity mask
> > can be large depending on the kernel config. But report is for 6.13,
> > AFAIU. In 6.13 and 6.14 napi_config was tiny.
>
> Regardless, it should still be ~64KB even in that case which is a far cry from eating all available memory. Something else must be going on....
>
> Thanks,
> Jake
Hello
Some observation, this "problem" still exists with the latest 6.14.y
and there must be multiple issues, the memory utilization is slowly
going down, from 3GB to 100MB in 10-20days. at home NUMA nodes where
intel x810 NIC are (looks like some memory leak related to
networking).
So without the revert the kawadX usage is observed asap like till
1-2d, with revert of mentioned commit kswadX starts to consume
resources later like in ~10d-20d later. It is almost impossible to use
servers with Intel X810 cards (ice driver) with recent linux kernels.
Were you able to reproduce the memory problems in your testbed?
Best,
Jaroslav
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-05-21 10:50 ` Jaroslav Pulchart
@ 2025-06-04 8:42 ` Jaroslav Pulchart
[not found] ` <CAK8fFZ5XTO9dGADuMSV0hJws-6cZE9equa3X6dfTBgDyzE1pEQ@mail.gmail.com>
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-06-04 8:42 UTC (permalink / raw)
To: Keller, Jacob E, Jakub Kicinski, Kitszel, Przemyslaw, Damato, Joe,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
Nguyen, Anthony L, Michal Swiatkowski, Czapnik, Lukasz,
Dumazet, Eric, Zaki, Ahmed, Martin Karsten
Cc: Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1: Type: text/plain, Size: 3080 bytes --]
>
> čt 17. 4. 2025 v 19:52 odesílatel Keller, Jacob E
> <jacob.e.keller@intel.com> napsal:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Sent: Wednesday, April 16, 2025 5:13 PM
> > > To: Keller, Jacob E <jacob.e.keller@intel.com>
> > > Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>; Kitszel, Przemyslaw
> > > <przemyslaw.kitszel@intel.com>; Damato, Joe <jdamato@fastly.com>; intel-wired-
> > > lan@lists.osuosl.org; netdev@vger.kernel.org; Nguyen, Anthony L
> > > <anthony.l.nguyen@intel.com>; Igor Raits <igor@gooddata.com>; Daniel Secik
> > > <daniel.secik@gooddata.com>; Zdenek Pesek <zdenek.pesek@gooddata.com>;
> > > Dumazet, Eric <edumazet@google.com>; Martin Karsten
> > > <mkarsten@uwaterloo.ca>; Zaki, Ahmed <ahmed.zaki@intel.com>; Czapnik,
> > > Lukasz <lukasz.czapnik@intel.com>; Michal Swiatkowski
> > > <michal.swiatkowski@linux.intel.com>
> > > Subject: Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE
> > > driver after upgrade to 6.13.y (regression in commit 492a044508ad)
> > >
> > > On Wed, 16 Apr 2025 22:57:10 +0000 Keller, Jacob E wrote:
> > > > > > And you're reverting just and exactly 492a044508ad13 ?
> > > > > > The memory for persistent config is allocated in alloc_netdev_mqs()
> > > > > > unconditionally. I'm lost as to how this commit could make any
> > > > > > difference :(
> > > > >
> > > > > Yes, reverted the 492a044508ad13.
> > > >
> > > > Struct napi_config *is* 1056 bytes
> > >
> > > You're probably looking at 6.15-rcX kernels. Yes, the affinity mask
> > > can be large depending on the kernel config. But report is for 6.13,
> > > AFAIU. In 6.13 and 6.14 napi_config was tiny.
> >
> > Regardless, it should still be ~64KB even in that case which is a far cry from eating all available memory. Something else must be going on....
> >
> > Thanks,
> > Jake
>
> Hello
>
> Some observation, this "problem" still exists with the latest 6.14.y
> and there must be multiple issues, the memory utilization is slowly
> going down, from 3GB to 100MB in 10-20days. at home NUMA nodes where
> intel x810 NIC are (looks like some memory leak related to
> networking).
>
> So without the revert the kawadX usage is observed asap like till
> 1-2d, with revert of mentioned commit kswadX starts to consume
> resources later like in ~10d-20d later. It is almost impossible to use
> servers with Intel X810 cards (ice driver) with recent linux kernels.
>
> Were you able to reproduce the memory problems in your testbed?
>
> Best,
> Jaroslav
Hello
I deployed linux 6.15.0 to our servers 7d ago and observed the
behaviour of memory utilization of NUMA home nodes of Intel X810
1/ there is no need to revert the commit as before,
2/ the memory is continuously consumed (like memory leak),
see attached "7d_memory_usage_per_numa_linux6.15.0.png" screenshot 8x
numa nodes, (NUMA0 + NUMA1 are local for X810 nics). BTW: We do not
see this memory utilization pattern on server s using Broadcom
Netxtreme-E NICs
[-- Attachment #2: 7d_memory_usage_per_numa_linux6.15.0.png --]
[-- Type: image/png, Size: 430093 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
[not found] ` <CAK8fFZ5XTO9dGADuMSV0hJws-6cZE9equa3X6dfTBgDyzE1pEQ@mail.gmail.com>
@ 2025-06-25 14:03 ` Przemek Kitszel
[not found] ` <CAK8fFZ7LREBEdhXjBAKuaqktOz1VwsBTxcCpLBsa+dkMj4Pyyw@mail.gmail.com>
2025-06-25 14:53 ` Paul Menzel
1 sibling, 1 reply; 46+ messages in thread
From: Przemek Kitszel @ 2025-06-25 14:03 UTC (permalink / raw)
To: Jaroslav Pulchart, intel-wired-lan@lists.osuosl.org
Cc: Keller, Jacob E, Jakub Kicinski, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
On 6/25/25 14:17, Jaroslav Pulchart wrote:
> Hello
>
> We are still facing the memory issue with Intel 810 NICs (even on latest
> 6.15.y).
>
> Our current stabilization and solution is to move everything to a new
> INTEL-FREE server and get rid of last Intel sights there (after Intel's
> CPU vulnerabilities fuckups NICs are next step).
>
> Any help welcomed,
> Jaroslav P.
>
>
Thank you for urging us, I can understand the frustration.
We have identified some (unrelated) memory leaks, will soon ship fixes.
And, as there were no clear issue with any commit/version you have
posted to be a culprit, there is a chance that our random findings could
help. Anyway going to zero kmemleak reports is good in itself, that is
a good start.
Will ask my VAL too to increase efforts in this area too.
Przemek
>
> st 4. 6. 2025 v 10:42 odesílatel Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com <mailto:jaroslav.pulchart@gooddata.com>>
> napsal:
>
> >
> > čt 17. 4. 2025 v 19:52 odesílatel Keller, Jacob E
> > <jacob.e.keller@intel.com <mailto:jacob.e.keller@intel.com>> napsal:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jakub Kicinski <kuba@kernel.org <mailto:kuba@kernel.org>>
> > > > Sent: Wednesday, April 16, 2025 5:13 PM
> > > > To: Keller, Jacob E <jacob.e.keller@intel.com
> <mailto:jacob.e.keller@intel.com>>
> > > > Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com
> <mailto:jaroslav.pulchart@gooddata.com>>; Kitszel, Przemyslaw
> > > > <przemyslaw.kitszel@intel.com
> <mailto:przemyslaw.kitszel@intel.com>>; Damato, Joe
> <jdamato@fastly.com <mailto:jdamato@fastly.com>>; intel-wired-
> > > > lan@lists.osuosl.org <mailto:lan@lists.osuosl.org>;
> netdev@vger.kernel.org <mailto:netdev@vger.kernel.org>; Nguyen,
> Anthony L
> > > > <anthony.l.nguyen@intel.com
> <mailto:anthony.l.nguyen@intel.com>>; Igor Raits <igor@gooddata.com
> <mailto:igor@gooddata.com>>; Daniel Secik
> > > > <daniel.secik@gooddata.com
> <mailto:daniel.secik@gooddata.com>>; Zdenek Pesek
> <zdenek.pesek@gooddata.com <mailto:zdenek.pesek@gooddata.com>>;
> > > > Dumazet, Eric <edumazet@google.com
> <mailto:edumazet@google.com>>; Martin Karsten
> > > > <mkarsten@uwaterloo.ca <mailto:mkarsten@uwaterloo.ca>>; Zaki,
> Ahmed <ahmed.zaki@intel.com <mailto:ahmed.zaki@intel.com>>; Czapnik,
> > > > Lukasz <lukasz.czapnik@intel.com
> <mailto:lukasz.czapnik@intel.com>>; Michal Swiatkowski
> > > > <michal.swiatkowski@linux.intel.com
> <mailto:michal.swiatkowski@linux.intel.com>>
> > > > Subject: Re: [Intel-wired-lan] Increased memory usage on NUMA
> nodes with ICE
> > > > driver after upgrade to 6.13.y (regression in commit
> 492a044508ad)
> > > >
> > > > On Wed, 16 Apr 2025 22:57:10 +0000 Keller, Jacob E wrote:
> > > > > > > And you're reverting just and exactly 492a044508ad13 ?
> > > > > > > The memory for persistent config is allocated in
> alloc_netdev_mqs()
> > > > > > > unconditionally. I'm lost as to how this commit could
> make any
> > > > > > > difference :(
> > > > > >
> > > > > > Yes, reverted the 492a044508ad13.
> > > > >
> > > > > Struct napi_config *is* 1056 bytes
> > > >
> > > > You're probably looking at 6.15-rcX kernels. Yes, the
> affinity mask
> > > > can be large depending on the kernel config. But report is
> for 6.13,
> > > > AFAIU. In 6.13 and 6.14 napi_config was tiny.
> > >
> > > Regardless, it should still be ~64KB even in that case which is
> a far cry from eating all available memory. Something else must be
> going on....
> > >
> > > Thanks,
> > > Jake
> >
> > Hello
> >
> > Some observation, this "problem" still exists with the latest 6.14.y
> > and there must be multiple issues, the memory utilization is slowly
> > going down, from 3GB to 100MB in 10-20days. at home NUMA nodes where
> > intel x810 NIC are (looks like some memory leak related to
> > networking).
> >
> > So without the revert the kawadX usage is observed asap like till
> > 1-2d, with revert of mentioned commit kswadX starts to consume
> > resources later like in ~10d-20d later. It is almost impossible
> to use
> > servers with Intel X810 cards (ice driver) with recent linux kernels.
> >
> > Were you able to reproduce the memory problems in your testbed?
> >
> > Best,
> > Jaroslav
>
> Hello
>
> I deployed linux 6.15.0 to our servers 7d ago and observed the
> behaviour of memory utilization of NUMA home nodes of Intel X810
> 1/ there is no need to revert the commit as before,
> 2/ the memory is continuously consumed (like memory leak),
> see attached "7d_memory_usage_per_numa_linux6.15.0.png" screenshot 8x
> numa nodes, (NUMA0 + NUMA1 are local for X810 nics). BTW: We do not
> see this memory utilization pattern on server s using Broadcom
> Netxtreme-E NICs
>
>
>
> --
> Jaroslav Pulchart
> Sr. Principal SW Engineer
> GoodData
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
[not found] ` <CAK8fFZ5XTO9dGADuMSV0hJws-6cZE9equa3X6dfTBgDyzE1pEQ@mail.gmail.com>
2025-06-25 14:03 ` Przemek Kitszel
@ 2025-06-25 14:53 ` Paul Menzel
1 sibling, 0 replies; 46+ messages in thread
From: Paul Menzel @ 2025-06-25 14:53 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Jacob E Keller, Jakub Kicinski, Przemyslaw Kitszel, Joe Damato,
intel-wired-lan, netdev, Anthony L Nguyen, Michal Swiatkowski,
Lukasz Czapnik, Eric Dumazet, Ahmed Zaki, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek, regressions
Dear Jaroslav,
Am 25.06.25 um 14:17 schrieb Jaroslav Pulchart:
> We are still facing the memory issue with Intel 810 NICs (even on latest
> 6.15.y).
Commit 492a044508ad13 ("ice: Add support for persistent NAPI config")
was added in Linux v6.13-rc1, and as until now, no fix could be
presented, but reverting it fixes your issue, I strongly recommend to
send a revert. No idea if it’s compiler depended or what else could be
the issue. But due to Linux’ no regression policy this should be
reverted as soon as possible.
Kind regards,
Paul
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
[not found] ` <CAK8fFZ7LREBEdhXjBAKuaqktOz1VwsBTxcCpLBsa+dkMj4Pyyw@mail.gmail.com>
@ 2025-06-25 20:25 ` Jakub Kicinski
2025-06-26 7:42 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2025-06-25 20:25 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Przemek Kitszel, intel-wired-lan@lists.osuosl.org,
Keller, Jacob E, Damato, Joe, netdev@vger.kernel.org,
Nguyen, Anthony L, Michal Swiatkowski, Czapnik, Lukasz,
Dumazet, Eric, Zaki, Ahmed, Martin Karsten, Igor Raits,
Daniel Secik, Zdenek Pesek
On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
> Great, please send me a link to the related patch set. I can apply them in
> our kernel build and try them ASAP!
Sorry if I'm repeating the question - have you tried
CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
is low enough to use it for production workloads.
> st 25. 6. 2025 v 16:03 odesílatel Przemek Kitszel <
> przemyslaw.kitszel@intel.com> napsal:
>
> > On 6/25/25 14:17, Jaroslav Pulchart wrote:
> > > Hello
> > >
> > > We are still facing the memory issue with Intel 810 NICs (even on latest
> > > 6.15.y).
> > >
> > > Our current stabilization and solution is to move everything to a new
> > > INTEL-FREE server and get rid of last Intel sights there (after Intel's
> > > CPU vulnerabilities fuckups NICs are next step).
> > >
> > > Any help welcomed,
> > > Jaroslav P.
> > >
> > >
> >
> > Thank you for urging us, I can understand the frustration.
> >
> > We have identified some (unrelated) memory leaks, will soon ship fixes.
> > And, as there were no clear issue with any commit/version you have
> > posted to be a culprit, there is a chance that our random findings could
> > help. Anyway going to zero kmemleak reports is good in itself, that is
> > a good start.
> >
> > Will ask my VAL too to increase efforts in this area too.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-25 20:25 ` Jakub Kicinski
@ 2025-06-26 7:42 ` Jaroslav Pulchart
2025-06-30 7:35 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-06-26 7:42 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Przemek Kitszel, intel-wired-lan@lists.osuosl.org,
Keller, Jacob E, Damato, Joe, netdev@vger.kernel.org,
Nguyen, Anthony L, Michal Swiatkowski, Czapnik, Lukasz,
Dumazet, Eric, Zaki, Ahmed, Martin Karsten, Igor Raits,
Daniel Secik, Zdenek Pesek
>
> On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
> > Great, please send me a link to the related patch set. I can apply them in
> > our kernel build and try them ASAP!
>
> Sorry if I'm repeating the question - have you tried
> CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
> is low enough to use it for production workloads.
I try it now, the fresh booted server:
# sort -g /proc/allocinfo| tail -n 15
45409728 236509 fs/dcache.c:1681 func:__d_alloc
71041024 17344 mm/percpu-vm.c:95 func:pcpu_alloc_pages
71524352 11140 kernel/dma/direct.c:141 func:__dma_direct_alloc_pages
85098496 4486 mm/slub.c:2452 func:alloc_slab_page
115470992 101647 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
141426688 34528 mm/filemap.c:1978 func:__filemap_get_folio
191594496 46776 mm/memory.c:1056 func:folio_prealloc
360710144 172 mm/khugepaged.c:1084 func:alloc_charge_folio
444076032 33790 mm/slub.c:2450 func:alloc_slab_page
530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
975175680 465 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
1022427136 249616 mm/memory.c:1054 func:folio_prealloc
1105125376 139252 drivers/net/ethernet/intel/ice/ice_txrx.c:681
[ice] func:ice_alloc_mapped_page
1621598208 395848 mm/readahead.c:186 func:ractl_alloc_folio
>
> > st 25. 6. 2025 v 16:03 odesílatel Przemek Kitszel <
> > przemyslaw.kitszel@intel.com> napsal:
> >
> > > On 6/25/25 14:17, Jaroslav Pulchart wrote:
> > > > Hello
> > > >
> > > > We are still facing the memory issue with Intel 810 NICs (even on latest
> > > > 6.15.y).
> > > >
> > > > Our current stabilization and solution is to move everything to a new
> > > > INTEL-FREE server and get rid of last Intel sights there (after Intel's
> > > > CPU vulnerabilities fuckups NICs are next step).
> > > >
> > > > Any help welcomed,
> > > > Jaroslav P.
> > > >
> > > >
> > >
> > > Thank you for urging us, I can understand the frustration.
> > >
> > > We have identified some (unrelated) memory leaks, will soon ship fixes.
> > > And, as there were no clear issue with any commit/version you have
> > > posted to be a culprit, there is a chance that our random findings could
> > > help. Anyway going to zero kmemleak reports is good in itself, that is
> > > a good start.
> > >
> > > Will ask my VAL too to increase efforts in this area too.
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-26 7:42 ` Jaroslav Pulchart
@ 2025-06-30 7:35 ` Jaroslav Pulchart
2025-06-30 16:02 ` Jacob Keller
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-06-30 7:35 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Przemek Kitszel, intel-wired-lan@lists.osuosl.org,
Keller, Jacob E, Damato, Joe, netdev@vger.kernel.org,
Nguyen, Anthony L, Michal Swiatkowski, Czapnik, Lukasz,
Dumazet, Eric, Zaki, Ahmed, Martin Karsten, Igor Raits,
Daniel Secik, Zdenek Pesek
>
> >
> > On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
> > > Great, please send me a link to the related patch set. I can apply them in
> > > our kernel build and try them ASAP!
> >
> > Sorry if I'm repeating the question - have you tried
> > CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
> > is low enough to use it for production workloads.
>
> I try it now, the fresh booted server:
>
> # sort -g /proc/allocinfo| tail -n 15
> 45409728 236509 fs/dcache.c:1681 func:__d_alloc
> 71041024 17344 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 71524352 11140 kernel/dma/direct.c:141 func:__dma_direct_alloc_pages
> 85098496 4486 mm/slub.c:2452 func:alloc_slab_page
> 115470992 101647 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> 141426688 34528 mm/filemap.c:1978 func:__filemap_get_folio
> 191594496 46776 mm/memory.c:1056 func:folio_prealloc
> 360710144 172 mm/khugepaged.c:1084 func:alloc_charge_folio
> 444076032 33790 mm/slub.c:2450 func:alloc_slab_page
> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> 975175680 465 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> 1022427136 249616 mm/memory.c:1054 func:folio_prealloc
> 1105125376 139252 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> [ice] func:ice_alloc_mapped_page
> 1621598208 395848 mm/readahead.c:186 func:ractl_alloc_folio
>
The "drivers/net/ethernet/intel/ice/ice_txrx.c:681 [ice]
func:ice_alloc_mapped_page" is just growing...
# uptime ; sort -g /proc/allocinfo| tail -n 15
09:33:58 up 4 days, 6 min, 1 user, load average: 6.65, 8.18, 9.81
# sort -g /proc/allocinfo| tail -n 15
85216896 443838 fs/dcache.c:1681 func:__d_alloc
106156032 25917 mm/shmem.c:1854 func:shmem_alloc_folio
116850096 102861 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
143556608 6894 mm/slub.c:2452 func:alloc_slab_page
186793984 45604 mm/memory.c:1056 func:folio_prealloc
362807296 88576 mm/percpu-vm.c:95 func:pcpu_alloc_pages
530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
598237184 51309 mm/slub.c:2450 func:alloc_slab_page
838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
929083392 226827 mm/filemap.c:1978 func:__filemap_get_folio
1034657792 252602 mm/memory.c:1054 func:folio_prealloc
1262485504 602 mm/khugepaged.c:1084 func:alloc_charge_folio
1335377920 325970 mm/readahead.c:186 func:ractl_alloc_folio
2544877568 315003 drivers/net/ethernet/intel/ice/ice_txrx.c:681
[ice] func:ice_alloc_mapped_page
>
> >
> > > st 25. 6. 2025 v 16:03 odesílatel Przemek Kitszel <
> > > przemyslaw.kitszel@intel.com> napsal:
> > >
> > > > On 6/25/25 14:17, Jaroslav Pulchart wrote:
> > > > > Hello
> > > > >
> > > > > We are still facing the memory issue with Intel 810 NICs (even on latest
> > > > > 6.15.y).
> > > > >
> > > > > Our current stabilization and solution is to move everything to a new
> > > > > INTEL-FREE server and get rid of last Intel sights there (after Intel's
> > > > > CPU vulnerabilities fuckups NICs are next step).
> > > > >
> > > > > Any help welcomed,
> > > > > Jaroslav P.
> > > > >
> > > > >
> > > >
> > > > Thank you for urging us, I can understand the frustration.
> > > >
> > > > We have identified some (unrelated) memory leaks, will soon ship fixes.
> > > > And, as there were no clear issue with any commit/version you have
> > > > posted to be a culprit, there is a chance that our random findings could
> > > > help. Anyway going to zero kmemleak reports is good in itself, that is
> > > > a good start.
> > > >
> > > > Will ask my VAL too to increase efforts in this area too.
> >
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-30 7:35 ` Jaroslav Pulchart
@ 2025-06-30 16:02 ` Jacob Keller
2025-06-30 17:24 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-06-30 16:02 UTC (permalink / raw)
To: Jaroslav Pulchart, Jakub Kicinski
Cc: Przemek Kitszel, intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 3954 bytes --]
On 6/30/2025 12:35 AM, Jaroslav Pulchart wrote:
>>
>>>
>>> On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
>>>> Great, please send me a link to the related patch set. I can apply them in
>>>> our kernel build and try them ASAP!
>>>
>>> Sorry if I'm repeating the question - have you tried
>>> CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
>>> is low enough to use it for production workloads.
>>
>> I try it now, the fresh booted server:
>>
>> # sort -g /proc/allocinfo| tail -n 15
>> 45409728 236509 fs/dcache.c:1681 func:__d_alloc
>> 71041024 17344 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>> 71524352 11140 kernel/dma/direct.c:141 func:__dma_direct_alloc_pages
>> 85098496 4486 mm/slub.c:2452 func:alloc_slab_page
>> 115470992 101647 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>> 141426688 34528 mm/filemap.c:1978 func:__filemap_get_folio
>> 191594496 46776 mm/memory.c:1056 func:folio_prealloc
>> 360710144 172 mm/khugepaged.c:1084 func:alloc_charge_folio
>> 444076032 33790 mm/slub.c:2450 func:alloc_slab_page
>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
>> 975175680 465 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>> 1022427136 249616 mm/memory.c:1054 func:folio_prealloc
>> 1105125376 139252 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>> [ice] func:ice_alloc_mapped_page
>> 1621598208 395848 mm/readahead.c:186 func:ractl_alloc_folio
>>
>
> The "drivers/net/ethernet/intel/ice/ice_txrx.c:681 [ice]
> func:ice_alloc_mapped_page" is just growing...
>
> # uptime ; sort -g /proc/allocinfo| tail -n 15
> 09:33:58 up 4 days, 6 min, 1 user, load average: 6.65, 8.18, 9.81
>
> # sort -g /proc/allocinfo| tail -n 15
> 85216896 443838 fs/dcache.c:1681 func:__d_alloc
> 106156032 25917 mm/shmem.c:1854 func:shmem_alloc_folio
> 116850096 102861 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> 143556608 6894 mm/slub.c:2452 func:alloc_slab_page
> 186793984 45604 mm/memory.c:1056 func:folio_prealloc
> 362807296 88576 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> 598237184 51309 mm/slub.c:2450 func:alloc_slab_page
> 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> 929083392 226827 mm/filemap.c:1978 func:__filemap_get_folio
> 1034657792 252602 mm/memory.c:1054 func:folio_prealloc
> 1262485504 602 mm/khugepaged.c:1084 func:alloc_charge_folio
> 1335377920 325970 mm/readahead.c:186 func:ractl_alloc_folio
> 2544877568 315003 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> [ice] func:ice_alloc_mapped_page
>
ice_alloc_mapped_page is the function used to allocate the pages for the
Rx ring buffers.
There were a number of fixes for the hot path from Maciej which might be
related. Although those fixes were primarily for XDP they do impact the
regular hot path as well.
These were fixes on top of work he did which landed in v6.13, so it
seems plausible they might be related. In particular one which mentions
a missing buffer put:
743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
It says the following:
> While at it, address an error path of ice_add_xdp_frag() - we were
> missing buffer putting from day 1 there.
>
It seems to me the issue must be somehow related to the buffer cleanup
logic for the Rx ring, since thats the only thing allocated by
ice_alloc_mapped_page.
It might be something fixed with the work Maciej did.. but it seems very
weird that 492a044508ad ("ice: Add support for persistent NAPI config")
would affect that logic at all....
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-30 16:02 ` Jacob Keller
@ 2025-06-30 17:24 ` Jaroslav Pulchart
2025-06-30 18:59 ` Jacob Keller
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-06-30 17:24 UTC (permalink / raw)
To: Jacob Keller
Cc: Jakub Kicinski, Przemek Kitszel, intel-wired-lan@lists.osuosl.org,
Damato, Joe, netdev@vger.kernel.org, Nguyen, Anthony L,
Michal Swiatkowski, Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed,
Martin Karsten, Igor Raits, Daniel Secik, Zdenek Pesek
>
>
>
> On 6/30/2025 12:35 AM, Jaroslav Pulchart wrote:
> >>
> >>>
> >>> On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
> >>>> Great, please send me a link to the related patch set. I can apply them in
> >>>> our kernel build and try them ASAP!
> >>>
> >>> Sorry if I'm repeating the question - have you tried
> >>> CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
> >>> is low enough to use it for production workloads.
> >>
> >> I try it now, the fresh booted server:
> >>
> >> # sort -g /proc/allocinfo| tail -n 15
> >> 45409728 236509 fs/dcache.c:1681 func:__d_alloc
> >> 71041024 17344 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> >> 71524352 11140 kernel/dma/direct.c:141 func:__dma_direct_alloc_pages
> >> 85098496 4486 mm/slub.c:2452 func:alloc_slab_page
> >> 115470992 101647 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> >> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> >> 141426688 34528 mm/filemap.c:1978 func:__filemap_get_folio
> >> 191594496 46776 mm/memory.c:1056 func:folio_prealloc
> >> 360710144 172 mm/khugepaged.c:1084 func:alloc_charge_folio
> >> 444076032 33790 mm/slub.c:2450 func:alloc_slab_page
> >> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> >> 975175680 465 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> >> 1022427136 249616 mm/memory.c:1054 func:folio_prealloc
> >> 1105125376 139252 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> >> [ice] func:ice_alloc_mapped_page
> >> 1621598208 395848 mm/readahead.c:186 func:ractl_alloc_folio
> >>
> >
> > The "drivers/net/ethernet/intel/ice/ice_txrx.c:681 [ice]
> > func:ice_alloc_mapped_page" is just growing...
> >
> > # uptime ; sort -g /proc/allocinfo| tail -n 15
> > 09:33:58 up 4 days, 6 min, 1 user, load average: 6.65, 8.18, 9.81
> >
> > # sort -g /proc/allocinfo| tail -n 15
> > 85216896 443838 fs/dcache.c:1681 func:__d_alloc
> > 106156032 25917 mm/shmem.c:1854 func:shmem_alloc_folio
> > 116850096 102861 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> > 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> > 143556608 6894 mm/slub.c:2452 func:alloc_slab_page
> > 186793984 45604 mm/memory.c:1056 func:folio_prealloc
> > 362807296 88576 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> > 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> > 598237184 51309 mm/slub.c:2450 func:alloc_slab_page
> > 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> > 929083392 226827 mm/filemap.c:1978 func:__filemap_get_folio
> > 1034657792 252602 mm/memory.c:1054 func:folio_prealloc
> > 1262485504 602 mm/khugepaged.c:1084 func:alloc_charge_folio
> > 1335377920 325970 mm/readahead.c:186 func:ractl_alloc_folio
> > 2544877568 315003 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> > [ice] func:ice_alloc_mapped_page
> >
> ice_alloc_mapped_page is the function used to allocate the pages for the
> Rx ring buffers.
>
> There were a number of fixes for the hot path from Maciej which might be
> related. Although those fixes were primarily for XDP they do impact the
> regular hot path as well.
>
> These were fixes on top of work he did which landed in v6.13, so it
> seems plausible they might be related. In particular one which mentions
> a missing buffer put:
>
> 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
>
> It says the following:
> > While at it, address an error path of ice_add_xdp_frag() - we were
> > missing buffer putting from day 1 there.
> >
>
> It seems to me the issue must be somehow related to the buffer cleanup
> logic for the Rx ring, since thats the only thing allocated by
> ice_alloc_mapped_page.
>
> It might be something fixed with the work Maciej did.. but it seems very
> weird that 492a044508ad ("ice: Add support for persistent NAPI config")
> would affect that logic at all....
I believe there were/are at least two separate issues. Regarding
commit 492a044508ad (“ice: Add support for persistent NAPI config”):
* On 6.13.y and 6.14.y kernels, this change prevented us from lowering
the driver’s initial, large memory allocation immediately after server
power-up. A few hours (max few days) later, this inevitably led to an
out-of-memory condition.
* Reverting the commit in those series only delayed the OOM, it
allowed the queue size (and thus memory footprint) to shrink on boot
just as it did in 6.12.y but didn’t eliminate the underlying 'leak'.
* In 6.15.y, however, that revert isn’t required (and isn’t even
applicable). The after boot allocation can once again be tuned down
without patching. Still, we observe the same increase in memory use
over time, as shown in the 'allocmap' output.
Thus, commit 492a044508ad led us down a false trail, or at the very
least hastened the inevitable OOM.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-30 17:24 ` Jaroslav Pulchart
@ 2025-06-30 18:59 ` Jacob Keller
2025-06-30 20:01 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-06-30 18:59 UTC (permalink / raw)
To: Jaroslav Pulchart, Maciej Fijalkowski
Cc: Jakub Kicinski, Przemek Kitszel, intel-wired-lan@lists.osuosl.org,
Damato, Joe, netdev@vger.kernel.org, Nguyen, Anthony L,
Michal Swiatkowski, Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed,
Martin Karsten, Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 5575 bytes --]
On 6/30/2025 10:24 AM, Jaroslav Pulchart wrote:
>>
>>
>>
>> On 6/30/2025 12:35 AM, Jaroslav Pulchart wrote:
>>>>
>>>>>
>>>>> On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
>>>>>> Great, please send me a link to the related patch set. I can apply them in
>>>>>> our kernel build and try them ASAP!
>>>>>
>>>>> Sorry if I'm repeating the question - have you tried
>>>>> CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
>>>>> is low enough to use it for production workloads.
>>>>
>>>> I try it now, the fresh booted server:
>>>>
>>>> # sort -g /proc/allocinfo| tail -n 15
>>>> 45409728 236509 fs/dcache.c:1681 func:__d_alloc
>>>> 71041024 17344 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>>>> 71524352 11140 kernel/dma/direct.c:141 func:__dma_direct_alloc_pages
>>>> 85098496 4486 mm/slub.c:2452 func:alloc_slab_page
>>>> 115470992 101647 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>>>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>>>> 141426688 34528 mm/filemap.c:1978 func:__filemap_get_folio
>>>> 191594496 46776 mm/memory.c:1056 func:folio_prealloc
>>>> 360710144 172 mm/khugepaged.c:1084 func:alloc_charge_folio
>>>> 444076032 33790 mm/slub.c:2450 func:alloc_slab_page
>>>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
>>>> 975175680 465 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>>>> 1022427136 249616 mm/memory.c:1054 func:folio_prealloc
>>>> 1105125376 139252 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>>>> [ice] func:ice_alloc_mapped_page
>>>> 1621598208 395848 mm/readahead.c:186 func:ractl_alloc_folio
>>>>
>>>
>>> The "drivers/net/ethernet/intel/ice/ice_txrx.c:681 [ice]
>>> func:ice_alloc_mapped_page" is just growing...
>>>
>>> # uptime ; sort -g /proc/allocinfo| tail -n 15
>>> 09:33:58 up 4 days, 6 min, 1 user, load average: 6.65, 8.18, 9.81
>>>
>>> # sort -g /proc/allocinfo| tail -n 15
>>> 85216896 443838 fs/dcache.c:1681 func:__d_alloc
>>> 106156032 25917 mm/shmem.c:1854 func:shmem_alloc_folio
>>> 116850096 102861 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>>> 143556608 6894 mm/slub.c:2452 func:alloc_slab_page
>>> 186793984 45604 mm/memory.c:1056 func:folio_prealloc
>>> 362807296 88576 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
>>> 598237184 51309 mm/slub.c:2450 func:alloc_slab_page
>>> 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>>> 929083392 226827 mm/filemap.c:1978 func:__filemap_get_folio
>>> 1034657792 252602 mm/memory.c:1054 func:folio_prealloc
>>> 1262485504 602 mm/khugepaged.c:1084 func:alloc_charge_folio
>>> 1335377920 325970 mm/readahead.c:186 func:ractl_alloc_folio
>>> 2544877568 315003 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>>> [ice] func:ice_alloc_mapped_page
>>>
>> ice_alloc_mapped_page is the function used to allocate the pages for the
>> Rx ring buffers.
>>
>> There were a number of fixes for the hot path from Maciej which might be
>> related. Although those fixes were primarily for XDP they do impact the
>> regular hot path as well.
>>
>> These were fixes on top of work he did which landed in v6.13, so it
>> seems plausible they might be related. In particular one which mentions
>> a missing buffer put:
>>
>> 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
>>
>> It says the following:
>>> While at it, address an error path of ice_add_xdp_frag() - we were
>>> missing buffer putting from day 1 there.
>>>
>>
>> It seems to me the issue must be somehow related to the buffer cleanup
>> logic for the Rx ring, since thats the only thing allocated by
>> ice_alloc_mapped_page.
>>
>> It might be something fixed with the work Maciej did.. but it seems very
>> weird that 492a044508ad ("ice: Add support for persistent NAPI config")
>> would affect that logic at all....
>
> I believe there were/are at least two separate issues. Regarding
> commit 492a044508ad (“ice: Add support for persistent NAPI config”):
> * On 6.13.y and 6.14.y kernels, this change prevented us from lowering
> the driver’s initial, large memory allocation immediately after server
> power-up. A few hours (max few days) later, this inevitably led to an
> out-of-memory condition.
> * Reverting the commit in those series only delayed the OOM, it
> allowed the queue size (and thus memory footprint) to shrink on boot
> just as it did in 6.12.y but didn’t eliminate the underlying 'leak'.
> * In 6.15.y, however, that revert isn’t required (and isn’t even
> applicable). The after boot allocation can once again be tuned down
> without patching. Still, we observe the same increase in memory use
> over time, as shown in the 'allocmap' output.
> Thus, commit 492a044508ad led us down a false trail, or at the very
> least hastened the inevitable OOM.
That seems reasonable. I'm still surprised the specific commit leads to
any large increase in memory, since it should only be a few bytes per
NAPI. But there may be some related driver-specific issues.
Either way, we clearly need to isolate how we're leaking memory in the
hot path. I think it might be related to the fixes from Maciej which are
pretty recent so might not be in 6.13 or 6.14
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-30 18:59 ` Jacob Keller
@ 2025-06-30 20:01 ` Jaroslav Pulchart
2025-06-30 20:42 ` Jacob Keller
2025-06-30 21:56 ` Jacob Keller
0 siblings, 2 replies; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-06-30 20:01 UTC (permalink / raw)
To: Jacob Keller
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
>
>
>
> On 6/30/2025 10:24 AM, Jaroslav Pulchart wrote:
> >>
> >>
> >>
> >> On 6/30/2025 12:35 AM, Jaroslav Pulchart wrote:
> >>>>
> >>>>>
> >>>>> On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
> >>>>>> Great, please send me a link to the related patch set. I can apply them in
> >>>>>> our kernel build and try them ASAP!
> >>>>>
> >>>>> Sorry if I'm repeating the question - have you tried
> >>>>> CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
> >>>>> is low enough to use it for production workloads.
> >>>>
> >>>> I try it now, the fresh booted server:
> >>>>
> >>>> # sort -g /proc/allocinfo| tail -n 15
> >>>> 45409728 236509 fs/dcache.c:1681 func:__d_alloc
> >>>> 71041024 17344 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> >>>> 71524352 11140 kernel/dma/direct.c:141 func:__dma_direct_alloc_pages
> >>>> 85098496 4486 mm/slub.c:2452 func:alloc_slab_page
> >>>> 115470992 101647 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> >>>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> >>>> 141426688 34528 mm/filemap.c:1978 func:__filemap_get_folio
> >>>> 191594496 46776 mm/memory.c:1056 func:folio_prealloc
> >>>> 360710144 172 mm/khugepaged.c:1084 func:alloc_charge_folio
> >>>> 444076032 33790 mm/slub.c:2450 func:alloc_slab_page
> >>>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> >>>> 975175680 465 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> >>>> 1022427136 249616 mm/memory.c:1054 func:folio_prealloc
> >>>> 1105125376 139252 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> >>>> [ice] func:ice_alloc_mapped_page
> >>>> 1621598208 395848 mm/readahead.c:186 func:ractl_alloc_folio
> >>>>
> >>>
> >>> The "drivers/net/ethernet/intel/ice/ice_txrx.c:681 [ice]
> >>> func:ice_alloc_mapped_page" is just growing...
> >>>
> >>> # uptime ; sort -g /proc/allocinfo| tail -n 15
> >>> 09:33:58 up 4 days, 6 min, 1 user, load average: 6.65, 8.18, 9.81
> >>>
> >>> # sort -g /proc/allocinfo| tail -n 15
> >>> 85216896 443838 fs/dcache.c:1681 func:__d_alloc
> >>> 106156032 25917 mm/shmem.c:1854 func:shmem_alloc_folio
> >>> 116850096 102861 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> >>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> >>> 143556608 6894 mm/slub.c:2452 func:alloc_slab_page
> >>> 186793984 45604 mm/memory.c:1056 func:folio_prealloc
> >>> 362807296 88576 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> >>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> >>> 598237184 51309 mm/slub.c:2450 func:alloc_slab_page
> >>> 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> >>> 929083392 226827 mm/filemap.c:1978 func:__filemap_get_folio
> >>> 1034657792 252602 mm/memory.c:1054 func:folio_prealloc
> >>> 1262485504 602 mm/khugepaged.c:1084 func:alloc_charge_folio
> >>> 1335377920 325970 mm/readahead.c:186 func:ractl_alloc_folio
> >>> 2544877568 315003 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> >>> [ice] func:ice_alloc_mapped_page
> >>>
> >> ice_alloc_mapped_page is the function used to allocate the pages for the
> >> Rx ring buffers.
> >>
> >> There were a number of fixes for the hot path from Maciej which might be
> >> related. Although those fixes were primarily for XDP they do impact the
> >> regular hot path as well.
> >>
> >> These were fixes on top of work he did which landed in v6.13, so it
> >> seems plausible they might be related. In particular one which mentions
> >> a missing buffer put:
> >>
> >> 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
> >>
> >> It says the following:
> >>> While at it, address an error path of ice_add_xdp_frag() - we were
> >>> missing buffer putting from day 1 there.
> >>>
> >>
> >> It seems to me the issue must be somehow related to the buffer cleanup
> >> logic for the Rx ring, since thats the only thing allocated by
> >> ice_alloc_mapped_page.
> >>
> >> It might be something fixed with the work Maciej did.. but it seems very
> >> weird that 492a044508ad ("ice: Add support for persistent NAPI config")
> >> would affect that logic at all....
> >
> > I believe there were/are at least two separate issues. Regarding
> > commit 492a044508ad (“ice: Add support for persistent NAPI config”):
> > * On 6.13.y and 6.14.y kernels, this change prevented us from lowering
> > the driver’s initial, large memory allocation immediately after server
> > power-up. A few hours (max few days) later, this inevitably led to an
> > out-of-memory condition.
> > * Reverting the commit in those series only delayed the OOM, it
> > allowed the queue size (and thus memory footprint) to shrink on boot
> > just as it did in 6.12.y but didn’t eliminate the underlying 'leak'.
> > * In 6.15.y, however, that revert isn’t required (and isn’t even
> > applicable). The after boot allocation can once again be tuned down
> > without patching. Still, we observe the same increase in memory use
> > over time, as shown in the 'allocmap' output.
> > Thus, commit 492a044508ad led us down a false trail, or at the very
> > least hastened the inevitable OOM.
>
> That seems reasonable. I'm still surprised the specific commit leads to
> any large increase in memory, since it should only be a few bytes per
> NAPI. But there may be some related driver-specific issues.
Actually, the large base allocation has existed for quite some time,
the mentioned commit didn’t suddenly grow our memory usage, it only
prevented us from shrinking it via "ethtool -L <iface> combined
<small-number>"
after boot. In other words, we’re still stuck with the same big
allocation, we just can’t tune it down (till reverting the commit)
>
> Either way, we clearly need to isolate how we're leaking memory in the
> hot path. I think it might be related to the fixes from Maciej which are
> pretty recent so might not be in 6.13 or 6.14
I’m fine with the fix for the mainline (now 6.15.y), the 6.13.y and
6.14.y are already EOL. Could you please tell me which 6.15.y stable
release first incorporates that patch? Is it included in current
6.15.5, or will it arrive in a later point release?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-30 20:01 ` Jaroslav Pulchart
@ 2025-06-30 20:42 ` Jacob Keller
2025-06-30 21:56 ` Jacob Keller
1 sibling, 0 replies; 46+ messages in thread
From: Jacob Keller @ 2025-06-30 20:42 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 6777 bytes --]
On 6/30/2025 1:01 PM, Jaroslav Pulchart wrote:
>>
>>
>>
>> On 6/30/2025 10:24 AM, Jaroslav Pulchart wrote:
>>>>
>>>>
>>>>
>>>> On 6/30/2025 12:35 AM, Jaroslav Pulchart wrote:
>>>>>>
>>>>>>>
>>>>>>> On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
>>>>>>>> Great, please send me a link to the related patch set. I can apply them in
>>>>>>>> our kernel build and try them ASAP!
>>>>>>>
>>>>>>> Sorry if I'm repeating the question - have you tried
>>>>>>> CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
>>>>>>> is low enough to use it for production workloads.
>>>>>>
>>>>>> I try it now, the fresh booted server:
>>>>>>
>>>>>> # sort -g /proc/allocinfo| tail -n 15
>>>>>> 45409728 236509 fs/dcache.c:1681 func:__d_alloc
>>>>>> 71041024 17344 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>>>>>> 71524352 11140 kernel/dma/direct.c:141 func:__dma_direct_alloc_pages
>>>>>> 85098496 4486 mm/slub.c:2452 func:alloc_slab_page
>>>>>> 115470992 101647 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>>>>>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>>>>>> 141426688 34528 mm/filemap.c:1978 func:__filemap_get_folio
>>>>>> 191594496 46776 mm/memory.c:1056 func:folio_prealloc
>>>>>> 360710144 172 mm/khugepaged.c:1084 func:alloc_charge_folio
>>>>>> 444076032 33790 mm/slub.c:2450 func:alloc_slab_page
>>>>>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
>>>>>> 975175680 465 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>>>>>> 1022427136 249616 mm/memory.c:1054 func:folio_prealloc
>>>>>> 1105125376 139252 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>>>>>> [ice] func:ice_alloc_mapped_page
>>>>>> 1621598208 395848 mm/readahead.c:186 func:ractl_alloc_folio
>>>>>>
>>>>>
>>>>> The "drivers/net/ethernet/intel/ice/ice_txrx.c:681 [ice]
>>>>> func:ice_alloc_mapped_page" is just growing...
>>>>>
>>>>> # uptime ; sort -g /proc/allocinfo| tail -n 15
>>>>> 09:33:58 up 4 days, 6 min, 1 user, load average: 6.65, 8.18, 9.81
>>>>>
>>>>> # sort -g /proc/allocinfo| tail -n 15
>>>>> 85216896 443838 fs/dcache.c:1681 func:__d_alloc
>>>>> 106156032 25917 mm/shmem.c:1854 func:shmem_alloc_folio
>>>>> 116850096 102861 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>>>>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>>>>> 143556608 6894 mm/slub.c:2452 func:alloc_slab_page
>>>>> 186793984 45604 mm/memory.c:1056 func:folio_prealloc
>>>>> 362807296 88576 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>>>>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
>>>>> 598237184 51309 mm/slub.c:2450 func:alloc_slab_page
>>>>> 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>>>>> 929083392 226827 mm/filemap.c:1978 func:__filemap_get_folio
>>>>> 1034657792 252602 mm/memory.c:1054 func:folio_prealloc
>>>>> 1262485504 602 mm/khugepaged.c:1084 func:alloc_charge_folio
>>>>> 1335377920 325970 mm/readahead.c:186 func:ractl_alloc_folio
>>>>> 2544877568 315003 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>>>>> [ice] func:ice_alloc_mapped_page
>>>>>
>>>> ice_alloc_mapped_page is the function used to allocate the pages for the
>>>> Rx ring buffers.
>>>>
>>>> There were a number of fixes for the hot path from Maciej which might be
>>>> related. Although those fixes were primarily for XDP they do impact the
>>>> regular hot path as well.
>>>>
>>>> These were fixes on top of work he did which landed in v6.13, so it
>>>> seems plausible they might be related. In particular one which mentions
>>>> a missing buffer put:
>>>>
>>>> 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
>>>>
>>>> It says the following:
>>>>> While at it, address an error path of ice_add_xdp_frag() - we were
>>>>> missing buffer putting from day 1 there.
>>>>>
>>>>
>>>> It seems to me the issue must be somehow related to the buffer cleanup
>>>> logic for the Rx ring, since thats the only thing allocated by
>>>> ice_alloc_mapped_page.
>>>>
>>>> It might be something fixed with the work Maciej did.. but it seems very
>>>> weird that 492a044508ad ("ice: Add support for persistent NAPI config")
>>>> would affect that logic at all....
>>>
>>> I believe there were/are at least two separate issues. Regarding
>>> commit 492a044508ad (“ice: Add support for persistent NAPI config”):
>>> * On 6.13.y and 6.14.y kernels, this change prevented us from lowering
>>> the driver’s initial, large memory allocation immediately after server
>>> power-up. A few hours (max few days) later, this inevitably led to an
>>> out-of-memory condition.
>>> * Reverting the commit in those series only delayed the OOM, it
>>> allowed the queue size (and thus memory footprint) to shrink on boot
>>> just as it did in 6.12.y but didn’t eliminate the underlying 'leak'.
>>> * In 6.15.y, however, that revert isn’t required (and isn’t even
>>> applicable). The after boot allocation can once again be tuned down
>>> without patching. Still, we observe the same increase in memory use
>>> over time, as shown in the 'allocmap' output.
>>> Thus, commit 492a044508ad led us down a false trail, or at the very
>>> least hastened the inevitable OOM.
>>
>> That seems reasonable. I'm still surprised the specific commit leads to
>> any large increase in memory, since it should only be a few bytes per
>> NAPI. But there may be some related driver-specific issues.
>
> Actually, the large base allocation has existed for quite some time,
> the mentioned commit didn’t suddenly grow our memory usage, it only
> prevented us from shrinking it via "ethtool -L <iface> combined
> <small-number>"
> after boot. In other words, we’re still stuck with the same big
> allocation, we just can’t tune it down (till reverting the commit)
>
Yes. My point is that I still don't understand the mechanism by which
that change *prevents* ethtool -L from working as you describe.
>>
>> Either way, we clearly need to isolate how we're leaking memory in the
>> hot path. I think it might be related to the fixes from Maciej which are
>> pretty recent so might not be in 6.13 or 6.14
>
> I’m fine with the fix for the mainline (now 6.15.y), the 6.13.y and
> 6.14.y are already EOL. Could you please tell me which 6.15.y stable
> release first incorporates that patch? Is it included in current
> 6.15.5, or will it arrive in a later point release?
I'm not certain if this fix actually is resolving your issue, but I will
figure out which stable kernels have it shortly.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-30 20:01 ` Jaroslav Pulchart
2025-06-30 20:42 ` Jacob Keller
@ 2025-06-30 21:56 ` Jacob Keller
2025-06-30 23:16 ` Jacob Keller
1 sibling, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-06-30 21:56 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 6966 bytes --]
On 6/30/2025 1:01 PM, Jaroslav Pulchart wrote:
>>
>>
>>
>> On 6/30/2025 10:24 AM, Jaroslav Pulchart wrote:
>>>>
>>>>
>>>>
>>>> On 6/30/2025 12:35 AM, Jaroslav Pulchart wrote:
>>>>>>
>>>>>>>
>>>>>>> On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
>>>>>>>> Great, please send me a link to the related patch set. I can apply them in
>>>>>>>> our kernel build and try them ASAP!
>>>>>>>
>>>>>>> Sorry if I'm repeating the question - have you tried
>>>>>>> CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
>>>>>>> is low enough to use it for production workloads.
>>>>>>
>>>>>> I try it now, the fresh booted server:
>>>>>>
>>>>>> # sort -g /proc/allocinfo| tail -n 15
>>>>>> 45409728 236509 fs/dcache.c:1681 func:__d_alloc
>>>>>> 71041024 17344 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>>>>>> 71524352 11140 kernel/dma/direct.c:141 func:__dma_direct_alloc_pages
>>>>>> 85098496 4486 mm/slub.c:2452 func:alloc_slab_page
>>>>>> 115470992 101647 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>>>>>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>>>>>> 141426688 34528 mm/filemap.c:1978 func:__filemap_get_folio
>>>>>> 191594496 46776 mm/memory.c:1056 func:folio_prealloc
>>>>>> 360710144 172 mm/khugepaged.c:1084 func:alloc_charge_folio
>>>>>> 444076032 33790 mm/slub.c:2450 func:alloc_slab_page
>>>>>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
>>>>>> 975175680 465 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>>>>>> 1022427136 249616 mm/memory.c:1054 func:folio_prealloc
>>>>>> 1105125376 139252 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>>>>>> [ice] func:ice_alloc_mapped_page
>>>>>> 1621598208 395848 mm/readahead.c:186 func:ractl_alloc_folio
>>>>>>
>>>>>
>>>>> The "drivers/net/ethernet/intel/ice/ice_txrx.c:681 [ice]
>>>>> func:ice_alloc_mapped_page" is just growing...
>>>>>
>>>>> # uptime ; sort -g /proc/allocinfo| tail -n 15
>>>>> 09:33:58 up 4 days, 6 min, 1 user, load average: 6.65, 8.18, 9.81
>>>>>
>>>>> # sort -g /proc/allocinfo| tail -n 15
>>>>> 85216896 443838 fs/dcache.c:1681 func:__d_alloc
>>>>> 106156032 25917 mm/shmem.c:1854 func:shmem_alloc_folio
>>>>> 116850096 102861 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>>>>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>>>>> 143556608 6894 mm/slub.c:2452 func:alloc_slab_page
>>>>> 186793984 45604 mm/memory.c:1056 func:folio_prealloc
>>>>> 362807296 88576 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>>>>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
>>>>> 598237184 51309 mm/slub.c:2450 func:alloc_slab_page
>>>>> 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>>>>> 929083392 226827 mm/filemap.c:1978 func:__filemap_get_folio
>>>>> 1034657792 252602 mm/memory.c:1054 func:folio_prealloc
>>>>> 1262485504 602 mm/khugepaged.c:1084 func:alloc_charge_folio
>>>>> 1335377920 325970 mm/readahead.c:186 func:ractl_alloc_folio
>>>>> 2544877568 315003 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>>>>> [ice] func:ice_alloc_mapped_page
>>>>>
>>>> ice_alloc_mapped_page is the function used to allocate the pages for the
>>>> Rx ring buffers.
>>>>
>>>> There were a number of fixes for the hot path from Maciej which might be
>>>> related. Although those fixes were primarily for XDP they do impact the
>>>> regular hot path as well.
>>>>
>>>> These were fixes on top of work he did which landed in v6.13, so it
>>>> seems plausible they might be related. In particular one which mentions
>>>> a missing buffer put:
>>>>
>>>> 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
>>>>
>>>> It says the following:
>>>>> While at it, address an error path of ice_add_xdp_frag() - we were
>>>>> missing buffer putting from day 1 there.
>>>>>
>>>>
>>>> It seems to me the issue must be somehow related to the buffer cleanup
>>>> logic for the Rx ring, since thats the only thing allocated by
>>>> ice_alloc_mapped_page.
>>>>
>>>> It might be something fixed with the work Maciej did.. but it seems very
>>>> weird that 492a044508ad ("ice: Add support for persistent NAPI config")
>>>> would affect that logic at all....
>>>
>>> I believe there were/are at least two separate issues. Regarding
>>> commit 492a044508ad (“ice: Add support for persistent NAPI config”):
>>> * On 6.13.y and 6.14.y kernels, this change prevented us from lowering
>>> the driver’s initial, large memory allocation immediately after server
>>> power-up. A few hours (max few days) later, this inevitably led to an
>>> out-of-memory condition.
>>> * Reverting the commit in those series only delayed the OOM, it
>>> allowed the queue size (and thus memory footprint) to shrink on boot
>>> just as it did in 6.12.y but didn’t eliminate the underlying 'leak'.
>>> * In 6.15.y, however, that revert isn’t required (and isn’t even
>>> applicable). The after boot allocation can once again be tuned down
>>> without patching. Still, we observe the same increase in memory use
>>> over time, as shown in the 'allocmap' output.
>>> Thus, commit 492a044508ad led us down a false trail, or at the very
>>> least hastened the inevitable OOM.
>>
>> That seems reasonable. I'm still surprised the specific commit leads to
>> any large increase in memory, since it should only be a few bytes per
>> NAPI. But there may be some related driver-specific issues.
>
> Actually, the large base allocation has existed for quite some time,
> the mentioned commit didn’t suddenly grow our memory usage, it only
> prevented us from shrinking it via "ethtool -L <iface> combined
> <small-number>"
> after boot. In other words, we’re still stuck with the same big
> allocation, we just can’t tune it down (till reverting the commit)
>
>>
>> Either way, we clearly need to isolate how we're leaking memory in the
>> hot path. I think it might be related to the fixes from Maciej which are
>> pretty recent so might not be in 6.13 or 6.14
>
> I’m fine with the fix for the mainline (now 6.15.y), the 6.13.y and
> 6.14.y are already EOL. Could you please tell me which 6.15.y stable
> release first incorporates that patch? Is it included in current
> 6.15.5, or will it arrive in a later point release?
Unfortunately it looks like the fix I mentioned has landed in 6.14, so
its not a fix for your issue (since you mentioned 6.14 has failed
testing in your system)
$ git describe --first-parent --contains --match=v* --exclude=*rc*
743bbd93cf29f653fae0e1416a31f03231689911
v6.14~251^2~15^2~2
I don't see any other relevant changes since v6.14. I can try to see if
I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
systems here.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-30 21:56 ` Jacob Keller
@ 2025-06-30 23:16 ` Jacob Keller
2025-07-01 6:48 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-06-30 23:16 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 2175 bytes --]
On 6/30/2025 2:56 PM, Jacob Keller wrote:
> Unfortunately it looks like the fix I mentioned has landed in 6.14, so
> its not a fix for your issue (since you mentioned 6.14 has failed
> testing in your system)
>
> $ git describe --first-parent --contains --match=v* --exclude=*rc*
> 743bbd93cf29f653fae0e1416a31f03231689911
> v6.14~251^2~15^2~2
>
> I don't see any other relevant changes since v6.14. I can try to see if
> I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
> systems here.
On my system I see this at boot after loading the ice module from
$ grep -F "/ice/" /proc/allocinfo | sort -g | tail | numfmt --to=iec>
26K 230 drivers/net/ethernet/intel/ice/ice_irq.c:84 [ice]
func:ice_get_irq_res
> 48K 2 drivers/net/ethernet/intel/ice/ice_arfs.c:565 [ice] func:ice_init_arfs
> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:397 [ice] func:ice_vsi_alloc_ring_stats
> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:416 [ice] func:ice_vsi_alloc_ring_stats
> 85K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1398 [ice] func:ice_vsi_alloc_rings
> 339K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1422 [ice] func:ice_vsi_alloc_rings
> 678K 226 drivers/net/ethernet/intel/ice/ice_base.c:109 [ice] func:ice_vsi_alloc_q_vector
> 1.1M 257 drivers/net/ethernet/intel/ice/ice_fwlog.c:40 [ice] func:ice_fwlog_alloc_ring_buffs
> 7.2M 114 drivers/net/ethernet/intel/ice/ice_txrx.c:493 [ice] func:ice_setup_rx_ring
> 896M 229264 drivers/net/ethernet/intel/ice/ice_txrx.c:680 [ice] func:ice_alloc_mapped_page
Its about 1GB for the mapped pages. I don't see any increase moment to
moment. I've started an iperf session to simulate some traffic, and I'll
leave this running to see if anything changes overnight.
Is there anything else that you can share about the traffic setup or
otherwise that I could look into? Your system seems to use ~2.5 x the
buffer size as mine, but that might just be a smaller number of CPUs.
Hopefully I'll get some more results overnight.
Thanks,
Jake
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-06-30 23:16 ` Jacob Keller
@ 2025-07-01 6:48 ` Jaroslav Pulchart
2025-07-01 20:48 ` Jacob Keller
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-07-01 6:48 UTC (permalink / raw)
To: Jacob Keller
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
> On 6/30/2025 2:56 PM, Jacob Keller wrote:
> > Unfortunately it looks like the fix I mentioned has landed in 6.14, so
> > its not a fix for your issue (since you mentioned 6.14 has failed
> > testing in your system)
> >
> > $ git describe --first-parent --contains --match=v* --exclude=*rc*
> > 743bbd93cf29f653fae0e1416a31f03231689911
> > v6.14~251^2~15^2~2
> >
> > I don't see any other relevant changes since v6.14. I can try to see if
> > I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
> > systems here.
>
> On my system I see this at boot after loading the ice module from
>
> $ grep -F "/ice/" /proc/allocinfo | sort -g | tail | numfmt --to=iec>
> 26K 230 drivers/net/ethernet/intel/ice/ice_irq.c:84 [ice]
> func:ice_get_irq_res
> > 48K 2 drivers/net/ethernet/intel/ice/ice_arfs.c:565 [ice] func:ice_init_arfs
> > 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:397 [ice] func:ice_vsi_alloc_ring_stats
> > 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:416 [ice] func:ice_vsi_alloc_ring_stats
> > 85K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1398 [ice] func:ice_vsi_alloc_rings
> > 339K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1422 [ice] func:ice_vsi_alloc_rings
> > 678K 226 drivers/net/ethernet/intel/ice/ice_base.c:109 [ice] func:ice_vsi_alloc_q_vector
> > 1.1M 257 drivers/net/ethernet/intel/ice/ice_fwlog.c:40 [ice] func:ice_fwlog_alloc_ring_buffs
> > 7.2M 114 drivers/net/ethernet/intel/ice/ice_txrx.c:493 [ice] func:ice_setup_rx_ring
> > 896M 229264 drivers/net/ethernet/intel/ice/ice_txrx.c:680 [ice] func:ice_alloc_mapped_page
>
> Its about 1GB for the mapped pages. I don't see any increase moment to
> moment. I've started an iperf session to simulate some traffic, and I'll
> leave this running to see if anything changes overnight.
>
> Is there anything else that you can share about the traffic setup or
> otherwise that I could look into? Your system seems to use ~2.5 x the
> buffer size as mine, but that might just be a smaller number of CPUs.
>
> Hopefully I'll get some more results overnight.
The traffic is random production workloads from VMs, using standard
Linux or OVS bridges. There is no specific pattern to it. I haven’t
had any luck reproducing (or was not patient enough) this with iperf3
myself. The two active (UP) interfaces are in an LACP bonding setup.
Here are our ethtool settings for the two member ports (em1 and p3p1)
# ethtool -l em1
Channel parameters for em1:
Pre-set maximums:
RX: 64
TX: 64
Other: 1
Combined: 64
Current hardware settings:
RX: 0
TX: 0
Other: 1
Combined: 8
# ethtool -g em1
Ring parameters for em1:
Pre-set maximums:
RX: 8160
RX Mini: n/a
RX Jumbo: n/a
TX: 8160
TX push buff len: n/a
Current hardware settings:
RX: 8160
RX Mini: n/a
RX Jumbo: n/a
TX: 8160
RX Buf Len: n/a
CQE Size: n/a
TX Push: off
RX Push: off
TX push buff len: n/a
TCP data split: n/a
# ethtool -c em1
Coalesce parameters for em1:
Adaptive RX: off TX: off
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a
rx-usecs: 12
rx-frames: n/a
rx-usecs-irq: n/a
rx-frames-irq: n/a
tx-usecs: 28
tx-frames: n/a
tx-usecs-irq: n/a
tx-frames-irq: n/a
rx-usecs-low: n/a
rx-frame-low: n/a
tx-usecs-low: n/a
tx-frame-low: n/a
rx-usecs-high: 0
rx-frame-high: n/a
tx-usecs-high: n/a
tx-frame-high: n/a
CQE mode RX: n/a TX: n/a
tx-aggr-max-bytes: n/a
tx-aggr-max-frames: n/a
tx-aggr-time-usecs: n/a
# ethtool -k em1
Features for em1:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: on
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: on
tx-tcp-accecn-segmentation: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
tx-gso-list: off [fixed]
tx-nocache-copy: off
loopback: off
rx-fcs: off
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off
rx-vlan-stag-hw-parse: off
rx-vlan-stag-filter: on
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]
# ethtool -i em1
driver: ice
version: 6.15.3-3.gdc.el9.x86_64
firmware-version: 4.51 0x8001e501 23.0.8
expansion-rom-version:
bus-info: 0000:63:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
# ethtool em1
Settings for em1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseT/Full
25000baseCR/Full
25000baseSR/Full
1000baseX/Full
10000baseCR/Full
10000baseSR/Full
10000baseLR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None RS BASER
Advertised link modes: 25000baseCR/Full
10000baseCR/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: None RS BASER
Speed: 25000Mb/s
Duplex: Full
Auto-negotiation: off
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: g
Wake-on: d
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
>
> Thanks,
> Jake
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-01 6:48 ` Jaroslav Pulchart
@ 2025-07-01 20:48 ` Jacob Keller
2025-07-02 9:48 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-07-01 20:48 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 3099 bytes --]
On 6/30/2025 11:48 PM, Jaroslav Pulchart wrote:
>> On 6/30/2025 2:56 PM, Jacob Keller wrote:
>>> Unfortunately it looks like the fix I mentioned has landed in 6.14, so
>>> its not a fix for your issue (since you mentioned 6.14 has failed
>>> testing in your system)
>>>
>>> $ git describe --first-parent --contains --match=v* --exclude=*rc*
>>> 743bbd93cf29f653fae0e1416a31f03231689911
>>> v6.14~251^2~15^2~2
>>>
>>> I don't see any other relevant changes since v6.14. I can try to see if
>>> I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
>>> systems here.
>>
>> On my system I see this at boot after loading the ice module from
>>
>> $ grep -F "/ice/" /proc/allocinfo | sort -g | tail | numfmt --to=iec>
>> 26K 230 drivers/net/ethernet/intel/ice/ice_irq.c:84 [ice]
>> func:ice_get_irq_res
>>> 48K 2 drivers/net/ethernet/intel/ice/ice_arfs.c:565 [ice] func:ice_init_arfs
>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:397 [ice] func:ice_vsi_alloc_ring_stats
>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:416 [ice] func:ice_vsi_alloc_ring_stats
>>> 85K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1398 [ice] func:ice_vsi_alloc_rings
>>> 339K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1422 [ice] func:ice_vsi_alloc_rings
>>> 678K 226 drivers/net/ethernet/intel/ice/ice_base.c:109 [ice] func:ice_vsi_alloc_q_vector
>>> 1.1M 257 drivers/net/ethernet/intel/ice/ice_fwlog.c:40 [ice] func:ice_fwlog_alloc_ring_buffs
>>> 7.2M 114 drivers/net/ethernet/intel/ice/ice_txrx.c:493 [ice] func:ice_setup_rx_ring
>>> 896M 229264 drivers/net/ethernet/intel/ice/ice_txrx.c:680 [ice] func:ice_alloc_mapped_page
>>
>> Its about 1GB for the mapped pages. I don't see any increase moment to
>> moment. I've started an iperf session to simulate some traffic, and I'll
>> leave this running to see if anything changes overnight.
>>
>> Is there anything else that you can share about the traffic setup or
>> otherwise that I could look into? Your system seems to use ~2.5 x the
>> buffer size as mine, but that might just be a smaller number of CPUs.
>>
>> Hopefully I'll get some more results overnight.
>
> The traffic is random production workloads from VMs, using standard
> Linux or OVS bridges. There is no specific pattern to it. I haven’t
> had any luck reproducing (or was not patient enough) this with iperf3
> myself. The two active (UP) interfaces are in an LACP bonding setup.
> Here are our ethtool settings for the two member ports (em1 and p3p1)
>
I had iperf3 running overnight and the memory usage for
ice_alloc_mapped_pages is constant here. Mine was direct connections
without bridge or bonding. From your description I assume there's no XDP
happening either.
I guess the traffic patterns of an iperf session are too regular, or
something to do with bridge or bonding.. but I also struggle to see how
those could play a role in the buffer management in the ice driver...
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-01 20:48 ` Jacob Keller
@ 2025-07-02 9:48 ` Jaroslav Pulchart
2025-07-02 18:01 ` Jacob Keller
2025-07-02 21:56 ` Jacob Keller
0 siblings, 2 replies; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-07-02 9:48 UTC (permalink / raw)
To: Jacob Keller
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
>
> On 6/30/2025 11:48 PM, Jaroslav Pulchart wrote:
> >> On 6/30/2025 2:56 PM, Jacob Keller wrote:
> >>> Unfortunately it looks like the fix I mentioned has landed in 6.14, so
> >>> its not a fix for your issue (since you mentioned 6.14 has failed
> >>> testing in your system)
> >>>
> >>> $ git describe --first-parent --contains --match=v* --exclude=*rc*
> >>> 743bbd93cf29f653fae0e1416a31f03231689911
> >>> v6.14~251^2~15^2~2
> >>>
> >>> I don't see any other relevant changes since v6.14. I can try to see if
> >>> I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
> >>> systems here.
> >>
> >> On my system I see this at boot after loading the ice module from
> >>
> >> $ grep -F "/ice/" /proc/allocinfo | sort -g | tail | numfmt --to=iec>
> >> 26K 230 drivers/net/ethernet/intel/ice/ice_irq.c:84 [ice]
> >> func:ice_get_irq_res
> >>> 48K 2 drivers/net/ethernet/intel/ice/ice_arfs.c:565 [ice] func:ice_init_arfs
> >>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:397 [ice] func:ice_vsi_alloc_ring_stats
> >>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:416 [ice] func:ice_vsi_alloc_ring_stats
> >>> 85K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1398 [ice] func:ice_vsi_alloc_rings
> >>> 339K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1422 [ice] func:ice_vsi_alloc_rings
> >>> 678K 226 drivers/net/ethernet/intel/ice/ice_base.c:109 [ice] func:ice_vsi_alloc_q_vector
> >>> 1.1M 257 drivers/net/ethernet/intel/ice/ice_fwlog.c:40 [ice] func:ice_fwlog_alloc_ring_buffs
> >>> 7.2M 114 drivers/net/ethernet/intel/ice/ice_txrx.c:493 [ice] func:ice_setup_rx_ring
> >>> 896M 229264 drivers/net/ethernet/intel/ice/ice_txrx.c:680 [ice] func:ice_alloc_mapped_page
> >>
> >> Its about 1GB for the mapped pages. I don't see any increase moment to
> >> moment. I've started an iperf session to simulate some traffic, and I'll
> >> leave this running to see if anything changes overnight.
> >>
> >> Is there anything else that you can share about the traffic setup or
> >> otherwise that I could look into? Your system seems to use ~2.5 x the
> >> buffer size as mine, but that might just be a smaller number of CPUs.
> >>
> >> Hopefully I'll get some more results overnight.
> >
> > The traffic is random production workloads from VMs, using standard
> > Linux or OVS bridges. There is no specific pattern to it. I haven’t
> > had any luck reproducing (or was not patient enough) this with iperf3
> > myself. The two active (UP) interfaces are in an LACP bonding setup.
> > Here are our ethtool settings for the two member ports (em1 and p3p1)
> >
>
> I had iperf3 running overnight and the memory usage for
> ice_alloc_mapped_pages is constant here. Mine was direct connections
> without bridge or bonding. From your description I assume there's no XDP
> happening either.
Yes, no XDP in use.
BTW the allocinfo after 6days uptime:
# uptime ; sort -g /proc/allocinfo| tail -n 15
11:46:44 up 6 days, 2:18, 1 user, load average: 9.24, 11.33, 15.07
102489024 533797 fs/dcache.c:1681 func:__d_alloc
106229760 25935 mm/shmem.c:1854 func:shmem_alloc_folio
117118192 103097 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
162783232 7656 mm/slub.c:2452 func:alloc_slab_page
189906944 46364 mm/memory.c:1056 func:folio_prealloc
499384320 121920 mm/percpu-vm.c:95 func:pcpu_alloc_pages
530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
625876992 54186 mm/slub.c:2450 func:alloc_slab_page
838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
1014710272 247732 mm/filemap.c:1978 func:__filemap_get_folio
1056710656 257986 mm/memory.c:1054 func:folio_prealloc
1279262720 610 mm/khugepaged.c:1084 func:alloc_charge_folio
1334530048 325763 mm/readahead.c:186 func:ractl_alloc_folio
3341238272 412215 drivers/net/ethernet/intel/ice/ice_txrx.c:681
[ice] func:ice_alloc_mapped_page
>
> I guess the traffic patterns of an iperf session are too regular, or
> something to do with bridge or bonding.. but I also struggle to see how
> those could play a role in the buffer management in the ice driver...
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-02 9:48 ` Jaroslav Pulchart
@ 2025-07-02 18:01 ` Jacob Keller
2025-07-02 21:56 ` Jacob Keller
1 sibling, 0 replies; 46+ messages in thread
From: Jacob Keller @ 2025-07-02 18:01 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 4996 bytes --]
On 7/2/2025 2:48 AM, Jaroslav Pulchart wrote:
>>
>> On 6/30/2025 11:48 PM, Jaroslav Pulchart wrote:
>>>> On 6/30/2025 2:56 PM, Jacob Keller wrote:
>>>>> Unfortunately it looks like the fix I mentioned has landed in 6.14, so
>>>>> its not a fix for your issue (since you mentioned 6.14 has failed
>>>>> testing in your system)
>>>>>
>>>>> $ git describe --first-parent --contains --match=v* --exclude=*rc*
>>>>> 743bbd93cf29f653fae0e1416a31f03231689911
>>>>> v6.14~251^2~15^2~2
>>>>>
>>>>> I don't see any other relevant changes since v6.14. I can try to see if
>>>>> I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
>>>>> systems here.
>>>>
>>>> On my system I see this at boot after loading the ice module from
>>>>
>>>> $ grep -F "/ice/" /proc/allocinfo | sort -g | tail | numfmt --to=iec>
>>>> 26K 230 drivers/net/ethernet/intel/ice/ice_irq.c:84 [ice]
>>>> func:ice_get_irq_res
>>>>> 48K 2 drivers/net/ethernet/intel/ice/ice_arfs.c:565 [ice] func:ice_init_arfs
>>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:397 [ice] func:ice_vsi_alloc_ring_stats
>>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:416 [ice] func:ice_vsi_alloc_ring_stats
>>>>> 85K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1398 [ice] func:ice_vsi_alloc_rings
>>>>> 339K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1422 [ice] func:ice_vsi_alloc_rings
>>>>> 678K 226 drivers/net/ethernet/intel/ice/ice_base.c:109 [ice] func:ice_vsi_alloc_q_vector
>>>>> 1.1M 257 drivers/net/ethernet/intel/ice/ice_fwlog.c:40 [ice] func:ice_fwlog_alloc_ring_buffs
>>>>> 7.2M 114 drivers/net/ethernet/intel/ice/ice_txrx.c:493 [ice] func:ice_setup_rx_ring
>>>>> 896M 229264 drivers/net/ethernet/intel/ice/ice_txrx.c:680 [ice] func:ice_alloc_mapped_page
>>>>
>>>> Its about 1GB for the mapped pages. I don't see any increase moment to
>>>> moment. I've started an iperf session to simulate some traffic, and I'll
>>>> leave this running to see if anything changes overnight.
>>>>
>>>> Is there anything else that you can share about the traffic setup or
>>>> otherwise that I could look into? Your system seems to use ~2.5 x the
>>>> buffer size as mine, but that might just be a smaller number of CPUs.
>>>>
>>>> Hopefully I'll get some more results overnight.
>>>
>>> The traffic is random production workloads from VMs, using standard
>>> Linux or OVS bridges. There is no specific pattern to it. I haven’t
>>> had any luck reproducing (or was not patient enough) this with iperf3
>>> myself. The two active (UP) interfaces are in an LACP bonding setup.
>>> Here are our ethtool settings for the two member ports (em1 and p3p1)
>>>
>>
>> I had iperf3 running overnight and the memory usage for
>> ice_alloc_mapped_pages is constant here. Mine was direct connections
>> without bridge or bonding. From your description I assume there's no XDP
>> happening either.
>
> Yes, no XDP in use.
>
> BTW the allocinfo after 6days uptime:
> # uptime ; sort -g /proc/allocinfo| tail -n 15
> 11:46:44 up 6 days, 2:18, 1 user, load average: 9.24, 11.33, 15.07
> 102489024 533797 fs/dcache.c:1681 func:__d_alloc
> 106229760 25935 mm/shmem.c:1854 func:shmem_alloc_folio
> 117118192 103097 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> 162783232 7656 mm/slub.c:2452 func:alloc_slab_page
> 189906944 46364 mm/memory.c:1056 func:folio_prealloc
> 499384320 121920 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> 625876992 54186 mm/slub.c:2450 func:alloc_slab_page
> 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> 1014710272 247732 mm/filemap.c:1978 func:__filemap_get_folio
> 1056710656 257986 mm/memory.c:1054 func:folio_prealloc
> 1279262720 610 mm/khugepaged.c:1084 func:alloc_charge_folio
> 1334530048 325763 mm/readahead.c:186 func:ractl_alloc_folio
> 3341238272 412215 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> [ice] func:ice_alloc_mapped_page
>
3.2GB meaning an entire GB wasted from your on-boot up :(
Unfortunately, I've had no luck trying to reproduce the conditions that
trigger this. We do have a series in flight to convert ice to page pool
which we hope resolves this.. but of course that isn't really a suitable
backport candidate.
Its quite frustrating when I can't figure out how to reproduce to
further debug where the leak is.
I also discovered that the leak sanitizer doesn't cover page allocations :(
>>
>> I guess the traffic patterns of an iperf session are too regular, or
>> something to do with bridge or bonding.. but I also struggle to see how
>> those could play a role in the buffer management in the ice driver...
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-02 9:48 ` Jaroslav Pulchart
2025-07-02 18:01 ` Jacob Keller
@ 2025-07-02 21:56 ` Jacob Keller
2025-07-03 6:46 ` Jaroslav Pulchart
1 sibling, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-07-02 21:56 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 5974 bytes --]
On 7/2/2025 2:48 AM, Jaroslav Pulchart wrote:
>>
>> On 6/30/2025 11:48 PM, Jaroslav Pulchart wrote:
>>>> On 6/30/2025 2:56 PM, Jacob Keller wrote:
>>>>> Unfortunately it looks like the fix I mentioned has landed in 6.14, so
>>>>> its not a fix for your issue (since you mentioned 6.14 has failed
>>>>> testing in your system)
>>>>>
>>>>> $ git describe --first-parent --contains --match=v* --exclude=*rc*
>>>>> 743bbd93cf29f653fae0e1416a31f03231689911
>>>>> v6.14~251^2~15^2~2
>>>>>
>>>>> I don't see any other relevant changes since v6.14. I can try to see if
>>>>> I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
>>>>> systems here.
>>>>
>>>> On my system I see this at boot after loading the ice module from
>>>>
>>>> $ grep -F "/ice/" /proc/allocinfo | sort -g | tail | numfmt --to=iec>
>>>> 26K 230 drivers/net/ethernet/intel/ice/ice_irq.c:84 [ice]
>>>> func:ice_get_irq_res
>>>>> 48K 2 drivers/net/ethernet/intel/ice/ice_arfs.c:565 [ice] func:ice_init_arfs
>>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:397 [ice] func:ice_vsi_alloc_ring_stats
>>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:416 [ice] func:ice_vsi_alloc_ring_stats
>>>>> 85K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1398 [ice] func:ice_vsi_alloc_rings
>>>>> 339K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1422 [ice] func:ice_vsi_alloc_rings
>>>>> 678K 226 drivers/net/ethernet/intel/ice/ice_base.c:109 [ice] func:ice_vsi_alloc_q_vector
>>>>> 1.1M 257 drivers/net/ethernet/intel/ice/ice_fwlog.c:40 [ice] func:ice_fwlog_alloc_ring_buffs
>>>>> 7.2M 114 drivers/net/ethernet/intel/ice/ice_txrx.c:493 [ice] func:ice_setup_rx_ring
>>>>> 896M 229264 drivers/net/ethernet/intel/ice/ice_txrx.c:680 [ice] func:ice_alloc_mapped_page
>>>>
>>>> Its about 1GB for the mapped pages. I don't see any increase moment to
>>>> moment. I've started an iperf session to simulate some traffic, and I'll
>>>> leave this running to see if anything changes overnight.
>>>>
>>>> Is there anything else that you can share about the traffic setup or
>>>> otherwise that I could look into? Your system seems to use ~2.5 x the
>>>> buffer size as mine, but that might just be a smaller number of CPUs.
>>>>
>>>> Hopefully I'll get some more results overnight.
>>>
>>> The traffic is random production workloads from VMs, using standard
>>> Linux or OVS bridges. There is no specific pattern to it. I haven’t
>>> had any luck reproducing (or was not patient enough) this with iperf3
>>> myself. The two active (UP) interfaces are in an LACP bonding setup.
>>> Here are our ethtool settings for the two member ports (em1 and p3p1)
>>>
>>
>> I had iperf3 running overnight and the memory usage for
>> ice_alloc_mapped_pages is constant here. Mine was direct connections
>> without bridge or bonding. From your description I assume there's no XDP
>> happening either.
>
> Yes, no XDP in use.
>
> BTW the allocinfo after 6days uptime:
> # uptime ; sort -g /proc/allocinfo| tail -n 15
> 11:46:44 up 6 days, 2:18, 1 user, load average: 9.24, 11.33, 15.07
> 102489024 533797 fs/dcache.c:1681 func:__d_alloc
> 106229760 25935 mm/shmem.c:1854 func:shmem_alloc_folio
> 117118192 103097 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> 162783232 7656 mm/slub.c:2452 func:alloc_slab_page
> 189906944 46364 mm/memory.c:1056 func:folio_prealloc
> 499384320 121920 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> 625876992 54186 mm/slub.c:2450 func:alloc_slab_page
> 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> 1014710272 247732 mm/filemap.c:1978 func:__filemap_get_folio
> 1056710656 257986 mm/memory.c:1054 func:folio_prealloc
> 1279262720 610 mm/khugepaged.c:1084 func:alloc_charge_folio
> 1334530048 325763 mm/readahead.c:186 func:ractl_alloc_folio
> 3341238272 412215 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> [ice] func:ice_alloc_mapped_page
>
I have a suspicion that the issue is related to the updating of
page_count in ice_get_rx_pgcnt(). The i40e driver has a very similar
logic for page reuse but doesn't do this. It also has a counter to track
failure to re-use the Rx pages.
Commit 11c4aa074d54 ("ice: gather page_count()'s of each frag right
before XDP prog call") changed the logic to update page_count of the Rx
page just prior to the XDP call instead of at the point where we get the
page from ice_get_rx_buf(). I think this change was originally
introduced while we were trying out an experimental refactor of the
hotpath to handle fragments differently, which no longer happens since
743bbd93cf29 ("ice: put Rx buffers after being done with current
frame"), which ironically was part of this very same series..
I think this updating of page count is accidentally causing us to
miscount when we could perform page-reuse, and ultimately causes us to
leak the page somehow. I'm still investigating, but I think this might
trigger if somehow the page pgcnt - pagecnt_bias becomes >1, we don't
reuse the page.
The i40e driver stores the page count in i40e_get_rx_buffer, and I think
our updating it later can somehow get things out-of-sync.
Do you know if your traffic pattern happens to send fragmented frames? I
think iperf doesn't do that, which might be part of whats causing this
issue. I'm going to try to see if I can generate such fragmentation to
confirm. Is your MTU kept at the default ethernet size?
At the very least I'm going to propose a patch for ice similar to the
one from Joe Damato to track the rx busy page count. That might at least
help track something..
Thanks,
Jake
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-02 21:56 ` Jacob Keller
@ 2025-07-03 6:46 ` Jaroslav Pulchart
2025-07-03 16:16 ` Jacob Keller
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-07-03 6:46 UTC (permalink / raw)
To: Jacob Keller
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
>
> On 7/2/2025 2:48 AM, Jaroslav Pulchart wrote:
> >>
> >> On 6/30/2025 11:48 PM, Jaroslav Pulchart wrote:
> >>>> On 6/30/2025 2:56 PM, Jacob Keller wrote:
> >>>>> Unfortunately it looks like the fix I mentioned has landed in 6.14, so
> >>>>> its not a fix for your issue (since you mentioned 6.14 has failed
> >>>>> testing in your system)
> >>>>>
> >>>>> $ git describe --first-parent --contains --match=v* --exclude=*rc*
> >>>>> 743bbd93cf29f653fae0e1416a31f03231689911
> >>>>> v6.14~251^2~15^2~2
> >>>>>
> >>>>> I don't see any other relevant changes since v6.14. I can try to see if
> >>>>> I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
> >>>>> systems here.
> >>>>
> >>>> On my system I see this at boot after loading the ice module from
> >>>>
> >>>> $ grep -F "/ice/" /proc/allocinfo | sort -g | tail | numfmt --to=iec>
> >>>> 26K 230 drivers/net/ethernet/intel/ice/ice_irq.c:84 [ice]
> >>>> func:ice_get_irq_res
> >>>>> 48K 2 drivers/net/ethernet/intel/ice/ice_arfs.c:565 [ice] func:ice_init_arfs
> >>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:397 [ice] func:ice_vsi_alloc_ring_stats
> >>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:416 [ice] func:ice_vsi_alloc_ring_stats
> >>>>> 85K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1398 [ice] func:ice_vsi_alloc_rings
> >>>>> 339K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1422 [ice] func:ice_vsi_alloc_rings
> >>>>> 678K 226 drivers/net/ethernet/intel/ice/ice_base.c:109 [ice] func:ice_vsi_alloc_q_vector
> >>>>> 1.1M 257 drivers/net/ethernet/intel/ice/ice_fwlog.c:40 [ice] func:ice_fwlog_alloc_ring_buffs
> >>>>> 7.2M 114 drivers/net/ethernet/intel/ice/ice_txrx.c:493 [ice] func:ice_setup_rx_ring
> >>>>> 896M 229264 drivers/net/ethernet/intel/ice/ice_txrx.c:680 [ice] func:ice_alloc_mapped_page
> >>>>
> >>>> Its about 1GB for the mapped pages. I don't see any increase moment to
> >>>> moment. I've started an iperf session to simulate some traffic, and I'll
> >>>> leave this running to see if anything changes overnight.
> >>>>
> >>>> Is there anything else that you can share about the traffic setup or
> >>>> otherwise that I could look into? Your system seems to use ~2.5 x the
> >>>> buffer size as mine, but that might just be a smaller number of CPUs.
> >>>>
> >>>> Hopefully I'll get some more results overnight.
> >>>
> >>> The traffic is random production workloads from VMs, using standard
> >>> Linux or OVS bridges. There is no specific pattern to it. I haven’t
> >>> had any luck reproducing (or was not patient enough) this with iperf3
> >>> myself. The two active (UP) interfaces are in an LACP bonding setup.
> >>> Here are our ethtool settings for the two member ports (em1 and p3p1)
> >>>
> >>
> >> I had iperf3 running overnight and the memory usage for
> >> ice_alloc_mapped_pages is constant here. Mine was direct connections
> >> without bridge or bonding. From your description I assume there's no XDP
> >> happening either.
> >
> > Yes, no XDP in use.
> >
> > BTW the allocinfo after 6days uptime:
> > # uptime ; sort -g /proc/allocinfo| tail -n 15
> > 11:46:44 up 6 days, 2:18, 1 user, load average: 9.24, 11.33, 15.07
> > 102489024 533797 fs/dcache.c:1681 func:__d_alloc
> > 106229760 25935 mm/shmem.c:1854 func:shmem_alloc_folio
> > 117118192 103097 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> > 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> > 162783232 7656 mm/slub.c:2452 func:alloc_slab_page
> > 189906944 46364 mm/memory.c:1056 func:folio_prealloc
> > 499384320 121920 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> > 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> > 625876992 54186 mm/slub.c:2450 func:alloc_slab_page
> > 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> > 1014710272 247732 mm/filemap.c:1978 func:__filemap_get_folio
> > 1056710656 257986 mm/memory.c:1054 func:folio_prealloc
> > 1279262720 610 mm/khugepaged.c:1084 func:alloc_charge_folio
> > 1334530048 325763 mm/readahead.c:186 func:ractl_alloc_folio
> > 3341238272 412215 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> > [ice] func:ice_alloc_mapped_page
> >
> I have a suspicion that the issue is related to the updating of
> page_count in ice_get_rx_pgcnt(). The i40e driver has a very similar
> logic for page reuse but doesn't do this. It also has a counter to track
> failure to re-use the Rx pages.
>
> Commit 11c4aa074d54 ("ice: gather page_count()'s of each frag right
> before XDP prog call") changed the logic to update page_count of the Rx
> page just prior to the XDP call instead of at the point where we get the
> page from ice_get_rx_buf(). I think this change was originally
> introduced while we were trying out an experimental refactor of the
> hotpath to handle fragments differently, which no longer happens since
> 743bbd93cf29 ("ice: put Rx buffers after being done with current
> frame"), which ironically was part of this very same series..
>
> I think this updating of page count is accidentally causing us to
> miscount when we could perform page-reuse, and ultimately causes us to
> leak the page somehow. I'm still investigating, but I think this might
> trigger if somehow the page pgcnt - pagecnt_bias becomes >1, we don't
> reuse the page.
>
> The i40e driver stores the page count in i40e_get_rx_buffer, and I think
> our updating it later can somehow get things out-of-sync.
>
> Do you know if your traffic pattern happens to send fragmented frames? I
Hmm, I check the
* node_netstat_Ip_Frag* metrics and they are empty(do-not-exists),
* shortly run "tcpdump -n -i any 'ip[6:2] & 0x3fff != 0'" and nothing was found
looks to me like there is no fragmentation.
> think iperf doesn't do that, which might be part of whats causing this
> issue. I'm going to try to see if I can generate such fragmentation to
> confirm. Is your MTU kept at the default ethernet size?
Our MTU size is set to 9000 everywhere.
>
> At the very least I'm going to propose a patch for ice similar to the
> one from Joe Damato to track the rx busy page count. That might at least
> help track something..
>
> Thanks,
> Jake
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-03 6:46 ` Jaroslav Pulchart
@ 2025-07-03 16:16 ` Jacob Keller
2025-07-04 19:30 ` Maciej Fijalkowski
2025-07-07 18:32 ` Jacob Keller
0 siblings, 2 replies; 46+ messages in thread
From: Jacob Keller @ 2025-07-03 16:16 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 6719 bytes --]
On 7/2/2025 11:46 PM, Jaroslav Pulchart wrote:
>>
>> On 7/2/2025 2:48 AM, Jaroslav Pulchart wrote:
>>>>
>>>> On 6/30/2025 11:48 PM, Jaroslav Pulchart wrote:
>>>>>> On 6/30/2025 2:56 PM, Jacob Keller wrote:
>>>>>>> Unfortunately it looks like the fix I mentioned has landed in 6.14, so
>>>>>>> its not a fix for your issue (since you mentioned 6.14 has failed
>>>>>>> testing in your system)
>>>>>>>
>>>>>>> $ git describe --first-parent --contains --match=v* --exclude=*rc*
>>>>>>> 743bbd93cf29f653fae0e1416a31f03231689911
>>>>>>> v6.14~251^2~15^2~2
>>>>>>>
>>>>>>> I don't see any other relevant changes since v6.14. I can try to see if
>>>>>>> I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
>>>>>>> systems here.
>>>>>>
>>>>>> On my system I see this at boot after loading the ice module from
>>>>>>
>>>>>> $ grep -F "/ice/" /proc/allocinfo | sort -g | tail | numfmt --to=iec>
>>>>>> 26K 230 drivers/net/ethernet/intel/ice/ice_irq.c:84 [ice]
>>>>>> func:ice_get_irq_res
>>>>>>> 48K 2 drivers/net/ethernet/intel/ice/ice_arfs.c:565 [ice] func:ice_init_arfs
>>>>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:397 [ice] func:ice_vsi_alloc_ring_stats
>>>>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:416 [ice] func:ice_vsi_alloc_ring_stats
>>>>>>> 85K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1398 [ice] func:ice_vsi_alloc_rings
>>>>>>> 339K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1422 [ice] func:ice_vsi_alloc_rings
>>>>>>> 678K 226 drivers/net/ethernet/intel/ice/ice_base.c:109 [ice] func:ice_vsi_alloc_q_vector
>>>>>>> 1.1M 257 drivers/net/ethernet/intel/ice/ice_fwlog.c:40 [ice] func:ice_fwlog_alloc_ring_buffs
>>>>>>> 7.2M 114 drivers/net/ethernet/intel/ice/ice_txrx.c:493 [ice] func:ice_setup_rx_ring
>>>>>>> 896M 229264 drivers/net/ethernet/intel/ice/ice_txrx.c:680 [ice] func:ice_alloc_mapped_page
>>>>>>
>>>>>> Its about 1GB for the mapped pages. I don't see any increase moment to
>>>>>> moment. I've started an iperf session to simulate some traffic, and I'll
>>>>>> leave this running to see if anything changes overnight.
>>>>>>
>>>>>> Is there anything else that you can share about the traffic setup or
>>>>>> otherwise that I could look into? Your system seems to use ~2.5 x the
>>>>>> buffer size as mine, but that might just be a smaller number of CPUs.
>>>>>>
>>>>>> Hopefully I'll get some more results overnight.
>>>>>
>>>>> The traffic is random production workloads from VMs, using standard
>>>>> Linux or OVS bridges. There is no specific pattern to it. I haven’t
>>>>> had any luck reproducing (or was not patient enough) this with iperf3
>>>>> myself. The two active (UP) interfaces are in an LACP bonding setup.
>>>>> Here are our ethtool settings for the two member ports (em1 and p3p1)
>>>>>
>>>>
>>>> I had iperf3 running overnight and the memory usage for
>>>> ice_alloc_mapped_pages is constant here. Mine was direct connections
>>>> without bridge or bonding. From your description I assume there's no XDP
>>>> happening either.
>>>
>>> Yes, no XDP in use.
>>>
>>> BTW the allocinfo after 6days uptime:
>>> # uptime ; sort -g /proc/allocinfo| tail -n 15
>>> 11:46:44 up 6 days, 2:18, 1 user, load average: 9.24, 11.33, 15.07
>>> 102489024 533797 fs/dcache.c:1681 func:__d_alloc
>>> 106229760 25935 mm/shmem.c:1854 func:shmem_alloc_folio
>>> 117118192 103097 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>>> 162783232 7656 mm/slub.c:2452 func:alloc_slab_page
>>> 189906944 46364 mm/memory.c:1056 func:folio_prealloc
>>> 499384320 121920 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
>>> 625876992 54186 mm/slub.c:2450 func:alloc_slab_page
>>> 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>>> 1014710272 247732 mm/filemap.c:1978 func:__filemap_get_folio
>>> 1056710656 257986 mm/memory.c:1054 func:folio_prealloc
>>> 1279262720 610 mm/khugepaged.c:1084 func:alloc_charge_folio
>>> 1334530048 325763 mm/readahead.c:186 func:ractl_alloc_folio
>>> 3341238272 412215 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>>> [ice] func:ice_alloc_mapped_page
>>>
>> I have a suspicion that the issue is related to the updating of
>> page_count in ice_get_rx_pgcnt(). The i40e driver has a very similar
>> logic for page reuse but doesn't do this. It also has a counter to track
>> failure to re-use the Rx pages.
>>
>> Commit 11c4aa074d54 ("ice: gather page_count()'s of each frag right
>> before XDP prog call") changed the logic to update page_count of the Rx
>> page just prior to the XDP call instead of at the point where we get the
>> page from ice_get_rx_buf(). I think this change was originally
>> introduced while we were trying out an experimental refactor of the
>> hotpath to handle fragments differently, which no longer happens since
>> 743bbd93cf29 ("ice: put Rx buffers after being done with current
>> frame"), which ironically was part of this very same series..
>>
>> I think this updating of page count is accidentally causing us to
>> miscount when we could perform page-reuse, and ultimately causes us to
>> leak the page somehow. I'm still investigating, but I think this might
>> trigger if somehow the page pgcnt - pagecnt_bias becomes >1, we don't
>> reuse the page.
>>
>> The i40e driver stores the page count in i40e_get_rx_buffer, and I think
>> our updating it later can somehow get things out-of-sync.
>>
>> Do you know if your traffic pattern happens to send fragmented frames? I
>
> Hmm, I check the
> * node_netstat_Ip_Frag* metrics and they are empty(do-not-exists),
> * shortly run "tcpdump -n -i any 'ip[6:2] & 0x3fff != 0'" and nothing was found
> looks to me like there is no fragmentation.
>
Good to rule it out at least.
>> think iperf doesn't do that, which might be part of whats causing this
>> issue. I'm going to try to see if I can generate such fragmentation to
>> confirm. Is your MTU kept at the default ethernet size?
>
> Our MTU size is set to 9000 everywhere.
>
Ok. I am re-trying with MTU 9000 and using some traffic generated by wrk
now. I do see much larger memory use (~2GB) when using MTU 9000, so that
tracks with what your system shows. Currently its fluctuating between
1.9 and 2G. I'll leave this going for a couple of days while on vacation
and see if anything pops up.
Thanks,
Jake
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-04-14 16:29 Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad) Jaroslav Pulchart
2025-04-14 17:15 ` [Intel-wired-lan] " Paul Menzel
2025-04-15 14:38 ` Przemek Kitszel
@ 2025-07-04 16:55 ` Michal Kubiak
2025-07-05 7:01 ` Jaroslav Pulchart
2 siblings, 1 reply; 46+ messages in thread
From: Michal Kubiak @ 2025-07-04 16:55 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Tony Nguyen, Kitszel, Przemyslaw, jdamato, intel-wired-lan,
netdev, Igor Raits, Daniel Secik, Zdenek Pesek
On Mon, Apr 14, 2025 at 06:29:01PM +0200, Jaroslav Pulchart wrote:
> Hello,
>
> While investigating increased memory usage after upgrading our
> host/hypervisor servers from Linux kernel 6.12.y to 6.13.y, I observed
> a regression in available memory per NUMA node. Our servers allocate
> 60GB of each NUMA node’s 64GB of RAM to HugePages for VMs, leaving 4GB
> for the host OS.
>
> After the upgrade, we noticed approximately 500MB less free RAM on
> NUMA nodes 0 and 2 compared to 6.12.y, even with no VMs running (just
> the host OS after reboot). These nodes host Intel 810-XXV NICs. Here's
> a snapshot of the NUMA stats on vanilla 6.13.y:
>
> NUMA nodes: 0 1 2 3 4 5 6 7 8
> 9 10 11 12 13 14 15
> HPFreeGiB: 60 60 60 60 60 60 60 60 60
> 60 60 60 60 60 60 60
> MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453
> 65470 65470 65470 65470 65470 65470 65470 65462
> MemFree: 2793 3559 3150 3438 3616 3722 3520 3547 3547
> 3536 3506 3452 3440 3489 3607 3729
>
> We traced the issue to commit 492a044508ad13a490a24c66f311339bf891cb5f
> "ice: Add support for persistent NAPI config".
>
> We limit the number of channels on the NICs to match local NUMA cores
> or less if unused interface (from ridiculous 96 default), for example:
> ethtool -L em1 combined 6 # active port; from 96
> ethtool -L p3p2 combined 2 # unused port; from 96
>
> This typically aligns memory use with local CPUs and keeps NUMA-local
> memory usage within expected limits. However, starting with kernel
> 6.13.y and this commit, the high memory usage by the ICE driver
> persists regardless of reduced channel configuration.
>
> Reverting the commit restores expected memory availability on nodes 0
> and 2. Below are stats from 6.13.y with the commit reverted:
> NUMA nodes: 0 1 2 3 4 5 6 7 8
> 9 10 11 12 13 14 15
> HPFreeGiB: 60 60 60 60 60 60 60 60 60
> 60 60 60 60 60 60 60
> MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453 65470
> 65470 65470 65470 65470 65470 65470 65462
> MemFree: 3208 3765 3668 3507 3811 3727 3812 3546 3676 3596 ...
>
> This brings nodes 0 and 2 back to ~3.5GB free RAM, similar to kernel
> 6.12.y, and avoids swap pressure and memory exhaustion when running
> services and VMs.
>
> I also do not see any practical benefit in persisting the channel
> memory allocation. After a fresh server reboot, channels are not
> explicitly configured, and the system will not automatically resize
> them back to a higher count unless manually set again. Therefore,
> retaining the previous memory footprint appears unnecessary and
> potentially harmful in memory-constrained environments
>
> Best regards,
> Jaroslav Pulchart
>
Hello Jaroslav,
I have just sent a series for converting the Rx path of the ice driver
to use the Page Pool.
We suspect it may help for the memory consumption issue since it removes
the problematic code and delegates some memory management to the generic
code.
Could you please give it a try and check if it helps for your issue.
The link to the series: https://lore.kernel.org/intel-wired-lan/20250704161859.871152-1-michal.kubiak@intel.com/
Thanks,
Michal
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-03 16:16 ` Jacob Keller
@ 2025-07-04 19:30 ` Maciej Fijalkowski
2025-07-07 18:32 ` Jacob Keller
1 sibling, 0 replies; 46+ messages in thread
From: Maciej Fijalkowski @ 2025-07-04 19:30 UTC (permalink / raw)
To: Jacob Keller
Cc: Jaroslav Pulchart, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
On Thu, Jul 03, 2025 at 09:16:35AM -0700, Jacob Keller wrote:
>
>
> On 7/2/2025 11:46 PM, Jaroslav Pulchart wrote:
> >>
> >> On 7/2/2025 2:48 AM, Jaroslav Pulchart wrote:
> >>>>
> >>>> On 6/30/2025 11:48 PM, Jaroslav Pulchart wrote:
> >>>>>> On 6/30/2025 2:56 PM, Jacob Keller wrote:
> >>>>>>> Unfortunately it looks like the fix I mentioned has landed in 6.14, so
> >>>>>>> its not a fix for your issue (since you mentioned 6.14 has failed
> >>>>>>> testing in your system)
> >>>>>>>
> >>>>>>> $ git describe --first-parent --contains --match=v* --exclude=*rc*
> >>>>>>> 743bbd93cf29f653fae0e1416a31f03231689911
> >>>>>>> v6.14~251^2~15^2~2
> >>>>>>>
> >>>>>>> I don't see any other relevant changes since v6.14. I can try to see if
> >>>>>>> I see similar issues with CONFIG_MEM_ALLOC_PROFILING on some test
> >>>>>>> systems here.
> >>>>>>
> >>>>>> On my system I see this at boot after loading the ice module from
> >>>>>>
> >>>>>> $ grep -F "/ice/" /proc/allocinfo | sort -g | tail | numfmt --to=iec>
> >>>>>> 26K 230 drivers/net/ethernet/intel/ice/ice_irq.c:84 [ice]
> >>>>>> func:ice_get_irq_res
> >>>>>>> 48K 2 drivers/net/ethernet/intel/ice/ice_arfs.c:565 [ice] func:ice_init_arfs
> >>>>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:397 [ice] func:ice_vsi_alloc_ring_stats
> >>>>>>> 57K 226 drivers/net/ethernet/intel/ice/ice_lib.c:416 [ice] func:ice_vsi_alloc_ring_stats
> >>>>>>> 85K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1398 [ice] func:ice_vsi_alloc_rings
> >>>>>>> 339K 226 drivers/net/ethernet/intel/ice/ice_lib.c:1422 [ice] func:ice_vsi_alloc_rings
> >>>>>>> 678K 226 drivers/net/ethernet/intel/ice/ice_base.c:109 [ice] func:ice_vsi_alloc_q_vector
> >>>>>>> 1.1M 257 drivers/net/ethernet/intel/ice/ice_fwlog.c:40 [ice] func:ice_fwlog_alloc_ring_buffs
> >>>>>>> 7.2M 114 drivers/net/ethernet/intel/ice/ice_txrx.c:493 [ice] func:ice_setup_rx_ring
> >>>>>>> 896M 229264 drivers/net/ethernet/intel/ice/ice_txrx.c:680 [ice] func:ice_alloc_mapped_page
> >>>>>>
> >>>>>> Its about 1GB for the mapped pages. I don't see any increase moment to
> >>>>>> moment. I've started an iperf session to simulate some traffic, and I'll
> >>>>>> leave this running to see if anything changes overnight.
> >>>>>>
> >>>>>> Is there anything else that you can share about the traffic setup or
> >>>>>> otherwise that I could look into? Your system seems to use ~2.5 x the
> >>>>>> buffer size as mine, but that might just be a smaller number of CPUs.
> >>>>>>
> >>>>>> Hopefully I'll get some more results overnight.
> >>>>>
> >>>>> The traffic is random production workloads from VMs, using standard
> >>>>> Linux or OVS bridges. There is no specific pattern to it. I haven’t
> >>>>> had any luck reproducing (or was not patient enough) this with iperf3
> >>>>> myself. The two active (UP) interfaces are in an LACP bonding setup.
> >>>>> Here are our ethtool settings for the two member ports (em1 and p3p1)
> >>>>>
> >>>>
> >>>> I had iperf3 running overnight and the memory usage for
> >>>> ice_alloc_mapped_pages is constant here. Mine was direct connections
> >>>> without bridge or bonding. From your description I assume there's no XDP
> >>>> happening either.
> >>>
> >>> Yes, no XDP in use.
> >>>
> >>> BTW the allocinfo after 6days uptime:
> >>> # uptime ; sort -g /proc/allocinfo| tail -n 15
> >>> 11:46:44 up 6 days, 2:18, 1 user, load average: 9.24, 11.33, 15.07
> >>> 102489024 533797 fs/dcache.c:1681 func:__d_alloc
> >>> 106229760 25935 mm/shmem.c:1854 func:shmem_alloc_folio
> >>> 117118192 103097 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
> >>> 134479872 32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
> >>> 162783232 7656 mm/slub.c:2452 func:alloc_slab_page
> >>> 189906944 46364 mm/memory.c:1056 func:folio_prealloc
> >>> 499384320 121920 mm/percpu-vm.c:95 func:pcpu_alloc_pages
> >>> 530579456 129536 mm/page_ext.c:271 func:alloc_page_ext
> >>> 625876992 54186 mm/slub.c:2450 func:alloc_slab_page
> >>> 838860800 400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
> >>> 1014710272 247732 mm/filemap.c:1978 func:__filemap_get_folio
> >>> 1056710656 257986 mm/memory.c:1054 func:folio_prealloc
> >>> 1279262720 610 mm/khugepaged.c:1084 func:alloc_charge_folio
> >>> 1334530048 325763 mm/readahead.c:186 func:ractl_alloc_folio
> >>> 3341238272 412215 drivers/net/ethernet/intel/ice/ice_txrx.c:681
> >>> [ice] func:ice_alloc_mapped_page
> >>>
> >> I have a suspicion that the issue is related to the updating of
> >> page_count in ice_get_rx_pgcnt(). The i40e driver has a very similar
> >> logic for page reuse but doesn't do this. It also has a counter to track
> >> failure to re-use the Rx pages.
> >>
> >> Commit 11c4aa074d54 ("ice: gather page_count()'s of each frag right
> >> before XDP prog call") changed the logic to update page_count of the Rx
> >> page just prior to the XDP call instead of at the point where we get the
> >> page from ice_get_rx_buf(). I think this change was originally
> >> introduced while we were trying out an experimental refactor of the
> >> hotpath to handle fragments differently, which no longer happens since
> >> 743bbd93cf29 ("ice: put Rx buffers after being done with current
> >> frame"), which ironically was part of this very same series..
> >>
> >> I think this updating of page count is accidentally causing us to
> >> miscount when we could perform page-reuse, and ultimately causes us to
> >> leak the page somehow. I'm still investigating, but I think this might
> >> trigger if somehow the page pgcnt - pagecnt_bias becomes >1, we don't
> >> reuse the page.
> >>
> >> The i40e driver stores the page count in i40e_get_rx_buffer, and I think
> >> our updating it later can somehow get things out-of-sync.
> >>
> >> Do you know if your traffic pattern happens to send fragmented frames? I
> >
> > Hmm, I check the
> > * node_netstat_Ip_Frag* metrics and they are empty(do-not-exists),
> > * shortly run "tcpdump -n -i any 'ip[6:2] & 0x3fff != 0'" and nothing was found
> > looks to me like there is no fragmentation.
> >
>
> Good to rule it out at least.
>
> >> think iperf doesn't do that, which might be part of whats causing this
> >> issue. I'm going to try to see if I can generate such fragmentation to
> >> confirm. Is your MTU kept at the default ethernet size?
> >
> > Our MTU size is set to 9000 everywhere.
> >
>
> Ok. I am re-trying with MTU 9000 and using some traffic generated by wrk
> now. I do see much larger memory use (~2GB) when using MTU 9000, so that
> tracks with what your system shows. Currently its fluctuating between
> 1.9 and 2G. I'll leave this going for a couple of days while on vacation
> and see if anything pops up.
I was thinking if order-1 pages might do the mess there for some reason
since for 9k mtu we pull them and split into half.
Maybe it would be worth trying out if legacy-rx (which will work on
order-0 pages) doesn't have this issue? but that would require 8k mtu.
>
> Thanks,
> Jake
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-04 16:55 ` Michal Kubiak
@ 2025-07-05 7:01 ` Jaroslav Pulchart
2025-07-07 15:37 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-07-05 7:01 UTC (permalink / raw)
To: Michal Kubiak
Cc: Tony Nguyen, Kitszel, Przemyslaw, jdamato, intel-wired-lan,
netdev, Igor Raits, Daniel Secik, Zdenek Pesek
> On Mon, Apr 14, 2025 at 06:29:01PM +0200, Jaroslav Pulchart wrote:
> > Hello,
> >
> > While investigating increased memory usage after upgrading our
> > host/hypervisor servers from Linux kernel 6.12.y to 6.13.y, I observed
> > a regression in available memory per NUMA node. Our servers allocate
> > 60GB of each NUMA node’s 64GB of RAM to HugePages for VMs, leaving 4GB
> > for the host OS.
> >
> > After the upgrade, we noticed approximately 500MB less free RAM on
> > NUMA nodes 0 and 2 compared to 6.12.y, even with no VMs running (just
> > the host OS after reboot). These nodes host Intel 810-XXV NICs. Here's
> > a snapshot of the NUMA stats on vanilla 6.13.y:
> >
> > NUMA nodes: 0 1 2 3 4 5 6 7 8
> > 9 10 11 12 13 14 15
> > HPFreeGiB: 60 60 60 60 60 60 60 60 60
> > 60 60 60 60 60 60 60
> > MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453
> > 65470 65470 65470 65470 65470 65470 65470 65462
> > MemFree: 2793 3559 3150 3438 3616 3722 3520 3547 3547
> > 3536 3506 3452 3440 3489 3607 3729
> >
> > We traced the issue to commit 492a044508ad13a490a24c66f311339bf891cb5f
> > "ice: Add support for persistent NAPI config".
> >
> > We limit the number of channels on the NICs to match local NUMA cores
> > or less if unused interface (from ridiculous 96 default), for example:
> > ethtool -L em1 combined 6 # active port; from 96
> > ethtool -L p3p2 combined 2 # unused port; from 96
> >
> > This typically aligns memory use with local CPUs and keeps NUMA-local
> > memory usage within expected limits. However, starting with kernel
> > 6.13.y and this commit, the high memory usage by the ICE driver
> > persists regardless of reduced channel configuration.
> >
> > Reverting the commit restores expected memory availability on nodes 0
> > and 2. Below are stats from 6.13.y with the commit reverted:
> > NUMA nodes: 0 1 2 3 4 5 6 7 8
> > 9 10 11 12 13 14 15
> > HPFreeGiB: 60 60 60 60 60 60 60 60 60
> > 60 60 60 60 60 60 60
> > MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453 65470
> > 65470 65470 65470 65470 65470 65470 65462
> > MemFree: 3208 3765 3668 3507 3811 3727 3812 3546 3676 3596 ...
> >
> > This brings nodes 0 and 2 back to ~3.5GB free RAM, similar to kernel
> > 6.12.y, and avoids swap pressure and memory exhaustion when running
> > services and VMs.
> >
> > I also do not see any practical benefit in persisting the channel
> > memory allocation. After a fresh server reboot, channels are not
> > explicitly configured, and the system will not automatically resize
> > them back to a higher count unless manually set again. Therefore,
> > retaining the previous memory footprint appears unnecessary and
> > potentially harmful in memory-constrained environments
> >
> > Best regards,
> > Jaroslav Pulchart
> >
>
>
> Hello Jaroslav,
>
> I have just sent a series for converting the Rx path of the ice driver
> to use the Page Pool.
> We suspect it may help for the memory consumption issue since it removes
> the problematic code and delegates some memory management to the generic
> code.
>
> Could you please give it a try and check if it helps for your issue.
> The link to the series: https://lore.kernel.org/intel-wired-lan/20250704161859.871152-1-michal.kubiak@intel.com/
I can try it, however I cannot apply the patch as-is @ 6.15.y:
$ git am ~/ice-convert-Rx-path-to-Page-Pool.patch
Applying: ice: remove legacy Rx and construct SKB
Applying: ice: drop page splitting and recycling
error: patch failed: drivers/net/ethernet/intel/ice/ice_txrx.h:480
error: drivers/net/ethernet/intel/ice/ice_txrx.h: patch does not apply
Patch failed at 0002 ice: drop page splitting and recycling
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
>
> Thanks,
> Michal
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-05 7:01 ` Jaroslav Pulchart
@ 2025-07-07 15:37 ` Jaroslav Pulchart
0 siblings, 0 replies; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-07-07 15:37 UTC (permalink / raw)
To: Michal Kubiak
Cc: Tony Nguyen, Kitszel, Przemyslaw, jdamato, intel-wired-lan,
netdev, Igor Raits, Daniel Secik, Zdenek Pesek
so 5. 7. 2025 v 9:01 odesílatel Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> napsal:
>
> > On Mon, Apr 14, 2025 at 06:29:01PM +0200, Jaroslav Pulchart wrote:
> > > Hello,
> > >
> > > While investigating increased memory usage after upgrading our
> > > host/hypervisor servers from Linux kernel 6.12.y to 6.13.y, I observed
> > > a regression in available memory per NUMA node. Our servers allocate
> > > 60GB of each NUMA node’s 64GB of RAM to HugePages for VMs, leaving 4GB
> > > for the host OS.
> > >
> > > After the upgrade, we noticed approximately 500MB less free RAM on
> > > NUMA nodes 0 and 2 compared to 6.12.y, even with no VMs running (just
> > > the host OS after reboot). These nodes host Intel 810-XXV NICs. Here's
> > > a snapshot of the NUMA stats on vanilla 6.13.y:
> > >
> > > NUMA nodes: 0 1 2 3 4 5 6 7 8
> > > 9 10 11 12 13 14 15
> > > HPFreeGiB: 60 60 60 60 60 60 60 60 60
> > > 60 60 60 60 60 60 60
> > > MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453
> > > 65470 65470 65470 65470 65470 65470 65470 65462
> > > MemFree: 2793 3559 3150 3438 3616 3722 3520 3547 3547
> > > 3536 3506 3452 3440 3489 3607 3729
> > >
> > > We traced the issue to commit 492a044508ad13a490a24c66f311339bf891cb5f
> > > "ice: Add support for persistent NAPI config".
> > >
> > > We limit the number of channels on the NICs to match local NUMA cores
> > > or less if unused interface (from ridiculous 96 default), for example:
> > > ethtool -L em1 combined 6 # active port; from 96
> > > ethtool -L p3p2 combined 2 # unused port; from 96
> > >
> > > This typically aligns memory use with local CPUs and keeps NUMA-local
> > > memory usage within expected limits. However, starting with kernel
> > > 6.13.y and this commit, the high memory usage by the ICE driver
> > > persists regardless of reduced channel configuration.
> > >
> > > Reverting the commit restores expected memory availability on nodes 0
> > > and 2. Below are stats from 6.13.y with the commit reverted:
> > > NUMA nodes: 0 1 2 3 4 5 6 7 8
> > > 9 10 11 12 13 14 15
> > > HPFreeGiB: 60 60 60 60 60 60 60 60 60
> > > 60 60 60 60 60 60 60
> > > MemTotal: 64989 65470 65470 65470 65470 65470 65470 65453 65470
> > > 65470 65470 65470 65470 65470 65470 65462
> > > MemFree: 3208 3765 3668 3507 3811 3727 3812 3546 3676 3596 ...
> > >
> > > This brings nodes 0 and 2 back to ~3.5GB free RAM, similar to kernel
> > > 6.12.y, and avoids swap pressure and memory exhaustion when running
> > > services and VMs.
> > >
> > > I also do not see any practical benefit in persisting the channel
> > > memory allocation. After a fresh server reboot, channels are not
> > > explicitly configured, and the system will not automatically resize
> > > them back to a higher count unless manually set again. Therefore,
> > > retaining the previous memory footprint appears unnecessary and
> > > potentially harmful in memory-constrained environments
> > >
> > > Best regards,
> > > Jaroslav Pulchart
> > >
> >
> >
> > Hello Jaroslav,
> >
> > I have just sent a series for converting the Rx path of the ice driver
> > to use the Page Pool.
> > We suspect it may help for the memory consumption issue since it removes
> > the problematic code and delegates some memory management to the generic
> > code.
> >
> > Could you please give it a try and check if it helps for your issue.
> > The link to the series: https://lore.kernel.org/intel-wired-lan/20250704161859.871152-1-michal.kubiak@intel.com/
>
> I can try it, however I cannot apply the patch as-is @ 6.15.y:
> $ git am ~/ice-convert-Rx-path-to-Page-Pool.patch
> Applying: ice: remove legacy Rx and construct SKB
> Applying: ice: drop page splitting and recycling
> error: patch failed: drivers/net/ethernet/intel/ice/ice_txrx.h:480
> error: drivers/net/ethernet/intel/ice/ice_txrx.h: patch does not apply
> Patch failed at 0002 ice: drop page splitting and recycling
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> hint: When you have resolved this problem, run "git am --continue".
> hint: If you prefer to skip this patch, run "git am --skip" instead.
> hint: To restore the original branch and stop patching, run "git am --abort".
> hint: Disable this message with "git config set advice.mergeConflict false"
>
My colleague and I have applied the missing bits and have it building
on 6.15.5 (note that we had to disable CONFIG_MEM_ALLOC_PROFILING, or
the kernel won’t boot). The patches we used are:
0001-libeth-convert-to-netmem.patch
0002-libeth-support-native-XDP-and-register-memory-model.patch
0003-libeth-xdp-add-XDP_TX-buffers-sending.patch
0004-libeth-xdp-add-.ndo_xdp_xmit-helpers.patch
0005-libeth-xdp-add-XDPSQE-completion-helpers.patch
0006-libeth-xdp-add-XDPSQ-locking-helpers.patch
0007-libeth-xdp-add-XDPSQ-cleanup-timers.patch
0008-libeth-xdp-add-helpers-for-preparing-processing-libe.patch
0009-libeth-xdp-add-XDP-prog-run-and-verdict-result-handl.patch
0010-libeth-xdp-add-templates-for-building-driver-side-ca.patch
0011-libeth-xdp-add-RSS-hash-hint-and-XDP-features-setup-.patch
0012-libeth-xsk-add-XSk-XDP_TX-sending-helpers.patch
0013-libeth-xsk-add-XSk-xmit-functions.patch
0014-libeth-xsk-add-XSk-Rx-processing-support.patch
0015-libeth-xsk-add-XSkFQ-refill-and-XSk-wakeup-helpers.patch
0016-libeth-xdp-xsk-access-adjacent-u32s-as-u64-where-app.patch
0017-ice-add-a-separate-Rx-handler-for-flow-director-comm.patch
0018-ice-remove-legacy-Rx-and-construct-SKB.patch
0019-ice-drop-page-splitting-and-recycling.patch
0020-ice-switch-to-Page-Pool.patch
Unfortunately, the new setup crashes after VMs are started. Here’s the
oops trace:
[ 82.816544] tun: Universal TUN/TAP device driver, 1.6
[ 82.823923] tap2c2b8dfc-91: entered promiscuous mode
[ 82.848913] tapa92181fc-b5: entered promiscuous mode
[ 84.030527] tap54ab9888-90: entered promiscuous mode
[ 84.043251] tap89f4f7ae-d1: entered promiscuous mode
[ 85.768578] tapf1e9f4f9-17: entered promiscuous mode
[ 85.780372] tap72c64909-77: entered promiscuous mode
[ 87.580455] tape1b2d2dd-bc: entered promiscuous mode
[ 87.593224] tap34fb2668-4a: entered promiscuous mode
[ 150.406899] Oops: general protection fault, probably for
non-canonical address 0xffff3b95e757d5a0: 0000 [#1] SMP NOPTI
[ 150.417626] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Tainted: G
E 6.15.5-1.gdc+ice.el9.x86_64 #1 PREEMPT(lazy)
[ 150.428845] Tainted: [E]=UNSIGNED_MODULE
[ 150.432773] Hardware name: Dell Inc. PowerEdge R7525/0H3K7P, BIOS
2.19.0 03/07/2025
[ 150.440432] RIP: 0010:page_pool_put_unrefed_netmem+0xe2/0x250
[ 150.446186] Code: 18 48 85 d2 0f 84 58 ff ff ff 8b 52 2c 4c 89 e7
39 d0 41 0f 94 c5 e8 0d f2 ff ff 84 c0 0f 85 4f ff ff ff 48 8b 85 60
06 00 00 <65> 48 ff 40 20 5b 4c 89 e6 48 89 ef 5d 41 5c 41 5d e9 f8 fa
ff ff
[ 150.464947] RSP: 0018:ffffbc4a003fcd18 EFLAGS: 00010246
[ 150.470173] RAX: ffff9dcabfc37580 RBX: 00000000ffffffff RCX: 0000000000000000
[ 150.477496] RDX: 0000000000000000 RSI: fffff2ec441924c0 RDI: fffff2ec441924c0
[ 150.484773] RBP: ffff9dcabfc36f20 R08: ffff9dc330536d20 R09: 0000000000551618
[ 150.492045] R10: 0000000000000000 R11: 0000000000000f82 R12: fffff2ec441924c0
[ 150.499317] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000001b69
[ 150.506584] FS: 0000000000000000(0000) GS:ffff9dcb27946000(0000)
knlGS:0000000000000000
[ 150.514806] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 150.520677] CR2: 00007f82d00041b8 CR3: 000000012bcab00a CR4: 0000000000770ef0
[ 150.527937] PKRU: 55555554
[ 150.530770] Call Trace:
[ 150.533342] <IRQ>
[ 150.535484] ice_clean_rx_irq+0x288/0x530 [ice]
[ 150.540171] ? sched_balance_find_src_group+0x13f/0x210
[ 150.545521] ? ice_clean_tx_irq+0x18f/0x3a0 [ice]
[ 150.550373] ice_napi_poll+0xe2/0x290 [ice]
[ 150.554709] __napi_poll+0x27/0x1e0
[ 150.558323] net_rx_action+0x1d3/0x3f0
[ 150.562194] ? __napi_schedule+0x8e/0xb0
[ 150.566239] ? sched_clock+0xc/0x30
[ 150.569852] ? sched_clock_cpu+0xb/0x190
[ 150.573897] handle_softirqs+0xd0/0x2b0
[ 150.577858] __irq_exit_rcu+0xcd/0xf0
[ 150.581636] common_interrupt+0x7f/0xa0
[ 150.585601] </IRQ>
[ 150.587826] <TASK>
[ 150.590049] asm_common_interrupt+0x22/0x40
[ 150.594352] RIP: 0010:flush_smp_call_function_queue+0x39/0x50
[ 150.600218] Code: 80 c0 bb 2e 98 48 85 c0 74 31 53 9c 5b fa bf 01
00 00 00 e8 49 f5 ff ff 65 66 83 3d 58 af 90 02 00 75 0c 80 e7 02 74
01 fb 5b <c3> cc cc cc cc e8 8d 1d f1 ff 80 e7 02 74 f0 eb ed c3 cc cc
cc cc
[ 150.619204] RSP: 0018:ffffbc4a001e7ed8 EFLAGS: 00000202
[ 150.624550] RAX: 0000000000000000 RBX: ffff9dc2c0088000 RCX: 00000000000f4240
[ 150.631806] RDX: 0000000000007f0c RSI: 0000000000000008 RDI: ffff9dcabfc30880
[ 150.639057] RBP: 0000000000000004 R08: 0000000000000008 R09: ffff9dcabfc311e8
[ 150.646314] R10: ffff9dcabfc1fd80 R11: 0000000000000004 R12: ffff9dc2c1e64400
[ 150.653569] R13: ffffffff978da0e0 R14: 0000000000000001 R15: 0000000000000000
[ 150.660829] do_idle+0x13a/0x200
[ 150.664186] cpu_startup_entry+0x25/0x30
[ 150.668241] start_secondary+0x114/0x140
[ 150.672292] common_startup_64+0x13e/0x141
[ 150.676525] </TASK>
[ 150.678840] Modules linked in: target_core_user(E) uio(E)
target_core_pscsi(E) target_core_file(E) target_core_iblock(E)
nf_conntrack_netlink(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E)
tun(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E)
nfs(E) lockd(E) grace(E) netfs(E) netconsole(E)
scsi_transport_iscsi(E) sch_ingress(E) iscsi_target_mod(E)
target_core_mod(E) 8021q(E) garp(E) mrp(E) bonding(E) tls(E)
nfnetlink_cttimeout(E) nfnetlink(E) openvswitch(E) nf_conncount(E)
nf_nat(E) psample(E) ib_core(E) binfmt_misc(E) dell_rbu(E) sunrpc(E)
vfat(E) fat(E) dm_service_time(E) dm_multipath(E) amd_atl(E)
intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) ipmi_ssif(E)
edac_mce_amd(E) kvm_amd(E) kvm(E) dell_pc(E) platform_profile(E)
dell_smbios(E) dcdbas(E) mgag200(E) irqbypass(E)
dell_wmi_descriptor(E) wmi_bmof(E) i2c_algo_bit(E) rapl(E)
acpi_cpufreq(E) ptdma(E) i2c_piix4(E) acpi_power_meter(E) ipmi_si(E)
k10temp(E) i2c_smbus(E) acpi_ipmi(E) wmi(E) ipmi_devintf(E)
ipmi_msghandler(E) tcp_bbr(E) fuse(E) zram(E)
[ 150.678894] lz4hc_compress(E) lz4_compress(E) zstd_compress(E)
ext4(E) crc16(E) mbcache(E) jbd2(E) dm_crypt(E) sd_mod(E) sg(E) ice(E)
ahci(E) polyval_clmulni(E) libie(E) libeth_xdp(E) polyval_generic(E)
libahci(E) libeth(E) ghash_clmulni_intel(E) sha512_ssse3(E) libata(E)
ccp(E) megaraid_sas(E) gnss(E) sp5100_tco(E) dm_mirror(E)
dm_region_hash(E) dm_log(E) dm_mod(E) nf_conntrack(E)
nf_defrag_ipv6(E) nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E)
llc(E)
[ 150.770112] Unloaded tainted modules: fmpm(E):1 fjes(E):2 padlock_aes(E):2
[ 150.818140] ---[ end trace 0000000000000000 ]---
[ 150.913536] pstore: backend (erst) writing error (-22)
[ 150.918850] RIP: 0010:page_pool_put_unrefed_netmem+0xe2/0x250
[ 150.924764] Code: 18 48 85 d2 0f 84 58 ff ff ff 8b 52 2c 4c 89 e7
39 d0 41 0f 94 c5 e8 0d f2 ff ff 84 c0 0f 85 4f ff ff ff 48 8b 85 60
06 00 00 <65> 48 ff 40 20 5b 4c 89 e6 48 89 ef 5d 41 5c 41 5d e9 f8 fa
ff ff
[ 150.943854] RSP: 0018:ffffbc4a003fcd18 EFLAGS: 00010246
[ 150.949245] RAX: ffff9dcabfc37580 RBX: 00000000ffffffff RCX: 0000000000000000
[ 150.956556] RDX: 0000000000000000 RSI: fffff2ec441924c0 RDI: fffff2ec441924c0
[ 150.963860] RBP: ffff9dcabfc36f20 R08: ffff9dc330536d20 R09: 0000000000551618
[ 150.971166] R10: 0000000000000000 R11: 0000000000000f82 R12: fffff2ec441924c0
[ 150.978475] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000001b69
[ 150.985782] FS: 0000000000000000(0000) GS:ffff9dcb27946000(0000)
knlGS:0000000000000000
[ 150.994036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 150.999958] CR2: 00007f82d00041b8 CR3: 000000012bcab00a CR4: 0000000000770ef0
[ 151.007270] PKRU: 55555554
[ 151.010151] Kernel panic - not syncing: Fatal exception in interrupt
[ 151.488873] Kernel Offset: 0x14600000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 151.581163] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt ]---
> >
> > Thanks,
> > Michal
> >
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-03 16:16 ` Jacob Keller
2025-07-04 19:30 ` Maciej Fijalkowski
@ 2025-07-07 18:32 ` Jacob Keller
2025-07-07 22:03 ` Jacob Keller
1 sibling, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-07-07 18:32 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 1394 bytes --]
On 7/3/2025 9:16 AM, Jacob Keller wrote:
> On 7/2/2025 11:46 PM, Jaroslav Pulchart wrote:
>>> think iperf doesn't do that, which might be part of whats causing this
>>> issue. I'm going to try to see if I can generate such fragmentation to
>>> confirm. Is your MTU kept at the default ethernet size?
>>
>> Our MTU size is set to 9000 everywhere.
>>
>
> Ok. I am re-trying with MTU 9000 and using some traffic generated by wrk
> now. I do see much larger memory use (~2GB) when using MTU 9000, so that
> tracks with what your system shows. Currently its fluctuating between
> 1.9 and 2G. I'll leave this going for a couple of days while on vacation
> and see if anything pops up.
>
> Thanks,
> Jake
Good news! After several days of running a wrk and iperf3 workload with
9k MTU, I see a significant increase in the memory usage from the page
allocations:
7.3G 953314 drivers/net/ethernet/intel/ice/ice_txrx.c:682 [ice]
func:ice_alloc_mapped_page
~5GB extra.
At least I can reproduce this now. Its unclear how long it took since I
was out on vacation from Wednesday through until now.
I do have a singular hypothesis regarding the way we're currently
tracking the page count, (just based on differences between ice and
i40e). I'm going to attempt to align with i40e and re-run the test.
Hopefully I'll have some more information in a day or two.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-07 18:32 ` Jacob Keller
@ 2025-07-07 22:03 ` Jacob Keller
2025-07-09 0:50 ` Jacob Keller
0 siblings, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-07-07 22:03 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 2279 bytes --]
On 7/7/2025 11:32 AM, Jacob Keller wrote:
>
>
> On 7/3/2025 9:16 AM, Jacob Keller wrote:
>> On 7/2/2025 11:46 PM, Jaroslav Pulchart wrote:
>>>> think iperf doesn't do that, which might be part of whats causing this
>>>> issue. I'm going to try to see if I can generate such fragmentation to
>>>> confirm. Is your MTU kept at the default ethernet size?
>>>
>>> Our MTU size is set to 9000 everywhere.
>>>
>>
>> Ok. I am re-trying with MTU 9000 and using some traffic generated by wrk
>> now. I do see much larger memory use (~2GB) when using MTU 9000, so that
>> tracks with what your system shows. Currently its fluctuating between
>> 1.9 and 2G. I'll leave this going for a couple of days while on vacation
>> and see if anything pops up.
>>
>> Thanks,
>> Jake
>
> Good news! After several days of running a wrk and iperf3 workload with
> 9k MTU, I see a significant increase in the memory usage from the page
> allocations:
>
> 7.3G 953314 drivers/net/ethernet/intel/ice/ice_txrx.c:682 [ice]
> func:ice_alloc_mapped_page
>
> ~5GB extra.
>
> At least I can reproduce this now. Its unclear how long it took since I
> was out on vacation from Wednesday through until now.
>
> I do have a singular hypothesis regarding the way we're currently
> tracking the page count, (just based on differences between ice and
> i40e). I'm going to attempt to align with i40e and re-run the test.
> Hopefully I'll have some more information in a day or two.
Bad news: my hypothesis was incorrect.
Good news: I can immediately see the problem if I set MTU to 9K and
start an iperf3 session and just watch the count of allocations from
ice_alloc_mapped_pages(). It goes up consistently, so I can quickly tell
if a change is helping.
I ported the stats from i40e for tracking the page allocations, and I
can see that we're allocating new pages despite not actually performing
releases.
I don't yet have a good understanding of what causes this, and the logic
in ice is pretty hard to track...
I'm going to try the page pool patches myself to see if this test bed
triggers the same problems. Unfortunately I think I need someone else
with more experience with the hotpath code to help figure out whats
going wrong here...
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-07 22:03 ` Jacob Keller
@ 2025-07-09 0:50 ` Jacob Keller
2025-07-09 19:11 ` Jacob Keller
0 siblings, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-07-09 0:50 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 1346 bytes --]
On 7/7/2025 3:03 PM, Jacob Keller wrote:
> Bad news: my hypothesis was incorrect.
>
> Good news: I can immediately see the problem if I set MTU to 9K and
> start an iperf3 session and just watch the count of allocations from
> ice_alloc_mapped_pages(). It goes up consistently, so I can quickly tell
> if a change is helping.
>
> I ported the stats from i40e for tracking the page allocations, and I
> can see that we're allocating new pages despite not actually performing
> releases.
>
> I don't yet have a good understanding of what causes this, and the logic
> in ice is pretty hard to track...
>
> I'm going to try the page pool patches myself to see if this test bed
> triggers the same problems. Unfortunately I think I need someone else
> with more experience with the hotpath code to help figure out whats
> going wrong here...
I believe I have isolated this and figured out the issue: With 9K MTU,
sometimes the hardware posts a multi-buffer frame with an extra
descriptor that has a size of 0 bytes with no data in it. When this
happens, our logic for tracking buffers fails to free this buffer. We
then later overwrite the page because we failed to either free or re-use
the page, and our overwriting logic doesn't verify this.
I will have a fix with a more detailed description posted tomorrow.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-09 0:50 ` Jacob Keller
@ 2025-07-09 19:11 ` Jacob Keller
2025-07-09 21:04 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-07-09 19:11 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 1772 bytes --]
On 7/8/2025 5:50 PM, Jacob Keller wrote:
>
>
> On 7/7/2025 3:03 PM, Jacob Keller wrote:
>> Bad news: my hypothesis was incorrect.
>>
>> Good news: I can immediately see the problem if I set MTU to 9K and
>> start an iperf3 session and just watch the count of allocations from
>> ice_alloc_mapped_pages(). It goes up consistently, so I can quickly tell
>> if a change is helping.
>>
>> I ported the stats from i40e for tracking the page allocations, and I
>> can see that we're allocating new pages despite not actually performing
>> releases.
>>
>> I don't yet have a good understanding of what causes this, and the logic
>> in ice is pretty hard to track...
>>
>> I'm going to try the page pool patches myself to see if this test bed
>> triggers the same problems. Unfortunately I think I need someone else
>> with more experience with the hotpath code to help figure out whats
>> going wrong here...
>
> I believe I have isolated this and figured out the issue: With 9K MTU,
> sometimes the hardware posts a multi-buffer frame with an extra
> descriptor that has a size of 0 bytes with no data in it. When this
> happens, our logic for tracking buffers fails to free this buffer. We
> then later overwrite the page because we failed to either free or re-use
> the page, and our overwriting logic doesn't verify this.
>
> I will have a fix with a more detailed description posted tomorrow.
@Jaroslav, I've posted a fix which I believe should resolve your issue:
https://lore.kernel.org/intel-wired-lan/20250709-jk-ice-fix-rx-mem-leak-v1-1-cfdd7eeea905@intel.com/T/#u
I am reasonably confident it should resolve the issue you reported. If
possible, it would be appreciated if you could test it and report back
to confirm.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-09 19:11 ` Jacob Keller
@ 2025-07-09 21:04 ` Jaroslav Pulchart
2025-07-09 21:15 ` Jacob Keller
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-07-09 21:04 UTC (permalink / raw)
To: Jacob Keller
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
>
>
> On 7/8/2025 5:50 PM, Jacob Keller wrote:
> >
> >
> > On 7/7/2025 3:03 PM, Jacob Keller wrote:
> >> Bad news: my hypothesis was incorrect.
> >>
> >> Good news: I can immediately see the problem if I set MTU to 9K and
> >> start an iperf3 session and just watch the count of allocations from
> >> ice_alloc_mapped_pages(). It goes up consistently, so I can quickly tell
> >> if a change is helping.
> >>
> >> I ported the stats from i40e for tracking the page allocations, and I
> >> can see that we're allocating new pages despite not actually performing
> >> releases.
> >>
> >> I don't yet have a good understanding of what causes this, and the logic
> >> in ice is pretty hard to track...
> >>
> >> I'm going to try the page pool patches myself to see if this test bed
> >> triggers the same problems. Unfortunately I think I need someone else
> >> with more experience with the hotpath code to help figure out whats
> >> going wrong here...
> >
> > I believe I have isolated this and figured out the issue: With 9K MTU,
> > sometimes the hardware posts a multi-buffer frame with an extra
> > descriptor that has a size of 0 bytes with no data in it. When this
> > happens, our logic for tracking buffers fails to free this buffer. We
> > then later overwrite the page because we failed to either free or re-use
> > the page, and our overwriting logic doesn't verify this.
> >
> > I will have a fix with a more detailed description posted tomorrow.
>
> @Jaroslav, I've posted a fix which I believe should resolve your issue:
>
> https://lore.kernel.org/intel-wired-lan/20250709-jk-ice-fix-rx-mem-leak-v1-1-cfdd7eeea905@intel.com/T/#u
>
> I am reasonably confident it should resolve the issue you reported. If
> possible, it would be appreciated if you could test it and report back
> to confirm.
@Jacob that’s excellent news!
I’ve built and installed 6.15.5 with your patch on one of our servers
(strange that I had to disable CONFIG_MEM_ALLOC_PROFILING with this
patch or the kernel wouldn’t boot) and started a VM running our
production traffic. I’ll let it run for a day-two, observe the memory
utilization per NUMA node and report back.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-09 21:04 ` Jaroslav Pulchart
@ 2025-07-09 21:15 ` Jacob Keller
2025-07-11 18:16 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-07-09 21:15 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 2406 bytes --]
On 7/9/2025 2:04 PM, Jaroslav Pulchart wrote:
>>
>>
>> On 7/8/2025 5:50 PM, Jacob Keller wrote:
>>>
>>>
>>> On 7/7/2025 3:03 PM, Jacob Keller wrote:
>>>> Bad news: my hypothesis was incorrect.
>>>>
>>>> Good news: I can immediately see the problem if I set MTU to 9K and
>>>> start an iperf3 session and just watch the count of allocations from
>>>> ice_alloc_mapped_pages(). It goes up consistently, so I can quickly tell
>>>> if a change is helping.
>>>>
>>>> I ported the stats from i40e for tracking the page allocations, and I
>>>> can see that we're allocating new pages despite not actually performing
>>>> releases.
>>>>
>>>> I don't yet have a good understanding of what causes this, and the logic
>>>> in ice is pretty hard to track...
>>>>
>>>> I'm going to try the page pool patches myself to see if this test bed
>>>> triggers the same problems. Unfortunately I think I need someone else
>>>> with more experience with the hotpath code to help figure out whats
>>>> going wrong here...
>>>
>>> I believe I have isolated this and figured out the issue: With 9K MTU,
>>> sometimes the hardware posts a multi-buffer frame with an extra
>>> descriptor that has a size of 0 bytes with no data in it. When this
>>> happens, our logic for tracking buffers fails to free this buffer. We
>>> then later overwrite the page because we failed to either free or re-use
>>> the page, and our overwriting logic doesn't verify this.
>>>
>>> I will have a fix with a more detailed description posted tomorrow.
>>
>> @Jaroslav, I've posted a fix which I believe should resolve your issue:
>>
>> https://lore.kernel.org/intel-wired-lan/20250709-jk-ice-fix-rx-mem-leak-v1-1-cfdd7eeea905@intel.com/T/#u
>>
>> I am reasonably confident it should resolve the issue you reported. If
>> possible, it would be appreciated if you could test it and report back
>> to confirm.
>
> @Jacob that’s excellent news!
>
> I’ve built and installed 6.15.5 with your patch on one of our servers
> (strange that I had to disable CONFIG_MEM_ALLOC_PROFILING with this
> patch or the kernel wouldn’t boot) and started a VM running our
> production traffic. I’ll let it run for a day-two, observe the memory
> utilization per NUMA node and report back.
Great! A bit odd you had to disable CONFIG_MEM_ALLOC_PROFILING. I didn't
have trouble on my kernel with it enabled.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-09 21:15 ` Jacob Keller
@ 2025-07-11 18:16 ` Jaroslav Pulchart
2025-07-11 22:30 ` Jacob Keller
0 siblings, 1 reply; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-07-11 18:16 UTC (permalink / raw)
To: Jacob Keller
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1: Type: text/plain, Size: 2717 bytes --]
>
>
>
> On 7/9/2025 2:04 PM, Jaroslav Pulchart wrote:
> >>
> >>
> >> On 7/8/2025 5:50 PM, Jacob Keller wrote:
> >>>
> >>>
> >>> On 7/7/2025 3:03 PM, Jacob Keller wrote:
> >>>> Bad news: my hypothesis was incorrect.
> >>>>
> >>>> Good news: I can immediately see the problem if I set MTU to 9K and
> >>>> start an iperf3 session and just watch the count of allocations from
> >>>> ice_alloc_mapped_pages(). It goes up consistently, so I can quickly tell
> >>>> if a change is helping.
> >>>>
> >>>> I ported the stats from i40e for tracking the page allocations, and I
> >>>> can see that we're allocating new pages despite not actually performing
> >>>> releases.
> >>>>
> >>>> I don't yet have a good understanding of what causes this, and the logic
> >>>> in ice is pretty hard to track...
> >>>>
> >>>> I'm going to try the page pool patches myself to see if this test bed
> >>>> triggers the same problems. Unfortunately I think I need someone else
> >>>> with more experience with the hotpath code to help figure out whats
> >>>> going wrong here...
> >>>
> >>> I believe I have isolated this and figured out the issue: With 9K MTU,
> >>> sometimes the hardware posts a multi-buffer frame with an extra
> >>> descriptor that has a size of 0 bytes with no data in it. When this
> >>> happens, our logic for tracking buffers fails to free this buffer. We
> >>> then later overwrite the page because we failed to either free or re-use
> >>> the page, and our overwriting logic doesn't verify this.
> >>>
> >>> I will have a fix with a more detailed description posted tomorrow.
> >>
> >> @Jaroslav, I've posted a fix which I believe should resolve your issue:
> >>
> >> https://lore.kernel.org/intel-wired-lan/20250709-jk-ice-fix-rx-mem-leak-v1-1-cfdd7eeea905@intel.com/T/#u
> >>
> >> I am reasonably confident it should resolve the issue you reported. If
> >> possible, it would be appreciated if you could test it and report back
> >> to confirm.
> >
> > @Jacob that’s excellent news!
> >
> > I’ve built and installed 6.15.5 with your patch on one of our servers
> > (strange that I had to disable CONFIG_MEM_ALLOC_PROFILING with this
> > patch or the kernel wouldn’t boot) and started a VM running our
> > production traffic. I’ll let it run for a day-two, observe the memory
> > utilization per NUMA node and report back.
>
> Great! A bit odd you had to disable CONFIG_MEM_ALLOC_PROFILING. I didn't
> have trouble on my kernel with it enabled.
Status update after ~45h of uptime. So far so good, I do not see
continuous memory consumption increase on home numa nodes like before.
See attached "status_before_after_45h_uptime.png" comparison.
[-- Attachment #2: status_before_after_45h_uptime.png --]
[-- Type: image/png, Size: 355801 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-11 18:16 ` Jaroslav Pulchart
@ 2025-07-11 22:30 ` Jacob Keller
2025-07-14 5:34 ` Jaroslav Pulchart
0 siblings, 1 reply; 46+ messages in thread
From: Jacob Keller @ 2025-07-11 22:30 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
[-- Attachment #1.1: Type: text/plain, Size: 2901 bytes --]
On 7/11/2025 11:16 AM, Jaroslav Pulchart wrote:
>>
>>
>>
>> On 7/9/2025 2:04 PM, Jaroslav Pulchart wrote:
>>>>
>>>>
>>>> On 7/8/2025 5:50 PM, Jacob Keller wrote:
>>>>>
>>>>>
>>>>> On 7/7/2025 3:03 PM, Jacob Keller wrote:
>>>>>> Bad news: my hypothesis was incorrect.
>>>>>>
>>>>>> Good news: I can immediately see the problem if I set MTU to 9K and
>>>>>> start an iperf3 session and just watch the count of allocations from
>>>>>> ice_alloc_mapped_pages(). It goes up consistently, so I can quickly tell
>>>>>> if a change is helping.
>>>>>>
>>>>>> I ported the stats from i40e for tracking the page allocations, and I
>>>>>> can see that we're allocating new pages despite not actually performing
>>>>>> releases.
>>>>>>
>>>>>> I don't yet have a good understanding of what causes this, and the logic
>>>>>> in ice is pretty hard to track...
>>>>>>
>>>>>> I'm going to try the page pool patches myself to see if this test bed
>>>>>> triggers the same problems. Unfortunately I think I need someone else
>>>>>> with more experience with the hotpath code to help figure out whats
>>>>>> going wrong here...
>>>>>
>>>>> I believe I have isolated this and figured out the issue: With 9K MTU,
>>>>> sometimes the hardware posts a multi-buffer frame with an extra
>>>>> descriptor that has a size of 0 bytes with no data in it. When this
>>>>> happens, our logic for tracking buffers fails to free this buffer. We
>>>>> then later overwrite the page because we failed to either free or re-use
>>>>> the page, and our overwriting logic doesn't verify this.
>>>>>
>>>>> I will have a fix with a more detailed description posted tomorrow.
>>>>
>>>> @Jaroslav, I've posted a fix which I believe should resolve your issue:
>>>>
>>>> https://lore.kernel.org/intel-wired-lan/20250709-jk-ice-fix-rx-mem-leak-v1-1-cfdd7eeea905@intel.com/T/#u
>>>>
>>>> I am reasonably confident it should resolve the issue you reported. If
>>>> possible, it would be appreciated if you could test it and report back
>>>> to confirm.
>>>
>>> @Jacob that’s excellent news!
>>>
>>> I’ve built and installed 6.15.5 with your patch on one of our servers
>>> (strange that I had to disable CONFIG_MEM_ALLOC_PROFILING with this
>>> patch or the kernel wouldn’t boot) and started a VM running our
>>> production traffic. I’ll let it run for a day-two, observe the memory
>>> utilization per NUMA node and report back.
>>
>> Great! A bit odd you had to disable CONFIG_MEM_ALLOC_PROFILING. I didn't
>> have trouble on my kernel with it enabled.
>
> Status update after ~45h of uptime. So far so good, I do not see
> continuous memory consumption increase on home numa nodes like before.
> See attached "status_before_after_45h_uptime.png" comparison.
Great news! Would you like your "Tested-by" being added to the commit
message when we submit the fix to netdev?
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)
2025-07-11 22:30 ` Jacob Keller
@ 2025-07-14 5:34 ` Jaroslav Pulchart
0 siblings, 0 replies; 46+ messages in thread
From: Jaroslav Pulchart @ 2025-07-14 5:34 UTC (permalink / raw)
To: Jacob Keller
Cc: Maciej Fijalkowski, Jakub Kicinski, Przemek Kitszel,
intel-wired-lan@lists.osuosl.org, Damato, Joe,
netdev@vger.kernel.org, Nguyen, Anthony L, Michal Swiatkowski,
Czapnik, Lukasz, Dumazet, Eric, Zaki, Ahmed, Martin Karsten,
Igor Raits, Daniel Secik, Zdenek Pesek
>
> On 7/11/2025 11:16 AM, Jaroslav Pulchart wrote:
> >>
> >>
> >>
> >> On 7/9/2025 2:04 PM, Jaroslav Pulchart wrote:
> >>>>
> >>>>
> >>>> On 7/8/2025 5:50 PM, Jacob Keller wrote:
> >>>>>
> >>>>>
> >>>>> On 7/7/2025 3:03 PM, Jacob Keller wrote:
> >>>>>> Bad news: my hypothesis was incorrect.
> >>>>>>
> >>>>>> Good news: I can immediately see the problem if I set MTU to 9K and
> >>>>>> start an iperf3 session and just watch the count of allocations from
> >>>>>> ice_alloc_mapped_pages(). It goes up consistently, so I can quickly tell
> >>>>>> if a change is helping.
> >>>>>>
> >>>>>> I ported the stats from i40e for tracking the page allocations, and I
> >>>>>> can see that we're allocating new pages despite not actually performing
> >>>>>> releases.
> >>>>>>
> >>>>>> I don't yet have a good understanding of what causes this, and the logic
> >>>>>> in ice is pretty hard to track...
> >>>>>>
> >>>>>> I'm going to try the page pool patches myself to see if this test bed
> >>>>>> triggers the same problems. Unfortunately I think I need someone else
> >>>>>> with more experience with the hotpath code to help figure out whats
> >>>>>> going wrong here...
> >>>>>
> >>>>> I believe I have isolated this and figured out the issue: With 9K MTU,
> >>>>> sometimes the hardware posts a multi-buffer frame with an extra
> >>>>> descriptor that has a size of 0 bytes with no data in it. When this
> >>>>> happens, our logic for tracking buffers fails to free this buffer. We
> >>>>> then later overwrite the page because we failed to either free or re-use
> >>>>> the page, and our overwriting logic doesn't verify this.
> >>>>>
> >>>>> I will have a fix with a more detailed description posted tomorrow.
> >>>>
> >>>> @Jaroslav, I've posted a fix which I believe should resolve your issue:
> >>>>
> >>>> https://lore.kernel.org/intel-wired-lan/20250709-jk-ice-fix-rx-mem-leak-v1-1-cfdd7eeea905@intel.com/T/#u
> >>>>
> >>>> I am reasonably confident it should resolve the issue you reported. If
> >>>> possible, it would be appreciated if you could test it and report back
> >>>> to confirm.
> >>>
> >>> @Jacob that’s excellent news!
> >>>
> >>> I’ve built and installed 6.15.5 with your patch on one of our servers
> >>> (strange that I had to disable CONFIG_MEM_ALLOC_PROFILING with this
> >>> patch or the kernel wouldn’t boot) and started a VM running our
> >>> production traffic. I’ll let it run for a day-two, observe the memory
> >>> utilization per NUMA node and report back.
> >>
> >> Great! A bit odd you had to disable CONFIG_MEM_ALLOC_PROFILING. I didn't
> >> have trouble on my kernel with it enabled.
> >
> > Status update after ~45h of uptime. So far so good, I do not see
> > continuous memory consumption increase on home numa nodes like before.
> > See attached "status_before_after_45h_uptime.png" comparison.
>
> Great news! Would you like your "Tested-by" being added to the commit
> message when we submit the fix to netdev?
Jacob, absolutely.
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2025-07-14 5:35 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-14 16:29 Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad) Jaroslav Pulchart
2025-04-14 17:15 ` [Intel-wired-lan] " Paul Menzel
2025-04-15 14:38 ` Przemek Kitszel
2025-04-16 0:53 ` Jakub Kicinski
2025-04-16 7:13 ` Jaroslav Pulchart
2025-04-16 13:48 ` Jakub Kicinski
2025-04-16 16:03 ` Jaroslav Pulchart
2025-04-16 22:44 ` Jakub Kicinski
2025-04-16 22:57 ` [Intel-wired-lan] " Keller, Jacob E
2025-04-16 22:57 ` Keller, Jacob E
2025-04-17 0:13 ` Jakub Kicinski
2025-04-17 17:52 ` Keller, Jacob E
2025-05-21 10:50 ` Jaroslav Pulchart
2025-06-04 8:42 ` Jaroslav Pulchart
[not found] ` <CAK8fFZ5XTO9dGADuMSV0hJws-6cZE9equa3X6dfTBgDyzE1pEQ@mail.gmail.com>
2025-06-25 14:03 ` Przemek Kitszel
[not found] ` <CAK8fFZ7LREBEdhXjBAKuaqktOz1VwsBTxcCpLBsa+dkMj4Pyyw@mail.gmail.com>
2025-06-25 20:25 ` Jakub Kicinski
2025-06-26 7:42 ` Jaroslav Pulchart
2025-06-30 7:35 ` Jaroslav Pulchart
2025-06-30 16:02 ` Jacob Keller
2025-06-30 17:24 ` Jaroslav Pulchart
2025-06-30 18:59 ` Jacob Keller
2025-06-30 20:01 ` Jaroslav Pulchart
2025-06-30 20:42 ` Jacob Keller
2025-06-30 21:56 ` Jacob Keller
2025-06-30 23:16 ` Jacob Keller
2025-07-01 6:48 ` Jaroslav Pulchart
2025-07-01 20:48 ` Jacob Keller
2025-07-02 9:48 ` Jaroslav Pulchart
2025-07-02 18:01 ` Jacob Keller
2025-07-02 21:56 ` Jacob Keller
2025-07-03 6:46 ` Jaroslav Pulchart
2025-07-03 16:16 ` Jacob Keller
2025-07-04 19:30 ` Maciej Fijalkowski
2025-07-07 18:32 ` Jacob Keller
2025-07-07 22:03 ` Jacob Keller
2025-07-09 0:50 ` Jacob Keller
2025-07-09 19:11 ` Jacob Keller
2025-07-09 21:04 ` Jaroslav Pulchart
2025-07-09 21:15 ` Jacob Keller
2025-07-11 18:16 ` Jaroslav Pulchart
2025-07-11 22:30 ` Jacob Keller
2025-07-14 5:34 ` Jaroslav Pulchart
2025-06-25 14:53 ` Paul Menzel
2025-07-04 16:55 ` Michal Kubiak
2025-07-05 7:01 ` Jaroslav Pulchart
2025-07-07 15:37 ` Jaroslav Pulchart
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).