netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG?] driver stmmac reports page_pool_release_retry() stalled pool shutdown every minute
@ 2025-09-01  2:47 Vincent Li
  2025-09-01  9:23 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 6+ messages in thread
From: Vincent Li @ 2025-09-01  2:47 UTC (permalink / raw)
  To: netdev, xdp-newbies, loongarch
  Cc: Maxime Coquelin, Alexandre Torgue, Jesper Dangaard Brouer,
	Huacai Chen

Hi,

I noticed once I attached a XDP program to a dwmac-loongson-pci
network device on a loongarch PC, the kernel logs stalled pool message
below every minute, it seems  not to affect network traffic though. it
does not seem to be architecture dependent, so I decided to report
this to netdev and XDP mailing list in case there is a bug in stmmac
related network device with XDP.

xdp-filter load green0

Aug 31 19:19:06 loongfire kernel: [200871.855044] dwmac-loongson-pci
0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
Aug 31 19:19:07 loongfire kernel: [200872.810587]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
200399 sec
Aug 31 19:20:07 loongfire kernel: [200933.226488]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
200460 sec
Aug 31 19:21:08 loongfire kernel: [200993.642391]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
200520 sec
Aug 31 19:22:08 loongfire kernel: [201054.058292]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
200581 sec

Thanks!

Vincent

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] driver stmmac reports page_pool_release_retry() stalled pool shutdown every minute
  2025-09-01  2:47 [BUG?] driver stmmac reports page_pool_release_retry() stalled pool shutdown every minute Vincent Li
@ 2025-09-01  9:23 ` Jesper Dangaard Brouer
  2025-09-01 17:56   ` Vincent Li
  0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2025-09-01  9:23 UTC (permalink / raw)
  To: Vincent Li, netdev, xdp-newbies, loongarch, Dragos Tatulea,
	Furong Xu
  Cc: Maxime Coquelin, Alexandre Torgue, Huacai Chen, Jakub Kicinski,
	Mina Almasry, Philipp Stanner, Ilias Apalodimas, Qunqin Zhao,
	Yanteng Si, Andrew Lunn

Hi Vincent,

Thanks for reporting.
Please see my instruction inlined below.
Will appreciate if you reply inline below to my questions.


On 01/09/2025 04.47, Vincent Li wrote:
> Hi,
> 
> I noticed once I attached a XDP program to a dwmac-loongson-pci
> network device on a loongarch PC, the kernel logs stalled pool message
> below every minute, it seems  not to affect network traffic though. it
> does not seem to be architecture dependent, so I decided to report
> this to netdev and XDP mailing list in case there is a bug in stmmac
> related network device with XDP.
> 

Dragos (Cc'ed) gave a very detailed talk[1] about debugging page_pool
leaks, that I highly recommend:
  [1] 
https://netdevconf.info/0x19/sessions/tutorial/diagnosing-page-pool-leaks.html

Before doing kernel debugging with drgn, I have some easier steps, I
want you to perform on your hardware (I cannot reproduce given I don't
have this hardware).

First step is to check is a socket have unprocessed packets stalled in
it receive-queue (Recv-Q).  Use command 'netstat -tapenu' and look at
column "Recv-Q".  If any socket/application have not emptied it's Recv-Q
try to restart this service and see if the "stalled pool shutdown" goes
away.

Second step is compiling kernel with CONFIG_DEBUG_VM enabled. This will
warn us if the driver leaked the a page_pool controlled page, without
first "releasing" is correctly.  See commit dba1b8a7ab68 ("mm/page_pool:
catch page_pool memory leaks") for how the warning will look like.
  (p.s. this CONFIG_DEBUG_VM have surprisingly low-overhead, as long as
you don't select any sub-options, so we choose to run with this in
production).

Third step is doing kernel debugging like Dragos did in [1].

What kernel version are you using?

In kernel v6.8 we (Kuba) silenced some of the cases.  See commit
be0096676e23 ("net: page_pool: mute the periodic warning for visible
page pools").
To Jakub/kuba can you remind us how to use the netlink tools that can
help us inspect the page_pools active on the system?


> xdp-filter load green0
> 

Most drivers change memory model and reset the RX rings, when attaching
XDP.  So, it makes sense that the existing page_pool instances (per RXq)
are freed and new allocated.  Revealing any leaked or unprocessed
page_pool pages.


> Aug 31 19:19:06 loongfire kernel: [200871.855044] dwmac-loongson-pci 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> Aug 31 19:19:07 loongfire kernel: [200872.810587] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200399 sec

It is very weird that a stall time of 200399 sec is reported. This
indicate that this have been happening *before* the xdp-filter was
attached. The uptime "200871.855044" indicate leak happened 472 sec
after booting this system.

Have you seen these dmesg logs before attaching XDP?

This will help us know if this page_pool became "invisible" according to
Kuba's change, if you run kernel >= v6.8.


> Aug 31 19:20:07 loongfire kernel: [200933.226488] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200460 sec
> Aug 31 19:21:08 loongfire kernel: [200993.642391]
> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> 200520 sec
> Aug 31 19:22:08 loongfire kernel: [201054.058292]
> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> 200581 sec
> 

Cc'ed some people that might have access to this hardware, can any of
you reproduce?

--Jesper

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] driver stmmac reports page_pool_release_retry() stalled pool shutdown every minute
  2025-09-01  9:23 ` Jesper Dangaard Brouer
@ 2025-09-01 17:56   ` Vincent Li
  2025-09-01 23:27     ` Vincent Li
  0 siblings, 1 reply; 6+ messages in thread
From: Vincent Li @ 2025-09-01 17:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, xdp-newbies, loongarch, Dragos Tatulea, Furong Xu,
	Maxime Coquelin, Alexandre Torgue, Huacai Chen, Jakub Kicinski,
	Mina Almasry, Philipp Stanner, Ilias Apalodimas, Qunqin Zhao,
	Yanteng Si, Andrew Lunn

Hi Jesper,

Thank you for your input!

On Mon, Sep 1, 2025 at 2:23 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
> Hi Vincent,
>
> Thanks for reporting.
> Please see my instruction inlined below.
> Will appreciate if you reply inline below to my questions.
>
>
> On 01/09/2025 04.47, Vincent Li wrote:
> > Hi,
> >
> > I noticed once I attached a XDP program to a dwmac-loongson-pci
> > network device on a loongarch PC, the kernel logs stalled pool message
> > below every minute, it seems  not to affect network traffic though. it
> > does not seem to be architecture dependent, so I decided to report
> > this to netdev and XDP mailing list in case there is a bug in stmmac
> > related network device with XDP.
> >
>
> Dragos (Cc'ed) gave a very detailed talk[1] about debugging page_pool
> leaks, that I highly recommend:
>   [1]
> https://netdevconf.info/0x19/sessions/tutorial/diagnosing-page-pool-leaks.html
>
> Before doing kernel debugging with drgn, I have some easier steps, I
> want you to perform on your hardware (I cannot reproduce given I don't
> have this hardware).

I watched the video and slide, I would have difficulty running drgn
since the loongfire OS [0] I am running does not have proper python
support. loongfire is a port of IPFire for LoongArch architecture. The
kernel is upstream stable release 6.15.9  with a backport of LoongArch
BPF trampoline for supporting xdp-tools. I run loongfire on a
LoongArch PC for my home Internet. I tried to reproduce this issue on
the LoongArch PC with a Fedora desktop OS release with the same kernel
6.15.9, I can't reproduce the issue, not sure if this is only
reproducible for firewall/router like Linux OS with stmmac device.

>
> First step is to check is a socket have unprocessed packets stalled in
> it receive-queue (Recv-Q).  Use command 'netstat -tapenu' and look at
> column "Recv-Q".  If any socket/application have not emptied it's Recv-Q
> try to restart this service and see if the "stalled pool shutdown" goes
> away.

the Recv-Q shows 0 from  'netstat -tapenu'

 [root@loongfire ~]#  netstat -tapenu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address
State       User       Inode      PID/Program name
tcp        0      0 127.0.0.1:8953          0.0.0.0:*
LISTEN      0          10283      1896/unbound
tcp        0      0 0.0.0.0:53              0.0.0.0:*
LISTEN      0          10281      1896/unbound
tcp        0      0 0.0.0.0:22              0.0.0.0:*
LISTEN      0          8708       2823/sshd: /usr/sbi
tcp        0    272 192.168.9.1:22          192.168.9.13:58660
ESTABLISHED 0          8754       3004/sshd-session:
tcp6       0      0 :::81                   :::*
LISTEN      0          7828       2841/httpd
tcp6       0      0 :::444                  :::*
LISTEN      0          7832       2841/httpd
tcp6       0      0 :::1013                 :::*
LISTEN      0          7836       2841/httpd
tcp6       0      0 10.0.0.229:444          192.168.9.13:58762
TIME_WAIT   0          0          -
udp        0      0 0.0.0.0:53              0.0.0.0:*
         0          10280      1896/unbound
udp        0      0 0.0.0.0:67              0.0.0.0:*
         0          10647      2803/dhcpd
udp        0      0 10.0.0.229:68           0.0.0.0:*
         0          8644       2659/dhcpcd: [BOOTP
udp        0      0 10.0.0.229:123          0.0.0.0:*
         0          8679       2757/ntpd
udp        0      0 192.168.9.1:123         0.0.0.0:*
         0          8678       2757/ntpd
udp        0      0 127.0.0.1:123           0.0.0.0:*
         0          8677       2757/ntpd
udp        0      0 0.0.0.0:123             0.0.0.0:*
         0          8670       2757/ntpd
udp        0      0 0.0.0.0:514             0.0.0.0:*
         0          5689       1864/syslogd
udp6       0      0 :::123                  :::*
         0          8667       2757/ntpd

> Second step is compiling kernel with CONFIG_DEBUG_VM enabled. This will
> warn us if the driver leaked the a page_pool controlled page, without
> first "releasing" is correctly.  See commit dba1b8a7ab68 ("mm/page_pool:
> catch page_pool memory leaks") for how the warning will look like.
>   (p.s. this CONFIG_DEBUG_VM have surprisingly low-overhead, as long as
> you don't select any sub-options, so we choose to run with this in
> production).
>

I added CONFIG_DEBUG_VM and recompiled the kernel, but no kernel
warning message about page leak, maybe false positive?

[root@loongfire ~]# grep 'CONFIG_DEBUG_VM=y' /boot/config-6.15.9-ipfire

CONFIG_DEBUG_VM=y

[root@loongfire ~]# grep -E 'MEM_TYPE_PAGE_POOL|stalled' /var/log/kern.log

Sep  1 10:23:19 loongfire kernel: [    7.484986] dwmac-loongson-pci
0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
Sep  1 10:26:44 loongfire kernel: [  212.514302] dwmac-loongson-pci
0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
Sep  1 10:27:44 loongfire kernel: [  272.911878]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 60
sec
Sep  1 10:28:44 loongfire kernel: [  333.327876]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 120
sec
Sep  1 10:29:45 loongfire kernel: [  393.743877]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 181
sec

> Third step is doing kernel debugging like Dragos did in [1].
>
> What kernel version are you using?

kernel 6.15.9

>
> In kernel v6.8 we (Kuba) silenced some of the cases.  See commit
> be0096676e23 ("net: page_pool: mute the periodic warning for visible
> page pools").
> To Jakub/kuba can you remind us how to use the netlink tools that can
> help us inspect the page_pools active on the system?
>
>
> > xdp-filter load green0
> >
>
> Most drivers change memory model and reset the RX rings, when attaching
> XDP.  So, it makes sense that the existing page_pool instances (per RXq)
> are freed and new allocated.  Revealing any leaked or unprocessed
> page_pool pages.
>
>
> > Aug 31 19:19:06 loongfire kernel: [200871.855044] dwmac-loongson-pci 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> > Aug 31 19:19:07 loongfire kernel: [200872.810587] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200399 sec
>
> It is very weird that a stall time of 200399 sec is reported. This
> indicate that this have been happening *before* the xdp-filter was
> attached. The uptime "200871.855044" indicate leak happened 472 sec
> after booting this system.
>

Not sure if I pasted the previous log message correctly, but this time
the log I pasted should be correct,

> Have you seen these dmesg logs before attaching XDP?

I didn't see such a log before attaching XDP.

>
> This will help us know if this page_pool became "invisible" according to
> Kuba's change, if you run kernel >= v6.8.
>
>
> > Aug 31 19:20:07 loongfire kernel: [200933.226488] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200460 sec
> > Aug 31 19:21:08 loongfire kernel: [200993.642391]
> > page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> > 200520 sec
> > Aug 31 19:22:08 loongfire kernel: [201054.058292]
> > page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> > 200581 sec
> >
>
> Cc'ed some people that might have access to this hardware, can any of
> you reproduce?
>
> --Jesper

[0]: https://github.com/vincentmli/loongfire

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] driver stmmac reports page_pool_release_retry() stalled pool shutdown every minute
  2025-09-01 17:56   ` Vincent Li
@ 2025-09-01 23:27     ` Vincent Li
  2025-09-05  7:59       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 6+ messages in thread
From: Vincent Li @ 2025-09-01 23:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, xdp-newbies, loongarch, Dragos Tatulea, Furong Xu,
	Maxime Coquelin, Alexandre Torgue, Huacai Chen, Jakub Kicinski,
	Mina Almasry, Philipp Stanner, Ilias Apalodimas, Qunqin Zhao,
	Yanteng Si, Andrew Lunn

On Mon, Sep 1, 2025 at 10:56 AM Vincent Li <vincent.mc.li@gmail.com> wrote:
>
> Hi Jesper,
>
> Thank you for your input!
>
> On Mon, Sep 1, 2025 at 2:23 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> >
> > Hi Vincent,
> >
> > Thanks for reporting.
> > Please see my instruction inlined below.
> > Will appreciate if you reply inline below to my questions.
> >
> >
> > On 01/09/2025 04.47, Vincent Li wrote:
> > > Hi,
> > >
> > > I noticed once I attached a XDP program to a dwmac-loongson-pci
> > > network device on a loongarch PC, the kernel logs stalled pool message
> > > below every minute, it seems  not to affect network traffic though. it
> > > does not seem to be architecture dependent, so I decided to report
> > > this to netdev and XDP mailing list in case there is a bug in stmmac
> > > related network device with XDP.
> > >
> >
> > Dragos (Cc'ed) gave a very detailed talk[1] about debugging page_pool
> > leaks, that I highly recommend:
> >   [1]
> > https://netdevconf.info/0x19/sessions/tutorial/diagnosing-page-pool-leaks.html
> >
> > Before doing kernel debugging with drgn, I have some easier steps, I
> > want you to perform on your hardware (I cannot reproduce given I don't
> > have this hardware).
>
> I watched the video and slide, I would have difficulty running drgn
> since the loongfire OS [0] I am running does not have proper python
> support. loongfire is a port of IPFire for LoongArch architecture. The
> kernel is upstream stable release 6.15.9  with a backport of LoongArch
> BPF trampoline for supporting xdp-tools. I run loongfire on a
> LoongArch PC for my home Internet. I tried to reproduce this issue on
> the LoongArch PC with a Fedora desktop OS release with the same kernel
> 6.15.9, I can't reproduce the issue, not sure if this is only
> reproducible for firewall/router like Linux OS with stmmac device.
>
> >
> > First step is to check is a socket have unprocessed packets stalled in
> > it receive-queue (Recv-Q).  Use command 'netstat -tapenu' and look at
> > column "Recv-Q".  If any socket/application have not emptied it's Recv-Q
> > try to restart this service and see if the "stalled pool shutdown" goes
> > away.
>
> the Recv-Q shows 0 from  'netstat -tapenu'
>
>  [root@loongfire ~]#  netstat -tapenu
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address           Foreign Address
> State       User       Inode      PID/Program name
> tcp        0      0 127.0.0.1:8953          0.0.0.0:*
> LISTEN      0          10283      1896/unbound
> tcp        0      0 0.0.0.0:53              0.0.0.0:*
> LISTEN      0          10281      1896/unbound
> tcp        0      0 0.0.0.0:22              0.0.0.0:*
> LISTEN      0          8708       2823/sshd: /usr/sbi
> tcp        0    272 192.168.9.1:22          192.168.9.13:58660
> ESTABLISHED 0          8754       3004/sshd-session:
> tcp6       0      0 :::81                   :::*
> LISTEN      0          7828       2841/httpd
> tcp6       0      0 :::444                  :::*
> LISTEN      0          7832       2841/httpd
> tcp6       0      0 :::1013                 :::*
> LISTEN      0          7836       2841/httpd
> tcp6       0      0 10.0.0.229:444          192.168.9.13:58762
> TIME_WAIT   0          0          -
> udp        0      0 0.0.0.0:53              0.0.0.0:*
>          0          10280      1896/unbound
> udp        0      0 0.0.0.0:67              0.0.0.0:*
>          0          10647      2803/dhcpd
> udp        0      0 10.0.0.229:68           0.0.0.0:*
>          0          8644       2659/dhcpcd: [BOOTP
> udp        0      0 10.0.0.229:123          0.0.0.0:*
>          0          8679       2757/ntpd
> udp        0      0 192.168.9.1:123         0.0.0.0:*
>          0          8678       2757/ntpd
> udp        0      0 127.0.0.1:123           0.0.0.0:*
>          0          8677       2757/ntpd
> udp        0      0 0.0.0.0:123             0.0.0.0:*
>          0          8670       2757/ntpd
> udp        0      0 0.0.0.0:514             0.0.0.0:*
>          0          5689       1864/syslogd
> udp6       0      0 :::123                  :::*
>          0          8667       2757/ntpd
>
> > Second step is compiling kernel with CONFIG_DEBUG_VM enabled. This will
> > warn us if the driver leaked the a page_pool controlled page, without
> > first "releasing" is correctly.  See commit dba1b8a7ab68 ("mm/page_pool:
> > catch page_pool memory leaks") for how the warning will look like.
> >   (p.s. this CONFIG_DEBUG_VM have surprisingly low-overhead, as long as
> > you don't select any sub-options, so we choose to run with this in
> > production).
> >
>
> I added CONFIG_DEBUG_VM and recompiled the kernel, but no kernel
> warning message about page leak, maybe false positive?
>
> [root@loongfire ~]# grep 'CONFIG_DEBUG_VM=y' /boot/config-6.15.9-ipfire
>
> CONFIG_DEBUG_VM=y
>
> [root@loongfire ~]# grep -E 'MEM_TYPE_PAGE_POOL|stalled' /var/log/kern.log
>
> Sep  1 10:23:19 loongfire kernel: [    7.484986] dwmac-loongson-pci
> 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> Sep  1 10:26:44 loongfire kernel: [  212.514302] dwmac-loongson-pci
> 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> Sep  1 10:27:44 loongfire kernel: [  272.911878]
> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 60
> sec
> Sep  1 10:28:44 loongfire kernel: [  333.327876]
> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 120
> sec
> Sep  1 10:29:45 loongfire kernel: [  393.743877]
> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 181
> sec
>

I came up a fentry bpf program [0]
https://github.com/vincentmli/loongfire/issues/3 to trace the netdev
value in page_pool_release_retry():

        /* Periodic warning for page pools the user can't see */
        netdev = READ_ONCE(pool->slow.netdev);
        if (time_after_eq(jiffies, pool->defer_warn) &&
            (!netdev || netdev == NET_PTR_POISON)) {
                int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;

                pr_warn("%s() stalled pool shutdown: id %u, %d
inflight %d sec\n",
                        __func__, pool->user.id, inflight, sec);
                pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
        }

The bpf program prints netdev  NULL, I wonder if there is left over
page pool allocated initially by the stmmac driver, and  after
attaching XDP program, the page pool allocated initially had netdev
changed to NULL?

Page Pool: 0x900000010b54f000
  netdev pointer: 0x0
  is NULL: YES
  is NET_PTR_POISON: NO
  condition (!netdev || netdev == NET_PTR_POISON): TRUE

Page Pool: 0x900000010b54f000
  netdev pointer: 0x0
  is NULL: YES
  is NET_PTR_POISON: NO
  condition (!netdev || netdev == NET_PTR_POISON): TRUE

> > Third step is doing kernel debugging like Dragos did in [1].
> >
> > What kernel version are you using?
>
> kernel 6.15.9
>
> >
> > In kernel v6.8 we (Kuba) silenced some of the cases.  See commit
> > be0096676e23 ("net: page_pool: mute the periodic warning for visible
> > page pools").
> > To Jakub/kuba can you remind us how to use the netlink tools that can
> > help us inspect the page_pools active on the system?
> >
> >
> > > xdp-filter load green0
> > >
> >
> > Most drivers change memory model and reset the RX rings, when attaching
> > XDP.  So, it makes sense that the existing page_pool instances (per RXq)
> > are freed and new allocated.  Revealing any leaked or unprocessed
> > page_pool pages.
> >
> >
> > > Aug 31 19:19:06 loongfire kernel: [200871.855044] dwmac-loongson-pci 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> > > Aug 31 19:19:07 loongfire kernel: [200872.810587] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200399 sec
> >
> > It is very weird that a stall time of 200399 sec is reported. This
> > indicate that this have been happening *before* the xdp-filter was
> > attached. The uptime "200871.855044" indicate leak happened 472 sec
> > after booting this system.
> >
>
> Not sure if I pasted the previous log message correctly, but this time
> the log I pasted should be correct,
>
> > Have you seen these dmesg logs before attaching XDP?
>
> I didn't see such a log before attaching XDP.
>
> >
> > This will help us know if this page_pool became "invisible" according to
> > Kuba's change, if you run kernel >= v6.8.
> >
> >
> > > Aug 31 19:20:07 loongfire kernel: [200933.226488] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200460 sec
> > > Aug 31 19:21:08 loongfire kernel: [200993.642391]
> > > page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> > > 200520 sec
> > > Aug 31 19:22:08 loongfire kernel: [201054.058292]
> > > page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> > > 200581 sec
> > >
> >
> > Cc'ed some people that might have access to this hardware, can any of
> > you reproduce?
> >
> > --Jesper
>
> [0]: https://github.com/vincentmli/loongfire

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] driver stmmac reports page_pool_release_retry() stalled pool shutdown every minute
  2025-09-01 23:27     ` Vincent Li
@ 2025-09-05  7:59       ` Jesper Dangaard Brouer
  2025-09-05 13:31         ` Vincent Li
  0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2025-09-05  7:59 UTC (permalink / raw)
  To: Vincent Li, Dragos Tatulea
  Cc: netdev, xdp-newbies, loongarch, Furong Xu, Maxime Coquelin,
	Alexandre Torgue, Huacai Chen, Jakub Kicinski, Mina Almasry,
	Philipp Stanner, Ilias Apalodimas, Qunqin Zhao, Yanteng Si,
	Andrew Lunn, Toke Høiland-Jørgensen



On 02/09/2025 01.27, Vincent Li wrote:
> On Mon, Sep 1, 2025 at 10:56 AM Vincent Li <vincent.mc.li@gmail.com> wrote:
>>
>> On Mon, Sep 1, 2025 at 2:23 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>>
>>> On 01/09/2025 04.47, Vincent Li wrote:
>>>> Hi,
>>>>
>>>> I noticed once I attached a XDP program to a dwmac-loongson-pci
>>>> network device on a loongarch PC, the kernel logs stalled pool message
>>>> below every minute, it seems  not to affect network traffic though. it
>>>> does not seem to be architecture dependent, so I decided to report
>>>> this to netdev and XDP mailing list in case there is a bug in stmmac
>>>> related network device with XDP.
>>>>
>>>
>>> Dragos (Cc'ed) gave a very detailed talk[1] about debugging page_pool
>>> leaks, that I highly recommend:
>>>    [1]
>>> https://netdevconf.info/0x19/sessions/tutorial/diagnosing-page-pool-leaks.html
>>>
>>> Before doing kernel debugging with drgn, I have some easier steps, I
>>> want you to perform on your hardware (I cannot reproduce given I don't
>>> have this hardware).
>>
>> I watched the video and slide, I would have difficulty running drgn
>> since the loongfire OS [0] I am running does not have proper python
>> support. loongfire is a port of IPFire for LoongArch architecture. The
>> kernel is upstream stable release 6.15.9  with a backport of LoongArch
>> BPF trampoline for supporting xdp-tools. I run loongfire on a
>> LoongArch PC for my home Internet. I tried to reproduce this issue on
>> the LoongArch PC with a Fedora desktop OS release with the same kernel
>> 6.15.9, I can't reproduce the issue, not sure if this is only
>> reproducible for firewall/router like Linux OS with stmmac device.
>>
>>>
>>> First step is to check is a socket have unprocessed packets stalled in
>>> it receive-queue (Recv-Q).  Use command 'netstat -tapenu' and look at
>>> column "Recv-Q".  If any socket/application have not emptied it's Recv-Q
>>> try to restart this service and see if the "stalled pool shutdown" goes
>>> away.
>>
>> the Recv-Q shows 0 from  'netstat -tapenu'
>>

This tell us that is wasn't an easy case of packets waiting in a socket
queue.  Indicating a higher probability of a driver issue.

>>   [root@loongfire ~]#  netstat -tapenu
>> Active Internet connections (servers and established)
>> Proto Recv-Q Send-Q Local Address           Foreign Address
>> State       User       Inode      PID/Program name
>> tcp        0      0 127.0.0.1:8953          0.0.0.0:*
>> LISTEN      0          10283      1896/unbound
>> tcp        0      0 0.0.0.0:53              0.0.0.0:*
>> LISTEN      0          10281      1896/unbound
>> tcp        0      0 0.0.0.0:22              0.0.0.0:*
>> LISTEN      0          8708       2823/sshd: /usr/sbi
>> tcp        0    272 192.168.9.1:22          192.168.9.13:58660
>> ESTABLISHED 0          8754       3004/sshd-session:
>> tcp6       0      0 :::81                   :::*
>> LISTEN      0          7828       2841/httpd
>> tcp6       0      0 :::444                  :::*
>> LISTEN      0          7832       2841/httpd
>> tcp6       0      0 :::1013                 :::*
>> LISTEN      0          7836       2841/httpd
>> tcp6       0      0 10.0.0.229:444          192.168.9.13:58762
>> TIME_WAIT   0          0          -
>> udp        0      0 0.0.0.0:53              0.0.0.0:*
>>           0          10280      1896/unbound
>> udp        0      0 0.0.0.0:67              0.0.0.0:*
>>           0          10647      2803/dhcpd
>> udp        0      0 10.0.0.229:68           0.0.0.0:*
>>           0          8644       2659/dhcpcd: [BOOTP
>> udp        0      0 10.0.0.229:123          0.0.0.0:*
>>           0          8679       2757/ntpd
>> udp        0      0 192.168.9.1:123         0.0.0.0:*
>>           0          8678       2757/ntpd
>> udp        0      0 127.0.0.1:123           0.0.0.0:*
>>           0          8677       2757/ntpd
>> udp        0      0 0.0.0.0:123             0.0.0.0:*
>>           0          8670       2757/ntpd
>> udp        0      0 0.0.0.0:514             0.0.0.0:*
>>           0          5689       1864/syslogd
>> udp6       0      0 :::123                  :::*
>>           0          8667       2757/ntpd
>>
>>> Second step is compiling kernel with CONFIG_DEBUG_VM enabled. This will
>>> warn us if the driver leaked the a page_pool controlled page, without
>>> first "releasing" is correctly.  See commit dba1b8a7ab68 ("mm/page_pool:
>>> catch page_pool memory leaks") for how the warning will look like.
>>>    (p.s. this CONFIG_DEBUG_VM have surprisingly low-overhead, as long as
>>> you don't select any sub-options, so we choose to run with this in
>>> production).
>>>
>>
>> I added CONFIG_DEBUG_VM and recompiled the kernel, but no kernel
>> warning message about page leak, maybe false positive?
>>

This just tells us that the inflight page_pool page wasn't "illegality"
returned to the MM-subsystem.  So, this page is stuck somewhere in the
system, still "registered" to a page_pool instance. This is even more
indication of a driver bug.

We are almost out of easy options to try.  The last attempt I want you
to try is to unload the NIC drivers kernel module (via rmmod).  And then
wait to see if the "stalled pool shutdown" messages disappears. I hope
you have some serial console, so you can still observe the kernel log.

If the "stalled pool shutdown" messages continue, then we have to use
the techniques as Dragos did.

Basically scanning all page's in memory looking for PP_SIGNATURE bit.
Here is some example[1] code that walks all memory pages from the kernel
side.  This doesn't actually work as a kernel module... if I was you, I
would just copy-paste this into the driver or page_pool, and call it
when we see the stalled messages.  This will help us identify the
page_pool page. (After which I would use Drgn to investigate the state).

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench06_walk_all.c#L68-L97


>> [root@loongfire ~]# grep 'CONFIG_DEBUG_VM=y' /boot/config-6.15.9-ipfire
>>
>> CONFIG_DEBUG_VM=y
>>
>> [root@loongfire ~]# grep -E 'MEM_TYPE_PAGE_POOL|stalled' /var/log/kern.log
>>
>> Sep  1 10:23:19 loongfire kernel: [    7.484986] dwmac-loongson-pci
>> 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
>> Sep  1 10:26:44 loongfire kernel: [  212.514302] dwmac-loongson-pci
>> 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
>> Sep  1 10:27:44 loongfire kernel: [  272.911878]
>> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 60
>> sec
>> Sep  1 10:28:44 loongfire kernel: [  333.327876]
>> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 120
>> sec
>> Sep  1 10:29:45 loongfire kernel: [  393.743877]
>> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 181
>> sec
>>
> 
> I came up a fentry bpf program [0]
> https://github.com/vincentmli/loongfire/issues/3 to trace the netdev
> value in page_pool_release_retry():
> 
>          /* Periodic warning for page pools the user can't see */
>          netdev = READ_ONCE(pool->slow.netdev);
>          if (time_after_eq(jiffies, pool->defer_warn) &&
>              (!netdev || netdev == NET_PTR_POISON)) {
>                  int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
> 
>                  pr_warn("%s() stalled pool shutdown: id %u, %d
> inflight %d sec\n",
>                          __func__, pool->user.id, inflight, sec);
>                  pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
>          }
> 
> The bpf program prints netdev  NULL, I wonder if there is left over
> page pool allocated initially by the stmmac driver, and  after
> attaching XDP program, the page pool allocated initially had netdev
> changed to NULL?
> 
> Page Pool: 0x900000010b54f000
>    netdev pointer: 0x0
>    is NULL: YES
>    is NET_PTR_POISON: NO
>    condition (!netdev || netdev == NET_PTR_POISON): TRUE
> 
> Page Pool: 0x900000010b54f000
>    netdev pointer: 0x0
>    is NULL: YES
>    is NET_PTR_POISON: NO
>    condition (!netdev || netdev == NET_PTR_POISON): TRUE
> 
>>> Third step is doing kernel debugging like Dragos did in [1].
>>>
>>> What kernel version are you using?
>>
>> kernel 6.15.9
>>

Nice, that is a very recent kernel.
The above shows us that we are indeed hitting the issue of a "hidden"
page_pool instance (related to the page_pool commit Jakub/Kuba added).


>>>
>>> In kernel v6.8 we (Kuba) silenced some of the cases.  See commit
>>> be0096676e23 ("net: page_pool: mute the periodic warning for visible
>>> page pools").
>>> To Jakub/kuba can you remind us how to use the netlink tools that can
>>> help us inspect the page_pools active on the system?
>>>
>>>
>>>> xdp-filter load green0
>>>>
>>>
>>> Most drivers change memory model and reset the RX rings, when attaching
>>> XDP.  So, it makes sense that the existing page_pool instances (per RXq)
>>> are freed and new allocated.  Revealing any leaked or unprocessed
>>> page_pool pages.
>>>
>>>
>>>> Aug 31 19:19:06 loongfire kernel: [200871.855044] dwmac-loongson-pci 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
>>>> Aug 31 19:19:07 loongfire kernel: [200872.810587] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200399 sec
>>>
>>> It is very weird that a stall time of 200399 sec is reported. This
>>> indicate that this have been happening *before* the xdp-filter was
>>> attached. The uptime "200871.855044" indicate leak happened 472 sec
>>> after booting this system.
>>>
>>
>> Not sure if I pasted the previous log message correctly, but this time
>> the log I pasted should be correct,
>>
>>> Have you seen these dmesg logs before attaching XDP?
>>
>> I didn't see such a log before attaching XDP.
>>

 From above we have established, that it makes sense, as the mentioned
commit would have "blocked it" from being printed.

>>>
>>> This will help us know if this page_pool became "invisible" according to
>>> Kuba's change, if you run kernel >= v6.8.
>>>
>>>
>>>> Aug 31 19:20:07 loongfire kernel: [200933.226488] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200460 sec
>>>> Aug 31 19:21:08 loongfire kernel: [200993.642391]
>>>> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
>>>> 200520 sec
>>>> Aug 31 19:22:08 loongfire kernel: [201054.058292]
>>>> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
>>>> 200581 sec
>>>>
>>>
>>> Cc'ed some people that might have access to this hardware, can any of
>>> you reproduce?
>>>

Anyone with this hardware?

>>
>> [0]: https://github.com/vincentmli/loongfire

--Jesper

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG?] driver stmmac reports page_pool_release_retry() stalled pool shutdown every minute
  2025-09-05  7:59       ` Jesper Dangaard Brouer
@ 2025-09-05 13:31         ` Vincent Li
  0 siblings, 0 replies; 6+ messages in thread
From: Vincent Li @ 2025-09-05 13:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Dragos Tatulea, netdev, xdp-newbies, loongarch, Furong Xu,
	Maxime Coquelin, Alexandre Torgue, Huacai Chen, Jakub Kicinski,
	Mina Almasry, Philipp Stanner, Ilias Apalodimas, Qunqin Zhao,
	Yanteng Si, Andrew Lunn, Toke Høiland-Jørgensen

On Fri, Sep 5, 2025 at 12:59 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 02/09/2025 01.27, Vincent Li wrote:
> > On Mon, Sep 1, 2025 at 10:56 AM Vincent Li <vincent.mc.li@gmail.com> wrote:
> >>
> >> On Mon, Sep 1, 2025 at 2:23 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> >>>
> >>> On 01/09/2025 04.47, Vincent Li wrote:
> >>>> Hi,
> >>>>
> >>>> I noticed once I attached a XDP program to a dwmac-loongson-pci
> >>>> network device on a loongarch PC, the kernel logs stalled pool message
> >>>> below every minute, it seems  not to affect network traffic though. it
> >>>> does not seem to be architecture dependent, so I decided to report
> >>>> this to netdev and XDP mailing list in case there is a bug in stmmac
> >>>> related network device with XDP.
> >>>>
> >>>
> >>> Dragos (Cc'ed) gave a very detailed talk[1] about debugging page_pool
> >>> leaks, that I highly recommend:
> >>>    [1]
> >>> https://netdevconf.info/0x19/sessions/tutorial/diagnosing-page-pool-leaks.html
> >>>
> >>> Before doing kernel debugging with drgn, I have some easier steps, I
> >>> want you to perform on your hardware (I cannot reproduce given I don't
> >>> have this hardware).
> >>
> >> I watched the video and slide, I would have difficulty running drgn
> >> since the loongfire OS [0] I am running does not have proper python
> >> support. loongfire is a port of IPFire for LoongArch architecture. The
> >> kernel is upstream stable release 6.15.9  with a backport of LoongArch
> >> BPF trampoline for supporting xdp-tools. I run loongfire on a
> >> LoongArch PC for my home Internet. I tried to reproduce this issue on
> >> the LoongArch PC with a Fedora desktop OS release with the same kernel
> >> 6.15.9, I can't reproduce the issue, not sure if this is only
> >> reproducible for firewall/router like Linux OS with stmmac device.
> >>
> >>>
> >>> First step is to check is a socket have unprocessed packets stalled in
> >>> it receive-queue (Recv-Q).  Use command 'netstat -tapenu' and look at
> >>> column "Recv-Q".  If any socket/application have not emptied it's Recv-Q
> >>> try to restart this service and see if the "stalled pool shutdown" goes
> >>> away.
> >>
> >> the Recv-Q shows 0 from  'netstat -tapenu'
> >>
>
> This tell us that is wasn't an easy case of packets waiting in a socket
> queue.  Indicating a higher probability of a driver issue.
>
> >>   [root@loongfire ~]#  netstat -tapenu
> >> Active Internet connections (servers and established)
> >> Proto Recv-Q Send-Q Local Address           Foreign Address
> >> State       User       Inode      PID/Program name
> >> tcp        0      0 127.0.0.1:8953          0.0.0.0:*
> >> LISTEN      0          10283      1896/unbound
> >> tcp        0      0 0.0.0.0:53              0.0.0.0:*
> >> LISTEN      0          10281      1896/unbound
> >> tcp        0      0 0.0.0.0:22              0.0.0.0:*
> >> LISTEN      0          8708       2823/sshd: /usr/sbi
> >> tcp        0    272 192.168.9.1:22          192.168.9.13:58660
> >> ESTABLISHED 0          8754       3004/sshd-session:
> >> tcp6       0      0 :::81                   :::*
> >> LISTEN      0          7828       2841/httpd
> >> tcp6       0      0 :::444                  :::*
> >> LISTEN      0          7832       2841/httpd
> >> tcp6       0      0 :::1013                 :::*
> >> LISTEN      0          7836       2841/httpd
> >> tcp6       0      0 10.0.0.229:444          192.168.9.13:58762
> >> TIME_WAIT   0          0          -
> >> udp        0      0 0.0.0.0:53              0.0.0.0:*
> >>           0          10280      1896/unbound
> >> udp        0      0 0.0.0.0:67              0.0.0.0:*
> >>           0          10647      2803/dhcpd
> >> udp        0      0 10.0.0.229:68           0.0.0.0:*
> >>           0          8644       2659/dhcpcd: [BOOTP
> >> udp        0      0 10.0.0.229:123          0.0.0.0:*
> >>           0          8679       2757/ntpd
> >> udp        0      0 192.168.9.1:123         0.0.0.0:*
> >>           0          8678       2757/ntpd
> >> udp        0      0 127.0.0.1:123           0.0.0.0:*
> >>           0          8677       2757/ntpd
> >> udp        0      0 0.0.0.0:123             0.0.0.0:*
> >>           0          8670       2757/ntpd
> >> udp        0      0 0.0.0.0:514             0.0.0.0:*
> >>           0          5689       1864/syslogd
> >> udp6       0      0 :::123                  :::*
> >>           0          8667       2757/ntpd
> >>
> >>> Second step is compiling kernel with CONFIG_DEBUG_VM enabled. This will
> >>> warn us if the driver leaked the a page_pool controlled page, without
> >>> first "releasing" is correctly.  See commit dba1b8a7ab68 ("mm/page_pool:
> >>> catch page_pool memory leaks") for how the warning will look like.
> >>>    (p.s. this CONFIG_DEBUG_VM have surprisingly low-overhead, as long as
> >>> you don't select any sub-options, so we choose to run with this in
> >>> production).
> >>>
> >>
> >> I added CONFIG_DEBUG_VM and recompiled the kernel, but no kernel
> >> warning message about page leak, maybe false positive?
> >>
>
> This just tells us that the inflight page_pool page wasn't "illegality"
> returned to the MM-subsystem.  So, this page is stuck somewhere in the
> system, still "registered" to a page_pool instance. This is even more
> indication of a driver bug.
>
> We are almost out of easy options to try.  The last attempt I want you
> to try is to unload the NIC drivers kernel module (via rmmod).  And then
> wait to see if the "stalled pool shutdown" messages disappears. I hope
> you have some serial console, so you can still observe the kernel log.
>

rmmod the driver kernel module while the XDP program is attached? if I
detach the XDP program, the message disappears.

> If the "stalled pool shutdown" messages continue, then we have to use
> the techniques as Dragos did.
>
> Basically scanning all page's in memory looking for PP_SIGNATURE bit.
> Here is some example[1] code that walks all memory pages from the kernel
> side.  This doesn't actually work as a kernel module... if I was you, I
> would just copy-paste this into the driver or page_pool, and call it
> when we see the stalled messages.  This will help us identify the
> page_pool page. (After which I would use Drgn to investigate the state).
>
> [1]
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench06_walk_all.c#L68-L97
>
can I insert the code in page_pool_release_retry() below right after
the pr_warn() message line 1187?  What should I print or do in the
above code if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE) block?

1164 static void page_pool_release_retry(struct work_struct *wq)
1165 {
1166         struct delayed_work *dwq = to_delayed_work(wq);
1167         struct page_pool *pool = container_of(dwq, typeof(*pool),
release_dw);
1168         void *netdev;
1169         int inflight;
1170
1171         inflight = page_pool_release(pool);
1172         /* In rare cases, a driver bug may cause inflight to go negative.
1173          * Don't reschedule release if inflight is 0 or negative.
1174          * - If 0, the page_pool has been destroyed
1175          * - if negative, we will never recover
1176          * in both cases no reschedule is necessary.
1177          */
1178         if (inflight <= 0)
1179                 return;
1180
1181         /* Periodic warning for page pools the user can't see */
1182         netdev = READ_ONCE(pool->slow.netdev);
1183         if (time_after_eq(jiffies, pool->defer_warn) &&
1184             (!netdev || netdev == NET_PTR_POISON)) {
1185                 int sec = (s32)((u32)jiffies -
(u32)pool->defer_start) / HZ;
1186
1187                 pr_warn("%s() stalled pool shutdown: id %u, %d
inflight %d sec\n",
1188                         __func__, pool->user.id, inflight, sec);
1189                 pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
1190         }
1191
1192         /* Still not ready to be disconnected, retry later */
1193         schedule_delayed_work(&pool->release_dw, DEFER_TIME);
1194 }

>
> >> [root@loongfire ~]# grep 'CONFIG_DEBUG_VM=y' /boot/config-6.15.9-ipfire
> >>
> >> CONFIG_DEBUG_VM=y
> >>
> >> [root@loongfire ~]# grep -E 'MEM_TYPE_PAGE_POOL|stalled' /var/log/kern.log
> >>
> >> Sep  1 10:23:19 loongfire kernel: [    7.484986] dwmac-loongson-pci
> >> 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> >> Sep  1 10:26:44 loongfire kernel: [  212.514302] dwmac-loongson-pci
> >> 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> >> Sep  1 10:27:44 loongfire kernel: [  272.911878]
> >> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 60
> >> sec
> >> Sep  1 10:28:44 loongfire kernel: [  333.327876]
> >> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 120
> >> sec
> >> Sep  1 10:29:45 loongfire kernel: [  393.743877]
> >> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 181
> >> sec
> >>
> >
> > I came up a fentry bpf program [0]
> > https://github.com/vincentmli/loongfire/issues/3 to trace the netdev
> > value in page_pool_release_retry():
> >
> >          /* Periodic warning for page pools the user can't see */
> >          netdev = READ_ONCE(pool->slow.netdev);
> >          if (time_after_eq(jiffies, pool->defer_warn) &&
> >              (!netdev || netdev == NET_PTR_POISON)) {
> >                  int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
> >
> >                  pr_warn("%s() stalled pool shutdown: id %u, %d
> > inflight %d sec\n",
> >                          __func__, pool->user.id, inflight, sec);
> >                  pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
> >          }
> >
> > The bpf program prints netdev  NULL, I wonder if there is left over
> > page pool allocated initially by the stmmac driver, and  after
> > attaching XDP program, the page pool allocated initially had netdev
> > changed to NULL?
> >
> > Page Pool: 0x900000010b54f000
> >    netdev pointer: 0x0
> >    is NULL: YES
> >    is NET_PTR_POISON: NO
> >    condition (!netdev || netdev == NET_PTR_POISON): TRUE
> >
> > Page Pool: 0x900000010b54f000
> >    netdev pointer: 0x0
> >    is NULL: YES
> >    is NET_PTR_POISON: NO
> >    condition (!netdev || netdev == NET_PTR_POISON): TRUE
> >
> >>> Third step is doing kernel debugging like Dragos did in [1].
> >>>
> >>> What kernel version are you using?
> >>
> >> kernel 6.15.9
> >>
>
> Nice, that is a very recent kernel.
> The above shows us that we are indeed hitting the issue of a "hidden"
> page_pool instance (related to the page_pool commit Jakub/Kuba added).
>
>
> >>>
> >>> In kernel v6.8 we (Kuba) silenced some of the cases.  See commit
> >>> be0096676e23 ("net: page_pool: mute the periodic warning for visible
> >>> page pools").
> >>> To Jakub/kuba can you remind us how to use the netlink tools that can
> >>> help us inspect the page_pools active on the system?
> >>>
> >>>
> >>>> xdp-filter load green0
> >>>>
> >>>
> >>> Most drivers change memory model and reset the RX rings, when attaching
> >>> XDP.  So, it makes sense that the existing page_pool instances (per RXq)
> >>> are freed and new allocated.  Revealing any leaked or unprocessed
> >>> page_pool pages.
> >>>
> >>>
> >>>> Aug 31 19:19:06 loongfire kernel: [200871.855044] dwmac-loongson-pci 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> >>>> Aug 31 19:19:07 loongfire kernel: [200872.810587] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200399 sec
> >>>
> >>> It is very weird that a stall time of 200399 sec is reported. This
> >>> indicate that this have been happening *before* the xdp-filter was
> >>> attached. The uptime "200871.855044" indicate leak happened 472 sec
> >>> after booting this system.
> >>>
> >>
> >> Not sure if I pasted the previous log message correctly, but this time
> >> the log I pasted should be correct,
> >>
> >>> Have you seen these dmesg logs before attaching XDP?
> >>
> >> I didn't see such a log before attaching XDP.
> >>
>
>  From above we have established, that it makes sense, as the mentioned
> commit would have "blocked it" from being printed.
>
> >>>
> >>> This will help us know if this page_pool became "invisible" according to
> >>> Kuba's change, if you run kernel >= v6.8.
> >>>
> >>>
> >>>> Aug 31 19:20:07 loongfire kernel: [200933.226488] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200460 sec
> >>>> Aug 31 19:21:08 loongfire kernel: [200993.642391]
> >>>> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> >>>> 200520 sec
> >>>> Aug 31 19:22:08 loongfire kernel: [201054.058292]
> >>>> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> >>>> 200581 sec
> >>>>
> >>>
> >>> Cc'ed some people that might have access to this hardware, can any of
> >>> you reproduce?
> >>>
>
> Anyone with this hardware?
>
> >>
> >> [0]: https://github.com/vincentmli/loongfire
>
> --Jesper

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-05 13:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-01  2:47 [BUG?] driver stmmac reports page_pool_release_retry() stalled pool shutdown every minute Vincent Li
2025-09-01  9:23 ` Jesper Dangaard Brouer
2025-09-01 17:56   ` Vincent Li
2025-09-01 23:27     ` Vincent Li
2025-09-05  7:59       ` Jesper Dangaard Brouer
2025-09-05 13:31         ` Vincent Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).