[BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power

public inbox for linux-wireless@vger.kernel.org
 help / color / mirror / Atom feed

* [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
@ 2026-03-09 21:48 LB F
  2026-03-10  2:02 ` Ping-Ke Shih
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-09 21:48 UTC (permalink / raw)
  To: pkshih; +Cc: linux-wireless, linux-kernel

Hi Ping-Ke,

I am writing to formally report a critical bug that causes a hard
system freeze on laptops equipped with the RTL8821CE WiFi module, and
to propose solutions.

Description:
On an HP laptop equipped with a Realtek RTL8821CE 802.11ac PCIe
adapter (PCI ID: 10ec:c821), the system experiences a hard lockup
(complete freeze of the UI and kernel, sysrq doesn't work, requires
holding the power button) when the WiFi adapter enters the power
saving state.

This issue occurs consistently across multiple Linux distributions and
kernel versions (reproduced on upstream kernel 6.13 and 6.19-rc).

Steps to Reproduce:
1. Use a system with RTL8821CE (pci:10ec:c821).
2. Ensure NetworkManager is configured with wifi.powersave = 3 (or
power saving is enabled via TLP/iw).
3. Connect to a WiFi network and let the system idle.
4. The system will eventually freeze completely.

Workarounds that successfully prevent the freeze:
* Passing disable_lps_deep=y to rtw88_core.
* Passing disable_aspm=y to rtw88_pci (or pcie_aspm=off).
* Disabling WiFi power save via NetworkManager.

Technical Analysis:
The root cause appears to be an unhandled race condition or hardware
bug between the adapter's Low Power State (LPS) Deep mode
(LPS_DEEP_MODE_LCLK) and the PCIe Active State Power Management (ASPM
L1) mechanism.

When the firmware drops into LPS_DEEP_MODE_LCLK concurrently with the
PCIe bus entering ASPM L1, the chip fails to handle PCIe Wake
signaling correctly. While there is an existing workaround in
rtw_pci_napi_poll (pci.c:1806) that sets `rtwpci->rx_no_aspm = true`
during NAPI poll for 8821CE, this polling wrapper is insufficient. The
deadlock often occurs during idle states when polling isn't actively
disabling ASPM, but the system suddenly needs to wake the radio.

Proposed Solutions:
Given that LPS_DEEP_MODE_LCLK seems fundamentally unreliable on 8821ce
PCIe variants when paired with standard Windows-era ASPM
implementations on laptops (HP, Lenovo, ASUS are all affected), the
most robust solution is to strip the unsupported deep sleep flag from
the hardware spec.

```diff
--- a/drivers/net/wireless/realtek/rtw88/rtw8821c.c
+++ b/drivers/net/wireless/realtek/rtw88/rtw8821c.c
@@ -1999,7 +1999,7 @@ struct rtw_chip_info rtw8821c_hw_spec = {
.bt_supported = true,
.fbtc_has_ext_ctrl = true,
.coex_info_hw_supported = true,
- .lps_deep_mode_supported = BIT(LPS_DEEP_MODE_LCLK),
+ .lps_deep_mode_supported = 0, /* Disabled due to ASPM L1 hard locks */
.dpk_supported = true,
.pstdma_type = COEX_PSTDMA_FORCE_LPSOFF,
.bfee_support = false,
```

Alternatively, a PCI Subsystem-based quirk should be introduced in
rtw_pci_aspm_set() to refuse ASPM BIT_L1_SW_EN transitions for
affected hardware IDs, similar to how CLKREQ issues are handled for
8822C via efuse->rfe_option.

Cross-Reference Analysis of other RTL8821CE Bugs:
After aggregating recent open bug reports for the 8821ce chip on
Bugzilla (https://bugzilla.kernel.org), it is apparent that almost all
of them are victims of the exact same underlying race condition.
1. Bug 215131: System freeze preceded by 'pci bus timeout, check dma
status'. Workaround used: disable_aspm=1.
2. Bug 219830: Log shows 'firmware failed to leave lps state' and
'failed to send h2c command'. A direct smoking gun for LPS Deep mode
freezing.
3. Bug 218697 & Bug 217491: Endless 'timed out to flush queue' floods.
4. Bug 217781 & Bug 216685: Random dropouts and low wireless speed.

Given the volume and age of these unresolved reports, disabling
.lps_deep_mode_supported (or restricting ASPM L1) specifically for
10ec:c821 is desperately needed.

System Information:
- Hardware: HP Notebook (SKU: P3S95EA#ACB, Family: 103C_5335KV)
- CPU: Intel Core i3-5005U
- WiFi PCI ID: 10ec:c821, Subsystem: 103c:831a
- Kernel: 6.13 / 6.19
- Driver module: rtw88_8821ce

I am happy to test any patches provided or formally submit the patch
above if maintainers agree it is the right approach. Thank you!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-09 21:48 [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict) LB F
@ 2026-03-10  2:02 ` Ping-Ke Shih
  2026-03-10 11:01   ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-10  2:02 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> Hi Ping-Ke,
> 
> I am writing to formally report a critical bug that causes a hard
> system freeze on laptops equipped with the RTL8821CE WiFi module, and
> to propose solutions.
> 
> Description:
> On an HP laptop equipped with a Realtek RTL8821CE 802.11ac PCIe
> adapter (PCI ID: 10ec:c821), the system experiences a hard lockup
> (complete freeze of the UI and kernel, sysrq doesn't work, requires
> holding the power button) when the WiFi adapter enters the power
> saving state.
> 
> This issue occurs consistently across multiple Linux distributions and
> kernel versions (reproduced on upstream kernel 6.13 and 6.19-rc).
> 
> Steps to Reproduce:
> 1. Use a system with RTL8821CE (pci:10ec:c821).
> 2. Ensure NetworkManager is configured with wifi.powersave = 3 (or
> power saving is enabled via TLP/iw).
> 3. Connect to a WiFi network and let the system idle.
> 4. The system will eventually freeze completely.

Can you dig kernel log (by netconsole or ramoops) if something useful?
I'd like to know this is hardware level freeze or kernel can capture
something wrong. 

> 
> Workarounds that successfully prevent the freeze:
> * Passing disable_lps_deep=y to rtw88_core.
> * Passing disable_aspm=y to rtw88_pci (or pcie_aspm=off).
> * Disabling WiFi power save via NetworkManager.

Are these totally needed to workaround the problem? Or disable_aspm is
enough?

I'd list them in order of power consumption impact: (the topmost is lower impact)
  1. disable_aspm=y
  2. disable_lps_deep=y
  3. disable WiFi power save

If you can do experiments on your platform, we can be easier to decide
which workarounds are adopted.

> 
> Technical Analysis:
> The root cause appears to be an unhandled race condition or hardware
> bug between the adapter's Low Power State (LPS) Deep mode
> (LPS_DEEP_MODE_LCLK) and the PCIe Active State Power Management (ASPM
> L1) mechanism.
> 
> When the firmware drops into LPS_DEEP_MODE_LCLK concurrently with the
> PCIe bus entering ASPM L1, the chip fails to handle PCIe Wake
> signaling correctly. While there is an existing workaround in
> rtw_pci_napi_poll (pci.c:1806) that sets `rtwpci->rx_no_aspm = true`
> during NAPI poll for 8821CE, this polling wrapper is insufficient. The
> deadlock often occurs during idle states when polling isn't actively
> disabling ASPM, but the system suddenly needs to wake the radio.

`rtwpci->rx_no_aspm = true` was another workaround years ago on certain
platform. I'd say ASPM has many interoperability problems, even years ago.

But what does 'deadlock' mean? As I know NAPI poll is scheduled by ISR,
and going to receive packets. The rx_no_aspm workaround is to forcely turn
off ASPM during this period. 

> 
> Proposed Solutions:
> Given that LPS_DEEP_MODE_LCLK seems fundamentally unreliable on 8821ce
> PCIe variants when paired with standard Windows-era ASPM
> implementations on laptops (HP, Lenovo, ASUS are all affected), the
> most robust solution is to strip the unsupported deep sleep flag from
> the hardware spec.
> 
> ```diff
> --- a/drivers/net/wireless/realtek/rtw88/rtw8821c.c
> +++ b/drivers/net/wireless/realtek/rtw88/rtw8821c.c
> @@ -1999,7 +1999,7 @@ struct rtw_chip_info rtw8821c_hw_spec = {
> .bt_supported = true,
> .fbtc_has_ext_ctrl = true,
> .coex_info_hw_supported = true,
> - .lps_deep_mode_supported = BIT(LPS_DEEP_MODE_LCLK),
> + .lps_deep_mode_supported = 0, /* Disabled due to ASPM L1 hard locks */
> .dpk_supported = true,
> .pstdma_type = COEX_PSTDMA_FORCE_LPSOFF,
> .bfee_support = false,
> ```
> 
> Alternatively, a PCI Subsystem-based quirk should be introduced in
> rtw_pci_aspm_set() to refuse ASPM BIT_L1_SW_EN transitions for
> affected hardware IDs, similar to how CLKREQ issues are handled for
> 8822C via efuse->rfe_option.

I'd add a quirk to your platforms, so other platforms can still have
better power consumption. 

> 
> Cross-Reference Analysis of other RTL8821CE Bugs:
> After aggregating recent open bug reports for the 8821ce chip on
> Bugzilla (https://bugzilla.kernel.org), it is apparent that almost all
> of them are victims of the exact same underlying race condition.
> 1. Bug 215131: System freeze preceded by 'pci bus timeout, check dma
> status'. Workaround used: disable_aspm=1.
> 2. Bug 219830: Log shows 'firmware failed to leave lps state' and
> 'failed to send h2c command'. A direct smoking gun for LPS Deep mode
> freezing.
> 3. Bug 218697 & Bug 217491: Endless 'timed out to flush queue' floods.
> 4. Bug 217781 & Bug 216685: Random dropouts and low wireless speed.
> 
> Given the volume and age of these unresolved reports, disabling
> .lps_deep_mode_supported (or restricting ASPM L1) specifically for
> 10ec:c821 is desperately needed.
> 
> System Information:
> - Hardware: HP Notebook (SKU: P3S95EA#ACB, Family: 103C_5335KV)
> - CPU: Intel Core i3-5005U
> - WiFi PCI ID: 10ec:c821, Subsystem: 103c:831a
> - Kernel: 6.13 / 6.19
> - Driver module: rtw88_8821ce
> 
> I am happy to test any patches provided or formally submit the patch
> above if maintainers agree it is the right approach. Thank you!

We have not modified RTL8821CE for a long time, so I'd add workaround
to specific platform as mentioned above. 

Ping-Ke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-10  2:02 ` Ping-Ke Shih
@ 2026-03-10 11:01   ` LB F
  2026-03-10 15:12     ` LB F
  2026-03-11  2:15     ` Ping-Ke Shih
  0 siblings, 2 replies; 34+ messages in thread
From: LB F @ 2026-03-10 11:01 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Ping-Ke,

Thank you for the incredibly fast response and assistance!

> Can you dig kernel log (by netconsole or ramoops) if something useful?
> I'd like to know this is hardware level freeze or kernel can capture something wrong.

I managed to pull a call trace from a historic journald log just
before the system hung. The kernel gets trapped in an IRQ thread
inside `rtw_pci_interrupt_threadfn`, calling up into `mac80211`
`ieee80211_rx_list` before everything freezes. Here is the relevant
snippet:

```text
Call Trace:
<IRQ>
? __alloc_skb+0x23a/0x2a0
? __alloc_skb+0x10c/0x2a0
? __pfx_irq_thread_fn+0x10/0x10
[ ... truncated module list ... ]
Tainted: G W I 6.19.6-2-cachyos #1 PREEMPT(full)
Hardware name: HP HP Notebook/81F0, BIOS F.50 11/20/2020
RIP: 0010:ieee80211_rx_list+0x1012/0x1020 [mac80211]
CPU: 2 UID: 0 PID: 765 Comm: irq/56-rtw88_pc
rtw_pci_interrupt_threadfn+0x239/0x310 [rtw88_pci]
```

It behaves exactly like a PCIe bus deadlock or a hardware fault that
eventually brings down the CPU handling the IRQ.

> Are these totally needed to workaround the problem? Or disable_aspm is enough?
> I'd list them in order of power consumption impact:
> 1. disable_aspm=y
> 2. disable_lps_deep=y
> 3. disable WiFi power save

To verify which parameters are strictly necessary, I performed
isolated testing today. I ensured no other modprobe configs were
active, rebuilt the initramfs, and manually enforced that
`wifi.powersave` was active via `iw dev wlan0 set power_save on`
during all tests (as the OS power management profiles were defaulting
it to off, which initially masked the issue).

I tested each workaround individually across multiple sleep/wake
cycles and active usage:

**Test 1 (ASPM Disabled, LPS Deep Enabled):**
- Kernel parameters: `rtw88_pci disable_aspm=y` (and `rtw88_core
disable_lps_deep=n`)
- Result: Stable. No freezes were observed during usage or transitions
into/out of S3 sleep while power saving was enforced.

**Test 2 (ASPM Enabled, LPS Deep Disabled):**
- Kernel parameters: `rtw88_core disable_lps_deep=y` (and `rtw88_pci
disable_aspm=n`)
- Result: Stable. No freezes were observed under the same forced power
save conditions.

**Conclusion:** It appears we do not need both workarounds
simultaneously for this specific hardware. Using only `disable_aspm=y`
seems to be sufficient to prevent the system freeze. Given your note
about the power consumption impact ranking, this looks like the
optimal path forward.

> But what does 'deadlock' mean? As I know NAPI poll is scheduled by ISR,
> and going to receive packets. The rx_no_aspm workaround is to forcely turn
> off ASPM during this period.

By "deadlock" I meant a hardware-level bus lockup. It seems the
physical RTL8821CE chip itself crashes or hangs the system's PCIe bus
when trying to negotiate waking up from ASPM L1 while simultaneously
existing in `LPS_DEEP_MODE_LCLK`. The `rx_no_aspm` workaround in NAPI
helps during active Rx decoding, but the laptop often freezes while
completely idle, presumably when the AP sends a basic beacon, the chip
attempts to leave LPS Deep + L1, and the hardware simply gives up and
halts the system.

> We have not modified RTL8821CE for a long time, so I'd add workaround
> to specific platform as mentioned above.

Adding a DMI/platform quirk specifically for this laptop to disable
ASPM would be wonderful and deeply appreciated. I agree it is safer
than touching the global flags for hardware that is functioning
correctly out in the wild.

Here is the exact identifying information for my system:

System Vendor: HP
Product Name: HP Notebook
SKU Number: P3S95EA#ACB
Family: 103C_5335KV
PCI ID: 10ec:c821
Subsystem ID: 103c:831a

I am completely ready to test any patch or quirk you send my way.
Thank you so much for your time and helping track this down!

Best regards,
Oleksandr

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-10 11:01   ` LB F
@ 2026-03-10 15:12     ` LB F
  2026-03-11  2:20       ` Ping-Ke Shih
  2026-03-11  2:15     ` Ping-Ke Shih
  1 sibling, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-10 15:12 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Ping-Ke,

Thank you for your guidance. To provide you with the cleanest possible
diagnostic data, we devised a strict testing environment:

1. **Live USB Environment:** We booted a completely fresh Live USB of
CachyOS (Kernel 6.19.6) to eliminate any potential interference from
installed software, TLP profiles, or custom NetworkManager
configurations.
2. **Aggressive Local Logging:** Because the system freeze physically
locks the PCIe bus and disables the Wi-Fi adapter instantly, using
`netconsole` was impossible (the network drops microseconds before the
freeze).

To overcome this, we wrote an "aggressive logger" script that pipes
`dmesg -w` directly to an independent FAT32 USB drive while issuing a
`sync` command twice a second. This bypassed RAM caching and
physically burned the logs to the drive right up to the moment of the
hard freeze. The script we used was:

```bash
#!/bin/bash
LOG_FILE="/run/media/liveuser/LOGS/kernel_freeze.log"
dmesg -w > "$LOG_FILE" &
while true; do
    sync
    sleep 0.5
done
```

3. No workarounds (`disable_aspm=n`, `disable_lps_deep=n`) were active
in this test. We manually enabled power saving (`iw dev wlan0 set
power_save on`) and triggered the freeze via typical web browsing.

Here are the precise, unadulterated logs showing the adapter
successfully connecting to the network, sitting idle for about 10
seconds (presumably entering power-saving states), and then suffering
a fatal firmware lockup right before the PCIe bus froze:

```
[  304.709201] audit: type=1111 ... op=connection-add-activate ...
name="Andrey_5G" ...
[  305.617785] wlan0: authenticate with 6c:68:a4:1c:97:5b ...
[  305.660333] wlan0: authenticated
[  305.661661] wlan0: associate with 6c:68:a4:1c:97:5b (try 1/3)
[  305.663404] wlan0: associated
[  305.719997] wlan0: Limiting TX power to 30 (30 - 0) dBm as
advertised by 6c:68:a4:1c:97:5b
... (~10 seconds of idle network time) ...
[  316.907114] rtw88_8821ce 0000:13:00.0: failed to send h2c command
[  316.911190] rtw88_8821ce 0000:13:00.0: failed to send h2c command
[  316.921504] rtw88_8821ce 0000:13:00.0: coex request time out
...
[  349.630952] rtw88_8821ce 0000:13:00.0: failed to send h2c command
[  349.635023] rtw88_8821ce 0000:13:00.0: failed to send h2c command
[  357.811235] rtw88_8821ce 0000:13:00.0: firmware failed to leave lps state
[  359.797238] rtw88_8821ce 0000:13:00.0: firmware failed to leave lps state
... (repeats indefinitely until hard reset) ...
```

As the logs clearly demonstrate, the adapter authenticates perfectly
but the firmware explicitly fails to leave the LPS state after a brief
idle period, dropping all H2C commands immediately before the
system-wide hard freeze begins.

We will upload the full, unabridged `.log` file to our Bugzilla thread
(Bug 221195) momentarily, but we wanted to provide you with this exact
'smoking gun' trace right away to help identify the root cause.

Please let us know if this information is helpful or if there are any
specific module patches or further tests you would like us to perform
to assist with debugging.

Best regards,
Oleksandr

вт, 10 мар. 2026 г. в 13:01, LB F <goainwo@gmail.com>:
>
> Hi Ping-Ke,
>
> Thank you for the incredibly fast response and assistance!
>
> > Can you dig kernel log (by netconsole or ramoops) if something useful?
> > I'd like to know this is hardware level freeze or kernel can capture something wrong.
>
> I managed to pull a call trace from a historic journald log just
> before the system hung. The kernel gets trapped in an IRQ thread
> inside `rtw_pci_interrupt_threadfn`, calling up into `mac80211`
> `ieee80211_rx_list` before everything freezes. Here is the relevant
> snippet:
>
> ```text
> Call Trace:
> <IRQ>
> ? __alloc_skb+0x23a/0x2a0
> ? __alloc_skb+0x10c/0x2a0
> ? __pfx_irq_thread_fn+0x10/0x10
> [ ... truncated module list ... ]
> Tainted: G W I 6.19.6-2-cachyos #1 PREEMPT(full)
> Hardware name: HP HP Notebook/81F0, BIOS F.50 11/20/2020
> RIP: 0010:ieee80211_rx_list+0x1012/0x1020 [mac80211]
> CPU: 2 UID: 0 PID: 765 Comm: irq/56-rtw88_pc
> rtw_pci_interrupt_threadfn+0x239/0x310 [rtw88_pci]
> ```
>
> It behaves exactly like a PCIe bus deadlock or a hardware fault that
> eventually brings down the CPU handling the IRQ.
>
> > Are these totally needed to workaround the problem? Or disable_aspm is enough?
> > I'd list them in order of power consumption impact:
> > 1. disable_aspm=y
> > 2. disable_lps_deep=y
> > 3. disable WiFi power save
>
> To verify which parameters are strictly necessary, I performed
> isolated testing today. I ensured no other modprobe configs were
> active, rebuilt the initramfs, and manually enforced that
> `wifi.powersave` was active via `iw dev wlan0 set power_save on`
> during all tests (as the OS power management profiles were defaulting
> it to off, which initially masked the issue).
>
> I tested each workaround individually across multiple sleep/wake
> cycles and active usage:
>
> **Test 1 (ASPM Disabled, LPS Deep Enabled):**
> - Kernel parameters: `rtw88_pci disable_aspm=y` (and `rtw88_core
> disable_lps_deep=n`)
> - Result: Stable. No freezes were observed during usage or transitions
> into/out of S3 sleep while power saving was enforced.
>
> **Test 2 (ASPM Enabled, LPS Deep Disabled):**
> - Kernel parameters: `rtw88_core disable_lps_deep=y` (and `rtw88_pci
> disable_aspm=n`)
> - Result: Stable. No freezes were observed under the same forced power
> save conditions.
>
> **Conclusion:** It appears we do not need both workarounds
> simultaneously for this specific hardware. Using only `disable_aspm=y`
> seems to be sufficient to prevent the system freeze. Given your note
> about the power consumption impact ranking, this looks like the
> optimal path forward.
>
> > But what does 'deadlock' mean? As I know NAPI poll is scheduled by ISR,
> > and going to receive packets. The rx_no_aspm workaround is to forcely turn
> > off ASPM during this period.
>
> By "deadlock" I meant a hardware-level bus lockup. It seems the
> physical RTL8821CE chip itself crashes or hangs the system's PCIe bus
> when trying to negotiate waking up from ASPM L1 while simultaneously
> existing in `LPS_DEEP_MODE_LCLK`. The `rx_no_aspm` workaround in NAPI
> helps during active Rx decoding, but the laptop often freezes while
> completely idle, presumably when the AP sends a basic beacon, the chip
> attempts to leave LPS Deep + L1, and the hardware simply gives up and
> halts the system.
>
> > We have not modified RTL8821CE for a long time, so I'd add workaround
> > to specific platform as mentioned above.
>
> Adding a DMI/platform quirk specifically for this laptop to disable
> ASPM would be wonderful and deeply appreciated. I agree it is safer
> than touching the global flags for hardware that is functioning
> correctly out in the wild.
>
> Here is the exact identifying information for my system:
>
> System Vendor: HP
> Product Name: HP Notebook
> SKU Number: P3S95EA#ACB
> Family: 103C_5335KV
> PCI ID: 10ec:c821
> Subsystem ID: 103c:831a
>
> I am completely ready to test any patch or quirk you send my way.
> Thank you so much for your time and helping track this down!
>
> Best regards,
> Oleksandr

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-10 11:01   ` LB F
  2026-03-10 15:12     ` LB F
@ 2026-03-11  2:15     ` Ping-Ke Shih
  2026-03-11  2:22       ` Ping-Ke Shih
  1 sibling, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-11  2:15 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> 
> Hi Ping-Ke,
> 
> Thank you for the incredibly fast response and assistance!
> 
> > Can you dig kernel log (by netconsole or ramoops) if something useful?
> > I'd like to know this is hardware level freeze or kernel can capture something
> wrong.
> 
> I managed to pull a call trace from a historic journald log just
> before the system hung. The kernel gets trapped in an IRQ thread
> inside `rtw_pci_interrupt_threadfn`, calling up into `mac80211`
> `ieee80211_rx_list` before everything freezes. Here is the relevant
> snippet:
> 
> ```text
> Call Trace:
> <IRQ>
> ? __alloc_skb+0x23a/0x2a0
> ? __alloc_skb+0x10c/0x2a0
> ? __pfx_irq_thread_fn+0x10/0x10
> [ ... truncated module list ... ]
> Tainted: G W I 6.19.6-2-cachyos #1 PREEMPT(full)
> Hardware name: HP HP Notebook/81F0, BIOS F.50 11/20/2020
> RIP: 0010:ieee80211_rx_list+0x1012/0x1020 [mac80211]
> CPU: 2 UID: 0 PID: 765 Comm: irq/56-rtw88_pc
> rtw_pci_interrupt_threadfn+0x239/0x310 [rtw88_pci]
> ```
> 
> It behaves exactly like a PCIe bus deadlock or a hardware fault that
> eventually brings down the CPU handling the IRQ.

I wonder if there is a malformed data, causing this trace and the leads
kernel freezes. If we can do validation on RX data before calling 
ieee80211_rx_list(), maybe trace disappears and everything will be fine?
Even no need workaround.

> 
> > Are these totally needed to workaround the problem? Or disable_aspm is enough?
> > I'd list them in order of power consumption impact:
> > 1. disable_aspm=y
> > 2. disable_lps_deep=y
> > 3. disable WiFi power save
> 
> To verify which parameters are strictly necessary, I performed
> isolated testing today. I ensured no other modprobe configs were
> active, rebuilt the initramfs, and manually enforced that
> `wifi.powersave` was active via `iw dev wlan0 set power_save on`
> during all tests (as the OS power management profiles were defaulting
> it to off, which initially masked the issue).
> 
> I tested each workaround individually across multiple sleep/wake
> cycles and active usage:
> 
> **Test 1 (ASPM Disabled, LPS Deep Enabled):**
> - Kernel parameters: `rtw88_pci disable_aspm=y` (and `rtw88_core
> disable_lps_deep=n`)
> - Result: Stable. No freezes were observed during usage or transitions
> into/out of S3 sleep while power saving was enforced.
> 
> **Test 2 (ASPM Enabled, LPS Deep Disabled):**
> - Kernel parameters: `rtw88_core disable_lps_deep=y` (and `rtw88_pci
> disable_aspm=n`)
> - Result: Stable. No freezes were observed under the same forced power
> save conditions.
> 
> **Conclusion:** It appears we do not need both workarounds
> simultaneously for this specific hardware. Using only `disable_aspm=y`
> seems to be sufficient to prevent the system freeze. Given your note
> about the power consumption impact ranking, this looks like the
> optimal path forward.

Let's test my RFT patch to disable ASPM then.

> 
> > But what does 'deadlock' mean? As I know NAPI poll is scheduled by ISR,
> > and going to receive packets. The rx_no_aspm workaround is to forcely turn
> > off ASPM during this period.
> 
> By "deadlock" I meant a hardware-level bus lockup. It seems the
> physical RTL8821CE chip itself crashes or hangs the system's PCIe bus
> when trying to negotiate waking up from ASPM L1 while simultaneously
> existing in `LPS_DEEP_MODE_LCLK`. The `rx_no_aspm` workaround in NAPI
> helps during active Rx decoding, but the laptop often freezes while
> completely idle, presumably when the AP sends a basic beacon, the chip
> attempts to leave LPS Deep + L1, and the hardware simply gives up and
> halts the system.

I think this is your perspective and induction, right? Did you measure
real hardware signals?

My point is that if this is a hardware-level bus lockup, let's apply
quirk. If some malformed data causing kernel hangs, I'd add sanity check
on RX data, but I don't actually know what we should check for now. 

> 
> > We have not modified RTL8821CE for a long time, so I'd add workaround
> > to specific platform as mentioned above.
> 
> Adding a DMI/platform quirk specifically for this laptop to disable
> ASPM would be wonderful and deeply appreciated. I agree it is safer
> than touching the global flags for hardware that is functioning
> correctly out in the wild.
> 
> Here is the exact identifying information for my system:
> 
> System Vendor: HP
> Product Name: HP Notebook
> SKU Number: P3S95EA#ACB
> Family: 103C_5335KV
> PCI ID: 10ec:c821
> Subsystem ID: 103c:831a
> 
> I am completely ready to test any patch or quirk you send my way.
> Thank you so much for your time and helping track this down!

I sent a RFT [1] for test. Please check if it works on your HP notebook.
If you check rtw88 log, you can see I added similar patch 5 years ago,
and replaced by preferred the change of "rtwpci->rx_no_aspm", which I
think it can only resolve problem on partial notebooks though.... 

[1] https://lore.kernel.org/linux-wireless/20260311020816.7065-1-pkshih@realtek.com/T/#u

Ping-Ke



^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-10 15:12     ` LB F
@ 2026-03-11  2:20       ` Ping-Ke Shih
  0 siblings, 0 replies; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-11  2:20 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> 
> Hi Ping-Ke,
> 
> Thank you for your guidance. To provide you with the cleanest possible
> diagnostic data, we devised a strict testing environment:
> 
> 1. **Live USB Environment:** We booted a completely fresh Live USB of
> CachyOS (Kernel 6.19.6) to eliminate any potential interference from
> installed software, TLP profiles, or custom NetworkManager
> configurations.
> 2. **Aggressive Local Logging:** Because the system freeze physically
> locks the PCIe bus and disables the Wi-Fi adapter instantly, using
> `netconsole` was impossible (the network drops microseconds before the
> freeze).
> 
> To overcome this, we wrote an "aggressive logger" script that pipes
> `dmesg -w` directly to an independent FAT32 USB drive while issuing a
> `sync` command twice a second. This bypassed RAM caching and
> physically burned the logs to the drive right up to the moment of the
> hard freeze. The script we used was:
> 
> ```bash
> #!/bin/bash
> LOG_FILE="/run/media/liveuser/LOGS/kernel_freeze.log"
> dmesg -w > "$LOG_FILE" &
> while true; do
>     sync
>     sleep 0.5
> done
> ```
> 
> 3. No workarounds (`disable_aspm=n`, `disable_lps_deep=n`) were active
> in this test. We manually enabled power saving (`iw dev wlan0 set
> power_save on`) and triggered the freeze via typical web browsing.
> 
> Here are the precise, unadulterated logs showing the adapter
> successfully connecting to the network, sitting idle for about 10
> seconds (presumably entering power-saving states), and then suffering
> a fatal firmware lockup right before the PCIe bus froze:
> 
> ```
> [  304.709201] audit: type=1111 ... op=connection-add-activate ...
> name="Andrey_5G" ...
> [  305.617785] wlan0: authenticate with 6c:68:a4:1c:97:5b ...
> [  305.660333] wlan0: authenticated
> [  305.661661] wlan0: associate with 6c:68:a4:1c:97:5b (try 1/3)
> [  305.663404] wlan0: associated
> [  305.719997] wlan0: Limiting TX power to 30 (30 - 0) dBm as
> advertised by 6c:68:a4:1c:97:5b
> ... (~10 seconds of idle network time) ...
> [  316.907114] rtw88_8821ce 0000:13:00.0: failed to send h2c command
> [  316.911190] rtw88_8821ce 0000:13:00.0: failed to send h2c command
> [  316.921504] rtw88_8821ce 0000:13:00.0: coex request time out
> ...
> [  349.630952] rtw88_8821ce 0000:13:00.0: failed to send h2c command
> [  349.635023] rtw88_8821ce 0000:13:00.0: failed to send h2c command
> [  357.811235] rtw88_8821ce 0000:13:00.0: firmware failed to leave lps state
> [  359.797238] rtw88_8821ce 0000:13:00.0: firmware failed to leave lps state
> ... (repeats indefinitely until hard reset) ...
> ```

Just want to clarify that these logs only appear in test 3, right?
No these logs in test 1/2. 

> 
> As the logs clearly demonstrate, the adapter authenticates perfectly
> but the firmware explicitly fails to leave the LPS state after a brief
> idle period, dropping all H2C commands immediately before the
> system-wide hard freeze begins.
> 
> We will upload the full, unabridged `.log` file to our Bugzilla thread
> (Bug 221195) momentarily, but we wanted to provide you with this exact
> 'smoking gun' trace right away to help identify the root cause.
> 
> Please let us know if this information is helpful or if there are any
> specific module patches or further tests you would like us to perform
> to assist with debugging.

Thanks for your detail tests and logs. With this kind of hardware problem,
to dig the cause, we need real hardware and hardware scope to measure
signals. I'd apply quirk or some validations on RX path. That'd be a
better way.

Ping-Ke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-11  2:15     ` Ping-Ke Shih
@ 2026-03-11  2:22       ` Ping-Ke Shih
  2026-03-11 11:00         ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-11  2:22 UTC (permalink / raw)
  To: Ping-Ke Shih, LB F
  Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> 
> LB F <goainwo@gmail.com> wrote:
> >
> > Hi Ping-Ke,
> >
> > Thank you for the incredibly fast response and assistance!
> >
> > > Can you dig kernel log (by netconsole or ramoops) if something useful?
> > > I'd like to know this is hardware level freeze or kernel can capture something
> > wrong.
> >
> > I managed to pull a call trace from a historic journald log just
> > before the system hung. The kernel gets trapped in an IRQ thread
> > inside `rtw_pci_interrupt_threadfn`, calling up into `mac80211`
> > `ieee80211_rx_list` before everything freezes. Here is the relevant
> > snippet:
> >
> > ```text
> > Call Trace:
> > <IRQ>
> > ? __alloc_skb+0x23a/0x2a0
> > ? __alloc_skb+0x10c/0x2a0
> > ? __pfx_irq_thread_fn+0x10/0x10
> > [ ... truncated module list ... ]
> > Tainted: G W I 6.19.6-2-cachyos #1 PREEMPT(full)
> > Hardware name: HP HP Notebook/81F0, BIOS F.50 11/20/2020
> > RIP: 0010:ieee80211_rx_list+0x1012/0x1020 [mac80211]
> > CPU: 2 UID: 0 PID: 765 Comm: irq/56-rtw88_pc
> > rtw_pci_interrupt_threadfn+0x239/0x310 [rtw88_pci]
> > ```
> >
> > It behaves exactly like a PCIe bus deadlock or a hardware fault that
> > eventually brings down the CPU handling the IRQ.
> 
> I wonder if there is a malformed data, causing this trace and the leads
> kernel freezes. If we can do validation on RX data before calling
> ieee80211_rx_list(), maybe trace disappears and everything will be fine?
> Even no need workaround.
> 
> >
> > > Are these totally needed to workaround the problem? Or disable_aspm is enough?
> > > I'd list them in order of power consumption impact:
> > > 1. disable_aspm=y
> > > 2. disable_lps_deep=y
> > > 3. disable WiFi power save
> >
> > To verify which parameters are strictly necessary, I performed
> > isolated testing today. I ensured no other modprobe configs were
> > active, rebuilt the initramfs, and manually enforced that
> > `wifi.powersave` was active via `iw dev wlan0 set power_save on`
> > during all tests (as the OS power management profiles were defaulting
> > it to off, which initially masked the issue).
> >
> > I tested each workaround individually across multiple sleep/wake
> > cycles and active usage:
> >
> > **Test 1 (ASPM Disabled, LPS Deep Enabled):**
> > - Kernel parameters: `rtw88_pci disable_aspm=y` (and `rtw88_core
> > disable_lps_deep=n`)
> > - Result: Stable. No freezes were observed during usage or transitions
> > into/out of S3 sleep while power saving was enforced.
> >
> > **Test 2 (ASPM Enabled, LPS Deep Disabled):**
> > - Kernel parameters: `rtw88_core disable_lps_deep=y` (and `rtw88_pci
> > disable_aspm=n`)
> > - Result: Stable. No freezes were observed under the same forced power
> > save conditions.
> >
> > **Conclusion:** It appears we do not need both workarounds
> > simultaneously for this specific hardware. Using only `disable_aspm=y`
> > seems to be sufficient to prevent the system freeze. Given your note
> > about the power consumption impact ranking, this looks like the
> > optimal path forward.
> 
> Let's test my RFT patch to disable ASPM then.
> 
> >
> > > But what does 'deadlock' mean? As I know NAPI poll is scheduled by ISR,
> > > and going to receive packets. The rx_no_aspm workaround is to forcely turn
> > > off ASPM during this period.
> >
> > By "deadlock" I meant a hardware-level bus lockup. It seems the
> > physical RTL8821CE chip itself crashes or hangs the system's PCIe bus
> > when trying to negotiate waking up from ASPM L1 while simultaneously
> > existing in `LPS_DEEP_MODE_LCLK`. The `rx_no_aspm` workaround in NAPI
> > helps during active Rx decoding, but the laptop often freezes while
> > completely idle, presumably when the AP sends a basic beacon, the chip
> > attempts to leave LPS Deep + L1, and the hardware simply gives up and
> > halts the system.
> 
> I think this is your perspective and induction, right? Did you measure
> real hardware signals?
> 
> My point is that if this is a hardware-level bus lockup, let's apply
> quirk. If some malformed data causing kernel hangs, I'd add sanity check
> on RX data, but I don't actually know what we should check for now.
> 
> >
> > > We have not modified RTL8821CE for a long time, so I'd add workaround
> > > to specific platform as mentioned above.
> >
> > Adding a DMI/platform quirk specifically for this laptop to disable
> > ASPM would be wonderful and deeply appreciated. I agree it is safer
> > than touching the global flags for hardware that is functioning
> > correctly out in the wild.
> >
> > Here is the exact identifying information for my system:
> >
> > System Vendor: HP
> > Product Name: HP Notebook
> > SKU Number: P3S95EA#ACB
> > Family: 103C_5335KV
> > PCI ID: 10ec:c821
> > Subsystem ID: 103c:831a
> >
> > I am completely ready to test any patch or quirk you send my way.
> > Thank you so much for your time and helping track this down!
> 
> I sent a RFT [1] for test. Please check if it works on your HP notebook.
> If you check rtw88 log, you can see I added similar patch 5 years ago,
> and replaced by preferred the change of "rtwpci->rx_no_aspm", which I
> think it can only resolve problem on partial notebooks though....
> 
> [1]
> https://lore.kernel.org/linux-wireless/20260311020816.7065-1-pkshih@realtek.
> com/T/#u

Forgot to say. Could you share your full name for me as a reporter
in commit message?



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-11  2:22       ` Ping-Ke Shih
@ 2026-03-11 11:00         ` LB F
  2026-03-11 15:22           ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-11 11:00 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Ping-Ke,

Thank you for the incredibly fast turnaround and for providing the RFT
patch with the DMI quirk!

First, I want to mention that I am not an IT professional or a
programmer. I am just a regular Linux user who really wants to help
solve this problem. I am trying my best to verify everything
carefully, so please forgive me if my terminology or induction was
slightly off.

To answer your clarifying questions from the previous emails:

> Just want to clarify that these logs only appear in test 3, right?
> No these logs in test 1/2.

Yes, exactly. The `failed to send h2c command` errors only caused a
complete system freeze when no workarounds were active and the adapter
attempted to sleep (Test 3).

> I think this is your perspective and induction, right? Did you measure
> real hardware signals?

You are entirely correct. This is just my induction based solely on
the timing of the logs and system behavior. I do not have access to an
oscilloscope or any hardware diagnostic tools. Given this, I
completely agree that your approach of applying a platform-specific
quirk is the safest and best solution.

> Forgot to say. Could you share your full name for me as a reporter
> in commit message?

My full name is Oleksandr Havrylov. I would be honored to be included
as the reporter in the commit message.

### Recent Baseline Testing Before Your Patch

Before applying your patch today, we ran a few more controlled tests
to double-check our baseline. We verified that our local workaround
(`modprobe.d disable_aspm=y`) **does indeed keep the system completely
stable** and prevents the hard freeze, even when NetworkManager's
`wifi.powersave` is set to ON (default).

However, we noticed one interesting detail in the kernel logs: while
the system no longer freezes with `disable_aspm=y`, `dmesg` still
constantly logs `firmware failed to leave lps state` and `failed to
send h2c command` when the laptop is completely idle. It seems the
firmware still crashes during LPS, but because ASPM is disabled, the
PCIe bus ignores the crash and the system survives perfectly fine. I
just wanted to mention this for completeness!

### Testing Plan

I have **not** applied your RFT patch just yet. I wanted to make sure
our testing baseline was 100% clean and documented first.

I will compile your patch and perform rigorous testing this evening (I
am in the EET timezone, Ukraine). I will test it with the native
`power_save` fully enabled to ensure your patch successfully prevents
the hard lockups as intended.

I will stay in touch and reply back to this thread with a formal
`Tested-by` confirmation (and any logs if needed) as soon as my
testing is complete. Thank you again for all your help!

Best regards,
Oleksandr Havrylov

ср, 11 мар. 2026 г. в 04:22, Ping-Ke Shih <pkshih@realtek.com>:
>
> Ping-Ke Shih <pkshih@realtek.com> wrote:
> >
> > LB F <goainwo@gmail.com> wrote:
> > >
> > > Hi Ping-Ke,
> > >
> > > Thank you for the incredibly fast response and assistance!
> > >
> > > > Can you dig kernel log (by netconsole or ramoops) if something useful?
> > > > I'd like to know this is hardware level freeze or kernel can capture something
> > > wrong.
> > >
> > > I managed to pull a call trace from a historic journald log just
> > > before the system hung. The kernel gets trapped in an IRQ thread
> > > inside `rtw_pci_interrupt_threadfn`, calling up into `mac80211`
> > > `ieee80211_rx_list` before everything freezes. Here is the relevant
> > > snippet:
> > >
> > > ```text
> > > Call Trace:
> > > <IRQ>
> > > ? __alloc_skb+0x23a/0x2a0
> > > ? __alloc_skb+0x10c/0x2a0
> > > ? __pfx_irq_thread_fn+0x10/0x10
> > > [ ... truncated module list ... ]
> > > Tainted: G W I 6.19.6-2-cachyos #1 PREEMPT(full)
> > > Hardware name: HP HP Notebook/81F0, BIOS F.50 11/20/2020
> > > RIP: 0010:ieee80211_rx_list+0x1012/0x1020 [mac80211]
> > > CPU: 2 UID: 0 PID: 765 Comm: irq/56-rtw88_pc
> > > rtw_pci_interrupt_threadfn+0x239/0x310 [rtw88_pci]
> > > ```
> > >
> > > It behaves exactly like a PCIe bus deadlock or a hardware fault that
> > > eventually brings down the CPU handling the IRQ.
> >
> > I wonder if there is a malformed data, causing this trace and the leads
> > kernel freezes. If we can do validation on RX data before calling
> > ieee80211_rx_list(), maybe trace disappears and everything will be fine?
> > Even no need workaround.
> >
> > >
> > > > Are these totally needed to workaround the problem? Or disable_aspm is enough?
> > > > I'd list them in order of power consumption impact:
> > > > 1. disable_aspm=y
> > > > 2. disable_lps_deep=y
> > > > 3. disable WiFi power save
> > >
> > > To verify which parameters are strictly necessary, I performed
> > > isolated testing today. I ensured no other modprobe configs were
> > > active, rebuilt the initramfs, and manually enforced that
> > > `wifi.powersave` was active via `iw dev wlan0 set power_save on`
> > > during all tests (as the OS power management profiles were defaulting
> > > it to off, which initially masked the issue).
> > >
> > > I tested each workaround individually across multiple sleep/wake
> > > cycles and active usage:
> > >
> > > **Test 1 (ASPM Disabled, LPS Deep Enabled):**
> > > - Kernel parameters: `rtw88_pci disable_aspm=y` (and `rtw88_core
> > > disable_lps_deep=n`)
> > > - Result: Stable. No freezes were observed during usage or transitions
> > > into/out of S3 sleep while power saving was enforced.
> > >
> > > **Test 2 (ASPM Enabled, LPS Deep Disabled):**
> > > - Kernel parameters: `rtw88_core disable_lps_deep=y` (and `rtw88_pci
> > > disable_aspm=n`)
> > > - Result: Stable. No freezes were observed under the same forced power
> > > save conditions.
> > >
> > > **Conclusion:** It appears we do not need both workarounds
> > > simultaneously for this specific hardware. Using only `disable_aspm=y`
> > > seems to be sufficient to prevent the system freeze. Given your note
> > > about the power consumption impact ranking, this looks like the
> > > optimal path forward.
> >
> > Let's test my RFT patch to disable ASPM then.
> >
> > >
> > > > But what does 'deadlock' mean? As I know NAPI poll is scheduled by ISR,
> > > > and going to receive packets. The rx_no_aspm workaround is to forcely turn
> > > > off ASPM during this period.
> > >
> > > By "deadlock" I meant a hardware-level bus lockup. It seems the
> > > physical RTL8821CE chip itself crashes or hangs the system's PCIe bus
> > > when trying to negotiate waking up from ASPM L1 while simultaneously
> > > existing in `LPS_DEEP_MODE_LCLK`. The `rx_no_aspm` workaround in NAPI
> > > helps during active Rx decoding, but the laptop often freezes while
> > > completely idle, presumably when the AP sends a basic beacon, the chip
> > > attempts to leave LPS Deep + L1, and the hardware simply gives up and
> > > halts the system.
> >
> > I think this is your perspective and induction, right? Did you measure
> > real hardware signals?
> >
> > My point is that if this is a hardware-level bus lockup, let's apply
> > quirk. If some malformed data causing kernel hangs, I'd add sanity check
> > on RX data, but I don't actually know what we should check for now.
> >
> > >
> > > > We have not modified RTL8821CE for a long time, so I'd add workaround
> > > > to specific platform as mentioned above.
> > >
> > > Adding a DMI/platform quirk specifically for this laptop to disable
> > > ASPM would be wonderful and deeply appreciated. I agree it is safer
> > > than touching the global flags for hardware that is functioning
> > > correctly out in the wild.
> > >
> > > Here is the exact identifying information for my system:
> > >
> > > System Vendor: HP
> > > Product Name: HP Notebook
> > > SKU Number: P3S95EA#ACB
> > > Family: 103C_5335KV
> > > PCI ID: 10ec:c821
> > > Subsystem ID: 103c:831a
> > >
> > > I am completely ready to test any patch or quirk you send my way.
> > > Thank you so much for your time and helping track this down!
> >
> > I sent a RFT [1] for test. Please check if it works on your HP notebook.
> > If you check rtw88 log, you can see I added similar patch 5 years ago,
> > and replaced by preferred the change of "rtwpci->rx_no_aspm", which I
> > think it can only resolve problem on partial notebooks though....
> >
> > [1]
> > https://lore.kernel.org/linux-wireless/20260311020816.7065-1-pkshih@realtek.
> > com/T/#u
>
> Forgot to say. Could you share your full name for me as a reporter
> in commit message?
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-11 11:00         ` LB F
@ 2026-03-11 15:22           ` LB F
  2026-03-12  1:56             ` Ping-Ke Shih
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-11 15:22 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Ping-Ke,

I successfully applied your patch out-of-tree and performed rigorous
testing on the host machine.

I can officially confirm that the patch works flawlessly. The DMI
quirk triggered correctly and successfully prevented the
hardware-level PCIe bus lockups on my HP P3S95EA#ACB.

Testing Environment & Methodology:
- Kernel: CachyOS Linux 6.19.6-2-cachyos x86_64
- Toolchain: Clang/LLVM 21.1.8 (`make CC=clang LLVM=1 modules`)
- Extraction: We fetched the strict
`drivers/net/wireless/realtek/rtw88` sub-tree out of the
torvalds/linux `v6.19` tree utilizing `git sparse-checkout` to cleanly
apply the patch without having to compile the entire 2.5GB+ kernel.
- The resulting `.ko` object files were compressed to `.zst` and
installed successfully over the generic CachyOS system driver objects.

Verification Conditions:
- Removed ALL local workarounds. `disable_aspm=Y` is no longer forced
via `/etc/modprobe.d/` overrides.
- Power saving remains natively ON `wifi.powersave = 3` (managed by
NetworkManager).
- Left the laptop in multiple 5-10 minute complete idle states to
enforce sleep modes.

Post-Boot Log Analysis & Potential Improvement Proposition:
The system remained 100% stable without any kernel panics or UI freezes.
However, I continuously monitored the `dmesg` ring buffer and noticed
an intriguing behavior. While the laptop sits completely idle
(NetworkManager connected, but no active traffic), the `rtw88` driver
starts flooded the logs with thousands of firmware errors:

[ 1084.746485] rtw88_8821ce 0000:13:00.0: firmware failed to leave lps state
[ 1084.749662] rtw88_8821ce 0000:13:00.0: failed to send h2c command
[ 1084.752895] rtw88_8821ce 0000:13:00.0: failed to send h2c command

If my understanding of this architecture is correct, previously, when
ASPM wasn't disabled, this exact failure of the adapter firmare inside
`LPS_DEEP_MODE_LCLK` would violently lock up the PCIe bus and crash
the host. Now, thanks to your DMI ASPM quirk at the `rtw88_pci` level,
the host PCIe controller doesn't enter `L1` and is perfectly shielded
from the adapter locking itself up! The OS handles the timeouts
gracefully and driver recovery prevents a hard freeze.

A question for your consideration: Given the immense volume of these
`h2c` timeout errors (and the underlying firmware's fundamental
inability to cleanly enter/exit its own sleep states without L1
participation on this HP model), do you think it would be beneficial
to *also* dynamically disable LPS Deep sleep when this specific ASPM
quirk is triggered?

For example, dynamically forcing `rtwdev->lps_conf.deep_mode =
LPS_DEEP_MODE_NONE` when the DMI ASPM flag is active, strictly to
prevent the firmware from attempting a sleep cycle that is doomed to
fail and polluting the queues and logs? Perhaps this might also save
microscopic CPU interrupts from continuous H2C polling timeouts?

If you believe that simply letting the driver recover and tolerating
the error spam in `dmesg` is the preferred/safer upstream approach, I
am perfectly happy. The patch functions as advertised and system
stability is unequivocally restored!

Thank you immensely for your rapid debugging and definitive patch for
this long-standing issue and for bringing stability to this model.

Tested-by: Oleksandr Havrylov <goainwo@gmail.com>

*(Note: I was a bit unsure which of the two active mailing list
threads was the most appropriate place for this final report — the
original bug discussion or the new RFT patch submission thread — so I
replied to both just to ensure it is correctly attached to the patch.
Apologies for the duplicate email!)*

Best regards,
Oleksandr Havrylov

ср, 11 мар. 2026 г. в 13:00, LB F <goainwo@gmail.com>:
>
> Hi Ping-Ke,
>
> Thank you for the incredibly fast turnaround and for providing the RFT
> patch with the DMI quirk!
>
> First, I want to mention that I am not an IT professional or a
> programmer. I am just a regular Linux user who really wants to help
> solve this problem. I am trying my best to verify everything
> carefully, so please forgive me if my terminology or induction was
> slightly off.
>
> To answer your clarifying questions from the previous emails:
>
> > Just want to clarify that these logs only appear in test 3, right?
> > No these logs in test 1/2.
>
> Yes, exactly. The `failed to send h2c command` errors only caused a
> complete system freeze when no workarounds were active and the adapter
> attempted to sleep (Test 3).
>
> > I think this is your perspective and induction, right? Did you measure
> > real hardware signals?
>
> You are entirely correct. This is just my induction based solely on
> the timing of the logs and system behavior. I do not have access to an
> oscilloscope or any hardware diagnostic tools. Given this, I
> completely agree that your approach of applying a platform-specific
> quirk is the safest and best solution.
>
> > Forgot to say. Could you share your full name for me as a reporter
> > in commit message?
>
> My full name is Oleksandr Havrylov. I would be honored to be included
> as the reporter in the commit message.
>
> ### Recent Baseline Testing Before Your Patch
>
> Before applying your patch today, we ran a few more controlled tests
> to double-check our baseline. We verified that our local workaround
> (`modprobe.d disable_aspm=y`) **does indeed keep the system completely
> stable** and prevents the hard freeze, even when NetworkManager's
> `wifi.powersave` is set to ON (default).
>
> However, we noticed one interesting detail in the kernel logs: while
> the system no longer freezes with `disable_aspm=y`, `dmesg` still
> constantly logs `firmware failed to leave lps state` and `failed to
> send h2c command` when the laptop is completely idle. It seems the
> firmware still crashes during LPS, but because ASPM is disabled, the
> PCIe bus ignores the crash and the system survives perfectly fine. I
> just wanted to mention this for completeness!
>
> ### Testing Plan
>
> I have **not** applied your RFT patch just yet. I wanted to make sure
> our testing baseline was 100% clean and documented first.
>
> I will compile your patch and perform rigorous testing this evening (I
> am in the EET timezone, Ukraine). I will test it with the native
> `power_save` fully enabled to ensure your patch successfully prevents
> the hard lockups as intended.
>
> I will stay in touch and reply back to this thread with a formal
> `Tested-by` confirmation (and any logs if needed) as soon as my
> testing is complete. Thank you again for all your help!
>
> Best regards,
> Oleksandr Havrylov
>
> ср, 11 мар. 2026 г. в 04:22, Ping-Ke Shih <pkshih@realtek.com>:
> >
> > Ping-Ke Shih <pkshih@realtek.com> wrote:
> > >
> > > LB F <goainwo@gmail.com> wrote:
> > > >
> > > > Hi Ping-Ke,
> > > >
> > > > Thank you for the incredibly fast response and assistance!
> > > >
> > > > > Can you dig kernel log (by netconsole or ramoops) if something useful?
> > > > > I'd like to know this is hardware level freeze or kernel can capture something
> > > > wrong.
> > > >
> > > > I managed to pull a call trace from a historic journald log just
> > > > before the system hung. The kernel gets trapped in an IRQ thread
> > > > inside `rtw_pci_interrupt_threadfn`, calling up into `mac80211`
> > > > `ieee80211_rx_list` before everything freezes. Here is the relevant
> > > > snippet:
> > > >
> > > > ```text
> > > > Call Trace:
> > > > <IRQ>
> > > > ? __alloc_skb+0x23a/0x2a0
> > > > ? __alloc_skb+0x10c/0x2a0
> > > > ? __pfx_irq_thread_fn+0x10/0x10
> > > > [ ... truncated module list ... ]
> > > > Tainted: G W I 6.19.6-2-cachyos #1 PREEMPT(full)
> > > > Hardware name: HP HP Notebook/81F0, BIOS F.50 11/20/2020
> > > > RIP: 0010:ieee80211_rx_list+0x1012/0x1020 [mac80211]
> > > > CPU: 2 UID: 0 PID: 765 Comm: irq/56-rtw88_pc
> > > > rtw_pci_interrupt_threadfn+0x239/0x310 [rtw88_pci]
> > > > ```
> > > >
> > > > It behaves exactly like a PCIe bus deadlock or a hardware fault that
> > > > eventually brings down the CPU handling the IRQ.
> > >
> > > I wonder if there is a malformed data, causing this trace and the leads
> > > kernel freezes. If we can do validation on RX data before calling
> > > ieee80211_rx_list(), maybe trace disappears and everything will be fine?
> > > Even no need workaround.
> > >
> > > >
> > > > > Are these totally needed to workaround the problem? Or disable_aspm is enough?
> > > > > I'd list them in order of power consumption impact:
> > > > > 1. disable_aspm=y
> > > > > 2. disable_lps_deep=y
> > > > > 3. disable WiFi power save
> > > >
> > > > To verify which parameters are strictly necessary, I performed
> > > > isolated testing today. I ensured no other modprobe configs were
> > > > active, rebuilt the initramfs, and manually enforced that
> > > > `wifi.powersave` was active via `iw dev wlan0 set power_save on`
> > > > during all tests (as the OS power management profiles were defaulting
> > > > it to off, which initially masked the issue).
> > > >
> > > > I tested each workaround individually across multiple sleep/wake
> > > > cycles and active usage:
> > > >
> > > > **Test 1 (ASPM Disabled, LPS Deep Enabled):**
> > > > - Kernel parameters: `rtw88_pci disable_aspm=y` (and `rtw88_core
> > > > disable_lps_deep=n`)
> > > > - Result: Stable. No freezes were observed during usage or transitions
> > > > into/out of S3 sleep while power saving was enforced.
> > > >
> > > > **Test 2 (ASPM Enabled, LPS Deep Disabled):**
> > > > - Kernel parameters: `rtw88_core disable_lps_deep=y` (and `rtw88_pci
> > > > disable_aspm=n`)
> > > > - Result: Stable. No freezes were observed under the same forced power
> > > > save conditions.
> > > >
> > > > **Conclusion:** It appears we do not need both workarounds
> > > > simultaneously for this specific hardware. Using only `disable_aspm=y`
> > > > seems to be sufficient to prevent the system freeze. Given your note
> > > > about the power consumption impact ranking, this looks like the
> > > > optimal path forward.
> > >
> > > Let's test my RFT patch to disable ASPM then.
> > >
> > > >
> > > > > But what does 'deadlock' mean? As I know NAPI poll is scheduled by ISR,
> > > > > and going to receive packets. The rx_no_aspm workaround is to forcely turn
> > > > > off ASPM during this period.
> > > >
> > > > By "deadlock" I meant a hardware-level bus lockup. It seems the
> > > > physical RTL8821CE chip itself crashes or hangs the system's PCIe bus
> > > > when trying to negotiate waking up from ASPM L1 while simultaneously
> > > > existing in `LPS_DEEP_MODE_LCLK`. The `rx_no_aspm` workaround in NAPI
> > > > helps during active Rx decoding, but the laptop often freezes while
> > > > completely idle, presumably when the AP sends a basic beacon, the chip
> > > > attempts to leave LPS Deep + L1, and the hardware simply gives up and
> > > > halts the system.
> > >
> > > I think this is your perspective and induction, right? Did you measure
> > > real hardware signals?
> > >
> > > My point is that if this is a hardware-level bus lockup, let's apply
> > > quirk. If some malformed data causing kernel hangs, I'd add sanity check
> > > on RX data, but I don't actually know what we should check for now.
> > >
> > > >
> > > > > We have not modified RTL8821CE for a long time, so I'd add workaround
> > > > > to specific platform as mentioned above.
> > > >
> > > > Adding a DMI/platform quirk specifically for this laptop to disable
> > > > ASPM would be wonderful and deeply appreciated. I agree it is safer
> > > > than touching the global flags for hardware that is functioning
> > > > correctly out in the wild.
> > > >
> > > > Here is the exact identifying information for my system:
> > > >
> > > > System Vendor: HP
> > > > Product Name: HP Notebook
> > > > SKU Number: P3S95EA#ACB
> > > > Family: 103C_5335KV
> > > > PCI ID: 10ec:c821
> > > > Subsystem ID: 103c:831a
> > > >
> > > > I am completely ready to test any patch or quirk you send my way.
> > > > Thank you so much for your time and helping track this down!
> > >
> > > I sent a RFT [1] for test. Please check if it works on your HP notebook.
> > > If you check rtw88 log, you can see I added similar patch 5 years ago,
> > > and replaced by preferred the change of "rtwpci->rx_no_aspm", which I
> > > think it can only resolve problem on partial notebooks though....
> > >
> > > [1]
> > > https://lore.kernel.org/linux-wireless/20260311020816.7065-1-pkshih@realtek.
> > > com/T/#u
> >
> > Forgot to say. Could you share your full name for me as a reporter
> > in commit message?
> >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-11 15:22           ` LB F
@ 2026-03-12  1:56             ` Ping-Ke Shih
  2026-03-12 21:42               ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-12  1:56 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> Hi Ping-Ke,
> 
> I successfully applied your patch out-of-tree and performed rigorous
> testing on the host machine.
> 
> I can officially confirm that the patch works flawlessly. The DMI
> quirk triggered correctly and successfully prevented the
> hardware-level PCIe bus lockups on my HP P3S95EA#ACB.

Thanks for your quickly test with my patch. :)

> 
> Testing Environment & Methodology:
> - Kernel: CachyOS Linux 6.19.6-2-cachyos x86_64
> - Toolchain: Clang/LLVM 21.1.8 (`make CC=clang LLVM=1 modules`)
> - Extraction: We fetched the strict
> `drivers/net/wireless/realtek/rtw88` sub-tree out of the
> torvalds/linux `v6.19` tree utilizing `git sparse-checkout` to cleanly
> apply the patch without having to compile the entire 2.5GB+ kernel.
> - The resulting `.ko` object files were compressed to `.zst` and
> installed successfully over the generic CachyOS system driver objects.
> 
> Verification Conditions:
> - Removed ALL local workarounds. `disable_aspm=Y` is no longer forced
> via `/etc/modprobe.d/` overrides.
> - Power saving remains natively ON `wifi.powersave = 3` (managed by
> NetworkManager).
> - Left the laptop in multiple 5-10 minute complete idle states to
> enforce sleep modes.
> 
> Post-Boot Log Analysis & Potential Improvement Proposition:
> The system remained 100% stable without any kernel panics or UI freezes.
> However, I continuously monitored the `dmesg` ring buffer and noticed
> an intriguing behavior. While the laptop sits completely idle
> (NetworkManager connected, but no active traffic), the `rtw88` driver
> starts flooded the logs with thousands of firmware errors:
> 
> [ 1084.746485] rtw88_8821ce 0000:13:00.0: firmware failed to leave lps state
> [ 1084.749662] rtw88_8821ce 0000:13:00.0: failed to send h2c command
> [ 1084.752895] rtw88_8821ce 0000:13:00.0: failed to send h2c command
> 
> If my understanding of this architecture is correct, previously, when
> ASPM wasn't disabled, this exact failure of the adapter firmare inside
> `LPS_DEEP_MODE_LCLK` would violently lock up the PCIe bus and crash
> the host. Now, thanks to your DMI ASPM quirk at the `rtw88_pci` level,
> the host PCIe controller doesn't enter `L1` and is perfectly shielded
> from the adapter locking itself up! The OS handles the timeouts
> gracefully and driver recovery prevents a hard freeze.

I'm really not sure how/why kernel becomes frozen. As I mentioned before
it might because of received malformed data and no complete validation
before reporting RX packet to mac80211.

Not sure if you can try to dig and add some validation?

(Current DMI patch is fine to me.)

> 
> A question for your consideration: Given the immense volume of these
> `h2c` timeout errors (and the underlying firmware's fundamental
> inability to cleanly enter/exit its own sleep states without L1
> participation on this HP model), do you think it would be beneficial
> to *also* dynamically disable LPS Deep sleep when this specific ASPM
> quirk is triggered?
> 
> For example, dynamically forcing `rtwdev->lps_conf.deep_mode =
> LPS_DEEP_MODE_NONE` when the DMI ASPM flag is active, strictly to
> prevent the firmware from attempting a sleep cycle that is doomed to
> fail and polluting the queues and logs? Perhaps this might also save
> microscopic CPU interrupts from continuous H2C polling timeouts?

Are the 'h2c' timeout messages flooding? or appears periodically? 
Does it really affect connection stable?

If you change another AP or connection on 5GHz band, does the messages
still present? 

I think it isn't easy to find out the cause without measuring hardware
signals, since I saw the message very very rare. So, I'd adopt your
suggestion (dynamic LPS_DEEP_MODE_NONE) if the test is positive.

> 
> If you believe that simply letting the driver recover and tolerating
> the error spam in `dmesg` is the preferred/safer upstream approach, I
> am perfectly happy. The patch functions as advertised and system
> stability is unequivocally restored!
> 
> Thank you immensely for your rapid debugging and definitive patch for
> this long-standing issue and for bringing stability to this model.
> 
> Tested-by: Oleksandr Havrylov <goainwo@gmail.com>

I will add this to my patch then.

> 
> *(Note: I was a bit unsure which of the two active mailing list
> threads was the most appropriate place for this final report — the
> original bug discussion or the new RFT patch submission thread — so I
> replied to both just to ensure it is correctly attached to the patch.
> Apologies for the duplicate email!)*
> 

Let's discuss in this thread. For RFT patch, I suppose you only reply
me about the test result and give me Tested-by tag if it works. 

By the way, your this reply is top posting that mailing list isn't
preferred, so I delete old discussion. Please avoid this in the future.

Ping-Ke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-12  1:56             ` Ping-Ke Shih
@ 2026-03-12 21:42               ` LB F
  2026-03-13  0:03                 ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-12 21:42 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> I'm really not sure how/why kernel becomes frozen. As I mentioned before
> it might because of received malformed data and no complete validation
> before reporting RX packet to mac80211.
> Not sure if you can try to dig and add some validation?

I reviewed both rx.c and pci.c in detail and found a genuine validation
gap specific to the 8821CE chip.

In rtw_pci_rx_napi() (pci.c), the RX path allocates a new skb based
on the pkt_len field from the RX descriptor:

  new_len = pkt_stat.pkt_len + pkt_offset;
  new = dev_alloc_skb(new_len);
  skb_put_data(new, skb->data, new_len);
  /* ... */
  skb_pull(new, pkt_offset);
  ieee80211_rx_napi(rtwdev->hw, NULL, new, napi);

If pkt_stat.pkt_len is zero, new_len equals pkt_offset, skb_put_data
copies only the descriptor header, and skb_pull then removes that header
-- leaving an empty skb (len=0) that is passed unconditionally to
ieee80211_rx_napi() with no length guard.

Protection already exists for the 8703B chip in rtw_rx_fill_rx_status():

  if (rtwdev->chip->id == RTW_CHIP_TYPE_8703B && pkt_stat->pkt_len == 0) {
      rx_status->flag |= RX_FLAG_NO_PSDU;
      rtw_dbg(rtwdev, RTW_DBG_RX, "zero length packet");
  }

No equivalent check exists for RTW_CHIP_TYPE_8821CE. Removing the
chip-id restriction would be a minimal, safe fix for all chips:

--- a/rx.c
+++ b/rx.c
-  if (rtwdev->chip->id == RTW_CHIP_TYPE_8703B && pkt_stat->pkt_len == 0) {
+  if (pkt_stat->pkt_len == 0) {

I also checked PHY-level error counters from debugfs during normal
operation (phy_info):

  OFDM cnt (ok, err) = (867,  11)  ->   1.3% PHY CRC error rate
  VHT  cnt (ok, err) = (267,  32)  ->  10.7% PHY CRC error rate

Frames with crc_err are passed to mac80211 with RX_FLAG_FAILED_FCS_CRC
set (not dropped by the driver), which is the correct approach.

However, I do not believe the freeze is caused by malformed RX data.
The freeze occurs deterministically about 10 seconds after the system
becomes fully idle with zero active network traffic, which matches the
LPS_DEEP_MODE_LCLK entry sequence rather than a random data corruption
pattern. The freeze behaviour also disappears entirely when ASPM L1 is
disabled (as confirmed by the Live USB logs I provided earlier), which
is the hallmark of a PCIe bus gating deadlock, not a data path issue.

> Are the 'h2c' timeout messages flooding? or appears periodically? Does it
> really affect connection stable?

The errors appear periodically in bursts during idle; network
connectivity is never affected (parallel ping tests show 0% packet
loss). The flooding documented in previous tests (hundreds per minute)
was observed under conditions where the LPS state machine had reached
a persistent failure mode after extended uptime. In shorter tests from
a fresh module load, the errors are sporadic (3-5 per 10 minutes).

> If you change another AP or connection on 5GHz band, does the messages
> still present?

Yes. The issue has persisted for 2 years across 3 completely different
Access Points. It is reproducible on 5GHz only (2.4GHz is disabled on
all my networks).

> I think it isn't easy to find out the cause without measuring hardware
> signals, since I saw the message very very rare. So, I'd adopt your
> suggestion (dynamic LPS_DEEP_MODE_NONE) if the test is positive.

The test is definitively positive.

Test environment: stock CachyOS 6.19.6 kernel, PCIe ASPM L1 confirmed
ENABLED via lspci ('LnkCtl: ASPM L1 Enabled'), no out-of-tree patches.
The rtw88 module stack was fully reloaded (including rtw88_core) for
each scenario. The disable_lps_deep parameter, which belongs to
rtw88_core, was verified via /sys/module/rtw88_core/parameters/
before and after each reload.

Test protocol: after module reload and Wi-Fi reconnect (verified via
HTTP 204 check), a 5-minute warm-up period elapsed before the
5-minute measurement window began. This ensures the firmware's LPS
state machine has fully initialised before results are recorded.

Methodology verified: 'modprobe -r rtw88_8821ce' removes only the
chip-specific modules, leaving rtw88_core in memory. The correct
procedure used was to explicitly also remove rtw88_core, then reload
all modules with the desired parameter.

Results (battery power, true idle each):

  disable_lps_deep=N (DEFAULT):
    Warm-up (5 min cumulative):  h2c=4   lps=0
    Measurement (5 min):         h2c=0   lps=0  [errors are bursty]

  disable_lps_deep=Y (CONFIRMED via sysfs):
    Warm-up (5 min cumulative):  h2c=0   lps=0
    Measurement (5 min):         h2c=0   lps=0
    ALL 10 minutes:              h2c=0

With disable_lps_deep=Y, not a single h2c timeout was recorded across
the entire 10-minute observation window (warm-up + measurement). With
disable_lps_deep=N, errors appeared within the first 5 minutes of idle.
Setting disable_lps_deep=Y completely eliminates the firmware timeout
loop, confirming that the root cause is the firmware attempting
LPS_DEEP_MODE_LCLK while PCIe constraints prevent it from completing.

Dynamic LPS_DEEP_MODE_NONE for the ASPM DMI quirk entry is the correct
and complete architectural solution.

--- Technical Appendix: RX Validation Audit Findings ---

I performed a deep audit of the RX descriptor parsing logic in rx.c and pci.c.
I found two concrete areas where validation is incomplete for the 8821CE:

1. Out-of-Bounds Read in rtw_pci_rx_napi (pci.c):
The DMA buffer size is fixed at ~11.5KB (RTK_PCI_RX_BUF_SIZE).
However, the hardware
descriptor (W0_PKT_LEN) is 14 bits, allowing it to indicate up to 16KB.
The driver calculates new_len = pkt_stat.pkt_len + pkt_offset and calls
skb_put_data(new, skb->data, new_len) without checking if new_len exceeds the
DMA source buffer. If hardware sends a malformed large length, this leads
to an OOB read of adjacent memory.

2. Missing 8821CE guard in rtw_rx_fill_rx_status (rx.c):
The check for pkt_len == 0 (which results in an empty SKB being passed
to mac80211)
is manually restricted to RTW_CHIP_TYPE_8703B:
  if (rtwdev->chip->id == RTW_CHIP_TYPE_8703B && pkt_stat->pkt_len == 0)

Expanding this guard to all chips (or specifically 8821CE) would be safer.

While these vulnerabilities exist, I still believe the freeze is
PCIe-timing related
(LCLK entry/ASPM conflict), as no RX-related warnings or memory corruption
traces were found in dmesg prior to the hard freeze.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-12 21:42               ` LB F
@ 2026-03-13  0:03                 ` LB F
  2026-03-13  0:29                   ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-13  0:03 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> I'm really not sure how/why kernel becomes frozen. As I mentioned before
> it might because of received malformed data and no complete validation
> before reporting RX packet to mac80211.
> Not sure if you can try to dig and add some validation?

Hi Ping-Ke,

I took your advice and performed a deeper audit of the rtw88 PCI implementation,
focusing on both validation and concurrency. While the RX gaps I previously
mentioned are real, I found two critical architectural issues in the TX path
that likely contribute to the "hard freezes" and DMA stalls we've seen.

1. Concurrency: TX Descriptor Management Race (pci.c:836)
---------------------------------------------------------
In rtw_pci_tx_write_data(), rtw88 fetches the descriptor address based on
the current write pointer (wp) BEFORE acquiring the irq_lock:

```c
/* drivers/net/wireless/realtek/rtw88/pci.c:836 */
buf_desc = get_tx_buffer_desc(ring, tx_buf_desc_sz);
memset(buf_desc, 0, tx_buf_desc_sz);
/* ... packets are filled ... */
spin_lock_bh(&rtwpci->irq_lock); // [!] Lock is taken too late
```

Since mac80211 can call rtw_ops_tx and rtw_ops_wake_tx_queue (the latter
calling __rtw_tx_work) concurrently on different CPUs—especially for
high-priority AC_VO traffic—two threads can fetch the same wp for the
same queue simultaneously.

Result: CPU 0 prepares data in slot [N], while CPU 1 simultaneously zeros out
or overwrites slot [N]. This explains why we see intermittent descriptor
corruption and subsequent DMA/firmware hangs.

2. Synchronization: Missing DMA Memory Barrier (pci.c:786)
----------------------------------------------------------
In rtw_pci_tx_kick_off_queue(), the doorbell is hit without a memory barrier:

```c
/* drivers/net/wireless/realtek/rtw88/pci.c:786 */
rtw_write16(rtwdev, bd_idx, ring->r.wp & TRX_BD_IDX_MASK);
```

For PCIe DMA, it is vital to ensure descriptor RAM writes are visible to
the device before the MMIO register doorbell hits. Standard Linux practice
usually dictates a wmb() here. Without it, the Wi-Fi controller may read
stale or uninitialized memory, leading to the "failed to leave lps state"
timeouts and H2C command failures we've logged.

3. Confirmed RX Limit Mismatch (rtw8821c.c:254)
-----------------------------------------------
I verified that the hardware is explicitly programmed with a 12KB limit:

```c
/* drivers/net/wireless/realtek/rtw88/rtw8821c.c:254 */
rtw_write8(rtwdev, REG_RX_PKT_LIMIT, WLAN_RX_PKT_LIMIT_512);
```

Since the driver's RX buffer (RTK_PCI_RX_BUF_SIZE) is only 11.2KB, any
malformed or large packet will result in an OOB read in rtw_pci_rx_napi().

I believe addressing these three points (TX locking, TX barriers, and
RX buffer consistency) would significantly harden the driver against
the stability issues reported in Bug 221195.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-13  0:03                 ` LB F
@ 2026-03-13  0:29                   ` LB F
  2026-03-14 10:52                     ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-13  0:29 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Ping-Ke,

I apologize for the rapid follow-up and for being perhaps a bit over-assertive
in my previous email. As I continued to dig into the code, I realized that
some of my interpretations of hardware registers (like REG_RX_PKT_LIMIT)
and kernel serialization might be simplified compared to the real-world
complexities you deal with.

I'd like to reframe my previous notes as "curious observations" that I
stumbled upon while testing, and I'd value your professional take on whether
they are relevant:

1. RX Host-Side Validation:
While searching for the 12KB limit I mentioned, I noticed that in
rtw_pci_rx_napi(), the driver uses the pkt_len field from the descriptor
directly for skb_put_data() without checking it against the host buffer
size (RTK_PCI_RX_BUF_SIZE). Even if the hardware normally clips DMA,
would it be worth adding a host-side guard there as a "hardening" measure
against potentially malformed hardware reports?

2. TX Write Pointer (wp) Fetch:
I noticed that in rtw_pci_tx_write_data(), get_tx_buffer_desc() fetches
the wp outside the irq_lock. I wasn't sure if mac80211 guarantees that
the direct TX path and the background worker threads can never collide on
the same queue, but I thought it was worth mentioning just in case.

3. Memory Barriers:
The wmb() point was more of an architectural observation regarding
PCI best practices for non-x86 platforms. I understand x86 is quite
forgiving here, but I noticed it was a pattern that stood out.

Please treat these as humble suggestions from someone trying to learn
the driver's internals. I didn't mean to imply these were "critical bugs"
without your expert verification.

Thank you for your patience with my technical excitement!

Best regards,
Oleksandr

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-13  0:29                   ` LB F
@ 2026-03-14 10:52                     ` LB F
  2026-03-14 12:39                       ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-14 10:52 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

After extended testing with your DMI patch applied, the hard freeze is
gone. However, with ASPM disabled but LPS Deep still active, I observe
periodic h2c timeouts during idle which cause occasional WiFi
throughput drops and Bluetooth audio stuttering. When I additionally
set disable_lps_deep=Y, all symptoms disappear completely. This
confirms that combining the ASPM quirk with dynamic LPS_DEEP_MODE_NONE
would be the complete fix. Ready to test an updated patch if you
decide to include this.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-14 10:52                     ` LB F
@ 2026-03-14 12:39                       ` LB F
  2026-03-15  0:24                         ` LB F
  2026-03-16  2:50                         ` Ping-Ke Shih
  0 siblings, 2 replies; 34+ messages in thread
From: LB F @ 2026-03-14 12:39 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> I'd adopt your suggestion (dynamic LPS_DEEP_MODE_NONE) if the test
> is positive.

Hi Ping-Ke,

Following your suggestion, I performed an additional experiment to
validate the dynamic LPS_DEEP_MODE_NONE idea. Please treat this
purely as a field test report -- I am not a kernel developer, and the
implementation below is certainly not upstream-quality. I am sharing
it only in the hope that it helps you design a proper solution.

What I did:

I extended your DMI quirk in pci.c with an additional capability flag
for LPS Deep mode. The only file touched was pci.c (your patch) --
main.c was left completely unmodified.

The changes to your patch are as follows:

  /* 1. Extended the capabilities enum */
  enum rtw88_quirk_dis_pci_caps {
          QUIRK_DIS_PCI_CAP_ASPM,
          QUIRK_DIS_PCI_CAP_LPS_DEEP,  /* test addition */
  };

  /* 2. Extended disable_pci_caps() callback */
  static int disable_pci_caps(const struct dmi_system_id *dmi)
  {
          uintptr_t dis_caps = (uintptr_t)dmi->driver_data;

          if (dis_caps & BIT(QUIRK_DIS_PCI_CAP_ASPM))
                  rtw_pci_disable_aspm = true;

          if (dis_caps & BIT(QUIRK_DIS_PCI_CAP_LPS_DEEP))
                  rtw_disable_lps_deep_mode = true;

          return 1;
  }

  /* 3. Both flags set for the HP P3S95EA#ACB entry */
  .driver_data = (void *)(BIT(QUIRK_DIS_PCI_CAP_ASPM) |
                          BIT(QUIRK_DIS_PCI_CAP_LPS_DEEP)),

I am aware that setting rtw_disable_lps_deep_mode from pci.c is
architecturally impure -- it is a global flag that would affect all
rtw88 devices in a hypothetical multi-adapter system. A proper
per-device solution (e.g. a flag inside struct rtw_dev set during
probe) would be cleaner. I simply used the existing global as the
most straightforward way to validate the concept.

Verification:

Confirmed no rtw88-related entries exist in /etc/modprobe.d/,
/lib/modprobe.d/, or /run/modprobe.d/, ruling out any external
parameter injection.

After loading the patched modules, the following was confirmed via
sysfs:

  /sys/module/rtw88_core/parameters/disable_lps_deep_mode = Y
  /sys/module/rtw88_pci/parameters/disable_aspm = Y

This confirms the DMI quirk is the sole source of both values.

Results (10-minute idle observation, battery power, wifi.powersave=3):

  With your ASPM patch alone (LPS Deep still active):
    - periodic "failed to send h2c command" bursts observed
    - occasional WiFi throughput drops and Bluetooth audio stuttering

  With ASPM patch + LPS Deep disabled via the quirk:
    - h2c=0, lps=0 across the entire observation window
    - WiFi throughput stable, Bluetooth audio uninterrupted

The result confirms that disabling LPS Deep Mode in addition to ASPM
completely eliminates the remaining firmware timeout loop on this
platform.

I hope this experiment is useful as a data point. Please feel free to
discard the implementation and design a proper solution -- I am ready
to test any updated patch you send.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-14 12:39                       ` LB F
@ 2026-03-15  0:24                         ` LB F
  2026-03-16  2:55                           ` Ping-Ke Shih
  2026-03-16  2:50                         ` Ping-Ke Shih
  1 sibling, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-15  0:24 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Oleksandr Havrylov <goainwo@gmail.com> wrote:
> After extended testing with your DMI patch applied, the hard freeze is
> gone. However, with ASPM disabled but LPS Deep still active, I observe
> periodic h2c timeouts during idle which cause occasional WiFi throughput
> drops and Bluetooth audio stuttering. When I additionally set
> disable_lps_deep=Y, all symptoms disappear completely. This confirms
> that combining the ASPM quirk with dynamic LPS_DEEP_MODE_NONE would be
> the complete fix. Ready to test an updated patch if you decide to
> include this.

Hi Ping-Ke,

While monitoring logs with the current patch applied, I noticed two
things that might be useful.

First, the following message appears each time the driver loads:

  rtw88_8821ce 0000:13:00.0: can't disable ASPM; OS doesn't have ASPM control

This suggests the BIOS retains control over ASPM and prevents any
OS-level override via pci_disable_link_state(). The system remains
stable regardless, which confirms that the rtw_pci_disable_aspm flag
approach in your patch is the correct and effective method here.

Second, during normal operation I observe this warning periodically:

  WARNING: net/mac80211/rx.c:5491 at ieee80211_rx_list+0x177/0x1020 [mac80211]

This is the same location that appeared in the call trace just before
the hard freeze. You mentioned earlier that malformed RX data reaching
mac80211 could be a factor. I'm not sure if this warning is related,
but I wanted to flag it in case it is useful for your RX validation
investigation.

No h2c timeouts or firmware errors have been observed. The system
remains fully stable.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-14 12:39                       ` LB F
  2026-03-15  0:24                         ` LB F
@ 2026-03-16  2:50                         ` Ping-Ke Shih
  1 sibling, 0 replies; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-16  2:50 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> Ping-Ke Shih <pkshih@realtek.com> wrote:
> > I'd adopt your suggestion (dynamic LPS_DEEP_MODE_NONE) if the test
> > is positive.
> 
> Hi Ping-Ke,
> 
> Following your suggestion, I performed an additional experiment to
> validate the dynamic LPS_DEEP_MODE_NONE idea. Please treat this
> purely as a field test report -- I am not a kernel developer, and the
> implementation below is certainly not upstream-quality. I am sharing
> it only in the hope that it helps you design a proper solution.
> 
> What I did:
> 
> I extended your DMI quirk in pci.c with an additional capability flag
> for LPS Deep mode. The only file touched was pci.c (your patch) --
> main.c was left completely unmodified.
> 
> The changes to your patch are as follows:
> 
>   /* 1. Extended the capabilities enum */
>   enum rtw88_quirk_dis_pci_caps {
>           QUIRK_DIS_PCI_CAP_ASPM,
>           QUIRK_DIS_PCI_CAP_LPS_DEEP,  /* test addition */
>   };
> 
>   /* 2. Extended disable_pci_caps() callback */
>   static int disable_pci_caps(const struct dmi_system_id *dmi)
>   {
>           uintptr_t dis_caps = (uintptr_t)dmi->driver_data;
> 
>           if (dis_caps & BIT(QUIRK_DIS_PCI_CAP_ASPM))
>                   rtw_pci_disable_aspm = true;
> 
>           if (dis_caps & BIT(QUIRK_DIS_PCI_CAP_LPS_DEEP))
>                   rtw_disable_lps_deep_mode = true;
> 
>           return 1;
>   }
> 
>   /* 3. Both flags set for the HP P3S95EA#ACB entry */
>   .driver_data = (void *)(BIT(QUIRK_DIS_PCI_CAP_ASPM) |
>                           BIT(QUIRK_DIS_PCI_CAP_LPS_DEEP)),
> 
> I am aware that setting rtw_disable_lps_deep_mode from pci.c is
> architecturally impure -- it is a global flag that would affect all
> rtw88 devices in a hypothetical multi-adapter system. A proper
> per-device solution (e.g. a flag inside struct rtw_dev set during
> probe) would be cleaner. I simply used the existing global as the
> most straightforward way to validate the concept.
> 
> Verification:
> 
> Confirmed no rtw88-related entries exist in /etc/modprobe.d/,
> /lib/modprobe.d/, or /run/modprobe.d/, ruling out any external
> parameter injection.
> 
> After loading the patched modules, the following was confirmed via
> sysfs:
> 
>   /sys/module/rtw88_core/parameters/disable_lps_deep_mode = Y
>   /sys/module/rtw88_pci/parameters/disable_aspm = Y
> 
> This confirms the DMI quirk is the sole source of both values.
> 
> Results (10-minute idle observation, battery power, wifi.powersave=3):
> 
>   With your ASPM patch alone (LPS Deep still active):
>     - periodic "failed to send h2c command" bursts observed
>     - occasional WiFi throughput drops and Bluetooth audio stuttering
> 
>   With ASPM patch + LPS Deep disabled via the quirk:
>     - h2c=0, lps=0 across the entire observation window
>     - WiFi throughput stable, Bluetooth audio uninterrupted
> 
> The result confirms that disabling LPS Deep Mode in addition to ASPM
> completely eliminates the remaining firmware timeout loop on this
> platform.
> 
> I hope this experiment is useful as a data point. Please feel free to
> discard the implementation and design a proper solution -- I am ready
> to test any updated patch you send.

Thanks for your analysis of TX/RX paths, and the changes above and
verifications. :)

I'd update the patch as your proposal and send a patch. For suggestions of
TX/RX paths, I only read them a little bit, and I will study them entirely
when I have more free time. 

Ping-Ke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-15  0:24                         ` LB F
@ 2026-03-16  2:55                           ` Ping-Ke Shih
  2026-03-16 20:27                             ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-16  2:55 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> 
> Oleksandr Havrylov <goainwo@gmail.com> wrote:
> > After extended testing with your DMI patch applied, the hard freeze is
> > gone. However, with ASPM disabled but LPS Deep still active, I observe
> > periodic h2c timeouts during idle which cause occasional WiFi throughput
> > drops and Bluetooth audio stuttering. When I additionally set
> > disable_lps_deep=Y, all symptoms disappear completely. This confirms
> > that combining the ASPM quirk with dynamic LPS_DEEP_MODE_NONE would be
> > the complete fix. Ready to test an updated patch if you decide to
> > include this.
> 
> Hi Ping-Ke,
> 
> While monitoring logs with the current patch applied, I noticed two
> things that might be useful.
> 
> First, the following message appears each time the driver loads:
> 
>   rtw88_8821ce 0000:13:00.0: can't disable ASPM; OS doesn't have ASPM control
> 
> This suggests the BIOS retains control over ASPM and prevents any
> OS-level override via pci_disable_link_state(). The system remains
> stable regardless, which confirms that the rtw_pci_disable_aspm flag
> approach in your patch is the correct and effective method here.

Not sure if this is because PCIE bridge has no ASPM capability?

> 
> Second, during normal operation I observe this warning periodically:
> 
>   WARNING: net/mac80211/rx.c:5491 at ieee80211_rx_list+0x177/0x1020 [mac80211]

LN5491 (kernel v6.19.6) is:

                case RX_ENC_VHT:
                        if (WARN_ONCE(status->rate_idx > 11 ||
                                      !status->nss ||
                                      status->nss > 8,
                                      "Rate marked as a VHT rate but data is invalid: MCS: %d, NSS: %d\n",
                                      status->rate_idx, status->nss))
                                goto drop;
                        break;

Looks like driver reports improper VHT nss/rate? But this warns once, and
you message isn't like this. 

Could you check the source code LN5491 you are using?

Ping-Ke



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-16  2:55                           ` Ping-Ke Shih
@ 2026-03-16 20:27                             ` LB F
  2026-03-17  1:28                               ` Ping-Ke Shih
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-16 20:27 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> Not sure if this is because PCIE bridge has no ASPM capability?

That could indeed be the case -- I do not have a way to confirm
without further hardware-level inspection.

> LN5491 (kernel v6.19.6) is:
>                 case RX_ENC_VHT:
>                         if (WARN_ONCE(status->rate_idx > 11 ||
>                                       !status->nss ||
>                                       status->nss > 8,
>                                       "Rate marked as a VHT rate but data is invalid: MCS: %d, NSS: %d\n",
>                                       status->rate_idx, status->nss))
>                                 goto drop;
>                         break;
> Looks like driver reports improper VHT nss/rate? But this warns once, and
> you message isn't like this.
> Could you check the source code LN5491 you are using?

The file net/mac80211/rx.c is not available on disk on my system
(CachyOS ships only .h files in the headers package), but I located
the exact warning message in journalctl:

  Rate marked as a VHT rate but data is invalid: MCS: 0, NSS: 0

This confirms that line 5491 in my kernel matches exactly what you
showed from v6.19.6 -- the RX_ENC_VHT case checking for
status->nss == 0. The offset in my trace is slightly different
(+0x183 vs +0x177), which is likely due to CachyOS's LTO/AutoFDO
compiler optimizations.

The warning appeared once in my initial test session:

  Rate marked as a VHT rate but data is invalid: MCS: 0, NSS: 0
  WARNING: net/mac80211/rx.c:5491 at ieee80211_rx_list+0x183/0x1020 [mac80211]

However, in subsequent module reload and reconnect cycles I was unable
to reproduce it. This is consistent with WARN_ONCE behavior -- it
likely fired on the first invalid nss=0 packet after the initial
driver load and has not triggered since. I cannot confirm it as a
reliable symptom.

---

Regarding patch stability: the results below are from testing your
original RFT patch [1], not any newer submission. I want to be
explicit to avoid confusion:

  [1] https://lore.kernel.org/linux-wireless/20260311020816.7065-1-pkshih@realtek.com/

This is the exact diff I compiled and tested:

--- a/drivers/net/wireless/realtek/rtw88/pci.c
+++ b/drivers/net/wireless/realtek/rtw88/pci.c
@@ -2,6 +2,7 @@
 /* Copyright(c) 2018-2019  Realtek Corporation
  */

+#include <linux/dmi.h>
 #include <linux/module.h>
 #include <linux/pci.h>
 #include "main.h"
@@ -1744,6 +1745,34 @@ const struct pci_error_handlers rtw_pci_err_handler = {
 };
 EXPORT_SYMBOL(rtw_pci_err_handler);

+enum rtw88_quirk_dis_pci_caps {
+ QUIRK_DIS_PCI_CAP_ASPM,
+};
+
+static int disable_pci_caps(const struct dmi_system_id *dmi)
+{
+ uintptr_t dis_caps = (uintptr_t)dmi->driver_data;
+
+ if (dis_caps & BIT(QUIRK_DIS_PCI_CAP_ASPM))
+ rtw_pci_disable_aspm = true;
+
+ return 1;
+}
+
+static const struct dmi_system_id rtw88_pci_quirks[] = {
+ {
+ .callback = disable_pci_caps,
+ .ident = "HP Notebook - P3S95EA#ACB",
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "HP"),
+ DMI_MATCH(DMI_PRODUCT_NAME, "HP Notebook"),
+ DMI_MATCH(DMI_PRODUCT_SKU, "P3S95EA#ACB"),
+ },
+ .driver_data = (void *)BIT(QUIRK_DIS_PCI_CAP_ASPM),
+ },
+ {}
+};
+
 int rtw_pci_probe(struct pci_dev *pdev,
    const struct pci_device_id *id)
 {
@@ -1808,6 +1837,7 @@ int rtw_pci_probe(struct pci_dev *pdev,
      bridge && bridge->vendor == PCI_VENDOR_ID_INTEL)
  rtwpci->rx_no_aspm = true;

+ dmi_check_system(rtw88_pci_quirks);
  rtw_pci_phy_cfg(rtwdev);

  ret = rtw_register_hw(rtwdev, hw);

Results with only this patch applied:

  - The hard freeze lockup is gone.
  - However, during idle the logs are flooded with:

      rtw88_8821ce 0000:13:00.0: failed to send h2c command
      rtw88_8821ce 0000:13:00.0: firmware failed to leave lps state

  - To give a concrete sense of the volume: over an ~80-minute
    observation window after a clean module reload, I recorded
    11,757 "failed to send h2c command" events and 2 "firmware
    failed to leave lps state" events -- approximately 110 errors
    per minute during active periods.
  - These errors cause Bluetooth audio stuttering and WiFi
    throughput drops.

When I additionally set disable_lps_deep=Y alongside your ASPM patch,
all h2c errors vanish completely and Bluetooth/WiFi remain fully
stable. This confirms that disabling LPS Deep is necessary for
complete stability on this specific HP SKU.

I also noticed what appears to be a new patch in a separate mailing
list thread. I will test it shortly and report back with the results.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-16 20:27                             ` LB F
@ 2026-03-17  1:28                               ` Ping-Ke Shih
  2026-03-18  0:00                                 ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-17  1:28 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> 
> Ping-Ke Shih <pkshih@realtek.com> wrote:
> > Not sure if this is because PCIE bridge has no ASPM capability?
> 
> That could indeed be the case -- I do not have a way to confirm
> without further hardware-level inspection.
> 
> > LN5491 (kernel v6.19.6) is:
> >                 case RX_ENC_VHT:
> >                         if (WARN_ONCE(status->rate_idx > 11 ||
> >                                       !status->nss ||
> >                                       status->nss > 8,
> >                                       "Rate marked as a VHT rate but data is
> invalid: MCS: %d, NSS: %d\n",
> >                                       status->rate_idx, status->nss))
> >                                 goto drop;
> >                         break;
> > Looks like driver reports improper VHT nss/rate? But this warns once, and
> > you message isn't like this.
> > Could you check the source code LN5491 you are using?
> 
> The file net/mac80211/rx.c is not available on disk on my system
> (CachyOS ships only .h files in the headers package), but I located
> the exact warning message in journalctl:
> 
>   Rate marked as a VHT rate but data is invalid: MCS: 0, NSS: 0
> 
> This confirms that line 5491 in my kernel matches exactly what you
> showed from v6.19.6 -- the RX_ENC_VHT case checking for
> status->nss == 0. The offset in my trace is slightly different
> (+0x183 vs +0x177), which is likely due to CachyOS's LTO/AutoFDO
> compiler optimizations.
> 
> The warning appeared once in my initial test session:
> 
>   Rate marked as a VHT rate but data is invalid: MCS: 0, NSS: 0
>   WARNING: net/mac80211/rx.c:5491 at ieee80211_rx_list+0x183/0x1020 [mac80211]
> 
> However, in subsequent module reload and reconnect cycles I was unable
> to reproduce it. This is consistent with WARN_ONCE behavior -- it
> likely fired on the first invalid nss=0 packet after the initial
> driver load and has not triggered since. I cannot confirm it as a
> reliable symptom.

To reproduce this reliable, you need to remove driver ko and mac80211.ko,
and reinstall them.

However, you have confirmed this is the symptom. I think only if you
want to dig why the rate reported by hardware is weird, otherwise we
can ignore this warning.

> 
> ---
> 
> Regarding patch stability: the results below are from testing your
> original RFT patch [1], not any newer submission. I want to be
> explicit to avoid confusion:
> 
>   [1]
> https://lore.kernel.org/linux-wireless/20260311020816.7065-1-pkshih@realtek.
> com/
> 
> This is the exact diff I compiled and tested:
> 
> --- a/drivers/net/wireless/realtek/rtw88/pci.c
> +++ b/drivers/net/wireless/realtek/rtw88/pci.c
> @@ -2,6 +2,7 @@
>  /* Copyright(c) 2018-2019  Realtek Corporation
>   */
> 
> +#include <linux/dmi.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
>  #include "main.h"
> @@ -1744,6 +1745,34 @@ const struct pci_error_handlers rtw_pci_err_handler = {
>  };
>  EXPORT_SYMBOL(rtw_pci_err_handler);
> 
> +enum rtw88_quirk_dis_pci_caps {
> + QUIRK_DIS_PCI_CAP_ASPM,
> +};
> +
> +static int disable_pci_caps(const struct dmi_system_id *dmi)
> +{
> + uintptr_t dis_caps = (uintptr_t)dmi->driver_data;
> +
> + if (dis_caps & BIT(QUIRK_DIS_PCI_CAP_ASPM))
> + rtw_pci_disable_aspm = true;
> +
> + return 1;
> +}
> +
> +static const struct dmi_system_id rtw88_pci_quirks[] = {
> + {
> + .callback = disable_pci_caps,
> + .ident = "HP Notebook - P3S95EA#ACB",
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "HP"),
> + DMI_MATCH(DMI_PRODUCT_NAME, "HP Notebook"),
> + DMI_MATCH(DMI_PRODUCT_SKU, "P3S95EA#ACB"),
> + },
> + .driver_data = (void *)BIT(QUIRK_DIS_PCI_CAP_ASPM),
> + },
> + {}
> +};
> +
>  int rtw_pci_probe(struct pci_dev *pdev,
>     const struct pci_device_id *id)
>  {
> @@ -1808,6 +1837,7 @@ int rtw_pci_probe(struct pci_dev *pdev,
>       bridge && bridge->vendor == PCI_VENDOR_ID_INTEL)
>   rtwpci->rx_no_aspm = true;
> 
> + dmi_check_system(rtw88_pci_quirks);
>   rtw_pci_phy_cfg(rtwdev);
> 
>   ret = rtw_register_hw(rtwdev, hw);
> 
> Results with only this patch applied:
> 
>   - The hard freeze lockup is gone.
>   - However, during idle the logs are flooded with:
> 
>       rtw88_8821ce 0000:13:00.0: failed to send h2c command
>       rtw88_8821ce 0000:13:00.0: firmware failed to leave lps state
> 
>   - To give a concrete sense of the volume: over an ~80-minute
>     observation window after a clean module reload, I recorded
>     11,757 "failed to send h2c command" events and 2 "firmware
>     failed to leave lps state" events -- approximately 110 errors
>     per minute during active periods.
>   - These errors cause Bluetooth audio stuttering and WiFi
>     throughput drops.
> 
> When I additionally set disable_lps_deep=Y alongside your ASPM patch,
> all h2c errors vanish completely and Bluetooth/WiFi remain fully
> stable. This confirms that disabling LPS Deep is necessary for
> complete stability on this specific HP SKU.
> 
> I also noticed what appears to be a new patch in a separate mailing
> list thread. I will test it shortly and report back with the results.

Thanks for your experiments in detail. :)

Ping-Ke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-17  1:28                               ` Ping-Ke Shih
@ 2026-03-18  0:00                                 ` LB F
  2026-03-18  0:58                                   ` Ping-Ke Shih
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-18  0:00 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> To reproduce this reliable, you need to remove driver ko and mac80211.ko,
> and reinstall them.
>
> However, you have confirmed this is the symptom. I think only if you
> want to dig why the rate reported by hardware is weird, otherwise we
> can ignore this warning.

Following your suggestion, I performed a full stack reload including
mac80211.ko and cfg80211.ko, and was able to reproduce the warning:

  [152.226055] Rate marked as a VHT rate but data is invalid: MCS: 0, NSS: 0
  [152.226057] WARNING: net/mac80211/rx.c:5491 at
ieee80211_rx_list+0x177/0x1020 [mac80211]
  [152.226336] CPU: 2 UID: 0 PID: 638 Comm: irq/56-rtw_pci Tainted: G
IOE 6.19.7-1-cachyos
  [152.226344] Hardware name: HP HP Notebook/81F0, BIOS F.50 11/20/2020

One observation worth mentioning: the warning triggered approximately
72 seconds after initial association, coinciding with a Bluetooth
device connecting to the system. This may suggest the NSS=0 condition
occurs during BT coexistence negotiation rather than during normal
WiFi traffic. I am not sure if this is relevant, but I wanted to
mention it in case it helps narrow down the root cause.

I also noticed the offset is now +0x177, which matches exactly what
you showed from v6.19.6. The earlier +0x183 was likely an artifact of
CachyOS's LTO optimizations while mac80211 had been resident for a
long time.

As you noted, this appears to be a separate issue from the freeze and
h2c timeout problems, so I leave it to your judgment whether it
warrants further investigation.

---

I would like to take this opportunity to thank you sincerely for your
time, patience, and expertise throughout this whole process. From your
very first response to the final v2 patch, your guidance made it
possible to properly identify and resolve a bug that had been causing
real frustration for users of this hardware for a long time.

If any further testing of the rtw88 driver is needed -- whether for
this hardware or for other patches -- I am happy to help to the best
of my abilities and available time. This HP Notebook with RTL8821CE
will remain available for testing whenever it is useful.

Best regards,
Oleksandr Havrylov

вт, 17 мар. 2026 г. в 03:28, Ping-Ke Shih <pkshih@realtek.com>:
>
> LB F <goainwo@gmail.com> wrote:
> >
> > Ping-Ke Shih <pkshih@realtek.com> wrote:
> > > Not sure if this is because PCIE bridge has no ASPM capability?
> >
> > That could indeed be the case -- I do not have a way to confirm
> > without further hardware-level inspection.
> >
> > > LN5491 (kernel v6.19.6) is:
> > >                 case RX_ENC_VHT:
> > >                         if (WARN_ONCE(status->rate_idx > 11 ||
> > >                                       !status->nss ||
> > >                                       status->nss > 8,
> > >                                       "Rate marked as a VHT rate but data is
> > invalid: MCS: %d, NSS: %d\n",
> > >                                       status->rate_idx, status->nss))
> > >                                 goto drop;
> > >                         break;
> > > Looks like driver reports improper VHT nss/rate? But this warns once, and
> > > you message isn't like this.
> > > Could you check the source code LN5491 you are using?
> >
> > The file net/mac80211/rx.c is not available on disk on my system
> > (CachyOS ships only .h files in the headers package), but I located
> > the exact warning message in journalctl:
> >
> >   Rate marked as a VHT rate but data is invalid: MCS: 0, NSS: 0
> >
> > This confirms that line 5491 in my kernel matches exactly what you
> > showed from v6.19.6 -- the RX_ENC_VHT case checking for
> > status->nss == 0. The offset in my trace is slightly different
> > (+0x183 vs +0x177), which is likely due to CachyOS's LTO/AutoFDO
> > compiler optimizations.
> >
> > The warning appeared once in my initial test session:
> >
> >   Rate marked as a VHT rate but data is invalid: MCS: 0, NSS: 0
> >   WARNING: net/mac80211/rx.c:5491 at ieee80211_rx_list+0x183/0x1020 [mac80211]
> >
> > However, in subsequent module reload and reconnect cycles I was unable
> > to reproduce it. This is consistent with WARN_ONCE behavior -- it
> > likely fired on the first invalid nss=0 packet after the initial
> > driver load and has not triggered since. I cannot confirm it as a
> > reliable symptom.
>
> To reproduce this reliable, you need to remove driver ko and mac80211.ko,
> and reinstall them.
>
> However, you have confirmed this is the symptom. I think only if you
> want to dig why the rate reported by hardware is weird, otherwise we
> can ignore this warning.
>
> >
> > ---
> >
> > Regarding patch stability: the results below are from testing your
> > original RFT patch [1], not any newer submission. I want to be
> > explicit to avoid confusion:
> >
> >   [1]
> > https://lore.kernel.org/linux-wireless/20260311020816.7065-1-pkshih@realtek.
> > com/
> >
> > This is the exact diff I compiled and tested:
> >
> > --- a/drivers/net/wireless/realtek/rtw88/pci.c
> > +++ b/drivers/net/wireless/realtek/rtw88/pci.c
> > @@ -2,6 +2,7 @@
> >  /* Copyright(c) 2018-2019  Realtek Corporation
> >   */
> >
> > +#include <linux/dmi.h>
> >  #include <linux/module.h>
> >  #include <linux/pci.h>
> >  #include "main.h"
> > @@ -1744,6 +1745,34 @@ const struct pci_error_handlers rtw_pci_err_handler = {
> >  };
> >  EXPORT_SYMBOL(rtw_pci_err_handler);
> >
> > +enum rtw88_quirk_dis_pci_caps {
> > + QUIRK_DIS_PCI_CAP_ASPM,
> > +};
> > +
> > +static int disable_pci_caps(const struct dmi_system_id *dmi)
> > +{
> > + uintptr_t dis_caps = (uintptr_t)dmi->driver_data;
> > +
> > + if (dis_caps & BIT(QUIRK_DIS_PCI_CAP_ASPM))
> > + rtw_pci_disable_aspm = true;
> > +
> > + return 1;
> > +}
> > +
> > +static const struct dmi_system_id rtw88_pci_quirks[] = {
> > + {
> > + .callback = disable_pci_caps,
> > + .ident = "HP Notebook - P3S95EA#ACB",
> > + .matches = {
> > + DMI_MATCH(DMI_SYS_VENDOR, "HP"),
> > + DMI_MATCH(DMI_PRODUCT_NAME, "HP Notebook"),
> > + DMI_MATCH(DMI_PRODUCT_SKU, "P3S95EA#ACB"),
> > + },
> > + .driver_data = (void *)BIT(QUIRK_DIS_PCI_CAP_ASPM),
> > + },
> > + {}
> > +};
> > +
> >  int rtw_pci_probe(struct pci_dev *pdev,
> >     const struct pci_device_id *id)
> >  {
> > @@ -1808,6 +1837,7 @@ int rtw_pci_probe(struct pci_dev *pdev,
> >       bridge && bridge->vendor == PCI_VENDOR_ID_INTEL)
> >   rtwpci->rx_no_aspm = true;
> >
> > + dmi_check_system(rtw88_pci_quirks);
> >   rtw_pci_phy_cfg(rtwdev);
> >
> >   ret = rtw_register_hw(rtwdev, hw);
> >
> > Results with only this patch applied:
> >
> >   - The hard freeze lockup is gone.
> >   - However, during idle the logs are flooded with:
> >
> >       rtw88_8821ce 0000:13:00.0: failed to send h2c command
> >       rtw88_8821ce 0000:13:00.0: firmware failed to leave lps state
> >
> >   - To give a concrete sense of the volume: over an ~80-minute
> >     observation window after a clean module reload, I recorded
> >     11,757 "failed to send h2c command" events and 2 "firmware
> >     failed to leave lps state" events -- approximately 110 errors
> >     per minute during active periods.
> >   - These errors cause Bluetooth audio stuttering and WiFi
> >     throughput drops.
> >
> > When I additionally set disable_lps_deep=Y alongside your ASPM patch,
> > all h2c errors vanish completely and Bluetooth/WiFi remain fully
> > stable. This confirms that disabling LPS Deep is necessary for
> > complete stability on this specific HP SKU.
> >
> > I also noticed what appears to be a new patch in a separate mailing
> > list thread. I will test it shortly and report back with the results.
>
> Thanks for your experiments in detail. :)
>
> Ping-Ke
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-18  0:00                                 ` LB F
@ 2026-03-18  0:58                                   ` Ping-Ke Shih
  2026-03-18 23:55                                     ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-18  0:58 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> Ping-Ke Shih <pkshih@realtek.com> wrote:
> > To reproduce this reliable, you need to remove driver ko and mac80211.ko,
> > and reinstall them.
> >
> > However, you have confirmed this is the symptom. I think only if you
> > want to dig why the rate reported by hardware is weird, otherwise we
> > can ignore this warning.
> 
> Following your suggestion, I performed a full stack reload including
> mac80211.ko and cfg80211.ko, and was able to reproduce the warning:
> 
>   [152.226055] Rate marked as a VHT rate but data is invalid: MCS: 0, NSS: 0
>   [152.226057] WARNING: net/mac80211/rx.c:5491 at
> ieee80211_rx_list+0x177/0x1020 [mac80211]
>   [152.226336] CPU: 2 UID: 0 PID: 638 Comm: irq/56-rtw_pci Tainted: G
> IOE 6.19.7-1-cachyos
>   [152.226344] Hardware name: HP HP Notebook/81F0, BIOS F.50 11/20/2020
> 
> One observation worth mentioning: the warning triggered approximately
> 72 seconds after initial association, coinciding with a Bluetooth
> device connecting to the system. This may suggest the NSS=0 condition
> occurs during BT coexistence negotiation rather than during normal
> WiFi traffic. I am not sure if this is relevant, but I wanted to
> mention it in case it helps narrow down the root cause.
> 
> I also noticed the offset is now +0x177, which matches exactly what
> you showed from v6.19.6. The earlier +0x183 was likely an artifact of
> CachyOS's LTO optimizations while mac80211 had been resident for a
> long time.
> 
> As you noted, this appears to be a separate issue from the freeze and
> h2c timeout problems, so I leave it to your judgment whether it
> warrants further investigation.

I add a printk to show the case VHT and NSS==0 as below. Please help to
collect the output, and then I can see what it happened. 

diff --git a/drivers/net/wireless/realtek/rtw88/rx.c b/drivers/net/wireless/realtek/rtw88/rx.c
index 8b0afaaffaa0..a4e3a3bce748 100644
--- a/drivers/net/wireless/realtek/rtw88/rx.c
+++ b/drivers/net/wireless/realtek/rtw88/rx.c
@@ -230,6 +230,11 @@ static void rtw_rx_fill_rx_status(struct rtw_dev *rtwdev,
                                    &rx_status->nss);
        }

+       if (rx_status->encoding == RX_ENC_VHT && rx_status->nss == 0) {
+               printk("VHT NSS=0 pkt_stat->rate=0x%x rx_status->band=%d rx_status->rate_idx=%d\n",
+                       pkt_stat->rate, rx_status->band, rx_status->rate_idx);
+       }
+
        rx_status->flag |= RX_FLAG_MACTIME_START;
        rx_status->mactime = pkt_stat->tsf_low;

> 
> ---
> 
> I would like to take this opportunity to thank you sincerely for your
> time, patience, and expertise throughout this whole process. From your
> very first response to the final v2 patch, your guidance made it
> possible to properly identify and resolve a bug that had been causing
> real frustration for users of this hardware for a long time.

I also thanks for your time and help. :)

Ping-Ke


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-18  0:58                                   ` Ping-Ke Shih
@ 2026-03-18 23:55                                     ` LB F
  2026-03-19  0:22                                       ` LB F
  2026-03-19  1:24                                       ` Ping-Ke Shih
  0 siblings, 2 replies; 34+ messages in thread
From: LB F @ 2026-03-18 23:55 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> I add a printk to show the case VHT and NSS==0 as below. Please help to
> collect the output, and then I can see what it happened.

Hi Ping-Ke,

I applied your diagnostic patch (using pr_err for maximum log
visibility) and spent the last couple of days testing it on the
affected hardware. The results answer both open questions cleanly.

---

Regarding your earlier question:
> Not sure if this is because PCIE bridge has no ASPM capability?

You were correct. The very beginning of the boot log shows:

  [0.177872] ACPI FADT declares the system doesn't support PCIe ASPM,
              so disable it
  [15.157752] r8169 0000:07:00.0: can't disable ASPM; OS doesn't have
               ASPM control

The BIOS on this HP laptop uses the ACPI FADT table to globally revoke
OS control over PCIe ASPM before Linux even takes over. This has an
important implication: since ASPM is already disabled at the hardware
level by firmware, the instability on this specific SKU is caused
entirely by LPS Deep Mode, not ASPM itself.

This explains why the ASPM-only quirk (v1 patch) did not stop the h2c
timeouts -- ASPM was never actually active on this machine to begin
with. Disabling LPS Deep Mode via the v2 quirk is what eliminates the
firmware timeout loop entirely.

---

Regarding the VHT NSS=0 diagnostic patch:

During normal idle, active pinging, and heavy VHT throughput
(175.5 Mb/s), the pr_err condition never triggered -- no
"VHT NSS=0" lines appeared in dmesg during active use.

However, the standard WARNING at mac80211/rx.c:5491 does reliably
appear exactly once after a fresh full stack reload (including
mac80211.ko and cfg80211.ko) or after resume from suspend:

  [167.708201] WARNING: net/mac80211/rx.c:5491 at
               ieee80211_rx_list+0x177/0x1020 [mac80211]

This suggests the hardware reports a malformed nss=0 VHT rate only
during initial link establishment. Since mac80211 uses WARN_ONCE, it
is suppressed on all subsequent packets.

The diagnostic module remains installed. I will report back
immediately if the pr_err condition is caught, or if any other
relevant symptoms appear.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-18 23:55                                     ` LB F
@ 2026-03-19  0:22                                       ` LB F
  2026-03-19  0:49                                         ` Ping-Ke Shih
  2026-03-19  1:24                                       ` Ping-Ke Shih
  1 sibling, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-19  0:22 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Ping-Ke,

I successfully collected the output with your diagnostic printk.

Here is the exact log entry triggered when the warning fires:

[  180.424146] VHT NSS=0 pkt_stat->rate=0x65 rx_status->band=1
rx_status->rate_idx=0
[  180.424157] WARNING: net/mac80211/rx.c:5491 at
ieee80211_rx_list+0x177/0x1020 [mac80211]

Looking at the rtw88 source code, this perfectly explains why `nss` is 0:
1. The hardware/firmware reports `pkt_stat->rate = 0x65` (101 in decimal).
2. `rtw_rx_fill_rx_status()` checks if `pkt_stat->rate >=
DESC_RATEVHT1SS_MCS0` (which is `0x2c`). Since `0x65 >= 0x2c`, it
correctly sets `rx_status->encoding = RX_ENC_VHT`.
3. It then calls `rtw_desc_to_mcsrate(pkt_stat->rate,
&rx_status->rate_idx, &rx_status->nss)`.
4. Inside `rtw_desc_to_mcsrate()`, the value `0x65` falls completely
outside any known bounds. The highest defined rate in `enum
rtw_trx_desc_rate` is `DESC_RATEVHT4SS_MCS9` (`0x53`). The HT range
(`DESC_RATEMCS0` to `DESC_RATEMCS31`) ends at `0x2b`.
5. Because `0x65` matches absolutely none of the `if/else` brackets in
`rtw_desc_to_mcsrate()`, the function simply returns without mutating
`mcs` and `nss`.
6. Since `rx_status` was initialized with `memset(rx_status, 0, ...)`
at the beginning of the function, `rx_status->nss` remains `0`.

So mac80211 complains because the rtw88 driver doesn't know what rate
`0x65` means, leaves NSS at 0, but still flags it as a VHT packet.

Any idea what `0x65` represents from the hardware's perspective? Is it
a firmware bug or a proprietary control/management frame rate index?

Looking forward to your thoughts!

Best regards,
Oleksandr

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-19  0:22                                       ` LB F
@ 2026-03-19  0:49                                         ` Ping-Ke Shih
  0 siblings, 0 replies; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-19  0:49 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> Hi Ping-Ke,
> 
> I successfully collected the output with your diagnostic printk.
> 
> Here is the exact log entry triggered when the warning fires:
> 
> [  180.424146] VHT NSS=0 pkt_stat->rate=0x65 rx_status->band=1
> rx_status->rate_idx=0
> [  180.424157] WARNING: net/mac80211/rx.c:5491 at
> ieee80211_rx_list+0x177/0x1020 [mac80211]
> 
> Looking at the rtw88 source code, this perfectly explains why `nss` is 0:
> 1. The hardware/firmware reports `pkt_stat->rate = 0x65` (101 in decimal).
> 2. `rtw_rx_fill_rx_status()` checks if `pkt_stat->rate >=
> DESC_RATEVHT1SS_MCS0` (which is `0x2c`). Since `0x65 >= 0x2c`, it
> correctly sets `rx_status->encoding = RX_ENC_VHT`.
> 3. It then calls `rtw_desc_to_mcsrate(pkt_stat->rate,
> &rx_status->rate_idx, &rx_status->nss)`.
> 4. Inside `rtw_desc_to_mcsrate()`, the value `0x65` falls completely
> outside any known bounds. The highest defined rate in `enum
> rtw_trx_desc_rate` is `DESC_RATEVHT4SS_MCS9` (`0x53`). The HT range
> (`DESC_RATEMCS0` to `DESC_RATEMCS31`) ends at `0x2b`.
> 5. Because `0x65` matches absolutely none of the `if/else` brackets in
> `rtw_desc_to_mcsrate()`, the function simply returns without mutating
> `mcs` and `nss`.
> 6. Since `rx_status` was initialized with `memset(rx_status, 0, ...)`
> at the beginning of the function, `rx_status->nss` remains `0`.
> 
> So mac80211 complains because the rtw88 driver doesn't know what rate
> `0x65` means, leaves NSS at 0, but still flags it as a VHT packet.
> 
> Any idea what `0x65` represents from the hardware's perspective? Is it
> a firmware bug or a proprietary control/management frame rate index?
> 
> Looking forward to your thoughts!

Not sure what hardware get wrong. Let's validate rate when reading from
hardware. Since 1M rate can only 20MHz, I set it together. 

Please help to test below. I suppose you can see "weird rate=xxx", but 
"WARNING: net/mac80211/rx.c:5491" disappears. 

diff --git a/drivers/net/wireless/realtek/rtw88/rx.c b/drivers/net/wireless/realtek/rtw88/rx.c
index 8b0afaaffaa0..3d5e48264fc5 100644
--- a/drivers/net/wireless/realtek/rtw88/rx.c
+++ b/drivers/net/wireless/realtek/rtw88/rx.c
@@ -295,6 +295,12 @@ void rtw_rx_query_rx_desc(struct rtw_dev *rtwdev, void *rx_desc8,

        pkt_stat->tsf_low = le32_get_bits(rx_desc->w5, RTW_RX_DESC_W5_TSFL);

+       if (pkt_stat->rate >= DESC_RATE_MAX) {
+               printk("weird rate=%d\n", pkt_stat->rate);
+               pkt_stat->rate = DESC_RATE1M;
+               pkt_stat->bw = RTW_CHANNEL_WIDTH_20;
+       }
+
        /* drv_info_sz is in unit of 8-bytes */
        pkt_stat->drv_info_sz *= 8;

Ping-Ke



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-18 23:55                                     ` LB F
  2026-03-19  0:22                                       ` LB F
@ 2026-03-19  1:24                                       ` Ping-Ke Shih
  2026-03-19 23:58                                         ` LB F
  1 sibling, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-19  1:24 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> Ping-Ke Shih <pkshih@realtek.com> wrote:
> > I add a printk to show the case VHT and NSS==0 as below. Please help to
> > collect the output, and then I can see what it happened.
> 
> Hi Ping-Ke,
> 
> I applied your diagnostic patch (using pr_err for maximum log
> visibility) and spent the last couple of days testing it on the
> affected hardware. The results answer both open questions cleanly.
> 
> ---
> 
> Regarding your earlier question:
> > Not sure if this is because PCIE bridge has no ASPM capability?
> 
> You were correct. The very beginning of the boot log shows:
> 
>   [0.177872] ACPI FADT declares the system doesn't support PCIe ASPM,
>               so disable it
>   [15.157752] r8169 0000:07:00.0: can't disable ASPM; OS doesn't have
>                ASPM control
> 
> The BIOS on this HP laptop uses the ACPI FADT table to globally revoke
> OS control over PCIe ASPM before Linux even takes over. This has an
> important implication: since ASPM is already disabled at the hardware
> level by firmware, the instability on this specific SKU is caused
> entirely by LPS Deep Mode, not ASPM itself.

Checking rtw88 code related to rtw_pci_disable_aspm, I found that driver
does check device ASPM capability before configuring ASPM. It looks 
a little weird why OS doesn't turn off these capabilities of device.
Maybe we should check the capabilities of PCI bridge side? 

> 
> This explains why the ASPM-only quirk (v1 patch) did not stop the h2c
> timeouts -- ASPM was never actually active on this machine to begin
> with. Disabling LPS Deep Mode via the v2 quirk is what eliminates the
> firmware timeout loop entirely.

I think there are two problems. One is ASPM causing system frozen, and
the other is LPS deep mode causing H2C timeouts. If you turn on ASPM
and disable LPS deep mode, I feel H2C timeout can disappear, but it
might go frozen first though. 

Ping-Ke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-19  1:24                                       ` Ping-Ke Shih
@ 2026-03-19 23:58                                         ` LB F
  2026-03-20  0:41                                           ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-19 23:58 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> Maybe we should check the capabilities of PCI bridge side?
> I think there are two problems. One is ASPM causing system frozen,
> and the other is LPS deep mode causing H2C timeouts.

Hi Ping-Ke,

You were right on both counts. Here are the PCI bridge capabilities.

The upstream bridge for the RTL8821CE (13:00.0) is:
  Intel Corporation Wildcat Point-LP PCI Express Root Port #5 (00:1c.4)

Bridge (00:1c.4):
  LnkCap: Port #5, Speed 5GT/s, Width x1, ASPM L0s L1,
          Exit Latency L0s <512ns, L1 <16us
  LnkCtl: ASPM L0s L1 Enabled

WiFi card (13:00.0):
  LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1,
          Exit Latency L0s <2us, L1 <64us
  LnkCtl: ASPM L0s L1 Enabled

So ASPM L0s and L1 are enabled by the BIOS on both ends of the bus,
despite the ACPI FADT claiming the OS has no ASPM control. ASPM was
active on this machine all along. I apologize for the incorrect
earlier conclusion that ASPM was not active.

This confirms your analysis: there are indeed two separate problems --
ASPM causing the hard freeze, and LPS Deep Mode causing the H2C
timeouts. The v2 patch correctly addresses both.

---

Regarding your rate validation patch: I applied it (removing the
earlier pr_err block and inserting the new check in
rtw_rx_query_rx_desc). The patch compiled and installed correctly --
verified via strings on the installed .zst module.

I was unable to reproduce the "weird rate" condition or the WARNING
during this test session. The diagnostic module remains installed and
active -- I will report back immediately if I manage to catch it.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-19 23:58                                         ` LB F
@ 2026-03-20  0:41                                           ` LB F
  2026-03-20  1:00                                             ` Ping-Ke Shih
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-20  0:41 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> Not sure what hardware get wrong. Let's validate rate when reading
> from hardware.

Hi Ping-Ke,

One additional observation while monitoring logs with your rate
validation patch installed.

During normal usage with Wi-Fi connected and a Bluetooth A2DP device
connecting to the system, the following message appeared in dmesg:

  [180.420000] rtw_8821ce 0000:13:00.0: unused phy status page (11)

Looking at rtw_rx_fill_phy_info() in rx.c, this message is emitted
when the firmware sends a PHY status report with a page number that
the driver does not recognize. In this case page 11 appeared at the
moment the Bluetooth device was establishing its connection.

We have not observed any stability issues or connectivity drops
associated with this message -- the driver appears to handle it
gracefully by ignoring it. We are not sure whether this is related
to the rate=0x65 issue or is simply a separate artifact of BT/Wi-Fi
coexistence on this chip. We wanted to mention it in case it is
useful context.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-20  0:41                                           ` LB F
@ 2026-03-20  1:00                                             ` Ping-Ke Shih
  2026-03-20  1:19                                               ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-20  1:00 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> Ping-Ke Shih <pkshih@realtek.com> wrote:
> > Not sure what hardware get wrong. Let's validate rate when reading
> > from hardware.
> 
> Hi Ping-Ke,
> 
> One additional observation while monitoring logs with your rate
> validation patch installed.
> 
> During normal usage with Wi-Fi connected and a Bluetooth A2DP device
> connecting to the system, the following message appeared in dmesg:
> 
>   [180.420000] rtw_8821ce 0000:13:00.0: unused phy status page (11)
> 
> Looking at rtw_rx_fill_phy_info() in rx.c, this message is emitted
> when the firmware sends a PHY status report with a page number that
> the driver does not recognize. In this case page 11 appeared at the
> moment the Bluetooth device was establishing its connection.

It seems like hardware reports incorrect about the PHY status, which
only 0 or 1 is expected. I don't know how it could be. Maybe, we
can ignore this message, or change it to debug level if it appears
frequently and you don't want to see it.

> 
> We have not observed any stability issues or connectivity drops
> associated with this message -- the driver appears to handle it
> gracefully by ignoring it. We are not sure whether this is related
> to the rate=0x65 issue or is simply a separate artifact of BT/Wi-Fi
> coexistence on this chip. We wanted to mention it in case it is
> useful context.

Two messages look like hardware goes weird. The report values become
unpredictable. Maybe we need more validation.... However, driver
will become very dirty since I can't conclude a single rule to
address them.

Ping-Ke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-20  1:00                                             ` Ping-Ke Shih
@ 2026-03-20  1:19                                               ` LB F
  2026-03-20  2:02                                                 ` Ping-Ke Shih
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-20  1:19 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> Not sure what hardware get wrong. Let's validate rate when reading
> from hardware. Since 1M rate can only 20MHz, I set it together.
> Please help to test below. I suppose you can see "weird rate=xxx",
> but "WARNING: net/mac80211/rx.c:5491" disappears.

Hi Ping-Ke,

I can confirm your patch works as expected. Here are the full results.

--- Test environment ---

Kernel:  6.19.7-1-cachyos
Patch:   your rate validation patch applied to rtw_rx_query_rx_desc(),
         on top of the v2 DMI quirk (ASPM + LPS Deep disabled)

--- Captured log (relevant excerpt) ---

  [  43.046] input: Soundcore Q10i (AVRCP)   <-- BT headset connected
  [ 111.551] rtw_8821ce 0000:13:00.0: unused phy status page (13)
  [ 111.635] weird rate=101
  [ 111.635] rtw_8821ce 0000:13:00.0: unused phy status page (7)
  [ 111.741] weird rate=102
  [ 115.045] weird rate=98
  [ 118.371] weird rate=104

--- Analysis ---

1. Timing: the anomalous events began approximately 68 seconds after
   the Bluetooth A2DP headset (Soundcore Q10i) established its
   connection. No anomalies were observed before BT connected.

2. Multiple invalid rate values were captured, not just 0x65:

     weird rate=101  (0x65)
     weird rate=102  (0x66)
     weird rate=98   (0x62)
     weird rate=104  (0x68)

   All four values exceed DESC_RATE_MAX (0x53 = 83 decimal). This
   suggests the hardware occasionally reports a range of out-of-bounds
   rate values during BT/Wi-Fi coexistence, not a single fixed value.

3. The "unused phy status page" messages (pages 13 and 7) appeared
   immediately before and alongside the "weird rate" events. As noted
   in my previous message, only pages 0 and 1 are expected. This
   further suggests the firmware leaks internal coexistence state
   into the RX ring during BT antenna arbitration.

4. Most importantly: the WARNING: net/mac80211/rx.c:5491 did NOT
   appear anywhere in the log. Your rate clamping patch successfully
   intercepts the out-of-bounds values before they propagate to
   mac80211, preventing the invalid VHT NSS=0 warning entirely.

--- Conclusion ---

Your patch achieves the intended result. The "weird rate" printk
confirms the hardware is the source of the invalid values (occurring
during BT coexistence), and the mac80211 WARNING is suppressed.

Please let me know if you need any additional data or further testing.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-20  1:19                                               ` LB F
@ 2026-03-20  2:02                                                 ` Ping-Ke Shih
  2026-03-21 12:07                                                   ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-20  2:02 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> Ping-Ke Shih <pkshih@realtek.com> wrote:
> > Not sure what hardware get wrong. Let's validate rate when reading
> > from hardware. Since 1M rate can only 20MHz, I set it together.
> > Please help to test below. I suppose you can see "weird rate=xxx",
> > but "WARNING: net/mac80211/rx.c:5491" disappears.
> 
> Hi Ping-Ke,
> 
> I can confirm your patch works as expected. Here are the full results.
> 
> --- Test environment ---
> 
> Kernel:  6.19.7-1-cachyos
> Patch:   your rate validation patch applied to rtw_rx_query_rx_desc(),
>          on top of the v2 DMI quirk (ASPM + LPS Deep disabled)
> 
> --- Captured log (relevant excerpt) ---
> 
>   [  43.046] input: Soundcore Q10i (AVRCP)   <-- BT headset connected
>   [ 111.551] rtw_8821ce 0000:13:00.0: unused phy status page (13)
>   [ 111.635] weird rate=101
>   [ 111.635] rtw_8821ce 0000:13:00.0: unused phy status page (7)
>   [ 111.741] weird rate=102
>   [ 115.045] weird rate=98
>   [ 118.371] weird rate=104
> 
> --- Analysis ---
> 
> 1. Timing: the anomalous events began approximately 68 seconds after
>    the Bluetooth A2DP headset (Soundcore Q10i) established its
>    connection. No anomalies were observed before BT connected.
> 
> 2. Multiple invalid rate values were captured, not just 0x65:
> 
>      weird rate=101  (0x65)
>      weird rate=102  (0x66)
>      weird rate=98   (0x62)
>      weird rate=104  (0x68)
> 
>    All four values exceed DESC_RATE_MAX (0x53 = 83 decimal). This
>    suggests the hardware occasionally reports a range of out-of-bounds
>    rate values during BT/Wi-Fi coexistence, not a single fixed value.
> 
> 3. The "unused phy status page" messages (pages 13 and 7) appeared
>    immediately before and alongside the "weird rate" events. As noted
>    in my previous message, only pages 0 and 1 are expected. This
>    further suggests the firmware leaks internal coexistence state
>    into the RX ring during BT antenna arbitration.
> 
> 4. Most importantly: the WARNING: net/mac80211/rx.c:5491 did NOT
>    appear anywhere in the log. Your rate clamping patch successfully
>    intercepts the out-of-bounds values before they propagate to
>    mac80211, preventing the invalid VHT NSS=0 warning entirely.
> 
> --- Conclusion ---
> 
> Your patch achieves the intended result. The "weird rate" printk
> confirms the hardware is the source of the invalid values (occurring
> during BT coexistence), and the mac80211 WARNING is suppressed.
> 
> Please let me know if you need any additional data or further testing.

I'll send formal patch (Cc you) for the invalid VHT NSS=0, but not to
handle "unused phy status page". Please give me Tested-by tag on the
patch after I send it.

Ping-Ke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-20  2:02                                                 ` Ping-Ke Shih
@ 2026-03-21 12:07                                                   ` LB F
  2026-03-23  2:01                                                     ` Ping-Ke Shih
  0 siblings, 1 reply; 34+ messages in thread
From: LB F @ 2026-03-21 12:07 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Ping-Ke Shih <pkshih@realtek.com> wrote:
> I'll send formal patch (Cc you) for the invalid VHT NSS=0, but not
> to handle "unused phy status page". Please give me Tested-by tag on
> the patch after I send it.

Hi Ping-Ke,

Just a quick update to keep you informed -- no rush on anything.

My kernel updated from 6.19.7 to 6.19.9, which wiped the previously
installed out-of-tree modules. I rebuilt and reinstalled both patches:

  1. The v2 DMI quirk (main.h + pci.c) disabling ASPM and LPS Deep
     Mode for the HP P3S95EA#ACB SKU.
  2. The rate validation patch (rx.c) clamping out-of-bounds rate
     values before they reach mac80211.

Both patches apply cleanly and the system remains fully stable on
6.19.9. The DMI quirk is confirmed active via sysfs (disable_aspm=Y,
disable_lps_deep=Y) with no manual modprobe overrides.

I am looking forward to your formal patch for the VHT NSS=0 issue and
will provide a Tested-by tag as soon as it arrives. Thank you again
for all your work and patience throughout this process.

Best regards,
Oleksandr Havrylov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-21 12:07                                                   ` LB F
@ 2026-03-23  2:01                                                     ` Ping-Ke Shih
  2026-03-25 20:38                                                       ` LB F
  0 siblings, 1 reply; 34+ messages in thread
From: Ping-Ke Shih @ 2026-03-23  2:01 UTC (permalink / raw)
  To: LB F; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

LB F <goainwo@gmail.com> wrote:
> Ping-Ke Shih <pkshih@realtek.com> wrote:
> > I'll send formal patch (Cc you) for the invalid VHT NSS=0, but not
> > to handle "unused phy status page". Please give me Tested-by tag on
> > the patch after I send it.
> 
> Hi Ping-Ke,
> 
> Just a quick update to keep you informed -- no rush on anything.
> 
> My kernel updated from 6.19.7 to 6.19.9, which wiped the previously
> installed out-of-tree modules. I rebuilt and reinstalled both patches:
> 
>   1. The v2 DMI quirk (main.h + pci.c) disabling ASPM and LPS Deep
>      Mode for the HP P3S95EA#ACB SKU.
>   2. The rate validation patch (rx.c) clamping out-of-bounds rate
>      values before they reach mac80211.
> 
> Both patches apply cleanly and the system remains fully stable on
> 6.19.9. The DMI quirk is confirmed active via sysfs (disable_aspm=Y,
> disable_lps_deep=Y) with no manual modprobe overrides.
> 
> I am looking forward to your formal patch for the VHT NSS=0 issue and
> will provide a Tested-by tag as soon as it arrives. Thank you again
> for all your work and patience throughout this process.

I sent the VHT NSS=0 patch [1]. Please help to give it a test.
Thanks.

[1] https://lore.kernel.org/linux-wireless/20260323015849.9424-1-pkshih@realtek.com/T/#u

Ping-Ke



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
  2026-03-23  2:01                                                     ` Ping-Ke Shih
@ 2026-03-25 20:38                                                       ` LB F
  0 siblings, 0 replies; 34+ messages in thread
From: LB F @ 2026-03-25 20:38 UTC (permalink / raw)
  To: Ping-Ke Shih; +Cc: linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org

Subject: Cross-platform analysis: RTL8821CE ASPM/LPS instability
        affects multiple OEM platforms beyond HP P3S95EA#ACB

Hi Ping-Ke,

First of all, thank you very much for your work on the rtw88 driver
and for the time you spent helping us resolve the issues on our HP
laptop. Both patches -- the v2 DMI quirk (ASPM + LPS Deep) and the
v2 RX rate validation (rx.c) -- have been tested and verified stable
on our system across two kernel updates (6.19.9-1 and 6.19.9-2),
sustained network load (~1.9 GB), and multiple suspend/resume cycles.
The system is now completely free of freezes, h2c errors, and
mac80211 warnings. Your patches genuinely solved every issue we had.

While working through this, I noticed that many other users across
different hardware platforms appear to be experiencing the same
problems that your patches resolved for us. I decided to collect
and organize these observations in case they might be useful to you.

Please note that this is an amateur analysis, not a professional
one -- I am just a user trying to help. It is entirely possible
that I have missed nuances or made incorrect assumptions. My only
goal is to share what I found, in case it provides useful data
points or sparks ideas for broader improvements. If any of this
is not relevant or not useful, please feel free to disregard it.

1. KERNEL BUGZILLA: OPEN RTL8821CE REPORTS
==========================================

I reviewed all open RTL8821CE bugs in kernel.org Bugzilla. Three
of the six show symptoms that directly match the root causes
addressed by your patches (ASPM deadlock and LPS Deep h2c failures).

--- Directly correlated with ASPM/LPS patches ---

Bug 215131 - System freeze (ASPM L1 deadlock)
  Title:    "Realtek 8821CE makes the system freeze after 9e2fd29864c5
             (rtw88: add napi support)"
  Reporter: Kai-Heng Feng (Canonical)
  Kernel:   5.15+
  Symptoms: Hard freeze preceded by "pci bus timeout, check dma status"
            warnings. RX tag mismatch in rtw_pci_dma_check().
  Workaround confirmed by reporter: rtw88_pci.disable_aspm=1
  Reporter note: "disable_aspm=1 is not a viable workaround because
                  it increases power consumption significantly"
  Status:   OPEN since 2021-11-24.
  Link:     https://bugzilla.kernel.org/show_bug.cgi?id=215131
  Relevance: Identical root cause to Bug 221195. The reporter's
             confirmed workaround (disable_aspm=1) is exactly what
             the DMI quirk implements.

Bug 219830 - h2c/LPS failures + BT crackling
  Title:    "rtw88_8821ce (RTL8821CE) slow performance and error
             messages in dmesg"
  Reporter: Bmax Y14 laptop, Fedora 41, kernel 6.13.5
  Symptoms: - "failed to send h2c command" (periodic)
            - "firmware failed to leave lps state" (periodic)
            - Lower signal strength vs Windows
            - Bluetooth crackling during audio playback
  Cross-ref: https://github.com/lwfinger/rtw88/issues/306
  Status:   OPEN since 2025-03-02.
  Link:     https://bugzilla.kernel.org/show_bug.cgi?id=219830
  Relevance: The h2c/lps errors are the same messages we observed
             before the DMI quirk disabled LPS Deep Mode. The BT
             crackling may correlate with the invalid RX rate
             condition addressed by your rx.c validation patch.

Bug 218697 - TX queue flush timeout during scan
  Title:    "rtw88_8821ce timed out to flush queue 2"
  Reporter: Arch Linux, kernel 6.8.4 / 6.8.5
  Symptoms: - "timed out to flush queue 2" every ~30 seconds
            - "failed to get tx report from firmware"
            - Stack trace: ieee80211_scan_work -> rtw_ops_flush ->
              rtw_mac_flush_queues timeout
  Status:   OPEN since 2024-04-08.
  Link:     https://bugzilla.kernel.org/show_bug.cgi?id=218697
  Relevance: The flush timeout occurs when the firmware cannot
             respond to TX queue operations -- consistent with
             firmware being stuck in LPS Deep during scan.

--- Potentially related (no confirmed workaround data) ---

Bug 217491 - "timed out to flush queue 1" regression (kernel 6.3)
  Manjaro user. Floods of "timed out to flush queue 1/2".
  Similar pattern to Bug 218697.
  Link: https://bugzilla.kernel.org/show_bug.cgi?id=217491

Bug 217781 - Random sudden dropouts
  Arch user. Random disconnections during streaming/transfers.
  Reproduced on Ubuntu and Fedora (kernels 5.15 to 6.4).
  Link: https://bugzilla.kernel.org/show_bug.cgi?id=217781

Bug 216685 - Low wireless speed
  Reduced throughput vs expected 802.11ac performance.
  Link: https://bugzilla.kernel.org/show_bug.cgi?id=216685

2. SYMPTOM-TO-PATCH MAPPING
=============================

dmesg signature                    Patch that resolves it
--------------------------         ----------------------
Hard system freeze                 pci.c DMI quirk (disable ASPM)
"pci bus timeout, check dma"       pci.c DMI quirk (disable ASPM)
"firmware failed to leave lps"     pci.c DMI quirk (disable LPS Deep)
"failed to send h2c command"       pci.c DMI quirk (disable LPS Deep)
"timed out to flush queue N"       pci.c DMI quirk (disable LPS Deep) [1]
"failed to get tx report"          pci.c DMI quirk (disable LPS Deep) [1]
VHT NSS=0 mac80211 WARNING        rx.c rate validation (v2)

Confirmed in bugs: 215131, 219830, 218697, 221195.
[1] Inferred: flush timeout occurs when firmware cannot exit LPS
    to process TX queue operations.

3. AFFECTED HARDWARE
=====================

Current DMI quirk coverage: HP P3S95EA#ACB only.

Platforms confirmed affected in Bugzilla:
  Bug 221195: HP Notebook 81F0 (P3S95EA#ACB)
  Bug 215131: unknown (Canonical upstream testing)
  Bug 219830: Bmax Y14
  Bug 218697: unknown (Arch Linux user)

Platforms reported in community forums as requiring
disable_aspm=Y and/or disable_lps_deep=Y for stable RTL8821CE:
  - HP 17-by4063CL
  - Lenovo IdeaPad S145-15AST, IdeaPad 3, IdeaPad 330S
  - ASUS VivoBook X series
  - Acer Aspire 3/5 series

All use PCI Device ID 10ec:c821 with different Subsystem IDs.

4. LPS_DEEP_MODE_LCLK IN THE rtw88 TREE
=========================================

I verified in the source which chips have the same
lps_deep_mode_supported flag:

Chip file       lps_deep_mode_supported            PCIe variant
---------       -----------------------            ------------
rtw8821c.c      BIT(LPS_DEEP_MODE_LCLK)            rtw8821ce ✓
rtw8822c.c      BIT(LPS_DEEP_MODE_LCLK) | PG       rtw8822ce ✓
rtw8822b.c      BIT(LPS_DEEP_MODE_LCLK)            rtw8822be ✓
rtw8814a.c      BIT(LPS_DEEP_MODE_LCLK)            rtw8814ae ✓
rtw8723d.c      0                                   rtw8723de ✗
rtw8703b.c      0                                   (SDIO)     -
rtw8821a.c      0                                   (legacy)   -

Source references:
  rtw8821c.c:2002  rtw8822c.c:5365  rtw8822b.c:2545
  rtw8814a.c:2211  rtw8723d.c:2144

RTL8822CE community reports (Manjaro, Arch forums) confirm the
same disable_aspm=Y + disable_lps_deep=Y workaround is effective,
consistent with rtw8822c.c having LCLK enabled.

5. COMMUNITY WORKAROUND REFERENCES
====================================

The following are concrete examples of forums and wikis where the
same modprobe workarounds are actively recommended:

Arch Wiki - RTW88 section:
  https://wiki.archlinux.org/title/Network_configuration/Wireless
  (section "RTW88" and "rtl8821ce" under Troubleshooting/Realtek)
  Recommends rtw88-dkms-git and documents the rtw88_8821ce issues.

Arch Wiki - RTW89 section (same page):
  Documents the identical ASPM pattern for the newer RTW89 driver:
    options rtw89_pci disable_aspm_l1=y disable_aspm_l1ss=y
    options rtw89_core disable_ps_mode=y
  This suggests the ASPM/LPS interaction is a systemic Realtek
  design issue, not specific to rtw88 or the 8821CE chip.

Arch Linux Forum - RTL8821CE thread:
  https://bbs.archlinux.org/viewtopic.php?id=273440
  Referenced by the Arch Wiki as the primary rtl8821ce discussion.

lwfinger/rtw88 GitHub:
  https://github.com/lwfinger/rtw88/issues/306
  Directly cross-referenced by Bug 219830. Reporter reports h2c
  errors and investigated antenna hardware (no fault found).

lwfinger/rtw89 GitHub:
  https://github.com/lwfinger/rtw89/issues/275#issuecomment-1784155449
  Same ASPM-disable pattern documented for the newer RTW89 driver
  on HP and Lenovo laptops.

6. OBSERVATIONS
================

a) Three open Bugzilla reporters (215131, 219830, 218697) show
   symptoms identical to those resolved by your patches but have
   no upstream fix available since they are not the HP P3S95EA#ACB.

b) Bug 215131 reporter (Kai-Heng Feng, Canonical) explicitly
   confirmed disable_aspm=1 as a workaround and called it
   "not viable" due to power cost. A DMI quirk for their
   platform would give them a proper fix.

c) The Arch Wiki documents the same ASPM-disable pattern for
   both RTW88 and RTW89 drivers, suggesting the underlying
   hardware/firmware limitation is shared across generations.

d) Asking Bugzilla reporters to provide their DMI data
   (dmidecode -t 1,2) could allow extending the quirk table
   with minimal effort and zero risk to unaffected platforms.

e) The rx.c rate validation patch is chip-agnostic and requires
   no platform-specific consideration.

7. REFERENCES
==============

Kernel Bugzilla:
  https://bugzilla.kernel.org/show_bug.cgi?id=215131
  https://bugzilla.kernel.org/show_bug.cgi?id=219830
  https://bugzilla.kernel.org/show_bug.cgi?id=218697
  https://bugzilla.kernel.org/show_bug.cgi?id=217491
  https://bugzilla.kernel.org/show_bug.cgi?id=217781
  https://bugzilla.kernel.org/show_bug.cgi?id=216685

GitHub:
  https://github.com/lwfinger/rtw88/issues/306
  https://github.com/lwfinger/rtw89/issues/275

Arch Wiki:
  https://wiki.archlinux.org/title/Network_configuration/Wireless

Arch Linux Forum:
  https://bbs.archlinux.org/viewtopic.php?id=273440

Source code (lps_deep_mode_supported):
  drivers/net/wireless/realtek/rtw88/rtw8821c.c:2002
  drivers/net/wireless/realtek/rtw88/rtw8822c.c:5365
  drivers/net/wireless/realtek/rtw88/rtw8822b.c:2545
  drivers/net/wireless/realtek/rtw88/rtw8814a.c:2211
  drivers/net/wireless/realtek/rtw88/rtw8723d.c:2144

Best regards,
Oleksandr Havrylov <goainwo@gmail.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2026-03-25 20:39 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09 21:48 [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict) LB F
2026-03-10  2:02 ` Ping-Ke Shih
2026-03-10 11:01   ` LB F
2026-03-10 15:12     ` LB F
2026-03-11  2:20       ` Ping-Ke Shih
2026-03-11  2:15     ` Ping-Ke Shih
2026-03-11  2:22       ` Ping-Ke Shih
2026-03-11 11:00         ` LB F
2026-03-11 15:22           ` LB F
2026-03-12  1:56             ` Ping-Ke Shih
2026-03-12 21:42               ` LB F
2026-03-13  0:03                 ` LB F
2026-03-13  0:29                   ` LB F
2026-03-14 10:52                     ` LB F
2026-03-14 12:39                       ` LB F
2026-03-15  0:24                         ` LB F
2026-03-16  2:55                           ` Ping-Ke Shih
2026-03-16 20:27                             ` LB F
2026-03-17  1:28                               ` Ping-Ke Shih
2026-03-18  0:00                                 ` LB F
2026-03-18  0:58                                   ` Ping-Ke Shih
2026-03-18 23:55                                     ` LB F
2026-03-19  0:22                                       ` LB F
2026-03-19  0:49                                         ` Ping-Ke Shih
2026-03-19  1:24                                       ` Ping-Ke Shih
2026-03-19 23:58                                         ` LB F
2026-03-20  0:41                                           ` LB F
2026-03-20  1:00                                             ` Ping-Ke Shih
2026-03-20  1:19                                               ` LB F
2026-03-20  2:02                                                 ` Ping-Ke Shih
2026-03-21 12:07                                                   ` LB F
2026-03-23  2:01                                                     ` Ping-Ke Shih
2026-03-25 20:38                                                       ` LB F
2026-03-16  2:50                         ` Ping-Ke Shih

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox