From: LB F <goainwo@gmail.com>
To: Ping-Ke Shih <pkshih@realtek.com>
Cc: "linux-wireless@vger.kernel.org" <linux-wireless@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)
Date: Thu, 12 Mar 2026 23:42:14 +0200 [thread overview]
Message-ID: <CALdGYqQykO9ZzO=-+D17R_8LC=Win5nGN6-9zFqChtNEyUzEfg@mail.gmail.com> (raw)
In-Reply-To: <458ed80e39734ea99610050140bb31ce@realtek.com>
Ping-Ke Shih <pkshih@realtek.com> wrote:
> I'm really not sure how/why kernel becomes frozen. As I mentioned before
> it might because of received malformed data and no complete validation
> before reporting RX packet to mac80211.
> Not sure if you can try to dig and add some validation?
I reviewed both rx.c and pci.c in detail and found a genuine validation
gap specific to the 8821CE chip.
In rtw_pci_rx_napi() (pci.c), the RX path allocates a new skb based
on the pkt_len field from the RX descriptor:
new_len = pkt_stat.pkt_len + pkt_offset;
new = dev_alloc_skb(new_len);
skb_put_data(new, skb->data, new_len);
/* ... */
skb_pull(new, pkt_offset);
ieee80211_rx_napi(rtwdev->hw, NULL, new, napi);
If pkt_stat.pkt_len is zero, new_len equals pkt_offset, skb_put_data
copies only the descriptor header, and skb_pull then removes that header
-- leaving an empty skb (len=0) that is passed unconditionally to
ieee80211_rx_napi() with no length guard.
Protection already exists for the 8703B chip in rtw_rx_fill_rx_status():
if (rtwdev->chip->id == RTW_CHIP_TYPE_8703B && pkt_stat->pkt_len == 0) {
rx_status->flag |= RX_FLAG_NO_PSDU;
rtw_dbg(rtwdev, RTW_DBG_RX, "zero length packet");
}
No equivalent check exists for RTW_CHIP_TYPE_8821CE. Removing the
chip-id restriction would be a minimal, safe fix for all chips:
--- a/rx.c
+++ b/rx.c
- if (rtwdev->chip->id == RTW_CHIP_TYPE_8703B && pkt_stat->pkt_len == 0) {
+ if (pkt_stat->pkt_len == 0) {
I also checked PHY-level error counters from debugfs during normal
operation (phy_info):
OFDM cnt (ok, err) = (867, 11) -> 1.3% PHY CRC error rate
VHT cnt (ok, err) = (267, 32) -> 10.7% PHY CRC error rate
Frames with crc_err are passed to mac80211 with RX_FLAG_FAILED_FCS_CRC
set (not dropped by the driver), which is the correct approach.
However, I do not believe the freeze is caused by malformed RX data.
The freeze occurs deterministically about 10 seconds after the system
becomes fully idle with zero active network traffic, which matches the
LPS_DEEP_MODE_LCLK entry sequence rather than a random data corruption
pattern. The freeze behaviour also disappears entirely when ASPM L1 is
disabled (as confirmed by the Live USB logs I provided earlier), which
is the hallmark of a PCIe bus gating deadlock, not a data path issue.
> Are the 'h2c' timeout messages flooding? or appears periodically? Does it
> really affect connection stable?
The errors appear periodically in bursts during idle; network
connectivity is never affected (parallel ping tests show 0% packet
loss). The flooding documented in previous tests (hundreds per minute)
was observed under conditions where the LPS state machine had reached
a persistent failure mode after extended uptime. In shorter tests from
a fresh module load, the errors are sporadic (3-5 per 10 minutes).
> If you change another AP or connection on 5GHz band, does the messages
> still present?
Yes. The issue has persisted for 2 years across 3 completely different
Access Points. It is reproducible on 5GHz only (2.4GHz is disabled on
all my networks).
> I think it isn't easy to find out the cause without measuring hardware
> signals, since I saw the message very very rare. So, I'd adopt your
> suggestion (dynamic LPS_DEEP_MODE_NONE) if the test is positive.
The test is definitively positive.
Test environment: stock CachyOS 6.19.6 kernel, PCIe ASPM L1 confirmed
ENABLED via lspci ('LnkCtl: ASPM L1 Enabled'), no out-of-tree patches.
The rtw88 module stack was fully reloaded (including rtw88_core) for
each scenario. The disable_lps_deep parameter, which belongs to
rtw88_core, was verified via /sys/module/rtw88_core/parameters/
before and after each reload.
Test protocol: after module reload and Wi-Fi reconnect (verified via
HTTP 204 check), a 5-minute warm-up period elapsed before the
5-minute measurement window began. This ensures the firmware's LPS
state machine has fully initialised before results are recorded.
Methodology verified: 'modprobe -r rtw88_8821ce' removes only the
chip-specific modules, leaving rtw88_core in memory. The correct
procedure used was to explicitly also remove rtw88_core, then reload
all modules with the desired parameter.
Results (battery power, true idle each):
disable_lps_deep=N (DEFAULT):
Warm-up (5 min cumulative): h2c=4 lps=0
Measurement (5 min): h2c=0 lps=0 [errors are bursty]
disable_lps_deep=Y (CONFIRMED via sysfs):
Warm-up (5 min cumulative): h2c=0 lps=0
Measurement (5 min): h2c=0 lps=0
ALL 10 minutes: h2c=0
With disable_lps_deep=Y, not a single h2c timeout was recorded across
the entire 10-minute observation window (warm-up + measurement). With
disable_lps_deep=N, errors appeared within the first 5 minutes of idle.
Setting disable_lps_deep=Y completely eliminates the firmware timeout
loop, confirming that the root cause is the firmware attempting
LPS_DEEP_MODE_LCLK while PCIe constraints prevent it from completing.
Dynamic LPS_DEEP_MODE_NONE for the ASPM DMI quirk entry is the correct
and complete architectural solution.
--- Technical Appendix: RX Validation Audit Findings ---
I performed a deep audit of the RX descriptor parsing logic in rx.c and pci.c.
I found two concrete areas where validation is incomplete for the 8821CE:
1. Out-of-Bounds Read in rtw_pci_rx_napi (pci.c):
The DMA buffer size is fixed at ~11.5KB (RTK_PCI_RX_BUF_SIZE).
However, the hardware
descriptor (W0_PKT_LEN) is 14 bits, allowing it to indicate up to 16KB.
The driver calculates new_len = pkt_stat.pkt_len + pkt_offset and calls
skb_put_data(new, skb->data, new_len) without checking if new_len exceeds the
DMA source buffer. If hardware sends a malformed large length, this leads
to an OOB read of adjacent memory.
2. Missing 8821CE guard in rtw_rx_fill_rx_status (rx.c):
The check for pkt_len == 0 (which results in an empty SKB being passed
to mac80211)
is manually restricted to RTW_CHIP_TYPE_8703B:
if (rtwdev->chip->id == RTW_CHIP_TYPE_8703B && pkt_stat->pkt_len == 0)
Expanding this guard to all chips (or specifically 8821CE) would be safer.
While these vulnerabilities exist, I still believe the freeze is
PCIe-timing related
(LCLK entry/ASPM conflict), as no RX-related warnings or memory corruption
traces were found in dmesg prior to the hard freeze.
Best regards,
Oleksandr Havrylov
next prev parent reply other threads:[~2026-03-12 21:42 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-09 21:48 [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict) LB F
2026-03-10 2:02 ` Ping-Ke Shih
2026-03-10 11:01 ` LB F
2026-03-10 15:12 ` LB F
2026-03-11 2:20 ` Ping-Ke Shih
2026-03-11 2:15 ` Ping-Ke Shih
2026-03-11 2:22 ` Ping-Ke Shih
2026-03-11 11:00 ` LB F
2026-03-11 15:22 ` LB F
2026-03-12 1:56 ` Ping-Ke Shih
2026-03-12 21:42 ` LB F [this message]
2026-03-13 0:03 ` LB F
2026-03-13 0:29 ` LB F
2026-03-14 10:52 ` LB F
2026-03-14 12:39 ` LB F
2026-03-15 0:24 ` LB F
2026-03-16 2:55 ` Ping-Ke Shih
2026-03-16 20:27 ` LB F
2026-03-17 1:28 ` Ping-Ke Shih
2026-03-18 0:00 ` LB F
2026-03-18 0:58 ` Ping-Ke Shih
2026-03-18 23:55 ` LB F
2026-03-19 0:22 ` LB F
2026-03-19 0:49 ` Ping-Ke Shih
2026-03-19 1:24 ` Ping-Ke Shih
2026-03-19 23:58 ` LB F
2026-03-20 0:41 ` LB F
2026-03-20 1:00 ` Ping-Ke Shih
2026-03-20 1:19 ` LB F
2026-03-20 2:02 ` Ping-Ke Shih
2026-03-21 12:07 ` LB F
2026-03-23 2:01 ` Ping-Ke Shih
2026-03-25 20:38 ` LB F
2026-03-16 2:50 ` Ping-Ke Shih
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALdGYqQykO9ZzO=-+D17R_8LC=Win5nGN6-9zFqChtNEyUzEfg@mail.gmail.com' \
--to=goainwo@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-wireless@vger.kernel.org \
--cc=pkshih@realtek.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox