Re: 6.18.13 iwlwifi deadlock allocating cma while work-item is active.

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Tejun Heo <tj@kernel.org>
To: Ben Greear <greearb@candelatech.com>
Cc: Johannes Berg <johannes@sipsolutions.net>,
	linux-wireless <linux-wireless@vger.kernel.org>,
	Miriam Rachel <miriam.rachel.korenblit@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: 6.18.13 iwlwifi deadlock allocating cma while work-item is active.
Date: Tue, 10 Mar 2026 08:06:08 -1000	[thread overview]
Message-ID: <5b9b93df8774810a43fceb359906604b@kernel.org> (raw)
In-Reply-To: <bba74cab-7305-a052-7e1c-7a7736ba4531@candelatech.com>

[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]

Hello,

Thanks for the detailed dump. One thing that doesn't look right is the
number of pending work items on pool 22 (CPU 5). The pool reports 2 idle
workers, yet there are 7+ work items sitting in the pending list across
multiple workqueues. If the pool were making forward progress, those items
would have been picked up by the idle workers. So, the pool itself seems to
be stuck for some reason, and the cfg80211 mutex stall may be a consequence
rather than the cause.

Let's try using drgn on the crash dump. I'm attaching a prompt that you can
feed to Claude (or any LLM with tool access to drgn). It contains workqueue
internals documentation, drgn code snippets, and a systematic investigation
procedure. The idea is:

1. Generate the crash dump when the deadlock is happening:

     echo c > /proc/sysrq-trigger

2. After the crash kernel boots, create the dump file:

     makedumpfile -c -d 31 /proc/vmcore /tmp/vmcore.dmp

3. Feed the attached prompt to Claude with drgn access to the dump. It
   should produce a Markdown report with its findings that you can post
   back here.

This is a bit experimental, so let's see whether it works. Either way, the
report should at least give us concrete data points to work with.

Thanks.

-- 
tejun

[-- Attachment #2: wq-drgn-prompt.txt --]
[-- Type: text/plain, Size: 27606 bytes --]

# Workqueue Lockup Investigation with drgn

You are investigating a Linux kernel workqueue lockup using drgn on a crash
dump. The system reported a workqueue pool stall on CPU 5 with `reg_todo`
[cfg80211] stuck for ~57500 seconds. Your job is to determine the root cause.

## HOW TO RUN DRGN

```bash
# Install drgn (if not already installed):
#   pip3 install drgn
#   OR on Fedora: dnf install drgn

# Run drgn on a crash dump:
drgn -c /path/to/vmcore

# If symbols aren't found automatically, point to the vmlinux:
drgn -c /path/to/vmcore -s /path/to/vmlinux

# For modules, point to the module directory:
drgn -c /path/to/vmcore -s /path/to/vmlinux -s /lib/modules/$(uname -r)/

# Inside drgn, 'prog' is the program object. You can run Python interactively
# or pass a script with -e or -s:
drgn -c /path/to/vmcore -s /path/to/vmlinux -e 'print(prog["jiffies"])'
```

All code blocks in this document are Python code to run inside the drgn
interactive shell or via `-e`.

## METHODOLOGY — READ THIS FIRST

**CRITICAL RULES — violating these will produce wrong conclusions:**

1. **NEVER jump on any specific lead without concrete evidence.** Do not
   assume you know the answer from the dmesg alone. The dmesg gives you a
   starting point, not a conclusion.

2. **Draw conclusions if and only if the hard facts support them.** Every
   claim you make must be backed by specific drgn output — an address, a
   value, a stack trace. If you cannot show the evidence, say "I don't have
   evidence for this" and move on.

3. **Present results and thought process with specific, concrete evidence.**
   Show the drgn commands you ran and the relevant output. Then explain what
   that output means. Evidence first, interpretation second.

4. **Think holistically — do NOT separate workqueue stall from lock stalls.**
   A stuck workqueue pool can CAUSE what looks like deadlocks elsewhere.
   Work items that are expected to run but cannot (because the pool is
   stuck) will stall anything waiting on their completion. A mutex holder
   might be waiting for a work item that will never run. What looks like a
   "deadlock" might actually be a consequence of the pool stall, not the
   cause. Always consider both directions of causality.

5. **Check everything systematically.** Do not skip steps because you think
   you already know the answer. Complete Phase 1 fully before moving to
   Phase 2. If the pool has multiple pending work items that are not being
   processed, the pool IS stuck — do not dismiss this as "a transient
   snapshot."

## WORKQUEUE ARCHITECTURE

### Overview

The Linux workqueue subsystem processes deferred work using kernel threads
(workers) organized into pools:

- **worker_pool**: A group of kernel threads (workers) that share a worklist.
  Each CPU has two standard pools: pool[0] (normal priority) and pool[1]
  (high priority). Unbound pools serve work not tied to a specific CPU.

- **workqueue_struct**: A named workqueue (e.g., "events",
  "events_power_efficient"). Each workqueue connects to pools via
  pool_workqueue (pwq) structures — one pwq per pool the workqueue uses.
  Multiple workqueues share the same underlying pool.

- **pool_workqueue (pwq)**: Links a workqueue to a pool. Tracks nr_active
  (how many work items from this workqueue are active in the pool) and
  enforces max_active limits. Work items exceeding max_active go to
  pwq->inactive_works instead of pool->worklist.

- **worker**: A kernel thread that picks work from pool->worklist and
  executes it. Workers are either idle (on pool->idle_list) or busy
  (in pool->busy_hash, executing a work item).

### Concurrency Management (CMWQ)

For bound (per-CPU) pools, the workqueue uses a concurrency management
protocol based on `pool->nr_running`:

- **nr_running** counts workers actively running on CPU (not sleeping, not
  idle, not marked CPU_INTENSIVE).

- When a worker sleeps (e.g., waiting on a mutex), the scheduler calls
  `wq_worker_sleeping()` which decrements nr_running. If nr_running hits 0
  and there is pending work, `kick_pool()` wakes an idle worker.

- When a worker wakes up, `wq_worker_running()` increments nr_running.

- The decision functions:
  - `need_more_worker(pool)`: `!list_empty(&pool->worklist) && !pool->nr_running`
    — need a worker if work is pending AND nobody is running.
  - `may_start_working(pool)`: `pool->nr_idle` — can proceed only if there
    are idle workers remaining (so there's always a reserve).
  - `keep_working(pool)`: `!list_empty(&pool->worklist) && pool->nr_running <= 1`
    — current worker keeps going if work pending and it's the only runner.

- **Key insight**: If nr_running > 0, the pool assumes someone is handling
  work and does NOT wake idle workers, even if work is pending. A stuck
  nr_running > 0 with no worker actually on CPU would prevent all forward
  progress.

### Worker Lifecycle and the 2-Idle-Worker Invariant

The worker_thread() main loop:
1. Wake up, leave idle state (nr_idle--)
2. Check `need_more_worker()` — if no work or nr_running > 0, go to sleep
3. Check `may_start_working()` — if nr_idle == 0, become manager and create
   new workers before proceeding
4. Clear PREP flag, enter concurrency management (nr_running++)
5. Process work items from pool->worklist in a loop
6. When done, enter idle state (nr_idle++) and sleep

**The 2-idle-worker invariant**: The pool maintains at least 2 idle workers
(enforced by `too_many_workers()` which only culls when nr_idle > 2). This
ensures that when one idle worker wakes to process work (step 1: nr_idle--),
there is still at least one idle worker remaining. If nr_idle hits 0, the
woken worker must become the "manager" and create new workers before it can
process any work.

**Worker creation** (`create_worker()`): Allocates memory with GFP_KERNEL
and calls `kthread_create_on_node()`. Both operations can stall indefinitely
if the system is under memory pressure and reclaim is not making progress.
GFP_KERNEL allocations will not fail — they block in the allocator waiting
for pages. If memory reclaim is broken for any reason, `create_worker()` will
hang forever. This would prevent the pool from recovering if it runs out of
idle workers.

**Mayday/rescuer mechanism**: If `create_worker()` cannot make progress, the
pool's mayday_timer fires and sends distress signals to workqueues that have
WQ_MEM_RECLAIM set. Those workqueues have a dedicated rescuer thread that
can process their work items without needing new workers. However, regular
workqueues like "events" do NOT have rescuers — if they run out of workers,
they are stuck.

### Watchdog

The pool watchdog checks whether pool->watchdog_ts has advanced. watchdog_ts
is updated each time a worker picks up a new work item from the worklist.
If watchdog_ts hasn't advanced for wq_watchdog_thresh seconds (default 30),
the pool is considered stalled. The "hung=Ns" in the stall warning shows
`jiffies - pool->watchdog_ts` converted to seconds.

## DATA STRUCTURE REFERENCE (for drgn)

### Accessing Pools

```python
from drgn.helpers.linux.percpu import per_cpu
from drgn.helpers.linux.list import list_for_each_entry
from drgn import Object, cast

# Per-CPU normal-priority pool for CPU N:
pool = per_cpu(prog["cpu_worker_pools"], cpu)[0]

# Per-CPU high-priority pool for CPU N:
pool = per_cpu(prog["cpu_worker_pools"], cpu)[1]
```

### worker_pool fields
```
pool.cpu              — int, associated CPU (-1 for unbound)
pool.id               — int, pool ID
pool.nr_running       — int, workers currently running on CPU
pool.nr_workers       — int, total workers
pool.nr_idle          — int, currently idle workers
pool.worklist         — list_head, pending work items
pool.idle_list        — list_head, idle workers
pool.workers          — list_head, all workers (iterate via "node" member)
pool.busy_hash        — hashtable of busy workers
pool.manager          — struct worker *, current manager (or NULL)
pool.flags            — uint: POOL_MANAGER_ACTIVE=0x2, POOL_DISASSOCIATED=0x4
pool.watchdog_ts      — unsigned long, jiffies of last forward progress
pool.cpu_stall        — bool, set by watchdog when stalled
```

### worker fields (iterate via pool.workers, link member "node")
```
worker.task           — struct task_struct *, the kthread
worker.current_work   — struct work_struct *, work being executed (NULL if idle)
worker.current_func   — work_func_t, function of current work
worker.current_pwq    — struct pool_workqueue *, pwq of current work
worker.current_at     — u64, ktime at start of current work
worker.sleeping       — int, 1 if worker went to sleep (decremented nr_running)
worker.flags          — uint: WORKER_DIE=0x2, WORKER_IDLE=0x4, WORKER_PREP=0x8,
                        WORKER_CPU_INTENSIVE=0x40, WORKER_UNBOUND=0x80
worker.id             — int, worker ID (shows in task name as kworker/CPU:ID)
worker.last_active    — unsigned long, jiffies of last activity
worker.pool           — struct worker_pool *, associated pool
worker.scheduled      — list_head, scheduled works for this worker
```

### work_struct fields (iterate via pool.worklist, link member "entry")
```
work.data             — atomic_long_t, encodes pwq pointer + flags
work.func             — work_func_t, the function to execute
work.entry            — list_head, linkage in worklist

# Extracting pwq from work->data:
data = work.data.counter.value_()
WORK_STRUCT_PWQ_BIT = 1 << 2
WORK_STRUCT_PWQ_SHIFT = 8  # bits 0-7 are flags, bits 8+ are pwq pointer
if data & WORK_STRUCT_PWQ_BIT:
    pwq_addr = data & ~((1 << WORK_STRUCT_PWQ_SHIFT) - 1)
    pwq = Object(prog, "struct pool_workqueue", address=pwq_addr)
    wq_name = pwq.wq.name.string_().decode()
```

### pool_workqueue fields
```
pwq.pool              — struct worker_pool *, the pool
pwq.wq                — struct workqueue_struct *, the workqueue
pwq.nr_active         — int, active work items from this wq in this pool
pwq.inactive_works    — list_head, work items waiting for nr_active < max_active
pwq.stats[]           — u64 array: [0]=STARTED, [1]=COMPLETED, [2]=CPU_TIME,
                        [3]=CPU_INTENSIVE, [4]=CM_WAKEUP, [5]=REPATRIATED,
                        [6]=MAYDAY, [7]=RESCUED
```

### workqueue_struct fields
```
wq.name               — char[], workqueue name
wq.flags              — uint, WQ_UNBOUND=0x2, WQ_FREEZABLE=0x4,
                        WQ_MEM_RECLAIM=0x8, WQ_HIGHPRI=0x10
wq.max_active         — int, max concurrent work items per pwq
wq.cpu_pwq            — per-cpu pointer to pwqs (for bound workqueues)
wq.pwqs               — list_head, all pwqs (iterate via "pwqs_node")
wq.rescuer            — struct worker * (non-NULL if WQ_MEM_RECLAIM)
```

### Mutex inspection
```python
# struct mutex has an owner field (atomic_long_t)
# Low 3 bits are flags, remaining bits are task_struct pointer
owner_val = mutex.owner.counter.value_()
owner_ptr = owner_val & ~0x7
if owner_ptr:
    owner_task = Object(prog, "struct task_struct", address=owner_ptr)
    print(f"mutex owner: {owner_task.comm.string_().decode()} "
          f"pid={owner_task.pid.value_()}")
    for frame in prog.stack_trace(owner_task):
        print(f"  {frame}")
```

### Stack traces
```python
# IMPORTANT: prog.stack_trace() takes a task_struct or PID, NOT a CPU number
# By task_struct pointer:
for frame in prog.stack_trace(task):
    print(frame)

# By PID:
for frame in prog.stack_trace(pid_number):
    print(frame)
```

### Jiffies time delta
```python
jiffies = prog["jiffies"].value_()
hz = 1000  # CONFIG_HZ, usually 1000 on x86 — verify with the kernel config
delta_jiffies = jiffies - pool.watchdog_ts.value_()
delta_seconds = delta_jiffies / hz
```

## INVESTIGATION PROCEDURE

### Phase 1: Is the workqueue pool stuck?

The dmesg says pool 22 (cpus=5, normal priority) is hung for ~57500s with
multiple pending work items. If a pool has pending work items that are not
being processed, the pool IS stuck. Do not dismiss pending items as "about
to be run" — at 57500s, anything pending is stuck.

Your goal in this phase is to determine WHY the pool is stuck: is it a
concurrency management state bug (nr_running wrong), a worker shortage
(no idle workers, can't create new ones), or something else?

**Step 1.1: Pool overview**
```python
cpu = 5
pool = per_cpu(prog["cpu_worker_pools"], cpu)[0]
print(f"Pool {pool.id.value_()} on CPU {pool.cpu.value_()}")
print(f"  nr_running:  {pool.nr_running.value_()}")
print(f"  nr_workers:  {pool.nr_workers.value_()}")
print(f"  nr_idle:     {pool.nr_idle.value_()}")
print(f"  flags:       0x{pool.flags.value_():x}")
print(f"  cpu_stall:   {pool.cpu_stall.value_()}")
print(f"  manager:     {pool.manager}")

jiffies = prog["jiffies"].value_()
wts = pool.watchdog_ts.value_()
print(f"  watchdog_ts: {wts} (jiffies={jiffies}, delta={jiffies - wts})")
print(f"  worklist empty: {pool.worklist.next.value_() == pool.worklist.address_of_().value_()}")
```

**What to look for:**
- **nr_running**: If > 0 but no worker is actually executing on CPU, the
  concurrency management thinks someone is running when nobody is. This
  would prevent idle workers from being woken. Check every worker to verify
  whether nr_running matches reality.
- **nr_idle**: If 0 and there's pending work, the pool has no reserve
  workers. Check if a manager is active trying to create one — and if so,
  what the manager is stuck on (likely a GFP_KERNEL allocation that won't
  return).
- **nr_workers**: Compare to nr_idle to see how many are busy.
- **flags**: Check POOL_MANAGER_ACTIVE (0x2) — is someone trying to create
  workers?

**Step 1.2: Enumerate ALL workers**
```python
WORKER_DIE = 0x2
WORKER_IDLE = 0x4
WORKER_PREP = 0x8
WORKER_CPU_INTENSIVE = 0x40
WORKER_UNBOUND = 0x80

for worker in list_for_each_entry("struct worker",
        pool.workers.address_of_(), "node"):
    task = worker.task
    pid = task.pid.value_()
    flags = worker.flags.value_()
    state = []
    if flags & WORKER_DIE: state.append("DIE")
    if flags & WORKER_IDLE: state.append("IDLE")
    if flags & WORKER_PREP: state.append("PREP")
    if flags & WORKER_CPU_INTENSIVE: state.append("CPU_INTENSIVE")
    if flags & WORKER_UNBOUND: state.append("UNBOUND")
    if not state: state.append("RUNNING")

    cur = worker.current_work
    func_name = str(worker.current_func) if int(cur) else "(none)"
    sleeping = worker.sleeping.value_()
    last_active = worker.last_active.value_()

    print(f"  worker {worker.id.value_()}: pid={pid}, "
          f"flags=0x{flags:x} [{','.join(state)}], "
          f"sleeping={sleeping}, last_active={last_active}, "
          f"current_func={func_name}")

    # Stack trace for ALL non-idle workers:
    if not (flags & WORKER_IDLE):
        try:
            print(f"    Stack trace:")
            for frame in prog.stack_trace(task):
                print(f"      {frame}")
        except Exception as e:
            print(f"    (stack trace failed: {e})")
    # Also check idle workers' task state — are they truly sleeping idle?
    else:
        tstate = task.__state.value_() if hasattr(task, '__state') else task.state.value_()
        print(f"    task state: {tstate}")
```

**What to look for:**
- For each non-idle worker: What is it doing? Is it sleeping on a lock?
  Is it stuck in an allocation? Is it the manager trying to create workers?
- For idle workers: Are they in TASK_IDLE state as expected? If not,
  something is wrong.
- Cross-check: Does the count of non-IDLE, non-PREP, non-CPU_INTENSIVE
  workers match nr_running? If not, the concurrency management state is
  inconsistent.
- Check worker.sleeping for non-idle workers: if sleeping==1, the worker
  went through wq_worker_sleeping() and decremented nr_running. If all
  non-idle workers have sleeping==1, nr_running should be 0.

**Step 1.3: Check pending work items**
```python
count = 0
for work in list_for_each_entry("struct work_struct",
        pool.worklist.address_of_(), "entry"):
    data = work.data.counter.value_()
    func = str(work.func)
    wq_name = "?"
    if data & (1 << 2):  # WORK_STRUCT_PWQ_BIT
        pwq_addr = data & ~((1 << 8) - 1)
        try:
            pwq = Object(prog, "struct pool_workqueue", address=pwq_addr)
            wq_name = pwq.wq.name.string_().decode()
        except:
            wq_name = f"(pwq@{pwq_addr:#x})"
    print(f"  [{count}] func={func}, wq={wq_name}")
    count += 1
print(f"Total pending: {count}")
```

These are work items waiting to be executed. If there are multiple items and
idle workers exist, something is preventing the workers from picking them up.
Correlate with nr_running and worker states from Step 1.2.

**Step 1.4: Check pwq statistics**

Find pwqs associated with this pool and check started vs completed counts:
```python
# Iterate all workqueues and find pwqs for this pool
pool_id = pool.id.value_()
for wq in list_for_each_entry("struct workqueue_struct",
        prog["workqueues"].address_of_(), "list"):
    try:
        pwq = per_cpu(wq.cpu_pwq, cpu)
        if pwq.pool.id.value_() == pool_id:
            started = pwq.stats[0].value_()
            completed = pwq.stats[1].value_()
            nr_active = pwq.nr_active.value_()
            if started > 0 or nr_active > 0:
                print(f"  wq '{wq.name.string_().decode()}': "
                      f"started={started}, completed={completed}, "
                      f"in_flight={started-completed}, "
                      f"nr_active={nr_active}, "
                      f"max_active={wq.max_active.value_()}")
    except:
        pass  # unbound workqueues don't have cpu_pwq
```

**Step 1.5: Determine the stall mechanism**

Based on the above evidence, determine WHICH of these scenarios applies:

**Scenario A — nr_running stuck > 0**: nr_running is positive but no worker
is actually executing on CPU. All non-idle workers have sleeping==1
(decremented nr_running) or are in PREP state, yet nr_running hasn't
reached 0. This prevents kick_pool() from waking idle workers. This would
be a concurrency management bug.

**Scenario B — No idle workers, manager stuck**: nr_idle is 0,
POOL_MANAGER_ACTIVE is set, and the manager worker is stuck trying to create
a new worker (blocked in GFP_KERNEL allocation or kthread_create). The pool
cannot make progress because there are no idle workers to wake and creating
new ones is blocked. Check the manager's stack trace to see where it's stuck.
If it's in the page allocator, this suggests system-wide memory pressure
where reclaim is not working.

**Scenario C — Workers all blocked on locks**: There are workers processing
work items, but every single one is sleeping on a mutex/lock. nr_running
correctly went to 0, idle workers were woken, but they too picked up work
items that immediately blocked. Eventually all workers are blocked and
no idle workers remain.

**Scenario D — Something else**: If none of the above, describe exactly what
you see and what doesn't add up.

### Phase 2: Investigate the broader stall

**IMPORTANT**: Do not assume the lock that reg_todo is waiting on is the
"root cause." The pool being stuck can CAUSE lock stalls elsewhere. Consider:

- If work item X is expected to release lock L, and work item X is pending
  on a stuck pool, then anything waiting on lock L will appear deadlocked —
  but the real cause is the pool stall, not a lock ordering bug.
- The cfg80211 reg_todo waiting on a mutex might be a VICTIM, not the cause.
  The mutex holder might itself be waiting for something that depends on the
  stuck pool.

**Step 2.1: Identify what the stuck worker(s) are waiting on**

For each non-idle worker found in Step 1.2, examine its stack trace. If it's
in `__mutex_lock`, `__rwsem_down_*`, `schedule_preempt_disabled`, or similar,
identify which lock:

```python
# For a worker stuck in __mutex_lock, try to extract the lock:
for frame in prog.stack_trace(stuck_task):
    if "mutex_lock" in str(frame):
        try:
            lock = frame["lock"]
            print(f"  Mutex at: {lock}")
            owner_val = lock.owner.counter.value_()
            owner_ptr = owner_val & ~0x7
            if owner_ptr:
                owner = Object(prog, "struct task_struct",
                               address=owner_ptr)
                print(f"  Owner: {owner.comm.string_().decode()} "
                      f"pid={owner.pid.value_()}")
                print(f"  Owner stack:")
                for f in prog.stack_trace(owner):
                    print(f"    {f}")
            else:
                print(f"  No owner (owner_val=0x{owner_val:x})")
        except Exception as e:
            print(f"  (couldn't extract lock: {e})")
```

**Step 2.2: Follow the dependency chain**

For each lock owner found above:
1. Is the owner running, sleeping, or in D state?
2. If sleeping — what is it waiting on? Another lock? A completion? I/O?
3. If waiting on another lock — find THAT lock's owner and repeat.
4. At each step: is the waited-on resource something that depends on a
   work item running? If so, check which pool/workqueue that work item
   would run on — is THAT pool also stuck?

Continue until you find either:
- A cycle (A→B→C→A) — true deadlock
- A task waiting on something that depends on the stuck pool — the pool
  stall is the root cause, the "deadlock" is a symptom
- A task waiting on something unrelated (I/O, userspace, etc.)

**Step 2.3: Check other pools and CPUs**

The stall might not be limited to pool 22. Check all CPU pools:
```python
import os
nr_cpus = int(prog["nr_cpu_ids"])
for cpu in range(nr_cpus):
    pool = per_cpu(prog["cpu_worker_pools"], cpu)[0]
    nr_r = pool.nr_running.value_()
    nr_w = pool.nr_workers.value_()
    nr_i = pool.nr_idle.value_()
    wts = pool.watchdog_ts.value_()
    delta = (jiffies - wts) / 1000
    empty = pool.worklist.next.value_() == pool.worklist.address_of_().value_()
    if not empty or delta > 30:
        print(f"CPU {cpu}: pool {pool.id.value_()}, nr_running={nr_r}, "
              f"nr_workers={nr_w}, nr_idle={nr_i}, "
              f"hung={delta:.0f}s, worklist_empty={empty}")
```

**Step 2.4: Check for system-wide memory pressure**

If the pool stall involves a manager stuck in allocation:
```python
# Check memory state
from drgn.helpers.linux.percpu import per_cpu
try:
    for node in range(int(prog["nr_online_nodes"])):
        # Check per-node free pages
        pass  # Use /proc/meminfo equivalent via drgn
except:
    pass

# Check if any workers are stuck in page allocation:
from drgn.helpers.linux.pid import for_each_task
alloc_stuck = []
for task in for_each_task(prog):
    tstate = task.__state.value_() if hasattr(task, '__state') else task.state.value_()
    if tstate != 0:  # not TASK_RUNNING
        try:
            for frame in prog.stack_trace(task):
                if "alloc_pages" in str(frame) or "__alloc_pages" in str(frame):
                    alloc_stuck.append(
                        f"{task.comm.string_().decode()} pid={task.pid.value_()}")
                    break
        except:
            pass
if alloc_stuck:
    print(f"Tasks stuck in page allocation: {len(alloc_stuck)}")
    for t in alloc_stuck[:20]:
        print(f"  {t}")
```

**Step 2.5: Check all D-state tasks for patterns**
```python
from drgn.helpers.linux.pid import for_each_task
d_state_tasks = []
for task in for_each_task(prog):
    tstate = task.__state.value_() if hasattr(task, '__state') else task.state.value_()
    if tstate == 2:  # TASK_UNINTERRUPTIBLE
        comm = task.comm.string_().decode()
        pid = task.pid.value_()
        try:
            frames = [str(f) for f in prog.stack_trace(task)]
            # Categorize by what they're stuck on
            category = "unknown"
            for f in frames:
                if "mutex_lock" in f: category = "mutex"; break
                if "rwsem" in f: category = "rwsem"; break
                if "alloc_pages" in f: category = "alloc"; break
                if "wait_for_completion" in f: category = "completion"; break
                if "worker_thread" in f: category = "wq_idle"; break
            d_state_tasks.append((category, comm, pid, frames[0] if frames else "?"))
        except:
            d_state_tasks.append(("error", comm, pid, "?"))

# Group by category
from collections import Counter
cats = Counter(cat for cat, _, _, _ in d_state_tasks)
print(f"D-state tasks by category: {dict(cats)}")
print(f"Total D-state tasks: {len(d_state_tasks)}")
for cat, comm, pid, top_frame in sorted(d_state_tasks)[:30]:
    print(f"  [{cat}] {comm} pid={pid}: {top_frame}")
```

## REPORTING

Present your findings in this order:

1. **Pool State** — Hard facts from Phase 1: nr_running, nr_idle,
   nr_workers, pending work count, watchdog delta. For each worker: state,
   current function, sleeping flag, stack trace. These are raw facts — state
   them without interpretation first.

2. **Stall Mechanism** — Based on the pool state, which scenario (A/B/C/D)
   applies and why. Cite specific values: "nr_running=1 but only worker is
   in __mutex_lock with sleeping=1, so nr_running should be 0" or
   "nr_idle=0, POOL_MANAGER_ACTIVE=1, manager pid=X stuck in __alloc_pages."

3. **Dependency Analysis** — The lock/wait chain from Phase 2. For each
   link: who is waiting, on what, who holds it, what are they doing. Note
   explicitly whether any link depends on a work item that would run on the
   stuck pool.

4. **Root Cause** — What started the stall. This might be:
   - A workqueue concurrency management bug (nr_running inconsistency)
   - A worker creation failure (memory pressure preventing GFP_KERNEL allocs)
   - A lock ordering issue in cfg80211/networking that caused workers to
     all block
   - Something else entirely
   State this with evidence. If you cannot determine root cause with
   certainty, say so and list the candidates with evidence for/against each.

5. **Cascade Effects** — What other subsystems are stalled as a consequence,
   and through what mechanism (blocked on same lock, waiting for stuck work
   item, etc.).

For EVERY claim, cite the specific drgn output. If you cannot support a
claim, explicitly say so and mark it as hypothesis.

## OUTPUT FORMAT

Write your complete findings as a Markdown file. The file should be
self-contained and suitable for posting as a reply to the bug report on
LKML. Structure it as follows:

```markdown
# Workqueue Lockup Analysis — CPU 5 Pool

## 1. Pool State

(raw facts: nr_running, nr_idle, nr_workers, flags, watchdog delta)

### Workers

(table or list of every worker: id, pid, state, current_func, sleeping,
stack trace for non-idle workers)

### Pending Work Items

(list of all pending work items on the worklist with function and workqueue)

### PWQ Statistics

(started/completed counts for active pwqs on this pool)

## 2. Stall Mechanism

(which scenario applies and WHY, with specific values cited as evidence)

## 3. Lock / Dependency Analysis

(the chain: who waits on what, who holds it, what are THEY waiting on —
with addresses and stack traces at each step. note whether any link depends
on a work item on the stuck pool.)

## 4. Other Pools and System-Wide State

(any other stuck pools, D-state task summary, memory pressure indicators)

## 5. Root Cause

(what started the stall, with evidence. or candidates if uncertain.)

## 6. Cascade Effects

(what else is broken as a consequence)
```

Save this file and tell the user where it is so they can attach it to their
reply.

next prev parent reply	other threads:[~2026-03-10 18:06 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 22:36 6.18.13 iwlwifi deadlock allocating cma while work-item is active Ben Greear
2026-02-27 16:31 ` Ben Greear
2026-03-01 15:38   ` Ben Greear
2026-03-02  8:07     ` Johannes Berg
2026-03-02 15:26       ` Ben Greear
2026-03-02 15:38         ` Johannes Berg
2026-03-02 15:50           ` Ben Greear
2026-03-03 11:49             ` Johannes Berg
2026-03-03 20:52               ` Tejun Heo
2026-03-03 21:03                 ` Johannes Berg
2026-03-03 21:12                 ` Johannes Berg
2026-03-03 21:40                   ` Ben Greear
2026-03-03 21:54                     ` Tejun Heo
2026-03-04  0:02                       ` Ben Greear
2026-03-04 17:14                         ` Tejun Heo
2026-03-10 16:10                           ` Ben Greear
2026-03-10 18:06                             ` Tejun Heo [this message]
2026-03-10 19:18                               ` Ben Greear
2026-03-10 19:47                                 ` Tejun Heo
2026-03-10 19:48                                   ` Tejun Heo
2026-03-04  3:08               ` Hillf Danton
2026-03-04  6:57                 ` Johannes Berg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5b9b93df8774810a43fceb359906604b@kernel.org \
    --to=tj@kernel.org \
    --cc=greearb@candelatech.com \
    --cc=johannes@sipsolutions.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=miriam.rachel.korenblit@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox