Re: [next-20260216]NULL pointer dereference in drain_obj_stock() (RCU free path)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Shakeel Butt <shakeel.butt@linux.dev>
To: Harry Yoo <harry.yoo@oracle.com>
Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com>,
	 Vlastimil Babka <vbabka@suse.cz>,
	Carlos Maiolino <cem@kernel.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	 Roman Gushchin <roman.gushchin@linux.dev>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	 LKML <linux-kernel@vger.kernel.org>,
	Madhavan Srinivasan <maddy@linux.ibm.com>,
	 Ritesh Harjani <riteshh@linux.ibm.com>,
	ojaswin@linux.ibm.com, Muchun Song <muchun.song@linux.dev>,
	 Cgroups <cgroups@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	 Hao Li <hao.li@linux.dev>
Subject: Re: [next-20260216]NULL pointer dereference in drain_obj_stock() (RCU free path)
Date: Sun, 22 Feb 2026 15:48:53 -0800	[thread overview]
Message-ID: <aZuVgStlrvZ87duZ@linux.dev> (raw)
In-Reply-To: <aZuR6_Mm9uqt_6Fp@linux.dev>

On Sun, Feb 22, 2026 at 03:36:46PM -0800, Shakeel Butt wrote:
> On Sun, Feb 22, 2026 at 08:47:03PM +0900, Harry Yoo wrote:
> [...]
> > 
> > It seems it crashed while dereferencing objcg->ref->data->count.
> > I think that implies that obj_cgroup_release()->percpu_ref_exit()
> > is already called due to the refcount reaching zero and set
> > ref->data = NULL.
> > 
> > Wait, was the stock->objcg ever a valid objcg?
> > I think it should be valid when refilling the obj stock, otherwise
> > it should have crashed in refill_obj_stock() -> obj_cgroup_get() path
> > in the first place, rather than crashing when draining.
> > 
> > And that sounds like we're somehow calling obj_cgroup_put() more times
> > than obj_cgroup_get().
> > 
> > Anyway, this is my theory that it may be due to mis-refcounting of objcgs.
> > 
> 
> I have not looked deeper into recent slub changes (sheafs or obj_exts savings)
> but one thing looks weird to me:
> 
> allocate_slab() // for cache with SLAB_OBJ_EXT_IN_OBJ
> 	-> alloc_slab_obj_exts_early()
> 		-> slab_set_stride(slab, s->size)
> 	-> account_slab()
> 		-> alloc_slab_obj_exts()
> 			-> slab_set_stride(slab, sizeof(struct slabobj_ext));
> 
> Unconditional overwrite of stride. Not sure if it is issue or even related to
> this crash but looks odd.

I asked AI to debug this crash report along with a nudge towards to look for
stride corruption, it gave me the following output:


# Stride Corruption Bug Analysis

## Bug Report Context

- **Crash Location**: `drain_obj_stock+0x620/0xa48` in `obj_cgroup_put(old)` at mm/memcontrol.c:3059
- **Root Cause**: `percpu_ref.data` is NULL, meaning `obj_cgroup_release()` already ran
- **Platform**: IBM Power11 (pSeries LPAR, Radix MMU, LE, 64K pages, kernel 6.19.0-next-20260216)
- **Trigger**: xfstests generic/428

## Identified Bug: Unconditional Stride Overwrite

### Location: mm/slub.c lines 2196-2223 (alloc_slab_obj_exts)

```c
retry:
    old_exts = READ_ONCE(slab->obj_exts);
    handle_failed_objexts_alloc(old_exts, vec, objects);
    slab_set_stride(slab, sizeof(struct slabobj_ext));  // BUG: UNCONDITIONALLY SET

    if (new_slab) {
        slab->obj_exts = new_exts;
    } else if (old_exts & ~OBJEXTS_FLAGS_MASK) {
        // obj_exts already exists, BUT stride was already modified above!
        mark_objexts_empty(vec);
        kfree(vec);
        return 0;
    } else if (cmpxchg(&slab->obj_exts, old_exts, new_exts) != old_exts) {
        goto retry;
    }
```

### The Problem

The stride is set to `sizeof(struct slabobj_ext)` **BEFORE** checking if `obj_exts` already
exists. If a slab was created with `SLAB_OBJ_EXT_IN_OBJ` mode (where stride = `s->size`),
and later `alloc_slab_obj_exts` is called for any reason, the stride gets corrupted.

### Stride Modes

There are two stride modes (see alloc_slab_obj_exts_early):

1. **Normal mode**: stride = `sizeof(struct slabobj_ext)` (~16 bytes)
   - obj_exts is a separate array or in slab leftover space

2. **SLAB_OBJ_EXT_IN_OBJ mode**: stride = `s->size` (object size, e.g., 64-256+ bytes)
   - obj_ext is embedded within each object at a fixed offset

### Consequences of Wrong Stride

When `slab_obj_ext` is later called:
```c
obj_ext = (struct slabobj_ext *)(obj_exts + slab_get_stride(slab) * index);
```

With corrupted stride (16 instead of 256):
- **Expected**: `obj_exts + 256 * 5 = obj_exts + 1280` (correct obj_ext for object 5)
- **Actual**: `obj_exts + 16 * 5 = obj_exts + 80` (WRONG obj_ext - belongs to object 0!)

This causes:
1. Reading wrong object's objcg pointer
2. Releasing wrong objcg reference (`obj_cgroup_put`)
3. Reference underflow on victim objcg
4. Early `obj_cgroup_release()` → `percpu_ref_exit()` → `data = NULL`
5. Stock still caches the objcg pointer
6. Later `drain_obj_stock()` tries to put it → **CRASH**

## Missing Safety Check

`slab_obj_ext()` in mm/slab.h has **no bounds checking**:

```c
static inline struct slabobj_ext *slab_obj_ext(struct slab *slab,
                                               unsigned long obj_exts,
                                               unsigned int index)
{
    struct slabobj_ext *obj_ext;

    VM_WARN_ON_ONCE(obj_exts != slab_obj_exts(slab));
    // MISSING: VM_WARN_ON_ONCE(index >= slab->objects);

    obj_ext = (struct slabobj_ext *)(obj_exts +
                                     slab_get_stride(slab) * index);
    return kasan_reset_tag(obj_ext);
}
```

## CRITICAL: Memory Ordering Bug on PowerPC (Likely Root Cause)

### The Problem

In `alloc_slab_obj_exts` (mm/slub.c lines 2199-2220), there is **NO memory barrier**
between the stride store and the obj_exts visibility via cmpxchg:

```c
slab_set_stride(slab, sizeof(struct slabobj_ext));  // Store to stride (line 2199)
                                                     // NO MEMORY BARRIER HERE!
if (new_slab) {
    slab->obj_exts = new_exts;                       // Store to obj_exts (line 2207)
} else if (...) {
} else if (cmpxchg(&slab->obj_exts, ...) != ...) {   // Atomic on obj_exts (line 2220)
    goto retry;
}
```

### Why This Crashes on PowerPC

PowerPC has a **weakly-ordered memory model**. Stores can be reordered and may not be
immediately visible to other processors. The cmpxchg provides a barrier AFTER it
executes, but the stride store BEFORE cmpxchg may not be visible when obj_exts becomes
visible.

**Race Scenario:**
1. CPU A: `slab_set_stride(slab, 16)` (store to stride, in CPU A's store buffer)
2. CPU A: `cmpxchg(&slab->obj_exts, 0, new_exts)` succeeds, obj_exts is now visible
3. CPU B: Sees `obj_exts` is set (from step 2)
4. CPU B: Reads `slab->stride` → **sees OLD value (0 or garbage)** due to reordering!
5. CPU B: `slab_obj_ext` calculates `obj_exts + 0 * index = obj_exts` for ALL indices!
6. **All objects appear to share the same obj_ext at offset 0**

### Consequences

- Object 0's objcg is correct
- Object 1..N all read object 0's objcg (WRONG!)
- When freeing multiple objects, we `obj_cgroup_put` the SAME objcg multiple times
- Reference count underflows → early `obj_cgroup_release()`
- `percpu_ref_exit()` sets `data = NULL`
- Later stock drain tries to put the objcg → **CRASH in `drain_obj_stock`**

### Why This Matches the Bug Report

- **Platform**: IBM Power11 (PowerPC) - weakly ordered memory
- **Trigger**: xfstests generic/428 - creates high filesystem activity with many allocations
- **Crash location**: `drain_obj_stock` → `obj_cgroup_put` with NULL data
- **Pattern**: Stock has cached objcg that was prematurely released

### Proposed Fix

Add a write memory barrier to ensure stride is visible before obj_exts:

```c
slab_set_stride(slab, sizeof(struct slabobj_ext));
smp_wmb();  // Ensure stride is visible before obj_exts

if (new_slab) {
    slab->obj_exts = new_exts;
} else if (...) {
} else if (cmpxchg(&slab->obj_exts, old_exts, new_exts) != old_exts) {
    goto retry;
}
```

And correspondingly, the reader side should use a read barrier:
```c
static inline unsigned short slab_get_stride(struct slab *slab)
{
    // Need acquire semantics when reading stride after seeing obj_exts
    return smp_load_acquire(&slab->stride);
}
```

Or use `smp_store_release` / `smp_load_acquire` pairs for proper ordering.

### Also Applies to alloc_slab_obj_exts_early

The same issue exists in `alloc_slab_obj_exts_early` (lines 2290-2291 and 2308-2309):

```c
slab->obj_exts = obj_exts;                           // Store obj_exts
slab_set_stride(slab, sizeof(struct slabobj_ext));   // Store stride AFTER!
```

Here the order is **reversed** - obj_exts is set BEFORE stride! This is even worse
for memory ordering, as other CPUs could see obj_exts before stride is set.

## Original Theory: Unconditional Stride Overwrite

(Kept for reference - less likely to be the root cause on this specific crash)

The stride is set to `sizeof(struct slabobj_ext)` **BEFORE** checking if `obj_exts`
already exists. However, analysis shows this is protected by the TOCTOU check in
callers (`!slab_obj_exts(slab)`).

## Trigger Scenarios

1. **Memory ordering on PowerPC** (MOST LIKELY): Stride not visible when obj_exts
   becomes visible due to missing memory barriers.

2. **Race between alloc_slab_obj_exts calls**: Two CPUs trying to allocate obj_exts
   for the same slab simultaneously.

3. **Interaction with RCU free path**: Objects in RCU sheaf being processed when
   stride is stale/zero.

## Confirmed Code Analysis (CONFIG_64BIT)

On 64-bit systems (including IBM Power11), the stride is stored dynamically:

**mm/slab.h:562-569**:
```c
#ifdef CONFIG_64BIT
static inline void slab_set_stride(struct slab *slab, unsigned short stride)
{
    slab->stride = stride;  // Plain store - NO memory ordering!
}
static inline unsigned short slab_get_stride(struct slab *slab)
{
    return slab->stride;    // Plain load - NO memory ordering!
}
```

**mm/slab.h:533-548** (`slab_obj_exts`):
```c
static inline unsigned long slab_obj_exts(struct slab *slab)
{
    unsigned long obj_exts = READ_ONCE(slab->obj_exts);  // Only compiler barrier!
    // ... validation ...
    return obj_exts & ~OBJEXTS_FLAGS_MASK;
}
```

`READ_ONCE` only provides compiler ordering, NOT CPU memory ordering. There's no
acquire barrier to ensure the stride read happens after seeing obj_exts.

## Complete Fix Using Release/Acquire Semantics

### Fix 1: Reader side - slab_obj_exts (mm/slab.h)

Change `READ_ONCE` to `smp_load_acquire`:

```c
static inline unsigned long slab_obj_exts(struct slab *slab)
{
    unsigned long obj_exts = smp_load_acquire(&slab->obj_exts);  // Acquire barrier
    // ... validation ...
    return obj_exts & ~OBJEXTS_FLAGS_MASK;
}
```

### Fix 2: Writer side - alloc_slab_obj_exts (mm/slub.c:2196-2223)

Use `smp_store_release` for obj_exts after setting stride:

```c
retry:
    old_exts = READ_ONCE(slab->obj_exts);
    handle_failed_objexts_alloc(old_exts, vec, objects);

    if (new_slab) {
        slab_set_stride(slab, sizeof(struct slabobj_ext));
        smp_store_release(&slab->obj_exts, new_exts);  // Release barrier
    } else if (old_exts & ~OBJEXTS_FLAGS_MASK) {
        mark_objexts_empty(vec);
        kfree(vec);
        return 0;
    } else {
        slab_set_stride(slab, sizeof(struct slabobj_ext));
        // cmpxchg already provides release semantics, but stride must be
        // visible before cmpxchg. Need explicit barrier:
        smp_wmb();
        if (cmpxchg(&slab->obj_exts, old_exts, new_exts) != old_exts)
            goto retry;
    }
```

### Fix 3: Writer side - alloc_slab_obj_exts_early (mm/slub.c:2290-2291, 2308-2309)

The order is REVERSED here - obj_exts is set BEFORE stride! Fix by using
`smp_store_release`:

```c
// For normal obj_exts (lines 2290-2291):
slab_set_stride(slab, sizeof(struct slabobj_ext));  // Set stride FIRST
smp_store_release(&slab->obj_exts, obj_exts);       // Then release obj_exts

// For SLAB_OBJ_EXT_IN_OBJ (lines 2308-2309):
slab_set_stride(slab, s->size);                     // Set stride FIRST
smp_store_release(&slab->obj_exts, obj_exts);       // Then release obj_exts
```

## Why This Fixes the Crash

With proper release/acquire ordering:

1. **Writer** (CPU A): Sets stride, then `smp_store_release(&obj_exts, ...)` ensures
   stride is visible to all CPUs before obj_exts becomes visible

2. **Reader** (CPU B): `smp_load_acquire(&obj_exts)` ensures that if obj_exts is
   seen as set, the subsequent stride read will see the correct value

This prevents the race where CPU B sees obj_exts but reads stale/zero stride,
which caused all objects to appear to share obj_ext at offset 0, leading to
multiple `obj_cgroup_put` calls on the same objcg → reference underflow → crash.

## Additional Safety: Bounds Check in slab_obj_ext

Add bounds check to catch any remaining issues:
```c
VM_WARN_ON_ONCE(index >= slab->objects);
```

next prev parent reply	other threads:[~2026-02-22 23:49 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-17 11:29 [next-20260216]NULL pointer dereference in drain_obj_stock() (RCU free path) Venkat Rao Bagalkote
2026-02-17 12:40 ` Carlos Maiolino
2026-02-18 11:36   ` Vlastimil Babka
2026-02-18 21:25     ` Shakeel Butt
2026-02-22 10:08     ` Venkat Rao Bagalkote
2026-02-22 11:47       ` Harry Yoo
2026-02-22 23:36         ` Shakeel Butt
2026-02-22 23:48           ` Shakeel Butt [this message]
2026-02-23  2:36             ` Harry Yoo
2026-02-24  2:07               ` Harry Yoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZuVgStlrvZ87duZ@linux.dev \
    --to=shakeel.butt@linux.dev \
    --cc=cem@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hao.li@linux.dev \
    --cc=harry.yoo@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=maddy@linux.ibm.com \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=ojaswin@linux.ibm.com \
    --cc=riteshh@linux.ibm.com \
    --cc=roman.gushchin@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=venkat88@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.