From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A9992CD4F3D for ; Wed, 20 May 2026 15:01:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3BAAD6B00A3; Wed, 20 May 2026 11:00:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2F5296B00A8; Wed, 20 May 2026 11:00:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0AEDA6B00A3; Wed, 20 May 2026 11:00:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E26FC6B00A7 for ; Wed, 20 May 2026 11:00:48 -0400 (EDT) Received: from smtpin01.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 8F8D0120376 for ; Wed, 20 May 2026 15:00:48 +0000 (UTC) X-FDA: 84788110176.01.96F41E4 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf17.hostedemail.com (Postfix) with ESMTP id CE53D4001F for ; Wed, 20 May 2026 15:00:46 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=surriel.com header.s=mail header.b=JljjMHsj; dmarc=none; spf=pass (imf17.hostedemail.com: domain of riel@surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@surriel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1779289246; a=rsa-sha256; cv=none; b=CqjnJYz46I77EebxIozKnHpkJNXZJEGVKGjOiaZfGwSzFDXdQ65mCwgrhhWwQp/YwIb/UK loR4Pp/cvLPhVY+Rf3laeeOJjJt81BkTk5QbxSxEf42CsomkmOoIS37DHw8tN8X7F9s4Nu ILSswBsaBRoNBoHUWDHI29C1w+UFnFo= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=surriel.com header.s=mail header.b=JljjMHsj; dmarc=none; spf=pass (imf17.hostedemail.com: domain of riel@surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@surriel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1779289246; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Z/qj+KWI7h9JqJhNq4BxI6ZWMV4VhY1/AXzq1+33pR8=; b=Kf32kOdQZG1Rpl4lOzoEJ1eJ9m9lFGwidY0XzpnnytR+V8ewwS681wlSOT29M60KME4jnH QFnxqJ5b3Nyq0uyEhOTz35eVzRG8+7W5i2Vi/bm82NflnR26A1dMxspbMBW/shGa88gYza /CCzt7OD3VFwZDO2Wd72vlJDrt1tRV8= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=Z/qj+KWI7h9JqJhNq4BxI6ZWMV4VhY1/AXzq1+33pR8=; b=JljjMHsjzfjFl41mysnmQ4YPui WwF5xRbeUYxYn0fHLSJFHja4dAlFQORwSTO+1qHgIyhDk437Q3g6Hi2uJHpt7EYp+NJ35oBzoqOhj tERvokdbIWMRRY+/E0sTto3XzjcLSD2BVoqB+15Ov3T1dyvBeWfbw779ICcOquCw3HsURR3NUaij4 E7HJy9jGspfLZILp8cBANa55xskwfSvMmBsXpFKc02shXc7tRdFQB19tYd/GCddrl4eWosWgH1w7X DdxkoaTD3e2KdWkVehdurLQegmAegjepGWrnRFSuxg+x7Vhsl1joZW6dczIr6qG/Hk0RIHX4AlrJi 8WB5OKfw==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wPiPM-0000000024Q-0Moh; Wed, 20 May 2026 11:00:28 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org, willy@infradead.org, surenb@google.com, hannes@cmpxchg.org, ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev, fvdl@google.com, Rik van Riel Subject: [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist Date: Wed, 20 May 2026 10:59:09 -0400 Message-ID: <20260520150018.2491267-4-riel@surriel.com> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260520150018.2491267-1-riel@surriel.com> References: <20260520150018.2491267-1-riel@surriel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Stat-Signature: sm1y18f9zmidrq1981huzzzif6b8wihd X-Rspam-User: X-Rspamd-Queue-Id: CE53D4001F X-Rspamd-Server: rspam07 X-HE-Tag: 1779289246-93852 X-HE-Meta: U2FsdGVkX1/wksFfM96PNgQyP/zZ5r76OWJB5B8gGhmAuholBo/srAjf+HkQLPJwL9gEXD3XT8Hg+P2FqFj6YuUxQrpNsH0wFYM1fAPhMZuvr/AeykwEAOKDq85n1U8CHiu/wL2bwYVjtxc1kXGts6qejvnClp1eswWSzZR59PfN/V5g64af9ZgkbRYWelsg5LHMPmBoWPLzNNIwqOYx9bb8SaeEOCu+sLPFov8BvPkaVE0+RjifNZ78/My98r0FQUVZCHH2oOJ8I4FvDVtIPgOZHTeJ1vZMYfadZrvXK3pWILKkzlSQCYlqD9yzPyZZISB+jHupkdMWs86n3v/nO5y0bQb/Dvjg8ywUm+SHqEEU7LqJqdKVleUyZjaldO5VErGXdwQNi9vmJa27MPI6d479QVYI51vCiT80ntMrkxHWijl+1JIOZvhVOI90YB119vazHHX4XXtMT4crqskVB1rlZ5vhXB3QMysGijRTcG2jPj4Z/exSd2JNTkFtXSubAY7x2HgV41GnLmTzVDkor2kXyKqlr9iQIDEGyH5piaahzSH2rEVPSP2wW3KuOwO2geKjr+/tFlAD7HQCF8eeuYuZxGJiN/BF21Ay84bkKVJ++7n9wom1CGxSo1LEHOb7uAloZWeeA7eY/w+VuCFOhQBeyDMmzAFdw0WYeC0ocpX+pCoZsuVsgc3H1wUPfewOcjiNSsOWVwromU11PyaPX0IW9/qBtKjKA9SN2k5RyGVFuXt24o59LOmqOaij9eWglbPwf6JeMgIxOLZAOvWjocZyHRIme5uO1Y1AY1nSvYgDa6Jost7uNHmcmORKZ28UtF8ybW+oLyaJ9MkWB6ch8tZzc4UpYOYOUriFoMOt2l0BkT1AB10S+4Fq8V/Kt8xEuHoE6sbe11xGnWTUL4lbzeZi2vtV6s3n0X+kRPQJnmZt+nGLk0OoWmHQ4hqjeeHY/JrdK4OPFsgtgoS4WPn j3Epj5Tz SEzVqbcKVqwl5MUPHgacr8Z8wlG2yev0wuG9+VpBwSYCWpI9335hfanNuywd1eV/W9Nl71JJV58FSIxtF7NwA0QELOEdaSQWM1hcUJO2EyF9QetgnaK8RwEZQdfuHXru3tlBVQjhLOZNDWIEvjM0H5UjKT966SvkoQyen1EF/EBx6LOz/3Q+ihh7kILGONRdWTGoDOZDbdz4HhFlx5UrGVV0btFJWsSUJNkhHfz7cOdL5JeVBqKQuBs1xwNc0yXrR+eUM2a/YUznjlHhpleJiBjH4ru+d2mYdSytftDV3vAtF525pXj+6Ut3tTIS3wMchOk5e Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The page allocator's PCP free path needs lock-inversion protection against zone->lock. The natural form -- always take pcp->lock with spin_lock -- can deadlock because callers may hold locks (e.g. xa_lock via slab/stack_depot) that are also taken in hardirq context, and pcp->lock is acquired with IRQs enabled on the allocation side. A coarse fix is to use spin_trylock and fall back to free_one_page() (direct zone-buddy free) on contention. That removes the inversion risk but defeats the per-CPU pageset benefits on a busy multi-CPU system: many frees take the slow zone->lock path, and the per-CPU pcp->count visible to allocators understates real free-page availability for the remote CPU's pageset. Replace the trylock-fallback with a per-CPU remote free list (llist) consumed by the owning CPU. Local frees still use the trylock path; remote frees push onto the target's lockless llist; the owning CPU absorbs the queued pages back onto its PCP buddy lists at the next opportunity. Result: zero lock-inversion risk, no zone->lock fallback storm, and remote frees become near-free at the freer's side. Mechanics: - per_cpu_pages gains struct llist_head free_llist. - absorb_remote_frees(pcp) drains the llist into the local PCP buddy lists. Called from pcp_rmqueue_smallest(), free_pcppages_bulk(), and drain_pages_zone(). - __free_frozen_pages and free_unref_folios are split into a local path (spin_trylock on pcp->lock; on success enqueue locally) and a remote path (llist_add to the target CPU's free_llist). - The local-side spin_trylock no longer takes irqsave: lockdep analysis showed no IRQ-context caller of the local PCP free path that is also a holder of pcp->lock; the remote-from-IRQ case routes through llist_add (NMI-safe). - Memory hot-add lazy init: page_alloc_cpu_dead drains the dead PCP via existing drain_pages_zone (which now also drains the llist via absorb_remote_frees). For the narrow race where a remote freer raced PCPF_CPU_DEAD and pushed onto the dead PCP's llist after the drain, page_alloc_cpu_online absorbs any stranded pages. - page_alloc_cpu_dead detaches every entry from owned_blocks via list_del_init before reinitializing the list head. A simpler INIT_LIST_HEAD-only form leaves owned PB entries with stale ->prev/->next pointing at the dead head -- they get list_del()'d later by clear_pcpblock_owner() under zone->lock, corrupting whatever now happens to be at the dead head address. A stress-test reproducer surfaced this as a list_del prev->next == prev WARN. QEMU stress (234K worker iters + 5 hotplug cycles + 30 hugepages): zero WARN/BUG. Bare-metal test machine ran for ~14 hours under production-style load with no list_del corruption, no WARN, no panic. Signed-off-by: Rik van Riel Assisted-by: Claude:claude-opus-4.7 syzkaller --- include/linux/mmzone.h | 9 ++ mm/page_alloc.c | 249 ++++++++++++++++++++++++++++++----------- 2 files changed, 193 insertions(+), 65 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f0eb16390906..732e4dd181b9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -843,6 +843,15 @@ struct per_cpu_pages { /* Pageblocks owned by this CPU, for fragment recovery */ struct list_head owned_blocks; + /* + * Pages remotely freed by other CPUs into pageblocks owned by + * this CPU. Lock-free push by remote freers via llist_add(); the + * owning CPU drains and merges them into its PCP buddy lists at + * convenient moments (start of pcp_rmqueue_smallest, drain + * paths) under pcp->lock. + */ + struct llist_head free_llist; + /* Lists of pages, one per migrate type stored on the pcp-lists */ struct list_head lists[NR_PCP_LISTS]; } ____cacheline_aligned_in_smp; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a3448a97bab2..47d314e77151 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1482,6 +1482,8 @@ bool free_pages_prepare(struct page *page, unsigned int order) return __free_pages_prepare(page, order, FPI_NONE); } +static void absorb_remote_frees(struct per_cpu_pages *pcp); + /* * Free PCP pages to zone buddy. First does a bottom-up merge pass * over PagePCPBuddy entries under pcp->lock only (already held by @@ -1502,6 +1504,13 @@ static void free_pcppages_bulk(struct zone *zone, int count, struct page *page; int mt, pindex; + /* + * Pull in any pages remotely freed to our pageblocks before the + * merge pass -- they participate in merging just like locally + * freed pages. + */ + absorb_remote_frees(pcp); + /* * Ensure proper count is passed which otherwise would stuck in the * below while (list_empty(list)) loop. @@ -1596,6 +1605,45 @@ static void free_pcppages_bulk(struct zone *zone, int count, spin_unlock_irqrestore(&zone->lock, flags); } +/* + * Absorb pages remotely freed into this CPU's pageblocks. Remote freers + * push pages onto pcp->free_llist lock-free (no remote PCP lock taken); + * the owning CPU pulls them onto its PCP buddy lists here, where they + * become eligible for normal merging on the next free_pcppages_bulk() + * pass. + * + * Called with pcp->lock held. Must be cheap on the empty path; the + * llist_empty() check is the fast-path bail-out. + */ +static void absorb_remote_frees(struct per_cpu_pages *pcp) +{ + struct llist_node *node; + struct page *p, *tmp; + int absorbed = 0; + + if (likely(llist_empty(&pcp->free_llist))) + return; + + node = llist_del_all(&pcp->free_llist); + llist_for_each_entry_safe(p, tmp, node, pcp_llist) { + unsigned long pfn = page_to_pfn(p); + unsigned int order = pcp_buddy_order(p); + int mt = pbd_migratetype(pfn_to_pageblock(p, pfn)); + + if (unlikely(mt >= MIGRATE_PCPTYPES)) + mt = MIGRATE_MOVABLE; + + /* + * Pages on the llist came from pageblocks owned by this CPU + * (that's how the freer picked our llist), so they are + * eligible for PCP-buddy merging. + */ + __SetPagePCPBuddy(p); + pcp_enqueue(pcp, p, mt, order); + absorbed += 1 << order; + } +} + /* * Search PCP free lists for a page of at least the requested order. * If found at a higher order, split and place remainders on PCP lists. @@ -1606,6 +1654,8 @@ static struct page *pcp_rmqueue_smallest(struct per_cpu_pages *pcp, { unsigned int high; + absorb_remote_frees(pcp); + for (high = order; high <= pageblock_order; high++) { struct list_head *list; unsigned long size; @@ -2884,6 +2934,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) do { pcp_spin_lock_nopin(pcp); + absorb_remote_frees(pcp); count = pcp->count; if (count) { int to_drain = min(count, @@ -3247,11 +3298,22 @@ static void __free_frozen_pages(struct page *page, unsigned int order, } /* - * Route page to the owning CPU's PCP for merging, or to - * the local PCP for batching (zone-owned pages). Zone-owned - * pages are cached without PagePCPBuddy -- the merge pass - * skips them, so they're inert on any PCP list and drain - * individually to zone buddy. + * Route the page based on pageblock ownership: + * + * - owner_cpu == this CPU (or no owner): take the local PCP + * lock with spin_trylock and enqueue normally. The trylock + * fails only on rare local self re-entry (IRQ/NMI fires + * while the interrupted task already holds the lock) or + * while a remote drain is active; either way, fall back to + * free_one_page (or the zone-llist for FPI_TRYLOCK). No + * irqsave: the trylock cannot block on self, and remote + * CPUs never take this pcp->lock (they go via free_llist), + * so an interruption cannot deadlock against another freer. + * + * - owner_cpu != this CPU: lock-free push onto the owner's + * free_llist. The owner absorbs the page into its PCP buddy + * lists at its next alloc/drain. No remote PCP lock taken, + * so no cross-CPU contention. * * Ownership is stable here: it can only change when the * pageblock is complete -- either fully free in zone buddy @@ -3259,31 +3321,46 @@ static void __free_frozen_pages(struct page *page, unsigned int order, * Since we hold this page, neither can happen. */ owner_cpu = pbd->cpu - 1; - cache_cpu = owner_cpu; - if (cache_cpu < 0) - cache_cpu = raw_smp_processor_id(); + cache_cpu = raw_smp_processor_id(); + + if (owner_cpu < 0 || owner_cpu == cache_cpu) { + pcp = per_cpu_ptr(zone->per_cpu_pageset, cache_cpu); - pcp = per_cpu_ptr(zone->per_cpu_pageset, cache_cpu); - if (unlikely(fpi_flags & FPI_TRYLOCK) || !in_task()) { if (!spin_trylock(&pcp->lock)) { + if (fpi_flags & FPI_TRYLOCK) + add_page_to_zone_llist(zone, page, order); + else + free_one_page(zone, page, pfn, order, fpi_flags); + return; + } + + if (unlikely(pcp->flags & PCPF_CPU_DEAD)) { + spin_unlock(&pcp->lock); free_one_page(zone, page, pfn, order, fpi_flags); return; } - } else { - spin_lock(&pcp->lock); + + if (free_frozen_page_commit(zone, pcp, page, migratetype, + order, fpi_flags, + owner_cpu == cache_cpu)) + spin_unlock(&pcp->lock); + /* If commit returned false, pcp was already unlocked + * (migration or trylock failure inside the batched-free + * loop). */ + return; } - if (unlikely(pcp->flags & PCPF_CPU_DEAD)) { - spin_unlock(&pcp->lock); + /* Remote owner: lock-free llist hand-off. */ + pcp = per_cpu_ptr(zone->per_cpu_pageset, owner_cpu); + + if (unlikely(READ_ONCE(pcp->flags) & PCPF_CPU_DEAD)) { free_one_page(zone, page, pfn, order, fpi_flags); return; } - if (free_frozen_page_commit(zone, pcp, page, migratetype, order, - fpi_flags, cache_cpu == owner_cpu)) - spin_unlock(&pcp->lock); - /* If commit returned false, pcp was already unlocked (migration or - * trylock failure inside the batched-free loop). */ + set_pcp_order(page, order); + llist_add(&page->pcp_llist, &pcp->free_llist); + __count_vm_events(PGFREE, 1 << order); } void free_frozen_pages(struct page *page, unsigned int order) @@ -3335,60 +3412,78 @@ void free_unref_folios(struct folio_batch *folios) struct zone *zone = folio_zone(folio); unsigned long pfn = folio_pfn(folio); unsigned int order = (unsigned long)folio->private; + struct per_cpu_pages *remote_pcp; struct pageblock_data *pbd; int migratetype; - int owner_cpu, cache_cpu; + int owner_cpu; folio->private = NULL; pbd = pfn_to_pageblock(&folio->page, pfn); migratetype = pbd_migratetype(pbd); owner_cpu = pbd->cpu - 1; - cache_cpu = owner_cpu; - if (cache_cpu < 0) - cache_cpu = raw_smp_processor_id(); - /* - * Re-lock needed if zone changed, page is isolate, - * or target CPU changed. - */ - if (zone != locked_zone || - is_migrate_isolate(migratetype) || - cache_cpu != locked_cpu) { + /* Isolated pages always go directly to the zone buddy. */ + if (unlikely(is_migrate_isolate(migratetype))) { if (pcp) { spin_unlock(&pcp->lock); + pcp = NULL; locked_zone = NULL; locked_cpu = -1; - pcp = NULL; } + free_one_page(zone, &folio->page, pfn, + order, FPI_NONE); + continue; + } - /* - * Free isolated pages directly to the - * allocator, see comment in free_frozen_pages. - */ - if (is_migrate_isolate(migratetype)) { + if (locked_cpu < 0) + locked_cpu = raw_smp_processor_id(); + + /* + * Remote owner: lock-free push onto the owner's free_llist. + * Drop any local PCP lock first; the remote llist needs no + * lock and the next folio may belong to a different owner. + */ + if (owner_cpu >= 0 && owner_cpu != locked_cpu) { + if (pcp) { + spin_unlock(&pcp->lock); + pcp = NULL; + locked_zone = NULL; + } + remote_pcp = per_cpu_ptr(zone->per_cpu_pageset, + owner_cpu); + if (unlikely(READ_ONCE(remote_pcp->flags) & + PCPF_CPU_DEAD)) { free_one_page(zone, &folio->page, pfn, order, FPI_NONE); continue; } + set_pcp_order(&folio->page, order); + llist_add(&folio->page.pcp_llist, + &remote_pcp->free_llist); + __count_vm_events(PGFREE, 1 << order); + trace_mm_page_free_batched(&folio->page); + continue; + } - pcp = per_cpu_ptr(zone->per_cpu_pageset, - cache_cpu); - /* - * Use trylock when not in task context (IRQ, - * softirq) to avoid spinning with IRQs - * disabled. In task context, spin -- brief - * contention on a per-CPU lock beats the - * unbatched zone->lock fallback. - */ - if (!in_task()) { - if (unlikely(!spin_trylock(&pcp->lock))) { - pcp = NULL; - free_one_page(zone, &folio->page, pfn, - order, FPI_NONE); - continue; - } - } else { - spin_lock(&pcp->lock); + /* + * Local owner (or unowned): take the local PCP lock with + * spin_trylock. On failure (rare local re-entry or a remote + * drain in progress) fall back to the zone buddy. No + * irqsave -- trylock cannot block on self, and remote + * CPUs never take this pcp->lock (they go via free_llist). + */ + if (zone != locked_zone) { + if (pcp) { + spin_unlock(&pcp->lock); + pcp = NULL; + locked_zone = NULL; + } + pcp = per_cpu_ptr(zone->per_cpu_pageset, locked_cpu); + if (!spin_trylock(&pcp->lock)) { + pcp = NULL; + free_one_page(zone, &folio->page, pfn, + order, FPI_NONE); + continue; } if (unlikely(pcp->flags & PCPF_CPU_DEAD)) { spin_unlock(&pcp->lock); @@ -3398,7 +3493,6 @@ void free_unref_folios(struct folio_batch *folios) continue; } locked_zone = zone; - locked_cpu = cache_cpu; } /* @@ -3411,7 +3505,7 @@ void free_unref_folios(struct folio_batch *folios) trace_mm_page_free_batched(&folio->page); if (!free_frozen_page_commit(zone, pcp, &folio->page, migratetype, order, FPI_NONE, - cache_cpu == owner_cpu)) { + owner_cpu == locked_cpu)) { pcp = NULL; locked_zone = NULL; locked_cpu = -1; @@ -6361,6 +6455,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta for (pindex = 0; pindex < NR_PCP_LISTS; pindex++) INIT_LIST_HEAD(&pcp->lists[pindex]); INIT_LIST_HEAD(&pcp->owned_blocks); + init_llist_head(&pcp->free_llist); /* * Set batch and high values safe for a boot pageset. A true percpu @@ -6581,19 +6676,38 @@ static int page_alloc_cpu_dead(unsigned int cpu) drain_pages_zone(cpu, zone); /* - * Drain released all pages. Reinitialize the - * owned-blocks list -- any remaining entries are - * stale (fragments that merged in zone buddy and - * cleared ownership, but weren't removed from - * the list because __free_one_page doesn't hold - * pcp->lock). + * drain_pages_zone iterates absorb_remote_frees + + * free_pcppages_bulk until both pcp->count and the + * remote-free llist are empty. A remote freer that + * read PCPF_CPU_DEAD as clear *before* the flag was set + * above and does llist_add *after* the drain exits will + * leave a few pages on the dead PCP's free_llist; they + * are harmless and absorbed when the CPU comes back + * online (any first alloc/free runs absorb_remote_frees). * - * Hold zone lock to prevent racing with other - * CPUs doing list_del_init on stale entries - * from this list during their Phase 1. + * Drain released all pages. Tear down the owned-blocks + * list cleanly: walk each entry and list_del_init() it + * before INIT_LIST_HEAD on the head. INIT_LIST_HEAD + * alone would leave stale entries with prev/next + * pointing at the (now self-pointing) head, so a future + * clear_pcpblock_owner -> list_del_init on a stale + * pbd->cpu_node would corrupt the list head it walks + * back through. Detaching each entry first makes the + * subsequent list_del_init a safe self-loop no-op. + * + * Hold zone lock to serialize with concurrent Phase 0 + * iteration on this same list from other CPUs (which + * also hold zone->lock). */ pcp_spin_lock_nopin(pcp); spin_lock_irqsave(&zone->lock, zflags); + while (!list_empty(&pcp->owned_blocks)) { + struct pageblock_data *pbd = + list_first_entry(&pcp->owned_blocks, + struct pageblock_data, + cpu_node); + list_del_init(&pbd->cpu_node); + } INIT_LIST_HEAD(&pcp->owned_blocks); spin_unlock_irqrestore(&zone->lock, zflags); pcp_spin_unlock_nopin(pcp); @@ -6632,6 +6746,11 @@ static int page_alloc_cpu_online(unsigned int cpu) pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); pcp_spin_lock_nopin(pcp); pcp->flags &= ~PCPF_CPU_DEAD; + /* + * Pull in any pages that landed on the free_llist while + * the CPU was down (rare race in page_alloc_cpu_dead). + */ + absorb_remote_frees(pcp); pcp_spin_unlock_nopin(pcp); zone_pcp_update(zone, 1); -- 2.54.0