From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C2D263B3C0F for ; Mon, 23 Mar 2026 14:05:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774274718; cv=none; b=M+fEkYuYE2DZGN96qJGTWx6xzyRhmcM2MRT1OZSK2Awz/uMW9R/8wibMYa60Jh3qTBVOtqPic0zhGHjgWaoMyrUJ1U/2pwuLAo1hQX6nuH3/kXqnay7M9nYcl61O6076jFRWTSm/l5hQ7tpbaGzqBFMr8YALUJJUW4PwZYu/RC8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774274718; c=relaxed/simple; bh=vPvHDDrXP8+YoUyRq0Up/csMjR2ek50/WATliCHJ1yM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=H5UXbX59Cr5ILlXQfmDxEPGyRfq/FS0S6Eom5hk2MTEAbcX9CknsAUT3XXD+AEGc6BknenW6ysqWjyJopBB61sH4afjSNwRTOHaZv2/QUyoDyl9SuKxjbdoqY+ddhWpQPIq+8etGs2JzuhN+QaPZt6vnlqeeXsj5BS1Lt/+a9js= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--joonwonkang.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=MBV3c6Xj; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--joonwonkang.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="MBV3c6Xj" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-c7414179cceso1829408a12.0 for ; Mon, 23 Mar 2026 07:05:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774274716; x=1774879516; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=rwrjwOv/nfR0NzxXgLyB1YRlwiln3ZEhnwpdsPSwI7E=; b=MBV3c6XjcgA1wWxImHqBDYUfGhn2qNC9LffDdHRsEOOS/YtKPUoz3R8+DgtPq45KY8 pgO1flq44o1BKuT7N4iQp4vjjzu1LYvPaW3qajuCZU8k3S8WmoHxNZtZbpNtUD3UdVCe SJzWxoQXyCpmjwxcYK+NqMpZQp5xjY3e7a6DgmIY/XJBgPWSXNSyjKJBdf187AuPR/ky 84RofIgqAoDNRijPtEKnrWDQlDK68WsDMR6dggygKtcEtaaldjhMDvdJfoJjATeBL7wB CuLpqFmxrRIa1luJxHHCgqwPPkoT6wXudxSeMVL66NOlYgwTMDQ8r/cAqPA4CIoQgvDx xzFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774274716; x=1774879516; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rwrjwOv/nfR0NzxXgLyB1YRlwiln3ZEhnwpdsPSwI7E=; b=NCG22jDNdVJi6ooN2P1zudApOEc8KOuCw3gwLJ0TfiEV4Rmu0X+AXbnVq1WwW0b9D9 lDezIYmeiHGafU+gHWjEr2wlCpoh5cUO2q4+ODZoz7CmaTfy0qOJ73EYc+RUZx55G1/8 Vet1YhehDQOAHlaO8dA/0vbYWImXupE7M2/LE/ZCaA9/XmsGoxkUk8EUJoH6Sx0UHgyh xZ5SDCKWT4yWjCHHkTzA9/0mr8XnYyKDgCwdgo9RHNfj7fWdSuOuDq2WCVIno17tkQii Sr7gDhzqOg70CYwXIgM+gqdFatimazyoChWI30O1il6QMfMC+MqjthOU0WUbJ4TgtSUN 9E6g== X-Forwarded-Encrypted: i=1; AJvYcCUxE3II0UmzdcKTxSOCTShbRHWfRskWAarIyGTCplFZpvUi4vd1YkgivP9S0Do9m9tKLQIaY+NN45Z2inI=@vger.kernel.org X-Gm-Message-State: AOJu0YyHQ5eVvPJM706/Iy1rWpX0pq3AcWlUtlyLvaAY4ZYMgSc+0zhK aTAyeBIkm9FJ+9gDRA3cFNgHtLigxbXp2dV0GZq19t7Bo1EvePVwqv9QMRF8aeUTcoZ10tcEHB8 MppgPX8RLW8YbCaRnhs1T2S6HsA== X-Received: from pgbda10.prod.google.com ([2002:a05:6a02:238a:b0:c73:9264:56a5]) (user=joonwonkang job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:7fa4:b0:398:b95c:51f7 with SMTP id adf61e73a8af0-39bceb47778mr10636206637.37.1774274715858; Mon, 23 Mar 2026 07:05:15 -0700 (PDT) Date: Mon, 23 Mar 2026 14:05:14 +0000 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: X-Mailer: git-send-email 2.53.0.959.g497ff81fa9-goog Message-ID: <20260323140514.3487563-1-joonwonkang@google.com> Subject: Re: [PATCH] percpu: Fix hint invariant breakage From: Joonwon Kang To: dennis@kernel.org Cc: akpm@linux-foundation.org, cl@gentwo.org, joonwonkang@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, tj@kernel.org, dodam@google.com Content-Type: text/plain; charset="UTF-8" > Hello, > > On Fri, Mar 20, 2026 at 11:52:14AM +0000, Joonwon Kang wrote: > > The invariant "scan_hint_start > contig_hint_start if and only if > > scan_hint == contig_hint" should be kept for hint management. However, > > it could be broken in some cases: > > > > First I'd just like to apologize. I spent an hour yesterday trying to > remember why the invariant exists and the reality is this code is more > clever than it needs to be. Thanks for taking time for this and sharing more context. While you are at it, I have a fundamental question on the invariant. I had deliberation and discussion on what benefits the invariant gets to the percpu allocator by its existence. My understanding is that if we put contig_hint before scan_hint when they are the same, it is more likely that contig_hint is broken by a future allocation, which leads to a linear scan after the scan_hint for hints update, although we could save scanning upto scan_hint when contig_hint is not broken. On the other hand, if we put scan_hint before contig_hint instead, it is more likely that scan_hint is broken while keeping contig_hint, which does not lead to the linear scan for hints update, although we could not save the scanning that could be saved in the other case. In other words, if contig_hint breaking allocations occur a lot in general with the current invariant, the performance may more suffer than without the invariant. I also think that there would be no strict reason of having the invariant. So, could you clarify the necessity of the invariant? If there is no must reason, then I could post another spin-off patch to remove the invariant at all so that we could simplify the code and experiment the result. How do you think? > > As Andrew asked, how did you come across this? It's pretty obscure so > thank you for taking the time to look at it. I came across this issue by manual code review and thanks for saying that. > > > > - if (new contig == contig_hint == scan_hint) && (contig_hint_start < > > scan_hint_start < new contig start) && the new contig is to become a > > new contig_hint due to its better alignment, then scan_hint should > > be invalidated instead of keeping it. > > > > - if (new contig == contig_hint > scan_hint) && (start < > > contig_hint_start) && the new contig is not to become a new > > contig_hint, then scan_hint should be invalidated instead of being > > updated to the new contig. > > > > This commit fixes this invariant breakage and also optimizes scan_hint > > by keeping it or updating it when acceptable: > > > > - if (new contig > contig_hint > scan_hint) && (scan_hint_start < new > > contig start < contig_hint_start), then keep scan_hint instead of > > invalidating it. > > > > - if (new contig > contig_hint == scan_hint) && (contig_hint_start < > > new contig start < scan_hint_start), then update scan_hint to the > > old contig_hint instead of invalidating it. > > > > - if (new contig == contig_hint > scan_hint) && (new contig start < > > contig_hint_start) && the new contig is to become a new contig_hint > > due to its better alignment, then update scan_hint to the old > > contig_hint instead of invalidating or keeping it. > > > > Signed-off-by: Joonwon Kang > > --- > > mm/percpu.c | 60 ++++++++++++++++++++++++++++++++++------------------- > > 1 file changed, 39 insertions(+), 21 deletions(-) > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > index 81462ce5866e..a0e4f8acb7c2 100644 > > --- a/mm/percpu.c > > +++ b/mm/percpu.c > > @@ -641,19 +641,13 @@ static void pcpu_block_update(struct pcpu_block_md *block, int start, int end) > > if (contig > block->contig_hint) { > > /* promote the old contig_hint to be the new scan_hint */ > > if (start > block->contig_hint_start) { > > - if (block->contig_hint > block->scan_hint) { > > + if (block->contig_hint > block->scan_hint || > > + start < block->scan_hint_start) { > > I think this should be <=. > Given hints as [hint_start, size]. > > contig_hint = [64, 64] > scan_hint = [160, 64] > > Free [224, 32]. > > Without <=, we don't promote the contig_hint and leave the stale > scan_hint. Ah, I missed the fact that the new contig could be given overlapping with other hints. Will reconsider the cases and fix it. Thanks. > > > block->scan_hint_start = > > block->contig_hint_start; > > block->scan_hint = block->contig_hint; > > - } else if (start < block->scan_hint_start) { > > - /* > > - * The old contig_hint == scan_hint. But, the > > - * new contig is larger so hold the invariant > > - * scan_hint_start < contig_hint_start. > > - */ > > - block->scan_hint = 0; > > } > > - } else { > > + } else if (start < block->scan_hint_start) { > > I think this too should be <=. > > scan_hint = [16, 8] > contig_hint = [32, 96] > > free [24, 8] > > scan_hint stays [16, 8] instead of being cleared. Will fix it, thanks. > > > > block->scan_hint = 0; > > } > > block->contig_hint_start = start; > > @@ -662,20 +656,44 @@ static void pcpu_block_update(struct pcpu_block_md *block, int start, int end) > > if (block->contig_hint_start && > > (!start || > > __ffs(start) > __ffs(block->contig_hint_start))) { > > + if (block->contig_hint > block->scan_hint) { > > + if (start < block->contig_hint_start) { > > + block->scan_hint = block->contig_hint; > > + block->scan_hint_start = block->contig_hint_start; > > + } > > + } else if (start > block->scan_hint_start) { > > + /* > > + * old contig_hint == old scan_hint == contig. > > + * But, the new contig is farther than the old > > + * scan_hint so hold the invariant > > + * scan_hint_start > contig_hint_start iff > > + * scan_hint == contig_hint. > > + */ > > + block->scan_hint = 0; > > + } > > + > > /* start has a better alignment so use it */ > > block->contig_hint_start = start; > > - if (start < block->scan_hint_start && > > - block->contig_hint > block->scan_hint) > > - block->scan_hint = 0; > > - } else if (start > block->scan_hint_start || > > - block->contig_hint > block->scan_hint) { > > - /* > > - * Knowing contig == contig_hint, update the scan_hint > > - * if it is farther than or larger than the current > > - * scan_hint. > > - */ > > - block->scan_hint_start = start; > > - block->scan_hint = contig; > > + } else { > > + if (block->contig_hint > block->scan_hint) { > > + if (start < block->contig_hint_start) { > > + /* > > + * old scan_hint < contig == old > > + * contig_hint. But, the new contig is > > + * before the old contig_hint so hold > > + * the invariant > > + * scan_hint_start > contig_hint_start > > + * iff scan_hint == contig_hint. > > + */ > > + block->scan_hint = 0; > > + } else { > > + block->scan_hint_start = start; > > + block->scan_hint = contig; > > + } > > + } else if (start > block->scan_hint_start) { > > + block->scan_hint_start = start; > > + block->scan_hint = contig; > > + } > > } > > } else { > > /* > > -- > > 2.53.0.1018.g2bb0e51243-goog > > > > Ultimately as I re-read this code, it might be nice to rewrite it so > that scan_hint can be kept separately. The code is a little too clever > with trying to avoid stating new_region overlaps scan_hint or > contig_hint. > > I recently started shimming out the bitmap code in userspace so > hopefully I can test it for performance / correctness more rigorously. Thanks for letting me know of this. It would be great if we could test this subtle change around the invariant and the hints. > > Thanks, > Dennis