From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C686AA59 for ; Tue, 22 Apr 2025 00:06:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745280410; cv=none; b=RGk9OD1jKHJkTAo+K4uCiPtAVdqW+Uuvx4KDu+nBeNygplL8QtHM7EfYv1CJv461+eeTQ6icH1Q2qAU7qjP7HZUhcEFCzMbaYQkuWO1KH2lJD+tzJn/SbbdS+2qnaghHme89H6BCjPTc8F6Ymt2hRg171YY2FaDQ5XtMCL6OXKA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745280410; c=relaxed/simple; bh=gaJ8CNHhuRpy69oqCd2Ln/ciIYRxbwnl4yp/yguh25g=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Lpzu/zB+QNBYllvf723oNQ+O/QcBbVIBc2j3xTMZ3rEvItQsNciFZfFxH3Ev6y/ye+zROax/vOvg3RX/WcRrTuGs+yWaL6Pbfln6s7hNzEsxyKHedJQO87FQCoGNfKwRz2rRxVYS4wJBsfHj5Vno7FCdIRGAazVmwxeN+igRZNg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=osandov.com; spf=none smtp.mailfrom=osandov.com; dkim=pass (2048-bit key) header.d=osandov-com.20230601.gappssmtp.com header.i=@osandov-com.20230601.gappssmtp.com header.b=kJ7ZbGq8; arc=none smtp.client-ip=209.85.214.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=osandov.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=osandov.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20230601.gappssmtp.com header.i=@osandov-com.20230601.gappssmtp.com header.b="kJ7ZbGq8" Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-227c7e4d5feso8873375ad.2 for ; Mon, 21 Apr 2025 17:06:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20230601.gappssmtp.com; s=20230601; t=1745280408; x=1745885208; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=QUH+zFjzG8QJ2B56eHcHCDW8HiQiaQrWAYR3RSYagxA=; b=kJ7ZbGq852RoxK403axaEV4US3NFNkL09/GI/7tCTrQJRWdjmCEQi7h2p+F7ZGnSzq iS9TLE/ezaRqflAHiDQRsxRA2XGFHdaoBoLB1KxCipsnULciwoffL3uTiC3JwGDMMOYC 8wg4nfHWaDLmF1qvBnumCmG4a5/jpzKr5re+EaJpwZz7Po+81/Y3qmikt9+zh0/KGtLj zojqX6Ym9VQzV29D4ie0vi9Kbg7uYlg6n/hOMnl6rFIx+d40tsHDM8QWwNumv9lQoHBO Tw+Xc080Pg55zmt0lN0RIZ/nnTPuFljTL1rhO3EnIpoy2fsqmwC7/0FG5LrVityuPA7j 5psg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745280408; x=1745885208; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=QUH+zFjzG8QJ2B56eHcHCDW8HiQiaQrWAYR3RSYagxA=; b=WDeIbK/v1wPGjwl7l/z1UiYAvchnq6eTzRWUEJZ/UP6BO8X4X/i9laNQDU4NOgRSUD OS/uM66AMeijI7vTKUqtIdHvPhqfN2VqR6p8J2Bx3sXBJmK/9FwZ65D20bG2XdAXGL9K QNinVFkyOMCeXQiFhwkaqmsYs40zR3jdnJGxPRIPbN0hW3PJXmwoU+Ezys6gq4SfJaas Bix832D9JLiGTZSile2f5+KG0dzqHlf8VZfeEheA4kRsXBDO3ptJ3WRlu6B5YpTR2Hn9 fpeG37+pMD5tkkSN7cYXCoL3kKIY9yy0ILOo2PNlqotJIxhTKbFN826iiDM/Z0m1w7c1 drcw== X-Forwarded-Encrypted: i=1; AJvYcCXDDCVv98cZXqCDEQxLJWxbaE1L4q9UXSmRjyV6gxdZOWEAIhzuieWVZ0t9LFI7AnYtbCUEgJ/v10CQdL8=@vger.kernel.org X-Gm-Message-State: AOJu0YyC8FNFOl2pDAzDHKP1c/LdGmqkrDJBGt5P6Pw6xjHN/RS9gbzp QiElNrtcXFeARPvcD1C8LC5pQKE+kHOQBay0u7tqZ4+9KoiaQX7okoM08UsvKXWKQa1hhpvY4p7 l X-Gm-Gg: ASbGncvWy7ikLdSsQlj9RgLwIVuiB77RilnTOPKnTsZPytYPoHfFozzgNhC8nKNSHM+ q9otMqFWnykf0kbA+3m9WXscaHzgW9XZwLnrWA+qjSwkbTbwYYk72ZJ7Hf+P6M9tO4DcQcCYwdC KD2heqzqyJUNlh9MI0qjX4XJICmaZMFCb70vKpDbse8Wue5aVw7rpArkae4uxZrv+uyRXRB0tcg 5CpvSsR6wM3hoYhpAU4WZsVlVapOcuW1SVW0TUTivNfGuvw0DbPnld7TeWmVB/BKbPZawg4GyZj uv+N3VKuP3G355WwEmP6MS8BvTj0Tw0a0w== X-Google-Smtp-Source: AGHT+IGEj5Tc9GtP9AEqFXLd/raNtdiK1BvYNx7GDQlh/ZRDY3vZrH9qR62I82ueFCQMiWAmU1ll3w== X-Received: by 2002:a17:903:1a2d:b0:224:88c:9255 with SMTP id d9443c01a7336-22c535dd409mr90831785ad.3.1745280407898; Mon, 21 Apr 2025 17:06:47 -0700 (PDT) Received: from telecaster ([2620:10d:c090:500::5:806]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-22c50fda18esm71217805ad.212.2025.04.21.17.06.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 21 Apr 2025 17:06:47 -0700 (PDT) Date: Mon, 21 Apr 2025 17:06:45 -0700 From: Omar Sandoval To: Peter Zijlstra Cc: Rik van Riel , Chris Mason , Pat Cody , mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org, patcody@meta.com, kernel-team@meta.com, Breno Leitao Subject: Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity() Message-ID: References: <20250320205310.779888-1-pat@patcody.io> <20250324115613.GD14944@noisy.programming.kicks-ass.net> <7B2CFC16-1ADE-4565-B555-7525A50494C2@surriel.com> <20250402082221.GT5880@noisy.programming.kicks-ass.net> <20250415080235.GK5600@noisy.programming.kicks-ass.net> <20250416124442.GC6580@noisy.programming.kicks-ass.net> <20250418154438.GH17910@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, Apr 18, 2025 at 04:49:08PM -0700, Omar Sandoval wrote: > On Fri, Apr 18, 2025 at 05:44:38PM +0200, Peter Zijlstra wrote: > > On Wed, Apr 16, 2025 at 10:19:42AM -0400, Rik van Riel wrote: > > > > > The most common warning by far, hitting > > > about 90% of the time we hit anything > > > in avg_vruntime_validate is the > > > WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime) > > > > > > The most common code path to getting there, > > > covering about 85% of the cases: > > > > > > avg_vruntime_validate > > > avg_vruntime_sub > > > __dequeue_entity > > > set_next_entity > > > pick_task_fair > > > pick_next_task_fair > > > __pick_next_task > > > pick_next_task > > > __schedule > > > schedule > > > > (I'm assuming zero_vruntime patch here, the stock kernel has more > > problems vs min_vruntime getting stuck in the future etc..) > > > > So I have a theory about this. Key is that you're running a PREEMPT-NONE > > kernel. > > > > There is one important site the overflow patch does not cover: > > avg_vruntime_update(). > > > > However, due to PREEMPT_NONE, it is possible (Chris mentioned direct > > reclaim and OOM) to have very long (in-kernel) runtimes without > > scheduling. > > > > (I suppose this should be visible by tracing sched_switch) > > > > Were this to happen, then avg_vruntime_add() gets to deal with 'key * > > weight' for a relatively large key. But per the overflow checks there, > > this isn't hitting (much). > > > > But then we try and update zero_vruntime, and that is doing 'delta * > > cfs_rq->avg_load', and avg_load is a larger number and might just cause > > overflow. We don't have a check there. > > > > If that overflows then avg_vruntime is buggered, but none of the > > individual tree entities hit overflow, exactly as you observe. > > > > I've modified the zero_vruntime patch a little (new one below) to update > > zero_vruntime in update_curr(). Additionally, I have every tick re-align > > it with the tree avg (the update and the tree can drift due to numerical > > funnies). > > > > This should ensure these very long in-kernel runtimes are broken up in > > at most tick sized chunks, and by keeping zero_vruntime more or less > > around the actual 0-lag point, the key values are minimized. > > > > I should probably go play with a kernel module that spins for a few > > seconds with preemption disabled, see if I can make it go BOOM :-) But > > I've not yet done so. [snip] > > static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq) > > @@ -850,6 +811,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntim > > static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) > > { > > avg_vruntime_add(cfs_rq, se); > > + update_zero_vruntime(cfs_rq); > > Won't this double-count cfs_rq->curr in the avg_vruntime() calculation > in update_zero_vruntime()? When cfs_rq->curr->on_rq, put_prev_entity() > calls this with se == cfs_rq->curr _before_ setting cfs_rq->curr to > NULL. So the avg_vruntime_add() call will add entity_key(cfs_rq->curr) > to cfs_rq->avg_vruntime and se_weight(cfs_rq->curr) to cfs_rq->avg_load. > Then update_zero_vruntime() calls avg_vruntime(), which still sees > curr->on_rq and will add curr's key and weight again. This throws > zero_vruntime off, maybe by enough to bust zero_vruntime and/or > avg_vruntime? > > Should the call to update_zero_vruntime() go before avg_vruntime_add()? > > Thanks, > Omar Hey, Peter, We haven't been able to test your latest patch, but I dug through some core dumps from crashes with your initial zero_vruntime patch. It looks like on just about all of them, the entity vruntimes are way too spread out, so we would get overflows regardless of what we picked as zero_vruntime. As a representative example, we have a cfs_rq with 3 entities with the follow vruntimes and (scaled down) weights: vruntime weight 39052385155836636 2 (curr) 43658311782076206 2 42824722322062111 4886 The difference between the minimum and maximum is 4605926626239570, which is 53 bits. The total load is 4890. Even if you picked zero_vruntime to be equidistant from the minimum and maximum, the (vruntime - zero_vruntime) * load calculation in entity_eligible() is doomed to overflow. That range in vruntime seems too absurd to be due to only to running too long without preemption. We're only seeing these crashes on internal node cgroups (i.e., cgroups whose children are cgroups, not tasks). This all leads me to suspect reweight_entity(). Specifically, this line in reweight_entity(): se->vlag = div_s64(se->vlag * se->load.weight, weight); seems like it could create a very large vlag, which could cause place_entity() to adjust vruntime by a large value. (place_entity() has a similarly suspect adjustment on se->vlag, only update_entity_lag() clamps it). I'll try to reproduce something like this, but do you have any thoughts on this theory in the meantime? Thanks, Omar