From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from BL2PR02CU003.outbound.protection.outlook.com (mail-eastusazon11011011.outbound.protection.outlook.com [52.101.52.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 552552877F7 for ; Tue, 31 Mar 2026 04:59:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.52.11 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774933144; cv=fail; b=spRVRuYtEvgHuJiAhHHhTdARx/DDh6rYILomehszmpZDM6KYS8IdKGHZ+xV5Lfo2ezTz+yn4Aae9xl6nyBK3eXNQav2NQo6g8LLrL5UYSnaRophBgKLxB8lLHnRwQeo8IZB9uuAJIpnF0EQRvA1riGqUWDaTlLhLgubhWy759xc= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774933144; c=relaxed/simple; bh=5Q3XJgtIXDC7HHgUxcJhr/xxUIMR/5/TzcScU9z87H0=; h=Message-ID:Date:MIME-Version:Subject:From:To:CC:References: In-Reply-To:Content-Type; b=oHxErSfdVCdjfSCp+sYKvd70jJBiIHVn7JUDKBrJK5w2WIv2lpveeJloDVlHkgHLcmSaCxHo1wt6YdBzkuH4eUWDoJpdx7urVBsytWiU/wJxe3gYaNN0pEriCGel6AxeHxtv8oLr/Or1VYmJ9dUMzHnfi4gZ1gIJIK3tneSeE9U= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=1byCmT/7; arc=fail smtp.client-ip=52.101.52.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="1byCmT/7" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=IZUnpQit6jCe5ou06W4hIc54997Vq6qfd+WSAn9AVFIKr7WxbiY3Nv8CNtB8eXTKqAbZ5IVhvvcfs9qLOSRcXdsZtFx2mXjpZgwSZpHyNNSXNtRv3A38M9TtUHnLWUmUt3RDkGUh/0KW4PjvjxCgY+jfN9TtzzIxlEy+u89M/myHxyQvfIQCsYtDoy7itTWQ5oBUZwMExLS/6xKWFZhMoJ+kaSrbNjAKwCEsHrVa0rd3U4qmqbwwT6wEkKourSQdjvlCXFo8ywP/8GvoquRmF+WeRNI+dOzojMojKQBYTXvySUwKg5bb98HeVqwhGa99vGY38Id2hjCIgpFSECktZA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=4r8CIQOmOa+LaioXFg/Q3ZKkYzADVdXuF3bfiR7nM04=; b=auN8Qq/hsI2uBpa/sr+90gIk4zAypqG/i3fhBerZ6HXX6O9jTG04JLDAxHXEzfZ98K7aTnDxc1JtkxaRWcS9tuJ6nNauB+y6ITmMknloV8ZsFBzPL2Tw2NVUXVs3DjEJVSM1lfi9/lKrx7JdQtBFh0XODASKXfWfkSEl8DgkTNcC/CsDM/jMcUDsxr81X7Nft6hQmbmiMyFmb0ol2SemKU8sos42zvlQuBuEwniXjr9bvGLCKS1OvgZy9tahrNBmU0o0zubZLkukT9iFCSHkdtQLazdTeI9KCIQ3jRasas0qoG0ZrvFvKyIQ6c/ETlp9nvjVTvCkNmS6ZbKCKWEhHg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=infradead.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=4r8CIQOmOa+LaioXFg/Q3ZKkYzADVdXuF3bfiR7nM04=; b=1byCmT/7y86yZzdhdlmWmxkXYhcI7k1ZwxxwDTdncoN06/ekwkE2dZyRwJzOQlQ61eTNnolCZPDlSD7aM6YO5KEjhEJgBm3Ce/QFiIdWvvcQnGnAjob8vE8sCZQLXnpEuDmBN0MBaflp5Un2fzmayMtgTOngagjT50GMN8VnHWg= Received: from PH7P220CA0097.NAMP220.PROD.OUTLOOK.COM (2603:10b6:510:32d::15) by CY1PR12MB9650.namprd12.prod.outlook.com (2603:10b6:930:105::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9769.15; Tue, 31 Mar 2026 04:58:57 +0000 Received: from CY4PEPF0000EDD2.namprd03.prod.outlook.com (2603:10b6:510:32d:cafe::d8) by PH7P220CA0097.outlook.office365.com (2603:10b6:510:32d::15) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9745.28 via Frontend Transport; Tue, 31 Mar 2026 04:58:52 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb08.amd.com; pr=C Received: from satlexmb08.amd.com (165.204.84.17) by CY4PEPF0000EDD2.mail.protection.outlook.com (10.167.241.198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9745.21 via Frontend Transport; Tue, 31 Mar 2026 04:58:56 +0000 Received: from Satlexmb09.amd.com (10.181.42.218) by satlexmb08.amd.com (10.181.42.217) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 30 Mar 2026 23:58:53 -0500 Received: from satlexmb08.amd.com (10.181.42.217) by satlexmb09.amd.com (10.181.42.218) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 30 Mar 2026 21:58:53 -0700 Received: from [172.31.184.125] (10.180.168.240) by satlexmb08.amd.com (10.181.42.217) with Microsoft SMTP Server id 15.2.2562.17 via Frontend Transport; Mon, 30 Mar 2026 23:58:49 -0500 Message-ID: <75678f1b-ed98-4fc2-a167-ce01ad555eef@amd.com> Date: Tue, 31 Mar 2026 10:28:48 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking From: K Prateek Nayak To: Peter Zijlstra CC: John Stultz , , , , , , , , , , , , , , Suleiman Souhlal References: <20260219075840.162631716@infradead.org> <20260219080624.438854780@infradead.org> <20260330101018.GN3738786@noisy.programming.kicks-ass.net> <73dab51a-650f-4c82-9e73-13236b2a26c2@amd.com> <20260330144005.GP3738786@noisy.programming.kicks-ass.net> <20260330191108.GU2872@noisy.programming.kicks-ass.net> Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CY4PEPF0000EDD2:EE_|CY1PR12MB9650:EE_ X-MS-Office365-Filtering-Correlation-Id: 670f1457-070f-4557-84f9-08de8ee23682 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|82310400026|376014|7416014|36860700016|56012099003|18002099003|22082099003; X-Microsoft-Antispam-Message-Info: 209Lf/1ZG3LrHjxUXK58pEyprR8y7p0DQG2f7SgOITFJ55oY4U+aLSiG72gQMj1fYxfvH4qMKC9ID2T9iHKCWfPqzpQiDm96d5yWHqrCxzlqWYlvLszUa1TSOYPH2aUpjeKTLk4ZGAo8AuhST93mG1W+71sCu34GkP9LWZwtAdf8RWUBt+NTlTK9T889I2ZjWPOgPT/65Znf4fM45eFsPVfv6psEFuD5q7zWdo9l9f8wsZkPHzX6zWiuCOBc2fhjfPQ3EO2/9rNcYU2m7wnLLmva97HXGL5xvJspNKxOcHMAW8bexKunKwGO3qZ+HwHhKroAVV9CbKaJr+9MjbWKfGd8dxVW/sCo/5S8Te3uAPDqytaDSJ6UFBt8DIFZ3jFmckK9k7fC3oj4L8lMXHdtA9gN5d73OCwBjvhKC15yeHT1ZT1ZKH5AQ7rilMnwljHUL7a455KF2ic/Na8BpA4spLZ8eZ9ESevdfS/UVsIWL+XdUhXu8W7+DF6Z0Qz4hrvIjSEexXnuvvhnEZTSD+j6gHPpfLAqe8IOHicEa7Knq2H1jte9nhzHEjzkt48ZvhUGl4Xv+fP6m8yaezoDdDV+OqrFZF11PNEdVIkVTc9JKgiRfETqGJm9OvoTkE6t3kYmBoprhc0zLl/8T/gBZyPwbiaoz6/t0fCFedFD+pu93din65B6yKeE+0wuoTakCe3U2NJ5fqc9wiSGtTJOabbta+SWPN2sqLamXwmsDU32ugx/1LZ9hdazcwODGmR0HAlw3E5GEaTLyyEnFKkjVmYzqQ== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb08.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(82310400026)(376014)(7416014)(36860700016)(56012099003)(18002099003)(22082099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: 9XU9ErE4194zJtI8uvgGGKq+sQ1zIVAQjU1jSAo5K3VwQ4c05X7tLpjIq3D8M+gEUjmEy0ilgNwhSG+6z+NZQaNucGJxXCfKCiZ0PPfK7dDMSqfX3uATa0iFjz7UVT+K7AyBRg3IOd+iJTkBhigOHkEz9ncItocfZ0N8WutEaDChj/7cKS+JJ47Jo/PRW963duekByslYgsSPo9AcERsR+dtv2NmEWpSJr5/UK71/6sTQgMBEcClEOQmAF7i2CyLkN4I2HvOQFgwd209qVFtw8FaTq5r/Lmu/tl6uYlw6paVTlClBIqYpLTVV//LlnAVrwqjffC/+v2YVAPh96QKU8ixpnHIb0AoMLJ2+fGpZh7T/CjXweCA47C9KGLheQ6xaPGhKL1NjeKdXP/mrFLdUnnFgNScx7nasYcHM20R2dzCSzf6ApmmNQb6CaRUjLYG X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 31 Mar 2026 04:58:56.0591 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 670f1457-070f-4557-84f9-08de8ee23682 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb08.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CY4PEPF0000EDD2.namprd03.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY1PR12MB9650 On 3/31/2026 6:08 AM, K Prateek Nayak wrote: >> I'm thinking that if you have two groups, and the tick always hits the >> one group, the other group can go a while without ever getting updated. > > Ack! That could be but I only have once cgroup on top of root cgroup as > far as cpu controllers are concerned so the sched_yield() catching up > the avg_vruntime() should have worked. Either ways, I have more data: > > When I hit the overflow warning, I have: > > se: entity_key(-83106064385) weight(90891264) overflow(-7553615238018032640) > cfs_rq: zero_vruntime(138430453113448575) sum_w_vruntime(0) sum_weight(0) > cfs_rq->curr: entity_key(0) vruntime(138430453113448575) deadline(138430500540426854) > Post avg_vruntime(): > se: entity_key(-83106064385) weight(90891264) overflow(-7553615238018032640) > cfs_rq: zero_vruntime(138430453113448575) sum_w_vruntime(0) sum_weight(0) > cfs_rq->curr: entity_key(0) vruntime(138430453113448575) deadline(138430500540426854) > > so running avg_vruntime() doesn't make a difference and it seems to be a > genuine case of place_entity() putting the newly woken entity pretty > far back in the timeline. (I forgot to print weights!) > > Now, the funny part is, if I leave the system undisturbed, I get a few > of the above warning and nothing interesting but as soon as I do a: > > grep bits /sys/kernel/debug/sched/debug > > Boom! Pick fails very consistently (Because of copy-pasta this too > doesn't contain weights): > > NULL Pick! > cfs_rq: zero_vruntime(89029406877992895) sum_w_vruntime(-135049248768) sum_weight(1048576) > cfs_rq->curr: entity_key(149162) vruntime(89029406878142057) deadline(89029406976268435) > queued se: entity_key(-123294) vruntime(89029406877869601) deadline(89029406880669601) > > after avg_vruntime()! > cfs_rq: zero_vruntime(89029406877868114) sum_w_vruntime(-4206886912) sum_weight(1048576) > cfs_rq->curr: entity_key(273943) vruntime(89029406878142057) deadline(89029406976268435) > queued se: entity_key(1487) vruntime(89029406877869601) deadline(89029406880669601) > > NULL Pick! > > The above doesn't recover after a avg_vruntime(). Btw I'm running: > > nice -n 19 stress-ng --yield 32 -t 1000000s& > while true; do perf bench sched messaging -p -t -l 100000 -g 16; done > > Nice 19 is to get a large deadline and keep catching up to that deadline > at every yield to see if that makes any difference. > >> >> But if there's no cgroups, this can't be it. >> >> Anyway, something like the below would rule this out I suppose. > > I'll add that in and see if it makes a difference. I'll add in > weights and look at place_entity() to see if we have anything > interesting going on there. Still trips the issue :-( This time I have logs with weights. For the warning: se: entity_key(-72358759771) weight(90891264) warning_mul(-6576779137058540544) vlag(39009) delayed?(0) cfs_rq: zero_vruntime(18695504496613622) sum_w_vruntime(0) sum_weight(0) cfs_rq->curr: entity_key(0) vruntime(18695504496613622) deadline(18695540588878716) weight(49) Post avg_vruntime(): se: entity_key(-72358759771) weight(90891264) overflow?(-6576779137058540544) cfs_rq: zero_vruntime(18695504496613622) sum_w_vruntime(0) sum_weight(0) cfs_rq->curr: entity_key(0) vruntime(18695504496613622) deadline(18695540588878716) weight(49) And the NULL pick while reading debugfs (probably something in the initial task wakeup path that trips it?): NULL Pick! cfs_rq: zero_vruntime(21126236598445952) sum_w_vruntime(-1074569456640) sum_weight(15360) cfs_rq->curr: entity_key(69958950) vruntime(21126236668404902) deadline(21126236859551568) weight(15360) queued se: entity_key(32498584) vruntime(21126236630944536) deadline(21126236822091202) weight(15360) After avg_vruntime(): cfs_rq: zero_vruntime(21126236598445952) sum_w_vruntime(-1074569456640) sum_weight(15360) cfs_rq->curr: entity_key(69958950) vruntime(21126236668404902) deadline(21126236859551568) weight(15360) queued se: entity_key(32498584) vruntime(21126236630944536) deadline(21126236822091202) weight(15360) NULL Pick! Updated zero_vruntime is behind that of either of the queued entities. Now that I have a reliable trigger for the crash, I'll just start tracing everything before I run grep (although I suspect something may have gone bad a long time ago but we can be hopeful) -- Thanks and Regards, Prateek