From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C69453033E8; Tue, 21 Apr 2026 16:59:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.9 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776790783; cv=none; b=EEJkaYR4IrYXbeTXfv+BIge7jVt4yY7SohEy1fWa1GyCb8A4HVgto7RSMBCVKZx1zjinW/K5JywT4ch4r6/RGxR+tpJiC9iK9A41tTH/dQTFKP2o3SSngy+UNQ4ZoEqvWSWtUVd0aHaCbc5QYukvF+jxWI61AcasyBbW8ygYkMA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776790783; c=relaxed/simple; bh=3xkW6Ke+o3NRCuvbOJKD2Ict0GZtRKo6boDp9ITB92A=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=OD7PqUkFm1zlO4TiI4ajisDWMT2V8YLzObuY9CHY0tBGu7CKqc1t0BEwRYzbT8MFXSa8CpjB6i+2otr2XeTa7qfR9URNmiP+qF79zMn41cN9zXSTfeiHnBa9fMwho+Kqp1VPGiLFpvaTlCJcQgdymQ9I9fiqz6/2p2t/iyEVwso= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=EK6LhhY7; arc=none smtp.client-ip=198.175.65.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="EK6LhhY7" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1776790781; x=1808326781; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3xkW6Ke+o3NRCuvbOJKD2Ict0GZtRKo6boDp9ITB92A=; b=EK6LhhY7hTkxyRWHIhBr3e1lasbAQZ8Q5UYMSQBD4TgG2KYT7HmitHvq wFdfxysBU8vkYVVQYxYK3hETGSOZwT6Y0PRNIgv0fkTh1bOGqEtz/ssPQ 5QwHx1HmmwRenTpX5YVF/ib9GA/ZiOrnDz9qVLaknC7lqh2w33+Eff00L sIqlUKjmGTtFVA4kHelJ1YXwziBMpZhuBBntSMGnTUotSV+gHpYcbg7H/ 2Gm361D3FlxXQk1xbkh2lp7XPzV8snSAqCdt/czS3/hvvCdVSlYMchr5o 1fxFGLvAhcr3B0qtplqNmz6pPJvqQ/h6iGWBYylv8+htAlat0Hxx/MTXE w==; X-CSE-ConnectionGUID: Q2FS00EWRzWRzC7iZzmFWA== X-CSE-MsgGUID: ZJ4IwWNBS6y6QitvBjtMvg== X-IronPort-AV: E=McAfee;i="6800,10657,11763"; a="100390190" X-IronPort-AV: E=Sophos;i="6.23,192,1770624000"; d="scan'208";a="100390190" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Apr 2026 09:59:41 -0700 X-CSE-ConnectionGUID: wsFiZVlKRuWOvyUD+jsfDA== X-CSE-MsgGUID: GouIYHodTL6xOMN2QA0UHQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,192,1770624000"; d="scan'208";a="232003029" Received: from intel-fishhawkfalls.iind.intel.com ([10.99.116.107]) by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Apr 2026 09:59:37 -0700 From: Sonam Sanju To: vineeth@bitbyteword.org Cc: dmaluka@chromium.org, kunwu.chan@linux.dev, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, pbonzini@redhat.com, rcu@vger.kernel.org, seanjc@google.com, sonam.sanju@intel.com, stable@vger.kernel.org, tj@kernel.org Subject: Re: [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race Date: Tue, 21 Apr 2026 22:24:55 +0530 Message-Id: <20260421165455.2486211-1-sonam.sanju@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Hi Vineeth, Kunwu, Tejun,=0D =0D Collected new crash logs with additional debug instrumentation in=0D wq_worker_sleeping(), kick_pool(), and show_one_worker_pool() to capture=0D pool state during the hang. The results conclusively confirm Vineeth's=0D preemption race theory.=0D =0D >From the new logs:=0D =0D 1. Pool dump with nr_running/nr_idle (added instrumentation):=0D =0D pool 10: cpus=3D2 flags=3D0x0 hung=3D201s workers=3D11 nr_running=3D1 nr= _idle=3D5=0D =0D 11 workers, 5 idle, 6 in D-state (all irqfd_shutdown) -- yet=0D nr_running=3D1. No worker is actually running on CPU 2.=0D =0D 2. NMI backtrace confirms CPU 2 is completely idle:=0D =0D NMI backtrace for cpu 2 skipped: idling at intel_idle+0x57/0xa0=0D =0D So nr_running=3D1 is a phantom count -- no worker is running, but=0D the pool thinks one is.=0D =0D 3. The first stuck worker (kworker/2:0, PID 33) shows the preemption=0D in wq_worker_sleeping:=0D =0D kworker/2:0 state:D Workqueue: kvm-irqfd-cleanup irqfd_shutdown=0D __schedule+0x87a/0xd60=0D preempt_schedule_irq+0x4a/0x90=0D asm_fred_entrypoint_kernel+0x41/0x70=0D ___ratelimit+0x1a1/0x1f0 <-- inside pr_info_ratelimited=0D wq_worker_sleeping+0x53/0x190 <-- preempted HERE=0D schedule+0x30/0xe0=0D schedule_preempt_disabled+0x10/0x20=0D __mutex_lock+0x413/0xe40=0D irqfd_resampler_shutdown+0x53/0x200=0D irqfd_shutdown+0xfa/0x190=0D =0D This confirms the exact race: a reschedule IPI interrupted=0D wq_worker_sleeping() after worker->sleeping was set to 1 but=0D before pool->nr_running was decremented. The preemption triggered=0D wq_worker_running() which incremented nr_running (1->2), then=0D on resume the decrement brought it back to 1 instead of 0.=0D =0D 4. The second pool dump 31 seconds later shows the stall is permanent:=0D =0D pool 10: cpus=3D2 flags=3D0x0 hung=3D232s workers=3D11 nr_running=3D1 nr= _idle=3D5=0D =0D Same phantom nr_running=3D1, hung time growing.=0D =0D 5. The deadlock chain:=0D - PID 33: holds resampler_lock mutex, stuck in wq_worker_sleeping=0D - PID 520: past mutex, stuck in synchronize_srcu_expedited=0D - PIDs 120, 4792, 4793, 4796: waiting on resampler_lock mutex=0D - crosvm_vcpu2: waiting in kvm_vm_release -> __flush_workqueue=0D - init (PID 1): stuck in pci_device_shutdown -> __flush_work=0D - Multiple userspace processes stuck in fsnotify_destroy_group=0D - Reboot thread timed out, system triggered sysrq crash=0D =0D 6. kick_pool_skip debug print fired for other pools but NOT for=0D pool 10 -- because need_more_worker() was never true (nr_running=0D was never 0), so kick_pool() was never even called for this pool.=0D =0D Regarding a fix, we can consider a workqueue-level fix in =0D wq_worker_sleeping() itself:=0D =0D void wq_worker_sleeping(struct task_struct *task)=0D {=0D ...=0D if (READ_ONCE(worker->sleeping))=0D return;=0D =0D + preempt_disable();=0D WRITE_ONCE(worker->sleeping, 1);=0D raw_spin_lock_irq(&pool->lock);=0D =0D if (worker->flags & WORKER_NOT_RUNNING) {=0D raw_spin_unlock_irq(&pool->lock);=0D + preempt_enable();=0D return;=0D }=0D =0D pool->nr_running--;=0D if (kick_pool(pool))=0D worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;=0D =0D raw_spin_unlock_irq(&pool->lock);=0D + preempt_enable();=0D }=0D =0D The idea is to disable preemption from sleeping=3D1 until we hold the pool= =0D lock (which disables IRQs). This prevents the reschedule IPI from=0D triggering preempt_schedule_irq() in this window. Note that=0D wq_worker_running() already uses preempt_disable/enable around its=0D nr_running++ for a similar race against unbind_workers().=0D =0D Does this approach look correct to you?=0D =0D =0D Thanks,=0D Sonam=0D