From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61AE5C4167B for ; Mon, 27 Nov 2023 19:50:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232690AbjK0TuX (ORCPT ); Mon, 27 Nov 2023 14:50:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34400 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231246AbjK0TuX (ORCPT ); Mon, 27 Nov 2023 14:50:23 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DE7371B4; Mon, 27 Nov 2023 11:50:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701114628; x=1732650628; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=FOWlBJuI1QWETMszMwBkYk25m1g8XF0ebe/+lRpK/Yg=; b=gCObUKfTaJZdrOdFp7nSPUlKYVaHXKwxM4Ib2kwUkf/AyO8UOxcrYknA uZnA47b8xdfi8hPOez76FhBEHIygWJDkNgmBs4lUZnLm1me/JUoG/UsAU 10j89ErBIwrikmgmoF29llZP1QjTnWKsaU37s6uir64TLsArK/yNwhEVA 5oXXzJhDg5tmRTW1vid8kuG9mRRTrq8fYA2KEcWMJcruDVVvk0Rzhe1eq k7n9Fyf3uAM+VidsAC9KaHiS5dEvSrkKxGAR78tmKOSv8023GrpNcJ9rR gjZfGiMkDUckR1Sgf6btmUk1GDAhOfRHybL0AgsicTcU0DBxM+2f6jmwh w==; X-IronPort-AV: E=McAfee;i="6600,9927,10907"; a="389930690" X-IronPort-AV: E=Sophos;i="6.04,231,1695711600"; d="scan'208";a="389930690" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Nov 2023 11:50:28 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10907"; a="744669420" X-IronPort-AV: E=Sophos;i="6.04,231,1695711600"; d="scan'208";a="744669420" Received: from agluck-desk3.sc.intel.com (HELO agluck-desk3) ([172.25.222.74]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Nov 2023 11:50:28 -0800 Date: Mon, 27 Nov 2023 11:50:26 -0800 From: Tony Luck To: Borislav Petkov Cc: Yazen Ghannam , Smita.KoralahalliChannabasappa@amd.com, dave.hansen@linux.intel.com, x86@kernel.org, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, patches@lists.linux.dev Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation Message-ID: References: <20230929181626.210782-1-tony.luck@intel.com> <20231004183623.17067-1-tony.luck@intel.com> <20231004183623.17067-3-tony.luck@intel.com> <20231019151211.GHZTFHS3osBIL1IJbF@fat_crate.local> <20231114192324.GAZVPJLGZmfJBS181/@fat_crate.local> <20231121115448.GCZVyaiNkNvb4t2NxB@fat_crate.local> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231121115448.GCZVyaiNkNvb4t2NxB@fat_crate.local> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote: > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote: > > Whichever of the timer and the CMCI happens first will run. Second to > > arrive will pend the interrupt and be handled when interrupts are > > enabled as the first completes. > > So I still don't like the timer calling machine_check_poll() and > cmci_mc_poll_banks() doing the same without any proper synchronization > between the two. But it isn't doing the same thing. The timer calls: machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); and cmci_mc_poll_banks() calls: machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); A bank is either in the bitmap of banks to poll from the timer, or in one of the per-cpu bitmaps of banks "owned" by that CPU to be checked when a CMCI occurs. But it can't be in both. > Yes, when you get a CMCI interrupt, you poll and do the call the storm > code. Now what happens if the polling runs from softirq context and you > get a CMCI interrupt at exactly the same time. I.e., is > machine_check_poll() reentrant and audited properly? So nothing bad happens. If Linux was polling some set of banks from the timer and is interrupted by CMCI, the interrupt will check some disjoint set of banks. All the history tracking code is done per-bank, so there is no overlap. > I hope I'm making more sense. Yes. Totally making sense. I was under the mistaken impression that the mce timers used TIMER_IRQSAFE and the nested CMCI while processing a timed poll couldn't happen. So I learned something here too. I'll think of some comment to add to the history tracking code to summarize this thread. -Tony