From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx2.zhaoxin.com (mx2.zhaoxin.com [203.110.167.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 336F5376E7 for ; Mon, 1 Jul 2024 11:09:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=203.110.167.99 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719832180; cv=none; b=LQrz++M9WVhNfybU3G+NXFLDm0hH+yaR1vmi1hLpRWrXImnppraRaw7Nm9abRFBr/4FA48dM4zX23CmWL70v2dUdxEg1lZsWECuMDarg0KuNf1UbnO3xH/DH20XTKPYyFA1xedZMN29g02rGyZSh5izi5FFEXG4AEI/81UCA18o= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719832180; c=relaxed/simple; bh=QWhAOknMbrY7rRqGy+yUZxAysV4mTx6t7CIYJcVhJ0A=; h=Message-ID:Date:MIME-Version:Subject:To:CC:References:From: In-Reply-To:Content-Type; b=U+GrI8rF9fMTWcNNlxCN+ZSz4xew9S3YXnBjOoAbngIu54tWECmukU/qBGzSERbxhpQ7MAW2HQj2YDDpy7YmXDFOnC/vpeigBieOrIP2g1w36UHZxCbQl0AXnep9QU9kPB66APX2OhTZfmDGaX9SG5HxY6CUo3wY4gw9FD0Ihys= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com; spf=pass smtp.mailfrom=zhaoxin.com; arc=none smtp.client-ip=203.110.167.99 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zhaoxin.com X-ASG-Debug-ID: 1719832167-1eb14e2e61c47f0001-PleebU Received: from ZXSHMBX3.zhaoxin.com (ZXSHMBX3.zhaoxin.com [10.28.252.165]) by mx2.zhaoxin.com with ESMTP id DBp4gzMym7EBFipn (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Mon, 01 Jul 2024 19:09:27 +0800 (CST) X-Barracuda-Envelope-From: TonyWWang-oc@zhaoxin.com X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.165 Received: from ZXBJMBX03.zhaoxin.com (10.29.252.7) by ZXSHMBX3.zhaoxin.com (10.28.252.165) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Mon, 1 Jul 2024 19:09:27 +0800 Received: from [10.32.57.242] (10.32.57.242) by ZXBJMBX03.zhaoxin.com (10.29.252.7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Mon, 1 Jul 2024 19:09:24 +0800 X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.165 Message-ID: <93030e7f-dcd2-45dd-a3a2-efa0128753f1@zhaoxin.com> X-Barracuda-RBL-Trusted-Forwarder: 10.32.57.242 Date: Mon, 1 Jul 2024 19:09:18 +0800 Precedence: bulk X-Mailing-List: linux-hardening@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress To: Thomas Gleixner , Dave Hansen , , , , , , , , , , , , , , , , , X-ASG-Orig-Subj: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress CC: , , , Linus Torvalds References: <20240528063836.5248-1-TonyWWang-oc@zhaoxin.com> <50fc1bd3-909e-41c4-a991-9d81e32ef92c@intel.com> <87wmnda8mc.ffs@tglx> Content-Language: en-US From: Tony W Wang-oc In-Reply-To: <87wmnda8mc.ffs@tglx> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: ZXSHCAS1.zhaoxin.com (10.28.252.161) To ZXBJMBX03.zhaoxin.com (10.29.252.7) X-Barracuda-Connect: ZXSHMBX3.zhaoxin.com[10.28.252.165] X-Barracuda-Start-Time: 1719832167 X-Barracuda-Encrypted: ECDHE-RSA-AES128-GCM-SHA256 X-Barracuda-URL: https://10.28.252.36:4443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at zhaoxin.com X-Barracuda-Scan-Msg-Size: 7509 X-Barracuda-BRTS-Status: 1 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.127014 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- On 2024/5/29 06:12, Thomas Gleixner wrote: > > > [这封邮件来自外部发件人 谨防风险] > > On Tue, May 28 2024 at 07:18, Dave Hansen wrote: >> On 5/27/24 23:38, Tony W Wang-oc wrote: >> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c >>> index c96ae8fee95e..ecadd0698d6a 100644 >>> --- a/arch/x86/kernel/hpet.c >>> +++ b/arch/x86/kernel/hpet.c >>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs) >>> if (in_nmi()) >>> return (u64)hpet_readl(HPET_COUNTER); >>> >>> + /* >>> + * Read HPET directly if panic in progress. >>> + */ >>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID)) >>> + return (u64)hpet_readl(HPET_COUNTER); >>> + >> >> There is literally one other piece of the code in the kernel doing >> something similar: the printk() implementation. There's no other >> clocksource or timekeeping code that does this on any architecture. >> >> Why doesn't this problem apply to any other clock sources? > > I principle it applies to any clocksource which needs a spinlock to > serialize access. HPET is not the only insanity here. > > Think about i8253 :) > > Most real clocksources, like TSC and the majority of the preferred clock > sources on other architectures are perfectly fine. They just read and be > done. > >> Why should the problem be fixed in the clock sources themselves? Why >> doesn't printk() deadlock on systems using the HPET? > > Because regular printk()s are deferred to irq work when in NMI and > similar contexts, but that obviously does not apply to panic > situations. Also NMI is treated special even in the HPET code. See > below. > >> In other words, I think we should fix pstore to be more like printk >> rather than hacking around this in each clock source. > > pstore is perfectly fine. It uses a NMI safe time accessor function > which is then tripping over the HPET lock. That's really a HPET specific > problem. > > Though what I read out of the changelog is that the MCE hits the same > CPU 'x' which holds the lock. But that's fairy tale material as you can > see in the patch above: > > if (in_nmi()) > return (u64)hpet_readl(HPET_COUNTER); > > For that particular case the dead lock, which would actually be a live > lock, cannot happen because in kernel MCEs are NMI class exceptions and > therefore in_nmi() evaluates to true and that new voodoo can't be > reached at all. > > Now there are two other scenarios which really can make that happen: > > 1) A non-NMI class exception within the lock held region > > CPU A > acquire(hpet_lock); > ... <- #PF, #GP, #DE, ... -> panic() > > If any of that happens within that lock held section then the live > lock on the hpet_lock is the least of your worries. Seriously, I > don't care about this at all. > > 2) The actual scenario is: > > CPU A CPU B > lock(hpet_lock) > MCE hits user space > ... > exc_machine_check_user() > irqentry_enter_from_user_mode(regs); > > irqentry_enter_from_user_mode() obviously does not mark the > exception as NMI class, so in_nmi() evaluates to false. That would > actually dead lock if CPU A is not making progress and releases > hpet_lock. > > Sounds unlikely to happen, right? But in reality it can because of > MCE broadcast. Assume that both CPUs go into MCE: > > CPU A CPU B > lock(hpet_lock) > exc_machine_check_user() > irqentry_enter_from_user_mode(); > exc_machine_check_kernel() do_machine_check() > irqentry_nmi_enter(); mce_panic() > do_machine_check() if (atomic_inc_return(&mce_panicked) > 1) > mce_panic() wait_for_panic(); <- Not taken > > if (atomic_inc_return(&mce_panicked) > 1) > wait_for_panic(); <- Taken > > .... > hpet_read() > > -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A > obviously can't release the lock. > For this scenario, an experiment was designed for the printk: a, Install a driver module that repeatedly sending IPIs to multiple cores to executes printk. b, Run a user-level testing tool like stream on all cores. c, Trigger a MCE hardware error. During burnin tests a-c, reproduce the following case: CPU A CPU B printk() console_owner exc_machine_check_user() irqentry_enter_from_user_mode() exc_machine_check_kernel() do_machine_check() irqentry_nmi_enter() mce_panic() do_machine_check() print_mce() ... ... while(console_waiter) cpu_relax(); <- deadloop mce_timed_out() <-timeout wait_for_panic() panic("Panicing machine check CPU died"); In this case CPU B is the monarch CPU in MCE handler, CPU B waiting to be the console_owner and CPU A can't release the console_owner. So the monarch CPU B deadloop happened, as a result other CPU witch waiting the monarch CPU timeout will call the panic function. This problem is caused by the use of printk in the MCE handler. Actually, I found the comments for the MCE handler like: * This is executed in #MC context not subject to normal locking rules. * This implies that most kernel services cannot be safely used. Don't even * think about putting a printk in there! Should consider not using printk in the MCE handler? Sincerely! TonyWWang-oc > So the proposed patch makes sense to some extent. But it only cures the > symptom. The real underlying questions are: > > 1) Should we provide a panic mode read callback for clocksources which > are affected by this? > > 2) Is it correct to claim that a MCE which hits user space and ends up in > mce_panic() is still just a regular exception or should we upgrade to > NMI class context when we enter mce_panic() or even go as far to > upgrade to NMI class context for any panic() invocation? > > #1 Solves it at the clocksource level. It still needs HPET specific > changes. > > #2 Solves a whole class of issues > > ... while potentially introducing new ones :) > > To me upgrading any panic() invocation to NMI class context makes a > lot of sense because in that case all bets are off. > > in_nmi() is used in quite some places to avoid such problems. IOW, > that would kill a whole class of issues instead of "curing" the HPET > problem locally for the price of an extra conditional. Not that the > extra conditional matters much if HPET is the clocksource as that's > awfully slow anyway and I really don't care about that. > > But I very much care about avoiding to sprinkle panic_cpu checks all > over the place. > > Thanks, > > tglx