From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 539A4CAC592 for ; Mon, 22 Sep 2025 16:02:01 +0000 (UTC) Received: from boromir.ozlabs.org (localhost [127.0.0.1]) by lists.ozlabs.org (Postfix) with ESMTP id 4cVntJ08K7z30N8; Tue, 23 Sep 2025 02:02:00 +1000 (AEST) Authentication-Results: lists.ozlabs.org; arc=none smtp.remote-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=lists.ozlabs.org; s=201707; t=1758556919; cv=none; b=leLvzDKVfbE5p+Hcd7XJPvLmj4PcksvLdJrLBhyRgSsVJ69kr78Hd5/12Rppr8d+Rx3YWcIRqNR3h5NXtK7i6V5d+Zy5MUysmSOhi6b7s8kKNat6ek1ljk729iAKUhSzkxYwrncUfx5oiP6dI3H/7a7WIOxkcLd/wXbJGAw2k2E3RvniQiGyMpmXNxyCj3cws6d/8vNiHQwxs5CO8XYOP2P+jJmueg+mj+Ghot0NNQXAa349/5N8gImvjFfGeLz6f7WMw/AdlJKAfIuvKYirO09LYq/ZjLjOJMxELWMqHdAA0U9y5b2sNkz1oGnXppUqDFdYtjUKtfImY5OBzu9IWA== ARC-Message-Signature: i=1; a=rsa-sha256; d=lists.ozlabs.org; s=201707; t=1758556919; c=relaxed/relaxed; bh=tXNIluJ/Y0q23mzmIzZIg9/Tv+rjRchaEvz0cHtgPjw=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=HCEUiaVtXFI6LbQNja/OP6NSMC1B2jEajSb19pW/FIudINmYdAs0OinLC6EToFC1hL3524I8X5NoYLMUXbfRJgNidlQxk948CD3kIGooErhIXHeD+crVwgzWQp9Em6QmxUV2eaWG6R0dBcUpsurAnDIyj3M/IoL8jx2V7TsZVWsfYLGerjHX6HTDis//eWQgLhBxGbJPnL+ik5trwdRpxJhqUUQT0RSPME7LXCZqTwBVYje7oPQS/lKsS5MC7Qrg2f4UxmsZCcUwGkdjIRVrnLMjUe5NwaQCnWIv70bhHiJUrPxVVI3CRQWnwNWHoWZDeA/aQeIeKFLyYHoSlKCWTQ== ARC-Authentication-Results: i=1; lists.ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=S6EMTbdx; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=S6EMTbdx; dkim-atps=neutral; spf=pass (client-ip=170.10.133.124; helo=us-smtp-delivery-124.mimecast.com; envelope-from=alex.williamson@redhat.com; receiver=lists.ozlabs.org) smtp.mailfrom=redhat.com Authentication-Results: lists.ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: lists.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=S6EMTbdx; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=S6EMTbdx; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=redhat.com (client-ip=170.10.133.124; helo=us-smtp-delivery-124.mimecast.com; envelope-from=alex.williamson@redhat.com; receiver=lists.ozlabs.org) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4cVntG49jjz30HB for ; Tue, 23 Sep 2025 02:01:56 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1758556912; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tXNIluJ/Y0q23mzmIzZIg9/Tv+rjRchaEvz0cHtgPjw=; b=S6EMTbdxm75GhKEHW5qxmK2xBaAEHoO/wwKAOy3Ec/8gj6CJ/uG7Oc+bifift4VIJhUEyo malS32megac6D+UxnZubcjctNBwE7E2E/YNHTN4NbYJkOtXoKe2UpgX/SwKeK16nkR9KqK 97bYLndXujDT/+Oh9aZAdxPADyeuKAg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1758556912; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tXNIluJ/Y0q23mzmIzZIg9/Tv+rjRchaEvz0cHtgPjw=; b=S6EMTbdxm75GhKEHW5qxmK2xBaAEHoO/wwKAOy3Ec/8gj6CJ/uG7Oc+bifift4VIJhUEyo malS32megac6D+UxnZubcjctNBwE7E2E/YNHTN4NbYJkOtXoKe2UpgX/SwKeK16nkR9KqK 97bYLndXujDT/+Oh9aZAdxPADyeuKAg= Received: from mail-oa1-f69.google.com (mail-oa1-f69.google.com [209.85.160.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-352-vpqURtFOOba6SfVUKMDDVg-1; Mon, 22 Sep 2025 12:01:50 -0400 X-MC-Unique: vpqURtFOOba6SfVUKMDDVg-1 X-Mimecast-MFC-AGG-ID: vpqURtFOOba6SfVUKMDDVg_1758556910 Received: by mail-oa1-f69.google.com with SMTP id 586e51a60fabf-3197f534082so2005693fac.2 for ; Mon, 22 Sep 2025 09:01:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758556909; x=1759161709; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tXNIluJ/Y0q23mzmIzZIg9/Tv+rjRchaEvz0cHtgPjw=; b=Y9EfNKCjRJUVyCNkupE9ka2w0wDZHokmWAzrV3VOXAb+ECjzjrFtTX5jwtMh/e2eGO BP8TARwZGDK5gKCEUCXJ0CEKah7h8gHtEtZSQvnmqbKwxuhMIrgGyHjTkZQEaWgg7wsT mdUp5SS5ELrqAc/B8svzdVaZQ4mdVm+2b37HGNVdJMizd4ZscaidE08Qi8TAhMIjA8NN tLsWK+h2jUAtu9MGtsVokS1w32afE8Bvc2QWFfpSoSimot+GB3C2PE4/tcobLfmjLqNz b6BKJKHGbICkRKfcVcgJZ3d/oKO9GVsCb1aOZz0UjGQkqApqmpfseGH9Km15ZG8/6QSg I4GA== X-Forwarded-Encrypted: i=1; AJvYcCUaIMi06L2ZyjTI1cJ2U4yPZo6qXYHXLjSzTUQUjAUB5gv8XMWFkG0WjlAsQcCZ+wsYQ4K6lV1ClcuNEoQ=@lists.ozlabs.org X-Gm-Message-State: AOJu0YxvydKZwrrsHgknPK1Yge4VkFmpd+orlCp7OMvYOueikiRH/UF3 BtDFZWVn8i+IXmslfElOFbTY7f4tptFgKhAk50EDU3XfOh+vBDVMnGPT8IgPo/6FUfXycJ/Bouo pFmUMUetHJzoDq0BXZZVt5FS0WMJkMPSxYe2adoq5kI0SOMwW2bzvY0X2I10NrjkKdTkjp5/iZB A= X-Gm-Gg: ASbGncup8WM3FkTpWdLcegyqV+7qjOBR7e+xWmnb8VaZQiYVsiv9Xj7Xn4y/0kjyQx3 hvfEjf8dTMtgAHaIj981Je5jTb33uaSLDj4CQxJir4sCxu+GDrxk0iuiMrcccdFF61TLAm/KmyD c+Qs3ALLTxuraxn/2Rnv0DJsFvSgEOBA90jOPv+W+mUU9N+n4G9uLHGROGH9i5CxGVqtBswEX9w Wn7U47G4xeBaUCaMUseo3amoJPD4bxGeCNa9CDuB3gYd46xTtDa8ui5OaLMUHtPGEWvvxfS/MtJ lB4v8VmwZrh8svSyY0ozOs7jyLKD+RazzfGo/4Hfe/U= X-Received: by 2002:a05:6871:2987:b0:331:ebe7:f673 with SMTP id 586e51a60fabf-33bb49655aamr3564718fac.9.1758556908232; Mon, 22 Sep 2025 09:01:48 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEGz5/rWt2sNidlRW6kyh8a7NA1uRncwrteUD3AJv60SOcvoVzWhVpJzASNVa+IT+VYZbfD6A== X-Received: by 2002:a05:6871:2987:b0:331:ebe7:f673 with SMTP id 586e51a60fabf-33bb49655aamr3564695fac.9.1758556907521; Mon, 22 Sep 2025 09:01:47 -0700 (PDT) Received: from redhat.com ([38.15.36.11]) by smtp.gmail.com with ESMTPSA id 586e51a60fabf-336e411fcb8sm7901202fac.8.2025.09.22.09.01.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Sep 2025 09:01:46 -0700 (PDT) Date: Mon, 22 Sep 2025 10:01:43 -0600 From: Alex Williamson To: Timothy Pearson Cc: kvm , linuxppc-dev Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices Message-ID: <20250922100143.1397e28b.alex.williamson@redhat.com> In-Reply-To: <537354829.1740670.1758396303861.JavaMail.zimbra@raptorengineeringinc.com> References: <663798478.1707537.1757450926706.JavaMail.zimbra@raptorengineeringinc.com> <20250919125603.08f600ac.alex.williamson@redhat.com> <1916735949.1739694.1758315074669.JavaMail.zimbra@raptorengineeringinc.com> <20250919162721.7a38d3e2.alex.williamson@redhat.com> <537354829.1740670.1758396303861.JavaMail.zimbra@raptorengineeringinc.com> X-Mailer: Claws Mail 4.3.1 (GTK 3.24.43; x86_64-redhat-linux-gnu) X-Mailing-List: linuxppc-dev@lists.ozlabs.org List-Id: List-Help: List-Owner: List-Post: List-Archive: , List-Subscribe: , , List-Unsubscribe: Precedence: list MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 24HzJ17S21fe7lwg_UjkbYKoyFXNzPqaocghDRU70Is_1758556910 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Sat, 20 Sep 2025 14:25:03 -0500 (CDT) Timothy Pearson wrote: > ----- Original Message ----- > > From: "Alex Williamson" > > To: "Timothy Pearson" > > Cc: "kvm" , "linuxppc-dev" > > Sent: Friday, September 19, 2025 5:27:21 PM > > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices > > > On Fri, 19 Sep 2025 15:51:14 -0500 (CDT) > > Timothy Pearson wrote: > > > >> ----- Original Message ----- > >> > From: "Alex Williamson" > >> > To: "Timothy Pearson" > >> > Cc: "kvm" , "linuxppc-dev" > >> > Sent: Friday, September 19, 2025 1:56:03 PM > >> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices > >> > >> > On Tue, 9 Sep 2025 15:48:46 -0500 (CDT) > >> > Timothy Pearson wrote: > >> > > >> >> PCI devices prior to PCI 2.3 both use level interrupts and do not support > >> >> interrupt masking, leading to a failure when passed through to a KVM guest on > >> >> at least the ppc64 platform, which does not utilize the resample IRQFD. This > >> >> failure manifests as receiving and acknowledging a single interrupt in the guest > >> >> while leaving the host physical device VFIO IRQ pending. > >> >> > >> >> Level interrupts in general require special handling due to their inherently > >> >> asynchronous nature; both the host and guest interrupt controller need to > >> >> remain in synchronization in order to coordinate mask and unmask operations. > >> >> When lazy IRQ masking is used on DisINTx- hardware, the following sequence > >> >> occurs: > >> >> > >> >> * Level IRQ assertion on host > >> >> * IRQ trigger within host interrupt controller, routed to VFIO driver > >> >> * Host EOI with hardware level IRQ still asserted > >> >> * Software mask of interrupt source by VFIO driver > >> >> * Generation of event and IRQ trigger in KVM guest interrupt controller > >> >> * Level IRQ deassertion on host > >> >> * Guest EOI > >> >> * Guest IRQ level deassertion > >> >> * Removal of software mask by VFIO driver > >> >> > >> >> Note that no actual state change occurs within the host interrupt controller, > >> >> unlike what would happen with either DisINTx+ hardware or message interrupts. > >> >> The host EOI is not fired with the hardware level IRQ deasserted, and the > >> >> level interrupt is not re-armed within the host interrupt controller, leading > >> >> to an unrecoverable stall of the device. > >> >> > >> >> Work around this by disabling lazy IRQ masking for DisINTx- INTx devices. > >> > > >> > I'm not really following here. It's claimed above that no actual state > >> > change occurs within the host interrupt controller, but that's exactly > >> > what disable_irq_nosync() intends to do, mask the interrupt line at the > >> > controller. > >> > >> While it seems that way on the surface (and this tripped me up > >> originally), the actual call chain is: > >> > >> disable_irq_nosync() > >> __disable_irq_nosync() > >> __disable_irq() > >> irq_disable() > >> > >> Inside void irq_disable(), __irq_disable() is gated on > >> irq_settings_disable_unlazy(). The lazy disable is intended to *not* > >> touch the interrupt controller itself, instead lazy mode masks the > >> interrupt at the device level (DisINT+ registers). If the IRQ is set > >> up to run in lazy mode, the interrupt is not disabled at the actual > >> interrupt controller by disable_irq_nosync(). > > > > What chip handler are you using? The comment above irq_disable > > reiterates the behavior, yes if the chip doesn't support irq_disable it > > marks the interrupt masked but leaves the hardware unmasked. It does > > not describe using DisINTx to mask the device, which would be at a > > different level from the chip. In this case __irq_disable() just calls > > irq_state_set_disabled(). Only with the change proposed here would we > > also call mask_irq(). > > This is all tested on a POWER XIVE controller, so arch/powerpc/sysdev/xive (CONFIG_PPC_XIVE_NATIVE=y) > > >> > The lazy optimization that's being proposed here should > >> > only change the behavior such that the interrupt is masked at the > >> > call to disable_irq_nosync() rather than at a subsequent > >> > re-assertion of the interrupt. In any case, enable_irq() should > >> > mark the line enabled and reenable the controller if necessary. > >> > >> If the interrupt was not disabled at the controller, then reenabling > >> a level interrupt is not guaranteed to actually do anything (although > >> it *might*). The hardware in the interrupt controller will still > >> "see" an active level assert for which it fired an interrupt without > >> a prior acknowledge (or disable/enable cycle) from software, and can > >> then decide to not re-raise the IRQ on a platform-specific basis. > >> > >> The key here is that the interrupt controllers differ somewhat in > >> behavior across various architectures. On POWER, the controller will > >> only raise the external processor interrupt once for each level > >> interrupt when that interrupt changes state to asserted, and will > >> only re-raise the external processor interrupt once an acknowledge > >> for that interrupt has been sent to the interrupt controller hardware > >> while the level interrupt is deasserted. As a result, if the > >> interrupt handler executes (acknowledging the interrupt), but does > >> not first clear the interrupt on the device itself, the interrupt > >> controller will never re-raise that interrupt -- from its > >> perspective, it has issued another IRQ (because the device level > >> interrupt was left asserted) and the associated handler has never > >> completed. Disabling the interrupt causes the controller to reassert > >> the interrupt if the level interrupt is still asserted when the > >> interrupt is reenabled at the controller level. > > > > This sounds a lot more like the problem than the previous description. > > Apologies for that. There's a lot of moving parts here and I guess I muddled it all up in the first description. > > > Is the actual scenario something like the irq is marked disabled, the > > eventfd is delivered to userspace, userspace handles the device, the > > interrupt is de-asserted at the device, but then the device re-asserts > > the interrupt before the unmask ioctl, causing the interrupt chip to > > mask the interrupt, then enable_irq() from the unmask ioctl doesn't > > reassert the interrupt? > > That is exactly it, yes. This particular scenario only occurs on old > pre-PCI 2.3 devices that advertise DisINT- ; newer devices that > advertise DisINT+ don't trip the fault since the host interrupt > handler sets the device interrupt mask (thus deasserting the IRQ at > the interrupt controller) before exiting. > > >> On other platforms the external processor interrupt itself is > >> disabled until the interrupt handler has finished, and the > >> controller doesn't auto-mask the level interrupts at the hardware > >> level; instead, it will happily re-assert the processor interrupt > >> if the interrupt was not cleared at the device level after IRQ > >> acknowledge. I suspect on those platforms this bug may be masked > >> at the expense of a bunch of "spurious" / unwanted interrupts if > >> the interrupt handler hasn't acked the interrupt at the device > >> level; as long as the guest interrupt handler is able to somewhat > >> rapidly clear the device interrupt, performance won't be impacted > >> too much by the extra interrupt load, further hiding the bug on > >> these platforms. > > > > It seems this is the trade off the lazy handling makes > > intentionally, we risk some spurious interrupts while the line is > > disabled to avoid poking the hardware. So long as we're not > > triggering the spurious interrupt handler to permanently disabling > > the interrupt line, this is a valid choice. > > At least for some platforms, yes, though it's not exactly clear in > the documentation that this is intentional or that the side effect > exists at all. > > > That's also a consideration for this patch, we're slowing down all > > non-PCI2.3 INTx handling for the behavior of this platform. Is > > there some behavior of the interrupt controller we can query to > > know which path to use? Can this issue be seen with the irqfd INTx > > handling disabled on other architectures? Do we actually care > > whether we're making old INTx devices slower? Thanks, > > I honestly don't know. In reality, we're talking about amd64 and > ppc64el as the only two platforms that I know of that both support > VFIO and such old PCIe INTX pre-2.3 devices, and their respective > interrupt controller handling in a spurious INTX IRQ context is very > different from one another. > > Personally, I'd argue that such old devices were intended to work > with much slower host systems, therefore the slowdown probably > doesn't matter vs. being more correct in terms of interrupt handling. > In terms of general kernel design, my understanding has always been > is that best practice is to always mask, disable, or clear a level > interrupt before exiting the associated IRQ handler, and the current > design seems to violate that rule. In that context, I'd personally > want to see an argument as to why echewing this traditional IRQ > handler design is beneficial enough to justify making the VFIO driver > dependent on platform-specific behavior. Yep, I kind of agree. The unlazy flag seems to provide the more intended behavior. It moves the irq chip masking into the fast path, whereas it would have been asynchronous on a subsequent interrupt previously, but the impact is only to ancient devices operating in INTx mode, so as long as we can verify those still work on both ppc and x86, I don't think it's worth complicating the code to make setting the unlazy flag conditional on anything other than the device support. Care to send out a new version documenting the actual sequence fixed by this change and updating the code based on this thread? Note that we can test non-pci2.3 mode for any device/driver that supports INTx using the nointxmask=1 option for vfio-pci and booting a linux guest with pci=nomsi. Thanks, Alex