From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42522367B93 for ; Tue, 30 Jun 2026 18:30:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.175 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782844252; cv=none; b=lSbbQ7l2TBSCO4V/AX1chQAHfnuTVT/oRMa5AoP6hgTlTuSDO/MPvO/+kUC4qkoTrxPdPlR1R624UIslCXl11UOwsT/HjnN3pW4CTbNyWi5pbhjRmHIRVXrVsZiWZZjAx0E0nNrEJ3t8WRe4MGT6O1AsaCeqbTQQTNfI2nkchr4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782844252; c=relaxed/simple; bh=lcny+By7zybEfee72uKlhA8jZ6/K+CPLkKhviUZiqUM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=NRH3uKe/vm4q0UU7bBNfVOkylfeT+pHKX7Ki6uhZzehl84loH25/g3e1PVSX3L8+Oas3tFyW8Ql+dVa79reTh6OXXkNOGgKp64JZJ+jMg+fjh3RfamqqKyjSm8IYGta5O0ICt1xpHYSXw0MAXVyxddcTohH2ls1+XsocQlGQ7To= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Y/tcZGKo; arc=none smtp.client-ip=209.85.214.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Y/tcZGKo" Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-2c81db32393so8995ad.0 for ; Tue, 30 Jun 2026 11:30:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1782844250; x=1783449050; darn=vger.kernel.org; h=in-reply-to:content-disposition:content-type:mime-version :references:message-id:subject:cc:to:from:date:from:to:cc:subject :date:message-id:reply-to:content-type; bh=UFLNZBPki14ArLq9AWdBh/KIv4fHhLMZ/fjJCcm8kA0=; b=Y/tcZGKoCtsfen3bajfdpgiPwaMHk0XQ1hzw6G+FDAa4F3IuJi0gr4gENAJg14CUP4 2fRsWFoTOtCbCcG/fOEyhYQCXqOv0Bs1WzFch5O0rwhcmGJxYRYzJ/Dd81nOD+FtSOxg /e+ydkweo7WFiie2nErukmcvUP/MnWqQsSUJ/BZuilAadLTzvqcgu3g4MMw7+ZW6zvB1 ky8WP/CcohoVnxdr+g/KOsgSx64ZjsQLsE9+z6nfPS5Zvb6PHxXoN2aT6cSPxWA7Oc9Y HEvXw46yk+I6AbzeZ0xjwbJNQ2F9Cil+qrqVwLScbEdJr9+Kii02HgwHeWzy0jO8oghL ZupA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1782844250; x=1783449050; h=in-reply-to:content-disposition:content-type:mime-version :references:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to :content-type; bh=UFLNZBPki14ArLq9AWdBh/KIv4fHhLMZ/fjJCcm8kA0=; b=TNqei7+Kcj8HshWNe5GqMgRSEfQE4ZqQglqwceIkby6/p9Gbs//oWJZD4lTPyU+kNS zhV6+1F+wlniFv5vyi5PNTf45g+ZE5OYUUQR0O2TfM6WkUypCQDXWtDfjizKjKHG72gY Fxa57UMxaYhz+TzCkPHDLU+ervKv48Ru7/xGbHiRJIXYM19zuj1xeKPolqrEznR2hYss GsC/fcldhP6XHaOuzjblfbJG5R1i1iOhJUCvWa8KrcuqPrOwIhzQMdWNHTCJPz/CAuB/ Ok9y4nLArlkElXB6mffLtmw92lvMALHfnWqsAdR3Ax02xoVq6CePkC/SehjX8YrQYVmv DmiQ== X-Forwarded-Encrypted: i=1; AHgh+RqHhrYJUXxtED7A9zDnYYQc8rU5u7Z/Fws2gHdNJSkAonV1z5FN5CtcYpj9TFdBfkJjAw3+eNOIz5WzNRI=@vger.kernel.org X-Gm-Message-State: AOJu0Yy5sLO9J1VkyDV4PwJTIa1fJ95fhqhWtSP43vQet+MkIVJ7vNFd FFHA6e+M9yMzdfVaSnG/OuOSGj7K9cbwx/hb3XizapHl1uHXQ0l3bgi3Ca2HPIEgbQ== X-Gm-Gg: AfdE7ckybVDFL7EGGEkZF20yr1rpkaID7BwYTn+8YszablUe0OzcBshQSeV/ydgj5Tu w7slMxXq4mEcQQVx3L/g98InY+n79/zRKKUslrOvQqPDhjVVzzhT7aFMo6BsKofSixzeJO4/hoO OEb4YjHNTHyab/CzxER1ecBF+IxrM1g+5nfCPYV4pvOY6hDdNsaLK0yR/Vu45zB9fa3OpeFR1kP iBYfq5+rFY6YDq31CYTIOP9V7/VvFq453B1mC8COQPqPVKABVYzRcGRQW5ROijuqNMQdUgdqsWZ yuHi8IwZKmCygwvafO2/4K/M5ZMN6jfo/UonGqmv5sjC3m/UnB8UZGI8k265eiLYdEbXyNG1Nct BwVR30r7xicYeCVsrvzzFHSQFTW4GztEAblLrmO998495RDmT3mdP34Crwo7w8Akqs4aCTMnXg3 ObzpBtpR5e8BI9geRByYaTWML/KEioCzh9mqd5NfDODkCF5/c= X-Received: by 2002:a17:902:ccc8:b0:2c1:ee6e:be1d with SMTP id d9443c01a7336-2ca68c40b96mr331005ad.27.1782844249825; Tue, 30 Jun 2026 11:30:49 -0700 (PDT) Received: from google.com (10.129.124.34.bc.googleusercontent.com. [34.124.129.10]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-847a00029d3sm2576588b3a.20.2026.06.30.11.30.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Jun 2026 11:30:48 -0700 (PDT) Date: Tue, 30 Jun 2026 18:30:41 +0000 From: Pranjal Shrivastava To: Mostafa Saleh Cc: Nicolin Chen , will@kernel.org, robin.murphy@arm.com, jgg@nvidia.com, joro@8bytes.org, kees@kernel.org, baolu.lu@linux.intel.com, kevin.tian@intel.com, miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, stable@vger.kernel.org, jamien@nvidia.com Subject: Re: [PATCH rc v7 0/7] iommu/arm-smmu-v3: Fix device crash on kdump kernel Message-ID: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Tue, Jun 30, 2026 at 03:33:12PM +0000, Mostafa Saleh wrote: > On Tue, Jun 30, 2026 at 02:51:40PM +0000, Pranjal Shrivastava wrote: > > On Tue, Jun 30, 2026 at 01:17:30PM +0000, Mostafa Saleh wrote: > > > On Mon, Jun 29, 2026 at 11:15:33PM -0700, Nicolin Chen wrote: > > > > When transitioning to a kdump kernel, the primary kernel might have crashed > > > > while endpoint devices were actively bus-mastering DMA. Currently, the SMMU > > > > driver aggressively resets the hardware during probe by clearing CR0_SMMUEN > > > > and setting the Global Bypass Attribute (GBPA) to ABORT. > > > > > > > > In a kdump scenario, this aggressive reset is highly destructive: > > > > a) If GBPA is set to ABORT, in-flight DMA will be aborted, generating fatal > > > > PCIe AER or SErrors that may panic the kdump kernel > > > > > > Can you please clarify more on those errors, what conditions will > > > trigger that? > > > For example, patch 4 disables the EVTQ to avoid events as there might > > > be a lot, why are they not fatal also? > > > > > > > b) If GBPA is set to BYPASS, in-flight DMA targeting some IOVAs will bypass > > > > the SMMU and corrupt the physical memory at those 1:1 mapped IOVAs. > > > > > > > > To safely absorb in-flight DMA, the kdump kernel must leave SMMUEN=1 intact > > > > and avoid modifying STRTAB_BASE. This allows HW to continue translating in- > > > > flight DMA using the crashed kernel's page tables until the endpoint device > > > > drivers probe and quiesce their respective hardware. > > > > > > > > However, the ARM SMMUv3 architecture specification states that updating the > > > > SMMU_STRTAB_BASE register while SMMUEN == 1 is UNPREDICTABLE or ignored. > > > > > > > > This leaves a kdump kernel no choice but to adopt the stream table from the > > > > crashed kernel. > > > > > > In many cases the patches assume that the CDs/STE might be corrupted, > > > but still attempt to retrieve them with some validation > > > (log2size/split...) > > > However, the base address might be broken, TLBs state is unknown... > > > > > > IMO, although that might improve the status quo, there are still > > > heuristics, in addition to noticeable complexity to transition the > > > stream tables. I wonder if FW can deal with AER in that case before > > > booting the kdump kernel. > > > > I guess we're reading the base address from the HW register itself so > > that should be fine? CDs are in-memory so that's why they could be > > corrupted? > > For example patch#1 verifies log2size and split and both are read > from HW registers. Same for the base address or other addresses as > the page tables, they might be corrupted due to a buggy driver. > My point is that, it is really hard to assume that the previous state > of registers/STE/page-tables were valid or even consistent, when the > kernel crashed and did not transition the state gracefully. > > > > > About the TLB state, I'm not sure what might pollute it, since this is a > > kexec, I don't expect any non-kernel entity to gain program control > > before the kdump kernel.. Hence, IMO, we can't configure FW to deal with > > AER here.. > > Similarly for TLBs, the kernel might have panicked in the middle of an > unmap or free domain. (not to mention what that means for RPM where > a device reset with unknown TLBs? > > Why can't the FW deal with it? The FW can't handle it because between a kexec from main kernel -> kdump there's no FW-based handoff hence wee can't setup a handler in FW.. > As I mentioned above in the previous > reply I am not sure I understand what situation leads into this, when > does a device trigger SError to the system vs when not which is observed > as an event in that case. > Ack. I see what you mean now.. How does a DMA fault raise an SError? I'm guessing the HW (PCIe RC) is wired in a way to raise an SError on an error? But I agree that sounds pretty unusual, why should a DMA abort panic the CPU? Is the DMA happening between a platform device & PCIe EP? Even if that's true, why would it raise an SError? (No CPU was involved) Unless, we have a fabric that raises an Serror on a SLVERR or something Thanks, Praan