From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E9232350A38 for ; Tue, 21 Oct 2025 20:37:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=13.77.154.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761079075; cv=none; b=bSwkLnXiQ2svPPmHmLUmYoVxN1FAz9CCwN2QVHVjw5OaQlS3LK1HHr4OR+T7effiitNj2OsALhKIGUvlxfHGEfWrJsaXwHS9Z8DR90eAhk7CaTxWIXf83Y2+Ta99+SaqpI7uBtZV18nuPWijlMdg4AQqVBy8CnLFY8Ngz3QeGu4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761079075; c=relaxed/simple; bh=HmECysQgda6YvOetuuCfUSWxczoogPdzJj9upSFshvw=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=QOp7X2WROdH+wYXENV5RAHD5vMmRbmqcsRqArkQZp4iYgOgBVKj8jgsMRHF5sSe3B5muX1mc31JlI6m8SbMBMMGlt1i6ESFW3ZpIeGIEOKfvsqkbnoaIwbECiotiLPSmfTGGBoQfiFuL8LbnJnJTjgXkYgWoQpxU829YJ4qBX3U= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.microsoft.com; spf=pass smtp.mailfrom=linux.microsoft.com; dkim=pass (1024-bit key) header.d=linux.microsoft.com header.i=@linux.microsoft.com header.b=lMv+yxXz; arc=none smtp.client-ip=13.77.154.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.microsoft.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.microsoft.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.microsoft.com header.i=@linux.microsoft.com header.b="lMv+yxXz" Received: from localhost (unknown [20.236.11.29]) by linux.microsoft.com (Postfix) with ESMTPSA id 7F39C201DAC8; Tue, 21 Oct 2025 13:37:52 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 7F39C201DAC8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1761079072; bh=3phraGjaS/C+izoBKBNrnVVsq7XHeGxq+f7PQUzmclc=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=lMv+yxXz1CbhzOPWDBnC7UAsZnKpAQQHrwV9evjF+MLEJOXyxchubKGxBXqLNkkfS mfGJ03b0TsoofcZeB0ZBkRS45BoTjGWP37kHve+PcnYEY7F1SQGoGJZdo3xQsuqoW1 qfBSUvZFhO4ATsMK/x2gTDau/sSv7KRNnaUv68QY= Date: Tue, 21 Oct 2025 13:37:51 -0700 From: Jacob Pan To: Robin Murphy Cc: Mostafa Saleh , linux-kernel@vger.kernel.org, "iommu@lists.linux.dev" , Will Deacon , Nicolin Chen , Zhang Yu , Jean Philippe-Brucker , Alexander Grest , Jason Gunthorpe Subject: Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement Message-ID: <20251021133751.0000418f@linux.microsoft.com> In-Reply-To: <336c5a8a-ac2c-4e4c-b2f9-a0bc056aa498@arm.com> References: <20250924175438.7450-1-jacob.pan@linux.microsoft.com> <20251017135145.GL3901471@nvidia.com> <20251017095031.00000b84@linux.microsoft.com> <20251020120240.GI316284@nvidia.com> <20251020115710.0000258b@linux.microsoft.com> <336c5a8a-ac2c-4e4c-b2f9-a0bc056aa498@arm.com> Organization: LSG X-Mailer: Claws Mail 3.21.0 (GTK+ 2.24.33; x86_64-w64-mingw32) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Tue, 21 Oct 2025 12:45:48 +0100 Robin Murphy wrote: > On 2025-10-20 7:57 pm, Jacob Pan wrote: > > On Mon, 20 Oct 2025 09:02:40 -0300 > > Jason Gunthorpe wrote: > > > >> On Fri, Oct 17, 2025 at 09:50:31AM -0700, Jacob Pan wrote: > >>> On Fri, 17 Oct 2025 10:51:45 -0300 > >>> Jason Gunthorpe wrote: > >>> > >>>> On Fri, Oct 17, 2025 at 10:57:52AM +0000, Mostafa Saleh wrote: > >>>>> On Wed, Sep 24, 2025 at 10:54:36AM -0700, Jacob Pan wrote: > >>>>>> Hi Will et al, > >>>>>> > >>>>>> These two patches are derived from testing SMMU driver with > >>>>>> smaller CMDQ sizes where we see soft lockups. > >>>>>> > >>>>>> This happens on HyperV emulated SMMU v3 as well as baremetal > >>>>>> ARM servers with artificially reduced queue size and > >>>>>> microbenchmark to stress test concurrency. > >>>>> > >>>>> Is it possible to share what are the artificial sizes and does > >>>>> the HW/emulation support range invalidation (IRD3.RIL)? > >>>>> > >>>>> I'd expect it would be really hard to overwhelm the command > >>>>> queue, unless the HW doesn't support range invalidation and/or > >>>>> the queue entries are close to the number of CPUs. > >>>> > >>>> At least on Jacob's system there is no RIL and there are 72/144 > >>>> CPU cores potentially banging on this. > >>>> > >>>> I think it is combination of lots of required invalidation > >>>> commands, low queue depth and slow retirement of commands that > >>>> make it easier to create a queue full condition. > >>>> > >>>> Without RIL one SVA invalidation may take out the entire small > >>>> queue, for example. > >>> Right, no range invalidation and queue depth is 256 in this case. > >>> > >> > >> I think Robin is asking you to justify why the queue depth is 256 > >> when ARM is recommending much larger depths specifically to fix > >> issues like this? > > The smaller queue depth is chosen for CMD_SYNC latency reasons. But > > I don't know the implementation details of HyperV and host SMMU > > driver. > > TBH that sounds highly dubious. The only way I could imagine CMDQ > size bearing any relation to CMD_SYNC at all is if a hypervisor is > emulating a stage 1 vCMDQ in such a naive and lazy manner that a) > performance is already off the table, and b) it has a good chance of > being broken anyway. > > For the hardware to actually process, say, 1023 invalidations > followed by a sync takes as long as it takes, based on how busy the > SMMU is. The only difference in issuing that sequence of commands on > a 256-entry queue vs. a 1024-entry queue is that in the latter case, > software does not have to sit waiting for the first 768 to actually > be consumed before it can finish the submission and potentially get > on with something else until the sync completes. Yes, one could claim > that technically the time between *issuing* the CMD_SYNC and its > completion is then lower, but only because that extra time has now > been wasted in a polling loop waiting for CMDQ space instead - it's a > meaningless distinction overall. I agree that a smaller emulated queue size does not change the time took the physical IOMMU to do the invalidation. I am not defending the current emulation which I think can be improved overtime transparent to the guest. > > IMHO, queue size is orthogonal to what this patch is trying to > > address, which is a specific locking problem and improve efficiency. > > e.g. eliminated cmpxchg > > - do { > > - val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= > > 0); > > - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) > > != val); > > + atomic_cond_read_relaxed(&cmdq->lock, VAL > 0); > > > > Even on BM with restricted queue size, this patch reduces latency of > > concurrent madvise(MADV_DONTNEED) from multiple CPUs (I tested 32 > > CPUs, cutting 50% latency unmap 1GB buffer in 2MB chucks per CPU). > My point is that a 50% improvement on nonsense is likely still > nonsense. With only 256 entries, every single one of those 2MB unmaps > needs to fill the entire CMDQ more than twice over. 32 CPUs all > jostling to issue about 34x as many commands as the queue can hold > *each* is a ridiculous level of contention. If a hypervisor is > advertising an SMMU_IDR1.CMDQS value that is absurdly tiny for the > size of the VM then that represents such an obvious bottleneck that > it's hardly mainline Linux's responsibility to maintain code to help > "debug" it. As for "BM with restricted queue size", like I said, just > don't do that. I don't think we are asking the mainline Linux to debug our emulation problem, quite the contrary, this setup helped to exposed mainline Linux's bug (Patch 1/2 clearly shows that the queue space contention has not been adequately tested) This madvise test is intended to show: a) the locking change is functionally sound, no harm to existing mainline users b) create extreme contentions that expose the problems with the current code c) performance differences > What is the difference on an un-hacked bare-metal system with a > normally-sized queue? Is it even measurable? Not measurable on my BM system with a large cmdq size. The condition !queue_has_space() is rarely met, so the exclusive lock is almost never acquired. > That's what's actually > interesting. Furthermore, what exactly does that measurement even > mean? I agree this is not a measurement of real workload performance, but the test shows: a) no more occasional soft lockup as shared lock is no longer starved b) shared lock can be taken quickly as we get rid of the unnecessary cmpxchg. > If we're still issuing the same number of commands I struggle > to believe we could lose 50% of the *overall* time just bouncing a > cacheline between shared and exclusive state - is this actually just > the *maximum* per-CPU latency going down, at the cost of the minimum > latency correspondingly increasing just as much (if not comparatively > more) due to better fairness? And if so, how important is that > really? I can imagine there are equally cases where other callers > might prefer a lower minimum/mean latency at the price of some longer > outliers. The importance is that this change avoided a soft lockup where exclusive lock is taken all the time. It is not about bouncing a cacheline between shared and exclusive state. I have tried to flush the lock cacheline after shared lock winning the cmpxchg but it didn't help avoiding the lockup. > Note I'm not saying I'm necessarily against making these changes, > just that I'm against making them without a believable justification > that it is actually beneficial to mainline users. The benefit to mainline users ( I assume you meant SMMUs with large cmdq size), at the minium, is that shared lock can be taken quicker when it is released from the exclusive state. > > - do { > > - val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= > > 0); > > - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) > > != val); > > + atomic_cond_read_relaxed(&cmdq->lock, VAL > 0); Here is the assembly code diff : BEFORE: 27bc: 14000008 b 27dc val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0); 27c0: 93407c00 sxtw x0, w0 __CMPWAIT_CASE(w, , 32); 27c4: d50320bf sevl 27c8: d503205f wfe 27cc: 885f7c22 ldxr w2, [x1] 27d0: 4a000042 eor w2, w2, w0 27d4: 35000042 cbnz w2, 27dc 27d8: d503205f wfe 27dc: b940ce60 ldr w0, [x19, #204] 27e0: 37ffff00 tbnz w0, #31, 27c0 } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val); 27e4: 11000403 add w3, w0, #0x1 27e8: 14000004 b 27f8 __CMPXCHG_CASE(w, , , 32, ) 27ec: 2a0003e2 mov w2, w0 27f0: 88a27c23 cas w2, w3, [x1] 27f4: 14000008 b 2814 __CMPXCHG_CASE( , 32) 27f8: 93407c04 sxtw x4, w0 __CMPXCHG_CASE(w, , , 32, , , , , K) 27fc: f9800031 prfm pstl1strm, [x1] 2800: 885f7c22 ldxr w2, [x1] 2804: 4a040045 eor w5, w2, w4 2808: 35000065 cbnz w5, 2814 280c: 88057c23 stxr w5, w3, [x1] 2810: 35ffff85 cbnz w5, 2800 2814: 6b02001f cmp w0, w2 2818: 54fffe21 b.ne 27dc // b.any AFTER: atomic_cond_read_relaxed(&cmdq->lock, VAL > 0); 27bc: b940ce60 ldr w0, [x19, #204] 27c0: 7100001f cmp w0, #0x0 27c4: 5400006d b.le 27d0 > Thanks, > Robin.