From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BFA6D17E0 for ; Sat, 22 Mar 2025 12:02:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.43 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742644942; cv=none; b=OmxXUVFMXmPg5kXrzF/TMW04VqQmohyzRw8HGOrgyZQhNN5e52A2WM8Dp9PFq9lB3SKzBoGYnaFw6F6dpebnA2NCAkuOuJOLo5sAGnVWFHyaO5AXerSWxQaIvBfv9M1HGd8uM9Qh/2T4nz1Kds22SeAZUNEReyud4Lr1TGOsCnA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742644942; c=relaxed/simple; bh=dJbr/+xbd2tD9DFz+WgFGuMH+D5ajrVlrfrLESKBtaE=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=F8i7yezQNHV5FzR2gfGzbWwTPnLcwRpbeQCWoApJpCqOF8X97ktEMz1CNneu/AybOJDcDlaGw3VkXdjLOk5dHTuxP5Tf2LcLwhw8IuxTFB57I5TRbZkq7qdJ2xTvtv1D+BHQMpyUt/x23ZrBAaceewqfnMas2j0LlVt11JdxdS0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LRmr/0z4; arc=none smtp.client-ip=209.85.128.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LRmr/0z4" Received: by mail-wm1-f43.google.com with SMTP id 5b1f17b1804b1-4394a823036so28258505e9.0 for ; Sat, 22 Mar 2025 05:02:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1742644939; x=1743249739; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=VYbsuPyU4m6K394Fk+cbUBhjL1OliK8MPdiQnerVR+k=; b=LRmr/0z4aBHWprHAHRWXJUp/aIJLE9v0JivMA8UVJBYjP/e5pZs5Xi8odaX8Q8+JXA zFaUcRE9B8hEArpvK+kFRDjwSWMuto4bOuG/d+a7aOVMo38rNOuaxkaxlE3cbZftHdt0 BA6yclLqrOo35foj+9Raj///kJ5kJiP+3rDS03LBtRSKw7Ah9i1g/StrYCSv+qvuispf 2snFkBbGGfLoZ3ZOCpVilw+y00nMVpB6W5ZGtV5B6vESyRr5X9HrhBaf1JvD8yYR1jBq k3m7ZVYUSjPi22mg/j2Fuc+5/KVDbdnsXvfoVrg5arvZbMwI6Jczs9HaruGrwph+MP4X evjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742644939; x=1743249739; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VYbsuPyU4m6K394Fk+cbUBhjL1OliK8MPdiQnerVR+k=; b=k+2+ucvIjQblJsYKKXmqyEVnHk4TsfsSyw0wjlJXBrkJj0aoYdnIeX6IPybVw8aoH3 glxDLCu4IYPBVwbbAQSYy2RrCXKDIEH//dRMVO1p4Li6Cy7k4b1Nfp+clDY2No3bguvS 0zr7eLKMb40qjHy5GZ1o7j048GobG+iRntSk4dK+85ecwa0SXp8zc/eQ04PX71/PawFU OmGFKnELkltoKcACrktj9rD+wUrUYoASuyt+lS23xRpaL/gkUU29xAKEfps/ZBlv6Fm9 w8vjCEnfkeyr4ebF/5tN2vrPACR/DlUu96FA/M169/VcqYzP4a7xDfsGawLTVGoR19KQ QpDg== X-Forwarded-Encrypted: i=1; AJvYcCX48g+KyjRV/oRastRNy6tA3Qx0MuK209tZCo0Wn0aj5hyY2MTAnhJJBkqh3ypgsQ5DnXoFafsue7BcRzA=@vger.kernel.org X-Gm-Message-State: AOJu0YyBqqyGzwjlY1w6vo5l35esZt3x8PNK1efPoEC+/F1ZDwyzK7RX MiUa7/6CPeBVX+qDLpid4RT7lh8Xm9qLi0unU8f4CdZCDxhwgbze X-Gm-Gg: ASbGncuhEqUit9rrmvZrPr3wZ0HM51u5UwtEfYMJ7MKkMDuZ+PmtZKoOkhu17BqRsEN gENIImHFltHQey/+pWRx2y3bMiYmkxrvFhEWdrCUQtfTUXj4LmMEquPyrvZle3e+N2dCr7okZuO Kaneh+dHrMXACBnmor8d33+Z1NdCkyVXrrvU/qSQkx/FHAHRXe9s5Zf/pnQpeS2el/YP3NBNRKQ aifhonif8l4EljCVTrT7/0lcaz7cIr6keTI7OfIJLXuUD1NKtLHboQfHoXVvNL0h4Pl58LvjTT9 4W0rz2KokEyz9qYAmEENIRKJQGyOLD4zsJ1FsNxQOcJRrko4Hq7JFEBQpm+sGXpow+DlnNK2l4n QS5m7Mac= X-Google-Smtp-Source: AGHT+IHawvyD/6qm9CmF5fyE4A/k5X8DwcgLZa0UiCzozXfHeWI1TD8+7xmEItYKcPg/g14Dna16mg== X-Received: by 2002:a05:600c:4f55:b0:439:9b2a:1b2f with SMTP id 5b1f17b1804b1-43d568cdf1bmr43793395e9.3.1742644938659; Sat, 22 Mar 2025 05:02:18 -0700 (PDT) Received: from pumpkin (82-69-66-36.dsl.in-addr.zen.co.uk. [82.69.66.36]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43d440ed5e0sm104921415e9.37.2025.03.22.05.02.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 22 Mar 2025 05:02:18 -0700 (PDT) Date: Sat, 22 Mar 2025 12:02:13 +0000 From: David Laight To: Mateusz Guzik Cc: Linus Torvalds , x86@kernel.org, hkrzesin@redhat.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, olichtne@redhat.com, atomasov@redhat.com, aokuliar@redhat.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store Message-ID: <20250322120213.1420450a@pumpkin> In-Reply-To: References: <20250320190514.1961144-1-mjguzik@gmail.com> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; arm-unknown-linux-gnueabihf) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Thu, 20 Mar 2025 21:24:38 +0100 Mateusz Guzik wrote: > On Thu, Mar 20, 2025 at 8:33=E2=80=AFPM Mateusz Guzik = wrote: > > > > On Thu, Mar 20, 2025 at 8:23=E2=80=AFPM Linus Torvalds > > wrote: =20 > > > > > > On Thu, 20 Mar 2025 at 12:06, Mateusz Guzik wrote= : =20 > > > > > > > > Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to= a > > > > 1 byte at a time loop to handle the tail. =20 > > > > > > I definitely do not mind this patch, but I think it doesn't go far en= ough. > > > > > > It gets rid of the byte-at-a-time loop at the end, but only for the > > > short-copy case of 8-63 bytes. > > > =20 > > > > This bit I can vouch for. > > =20 > > > The .Llarge_movsq ends up still doing > > > > > > testl %ecx,%ecx > > > jne .Lcopy_user_tail > > > RET > > > > > > and while that is only triggered by the non-ERMS case, that's what > > > most older AMD CPU's will trigger, afaik. > > > =20 > > > > This bit I can't. > > > > Per my other e-mail it has been several years since I was seriously > > digging in the area (around 7 by now I think) and details are rather > > fuzzy. > > > > I have a recollection that handling the tail after rep movsq with an > > overlapping store was suffering a penalty big enough to warrant a > > "normal" copy instead, avoiding the just written to area. I see my old > > routine $elsewhere makes sure to do it. I don't have sensible hw to > > bench this on either at the moment. > > =20 >=20 > So I did some testing on Sapphire Rapids vs movsq (it is an > FSRM-enabled sucker so I had to force it to use it) and the penalty I > remembered is still there. >=20 > I patched read1 from will-it-scale to do 128 and 129 byte reads. >=20 > On the stock kernel I get about 5% throughput drop rate when adding > just the one byte. >=20 > On a kernel patched to overlap these with prior movsq stores I get a 10% = drop. >=20 > While a CPU without ERMS/FSRM could have a different penalty, this > very much lines up with my prior experiments back in the day, so I'm > gonna have to assume this still is still a problem for others. A different penalty for ERMS/FSRM wouldn't surprise me. If you tested on an Intel cpu it probably supported ERMS/FSRM (you have to go back to 'core-2' to not have the support). So I suspect you'd need to have tested an AMD cpu for it to matter. And the report of slow misaligned writes was for a fairly recent AMD cpu - I think one that didn't report ERMS/FSRM support. I've only got a zen-5, although I might 'borrow' a piledriver from work. Definitely no access to anything amd in between. I do wonder if the 'medium length' copy loop is needed at all. The non ERMS/FSRM case needs 'short copy' and aligned 'rep movsq' copies, but is the setup cost for 'rep movsq' significant enough for the 'copy loop' to be worth while? I also suspect the copy loop is 'over unrolled'. The max throughput is 8 bytes/clock and Intel cpu will execute 2 clock loops, so if you can reduce the loop control (etc) instructions to ones that will run in two clocks you can do 16 bytes per iteration. That should be achievable using negative offsets from the end of the buffer - the loop control is 'add $16,%rcx; js 10b'. (I've mentioned that before ...) I've just done some more searching and found a comment that FRMS is very slow on zen3 for misaligned destinations. (Possibly to the level of dropping back to byte copies.) That means that the 'rep movsq' also needs to be aligned. Which is what started all this. I'm going to have to see if my Sandy bridge system still works. David >=20 > So I stand by only patching the short range for the time being. >=20 > Note that should someone(tm) rework this to use overlapping stores > with bigger ranges (like memcpy), this will become less of an issue as > there will be no per-byte copying.