From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailtransmit04.runbox.com (mailtransmit04.runbox.com [185.226.149.37]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D33F72DA765 for ; Thu, 27 Nov 2025 09:58:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.226.149.37 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764237509; cv=none; b=k/vCUU0LrDKol/Emy3j99vUJTA41qkD8DenA0OdO4pvtcSwrbRVoPX11l5UGOcy2cZiq/RdSBKpPk9SbHr01Vh2t4aYUMe5aCs1XtcS+FSpOF3+oFpHQdgIRkuy/OSS/cDDTrc2Z4ZLDRBPCetY5EolGcpnWme67TEg3iPfmMfw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764237509; c=relaxed/simple; bh=Uuva1rPzLArm+7N8U5R9+cOnFd9xyBtdmho2c5n3iuQ=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=OSMiPLt/0xUz+KyO0n5JtmxwqT9+rcxN266Ad5gAVV+hBbr5+0Fb+iqYJgNyHaUyZ1zM623LkIhXQxa8JpGtwW+v0Yadn5L5GSCGClIgWvJ8zLF3ZDzEexdweXoQjA+ylhl/ubb7i7Jp1Qrp9cS76Lr5aOetDu1qBA9wVSUePKs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=runbox.com; spf=pass smtp.mailfrom=runbox.com; dkim=pass (2048-bit key) header.d=runbox.com header.i=@runbox.com header.b=rhyLEnUK; arc=none smtp.client-ip=185.226.149.37 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=runbox.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=runbox.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=runbox.com header.i=@runbox.com header.b="rhyLEnUK" Received: from mailtransmit02.runbox ([10.9.9.162] helo=aibo.runbox.com) by mailtransmit04.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1vOYlQ-00CJts-Aj; Thu, 27 Nov 2025 10:58:12 +0100 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=runbox.com; s=selector1; h=Content-Transfer-Encoding:Content-Type:MIME-Version: References:In-Reply-To:Message-ID:Subject:Cc:To:From:Date; bh=9hxwA7xIEnlzzV7siRRc7j91/JVnJknVoZgNEPdUlXc=; b=rhyLEnUK2fEHo28akBW5bAHe8b qd/2cDQ1v3yFjmpbPYJTg/PnoL9mc3ALqwJwCIHTTdP9djNqwxxZYNwhh6QOvEeI92QjBRJJ9A5aH mgn/yRhgAU91qy1lvegL61WoMAgorKil4hBuUjFYOFzPDGI/vm2LItZJ6DGm+DMBnZIKxOunCUXmb QFiGbD9opuucezDfv/6kY9cB3t/xpaZK6LUMebzLFc6OD6Bp6rmyNB9WhLUlZhCOlOC+gFCNItiXD FbA1KCR0vAAUJL/LfBp4cLgiCdVPRHeNXyuwW1fYWO/hj5s9pWTdJGN9gPE3KDshcHA0d2en6h/zb Du5UDh4A==; Received: from [10.9.9.73] (helo=submission02.runbox) by mailtransmit02.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1vOYlP-0004Ef-86; Thu, 27 Nov 2025 10:58:11 +0100 Received: by submission02.runbox with esmtpsa [Authenticated ID (1493616)] (TLS1.2:ECDHE_SECP256R1__RSA_SHA256__AES_256_GCM:256) (Exim 4.93) id 1vOYlH-00H0XS-TL; Thu, 27 Nov 2025 10:58:04 +0100 Date: Thu, 27 Nov 2025 09:58:01 +0000 From: david laight To: Mateusz Guzik Cc: x86@kernel.org, glx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, olichtne@redhat.com, atomasov@redhat.com, aokuliar@redhat.com Subject: Re: performance anomaly in rep movsq/movsb as seen on Sapphire Rapids executing sync_regs() Message-ID: <20251127095801.0473d641@pumpkin> In-Reply-To: References: X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; arm-unknown-linux-gnueabihf) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Thu, 27 Nov 2025 07:55:27 +0100 Mateusz Guzik wrote: > Sapphire Rapids has both ERMS (of course) and FSRM. > > sync_regs() runs into a corner case where both rep movsq and rep movsb > suffer massive penalty for being used to copy 168 bytes, which clear > itself when data is copied by a bunch of movq instead. > > I verified the issue is not present on AMD EPYC 9454, I don't know about > other Intel CPUs. On pretty much all intel cpu 'rep movsb' and 'rep movsq' seem to be implemented in the same hardware - so the length in the 'q' case is just multiplied by 8. (That goes all the way back to Sandy bridge.) I'm guessing all the copies are at the same page alignment? I found some strange alignment related issues on a zen-5 cpu. Mostly neither the source nor destination alignment made much difference. (Apart from (IIRC) 64 byte aligning the destination doubling throughput.) But some copies were horribly slow. It was something like copies where the page offset of the destination was less than 64 bytes from the page offset of the src and the src wasn't on a page boundary (the byte alignment wasn't relevant). I wonder if Sapphire Rapids has some similar perversion? Or, is that one of the big/little cpu where most of the cpu are actually atom ones - which may not have either ERMS or FSRM ? I need to rerun those tests using data dependencies instead of lfence and get a much better estimation of the instruction setup time. But I am lacking old amd and new intel hardware. David