From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0034133E7 for ; Thu, 20 Mar 2025 21:31:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.53 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742506311; cv=none; b=LTniI2l9uvYQ7DvzplRcMoiP5JYkyCkJBOWUMF6ZAR2r5cUfiyqbYKXzMotreoYyAiybtksz+mDLoVe8daSQtXaRdxzrpeqpKin8RZiYAvhVTPATVvYQfn/PFVvKQxZNoN72lAqPUL+docyTiQnByUXTblHs6otpDB52xzt1vIc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742506311; c=relaxed/simple; bh=4jvKj8a9fKMF7pwxi5LmPBbXgZ3JZ2kDMyB5nl6jWug=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=eXN7BfM316Jb6I7E0o3fdkYtKlOJrcbBOsjWTlSOFIG323IDGG23ovax6jX+ym1zklrAW+RtnytOqDq0Fvf5TTpP2XtB7n4l8BszDJT+2NTm7uLJoAza4iKdF87pyIEu8AUCFDcu/gYg/7/OakbtFV8YEPftGNZ/7+7jZ3MjcOg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Hd1G1Z8x; arc=none smtp.client-ip=209.85.128.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Hd1G1Z8x" Received: by mail-wm1-f53.google.com with SMTP id 5b1f17b1804b1-43cfb6e9031so12089585e9.0 for ; Thu, 20 Mar 2025 14:31:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1742506308; x=1743111108; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=B9aMADM4ydSghIO8HYQaCwwPe7fPOWJSM+ByuEXGP7w=; b=Hd1G1Z8xNLIfwqjZlXYEdFYRrYnCGl1k+PyQluN882CoEe9Pkb0B0Q8Xe/bAKSsiop Qk4dLQPLrN42eLT9kyrGSWKnAme/g0fi44Jidr8jRnZAxwv9SAmkMt8xdv2jQC9Wwfgs 1nKXbkC5BNpoC+84FpOGSibcyUahATI0fFuHZVTWuPA7c73EsEmsjmKmKsbIY2KdRGO4 uBsC/Dk/UvPmlKLPMIbzOijfbj4GmNIL72jVKlJUU8+hcYdW83pR0hIekCMdXYc2+qBC 9LsLkzZ6Lb34A8N2OeSCrKgk/dx2RXzp+JCdpUDUzz/4O67g2tFqsBaUo4jJbRrTwpfQ bzSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742506308; x=1743111108; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=B9aMADM4ydSghIO8HYQaCwwPe7fPOWJSM+ByuEXGP7w=; b=suo9tULlWu13eVffDZQjzgvMlC5m/3EV+McAxjj+utBB33DUJBOZytE4xfFKzcu10M fAYaaHpSI5t7cE40+FmpHyt7quSnRq4cdnHpaP6DpjdUaDPdbZeH7+kimstzvlW3KcZ8 qDik9B1irKwl8wnUK7tRC2UT33r0I0alndpJjDk0WO1TG6MVTfGjFmCn9YjFyle/aHqH OWoF94uaqN56prS2u8x8SsH7rcrL+IhnwLoY5dDqTpC3TXj4uk9u5m8nzug1AW+fuYho n8+m6wdh0zD89p75HNyxnf9V9N8xRuS1fZdHfHl7BshBOVosSfxx00FQZO2nJOfDNLEE qOKw== X-Forwarded-Encrypted: i=1; AJvYcCVHAlWc4bsgOCFdFFo3wRhqegWxTKIZcog6YqG3Xnj5fJcJn5pGwTVsNlJFA3FoL+vjaDkBmJTWNUUS86E=@vger.kernel.org X-Gm-Message-State: AOJu0Yx+MMEeVjANzYrXFV9FluFVqj6B89fb+8L/XwhcXWvslTe73Qtp LyUMSyZDQMrkPcAufEbFG83fYyc4ZMW5ec3gSmEyNBIY3aJy1F3E X-Gm-Gg: ASbGncvtEPljrKM21Kigymp2orWUPXNLZ8goqatuGiUa3ZVbxkWa5+GL04dTb93zz+a vkG+XpXhSG7KiiKvH9KrYDGjxJP60KNdNEG+rbi5+dY/Ycdin3ZJ4D/Y8UP/JrQzxDM5jcYhHBe XipCAI9jznTIBvSobldUqYp1nIWl6alzrdlahcx+BNV+n3uOYNh73cpMSPqorI/HDTPv+HGgvI1 J7XQIvbJhqyj2Hc9Ams5vLJzcqyEOWombUgqDSnZaeVE5g6k7nGFwv7+UMJ6xQ9SWUKWUpGR5Js uw7uyUnf+QsfGxnk+WsetP5yB7jV3i6wfmdIl4V3tnGfYdLaeoqwo6QFfxePhE3DAQjQ+0BHxbc ToHASyy10gOqXV//9eg== X-Google-Smtp-Source: AGHT+IEbIHx1eEq/HCjREZzXVXYJ1qBo1McEUTBg5qk1xb84aVYdBlQckM0N5s2czPkdGYQEE24umQ== X-Received: by 2002:a05:6000:1fa4:b0:391:952:c730 with SMTP id ffacd0b85a97d-3997f8f8c1fmr847331f8f.11.1742506307773; Thu, 20 Mar 2025 14:31:47 -0700 (PDT) Received: from pumpkin (82-69-66-36.dsl.in-addr.zen.co.uk. [82.69.66.36]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43d440eda26sm59357875e9.36.2025.03.20.14.31.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Mar 2025 14:31:47 -0700 (PDT) Date: Thu, 20 Mar 2025 21:31:45 +0000 From: David Laight To: Mateusz Guzik Cc: Herton Krzesinski , x86@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, olichtne@redhat.com, atomasov@redhat.com, aokuliar@redhat.com Subject: Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS) Message-ID: <20250320213145.6d016e21@pumpkin> In-Reply-To: References: <20250320142213.2623518-1-herton@redhat.com> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; arm-unknown-linux-gnueabihf) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Thu, 20 Mar 2025 19:02:21 +0100 Mateusz Guzik wrote: > On Thu, Mar 20, 2025 at 6:51=E2=80=AFPM Herton Krzesinski wrote: > > > > On Thu, Mar 20, 2025 at 11:36=E2=80=AFAM Mateusz Guzik wrote: =20 ... > > > That said, have you experimented with aligning the target to 16 bytes > > > or more bytes? =20 > > > > Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge ba= sed) > > and got no improvement at least in the specific benchmark I'm doing her= e. > > Also after your question here I tried 16-byte/32-byte on the AMD cpu as > > well and got no difference from the 8-byte alignment, same bench as wel= l. > > I tried to do 8-byte alignment for the ERMS case on Intel and got no > > difference on the systems I tested. I'm not saying it may not improve in > > some other case, just that in my specific testing I couldn't tell/measu= re > > any improvement. > > =20 >=20 > oof, I would not got as far back as Sandy Bridge. ;) It is a boundary point. Agner's tables (fairly reliable have): Sandy Bridge Page 222 MOVS 5 4 REP MOVS 2n 1.5 n worst case REP MOVS 3/16B 1/16B best case which is the same as Ivy bridge - which you'd sort of expect since Ivy bridge is a minor update, Agner's tables have the same values for it. Haswell jumps to 1/32B. I didn't test Sandy bridge (I've got one, powered off), but did test Ivy Br= idge. Neither the source nor destination alignment made any difference at all. As I said earlier the only alignment that made any difference was 32byte aligning the destination on Haswell (and later). That is needed to get 32 bytes/clock rather than 16 bytes/clock. >=20 > I think Skylake is the oldest yeller to worry about, if one insists on it. >=20 > That said, if memory serves right these bufs like to be misaligned to > weird extents, it very well may be in your tests aligning to 8 had a > side effect of aligning it to 16 even. >=20 > > > > > > Moreover, I have some recollection that there were uarchs with ERMS > > > which also liked the target to be aligned -- as in perhaps this should > > > be done regardless of FSRM? =20 Dunno, the only report is some AMD cpu being slow with misaligned writes. But that is the copy loop, not 'rep movsq'. I don't have one to test. > > > > Where I tested I didn't see improvements but may be there is some case, > > but I didn't have any. > > =20 > > > > > > And most importantly memset, memcpy and clear_user would all use a > > > revamp and they are missing rep handling for bigger sizes (I verified > > > they *do* show up). Not only that, but memcpy uses overlapping stores > > > while memset just loops over stuff. > > > > > > I intended to sort it out long time ago and maybe will find some time > > > now that I got reminded of it, but I would be deligthed if it got > > > picked up. > > > > > > Hacking this up is just some screwing around, the real time consuming > > > part is the benchmarking so I completely understand if you are not > > > interested. =20 > > > > Yes, the most time you spend is on benchmarking. May be later I could > > try to take a look but will not put any promises on it. I found I needed to use the performance counter to get a proper cycle count. But then directly read the register to avoid all the 'library' overhead. Then add lfence/mfence both sides of the cycle count read. After subtracting the overhead of a 'null function' I could measure the number of clocks each operation took. So could tell when I was actually getting 32 bytes copied per clock. (Or testing the ip checksum code the number of bytes/clock - can get to 12). David > > =20 >=20 > Now I'm curious enough what's up here. If I don't run out of steam, > I'm gonna cover memset and memcpy myself. >=20