From: Feng Tang <feng.tang@intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel test robot <yujie.liu@intel.com>,
Frank Denis <j@pureftpd.org>, <oe-lkp@lists.linux.dev>,
<lkp@intel.com>, <linux-kernel@vger.kernel.org>,
<ying.huang@intel.com>, <fengwei.yin@intel.com>
Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression
Date: Tue, 9 May 2023 20:24:47 +0800 [thread overview]
Message-ID: <ZFo7j/M1suY/Ey5M@feng-clx> (raw)
In-Reply-To: <CAHk-=wjMz_=WyfQMx1ebofeq+w6Cr_kRcJ1Xc=D6iKc4My16bQ@mail.gmail.com>
Hi Linus,
On Thu, May 04, 2023 at 11:56:12AM -0700, Linus Torvalds wrote:
> On Thu, May 4, 2023 at 10:39 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And I suspect that the 'blogbench.read_score' thing probably does
> > mostly big reads, and really really liked the cacheline optimizations
> > of 'rep movs' even when it was slow in other things.
>
> Oh, and I went to look at the blogbench code, and it's a
> *horrifically* bad benchmark for this case.
>
> It may actually be a good benchmark for other things - the "look up
> random filenames" etc might be somewhat indicative of real performance
> on some loads.
>
> But the read() system call part is very much *not* good.
>
> It looks like the reason blogbench likes 'rep movs' is the traditional
> reason why memory copy benchmarks tend to like uncached copies: it
> looks like the benchmark doesn't actually even look at the data it
> reads.
>
> It's a common pattern, but it's a bad pattern. It's not very different
> from doing CPU benchmarks using an empty loop:
>
> for (int i = 0; i < 1000000; i++) /*nothing*/;
>
> and then being surprised when the compiler optimized the loop away
> because it doesn't actually *do* anything.
>
> Of course, unlike the empty loop, the kernel will still do the reads,
> but the kernel will do the reads with the assumption that the result
> matters.
>
> And for blogbench, that is *doubly* not the case.
>
> Because it looks like blogbench does a loop over
>
> read_dummy_file(file_name);
>
> which will open that file, and then loop over
>
> #define READ_CHUNK_SIZE 65536
> ...
> static char garbage[READ_CHUNK_SIZE];
> ..
> loop:
> if ((readen = read(fd, garbage, sizeof garbage))
>
> and that variable is very appropriately named indeed. It is very much garbage.
>
> So it really looks like what is going on is that you have
>
> (a) a threaded application (looks like N=100 for reads by default)
>
> (b) doing 'read()' calls into a *shared* buffer in parallel
>
> (c) and never using the buffer for anything else
>
> which means that a memory copy that does non-temporal writes is going
> to look artificially really good, because nobody wants to ever see the
> end result.
>
> Now, the "nobody wants to ever see the end result", might actually be
> at least _partially_ valid in some historical setting, if you were to
> use DMA to then push things out to a network. Back in the olden days,
> that was actually a valid reason to use non-temporal writes, because
> the DMA would have to invalidate the caches anyway.
>
> So that part is wrong, but isn't necessarily *horrifically* wrong.
> It's just bad.
>
> But using one shared destination buffer for N readers is *completely*
> bogus. Any cache-lines would bounce around like crazy for the
> (pointless) sharing. I'd like to call it "false sharing", but it's not
> really false: it's intentionally just using one big shared buffer as a
> sink for all these read() calls.
>
> End result: I think the benchmark scores for reading are entirely
> pointless random numbers, and trying to optimize the kernel for this
> benchmark is exactly the wrong thing to do.
>
> The benchmark doesn't actually do anything half-way valid.
>
> At a *minimum* that benchmark should have different threads using
> different read() buffers.
We tried a debug patch which allocates a dedicated buffer for
each reader thread, run it on the same Cacade Lake platform, and
the regression is _gone_, after the noise of cache false sharing is
reduced.
20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96
---------------- ---------------------------
2119572 ± 2% +1.7% 2155704 ± 2% blogbench.read_score
3700996 ± 9% -7.4% 3427562 ± 16% perf-stat.ps.dTLB-load-misses
8.171e+09 +25.3% 1.024e+10 perf-stat.ps.dTLB-loads
431224 ± 9% -13.1% 374903 ± 12% perf-stat.ps.dTLB-store-misses
2.399e+09 +87.5% 4.497e+09 perf-stat.ps.dTLB-stores
8495507 ± 7% -8.2% 7794738 perf-stat.ps.iTLB-load-misses
3.07 -3.1 0.00 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
0.00 +3.4 3.40 perf-profile.self.cycles-pp.copy_user_generic_unrolled
We also run the original blogbench binary on a IceLake platform
(which has both ERMS and FSRM), and there is almost no peformance
change for the 2 commits.
Please let us know if you need more profiling info or want to run
more tests.
Thanks,
Feng
---
The debug patch for blobbench is:
diff --git a/src/blogbench.h b/src/blogbench.h
index bf538a8..d99d5bb 100644
--- a/src/blogbench.h
+++ b/src/blogbench.h
@@ -178,6 +178,8 @@ void *rewriter(void * const fodder);
void *reader(void * const fodder);
void *commenter(void * const fodder);
+extern pthread_key_t read_buf_key;
+
#include "globals.h"
#endif
diff --git a/src/helpers.c b/src/helpers.c
index 4c17e73..85631ec 100644
--- a/src/helpers.c
+++ b/src/helpers.c
@@ -107,15 +107,18 @@ int create_atomic_file(const char * const file_name,
int read_dummy_file(const char * const file_name)
{
- static char garbage[READ_CHUNK_SIZE];
ssize_t readen;
+ void * read_buf;
int fd;
if ((fd = open(file_name, O_RDONLY)) == -1) {
return -1;
}
+
+ read_buf = pthread_getspecific(read_buf_key);
+
do {
- if ((readen = read(fd, garbage, sizeof garbage))
+ if ((readen = read(fd, read_buf, READ_CHUNK_SIZE))
< (ssize_t) 0) {
if (errno == EINTR) {
continue;
diff --git a/src/process.c b/src/process.c
index a83a980..04f1411 100644
--- a/src/process.c
+++ b/src/process.c
@@ -58,9 +58,13 @@ static int wait_rewriters(void)
return 0;
}
+pthread_key_t read_buf_key = 0;
+
static int spawn_readers(void)
{
unsigned int t = nb_readers;
+
+ pthread_key_create(&read_buf_key, NULL);
do {
t--;
diff --git a/src/reader.c b/src/reader.c
index 91bab7a..2c0efdb 100644
--- a/src/reader.c
+++ b/src/reader.c
@@ -66,6 +66,11 @@ static int read_random_blog(void)
void *reader(void * const fodder)
{
(void) fodder;
+
+ void *read_buf;
+
+ read_buf = malloc(READ_CHUNK_SIZE);
+ pthread_setspecific(read_buf_key, read_buf);
do {
if (read_random_blog() != 0) {
@@ -75,6 +80,8 @@ void *reader(void * const fodder)
usleep(USLEEP_READER);
#endif
} while (stop == (sig_atomic_t) 0);
+
+ free(read_buf);
return NULL;
}
> I suspect it might be a good idea to also touch some of the data,
> because that would be the normal situation (even with old-style
> zero-copy DMA you'd probably not do raw file data and push it out to
> the network unmodified).
>
> Because otherwise you will always find that objectively bad memory
> copies will do better than a good memory copy that makes sense.
>
> Anyway, that was a very long-winded way of saying that I will not use
> that benchmark as a reason to touch the kernel "copy_to_user()"
> implementation.
>
> Of course, that's not to say that this might not be a real regression
> on real loads, but I'd want to see those other numbers.
>
> It looks like this is an old benchmark that hasn't been touched in
> years (just going by the github activity), but I'm cc'ing the author
> anyway, and will add a pointer to this email as a github issue. Maybe
> the developer cares, maybe he doesn't, but no reason not to at least
> mention this and see if maybe the benchmark can be improved to at
> least use a thread buffers for the reads.
>
> (The same issues are true for the writing side too, of course, but the
> writing side has less parallelism and a shared *source* is not the
> same kind of problematic from a cache pattern standpoint, so probably
> isn't going to have anything like the same effect from any memcpy
> implementation).
>
> Linus
next prev parent reply other threads:[~2023-05-09 12:30 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-04 8:09 [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression kernel test robot
2023-05-04 17:39 ` Linus Torvalds
2023-05-04 18:56 ` Linus Torvalds
2023-05-09 12:24 ` Feng Tang [this message]
2023-05-09 17:05 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZFo7j/M1suY/Ey5M@feng-clx \
--to=feng.tang@intel.com \
--cc=fengwei.yin@intel.com \
--cc=j@pureftpd.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lkp@intel.com \
--cc=oe-lkp@lists.linux.dev \
--cc=torvalds@linux-foundation.org \
--cc=ying.huang@intel.com \
--cc=yujie.liu@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.