From: Huang, Ying <ying.huang@intel.com>
To: lkp@lists.01.org
Subject: Re: [sched/fair] 2c83362734: pft.faults_per_sec_per_cpu -41.4% regression
Date: Sat, 02 Mar 2019 16:15:19 +0800 [thread overview]
Message-ID: <87wolhn148.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20190228111023.GC9565@techsingularity.net>
[-- Attachment #1: Type: text/plain, Size: 16104 bytes --]
Mel Gorman <mgorman@techsingularity.net> writes:
> On Thu, Feb 28, 2019 at 03:17:51PM +0800, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -41.4% regression of pft.faults_per_sec_per_cpu due to commit:
>>
>>
>> commit: 2c83362734dad8e48ccc0710b5cd2436a0323893 ("sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: pft
>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
>> with following parameters:
>>
>> runtime: 300s
>> nr_task: 50%
>> cpufreq_governor: performance
>> ucode: 0xb00002e
>>
>
> The headline regression looks high but it's also a known consequence for
> some microbenchmarks, particularly those that are short-lived and consist
> of non-communicating tasks.
>
> The impact of the patch is to favour starting a new task on the local
> node unless the socket is saturated. This is to avoid a pattern where a
> task that clones a helper that it communicates with starts on a remote
> node. Starting remote negatively impacts basis workloads like
> shellscripts, client/server workloads or pipelined tasks. The workloads
> that benefit from spreading early are parallelised tasks that do not
> communicate until end of the task.
>
> PFT is an example of the latter. If spread early, it maximises the total
> memory bandwidth of the machine early in the lifetime of the machine. It
> would quickly recover if it run long enough, the early measurements are
> low as it saturates the bandwidth of the local node. This configuration
> is at 50% and the machine is likely to be 2-socket so it has half the
> bandwidth in all likelihood and hence the 41.4% regression (very close
> to half so some tasks probably got load-balanced).
>
> On to the other examples;
>
>> test-description: Pft is the page fault test micro benchmark.
>> test-url: https://github.com/gormanm/pft
>>
>> In addition to that, the commit also has significant impact on the following tests:
>>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | stream: |
>> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 128G memory |
>> | test parameters | array_size=10000000 |
>> | | cpufreq_governor=performance |
>> | | nr_threads=25% |
>> | | omp=true |
>> | | ucode=0xb00002e |
>
> STREAM is typically short-lived. Again, it benefits from spreading early
> to maximise memory bandwidth. 25% of threads would fit in one node. For
> parallelised stream tests it's usually the case that OMP is used to bind 1
> thread per memory channel using the openmp directives to measure the total
> machine memory bandwidth rather than using it as a scaling tests. I'm
> guessing this machine didn't have 22 memory channels that would make
> nr_thread=25% a sensible configuration.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | reaim: reaim.jobs_per_min 1.3% improvement |
>> | test machine | 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 256G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | nr_job=3000 |
>> | | nr_task=100% |
>> | | runtime=300s |
>> | | test=custom |
>> | | ucode=0x3d |
>
> reaim is generally a mess so in this case it's unclear. The load is a
> mix of task creation, IO operations, signal and others. It might have
> benefitted slightly from running local. One reason I don't particularly
> like reaim is that historically it was dominated by sending/receiving
> signals. In my own tests, signal is typically removed as well as it's
> tendency to sync the entire filesystem at high frequency.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | stream: stream.add_bandwidth_MBps -32.0% regression |
>> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 128G memory |
>> | test parameters | array_size=10000000 |
>> | | cpufreq_governor=performance |
>> | | nr_threads=50% |
>> | | omp=true |
>> | | ucode=0xb00002e |
>
> STREAM covered already other than noting that it's unlikely it has 44
> memory channels to work with so any imbalance in the task distribution
> should show up as a regression. Again, the patch favours using local node
> first which would saturate the local memory channel earlier.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | plzip: |
>> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 128G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | nr_threads=100% |
>> | | ucode=0xb00002e |
>
> Doesn't state what change happened be it positive or negative.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | reaim: reaim.jobs_per_min -11.9% regression |
>> | test machine | 192 threads Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz with 512G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | nr_task=100% |
>> | | runtime=300s |
>> | | test=all_utime |
>> | | ucode=0xb00002e |
>
> This is completely user-space bound running basic math operations. Not
> clear why it would suffer *but* if hyperthreading is enabled, the patch
> might mean that hyperthread siblings were used early due to favouring
> the local node.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | hackbench: hackbench.throughput -7.3% regression |
>> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | ipc=pipe |
>> | | mode=process |
>> | | nr_threads=1600% |
>> | | ucode=0xb00002e |
>
> Hackbench very short-lived but the workload is also heavily saturating the
> machine to an extent where it would be hard to tell from this report if
> the 7.3% is statically significant or not. The patch might mean a socket
> is severely over-saturated in the very early phases of the workload.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | reaim: reaim.std_dev_percent 11.4% undefined |
>> | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz with 64G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | nr_task=100% |
>> | | runtime=300s |
>> | | test=custom |
>> | | ucode=0x200004d |
>
> Not sure what the change is saying. Possibly that it's less variable.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | reaim: boot-time.boot 95.3% regression |
>> | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz with 64G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | nr_task=100% |
>> | | runtime=300s |
>> | | test=alltests |
>> | | ucode=0x200004d |
>
> boot-time.boot?
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | pft: pft.faults_per_sec_per_cpu -42.7% regression |
>> | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz with 64G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | nr_task=50% |
>> | | runtime=300s |
>> | | ucode=0x200004d |
>
> PFT already discussed.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | stream: stream.add_bandwidth_MBps -28.8% regression |
>> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 128G memory |
>> | test parameters | array_size=50000000 |
>> | | cpufreq_governor=performance |
>> | | nr_threads=50% |
>> | | omp=true |
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | stream: stream.add_bandwidth_MBps -30.6% regression |
>> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 128G memory |
>> | test parameters | array_size=10000000 |
>> | | cpufreq_governor=performance |
>> | | nr_threads=50% |
>> | | omp=true |
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | pft: pft.faults_per_sec_per_cpu -42.5% regression |
>> | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz with 64G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | nr_task=50% |
>> | | runtime=300s |
>
> Already discussed.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | reaim: reaim.child_systime -1.4% undefined |
>> | test machine | 144 threads Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz with 512G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | iterations=30 |
>> | | nr_task=1600% |
>> | | test=compute |
>
> 1.4% change in system time could be overhead in the fork phase as it
> looks for local idle cores then remote idle cores early but the
> difference is tiny.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | stress-ng: stress-ng.fifo.ops_per_sec 76.2% improvement |
>> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 128G memory |
>> | test parameters | class=pipe |
>> | | cpufreq_governor=performance |
>> | | nr_threads=100% |
>> | | testtime=1s |
>
> A case where short-lived communicating tasks benefit by starting local.
>
>> +------------------+--------------------------------------------------------------------------+
>> | testcase: change | stress-ng: stress-ng.tsearch.ops_per_sec -17.1% regression |
>> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 128G memory |
>> | test parameters | class=cpu |
>> | | cpufreq_governor=performance |
>> | | nr_threads=100% |
>> | | testtime=1s |
>> +------------------+--------------------------------------------------------------------------+
>>
>
> Given full machine utilisation and a 1 second duration, it's a case
> where saturating the local node early was sub-optimal and 1 second is
> too long for load balancing or other factors to correct it.
>
> Bottom line, the patch is a trade off but from a range of tests, I found
> that on balance we benefit more from having tasks start local until
> there is evidence that the kernel is justified to spread the load to
> remote nodes.
Thanks a lot for detailed explanation!
Best Regards,
Huang, Ying
prev parent reply other threads:[~2019-03-02 8:15 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-28 7:17 [sched/fair] 2c83362734: pft.faults_per_sec_per_cpu -41.4% regression kernel test robot
2019-02-28 11:10 ` Mel Gorman
2019-02-28 11:10 ` [LKP] " Mel Gorman
2019-03-02 8:15 ` Huang, Ying [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wolhn148.fsf@yhuang-dev.intel.com \
--to=ying.huang@intel.com \
--cc=lkp@lists.01.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.