From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============4218238302119988488==" MIME-Version: 1.0 From: Huang, Ying To: lkp@lists.01.org Subject: Re: [sched/fair] 2c83362734: pft.faults_per_sec_per_cpu -41.4% regression Date: Sat, 02 Mar 2019 16:15:19 +0800 Message-ID: <87wolhn148.fsf@yhuang-dev.intel.com> In-Reply-To: <20190228111023.GC9565@techsingularity.net> List-Id: --===============4218238302119988488== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Mel Gorman writes: > On Thu, Feb 28, 2019 at 03:17:51PM +0800, kernel test robot wrote: >> Greeting, >> = >> FYI, we noticed a -41.4% regression of pft.faults_per_sec_per_cpu due to= commit: >> = >> = >> commit: 2c83362734dad8e48ccc0710b5cd2436a0323893 ("sched/fair: Consider = SD_NUMA when selecting the most idle group to schedule on") >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master >> = >> in testcase: pft >> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz wi= th 64G memory >> with following parameters: >> = >> runtime: 300s >> nr_task: 50% >> cpufreq_governor: performance >> ucode: 0xb00002e >> = > > The headline regression looks high but it's also a known consequence for > some microbenchmarks, particularly those that are short-lived and consist > of non-communicating tasks. > > The impact of the patch is to favour starting a new task on the local > node unless the socket is saturated. This is to avoid a pattern where a > task that clones a helper that it communicates with starts on a remote > node. Starting remote negatively impacts basis workloads like > shellscripts, client/server workloads or pipelined tasks. The workloads > that benefit from spreading early are parallelised tasks that do not > communicate until end of the task. > > PFT is an example of the latter. If spread early, it maximises the total > memory bandwidth of the machine early in the lifetime of the machine. It > would quickly recover if it run long enough, the early measurements are > low as it saturates the bandwidth of the local node. This configuration > is at 50% and the machine is likely to be 2-socket so it has half the > bandwidth in all likelihood and hence the 41.4% regression (very close > to half so some tasks probably got load-balanced). > > On to the other examples; > >> test-description: Pft is the page fault test micro benchmark. >> test-url: https://github.com/gormanm/pft >> = >> In addition to that, the commit also has significant impact on the follo= wing tests: >> = >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | stream: = | >> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GH= z with 128G memory | >> | test parameters | array_size=3D10000000 = | >> | | cpufreq_governor=3Dperformance = | >> | | nr_threads=3D25% = | >> | | omp=3Dtrue = | >> | | ucode=3D0xb00002e = | > > STREAM is typically short-lived. Again, it benefits from spreading early > to maximise memory bandwidth. 25% of threads would fit in one node. For > parallelised stream tests it's usually the case that OMP is used to bind 1 > thread per memory channel using the openmp directives to measure the total > machine memory bandwidth rather than using it as a scaling tests. I'm > guessing this machine didn't have 22 memory channels that would make > nr_thread=3D25% a sensible configuration. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | reaim: reaim.jobs_per_min 1.3% improvement = | >> | test machine | 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GH= z with 256G memory | >> | test parameters | cpufreq_governor=3Dperformance = | >> | | nr_job=3D3000 = | >> | | nr_task=3D100% = | >> | | runtime=3D300s = | >> | | test=3Dcustom = | >> | | ucode=3D0x3d = | > > reaim is generally a mess so in this case it's unclear. The load is a > mix of task creation, IO operations, signal and others. It might have > benefitted slightly from running local. One reason I don't particularly > like reaim is that historically it was dominated by sending/receiving > signals. In my own tests, signal is typically removed as well as it's > tendency to sync the entire filesystem at high frequency. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | stream: stream.add_bandwidth_MBps -32.0% regression= | >> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GH= z with 128G memory | >> | test parameters | array_size=3D10000000 = | >> | | cpufreq_governor=3Dperformance = | >> | | nr_threads=3D50% = | >> | | omp=3Dtrue = | >> | | ucode=3D0xb00002e = | > > STREAM covered already other than noting that it's unlikely it has 44 > memory channels to work with so any imbalance in the task distribution > should show up as a regression. Again, the patch favours using local node > first which would saturate the local memory channel earlier. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | plzip: = | >> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GH= z with 128G memory | >> | test parameters | cpufreq_governor=3Dperformance = | >> | | nr_threads=3D100% = | >> | | ucode=3D0xb00002e = | > > Doesn't state what change happened be it positive or negative. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | reaim: reaim.jobs_per_min -11.9% regression = | >> | test machine | 192 threads Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20G= Hz with 512G memory | >> | test parameters | cpufreq_governor=3Dperformance = | >> | | nr_task=3D100% = | >> | | runtime=3D300s = | >> | | test=3Dall_utime = | >> | | ucode=3D0xb00002e = | > > This is completely user-space bound running basic math operations. Not > clear why it would suffer *but* if hyperthreading is enabled, the patch > might mean that hyperthread siblings were used early due to favouring > the local node. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | hackbench: hackbench.throughput -7.3% regression = | >> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GH= z with 64G memory | >> | test parameters | cpufreq_governor=3Dperformance = | >> | | ipc=3Dpipe = | >> | | mode=3Dprocess = | >> | | nr_threads=3D1600% = | >> | | ucode=3D0xb00002e = | > > Hackbench very short-lived but the workload is also heavily saturating the > machine to an extent where it would be hard to tell from this report if > the 7.3% is statically significant or not. The patch might mean a socket > is severely over-saturated in the very early phases of the workload. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | reaim: reaim.std_dev_percent 11.4% undefined = | >> | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.= 10GHz with 64G memory | >> | test parameters | cpufreq_governor=3Dperformance = | >> | | nr_task=3D100% = | >> | | runtime=3D300s = | >> | | test=3Dcustom = | >> | | ucode=3D0x200004d = | > > Not sure what the change is saying. Possibly that it's less variable. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | reaim: boot-time.boot 95.3% regression = | >> | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.= 10GHz with 64G memory | >> | test parameters | cpufreq_governor=3Dperformance = | >> | | nr_task=3D100% = | >> | | runtime=3D300s = | >> | | test=3Dalltests = | >> | | ucode=3D0x200004d = | > > boot-time.boot? > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | pft: pft.faults_per_sec_per_cpu -42.7% regression = | >> | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.= 10GHz with 64G memory | >> | test parameters | cpufreq_governor=3Dperformance = | >> | | nr_task=3D50% = | >> | | runtime=3D300s = | >> | | ucode=3D0x200004d = | > > PFT already discussed. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | stream: stream.add_bandwidth_MBps -28.8% regression= | >> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GH= z with 128G memory | >> | test parameters | array_size=3D50000000 = | >> | | cpufreq_governor=3Dperformance = | >> | | nr_threads=3D50% = | >> | | omp=3Dtrue = | >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | stream: stream.add_bandwidth_MBps -30.6% regression= | >> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GH= z with 128G memory | >> | test parameters | array_size=3D10000000 = | >> | | cpufreq_governor=3Dperformance = | >> | | nr_threads=3D50% = | >> | | omp=3Dtrue = | >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | pft: pft.faults_per_sec_per_cpu -42.5% regression = | >> | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.= 10GHz with 64G memory | >> | test parameters | cpufreq_governor=3Dperformance = | >> | | nr_task=3D50% = | >> | | runtime=3D300s = | > > Already discussed. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | reaim: reaim.child_systime -1.4% undefined = | >> | test machine | 144 threads Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50G= Hz with 512G memory | >> | test parameters | cpufreq_governor=3Dperformance = | >> | | iterations=3D30 = | >> | | nr_task=3D1600% = | >> | | test=3Dcompute = | > > 1.4% change in system time could be overhead in the fork phase as it > looks for local idle cores then remote idle cores early but the > difference is tiny. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | stress-ng: stress-ng.fifo.ops_per_sec 76.2% improve= ment | >> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GH= z with 128G memory | >> | test parameters | class=3Dpipe = | >> | | cpufreq_governor=3Dperformance = | >> | | nr_threads=3D100% = | >> | | testtime=3D1s = | > > A case where short-lived communicating tasks benefit by starting local. > >> +------------------+----------------------------------------------------= ----------------------+ >> | testcase: change | stress-ng: stress-ng.tsearch.ops_per_sec -17.1% reg= ression | >> | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GH= z with 128G memory | >> | test parameters | class=3Dcpu = | >> | | cpufreq_governor=3Dperformance = | >> | | nr_threads=3D100% = | >> | | testtime=3D1s = | >> +------------------+----------------------------------------------------= ----------------------+ >> = > > Given full machine utilisation and a 1 second duration, it's a case > where saturating the local node early was sub-optimal and 1 second is > too long for load balancing or other factors to correct it. > > Bottom line, the patch is a trade off but from a range of tests, I found > that on balance we benefit more from having tasks start local until > there is evidence that the kernel is justified to spread the load to > remote nodes. Thanks a lot for detailed explanation! Best Regards, Huang, Ying --===============4218238302119988488==--