From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752155AbdIMKYr (ORCPT ); Wed, 13 Sep 2017 06:24:47 -0400 Received: from mail-lf0-f54.google.com ([209.85.215.54]:35394 "EHLO mail-lf0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751634AbdIMKYq (ORCPT ); Wed, 13 Sep 2017 06:24:46 -0400 X-Google-Smtp-Source: AOwi7QBT1H+p/UMEUevoOPK/baRt49aBS/3qjnRKhFP7AiQVaxPl6GVHFEzbTtzSADfUDs2R3jItuw== From: "Uladzislau Rezki (Sony)" To: Peter Zijlstra Cc: LKML , Ingo Molnar , Mike Galbraith , Oleksiy Avramchenko , Paul Turner , Oleg Nesterov , Steven Rostedt , Mike Galbraith , Kirill Tkhai , Tim Chen , Nicolas Pitre , "Uladzislau Rezki (Sony)" Subject: [RFC PATCH v2] sched/fair: search a task from the tail of the queue Date: Wed, 13 Sep 2017 12:24:29 +0200 Message-Id: <20170913102430.8985-1-urezki@gmail.com> X-Mailer: git-send-email 2.11.0 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Objective: In an attempt to improve the criteria of which tasks we should consider to be migrated (SMP case) during load balance operations, i have done some performance evaluations. Test environment: - set performance governor - echo 0 > /proc/sys/kernel/nmi_watchdog - intel_pstate=disable - i5-3320M CPU @ 2.60GHz Test results: A first test was to evaluate hackbench with different number of groups, i used 10, 20, 40. See below plots with results: i=0; while [ $i -le 1000 ]; do ./hackbench 10 | grep "Time" | awk '{print $2}'; i=$(($i+1)); done ftp://vps418301.ovh.net/incoming/hacknench_1000_samples_10_groups.png i=0; while [ $i -le 1000 ]; do ./hackbench 20 | grep "Time" | awk '{print $2}'; i=$(($i+1)); done ftp://vps418301.ovh.net/incoming/hacknench_1000_samples_20_groups.png i=0; while [ $i -le 1000 ]; do ./hackbench 40 | grep "Time" | awk '{print $2}'; i=$(($i+1)); done ftp://vps418301.ovh.net/incoming/hacknench_1000_samples_40_groups.png A second test was to evaluate how "perf bench sched pipe" behaves in a single CPU scenario. As Peter Zijlstra suggested before, to check caches and find out extra overhead caused by list manipulation: i=0; while [ $i -le 500 ]; do taskset 1 perf bench sched pipe | grep "Total" | awk '{print $3}'; i=$(($i+1)); done ftp://vps418301.ovh.net/incoming/taskset_1_perf_bench_sched_pipe.png Added overhead: First, i checked if "cfs_tasks" and "group_node" are in a cache line by annotating pick_next_task_fair symbol and running single CPU test. perf record -F 100000 -a -e L1-dcache-misses -- taskset 1 perf bench sched pipe -l 10000000 perf annotate pick_next_task_fair Most of the time i see that cfs_tasks and group_node are in L1-dcache line: │ __list_del(entry->prev, entry->next); 3.51 │ mov 0xb0(%rbp),%rdx 1.75 │ mov 0xa8(%rbp),%rcx │ pick_next_task_fair(): │ list_move(&p->se.group_node, &rq->cfs_tasks); │ lea 0xa8(%rbp),%rax │ __list_del(): group_node: 3.51 corresponds to 2 samples or misses. Minimum value is 0 maximum is 2 misses, among 10 runs. │ list_add(): │ __list_add(new, head, head->next); 2.44 │ mov 0x940(%r15),%rdx │ __list_add(): cfs_tasks: 2.44 corresponds to 1 sample or misses. Minimum value is 0 maximum is 2 misses, among 10 runs. In case of checking all level cache misses "-e cache-misses" i do not see any samples or misses. Conclusion: according to provided results and my subjective opinion, it worth to sort cfs_task list and start pulling from the back of the list during load balance (+ active) or idle balance operations. It would be appreciated if there are any comments, proposals or ideas regarding this small investigation. Best Regards, Uladzislau Rezki Uladzislau Rezki (1): sched/fair: search a task from the tail of the queue kernel/sched/fair.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) -- 2.11.0