From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752155AbdIMKYr (ORCPT <rfc822;w@1wt.eu>);
        Wed, 13 Sep 2017 06:24:47 -0400
Received: from mail-lf0-f54.google.com ([209.85.215.54]:35394 "EHLO
        mail-lf0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751634AbdIMKYq (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 13 Sep 2017 06:24:46 -0400
X-Google-Smtp-Source: AOwi7QBT1H+p/UMEUevoOPK/baRt49aBS/3qjnRKhFP7AiQVaxPl6GVHFEzbTtzSADfUDs2R3jItuw==
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@redhat.com>,
        Mike Galbraith <efault@gmx.de>,
        Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>,
        Paul Turner <pjt@google.com>, Oleg Nesterov <oleg@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Mike Galbraith <umgwanakikbuti@gmail.com>,
        Kirill Tkhai <tkhai@yandex.ru>, Tim Chen <tim.c.chen@linux.intel.com>,
        Nicolas Pitre <nicolas.pitre@linaro.org>,
        "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Subject: [RFC PATCH v2] sched/fair: search a task from the tail of the queue
Date: Wed, 13 Sep 2017 12:24:29 +0200
Message-Id: <20170913102430.8985-1-urezki@gmail.com>
X-Mailer: git-send-email 2.11.0
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Objective:

In an attempt to improve the criteria of which tasks we should consider to
be migrated (SMP case) during load balance operations, i have done some
performance evaluations.

Test environment:

- set performance governor
- echo 0 > /proc/sys/kernel/nmi_watchdog
- intel_pstate=disable
- i5-3320M CPU @ 2.60GHz

Test results:

A first test was to evaluate hackbench with different number of groups,
i used 10, 20, 40. See below plots with results:

i=0; while [ $i -le 1000 ]; do ./hackbench 10 | grep "Time" | awk '{print $2}'; i=$(($i+1)); done
ftp://vps418301.ovh.net/incoming/hacknench_1000_samples_10_groups.png

i=0; while [ $i -le 1000 ]; do ./hackbench 20 | grep "Time" | awk '{print $2}'; i=$(($i+1)); done
ftp://vps418301.ovh.net/incoming/hacknench_1000_samples_20_groups.png

i=0; while [ $i -le 1000 ]; do ./hackbench 40 | grep "Time" | awk '{print $2}'; i=$(($i+1)); done
ftp://vps418301.ovh.net/incoming/hacknench_1000_samples_40_groups.png

A second test was to evaluate how "perf bench sched pipe" behaves in a single
CPU scenario. As Peter Zijlstra suggested before, to check caches and find out
extra overhead caused by list manipulation:

i=0; while [ $i -le 500 ]; do taskset 1 perf bench sched pipe | grep "Total" | awk '{print $3}'; i=$(($i+1)); done
ftp://vps418301.ovh.net/incoming/taskset_1_perf_bench_sched_pipe.png

Added overhead:

First, i checked if "cfs_tasks" and "group_node" are in a cache line
by annotating pick_next_task_fair symbol and running single CPU test.

perf record -F 100000 -a -e L1-dcache-misses -- taskset 1 perf bench sched pipe -l 10000000
perf annotate pick_next_task_fair

Most of the time i see that cfs_tasks and group_node are in L1-dcache line:

       │             __list_del(entry->prev, entry->next);
  3.51 │       mov    0xb0(%rbp),%rdx
  1.75 │       mov    0xa8(%rbp),%rcx
       │     pick_next_task_fair():
       │                     list_move(&p->se.group_node, &rq->cfs_tasks);
       │       lea    0xa8(%rbp),%rax
       │     __list_del():

group_node: 3.51 corresponds to 2 samples or misses. Minimum value is 0
maximum is 2 misses, among 10 runs.

       │     list_add():
       │             __list_add(new, head, head->next);
  2.44 │       mov    0x940(%r15),%rdx
       │     __list_add():

cfs_tasks: 2.44 corresponds to 1 sample or misses. Minimum value is 0
maximum is 2 misses, among 10 runs.

In case of checking all level cache misses "-e cache-misses" i do not
see any samples or misses.

Conclusion:

according to provided results and my subjective opinion, it worth to
sort cfs_task list and start pulling from the back of the list during
load balance (+ active) or idle balance operations.

It would be appreciated if there are any comments, proposals or ideas
regarding this small investigation.

Best Regards,
Uladzislau Rezki

Uladzislau Rezki (1):
  sched/fair: search a task from the tail of the queue

 kernel/sched/fair.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

-- 
2.11.0