From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752991AbdFUJZR (ORCPT ); Wed, 21 Jun 2017 05:25:17 -0400 Received: from mga09.intel.com ([134.134.136.24]:30717 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752670AbdFUJZP (ORCPT ); Wed, 21 Jun 2017 05:25:15 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.39,368,1493708400"; d="scan'208";a="276830540" Subject: Re: [RESEND PATCH v3 1/n] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi From: Alexey Budankov To: Mark Rutland Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Andi Kleen , Kan Liang , Dmitri Prokhorov , Valery Cherepennikov , David Carrillo-Cisneros , Stephane Eranian , linux-kernel@vger.kernel.org References: <20170620134442.GG28157@leverpostej> <48e77d47-20cc-67b4-577e-00489767b263@linux.intel.com> Organization: Intel Corp. Message-ID: Date: Wed, 21 Jun 2017 12:25:04 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: <48e77d47-20cc-67b4-577e-00489767b263@linux.intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 20.06.2017 18:12, Alexey Budankov wrote: > On 20.06.2017 16:44, Mark Rutland wrote: >> On Fri, Jun 16, 2017 at 02:03:58AM +0300, Alexey Budankov wrote: >>> perf/core: use rb trees for pinned/flexible groups >>> >>> By default, the userspace perf tool opens per-cpu task-bound events >>> when sampling, so for N logical events requested by the user, the tool >>> will open N * NR_CPUS events. >>> >>> In the kernel, we mux events with a hrtimer, periodically rotating the >>> flexible group list and trying to schedule each group in turn. We >>> skip groups whose cpu filter doesn't match. So when we get unlucky, >>> we can walk N * (NR_CPUS - 1) groups pointlessly for each hrtimer >>> invocation. >>> >>> This has been observed to result in significant overhead when running >>> the STREAM benchmark on 272 core Xeon Phi systems. >>> >>> One way to avoid this is to place our events into an rb tree sorted by >>> CPU filter, so that our hrtimer can skip to the current CPU's >>> list and ignore everything else. >>> >>> As a step towards that, this patch replaces event group lists with rb >>> trees. >>> >>> Signed-off-by: Alexey Budankov >>> --- >>> include/linux/perf_event.h | 18 ++- >>> kernel/events/core.c | 393 >>> +++++++++++++++++++++++++++++++++------------ >>> 2 files changed, 307 insertions(+), 104 deletions(-) >>> >>> Addressed Mark Rutland's comments from the previous patch version. >> >> ... then this should be v4, no? >> >> Which comments? Could you pelase write a changelog in future? > > Changes are: > > 1. changed type of pinned_groups/flexible_groups to rb_tree; > 2. removed group_list_entry and reused group_entry for that purposes; > 3. added add_to_groups()/del_from_groups() helper functions; > >> >> In future, please send patches as a series, with a known upper-bound >> rather than N. It's really painful to find them when they're sent >> separately, without a known upper bound. > > Accepted. > >> >> [...] >> >>> +/* >>> + * Delete a group from a tree. If the group is directly attached to >>> the tree >>> + * it also detaches all groups on the group's group_list list. >>> + */ >>> +static void >>> +perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event) >>> +{ >>> + WARN_ON_ONCE(!tree || !event); >>> + >>> + if (RB_EMPTY_NODE(&event->group_node)) { >>> + list_del_init(&event->group_entry); >>> + } else { >>> + struct perf_event *list_event, *list_next; >>> + >>> + rb_erase(&event->group_node, tree); >>> + list_for_each_entry_safe(list_event, list_next, >>> + &event->group_list, group_entry) >>> + list_del_init(&list_event->group_entry); >>> + } >>> +} >> >> As I commented on the last version, this means that all groups which >> were (incidentally) hanging off of this one are removed, and can >> no longer be reached via the tree. >> >> Surely one of the remaining groups should be added to the tree? > > Aww, I see. That needs to implemented. Thanks. Addressing that inconsistency like this: static void perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event) { struct perf_event *next; WARN_ON_ONCE(!tree || !event); list_del_init(&event->group_entry); if (!RB_EMPTY_NODE(&event->group_node)) { if (!list_empty(&event->group_list)) { next = list_first_entry(&event->group_list, struct perf_event, group_entry); list_replace_init(&event->group_list, &next->group_list); rb_replace_node(&event->group_node, &next->group_node, tree); } else { rb_erase(&event->group_node, tree); } RB_CLEAR_NODE(&event->group_node); } } solves list_del corruption: list_del corruption. prev->next should be ffff88c2c4654010, but was ffff88c31eb0c020 [ 607.632813] ------------[ cut here ]------------ [ 607.632816] kernel BUG at lib/list_debug.c:53! [ 607.635531] Call Trace: [ 607.635583] list_del_event+0x1d7/0x210 and x86_pmu_start warning: [ 484.804737] WARNING: CPU: 15 PID: 31168 at arch/x86/events/core.c:1257 x86_pmu_start+0x8f/0xb0 [ 484.804938] RIP: 0010:x86_pmu_start+0x8f/0xb0 [ 484.804971] Call Trace: [ 484.804976] [ 484.804984] x86_pmu_enable+0x27f/0x2f0 > >> >> I don't beleive that is correct. >> >> I beleive it would be simpler to reason about a threaded rb-tree here, >> since that special case wouldn't exist. >> >> Thanks, >> Mark. >> > >