From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933724AbdJQHER (ORCPT ); Tue, 17 Oct 2017 03:04:17 -0400 Received: from mga03.intel.com ([134.134.136.65]:34371 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932699AbdJQHEQ (ORCPT ); Tue, 17 Oct 2017 03:04:16 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.43,389,1503385200"; d="scan'208";a="163464896" Subject: Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry To: "Rafael J. Wysocki" , Mike Galbraith Cc: Aubrey Li , tglx@linutronix.de, peterz@infradead.org, len.brown@intel.com, ak@linux.intel.com, tim.c.chen@linux.intel.com, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org References: <1506756034-6340-1-git-send-email-aubrey.li@intel.com> <2672521.fEEa1b19Vu@aspire.rjw.lan> <6f17f378-defc-c668-de00-250da5c41b87@linux.intel.com> <10643458.iWc7GTROAz@aspire.rjw.lan> From: "Li, Aubrey" Message-ID: Date: Tue, 17 Oct 2017 15:04:11 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: <10643458.iWc7GTROAz@aspire.rjw.lan> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2017/10/17 8:05, Rafael J. Wysocki wrote: > On Monday, October 16, 2017 5:11:57 AM CEST Li, Aubrey wrote: >> On 2017/10/14 8:35, Rafael J. Wysocki wrote: >>> On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote: >>>> Record the overhead of idle entry in micro-second >>>> >>> >>> What is this needed for? >> >> We need to figure out how long of a idle is a short idle and recording >> the overhead is for this purpose. The short idle threshold is based >> on this overhead. > > I don't really understand this statement. > > Pretent I'm not familiar with this stuff and try to explain it to me. :-) > Okay, let me try, :-) Today what we did in idle loop as follows: do_idle { idle_entry { - deferrable stuff like quiet_vmstat - turn off tick(without looking at historical/predicted idle interval) - rcu idle enter, c-state selection, etc } idle_call { - poll or halt or mwait } idle_exit { - rcu idle exit - restore the tick if tick is stopped before enter idle } } And we already measured idle_entry and idle_exit costs several micro-seconds, say 10us. Now if idle_call is 1000us, much larger than idle_entry and idle_exit, we can ignore the time cost in idle_entry and idle_exit. But for some workloads with short idle pattern, like netperf, the idle_call is 2us, then idle_entry and idle_exit start to dominate. If we can reduce the time in idle_entry and idle_exit, we then get better workload performance significantly. Modem high-speed network and low-latency I/O like Nvme disk has this requirement. Mike's patch was made several years ago though I don't know the details. Here is an article related to this. https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-microseconds/fulltext >>> >>>> +void cpuidle_entry_end(void) >>>> +{ >>>> + struct cpuidle_device *dev = cpuidle_get_device(); >>>> + u64 overhead; >>>> + s64 diff; >>>> + >>>> + if (dev) { >>>> + dev->idle_stat.entry_end = local_clock(); >>>> + overhead = div_u64(dev->idle_stat.entry_end - >>>> + dev->idle_stat.entry_start, NSEC_PER_USEC); >>> >>> Is the conversion really necessary? >>> >>> If so, then why? >> >> We can choose nano-second and micro-second. Given that workload results >> in the short idle pattern, I think micro-second is good enough for the >> real workload. >> >> Another reason is that prediction from idle governor is micro-second, so >> I convert it for comparing purpose. >>> >>> And if there is a good reason, what about using right shift to do >>> an approximate conversion to avoid the extra division here? >> >> Sure >> 10 works for me as I don't think here precision is a big deal. >> >>> >>>> + diff = overhead - dev->idle_stat.overhead; >>>> + dev->idle_stat.overhead += diff >> 3; >>> >>> Can you please explain what happens in the two lines above? >> >> Online average computing algorithm, stolen from update_avg() @ kernel/sched/core.c. > > OK > > Maybe care to add a comment to that effect? Sure, I'll add in the next version. Thanks, -Aubrey