From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-oi0-f67.google.com ([209.85.218.67]:39315 "EHLO
        mail-oi0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1753528AbdKQLX5 (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 17 Nov 2017 06:23:57 -0500
Subject: Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter
 real idle path
To: Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>
Cc: Quan Xu <quan.xu03@gmail.com>, kvm@vger.kernel.org,
        linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        LKML <linux-kernel@vger.kernel.org>,
        virtualization@lists.linux-foundation.org, x86@kernel.org,
        xen-devel@lists.xenproject.org,
        Yang Zhang <yang.zhang.wz@gmail.com>,
        Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>,
        Kyle Huey <me@kylehuey.com>, Len Brown <len.brown@intel.com>,
        Andy Lutomirski <luto@kernel.org>,
        Tom Lendacky <thomas.lendacky@amd.com>,
        Tobias Klauser <tklauser@distanz.ch>,
        Daniel Lezcano <daniel.lezcano@linaro.org>
References: <1510567565-5118-1-git-send-email-quan.xu0@gmail.com>
 <1510567565-5118-4-git-send-email-quan.xu0@gmail.com>
 <20171115121152.gqug5wzerlo3eimd@hirez.programming.kicks-ass.net>
 <alpine.DEB.2.20.1711152240010.2146@nanos>
 <46086489-5a01-16e1-9314-70ae53c01952@gmail.com>
 <alpine.DEB.2.20.1711161048000.2191@nanos>
From: Quan Xu <quan.xu0@gmail.com>
Message-ID: <564b8a6e-8ddd-4e3d-c670-10f1697e6c06@gmail.com>
Date: Fri, 17 Nov 2017 19:23:43 +0800
MIME-Version: 1.0
In-Reply-To: <alpine.DEB.2.20.1711161048000.2191@nanos>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>


On 2017-11-16 17:53, Thomas Gleixner wrote:
> On Thu, 16 Nov 2017, Quan Xu wrote:
>> On 2017-11-16 06:03, Thomas Gleixner wrote:
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv,
>>                  target_state = &drv->states[index];
>>          }
>>
>> +#ifdef CONFIG_PARAVIRT
>> +       paravirt_idle_poll();
>> +
>> +       if (need_resched())
>> +               return -EBUSY;
>> +#endif
> That's just plain wrong. We don't want to see any of this PARAVIRT crap in
> anything outside the architecture/hypervisor interfacing code which really
> needs it.
>
> The problem can and must be solved at the generic level in the first place
> to gather the data which can be used to make such decisions.
>
> How that information is used might be either completely generic or requires
> system specific variants. But as long as we don't have any information at
> all we cannot discuss that.
>
> Please sit down and write up which data needs to be considered to make
> decisions about probabilistic polling. Then we need to compare and contrast
> that with the data which is necessary to make power/idle state decisions.
>
> I would be very surprised if this data would not overlap by at least 90%.
>

Peter, tglx
Thanks for your comments..

rethink of this patch set,

1. which data needs to considerd to make decisions about probabilistic 
polling

I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.


2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for 
KVM guest,
calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can 
implement it
as easy as following:

              --- a/arch/x86/kernel/kvm.c
              +++ b/arch/x86/kernel/kvm.c
              @@ -465,6 +465,12 @@ static void __init 
kvm_apf_trap_init(void)
                      update_intr_gate(X86_TRAP_PF, async_page_fault);
               }

              +static __cpuidle void kvm_safe_halt(void)
              +{
          +        /* 1. POLL, if need_resched() --> return */
          +
              +        asm volatile("sti; hlt": : :"memory"); /* 2. halt */
              +
          +        /* 3. get the last idle state's residency time */
              +
          +        /* 4. update poll duration based on last idle state's 
residency time */
              +}
              +
               void __init kvm_guest_init(void)
               {
                      int i;
              @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
                      if (kvmclock_vsyscall)
                              kvm_setup_vsyscall_timeinfo();

              +       pv_irq_ops.safe_halt = kvm_safe_halt;
              +
               #ifdef CONFIG_SMP


then, I am no need to introduce a new pvops, and never modify 
schedule/idle/nohz code again.
also I can narrow all of the code down in arch/x86/kernel/kvm.c.

If this is in the right direction, I will send a new patch set next week..

thanks,

Quan
Alibaba Cloud