From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751157AbdBCQny (ORCPT ); Fri, 3 Feb 2017 11:43:54 -0500 Received: from mx1.redhat.com ([209.132.183.28]:52704 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750808AbdBCQnx (ORCPT ); Fri, 3 Feb 2017 11:43:53 -0500 Date: Fri, 3 Feb 2017 17:43:50 +0100 From: Radim Krcmar To: Marcelo Tosatti Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Paolo Bonzini , "Rafael J. Wysocki" , Viresh Kumar Subject: Re: [patch 0/3] KVM CPU frequency change hypercalls Message-ID: <20170203164349.GA5582@potion> References: <20170202174755.946578704@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170202174755.946578704@redhat.com> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Fri, 03 Feb 2017 16:43:54 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 2017-02-02 15:47-0200, Marcelo Tosatti: > Implement KVM hypercalls for the guest > to issue frequency changes. > > Current situation with DPDK and frequency changes is as follows: > An algorithm in the guest decides when to increase/decrease > frequency based on the queue length of the device. Does the algorithm compute with the magnitude of frequency steps? (e.g. if CPU can step with 200 MHz granularity, does the algorithm ever do 400 MHz at once, because it assumes that frequency would be enough to handle the load?) > On the host, a power manager daemon is used to listen for > frequency change requests (on another core) and issue these > requests. > > However frequency changes are performance sensitive events because: > On a change from low load condition to max load condition, > the frequency should be raised as soon as possible. > Sending a virtio-serial notification to another pCPU, > waiting for that pCPU to initiate an IPI to the requestor pCPU > to change frequency, is slower and more cache costly than > a direct hypercall to host to switch the frequency. > > If the pCPU where the power manager daemon is running > is not busy spinning on requests from the isolated DPDK vcpus, > there is also the cost of HLT wakeup for that pCPU. > > Moreover, the daemon serves multiple VMs, meaning that > the scheme is subject to additional delays from > queueing of power change requests from VMs. (Wow, this must be bringing humanity to its doom faster than the heat it helps to eliminate.) > A direct hypercall from userspace is the fastest most direct > method for the guest to change frequency and does not suffer > from the issues above. Right, userspace on bare-metal cannot change frequency directly. > The usage scenario for this hypercalls is for pinned vCPUs <-> pCPUs. And pinned tasks <-> vCPUs, because the guest kernel has no idea what frequency is being used or desired on its virtualware, so the kernel cannot even change frequency without introducing a bug ... I'm not happy about this hole through layers of isolations. The domain of valid users is very small and a problem is that any program with access to /dev/kvm gains the ability to change host CPU frequency if the host happens to use the userspace governor. We should at least enable this feature only if /dev/kvm is root-only.