From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ACF7C3D97F for ; Tue, 25 Jun 2024 02:34:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719282858; cv=none; b=qUs6Yd/jWNyTfYjeBySmXUlrgrApY2hRmfoWPhBxILDOxEnqqm91iy3pq+8R7j0wAP1yP2rYrXGFQJt9FEmZsuQxllUb+j1YX34iwsDPaK4tJF9xSW7Qnw/1FjFC5bvlVKR3muSXnrTxUaGH5F+J6pj/WKj62hqYhLuTyd1CQs4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719282858; c=relaxed/simple; bh=mvAeQQNTmipe2ukpaFBRpTyHiTmgLvRaga9lVjolnwY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type:Content-Disposition; b=MaMlHySJ8QTedL8HLA8KQF+guD3hiesOZrk04Wsu2+Qhe4o5oZLhfgnjNW4R345IBp1RV92dIdxnP/upZ1wHAYWziS8qSL+QDIjsbq1y7reXRUagXviUQCsWCBLVbt70bcoU8tMCiC/eihZTg1wCrKiZDf3tdDQo2vOoKqVW3bs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=LZuBCIIc; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="LZuBCIIc" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1719282855; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VTjNQDg+V7B2IO3+9h64ecZlBB6T7TeKyXtWx71k27k=; b=LZuBCIIcuGZ747Wh9HG+CTcKXfQl0+Z6GyJY2pIExzjoP0qcANfKkXq1+gG2U/dt6K/72s PO61VMMIg3+bEZDUkWnBX3J3D4HtfNEGWpElGii1t9T183hz4MeUQAiNDZHhgh/H4IULzm R7vQUBiACJ/ChijvAfPSt1DdTit2V94= Received: from mail-pg1-f198.google.com (mail-pg1-f198.google.com [209.85.215.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-382-Faltxg02PLeQTISW7byLFA-1; Mon, 24 Jun 2024 22:34:12 -0400 X-MC-Unique: Faltxg02PLeQTISW7byLFA-1 Received: by mail-pg1-f198.google.com with SMTP id 41be03b00d2f7-70ab3bc4a69so6429584a12.3 for ; Mon, 24 Jun 2024 19:34:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719282851; x=1719887651; h=content-transfer-encoding:content-disposition:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=VTjNQDg+V7B2IO3+9h64ecZlBB6T7TeKyXtWx71k27k=; b=R+XkNX4jpaEsK6O4qxBnUoydxJSD6l14PBr7GzQewD6576rYh0c4ribWfIW3IX/U+s pXS7voV5h9xZNjtqVjWD8wAaZYB+iRpFFh3UaWMb+O40gEf61YL9zCeRbWzqcv7KK30f XYk9tPoBo3ie/6SBSk+Vcg68FKh2z4MwVWEnNZ912woJeHZcpMbeOOEQySmfKiiR1nEd juabwBVqoDP/0IrawCQyH4QgS0d6gyuWumRJQ4SuT/SkwhqitryCpJitHwuJJpKY2qRZ MmaTVg1lnNetIN/dUdtHTzaWBb4aEu+HHFmjTJ6Bft6rFu1FuDyv6UgVQWv07+FJLeVm 4d5g== X-Forwarded-Encrypted: i=1; AJvYcCWKnTeP3y9BBzRLgTZ5XNanANK2+NEYZI4odrz7YjLJ67kVUTLTVuuYLQBTdVcC7kPXT+LScc2h6bLjX+tY+NykzI5K X-Gm-Message-State: AOJu0YzBFkahGrHI7X2oIRszqYJcJ8WjipvctVInyHNqOLeUp9Dk8UVI gwNhPJc5P19aOmdc/VODyk2RfefMJGehrpV6hoX0CWkn6RC6QJ61yrVbyVxXGHSukO5r+PtfwTV 0XLIQ8/KfdyaHJHQs5NlZv0c7AnF6SoYl2z7EaUx9Vkbi0pHZRg== X-Received: by 2002:a05:6a20:6ab0:b0:1bd:212d:ac63 with SMTP id adf61e73a8af0-1bd212db009mr160641637.22.1719282851623; Mon, 24 Jun 2024 19:34:11 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHVztCcdTKzLe8x8+d8PhvG5R+r2KJ4bY773JX9zX1NiNUZ530CtwA3mIbk6t4rgiCSm8vAcw== X-Received: by 2002:a05:6a20:6ab0:b0:1bd:212d:ac63 with SMTP id adf61e73a8af0-1bd212db009mr160632637.22.1719282851192; Mon, 24 Jun 2024 19:34:11 -0700 (PDT) Received: from LeoBras.redhat.com ([2804:1b3:a801:fda9:d11e:3755:61da:97fd]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1f9eb2f053asm69477115ad.18.2024.06.24.19.34.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Jun 2024 19:34:10 -0700 (PDT) From: Leonardo Bras To: Leonardo Bras Cc: "Paul E. McKenney" , Leonardo Bras , Sean Christopherson , Frederic Weisbecker , Paolo Bonzini , Marcelo Tosatti , linux-kernel@vger.kernel.org, kvm@vger.kernel.org Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit Date: Mon, 24 Jun 2024 23:34:04 -0300 Message-ID: X-Mailer: git-send-email 2.45.2 In-Reply-To: References: <20240511020557.1198200-1-leobras@redhat.com> <68c39823-6b1d-4368-bd1e-a521ade8889b@paulmck-laptop> <17ebd54d-a058-4bc8-bd65-a175d73b6d1a@paulmck-laptop> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit On Mon, Jun 24, 2024 at 11:31:51PM -0300, Leonardo Bras wrote: > On Thu, Jun 20, 2024 at 10:26:38AM -0700, Paul E. McKenney wrote: > > On Thu, Jun 20, 2024 at 04:03:41AM -0300, Leonardo Bras wrote: > > > On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote: > > > > On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote: > > > > > On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote: > > > > > > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote: > > > > > > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson wrote: > > > > > > > > > > > > > > > > On Fri, May 10, 2024, Leonardo Bras wrote: > > > > > > > > > As of today, KVM notes a quiescent state only in guest entry, which is good > > > > > > > > > as it avoids the guest being interrupted for current RCU operations. > > > > > > > > > > > > > > > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will > > > > > > > > > check for any RCU operations waiting for this CPU. In case there are any of > > > > > > > > > such, it invokes rcu_core() in order to sched-out the current thread and > > > > > > > > > note a quiescent state. > > > > > > > > > > > > > > > > > > This occasional schedule work will introduce tens of microsseconds of > > > > > > > > > latency, which is really bad for vcpus running latency-sensitive > > > > > > > > > applications, such as real-time workloads. > > > > > > > > > > > > > > > > > > So, note a quiescent state in guest exit, so the interrupted guests is able > > > > > > > > > to deal with any pending RCU operations before being required to invoke > > > > > > > > > rcu_core(), and thus avoid the overhead of related scheduler work. > > > > > > > > > > > > > > > > Are there any downsides to this? E.g. extra latency or anything? KVM will note > > > > > > > > a context switch on the next VM-Enter, so even if there is extra latency or > > > > > > > > something, KVM will eventually take the hit in the common case no matter what. > > > > > > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible. > > > > > > > > > > > > > > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and > > > > > > > > VM-Exit note the context switch, which begs the question of why KVM isn't already > > > > > > > > doing that. I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz: > > > > > > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry > > > > > > > > case? > > > > > > > > > > > > > > I don't know, by the lore I see it happening in guest entry since the > > > > > > > first time it was introduced at > > > > > > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/ > > > > > > > > > > > > > > Noting a quiescent state is cheap, but it may cost a few accesses to > > > > > > > possibly non-local cachelines. (Not an expert in this, Paul please let > > > > > > > me know if I got it wrong). > > > > > > > > > > > > Yes, it is cheap, especially if interrupts are already disabled. > > > > > > (As in the scheduler asks RCU to do the same amount of work on its > > > > > > context-switch fastpath.) > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > I don't have a historic context on why it was just implemented on > > > > > > > guest_entry, but it would make sense when we don't worry about latency > > > > > > > to take the entry-only approach: > > > > > > > - It saves the overhead of calling rcu_virt_note_context_switch() > > > > > > > twice per guest entry in the loop > > > > > > > - KVM will probably run guest entry soon after guest exit (in loop), > > > > > > > so there is no need to run it twice > > > > > > > - Eventually running rcu_core() may be cheaper than noting quiescent > > > > > > > state every guest entry/exit cycle > > > > > > > > > > > > > > Upsides of the new strategy: > > > > > > > - Noting a quiescent state in guest exit avoids calling rcu_core() if > > > > > > > there was a grace period request while guest was running, and timer > > > > > > > interrupt hits the cpu. > > > > > > > - If the loop re-enter quickly there is a high chance that guest > > > > > > > entry's rcu_virt_note_context_switch() will be fast (local cacheline) > > > > > > > as there is low probability of a grace period request happening > > > > > > > between exit & re-entry. > > > > > > > - It allows us to use the rcu patience strategy to avoid rcu_core() > > > > > > > running if any grace period request happens between guest exit and > > > > > > > guest re-entry, which is very important for low latency workloads > > > > > > > running on guests as it reduces maximum latency in long runs. > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > Try both on the workload of interest with appropriate tracing and > > > > > > see what happens? The hardware's opinion overrides mine. ;-) > > > > > > > > > > That's a great approach! > > > > > > > > > > But in this case I think noting a quiescent state in guest exit is > > > > > necessary to avoid a scenario in which a VM takes longer than RCU > > > > > patience, and it ends up running rcuc in a nohz_full cpu, even if guest > > > > > exit was quite brief. > > > > > > > > > > IIUC Sean's question is more on the tone of "Why KVM does not note a > > > > > quiescent state in guest exit already, if it does in guest entry", and I > > > > > just came with a few arguments to try finding a possible rationale, since > > > > > I could find no discussion on that topic in the lore for the original > > > > > commit. > > > > > > > > Understood, and maybe trying it would answer that question quickly. > > > > Don't get me wrong, just because it appears to work in a few tests doesn't > > > > mean that it really works, but if it visibly blows up, that answers the > > > > question quite quickly and easily. ;-) > > > > > > > > But yes, if it appears to work, there must be a full investigation into > > > > whether or not the change really is safe. > > > > > > > > Thanx, Paul > > > > > > Hello Paul, Sean, sorry for the delay on this. > > > > > > I tested x86 by counting cycles (using rdtsc_ordered()). > > > > > > Cycles were counted upon function entry/exit on > > > {svm,vmx}_vcpu_enter_exit(), and right before / after > > > __{svm,vmx}_vcpu_run() in the same function. > > > > > > The main idea was to get cycles spend in the procedures before entering > > > guest (such as reporting RCU quiescent state in entry / exit) and the > > > cycles actually used by the VM. > > > > > > Those cycles were summed-up and stored in per-cpu structures, with a > > > counter to get the average value. I then created a debug file to read the > > > results and reset the counters. > > > > > > As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll. > > > > > > The workload inside the VM consisted in cyclictest in 16 vcpus > > > (SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus > > > (SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. > > > > > > $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 > > > --mainaffinity 0-3 > > > > > > All tests were run for exaclty 1 hour, and the clock counter was reset at > > > the same moment cyclictest stared. After that VM was poweroff from guest. > > > Results show the average for all CPUs in the same category, in cycles. > > > > > > With above setup, I tested 2 use cases: > > > 1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case) > > > 2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and > > > RCU patience = 1000ms (best case for RT) > > > > > > Results are: > > > # Test case 1: > > > Vanilla: (average on all vcpus) > > > VM Cycles / RT vcpu: 123287.75 > > > VM Cycles / non-RT vcpu: 709847.25 > > > Setup Cycles: 186.00 > > > VM entries / RT vcpu: 58737094.81 > > > VM entries / non-RT vcpu: 10527869.25 > > > Total cycles in RT VM: 7241564260969.80 > > > Total cycles in non-RT VM: 7473179035472.06 > > > > > > Patched: (average on all vcpus) > > > VM Cycles / RT vcpu: 124695.31 (+ 1.14%) > > > VM Cycles / non-RT vcpu: 710479.00 (+ 0.09%) > > > Setup Cycles: 218.65 (+17.55%) > > > VM entries / RT vcpu: 60654285.44 (+ 3.26%) > > > VM entries / non-RT vcpu: 11003516.75 (+ 4.52%) > > > Total cycles in RT VM: 7563305077093.26 (+ 4.44%) > > > Total cycles in non-RT VM: 7817767577023.25 (+ 4.61%) > > > > > > Discussion: > > > Setup cycles raised in ~33 cycles, increasing overhead. > > > It proves that noting a quiescent state in guest entry introduces setup > > > routine costs, which is expected. > > > > > > On the other hand, both the average time spend inside the VM and the number > > > of VM entries raised, causing the VM to have ~4.5% more cpu cycles > > > available to run, which is positive. Extra cycles probably came from not > > > having invoke_rcu_core() getting ran after VM exit. > > > > > > > > > # Test case 2: > > > Vanilla: (average on all vcpus) > > > VM Cycles / RT vcpu: 123785.63 > > > VM Cycles / non-RT vcpu: 698758.25 > > > Setup Cycles: 187.20 > > > VM entries / RT vcpu: 61096820.75 > > > VM entries / non-RT vcpu: 11191873.00 > > > Total cycles in RT VM: 7562908142051.72 > > > Total cycles in non-RT VM: 7820413591702.25 > > > > > > Patched: (average on all vcpus) > > > VM Cycles / RT vcpu: 123137.13 (- 0.52%) > > > VM Cycles / non-RT vcpu: 696824.25 (- 0.28%) > > > Setup Cycles: 229.35 (+22.52%) > > > VM entries / RT vcpu: 61424897.13 (+ 0.54%) > > > VM entries / non-RT vcpu: 11237660.50 (+ 0.41%) > > > Total cycles in RT VM: 7563685235393.27 (+ 0.01%) > > > Total cycles in non-RT VM: 7830674349667.13 (+ 0.13%) > > > > > > Discussion: > > > Setup cycles raised in ~42 cycles, increasing overhead. > > > It proves that noting a quiescent state in guest entry introduces setup > > > routine costs, which is expected. > > > > > > The average time spend inside the VM was reduced, but the number of VM > > > entries raised, causing the VM to have around the same number of cpu cycles > > > available to run, meaning that the overhead caused by reporting RCU > > > quiescent state in VM exit got absorbed, and it may have to do with those > > > rare invoke_rcu_core()s that were bothering latency. > > > > > > The difference is much smaller compared to case 1, and this is probably > > > because there is a clause in rcu_pending() for isolated (nohz_full) cpus > > > which may be already inhibiting a lot of invoke_rcu_core()s. > > > > > > Sean, Paul, what do you think? > > > > First, thank you for doing this work! > > Thanks you! > > > > > I must defer to Sean on how to handle this tradeoff. My kneejerk reaction > > would be to add yet another knob, but knobs are not free. > > Sure, let's wait for Sean's feedback. But I am very positive with the > results, as we have lantecy improvements for RT, and IIUC performance > improvements for non-RT. > > While we wait, I ran stress-ng in a VM with non-RT kernel (rcu.patience=0), More info here: The VM was on a host with non-RT kernel, without cpu isolation and with unset rcu.patience. > and could see instruction count raising in around 1.5% after applying the patch. > > Maybe I am missing something, but I thought we could get the ~4.5% > improvement as seen above. > > Thanks! > Leo > > > > > > Thanx, Paul > >