From mboxrd@z Thu Jan 1 00:00:00 1970 From: Venkatesh Srinivas Subject: Re: [PATCH] virtio_ring: Shadow available ring flags & index Date: Fri, 20 Nov 2015 10:30:11 -0800 Message-ID: <20151120183011.GA24228@google.com> References: <1447201267-30024-1-git-send-email-venkateshs@google.com> <20151111142647-mutt-send-email-mst@redhat.com> <20151113234140.GA6512@google.com> <20151118040818.GA17436@google.com> <20151118042839.GA24662@google.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "Michael S. Tsirkin" , "virtualization@lists.linux-foundation.org" , Paolo Bonzini , David Matlack , KVM list , "luto@kernel.org" , Rusty Russell , Venkatesh Srinivas To: "Xie, Huawei" Return-path: Received: from mail-pa0-f47.google.com ([209.85.220.47]:33620 "EHLO mail-pa0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161087AbbKTSaN (ORCPT ); Fri, 20 Nov 2015 13:30:13 -0500 Received: by pabfh17 with SMTP id fh17so127494082pab.0 for ; Fri, 20 Nov 2015 10:30:12 -0800 (PST) Content-Disposition: inline In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On Thu, Nov 19, 2015 at 04:15:48PM +0000, Xie, Huawei wrote: > On 11/18/2015 12:28 PM, Venkatesh Srinivas wrote: > > On Tue, Nov 17, 2015 at 08:08:18PM -0800, Venkatesh Srinivas wrote: > >> On Mon, Nov 16, 2015 at 7:46 PM, Xie, Huawei wrote: > >> > >>> On 11/14/2015 7:41 AM, Venkatesh Srinivas wrote: > >>>> On Wed, Nov 11, 2015 at 02:34:33PM +0200, Michael S. Tsirkin wrote: > >>>>> On Tue, Nov 10, 2015 at 04:21:07PM -0800, Venkatesh Srinivas wrote: > >>>>>> Improves cacheline transfer flow of available ring header. > >>>>>> > >>>>>> Virtqueues are implemented as a pair of rings, one producer->consumer > >>>>>> avail ring and one consumer->producer used ring; preceding the > >>>>>> avail ring in memory are two contiguous u16 fields -- avail->flags > >>>>>> and avail->idx. A producer posts work by writing to avail->idx and > >>>>>> a consumer reads avail->idx. > >>>>>> > >>>>>> The flags and idx fields only need to be written by a producer CPU > >>>>>> and only read by a consumer CPU; when the producer and consumer are > >>>>>> running on different CPUs and the virtio_ring code is structured to > >>>>>> only have source writes/sink reads, we can continuously transfer the > >>>>>> avail header cacheline between 'M' states between cores. This flow > >>>>>> optimizes core -> core bandwidth on certain CPUs. > >>>>>> > >>>>>> (see: "Software Optimization Guide for AMD Family 15h Processors", > >>>>>> Section 11.6; similar language appears in the 10h guide and should > >>>>>> apply to CPUs w/ exclusive caches, using LLC as a transfer cache) > >>>>>> > >>>>>> Unfortunately the existing virtio_ring code issued reads to the > >>>>>> avail->idx and read-modify-writes to avail->flags on the producer. > >>>>>> > >>>>>> This change shadows the flags and index fields in producer memory; > >>>>>> the vring code now reads from the shadows and only ever writes to > >>>>>> avail->flags and avail->idx, allowing the cacheline to transfer > >>>>>> core -> core optimally. > >>>>> Sounds logical, I'll apply this after a bit of testing > >>>>> of my own, thanks! > >>>> Thanks! > >>> Venkatesh: > >>> Is it that your patch only applies to CPUs w/ exclusive caches? > >> No --- it applies when the inter-cache coherence flow is optimized by > >> 'M' -> 'M' transfers and when producer reads might interfere w/ > >> consumer prefetchw/reads. The AMD Optimization guides have specific > >> language on this subject, but other platforms may benefit. > >> (see Intel #'s below) > For core2core case(not HT paire), after consumer reads that M cache line > for avail_idx, is that line still in the producer core's L1 data cache > with state changing from M->O state? Textbook MOESI would not allow that state combination -- when the consumer gets the line in 'M' state, the producer cannot hold it in 'O' state. On the AMD Piledriver, per the Optimization guide, I use PREFETCHW/Load to get the line in 'M' state on the consumer (invalidating it in the Producer's cache): "* Use PREFETCHW on the consumer side, even if the consumer will not modify the data" That, plus the "Optimizing Inter-Core Data Transfer" section imply that PREFETCHW + MOV will cause the consumer to load the line into 'M' state. PREFETCHW was not available on Intel CPUs pre-Broadwell; from the public documentation alone, I don't think we can tell what transition the producer's cacheline undergoes on these cores. For that matter, the latest documentation I can find (for Nehalem), indicated there was no 'O' state -- Nehalem implemented MESIF, not MOESI. HTH, -- vs;