netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: Jon Olson <jonolson@google.com>,
	willemdebruijn.kernel@gmail.com, caleb.raitto@gmail.com,
	davem@davemloft.net, netdev@vger.kernel.org,
	Caleb Raitto <caraitto@google.com>
Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param.
Date: Thu, 2 Aug 2018 01:27:13 +0300	[thread overview]
Message-ID: <20180802012556-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <706da951-ed30-85e8-c0aa-cb9ae8b3deb7@redhat.com>

On Mon, Jul 30, 2018 at 02:06:50PM +0800, Jason Wang wrote:
> 
> 
> On 2018年07月25日 08:17, Jon Olson wrote:
> > On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> > > > On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > > > > > >From the above linked patch, I understand that there are yet
> > > > > > other special cases in production, such as a hard cap on #tx queues to
> > > > > > 32 regardless of number of vcpus.
> > > > > I don't think upstream kernels have this limit - we can
> > > > > now use vmalloc for higher number of queues.
> > > > Yes. that patch* mentioned it as a google compute engine imposed
> > > > limit. It is exactly such cloud provider imposed rules that I'm
> > > > concerned about working around in upstream drivers.
> > > > 
> > > > * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
> > > Yea. Why does GCE do it btw?
> > There are a few reasons for the limit, some historical, some current.
> > 
> > Historically we did this because of a kernel limit on the number of
> > TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> > the limit upstream at the time we did it was actually eight. We had
> > increased the limit from eight to 32 internally, and it appears in
> > upstream it has subsequently increased upstream to 256. We no longer
> > use TAP for networking, so that constraint no longer applies for us,
> > but when looking at removing/raising the limit we discovered no
> > workloads that clearly benefited from lifting it, and it also placed
> > more pressure on our virtual networking stack particularly on the Tx
> > side. We left it as-is.
> > 
> > In terms of current reasons there are really two. One is memory usage.
> > As you know, virtio-net uses rx/tx pairs, so there's an expectation
> > that the guest will have an Rx queue for every Tx queue. We run our
> > individual virtqueues fairly deep (4096 entries) to give guests a wide
> > time window for re-posting Rx buffers and avoiding starvation on
> > packet delivery. Filling an Rx vring with max-sized mergeable buffers
> > (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> > be up to 512MB of memory posted for network buffers. Scaling this to
> > the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> > all of the Rx rings full would (in the large average Rx packet size
> > case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> > of RAM available, but I don't believe we've observed a situation where
> > they would have benefited from having 2.5 gigs of buffers posted for
> > incoming network traffic :)
> 
> We can work to have async txq and rxq instead of paris if there's a strong
> requirement.

I think the reason we don't is because RX queueing is programmed by TX
packets.  It might make sense if we support RX queueing policy that
isn't dependent on TX.


> > 
> > The second reason is interrupt related -- as I mentioned above, we
> > have found no workloads that clearly benefit from so many queues, but
> > we have found workloads that degrade. In particular workloads that do
> > a lot of small packet processing but which aren't extremely latency
> > sensitive can achieve higher PPS by taking fewer interrupt across
> > fewer VCPUs due to better batching (this also incurs higher latency,
> > but at the limit the "busy" cores end up suppressing most interrupts
> > and spending most of their cycles farming out work). Memcache is a
> > good example here, particularly if the latency targets for request
> > completion are in the ~milliseconds range (rather than the
> > microseconds we typically strive for with TCP_RR-style workloads).
> > 
> > All of that said, we haven't been forthcoming with data (and
> > unfortunately I don't have it handy in a useful form, otherwise I'd
> > simply post it here), so I understand the hesitation to simply run
> > with napi_tx across the board. As Willem said, this patch seemed like
> > the least disruptive way to allow us to continue down the road of
> > "universal" NAPI Tx and to hopefully get data across enough workloads
> > (with VMs small, large, and absurdly large :) to present a compelling
> > argument in one direction or another. As far as I know there aren't
> > currently any NAPI related ethtool commands (based on a quick perusal
> > of ethtool.h)
> 
> As I suggest before, maybe we can (ab)use tx-frames-irq.
> 
> Thanks
> 
> > -- it seems like it would be fairly involved/heavyweight
> > to plumb one solely for this unless NAPI Tx is something many users
> > will want to tune (and for which other drivers would support tuning).
> > 
> > --
> > Jon Olson

  reply	other threads:[~2018-08-02  0:15 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-23 23:11 [PATCH net-next] virtio_net: force_napi_tx module param Caleb Raitto
2018-07-24  0:53 ` Stephen Hemminger
2018-07-24  1:23   ` Willem de Bruijn
2018-07-24 10:42 ` Michael S. Tsirkin
2018-07-24 14:01   ` Willem de Bruijn
2018-07-24 18:38     ` Michael S. Tsirkin
2018-07-24 20:52       ` Willem de Bruijn
2018-07-24 22:23         ` Michael S. Tsirkin
2018-07-24 22:31           ` Willem de Bruijn
2018-07-24 22:46             ` Michael S. Tsirkin
2018-07-25  0:02               ` Willem de Bruijn
2018-07-25  0:17               ` Jon Olson
2018-07-30  6:06                 ` Jason Wang
2018-08-01 22:27                   ` Michael S. Tsirkin [this message]
2018-08-28 19:57                   ` Willem de Bruijn
2018-08-29  7:56                     ` Jason Wang
2018-08-29 13:01                       ` Willem de Bruijn
2018-09-09 23:07                         ` Willem de Bruijn
2018-09-10  5:59                           ` Jason Wang
2018-07-29 16:00 ` David Miller
2018-07-29 20:33   ` Michael S. Tsirkin
2018-07-29 20:36     ` David Miller
2018-07-29 21:09     ` Willem de Bruijn
2018-07-29 21:32   ` Willem de Bruijn
2018-07-31 12:34     ` Michael S. Tsirkin
2018-08-01 15:46       ` Willem de Bruijn
2018-08-01 15:56         ` Willem de Bruijn
2018-08-01 22:25         ` Michael S. Tsirkin
2018-08-01 22:43           ` Willem de Bruijn
2018-08-01 22:48             ` Michael S. Tsirkin
2018-08-01 23:33               ` Willem de Bruijn
2018-07-30 19:51   ` Stephen Hemminger
2018-07-31  1:41     ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180802012556-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=caleb.raitto@gmail.com \
    --cc=caraitto@google.com \
    --cc=davem@davemloft.net \
    --cc=jasowang@redhat.com \
    --cc=jonolson@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).