All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Krishna Kumar2 <krkumar2@in.ibm.com>
Cc: anthony@codemonkey.ws, arnd@arndb.de, avi@redhat.com,
	davem@davemloft.net, kvm@vger.kernel.org, netdev@vger.kernel.org,
	rusty@rustcorp.com.au
Subject: Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Date: Tue, 5 Oct 2010 20:23:23 +0200	[thread overview]
Message-ID: <20101005182323.GA25852@redhat.com> (raw)
In-Reply-To: <OF13594229.1A55A20C-ON652577B3.00393C8D-652577B3.003A54C9@in.ibm.com>

On Tue, Oct 05, 2010 at 04:10:00PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/19/2010 06:14:43 PM:
> 
> > Could you document how exactly do you measure multistream bandwidth:
> > netperf flags, etc?
> 
> All results were without any netperf flags or system tuning:
>     for i in $list
>     do
>         netperf -c -C -l 60 -H 192.168.122.1 > /tmp/netperf.$$.$i &
>     done
>     wait
> Another script processes the result files.  It also displays the
> start time/end time of each iteration to make sure skew due to
> parallel netperfs is minimal.
> 
> I changed the vhost functionality once more to try to get the
> best model, the new model being:
> 1. #numtxqs=1 -> #vhosts=1, this thread handles both RX/TX.
> 2. #numtxqs>1 -> vhost[0] handles RX and vhost[1-MAX] handles
>    TX[0-n], where MAX is 4.  Beyond numtxqs=4, the remaining TX
>    queues are handled by vhost threads in round-robin fashion.
> 
> Results from here on are with these changes, and only "tuning" is
> to set each vhost's affinity to CPUs[0-3] ("taskset -p f <vhost-pids>").
> 
> > Any idea where does this come from?
> > Do you see more TX interrupts? RX interrupts? Exits?
> > Do interrupts bounce more between guest CPUs?
> > 4. Identify reasons for single netperf BW regression.
> 
> After testing various combinations of #txqs, #vhosts, #netperf
> sessions, I think the drop for 1 stream is due to TX and RX for
> a flow being processed on different cpus.

Right. Can we fix it?

>  I did two more tests:
>     1. Pin vhosts to same CPU:
>         - BW drop is much lower for 1 stream case (- 5 to -8% range)
>         - But performance is not so high for more sessions.
>     2. Changed vhost to be single threaded:
>           - No degradation for 1 session, and improvement for upto
> 	      8, sometimes 16 streams (5-12%).
>           - BW degrades after that, all the way till 128 netperf sessions.
>           - But overall CPU utilization improves.
>             Summary of the entire run (for 1-128 sessions):
>                 txq=4:  BW: (-2.3)      CPU: (-16.5)    RCPU: (-5.3)
>                 txq=16: BW: (-1.9)      CPU: (-24.9)    RCPU: (-9.6)
> 
> I don't see any reasons mentioned above.  However, for higher
> number of netperf sessions, I see a big increase in retransmissions:

Hmm, ok, and do you see any errors?

> _______________________________________
> #netperf      ORG           NEW
>             BW (#retr)    BW (#retr)
> _______________________________________
> 1          70244 (0)     64102 (0)
> 4          21421 (0)     36570 (416)
> 8          21746 (0)     38604 (148)
> 16         21783 (0)     40632 (464)
> 32         22677 (0)     37163 (1053)
> 64         23648 (4)     36449 (2197)
> 128        23251 (2)     31676 (3185)
> _______________________________________
> 
> Single netperf case didn't have any retransmissions so that is not
> the cause for drop.  I tested ixgbe (MQ):
> ___________________________________________________________
> #netperf      ixgbe             ixgbe (pin intrs to cpu#0 on
>                                        both server/client)
>             BW (#retr)          BW (#retr)
> ___________________________________________________________
> 1           3567 (117)          6000 (251)
> 2           4406 (477)          6298 (725)
> 4           6119 (1085)         7208 (3387)
> 8           6595 (4276)         7381 (15296)
> 16          6651 (11651)        6856 (30394)

Interesting.
You are saying we get much more retransmissions with physical nic as
well?

> ___________________________________________________________
> 
> > 5. Test perf in more scenarious:
> >    small packets
> 
> 512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
> but increases with #sessions:
> _______________________________________________________________________________
> #       BW1     BW2 (%)         CPU1    CPU2 (%)        RCPU1   RCPU2 (%)
> _______________________________________________________________________________
> 1       4043    3800 (-6.0)     50      50 (0)          86      98 (13.9)
> 2       8358    7485 (-10.4)    153     178 (16.3)      230     264 (14.7)
> 4       20664   13567 (-34.3)   448     490 (9.3)       530     624 (17.7)
> 8       25198   17590 (-30.1)   967     1021 (5.5)      1085    1257 (15.8)
> 16      23791   24057 (1.1)     1904    2220 (16.5)     2156    2578 (19.5)
> 24      23055   26378 (14.4)    2807    3378 (20.3)     3225    3901 (20.9)
> 32      22873   27116 (18.5)    3748    4525 (20.7)     4307    5239 (21.6)
> 40      22876   29106 (27.2)    4705    5717 (21.5)     5388    6591 (22.3)
> 48      23099   31352 (35.7)    5642    6986 (23.8)     6475    8085 (24.8)
> 64      22645   30563 (34.9)    7527    9027 (19.9)     8619    10656 (23.6)
> 80      22497   31922 (41.8)    9375    11390 (21.4)    10736   13485 (25.6)
> 96      22509   32718 (45.3)    11271   13710 (21.6)    12927   16269 (25.8)
> 128     22255   32397 (45.5)    15036   18093 (20.3)    17144   21608 (26.0)
> _______________________________________________________________________________
> SUM:    BW: (16.7)      CPU: (20.6)     RCPU: (24.3)
> _______________________________________________________________________________
> 
> > host -> guest
> _______________________________________________________________________________
> #       BW1     BW2 (%)         CPU1    CPU2 (%)        RCPU1   RCPU2 (%)
> _______________________________________________________________________________
> *1      70706   90398 (27.8)    300     327 (9.0)       140     175 (25.0)
> 2       20951   21937 (4.7)     188     196 (4.2)       93      103 (10.7)
> 4       19952   25281 (26.7)    397     496 (24.9)      210     304 (44.7)
> 8       18559   24992 (34.6)    802     1010 (25.9)     439     659 (50.1)
> 16      18882   25608 (35.6)    1642    2082 (26.7)     953     1454 (52.5)
> 24      19012   26955 (41.7)    2465    3153 (27.9)     1452    2254 (55.2)
> 32      19846   26894 (35.5)    3278    4238 (29.2)     1914    3081 (60.9)
> 40      19704   27034 (37.2)    4104    5303 (29.2)     2409    3866 (60.4)
> 48      19721   26832 (36.0)    4924    6418 (30.3)     2898    4701 (62.2)
> 64      19650   26849 (36.6)    6595    8611 (30.5)     3975    6433 (61.8)
> 80      19432   26823 (38.0)    8244    10817 (31.2)    4985    8165 (63.7)
> 96      20347   27886 (37.0)    9913    13017 (31.3)    5982    9860 (64.8)
> 128     19108   27715 (45.0)    13254   17546 (32.3)    8153    13589 (66.6)
> _______________________________________________________________________________
> SUM:    BW: (32.4)      CPU: (30.4)     RCPU: (62.6)
> _______________________________________________________________________________
> *: Sum over 7 iterations, remaining test cases are sum over 2 iterations
> 
> > guest <-> external
> 
> I haven't done this right now since I don't have a setup.  I guess
> it would be limited by wire speed and gains may not be there.  I
> will try to do this later when I get the setup.

OK but at least need to check that it does not hurt things.

> > in last case:
> > find some other way to measure host CPU utilization,
> > try multiqueue and single queue devices
> > 6. Use above to figure out what is a sane default for numtxqs
> 
> A. Summary for default I/O (16K):
> #txqs=2 (#vhost=3):       BW: (37.6)      CPU: (69.2)     RCPU: (40.8)
> #txqs=4 (#vhost=5):       BW: (36.9)      CPU: (60.9)     RCPU: (25.2)
> #txqs=8 (#vhost=5):       BW: (41.8)      CPU: (50.0)     RCPU: (15.2)
> #txqs=16 (#vhost=5):      BW: (40.4)      CPU: (49.9)     RCPU: (10.0)
> 
> B. Summary for 512 byte I/O:
> #txqs=2 (#vhost=3):       BW: (31.6)      CPU: (35.7)     RCPU: (28.6)
> #txqs=4 (#vhost=5):       BW: (5.7)       CPU: (27.2)     RCPU: (22.7)
> #txqs=8 (#vhost=5):       BW: (-.6)       CPU: (25.1)     RCPU: (22.5)
> #txqs=16 (#vhost=5):      BW: (-6.6)      CPU: (24.7)     RCPU: (21.7)
> 
> Summary:
> 
> 1. Average BW increase for regular I/O is best for #txq=16 with the
>    least CPU utilization increase.
> 2. The average BW for 512 byte I/O is best for lower #txq=2. For higher
>    #txqs, BW increased only after a particular #netperf sessions - in
>    my testing that limit was 32 netperf sessions.
> 3. Multiple txq for guest by itself doesn't seem to have any issues.
>    Guest CPU% increase is slightly higher than BW improvement.  I
>    think it is true for all mq drivers since more paths run in parallel
>    upto the device instead of sleeping and allowing one thread to send
>    all packets via qdisc_restart.
> 4. Having high number of txqs gives better gains and reduces cpu util
>    on the guest and the host.
> 5. MQ is intended for server loads.  MQ should probably not be explicitly
>    specified for client systems.
> 6. No regression with numtxqs=1 (or if mq option is not used) in any
>    testing scenario.

Of course txq=1 can be considered a kind of fix, but if we know the
issue is TX/RX flows getting bounced between CPUs, can we fix this?
Workload-specific optimizations can only get us this far.

> 
> I will send the v3 patch within a day after some more testing.
> 
> Thanks,
> 
> - KK

  reply	other threads:[~2010-10-05 18:23 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-17 10:03 [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
2010-09-17 10:03 ` [v2 RFC PATCH 1/4] Change virtqueue structure Krishna Kumar
2010-09-17 10:03 ` [v2 RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
2010-09-17 10:25   ` Eric Dumazet
2010-09-17 12:27     ` Krishna Kumar2
2010-09-17 13:20       ` Krishna Kumar2
2010-09-17 10:03 ` [v2 RFC PATCH 3/4] Changes for vhost Krishna Kumar
2010-09-17 10:03 ` [v2 RFC PATCH 4/4] qemu changes Krishna Kumar
2010-09-17 15:42 ` [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Sridhar Samudrala
2010-09-19 12:44 ` Michael S. Tsirkin
2010-10-05 10:40   ` Krishna Kumar2
2010-10-05 18:23     ` Michael S. Tsirkin [this message]
2010-10-06 17:43       ` Krishna Kumar2
2010-10-06 19:03         ` Michael S. Tsirkin
2010-10-06 12:19     ` Arnd Bergmann
2010-10-06 17:14       ` Krishna Kumar2
2010-10-06 17:50         ` Arnd Bergmann
  -- strict thread matches above, loose matches on Subject: below --
2010-10-06 13:34 Michael S. Tsirkin
2010-10-06 17:02 ` Krishna Kumar2
2010-10-11  7:21 ` Krishna Kumar2
2010-10-12 17:09   ` Michael S. Tsirkin
2010-10-14  7:58     ` Krishna Kumar2
2010-10-14  8:17       ` Michael S. Tsirkin
2010-10-14  9:04         ` Krishna Kumar2
     [not found]         ` <OFEC86A094.39835EBF-ON652577BC.002F9AAF-652577BC.003186B5@LocalDomain>
2010-10-14 12:17           ` Krishna Kumar2
     [not found]           ` <OF0BDA6B3A.F673A449-ON652577BC.00422911-652577BC.0043474B@LocalDomain>
2010-10-14 12:47             ` Krishna Kumar2

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101005182323.GA25852@redhat.com \
    --to=mst@redhat.com \
    --cc=anthony@codemonkey.ws \
    --cc=arnd@arndb.de \
    --cc=avi@redhat.com \
    --cc=davem@davemloft.net \
    --cc=krkumar2@in.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.