netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: Sridhar Samudrala <sri@us.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	habanero@linux.vnet.ibm.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, krkumar2@in.ibm.com,
	tahm@linux.vnet.ibm.com, akong@redhat.com, davem@davemloft.net,
	shemminger@vyatta.com, mashirle@us.ibm.com
Subject: Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support
Date: Thu, 28 Jun 2012 13:31:24 +0800	[thread overview]
Message-ID: <4FEBEC2C.8020700@redhat.com> (raw)
In-Reply-To: <4FEBE2FF.3010105@us.ibm.com>

On 06/28/2012 12:52 PM, Sridhar Samudrala wrote:
> On 6/27/2012 8:02 PM, Jason Wang wrote:
>> On 06/27/2012 04:44 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 27, 2012 at 01:16:30PM +0800, Jason Wang wrote:
>>>> On 06/26/2012 06:42 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, Jun 26, 2012 at 11:42:17AM +0800, Jason Wang wrote:
>>>>>> On 06/25/2012 04:25 PM, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Jun 25, 2012 at 02:10:18PM +0800, Jason Wang wrote:
>>>>>>>> This patch adds multiqueue support for tap device. This is done 
>>>>>>>> by abstracting
>>>>>>>> each queue as a file/socket and allowing multiple sockets to be 
>>>>>>>> attached to the
>>>>>>>> tuntap device (an array of tun_file were stored in the 
>>>>>>>> tun_struct). Userspace
>>>>>>>> could write and read from those files to do the parallel packet
>>>>>>>> sending/receiving.
>>>>>>>>
>>>>>>>> Unlike the previous single queue implementation, the socket and 
>>>>>>>> device were
>>>>>>>> loosely coupled, each of them were allowed to go away first. In 
>>>>>>>> order to let the
>>>>>>>> tx path lockless, netif_tx_loch_bh() is replaced by 
>>>>>>>> RCU/NETIF_F_LLTX to
>>>>>>>> synchronize between data path and system call.
>>>>>>> Don't use LLTX/RCU. It's not worth it.
>>>>>>> Use something like netif_set_real_num_tx_queues.
>>>>>>>
>>>>>>>> The tx queue selecting is first based on the recorded rxq index 
>>>>>>>> of an skb, it
>>>>>>>> there's no such one, then choosing based on rx hashing 
>>>>>>>> (skb_get_rxhash()).
>>>>>>>>
>>>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>>> Interestingly macvtap switched to hashing first:
>>>>>>> ef0002b577b52941fb147128f30bd1ecfdd3ff6d
>>>>>>> (the commit log is corrupted but see what it
>>>>>>> does in the patch).
>>>>>>> Any idea why?
>>>>>> Yes, so tap should be changed to behave same as macvtap. I remember
>>>>>> the reason we do that is to make sure the packet of a single flow to
>>>>>> be queued to a fixed socket/virtqueues. As 10g cards like ixgbe
>>>>>> choose the rx queue for a flow based on the last tx queue where the
>>>>>> packets of that flow comes. So if we are using recored rx queue in
>>>>>> macvtap, the queue index of a flow would change as vhost thread
>>>>>> moves amongs processors.
>>>>> Hmm. OTOH if you override this, if TX is sent from VCPU0, RX might 
>>>>> land
>>>>> on VCPU1 in the guest, which is not good, right?
>>>> Yes, but better than making the rx moves between vcpus when we use
>>>> recorded rx queue.
>>> Why isn't this a problem with native TCP?
>>> I think what happens is one of the following:
>>> - moving between CPUs is more expensive with tun
>>>    because it can queue so much data on xmit
>>> - scheduler makes very bad decisions about VCPUs
>>>    bouncing them around all the time
>>
>> For usual native TCP/host process, as it reads and writes tcp 
>> sockets, so it make make sense to move rx to the porcessor where the 
>> process moves. But vhost does not do tcp stuffs and ixgbe would still 
>> move rx when vhost process moves, and we can't even make sure the 
>> vhost process that handling rx is running on processor that handle rx 
>> interrupt.
>
> We also saw this behavior with the default ixgbe configuration. If 
> vhost is pinned to a CPU all
> packets for that VM are received on a single RX queue.
> So even if the VM is doing multiple TCP_RR sessions, packets for all 
> the flows are received
> on a single RX queue. Without pinning, vhost moves around and so does 
> the packets across
> the RX queues.
>
> I think
>         ethtool -K ethX ntuple on
> will disable this behavior and it should be possible to program the 
> flow director using ethtool -U.
> This way we can split the packets across the host NIC RX queues based 
> on the flows, but it is not
> clear if this would help with the current model of single vhost per 
> device.
> With per-cpu vhost,  each RX queue can be handled by the matching 
> vhost, but if we have only
> 1 queue in the VMs virtio-net device, that could become the bottleneck.

Yes, I've been thinking about this. And instead of using ethtool -U 
(maybe possible for macvtap but hard for tuntap), we can 'teach' the 
ixgbe of the rxq it would used for a flow because ixgbe_select_queue() 
would first select the txq based on the recorded rxq. So if we want the 
flow using a dedicated rxq say N, we can record N to the rxq in tuntap 
before we passing the skb to bridge.

> Multi-queue virtio-net should help here, but we need the same number 
> of queues in VM's virtio-net
> device as the host's NIC so that each vhost can handle the 
> corresponding virtio queue.
> But if the VM has only 2 vcpus, i think it is not efficient to have 8 
> virtio-net queues.(to match a host
> with 8 physical cpus and 8 RX queues in the NIC).

Ideally, if we can 2 queues in guest, it's better to only use 2 queues 
in host to avoid extra contention.
>
> Thanks
> Sridhar
>
>>
>>> Could we isolate which it is? Does the problem
>>> still happen if you pin VCPUs to host cpus?
>>> If not it's the queue depth.
>>
>> It may not help as tun does not record the vcpu/queue that send the 
>> stream, so it can't transmit the packets back the same vcpu/queue.
>>>> Flow steering is needed to make sure the tx and
>>>> rx on the same vcpu.
>>> That involves IPI between processes, so it might be
>>> very expensive for kvm.
>>>
>>>>>> But during test tun/tap, one interesting thing I find is that even
>>>>>> ixgbe has recorded the queue index during rx, it seems be lost when
>>>>>> tap tries to transmit skbs to userspace.
>>>>> dev_pick_tx does this I think but ndo_select_queue
>>>>> should be able to get it without trouble.
>>>>>
>>>>>
>

  reply	other threads:[~2012-06-28  5:31 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20120625060830.6765.27584.stgit@amd-6168-8-1.englab.nay.redhat.com>
     [not found] ` <20120625061018.6765.76633.stgit@amd-6168-8-1.englab.nay.redhat.com>
2012-06-25  8:25   ` [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support Michael S. Tsirkin
2012-06-25  8:41     ` Michael S. Tsirkin
2012-06-26  3:42     ` Jason Wang
2012-06-26 10:42       ` Michael S. Tsirkin
2012-06-27  5:16         ` Jason Wang
2012-06-27  8:44           ` Michael S. Tsirkin
2012-06-28  3:02             ` Jason Wang
2012-06-28  4:52               ` Sridhar Samudrala
2012-06-28  5:31                 ` Jason Wang [this message]
2012-06-26  5:52     ` Jason Wang
2012-06-26 11:54       ` Michael S. Tsirkin
2012-06-27  5:59         ` Jason Wang
2012-06-27  8:26           ` Michael S. Tsirkin
2012-06-28  3:15             ` Jason Wang
     [not found] ` <20120625060945.6765.98618.stgit@amd-6168-8-1.englab.nay.redhat.com>
2012-06-25  8:27   ` [net-next RFC V3 PATCH 1/6] tuntap: move socket to tun_file Michael S. Tsirkin
2012-06-26  5:55     ` Jason Wang
2012-06-25 11:59 ` [net-next RFC V3 0/6] Multiqueue support in tun/tap Jason Wang
2012-06-25 11:59 ` [PATCH 1/6] tuntap: move socket to tun_file Jason Wang
2012-06-25 11:59 ` [PATCH 2/6] tuntap: categorize ioctl Jason Wang
2012-06-25 11:59 ` [PATCH 3/6] tuntap: introduce multiqueue flags Jason Wang
2012-06-25 11:59 ` [PATCH 4/6] tuntap: multiqueue support Jason Wang
2012-06-25 11:59 ` [PATCH 5/6] tuntap: per queue 64 bit stats Jason Wang
2012-06-25 12:52   ` Eric Dumazet
2012-06-26  6:00     ` Jason Wang
2012-06-26  6:10       ` Eric Dumazet
2012-06-26  6:28         ` Jason Wang
2012-06-26 19:46       ` [PATCH 5/6] tuntap: per queue 64 bit stats\ Michael S. Tsirkin
2012-06-25 11:59 ` [PATCH 6/6] tuntap: add ioctls to attach or detach a file form tuntap device Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FEBEC2C.8020700@redhat.com \
    --to=jasowang@redhat.com \
    --cc=akong@redhat.com \
    --cc=davem@davemloft.net \
    --cc=habanero@linux.vnet.ibm.com \
    --cc=krkumar2@in.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mashirle@us.ibm.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=shemminger@vyatta.com \
    --cc=sri@us.ibm.com \
    --cc=tahm@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).