Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: John Fastabend <john.r.fastabend@intel.com>
To: Willem de Bruijn <willemb@google.com>,
	"Zhou, Danny" <danny.zhou@intel.com>
Cc: John Fastabend <john.fastabend@gmail.com>,
	Daniel Borkmann <dborkman@redhat.com>,
	Florian Westphal <fw@strlen.de>,
	"gerlitz.or@gmail.com" <gerlitz.or@gmail.com>,
	Hannes Frederic Sowa <hannes@stressinduktion.org>,
	Network Development <netdev@vger.kernel.org>,
	"Ronciak, John" <john.ronciak@intel.com>,
	Amir Vadai <amirv@mellanox.com>,
	Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
Date: Tue, 07 Oct 2014 08:55:23 -0700	[thread overview]
Message-ID: <54340CEB.204@intel.com> (raw)
In-Reply-To: <CA+FuTSe=vo1-Xpk+318SNc-mCH_c0WQadXo3usiA_dRBNx_fEQ@mail.gmail.com>

On 10/07/2014 08:46 AM, Willem de Bruijn wrote:
>>>> Typically in an af_packet interface a packet_type handler is
>>>> registered and used to filter traffic to the socket and do other
>>>> things such as fan out traffic to multiple sockets. In this case the
>>>> networking stack is being bypassed so this code is not run. So the
>>>> hardware must push the correct traffic to the queues obtained from
>>>> the ndo callback ndo_split_queue_pairs().
>>>
>>> Why does the interface work at the level of queue_pairs instead of
>>> individual queues?
>>
>> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
>> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
>> egress traffics respectively, although the flow director only applies to ingress traffics.
> 
> That requirement of co-allocation is absent in existing packet
> rings. Many applications only receive or transmit. For
> receive-only, it would even be possible to map descriptor
> rings read-only, if the kernel remains responsible for posting
> buffers -- but I see below that that is not the case, so that's
> not very relevant here.
> 
> Still, some workloads want asymmetric sets of rx and tx rings.
> For instance, instead of using RSS, a process may want to
> receive on as few rings as possible, load balance across
> workers in software, but still give each worker thread its own
> private transmit ring.
> 

We can build this into the interface by having the setsockopt
provide both the number of tx rings and number of rx rings. It
might not be immediately available in any drivers because at
least ixgbe is pretty dependent on tx/rx pairing.

I would have to look through the other drivers to see how
much work it would be to support this on them. If I can't find
a good candidate we might leave it out until we can fix up the
drivers.

> 
>>>
>>>>         /* Get the layout of ring space offset, page_sz, cnt */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>>>>                    &info, &optlen);
>>>>
>>>>         /* request some queues from the driver */
>>>>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>>
>>>>         /* if we let the driver pick us queues learn which queues
>>>>          * we were given
>>>>          */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>
>>> If ethtool -U is used to steer traffic to a specific descriptor queue,
>>> then the setsockopt can pass the exact id of that queue and there
>>> is no need for a getsockopt follow-up.
>>
>> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
>> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
>> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
>> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
>> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
>> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
>> to call getsocketopt.
> 
> One step further would be to move the entire configuration behind
> the packet socket interface. It's perhaps out of scope of this patch,
> but the difference between using `ethtool -U` and passing the same
> expression through the packet socket is that in the latter case the
> kernel can automatically rollback the configuration change when the
> process dies.
> 

hmm might be interesting I think  this is a follow on path to
investigate after the initial support.

>>
>>>
>>>>         /* And mmap queue pairs to user space */
>>>>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>>>>              MAP_SHARED, fd, 0);
>>>
>>> How will packet data be mapped and how will userspace translate
>>> from paddr to vaddr? Is the goal to maintain long lived mappings
>>> and instruct drivers to allocate from this restricted range (to
>>> avoid per-packet system calls and vma operations)?
>>>
>>
>> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
>> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
>> are translated to paddr) to fill in the packet descriptors.
> 
> Ah, userspace is responsible for posting buffers and translation
> from vaddr to paddr is straightforward. Yes that makes sense.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2014-10-07 15:55 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-06  0:06 [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access John Fastabend
2014-10-06  0:07 ` [net-next PATCH v1 2/3] net: sched: add direct ring acces via af_packet to ixgbe John Fastabend
2014-10-06  0:07 ` [net-next PATCH v1 3/3] net: packet: Document PACKET_DEV_QPAIR_SPLIT and friends John Fastabend
2014-10-06  0:29 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Florian Westphal
2014-10-06  1:09   ` David Miller
2014-10-06  1:18     ` John Fastabend
2014-10-06  1:12   ` John Fastabend
2014-10-06  9:49     ` Daniel Borkmann
2014-10-06 15:01       ` John Fastabend
2014-10-06 16:35         ` Jesper Dangaard Brouer
2014-10-06 17:03         ` Hannes Frederic Sowa
2014-10-06 20:37           ` John Fastabend
2014-10-06 23:26             ` Hannes Frederic Sowa
2014-10-07 18:59               ` Neil Horman
2014-10-08 17:20                 ` John Fastabend
2014-10-09 13:36                   ` [PATCH] af_packet: Add Doorbell transmit mode to AF_PACKET sockets Neil Horman
2014-10-09 15:01                     ` John Fastabend
2014-10-09 16:05                       ` Neil Horman
2014-10-06 16:55 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Stephen Hemminger
2014-10-06 20:42   ` John Fastabend
2014-10-06 21:42 ` David Miller
2014-10-07  4:25   ` John Fastabend
2014-10-07  4:24 ` Willem de Bruijn
2014-10-07  9:27   ` David Laight
2014-10-07 15:43     ` David Miller
2014-10-07 15:59       ` David Laight
2014-10-07 16:08         ` David Miller
2014-10-07 15:21   ` Zhou, Danny
2014-10-07 15:46     ` Willem de Bruijn
2014-10-07 15:55       ` John Fastabend [this message]
2014-10-07 16:06         ` Zhou, Danny
2014-10-07 16:05     ` David Miller
2014-10-10  3:49       ` Zhou, Danny
  -- strict thread matches above, loose matches on Subject: below --
2014-10-07 16:33 Alexei Starovoitov
2014-10-07 16:46 ` Zhou, Danny
2014-10-07 17:01 ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54340CEB.204@intel.com \
    --to=john.r.fastabend@intel.com \
    --cc=amirv@mellanox.com \
    --cc=danny.zhou@intel.com \
    --cc=dborkman@redhat.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fw@strlen.de \
    --cc=gerlitz.or@gmail.com \
    --cc=hannes@stressinduktion.org \
    --cc=john.fastabend@gmail.com \
    --cc=john.ronciak@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).