From: John Fastabend <john.fastabend@gmail.com>
To: Stephen Hemminger <stephen@networkplumber.org>
Cc: dborkman@redhat.com, fw@strlen.de, gerlitz.or@gmail.com,
hannes@stressinduktion.org, netdev@vger.kernel.org,
john.ronciak@intel.com, amirv@mellanox.com,
eric.dumazet@gmail.com, danny.zhou@intel.com
Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
Date: Mon, 06 Oct 2014 13:42:28 -0700 [thread overview]
Message-ID: <5432FEB4.7040909@gmail.com> (raw)
In-Reply-To: <20141006095553.18097d68@urahara>
On 10/06/2014 09:55 AM, Stephen Hemminger wrote:
> On Sun, 05 Oct 2014 17:06:31 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
>
>> This patch adds a net_device ops to split off a set of driver queues
>> from the driver and map the queues into user space via mmap. This
>> allows the queues to be directly manipulated from user space. For
>> raw packet interface this removes any overhead from the kernel network
>> stack.
>>
>> Typically in an af_packet interface a packet_type handler is
>> registered and used to filter traffic to the socket and do other
>> things such as fan out traffic to multiple sockets. In this case the
>> networking stack is being bypassed so this code is not run. So the
>> hardware must push the correct traffic to the queues obtained from
>> the ndo callback ndo_split_queue_pairs().
>>
>> Fortunately there is already a flow classification interface which
>> is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
>> currently supported by multiple drivers including sfc, mlx4, niu,
>> ixgbe, and i40e. Supporting some way to steer traffic to a queue
>> is the _only_ hardware requirement to support the interface, plus
>> the driver needs to implement the correct ndo ops. A follow on
>> patch adds support for ixgbe but we expect at least the subset of
>> drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.
>>
>> The interface is driven over an af_packet socket which we believe
>> is the most natural interface to use. Because it is already used
>> for raw packet interfaces which is what we are providing here.
>> The high level flow for this interface looks like:
>>
>> bind(fd, &sockaddr, sizeof(sockaddr));
>>
>> /* Get the device type and info */
>> getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
>> &optlen);
>>
>> /* With device info we can look up descriptor format */
>>
>> /* Get the layout of ring space offset, page_sz, cnt */
>> getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>> &info, &optlen);
>>
>> /* request some queues from the driver */
>> setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> &qpairs_info, sizeof(qpairs_info));
>>
>> /* if we let the driver pick us queues learn which queues
>> * we were given
>> */
>> getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> &qpairs_info, sizeof(qpairs_info));
>>
>> /* And mmap queue pairs to user space */
>> mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>> MAP_SHARED, fd, 0);
>>
>> /* Now we have some user space queues to read/write to*/
>>
>> There is one critical difference when running with these interfaces
>> vs running without them. In the normal case the af_packet module
>> uses a standard descriptor format exported by the af_packet user
>> space headers. In this model because we are working directly with
>> driver queues the descriptor format maps to the descriptor format
>> used by the device. User space applications can learn device
>> information from the socket option PACKET_DEV_DESC_INFO which
>> should provide enough details to extrapulate the descriptor formats.
>> Although this adds some complexity to user space it removes the
>> requirement to copy descriptor fields around.
>>
>> The formats are usually provided by the device vendor documentation
>> If folks want I can provide a follow up patch to provide the formats
>> in a .h file in ./include/uapi/linux/ for ease of use. I have access
>> to formats for ixgbe and mlx drivers other driver owners would need to
>> provide their formats.
>>
>> We tested this interface using traffic generators and doing basic
>> L2 forwarding tests on ixgbe devices. Our tests use a set of patches
>> to DPDK to enable an interface using this socket interfaace. With
>> this interface we can xmit/receive @ line rate from a test user space
>> application on a single core.
>>
>> Additionally we have a set of DPDK patches to enable DPDK with this
>> interface. DPDK can be downloaded @ dpdk.org although as I hope is
>> clear from above DPDK is just our paticular test environment we
>> expect other libraries could be built on this interface.
>>
>> Signed-off-by: Danny Zhou <danny.zhou@intel.com>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>
> I like the ability to share a device between kernel and user mode networking.
> The model used for DPDK for this is really ugly and fragile/broken.
> Your proposal assumes that you fully trust the user mode networking application
> which is not a generally safe assumption.
>
> A device can DMA from/to any arbitrary physical memory.
> And it would be hard to use IOMMU to protect because the
> IOMMU doesn't know that the difference between the applications queue and
> the rest of the queues.
>
> At least with DPDK you can use VFIO, and you are claiming the whole device to
> allow protection against random memory being read/written.
>
>
However not all platforms support VFIO and when the application
only want to handle specific traffic types a queue maps well to
this.
--
John Fastabend Intel Corporation
next prev parent reply other threads:[~2014-10-06 20:42 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-06 0:06 [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access John Fastabend
2014-10-06 0:07 ` [net-next PATCH v1 2/3] net: sched: add direct ring acces via af_packet to ixgbe John Fastabend
2014-10-06 0:07 ` [net-next PATCH v1 3/3] net: packet: Document PACKET_DEV_QPAIR_SPLIT and friends John Fastabend
2014-10-06 0:29 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Florian Westphal
2014-10-06 1:09 ` David Miller
2014-10-06 1:18 ` John Fastabend
2014-10-06 1:12 ` John Fastabend
2014-10-06 9:49 ` Daniel Borkmann
2014-10-06 15:01 ` John Fastabend
2014-10-06 16:35 ` Jesper Dangaard Brouer
2014-10-06 17:03 ` Hannes Frederic Sowa
2014-10-06 20:37 ` John Fastabend
2014-10-06 23:26 ` Hannes Frederic Sowa
2014-10-07 18:59 ` Neil Horman
2014-10-08 17:20 ` John Fastabend
2014-10-09 13:36 ` [PATCH] af_packet: Add Doorbell transmit mode to AF_PACKET sockets Neil Horman
2014-10-09 15:01 ` John Fastabend
2014-10-09 16:05 ` Neil Horman
2014-10-06 16:55 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Stephen Hemminger
2014-10-06 20:42 ` John Fastabend [this message]
2014-10-06 21:42 ` David Miller
2014-10-07 4:25 ` John Fastabend
2014-10-07 4:24 ` Willem de Bruijn
2014-10-07 9:27 ` David Laight
2014-10-07 15:43 ` David Miller
2014-10-07 15:59 ` David Laight
2014-10-07 16:08 ` David Miller
2014-10-07 15:21 ` Zhou, Danny
2014-10-07 15:46 ` Willem de Bruijn
2014-10-07 15:55 ` John Fastabend
2014-10-07 16:06 ` Zhou, Danny
2014-10-07 16:05 ` David Miller
2014-10-10 3:49 ` Zhou, Danny
-- strict thread matches above, loose matches on Subject: below --
2014-10-07 16:33 Alexei Starovoitov
2014-10-07 16:46 ` Zhou, Danny
2014-10-07 17:01 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5432FEB4.7040909@gmail.com \
--to=john.fastabend@gmail.com \
--cc=amirv@mellanox.com \
--cc=danny.zhou@intel.com \
--cc=dborkman@redhat.com \
--cc=eric.dumazet@gmail.com \
--cc=fw@strlen.de \
--cc=gerlitz.or@gmail.com \
--cc=hannes@stressinduktion.org \
--cc=john.ronciak@intel.com \
--cc=netdev@vger.kernel.org \
--cc=stephen@networkplumber.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).