From: Jesper Dangaard Brouer <brouer@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: David Miller <davem@davemloft.net>,
cl@linux.com, rppt@linux.vnet.ibm.com, netdev@vger.kernel.org,
linux-mm@kvack.org, willemdebruijn.kernel@gmail.com,
bjorn.topel@intel.com, magnus.karlsson@intel.com,
alexander.duyck@gmail.com, mgorman@techsingularity.net,
tom@herbertland.com, bblanco@plumgrid.com, tariqt@mellanox.com,
saeedm@mellanox.com, jesse.brandeburg@intel.com, METH@il.ibm.com,
vyasevich@gmail.com, brouer@redhat.com
Subject: Re: Designing a safe RX-zero-copy Memory Model for Networking
Date: Wed, 14 Dec 2016 22:04:38 +0100 [thread overview]
Message-ID: <20161214220438.4608f2bb@redhat.com> (raw)
In-Reply-To: <5851740A.2080806@gmail.com>
On Wed, 14 Dec 2016 08:32:10 -0800
John Fastabend <john.fastabend@gmail.com> wrote:
> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 13 Dec 2016 12:08:21 -0800
> > John Fastabend <john.fastabend@gmail.com> wrote:
> >
> >> On 16-12-13 11:53 AM, David Miller wrote:
> >>> From: John Fastabend <john.fastabend@gmail.com>
> >>> Date: Tue, 13 Dec 2016 09:43:59 -0800
> >>>
> >>>> What does "zero-copy send packet-pages to the application/socket that
> >>>> requested this" mean? At the moment on x86 page-flipping appears to be
> >>>> more expensive than memcpy (I can post some data shortly) and shared
> >>>> memory was proposed and rejected for security reasons when we were
> >>>> working on bifurcated driver.
> >>>
> >>> The whole idea is that we map all the active RX ring pages into
> >>> userspace from the start.
> >>>
> >>> And just how Jesper's page pool work will avoid DMA map/unmap,
> >>> it will also avoid changing the userspace mapping of the pages
> >>> as well.
> >>>
> >>> Thus avoiding the TLB/VM overhead altogether.
> >>>
> >
> > Exactly. It is worth mentioning that pages entering the page pool need
> > to be cleared (measured cost 143 cycles), in order to not leak any
> > kernel info. The primary focus of this design is to make sure not to
> > leak kernel info to userspace, but with an "exclusive" mode also
> > support isolation between applications.
> >
> >
> >> I get this but it requires applications to be isolated. The pages from
> >> a queue can not be shared between multiple applications in different
> >> trust domains. And the application has to be cooperative meaning it
> >> can't "look" at data that has not been marked by the stack as OK. In
> >> these schemes we tend to end up with something like virtio/vhost or
> >> af_packet.
> >
> > I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> > two would require CAP_NET_ADMIN privileges. All modes have a trust
> > domain id, that need to match e.g. when page reach the socket.
>
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.
Good point.
> >
> > Mode-1 "Shared": Application choose lowest isolation level, allowing
> > multiple application to mmap VMA area.
>
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
>
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
>
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
>
> >
> > Mode-2 "Single-user": Application request it want to be the only user
> > of the RX queue. This blocks other application to mmap VMA area.
> >
>
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.
Yes, as describe in orig email and here[1]: "once an application
request zero-copy RX, then the driver must use a specific SKB
allocation mode and might have to reconfigure the RX-ring."
The SKB allocation mode is "read-only packet page", which is the
current default mode (also desc in document[1]) of using skb-frags.
[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
>
> I assume all these concerns are shared between mode-1 and mode-2
>
> > Mode-3 "Exclusive": Application request to own RX queue. Packets are
> > no longer allowed for normal netstack delivery.
> >
>
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.
Interesting.
> > Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> > still allowed to travel netstack and thus can contain packet data from
> > other normal applications. This is part of the design, to share the
> > NIC between netstack and an accelerated userspace application using RX
> > zero-copy delivery.
> >
>
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.
I though the bifurcated driver work was rejected because it could leak
kernel info in the pages. This approach cannot.
> >> Any ACLs/filtering/switching/headers need to be done in hardware or
> >> the application trust boundaries are broken.
> >
> > The software solution outlined allow the application to make the
> > choice of what trust boundary it wants.
> >
> > The "exclusive" mode-3 make most sense together with HW filters.
> > Already today, we support creating a new RX queue based on ethtool
> > ntuple HW filter and then you simply attach your application that
> > queue in mode-3, and have full isolation.
> >
>
> Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
> Without hardware filters we have no way of knowing who/what data is
> put in the page.
For sockets, an SKB carrying a RX zero-copy-able page can be steered
(as normal) into a given socket. Then we check if socket requested
zero-copy, and verify if the domain-id match between the page_pool and
socket.
You can also use XDP to filter and steer the packet (which will be
faster and using normal steering code).
> >
> >> If the above can not be met then a copy is needed. What I am trying
> >> to tease out is the above comment along with other statements like
> >> this "can be done with out HW filter features".
> >
> > Does this address your concerns?
> >
>
> I think we need to enforce strong isolation. An application should not
> be able to read data or impact other applications. I gather this is
> the case per comment about normal applications in mode-2. A slightly
> weaker statement would be to say applications can only impace/read
> data of other applications in their domain. This might be OK as well.
I think this approach covers the "weaker statement". Because only page
within the pool are "exposed". Thus, the domain is the NIC (possibly
restricted to a single RX queue).
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-12-14 21:04 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-05 14:31 Designing a safe RX-zero-copy Memory Model for Networking Jesper Dangaard Brouer
2016-12-12 8:38 ` Mike Rapoport
2016-12-12 9:40 ` Jesper Dangaard Brouer
2016-12-12 14:14 ` Mike Rapoport
2016-12-12 14:14 ` Mike Rapoport
2016-12-12 14:49 ` John Fastabend
2016-12-12 17:13 ` Jesper Dangaard Brouer
2016-12-12 18:06 ` Christoph Lameter
2016-12-12 18:06 ` Christoph Lameter
2016-12-13 16:10 ` Jesper Dangaard Brouer
2016-12-13 16:36 ` Christoph Lameter
2016-12-13 16:36 ` Christoph Lameter
2016-12-13 17:43 ` John Fastabend
2016-12-13 17:43 ` John Fastabend
2016-12-13 19:53 ` David Miller
2016-12-13 20:08 ` John Fastabend
2016-12-14 9:39 ` Jesper Dangaard Brouer
2016-12-14 16:32 ` John Fastabend
2016-12-14 16:45 ` Alexander Duyck
2016-12-14 21:29 ` Jesper Dangaard Brouer
2016-12-14 22:45 ` Alexander Duyck
2016-12-15 8:28 ` Jesper Dangaard Brouer
2016-12-15 15:59 ` Alexander Duyck
2016-12-15 16:38 ` Christoph Lameter
2016-12-14 21:04 ` Jesper Dangaard Brouer [this message]
2016-12-13 18:39 ` Hannes Frederic Sowa
2016-12-14 17:00 ` Christoph Lameter
2016-12-14 17:00 ` Christoph Lameter
2016-12-14 17:37 ` David Laight
2016-12-14 19:43 ` Christoph Lameter
2016-12-14 19:43 ` Christoph Lameter
2016-12-14 20:37 ` Hannes Frederic Sowa
2016-12-14 20:37 ` Hannes Frederic Sowa
2016-12-14 21:22 ` Christoph Lameter
2016-12-13 9:42 ` Mike Rapoport
2016-12-12 15:10 ` Jesper Dangaard Brouer
2016-12-12 15:10 ` Jesper Dangaard Brouer
2016-12-13 8:43 ` Mike Rapoport
2016-12-13 8:43 ` Mike Rapoport
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161214220438.4608f2bb@redhat.com \
--to=brouer@redhat.com \
--cc=METH@il.ibm.com \
--cc=alexander.duyck@gmail.com \
--cc=bblanco@plumgrid.com \
--cc=bjorn.topel@intel.com \
--cc=cl@linux.com \
--cc=davem@davemloft.net \
--cc=jesse.brandeburg@intel.com \
--cc=john.fastabend@gmail.com \
--cc=linux-mm@kvack.org \
--cc=magnus.karlsson@intel.com \
--cc=mgorman@techsingularity.net \
--cc=netdev@vger.kernel.org \
--cc=rppt@linux.vnet.ibm.com \
--cc=saeedm@mellanox.com \
--cc=tariqt@mellanox.com \
--cc=tom@herbertland.com \
--cc=vyasevich@gmail.com \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.