linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found] <20170222163901.90834-1-willemdebruijn.kernel@gmail.com>
@ 2017-02-27 18:57 ` Michael Kerrisk
  2017-02-28 19:46   ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Kerrisk @ 2017-02-27 18:57 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, Willem de Bruijn, Linux API

[CC += linux-api@vger.kernel.org]

Hi Willem

This is a change to the kernel-user-space API. Please CC
linux-api@vger.kernel.org on any future iterations of this patch.

Thanks,

Michael



On Wed, Feb 22, 2017 at 5:38 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> From: Willem de Bruijn <willemb@google.com>
>
> RFCv2:
>
> I have received a few requests for status and rebased code of this
> feature. We have been running this code internally, discovering and
> fixing various bugs. With net-next closed, now seems like a good time
> to share an updated patchset with fixes. The rebase from RFCv1/v4.2
> was mostly straightforward: mainly iov_iter changes. Full changelog:
>
>   RFC -> RFCv2:
>     - review comment: do not loop skb with zerocopy frags onto rx:
>           add skb_orphan_frags_rx to orphan even refcounted frags
>           call this in __netif_receive_skb_core, deliver_skb and tun:
>           the same as 1080e512d44d ("net: orphan frags on receive")
>     - fix: hold an explicit sk reference on each notification skb.
>           previously relied on the reference (or wmem) held by the
>           data skb that would trigger notification, but this breaks
>           on skb_orphan.
>     - fix: when aborting a send, do not inc the zerocopy counter
>           this caused gaps in the notification chain
>     - fix: in packet with SOCK_DGRAM, pull ll headers before calling
>           zerocopy_sg_from_iter
>     - fix: if sock_zerocopy_realloc does not allow coalescing,
>           do not fail, just allocate a new ubuf
>     - fix: in tcp, check return value of second allocation attempt
>     - chg: allocate notification skbs from optmem
>           to avoid affecting tcp write queue accounting (TSQ)
>     - chg: limit #locked pages (ulimit) per user instead of per process
>     - chg: grow notification ids from 16 to 32 bit
>       - pass range [lo, hi] through 32 bit fields ee_info and ee_data
>     - chg: rebased to davem-net-next on top of v4.10-rc7
>     - add: limit notification coalescing
>           sharing ubufs limits overhead, but delays notification until
>           the last packet is released, possibly unbounded. Add a cap.
>     - tests: add snd_zerocopy_lo pf_packet test
>     - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)
>
> The change to allocate notification skbuffs from optmem requires
> ensuring that net.core.optmem is at least a few 100KB. To
> experiment, run
>
>   sysctl -w net.core.optmem_max=1048576
>
> The snd_zerocopy_lo benchmarks reported in the individual patches were
> rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were
> replaced with skb_orphan_frags to allow looping to local sockets. The
> netperf results below are also rerun with v2.
>
> In application load, copy avoidance shows a roughly 5% systemwide
> reduction in cycles when streaming large flows and a 4-8% reduction in
> wall clock time on early tensorflow test workloads.
>
>
> Overview (from original RFC):
>
> Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
> Implement the feature for TCP, UDP, RAW and packet sockets. This is
> a generalization of a previous packet socket RFC patch
>
>   http://patchwork.ozlabs.org/patch/413184/
>
> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
> creates skbuff fragments directly from these pages. On tx completion,
> it notifies the socket owner that it is safe to modify memory by
> queuing a completion notification onto the socket error queue.
>
> The kernel already implements such copy avoidance with vmsplice plus
> splice and with ubuf_info for tun and virtio. Extend the second
> with features required by TCP and others: reference counting to
> support cloning (retransmit queue) and shared fragments (GSO) and
> notification coalescing to handle corking.
>
> Notifications are queued onto the socket error queue as a range
> range [N, N+m], where N is a per-socket counter incremented on each
> successful zerocopy send call.
>
> * Performance
>
> The below table shows cycles reported by perf for a netperf process
> sending a single 10 Gbps TCP_STREAM. The first three columns show
> Mcycles spent in the netperf process context. The second three columns
> show time spent systemwide (-a -C A,B) on the two cpus that run the
> process and interrupt handler. Reported is the median of at least 3
> runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
> Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
> are disabled and the kernel is booted with idle=halt.
>
> NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size
>
> perf stat -e cycles $NETPERF
> perf stat -C 2,3 -a -e cycles $NETPERF
>
>         --process cycles--      ----cpu cycles----
>            std      zc   %      std         zc   %
> 4K      27,609  11,217  41      49,217  39,175  79
> 16K     21,370   3,823  18      43,540  29,213  67
> 64K     20,557   2,312  11      42,189  26,910  64
> 256K    21,110   2,134  10      43,006  27,104  63
> 1M      20,987   1,610   8      42,759  25,931  61
>
> Perf record indicates the main source of these differences. Process
> cycles only at 1M writes (perf record; perf report -n):
>
> std:
> Samples: 42K of event 'cycles', Event count (approx.): 21258597313
>  79.41%         33884  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
>   3.27%          1396  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
>   1.66%           694  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
>   0.79%           325  netperf  [kernel.kallsyms]  [k] tcp_ack
>   0.43%           188  netperf  [kernel.kallsyms]  [k] __alloc_skb
>
> zc:
> Samples: 1K of event 'cycles', Event count (approx.): 1439509124
>  30.36%           584  netperf.zerocop  [kernel.kallsyms]  [k] gup_pte_range
>  14.63%           284  netperf.zerocop  [kernel.kallsyms]  [k] __zerocopy_sg_from_iter
>   8.03%           159  netperf.zerocop  [kernel.kallsyms]  [k] skb_zerocopy_add_frags_iter
>   4.84%            96  netperf.zerocop  [kernel.kallsyms]  [k] __alloc_skb
>   3.10%            60  netperf.zerocop  [kernel.kallsyms]  [k] kmem_cache_alloc_node
>
>
> * Safety
>
> The number of pages that can be pinned on behalf of a user with
> MSG_ZEROCOPY is bound by the locked memory ulimit.
>
> While the kernel holds process memory pinned, a process cannot safely
> reuse those pages for other purposes. Packets looped onto the receive
> stack and queued to a socket can be held indefinitely. Avoid unbounded
> notification latency by restricting user pages to egress paths only.
> skb_orphan_frags_rx() will create a private copy of pages even for
> refcounted packets when these are looped, as did skb_orphan_frags for
> the original tun zerocopy implementation.
>
> Pages are not remapped read-only. Processes can modify packet contents
> while packets are in flight in the kernel path. Bytes on which kernel
> control flow depends (headers) are copied to avoid TOCTTOU attacks.
> Datapath integrity does not otherwise depend on payload, with three
> exceptions: checksums, optional sk_filter/tc u32/.. and device +
> driver logic. The effect of wrong checksums is limited to the
> misbehaving process. TC filters that access contents may have to be
> excluded by adding an skb_orphan_frags_rx.
>
> Processes can also safely avoid OOM conditions by bounding the number
> of bytes passed with MSG_ZEROCOPY and by removing shared pages after
> transmission from their own memory map.
>
>
> * Limitations / Known Issues
>
> - PF_INET6 is not yet supported.
> - TCP does not build max GSO packets, especially for
>      small send buffers (< 4 KB)
>
> Willem de Bruijn (12):
>   sock: allocate skbs from optmem
>   sock: skb_copy_ubufs support for compound pages
>   sock: add generic socket zerocopy
>   sock: enable sendmsg zerocopy
>   sock: sendmsg zerocopy notification coalescing
>   sock: sendmsg zerocopy ulimit
>   sock: sendmsg zerocopy limit bytes per notification
>   tcp: enable sendmsg zerocopy
>   udp: enable sendmsg zerocopy
>   raw: enable sendmsg zerocopy with IP_HDRINCL
>   packet: enable sendmsg zerocopy
>   test: add sendmsg zerocopy tests
>
>  drivers/net/tun.c                             |   2 +-
>  drivers/vhost/net.c                           |   1 +
>  include/linux/sched.h                         |   2 +-
>  include/linux/skbuff.h                        |  94 +++-
>  include/linux/socket.h                        |   1 +
>  include/net/sock.h                            |   4 +
>  include/uapi/linux/errqueue.h                 |   1 +
>  net/core/datagram.c                           |  35 +-
>  net/core/dev.c                                |   4 +-
>  net/core/skbuff.c                             | 327 ++++++++++++--
>  net/core/sock.c                               |  29 ++
>  net/ipv4/ip_output.c                          |  34 +-
>  net/ipv4/raw.c                                |  27 +-
>  net/ipv4/tcp.c                                |  37 +-
>  net/packet/af_packet.c                        |  52 ++-
>  tools/testing/selftests/net/.gitignore        |   2 +
>  tools/testing/selftests/net/Makefile          |   1 +
>  tools/testing/selftests/net/snd_zerocopy.c    | 354 +++++++++++++++
>  tools/testing/selftests/net/snd_zerocopy_lo.c | 596 ++++++++++++++++++++++++++
>  19 files changed, 1536 insertions(+), 67 deletions(-)
>  create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
>  create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c
>
> --
> 2.11.0.483.g087da7b7c-goog
>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-27 18:57 ` [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Michael Kerrisk
@ 2017-02-28 19:46   ` Andy Lutomirski
  2017-02-28 20:43     ` Willem de Bruijn
  2017-03-01  3:25     ` David Miller
  0 siblings, 2 replies; 19+ messages in thread
From: Andy Lutomirski @ 2017-02-28 19:46 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: Willem de Bruijn, netdev, Willem de Bruijn, Linux API

On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
<mtk.manpages@gmail.com> wrote:
> [CC += linux-api@vger.kernel.org]
>
> Hi Willem
>

>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>> creates skbuff fragments directly from these pages. On tx completion,
>> it notifies the socket owner that it is safe to modify memory by
>> queuing a completion notification onto the socket error queue.

What happens if the user writes to the pages while it's not safe?

How about if you're writing to an interface or a route that has crypto
involved and a malicious user can make the data change in the middle
of a crypto operation, thus perhaps leaking the entire key?  (I
wouldn't be at all surprised if a lot of provably secure AEAD
constructions are entirely compromised if an attacker can get the
ciphertext and tag computed from a message that changed during the
computation.

I can see this working if you have a special type of skb that
indicates that the data might be concurrently written and have all the
normal skb APIs (including, especially, anything that clones it) make
a copy first.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 19:46   ` Andy Lutomirski
@ 2017-02-28 20:43     ` Willem de Bruijn
       [not found]       ` <CAF=yD-K_0zO3pMeXf-UKGTsD4sNOdyN9KJkUb5MnCO_J5pisrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-03-01  3:25     ` David Miller
  1 sibling, 1 reply; 19+ messages in thread
From: Willem de Bruijn @ 2017-02-28 20:43 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
> <mtk.manpages@gmail.com> wrote:
>> [CC += linux-api@vger.kernel.org]
>>
>> Hi Willem
>>
>
>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>> creates skbuff fragments directly from these pages. On tx completion,
>>> it notifies the socket owner that it is safe to modify memory by
>>> queuing a completion notification onto the socket error queue.
>
> What happens if the user writes to the pages while it's not safe?
>
> How about if you're writing to an interface or a route that has crypto
> involved and a malicious user can make the data change in the middle
> of a crypto operation, thus perhaps leaking the entire key?  (I
> wouldn't be at all surprised if a lot of provably secure AEAD
> constructions are entirely compromised if an attacker can get the
> ciphertext and tag computed from a message that changed during the
> computation.

Operations that read or write payload, such as this crypto example,
but also ebpf in tc or iptables, for instance, demand a deep copy using
skb_copy_ubufs before the operation.

This blacklist approach requires caution, but these paths should be
few and countable. It is not possible to predict at the socket layer
whether a packet will encounter any such operation, so white-listing
a subset of end-to-end paths is not practical.

> I can see this working if you have a special type of skb that
> indicates that the data might be concurrently written and have all the
> normal skb APIs (including, especially, anything that clones it) make
> a copy first.

Support for cloned skbs is required for TCP, both at tcp_transmit_skb
and segmentation offload. Patch 4 especially adds reference counting
of shared pages across clones and other sk_buff operations like
pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
on clones in specific datapaths like the above.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]       ` <CAF=yD-K_0zO3pMeXf-UKGTsD4sNOdyN9KJkUb5MnCO_J5pisrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-28 21:06         ` Andy Lutomirski
  2017-03-01  3:28           ` David Miller
  2017-02-28 21:09         ` Andy Lutomirski
  1 sibling, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2017-02-28 21:06 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
<willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> [CC += linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org]
>>>
>>> Hi Willem
>>>
>>
>>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>>> creates skbuff fragments directly from these pages. On tx completion,
>>>> it notifies the socket owner that it is safe to modify memory by
>>>> queuing a completion notification onto the socket error queue.
>>
>> What happens if the user writes to the pages while it's not safe?
>>
>> How about if you're writing to an interface or a route that has crypto
>> involved and a malicious user can make the data change in the middle
>> of a crypto operation, thus perhaps leaking the entire key?  (I
>> wouldn't be at all surprised if a lot of provably secure AEAD
>> constructions are entirely compromised if an attacker can get the
>> ciphertext and tag computed from a message that changed during the
>> computation.
>
> Operations that read or write payload, such as this crypto example,
> but also ebpf in tc or iptables, for instance, demand a deep copy using
> skb_copy_ubufs before the operation.
>
> This blacklist approach requires caution, but these paths should be
> few and countable. It is not possible to predict at the socket layer
> whether a packet will encounter any such operation, so white-listing
> a subset of end-to-end paths is not practical.

How about hardware that malfunctions if the packet changes out from
under it?  A whitelist seems quite a bit safer.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]       ` <CAF=yD-K_0zO3pMeXf-UKGTsD4sNOdyN9KJkUb5MnCO_J5pisrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-02-28 21:06         ` Andy Lutomirski
@ 2017-02-28 21:09         ` Andy Lutomirski
  2017-02-28 21:28           ` Willem de Bruijn
  2017-02-28 21:47           ` Eric Dumazet
  1 sibling, 2 replies; 19+ messages in thread
From: Andy Lutomirski @ 2017-02-28 21:09 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
<willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> I can see this working if you have a special type of skb that
>> indicates that the data might be concurrently written and have all the
>> normal skb APIs (including, especially, anything that clones it) make
>> a copy first.
>
> Support for cloned skbs is required for TCP, both at tcp_transmit_skb
> and segmentation offload. Patch 4 especially adds reference counting
> of shared pages across clones and other sk_buff operations like
> pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
> on clones in specific datapaths like the above.

Does this mean that a user program that does a zerocopy send can cause
a retransmitted segment to contain different data than the original
segment?  If so, is that okay?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 21:09         ` Andy Lutomirski
@ 2017-02-28 21:28           ` Willem de Bruijn
  2017-02-28 21:47           ` Eric Dumazet
  1 sibling, 0 replies; 19+ messages in thread
From: Willem de Bruijn @ 2017-02-28 21:28 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Michael Kerrisk, netdev, Willem de Bruijn, Linux API

>>> I can see this working if you have a special type of skb that
>>> indicates that the data might be concurrently written and have all the
>>> normal skb APIs (including, especially, anything that clones it) make
>>> a copy first.
>>
>> Support for cloned skbs is required for TCP, both at tcp_transmit_skb
>> and segmentation offload. Patch 4 especially adds reference counting
>> of shared pages across clones and other sk_buff operations like
>> pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
>> on clones in specific datapaths like the above.
>
> Does this mean that a user program that does a zerocopy send can cause
> a retransmitted segment to contain different data than the original
> segment? If so, is that okay?

That is possible, indeed. The bytestream at the receiver is then
likely undefined. Though integrity of the tcp receive stack should
not be affected. A valid question is whether the stack should protect
against such users. The pattern is reminiscent of evasion attacks. In
the limited case, privileged users already can generate this data
pattern, of course.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 21:09         ` Andy Lutomirski
  2017-02-28 21:28           ` Willem de Bruijn
@ 2017-02-28 21:47           ` Eric Dumazet
       [not found]             ` <1488318476.9415.270.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2017-02-28 21:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn,
	Linux API

On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:

> Does this mean that a user program that does a zerocopy send can cause
> a retransmitted segment to contain different data than the original
> segment?  If so, is that okay?

Same remark applies to sendfile() already, or other zero copy modes
(vmsplice() + splice() )

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]             ` <1488318476.9415.270.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
@ 2017-02-28 22:25               ` Andy Lutomirski
       [not found]                 ` <CALCETrVQj1AEsLEGGkWW1zApGz6_x2rDmE0wz4ft+O5h07f_Ug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2017-02-28 22:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn,
	Linux API

On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
>
>> Does this mean that a user program that does a zerocopy send can cause
>> a retransmitted segment to contain different data than the original
>> segment?  If so, is that okay?
>
> Same remark applies to sendfile() already

True.

>, or other zero copy modes
> (vmsplice() + splice() )

I hate vmsplice().  I thought I remembered it being essentially
disabled at some point due to security problems.

>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]                 ` <CALCETrVQj1AEsLEGGkWW1zApGz6_x2rDmE0wz4ft+O5h07f_Ug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-28 22:40                   ` Eric Dumazet
  2017-02-28 22:52                     ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2017-02-28 22:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn,
	Linux API

On Tue, 2017-02-28 at 14:25 -0800, Andy Lutomirski wrote:
> On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
> >
> >> Does this mean that a user program that does a zerocopy send can cause
> >> a retransmitted segment to contain different data than the original
> >> segment?  If so, is that okay?
> >
> > Same remark applies to sendfile() already
> 
> True.
> 
> >, or other zero copy modes
> > (vmsplice() + splice() )
> 
> I hate vmsplice().  I thought I remembered it being essentially
> disabled at some point due to security problems.

Right, zero copy is hard ;)

vmsplice() is not disabled in current kernels, unless I missed
something.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 22:40                   ` Eric Dumazet
@ 2017-02-28 22:52                     ` Andy Lutomirski
  2017-02-28 23:22                       ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2017-02-28 22:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn,
	Linux API

On Tue, Feb 28, 2017 at 2:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-02-28 at 14:25 -0800, Andy Lutomirski wrote:
>> On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
>> >
>> >> Does this mean that a user program that does a zerocopy send can cause
>> >> a retransmitted segment to contain different data than the original
>> >> segment?  If so, is that okay?
>> >
>> > Same remark applies to sendfile() already
>>
>> True.
>>
>> >, or other zero copy modes
>> > (vmsplice() + splice() )
>>
>> I hate vmsplice().  I thought I remembered it being essentially
>> disabled at some point due to security problems.
>
> Right, zero copy is hard ;)
>
> vmsplice() is not disabled in current kernels, unless I missed
> something.
>

I think you're right.  That being said, from the man page:

The user pages are a gift to the kernel.  The application  may  not
modify this memory ever, otherwise the page cache and on-disk data may
differ.

This is just not okay IMO.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 22:52                     ` Andy Lutomirski
@ 2017-02-28 23:22                       ` Eric Dumazet
       [not found]                         ` <1488324131.9415.278.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2017-02-28 23:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn,
	Linux API

On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:

> The user pages are a gift to the kernel.  The application  may  not
> modify this memory ever, otherwise the page cache and on-disk data may
> differ.
> 
> This is just not okay IMO.

TCP works just fine in this case.

TX checksum will be computed by the NIC after/while data is copied.

If really the application changes the data, that will not cause any
problems, other than user side consistency.

This is why we require a copy (for all buffers that came from zero-copy)
if network stack hits a device that can not offload TX checksum.

Even pwrite() does not guarantee consistency if multiple threads are
using it on overlapping regions.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]                         ` <1488324131.9415.278.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
@ 2017-03-01  0:28                           ` Tom Herbert
       [not found]                             ` <CALx6S357ssnbEu7CMrczEjiX25QYBJh3WG=w8KuAoxGQS4aKLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Tom Herbert @ 2017-03-01  0:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andy Lutomirski, Willem de Bruijn, Michael Kerrisk, netdev,
	Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:
>
>> The user pages are a gift to the kernel.  The application  may  not
>> modify this memory ever, otherwise the page cache and on-disk data may
>> differ.
>>
>> This is just not okay IMO.
>
> TCP works just fine in this case.
>
> TX checksum will be computed by the NIC after/while data is copied.
>
> If really the application changes the data, that will not cause any
> problems, other than user side consistency.
>
> This is why we require a copy (for all buffers that came from zero-copy)
> if network stack hits a device that can not offload TX checksum.
>
> Even pwrite() does not guarantee consistency if multiple threads are
> using it on overlapping regions.
>
The Mellanox team working on TLS offload pointed out to us that if
data is changed for a retransmit then it becomes trivial for someone
snooping to break the encryption. Sounds pretty scary and it would be
a shame if we couldn't use zero-copy in that use case :-( Hopefully we
can find a solution...

Tom

>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]                             ` <CALx6S357ssnbEu7CMrczEjiX25QYBJh3WG=w8KuAoxGQS4aKLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-03-01  0:37                               ` Eric Dumazet
  2017-03-01  0:58                               ` Willem de Bruijn
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2017-03-01  0:37 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Andy Lutomirski, Willem de Bruijn, Michael Kerrisk, netdev,
	Willem de Bruijn, Linux API

On Tue, 2017-02-28 at 16:28 -0800, Tom Herbert wrote:

> The Mellanox team working on TLS offload pointed out to us that if
> data is changed for a retransmit then it becomes trivial for someone
> snooping to break the encryption. Sounds pretty scary and it would be
> a shame if we couldn't use zero-copy in that use case :-( Hopefully we
> can find a solution...

Right, this is why offloading encryption over TCP is also hard ;)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]                             ` <CALx6S357ssnbEu7CMrczEjiX25QYBJh3WG=w8KuAoxGQS4aKLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-03-01  0:37                               ` Eric Dumazet
@ 2017-03-01  0:58                               ` Willem de Bruijn
  2017-03-01  1:50                                 ` Tom Herbert
  1 sibling, 1 reply; 19+ messages in thread
From: Willem de Bruijn @ 2017-03-01  0:58 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Andy Lutomirski, Michael Kerrisk, netdev,
	Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 7:28 PM, Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:
> On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:
>>
>>> The user pages are a gift to the kernel.  The application  may  not
>>> modify this memory ever, otherwise the page cache and on-disk data may
>>> differ.
>>>
>>> This is just not okay IMO.
>>
>> TCP works just fine in this case.
>>
>> TX checksum will be computed by the NIC after/while data is copied.
>>
>> If really the application changes the data, that will not cause any
>> problems, other than user side consistency.
>>
>> This is why we require a copy (for all buffers that came from zero-copy)
>> if network stack hits a device that can not offload TX checksum.
>>
>> Even pwrite() does not guarantee consistency if multiple threads are
>> using it on overlapping regions.
>>
> The Mellanox team working on TLS offload pointed out to us that if
> data is changed for a retransmit then it becomes trivial for someone
> snooping to break the encryption. Sounds pretty scary and it would be
> a shame if we couldn't use zero-copy in that use case :-( Hopefully we
> can find a solution...
>

This requires collusion by the process initiating the zerocopy send
to help the entity snooping the link. That could be an attack on admin
configured tunnels, but user-directed encryption offload like AF_TLS
can still use zerocopy.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-03-01  0:58                               ` Willem de Bruijn
@ 2017-03-01  1:50                                 ` Tom Herbert
  0 siblings, 0 replies; 19+ messages in thread
From: Tom Herbert @ 2017-03-01  1:50 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Eric Dumazet, Andy Lutomirski, Michael Kerrisk, netdev,
	Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 4:58 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Tue, Feb 28, 2017 at 7:28 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:
>>>
>>>> The user pages are a gift to the kernel.  The application  may  not
>>>> modify this memory ever, otherwise the page cache and on-disk data may
>>>> differ.
>>>>
>>>> This is just not okay IMO.
>>>
>>> TCP works just fine in this case.
>>>
>>> TX checksum will be computed by the NIC after/while data is copied.
>>>
>>> If really the application changes the data, that will not cause any
>>> problems, other than user side consistency.
>>>
>>> This is why we require a copy (for all buffers that came from zero-copy)
>>> if network stack hits a device that can not offload TX checksum.
>>>
>>> Even pwrite() does not guarantee consistency if multiple threads are
>>> using it on overlapping regions.
>>>
>> The Mellanox team working on TLS offload pointed out to us that if
>> data is changed for a retransmit then it becomes trivial for someone
>> snooping to break the encryption. Sounds pretty scary and it would be
>> a shame if we couldn't use zero-copy in that use case :-( Hopefully we
>> can find a solution...
>>
>
> This requires collusion by the process initiating the zerocopy send
> to help the entity snooping the link. That could be an attack on admin
> configured tunnels, but user-directed encryption offload like AF_TLS
> can still use zerocopy.

Yes, but we can't trust the user to always understand or correctly
implement the semantic nuances when security is involved. If we can't
provide a  robust API then the only recourse is to not allow zero copy
in that case. We could suggest COW to solve all problems, but I think
we know where the conversation will go ;-)

Tom

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 19:46   ` Andy Lutomirski
  2017-02-28 20:43     ` Willem de Bruijn
@ 2017-03-01  3:25     ` David Miller
  1 sibling, 0 replies; 19+ messages in thread
From: David Miller @ 2017-03-01  3:25 UTC (permalink / raw)
  To: luto; +Cc: mtk.manpages, willemdebruijn.kernel, netdev, willemb, linux-api

From: Andy Lutomirski <luto@amacapital.net>
Date: Tue, 28 Feb 2017 11:46:23 -0800

> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
> <mtk.manpages@gmail.com> wrote:
>> [CC += linux-api@vger.kernel.org]
>>
>> Hi Willem
>>
> 
>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>> creates skbuff fragments directly from these pages. On tx completion,
>>> it notifies the socket owner that it is safe to modify memory by
>>> queuing a completion notification onto the socket error queue.
> 
> What happens if the user writes to the pages while it's not safe?

Just want to mention that this ability to write to data behind a
network send's back is not a new thing added by MSG_ZEROCOPY.

All of this is already possible with sendfile().  The pages can be
written to completely asynchronously to the data being pulled out of
the page cache into the transmit path.

The crypto case is interesting, but that is a seperate discussion
about an existing problem rather than something specifically new to
the MSG_ZEROCOPY changes.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 21:06         ` Andy Lutomirski
@ 2017-03-01  3:28           ` David Miller
  2017-03-01  3:43             ` Eric Dumazet
  2017-03-02 19:26             ` Andy Lutomirski
  0 siblings, 2 replies; 19+ messages in thread
From: David Miller @ 2017-03-01  3:28 UTC (permalink / raw)
  To: luto; +Cc: willemdebruijn.kernel, mtk.manpages, netdev, willemb, linux-api

From: Andy Lutomirski <luto@amacapital.net>
Date: Tue, 28 Feb 2017 13:06:49 -0800

> On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>>> <mtk.manpages@gmail.com> wrote:
>>>> [CC += linux-api@vger.kernel.org]
>>>>
>>>> Hi Willem
>>>>
>>>
>>>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>>>> creates skbuff fragments directly from these pages. On tx completion,
>>>>> it notifies the socket owner that it is safe to modify memory by
>>>>> queuing a completion notification onto the socket error queue.
>>>
>>> What happens if the user writes to the pages while it's not safe?
>>>
>>> How about if you're writing to an interface or a route that has crypto
>>> involved and a malicious user can make the data change in the middle
>>> of a crypto operation, thus perhaps leaking the entire key?  (I
>>> wouldn't be at all surprised if a lot of provably secure AEAD
>>> constructions are entirely compromised if an attacker can get the
>>> ciphertext and tag computed from a message that changed during the
>>> computation.
>>
>> Operations that read or write payload, such as this crypto example,
>> but also ebpf in tc or iptables, for instance, demand a deep copy using
>> skb_copy_ubufs before the operation.
>>
>> This blacklist approach requires caution, but these paths should be
>> few and countable. It is not possible to predict at the socket layer
>> whether a packet will encounter any such operation, so white-listing
>> a subset of end-to-end paths is not practical.
> 
> How about hardware that malfunctions if the packet changes out from
> under it?  A whitelist seems quite a bit safer.

These device are already choking, because as I stated this can already
be done via sendfile().

Networking card wise this isn't an issue, chips bring the entire packet
into their FIFO, compute checksums on the fly mid-stream, and then write
the 16-bit checksum field before starting to write the packet onto the
wire.

I think this is completely a non-issue, and we thought about this right
from the start when sendfile() support was added nearly two decades ago.
If network cards from back then didn't crap out in this situation I
think the ones out there now are probably ok.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-03-01  3:28           ` David Miller
@ 2017-03-01  3:43             ` Eric Dumazet
  2017-03-02 19:26             ` Andy Lutomirski
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2017-03-01  3:43 UTC (permalink / raw)
  To: David Miller
  Cc: luto, willemdebruijn.kernel, mtk.manpages, netdev, willemb,
	linux-api

On Tue, 2017-02-28 at 22:28 -0500, David Miller wrote:

> These device are already choking, because as I stated this can already
> be done via sendfile().
> 
> Networking card wise this isn't an issue, chips bring the entire packet
> into their FIFO, compute checksums on the fly mid-stream, and then write
> the 16-bit checksum field before starting to write the packet onto the
> wire.
> 
> I think this is completely a non-issue, and we thought about this right
> from the start when sendfile() support was added nearly two decades ago.
> If network cards from back then didn't crap out in this situation I
> think the ones out there now are probably ok.

Well, we had to fix one issue with GSO fall back about 4 years ago.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cef401de7be8c4e155c6746bfccf721a4fa5fab9

So extra scrutiny would be nice.

Zero copy is incredibly hard to get right.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-03-01  3:28           ` David Miller
  2017-03-01  3:43             ` Eric Dumazet
@ 2017-03-02 19:26             ` Andy Lutomirski
  1 sibling, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2017-03-02 19:26 UTC (permalink / raw)
  To: David Miller
  Cc: Willem de Bruijn, Michael Kerrisk, Network Development,
	Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 7:28 PM, David Miller <davem@davemloft.net> wrote:
> From: Andy Lutomirski <luto@amacapital.net>
> Date: Tue, 28 Feb 2017 13:06:49 -0800
>
>> On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
>> <willemdebruijn.kernel@gmail.com> wrote:
>>> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>>>> <mtk.manpages@gmail.com> wrote:
>>>>> [CC += linux-api@vger.kernel.org]
>>>>>
>>>>> Hi Willem
>>>>>
>>>>
>>>>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>>>>> creates skbuff fragments directly from these pages. On tx completion,
>>>>>> it notifies the socket owner that it is safe to modify memory by
>>>>>> queuing a completion notification onto the socket error queue.
>>>>
>>>> What happens if the user writes to the pages while it's not safe?
>>>>
>>>> How about if you're writing to an interface or a route that has crypto
>>>> involved and a malicious user can make the data change in the middle
>>>> of a crypto operation, thus perhaps leaking the entire key?  (I
>>>> wouldn't be at all surprised if a lot of provably secure AEAD
>>>> constructions are entirely compromised if an attacker can get the
>>>> ciphertext and tag computed from a message that changed during the
>>>> computation.
>>>
>>> Operations that read or write payload, such as this crypto example,
>>> but also ebpf in tc or iptables, for instance, demand a deep copy using
>>> skb_copy_ubufs before the operation.
>>>
>>> This blacklist approach requires caution, but these paths should be
>>> few and countable. It is not possible to predict at the socket layer
>>> whether a packet will encounter any such operation, so white-listing
>>> a subset of end-to-end paths is not practical.
>>
>> How about hardware that malfunctions if the packet changes out from
>> under it?  A whitelist seems quite a bit safer.
>
> These device are already choking, because as I stated this can already
> be done via sendfile().
>
> Networking card wise this isn't an issue, chips bring the entire packet
> into their FIFO, compute checksums on the fly mid-stream, and then write
> the 16-bit checksum field before starting to write the packet onto the
> wire.
>
> I think this is completely a non-issue, and we thought about this right
> from the start when sendfile() support was added nearly two decades ago.
> If network cards from back then didn't crap out in this situation I
> think the ones out there now are probably ok.

Fair enough.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-03-02 19:26 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20170222163901.90834-1-willemdebruijn.kernel@gmail.com>
2017-02-27 18:57 ` [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Michael Kerrisk
2017-02-28 19:46   ` Andy Lutomirski
2017-02-28 20:43     ` Willem de Bruijn
     [not found]       ` <CAF=yD-K_0zO3pMeXf-UKGTsD4sNOdyN9KJkUb5MnCO_J5pisrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-28 21:06         ` Andy Lutomirski
2017-03-01  3:28           ` David Miller
2017-03-01  3:43             ` Eric Dumazet
2017-03-02 19:26             ` Andy Lutomirski
2017-02-28 21:09         ` Andy Lutomirski
2017-02-28 21:28           ` Willem de Bruijn
2017-02-28 21:47           ` Eric Dumazet
     [not found]             ` <1488318476.9415.270.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
2017-02-28 22:25               ` Andy Lutomirski
     [not found]                 ` <CALCETrVQj1AEsLEGGkWW1zApGz6_x2rDmE0wz4ft+O5h07f_Ug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-28 22:40                   ` Eric Dumazet
2017-02-28 22:52                     ` Andy Lutomirski
2017-02-28 23:22                       ` Eric Dumazet
     [not found]                         ` <1488324131.9415.278.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
2017-03-01  0:28                           ` Tom Herbert
     [not found]                             ` <CALx6S357ssnbEu7CMrczEjiX25QYBJh3WG=w8KuAoxGQS4aKLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-01  0:37                               ` Eric Dumazet
2017-03-01  0:58                               ` Willem de Bruijn
2017-03-01  1:50                                 ` Tom Herbert
2017-03-01  3:25     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).