Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 3/3] qedi: Add QLogic FastLinQ offload iSCSI driver framework.
From: Martin K. Petersen @ 2016-12-14 20:02 UTC (permalink / raw)
  To: Manish Rangankar
  Cc: martin.petersen, lduncan, cleech, linux-scsi, netdev,
	QLogic-Storage-Upstream, Yuval.Mintz
In-Reply-To: <1480580468-31567-4-git-send-email-manish.rangankar@cavium.com>

>>>>> "Manish" == Manish Rangankar <manish.rangankar@cavium.com> writes:

Manish> The QLogic FastLinQ Driver for iSCSI (qedi) is the iSCSI
Manish> specific module for 41000 Series Converged Network Adapters by
Manish> QLogic.

Applied to 4.10/scsi-fixes.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH scsi 0/3] cxgb4i: add support for Chelsio T6 adapters
From: Martin K. Petersen @ 2016-12-14 20:10 UTC (permalink / raw)
  To: Varun Prakash; +Cc: martin.petersen, jejb, linux-scsi, netdev, davem, indranil
In-Reply-To: <cover.1480603332.git.varun@chelsio.com>

>>>>> "Varun" == Varun Prakash <varun@chelsio.com> writes:

Varun> This patch series adds support for Chelsio T6 adapters in iSCSI
Varun> initiator offload driver(cxgb4i).

Applied to 4.10/scsi-queue.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: Tom Herbert @ 2016-12-14 20:12 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David Laight, Netdev, kernel-hardening, Andi Kleen, LKML,
	Linux Crypto Mailing List
In-Reply-To: <CAHmME9pEM=cDC5S=j1BU2oCF8-WdnbRfiVojcet4rXcRLcpJRw@mail.gmail.com>

On Wed, Dec 14, 2016 at 4:53 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hi David,
>
> On Wed, Dec 14, 2016 at 10:51 AM, David Laight <David.Laight@aculab.com> wrote:
>> From: Jason A. Donenfeld
>>> Sent: 14 December 2016 00:17
>>> This gives a clear speed and security improvement. Rather than manually
>>> filling MD5 buffers, we simply create a layout by a simple anonymous
>>> struct, for which gcc generates rather efficient code.
>> ...
>>> +     const struct {
>>> +             struct in6_addr saddr;
>>> +             struct in6_addr daddr;
>>> +             __be16 sport;
>>> +             __be16 dport;
>>> +     } __packed combined = {
>>> +             .saddr = *(struct in6_addr *)saddr,
>>> +             .daddr = *(struct in6_addr *)daddr,
>>> +             .sport = sport,
>>> +             .dport = dport
>>> +     };
>>
>> You need to look at the effect of marking this (and the other)
>> structures 'packed' on architectures like sparc64.
>
> In all current uses of __packed in the code, I think the impact is
> precisely zero, because all structures have members in descending
> order of size, with each member being a perfect multiple of the one
> below it. The __packed is therefore just there for safety, in case
> somebody comes in and screws everything up by sticking a u8 in
> between. In that case, it wouldn't be desirable to hash the structure
> padding bits. In the worst case, I don't believe the impact would be
> worse than a byte-by-byte memcpy, which is what the old code did. But
> anyway, these structures are already naturally packed anyway, so the
> present impact is nil.
>
If you pad the data structure to 64 bits then we can call the version
of siphash that only deals in 64 bit words. Writing a zero in the
padding will be cheaper than dealing with odd lengths in siphash24.

Tom

> Jason

^ permalink raw reply

* Re: net/arp: ARP cache aging failed.
From: Julian Anastasov @ 2016-12-14 20:15 UTC (permalink / raw)
  To: YueHaibing; +Cc: Hannes Frederic Sowa, Eric Dumazet, David S. Miller, netdev
In-Reply-To: <74d41c47-7091-3c52-096d-5b9af2e0e9cf@huawei.com>


	Hello,

On Wed, 14 Dec 2016, YueHaibing wrote:

> On 2016/11/26 4:40, Julian Anastasov wrote:
> > 
> > 	So, the idea is to move TCP and other similar
> > users to the new dst_confirm_sk() method. If other
> > dst_confirm() users are left, they should be checked
> > if dsts with rt_gateway = 0 can be wrongly used.
> 
> Sorry for so late.

	In fact, I'm late too because I almost finished
my changes, the only remaining part is the cxgb files...

> Based on your ideas, I make a patch and test it for a while.

	The problem is that it is valid only for TCP.
Also, this flag should be reset sometimes, eg. when sk dst
changes...

> It works for me.
> 
> @@ -847,7 +847,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  	return err;
> 
>  do_confirm:
> -	dst_confirm(&rt->dst);
> +	dst_confirm_sk(sk);

	MSG_CONFIRM from sendmsg needs special treatment. The
problem is that UDP sending does not lock the socket, so I also
added a skb flag to handle this situation in ip*_append_data.
We do not want threaded application firing at different
destinations to confirm the wrong neighbour. MSG_PROBE is
another issue, the XFRM dst chaining, etc...

	I hope, I'll be ready this weekend with few patches
that change all dst_confirm users... There I'll explain
everything in detail.

Regards

^ permalink raw reply

* Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: Hannes Frederic Sowa @ 2016-12-14 20:27 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David Miller, David Laight, Netdev, kernel-hardening, Andi Kleen,
	LKML, Linux Crypto Mailing List
In-Reply-To: <CAHmME9rp2oSCo0eu92jKm00S0eJHz65bJKXRpeS7=_EV6zZNYw@mail.gmail.com>

Hey Jason,

On 14.12.2016 20:38, Jason A. Donenfeld wrote:
> On Wed, Dec 14, 2016 at 8:22 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> I don't think this helps. Did you test it? I don't see reason why
>> padding could be left out between `d' and `end' because of the flexible
>> array member?
> 
> Because the type u8 doesn't require any alignment requirements, it can
> nestle right up there cozy with the u16:
> 
> zx2c4@thinkpad ~ $ cat a.c
> #include <stdint.h>
> #include <stdio.h>
> #include <stddef.h>
> int main()
> {
>        struct {
>                uint64_t a;
>                uint32_t b;
>                uint32_t c;
>                uint16_t d;
>                char x[];
>        } a;
>        printf("%zu\n", sizeof(a));
>        printf("%zu\n", offsetof(typeof(a), x));
>        return 0;
> }
> zx2c4@thinkpad ~ $ gcc a.c
> zx2c4@thinkpad ~ $ ./a.out
> 24
> 18

Sorry, I misread the patch. You are using offsetof. In this case remove
the char x[] and just use offsetofend because it is misleading
otherwise. Should work like that though.

What I don't really understand is that the addition of this complexity
actually reduces the performance, as you have to take the "if (left)"
branch during hashing and causes you to make a load_unaligned_zeropad.

Bye,
Hannes

^ permalink raw reply

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Hannes Frederic Sowa @ 2016-12-14 20:37 UTC (permalink / raw)
  To: Christoph Lameter, David Laight
  Cc: Jesper Dangaard Brouer, John Fastabend, Mike Rapoport,
	netdev@vger.kernel.org, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth, Vladislav Yasevich
In-Reply-To: <alpine.DEB.2.20.1612141342080.23516@east.gentwo.org>

On 14.12.2016 20:43, Christoph Lameter wrote:
> On Wed, 14 Dec 2016, David Laight wrote:
> 
>> If the kernel is doing ANY validation on the frames it must copy the
>> data to memory the application cannot modify before doing the validation.
>> Otherwise the application could change the data afterwards.
> 
> The application is not allowed to change the data after a work request has
> been submitted to send the frame. Changes are possible after the
> completion request has been received.
> 
> The kernel can enforce that by making the frame(s) readonly and thus
> getting a page fault if the app would do such a thing.

As far as I remember right now, if you gift with vmsplice the memory
over a pipe to a tcp socket, you can in fact change the user data while
the data is in transmit. So you should not touch the memory region until
you received a SOF_TIMESTAMPING_TX_ACK error message in your sockets
error queue or stuff might break horribly. I don't think we have a
proper event for UDP that fires after we know the data left the hardware.

In my opinion this is still fine within the kernel protection limits.
E.g. due to scatter gather I/O you don't get access to the TCP header
nor UDP header and thus can't e.g. spoof or modify the header or
administration policies, albeit TOCTTOU races with netfilter which
matches inside the TCP/UDP packets are very well possible on transmit.

Wouldn't changing of the pages cause expensive TLB flushes?

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net] ibmveth: calculate gso_segs for large packets
From: marcelo.leitner @ 2016-12-14 20:39 UTC (permalink / raw)
  To: Thomas Falcon; +Cc: netdev, brking, pradeeps, jmaxwell37, zdai, eric.dumazet
In-Reply-To: <1481674509-14256-1-git-send-email-tlfalcon@linux.vnet.ibm.com>

On Tue, Dec 13, 2016 at 06:15:09PM -0600, Thomas Falcon wrote:
> Include calculations to compute the number of segments
> that comprise an aggregated large packet.
> 
> Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>


> ---
>  drivers/net/ethernet/ibm/ibmveth.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> index fbece63..a831f94 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -1181,7 +1181,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>  
>  static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
>  {
> +	struct tcphdr *tcph;
>  	int offset = 0;
> +	int hdr_len;
>  
>  	/* only TCP packets will be aggregated */
>  	if (skb->protocol == htons(ETH_P_IP)) {
> @@ -1208,14 +1210,20 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
>  	/* if mss is not set through Large Packet bit/mss in rx buffer,
>  	 * expect that the mss will be written to the tcp header checksum.
>  	 */
> +	tcph = (struct tcphdr *)(skb->data + offset);
>  	if (lrg_pkt) {
>  		skb_shinfo(skb)->gso_size = mss;
>  	} else if (offset) {
> -		struct tcphdr *tcph = (struct tcphdr *)(skb->data + offset);
> -
>  		skb_shinfo(skb)->gso_size = ntohs(tcph->check);
>  		tcph->check = 0;
>  	}
> +
> +	if (skb_shinfo(skb)->gso_size) {
> +		hdr_len = offset + tcph->doff * 4;
> +		skb_shinfo(skb)->gso_segs =
> +				DIV_ROUND_UP(skb->len - hdr_len,
> +					     skb_shinfo(skb)->gso_size);
> +	}
>  }
>  
>  static int ibmveth_poll(struct napi_struct *napi, int budget)
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* [PATCH net] vxlan: fix unused variable warning
From: Stephen Hemminger @ 2016-12-14 20:43 UTC (permalink / raw)
  To: davem; +Cc: netdev

Fixes commit 4528520d315ac1 ("vxlan: add ipv6 proxy support")
which added code that caused this warning:

drivers/net/vxlan.c: In function ‘neigh_reduce’:
drivers/net/vxlan.c:1556:25: warning: variable ‘saddr’ set but not used

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/vxlan.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index bb70dd5..4147fd6 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1553,7 +1553,7 @@ static int neigh_reduce(struct net_device *dev, struct sk_buff *skb)
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 	struct nd_msg *msg;
 	const struct ipv6hdr *iphdr;
-	const struct in6_addr *saddr, *daddr;
+	const struct in6_addr *daddr;
 	struct neighbour *n;
 	struct inet6_dev *in6_dev;
 
@@ -1562,7 +1562,6 @@ static int neigh_reduce(struct net_device *dev, struct sk_buff *skb)
 		goto out;
 
 	iphdr = ipv6_hdr(skb);
-	saddr = &iphdr->saddr;
 	daddr = &iphdr->daddr;
 
 	msg = (struct nd_msg *)skb_transport_header(skb);
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: Jason A. Donenfeld @ 2016-12-14 20:55 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Netdev, kernel-hardening, LKML, Linux Crypto Mailing List,
	Jean-Philippe Aumasson, Daniel J . Bernstein, Linus Torvalds,
	Eric Biggers, David Laight
In-Reply-To: <CAHmME9oMaQOhzJbNKh8GN759iJngeRdXt3naOnFhY9mD6t5Kxg@mail.gmail.com>

Hey Tom,

Just following up on what I mentioned in my last email...

On Wed, Dec 14, 2016 at 8:35 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> I think your suggestion for (2) will contribute to further
> optimizations for (1). In v2, I had another patch in there adding
> siphash_1word, siphash_2words, etc, like jhash, but I implemented it
> by taking u32 variables and then just concatenating these into a
> buffer and passing them to the main siphash function. I removed it
> from v3 because I thought that these kind of missed the whole point.
> In particular:
>
> a) siphash24_1word, siphash24_2words, siphash24_3words, etc should
> take u64, not u32, since that's what siphash operates on natively

I implemented these here:
https://git.zx2c4.com/linux-dev/commit/?h=siphash&id=4652b6f3643bdba217e2194d89661348bbac48a0

This will be part of the next version of the series I submit. It's not
immediately clear that using it is strictly faster than the struct
trick though. However, I'm not yet sure why this would be.

Jason

^ permalink raw reply

* Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: Jason A. Donenfeld @ 2016-12-14 21:01 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Laight, Netdev, kernel-hardening, Andi Kleen, LKML,
	Linux Crypto Mailing List
In-Reply-To: <CALx6S37mGaLJoacxyu3_ZQANSNz9UU38-b-V6g1nma=Gye3pjw@mail.gmail.com>

On Wed, Dec 14, 2016 at 9:12 PM, Tom Herbert <tom@herbertland.com> wrote:
> If you pad the data structure to 64 bits then we can call the version
> of siphash that only deals in 64 bit words. Writing a zero in the
> padding will be cheaper than dealing with odd lengths in siphash24.
On Wed, Dec 14, 2016 at 9:27 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> What I don't really understand is that the addition of this complexity
> actually reduces the performance, as you have to take the "if (left)"
> branch during hashing and causes you to make a load_unaligned_zeropad.

Oh, duh, you guys are right. Fixed in my repo [1]. I'll submit the
next version in a day or so to let some other comments come in.

Thanks again for your reviews.

Jason

[1] https://git.zx2c4.com/linux-dev/log/?h=siphash

^ permalink raw reply

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Jesper Dangaard Brouer @ 2016-12-14 21:04 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, cl, rppt, netdev, linux-mm, willemdebruijn.kernel,
	bjorn.topel, magnus.karlsson, alexander.duyck, mgorman, tom,
	bblanco, tariqt, saeedm, jesse.brandeburg, METH, vyasevich,
	brouer
In-Reply-To: <5851740A.2080806@gmail.com>

On Wed, 14 Dec 2016 08:32:10 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 13 Dec 2016 12:08:21 -0800
> > John Fastabend <john.fastabend@gmail.com> wrote:
> >   
> >> On 16-12-13 11:53 AM, David Miller wrote:  
> >>> From: John Fastabend <john.fastabend@gmail.com>
> >>> Date: Tue, 13 Dec 2016 09:43:59 -0800
> >>>     
> >>>> What does "zero-copy send packet-pages to the application/socket that
> >>>> requested this" mean? At the moment on x86 page-flipping appears to be
> >>>> more expensive than memcpy (I can post some data shortly) and shared
> >>>> memory was proposed and rejected for security reasons when we were
> >>>> working on bifurcated driver.    
> >>>
> >>> The whole idea is that we map all the active RX ring pages into
> >>> userspace from the start.
> >>>
> >>> And just how Jesper's page pool work will avoid DMA map/unmap,
> >>> it will also avoid changing the userspace mapping of the pages
> >>> as well.
> >>>
> >>> Thus avoiding the TLB/VM overhead altogether.
> >>>     
> > 
> > Exactly.  It is worth mentioning that pages entering the page pool need
> > to be cleared (measured cost 143 cycles), in order to not leak any
> > kernel info.  The primary focus of this design is to make sure not to
> > leak kernel info to userspace, but with an "exclusive" mode also
> > support isolation between applications.
> > 
> >   
> >> I get this but it requires applications to be isolated. The pages from
> >> a queue can not be shared between multiple applications in different
> >> trust domains. And the application has to be cooperative meaning it
> >> can't "look" at data that has not been marked by the stack as OK. In
> >> these schemes we tend to end up with something like virtio/vhost or
> >> af_packet.  
> > 
> > I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> > two would require CAP_NET_ADMIN privileges.  All modes have a trust
> > domain id, that need to match e.g. when page reach the socket.  
> 
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.

Good point.

> > 
> > Mode-1 "Shared": Application choose lowest isolation level, allowing
> >  multiple application to mmap VMA area.  
> 
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
> 
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
> 
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
> 
> > 
> > Mode-2 "Single-user": Application request it want to be the only user
> >  of the RX queue.  This blocks other application to mmap VMA area.
> >   
> 
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.

Yes, as describe in orig email and here[1]: "once an application
request zero-copy RX, then the driver must use a specific SKB
allocation mode and might have to reconfigure the RX-ring."

The SKB allocation mode is "read-only packet page", which is the
current default mode (also desc in document[1]) of using skb-frags.

[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
 
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
> 
> I assume all these concerns are shared between mode-1 and mode-2
> 
> > Mode-3 "Exclusive": Application request to own RX queue.  Packets are
> >  no longer allowed for normal netstack delivery.
> >   
> 
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.

Interesting.

> > Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> > still allowed to travel netstack and thus can contain packet data from
> > other normal applications.  This is part of the design, to share the
> > NIC between netstack and an accelerated userspace application using RX
> > zero-copy delivery.
> >   
> 
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.

I though the bifurcated driver work was rejected because it could leak
kernel info in the pages. This approach cannot.

   
> >> Any ACLs/filtering/switching/headers need to be done in hardware or
> >> the application trust boundaries are broken.  
> > 
> > The software solution outlined allow the application to make the
> > choice of what trust boundary it wants.
> > 
> > The "exclusive" mode-3 make most sense together with HW filters.
> > Already today, we support creating a new RX queue based on ethtool
> > ntuple HW filter and then you simply attach your application that
> > queue in mode-3, and have full isolation.
> >   
> 
> Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
> Without hardware filters we have no way of knowing who/what data is
> put in the page.

For sockets, an SKB carrying a RX zero-copy-able page can be steered
(as normal) into a given socket. Then we check if socket requested
zero-copy, and verify if the domain-id match between the page_pool and
socket.

You can also use XDP to filter and steer the packet (which will be
faster and using normal steering code). 

> >    
> >> If the above can not be met then a copy is needed. What I am trying
> >> to tease out is the above comment along with other statements like
> >> this "can be done with out HW filter features".  
> > 
> > Does this address your concerns?
> >   
> 
> I think we need to enforce strong isolation. An application should not
> be able to read data or impact other applications. I gather this is
> the case per comment about normal applications in mode-2. A slightly
> weaker statement would be to say applications can only impace/read
> data of other applications in their domain. This might be OK as well.

I think this approach covers the "weaker statement".  Because only page
within the pool are "exposed".  Thus, the domain is the NIC (possibly
restricted to a single RX queue).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: kbuild test robot @ 2016-12-14 21:15 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: kbuild-all, Netdev, kernel-hardening, LKML, linux-crypto,
	Jason A. Donenfeld, Jean-Philippe Aumasson, Daniel J . Bernstein,
	Linus Torvalds, Eric Biggers, David Laight
In-Reply-To: <20161214184605.24006-1-Jason@zx2c4.com>

[-- Attachment #1: Type: text/plain, Size: 1420 bytes --]

Hi Jason,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.9 next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jason-A-Donenfeld/siphash-add-cryptographically-secure-hashtable-function/20161215-041458
config: i386-randconfig-i1-201650 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   lib/test_siphash.c: In function 'siphash_test_init':
>> lib/test_siphash.c:49:2: error: requested alignment is not an integer constant
     u8 in[64] __aligned(SIPHASH24_ALIGNMENT);
     ^
   lib/test_siphash.c:50:2: error: requested alignment is not an integer constant
     u8 k[16] __aligned(SIPHASH24_ALIGNMENT);
     ^

vim +49 lib/test_siphash.c

    43		0x6ca4ecb15c5f91e1ULL, 0x9f626da15c9625f3ULL, 0xe51b38608ef25f57ULL,
    44		0x958a324ceb064572ULL
    45	};
    46	
    47	static int __init siphash_test_init(void)
    48	{
  > 49		u8 in[64] __aligned(SIPHASH24_ALIGNMENT);
    50		u8 k[16] __aligned(SIPHASH24_ALIGNMENT);
    51		u8 in_unaligned[65];
    52		u8 k_unaligned[65];

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26712 bytes --]

^ permalink raw reply

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: Jason A. Donenfeld @ 2016-12-14 21:21 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Netdev, kernel-hardening, LKML,
	Linux Crypto Mailing List, Jean-Philippe Aumasson,
	Daniel J . Bernstein, Linus Torvalds, Eric Biggers, David Laight
In-Reply-To: <201612150515.xggXiOp3%fengguang.wu@intel.com>

Interesting. Evidently gcc 4.8 doesn't like my use of:

enum siphash_lengths {
       SIPHASH24_KEY_LEN = 16,
       SIPHASH24_ALIGNMENT = 8
};

I'll convert this to the more boring:

#define SIPHASH24_KEY_LEN 16
#define SIPHASH24_ALIGNMENT 8

^ permalink raw reply

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Christoph Lameter @ 2016-12-14 21:22 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Laight, Jesper Dangaard Brouer, John Fastabend,
	Mike Rapoport, netdev@vger.kernel.org, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth, Vladislav Yasevich
In-Reply-To: <c122d91d-9506-ac35-29e5-3d80791259ef@stressinduktion.org>

On Wed, 14 Dec 2016, Hannes Frederic Sowa wrote:

> Wouldn't changing of the pages cause expensive TLB flushes?

Yes so you would only want that feature if its realized at the page
table level for debugging issues.

Once you have memory registered with the hardware device then also the
device could itself could perform snooping to realize that data was
changed and thus abort the operation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Jesper Dangaard Brouer @ 2016-12-14 21:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: John Fastabend, David Miller, Christoph Lameter, rppt, Netdev,
	linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich, brouer
In-Reply-To: <CAKgT0UfnBurxz9f+ceD81hAp3U0tGHEi_5MEtxk6PiehG=X8ag@mail.gmail.com>

On Wed, 14 Dec 2016 08:45:08 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> I agree.  This is a no-go from the performance perspective as well.
> At a minimum you would have to be zeroing out the page between uses to
> avoid leaking data, and that assumes that the program we are sending
> the pages to is slightly well behaved.  If we think zeroing out an
> sk_buff is expensive wait until we are trying to do an entire 4K page.

Again, yes the page will be zero'ed out, but only when entering the
page_pool. Because they are recycled they are not cleared on every use.
Thus, performance does not suffer.

Besides clearing large mem area is not as bad as clearing small.
Clearing an entire page does cost something, as mentioned before 143
cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
cost 36 cycles which is only 7 bytes-per-cycle (256/36).

> I think we are stuck with having to use a HW filter to split off
> application traffic to a specific ring, and then having to share the
> memory between the application and the kernel on that ring only.  Any
> other approach just opens us up to all sorts of security concerns
> since it would be possible for the application to try to read and
> possibly write any data it wants into the buffers.

This is why I wrote a document[1], trying to outline how this is possible,
going through all the combinations, and asking the community to find
faults in my idea.  Inlining it again, as nobody really replied on the
content of the doc.

- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

===========================
Memory Model for Networking
===========================

This design describes how the page_pool change the memory model for
networking in the NIC (Network Interface Card) drivers.

.. Note:: The catch for driver developers is that, once an application
          request zero-copy RX, then the driver must use a specific
          SKB allocation mode and might have to reconfigure the
          RX-ring.

Design target
=============

Allow the NIC to function as a normal Linux NIC and be shared in a
safe manor, between the kernel network stack and an accelerated
userspace application using RX zero-copy delivery.

Target is to provide the basis for building RX zero-copy solutions in
a memory safe manor.  An efficient communication channel for userspace
delivery is out of scope for this document, but OOM considerations are
discussed below (`Userspace delivery and OOM`_).

Background
==========

The SKB or ``struct sk_buff`` is the fundamental meta-data structure
for network packets in the Linux Kernel network stack.  It is a fairly
complex object and can be constructed in several ways.

From a memory perspective there are two ways depending on
RX-buffer/page state:

1) Writable packet page
2) Read-only packet page

To take full potential of the page_pool, the drivers must actually
support handling both options depending on the configuration state of
the page_pool.

Writable packet page
--------------------

When the RX packet page is writable, the SKB setup is fairly straight
forward.  The SKB->data (and skb->head) can point directly to the page
data, adjusting the offset according to drivers headroom (for adding
headers) and setting the length according to the DMA descriptor info.

The page/data need to be writable, because the network stack need to
adjust headers (like TimeToLive and checksum) or even add or remove
headers for encapsulation purposes.

A subtle catch, which also requires a writable page, is that the SKB
also have an accompanying "shared info" data-structure ``struct
skb_shared_info``.  This "skb_shared_info" is written into the
skb->data memory area at the end (skb->end) of the (header) data.  The
skb_shared_info contains semi-sensitive information, like kernel
memory pointers to other pages (which might be pointers to more packet
data).  This would be bad from a zero-copy point of view to leak this
kind of information.

Read-only packet page
---------------------

When the RX packet page is read-only, the construction of the SKB is
significantly more complicated and even involves one more memory
allocation.

1) Allocate a new separate writable memory area, and point skb->data
   here.  This is needed due to (above described) skb_shared_info.

2) Memcpy packet headers into this (skb->data) area.

3) Clear part of skb_shared_info struct in writable-area.

4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
   and adjust the page_offset to be past the headers just copied.

It is useful (later) that the network stack have this notion that part
of the packet and a page can be read-only.  This implies that the
kernel will not "pollute" this memory with any sensitive information.
This is good from a zero-copy point of view, but bad from a
performance perspective.

NIC RX Zero-Copy
================

Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
involves costly mapping and unmapping operations in the address space
of the userspace process.  Plus for doing this safely, the page memory
need to be cleared before using it, to avoid leaking kernel
information to userspace, also a costly operation.  The page_pool base
"class" of optimization is moving these kind of operations out of the
fastpath, by recycling and lifetime control.

Once a NIC RX-queue's page_pool have been configured for zero-copy
into userspace, then can packets still be allowed to travel the normal
stack?

Yes, this should be possible, because the driver can use the
SKB-read-only mode, which avoids polluting the page data with
kernel-side sensitive data.  This implies, when a driver RX-queue
switch page_pool to RX-zero-copy mode it MUST also switch to
SKB-read-only mode (for normal stack delivery for this RXq).

XDP can be used for controlling which pages that gets RX zero-copied
to userspace.  The page is still writable for the XDP program, but
read-only for normal stack delivery.

Kernel safety
-------------

For the paranoid, how do we protect the kernel from a malicious
userspace program.  Sure there will be a communication interface
between kernel and userspace, that synchronize ownership of pages.
But a userspace program can violate this interface, given pages are
kept VMA mapped, the program can in principle access all the memory
pages in the given page_pool.  This opens up for a malicious (or
defect) program modifying memory pages concurrently with the kernel
and DMA engine using them.

An easy way to get around userspace modifying page data contents is
simply to map pages read-only into userspace.

.. Note:: The first implementation target is read-only zero-copy RX
          page to userspace and require driver to use SKB-read-only
          mode.

Advanced: Allowing userspace write access?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What if userspace need write access? Flipping the page permissions per
transfer will likely kill performance (as this likely affects the
TLB-cache).

I will argue that giving userspace write access is still possible,
without risking a kernel crash.  This is related to the SKB-read-only
mode that copies the packet headers (in to another memory area,
inaccessible to userspace).  The attack angle is to modify packet
headers after they passed some kernel network stack validation step
(as once headers are copied they are out of "reach").

Situation classes where memory page can be modified concurrently:

1) When DMA engine owns the page.  Not a problem, as DMA engine will
   simply overwrite data.

2) Just after DMA engine finish writing.  Not a problem, the packet
   will go through netstack validation and be rejected.

3) While XDP reads data. This can lead to XDP/eBPF program goes into a
   wrong code branch, but the eBPF virtual machine should not be able
   to crash the kernel. The worst outcome is a wrong or invalid XDP
   return code.

4) Before SKB with read-only page is constructed. Not a problem, the
   packet will go through netstack validation and be rejected.

5) After SKB with read-only page has been constructed.  Remember the
   packet headers were copied into a separate memory area, and the
   page data is pointed to with an offset passed the copied headers.
   Thus, userspace cannot modify the headers used for netstack
   validation.  It can only modify packet data contents, which is less
   critical as it cannot crash the kernel, and eventually this will be
   caught by packet checksum validation.

6) After netstack delivered packet to another userspace process. Not a
   problem, as it cannot crash the kernel.  It might corrupt
   packet-data being read by another userspace process, which one
   argument for requiring elevated privileges to get write access
   (like NET_CAP_ADMIN).

Userspace delivery and OOM
--------------------------

These RX pages are likely mapped to userspace via mmap(), so-far so
good.  It is key to performance to get an efficient way of signaling
between kernel and userspace, e.g what page are ready for consumption,
and when userspace are done with the page.

It is outside the scope of page_pool to provide such a queuing
structure, but the page_pool can offer some means of protecting the
system resource usage.  It is a classical problem that resources
(e.g. the page) must be returned in a timely manor, else the system,
in this case, will run out of memory.  Any system/design with
unbounded memory allocation can lead to Out-Of-Memory (OOM)
situations.

Communication between kernel and userspace is likely going to be some
kind of queue.  Given transferring packets individually will have too
much scheduling overhead.  A queue can implicitly function as a
bulking interface, and offers a natural way to split the workload
across CPU cores.

This essentially boils down-to a two queue system, with the RX-ring
queue and the userspace delivery queue.

Two bad situations exists for the userspace queue:

1) Userspace is not consuming objects fast-enough. This should simply
   result in packets getting dropped when enqueueing to a full
   userspace queue (as queue *must* implement some limit). Open
   question is; should this be reported or communicated to userspace.

2) Userspace is consuming objects fast, but not returning them in a
   timely manor.  This is a bad situation, because it threatens the
   system stability as it can lead to OOM.

The page_pool should somehow protect the system in case 2.  The
page_pool can detect the situation as it is able to track the number
of outstanding pages, due to the recycle feedback loop.  Thus, the
page_pool can have some configurable limit of allowed outstanding
pages, which can protect the system against OOM.

Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
pages to be "pageable", i.e. swap-able, but that is not an option for
the page_pool as these pages are DMA mapped.

.. _`Fbufs paper`:
   http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688

Effect of blocking allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The effect of page_pool, in case 2, that denies more allocations
essentially result-in the RX-ring queue cannot be refilled and HW
starts dropping packets due to "out-of-buffers".  For NICs with
several HW RX-queues, this can be limited to a subset of queues (and
admin can control which RX queue with HW filters).

The question is if the page_pool can do something smarter in this
case, to signal the consumers of these pages, before the maximum limit
is hit (of allowed outstanding packets).  The MM-subsystem already
have a concept of emergency PFMEMALLOC reserves and associate
page-flags (e.g. page_is_pfmemalloc).  And the network stack already
handle and react to this.  Could the same PFMEMALLOC system be used
for marking pages when limit is close?

This requires further analysis. One can imagine; this could be used at
RX by XDP to mitigate the situation by dropping less-important frames.
Given XDP choose which pages are being send to userspace it might have
appropriate knowledge of what it relevant to drop(?).

.. Note:: An alternative idea is using a data-structure that blocks
          userspace from getting new pages before returning some.
          (out of scope for the page_pool)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: Tom Herbert @ 2016-12-14 21:35 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Netdev, kernel-hardening, LKML, Linux Crypto Mailing List,
	Jean-Philippe Aumasson, Daniel J . Bernstein, Linus Torvalds,
	Eric Biggers, David Laight
In-Reply-To: <CAHmME9pR3tD2zknKsYaFaTJm_3aBBOA6c174hypm6S-q9wp5nw@mail.gmail.com>

On Wed, Dec 14, 2016 at 12:55 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hey Tom,
>
> Just following up on what I mentioned in my last email...
>
> On Wed, Dec 14, 2016 at 8:35 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>> I think your suggestion for (2) will contribute to further
>> optimizations for (1). In v2, I had another patch in there adding
>> siphash_1word, siphash_2words, etc, like jhash, but I implemented it
>> by taking u32 variables and then just concatenating these into a
>> buffer and passing them to the main siphash function. I removed it
>> from v3 because I thought that these kind of missed the whole point.
>> In particular:
>>
>> a) siphash24_1word, siphash24_2words, siphash24_3words, etc should
>> take u64, not u32, since that's what siphash operates on natively
>
> I implemented these here:
> https://git.zx2c4.com/linux-dev/commit/?h=siphash&id=4652b6f3643bdba217e2194d89661348bbac48a0
>
Those look good, although I would probably just do 1,2,3 words and
then have a function that takes n words like jhash. Might want to call
these dword to distinguish from 32 bit words in jhash.

Also, what is the significance of "24" in the function and constant
names? Can we just drop that and call this siphash?

Tom

> This will be part of the next version of the series I submit. It's not
> immediately clear that using it is strictly faster than the struct
> trick though. However, I'm not yet sure why this would be.
>
> Jason

^ permalink raw reply

* Re: [PATCH v3 2/3] secure_seq: use siphash24 instead of md5_transform
From: kbuild test robot @ 2016-12-14 21:44 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: kbuild-all, Netdev, kernel-hardening, LKML, linux-crypto,
	Jason A. Donenfeld, Andi Kleen, David Miller, David Laight
In-Reply-To: <20161214184605.24006-2-Jason@zx2c4.com>

[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]

Hi Jason,

[auto build test ERROR on linus/master]
[also build test ERROR on next-20161214]
[cannot apply to v4.9]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jason-A-Donenfeld/siphash-add-cryptographically-secure-hashtable-function/20161215-041458
config: openrisc-or1ksim_defconfig (attached as .config)
compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=openrisc 

All errors (new ones prefixed by >>):

>> net/core/secure_seq.c:20:1: error: requested alignment is not a constant
   net/core/secure_seq.c: In function 'secure_tcp_sequence_number':
   net/core/secure_seq.c:99:2: error: requested alignment is not a constant
   net/core/secure_seq.c: In function 'secure_ipv4_port_ephemeral':
   net/core/secure_seq.c:119:2: error: requested alignment is not a constant

vim +20 net/core/secure_seq.c

    14	#include <net/secure_seq.h>
    15	
    16	#if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET)
    17	#include <linux/in6.h>
    18	#include <net/tcp.h>
    19	
  > 20	static u8 net_secret[SIPHASH24_KEY_LEN] __aligned(SIPHASH24_ALIGNMENT);
    21	
    22	static __always_inline void net_secret_init(void)
    23	{

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 7325 bytes --]

^ permalink raw reply

* Re: [PATCH v3 3/3] random: use siphash24 instead of md5 for get_random_int/long
From: kbuild test robot @ 2016-12-14 21:56 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: kbuild-all, Netdev, kernel-hardening, LKML, linux-crypto,
	Jason A. Donenfeld, Jean-Philippe Aumasson, Ted Tso
In-Reply-To: <20161214184605.24006-3-Jason@zx2c4.com>

[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]

Hi Jason,

[auto build test ERROR on linus/master]
[also build test ERROR on next-20161214]
[cannot apply to v4.9]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jason-A-Donenfeld/siphash-add-cryptographically-secure-hashtable-function/20161215-041458
config: i386-randconfig-i1-201650 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> drivers/char/random.c:2046:1: error: requested alignment is not an integer constant
    static u8 random_int_secret[SIPHASH24_KEY_LEN] __aligned(SIPHASH24_ALIGNMENT);
    ^
   drivers/char/random.c: In function 'get_random_int':
   drivers/char/random.c:2071:2: error: requested alignment is not an integer constant
     } __aligned(SIPHASH24_ALIGNMENT) combined;
     ^
   drivers/char/random.c: In function 'get_random_long':
   drivers/char/random.c:2100:2: error: requested alignment is not an integer constant
     } __aligned(SIPHASH24_ALIGNMENT) combined;
     ^

vim +2046 drivers/char/random.c

  2040		},
  2041	#endif
  2042		{ }
  2043	};
  2044	#endif 	/* CONFIG_SYSCTL */
  2045	
> 2046	static u8 random_int_secret[SIPHASH24_KEY_LEN] __aligned(SIPHASH24_ALIGNMENT);
  2047	
  2048	int random_int_secret_init(void)
  2049	{

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26712 bytes --]

^ permalink raw reply

* Re: [PATCH v3 3/3] random: use siphash24 instead of md5 for get_random_int/long
From: kbuild test robot @ 2016-12-14 21:57 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: kbuild-all, Netdev, kernel-hardening, LKML, linux-crypto,
	Jason A. Donenfeld, Jean-Philippe Aumasson, Ted Tso
In-Reply-To: <20161214184605.24006-3-Jason@zx2c4.com>

[-- Attachment #1: Type: text/plain, Size: 1528 bytes --]

Hi Jason,

[auto build test ERROR on linus/master]
[also build test ERROR on next-20161214]
[cannot apply to v4.9]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jason-A-Donenfeld/siphash-add-cryptographically-secure-hashtable-function/20161215-041458
config: openrisc-or1ksim_defconfig (attached as .config)
compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=openrisc 

All errors (new ones prefixed by >>):

>> drivers/char/random.c:2046:1: error: requested alignment is not a constant
   drivers/char/random.c: In function 'get_random_int':
   drivers/char/random.c:2071:2: error: requested alignment is not a constant
   drivers/char/random.c: In function 'get_random_long':
   drivers/char/random.c:2100:2: error: requested alignment is not a constant

vim +2046 drivers/char/random.c

  2040		},
  2041	#endif
  2042		{ }
  2043	};
  2044	#endif 	/* CONFIG_SYSCTL */
  2045	
> 2046	static u8 random_int_secret[SIPHASH24_KEY_LEN] __aligned(SIPHASH24_ALIGNMENT);
  2047	
  2048	int random_int_secret_init(void)
  2049	{

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 7325 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
From: Hannes Frederic Sowa @ 2016-12-14 22:03 UTC (permalink / raw)
  To: Jason A. Donenfeld, David Laight
  Cc: Netdev, kernel-hardening, Jean-Philippe Aumasson, LKML,
	Linux Crypto Mailing List, Daniel J . Bernstein, Linus Torvalds,
	Eric Biggers
In-Reply-To: <CAHmME9q9ffdtWRbzyMO1ktTdtEFdfumTfojTTNLXFhoa+R+MWQ@mail.gmail.com>

On 14.12.2016 13:46, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Wed, Dec 14, 2016 at 10:56 AM, David Laight <David.Laight@aculab.com> wrote:
>> ...
>>> +u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN])
>> ...
>>> +     u64 k0 = get_unaligned_le64(key);
>>> +     u64 k1 = get_unaligned_le64(key + sizeof(u64));
>> ...
>>> +             m = get_unaligned_le64(data);
>>
>> All these unaligned accesses are going to get expensive on architectures
>> like sparc64.
> 
> Yes, the unaligned accesses aren't pretty. Since in pretty much all
> use cases thus far, the data can easily be made aligned, perhaps it
> makes sense to create siphash24() and siphash24_unaligned(). Any
> thoughts on doing something like that?

I fear that the alignment requirement will be a source of bugs on 32 bit
machines, where you cannot even simply take a well aligned struct on a
stack and put it into the normal siphash(aligned) function without
adding alignment annotations everywhere. Even blocks returned from
kmalloc on 32 bit are not aligned to 64 bit.

Can we do this a runtime check and just have one function (siphash)
dealing with that?

Bye,
Hannes

^ permalink raw reply

* [PATCH net] net: vrf: Drop conntrack data after pass through VRF device on Tx
From: David Ahern @ 2016-12-14 22:31 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern

Locally originated traffic in a VRF fails in the presence of a POSTROUTING
rule. For example,

    $ iptables -t nat -A POSTROUTING -s 11.1.1.0/24  -j MASQUERADE
    $ ping -I red -c1 11.1.1.3
    ping: Warning: source address might be selected on device other than red.
    PING 11.1.1.3 (11.1.1.3) from 11.1.1.2 red: 56(84) bytes of data.
    ping: sendmsg: Operation not permitted

Worse, the above causes random corruption resulting in a panic in random
places (I have not seen a consistent backtrace).

Call nf_reset to drop the conntrack info following the pass through the
VRF device.  The nf_reset is needed on Tx but not Rx because of the order
in which NF_HOOK's are hit: on Rx the VRF device is after the real ingress
device and on Tx it is is before the real egress device. Connection
tracking should be tied to the real egress device and not the VRF device.

Fixes: 8f58336d3f78a ("net: Add ethernet header for pass through VRF device")
Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 drivers/net/vrf.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 015a1321c7dd..7532646c3b7b 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -366,6 +366,8 @@ static int vrf_finish_output6(struct net *net, struct sock *sk,
 	struct in6_addr *nexthop;
 	int ret;
 
+	nf_reset(skb);
+
 	skb->protocol = htons(ETH_P_IPV6);
 	skb->dev = dev;
 
@@ -547,6 +549,8 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 	u32 nexthop;
 	int ret = -EINVAL;
 
+	nf_reset(skb);
+
 	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		struct sk_buff *skb2;
-- 
2.1.4

^ permalink raw reply related

* [PATCH perf/core REBASE 4/5] samples/bpf: Remove perf_event_open() declaration
From: Joe Stringer @ 2016-12-14 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: netdev, wangnan0, ast, daniel, acme, Arnaldo Carvalho de Melo
In-Reply-To: <20161214224342.12858-1-joe@ovn.org>

This declaration was made in samples/bpf/libbpf.c for convenience, but
there's already one in tools/perf/perf-sys.h. Reuse that one.

Committer notes:

Testing it:

  $ make -j4 O=../build/v4.9.0-rc8+ samples/bpf/
  make[1]: Entering directory '/home/build/v4.9.0-rc8+'
    CHK     include/config/kernel.release
    GEN     ./Makefile
    CHK     include/generated/uapi/linux/version.h
    Using /home/acme/git/linux as source for kernel
    CHK     include/generated/utsrelease.h
    CHK     include/generated/timeconst.h
    CHK     include/generated/bounds.h
    CHK     include/generated/asm-offsets.h
    CALL    /home/acme/git/linux/scripts/checksyscalls.sh
    HOSTCC  samples/bpf/test_verifier.o
    HOSTCC  samples/bpf/libbpf.o
    HOSTCC  samples/bpf/../../tools/lib/bpf/bpf.o
    HOSTCC  samples/bpf/test_maps.o
    HOSTCC  samples/bpf/sock_example.o
    HOSTCC  samples/bpf/bpf_load.o
<SNIP>
    HOSTLD  samples/bpf/trace_event
    HOSTLD  samples/bpf/sampleip
    HOSTLD  samples/bpf/tc_l2_redirect
  make[1]: Leaving directory '/home/build/v4.9.0-rc8+'
  $

Signed-off-by: Joe Stringer <joe@ovn.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/20161209024620.31660-7-joe@ovn.org
[ Use -I$(srctree)/tools/lib/ to support out of source code tree builds ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 samples/bpf/Makefile            | 2 ++
 samples/bpf/bpf_load.c          | 3 ++-
 samples/bpf/libbpf.c            | 7 -------
 samples/bpf/libbpf.h            | 3 ---
 samples/bpf/sampleip_user.c     | 3 ++-
 samples/bpf/trace_event_user.c  | 9 +++++----
 samples/bpf/trace_output_user.c | 3 ++-
 samples/bpf/tracex6_user.c      | 3 ++-
 8 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index add514e2984a..9718f664fedf 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -108,6 +108,8 @@ always += xdp_tx_iptunnel_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/testing/selftests/bpf/
+HOSTCFLAGS += -I$(srctree)/tools/lib/ -I$(srctree)/tools/include
+HOSTCFLAGS += -I$(srctree)/tools/perf
 
 HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
 HOSTLOADLIBES_fds_example += -lelf
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index f5b186c46b7c..982c3be702d7 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -23,6 +23,7 @@
 #include <ctype.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "perf-sys.h"
 
 #define DEBUGFS "/sys/kernel/debug/tracing/"
 
@@ -178,7 +179,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	id = atoi(buf);
 	attr.config = id;
 
-	efd = perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
+	efd = sys_perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
 	if (efd < 0) {
 		printf("event %d fd %d err %s\n", id, efd, strerror(errno));
 		return -1;
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index d9af876b4a2c..bee473a494f1 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -34,10 +34,3 @@ int open_raw_sock(const char *name)
 
 	return sock;
 }
-
-int perf_event_open(struct perf_event_attr *attr, int pid, int cpu,
-		    int group_fd, unsigned long flags)
-{
-	return syscall(__NR_perf_event_open, attr, pid, cpu,
-		       group_fd, flags);
-}
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index cc815624aacf..09aedc320009 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -188,7 +188,4 @@ struct bpf_insn;
 /* create RAW socket and bind to interface 'name' */
 int open_raw_sock(const char *name);
 
-struct perf_event_attr;
-int perf_event_open(struct perf_event_attr *attr, int pid, int cpu,
-		    int group_fd, unsigned long flags);
 #endif
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
index 5ac5adf75931..be59d7dcbdde 100644
--- a/samples/bpf/sampleip_user.c
+++ b/samples/bpf/sampleip_user.c
@@ -21,6 +21,7 @@
 #include <sys/ioctl.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "perf-sys.h"
 
 #define DEFAULT_FREQ	99
 #define DEFAULT_SECS	5
@@ -49,7 +50,7 @@ static int sampling_start(int *pmu_fd, int freq)
 	};
 
 	for (i = 0; i < nr_cpus; i++) {
-		pmu_fd[i] = perf_event_open(&pe_sample_attr, -1 /* pid */, i,
+		pmu_fd[i] = sys_perf_event_open(&pe_sample_attr, -1 /* pid */, i,
 					    -1 /* group_fd */, 0 /* flags */);
 		if (pmu_fd[i] < 0) {
 			fprintf(stderr, "ERROR: Initializing perf sampling\n");
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 704fe9fa77b2..0c5561d193a4 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -20,6 +20,7 @@
 #include <sys/resource.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "perf-sys.h"
 
 #define SAMPLE_FREQ 50
 
@@ -125,9 +126,9 @@ static void test_perf_event_all_cpu(struct perf_event_attr *attr)
 
 	/* open perf_event on all cpus */
 	for (i = 0; i < nr_cpus; i++) {
-		pmu_fd[i] = perf_event_open(attr, -1, i, -1, 0);
+		pmu_fd[i] = sys_perf_event_open(attr, -1, i, -1, 0);
 		if (pmu_fd[i] < 0) {
-			printf("perf_event_open failed\n");
+			printf("sys_perf_event_open failed\n");
 			goto all_cpu_err;
 		}
 		assert(ioctl(pmu_fd[i], PERF_EVENT_IOC_SET_BPF, prog_fd[0]) == 0);
@@ -146,9 +147,9 @@ static void test_perf_event_task(struct perf_event_attr *attr)
 	int pmu_fd;
 
 	/* open task bound event */
-	pmu_fd = perf_event_open(attr, 0, -1, -1, 0);
+	pmu_fd = sys_perf_event_open(attr, 0, -1, -1, 0);
 	if (pmu_fd < 0) {
-		printf("perf_event_open failed\n");
+		printf("sys_perf_event_open failed\n");
 		return;
 	}
 	assert(ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd[0]) == 0);
diff --git a/samples/bpf/trace_output_user.c b/samples/bpf/trace_output_user.c
index 3bedd945def1..7449fac2819e 100644
--- a/samples/bpf/trace_output_user.c
+++ b/samples/bpf/trace_output_user.c
@@ -21,6 +21,7 @@
 #include <signal.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "perf-sys.h"
 
 static int pmu_fd;
 
@@ -159,7 +160,7 @@ static void test_bpf_perf_event(void)
 	};
 	int key = 0;
 
-	pmu_fd = perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
+	pmu_fd = sys_perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
 
 	assert(pmu_fd >= 0);
 	assert(bpf_map_update_elem(map_fd[0], &key, &pmu_fd, BPF_ANY) == 0);
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
index 179297cb4d35..ca7874ed77f4 100644
--- a/samples/bpf/tracex6_user.c
+++ b/samples/bpf/tracex6_user.c
@@ -10,6 +10,7 @@
 #include <linux/bpf.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "perf-sys.h"
 
 #define SAMPLE_PERIOD  0x7fffffffffffffffULL
 
@@ -30,7 +31,7 @@ static void test_bpf_perf_event(void)
 	};
 
 	for (i = 0; i < nr_cpus; i++) {
-		pmu_fd[i] = perf_event_open(&attr_insn_pmu, -1/*pid*/, i/*cpu*/, -1/*group_fd*/, 0);
+		pmu_fd[i] = sys_perf_event_open(&attr_insn_pmu, -1/*pid*/, i/*cpu*/, -1/*group_fd*/, 0);
 		if (pmu_fd[i] < 0) {
 			printf("event syscall failed\n");
 			goto exit;
-- 
2.10.2

^ permalink raw reply related

* [PATCH perf/core REBASE 5/5] samples/bpf: Move open_raw_sock to separate header
From: Joe Stringer @ 2016-12-14 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: netdev, wangnan0, ast, daniel, acme, Arnaldo Carvalho de Melo
In-Reply-To: <20161214224342.12858-1-joe@ovn.org>

This function was declared in libbpf.c and was the only remaining
function in this library, but has nothing to do with BPF. Shift it out
into a new header, sock_example.h, and include it from the relevant
samples.

Signed-off-by: Joe Stringer <joe@ovn.org>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/20161209024620.31660-8-joe@ovn.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 samples/bpf/Makefile                     | 2 +-
 samples/bpf/fds_example.c                | 1 +
 samples/bpf/libbpf.h                     | 3 ---
 samples/bpf/sock_example.c               | 1 +
 samples/bpf/{libbpf.c => sock_example.h} | 3 +--
 samples/bpf/sockex1_user.c               | 1 +
 samples/bpf/sockex2_user.c               | 1 +
 samples/bpf/sockex3_user.c               | 1 +
 8 files changed, 7 insertions(+), 6 deletions(-)
 rename samples/bpf/{libbpf.c => sock_example.h} (92%)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 9718f664fedf..c9496fb98ba1 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -36,7 +36,7 @@ hostprogs-y += lwt_len_hist
 hostprogs-y += xdp_tx_iptunnel
 
 # Libbpf dependencies
-LIBBPF := libbpf.o ../../tools/lib/bpf/bpf.o
+LIBBPF := ../../tools/lib/bpf/bpf.o
 
 test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
 sock_example-objs := sock_example.o $(LIBBPF)
diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c
index 8a4fc4ef3993..f8080dbc8356 100644
--- a/samples/bpf/fds_example.c
+++ b/samples/bpf/fds_example.c
@@ -14,6 +14,7 @@
 
 #include "bpf_load.h"
 #include "libbpf.h"
+#include "sock_example.h"
 
 #define BPF_F_PIN	(1 << 0)
 #define BPF_F_GET	(1 << 1)
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 09aedc320009..3705fba453a0 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -185,7 +185,4 @@ struct bpf_insn;
 		.off   = 0,					\
 		.imm   = 0 })
 
-/* create RAW socket and bind to interface 'name' */
-int open_raw_sock(const char *name);
-
 #endif
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
index d6b91e9a38ad..682ffb1e8c1a 100644
--- a/samples/bpf/sock_example.c
+++ b/samples/bpf/sock_example.c
@@ -27,6 +27,7 @@
 #include <linux/ip.h>
 #include <stddef.h>
 #include "libbpf.h"
+#include "sock_example.h"
 
 char bpf_log_buf[BPF_LOG_BUF_SIZE];
 
diff --git a/samples/bpf/libbpf.c b/samples/bpf/sock_example.h
similarity index 92%
rename from samples/bpf/libbpf.c
rename to samples/bpf/sock_example.h
index bee473a494f1..09f7fe7e5fd7 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/sock_example.h
@@ -1,4 +1,3 @@
-/* eBPF mini library */
 #include <stdlib.h>
 #include <stdio.h>
 #include <linux/unistd.h>
@@ -11,7 +10,7 @@
 #include <arpa/inet.h>
 #include "libbpf.h"
 
-int open_raw_sock(const char *name)
+static inline int open_raw_sock(const char *name)
 {
 	struct sockaddr_ll sll;
 	int sock;
diff --git a/samples/bpf/sockex1_user.c b/samples/bpf/sockex1_user.c
index 9454448bf198..6cd2feb3e9b3 100644
--- a/samples/bpf/sockex1_user.c
+++ b/samples/bpf/sockex1_user.c
@@ -3,6 +3,7 @@
 #include <linux/bpf.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "sock_example.h"
 #include <unistd.h>
 #include <arpa/inet.h>
 
diff --git a/samples/bpf/sockex2_user.c b/samples/bpf/sockex2_user.c
index 6a40600d5a83..0e0207c90841 100644
--- a/samples/bpf/sockex2_user.c
+++ b/samples/bpf/sockex2_user.c
@@ -3,6 +3,7 @@
 #include <linux/bpf.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "sock_example.h"
 #include <unistd.h>
 #include <arpa/inet.h>
 #include <sys/resource.h>
diff --git a/samples/bpf/sockex3_user.c b/samples/bpf/sockex3_user.c
index 9099c4255f23..b5524d417eb5 100644
--- a/samples/bpf/sockex3_user.c
+++ b/samples/bpf/sockex3_user.c
@@ -3,6 +3,7 @@
 #include <linux/bpf.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "sock_example.h"
 #include <unistd.h>
 #include <arpa/inet.h>
 #include <sys/resource.h>
-- 
2.10.2

^ permalink raw reply related

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Alexander Duyck @ 2016-12-14 22:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, David Miller, Christoph Lameter, rppt, Netdev,
	linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich
In-Reply-To: <20161214222927.587a8ac4@redhat.com>

On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 14 Dec 2016 08:45:08 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> I agree.  This is a no-go from the performance perspective as well.
>> At a minimum you would have to be zeroing out the page between uses to
>> avoid leaking data, and that assumes that the program we are sending
>> the pages to is slightly well behaved.  If we think zeroing out an
>> sk_buff is expensive wait until we are trying to do an entire 4K page.
>
> Again, yes the page will be zero'ed out, but only when entering the
> page_pool. Because they are recycled they are not cleared on every use.
> Thus, performance does not suffer.

So you are talking about recycling, but not clearing the page when it
is recycled.  That right there is my problem with this.  It is fine if
you assume the pages are used by the application only, but you are
talking about using them for both the application and for the regular
network path.  You can't do that.  If you are recycling you will have
to clear the page every time you put it back onto the Rx ring,
otherwise you can leak the recycled memory into user space and end up
with a user space program being able to snoop data out of the skb.

> Besides clearing large mem area is not as bad as clearing small.
> Clearing an entire page does cost something, as mentioned before 143
> cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
> cost 36 cycles which is only 7 bytes-per-cycle (256/36).

What I am saying is that you are going to be clearing the 4K blocks
each time they are recycled.  You can't have the pages shared between
user-space and the network stack unless you have true isolation.  If
you are allowing network stack pages to be recycled back into the
user-space application you open up all sorts of leaks where the
application can snoop into data it shouldn't have access to.

>> I think we are stuck with having to use a HW filter to split off
>> application traffic to a specific ring, and then having to share the
>> memory between the application and the kernel on that ring only.  Any
>> other approach just opens us up to all sorts of security concerns
>> since it would be possible for the application to try to read and
>> possibly write any data it wants into the buffers.
>
> This is why I wrote a document[1], trying to outline how this is possible,
> going through all the combinations, and asking the community to find
> faults in my idea.  Inlining it again, as nobody really replied on the
> content of the doc.
>
> -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
> [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>
> ===========================
> Memory Model for Networking
> ===========================
>
> This design describes how the page_pool change the memory model for
> networking in the NIC (Network Interface Card) drivers.
>
> .. Note:: The catch for driver developers is that, once an application
>           request zero-copy RX, then the driver must use a specific
>           SKB allocation mode and might have to reconfigure the
>           RX-ring.
>
>
> Design target
> =============
>
> Allow the NIC to function as a normal Linux NIC and be shared in a
> safe manor, between the kernel network stack and an accelerated
> userspace application using RX zero-copy delivery.
>
> Target is to provide the basis for building RX zero-copy solutions in
> a memory safe manor.  An efficient communication channel for userspace
> delivery is out of scope for this document, but OOM considerations are
> discussed below (`Userspace delivery and OOM`_).
>
> Background
> ==========
>
> The SKB or ``struct sk_buff`` is the fundamental meta-data structure
> for network packets in the Linux Kernel network stack.  It is a fairly
> complex object and can be constructed in several ways.
>
> From a memory perspective there are two ways depending on
> RX-buffer/page state:
>
> 1) Writable packet page
> 2) Read-only packet page
>
> To take full potential of the page_pool, the drivers must actually
> support handling both options depending on the configuration state of
> the page_pool.
>
> Writable packet page
> --------------------
>
> When the RX packet page is writable, the SKB setup is fairly straight
> forward.  The SKB->data (and skb->head) can point directly to the page
> data, adjusting the offset according to drivers headroom (for adding
> headers) and setting the length according to the DMA descriptor info.
>
> The page/data need to be writable, because the network stack need to
> adjust headers (like TimeToLive and checksum) or even add or remove
> headers for encapsulation purposes.
>
> A subtle catch, which also requires a writable page, is that the SKB
> also have an accompanying "shared info" data-structure ``struct
> skb_shared_info``.  This "skb_shared_info" is written into the
> skb->data memory area at the end (skb->end) of the (header) data.  The
> skb_shared_info contains semi-sensitive information, like kernel
> memory pointers to other pages (which might be pointers to more packet
> data).  This would be bad from a zero-copy point of view to leak this
> kind of information.

This should be the default once we get things moved over to using the
DMA_ATTR_SKIP_CPU_SYNC DMA attribute.  It will be a little while more
before it gets fully into Linus's tree.  It looks like the swiotlb
bits have been accepted, just waiting on the ability to map a page w/
attributes and the remainder of the patches that are floating around
in mmotm and linux-next.

BTW, any ETA on when we might expect to start seeing code related to
the page_pool?  It is much easier to review code versus these kind of
blueprints.

> Read-only packet page
> ---------------------
>
> When the RX packet page is read-only, the construction of the SKB is
> significantly more complicated and even involves one more memory
> allocation.
>
> 1) Allocate a new separate writable memory area, and point skb->data
>    here.  This is needed due to (above described) skb_shared_info.
>
> 2) Memcpy packet headers into this (skb->data) area.
>
> 3) Clear part of skb_shared_info struct in writable-area.
>
> 4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
>    and adjust the page_offset to be past the headers just copied.
>
> It is useful (later) that the network stack have this notion that part
> of the packet and a page can be read-only.  This implies that the
> kernel will not "pollute" this memory with any sensitive information.
> This is good from a zero-copy point of view, but bad from a
> performance perspective.

This will hopefully become a legacy approach.

>
> NIC RX Zero-Copy
> ================
>
> Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
> involves costly mapping and unmapping operations in the address space
> of the userspace process.  Plus for doing this safely, the page memory
> need to be cleared before using it, to avoid leaking kernel
> information to userspace, also a costly operation.  The page_pool base
> "class" of optimization is moving these kind of operations out of the
> fastpath, by recycling and lifetime control.
>
> Once a NIC RX-queue's page_pool have been configured for zero-copy
> into userspace, then can packets still be allowed to travel the normal
> stack?
>
> Yes, this should be possible, because the driver can use the
> SKB-read-only mode, which avoids polluting the page data with
> kernel-side sensitive data.  This implies, when a driver RX-queue
> switch page_pool to RX-zero-copy mode it MUST also switch to
> SKB-read-only mode (for normal stack delivery for this RXq).

This is the part that is wrong.  Once userspace has access to the
pages in an Rx ring that ring cannot be used for regular kernel-side
networking.  If it is, then sensitive kernel data may be leaked
because the application has full access to any page on the ring so it
could read the data at any time regardless of where the data is meant
to be delivered.

> XDP can be used for controlling which pages that gets RX zero-copied
> to userspace.  The page is still writable for the XDP program, but
> read-only for normal stack delivery.

Making the page read-only doesn't get you anything.  You still have a
conflict since user-space can read any packet directly out of the
page.

> Kernel safety
> -------------
>
> For the paranoid, how do we protect the kernel from a malicious
> userspace program.  Sure there will be a communication interface
> between kernel and userspace, that synchronize ownership of pages.
> But a userspace program can violate this interface, given pages are
> kept VMA mapped, the program can in principle access all the memory
> pages in the given page_pool.  This opens up for a malicious (or
> defect) program modifying memory pages concurrently with the kernel
> and DMA engine using them.
>
> An easy way to get around userspace modifying page data contents is
> simply to map pages read-only into userspace.
>
> .. Note:: The first implementation target is read-only zero-copy RX
>           page to userspace and require driver to use SKB-read-only
>           mode.

This allows for Rx but what do we do about Tx?  It sounds like
Christoph's RDMA approach might be the way to go.

> Advanced: Allowing userspace write access?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> What if userspace need write access? Flipping the page permissions per
> transfer will likely kill performance (as this likely affects the
> TLB-cache).
>
> I will argue that giving userspace write access is still possible,
> without risking a kernel crash.  This is related to the SKB-read-only
> mode that copies the packet headers (in to another memory area,
> inaccessible to userspace).  The attack angle is to modify packet
> headers after they passed some kernel network stack validation step
> (as once headers are copied they are out of "reach").
>
> Situation classes where memory page can be modified concurrently:
>
> 1) When DMA engine owns the page.  Not a problem, as DMA engine will
>    simply overwrite data.
>
> 2) Just after DMA engine finish writing.  Not a problem, the packet
>    will go through netstack validation and be rejected.
>
> 3) While XDP reads data. This can lead to XDP/eBPF program goes into a
>    wrong code branch, but the eBPF virtual machine should not be able
>    to crash the kernel. The worst outcome is a wrong or invalid XDP
>    return code.
>
> 4) Before SKB with read-only page is constructed. Not a problem, the
>    packet will go through netstack validation and be rejected.
>
> 5) After SKB with read-only page has been constructed.  Remember the
>    packet headers were copied into a separate memory area, and the
>    page data is pointed to with an offset passed the copied headers.
>    Thus, userspace cannot modify the headers used for netstack
>    validation.  It can only modify packet data contents, which is less
>    critical as it cannot crash the kernel, and eventually this will be
>    caught by packet checksum validation.
>
> 6) After netstack delivered packet to another userspace process. Not a
>    problem, as it cannot crash the kernel.  It might corrupt
>    packet-data being read by another userspace process, which one
>    argument for requiring elevated privileges to get write access
>    (like NET_CAP_ADMIN).

If userspace has access to a ring we shouldn't be using SKBs on it
really anyway.  We should probably expect XDP to be handling all the
packaging so items 4-6 can probably be dropped.

>
> Userspace delivery and OOM
> --------------------------
>
> These RX pages are likely mapped to userspace via mmap(), so-far so
> good.  It is key to performance to get an efficient way of signaling
> between kernel and userspace, e.g what page are ready for consumption,
> and when userspace are done with the page.
>
> It is outside the scope of page_pool to provide such a queuing
> structure, but the page_pool can offer some means of protecting the
> system resource usage.  It is a classical problem that resources
> (e.g. the page) must be returned in a timely manor, else the system,
> in this case, will run out of memory.  Any system/design with
> unbounded memory allocation can lead to Out-Of-Memory (OOM)
> situations.
>
> Communication between kernel and userspace is likely going to be some
> kind of queue.  Given transferring packets individually will have too
> much scheduling overhead.  A queue can implicitly function as a
> bulking interface, and offers a natural way to split the workload
> across CPU cores.
>
> This essentially boils down-to a two queue system, with the RX-ring
> queue and the userspace delivery queue.
>
> Two bad situations exists for the userspace queue:
>
> 1) Userspace is not consuming objects fast-enough. This should simply
>    result in packets getting dropped when enqueueing to a full
>    userspace queue (as queue *must* implement some limit). Open
>    question is; should this be reported or communicated to userspace.
>
> 2) Userspace is consuming objects fast, but not returning them in a
>    timely manor.  This is a bad situation, because it threatens the
>    system stability as it can lead to OOM.
>
> The page_pool should somehow protect the system in case 2.  The
> page_pool can detect the situation as it is able to track the number
> of outstanding pages, due to the recycle feedback loop.  Thus, the
> page_pool can have some configurable limit of allowed outstanding
> pages, which can protect the system against OOM.
>
> Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
> pages to be "pageable", i.e. swap-able, but that is not an option for
> the page_pool as these pages are DMA mapped.
>
> .. _`Fbufs paper`:
>    http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688
>
> Effect of blocking allocation
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> The effect of page_pool, in case 2, that denies more allocations
> essentially result-in the RX-ring queue cannot be refilled and HW
> starts dropping packets due to "out-of-buffers".  For NICs with
> several HW RX-queues, this can be limited to a subset of queues (and
> admin can control which RX queue with HW filters).
>
> The question is if the page_pool can do something smarter in this
> case, to signal the consumers of these pages, before the maximum limit
> is hit (of allowed outstanding packets).  The MM-subsystem already
> have a concept of emergency PFMEMALLOC reserves and associate
> page-flags (e.g. page_is_pfmemalloc).  And the network stack already
> handle and react to this.  Could the same PFMEMALLOC system be used
> for marking pages when limit is close?
>
> This requires further analysis. One can imagine; this could be used at
> RX by XDP to mitigate the situation by dropping less-important frames.
> Given XDP choose which pages are being send to userspace it might have
> appropriate knowledge of what it relevant to drop(?).
>
> .. Note:: An alternative idea is using a data-structure that blocks
>           userspace from getting new pages before returning some.
>           (out of scope for the page_pool)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf
From: Joe Stringer @ 2016-12-14 22:46 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Daniel Borkmann, LKML, netdev, Wang Nan, ast
In-Reply-To: <20161214145512.GQ5482@kernel.org>

On 14 December 2016 at 06:55, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> Em Wed, Dec 14, 2016 at 10:25:01AM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Fri, Dec 09, 2016 at 04:30:54PM +0100, Daniel Borkmann escreveu:
>> > On 12/09/2016 04:09 PM, Arnaldo Carvalho de Melo wrote:
>> > Please note that this might result in hopefully just a minor merge issue
>> > with net-next. Looks like patch 4/7 touches test_maps.c and test_verifier.c,
>> > which moved to a new bpf selftest suite [1] this net-next cycle. Seems it's
>> > just log buffer and some renames there, which can be discarded for both
>> > files sitting in selftests.
>>
>> Yeah, I've got to this point, and the merge has a little bit more than
>> that, including BPF_PROG_ATTACH/BPF_PROG_DETACH, etc, working on it...
>
> So, Joe, can you try refreshing this work, starting from what I have in
> perf/core? It has the changes coming from net-next that Daniel warned us about
> and some more.

Hi Arnaldo,

I've just respun this series based on the version you previously
applied to perf/core. Since bpf_prog_{attach,detach}() were added to
samples/libbpf, a new patch will shift these over to tools/lib/bpf.
Other than that, I folded "samples/bpf: Drop unnecessary build
targets." back into "samples/bpf: Switch over to libbpf", and I
noticed that there were a couple of unnecessary log buffers with the
latest changes. For any new sample programs, those were fixed up to
use libbpf as well.

Don't forget to do a "make headers_install" before attempting to build
the samples, access to the latest headers is required (as per the
readme in samples/bpf).

Thanks,
Joe

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox