Netdev List

Netdev List
 help / color / mirror / Atom feed

* [Patch net-next] audit: remove useless synchronize_net()
From: Cong Wang @ 2016-11-29 17:14 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Richard Guy Briggs

netlink kernel socket is protected by refcount, not RCU.
Its rcv path is neither protected by RCU. So the synchronize_net()
is just pointless.

Cc: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 kernel/audit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 92c463d..67b9fbd8 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1172,9 +1172,8 @@ static void __net_exit audit_net_exit(struct net *net)
 		audit_sock = NULL;
 	}
 
-	RCU_INIT_POINTER(aunet->nlsk, NULL);
-	synchronize_net();
 	netlink_kernel_release(sock);
+	aunet->nlsk = NULL;
 }
 
 static struct pernet_operations audit_net_ops __net_initdata = {
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH v3 net-next 1/2] net: ethernet: slicoss: add slicoss gigabit ethernet driver
From: Florian Fainelli @ 2016-11-29 17:14 UTC (permalink / raw)
  To: Lino Sanfilippo, davem, charrer, liodot, gregkh, andrew
  Cc: devel, netdev, linux-kernel
In-Reply-To: <6b7de79d-b149-9888-2d0f-2acee5c2905e@gmx.de>

On 11/28/2016 01:41 PM, Lino Sanfilippo wrote:
> The problem is that the HW does not provide a tx completion index. Instead we have to 
> iterate the status descriptors until we get an invalid idx which indicates that there 
> are no further tx descriptors done for now. I am afraid that if we do not limit the 
> number of descriptors processed in the tx completion handler, a continuous transmission 
> of frames could keep the loop in xmit_complete() run endlessly. I dont know if this 
> can actually happen but I wanted to make sure that this is avoided.

OK, it might be a good idea to put that comment somewhere around the tx
completion handler to understand why it is bounded with a specific value.

> 
>> [snip]
>>
>>> +	while (slic_get_free_rx_descs(rxq) > SLIC_MAX_REQ_RX_DESCS) {
>>> +		skb = alloc_skb(maplen + ALIGN_MASK, gfp);
>>> +		if (!skb)
>>> +			break;
>>> +
>>> +		paddr = dma_map_single(&sdev->pdev->dev, skb->data, maplen,
>>> +				       DMA_FROM_DEVICE);
>>> +		if (dma_mapping_error(&sdev->pdev->dev, paddr)) {
>>> +			netdev_err(dev, "mapping rx packet failed\n");
>>> +			/* drop skb */
>>> +			dev_kfree_skb_any(skb);
>>> +			break;
>>> +		}
>>> +		/* ensure head buffer descriptors are 256 byte aligned */
>>> +		offset = 0;
>>> +		misalign = paddr & ALIGN_MASK;
>>> +		if (misalign) {
>>> +			offset = SLIC_RX_BUFF_ALIGN - misalign;
>>> +			skb_reserve(skb, offset);
>>> +		}
>>> +		/* the HW expects dma chunks for descriptor + frame data */
>>> +		desc = (struct slic_rx_desc *)skb->data;
>>> +		memset(desc, 0, sizeof(*desc));
>>
>> Do you really need to zero-out the prepending RX descriptor? Are not you
>> missing a write barrier here?
> 
> Indeed, it should be sufficient to make sure that the bit SLIC_IRHDDR_SVALID is not set.
> I will adjust it. 
> Concerning the write barrier: You mean a wmb() before slic_write() to ensure that the zeroing
>  of the status desc is done before the descriptor is passed to the HW, right?

Correct, that's what I meant here.

> 
> 
>> [snip]
>>
>>> +
>>> +		dma_sync_single_for_cpu(&sdev->pdev->dev,
>>> +					dma_unmap_addr(buff, map_addr),
>>> +					buff->addr_offset + sizeof(*desc),
>>> +					DMA_FROM_DEVICE);
>>> +
>>> +		status = le32_to_cpu(desc->status);
>>> +		if (!(status & SLIC_IRHDDR_SVALID))
>>> +			break;
>>> +
>>> +		buff->skb = NULL;
>>> +
>>> +		dma_unmap_single(&sdev->pdev->dev,
>>> +				 dma_unmap_addr(buff, map_addr),
>>> +				 dma_unmap_len(buff, map_len),
>>> +				 DMA_FROM_DEVICE);
>>
>> This is potentially inefficient, you already did a cache invalidation
>> for the RX descriptor here, you could be more efficient with just
>> invalidating the packet length, minus the descriptor length.
>>
> 
> I am not sure I understand: We have to unmap the complete dma area, no matter if we synced
> part of it before, dont we? AFAIK a dma sync is different from unmapping dma, or do I miss
> something?

Sorry, I was not very clear, what I meant is that you can allocate and
do the initial dma_map_single() of your RX skbs during ndo_open(), and
then, in your RX path, you can only do dma_sync_single_for_cpu() twice
(once for the RX descriptor status, second time for the actual packet
contents), and when you return the SKB to the HW, do a
dma_sync_single_for_device(). The advantage of doing that, is that if
your cache operations are slow, you only do them on exactly packet
length, and not the actual RX buffer size (e.g: 2KB).
-- 
Florian

^ permalink raw reply

* Re: [PATCH v2 net-next 10/11] qede: Add basic XDP support
From: Jakub Kicinski @ 2016-11-29 17:10 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Yuval Mintz, davem, netdev, alexei.starovoitov
In-Reply-To: <583DA362.7010702@iogearbox.net>

On Tue, 29 Nov 2016 16:48:50 +0100, Daniel Borkmann wrote:
> On 11/29/2016 03:47 PM, Yuval Mintz wrote:
> > Add support for the ndo_xdp callback. This patch would support XDP_PASS,
> > XDP_DROP and XDP_ABORTED commands.
> >
> > This also adds a per Rx queue statistic which counts number of packets
> > which didn't reach the stack [due to XDP].
> >
> > Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>  
> [...]
> > @@ -1560,6 +1593,7 @@ static int qede_rx_process_cqe(struct qede_dev *edev,
> >   			       struct qede_fastpath *fp,
> >   			       struct qede_rx_queue *rxq)
> >   {
> > +	struct bpf_prog *xdp_prog = READ_ONCE(rxq->xdp_prog);
> >   	struct eth_fast_path_rx_reg_cqe *fp_cqe;
> >   	u16 len, pad, bd_cons_idx, parse_flag;
> >   	enum eth_rx_cqe_type cqe_type;
> > @@ -1596,6 +1630,11 @@ static int qede_rx_process_cqe(struct qede_dev *edev,
> >   	len = le16_to_cpu(fp_cqe->len_on_first_bd);
> >   	pad = fp_cqe->placement_offset;
> >
> > +	/* Run eBPF program if one is attached */
> > +	if (xdp_prog)
> > +		if (!qede_rx_xdp(edev, fp, rxq, xdp_prog, bd, fp_cqe))
> > +			return 1;
> > +  
> 
> You also need to wrap this under rcu_read_lock() (at least I haven't seen
> it in your patches) for same reasons as stated in 326fe02d1ed6 ("net/mlx4_en:
> protect ring->xdp_prog with rcu_read_lock"), as otherwise xdp_prog could
> disappear underneath you. mlx4 and nfp does it correctly, looks like mlx5
> doesn't.

My understanding was that Yuval is always doing full stop()/start() so
there should be no RX packets in flight while the XDP prog is being
changed.  But thinking about it again, perhaps is worth adding the
optimization to forego the full qede_reload() in qede_xdp_set() if there
is a program already loaded and just do the xchg()+put() (and add RCU
protection on the fast path)?

^ permalink raw reply

* Re: bpf debug info
From: Alexei Starovoitov @ 2016-11-29 17:01 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, Brenden Blanco, Thomas Graf, Wangnan, He Kuang,
	kernel-team
In-Reply-To: <583D9AA4.8050601@iogearbox.net>

On Tue, Nov 29, 2016 at 04:11:32PM +0100, Daniel Borkmann wrote:
> On 11/29/2016 07:42 AM, Alexei Starovoitov wrote:
> [...]
> >The support for debug information in BPF was recently added to llvm.
> >
> >In order to use it recompile bpf programs with the following patch
> >in samples/bpf/Makefile
> >@@ -155,4 +155,4 @@ $(obj)/%.o: $(src)/%.c
> >         $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
> >                 -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
> >                 -Wno-compare-distinct-pointer-types \
> >-               -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
> >+               -O2 -emit-llvm -g -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
> >
> >and compiled .o files can be consumed by standard llvm-objdump utility.
> >
> >$ llvm-objdump -S -no-show-raw-insn samples/bpf/xdp1_kern.o
> >xdp1_kern.o:    file format ELF64-BPF
> >
> >Disassembly of section xdp1:
> >xdp_prog1:
> >; {
> >        0:       r2 = *(u32 *)(r1 + 4)
> >; void *data = (void *)(long)ctx->data;
> >        8:       r1 = *(u32 *)(r1 + 0)
> >; if (data + nh_off > data_end)
> >       10:       r3 = r1
> >       18:       r3 += 14
> >       20:       if r3 > r2 goto 55
> >; h_proto = eth->h_proto;
> >       28:       r3 = *(u8 *)(r1 + 12)
> >       30:       r4 = *(u8 *)(r1 + 13)
> >       38:       r4 <<= 8
> >       40:       r4 |= r3
> >; if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
> >       48:       if r4 == 43144 goto 2
> >       50:       r3 = 14
> >       58:       if r4 != 129 goto 5
> >
> >LBB0_3:
> >; if (data + nh_off > data_end)
> >       60:       r3 = r1
> >       68:       r3 += 18
> >       70:       if r3 > r2 goto 45
> >       78:       r3 = 18
> >; h_proto = vhdr->h_vlan_encapsulated_proto;
> >       80:       r4 = *(u16 *)(r1 + 16)
> >
> >LBB0_5:
> >       88:       r5 = r4
> >       90:       r5 &= 65535
> >; if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
> >       98:       if r5 == 43144 goto 1
> >       a0:       if r5 != 129 goto 9
> >
> >Notice that 'clang -S -o a.s' output and llvm-objdump disassembler
> >were changed to use kernel verifier style, so now it should be easier
> >to see what's going on.
> 
> Sounds really useful, is that scheduled for llvm 3.10 release?

llvm 4.0 :)

> That debugging info is stored in dwarf format into the obj, right?

right.

> Would be nice if also pahole could work on bpf object files.

yeah. pahole need to be taught to recognize bpf e_machine type and relocations.

> >The main advantage of debug info is that verifier error messages
> >are now easier to correlate to original C code.
> 
> Does that mean that the old -S output format is not available anymore?
> Personally, I liked that one better tbh, was hoping someone would have
> added asm parsing for it, so compilation from .S to .o also works.

we can still do a proper assembler form .s into .o
llvm infra is flexible enough. I can point somebody on what places
inside llvm bpf backend to extend to add full parsing of .s

> >If I try to run samples/bpf/test_cls_bpf.sh the verifier will complain:
> >R0=imm0,min_value=0,max_value=0 R1=pkt(id=0,off=0,r=42) R2=pkt_end
> >112: (0f) r4 += r3
> >113: (0f) r1 += r4
> >114: (b7) r0 = 2
> >115: (69) r2 = *(u16 *)(r1 +2)
> >invalid access to packet, off=2 size=2, R1(id=3,off=0,r=0)
> >
> >Now multiply 115 * 8 and convert to hex. This is address 0x398 in llvm-objdump:
> >; struct udphdr *udp = data + tp_off;
> >      388:       r1 += r4
> >      390:       r0 = 2
> >; if (udp->dest == htons(DEFAULT_PKTGEN_UDP_PORT) ||
> >      398:       r2 = *(u16 *)(r1 + 2)
> >      3a0:       if r2 == 2304 goto 16
> >
> >Now it's clear which line of C code is causing the verifier to reject.
> [...]
> 
> Could llvm-objdump switch line numbering for bpf same way as verifier
> output, so mapping step is not really needed?

you mean that llvm-objdump to print 113,114,115 ?
I guess it's doable. Will give it a try.

> So the BPF_COMMENT pseudo insn will get stripped away from the insn array
> after verification step, so we don't need to hold/account for this mem?

that was the idea, but if we keep src in there, we might want to keep
it for some future 'dump program' debugging step?

> I assume in it's ->imm member it will just hold offset into text blob?

yes. 

> Given that the generated verifier log can already become huge nowadays up
> to a point where less than 4k insns prog fails to load due to reaching max
> kernel allowed verifier buffer size, is the plan to only dump C code for
> last few lines or for everything?

right. that's a good point.

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-29 16:50 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Vishwanathapura, Niranjana, Weiny, Ira, Doug Ledford,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
	Dalessandro, Dennis
In-Reply-To: <1828884A29C6694DAF28B7E6B8A82373AB0B9B46@ORSMSX109.amr.corp.intel.com>

On Tue, Nov 29, 2016 at 04:44:37PM +0000, Hefty, Sean wrote:
> > You are not making a subsystem. Don't overcomplicate things. A
> > multi-part device device can just directly link.
> 
> The VNIC may be usable over multiple generations of HFIs, but I
> don't know if that is the intent.

If Intel wants to build a HFI subystem within RDMA with multiple
drivers then sure, but they are not there yet, and we don't even know
what that could look like. So it is better to leave it simple for now.

Jason

^ permalink raw reply

* Re: [PATCH net] bpf: fix states equal logic for varlen access
From: Alexei Starovoitov @ 2016-11-29 16:49 UTC (permalink / raw)
  To: Josef Bacik; +Cc: davem, netdev, daniel, ast, jannh
In-Reply-To: <db0d1d81-b497-4f95-48c1-87f5f52a906e@fb.com>

On Tue, Nov 29, 2016 at 09:45:33AM -0500, Josef Bacik wrote:
> On 11/28/2016 10:04 PM, Alexei Starovoitov wrote:
> >On Mon, Nov 28, 2016 at 02:44:10PM -0500, Josef Bacik wrote:
> >>If we have a branch that looks something like this
> >>
> >>int foo = map->value;
> >>if (condition) {
> >>  foo += blah;
> >>} else {
> >>  foo = bar;
> >>}
> >>map->array[foo] = baz;
> >>
> >>We will incorrectly assume that the !condition branch is equal to the condition
> >>branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
> >>adjust this logic to only do this if we didn't do a varlen access after we
> >>processed the !condition branch, otherwise we have different ranges and need to
> >>check the other branch as well.
> >>
> >>Signed-off-by: Josef Bacik <jbacik@fb.com>
> >>---
> >> kernel/bpf/verifier.c | 10 ++++++++--
> >> 1 file changed, 8 insertions(+), 2 deletions(-)
> >>
> >>diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>index 89f787c..2c8a688 100644
> >>--- a/kernel/bpf/verifier.c
> >>+++ b/kernel/bpf/verifier.c
> >>@@ -2478,6 +2478,7 @@ static bool states_equal(struct bpf_verifier_env *env,
> >> {
> >> 	struct bpf_reg_state *rold, *rcur;
> >> 	int i;
> >>+	bool map_access = env->varlen_map_value_access;
> >
> >that's a bit misleading name for the variable.
> >Pls call it varlen_map_access.
> >
> >> 	for (i = 0; i < MAX_BPF_REG; i++) {
> >> 		rold = &old->regs[i];
> >>@@ -2489,12 +2490,17 @@ static bool states_equal(struct bpf_verifier_env *env,
> >> 		/* If the ranges were not the same, but everything else was and
> >> 		 * we didn't do a variable access into a map then we are a-ok.
> >> 		 */
> >>-		if (!env->varlen_map_value_access &&
> >>+		if (!map_access &&
> >> 		    rold->type == rcur->type && rold->imm == rcur->imm)
> >
> >just noticed that this one is missing comparing rold->id == rcur->id
> >
> 
> Do you want me to fix that here?  I'll fix up the rest of the stuff, and
> Daniels things as well.  Thanks,

Nevermind. Comparing 'id' is not needed in net, only in net-next.

^ permalink raw reply

* Re: netlink: GPF in sock_sndtimeo
From: Richard Guy Briggs @ 2016-11-29 16:48 UTC (permalink / raw)
  To: Cong Wang
  Cc: Dmitry Vyukov, David Miller, Johannes Berg, Florian Westphal,
	Eric Dumazet, Herbert Xu, netdev, LKML, syzkaller
In-Reply-To: <CAM_iQpVeLvfYV+1jX1ZKOntZim4roof4=>

On 2016-11-26 17:11, Cong Wang wrote:
> On Sat, Nov 26, 2016 at 7:44 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> > Hello,

Eric, thanks for the heads up on this.  (more below...)

> > The following program triggers GPF in sock_sndtimeo:
> > https://gist.githubusercontent.com/dvyukov/c19cadd309791cf5cb9b2bf936d3f48d/raw/1743ba0211079a5465d039512b427bc6b59b1a76/gistfile1.txt
> >
> > On commit 16ae16c6e5616c084168740990fc508bda6655d4 (Nov 24).
> >
> > general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> > Dumping ftrace buffer:
> >    (ftrace buffer empty)
> > Modules linked in:
> > CPU: 1 PID: 19950 Comm: syz-executor Not tainted 4.9.0-rc5+ #54
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > task: ffff88002a0d0840 task.stack: ffff880036920000
> > RIP: 0010:[<ffffffff86cb35e1>]  [<     inline     >] sock_sndtimeo
> > include/net/sock.h:2075
> > RIP: 0010:[<ffffffff86cb35e1>]  [<ffffffff86cb35e1>]
> > netlink_unicast+0xe1/0x730 net/netlink/af_netlink.c:1232
> > RSP: 0018:ffff880036926f68  EFLAGS: 00010202
> > RAX: 0000000000000068 RBX: ffff880036927000 RCX: ffffc900021d0000
> > RDX: 0000000000000d63 RSI: 00000000024000c0 RDI: 0000000000000340
> > RBP: ffff880036927028 R08: ffffed0006ea7aab R09: ffffed0006ea7aab
> > R10: 0000000000000001 R11: ffffed0006ea7aaa R12: dffffc0000000000
> > R13: 0000000000000000 R14: ffff880035de3400 R15: ffff880035de3400
> > FS:  00007f90a2fc7700(0000) GS:ffff88003ed00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000006de0c0 CR3: 0000000035de6000 CR4: 00000000000006e0
> > Stack:
> >  ffff880035de3400 ffffffff819f02a1 1ffff10006d24df4 0000000000000004
> >  00004db400000014 ffff880036926fd8 ffffffff00000000 0000000041b58ab3
> >  ffffffff89653c11 ffffffff86cb3500 ffffffff819f0345 ffff880035de3400
> > Call Trace:
> >  [<     inline     >] audit_replace kernel/audit.c:817
> >  [<ffffffff816c34b9>] audit_receive_msg+0x22c9/0x2ce0 kernel/audit.c:894
> >  [<     inline     >] audit_receive_skb kernel/audit.c:1120
> >  [<ffffffff816c40ac>] audit_receive+0x1dc/0x360 kernel/audit.c:1133
> >  [<     inline     >] netlink_unicast_kernel net/netlink/af_netlink.c:1214
> >  [<ffffffff86cb3a14>] netlink_unicast+0x514/0x730 net/netlink/af_netlink.c:1240
> >  [<ffffffff86cb46d4>] netlink_sendmsg+0xaa4/0xe50 net/netlink/af_netlink.c:1786
> >  [<     inline     >] sock_sendmsg_nosec net/socket.c:621
> >  [<ffffffff86a6d54f>] sock_sendmsg+0xcf/0x110 net/socket.c:631
> >  [<ffffffff86a6d8bb>] sock_write_iter+0x32b/0x620 net/socket.c:829
> >  [<     inline     >] new_sync_write fs/read_write.c:499
> >  [<ffffffff81a6f24e>] __vfs_write+0x4fe/0x830 fs/read_write.c:512
> >  [<ffffffff81a70cf5>] vfs_write+0x175/0x4e0 fs/read_write.c:560
> >  [<     inline     >] SYSC_write fs/read_write.c:607
> >  [<ffffffff81a75180>] SyS_write+0x100/0x240 fs/read_write.c:599
> >  [<ffffffff81009a24>] do_syscall_64+0x2f4/0x940 arch/x86/entry/common.c:280
> >  [<ffffffff88149e8d>] entry_SYSCALL64_slow_path+0x25/0x25
> > Code: fe 4c 89 f7 e8 31 16 ff ff 8b 8d 70 ff ff ff 49 89 c7 31 c0 85
> > c9 75 25 e8 7d 4a a3 fa 49 8d bd 40 03 00 00 48 89 f8 48 c1 e8 03 <42>
> > 80 3c 20 00 0f 85 3a 06 00 00 49 8b 85 40 03 00 00 4c 8d 73
> > RIP  [<     inline     >] sock_sndtimeo include/net/sock.h:2075
> > RIP  [<ffffffff86cb35e1>] netlink_unicast+0xe1/0x730
> > net/netlink/af_netlink.c:1232
> >  RSP <ffff880036926f68>
> > ---[ end trace 8383a15fba6fdc59 ]---
> 
> It is racy on audit_sock, especially on the netns exit path.

I think that is the only place it is racy.  The other places audit_sock
is set is when the socket failure has just triggered a reset.

Is there a notifier callback for failed or reaped sockets?

> Could the following patch help a little bit? Also, I don't see how the
> synchronize_net() here could sync with netlink rcv path, since unlike
> packets from wire, netlink messages are not handled in BH context
> nor I see any RCU taken on rcv path.

Thanks Cong, this looks like it could help.  I'll have a closer look.

> diff --git a/kernel/audit.c b/kernel/audit.c
> index f1ca116..20bc79e 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -1167,10 +1167,13 @@ static void __net_exit audit_net_exit(struct net *net)
>  {
>         struct audit_net *aunet = net_generic(net, audit_net_id);
>         struct sock *sock = aunet->nlsk;
> +
> +       mutex_lock(&audit_cmd_mutex);
>         if (sock == audit_sock) {
>                 audit_pid = 0;
>                 audit_sock = NULL;
>         }
> +       mutex_unlock(&audit_cmd_mutex);
> 
>         RCU_INIT_POINTER(aunet->nlsk, NULL);
>         synchronize_net();

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* RE: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Hefty, Sean @ 2016-11-29 16:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Vishwanathapura, Niranjana
  Cc: Weiny, Ira, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Dalessandro, Dennis
In-Reply-To: <20161129161950.GB742-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

> You are not making a subsystem. Don't overcomplicate things. A
> multi-part device device can just directly link.

The VNIC may be usable over multiple generations of HFIs, but I don't know if that is the intent.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next] net: can: usb: kvaser_usb: fix spelling mistake of "outstanding"
From: Colin King @ 2016-11-29 16:27 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde, Jimmy Assarsson,
	Wolfram Sang, David S . Miller, linux-can, netdev
  Cc: linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Trivial fix to spelling mistake "oustanding" to "outstanding" in
comment and dev_dbg message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/can/usb/kvaser_usb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/can/usb/kvaser_usb.c b/drivers/net/can/usb/kvaser_usb.c
index d51e0c4..18cc529 100644
--- a/drivers/net/can/usb/kvaser_usb.c
+++ b/drivers/net/can/usb/kvaser_usb.c
@@ -459,7 +459,7 @@ struct kvaser_usb {
 	struct usb_endpoint_descriptor *bulk_in, *bulk_out;
 	struct usb_anchor rx_submitted;
 
-	/* @max_tx_urbs: Firmware-reported maximum number of oustanding,
+	/* @max_tx_urbs: Firmware-reported maximum number of outstanding,
 	 * not yet ACKed, transmissions on this device. This value is
 	 * also used as a sentinel for marking free tx contexts.
 	 */
@@ -2027,7 +2027,7 @@ static int kvaser_usb_probe(struct usb_interface *intf,
 		((dev->fw_version >> 16) & 0xff),
 		(dev->fw_version & 0xffff));
 
-	dev_dbg(&intf->dev, "Max oustanding tx = %d URBs\n", dev->max_tx_urbs);
+	dev_dbg(&intf->dev, "Max outstanding tx = %d URBs\n", dev->max_tx_urbs);
 
 	err = kvaser_usb_get_card_info(dev);
 	if (err) {
-- 
2.10.2


^ permalink raw reply related

* Re: [net-next] neigh: remove duplicate check for same neigh
From: David Ahern @ 2016-11-29 16:22 UTC (permalink / raw)
  To: Zhang Shengju, netdev
In-Reply-To: <1480407771-12884-1-git-send-email-zhangshengju@cmss.chinamobile.com>

On 11/29/16 1:22 AM, Zhang Shengju wrote:
> @@ -2285,20 +2295,15 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
>  	rcu_read_lock_bh();
>  	nht = rcu_dereference_bh(tbl->nht);
>  
> -	for (h = s_h; h < (1 << nht->hash_shift); h++) {
> -		if (h > s_h)
> -			s_idx = 0;
> +	for (h = s_h; h < (1 << nht->hash_shift); h++, s_idx = 0) {
>  		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
>  		     n != NULL;
> -		     n = rcu_dereference_bh(n->next)) {
> -			if (!net_eq(dev_net(n->dev), net))
> -				continue;
> -			if (neigh_ifindex_filtered(n->dev, filter_idx))
> +		     n = rcu_dereference_bh(n->next), idx++) {

incrementing idx here ...

> +			if (idx < s_idx || !net_eq(dev_net(n->dev), net))
>  				continue;
> -			if (neigh_master_filtered(n->dev, filter_master_idx))
> +			if (neigh_dump_filtered(n->dev, filter_idx,
> +						filter_master_idx))
>  				continue;
> -			if (idx < s_idx)
> -				goto next;
>  			if (neigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
>  					    cb->nlh->nlmsg_seq,
>  					    RTM_NEWNEIGH,

... causes a missed entry when an error happens here

idx is only incremented when a neigh has been successfully added to the skb.


Your intention is to save a few cycles by making idx an absolute index within the hash bucket and jumping to the next entry to be dumped without evaluating any filters per entry.

So why not keep it simple and do this:

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 2ae929f9bd06..2e49a85e696a 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2291,14 +2291,12 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
                for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
                     n != NULL;
                     n = rcu_dereference_bh(n->next)) {
-                       if (!net_eq(dev_net(n->dev), net))
-                               continue;
-                       if (neigh_ifindex_filtered(n->dev, filter_idx))
-                               continue;
-                       if (neigh_master_filtered(n->dev, filter_master_idx))
-                               continue;
                        if (idx < s_idx)
                                goto next;
+                       if (!net_eq(dev_net(n->dev), net) ||
+                           neigh_ifindex_filtered(n->dev, filter_idx) ||
+                           neigh_master_filtered(n->dev, filter_master_idx))
+                               goto next;
                        if (neigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
                                            cb->nlh->nlmsg_seq,
                                            RTM_NEWNEIGH,

^ permalink raw reply related

* Re: [PATCH v2 12/13] net: ethernet: ti: cpts: calc mult and shift from refclk freq
From: Grygorii Strashko @ 2016-11-29 16:22 UTC (permalink / raw)
  To: Richard Cochran
  Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
	linux-omap, Rob Herring, devicetree, Murali Karicheri,
	Wingman Kwok, John Stultz, Thomas Gleixner
In-Reply-To: <20161129103453.GJ3110@localhost.localdomain>



On 11/29/2016 04:34 AM, Richard Cochran wrote:
> On Mon, Nov 28, 2016 at 05:03:36PM -0600, Grygorii Strashko wrote:
>> +static void cpts_calc_mult_shift(struct cpts *cpts)
>> +{
>> +	u64 frac, maxsec, ns;
>> +	u32 freq, mult, shift;
>> +
>> +	freq = clk_get_rate(cpts->refclk);
>> +
>> +	/* Calc the maximum number of seconds which we can run before
>> +	 * wrapping around.
>> +	 */
>> +	maxsec = cpts->cc.mask;
>> +	do_div(maxsec, freq);
>> +	if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
>> +		maxsec = 600;
> 
> The reason for this test is not obvious.  Why check cc.mask against
> UINT_MAX?  Please use the comment to explain it.
> 

Yeah. This is copy paste from __clocksource_update_freq_scale(), but
I'm going to limit it to 10 sec for now, because otherwise it will result in too small
mult in case of 64bit counter.

if (maxsec > 10)
	maxsec = 10;

-- 
regards,
-grygorii

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-29 16:21 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161129062938.GA84990-wPcXA7LoDC+1XWohqUldA0EOCMrvLtNR@public.gmane.org>

On Mon, Nov 28, 2016 at 10:29:38PM -0800, Vishwanathapura, Niranjana wrote:
> On Thu, Nov 24, 2016 at 09:15:45AM -0700, Jason Gunthorpe wrote:
> >>And will move the hfi_vnic module under
> >>???drivers/infiniband/ulp/hfi_vnic???.
> >
> >I would prefer drivers/net/ethernet
> >
> >This is clearly not a ULP since it doesn't use verbs.
> >
> 
> I understand it is not using verbs, but the control path (ib_device client)
> is using verbs (IB MAD).
> Our prefernce is to keep it somewhere under drivers/infiniband. Summarizing
> reasons again here,
> 
> - VNIC control driver (ib_device client) is an IB MAD agent.
> - It is purly a software construct, encapsualtes ethernet packets in
> Omni-path packet and depends on hfi1 driver here for HW access.

Is the majority of the code MAD focused or net stack focused?

I'm not sure it matters, it isn't like we can review Intel's
proprietary mad stuff anyhow. :\

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] tun: Use netif_receive_skb instead of netif_rx
From: Michael S. Tsirkin @ 2016-11-29 16:20 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Herbert Xu, David S . Miller, Jason Wang, Eric Dumazet,
	Paolo Abeni, Soheil Hassas Yeganeh, Markus Elfring, Mike Rapoport,
	netdev, linux-kernel, Dmitry Vyukov, Kostya Serebryany, syzkaller
In-Reply-To: <1480433136-7922-1-git-send-email-andreyknvl@google.com>

On Tue, Nov 29, 2016 at 04:25:36PM +0100, Andrey Konovalov wrote:
> This patch changes tun.c to call netif_receive_skb instead of netif_rx
> when a packet is received. The difference between the two is that netif_rx
> queues the packet into the backlog, and netif_receive_skb proccesses the
> packet in the current context.
> 
> This patch is required for syzkaller [1] to collect coverage from packet
> receive paths, when a packet being received through tun (syzkaller collects
> coverage per process in the process context).
> 
> A similar patch was introduced back in 2010 [2, 3], but the author found
> out that the patch doesn't help with the task he had in mind (for cgroups
> to shape network traffic based on the original process) and decided not to
> go further with it. The main concern back then was about possible stack
> exhaustion with 4K stacks, but CONFIG_4KSTACKS was removed and stacks are
> 8K now.
> 
> [1] https://github.com/google/syzkaller
> 
> [2] https://www.spinics.net/lists/netdev/thrd440.html#130570
> 
> [3] https://www.spinics.net/lists/netdev/msg130570.html
> 
> Signed-off-by: Andrey Konovalov <andreyknvl@google.com>

This was on my list of things to investigate ever since
8k stack default went in. Thanks for looking into this!

I note that there are still seem to exit 3 architectures that do have
CONFIG_4KSTACKS.

How about a wrapper that does netif_rx_ni with CONFIG_4KSTACKS.


> ---
>  drivers/net/tun.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 8093e39..4b56e91 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1304,7 +1304,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>  	skb_probe_transport_header(skb, 0);
>  
>  	rxhash = skb_get_hash(skb);
> -	netif_rx_ni(skb);
> +	local_bh_disable();
> +	netif_receive_skb(skb);
> +	local_bh_enable();
>  
>  	stats = get_cpu_ptr(tun->pcpu_stats);
>  	u64_stats_update_begin(&stats->syncp);
> -- 
> 2.8.0.rc3.226.g39d4020

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-29 16:19 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161129063106.GB84990-wPcXA7LoDC+1XWohqUldA0EOCMrvLtNR@public.gmane.org>

On Mon, Nov 28, 2016 at 10:31:06PM -0800, Vishwanathapura, Niranjana wrote:
> On Fri, Nov 25, 2016 at 12:05:09PM -0700, Jason Gunthorpe wrote:
> >On Thu, Nov 24, 2016 at 06:13:50PM -0800, Vishwanathapura, Niranjana wrote:
> >
> >>In order to be truely device independent the hfi_vnic ULP should not depend
> >>on a device exported symbol. Instead device should register its functions
> >>with the ULP. Hence the approaches a) and b).
> >
> >It is not device independent, it is hard linked to hfi1, just like our
> >other multi-component drivers.. So don't worry about that.
> >
> 
> We would like to keep the design clean and avoid any tight coupling here
> (our original design in this series tackled these).
> Any strong reason not to go with a) or b) ?

You are not making a subsystem. Don't overcomplicate things. A
multi-part device device can just directly link.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: net: GPF in eth_header
From: Eric Dumazet @ 2016-11-29 16:15 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: syzkaller, Dmitry Vyukov, David Miller, Tom Herbert,
	Alexander Duyck, Hannes Frederic Sowa, Jiri Benc, Sabrina Dubroca,
	netdev, LKML
In-Reply-To: <CAAeHK+zLO=Gv33TqS6tGTTC=QUYO-HOPgSB2oS1SBOAyJzPDTQ@mail.gmail.com>

On Tue, 2016-11-29 at 16:31 +0100, Andrey Konovalov wrote:
=
> The issue is not with skb_network_offset(), but with __skb_pull()
> using skb_network_offset() as an argument.
> 

No. The issue can happen with _any_ __skb_pull() with a 'negative'
argument, on 64bit arches.

skb_network_offset() is only one of the many cases this could happen if
a bug is added at some random place, including memory corruption from
a different kernel layer, or buggy hardware.

> I'm not sure what would be the beast way to fix this, to add a check
> before every __skb_pull(skb_network_offset()), to fix __skb_pull() to
> work with signed ints, to add BUG_ON()'s in __skb_pull, or something
> else.
> 
> What I meant is that you fixed this very instance of the bug, and I'm
> pointing out that a similar one might hit us again.

As I said, adding a check in skb_network_offset() would not be generic
enough.

Sure, we can be proactive and add tests everywhere in the kernel, but we
also want to keep it reasonably fast.

^ permalink raw reply

* [PATCH net 2/2] esp6: Fix integrity verification when ESN are used
From: Tobias Brunner @ 2016-11-29 16:05 UTC (permalink / raw)
  To: David S. Miller, Herbert Xu; +Cc: Steffen Klassert, netdev

When handling inbound packets, the two halves of the sequence number
stored on the skb are already in network order.

Fixes: 000ae7b2690e ("esp6: Switch to new AEAD interface")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
---
 net/ipv6/esp6.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 060a60b2f8a6..111ba55fd512 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -418,7 +418,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
 		esph = (void *)skb_push(skb, 4);
 		*seqhi = esph->spi;
 		esph->spi = esph->seq_no;
-		esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.input.hi);
+		esph->seq_no = XFRM_SKB_CB(skb)->seq.input.hi;
 		aead_request_set_callback(req, 0, esp_input_done_esn, skb);
 	}

-- 
1.9.1

^ permalink raw reply related

* [PATCH net 1/2] esp4: Fix integrity verification when ESN are used
From: Tobias Brunner @ 2016-11-29 16:05 UTC (permalink / raw)
  To: David S. Miller, Herbert Xu; +Cc: Steffen Klassert, netdev

When handling inbound packets, the two halves of the sequence number
stored on the skb are already in network order.

Fixes: 7021b2e1cddd ("esp4: Switch to new AEAD interface")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
---
 net/ipv4/esp4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index d95631d09248..20fb25e3027b 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -476,7 +476,7 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
 		esph = (void *)skb_push(skb, 4);
 		*seqhi = esph->spi;
 		esph->spi = esph->seq_no;
-		esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.input.hi);
+		esph->seq_no = XFRM_SKB_CB(skb)->seq.input.hi;
 		aead_request_set_callback(req, 0, esp_input_done_esn, skb);
 	}

-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net-next] samples/bpf: fix include path
From: David Ahern @ 2016-11-29 15:55 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, David S . Miller; +Cc: netdev
In-Reply-To: <583D2187.7060405@iogearbox.net>

On 11/28/16 11:34 PM, Daniel Borkmann wrote:
> On 11/29/2016 07:07 AM, Alexei Starovoitov wrote:
>> Fix the following build error:
>> HOSTCC  samples/bpf/test_lru_dist.o
>> ../samples/bpf/test_lru_dist.c:25:22: fatal error: bpf_util.h: No such file or directory
>>
>> This is due to objtree != srctree.
>> Use srctree, since that's where bpf_util.h is located.
>>
>> Fixes: e00c7b216f34 ("bpf: fix multiple issues in selftest suite and samples")
>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> 
> Didn't run into this so far, thanks for fixing!
> 
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>

I have and this fixed it for me.

Tested-by: David Ahern <dsa@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCH v2 08/13] net: ethernet: ti: cpts: move dt props parsing to cpts driver
From: Grygorii Strashko @ 2016-11-29 15:54 UTC (permalink / raw)
  To: Richard Cochran
  Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
	linux-omap, Rob Herring, devicetree, Murali Karicheri,
	Wingman Kwok
In-Reply-To: <20161129101112.GG3110@localhost.localdomain>



On 11/29/2016 04:11 AM, Richard Cochran wrote:
> On Mon, Nov 28, 2016 at 05:03:32PM -0600, Grygorii Strashko wrote:
>> +static int cpts_of_parse(struct cpts *cpts, struct device_node *node)
>> +{
>> +	int ret = -EINVAL;
>> +	u32 prop;
>> +
>> +	if (of_property_read_u32(node, "cpts_clock_mult", &prop))
>> +		goto  of_error;
>> +	cpts->cc_mult = prop;
> 
> Why not set cc.mult here at the same time?

The same reason as in prev patch - cpts->cc_mult is original/initial mult
value loaded from DT (or calculated), while cc.mult is dynamic value
which can be changed as part of freq adjustment.

> 
>> +
>> +	if (of_property_read_u32(node, "cpts_clock_shift", &prop))
>> +		goto  of_error;
>> +	cpts->cc.shift = prop;
>> +
>> +	return 0;
>> +
>> +of_error:
>> +	dev_err(cpts->dev, "CPTS: Missing property in the DT.\n");
>> +	return ret;
>> +}
> 


-- 
regards,
-grygorii

^ permalink raw reply

* [PATCH net-next v5 3/3] samples: bpf: add userspace example for modifying sk_bound_dev_if
From: David Ahern @ 2016-11-29 15:53 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480434813-3141-1-git-send-email-dsa@cumulusnetworks.com>

Add a simple program to demonstrate the ability to attach a bpf program
to a cgroup that sets sk_bound_dev_if for AF_INET{6} sockets when they
are created.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v5
- changed BPF_CGROUP_INET_SOCK to BPF_CGROUP_INET_SOCK_CREATE

v4
- added test_cgrp2_sock.sh for an automated test

v3
- revert to BPF_PROG_TYPE_CGROUP_SOCK prog type

v2
- removed bpf_sock_store_u32 references
- changed BPF_CGROUP_INET_SOCK_CREATE to BPF_CGROUP_INET_SOCK
- remove BPF_PROG_TYPE_CGROUP_SOCK prog type and add prog_subtype

 samples/bpf/Makefile           |  2 +
 samples/bpf/test_cgrp2_sock.c  | 83 ++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/test_cgrp2_sock.sh | 48 ++++++++++++++++++++++++
 3 files changed, 133 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_sock.c
 create mode 100755 samples/bpf/test_cgrp2_sock.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 22b6407efa4f..3a404dd4bb46 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -23,6 +23,7 @@ hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += test_cgrp2_attach
+hostprogs-y += test_cgrp2_sock
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -51,6 +52,7 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
+test_cgrp2_sock-objs := libbpf.o test_cgrp2_sock.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c
new file mode 100644
index 000000000000..2831e5f41f86
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock.c
@@ -0,0 +1,83 @@
+/* eBPF example program:
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program sets the sk_bound_dev_if index in new AF_INET{6}
+ *   sockets opened by processes in the cgroup.
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+
+static int prog_load(int idx)
+{
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_MOV64_IMM(BPF_REG_3, idx),
+		BPF_MOV64_IMM(BPF_REG_2, offsetof(struct bpf_sock, bound_dev_if)),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3, offsetof(struct bpf_sock, bound_dev_if)),
+		BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = verdict */
+		BPF_EXIT_INSN(),
+	};
+
+	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
+			     "GPL", 0);
+}
+
+static int usage(const char *argv0)
+{
+	printf("Usage: %s cg-path device-index\n", argv0);
+	return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+	int cg_fd, prog_fd, ret;
+	int idx = 0;
+
+	if (argc < 2)
+		return usage(argv[0]);
+
+	idx = atoi(argv[2]);
+	if (!idx) {
+		printf("Invalid device index\n");
+		return EXIT_FAILURE;
+	}
+
+	cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+	if (cg_fd < 0) {
+		printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	prog_fd = prog_load(idx);
+	printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+	if (prog_fd < 0) {
+		printf("Failed to load prog: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	ret = bpf_prog_detach(cg_fd, BPF_CGROUP_INET_SOCK_CREATE);
+	ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK_CREATE);
+	if (ret < 0) {
+		printf("Failed to attach prog to cgroup: '%s'\n",
+		       strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	return EXIT_SUCCESS;
+}
diff --git a/samples/bpf/test_cgrp2_sock.sh b/samples/bpf/test_cgrp2_sock.sh
new file mode 100755
index 000000000000..35ce9e1566d8
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock.sh
@@ -0,0 +1,48 @@
+#!/bin/bash
+
+function config_device {
+	ip netns add at_ns0
+	ip link add veth0 type veth peer name veth0b
+	ip link set veth0b up
+	ip link set veth0 netns at_ns0
+	ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0
+	ip netns exec at_ns0 ip addr add 2401:db00::1/64 dev veth0 nodad
+	ip netns exec at_ns0 ip link set dev veth0 up
+	ip link add foo type vrf table 1234
+	ip link set foo up
+	ip addr add 172.16.1.101/24 dev veth0b
+	ip addr add 2401:db00::2/64 dev veth0b nodad
+	ip link set veth0b master foo
+}
+
+function attach_bpf {
+	idx=$(ip -o li sh dev foo | awk -F':' '{print $1}')
+	rm -rf /tmp/cgroupv2
+	mkdir -p /tmp/cgroupv2
+	mount -t cgroup2 none /tmp/cgroupv2
+	mkdir -p /tmp/cgroupv2/foo
+	test_cgrp2_sock /tmp/cgroupv2/foo $idx
+	echo $$ >> /tmp/cgroupv2/foo/cgroup.procs
+}
+
+function cleanup {
+	set +ex
+	ip netns delete at_ns0
+	ip link del veth0
+	ip link del foo
+	umount /tmp/cgroupv2
+	rm -rf /tmp/cgroupv2
+	set -ex
+}
+
+function do_test {
+	ping -c1 -w1 172.16.1.100
+	ping6 -c1 -w1 2401:db00::1
+}
+
+cleanup 2>/dev/null
+config_device
+attach_bpf
+do_test
+cleanup
+echo "*** PASS ***"
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v5 2/3] bpf: Add new cgroup attach type to enable sock modifications
From: David Ahern @ 2016-11-29 15:53 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480434813-3141-1-git-send-email-dsa@cumulusnetworks.com>

Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.

This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v5
- no change

v4
- dropped tweak to bpf_func signature
- dropped cg_sock_func_proto in favor of sk_filter_func_proto
- new __cgroup_bpf_run_filter_sk versus overloading __cgroup_bpf_run_filter
- reverted BPF_CGROUP_INET_SOCK to BPF_CGROUP_INET_SOCK_CREATE

v3
- reverted to new prog type BPF_PROG_TYPE_CGROUP_SOCK
- dropped the subtype

v2
- dropped the bpf_sock_store_u32 helper
- dropped the new prog type BPF_PROG_TYPE_CGROUP_SOCK
- moved valid access and context conversion to use subtype
- dropped CREATE from BPF_CGROUP_INET_SOCK and related function names
- moved running of filter from sk_alloc to inet{6}_create

 include/linux/bpf-cgroup.h | 14 +++++++++++
 include/uapi/linux/bpf.h   |  6 +++++
 kernel/bpf/cgroup.c        | 33 ++++++++++++++++++++++++++
 kernel/bpf/syscall.c       |  5 +++-
 net/core/filter.c          | 59 ++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c         | 12 +++++++++-
 net/ipv6/af_inet6.c        |  8 +++++++
 7 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 7f0fc635b13e..7de376e37c5c 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -41,6 +41,9 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 				struct sk_buff *skb,
 				enum bpf_attach_type type);
 
+int __cgroup_bpf_run_filter_sk(struct sock *sk,
+			       enum bpf_attach_type type);
+
 /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)			      \
 ({									      \
@@ -64,6 +67,16 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk)				       \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled && sk) {					       \
+		__ret = __cgroup_bpf_run_filter_sk(sk,			       \
+						 BPF_CGROUP_INET_SOCK_CREATE); \
+	}								       \
+	__ret;								       \
+})
+
 #else
 
 struct cgroup_bpf {};
@@ -73,6 +86,7 @@ static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
 
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
 
 #endif /* CONFIG_CGROUP_BPF */
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1370a9d1456f..75964e00d947 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,11 +101,13 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
 	BPF_PROG_TYPE_CGROUP_SKB,
+	BPF_PROG_TYPE_CGROUP_SOCK,
 };
 
 enum bpf_attach_type {
 	BPF_CGROUP_INET_INGRESS,
 	BPF_CGROUP_INET_EGRESS,
+	BPF_CGROUP_INET_SOCK_CREATE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -537,6 +539,10 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+struct bpf_sock {
+	__u32 bound_dev_if;
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 19892973a78a..fe1c9ad03a36 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -165,3 +165,36 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	return ret;
 }
 EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb);
+
+/**
+ * __cgroup_bpf_run_filter_sk() - Run a program on a sock
+ * @sk: sock structure to manipulate
+ * @type: The type of program to be exectuted
+ *
+ * socket is passed is expected to be of type INET or INET6.
+ *
+ * The program type passed in via @type must be suitable for sock
+ * filtering. No further check is performed to assert that.
+ *
+ * This function will return %-EPERM if any if an attached program was found
+ * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter_sk(struct sock *sk,
+			       enum bpf_attach_type type)
+{
+	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+	struct bpf_prog *prog;
+	int ret = 0;
+
+
+	rcu_read_lock();
+
+	prog = rcu_dereference(cgrp->bpf.effective[type]);
+	if (prog)
+		ret = BPF_PROG_RUN(prog, sk) == 1 ? 0 : -EPERM;
+
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5518a6839ab1..85af86c496cd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -869,7 +869,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET_EGRESS:
 		ptype = BPF_PROG_TYPE_CGROUP_SKB;
 		break;
-
+	case BPF_CGROUP_INET_SOCK_CREATE:
+		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -905,6 +907,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	switch (attr->attach_type) {
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
+	case BPF_CGROUP_INET_SOCK_CREATE:
 		cgrp = cgroup_get_from_fd(attr->target_fd);
 		if (IS_ERR(cgrp))
 			return PTR_ERR(cgrp);
diff --git a/net/core/filter.c b/net/core/filter.c
index 698a262b8ebb..404aaa1bfa1f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2676,6 +2676,29 @@ static bool sk_filter_is_valid_access(int off, int size,
 	return __is_valid_access(off, size, type);
 }
 
+static bool sock_filter_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					enum bpf_reg_type *reg_type)
+{
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case offsetof(struct bpf_sock, bound_dev_if):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	if (off < 0 || off + size > sizeof(struct bpf_sock))
+		return false;
+
+	/* The verifier guarantees that size > 0. */
+	if (off % size != 0)
+		return false;
+
+	return true;
+}
+
 static int tc_cls_act_prologue(struct bpf_insn *insn_buf, bool direct_write,
 			       const struct bpf_prog *prog)
 {
@@ -2934,6 +2957,30 @@ static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 	return insn - insn_buf;
 }
 
+static u32 sock_filter_convert_ctx_access(enum bpf_access_type type,
+					  int dst_reg, int src_reg,
+					  int ctx_off,
+					  struct bpf_insn *insn_buf,
+					  struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (ctx_off) {
+	case offsetof(struct bpf_sock, bound_dev_if):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_bound_dev_if) != 4);
+
+		if (type == BPF_WRITE)
+			*insn++ = BPF_STX_MEM(BPF_W, dst_reg, src_reg,
+					offsetof(struct sock, sk_bound_dev_if));
+		else
+			*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+				      offsetof(struct sock, sk_bound_dev_if));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 tc_cls_act_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 					 int src_reg, int ctx_off,
 					 struct bpf_insn *insn_buf,
@@ -3007,6 +3054,12 @@ static const struct bpf_verifier_ops cg_skb_ops = {
 	.convert_ctx_access	= sk_filter_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_sock_ops = {
+	.get_func_proto		= sk_filter_func_proto,
+	.is_valid_access	= sock_filter_is_valid_access,
+	.convert_ctx_access	= sock_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -3032,6 +3085,11 @@ static struct bpf_prog_type_list cg_skb_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+static struct bpf_prog_type_list cg_sock_type __read_mostly = {
+	.ops	= &cg_sock_ops,
+	.type	= BPF_PROG_TYPE_CGROUP_SOCK
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
@@ -3039,6 +3097,7 @@ static int __init register_sk_filter_ops(void)
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
 	bpf_register_prog_type(&cg_skb_type);
+	bpf_register_prog_type(&cg_sock_type);
 
 	return 0;
 }
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5ddf5cda07f4..24d2550492ee 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -374,8 +374,18 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
 
 	if (sk->sk_prot->init) {
 		err = sk->sk_prot->init(sk);
-		if (err)
+		if (err) {
+			sk_common_release(sk);
+			goto out;
+		}
+	}
+
+	if (!kern) {
+		err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
+		if (err) {
 			sk_common_release(sk);
+			goto out;
+		}
 	}
 out:
 	return err;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d424f3a3737a..237e654ba717 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -258,6 +258,14 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 			goto out;
 		}
 	}
+
+	if (!kern) {
+		err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
+		if (err) {
+			sk_common_release(sk);
+			goto out;
+		}
+	}
 out:
 	return err;
 out_rcu_unlock:
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v5 1/3] bpf: Refactor cgroups code in prep for new type
From: David Ahern @ 2016-11-29 15:53 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480434813-3141-1-git-send-email-dsa@cumulusnetworks.com>

Code move and rename only; no functional change intended.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v5
- no change

v4
- dropped refactor of __cgroup_bpf_run_filter and renamed it
  to __cgroup_bpf_run_filter_skb

v3
- dropped the rename

v2
- fix bpf_prog_run_clear_cb to bpf_prog_run_save_cb as caught by Daniel

- rename BPF_PROG_TYPE_CGROUP_SKB and its cg_skb functions to
  BPF_PROG_TYPE_CGROUP and cgroup

 include/linux/bpf-cgroup.h | 46 +++++++++++++++++++++++-----------------------
 kernel/bpf/cgroup.c        | 10 +++++-----
 kernel/bpf/syscall.c       | 28 +++++++++++++++-------------
 3 files changed, 43 insertions(+), 41 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index ec80d0c0953e..7f0fc635b13e 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -37,31 +37,31 @@ void cgroup_bpf_update(struct cgroup *cgrp,
 		       struct bpf_prog *prog,
 		       enum bpf_attach_type type);
 
-int __cgroup_bpf_run_filter(struct sock *sk,
-			    struct sk_buff *skb,
-			    enum bpf_attach_type type);
-
-/* Wrappers for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled. */
-#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb)			\
-({									\
-	int __ret = 0;							\
-	if (cgroup_bpf_enabled)						\
-		__ret = __cgroup_bpf_run_filter(sk, skb,		\
-						BPF_CGROUP_INET_INGRESS); \
-									\
-	__ret;								\
+int __cgroup_bpf_run_filter_skb(struct sock *sk,
+				struct sk_buff *skb,
+				enum bpf_attach_type type);
+
+/* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)			      \
+({									      \
+	int __ret = 0;							      \
+	if (cgroup_bpf_enabled)						      \
+		__ret = __cgroup_bpf_run_filter_skb(sk, skb,		      \
+						    BPF_CGROUP_INET_INGRESS); \
+									      \
+	__ret;								      \
 })
 
-#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb)				\
-({									\
-	int __ret = 0;							\
-	if (cgroup_bpf_enabled && sk && sk == skb->sk) {		\
-		typeof(sk) __sk = sk_to_full_sk(sk);			\
-		if (sk_fullsock(__sk))					\
-			__ret = __cgroup_bpf_run_filter(__sk, skb,	\
-						BPF_CGROUP_INET_EGRESS); \
-	}								\
-	__ret;								\
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)			       \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled && sk && sk == skb->sk) {		       \
+		typeof(sk) __sk = sk_to_full_sk(sk);			       \
+		if (sk_fullsock(__sk))					       \
+			__ret = __cgroup_bpf_run_filter_skb(__sk, skb,	       \
+						      BPF_CGROUP_INET_EGRESS); \
+	}								       \
+	__ret;								       \
 })
 
 #else
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a0ab43f264b0..19892973a78a 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -118,7 +118,7 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
 }
 
 /**
- * __cgroup_bpf_run_filter() - Run a program for packet filtering
+ * __cgroup_bpf_run_filter_skb() - Run a program for packet filtering
  * @sk: The socken sending or receiving traffic
  * @skb: The skb that is being sent or received
  * @type: The type of program to be exectuted
@@ -132,9 +132,9 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
  * This function will return %-EPERM if any if an attached program was found
  * and if it returned != 1 during execution. In all other cases, 0 is returned.
  */
-int __cgroup_bpf_run_filter(struct sock *sk,
-			    struct sk_buff *skb,
-			    enum bpf_attach_type type)
+int __cgroup_bpf_run_filter_skb(struct sock *sk,
+				struct sk_buff *skb,
+				enum bpf_attach_type type)
 {
 	struct bpf_prog *prog;
 	struct cgroup *cgrp;
@@ -164,4 +164,4 @@ int __cgroup_bpf_run_filter(struct sock *sk,
 
 	return ret;
 }
-EXPORT_SYMBOL(__cgroup_bpf_run_filter);
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4caa18e6860a..5518a6839ab1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -856,6 +856,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 {
 	struct bpf_prog *prog;
 	struct cgroup *cgrp;
+	enum bpf_prog_type ptype;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
@@ -866,25 +867,26 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	switch (attr->attach_type) {
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
-		prog = bpf_prog_get_type(attr->attach_bpf_fd,
-					 BPF_PROG_TYPE_CGROUP_SKB);
-		if (IS_ERR(prog))
-			return PTR_ERR(prog);
-
-		cgrp = cgroup_get_from_fd(attr->target_fd);
-		if (IS_ERR(cgrp)) {
-			bpf_prog_put(prog);
-			return PTR_ERR(cgrp);
-		}
-
-		cgroup_bpf_update(cgrp, prog, attr->attach_type);
-		cgroup_put(cgrp);
+		ptype = BPF_PROG_TYPE_CGROUP_SKB;
 		break;
 
 	default:
 		return -EINVAL;
 	}
 
+	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	cgrp = cgroup_get_from_fd(attr->target_fd);
+	if (IS_ERR(cgrp)) {
+		bpf_prog_put(prog);
+		return PTR_ERR(cgrp);
+	}
+
+	cgroup_bpf_update(cgrp, prog, attr->attach_type);
+	cgroup_put(cgrp);
+
 	return 0;
 }
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v5 0/3] net: Add bpf support to set sk_bound_dev_if
From: David Ahern @ 2016-11-29 15:53 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern

The recently added VRF support in Linux leverages the bind-to-device
API for programs to specify an L3 domain for a socket. While
SO_BINDTODEVICE has been around for ages, not every ipv4/ipv6 capable
program has support for it. Even for those programs that do support it,
the API requires processes to be started as root (CAP_NET_RAW) which
is not desirable from a general security perspective.

This patch set leverages Daniel Mack's work to attach bpf programs to
a cgroup to provide a capability to set sk_bound_dev_if for all
AF_INET{6} sockets opened by a process in a cgroup when the sockets
are allocated.

For example:
 1. configure vrf (e.g., using ifupdown2)
        auto eth0
        iface eth0 inet dhcp
            vrf mgmt

        auto mgmt
        iface mgmt
            vrf-table auto

 2. configure cgroup
        mount -t cgroup2 none /tmp/cgroupv2
        mkdir /tmp/cgroupv2/mgmt
        test_cgrp2_sock /tmp/cgroupv2/mgmt 15

 3. set shell into cgroup (e.g., can be done at login using pam)
        echo $$ >> /tmp/cgroupv2/mgmt/cgroup.procs

At this point all commands run in the shell (e.g, apt) have sockets
automatically bound to the VRF (see output of ss -ap 'dev == <vrf>'),
including processes not running as root.

This capability enables running any program in a VRF context and is key
to deploying Management VRF, a fundamental configuration for networking
gear, with any Linux OS installation.

David Ahern (3):
  bpf: Refactor cgroups code in prep for new type
  bpf: Add new cgroup attach type to enable sock modifications
  samples: bpf: add userspace example for modifying sk_bound_dev_if

 include/linux/bpf-cgroup.h     | 60 ++++++++++++++++++------------
 include/uapi/linux/bpf.h       |  6 +++
 kernel/bpf/cgroup.c            | 43 +++++++++++++++++++---
 kernel/bpf/syscall.c           | 33 ++++++++++-------
 net/core/filter.c              | 59 ++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c             | 12 +++++-
 net/ipv6/af_inet6.c            |  8 ++++
 samples/bpf/Makefile           |  2 +
 samples/bpf/test_cgrp2_sock.c  | 83 ++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/test_cgrp2_sock.sh | 48 ++++++++++++++++++++++++
 10 files changed, 311 insertions(+), 43 deletions(-)
 create mode 100644 samples/bpf/test_cgrp2_sock.c
 create mode 100755 samples/bpf/test_cgrp2_sock.sh

-- 
2.1.4

^ permalink raw reply

* Re: [PATCH net-next v4 3/3] samples: bpf: add userspace example for modifying sk_bound_dev_if
From: David Ahern @ 2016-11-29 15:52 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480394329-24847-4-git-send-email-dsa@cumulusnetworks.com>

On 11/28/16 9:38 PM, David Ahern wrote:
> +	ret = bpf_prog_detach(cg_fd, BPF_CGROUP_INET_SOCK);
> +	ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK);

forgot to update this to BPF_CGROUP_INET_SOCK_CREATE.

will send v5

^ permalink raw reply

* [PATCH iproute2 net-next 4/4] tc: flower: make use of flower_port_attr_type() safe and silent
From: Simon Horman @ 2016-11-29 15:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, Simon Horman
In-Reply-To: <1480434693-29397-1-git-send-email-simon.horman@netronome.com>

Make use of flower_port_attr_type() safe:
* flower_port_attr_type() may return a valid index into tb[] or -1.
  Only access tb[] in the case of the former.
* Do not access null entries in tb[]

Also make usage silent - it is valid for ip_proto to be invalid,
for example if it is not specified as part of the filter.

Fixes: ("tc: flower: Support matching on SCTP ports")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 tc/f_flower.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/tc/f_flower.c b/tc/f_flower.c
index bca910d95729..21655f950386 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -159,19 +159,17 @@ static int flower_parse_ip_addr(char *str, __be16 eth_type,
 
 static int flower_port_attr_type(__u8 ip_proto, bool is_src)
 {
-	if (ip_proto == IPPROTO_TCP) {
+	if (ip_proto == IPPROTO_TCP)
 		return is_src ? TCA_FLOWER_KEY_TCP_SRC :
 			TCA_FLOWER_KEY_TCP_DST;
-	} else if (ip_proto == IPPROTO_UDP) {
+	else if (ip_proto == IPPROTO_UDP)
 		return is_src ? TCA_FLOWER_KEY_UDP_SRC :
 			TCA_FLOWER_KEY_UDP_DST;
-	} else if (ip_proto == IPPROTO_SCTP) {
+	else if (ip_proto == IPPROTO_SCTP)
 		return is_src ? TCA_FLOWER_KEY_SCTP_SRC :
 			TCA_FLOWER_KEY_SCTP_DST;
-	} else {
-		fprintf(stderr, "Illegal \"ip_proto\" for port\n");
+	else
 		return -1;
-	}
 }
 
 static int flower_parse_port(char *str, __u8 ip_proto, bool is_src,
@@ -505,7 +503,8 @@ static void flower_print_ip_addr(FILE *f, char *name, __be16 eth_type,
 
 static void flower_print_port(FILE *f, char *name, struct rtattr *attr)
 {
-	fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u16(attr)));
+	if (attr)
+		fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u16(attr)));
 }
 
 static int flower_print_opt(struct filter_util *qu, FILE *f,
@@ -514,6 +513,7 @@ static int flower_print_opt(struct filter_util *qu, FILE *f,
 	struct rtattr *tb[TCA_FLOWER_MAX + 1];
 	__be16 eth_type = 0;
 	__u8 ip_proto = 0xff;
+	int nl_type;
 
 	if (!opt)
 		return 0;
@@ -568,10 +568,12 @@ static int flower_print_opt(struct filter_util *qu, FILE *f,
 			     tb[TCA_FLOWER_KEY_IPV6_SRC],
 			     tb[TCA_FLOWER_KEY_IPV6_SRC_MASK]);
 
-	flower_print_port(f, "dst_port",
-			  tb[flower_port_attr_type(ip_proto, false)]);
-	flower_print_port(f, "src_port",
-			  tb[flower_port_attr_type(ip_proto, true)]);
+	nl_type = flower_port_attr_type(ip_proto, false);
+	if (nl_type >= 0)
+		flower_print_port(f, "dst_port", tb[nl_type]);
+	nl_type = flower_port_attr_type(ip_proto, true);
+	if (nl_type >= 0)
+		flower_print_port(f, "src_port", tb[nl_type]);
 
 	if (tb[TCA_FLOWER_FLAGS])  {
 		__u32 flags = rta_getattr_u32(tb[TCA_FLOWER_FLAGS]);
-- 
2.7.0.rc3.207.g0ac5344

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox