* [PATCH net-next 1/3] sfc: fix debug message format string in efx_farch_handle_rx_not_ok
From: Edward Cree @ 2016-12-01 17:00 UTC (permalink / raw)
To: linux-net-drivers, davem; +Cc: bkenward, netdev
In-Reply-To: <c52a0276-e379-7841-8d10-d5a834b81c4e@solarflare.com>
Defalconisation removed one of the string arguments, but missed the
corresponding %s.
Fixes: 5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
drivers/net/ethernet/sfc/farch.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c
index c282d66..91aa3ec 100644
--- a/drivers/net/ethernet/sfc/farch.c
+++ b/drivers/net/ethernet/sfc/farch.c
@@ -913,7 +913,7 @@ static u16 efx_farch_handle_rx_not_ok(struct efx_rx_queue *rx_queue,
if (rx_ev_other_err && net_ratelimit()) {
netif_dbg(efx, rx_err, efx->net_dev,
" RX queue %d unexpected RX event "
- EFX_QWORD_FMT "%s%s%s%s%s%s%s%s\n",
+ EFX_QWORD_FMT "%s%s%s%s%s%s%s\n",
efx_rx_queue_index(rx_queue), EFX_QWORD_VAL(*event),
rx_ev_buf_owner_id_err ? " [OWNER_ID_ERR]" : "",
rx_ev_ip_hdr_chksum_err ?
^ permalink raw reply related
* Re: [PATCH v7 net-next 0/6] net: Add bpf support for sockets
From: Alexei Starovoitov @ 2016-12-01 16:59 UTC (permalink / raw)
To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480610888-31082-1-git-send-email-dsa@cumulusnetworks.com>
On Thu, Dec 01, 2016 at 08:48:02AM -0800, David Ahern wrote:
> The recently added VRF support in Linux leverages the bind-to-device
> API for programs to specify an L3 domain for a socket. While
> SO_BINDTODEVICE has been around for ages, not every ipv4/ipv6 capable
> program has support for it. Even for those programs that do support it,
> the API requires processes to be started as root (CAP_NET_RAW) which
> is not desirable from a general security perspective.
>
> This patch set leverages Daniel Mack's work to attach bpf programs to
> a cgroup to provide a capability to set sk_bound_dev_if for all
> AF_INET{6} sockets opened by a process in a cgroup when the sockets
> are allocated.
>
> For example:
> 1. configure vrf (e.g., using ifupdown2)
> auto eth0
> iface eth0 inet dhcp
> vrf mgmt
>
> auto mgmt
> iface mgmt
> vrf-table auto
>
> 2. configure cgroup
> mount -t cgroup2 none /tmp/cgroupv2
> mkdir /tmp/cgroupv2/mgmt
> test_cgrp2_sock /tmp/cgroupv2/mgmt 15
>
> 3. set shell into cgroup (e.g., can be done at login using pam)
> echo $$ >> /tmp/cgroupv2/mgmt/cgroup.procs
>
> At this point all commands run in the shell (e.g, apt) have sockets
> automatically bound to the VRF (see output of ss -ap 'dev == <vrf>'),
> including processes not running as root.
>
> This capability enables running any program in a VRF context and is key
> to deploying Management VRF, a fundamental configuration for networking
> gear, with any Linux OS installation.
>
> This patchset also exports the socket family, type and protocol as
> read-only allowing bpf filters to deny a process in a cgroup the ability
> to open specific types of AF_INET or AF_INET6 sockets.
>
> v7
> - comments from Alexei
Looks great.
In case you need to change something. Please keep my Acks
on patches that were kept as-is.
Thanks
^ permalink raw reply
* [PATCH net-next 0/3] sfc: defalconisation fixups
From: Edward Cree @ 2016-12-01 16:59 UTC (permalink / raw)
To: linux-net-drivers, davem; +Cc: bkenward, netdev
A bug fix, the Kconfig change, and cleaning up a bit more unused code.
Edward Cree (3):
sfc: fix debug message format string in efx_farch_handle_rx_not_ok
sfc: don't select SFC_FALCON
sfc: remove RESET_TYPE_RX_RECOVERY
drivers/net/ethernet/sfc/Kconfig | 1 -
drivers/net/ethernet/sfc/efx.c | 1 -
drivers/net/ethernet/sfc/enum.h | 5 +----
drivers/net/ethernet/sfc/farch.c | 2 +-
4 files changed, 2 insertions(+), 7 deletions(-)
^ permalink raw reply
* RE: [PATCH V2 net-next] net: hns: Fix to conditionally convey RX checksum flag to stack
From: Salil Mehta @ 2016-12-01 16:59 UTC (permalink / raw)
To: David Miller
Cc: Zhuangyuzeng (Yisen), mehta.salil.lnk@gmail.com,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm
In-Reply-To: <20161130.142539.1927956259851457047.davem@davemloft.net>
> -----Original Message-----
> From: Salil Mehta
> Sent: Thursday, December 01, 2016 12:09 PM
> To: 'David Miller'
> Cc: Zhuangyuzeng (Yisen); mehta.salil.lnk@gmail.com;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm
> Subject: RE: [PATCH V2 net-next] net: hns: Fix to conditionally convey
> RX checksum flag to stack
>
> > -----Original Message-----
> > From: David Miller [mailto:davem@davemloft.net]
> > Sent: Wednesday, November 30, 2016 7:26 PM
> > To: Salil Mehta
> > Cc: Zhuangyuzeng (Yisen); mehta.salil.lnk@gmail.com;
> > netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm
> > Subject: Re: [PATCH V2 net-next] net: hns: Fix to conditionally
> convey
> > RX checksum flag to stack
> >
> > From: Salil Mehta <salil.mehta@huawei.com>
> > Date: Tue, 29 Nov 2016 13:09:45 +0000
> >
> > > + /* We only support checksum for IPv4,UDP(over IPv4 or IPv6),
> > TCP(over
> > > + * IPv4 or IPv6) and SCTP but we support many L3(IPv4, IPv6,
> > MPLS,
> > > + * PPPoE etc) and L4(TCP, UDP, GRE, SCTP, IGMP, ICMP etc.)
> > protocols.
> > > + * We want to filter out L3 and L4 protocols early on for which
> > checksum
> > > + * is not supported.
> > ...
> > > + */
> > > + l3id = hnae_get_field(flag, HNS_RXD_L3ID_M, HNS_RXD_L3ID_S);
> > > + l4id = hnae_get_field(flag, HNS_RXD_L4ID_M, HNS_RXD_L4ID_S);
> > > + if ((l3id != HNS_RX_FLAG_L3ID_IPV4) &&
> > > + ((l3id != HNS_RX_FLAG_L3ID_IPV6) ||
> > > + (l4id != HNS_RX_FLAG_L4ID_UDP)) &&
> > > + ((l3id != HNS_RX_FLAG_L3ID_IPV6) ||
> > > + (l4id != HNS_RX_FLAG_L4ID_TCP)) &&
> > > + (l4id != HNS_RX_FLAG_L4ID_SCTP))
> > > + return;
> >
> > I have a hard time understanding this seemingly overcomplicated
> > check.
> >
> > It looks like if L3 is IPV4 it will accept any underlying L4
> protocol,
> > but is that what is really intended? That doesn't match what this
> new
> > comment states.
> I agree that it is bit difficult to read. Earlier, I was banking on the
> register(mistakenly, its hardware implementation err ) to de-multiplex
> the checksum error type. The register supported indication of just IPv4
> Header Checksum Error as well (which meant it could carry any L4
> protocol).
> IPv6 does not have similar Header checksum error. Therefore, to check
> if it is just IPv4 Header checksum error or any supported L4 transport
> (UDP or TCP) error over IPv4 or IPv6 I had to bank upon above complex
> check
>
> Below suggested solution check would have been insufficient for
> example, if packet had IPv4/IGMP and there was a checksum error in IPv4
> header.
>
> Comment states:
> " We only support checksum for IPv4, UDP(over IPv4 or IPv6),
> TCP(over IPv4 or IPv6) and SCTP"
> 1) Checksum of IPv4 (IPv4 header)
> 2) UDP(over IPv4 or IPv6)
> 3) TCP(over IPv4 or IPv6)
> 4) SCTP (over IPv4 or IPv6)*
>
> (*) I should have put IPv4/IPv6 check for SCTP in the code
> and made it clear in the comment as well?
>
> >
> > My understanding is that the chip supports checksums for:
> >
> > UDP/IPV4
> > UDP/IPV6
> > TCP/IPV4
> > TCP/IPV6
> > SCTP/IPV4
> > SCTP/IPV6
>
> Hardware also supports checksum of IPv4 Header.
>
> >
> > So the simplest thing is to validate each level one at a time:
> >
> > if (l3 != IPV4 && l3 != IPV6)
> > return;
> > if (l4 != UDP && l4 != TCP && l4 != SCTP)
> > return;
> I guess above check will fail to catch cases like IPv4/IGMP, when there
> is a bad checksum in The IPv4 header.
>
> But maybe now since we don't have any method to de-multiplex the kind
> of
> checksum error (cannot depend upon register) we can have below code
> re-arrangement:
>
> hns_nic_rx_checksum() {
> /* check supported L3 protocol */
> if (l3 != IPV4 && l3 != IPV6)
> return;
> /* check if L3 protocols error */
> if (l3e)
> return;
>
> /* check if the packets are fragmented */
> If (l3frags)
> Return;
>
> /* check supported L4 protocol */
> if (l4 != UDP && l4 != TCP && l4 != SCTP)
> return;
Hi David,
This logic might not do as well since IGMP packet with valid IP
Checksum will be bypassed here. We would want to set
CHECKSUM_UNNECESSARY for such IP packets as well.
It looks to me the cumbersome check in the PATCH V2 should
be retained.
> /* check if any L4 protocol error */
> if (l3e)
> return;
>
> /* packet with valid checksum - covey to stack */
> skb->ip_summed = CHECKSUM_UNNECESSARY
> }
>
> Hope I am not missing something here. Please correct my understanding
> if there is any gap here. Thanks!
>
> Best regards
> Salil
^ permalink raw reply
* Re: [PATCH v7 net-next 6/6] samples/bpf: add userspace example for prohibiting sockets
From: Alexei Starovoitov @ 2016-12-01 16:58 UTC (permalink / raw)
To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480610888-31082-7-git-send-email-dsa@cumulusnetworks.com>
On Thu, Dec 01, 2016 at 08:48:08AM -0800, David Ahern wrote:
> Add examples preventing a process in a cgroup from opening a socket
> based family, protocol and type.
>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH v7 net-next 5/6] samples/bpf: Update bpf loader for cgroup section names
From: Alexei Starovoitov @ 2016-12-01 16:58 UTC (permalink / raw)
To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480610888-31082-6-git-send-email-dsa@cumulusnetworks.com>
On Thu, Dec 01, 2016 at 08:48:07AM -0800, David Ahern wrote:
> Add support for section names starting with cgroup/skb and cgroup/sock.
>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH v7 net-next 3/6] samples: bpf: add userspace example for modifying sk_bound_dev_if
From: Alexei Starovoitov @ 2016-12-01 16:57 UTC (permalink / raw)
To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480610888-31082-4-git-send-email-dsa@cumulusnetworks.com>
On Thu, Dec 01, 2016 at 08:48:05AM -0800, David Ahern wrote:
> Add a simple program to demonstrate the ability to attach a bpf program
> to a cgroup that sets sk_bound_dev_if for AF_INET{6} sockets when they
> are created.
>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH v7 net-next 4/6] bpf: Add support for reading socket family, type, protocol
From: Alexei Starovoitov @ 2016-12-01 16:57 UTC (permalink / raw)
To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480610888-31082-5-git-send-email-dsa@cumulusnetworks.com>
On Thu, Dec 01, 2016 at 08:48:06AM -0800, David Ahern wrote:
> Add socket family, type and protocol to bpf_sock allowing bpf programs
> read-only access.
>
> Add __sk_flags_offset[0] to struct sock before the bitfield to
> programmtically determine the offset of the unsigned int containing
> protocol and type.
>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH v7 net-next 1/6] bpf: Refactor cgroups code in prep for new type
From: Alexei Starovoitov @ 2016-12-01 16:56 UTC (permalink / raw)
To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480610888-31082-2-git-send-email-dsa@cumulusnetworks.com>
On Thu, Dec 01, 2016 at 08:48:03AM -0800, David Ahern wrote:
> Code move and rename only; no functional change intended.
>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [flamebait] xdp, well meaning but pointless
From: Florian Westphal @ 2016-12-01 16:51 UTC (permalink / raw)
To: David Miller; +Cc: tgraf, fw, netdev
In-Reply-To: <20161201.111947.888676978252329124.davem@davemloft.net>
David Miller <davem@davemloft.net> wrote:
> Saying that ntuple filters can handle the early drop use case doesn't
> take into consideration the nature of the tables (hundreds of
> thousands of "evil" IP addresses),
Thats not what I said.
But Ok, message received. I rest my case.
^ permalink raw reply
* Re: DSA vs. SWTICHDEV ?
From: Murali Karicheri @ 2016-12-01 16:50 UTC (permalink / raw)
To: Joakim Tjernlund, netdev@vger.kernel.org, Roger Quadros,
Grygorii Strashko
In-Reply-To: <1480495831.3563.135.camel@infinera.com>
On 11/30/2016 03:50 AM, Joakim Tjernlund wrote:
> I am trying to wrap my head around these two "devices" and have a hard time telling them apart.
> We are looking att adding a faily large switch(over PCIe) to our board and from what I can tell
> switchdev is the new way to do it but DSA is still there. Is it possible to just list
> how they differ?
>
> What can switchdev do that DSA cannot?
>
> What can DSA do that switchdev cannot?
>
>
> Can one enable switchdev and dsa for the same switch device?
>
> Jocke
>
DSA/Switchdev experts,
Nice to see this discussion as I am trying to evaluat what model
works best for our hardware. From my evaluation so far, DSA can be
used even though there is no tag protocol used between the Host and
Switch. In our hardware, the Host and Switch are part of the SoC.
The Host interface is a shared memory with queues implemented at
hardware. The Phy is attached to the mii ports externally on the board.
Also this hardware is programmable through firmware. More details
can be seen at http://processors.wiki.ti.com/index.php/PRU-ICSS
PRU can run a firmware to configure the hardware in one of the following:-
1. EMAC mode where it appears as two Ethernet ports
2. Switch mode where it implements a simple Ethernet switch. Currently
it doesn't have address learning capability, but in future it
can.
3. Switch with HSR/PRP offload where it provides HSR/PRP protocol
support and cut through switch.
So a device need to function in one of the modes. A a regular Ethernet
driver that provides two network devices, one per port, and switchdev
for each physical port (in switch mode) will look ideal in this case.
This will allow attaching the associated interfaces to a bridge (where
a L2 offload is possible in the future). This also helps to attach the
interfaces to an HSR device at the top layer like bridge to support
HSR/PRP protocol with offload possible to the PRU Switch.
Using a DSA for this appears to be adding more complexity to the driver
model and may not be ideal. What do you think?
--
Murali Karicheri
Linux Kernel, Keystone
^ permalink raw reply
* [PATCH v7 net-next 5/6] samples/bpf: Update bpf loader for cgroup section names
From: David Ahern @ 2016-12-01 16:48 UTC (permalink / raw)
To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480610888-31082-1-git-send-email-dsa@cumulusnetworks.com>
Add support for section names starting with cgroup/skb and cgroup/sock.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v7
- no change
v6
- new patch for version 6
samples/bpf/bpf_load.c | 14 +++++++++++---
samples/bpf/bpf_load.h | 1 +
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 62f54d6eb8bf..49b45ccbe153 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -52,6 +52,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
bool is_xdp = strncmp(event, "xdp", 3) == 0;
bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
+ bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
+ bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
@@ -72,6 +74,10 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
prog_type = BPF_PROG_TYPE_XDP;
} else if (is_perf_event) {
prog_type = BPF_PROG_TYPE_PERF_EVENT;
+ } else if (is_cgroup_skb) {
+ prog_type = BPF_PROG_TYPE_CGROUP_SKB;
+ } else if (is_cgroup_sk) {
+ prog_type = BPF_PROG_TYPE_CGROUP_SOCK;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -85,7 +91,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
prog_fd[prog_cnt++] = fd;
- if (is_xdp || is_perf_event)
+ if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
return 0;
if (is_socket) {
@@ -334,7 +340,8 @@ int load_bpf_file(char *path)
memcmp(shname_prog, "tracepoint/", 11) == 0 ||
memcmp(shname_prog, "xdp", 3) == 0 ||
memcmp(shname_prog, "perf_event", 10) == 0 ||
- memcmp(shname_prog, "socket", 6) == 0)
+ memcmp(shname_prog, "socket", 6) == 0 ||
+ memcmp(shname_prog, "cgroup/", 7) == 0)
load_and_attach(shname_prog, insns, data_prog->d_size);
}
}
@@ -353,7 +360,8 @@ int load_bpf_file(char *path)
memcmp(shname, "tracepoint/", 11) == 0 ||
memcmp(shname, "xdp", 3) == 0 ||
memcmp(shname, "perf_event", 10) == 0 ||
- memcmp(shname, "socket", 6) == 0)
+ memcmp(shname, "socket", 6) == 0 ||
+ memcmp(shname, "cgroup/", 7) == 0)
load_and_attach(shname, data->d_buf, data->d_size);
}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index dfa57fe65c8e..4adeeef53ad6 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -7,6 +7,7 @@
extern int map_fd[MAX_MAPS];
extern int prog_fd[MAX_PROGS];
extern int event_fd[MAX_PROGS];
+extern int prog_cnt;
/* parses elf file compiled by llvm .c->.o
* . parses 'maps' section and creates maps via BPF syscall
--
2.1.4
^ permalink raw reply related
* [PATCH v7 net-next 6/6] samples/bpf: add userspace example for prohibiting sockets
From: David Ahern @ 2016-12-01 16:48 UTC (permalink / raw)
To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480610888-31082-1-git-send-email-dsa@cumulusnetworks.com>
Add examples preventing a process in a cgroup from opening a socket
based family, protocol and type.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v7
- fix header file includes to use symbolic names in sock_flags_kern.c
v6
- new patch for version 6
samples/bpf/Makefile | 4 ++
samples/bpf/sock_flags_kern.c | 44 ++++++++++++++++++++++
samples/bpf/test_cgrp2_sock2.c | 66 +++++++++++++++++++++++++++++++++
samples/bpf/test_cgrp2_sock2.sh | 81 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 195 insertions(+)
create mode 100644 samples/bpf/sock_flags_kern.c
create mode 100644 samples/bpf/test_cgrp2_sock2.c
create mode 100755 samples/bpf/test_cgrp2_sock2.sh
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a335b218198e..8df12f9429dc 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -24,6 +24,7 @@ hostprogs-y += test_overhead
hostprogs-y += test_cgrp2_array_pin
hostprogs-y += test_cgrp2_attach
hostprogs-y += test_cgrp2_sock
+hostprogs-y += test_cgrp2_sock2
hostprogs-y += xdp1
hostprogs-y += xdp2
hostprogs-y += test_current_task_under_cgroup
@@ -53,6 +54,7 @@ test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
test_cgrp2_sock-objs := libbpf.o test_cgrp2_sock.o
+test_cgrp2_sock2-objs := bpf_load.o libbpf.o test_cgrp2_sock2.o
xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
# reuse xdp1 source intentionally
xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
@@ -73,6 +75,7 @@ always += tracex3_kern.o
always += tracex4_kern.o
always += tracex5_kern.o
always += tracex6_kern.o
+always += sock_flags_kern.o
always += test_probe_write_user_kern.o
always += trace_output_kern.o
always += tcbpf1_kern.o
@@ -106,6 +109,7 @@ HOSTLOADLIBES_tracex3 += -lelf
HOSTLOADLIBES_tracex4 += -lelf -lrt
HOSTLOADLIBES_tracex5 += -lelf
HOSTLOADLIBES_tracex6 += -lelf
+HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
HOSTLOADLIBES_test_probe_write_user += -lelf
HOSTLOADLIBES_trace_output += -lelf -lrt
HOSTLOADLIBES_lathist += -lelf
diff --git a/samples/bpf/sock_flags_kern.c b/samples/bpf/sock_flags_kern.c
new file mode 100644
index 000000000000..533dd11a6baa
--- /dev/null
+++ b/samples/bpf/sock_flags_kern.c
@@ -0,0 +1,44 @@
+#include <uapi/linux/bpf.h>
+#include <linux/socket.h>
+#include <linux/net.h>
+#include <uapi/linux/in.h>
+#include <uapi/linux/in6.h>
+#include "bpf_helpers.h"
+
+SEC("cgroup/sock1")
+int bpf_prog1(struct bpf_sock *sk)
+{
+ char fmt[] = "socket: family %d type %d protocol %d\n";
+
+ bpf_trace_printk(fmt, sizeof(fmt), sk->family, sk->type, sk->protocol);
+
+ /* block PF_INET6, SOCK_RAW, IPPROTO_ICMPV6 sockets
+ * ie., make ping6 fail
+ */
+ if (sk->family == PF_INET6 &&
+ sk->type == SOCK_RAW &&
+ sk->protocol == IPPROTO_ICMPV6)
+ return 0;
+
+ return 1;
+}
+
+SEC("cgroup/sock2")
+int bpf_prog2(struct bpf_sock *sk)
+{
+ char fmt[] = "socket: family %d type %d protocol %d\n";
+
+ bpf_trace_printk(fmt, sizeof(fmt), sk->family, sk->type, sk->protocol);
+
+ /* block PF_INET, SOCK_RAW, IPPROTO_ICMP sockets
+ * ie., make ping fail
+ */
+ if (sk->family == PF_INET &&
+ sk->type == SOCK_RAW &&
+ sk->protocol == IPPROTO_ICMP)
+ return 0;
+
+ return 1;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_cgrp2_sock2.c b/samples/bpf/test_cgrp2_sock2.c
new file mode 100644
index 000000000000..455ef0d06e93
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock2.c
@@ -0,0 +1,66 @@
+/* eBPF example program:
+ *
+ * - Loads eBPF program
+ *
+ * The eBPF program loads a filter from file and attaches the
+ * program to a cgroup using BPF_PROG_ATTACH
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <net/if.h>
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+#include "bpf_load.h"
+
+static int usage(const char *argv0)
+{
+ printf("Usage: %s cg-path filter-path [filter-id]\n", argv0);
+ return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+ int cg_fd, ret, filter_id = 0;
+
+ if (argc < 3)
+ return usage(argv[0]);
+
+ cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+ if (cg_fd < 0) {
+ printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+ return EXIT_FAILURE;
+ }
+
+ if (load_bpf_file(argv[2]))
+ return EXIT_FAILURE;
+
+ printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+ if (argc > 3)
+ filter_id = atoi(argv[3]);
+
+ if (filter_id > prog_cnt) {
+ printf("Invalid program id; program not found in file\n");
+ return EXIT_FAILURE;
+ }
+
+ ret = bpf_prog_attach(prog_fd[filter_id], cg_fd,
+ BPF_CGROUP_INET_SOCK_CREATE);
+ if (ret < 0) {
+ printf("Failed to attach prog to cgroup: '%s'\n",
+ strerror(errno));
+ return EXIT_FAILURE;
+ }
+
+ return EXIT_SUCCESS;
+}
diff --git a/samples/bpf/test_cgrp2_sock2.sh b/samples/bpf/test_cgrp2_sock2.sh
new file mode 100755
index 000000000000..891f12a0e26f
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock2.sh
@@ -0,0 +1,81 @@
+#!/bin/bash
+
+function config_device {
+ ip netns add at_ns0
+ ip link add veth0 type veth peer name veth0b
+ ip link set veth0b up
+ ip link set veth0 netns at_ns0
+ ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0
+ ip netns exec at_ns0 ip addr add 2401:db00::1/64 dev veth0 nodad
+ ip netns exec at_ns0 ip link set dev veth0 up
+ ip addr add 172.16.1.101/24 dev veth0b
+ ip addr add 2401:db00::2/64 dev veth0b nodad
+}
+
+function config_cgroup {
+ rm -rf /tmp/cgroupv2
+ mkdir -p /tmp/cgroupv2
+ mount -t cgroup2 none /tmp/cgroupv2
+ mkdir -p /tmp/cgroupv2/foo
+ echo $$ >> /tmp/cgroupv2/foo/cgroup.procs
+}
+
+
+function attach_bpf {
+ test_cgrp2_sock2 /tmp/cgroupv2/foo sock_flags_kern.o $1
+ [ $? -ne 0 ] && exit 1
+}
+
+function cleanup {
+ ip link del veth0b
+ ip netns delete at_ns0
+ umount /tmp/cgroupv2
+ rm -rf /tmp/cgroupv2
+}
+
+cleanup 2>/dev/null
+
+set -e
+config_device
+config_cgroup
+set +e
+
+#
+# Test 1 - fail ping6
+#
+attach_bpf 0
+ping -c1 -w1 172.16.1.100
+if [ $? -ne 0 ]; then
+ echo "ping failed when it should succeed"
+ cleanup
+ exit 1
+fi
+
+ping6 -c1 -w1 2401:db00::1
+if [ $? -eq 0 ]; then
+ echo "ping6 succeeded when it should not"
+ cleanup
+ exit 1
+fi
+
+#
+# Test 2 - fail ping
+#
+attach_bpf 1
+ping6 -c1 -w1 2401:db00::1
+if [ $? -ne 0 ]; then
+ echo "ping6 failed when it should succeed"
+ cleanup
+ exit 1
+fi
+
+ping -c1 -w1 172.16.1.100
+if [ $? -eq 0 ]; then
+ echo "ping succeeded when it should not"
+ cleanup
+ exit 1
+fi
+
+cleanup
+echo
+echo "*** PASS ***"
--
2.1.4
^ permalink raw reply related
* [PATCH v7 net-next 4/6] bpf: Add support for reading socket family, type, protocol
From: David Ahern @ 2016-12-01 16:48 UTC (permalink / raw)
To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480610888-31082-1-git-send-email-dsa@cumulusnetworks.com>
Add socket family, type and protocol to bpf_sock allowing bpf programs
read-only access.
Add __sk_flags_offset[0] to struct sock before the bitfield to
programmtically determine the offset of the unsigned int containing
protocol and type.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v7
- remove convert_sock_access helper and put code inline
v6
- new patch for version 6 of set
include/net/sock.h | 15 +++++++++++++++
include/uapi/linux/bpf.h | 3 +++
net/core/filter.c | 21 +++++++++++++++++++++
3 files changed, 39 insertions(+)
diff --git a/include/net/sock.h b/include/net/sock.h
index 442cbb118a07..69afda6bea15 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -389,6 +389,21 @@ struct sock {
* Because of non atomicity rules, all
* changes are protected by socket lock.
*/
+ unsigned int __sk_flags_offset[0];
+#ifdef __BIG_ENDIAN_BITFIELD
+#define SK_FL_PROTO_SHIFT 16
+#define SK_FL_PROTO_MASK 0x00ff0000
+
+#define SK_FL_TYPE_SHIFT 0
+#define SK_FL_TYPE_MASK 0x0000ffff
+#else
+#define SK_FL_PROTO_SHIFT 8
+#define SK_FL_PROTO_MASK 0x0000ff00
+
+#define SK_FL_TYPE_SHIFT 16
+#define SK_FL_TYPE_MASK 0xffff0000
+#endif
+
kmemcheck_bitfield_begin(flags);
unsigned int sk_padding : 2,
sk_no_check_tx : 1,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 75964e00d947..b47ffd117fd6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -541,6 +541,9 @@ struct bpf_tunnel_key {
struct bpf_sock {
__u32 bound_dev_if;
+ __u32 family;
+ __u32 type;
+ __u32 protocol;
};
/* User return codes for XDP prog type.
diff --git a/net/core/filter.c b/net/core/filter.c
index 5ee722dc097d..efcc22b44ec1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2979,6 +2979,27 @@ static u32 sock_filter_convert_ctx_access(enum bpf_access_type type,
*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
offsetof(struct sock, sk_bound_dev_if));
break;
+
+ case offsetof(struct bpf_sock, family):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_family) != 2);
+
+ *insn++ = BPF_LDX_MEM(BPF_H, dst_reg, src_reg,
+ offsetof(struct sock, sk_family));
+ break;
+
+ case offsetof(struct bpf_sock, type):
+ *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+ offsetof(struct sock, __sk_flags_offset));
+ *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, SK_FL_TYPE_MASK);
+ *insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, SK_FL_TYPE_SHIFT);
+ break;
+
+ case offsetof(struct bpf_sock, protocol):
+ *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+ offsetof(struct sock, __sk_flags_offset));
+ *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, SK_FL_PROTO_MASK);
+ *insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, SK_FL_PROTO_SHIFT);
+ break;
}
return insn - insn_buf;
--
2.1.4
^ permalink raw reply related
* [PATCH v7 net-next 3/6] samples: bpf: add userspace example for modifying sk_bound_dev_if
From: David Ahern @ 2016-12-01 16:48 UTC (permalink / raw)
To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480610888-31082-1-git-send-email-dsa@cumulusnetworks.com>
Add a simple program to demonstrate the ability to attach a bpf program
to a cgroup that sets sk_bound_dev_if for AF_INET{6} sockets when they
are created.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v7
- no change
v6
- added conversion from device name to index in test program
v5
- changed BPF_CGROUP_INET_SOCK to BPF_CGROUP_INET_SOCK_CREATE
v4
- added test_cgrp2_sock.sh for an automated test
v3
- revert to BPF_PROG_TYPE_CGROUP_SOCK prog type
v2
- removed bpf_sock_store_u32 references
- changed BPF_CGROUP_INET_SOCK_CREATE to BPF_CGROUP_INET_SOCK
- remove BPF_PROG_TYPE_CGROUP_SOCK prog type and add prog_subtype
samples/bpf/Makefile | 2 +
samples/bpf/test_cgrp2_sock.c | 83 ++++++++++++++++++++++++++++++++++++++++++
samples/bpf/test_cgrp2_sock.sh | 47 ++++++++++++++++++++++++
3 files changed, 132 insertions(+)
create mode 100644 samples/bpf/test_cgrp2_sock.c
create mode 100755 samples/bpf/test_cgrp2_sock.sh
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 3ceb5a9d86df..a335b218198e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -23,6 +23,7 @@ hostprogs-y += map_perf_test
hostprogs-y += test_overhead
hostprogs-y += test_cgrp2_array_pin
hostprogs-y += test_cgrp2_attach
+hostprogs-y += test_cgrp2_sock
hostprogs-y += xdp1
hostprogs-y += xdp2
hostprogs-y += test_current_task_under_cgroup
@@ -51,6 +52,7 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
+test_cgrp2_sock-objs := libbpf.o test_cgrp2_sock.o
xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
# reuse xdp1 source intentionally
xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c
new file mode 100644
index 000000000000..d467b3c1c55c
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock.c
@@ -0,0 +1,83 @@
+/* eBPF example program:
+ *
+ * - Loads eBPF program
+ *
+ * The eBPF program sets the sk_bound_dev_if index in new AF_INET{6}
+ * sockets opened by processes in the cgroup.
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <net/if.h>
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+
+static int prog_load(int idx)
+{
+ struct bpf_insn prog[] = {
+ BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+ BPF_MOV64_IMM(BPF_REG_3, idx),
+ BPF_MOV64_IMM(BPF_REG_2, offsetof(struct bpf_sock, bound_dev_if)),
+ BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3, offsetof(struct bpf_sock, bound_dev_if)),
+ BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = verdict */
+ BPF_EXIT_INSN(),
+ };
+
+ return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
+ "GPL", 0);
+}
+
+static int usage(const char *argv0)
+{
+ printf("Usage: %s cg-path device-index\n", argv0);
+ return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+ int cg_fd, prog_fd, ret;
+ unsigned int idx;
+
+ if (argc < 2)
+ return usage(argv[0]);
+
+ idx = if_nametoindex(argv[2]);
+ if (!idx) {
+ printf("Invalid device name\n");
+ return EXIT_FAILURE;
+ }
+
+ cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+ if (cg_fd < 0) {
+ printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+ return EXIT_FAILURE;
+ }
+
+ prog_fd = prog_load(idx);
+ printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+ if (prog_fd < 0) {
+ printf("Failed to load prog: '%s'\n", strerror(errno));
+ return EXIT_FAILURE;
+ }
+
+ ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK_CREATE);
+ if (ret < 0) {
+ printf("Failed to attach prog to cgroup: '%s'\n",
+ strerror(errno));
+ return EXIT_FAILURE;
+ }
+
+ return EXIT_SUCCESS;
+}
diff --git a/samples/bpf/test_cgrp2_sock.sh b/samples/bpf/test_cgrp2_sock.sh
new file mode 100755
index 000000000000..925fd467c7cc
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+
+function config_device {
+ ip netns add at_ns0
+ ip link add veth0 type veth peer name veth0b
+ ip link set veth0b up
+ ip link set veth0 netns at_ns0
+ ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0
+ ip netns exec at_ns0 ip addr add 2401:db00::1/64 dev veth0 nodad
+ ip netns exec at_ns0 ip link set dev veth0 up
+ ip link add foo type vrf table 1234
+ ip link set foo up
+ ip addr add 172.16.1.101/24 dev veth0b
+ ip addr add 2401:db00::2/64 dev veth0b nodad
+ ip link set veth0b master foo
+}
+
+function attach_bpf {
+ rm -rf /tmp/cgroupv2
+ mkdir -p /tmp/cgroupv2
+ mount -t cgroup2 none /tmp/cgroupv2
+ mkdir -p /tmp/cgroupv2/foo
+ test_cgrp2_sock /tmp/cgroupv2/foo foo
+ echo $$ >> /tmp/cgroupv2/foo/cgroup.procs
+}
+
+function cleanup {
+ set +ex
+ ip netns delete at_ns0
+ ip link del veth0
+ ip link del foo
+ umount /tmp/cgroupv2
+ rm -rf /tmp/cgroupv2
+ set -ex
+}
+
+function do_test {
+ ping -c1 -w1 172.16.1.100
+ ping6 -c1 -w1 2401:db00::1
+}
+
+cleanup 2>/dev/null
+config_device
+attach_bpf
+do_test
+cleanup
+echo "*** PASS ***"
--
2.1.4
^ permalink raw reply related
* [PATCH v7 net-next 2/6] bpf: Add new cgroup attach type to enable sock modifications
From: David Ahern @ 2016-12-01 16:48 UTC (permalink / raw)
To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480610888-31082-1-git-send-email-dsa@cumulusnetworks.com>
Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.
This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v7
- no change
v6
- added size check to sock_filter_is_valid_access; accesses must be u32
v5
- no change
v4
- dropped tweak to bpf_func signature
- dropped cg_sock_func_proto in favor of sk_filter_func_proto
- new __cgroup_bpf_run_filter_sk versus overloading __cgroup_bpf_run_filter
- reverted BPF_CGROUP_INET_SOCK to BPF_CGROUP_INET_SOCK_CREATE
v3
- reverted to new prog type BPF_PROG_TYPE_CGROUP_SOCK
- dropped the subtype
v2
- dropped the bpf_sock_store_u32 helper
- dropped the new prog type BPF_PROG_TYPE_CGROUP_SOCK
- moved valid access and context conversion to use subtype
- dropped CREATE from BPF_CGROUP_INET_SOCK and related function names
- moved running of filter from sk_alloc to inet{6}_create
include/linux/bpf-cgroup.h | 14 +++++++++++
include/uapi/linux/bpf.h | 6 +++++
kernel/bpf/cgroup.c | 33 ++++++++++++++++++++++++
kernel/bpf/syscall.c | 5 +++-
net/core/filter.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++
net/ipv4/af_inet.c | 12 ++++++++-
net/ipv6/af_inet6.c | 8 ++++++
7 files changed, 138 insertions(+), 2 deletions(-)
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index af2ca8b432c0..7b6e5d168c95 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -40,6 +40,9 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
struct sk_buff *skb,
enum bpf_attach_type type);
+int __cgroup_bpf_run_filter_sk(struct sock *sk,
+ enum bpf_attach_type type);
+
/* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb) \
({ \
@@ -63,6 +66,16 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
__ret; \
})
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) \
+({ \
+ int __ret = 0; \
+ if (cgroup_bpf_enabled && sk) { \
+ __ret = __cgroup_bpf_run_filter_sk(sk, \
+ BPF_CGROUP_INET_SOCK_CREATE); \
+ } \
+ __ret; \
+})
+
#else
struct cgroup_bpf {};
@@ -72,6 +85,7 @@ static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
#endif /* CONFIG_CGROUP_BPF */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1370a9d1456f..75964e00d947 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,11 +101,13 @@ enum bpf_prog_type {
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
BPF_PROG_TYPE_CGROUP_SKB,
+ BPF_PROG_TYPE_CGROUP_SOCK,
};
enum bpf_attach_type {
BPF_CGROUP_INET_INGRESS,
BPF_CGROUP_INET_EGRESS,
+ BPF_CGROUP_INET_SOCK_CREATE,
__MAX_BPF_ATTACH_TYPE
};
@@ -537,6 +539,10 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
};
+struct bpf_sock {
+ __u32 bound_dev_if;
+};
+
/* User return codes for XDP prog type.
* A valid XDP program must return one of these defined values. All other
* return codes are reserved for future use. Unknown return codes will result
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 8fe55ffd109d..a515f7b007c6 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -165,3 +165,36 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
return ret;
}
EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb);
+
+/**
+ * __cgroup_bpf_run_filter_sk() - Run a program on a sock
+ * @sk: sock structure to manipulate
+ * @type: The type of program to be exectuted
+ *
+ * socket is passed is expected to be of type INET or INET6.
+ *
+ * The program type passed in via @type must be suitable for sock
+ * filtering. No further check is performed to assert that.
+ *
+ * This function will return %-EPERM if any if an attached program was found
+ * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter_sk(struct sock *sk,
+ enum bpf_attach_type type)
+{
+ struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+ struct bpf_prog *prog;
+ int ret = 0;
+
+
+ rcu_read_lock();
+
+ prog = rcu_dereference(cgrp->bpf.effective[type]);
+ if (prog)
+ ret = BPF_PROG_RUN(prog, sk) == 1 ? 0 : -EPERM;
+
+ rcu_read_unlock();
+
+ return ret;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5518a6839ab1..85af86c496cd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -869,7 +869,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_CGROUP_INET_EGRESS:
ptype = BPF_PROG_TYPE_CGROUP_SKB;
break;
-
+ case BPF_CGROUP_INET_SOCK_CREATE:
+ ptype = BPF_PROG_TYPE_CGROUP_SOCK;
+ break;
default:
return -EINVAL;
}
@@ -905,6 +907,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
switch (attr->attach_type) {
case BPF_CGROUP_INET_INGRESS:
case BPF_CGROUP_INET_EGRESS:
+ case BPF_CGROUP_INET_SOCK_CREATE:
cgrp = cgroup_get_from_fd(attr->target_fd);
if (IS_ERR(cgrp))
return PTR_ERR(cgrp);
diff --git a/net/core/filter.c b/net/core/filter.c
index 698a262b8ebb..5ee722dc097d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2676,6 +2676,32 @@ static bool sk_filter_is_valid_access(int off, int size,
return __is_valid_access(off, size, type);
}
+static bool sock_filter_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ enum bpf_reg_type *reg_type)
+{
+ if (type == BPF_WRITE) {
+ switch (off) {
+ case offsetof(struct bpf_sock, bound_dev_if):
+ break;
+ default:
+ return false;
+ }
+ }
+
+ if (off < 0 || off + size > sizeof(struct bpf_sock))
+ return false;
+
+ /* The verifier guarantees that size > 0. */
+ if (off % size != 0)
+ return false;
+
+ if (size != sizeof(__u32))
+ return false;
+
+ return true;
+}
+
static int tc_cls_act_prologue(struct bpf_insn *insn_buf, bool direct_write,
const struct bpf_prog *prog)
{
@@ -2934,6 +2960,30 @@ static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
return insn - insn_buf;
}
+static u32 sock_filter_convert_ctx_access(enum bpf_access_type type,
+ int dst_reg, int src_reg,
+ int ctx_off,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog)
+{
+ struct bpf_insn *insn = insn_buf;
+
+ switch (ctx_off) {
+ case offsetof(struct bpf_sock, bound_dev_if):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_bound_dev_if) != 4);
+
+ if (type == BPF_WRITE)
+ *insn++ = BPF_STX_MEM(BPF_W, dst_reg, src_reg,
+ offsetof(struct sock, sk_bound_dev_if));
+ else
+ *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+ offsetof(struct sock, sk_bound_dev_if));
+ break;
+ }
+
+ return insn - insn_buf;
+}
+
static u32 tc_cls_act_convert_ctx_access(enum bpf_access_type type, int dst_reg,
int src_reg, int ctx_off,
struct bpf_insn *insn_buf,
@@ -3007,6 +3057,12 @@ static const struct bpf_verifier_ops cg_skb_ops = {
.convert_ctx_access = sk_filter_convert_ctx_access,
};
+static const struct bpf_verifier_ops cg_sock_ops = {
+ .get_func_proto = sk_filter_func_proto,
+ .is_valid_access = sock_filter_is_valid_access,
+ .convert_ctx_access = sock_filter_convert_ctx_access,
+};
+
static struct bpf_prog_type_list sk_filter_type __read_mostly = {
.ops = &sk_filter_ops,
.type = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -3032,6 +3088,11 @@ static struct bpf_prog_type_list cg_skb_type __read_mostly = {
.type = BPF_PROG_TYPE_CGROUP_SKB,
};
+static struct bpf_prog_type_list cg_sock_type __read_mostly = {
+ .ops = &cg_sock_ops,
+ .type = BPF_PROG_TYPE_CGROUP_SOCK
+};
+
static int __init register_sk_filter_ops(void)
{
bpf_register_prog_type(&sk_filter_type);
@@ -3039,6 +3100,7 @@ static int __init register_sk_filter_ops(void)
bpf_register_prog_type(&sched_act_type);
bpf_register_prog_type(&xdp_type);
bpf_register_prog_type(&cg_skb_type);
+ bpf_register_prog_type(&cg_sock_type);
return 0;
}
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5ddf5cda07f4..24d2550492ee 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -374,8 +374,18 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
if (sk->sk_prot->init) {
err = sk->sk_prot->init(sk);
- if (err)
+ if (err) {
+ sk_common_release(sk);
+ goto out;
+ }
+ }
+
+ if (!kern) {
+ err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
+ if (err) {
sk_common_release(sk);
+ goto out;
+ }
}
out:
return err;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d424f3a3737a..237e654ba717 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -258,6 +258,14 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
goto out;
}
}
+
+ if (!kern) {
+ err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
+ if (err) {
+ sk_common_release(sk);
+ goto out;
+ }
+ }
out:
return err;
out_rcu_unlock:
--
2.1.4
^ permalink raw reply related
* [PATCH v7 net-next 1/6] bpf: Refactor cgroups code in prep for new type
From: David Ahern @ 2016-12-01 16:48 UTC (permalink / raw)
To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480610888-31082-1-git-send-email-dsa@cumulusnetworks.com>
Code move and rename only; no functional change intended.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v7, v6, v5
- no change
v4
- dropped refactor of __cgroup_bpf_run_filter and renamed it
to __cgroup_bpf_run_filter_skb
v3
- dropped the rename
v2
- fix bpf_prog_run_clear_cb to bpf_prog_run_save_cb as caught by Daniel
- rename BPF_PROG_TYPE_CGROUP_SKB and its cg_skb functions to
BPF_PROG_TYPE_CGROUP and cgroup
include/linux/bpf-cgroup.h | 46 +++++++++++++++++++++++-----------------------
kernel/bpf/cgroup.c | 10 +++++-----
kernel/bpf/syscall.c | 28 +++++++++++++++-------------
3 files changed, 43 insertions(+), 41 deletions(-)
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 0cf1adfadd2d..af2ca8b432c0 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -36,31 +36,31 @@ void cgroup_bpf_update(struct cgroup *cgrp,
struct bpf_prog *prog,
enum bpf_attach_type type);
-int __cgroup_bpf_run_filter(struct sock *sk,
- struct sk_buff *skb,
- enum bpf_attach_type type);
-
-/* Wrappers for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled. */
-#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) \
-({ \
- int __ret = 0; \
- if (cgroup_bpf_enabled) \
- __ret = __cgroup_bpf_run_filter(sk, skb, \
- BPF_CGROUP_INET_INGRESS); \
- \
- __ret; \
+int __cgroup_bpf_run_filter_skb(struct sock *sk,
+ struct sk_buff *skb,
+ enum bpf_attach_type type);
+
+/* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb) \
+({ \
+ int __ret = 0; \
+ if (cgroup_bpf_enabled) \
+ __ret = __cgroup_bpf_run_filter_skb(sk, skb, \
+ BPF_CGROUP_INET_INGRESS); \
+ \
+ __ret; \
})
-#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) \
-({ \
- int __ret = 0; \
- if (cgroup_bpf_enabled && sk && sk == skb->sk) { \
- typeof(sk) __sk = sk_to_full_sk(sk); \
- if (sk_fullsock(__sk)) \
- __ret = __cgroup_bpf_run_filter(__sk, skb, \
- BPF_CGROUP_INET_EGRESS); \
- } \
- __ret; \
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb) \
+({ \
+ int __ret = 0; \
+ if (cgroup_bpf_enabled && sk && sk == skb->sk) { \
+ typeof(sk) __sk = sk_to_full_sk(sk); \
+ if (sk_fullsock(__sk)) \
+ __ret = __cgroup_bpf_run_filter_skb(__sk, skb, \
+ BPF_CGROUP_INET_EGRESS); \
+ } \
+ __ret; \
})
#else
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 8c784f8c67cd..8fe55ffd109d 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -118,7 +118,7 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
}
/**
- * __cgroup_bpf_run_filter() - Run a program for packet filtering
+ * __cgroup_bpf_run_filter_skb() - Run a program for packet filtering
* @sk: The socken sending or receiving traffic
* @skb: The skb that is being sent or received
* @type: The type of program to be exectuted
@@ -132,9 +132,9 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
* This function will return %-EPERM if any if an attached program was found
* and if it returned != 1 during execution. In all other cases, 0 is returned.
*/
-int __cgroup_bpf_run_filter(struct sock *sk,
- struct sk_buff *skb,
- enum bpf_attach_type type)
+int __cgroup_bpf_run_filter_skb(struct sock *sk,
+ struct sk_buff *skb,
+ enum bpf_attach_type type)
{
struct bpf_prog *prog;
struct cgroup *cgrp;
@@ -164,4 +164,4 @@ int __cgroup_bpf_run_filter(struct sock *sk,
return ret;
}
-EXPORT_SYMBOL(__cgroup_bpf_run_filter);
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4caa18e6860a..5518a6839ab1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -856,6 +856,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
{
struct bpf_prog *prog;
struct cgroup *cgrp;
+ enum bpf_prog_type ptype;
if (!capable(CAP_NET_ADMIN))
return -EPERM;
@@ -866,25 +867,26 @@ static int bpf_prog_attach(const union bpf_attr *attr)
switch (attr->attach_type) {
case BPF_CGROUP_INET_INGRESS:
case BPF_CGROUP_INET_EGRESS:
- prog = bpf_prog_get_type(attr->attach_bpf_fd,
- BPF_PROG_TYPE_CGROUP_SKB);
- if (IS_ERR(prog))
- return PTR_ERR(prog);
-
- cgrp = cgroup_get_from_fd(attr->target_fd);
- if (IS_ERR(cgrp)) {
- bpf_prog_put(prog);
- return PTR_ERR(cgrp);
- }
-
- cgroup_bpf_update(cgrp, prog, attr->attach_type);
- cgroup_put(cgrp);
+ ptype = BPF_PROG_TYPE_CGROUP_SKB;
break;
default:
return -EINVAL;
}
+ prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+ if (IS_ERR(prog))
+ return PTR_ERR(prog);
+
+ cgrp = cgroup_get_from_fd(attr->target_fd);
+ if (IS_ERR(cgrp)) {
+ bpf_prog_put(prog);
+ return PTR_ERR(cgrp);
+ }
+
+ cgroup_bpf_update(cgrp, prog, attr->attach_type);
+ cgroup_put(cgrp);
+
return 0;
}
--
2.1.4
^ permalink raw reply related
* [PATCH v7 net-next 0/6] net: Add bpf support for sockets
From: David Ahern @ 2016-12-01 16:48 UTC (permalink / raw)
To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
The recently added VRF support in Linux leverages the bind-to-device
API for programs to specify an L3 domain for a socket. While
SO_BINDTODEVICE has been around for ages, not every ipv4/ipv6 capable
program has support for it. Even for those programs that do support it,
the API requires processes to be started as root (CAP_NET_RAW) which
is not desirable from a general security perspective.
This patch set leverages Daniel Mack's work to attach bpf programs to
a cgroup to provide a capability to set sk_bound_dev_if for all
AF_INET{6} sockets opened by a process in a cgroup when the sockets
are allocated.
For example:
1. configure vrf (e.g., using ifupdown2)
auto eth0
iface eth0 inet dhcp
vrf mgmt
auto mgmt
iface mgmt
vrf-table auto
2. configure cgroup
mount -t cgroup2 none /tmp/cgroupv2
mkdir /tmp/cgroupv2/mgmt
test_cgrp2_sock /tmp/cgroupv2/mgmt 15
3. set shell into cgroup (e.g., can be done at login using pam)
echo $$ >> /tmp/cgroupv2/mgmt/cgroup.procs
At this point all commands run in the shell (e.g, apt) have sockets
automatically bound to the VRF (see output of ss -ap 'dev == <vrf>'),
including processes not running as root.
This capability enables running any program in a VRF context and is key
to deploying Management VRF, a fundamental configuration for networking
gear, with any Linux OS installation.
This patchset also exports the socket family, type and protocol as
read-only allowing bpf filters to deny a process in a cgroup the ability
to open specific types of AF_INET or AF_INET6 sockets.
v7
- comments from Alexei
v6
- add export of socket family, type and protocol
David Ahern (6):
bpf: Refactor cgroups code in prep for new type
bpf: Add new cgroup attach type to enable sock modifications
samples: bpf: add userspace example for modifying sk_bound_dev_if
bpf: Add support for reading socket family, type, protocol
samples/bpf: Update bpf loader for cgroup section names
samples/bpf: add userspace example for prohibiting sockets
include/linux/bpf-cgroup.h | 60 +++++++++++++++++------------
include/net/sock.h | 15 ++++++++
include/uapi/linux/bpf.h | 9 +++++
kernel/bpf/cgroup.c | 43 ++++++++++++++++++---
kernel/bpf/syscall.c | 33 +++++++++-------
net/core/filter.c | 83 +++++++++++++++++++++++++++++++++++++++++
net/ipv4/af_inet.c | 12 +++++-
net/ipv6/af_inet6.c | 8 ++++
samples/bpf/Makefile | 6 +++
samples/bpf/bpf_load.c | 14 +++++--
samples/bpf/bpf_load.h | 1 +
samples/bpf/sock_flags_kern.c | 44 ++++++++++++++++++++++
samples/bpf/test_cgrp2_sock.c | 83 +++++++++++++++++++++++++++++++++++++++++
samples/bpf/test_cgrp2_sock.sh | 47 +++++++++++++++++++++++
samples/bpf/test_cgrp2_sock2.c | 66 ++++++++++++++++++++++++++++++++
samples/bpf/test_cgrp2_sock2.sh | 81 ++++++++++++++++++++++++++++++++++++++++
16 files changed, 559 insertions(+), 46 deletions(-)
create mode 100644 samples/bpf/sock_flags_kern.c
create mode 100644 samples/bpf/test_cgrp2_sock.c
create mode 100755 samples/bpf/test_cgrp2_sock.sh
create mode 100644 samples/bpf/test_cgrp2_sock2.c
create mode 100755 samples/bpf/test_cgrp2_sock2.sh
--
2.1.4
^ permalink raw reply
* Re: pull request (net-next): ipsec-next 2016-12-01
From: David Miller @ 2016-12-01 16:45 UTC (permalink / raw)
To: steffen.klassert; +Cc: herbert, netdev
In-Reply-To: <1480592885-3903-1-git-send-email-steffen.klassert@secunet.com>
From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Thu, 1 Dec 2016 12:48:04 +0100
> git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master
>
> for you to fetch changes up to 2258d927a691ddd2ab585adb17ea9f96e89d0638:
>
> xfrm: remove unused helper (2016-09-30 08:20:56 +0200)
Hmmm, when I try to pull I don't get anything:
[davem@dhcp-10-15-49-210 net-next]$ git pull --no-ff git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master
>From git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
* branch master -> FETCH_HEAD
Already up-to-date.
^ permalink raw reply
* Re: [PATCH net] RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net
From: Santosh Shilimkar @ 2016-12-01 16:40 UTC (permalink / raw)
To: Sowmini Varadhan, netdev; +Cc: davem
In-Reply-To: <1480596283-204869-1-git-send-email-sowmini.varadhan@oracle.com>
On 12/1/2016 4:44 AM, Sowmini Varadhan wrote:
> If some error is encountered in rds_tcp_init_net, make sure to
> unregister_netdevice_notifier(), else we could trigger a panic
> later on, when the modprobe from a netns fails.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
^ permalink raw reply
* Re: pull request (net): ipsec 2016-12-01
From: David Miller @ 2016-12-01 16:36 UTC (permalink / raw)
To: steffen.klassert; +Cc: herbert, netdev
In-Reply-To: <1480592692-3653-1-git-send-email-steffen.klassert@secunet.com>
From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Thu, 1 Dec 2016 12:44:49 +0100
> 1) Change the error value when someone tries to run 32bit
> userspace on a 64bit host from -ENOTSUPP to the userspace
> exported -EOPNOTSUPP. Fix from Yi Zhao.
>
> 2) On inbound, ESN sequence numbers are already in network
> byte order. So don't try to convert it again, this fixes
> integrity verification for ESN. Fixes from Tobias Brunner.
>
> Please pull or let me know if there are problems.
Pulled, thanks Steffen.
^ permalink raw reply
* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Saeed Mahameed @ 2016-12-01 16:33 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480607729.18162.311.camel@edumazet-glaptop3.roam.corp.google.com>
On Thu, Dec 1, 2016 at 5:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-12-01 at 17:38 +0200, Saeed Mahameed wrote:
>
>>
>> Hi Eric, Thanks for the patch, I already acked it.
>
> Thanks !
>
>>
>> I have one educational question (not related to this patch, but
>> related to stats reading in general).
>> I was wondering why do we need to disable bh every time we read stats
>> "spin_lock_bh" ? is it essential ?
>>
>> I checked and in mlx4 we don't hold stats_lock in softirq
>> (en_rx.c/en_tx.c), so I don't see any deadlock risk in here..
>
> Excellent question, and I chose to keep the spinlock.
>
> That would be doable, only if we do not overwrite dev->stats.
>
> Current code is :
>
> static struct rtnl_link_stats64 *
> mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> {
> struct mlx4_en_priv *priv = netdev_priv(dev);
>
> spin_lock_bh(&priv->stats_lock);
> mlx4_en_fold_software_stats(dev);
> netdev_stats_to_stats64(stats, &dev->stats);
> spin_unlock_bh(&priv->stats_lock);
>
> return stats;
> }
>
> If you remove the spin_lock_bh() :
>
>
> static struct rtnl_link_stats64 *
> mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> {
> struct mlx4_en_priv *priv = netdev_priv(dev);
>
> mlx4_en_fold_software_stats(dev); // possible races
>
> netdev_stats_to_stats64(stats, &dev->stats);
>
> return stats;
> }
>
> 1) one mlx4_en_fold_software_stats(dev) could be preempted
> on a CONFIG_PREEMPT kernel, or interrupted by long irqs.
>
> 2) Another cpu would also call mlx4_en_fold_software_stats(dev) while
> first cpu is busy.
>
> 3) Then when resuming first cpu/thread, part of the dev->stats fieds
> would be updated with 'old counters',
> while another thread might have updated them with newer values.
>
> 4) A SNMP reader could then get counters that are not monotonically
> increasing,
> which would be confusing/buggy.
>
> So removing the spinlock is doable, but needs to add a new parameter
> to mlx4_en_fold_software_stats() and call netdev_stats_to_stats64()
> before mlx4_en_fold_software_stats(dev)
>
> static struct rtnl_link_stats64 *
> mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> {
> struct mlx4_en_priv *priv = netdev_priv(dev);
>
> netdev_stats_to_stats64(stats, &dev->stats);
>
> // Passing a non NULL stats asks mlx4_en_fold_software_stats()
> // to not update dev->stats, but stats directly.
>
> mlx4_en_fold_software_stats(dev, stats)
>
>
> return stats;
> }
>
>
Thanks for the detailed answer !!
BTW you went 5 steps ahead of my original question :)), so far you
already have a patch without locking at all (really impressive).
What i wanted to ask originally, was regarding the "_bh", i didn't
mean to completely remove the "spin_lock_bh",
I meant, what happens if we replace "spin_lock_bh" with "spin_lock",
without disabling bh ?
I gues raw "sping_lock" handles points (2 to 4) from above, but it
won't handle long irqs.
^ permalink raw reply
* Re: [Patch net-next] audit: remove useless synchronize_net()
From: David Miller @ 2016-12-01 16:29 UTC (permalink / raw)
To: xiyou.wangcong; +Cc: netdev, rgb
In-Reply-To: <1480439696-21818-1-git-send-email-xiyou.wangcong@gmail.com>
From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 29 Nov 2016 09:14:56 -0800
> netlink kernel socket is protected by refcount, not RCU.
> Its rcv path is neither protected by RCU. So the synchronize_net()
> is just pointless.
>
> Cc: Richard Guy Briggs <rgb@redhat.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next v4 3/4] bpf: BPF for lightweight tunnel infrastructure
From: Thomas Graf @ 2016-12-01 16:28 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: davem, netdev, alexei.starovoitov, tom, roopa, hannes
In-Reply-To: <584012CC.4030004@iogearbox.net>
On 12/01/16 at 01:08pm, Daniel Borkmann wrote:
> For the verifier change in may_access_direct_pkt_data(), would be
> great if you could later on follow up with a selftest-suite case,
> one where BPF_PROG_TYPE_LWT_IN/OUT prog tries to write and fails,
> and one where BPF_PROG_TYPE_LWT_IN/OUT prog uses pkt data to pass
> to helpers, for example, so that we can keep testing it when future
> changes in that area are made. Thanks.
Good idea, will do.
^ permalink raw reply
* Re: [flamebait] xdp, well meaning but pointless
From: Thomas Graf @ 2016-12-01 16:28 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: Florian Westphal, netdev
In-Reply-To: <7e2be2fc-7c04-b333-59c7-43d4fcfcb451@stressinduktion.org>
On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
> XDP manipulates packets at free will and thus all security guarantees
> are off as well as in any user space solution.
>
> Secondly user space provides policy, acl, more controlled memory
> protection, restartability and better debugability. If I had multi
> tenant workloads I would definitely put more complex "business/acl"
> logic into user space, so I can make use of LSM and other features to
> especially prevent a network facing service to attack the tenants. If
> stuff gets put into the kernel you run user controlled code in the
> kernel exposing a much bigger attack vector.
>
> What use case do you see in XDP specifically e.g. for container networking?
DDOS mitigation to protect distributed applications in large clusters.
Relying on CDN works to protect API gateways and frontends (as long as
they don't throw you out of their network) but offers no protection
beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
level and allowing the mitigation capability to scale up with the number
of servers is natural and cheap.
> > I agree with you if the LB is a software based appliance in either a
> > dedicated VM or on dedicated baremetal.
> >
> > The reality is turning out to be different in many cases though, LB
> > needs to be performed not only for north south but east west as well.
> > So even if I would handle LB for traffic entering my datacenter in user
> > space, I will need the same LB for packets from my applications and
> > I definitely don't want to move all of that into user space.
>
> The open question to me is why is programmability needed here.
>
> Look at the discussion about ECMP and consistent hashing. It is not very
> easy to actually write this code correctly. Why can't we just put C code
> into the kernel that implements this once and for all and let user space
> update the policies?
Whatever LB logic is put in place with native C code now is unlikely the
logic we need in two years. We can't really predict the future. If it
was the case, networking would have been done long ago and we would all
be working on self eating ice cream now.
> Load balancers have to deal correctly with ICMP packets, e.g. they even
> have to be duplicated to every ECMP route. This seems to be problematic
> to do in eBPF programs due to looping constructs so you end up with
> complicated user space anyway.
Feel free to implement such complex LBs in user space or natively. It is
not required for the majority of use cases. The most popular LBs for
application load balancing have no idea of ECMP and require ECMP aware
routers to be made redundant itself.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox