Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: Daniel Mack @ 2016-09-19 19:30 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev, sargun,
	cgroups
In-Reply-To: <20160919191910.GA984@salvia>

On 09/19/2016 09:19 PM, Pablo Neira Ayuso wrote:
> On Mon, Sep 19, 2016 at 06:44:00PM +0200, Daniel Mack wrote:
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index 6001e78..5dc90aa 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -39,6 +39,7 @@
>>  #include <linux/module.h>
>>  #include <linux/slab.h>
>>  
>> +#include <linux/bpf-cgroup.h>
>>  #include <linux/netfilter.h>
>>  #include <linux/netfilter_ipv6.h>
>>  
>> @@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>>  {
>>  	struct net_device *dev = skb_dst(skb)->dev;
>>  	struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
>> +	int ret;
>>  
>>  	if (unlikely(idev->cnf.disable_ipv6)) {
>>  		IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
>> @@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>>  		return 0;
>>  	}
>>  
>> +	ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS);
>> +	if (ret) {
>> +		kfree_skb(skb);
>> +		return ret;
>> +	}
> 
> 1) If your goal is to filter packets, why so late? The sooner you
>    enforce your policy, the less cycles you waste.
> 
> Actually, did you look at Google's approach to this problem?  They
> want to control this at socket level, so you restrict what the process
> can actually bind. That is enforcing the policy way before you even
> send packets. On top of that, what they submitted is infrastructured
> so any process with CAP_NET_ADMIN can access that policy that is being
> applied and fetch a readable policy through kernel interface.

Yes, I've seen what they propose, but I want this approach to support
accounting, and so the code has to look at each and every packet in
order to count bytes and packets. Do you know of any better place to put
the hook then?

That said, I can well imagine more hooks types that also operate at port
bind time. That would be easy to add on top.

> 2) This will turn the stack into a nightmare to debug I predict. If
>    any process with CAP_NET_ADMIN can potentially attach bpf blobs
>    via these hooks, we will have to include in the network stack
>    traveling documentation something like: "Probably you have to check
>    that your orchestrator is not dropping your packets for some
>    reason". So I wonder how users will debug this and how the policy that
>    your orchestrator applies will be exposed to userspace.

Sure, every new limitation mechanism adds another knob to look at if
things don't work. But apart from taking care at userspace level to make
the behavior as obvious as possible, I'm open to suggestions of how to
improve the transparency of attached eBPF programs on the kernel side.


Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: Pablo Neira Ayuso @ 2016-09-19 19:19 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev, sargun,
	cgroups
In-Reply-To: <1474303441-3745-6-git-send-email-daniel@zonque.org>

On Mon, Sep 19, 2016 at 06:44:00PM +0200, Daniel Mack wrote:
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 6001e78..5dc90aa 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -39,6 +39,7 @@
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  
> +#include <linux/bpf-cgroup.h>
>  #include <linux/netfilter.h>
>  #include <linux/netfilter_ipv6.h>
>  
> @@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
>  	struct net_device *dev = skb_dst(skb)->dev;
>  	struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
> +	int ret;
>  
>  	if (unlikely(idev->cnf.disable_ipv6)) {
>  		IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
> @@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>  		return 0;
>  	}
>  
> +	ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS);
> +	if (ret) {
> +		kfree_skb(skb);
> +		return ret;
> +	}

1) If your goal is to filter packets, why so late? The sooner you
   enforce your policy, the less cycles you waste.

Actually, did you look at Google's approach to this problem?  They
want to control this at socket level, so you restrict what the process
can actually bind. That is enforcing the policy way before you even
send packets. On top of that, what they submitted is infrastructured
so any process with CAP_NET_ADMIN can access that policy that is being
applied and fetch a readable policy through kernel interface.

2) This will turn the stack into a nightmare to debug I predict. If
   any process with CAP_NET_ADMIN can potentially attach bpf blobs
   via these hooks, we will have to include in the network stack
   traveling documentation something like: "Probably you have to check
   that your orchestrator is not dropping your packets for some
   reason". So I wonder how users will debug this and how the policy that
   your orchestrator applies will be exposed to userspace.

>  	return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
>  			    net, sk, skb, NULL, dev,
>  			    ip6_finish_output,
> -- 
> 2.5.5
> 

^ permalink raw reply

* [PATCH] 6pack: fix buffer length mishandling
From: Alan @ 2016-09-19 19:15 UTC (permalink / raw)
  To: netdev

Dmitry Vyukov wrote:
> different runs). Looking at code, the following looks suspicious -- we
> limit copy by 512 bytes, but use the original count which can be
> larger than 512:
>
> static void sixpack_receive_buf(struct tty_struct *tty,
>     const unsigned char *cp, char *fp, int count)
> {
>     unsigned char buf[512];
>     ....
>     memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf));
>     ....
>     sixpack_decode(sp, buf, count1);

With the sane tty locking we now have I believe the following is safe as
we consume the bytes and move them into the decoded buffer before
returning.

Signed-off-by: Alan Cox <alan@linux.intel.com>
---
 drivers/net/hamradio/6pack.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
index 5a1e985..470b3dc 100644
--- a/drivers/net/hamradio/6pack.c
+++ b/drivers/net/hamradio/6pack.c
@@ -127,7 +127,7 @@ struct sixpack {
 
 #define AX25_6PACK_HEADER_LEN 0
 
-static void sixpack_decode(struct sixpack *, unsigned char[], int);
+static void sixpack_decode(struct sixpack *, const unsigned char[], int);
 static int encode_sixpack(unsigned char *, unsigned char *, int, unsigned char);
 
 /*
@@ -428,7 +428,7 @@ static void sixpack_write_wakeup(struct tty_struct *tty)
 
 /*
  * Handle the 'receiver data ready' interrupt.
- * This function is called by the 'tty_io' module in the kernel when
+ * This function is called by the tty module in the kernel when
  * a block of 6pack data has been received, which can now be decapsulated
  * and sent on to some IP layer for further processing.
  */
@@ -436,7 +436,6 @@ static void sixpack_receive_buf(struct tty_struct *tty,
 	const unsigned char *cp, char *fp, int count)
 {
 	struct sixpack *sp;
-	unsigned char buf[512];
 	int count1;
 
 	if (!count)
@@ -446,10 +445,7 @@ static void sixpack_receive_buf(struct tty_struct *tty,
 	if (!sp)
 		return;
 
-	memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf));
-
 	/* Read the characters out of the buffer */
-
 	count1 = count;
 	while (count) {
 		count--;
@@ -459,7 +455,7 @@ static void sixpack_receive_buf(struct tty_struct *tty,
 			continue;
 		}
 	}
-	sixpack_decode(sp, buf, count1);
+	sixpack_decode(sp, cp, count1);
 
 	sp_put(sp);
 	tty_unthrottle(tty);
@@ -992,7 +988,7 @@ static void decode_std_command(struct sixpack *sp, unsigned char cmd)
 /* decode a 6pack packet */
 
 static void
-sixpack_decode(struct sixpack *sp, unsigned char *pre_rbuff, int count)
+sixpack_decode(struct sixpack *sp, const unsigned char *pre_rbuff, int count)
 {
 	unsigned char inbyte;
 	int count1;

^ permalink raw reply related

* Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
From: Jason Baron @ 2016-09-19 18:33 UTC (permalink / raw)
  To: Mintz, Yuval, davem@davemloft.net
  Cc: netdev@vger.kernel.org, Ariel.Elior@qlogic.com
In-Reply-To: <BL2PR07MB2306CEA6AD75D9D4EB77FB1F8DF50@BL2PR07MB2306.namprd07.prod.outlook.com>

On 09/18/2016 06:25 AM, Mintz, Yuval wrote:
>> Currently, we can have high order page allocations that specify
>> GFP_ATOMIC when configuring multicast MAC address filters.
>>
>> For example, we have seen order 2 page allocation failures with
>> ~500 multicast addresses configured.
>>
>> Convert the allocation for the pending list to be done in PAGE_SIZE
>> increments.
>>
>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>
> While I appreciate the effort, I wonder whether it's worth it:
>
> - The hardware [even in its newer generation] provides an approximate
> based classification [I.e., hashed] with 256 bins.
> When configuring 500 multicast addresses, one can argue the
> difference between multicast-promisc mode and actual configuration
> is insignificant.

With 256 bins, I think it takes close to: 256*lg(256) or 2,048
multicast addresses to expect to have all bins have at least one hash, 
assuming a uniform distribution of the hashes.

> Perhaps the easier-to-maintain alternative would simply be to
> determine the maximal number of multicast addresses that can be
> configured using a single PAGE, and if in need of more than that
> simply move into multicast-promisc.
>

sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per
page on x86. So if we want to fit 2,048 elements, we need 12 pages.

>   - While GFP_ATOMIC is required in this flow due to the fact it's being
> called from sleepless context, I do believe this is mostly a remnant -
> it's possible that by slightly changing the locking scheme we can have
> the configuration done from sleepless context and simply switch to
> GFP_KERNEL instead.
>

Ok if its GFP_KERNEL, I think its still undesirable to do large page 
order allocations (unless of course its converted to a single page, but
I'm not sure this makes sense as mentioned).

> Regarding the patch itself, only comment I have:
>> +			elem_group = (struct bnx2x_mcast_elem_group *)
>> +				     elem_group->mcast_group_link.next;
> Let's use list_next_entry() instead.
>
>

Yes, agreed.

I think it would be easy to add a check to bnx2x_set_rx_mode_inner() to
enforce some maximum number of elements (perhaps 2,048 based on the
above math) for the !CHIP_IS_E1() case on top of what I already posted.

Thanks,

-Jason

^ permalink raw reply

* Re: [PATCH 1/2] ptp_clock: allow for it to be optional
From: Nicolas Pitre @ 2016-09-19 18:09 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Jiri Benc, John Stultz, Thomas Gleixner, Richard Cochran, netdev,
	linux-kernel, linux-kbuild
In-Reply-To: <20160919173112.tpexkd44ggorg7ww@x>

On Mon, 19 Sep 2016, Josh Triplett wrote:

> But it does seem unfortunate that this can't happen at build time via
> Kconfig.  CCing linux-kbuild in case someone has an idea for how to fix
> this.

I hoped something like this could work:

config FOO
	prompt "Blah-blah"
	tristate if (BAR != y)
	bool if (BAR = y)
	default BAR

This way FOO could be y, m or n when BAR is m or n, otherwise FOO could 
be y or n only.

But that didn't work.


Nicolas

^ permalink raw reply

* Re: eBPF: how to check the flow table
From: Alexei Starovoitov @ 2016-09-19 17:53 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netdev, daniel
In-Reply-To: <1474268215.32028.1.camel@regit.org>

On Mon, Sep 19, 2016 at 08:56:55AM +0200, Eric Leblond wrote:
> Hello,
> 
> On Sun, 2016-09-18 at 14:02 +0200, Eric Leblond wrote:
> > Hello,
> > 
> > I'm currently testing a code implementing AF_PACKET bypass for
> > Suricata. The idea is that Suricata is updating a hash table
> > containing
> > a list of flows it does not want to see anymore.
> > 
> > I want to check flow timeout from the userspace, so my current
> > algorithm is doing:
> > 
> >     while (bpf_get_next_key(mapfd, &key, &next_key) == 0) {
> >         bpf_lookup_elem(mapfd, &next_key, &value);
> >         FlowCallback(mapfd, &next_key, &value, data);
> >         key = next_key;
> >     }
> > 
> > In the FlowCallback, I check the timing in the flow entry and I
> > remove
> > the key if the flow is timeout.
> > 
> > This is currently working well when there is only a few flows but on
> > a
> > real system with log of insertion in the table, the loop is never
> > returning because we dequeue slower than we enqueue.
> > 
> > Is there a better algorithm or an other way to do it ? 
> 
> It seems I missed an obvious race condition in my existing code. I'll
> continue to test and relive this thread if necessary.

On tracing side we do similar loops all the time.
yes they're racy by design, since kernel always have to proceed.
User space have to be more careful with iteration over the map
to make sure that get_next actually moves forward.
It's indeed possible to custom craft a situation where user space
will forever loop.
Here is one:
https://github.com/iovisor/bcc/issues/665

^ permalink raw reply

* Re: [PATCH v2 2/2] net: skbuff: Coding: Use eth_type_vlan() instead of open coding it
From: Eric Dumazet @ 2016-09-19 17:37 UTC (permalink / raw)
  To: Shmulik Ladkani; +Cc: David S . Miller, Jiri Pirko, Daniel Borkmann, netdev
In-Reply-To: <20160919192310.3c5d610d@halley>

On Mon, 2016-09-19 at 19:23 +0300, Shmulik Ladkani wrote:
> On Mon, 19 Sep 2016 09:20:55 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2016-09-19 at 18:53 +0300, Shmulik Ladkani wrote:
> > > Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing
> > > skb->protocol to ETH_P_8021Q or ETH_P_8021AD.
> > > 
> > > Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
> > > ---  
> > 
> > You forgot to explicitly state this was targeting net-next tree.
> 
> Thanks, was an oversight.
> Requires resubmission?

I do not think so.

I only reminded it because it seems many netdev submissions
are forgetting this, and it does not help David Miller job.

Thanks

^ permalink raw reply

* Re: [PATCH 1/2] ptp_clock: allow for it to be optional
From: Josh Triplett @ 2016-09-19 17:31 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Nicolas Pitre, John Stultz, Thomas Gleixner, Richard Cochran,
	netdev, linux-kernel, linux-kbuild
In-Reply-To: <20160919190415.26bd9343@griffin>

On Mon, Sep 19, 2016 at 07:04:15PM +0200, Jiri Benc wrote:
> On Mon, 19 Sep 2016 10:10:21 -0400 (EDT), Nicolas Pitre wrote:
> > --- a/include/linux/ptp_clock_kernel.h
> > +++ b/include/linux/ptp_clock_kernel.h
> > @@ -207,7 +207,16 @@ int ptp_find_pin(struct ptp_clock *ptp,
> >  #else
> >  static inline struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info,
> >  						   struct device *parent)
> > -{ return NULL; }
> > +{
> > +	if (IS_MODULE(CONFIG_PTP_1588_CLOCK)) {
> > +		pr_warn("%s is built-in while PTP clock subsystem is modular, "
> > +			"PTP clock ignored\n", KBUILD_MODNAME);
> > +	} else {
> > +		pr_warn("ignoring PTP clock from %s as PTP clock subsystem "
> > +			"is configured out\n", KBUILD_MODNAME);
> > +	}
> > +	return NULL;
> > +}
> 
> I think the else part is not needed. If PTP is disabled, it is
> disabled, nobody should be surprised by that. Looks good otherwise.

This works, and it should compile away to nothing in the normal case.
But it does seem unfortunate that this can't happen at build time via
Kconfig.  CCing linux-kbuild in case someone has an idea for how to fix
this.

^ permalink raw reply

* Re: [PATCH net v2] qed: Fix stack corruption on probe
From: Mintz, Yuval @ 2016-09-19 17:24 UTC (permalink / raw)
  To: David Laight, 'Yuval Mintz', davem@davemloft.net,
	netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0103817@AcuExch.aculab.com>

> > Commit fe56b9e6a8d95 ("qed: Add module with basic common support")
> > has introduced a stack corruption during probe, where filling a
> > local struct with data to be sent to management firmware is incorrectly
> > filled; The data is written outside of the struct and corrupts
> > the stack.
> > 
> > @@ -1153,8 +1153,8 @@ qed_mcp_send_drv_version(struct qed_hwfn *p_hwfn,
> >        p_drv_version = &union_data.drv_version;
> >        p_drv_version->version = p_ver->version;
> > 
> > -     for (i = 0; i < MCP_DRV_VER_STR_SIZE - 1; i += 4) {
> > -             val = cpu_to_be32(p_ver->name[i]);
> > +     for (i = 0; i < (MCP_DRV_VER_STR_SIZE - 4) / sizeof(u32); i++) {
> > +             val = cpu_to_be32(*((u32 *)&p_ver->name[i * sizeof(u32)]));
> >                *(__be32 *)&p_drv_version->name[i * sizeof(u32)] = val;
> >        }
> 
> Or the much simpler:
>      /* Byteswap driver name */
>         for (i = 0; i < MCP_DRV_VER_STR_SIZE; i++)
>                 p_drv_version->name[i ^ 3] = p_ver->name[i];

You're suggesting an unconditional swap; This will break
the message on a big endian machine [we'll swap needlessly].
    

^ permalink raw reply

* Re: [PATCH 1/2] ptp_clock: allow for it to be optional
From: Jiri Benc @ 2016-09-19 17:04 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: John Stultz, Thomas Gleixner, Richard Cochran, Josh Triplett,
	netdev, linux-kernel
In-Reply-To: <alpine.LFD.2.20.1609190942230.9311@knanqh.ubzr>

On Mon, 19 Sep 2016 10:10:21 -0400 (EDT), Nicolas Pitre wrote:
> --- a/include/linux/ptp_clock_kernel.h
> +++ b/include/linux/ptp_clock_kernel.h
> @@ -207,7 +207,16 @@ int ptp_find_pin(struct ptp_clock *ptp,
>  #else
>  static inline struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info,
>  						   struct device *parent)
> -{ return NULL; }
> +{
> +	if (IS_MODULE(CONFIG_PTP_1588_CLOCK)) {
> +		pr_warn("%s is built-in while PTP clock subsystem is modular, "
> +			"PTP clock ignored\n", KBUILD_MODNAME);
> +	} else {
> +		pr_warn("ignoring PTP clock from %s as PTP clock subsystem "
> +			"is configured out\n", KBUILD_MODNAME);
> +	}
> +	return NULL;
> +}

I think the else part is not needed. If PTP is disabled, it is
disabled, nobody should be surprised by that. Looks good otherwise.

Thanks,

 Jiri

^ permalink raw reply

* RE: [PATCH 1/1] ixgbe: replace defined with IS_ENABLED
From: Tantilov, Emil S @ 2016-09-19 17:00 UTC (permalink / raw)
  To: zyjzyj2000@gmail.com, netdev@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org, Kirsher, Jeffrey T,
	e1000-devel@lists.sourceforge.net
In-Reply-To: <1473866814-21374-1-git-send-email-zyjzyj2000@gmail.com>

>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of zyjzyj2000@gmail.com
>Sent: Wednesday, September 14, 2016 8:27 AM
>To: netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org; Kirsher,
>Jeffrey T <jeffrey.t.kirsher@intel.com>; zyjzyj2000@gmail.com; e1000-
>devel@lists.sourceforge.net
>Subject: [PATCH 1/1] ixgbe: replace defined with IS_ENABLED
>
>From: Zhu Yanjun <zyjzyj2000@gmail.com>
>
>Replace defined macro with IS_ENABLED in ixgbe.h file
>
>Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
>---
> drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>index 9475ff9..f8bc1d0 100644
>--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>@@ -45,7 +45,7 @@
> #include "ixgbe_type.h"
> #include "ixgbe_common.h"
> #include "ixgbe_dcb.h"
>-#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
>+#if IS_ENABLED(CONFIG_FCOE)
> #define IXGBE_FCOE
> #include "ixgbe_fcoe.h"
> #endif /* CONFIG_FCOE or CONFIG_FCOE_MODULE */
>--
>2.7.4

Already taken care of by the following patch:
http://patchwork.ozlabs.org/patch/668894/

Thanks,
Emil
 

^ permalink raw reply

* RE: [PATCH] ixgbe: mark symbols static where possible
From: Tantilov, Emil S @ 2016-09-19 16:57 UTC (permalink / raw)
  To: Baoyou Xie, Kirsher, Jeffrey T, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	arnd@arndb.de, xie.baoyou@zte.com.cn
In-Reply-To: <1474188454-8872-1-git-send-email-baoyou.xie@linaro.org>


>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of Baoyou Xie
>Sent: Sunday, September 18, 2016 1:48 AM
>To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; intel-wired-
>lan@lists.osuosl.org
>Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org; arnd@arndb.de;
>baoyou.xie@linaro.org; xie.baoyou@zte.com.cn
>Subject: [PATCH] ixgbe: mark symbols static where possible
>
>We get 2 warnings when building kernel with W=1:
>drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c:2128:5: warning: no previous
>prototype for 'ixgbe_led_on_t_x550em' [-Wmissing-prototypes]
>drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c:2150:5: warning: no previous
>prototype for 'ixgbe_led_off_t_x550em' [-Wmissing-prototypes]

Already taken care of:
http://patchwork.ozlabs.org/patch/661647/

Thanks,
Emil

^ permalink raw reply

* [PATCH v6 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups
From: Daniel Mack @ 2016-09-19 16:44 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups,
	Daniel Mack
In-Reply-To: <1474303441-3745-1-git-send-email-daniel@zonque.org>

Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  21 ++++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 147 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_bpf_fd = prog_fd,
+		.attach_type = type,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_type = type,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
 	union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 364582b..f973241 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, int insn_len,
 		  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 0000000..19e4ec0
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+
+enum {
+	MAP_KEY_PACKETS,
+	MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd, int verdict)
+{
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* save r6 so it's not clobbered by BPF_CALL */
+
+		/* Count packets */
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_PACKETS), /* r0 = 0 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* load map fd to r1 */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		/* Count bytes */
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_BYTES), /* r0 = 1 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_6, offsetof(struct __sk_buff, len)), /* r1 = skb->len */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		BPF_MOV64_IMM(BPF_REG_0, verdict), /* r0 = verdict */
+		BPF_EXIT_INSN(),
+	};
+
+	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCKET,
+			     prog, sizeof(prog), "GPL", 0);
+}
+
+static int usage(const char *argv0)
+{
+	printf("Usage: %s <cg-path> <egress|ingress> [drop]\n", argv0);
+	return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+	int cg_fd, map_fd, prog_fd, key, ret;
+	long long pkt_cnt, byte_cnt;
+	enum bpf_attach_type type;
+	int verdict = 1;
+
+	if (argc < 3)
+		return usage(argv[0]);
+
+	if (strcmp(argv[2], "ingress") == 0)
+		type = BPF_CGROUP_INET_INGRESS;
+	else if (strcmp(argv[2], "egress") == 0)
+		type = BPF_CGROUP_INET_EGRESS;
+	else
+		return usage(argv[0]);
+
+	if (argc > 3 && strcmp(argv[3], "drop") == 0)
+		verdict = 0;
+
+	cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+	if (cg_fd < 0) {
+		printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY,
+				sizeof(key), sizeof(byte_cnt),
+				256, 0);
+	if (map_fd < 0) {
+		printf("Failed to create map: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	prog_fd = prog_load(map_fd, verdict);
+	printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+	if (prog_fd < 0) {
+		printf("Failed to load prog: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	ret = bpf_prog_detach(cg_fd, type);
+	printf("bpf_prog_detach() returned '%s' (%d)\n", strerror(errno), errno);
+
+	ret = bpf_prog_attach(prog_fd, cg_fd, type);
+	if (ret < 0) {
+		printf("Failed to attach prog to cgroup: '%s'\n",
+		       strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	while (1) {
+		key = MAP_KEY_PACKETS;
+		assert(bpf_lookup_elem(map_fd, &key, &pkt_cnt) == 0);
+
+		key = MAP_KEY_BYTES;
+		assert(bpf_lookup_elem(map_fd, &key, &byte_cnt) == 0);
+
+		printf("cgroup received %lld packets, %lld bytes\n",
+		       pkt_cnt, byte_cnt);
+		sleep(1);
+	}
+
+	return EXIT_SUCCESS;
+}
-- 
2.5.5

^ permalink raw reply related

* [PATCH v6 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
From: Daniel Mack @ 2016-09-19 16:43 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups,
	Daniel Mack
In-Reply-To: <1474303441-3745-1-git-send-email-daniel@zonque.org>

Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/uapi/linux/bpf.h |  8 +++++
 kernel/bpf/syscall.c     | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 55f815e..7cd3616 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
 	BPF_PROG_LOAD,
 	BPF_OBJ_PIN,
 	BPF_OBJ_GET,
+	BPF_PROG_ATTACH,
+	BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 	};
+
+	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+		__u32		target_fd;	/* container object to attach to */
+		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		__u32		attach_type;
+	};
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1a8592a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
 	return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (CHECK_ATTR(BPF_PROG_ATTACH))
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+	case BPF_CGROUP_INET_INGRESS:
+	case BPF_CGROUP_INET_EGRESS:
+		prog = bpf_prog_get_type(attr->attach_bpf_fd,
+					 BPF_PROG_TYPE_CGROUP_SOCKET);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp)) {
+			bpf_prog_put(prog);
+			return PTR_ERR(cgrp);
+		}
+
+		cgroup_bpf_update(cgrp, prog, attr->attach_type);
+		cgroup_put(cgrp);
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+	struct cgroup *cgrp;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (CHECK_ATTR(BPF_PROG_DETACH))
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+	case BPF_CGROUP_INET_INGRESS:
+	case BPF_CGROUP_INET_EGRESS:
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp))
+			return PTR_ERR(cgrp);
+
+		cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+		cgroup_put(cgrp);
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET:
 		err = bpf_obj_get(&attr);
 		break;
+
+#ifdef CONFIG_CGROUP_BPF
+	case BPF_PROG_ATTACH:
+		err = bpf_prog_attach(&attr);
+		break;
+	case BPF_PROG_DETACH:
+		err = bpf_prog_detach(&attr);
+		break;
+#endif
+
 	default:
 		err = -EINVAL;
 		break;
-- 
2.5.5

^ permalink raw reply related

* Re: [patch net-next RFC 0/2] fib4 offload: notifier to let hw to be aware of all prefixes
From: Florian Fainelli @ 2016-09-19 16:44 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, eladr, yotamg, nogahf, ogerlitz, roopa,
	nikolay, linville, tgraf, gospo, sfeldma, ast, edumazet, hannes,
	dsa, jhs, vivien.didelot, john.fastabend, andrew, ivecera
In-Reply-To: <20160919060854.GB1846@nanopsycho.orion>

On 09/18/2016 11:08 PM, Jiri Pirko wrote:
> Sun, Sep 18, 2016 at 10:00:44PM CEST, f.fainelli@gmail.com wrote:
>> Le 06/09/2016 à 05:01, Jiri Pirko a écrit :
>>> From: Jiri Pirko <jiri@mellanox.com>
>>>
>>> This is RFC, unfinished. I came across some issues in the process so I would
>>> like to share those and restart the fib offload discussion in order to make it
>>> really usable.
>>>
>>> So the goal of this patchset is to allow driver to propagate all prefixes
>>> configured in kernel down HW. This is necessary for routing to work
>>> as expected. If we don't do that HW might forward prefixes known to kernel
>>> incorrectly. Take an example when default route is set in switch HW and there
>>> is an IP address set on a management (non-switch) port.
>>>
>>> Currently, only fibs related to the switch port netdev are offloaded using
>>> switchdev ops. This model is not extendable so the first patch introduces
>>> a replacement: notifier to propagate fib additions and removals to whoever
>>> interested. The second patch makes mlxsw to adopt this new way, registering
>>> one notifier block for each mlxsw (asic) instance.
>>
>> Instead of introducing another specialization of a notifier_block
>> implementation, could we somehow have a kernel-based netlink listener
>> which receives the same kind of event information from rtmsg_fib()?
> 
> rtmsg_fib destination is userspace. The message format is netlink. I
> don't think it is wise to pass netlink messages inside kernel when we
> can pass nice structures.

True, which does not mean you cannot have a small piece of kernel code
that listens for netlink events and unmarshalls what you want to have
your drivers operate on a structure-based message. The point I am
getting it is that rtmsg_* exists all over the place, and with close to
zero modification in the networking stack you can get the notification
you want if you hook a netlink listener in kernel space. Just like the
notifier you are proposing, the interface drivers could work with can be
structure (and not netlink message) based. An argument could be made
that with the notifier approach and hooking where necessary you have
finer control of message delivery, especially for managing overhead and
potential subscribers to these messages (it is easier to compile out the
notifier code for what you are interested in).


> Lower overhead. This is how it is done in the
> rest of kernel. Not sure how your comment is related specifically to
> this patch.

This is not directly related to your patch, it just made me think about
it as I was considering adding a notifier for other purposes (turns out
we may be able to do differently, without one).
-- 
Florian

^ permalink raw reply

* [PATCH v6 1/6] bpf: add new prog type for cgroup socket filtering
From: Daniel Mack @ 2016-09-19 16:43 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups,
	Daniel Mack
In-Reply-To: <1474303441-3745-1-git-send-email-daniel@zonque.org>

This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/uapi/linux/bpf.h |  9 +++++++++
 net/core/filter.c        | 23 +++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f896dfa..55f815e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_TRACEPOINT,
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
+	BPF_PROG_TYPE_CGROUP_SOCKET,
 };
 
+enum bpf_attach_type {
+	BPF_CGROUP_INET_INGRESS,
+	BPF_CGROUP_INET_EGRESS,
+	__MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD	1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/net/core/filter.c b/net/core/filter.c
index 298b146..e46c98e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2496,6 +2496,17 @@ xdp_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+cg_sk_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	default:
+		return sk_filter_func_proto(func_id);
+	}
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2818,6 +2829,12 @@ static const struct bpf_verifier_ops xdp_ops = {
 	.convert_ctx_access	= xdp_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_sk_ops = {
+	.get_func_proto		= cg_sk_func_proto,
+	.is_valid_access	= sk_filter_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2838,12 +2855,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_sk_type __read_mostly = {
+	.ops	= &cg_sk_ops,
+	.type	= BPF_PROG_TYPE_CGROUP_SOCKET,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
+	bpf_register_prog_type(&cg_sk_type);
 
 	return 0;
 }
-- 
2.5.5

^ permalink raw reply related

* [PATCH v6 0/6] Add eBPF hooks for cgroups
From: Daniel Mack @ 2016-09-19 16:43 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups,
	Daniel Mack

This is v6 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

Changes from v5:

* The eBPF programs now operate on L3 rather than on L2 of the packets,
  and the egress hooks were moved from __dev_queue_xmit() to
  ip*_output().

* For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb
  through BPF_LD_[ABS|IND] instructions, but hook up the
  bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann
  for the help.

Changes from v4:

* Plug an skb leak when dropping packets due to eBPF verdicts in
  __dev_queue_xmit(). Spotted by Daniel Borkmann.

* Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
  operate on timewait or request sockets. Suggested by Daniel Borkmann.

* Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
  Spotted by Rami Rosen.

* Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.

Changes from v3:

* Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
  BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
  __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.

* Dropped the attach_flags member from the anonymous struct for BPF
  attach operations in union bpf_attr. They can be added later on via
  CHECK_ATTR. Requested by Daniel Borkmann and Alexei.

* Release old_prog at the end of __cgroup_bpf_update rather that at
  the beginning to fix a race gap between program updates and their
  users. Spotted by Daniel Borkmann.

* Plugged an skb leak when dropping packets on the egress path.
  Spotted by Daniel Borkmann.

* Add cgroups@vger.kernel.org to the loop, as suggested by Rami Rosen.

* Some minor coding style adoptions not worth mentioning in particular.

Changes from v2:

* Fixed the RCU locking details Tejun pointed out.

* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.

Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.

As always, feedback is much appreciated.

Thanks,
Daniel

Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: ipv4, ipv6: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
    cgroups

 include/linux/bpf-cgroup.h      |  71 +++++++++++++++++
 include/linux/cgroup-defs.h     |   4 +
 include/uapi/linux/bpf.h        |  17 ++++
 init/Kconfig                    |  12 +++
 kernel/bpf/Makefile             |   1 +
 kernel/bpf/cgroup.c             | 166 ++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c            |  81 ++++++++++++++++++++
 kernel/cgroup.c                 |  18 +++++
 net/core/filter.c               |  27 +++++++
 net/ipv4/ip_output.c            |  15 ++++
 net/ipv6/ip6_output.c           |   8 ++
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  21 +++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +++++++++++++++++++++++++++++++++++
 15 files changed, 593 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.5.5

^ permalink raw reply

* [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: Daniel Mack @ 2016-09-19 16:44 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack
In-Reply-To: <1474303441-3745-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
---
 net/ipv4/ip_output.c  | 15 +++++++++++++++
 net/ipv6/ip6_output.c |  8 ++++++++
 2 files changed, 23 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 05d1058..3ca3d7a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -74,6 +74,7 @@
 #include <net/checksum.h>
 #include <net/inetpeer.h>
 #include <net/lwtunnel.h>
+#include <linux/bpf-cgroup.h>
 #include <linux/igmp.h>
 #include <linux/netfilter_ipv4.h>
 #include <linux/netfilter_bridge.h>
@@ -303,6 +304,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct rtable *rt = skb_rtable(skb);
 	struct net_device *dev = rt->dst.dev;
+	int ret;
 
 	/*
 	 *	If the indicated interface is up and running, send the packet.
@@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
 
+	ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS);
+	if (ret) {
+		kfree_skb(skb);
+		return ret;
+	}
+
 	/*
 	 *	Multicasts are looped back for other local users
 	 */
@@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
+	int ret;
 
 	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
 
+	ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS);
+	if (ret) {
+		kfree_skb(skb);
+		return ret;
+	}
+
 	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
 			    net, sk, skb, NULL, dev,
 			    ip_finish_output,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6001e78..5dc90aa 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -39,6 +39,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 
+#include <linux/bpf-cgroup.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter_ipv6.h>
 
@@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 	struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
+	int ret;
 
 	if (unlikely(idev->cnf.disable_ipv6)) {
 		IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
@@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 		return 0;
 	}
 
+	ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS);
+	if (ret) {
+		kfree_skb(skb);
+		return ret;
+	}
+
 	return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
 			    net, sk, skb, NULL, dev,
 			    ip6_finish_output,
-- 
2.5.5

^ permalink raw reply related

* [PATCH v6 4/6] net: filter: run cgroup eBPF ingress programs
From: Daniel Mack @ 2016-09-19 16:43 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack
In-Reply-To: <1474303441-3745-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
---
 net/core/filter.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index e46c98e..ce6e527 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
 		return -ENOMEM;
 
+	err = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_INGRESS);
+	if (err)
+		return err;
+
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)
 		return err;
-- 
2.5.5

^ permalink raw reply related

* [PATCH v6 2/6] cgroup: add support for eBPF programs
From: Daniel Mack @ 2016-09-19 16:43 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack
In-Reply-To: <1474303441-3745-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
---
 include/linux/bpf-cgroup.h  |  71 +++++++++++++++++++
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig                |  12 ++++
 kernel/bpf/Makefile         |   1 +
 kernel/bpf/cgroup.c         | 166 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup.c             |  18 +++++
 6 files changed, 272 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 0000000..fc076de
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,71 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include <linux/bpf.h>
+#include <linux/jump_label.h>
+#include <uapi/linux/bpf.h>
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
+
+struct cgroup_bpf {
+	/*
+	 * Store two sets of bpf_prog pointers, one for programs that are
+	 * pinned directly to this cgroup, and one for those that are effective
+	 * when this cgroup is accessed.
+	 */
+	struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+	struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+			 struct cgroup *parent,
+			 struct bpf_prog *prog,
+			 enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+		       struct bpf_prog *prog,
+		       enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+			    struct sk_buff *skb,
+			    enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+					struct sk_buff *skb,
+					enum bpf_attach_type type)
+{
+	if (cgroup_bpf_enabled)
+		return __cgroup_bpf_run_filter(sk, skb, type);
+
+	return 0;
+}
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+				      struct cgroup *parent) {}
+
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+					struct sk_buff *skb,
+					enum bpf_attach_type type)
+{
+	return 0;
+}
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include <linux/percpu-refcount.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/workqueue.h>
+#include <linux/bpf-cgroup.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
 
+	/* used to store eBPF programs */
+	struct cgroup_bpf bpf;
+
 	/* ids of the ancestors at each level including self */
 	int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..71c71b0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,18 @@ config CGROUP_PERF
 
 	  Say N if unsure.
 
+config CGROUP_BPF
+	bool "Support for eBPF programs attached to cgroups"
+	depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+	help
+	  Allow attaching eBPF programs to a cgroup using the bpf(2)
+	  syscall command BPF_PROG_ATTACH.
+
+	  In which context these programs are accessed depends on the type
+	  of attachment. For instance, programs that are attached using
+	  BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
+	  inet sockets.
+
 config CGROUP_DEBUG
 	bool "Example controller"
 	default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index eed911d..b22256b 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
+obj-$(CONFIG_CGROUP_BPF) += cgroup.o
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
new file mode 100644
index 0000000..81b96a4
--- /dev/null
+++ b/kernel/bpf/cgroup.c
@@ -0,0 +1,166 @@
+/*
+ * Functions to manage eBPF programs attached to cgroups
+ *
+ * Copyright (c) 2016 Daniel Mack
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/bpf-cgroup.h>
+#include <net/sock.h>
+
+DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
+EXPORT_SYMBOL(cgroup_bpf_enabled_key);
+
+/**
+ * cgroup_bpf_put() - put references of all bpf programs
+ * @cgrp: the cgroup to modify
+ */
+void cgroup_bpf_put(struct cgroup *cgrp)
+{
+	unsigned int type;
+
+	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.prog); type++) {
+		struct bpf_prog *prog = cgrp->bpf.prog[type];
+
+		if (prog) {
+			bpf_prog_put(prog);
+			static_branch_dec(&cgroup_bpf_enabled_key);
+		}
+	}
+}
+
+/**
+ * cgroup_bpf_inherit() - inherit effective programs from parent
+ * @cgrp: the cgroup to modify
+ * @parent: the parent to inherit from
+ */
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
+{
+	unsigned int type;
+
+	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
+		struct bpf_prog *e;
+
+		e = rcu_dereference_protected(parent->bpf.effective[type],
+					      lockdep_is_held(&cgroup_mutex));
+		rcu_assign_pointer(cgrp->bpf.effective[type], e);
+	}
+}
+
+/**
+ * __cgroup_bpf_update() - Update the pinned program of a cgroup, and
+ *                         propagate the change to descendants
+ * @cgrp: The cgroup which descendants to traverse
+ * @parent: The parent of @cgrp, or %NULL if @cgrp is the root
+ * @prog: A new program to pin
+ * @type: Type of pinning operation (ingress/egress)
+ *
+ * Each cgroup has a set of two pointers for bpf programs; one for eBPF
+ * programs it owns, and which is effective for execution.
+ *
+ * If @prog is %NULL, this function attaches a new program to the cgroup and
+ * releases the one that is currently attached, if any. @prog is then made
+ * the effective program of type @type in that cgroup.
+ *
+ * If @prog is %NULL, the currently attached program of type @type is released,
+ * and the effective program of the parent cgroup (if any) is inherited to
+ * @cgrp.
+ *
+ * Then, the descendants of @cgrp are walked and the effective program for
+ * each of them is set to the effective program of @cgrp unless the
+ * descendant has its own program attached, in which case the subbranch is
+ * skipped. This ensures that delegated subcgroups with own programs are left
+ * untouched.
+ *
+ * Must be called with cgroup_mutex held.
+ */
+void __cgroup_bpf_update(struct cgroup *cgrp,
+			 struct cgroup *parent,
+			 struct bpf_prog *prog,
+			 enum bpf_attach_type type)
+{
+	struct bpf_prog *old_prog, *effective;
+	struct cgroup_subsys_state *pos;
+
+	old_prog = xchg(cgrp->bpf.prog + type, prog);
+
+	effective = (!prog && parent) ?
+		rcu_dereference_protected(parent->bpf.effective[type],
+					  lockdep_is_held(&cgroup_mutex)) :
+		prog;
+
+	css_for_each_descendant_pre(pos, &cgrp->self) {
+		struct cgroup *desc = container_of(pos, struct cgroup, self);
+
+		/* skip the subtree if the descendant has its own program */
+		if (desc->bpf.prog[type] && desc != cgrp)
+			pos = css_rightmost_descendant(pos);
+		else
+			rcu_assign_pointer(desc->bpf.effective[type],
+					   effective);
+	}
+
+	if (prog)
+		static_branch_inc(&cgroup_bpf_enabled_key);
+
+	if (old_prog) {
+		bpf_prog_put(old_prog);
+		static_branch_dec(&cgroup_bpf_enabled_key);
+	}
+}
+
+/**
+ * __cgroup_bpf_run_filter() - Run a program for packet filtering
+ * @sk: The socken sending or receiving traffic
+ * @skb: The skb that is being sent or received
+ * @type: The type of program to be exectuted
+ *
+ * If no socket is passed, or the socket is not of type INET or INET6,
+ * this function does nothing and returns 0.
+ *
+ * The program type passed in via @type must be suitable for network
+ * filtering. No further check is performed to assert that.
+ *
+ * This function will return %-EPERM if any if an attached program was found
+ * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter(struct sock *sk,
+			    struct sk_buff *skb,
+			    enum bpf_attach_type type)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+	int ret = 0;
+
+	if (!sk || !sk_fullsock(sk))
+		return 0;
+
+	if (sk->sk_family != AF_INET &&
+	    sk->sk_family != AF_INET6)
+		return 0;
+
+	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+
+	rcu_read_lock();
+
+	prog = rcu_dereference(cgrp->bpf.effective[type]);
+	if (prog) {
+		unsigned int offset = skb->data - skb_network_header(skb);
+
+		__skb_push(skb, offset);
+		ret = bpf_prog_run_clear_cb(prog, skb) == 1 ? 0 : -EPERM;
+		__skb_pull(skb, offset);
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d1c51b7..57ade89 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5038,6 +5038,8 @@ static void css_release_work_fn(struct work_struct *work)
 		if (cgrp->kn)
 			RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv,
 					 NULL);
+
+		cgroup_bpf_put(cgrp);
 	}
 
 	mutex_unlock(&cgroup_mutex);
@@ -5245,6 +5247,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	if (!cgroup_on_dfl(cgrp))
 		cgrp->subtree_control = cgroup_control(cgrp);
 
+	if (parent)
+		cgroup_bpf_inherit(cgrp, parent);
+
 	cgroup_propagate_control(cgrp);
 
 	/* @cgrp doesn't have dir yet so the following will only create csses */
@@ -6417,6 +6422,19 @@ static __init int cgroup_namespaces_init(void)
 }
 subsys_initcall(cgroup_namespaces_init);
 
+#ifdef CONFIG_CGROUP_BPF
+void cgroup_bpf_update(struct cgroup *cgrp,
+		       struct bpf_prog *prog,
+		       enum bpf_attach_type type)
+{
+	struct cgroup *parent = cgroup_parent(cgrp);
+
+	mutex_lock(&cgroup_mutex);
+	__cgroup_bpf_update(cgrp, parent, prog, type);
+	mutex_unlock(&cgroup_mutex);
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 #ifdef CONFIG_CGROUP_DEBUG
 static struct cgroup_subsys_state *
 debug_css_alloc(struct cgroup_subsys_state *parent_css)
-- 
2.5.5

^ permalink raw reply related

* RE: [PATCH net v2] qed: Fix stack corruption on probe
From: David Laight @ 2016-09-19 16:37 UTC (permalink / raw)
  To: 'Yuval Mintz', davem@davemloft.net,
	netdev@vger.kernel.org
  Cc: Yuval Mintz
In-Reply-To: <1474296461-22558-1-git-send-email-Yuval.Mintz@qlogic.com>

From: Yuval Mintz 
> Sent: 19 September 2016 15:48
> Commit fe56b9e6a8d95 ("qed: Add module with basic common support")
> has introduced a stack corruption during probe, where filling a
> local struct with data to be sent to management firmware is incorrectly
> filled; The data is written outside of the struct and corrupts
> the stack.
> 
> Changes from v1:
> ----------------
>  - Correct the value written [Caught by David Laight]
> 
> Fixes: fe56b9e6a8d95 ("qed: Add module with basic common support")
> Signed-off-by: Yuval Mintz <Yuval.Mintz@caviumnetworks.com>
> ---
> Hi Dave,
> 
> Please consider applying this to `net'.
> 
> Thanks,
> Yuval
> ---
>  drivers/net/ethernet/qlogic/qed/qed_mcp.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
> index a240f26..59c81ce 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
> @@ -1153,8 +1153,8 @@ qed_mcp_send_drv_version(struct qed_hwfn *p_hwfn,
>  	p_drv_version = &union_data.drv_version;
>  	p_drv_version->version = p_ver->version;
> 
> -	for (i = 0; i < MCP_DRV_VER_STR_SIZE - 1; i += 4) {
> -		val = cpu_to_be32(p_ver->name[i]);
> +	for (i = 0; i < (MCP_DRV_VER_STR_SIZE - 4) / sizeof(u32); i++) {
> +		val = cpu_to_be32(*((u32 *)&p_ver->name[i * sizeof(u32)]));
>  		*(__be32 *)&p_drv_version->name[i * sizeof(u32)] = val;
>  	}

Or the much simpler:
	/* Byteswap driver name */
	for (i = 0; i < MCP_DRV_VER_STR_SIZE; i++)
		p_drv_version->name[i ^ 3] = p_ver->name[i];

	David

^ permalink raw reply

* Re: [PATCH net-next 05/10] bnxt_en: Fix ethtool -l|-L inconsistent channel counts.
From: Michael Chan @ 2016-09-19 16:39 UTC (permalink / raw)
  To: Mintz, Yuval; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <BL2PR07MB2306246BCB25E3C323A3F3A88DF40@BL2PR07MB2306.namprd07.prod.outlook.com>

On Mon, Sep 19, 2016 at 2:51 AM, Mintz, Yuval <Yuval.Mintz@cavium.com> wrote:

>>
>> If the user chooses 2 rx and 1 tx, he will use 3 msix vectors (3 completion rings,
>> etc).  If the user chooses 2 combined (1 with rx/tx,
>> 1 with rx only), he will use 2 msix vectors (2 completion rings, etc).
>> With a large number of NPAR functions and SRIOV functions, the number of
>> rings available per function may not be symmetrical.  We just want maximum
>> flexibility to make use of all available resources in 2 different modes (one that
>> uses more resources and one that uses less).
>
> Sounds like the user should chose 1 combined in 1 rx in this scenario.

We only support all combined or all rx/tx.  Supporting arbitrary
combination of the 2 will add a lot more complication to the driver
code.  I don't think it will add any value to the user either.  In the
end, it's all the same and the user sees a simpler interface.  If
combined is chosen, the driver logic will combine all available rx/tx
ring pairs.  Any remaining rings will still be used to reach the
desired combined count.

^ permalink raw reply

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
From: Daniel Mack @ 2016-09-19 16:34 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Pablo Neira Ayuso, htejun-b10kYP2dOMg,
	daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20160916195728.GA14736-I4sfFR6g6EicJoAdRrHjTrzMkBWIpU9tytq7g7fCXyjEk0E+pv7Png@public.gmane.org>

Hi,

On 09/16/2016 09:57 PM, Sargun Dhillon wrote:
> On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote:

>> I have no idea what makes you think this is limited to systemd. As I
>> said, I provided an example for userspace that works from the command
>> line. The same limitation apply as for all other users of cgroups.
>>
> So, at least in my work, we have Mesos, but on nearly every machine that Mesos 
> runs, people also have systemd. Now, there's recently become a bit of a battle 
> of ownership of things like cgroups on these machines. We can usually solve it 
> by nesting under systemd cgroups, and thus so far we've avoided making too many 
> systemd-specific concessions.
> 
> The reason this works (mostly), is because everything we touch has a sense of 
> nesting, where we can apply policy at a place lower in the hierarchy, and yet 
> systemd's monitoring and policy still stays in place. 
> 
> Now, with this patch, we don't have that, but I think we can reasonably add some 
> flag like "no override" when applying policies, or alternatively something like 
> "no new privileges", to prevent children from applying policies that override 
> top-level policy.

Yes, but the API is already guarded by CAP_NET_ADMIN. Take that
capability away from your children, and they can't tamper with the
policy. Does that work for you?

> I realize there is a speed concern as well, but I think for 
> people who want nested policy, we're willing to make the tradeoff. The cost
> of traversing a few extra pointers still outweighs the overhead of network
> namespaces, iptables, etc.. for many of us. 

Not sure. Have you tried it?

> What do you think Daniel?

I think we should look at an implementation once we really need it, and
then revisit the performance impact. In any case, this can be changed
under the hood, without touching the userspace API (except for adding
flags if we need them).

>> Not necessarily. You can as well do it the inetd way, and pass the
>> socket to a process that is launched on demand, but do SO_ATTACH_FILTER
>> + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
>> is not transparent to the launched binary at all. The proposed cgroup
>> eBPF solution implements a very similar behavior in that regard.
>
> It would be nice to be able to see whether or not a filter is attached to a 
> cgroup, but given this is going through syscalls, at least introspection
> is possible as opposed to something like netlink.

Sure, there are many ways. I implemented the bpf cgroup logic using an
own cgroup controller once, which made it possible to read out the
status. But as we agreed on attaching programs through the bpf(2) system
call, I moved back to the implementation that directly stores the
pointers in the cgroup.

First enabling the controller through the fs-backed cgroup interface,
then come back through the bpf(2) syscall and then go back to the fs
interface to read out status values is a bit weird.

>> And FWIW, I agree with Thomas - there is nothing wrong with having
>> multiple options to use for such use-cases.
>
> Right now, for containers, we have netfilter and network namespaces.
> There's a lot of performance overhead that comes with this.

Out of curiosity: Could you express that in numbers? And how exactly are
you testing?

> Not only
> that, but iptables doesn't really have a simple way of usage by
> automated infrastructure. We (firewalld, systemd, dockerd, mesos)
> end up fighting with one another for ownership over firewall rules.

Yes, that's a common problem.

> Although, I have problems with this approach, I think that it's
> a good baseline where we can have top level owned by systemd,
> docker underneath that, and Mesos underneath that. We can add
> additional hooks for things like Checmate and Landlock, and
> with a little more work, we can do compositition, solving
> all of our problems.

It is supposed to be just a baseline, yes.


Thanks for your feedback,
Daniel

^ permalink raw reply

* Re: [PATCH v2 1/2] net: skbuff: Remove errornous length validation in skb_vlan_pop()
From: Shmulik Ladkani @ 2016-09-19 16:28 UTC (permalink / raw)
  To: David S . Miller; +Cc: Jiri Pirko, Daniel Borkmann, netdev, Pravin Shelar
In-Reply-To: <1474300400-32362-1-git-send-email-shmulik.ladkani@gmail.com>

On Mon, 19 Sep 2016 18:53:19 +0300 Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely:
> It is superfluous since inner '__skb_vlan_pop' already verifies there
> are VLAN_ETH_HLEN writable bytes at the mac_header.

Please disregard, will send a v3.

^ permalink raw reply

* Re: [PATCH v2 2/2] net: skbuff: Coding: Use eth_type_vlan() instead of open coding it
From: Shmulik Ladkani @ 2016-09-19 16:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, Jiri Pirko, Daniel Borkmann, netdev
In-Reply-To: <1474302055.22679.89.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, 19 Sep 2016 09:20:55 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-09-19 at 18:53 +0300, Shmulik Ladkani wrote:
> > Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing
> > skb->protocol to ETH_P_8021Q or ETH_P_8021AD.
> > 
> > Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
> > ---  
> 
> You forgot to explicitly state this was targeting net-next tree.

Thanks, was an oversight.
Requires resubmission?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox