Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 2/2] Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
From: Eli Cooper @ 2016-11-29  2:35 UTC (permalink / raw)
  To: netdev, David S . Miller; +Cc: Eric Dumazet
In-Reply-To: <20161129023529.17645-1-elicooper@gmx.com>

This reverts commit ae148b085876fa771d9ef2c05f85d4b4bf09ce0d
("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").

skb->protocol is now set in ip_local_out() and ip6_local_out() right before
dst_output() is called. It is no longer necessary to do it in each tunnel.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
---
 net/ipv6/ip6_tunnel.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0a4759b..d76674e 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1181,7 +1181,6 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield,
 	if (err)
 		return err;

-	skb->protocol = htons(ETH_P_IPV6);
 	skb_push(skb, sizeof(struct ipv6hdr));
 	skb_reset_network_header(skb);
 	ipv6h = ipv6_hdr(skb);
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 1/2] Set skb->protocol properly before calling dst_output()
From: Eli Cooper @ 2016-11-29  2:35 UTC (permalink / raw)
  To: netdev, David S . Miller; +Cc: Eric Dumazet

When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol properly before dst_output() is called,
fixing a bug where GSO packets sent through a sit or ipip6 tunnel are
dropped when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
---
 net/ipv4/ip_output.c   | 4 +++-
 net/ipv6/output_core.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 105908d..0180e44 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -117,8 +117,10 @@ int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 	int err;
 
 	err = __ip_local_out(net, sk, skb);
-	if (likely(err == 1))
+	if (likely(err == 1)) {
+		skb->protocol = htons(ETH_P_IP);
 		err = dst_output(net, sk, skb);
+	}
 
 	return err;
 }
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 7cca8ac..d6e850d 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -166,8 +166,10 @@ int ip6_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 	int err;
 
 	err = __ip6_local_out(net, sk, skb);
-	if (likely(err == 1))
+	if (likely(err == 1)) {
+		skb->protocol = htons(ETH_P_IPV6);
 		err = dst_output(net, sk, skb);
+	}
 
 	return err;
 }
-- 
2.10.2

^ permalink raw reply related

* [PATCH v2 net-next 1/2] openvswitch: Add a missing break statement.
From: Jarno Rajahalme @ 2016-11-29  2:41 UTC (permalink / raw)
  To: netdev; +Cc: jbenc, jarno

Add a break statement to prevent fall-through from
OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL.  Without the break
actions setting ethernet addresses fail to validate with log messages
complaining about invalid tunnel attributes.

Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
---
v2: No change.

 net/openvswitch/flow_netlink.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d19044f..c87d359 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2195,6 +2195,7 @@ static int validate_set(const struct nlattr *a,
 	case OVS_KEY_ATTR_ETHERNET:
 		if (mac_proto != MAC_PROTO_ETHERNET)
 			return -EINVAL;
+		break;
 
 	case OVS_KEY_ATTR_TUNNEL:
 		if (masked)
-- 
2.1.4

^ permalink raw reply related

* [PATCH v2 net-next 2/2] openvswitch: Fix skb->protocol for vlan frames.
From: Jarno Rajahalme @ 2016-11-29  2:41 UTC (permalink / raw)
  To: netdev; +Cc: jbenc, jarno
In-Reply-To: <1480387276-123557-1-git-send-email-jarno@ovn.org>

Do not set skb->protocol to be the ethertype of the L3 header, unless
the packet only has the L3 header.  For a non-hardware offloaded VLAN
frame skb->protocol needs to be one of the VLAN ethertypes.

Any VLAN offloading is undone on the OVS netlink interface.  Also any
VLAN tags added by userspace are non-offloaded.

Incorrect skb->protocol value on a full-size non-offloaded VLAN skb
causes packet drop due to failing MTU check, as the VLAN header should
not be counted in when considering MTU in ovs_vport_send().

Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
v2: Set skb->protocol when an ETH_P_TEB frame is received via ARPHRD_NONE
    interface.

 net/openvswitch/datapath.c |  1 -
 net/openvswitch/flow.c     | 30 ++++++++++++++++++++++--------
 2 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 2d4c4d3..9c62b63 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -606,7 +606,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	rcu_assign_pointer(flow->sf_acts, acts);
 	packet->priority = flow->key.phy.priority;
 	packet->mark = flow->key.phy.skb_mark;
-	packet->protocol = flow->key.eth.type;
 
 	rcu_read_lock();
 	dp = get_dp_rcu(net, ovs_header->dp_ifindex);
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 08aa926..b9aae99 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -477,12 +477,17 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
 }
 
 /**
- * key_extract - extracts a flow key from an Ethernet frame.
+ * key_extract - extracts a flow key from a packet with or without an
+ * Ethernet header.
  * @skb: sk_buff that contains the frame, with skb->data pointing to the
- * Ethernet header
+ * beginning of the packet.
  * @key: output flow key
  *
- * The caller must ensure that skb->len >= ETH_HLEN.
+ * 'key->mac_proto' must be initialized to indicate the frame type.
+ * For an L3 frame 'key->mac_proto' must equal 'MAC_PROTO_NONE', and the
+ * caller must ensure that 'skb->protocol' is set to the ethertype of the L3
+ * header.  Otherwise the presence of an Ethernet header is assumed and
+ * the caller must ensure that skb->len >= ETH_HLEN.
  *
  * Returns 0 if successful, otherwise a negative errno value.
  *
@@ -498,8 +503,9 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
  *      of a correct length, otherwise the same as skb->network_header.
  *      For other key->eth.type values it is left untouched.
  *
- *    - skb->protocol: the type of the data starting at skb->network_header.
- *      Equals to key->eth.type.
+ *    - skb->protocol: For non-accelerated VLAN, one of the VLAN ether types,
+ *      otherwise the same as key->eth.type, the ether type of the payload
+ *      starting at skb->network_header.
  */
 static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 {
@@ -518,6 +524,7 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 			return -EINVAL;
 
 		skb_reset_network_header(skb);
+		key->eth.type = skb->protocol;
 	} else {
 		eth = eth_hdr(skb);
 		ether_addr_copy(key->eth.src, eth->h_source);
@@ -531,15 +538,22 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 		if (unlikely(parse_vlan(skb, key)))
 			return -ENOMEM;
 
-		skb->protocol = parse_ethertype(skb);
-		if (unlikely(skb->protocol == htons(0)))
+		key->eth.type = parse_ethertype(skb);
+		if (unlikely(key->eth.type == htons(0)))
 			return -ENOMEM;
 
+		if (skb->protocol == htons(ETH_P_TEB)) {
+			if (key->eth.vlan.tci & htons(VLAN_TAG_PRESENT)
+			    && !skb_vlan_tag_present(skb))
+				skb->protocol = key->eth.vlan.tpid;
+			else
+				skb->protocol = key->eth.type;
+		}
+
 		skb_reset_network_header(skb);
 		__skb_push(skb, skb->data - skb_mac_header(skb));
 	}
 	skb_reset_mac_len(skb);
-	key->eth.type = skb->protocol;
 
 	/* Network layer. */
 	if (key->eth.type == htons(ETH_P_IP)) {
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next v2 2/2] net: phy: bcm7xxx: Plug in support for reading PHY error counters
From: Andrew Lunn @ 2016-11-29  2:50 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, davem, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju
In-Reply-To: <20161129010457.17438-3-f.fainelli@gmail.com>

> +struct bcm7xxx_phy_priv {
> +	u64	*stats;
> +};

> +static int bcm7xxx_28nm_probe(struct phy_device *phydev)
> +{
> +	struct bcm7xxx_phy_priv *priv;
> +
> +	priv = devm_kzalloc(&phydev->mdio.dev, sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	phydev->priv = priv;
> +
> +	priv->stats = devm_kzalloc(&phydev->mdio.dev,
> +				   bcm_phy_get_sset_count(phydev), GFP_KERNEL);

Hi Florian

Should there be a * sizeof(u64) in there?

       Andrew

^ permalink raw reply

* RE: Account Verification
From: Marquell Woodson @ 2016-11-29  2:53 UTC (permalink / raw)
  To: Marquell Woodson
In-Reply-To: <3028D1F0EBE1C644B0D3566CC9ACA34C097BFA6F@SMS3.schls.albco>

________________________________
From: Marquell Woodson
Sent: Monday, November 28, 2016 9:35 PM
Subject: Account Verification

Dear Account Owner,

we noticed your account being used in sending spam emails from an IP Address situated in Malaysia 689.9087.0987. we have placed a stop on your account for security reason Your mail has been placed on hold for security reason, For Account Verification Click Help Desk:<http://mailboxvalidation2016.moonfruit.com>
ICT Service Desk Support
21/11/2016.

^ permalink raw reply

* Re: [PATCH net] bpf: fix states equal logic for varlen access
From: Alexei Starovoitov @ 2016-11-29  3:04 UTC (permalink / raw)
  To: Josef Bacik; +Cc: davem, netdev, daniel, ast, jannh
In-Reply-To: <1480362250-2132-1-git-send-email-jbacik@fb.com>

On Mon, Nov 28, 2016 at 02:44:10PM -0500, Josef Bacik wrote:
> If we have a branch that looks something like this
> 
> int foo = map->value;
> if (condition) {
>   foo += blah;
> } else {
>   foo = bar;
> }
> map->array[foo] = baz;
> 
> We will incorrectly assume that the !condition branch is equal to the condition
> branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
> adjust this logic to only do this if we didn't do a varlen access after we
> processed the !condition branch, otherwise we have different ranges and need to
> check the other branch as well.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  kernel/bpf/verifier.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 89f787c..2c8a688 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2478,6 +2478,7 @@ static bool states_equal(struct bpf_verifier_env *env,
>  {
>  	struct bpf_reg_state *rold, *rcur;
>  	int i;
> +	bool map_access = env->varlen_map_value_access;

that's a bit misleading name for the variable.
Pls call it varlen_map_access.

>  	for (i = 0; i < MAX_BPF_REG; i++) {
>  		rold = &old->regs[i];
> @@ -2489,12 +2490,17 @@ static bool states_equal(struct bpf_verifier_env *env,
>  		/* If the ranges were not the same, but everything else was and
>  		 * we didn't do a variable access into a map then we are a-ok.
>  		 */
> -		if (!env->varlen_map_value_access &&
> +		if (!map_access &&
>  		    rold->type == rcur->type && rold->imm == rcur->imm)

just noticed that this one is missing comparing rold->id == rcur->id

>  			continue;
>  
> +		/* If we didn't map access then again we don't care about the
> +		 * mismatched range values and it's ok if our old type was
> +		 * UNKNOWN and we didn't go to a NOT_INIT'ed reg.
> +		 */
>  		if (rold->type == NOT_INIT ||
> -		    (rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))
> +		    (!map_access && (rold->type == UNKNOWN_VALUE &&
> +				     rcur->type != NOT_INIT)))

please drop unnecessary ( )

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] net: phy: bcm7xxx: Plug in support for reading PHY error counters
From: Florian Fainelli @ 2016-11-29  3:20 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, davem, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju
In-Reply-To: <20161129025018.GA28968@lunn.ch>

Le 11/28/16 à 18:50, Andrew Lunn a écrit :
>> +struct bcm7xxx_phy_priv {
>> +	u64	*stats;
>> +};
> 
>> +static int bcm7xxx_28nm_probe(struct phy_device *phydev)
>> +{
>> +	struct bcm7xxx_phy_priv *priv;
>> +
>> +	priv = devm_kzalloc(&phydev->mdio.dev, sizeof(*priv), GFP_KERNEL);
>> +	if (!priv)
>> +		return -ENOMEM;
>> +
>> +	phydev->priv = priv;
>> +
>> +	priv->stats = devm_kzalloc(&phydev->mdio.dev,
>> +				   bcm_phy_get_sset_count(phydev), GFP_KERNEL);
> 
> Hi Florian
> 
> Should there be a * sizeof(u64) in there?

It should thanks for noticing!
-- 
Florian

^ permalink raw reply

* [net-next] macvtap: replace printk with netdev_err
From: Zhang Shengju @ 2016-11-29  3:26 UTC (permalink / raw)
  To: netdev; +Cc: jasowang

This patch replaces printk() with netdev_err() for macvtap device.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
---
 drivers/net/macvtap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 5da9861..2513939 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -437,7 +437,7 @@ static int macvtap_get_minor(struct macvlan_dev *vlan)
 	if (retval >= 0) {
 		vlan->minor = retval;
 	} else if (retval == -ENOSPC) {
-		printk(KERN_ERR "too many macvtap devices\n");
+		netdev_err(vlan->dev, "Too many macvtap devices\n");
 		retval = -EINVAL;
 	}
 	mutex_unlock(&minor_lock);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [net-next 1/1] samples: bpf: Refactor test_cgrp2_attach -- use getopt, and add mode
From: Alexei Starovoitov @ 2016-11-29  3:50 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: netdev, daniel, ast
In-Reply-To: <20161128225240.GA8761@ircssh.c.rugged-nimbus-611.internal>

On Mon, Nov 28, 2016 at 02:52:42PM -0800, Sargun Dhillon wrote:
> This patch modifies test_cgrp2_attach to use getopt so we can use standard
> command line parsing.
> 
> It also adds an option to run the program in detach only mode. This does
> not attach a new filter at the cgroup, but only runs the detach command.
> 
> Lastly, it changes the attach code to not detach and then attach. It relies
> on the 'hotswap' behaviour of CGroup BPF programs to be able to change
> in-place. If detach-then-attach behaviour needs to be tested, the example
> can be run in detach only mode prior to attachment.
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>

looks fine to me.
I'd really prefer this example to become an automated test eventually.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH 1/2] net: macb: Add MDIO driver for accessing multiple PHY devices
From: Harini Katakam @ 2016-11-29  4:11 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Harini Katakam, Nicolas Ferre, davem, Rob Herring, Pawel Moll,
	Mark Rutland, ijc+devicetree@hellion.org.uk, Kumar Gala,
	Boris Brezillon, alexandre.belloni, netdev,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	michals@xilinx.com, Punnaiah Choudary Kalluri
In-Reply-To: <20161128163356.GJ17704@lunn.ch>

Hi Andrew,

On Mon, Nov 28, 2016 at 10:03 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> On Mon, Nov 28, 2016 at 03:19:14PM +0530, Harini Katakam wrote:
>> This patch is to add support for the hardware with multiple ethernet
>> MAC controllers and a single MDIO bus connected to multiple PHY devices.
>> MDIO lines are connected to any one of the ethernet MAC controllers and
>> all the PHY devices will be accessed using the PHY maintenance interface
>> in that MAC controller. This handling along with PHY functionality is
>> moved to macb_mdio.c
>>
>> Signed-off-by: Punnaiah Choudary Kalluri <punnaia@xilinx.com>
>> Signed-off-by: Harini Katakam <harinik@xilinx.com>
>> ---
>>  drivers/net/ethernet/cadence/Makefile    |   2 +-
>>  drivers/net/ethernet/cadence/macb.c      | 169 +++-----------------
>>  drivers/net/ethernet/cadence/macb.h      |   2 +
>>  drivers/net/ethernet/cadence/macb_mdio.c | 266 +++++++++++++++++++++++++++++++
>>  4 files changed, 294 insertions(+), 145 deletions(-)
>>  create mode 100644 drivers/net/ethernet/cadence/macb_mdio.c
>>
<snip>
>> +     bus->irq = devm_kzalloc(&pdev->dev, sizeof(int) * PHY_MAX_ADDR,
>> +                             GFP_KERNEL);
>
> This looks wrong, or at least old. It used to be a pointer to an array,
> but it is now an actual array.

Sorry, this was a mistake.
I changed this after rebase, will update in next version.

>
>> +static const struct of_device_id macb_mdio_dt_ids[] = {
>> +     { .compatible = "cdns,macb-mdio" },
>> +
>> +};
>
>
> I've not looked hard enough to know, but can you keep backwards
> compatibility? Won't old device tree's assume the mdio bus is always
> present? Now you need an explicit node otherwise there will not be an
> mdio bus?

Yes, an explicit MDIO bus is required. But I'm not sure
how to maintain backward compatibility (without using this separate
macb_mdio) and have different MACs use the same MDIO bus
with separate PHYs.

Regards,
Harini

^ permalink raw reply

* Re: Crash due to mutex genl_lock called from RCU context
From: Cong Wang @ 2016-11-29  4:33 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, Subash Abhinov Kasiviswanathan, Thomas Graf,
	Linux Kernel Network Developers
In-Reply-To: <20161128112211.GA990@gondor.apana.org.au>

On Mon, Nov 28, 2016 at 3:22 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> netlink: Call cb->done from a worker thread
>
> The cb->done interface expects to be called in process context.
> This was broken by the netlink RCU conversion.  This patch fixes
> it by adding a worker struct to make the cb->done call where
> necessary.
>
> Fixes: 21e4902aea80 ("netlink: Lockless lookup with RCU grace...")
> Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>


Looks good,

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

Thanks!

^ permalink raw reply

* [PATCH] openvswitch: add sanity check in queue_userspace_packet.
From: Haishuang Yan @ 2016-11-29  4:36 UTC (permalink / raw)
  To: Pravin Shelar, David S. Miller; +Cc: netdev, linux-kernel, Haishuang Yan

kernel will crash in oops if genlmsg_put return NULL,
so add the sanity check.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
---
 net/openvswitch/datapath.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 2d4c4d3..ceb1b1e 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -474,6 +474,10 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
 
 	upcall = genlmsg_put(user_skb, 0, 0, &dp_packet_genl_family,
 			     0, upcall_info->cmd);
+	if (!upcall) {
+		err = -EMSGSIZE;
+		goto out;
+	}
 	upcall->dp_ifindex = dp_ifindex;
 
 	err = ovs_nla_put_key(key, key, OVS_PACKET_ATTR_KEY, false, user_skb);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v4 0/3] net: Add bpf support to set sk_bound_dev_if
From: David Ahern @ 2016-11-29  4:38 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern

The recently added VRF support in Linux leverages the bind-to-device
API for programs to specify an L3 domain for a socket. While
SO_BINDTODEVICE has been around for ages, not every ipv4/ipv6 capable
program has support for it. Even for those programs that do support it,
the API requires processes to be started as root (CAP_NET_RAW) which
is not desirable from a general security perspective.

This patch set leverages Daniel Mack's work to attach bpf programs to
a cgroup to provide a capability to set sk_bound_dev_if for all
AF_INET{6} sockets opened by a process in a cgroup when the sockets
are allocated.

For example:
 1. configure vrf (e.g., using ifupdown2)
        auto eth0
        iface eth0 inet dhcp
            vrf mgmt

        auto mgmt
        iface mgmt
            vrf-table auto

 2. configure cgroup
        mount -t cgroup2 none /tmp/cgroupv2
        mkdir /tmp/cgroupv2/mgmt
        test_cgrp2_sock /tmp/cgroupv2/mgmt 15

 3. set shell into cgroup (e.g., can be done at login using pam)
        echo $$ >> /tmp/cgroupv2/mgmt/cgroup.procs

At this point all commands run in the shell (e.g, apt) have sockets
automatically bound to the VRF (see output of ss -ap 'dev == <vrf>'),
including processes not running as root.

This capability enables running any program in a VRF context and is key
to deploying Management VRF, a fundamental configuration for networking
gear, with any Linux OS installation.

David Ahern (3):
  bpf: Refactor cgroups code in prep for new type
  bpf: Add new cgroup attach type to enable sock modifications
  samples: bpf: add userspace example for modifying sk_bound_dev_if

 include/linux/bpf-cgroup.h     | 60 ++++++++++++++++++------------
 include/uapi/linux/bpf.h       |  6 +++
 kernel/bpf/cgroup.c            | 43 +++++++++++++++++++---
 kernel/bpf/syscall.c           | 33 ++++++++++-------
 net/core/filter.c              | 59 ++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c             | 12 +++++-
 net/ipv6/af_inet6.c            |  8 ++++
 samples/bpf/Makefile           |  2 +
 samples/bpf/test_cgrp2_sock.c  | 83 ++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/test_cgrp2_sock.sh | 48 ++++++++++++++++++++++++
 10 files changed, 311 insertions(+), 43 deletions(-)
 create mode 100644 samples/bpf/test_cgrp2_sock.c
 create mode 100755 samples/bpf/test_cgrp2_sock.sh

-- 
2.1.4

^ permalink raw reply

* [PATCH net-next v4 1/3] bpf: Refactor cgroups code in prep for new type
From: David Ahern @ 2016-11-29  4:38 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480394329-24847-1-git-send-email-dsa@cumulusnetworks.com>

Code move and rename only; no functional change intended.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v4
- dropped refactor of __cgroup_bpf_run_filter and renamed it
  to __cgroup_bpf_run_filter_skb

v3
- dropped the rename

v2
- fix bpf_prog_run_clear_cb to bpf_prog_run_save_cb as caught by Daniel

- rename BPF_PROG_TYPE_CGROUP_SKB and its cg_skb functions to
  BPF_PROG_TYPE_CGROUP and cgroup

 include/linux/bpf-cgroup.h | 46 +++++++++++++++++++++++-----------------------
 kernel/bpf/cgroup.c        | 10 +++++-----
 kernel/bpf/syscall.c       | 28 +++++++++++++++-------------
 3 files changed, 43 insertions(+), 41 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index ec80d0c0953e..7f0fc635b13e 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -37,31 +37,31 @@ void cgroup_bpf_update(struct cgroup *cgrp,
 		       struct bpf_prog *prog,
 		       enum bpf_attach_type type);
 
-int __cgroup_bpf_run_filter(struct sock *sk,
-			    struct sk_buff *skb,
-			    enum bpf_attach_type type);
-
-/* Wrappers for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled. */
-#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb)			\
-({									\
-	int __ret = 0;							\
-	if (cgroup_bpf_enabled)						\
-		__ret = __cgroup_bpf_run_filter(sk, skb,		\
-						BPF_CGROUP_INET_INGRESS); \
-									\
-	__ret;								\
+int __cgroup_bpf_run_filter_skb(struct sock *sk,
+				struct sk_buff *skb,
+				enum bpf_attach_type type);
+
+/* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)			      \
+({									      \
+	int __ret = 0;							      \
+	if (cgroup_bpf_enabled)						      \
+		__ret = __cgroup_bpf_run_filter_skb(sk, skb,		      \
+						    BPF_CGROUP_INET_INGRESS); \
+									      \
+	__ret;								      \
 })
 
-#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb)				\
-({									\
-	int __ret = 0;							\
-	if (cgroup_bpf_enabled && sk && sk == skb->sk) {		\
-		typeof(sk) __sk = sk_to_full_sk(sk);			\
-		if (sk_fullsock(__sk))					\
-			__ret = __cgroup_bpf_run_filter(__sk, skb,	\
-						BPF_CGROUP_INET_EGRESS); \
-	}								\
-	__ret;								\
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)			       \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled && sk && sk == skb->sk) {		       \
+		typeof(sk) __sk = sk_to_full_sk(sk);			       \
+		if (sk_fullsock(__sk))					       \
+			__ret = __cgroup_bpf_run_filter_skb(__sk, skb,	       \
+						      BPF_CGROUP_INET_EGRESS); \
+	}								       \
+	__ret;								       \
 })
 
 #else
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a0ab43f264b0..19892973a78a 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -118,7 +118,7 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
 }
 
 /**
- * __cgroup_bpf_run_filter() - Run a program for packet filtering
+ * __cgroup_bpf_run_filter_skb() - Run a program for packet filtering
  * @sk: The socken sending or receiving traffic
  * @skb: The skb that is being sent or received
  * @type: The type of program to be exectuted
@@ -132,9 +132,9 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
  * This function will return %-EPERM if any if an attached program was found
  * and if it returned != 1 during execution. In all other cases, 0 is returned.
  */
-int __cgroup_bpf_run_filter(struct sock *sk,
-			    struct sk_buff *skb,
-			    enum bpf_attach_type type)
+int __cgroup_bpf_run_filter_skb(struct sock *sk,
+				struct sk_buff *skb,
+				enum bpf_attach_type type)
 {
 	struct bpf_prog *prog;
 	struct cgroup *cgrp;
@@ -164,4 +164,4 @@ int __cgroup_bpf_run_filter(struct sock *sk,
 
 	return ret;
 }
-EXPORT_SYMBOL(__cgroup_bpf_run_filter);
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4caa18e6860a..5518a6839ab1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -856,6 +856,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 {
 	struct bpf_prog *prog;
 	struct cgroup *cgrp;
+	enum bpf_prog_type ptype;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
@@ -866,25 +867,26 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	switch (attr->attach_type) {
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
-		prog = bpf_prog_get_type(attr->attach_bpf_fd,
-					 BPF_PROG_TYPE_CGROUP_SKB);
-		if (IS_ERR(prog))
-			return PTR_ERR(prog);
-
-		cgrp = cgroup_get_from_fd(attr->target_fd);
-		if (IS_ERR(cgrp)) {
-			bpf_prog_put(prog);
-			return PTR_ERR(cgrp);
-		}
-
-		cgroup_bpf_update(cgrp, prog, attr->attach_type);
-		cgroup_put(cgrp);
+		ptype = BPF_PROG_TYPE_CGROUP_SKB;
 		break;
 
 	default:
 		return -EINVAL;
 	}
 
+	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	cgrp = cgroup_get_from_fd(attr->target_fd);
+	if (IS_ERR(cgrp)) {
+		bpf_prog_put(prog);
+		return PTR_ERR(cgrp);
+	}
+
+	cgroup_bpf_update(cgrp, prog, attr->attach_type);
+	cgroup_put(cgrp);
+
 	return 0;
 }
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v4 2/3] bpf: Add new cgroup attach type to enable sock modifications
From: David Ahern @ 2016-11-29  4:38 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480394329-24847-1-git-send-email-dsa@cumulusnetworks.com>

Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.

This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v4
- dropped tweak to bpf_func signature
- dropped cg_sock_func_proto in favor of sk_filter_func_proto
- new __cgroup_bpf_run_filter_sk versus overloading __cgroup_bpf_run_filter
- reverted BPF_CGROUP_INET_SOCK to BPF_CGROUP_INET_SOCK_CREATE

v3
- reverted to new prog type BPF_PROG_TYPE_CGROUP_SOCK
- dropped the subtype

v2
- dropped the bpf_sock_store_u32 helper
- dropped the new prog type BPF_PROG_TYPE_CGROUP_SOCK
- moved valid access and context conversion to use subtype
- dropped CREATE from BPF_CGROUP_INET_SOCK and related function names
- moved running of filter from sk_alloc to inet{6}_create

 include/linux/bpf-cgroup.h | 14 +++++++++++
 include/uapi/linux/bpf.h   |  6 +++++
 kernel/bpf/cgroup.c        | 33 ++++++++++++++++++++++++++
 kernel/bpf/syscall.c       |  5 +++-
 net/core/filter.c          | 59 ++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c         | 12 +++++++++-
 net/ipv6/af_inet6.c        |  8 +++++++
 7 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 7f0fc635b13e..7de376e37c5c 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -41,6 +41,9 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 				struct sk_buff *skb,
 				enum bpf_attach_type type);
 
+int __cgroup_bpf_run_filter_sk(struct sock *sk,
+			       enum bpf_attach_type type);
+
 /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)			      \
 ({									      \
@@ -64,6 +67,16 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk)				       \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled && sk) {					       \
+		__ret = __cgroup_bpf_run_filter_sk(sk,			       \
+						 BPF_CGROUP_INET_SOCK_CREATE); \
+	}								       \
+	__ret;								       \
+})
+
 #else
 
 struct cgroup_bpf {};
@@ -73,6 +86,7 @@ static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
 
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
 
 #endif /* CONFIG_CGROUP_BPF */
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1370a9d1456f..75964e00d947 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,11 +101,13 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
 	BPF_PROG_TYPE_CGROUP_SKB,
+	BPF_PROG_TYPE_CGROUP_SOCK,
 };
 
 enum bpf_attach_type {
 	BPF_CGROUP_INET_INGRESS,
 	BPF_CGROUP_INET_EGRESS,
+	BPF_CGROUP_INET_SOCK_CREATE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -537,6 +539,10 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+struct bpf_sock {
+	__u32 bound_dev_if;
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 19892973a78a..fe1c9ad03a36 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -165,3 +165,36 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	return ret;
 }
 EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb);
+
+/**
+ * __cgroup_bpf_run_filter_sk() - Run a program on a sock
+ * @sk: sock structure to manipulate
+ * @type: The type of program to be exectuted
+ *
+ * socket is passed is expected to be of type INET or INET6.
+ *
+ * The program type passed in via @type must be suitable for sock
+ * filtering. No further check is performed to assert that.
+ *
+ * This function will return %-EPERM if any if an attached program was found
+ * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter_sk(struct sock *sk,
+			       enum bpf_attach_type type)
+{
+	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+	struct bpf_prog *prog;
+	int ret = 0;
+
+
+	rcu_read_lock();
+
+	prog = rcu_dereference(cgrp->bpf.effective[type]);
+	if (prog)
+		ret = BPF_PROG_RUN(prog, sk) == 1 ? 0 : -EPERM;
+
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5518a6839ab1..85af86c496cd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -869,7 +869,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET_EGRESS:
 		ptype = BPF_PROG_TYPE_CGROUP_SKB;
 		break;
-
+	case BPF_CGROUP_INET_SOCK_CREATE:
+		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -905,6 +907,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	switch (attr->attach_type) {
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
+	case BPF_CGROUP_INET_SOCK_CREATE:
 		cgrp = cgroup_get_from_fd(attr->target_fd);
 		if (IS_ERR(cgrp))
 			return PTR_ERR(cgrp);
diff --git a/net/core/filter.c b/net/core/filter.c
index 698a262b8ebb..404aaa1bfa1f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2676,6 +2676,29 @@ static bool sk_filter_is_valid_access(int off, int size,
 	return __is_valid_access(off, size, type);
 }
 
+static bool sock_filter_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					enum bpf_reg_type *reg_type)
+{
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case offsetof(struct bpf_sock, bound_dev_if):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	if (off < 0 || off + size > sizeof(struct bpf_sock))
+		return false;
+
+	/* The verifier guarantees that size > 0. */
+	if (off % size != 0)
+		return false;
+
+	return true;
+}
+
 static int tc_cls_act_prologue(struct bpf_insn *insn_buf, bool direct_write,
 			       const struct bpf_prog *prog)
 {
@@ -2934,6 +2957,30 @@ static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 	return insn - insn_buf;
 }
 
+static u32 sock_filter_convert_ctx_access(enum bpf_access_type type,
+					  int dst_reg, int src_reg,
+					  int ctx_off,
+					  struct bpf_insn *insn_buf,
+					  struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (ctx_off) {
+	case offsetof(struct bpf_sock, bound_dev_if):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_bound_dev_if) != 4);
+
+		if (type == BPF_WRITE)
+			*insn++ = BPF_STX_MEM(BPF_W, dst_reg, src_reg,
+					offsetof(struct sock, sk_bound_dev_if));
+		else
+			*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+				      offsetof(struct sock, sk_bound_dev_if));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 tc_cls_act_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 					 int src_reg, int ctx_off,
 					 struct bpf_insn *insn_buf,
@@ -3007,6 +3054,12 @@ static const struct bpf_verifier_ops cg_skb_ops = {
 	.convert_ctx_access	= sk_filter_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_sock_ops = {
+	.get_func_proto		= sk_filter_func_proto,
+	.is_valid_access	= sock_filter_is_valid_access,
+	.convert_ctx_access	= sock_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -3032,6 +3085,11 @@ static struct bpf_prog_type_list cg_skb_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+static struct bpf_prog_type_list cg_sock_type __read_mostly = {
+	.ops	= &cg_sock_ops,
+	.type	= BPF_PROG_TYPE_CGROUP_SOCK
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
@@ -3039,6 +3097,7 @@ static int __init register_sk_filter_ops(void)
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
 	bpf_register_prog_type(&cg_skb_type);
+	bpf_register_prog_type(&cg_sock_type);
 
 	return 0;
 }
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5ddf5cda07f4..24d2550492ee 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -374,8 +374,18 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
 
 	if (sk->sk_prot->init) {
 		err = sk->sk_prot->init(sk);
-		if (err)
+		if (err) {
+			sk_common_release(sk);
+			goto out;
+		}
+	}
+
+	if (!kern) {
+		err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
+		if (err) {
 			sk_common_release(sk);
+			goto out;
+		}
 	}
 out:
 	return err;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d424f3a3737a..237e654ba717 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -258,6 +258,14 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 			goto out;
 		}
 	}
+
+	if (!kern) {
+		err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
+		if (err) {
+			sk_common_release(sk);
+			goto out;
+		}
+	}
 out:
 	return err;
 out_rcu_unlock:
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v4 3/3] samples: bpf: add userspace example for modifying sk_bound_dev_if
From: David Ahern @ 2016-11-29  4:38 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480394329-24847-1-git-send-email-dsa@cumulusnetworks.com>

Add a simple program to demonstrate the ability to attach a bpf program
to a cgroup that sets sk_bound_dev_if for AF_INET{6} sockets when they
are created.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v4
- added test_cgrp2_sock.sh for an automated test

v3
- revert to BPF_PROG_TYPE_CGROUP_SOCK prog type

v2
- removed bpf_sock_store_u32 references
- changed BPF_CGROUP_INET_SOCK_CREATE to BPF_CGROUP_INET_SOCK
- remove BPF_PROG_TYPE_CGROUP_SOCK prog type and add prog_subtype

 samples/bpf/Makefile           |  2 +
 samples/bpf/test_cgrp2_sock.c  | 83 ++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/test_cgrp2_sock.sh | 48 ++++++++++++++++++++++++
 3 files changed, 133 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_sock.c
 create mode 100755 samples/bpf/test_cgrp2_sock.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 22b6407efa4f..3a404dd4bb46 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -23,6 +23,7 @@ hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += test_cgrp2_attach
+hostprogs-y += test_cgrp2_sock
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -51,6 +52,7 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
+test_cgrp2_sock-objs := libbpf.o test_cgrp2_sock.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c
new file mode 100644
index 000000000000..2831e5f41f86
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock.c
@@ -0,0 +1,83 @@
+/* eBPF example program:
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program sets the sk_bound_dev_if index in new AF_INET{6}
+ *   sockets opened by processes in the cgroup.
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+
+static int prog_load(int idx)
+{
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_MOV64_IMM(BPF_REG_3, idx),
+		BPF_MOV64_IMM(BPF_REG_2, offsetof(struct bpf_sock, bound_dev_if)),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3, offsetof(struct bpf_sock, bound_dev_if)),
+		BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = verdict */
+		BPF_EXIT_INSN(),
+	};
+
+	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
+			     "GPL", 0);
+}
+
+static int usage(const char *argv0)
+{
+	printf("Usage: %s cg-path device-index\n", argv0);
+	return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+	int cg_fd, prog_fd, ret;
+	int idx = 0;
+
+	if (argc < 2)
+		return usage(argv[0]);
+
+	idx = atoi(argv[2]);
+	if (!idx) {
+		printf("Invalid device index\n");
+		return EXIT_FAILURE;
+	}
+
+	cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+	if (cg_fd < 0) {
+		printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	prog_fd = prog_load(idx);
+	printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+	if (prog_fd < 0) {
+		printf("Failed to load prog: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	ret = bpf_prog_detach(cg_fd, BPF_CGROUP_INET_SOCK);
+	ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK);
+	if (ret < 0) {
+		printf("Failed to attach prog to cgroup: '%s'\n",
+		       strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	return EXIT_SUCCESS;
+}
diff --git a/samples/bpf/test_cgrp2_sock.sh b/samples/bpf/test_cgrp2_sock.sh
new file mode 100755
index 000000000000..35ce9e1566d8
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock.sh
@@ -0,0 +1,48 @@
+#!/bin/bash
+
+function config_device {
+	ip netns add at_ns0
+	ip link add veth0 type veth peer name veth0b
+	ip link set veth0b up
+	ip link set veth0 netns at_ns0
+	ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0
+	ip netns exec at_ns0 ip addr add 2401:db00::1/64 dev veth0 nodad
+	ip netns exec at_ns0 ip link set dev veth0 up
+	ip link add foo type vrf table 1234
+	ip link set foo up
+	ip addr add 172.16.1.101/24 dev veth0b
+	ip addr add 2401:db00::2/64 dev veth0b nodad
+	ip link set veth0b master foo
+}
+
+function attach_bpf {
+	idx=$(ip -o li sh dev foo | awk -F':' '{print $1}')
+	rm -rf /tmp/cgroupv2
+	mkdir -p /tmp/cgroupv2
+	mount -t cgroup2 none /tmp/cgroupv2
+	mkdir -p /tmp/cgroupv2/foo
+	test_cgrp2_sock /tmp/cgroupv2/foo $idx
+	echo $$ >> /tmp/cgroupv2/foo/cgroup.procs
+}
+
+function cleanup {
+	set +ex
+	ip netns delete at_ns0
+	ip link del veth0
+	ip link del foo
+	umount /tmp/cgroupv2
+	rm -rf /tmp/cgroupv2
+	set -ex
+}
+
+function do_test {
+	ping -c1 -w1 172.16.1.100
+	ping6 -c1 -w1 2401:db00::1
+}
+
+cleanup 2>/dev/null
+config_device
+attach_bpf
+do_test
+cleanup
+echo "*** PASS ***"
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH] openvswitch: add sanity check in queue_userspace_packet.
From: Pravin Shelar @ 2016-11-29  5:15 UTC (permalink / raw)
  To: Haishuang Yan
  Cc: Pravin Shelar, David S. Miller, Linux Kernel Network Developers,
	linux-kernel
In-Reply-To: <1480394196-73882-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

On Mon, Nov 28, 2016 at 8:36 PM, Haishuang Yan
<yanhaishuang@cmss.chinamobile.com> wrote:
> kernel will crash in oops if genlmsg_put return NULL,
> so add the sanity check.
>
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> ---
>  net/openvswitch/datapath.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 2d4c4d3..ceb1b1e 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -474,6 +474,10 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
>
>         upcall = genlmsg_put(user_skb, 0, 0, &dp_packet_genl_family,
>                              0, upcall_info->cmd);
> +       if (!upcall) {
> +               err = -EMSGSIZE;
> +               goto out;
> +       }

user_skb has already enough space allocated, so there is no need to
check upcall pointer.

^ permalink raw reply

* Re: [PATCH net] net, sched: respect rcu grace period on cls destruction
From: Roi Dayan @ 2016-11-29  5:16 UTC (permalink / raw)
  To: Daniel Borkmann, davem
  Cc: roid, xiyou.wangcong, john.fastabend, ast, hannes, jiri, netdev
In-Reply-To: <0d6d89f885033f1739e97f7f3372ae6e1db72892.1480204343.git.daniel@iogearbox.net>


On 27/11/2016 02:18, Daniel Borkmann wrote:
> Roi reported a crash in flower where tp->root was NULL in ->classify()
> callbacks. Reason is that in ->destroy() tp->root is set to NULL via
> RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
> this doesn't respect RCU grace period for them, and as a result, still
> outstanding readers from tc_classify() will try to blindly dereference
> a NULL tp->root.
>
> The tp->root object is strictly private to the classifier implementation
> and holds internal data the core such as tc_ctl_tfilter() doesn't know
> about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
> is only checked for NULL in ->get() callback, but nowhere else. This is
> misleading and seemed to be copied from old classifier code that was not
> cleaned up properly. For example, d3fa76ee6b4a ("[NET_SCHED]: cls_basic:
> fix NULL pointer dereference") moved tp->root initialization into ->init()
> routine, where before it was part of ->change(), so ->get() had to deal
> with tp->root being NULL back then, so that was indeed a valid case, after
> d3fa76ee6b4a, not really anymore. We used to set tp->root to NULL long
> ago in ->destroy(), see 47a1a1d4be29 ("pkt_sched: remove unnecessary xchg()
> in packet classifiers"); but the NULLifying was reintroduced with the
> RCUification, but it's not correct for every classifier implementation.
>
> In the cases that are fixed here with one exception of cls_cgroup, tp->root
> object is allocated and initialized inside ->init() callback, which is always
> performed at a point in time after we allocate a new tp, which means tp and
> thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
> Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
> handler, same for the tp which is kfree_rcu()'ed right when we return
> from ->destroy() in tcf_destroy(). This means, the head object's lifetime
> for such classifiers is always tied to the tp lifetime. The RCU callback
> invocation for the two kfree_rcu() could be out of order, but that's fine
> since both are independent.
>
> Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
> means that 1) we don't need a useless NULL check in fast-path and, 2) that
> outstanding readers of that tp in tc_classify() can still execute under
> respect with RCU grace period as it is actually expected.
>
> Things that haven't been touched here: cls_fw and cls_route. They each
> handle tp->root being NULL in ->classify() path for historic reasons, so
> their ->destroy() implementation can stay as is. If someone actually
> cares, they could get cleaned up at some point to avoid the test in fast
> path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
> !head should anyone actually be using/testing it, so it at least aligns with
> cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
> destruction (to a sleepable context) after RCU grace period as concurrent
> readers might still access it. (Note that in this case we need to hold module
> reference to keep work callback address intact, since we only wait on module
> unload for all call_rcu()s to finish.)
>
> This fixes one race to bring RCU grace period guarantees back. Next step
> as worked on by Cong however is to fix 1e052be69d04 ("net_sched: destroy
> proto tp when all filters are gone") to get the order of unlinking the tp
> in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
> RCU_INIT_POINTER() before tcf_destroy() and let the notification for
> removal be done through the prior ->delete() callback. Both are independant
> issues. Once we have that right, we can then clean tp->root up for a number
> of classifiers by not making them RCU pointers, which requires a new callback
> (->uninit) that is triggered from tp's RCU callback, where we just kfree()
> tp->root from there.
>
> Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
> Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
> Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU")
> Fixes: 77b9900ef53a ("tc: introduce Flower classifier")
> Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
> Fixes: 952313bd6258 ("net: sched: cls_cgroup use RCU")
> Reported-by: Roi Dayan <roid@mellanox.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Roi Dayan <roid@mellanox.com>
> Cc: Jiri Pirko <jiri@mellanox.com>
> ---
>

Hi,

replying also here instead of in the other thread. I could not reproduce 
my original issue after applying this patch.

Thanks,
Roi

^ permalink raw reply

* Re: [net-next 1/1] samples: bpf: Refactor test_cgrp2_attach -- use getopt, and add mode
From: Sargun Dhillon @ 2016-11-29  5:42 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, Daniel Mack, Alexei Starovoitov
In-Reply-To: <20161129035020.GA15161@ast-mbp.thefacebook.com>

On Mon, Nov 28, 2016 at 7:50 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Nov 28, 2016 at 02:52:42PM -0800, Sargun Dhillon wrote:
>> This patch modifies test_cgrp2_attach to use getopt so we can use standard
>> command line parsing.
>>
>> It also adds an option to run the program in detach only mode. This does
>> not attach a new filter at the cgroup, but only runs the detach command.
>>
>> Lastly, it changes the attach code to not detach and then attach. It relies
>> on the 'hotswap' behaviour of CGroup BPF programs to be able to change
>> in-place. If detach-then-attach behaviour needs to be tested, the example
>> can be run in detach only mode prior to attachment.
>>
>> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
>
> looks fine to me.
> I'd really prefer this example to become an automated test eventually.
I can do that. As far as test cases:

1. create /foo
2. enter foo
3. attach drop filter to foo
4. try to ping 127.0.0.1 (make sure it returns 0 replies)
5. create /foo/bar
6. enter /foo/bar
7. try to ping 127.0.0.1 (make sure it returns 0 replies)
8. attach passthrough filter to foo/bar
9. try to ping 127.0.0.1 (make sure it returns 1 replies)
10. Detach filter from foo/bar
11. try to ping 127.0.0.1 (make sure it returns 0 replies)
Reasonable?


>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
>

^ permalink raw reply

* Re: [patch net] net: fec: cache statistics while device is down
From: Nikita Yushchenko @ 2016-11-29  5:43 UTC (permalink / raw)
  To: Andy Duan, David S. Miller, Troy Kisky, Andrew Lunn, Eric Nelson,
	Philippe Reynes, Johannes Berg, netdev@vger.kernel.org
  Cc: Chris Healy, Fabio Estevam, linux-kernel@vger.kernel.org
In-Reply-To: <AM4PR0401MB2260081367AC9A140448A631FF8D0@AM4PR0401MB2260.eurprd04.prod.outlook.com>

>  >
>  >+	fec_enet_update_ethtool_stats(ndev);
>  >+
> If user never open the interface, ethtool_stats[] always is 0 that are not expected.
> So, it also should be called at . fec_enet_init() ?

I don't think that zero stats is wrong for never-opened interface.

However a call at init path won't hurt, so I'll add it, just to clear
the question.

Nikita

^ permalink raw reply

* Re: [net-next 1/1] samples: bpf: Refactor test_cgrp2_attach -- use getopt, and add mode
From: Alexei Starovoitov @ 2016-11-29  5:50 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: netdev, Daniel Mack, Alexei Starovoitov
In-Reply-To: <CAMp4zn8tGFxEBSMicv51hSofE_bRT9xd6rgtLFnVSLmhZ_sF_w@mail.gmail.com>

On Mon, Nov 28, 2016 at 09:42:25PM -0800, Sargun Dhillon wrote:
> On Mon, Nov 28, 2016 at 7:50 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Nov 28, 2016 at 02:52:42PM -0800, Sargun Dhillon wrote:
> >> This patch modifies test_cgrp2_attach to use getopt so we can use standard
> >> command line parsing.
> >>
> >> It also adds an option to run the program in detach only mode. This does
> >> not attach a new filter at the cgroup, but only runs the detach command.
> >>
> >> Lastly, it changes the attach code to not detach and then attach. It relies
> >> on the 'hotswap' behaviour of CGroup BPF programs to be able to change
> >> in-place. If detach-then-attach behaviour needs to be tested, the example
> >> can be run in detach only mode prior to attachment.
> >>
> >> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> >
> > looks fine to me.
> > I'd really prefer this example to become an automated test eventually.
> I can do that. As far as test cases:
> 
> 1. create /foo
> 2. enter foo
> 3. attach drop filter to foo
> 4. try to ping 127.0.0.1 (make sure it returns 0 replies)
> 5. create /foo/bar
> 6. enter /foo/bar
> 7. try to ping 127.0.0.1 (make sure it returns 0 replies)
> 8. attach passthrough filter to foo/bar
> 9. try to ping 127.0.0.1 (make sure it returns 1 replies)
> 10. Detach filter from foo/bar
> 11. try to ping 127.0.0.1 (make sure it returns 0 replies)
> Reasonable?

awesome. sounds like a plan.

^ permalink raw reply

* [PATCH v2] net: macb: Write only necessary bits in NCR in macb reset
From: Harini Katakam @ 2016-11-29  5:56 UTC (permalink / raw)
  To: nicolas.ferre, davem, harinikatakamlinux
  Cc: netdev, linux-kernel, harinik, michals

In macb_reset_hw, use read-modify-write to disable RX and TX.
Existing settings, for ex. management port enable,
are being cleared in the current implementation.
Also certain reserved bits are read only.
Hence it is better to use read-modify-write.
Use the same method for clearing statistics as well.

Signed-off-by: Harini Katakam <harinik@xilinx.com>
---

v2:
Make ctrl type as u32
Improve commit description

---
 drivers/net/ethernet/cadence/macb.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index 0e489bb..2ce3407 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -1744,14 +1744,18 @@ static void macb_reset_hw(struct macb *bp)
 {
 	struct macb_queue *queue;
 	unsigned int q;
+	u32 ctrl;
 
 	/* Disable RX and TX (XXX: Should we halt the transmission
 	 * more gracefully?)
 	 */
-	macb_writel(bp, NCR, 0);
+	ctrl = macb_readl(bp, NCR);
+	ctrl &= ~(MACB_BIT(RE) | MACB_BIT(TE));
+	macb_writel(bp, NCR, ctrl);
 
 	/* Clear the stats registers (XXX: Update stats first?) */
-	macb_writel(bp, NCR, MACB_BIT(CLRSTAT));
+	ctrl |= MACB_BIT(CLRSTAT);
+	macb_writel(bp, NCR, ctrl);
 
 	/* Clear all status flags */
 	macb_writel(bp, TSR, -1);
-- 
2.7.4

^ permalink raw reply related

* Re: net: deadlock on genl_mutex
From: subashab @ 2016-11-29  5:59 UTC (permalink / raw)
  To: Eric Dumazet, Dmitry Vyukov
  Cc: David Miller, Matti Vaittinen, Tycho Andersen, Cong Wang,
	Florian Westphal, stephen hemminger, Tom Herbert, netdev, LKML,
	Richard Guy Briggs, syzkaller, netdev-owner
In-Reply-To: <CANn89iJ1G4e655RnfcGK9RwmSSb2L=ZBSq9e2gpHVNmkUOYUxg@mail.gmail.com>

> 
> Issue was reported yesterday and is under investigation.
> 
> 
> http://marc.info/?l=linux-netdev&m=148014004331663&w=2
> 
> 
> Thanks !

Hi Dmitry

Can you try the patch below with your reproducer? I haven't seen similar 
crashes reported after this (or even with Eric's patch).

https://patchwork.ozlabs.org/patch/699937/

^ permalink raw reply

* Re: net: deadlock on genl_mutex
From: Eric Dumazet @ 2016-11-29  6:06 UTC (permalink / raw)
  To: subashab
  Cc: Eric Dumazet, Dmitry Vyukov, David Miller, Matti Vaittinen,
	Tycho Andersen, Cong Wang, Florian Westphal, stephen hemminger,
	Tom Herbert, netdev, LKML, Richard Guy Briggs, syzkaller,
	netdev-owner
In-Reply-To: <0227d7e83cc5ac0a192d1ba0fee61413@codeaurora.org>

On Mon, 2016-11-28 at 22:59 -0700, subashab@codeaurora.org wrote:
> > 
> > Issue was reported yesterday and is under investigation.
> > 
> > 
> > http://marc.info/?l=linux-netdev&m=148014004331663&w=2
> > 
> > 
> > Thanks !
> 
> Hi Dmitry
> 
> Can you try the patch below with your reproducer? I haven't seen similar 
> crashes reported after this (or even with Eric's patch).
> 
> https://patchwork.ozlabs.org/patch/699937/

Yeah, I will post my patch on top of this one.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox