Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 0/3] Add support for the ethernet switch on the ESPRESSObin
From: Romain Perier @ 2016-12-20  8:53 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Jason Cooper,
	Sebastian Hesselbarth, Gregory Clement
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	Rob Herring, Ian Campbell, Pawel Moll, Mark Rutland, Kumar Gala,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Thomas Petazzoni, Nadav Haklai
In-Reply-To: <20161220085138.3998-1-romain.perier-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8@public.gmane.org>

Hi,

Le 20/12/2016 à 09:51, Romain Perier a écrit :
> This set of patches adds support for the Marvell ethernet switch 88E6341.
> It also add the devicetree definition of this switch to the DT board.
>
> Romain Perier (3):
>   net: dsa: mv88e6xxx: Don't forbid MDIO I/Os for PHY addr >=
>     num_of_ports
>   net: dsa: mv88e6xxx: Add support for ethernet switch 88E6341/88E6141
>   arm64: dts: marvell: Add ethernet switch definition for the
>     ESPRESSObin
>
>  .../boot/dts/marvell/armada-3720-espressobin.dts   | 67 ++++++++++++++++++++++
>  drivers/net/dsa/mv88e6xxx/chip.c                   | 48 ++++++++++++++--
>  drivers/net/dsa/mv88e6xxx/mv88e6xxx.h              |  4 +-
>  3 files changed, 112 insertions(+), 7 deletions(-)
>

Oh and I rebased the whole series onto net-next.

Romain
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2 1/3] net: dsa: mv88e6xxx: Don't forbid MDIO I/Os for PHY addr >= num_of_ports
From: Romain Perier @ 2016-12-20  8:51 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Jason Cooper,
	Sebastian Hesselbarth, Gregory Clement
  Cc: netdev, devicetree, Rob Herring, Ian Campbell, Pawel Moll,
	Mark Rutland, Kumar Gala, linux-arm-kernel, Thomas Petazzoni,
	Nadav Haklai, Romain Perier
In-Reply-To: <20161220085138.3998-1-romain.perier@free-electrons.com>

Some Marvell ethernet switches have internal ethernet transceivers with
hardcoded phy addresses. These addresses can be greater than the number
of ports or its value might be different than the associated port number.
This is for example the case for MV88E6341 that has 6 ports and internal
Port 1 to Port4 PHYs mapped at SMI addresses from 0x11 to 0x14.

This commits fixes the issue by removing the condition in MDIO callbacks.

Signed-off-by: Romain Perier <romain.perier@free-electrons.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---

Changes in v2:
 - Added tag "Reviewed-by" by Andrew
 - Fixed typo in the commit log

 drivers/net/dsa/mv88e6xxx/chip.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index b5f0e1e..76d944e 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2881,9 +2881,6 @@ static int mv88e6xxx_mdio_read(struct mii_bus *bus, int phy, int reg)
 	u16 val;
 	int err;
 
-	if (phy >= mv88e6xxx_num_ports(chip))
-		return 0xffff;
-
 	mutex_lock(&chip->reg_lock);
 	err = mv88e6xxx_phy_read(chip, phy, reg, &val);
 	mutex_unlock(&chip->reg_lock);
@@ -2896,9 +2893,6 @@ static int mv88e6xxx_mdio_write(struct mii_bus *bus, int phy, int reg, u16 val)
 	struct mv88e6xxx_chip *chip = bus->priv;
 	int err;
 
-	if (phy >= mv88e6xxx_num_ports(chip))
-		return 0xffff;
-
 	mutex_lock(&chip->reg_lock);
 	err = mv88e6xxx_phy_write(chip, phy, reg, val);
 	mutex_unlock(&chip->reg_lock);
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 2/3] net: dsa: mv88e6xxx: Add support for ethernet switch 88E6341/88E6141
From: Romain Perier @ 2016-12-20  8:51 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Jason Cooper,
	Sebastian Hesselbarth, Gregory Clement
  Cc: netdev, devicetree, Rob Herring, Ian Campbell, Pawel Moll,
	Mark Rutland, Kumar Gala, linux-arm-kernel, Thomas Petazzoni,
	Nadav Haklai, Romain Perier
In-Reply-To: <20161220085138.3998-1-romain.perier@free-electrons.com>

The Marvell 88E6341 device is single-chip, 6-port ethernet switch with
four integrated 10/100/1000Mbps ethernet transceivers and one high speed
SerDes interfaces. It is compatible with switches of family 88E6352.

This commit adds basic support for this switch by describing its
capabilities to the driver.

Signed-off-by: Romain Perier <romain.perier@free-electrons.com>
---

 Changes in v2:
  - Add a dedicated data structure for the operations of the 88E6341
  - Re-ordered PORT_SWITCH_ID_PROD_NUM_6341 in alphabetic order with other
    macros

 drivers/net/dsa/mv88e6xxx/chip.c      | 42 +++++++++++++++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  4 +++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 76d944e..5e97dc4 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -3625,6 +3625,34 @@ static const struct mv88e6xxx_ops mv88e6321_ops = {
 	.reset = mv88e6352_g1_reset,
 };
 
+static const struct mv88e6xxx_ops mv88e6341_ops = {
+	/* MV88E6XXX_FAMILY_6352 */
+	.get_eeprom = mv88e6xxx_g2_get_eeprom16,
+	.set_eeprom = mv88e6xxx_g2_set_eeprom16,
+	.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
+	.phy_read = mv88e6xxx_g2_smi_phy_read,
+	.phy_write = mv88e6xxx_g2_smi_phy_write,
+	.port_set_link = mv88e6xxx_port_set_link,
+	.port_set_duplex = mv88e6xxx_port_set_duplex,
+	.port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay,
+	.port_set_speed = mv88e6352_port_set_speed,
+	.port_tag_remap = mv88e6095_port_tag_remap,
+	.port_set_frame_mode = mv88e6351_port_set_frame_mode,
+	.port_set_egress_unknowns = mv88e6351_port_set_egress_unknowns,
+	.port_set_ether_type = mv88e6351_port_set_ether_type,
+	.port_jumbo_config = mv88e6165_port_jumbo_config,
+	.port_egress_rate_limiting = mv88e6097_port_egress_rate_limiting,
+	.port_pause_config = mv88e6097_port_pause_config,
+	.stats_snapshot = mv88e6320_g1_stats_snapshot,
+	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
+	.stats_get_strings = mv88e6095_stats_get_strings,
+	.stats_get_stats = mv88e6095_stats_get_stats,
+	.g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
+	.g1_set_egress_port = mv88e6095_g1_set_egress_port,
+	.mgmt_rsvd2cpu = mv88e6095_g2_mgmt_rsvd2cpu,
+	.reset = mv88e6352_g1_reset,
+};
+
 static const struct mv88e6xxx_ops mv88e6350_ops = {
 	/* MV88E6XXX_FAMILY_6351 */
 	.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
@@ -4086,6 +4114,20 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
 		.ops = &mv88e6321_ops,
 	},
 
+	[MV88E6341] = {
+		.prod_num = PORT_SWITCH_ID_PROD_NUM_6341,
+		.family = MV88E6XXX_FAMILY_6352,
+		.name = "Marvell 88E6341",
+		.num_databases = 4096,
+		.num_ports = 6,
+		.port_base_addr = 0x10,
+		.global1_addr = 0x1b,
+		.age_time_coeff = 15000,
+		.tag_protocol = DSA_TAG_PROTO_EDSA,
+		.flags = MV88E6XXX_FLAGS_FAMILY_6352,
+		.ops = &mv88e6341_ops,
+	},
+
 	[MV88E6350] = {
 		.prod_num = PORT_SWITCH_ID_PROD_NUM_6350,
 		.family = MV88E6XXX_FAMILY_6351,
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index af54bae..cb55fdb 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -100,6 +100,7 @@
 #define PORT_SWITCH_ID_PROD_NUM_6240	0x240
 #define PORT_SWITCH_ID_PROD_NUM_6290	0x290
 #define PORT_SWITCH_ID_PROD_NUM_6321	0x310
+#define PORT_SWITCH_ID_PROD_NUM_6341	0x340
 #define PORT_SWITCH_ID_PROD_NUM_6352	0x352
 #define PORT_SWITCH_ID_PROD_NUM_6350	0x371
 #define PORT_SWITCH_ID_PROD_NUM_6351	0x375
@@ -432,6 +433,7 @@ enum mv88e6xxx_model {
 	MV88E6290,
 	MV88E6320,
 	MV88E6321,
+	MV88E6341,
 	MV88E6350,
 	MV88E6351,
 	MV88E6352,
@@ -448,7 +450,7 @@ enum mv88e6xxx_family {
 	MV88E6XXX_FAMILY_6185,	/* 6108 6121 6122 6131 6152 6155 6182 6185 */
 	MV88E6XXX_FAMILY_6320,	/* 6320 6321 */
 	MV88E6XXX_FAMILY_6351,	/* 6171 6175 6350 6351 */
-	MV88E6XXX_FAMILY_6352,	/* 6172 6176 6240 6352 */
+	MV88E6XXX_FAMILY_6352,	/* 6172 6176 6240 6341 6352 */
 	MV88E6XXX_FAMILY_6390,  /* 6190 6190X 6191 6290 6390 6390X */
 };
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 3/3] arm64: dts: marvell: Add ethernet switch definition for the ESPRESSObin
From: Romain Perier @ 2016-12-20  8:51 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Jason Cooper,
	Sebastian Hesselbarth, Gregory Clement
  Cc: netdev, devicetree, Rob Herring, Ian Campbell, Pawel Moll,
	Mark Rutland, Kumar Gala, linux-arm-kernel, Thomas Petazzoni,
	Nadav Haklai, Romain Perier
In-Reply-To: <20161220085138.3998-1-romain.perier@free-electrons.com>

This defines and enables the Marvell ethernet switch MVE886341 on the
Marvell ESPRESSObin board.

Signed-off-by: Romain Perier <romain.perier@free-electrons.com>
---

Changes in v2:
 - EXPRESSObin -> ESPRESSObin
 - phy nodes definition must contain the internal bus address after the @

 .../boot/dts/marvell/armada-3720-espressobin.dts   | 67 ++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/arch/arm64/boot/dts/marvell/armada-3720-espressobin.dts b/arch/arm64/boot/dts/marvell/armada-3720-espressobin.dts
index 83178d9..9582661 100644
--- a/arch/arm64/boot/dts/marvell/armada-3720-espressobin.dts
+++ b/arch/arm64/boot/dts/marvell/armada-3720-espressobin.dts
@@ -80,3 +80,70 @@
 &usb3 {
 	status = "okay";
 };
+
+&mdio {
+	switch0: switch0@0 {
+		compatible = "marvell,mv88e6085";
+		#address-cells = <1>;
+		#size-cells = <0>;
+		reg = <1>;
+
+		dsa,member = <0 0>;
+
+		ports {
+			#address-cells = <1>;
+			#size-cells = <0>;
+
+			port@0 {
+				reg = <0>;
+				label = "cpu";
+				ethernet = <&eth0>;
+			};
+
+			port@1 {
+				reg = <1>;
+				label = "wan";
+				phy-handle = <&switch0phy0>;
+			};
+
+			port@2 {
+				reg = <2>;
+				label = "lan0";
+				phy-handle = <&switch0phy1>;
+			};
+
+			port@3 {
+				reg = <3>;
+				label = "lan1";
+				phy-handle = <&switch0phy2>;
+			};
+
+		};
+
+		mdio {
+			#address-cells = <1>;
+			#size-cells = <0>;
+			reg = <1>;
+
+			switch0phy0: switch0phy0@11 {
+				reg = <0x11>;
+			};
+			switch0phy1: switch0phy1@12 {
+				reg = <0x12>;
+			};
+			switch0phy2: switch0phy2@13 {
+				reg = <0x13>;
+			};
+		};
+	};
+};
+
+&eth0 {
+	phy-mode = "rgmii-id";
+	status = "okay";
+
+	fixed-link {
+		speed = <1000>;
+		full-duplex;
+	};
+};
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 0/3] Add support for the ethernet switch on the ESPRESSObin
From: Romain Perier @ 2016-12-20  8:51 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Jason Cooper,
	Sebastian Hesselbarth, Gregory Clement
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	Rob Herring, Ian Campbell, Pawel Moll, Mark Rutland, Kumar Gala,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Thomas Petazzoni, Nadav Haklai, Romain Perier


This set of patches adds support for the Marvell ethernet switch 88E6341.
It also add the devicetree definition of this switch to the DT board.

Romain Perier (3):
  net: dsa: mv88e6xxx: Don't forbid MDIO I/Os for PHY addr >=
    num_of_ports
  net: dsa: mv88e6xxx: Add support for ethernet switch 88E6341/88E6141
  arm64: dts: marvell: Add ethernet switch definition for the
    ESPRESSObin

 .../boot/dts/marvell/armada-3720-espressobin.dts   | 67 ++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/chip.c                   | 48 ++++++++++++++--
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h              |  4 +-
 3 files changed, 112 insertions(+), 7 deletions(-)

-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v8 3/8] thunderbolt: Communication with the ICM (firmware)
From: Mika Westerberg @ 2016-12-20  8:09 UTC (permalink / raw)
  To: Mario.Limonciello
  Cc: luto, amir.jer.levy, gregkh, andreas.noever, bhelgaas, corbet,
	linux-kernel, linux-pci, netdev, linux-doc, thunderbolt-linux,
	tomas.winkler, xiong.y.zhang
In-Reply-To: <05d9c1db67af410981bbb9672071a237@ausx13mpc120.AMER.DELL.COM>

On Mon, Dec 19, 2016 at 05:21:39PM +0000, Mario.Limonciello@dell.com wrote:
> Dell - Internal Use - Confidential  
> 
> > 
> > There is small problem, though. On non-Apple systems the host controller only
> > appears when something is connected to thunderbolt ports. So the char device
> > would not be there all the time. However, I think we can still notify the
> > userspace by sending an extra uevent when we detect there is a PCIe device or
> > inter-domain connection plugged in.
> > 
> 
> Why couldn't you just create it the first time a device is plugged into a Thunderbolt
> port and leave it until the module is cleaned up?  If the host controller goes to sleep
> an event could be sent to the daemon to let it know it disappeared and not to expect
> data on the char device for now, but leave the node around. 

We don't do that for USB memory sticks (or any other removable media)
either - once you disconnect the device the nodes are also removed. I
suppose same goes with USB network adapters, which is closest to
thunderbolt networking I can think of.

^ permalink raw reply

* Re: "rfkill: Add rfkill-any LED trigger" causes deadlock
From: Johannes Berg @ 2016-12-20  7:47 UTC (permalink / raw)
  To: Michał Kępień
  Cc: Михаил Кринкин,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161219140115.GA18194-d0zqvqtRkg7MisJl3QRgO5JNpPtI3pG4@public.gmane.org>


> Just to ensure things get cleaned up properly, as of now it looks
> like you only reverted patch 2/2 of my v2 and Arnd's fix to patch
> 1/2, not patch 1/2 itself.

Oops. I've fixed that up to only revert your patch - I wanted the
revert in the tree to document the issue, rather than just dropping the
patch entirely. Thanks for noticing that mistake.

johannes

^ permalink raw reply

* [PATCHv2 net 2/2] sctp: not copying duplicate addrs to the assoc's bind address list
From: Xin Long @ 2016-12-20  5:49 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman
In-Reply-To: <cover.1482212764.git.lucien.xin@gmail.com>

sctp.local_addr_list is a global address list that is supposed to include
all the local addresses. sctp updates this list according to NETDEV_UP/
NETDEV_DOWN notifications.

However, if multiple NICs have the same address, the global list would
have duplicate addresses. Even if for one NIC, promote secondaries in
__inet_del_ifa can also lead to accumulating duplicate addresses.

When sctp binds address 'ANY' and creates a connection, it copies all
the addresses from global list into asoc's bind addr list, which makes
sctp pack the duplicate addresses into INIT/INIT_ACK packets.

This patch is to filter the duplicate addresses when copying the addrs
from global list in sctp_copy_local_addr_list and unpacking addr_param
from cookie in sctp_raw_to_bind_addrs to asoc's bind addr list.

Note that we can't filter the duplicate addrs when global address list
gets updated, As NETDEV_DOWN event may remove an addr that still exists
in another NIC.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/bind_addr.c | 3 +++
 net/sctp/protocol.c  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index 401c607..1ebc184 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -292,6 +292,8 @@ int sctp_raw_to_bind_addrs(struct sctp_bind_addr *bp, __u8 *raw_addr_list,
 		}
 
 		af->from_addr_param(&addr, rawaddr, htons(port), 0);
+		if (sctp_bind_addr_state(bp, &addr) != -1)
+			goto next;
 		retval = sctp_add_bind_addr(bp, &addr, sizeof(addr),
 					    SCTP_ADDR_SRC, gfp);
 		if (retval) {
@@ -300,6 +302,7 @@ int sctp_raw_to_bind_addrs(struct sctp_bind_addr *bp, __u8 *raw_addr_list,
 			break;
 		}
 
+next:
 		len = ntohs(param->length);
 		addrs_len -= len;
 		raw_addr_list += len;
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index da5d82b..616a942 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -220,6 +220,9 @@ int sctp_copy_local_addr_list(struct net *net, struct sctp_bind_addr *bp,
 		     !(copy_flags & SCTP_ADDR6_PEERSUPP)))
 			continue;
 
+		if (sctp_bind_addr_state(bp, &addr->a) != -1)
+			continue;
+
 		error = sctp_add_bind_addr(bp, &addr->a, sizeof(addr->a),
 					   SCTP_ADDR_SRC, GFP_ATOMIC);
 		if (error)
-- 
2.1.0

^ permalink raw reply related

* [PATCHv2 net 1/2] sctp: reduce indent level in sctp_copy_local_addr_list
From: Xin Long @ 2016-12-20  5:49 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman
In-Reply-To: <cover.1482212764.git.lucien.xin@gmail.com>

This patch is to reduce indent level by using continue when the addr
is not allowed, and also drop end_copy by using break.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/protocol.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 7b523e3..da5d82b 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -205,26 +205,27 @@ int sctp_copy_local_addr_list(struct net *net, struct sctp_bind_addr *bp,
 	list_for_each_entry_rcu(addr, &net->sctp.local_addr_list, list) {
 		if (!addr->valid)
 			continue;
-		if (sctp_in_scope(net, &addr->a, scope)) {
-			/* Now that the address is in scope, check to see if
-			 * the address type is really supported by the local
-			 * sock as well as the remote peer.
-			 */
-			if ((((AF_INET == addr->a.sa.sa_family) &&
-			      (copy_flags & SCTP_ADDR4_PEERSUPP))) ||
-			    (((AF_INET6 == addr->a.sa.sa_family) &&
-			      (copy_flags & SCTP_ADDR6_ALLOWED) &&
-			      (copy_flags & SCTP_ADDR6_PEERSUPP)))) {
-				error = sctp_add_bind_addr(bp, &addr->a,
-						    sizeof(addr->a),
-						    SCTP_ADDR_SRC, GFP_ATOMIC);
-				if (error)
-					goto end_copy;
-			}
-		}
+		if (!sctp_in_scope(net, &addr->a, scope))
+			continue;
+
+		/* Now that the address is in scope, check to see if
+		 * the address type is really supported by the local
+		 * sock as well as the remote peer.
+		 */
+		if (addr->a.sa.sa_family == AF_INET &&
+		    !(copy_flags & SCTP_ADDR4_PEERSUPP))
+			continue;
+		if (addr->a.sa.sa_family == AF_INET6 &&
+		    (!(copy_flags & SCTP_ADDR6_ALLOWED) ||
+		     !(copy_flags & SCTP_ADDR6_PEERSUPP)))
+			continue;
+
+		error = sctp_add_bind_addr(bp, &addr->a, sizeof(addr->a),
+					   SCTP_ADDR_SRC, GFP_ATOMIC);
+		if (error)
+			break;
 	}
 
-end_copy:
 	rcu_read_unlock();
 	return error;
 }
-- 
2.1.0

^ permalink raw reply related

* [PATCHv2 net 0/2] fix the issue that may copy duplicate addrs into assoc's bind address list
From: Xin Long @ 2016-12-20  5:49 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

Patch 1/2 is to fix some indent level.

Given that we have kernels out there with this issue, patch 2/2 also
fix sctp_raw_to_bind_addrs.

v1 -> v2:
  Explain why we didn't filter the duplicate addresses when global
  address list gets updated in patch 2/2 changelog.

Xin Long (2):
  sctp: reduce indent level in sctp_copy_local_addr_list
  sctp: not copying duplicate addrs to the assoc's bind address list

 net/sctp/bind_addr.c |  3 +++
 net/sctp/protocol.c  | 40 ++++++++++++++++++++++------------------
 2 files changed, 25 insertions(+), 18 deletions(-)

-- 
2.1.0

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  5:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Ahern, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <CALCETrVxkdZA3SsRv0KKhBz9YvNMsnmHSjS8HN1GHrgWRYNM1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Dec 19, 2016 at 09:27:18PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov
> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
> >>
> >> struct cgroup_bpf {
> >>         /*
> >>          * Store two sets of bpf_prog pointers, one for programs that are
> >>          * pinned directly to this cgroup, and one for those that are effective
> >>          * when this cgroup is accessed.
> >>          */
> >>         struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
> >>         struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
> >> };
> >>
> >> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
> >>
> >> This would change to something like:
> >>
> >> struct cgroup_filter_slot {
> >>   struct bpf_prog *effective;
> >>   struct cgroup_filter_slot *next;
> >>   struct bpf_prog *local;
> >> }
> >>
> >> local is NULL unless *this* cgroup has a filter.  effective points to
> >> the bpf_prog that's active in this cgroup or the nearest ancestor that
> >> has a filter.  next is NULL if there are no filters higher in the
> >> chain or points to the next slot that has a filter.  struct cgroup
> >> has:
> >>
> >> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
> >>
> >> To evaluate it, you do:
> >>
> >> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
> >>
> >> if (!slot->effective)
> >>   return;
> >>
> >> do {
> >>   evaluate(slot->effective);
> >>   slot = slot->next;
> >> } while (unlikely(slot));
> >
> > yes. something like this can work as a future extension
> > to support multiple programs for security use case.
> > Please propose a patch.
> > Again, it's not needed today and there is no rush to implement it.
> >
> 
> If this happens after 4.10 and 4.10 is released as is, then this
> change would be an ABI break.

it won't break existing apps.
please study how bpf syscall was extended in the past without
breaking anything.
Same thing here. The default is one program per hook per cgroup.
Everything else is future extensions.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  5:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <20161220044440.GB86803-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>

On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov
<alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
>>
>> struct cgroup_bpf {
>>         /*
>>          * Store two sets of bpf_prog pointers, one for programs that are
>>          * pinned directly to this cgroup, and one for those that are effective
>>          * when this cgroup is accessed.
>>          */
>>         struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
>>         struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
>> };
>>
>> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
>>
>> This would change to something like:
>>
>> struct cgroup_filter_slot {
>>   struct bpf_prog *effective;
>>   struct cgroup_filter_slot *next;
>>   struct bpf_prog *local;
>> }
>>
>> local is NULL unless *this* cgroup has a filter.  effective points to
>> the bpf_prog that's active in this cgroup or the nearest ancestor that
>> has a filter.  next is NULL if there are no filters higher in the
>> chain or points to the next slot that has a filter.  struct cgroup
>> has:
>>
>> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
>>
>> To evaluate it, you do:
>>
>> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
>>
>> if (!slot->effective)
>>   return;
>>
>> do {
>>   evaluate(slot->effective);
>>   slot = slot->next;
>> } while (unlikely(slot));
>
> yes. something like this can work as a future extension
> to support multiple programs for security use case.
> Please propose a patch.
> Again, it's not needed today and there is no rush to implement it.
>

If this happens after 4.10 and 4.10 is released as is, then this
change would be an ABI break.

--Andy

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  5:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Andrew Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David Ahern, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <20161220045155.GC86803@ast-mbp.thefacebook.com>

On Mon, Dec 19, 2016 at 8:51 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote:
>>
>> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
>> even take a reference to a BPF program as an argument.  What is it
>> supposed to do if this mechanism ever gets extended?
>
> we just add another field to that anonymous union just like
> we did for other commands and everything is backwards compatible.
> It's the basics of bpf syscall that we've been relying on for some
> time now and it worked just fine.

And what happens if you don't specify that member and two programs are attached?

--Andy

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-20  4:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers
In-Reply-To: <1482209536.1521.21.camel@edumazet-glaptop3.roam.corp.google.com>


> On Dec 19, 2016, at 11:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote:
>>> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> 
>>>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
>>>> 
>>>> When sockets created SO_REUSEPORT move to TW state they are placed
>>>> back on the the tb->owners. fastreuse port is no longer set so we have
>>>> to walk potential long list of sockets in tb->owners to open a new
>>>> listener socket. I imagine this is happens when we try to open a new
>>>> listener SO_REUSEPORT after the system has been running a while and so
>>>> we hit the long tb->owners list.
>>> 
>>> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
>>> 
>>> So where tb->fastreuse is cleared ?
>>> 
>>> If all your sockets have SO_REUSEPORT set, this should not happen.
>>> 
>> 
>> The app starts out with no SO_REUSEPORT, and then we restart it with
>> that option enabled.
> 
> But... why would the application do this dance ?
> 
> I now better understand why we never had these issues...
> 

It doesn't do it as a part of it's normal operation.  The old version didn't use SO_REUSEPORT and then somebody added support for it, restarted the service with the new option enabled and boom.  They immediately stopped doing anything and gave it to me to figure out.

> 
>>  What I suspect is we have all the twsks from the original service,
>> and the fastreuse stuff is cleared.  My naive patch resets it once we
>> add a reuseport sk to the tb and that makes the problem go away.  I'm
>> reworking all of this logic and adding some extra info to the tb to
>> make the reset actually safe.  I'll send those patches out tomorrow.
>> Thanks,
> 
> Okay, we will review them ;)
> 
> Note that Willy Tarreau wants some mechanism to be able to freeze a
> listener, to allow haproxy to be replaced without closing any sessions.
> 

I assume that's what these guys would want as well.  They have some weird handoff thing they do when the app starts but I'm not entirely convinced it does what they think it does.  Thanks,

Josef

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Eric Dumazet @ 2016-12-20  4:52 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers
In-Reply-To: <9DF94C8E-1463-4C10-81E3-E6F4534097CB@fb.com>

On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote:
> > On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> >> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
> >> 
> >> When sockets created SO_REUSEPORT move to TW state they are placed
> >> back on the the tb->owners. fastreuse port is no longer set so we have
> >> to walk potential long list of sockets in tb->owners to open a new
> >> listener socket. I imagine this is happens when we try to open a new
> >> listener SO_REUSEPORT after the system has been running a while and so
> >> we hit the long tb->owners list.
> > 
> > Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
> > 
> > So where tb->fastreuse is cleared ?
> > 
> > If all your sockets have SO_REUSEPORT set, this should not happen.
> > 
> 
> The app starts out with no SO_REUSEPORT, and then we restart it with
> that option enabled.

But... why would the application do this dance ?

I now better understand why we never had these issues...


>   What I suspect is we have all the twsks from the original service,
> and the fastreuse stuff is cleared.  My naive patch resets it once we
> add a reuseport sk to the tb and that makes the problem go away.  I'm
> reworking all of this logic and adding some extra info to the tb to
> make the reset actually safe.  I'll send those patches out tomorrow.
> Thanks,

Okay, we will review them ;)

Note that Willy Tarreau wants some mechanism to be able to freeze a
listener, to allow haproxy to be replaced without closing any sessions.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  4:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Miller, Andrew Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David Ahern, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <CALCETrUc-iDUT8isKc43PPZ3xz31Sz+QTU+_SQQTAsWhH+zkLw@mail.gmail.com>

On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote:
> 
> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
> even take a reference to a BPF program as an argument.  What is it
> supposed to do if this mechanism ever gets extended?

we just add another field to that anonymous union just like
we did for other commands and everything is backwards compatible.
It's the basics of bpf syscall that we've been relying on for some
time now and it worked just fine.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  4:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Ahern, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <CALCETrVKu63BFVQFAJcLcd6ovPtq-WDdTh-BwyAPSprw8UarNQ@mail.gmail.com>

On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
> 
> struct cgroup_bpf {
>         /*
>          * Store two sets of bpf_prog pointers, one for programs that are
>          * pinned directly to this cgroup, and one for those that are effective
>          * when this cgroup is accessed.
>          */
>         struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
>         struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
> };
> 
> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
> 
> This would change to something like:
> 
> struct cgroup_filter_slot {
>   struct bpf_prog *effective;
>   struct cgroup_filter_slot *next;
>   struct bpf_prog *local;
> }
> 
> local is NULL unless *this* cgroup has a filter.  effective points to
> the bpf_prog that's active in this cgroup or the nearest ancestor that
> has a filter.  next is NULL if there are no filters higher in the
> chain or points to the next slot that has a filter.  struct cgroup
> has:
> 
> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
> 
> To evaluate it, you do:
> 
> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
> 
> if (!slot->effective)
>   return;
> 
> do {
>   evaluate(slot->effective);
>   slot = slot->next;
> } while (unlikely(slot));

yes. something like this can work as a future extension
to support multiple programs for security use case.
Please propose a patch.
Again, it's not needed today and there is no rush to implement it.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  4:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook,
	Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <CALCETrXymvAo-9zhQe=amToz_fs9XGniK2KLZv5Fxc66qcUx6A@mail.gmail.com>

On Mon, Dec 19, 2016 at 07:50:01PM -0800, Andy Lutomirski wrote:
> >>
> >> net.socket_create_filter = "none": no filter
> >> net.socket_create_filter = "bpf:baadf00d": bpf filter
> >
> > i'm assuming 'baadf00d' is bpf program fd expressed a text string?
> > and kernel needs to parse above? will you allow capital and lower
> > case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
> > can program fd expressed as decimal or hex or both?
> > how do you return the error? as a text string for user space
> > to parse?
> 
> No.  The kernel does not parse it because you cannot write this to the
> file.  You set a bpf filter with ioctl and pass an fd.  If you *read*
> the file, you get the same bpf program hash that fdinfo on the bpf
> object would show -- this is for debugging and (eventually) CRIU.

my understanding that cgroup is based on kernfs and both don't support
ioctl, so you'd need quite some hacking to introduce such concepts
and buy-in from a bunch of people first.

> >> net.socket_create_filter = "iptables:foobar": some iptables thingy
> >> net.socket_create_filter = "nft:blahblahblah": some nft thingy
> >
> > iptables/nft are not applicable to 'bpf_socket_create' which
> > looks into:
> > struct bpf_sock {
> >         __u32 bound_dev_if;
> >         __u32 family;
> >         __u32 type;
> >         __u32 protocol;
> > };
> > so don't fit as an example.
> 
> The code that takes a 'struct sock' and sets up bpf_sock is
> bpf-specific and would obviously not be used for non-bpf filter.  But
> if you had a filter that was just a like of address families, that
> filter would look at struct sock and do its own thing.  iptables
> wouldn't make sense for a socket creation filter, but it would make
> perfect sense for an ingress filter.  Obviously the bpf-specific state
> object wouldn't be used, but it would still be a hook, invoked from
> the same network code, guarded by the same static key, looking at the
> same skb.

I strongly suggest to go back and read my first reply where
I think I explained well enough that something like iptables
will not able to reuse the ioctl mechanism you're proposing here,
hook ids will be different, attachment mechanism will be different too.
So your proposed cgroup ioctl is already dead as a reusable interface.

> >> net.socket_create_filter = "disallow": no sockets created period
> >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
> >
> > so you're proposing to add a bunch of hard coded logic to the kernel.
> > First to parse such text into some sort of syntax tree or list/set
> > and then have hard coded logic specifically for these two use cases?
> > While above two can be implemented as trivial bpf programs already?!
> > That goes 180% degree vs bpf philosophy. bpf is about moving
> > the specific code out of the kernel and keeping kernel generic that
> > it can solve as many use cases as possible by being programmable.
> 
> I'm not seriously proposing implementing these.  My point is that
> *bpf*, while wonderful, is not the be-all-and-end-all of kernel
> configurability, and other types of hooks might want to be hooked in
> here.

Then please let's talk about real use cases.
This daydreaming of some future llvm in the kernel that you were
bringing up during LPC doesn't help the discussion.
Just like these artificial examples.

> > ...
> >> What exactly isn't sensible about using cgroup_bpf for containers?
> >
> > my use case is close to zero overhead application network monitoring.
> 
> So if I set up a cgroup that's monitored and call it /cgroup/a and
> enable delegation and if the program running there wants to do its own
> monitoring in /cgroup/a/b (via delegation), then you really want the
> outer monitor to silently drop events coming from /cgroup/a/b?

yes. both are root and must talk to each other if they want
to co-exist. When root process is asking kernel to do X, this X has
to happen.

> Then disallow nesting.  You're welcome to not rush the decision, but
> that's not what you've done.  If 4.10 is released as is, you've made
> the decision and you're going to have a hard time changing it.

Nothing needs to be changed.

> > No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
> > type programs and only root can attach them.
> 
> Why?  It really seems to me that you expect that future namespaceable
> bpf hooks will use a totally different API.  At KS, I sat in a room
> full of people arguing about cgroup v2 and a lot of them pointed out
> that there are valid, paying use cases that want to stick cgroup v1 in
> a container, because the code that uses cgroup v1 in the container is
> called systemd and the contained OS is called RHEL (or SuSE or Ubuntu
> or Gentoo or whatever), and that code is *already written* and needs
> to be contained.

bpf in general is not namespace aware. It's global and this cgroup scoping
of bpf programs is the first of this kind. Namespacing of bpf is completely
different topic.

> The current approach to bpf hooks will bite you down the road.  David
> Ahern is already proposing using it for something that is not tracing
> at all, and someone will want that in a container, and there will be a
> problem.

vrf use case already supported by existing code.

> How about slowing down a wee bit and trying to come up with cgroup
> hook semantics that work for all of these use cases?  I think my
> proposal is quite close to workable.

you've started the topic by claiming that things are broken
and non-extensible. At the course of this thread it was explained
that the interface is extensible and not broken for the use case
it was designed for. The 'security' use case like lsm+bpf+cgroup
is not supported by the current model yet and that's what we need
to discuss in the future. So, yes, please slow down.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  3:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook,
	Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <20161220031802.GA77838-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>

On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov
<alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
>> I think we're still talking past each other.  A big part of the point
>> of changing it is that none of this is specific to bpf.  You could (in
>
> the hooks and context passed into the program is very much bpf specific.
> That's what I've been trying to convey all along.

You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)?  There is nothing bpf
specfic about the hook except that the name of this macro has "BPF" in
it.  There is nothing whatsoever that's bpf-specific about the context
-- sk is not bpf-specific at all.

The only thing bpf-specific about it is that it currently only invokes
bpf programs.  That could easily change.

>
>> theory -- I'm not proposing implementing these until there's demand)
>> have:
>>
>> net.socket_create_filter = "none": no filter
>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>
> i'm assuming 'baadf00d' is bpf program fd expressed a text string?
> and kernel needs to parse above? will you allow capital and lower
> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
> can program fd expressed as decimal or hex or both?
> how do you return the error? as a text string for user space
> to parse?

No.  The kernel does not parse it because you cannot write this to the
file.  You set a bpf filter with ioctl and pass an fd.  If you *read*
the file, you get the same bpf program hash that fdinfo on the bpf
object would show -- this is for debugging and (eventually) CRIU.

>
>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>
> iptables/nft are not applicable to 'bpf_socket_create' which
> looks into:
> struct bpf_sock {
>         __u32 bound_dev_if;
>         __u32 family;
>         __u32 type;
>         __u32 protocol;
> };
> so don't fit as an example.

The code that takes a 'struct sock' and sets up bpf_sock is
bpf-specific and would obviously not be used for non-bpf filter.  But
if you had a filter that was just a like of address families, that
filter would look at struct sock and do its own thing.  iptables
wouldn't make sense for a socket creation filter, but it would make
perfect sense for an ingress filter.  Obviously the bpf-specific state
object wouldn't be used, but it would still be a hook, invoked from
the same network code, guarded by the same static key, looking at the
same skb.

>
>> net.socket_create_filter = "disallow": no sockets created period
>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>
> so you're proposing to add a bunch of hard coded logic to the kernel.
> First to parse such text into some sort of syntax tree or list/set
> and then have hard coded logic specifically for these two use cases?
> While above two can be implemented as trivial bpf programs already?!
> That goes 180% degree vs bpf philosophy. bpf is about moving
> the specific code out of the kernel and keeping kernel generic that
> it can solve as many use cases as possible by being programmable.

I'm not seriously proposing implementing these.  My point is that
*bpf*, while wonderful, is not the be-all-and-end-all of kernel
configurability, and other types of hooks might want to be hooked in
here.

> ...
>> What exactly isn't sensible about using cgroup_bpf for containers?
>
> my use case is close to zero overhead application network monitoring.

So if I set up a cgroup that's monitored and call it /cgroup/a and
enable delegation and if the program running there wants to do its own
monitoring in /cgroup/a/b (via delegation), then you really want the
outer monitor to silently drop events coming from /cgroup/a/b?

>
>> >> > As you're pointing out, in case of security, we probably
>> >> > want to preserve original bpf program that should always be
>> >> > run first and only after it returned 'ok' (we'd need to define
>> >> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
>> >>
>> >> It's already defined AFAICT.  1 means okay.  0 means not okay.
>> >
>> > sorry that doesn't make any sense. For seccomp we have a set of
>> > ranges that mean different things. Here you're proposing to
>> > hastily assign 1 and 0 ? How is that extensible?
>> > We need to carefully think through what should be the semantics
>> > of attaching multiple programs, consider performance implications,
>> > return codes and so on.
>>
>> You already assigned it.  The return value of the bpf program, loaded
>> in Linus' tree today, tells the kernel whether to accept or reject.
>
> yes. that's what the program tells the hook.
> I'm saying that whenever we have a link list of the programs
> interaction between them may or may not be expressible with 1/0
> and I don't want to rush such decision.

Then disallow nesting.  You're welcome to not rush the decision, but
that's not what you've done.  If 4.10 is released as is, you've made
the decision and you're going to have a hard time changing it.

>
>> >
>> >> > Another alternative is to disallow attaching programs in sub-hierarchy
>> >> > if parent has something already attached, but it's not useful
>> >> > for general case.
>> >> > All of these are possible future extensions.
>> >>
>> >> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>> >> have to do it now (or disable the feature for 4.10).  This is why I'm
>> >> bringing this whole thing up now.
>> >
>> > We don't have to touch user visible api here, so extensions are fine.
>>
>> Huh?  My example in the original email attaches a program in a
>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>> working?
>
> No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
> type programs and only root can attach them.

Why?  It really seems to me that you expect that future namespaceable
bpf hooks will use a totally different API.  At KS, I sat in a room
full of people arguing about cgroup v2 and a lot of them pointed out
that there are valid, paying use cases that want to stick cgroup v1 in
a container, because the code that uses cgroup v1 in the container is
called systemd and the contained OS is called RHEL (or SuSE or Ubuntu
or Gentoo or whatever), and that code is *already written* and needs
to be contained.

The current approach to bpf hooks will bite you down the road.  David
Ahern is already proposing using it for something that is not tracing
at all, and someone will want that in a container, and there will be a
problem.

> Similarly for bpf+lsm+cgroup. In the next round of Mickael's patches
> we'll keep debating the right security model for it.
>

How about slowing down a wee bit and trying to come up with cgroup
hook semantics that work for all of these use cases?  I think my
proposal is quite close to workable.

--Andy

^ permalink raw reply

* Re: [PATCH v2 2/3] Bluetooth: btusb: Add out-of-band wakeup support
From: Brian Norris @ 2016-12-20  3:41 UTC (permalink / raw)
  To: Rajat Jain
  Cc: Rob Herring, Mark Rutland, Marcel Holtmann, Gustavo Padovan,
	Johan Hedberg, Amitkumar Karwar, Wei-Ning Huang, Xinming Hu,
	netdev, devicetree, linux-bluetooth, linux-kernel, Rajat Jain
In-Reply-To: <CACK8Z6FteRYmrNq5Oq9LpHOUOYg49XrL0WsxD8G45pTw_cwaSg@mail.gmail.com>

On Mon, Dec 19, 2016 at 05:46:19PM -0800, Rajat Jain wrote:
> On Mon, Dec 19, 2016 at 3:10 PM, Brian Norris <briannorris@chromium.org> wrote:
> > On Fri, Dec 16, 2016 at 11:30:03AM -0800, Rajat Jain wrote:
> >> diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
> >> index ce22cef..beca4e9 100644
> >> --- a/drivers/bluetooth/btusb.c
> >> +++ b/drivers/bluetooth/btusb.c

> >> +static const struct of_device_id btusb_match_table[] = {
> >> +     { .compatible = "usb1286,204e" },
> >> +     { }
> >> +};
> >> +MODULE_DEVICE_TABLE(of, btusb_match_table);
> >
> > You define a match table here, but you also define essentially same
> > table for Marvell-specific additions in patch 3. It looks like maybe
> > it's legal to have more than one OF table in a module? But it seems like
> > it would get confusing, besides being somewhat strange to maintain. It
> > might also produce duplicate 'modinfo' output.
> >
> > If you really want to independently opt into device-tree-specified
> > interrupts vs. Marvell-specific interrrupt configuration, then you
> > should probably just merge the latter into the former table, and
> > implement a Marvell/GPIO flag to stick in the .data field of this table.
> 
> The data we want to stick (The pin number on the chip) is board
> specific rather than being chip specific, and hence .data may not be
> useful here.

I just meant that if you conceptually wanted Marvell devices to "opt in"
to this, then you could turn the .data field into some kind of flag or
struct for describing capabilities (e.g., MARVELL_GPIO_WAKE or similar),
effectively merging the two tables instead of having two
mostly-overlapping tables.

But you could do the same by putting such flag(s) in the
{btusb,blacklist}_table[], or add such flag(s) later whenever there's a
Marvell device that doesn't support this feature.

> > Or it might be fine to drop one or both "match" checks. Particularly for
> > the Marvell-specific stuff, it's probably fair just to check if it has
> > an ->of_node and if 'id->driver_info & BTUSB_MARVELL'. Any other Marvell
> > device-specific quirks could probably be keyed off of the
> > (weirdly-named?) blacklist_table[], which already matches PID/VID.
> 
> Yup, I think I can remove the compatible string checks.
> 
> I'll be sending a V3.

Great!

Brian

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-20  3:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers
In-Reply-To: <1482201702.1521.13.camel@edumazet-glaptop3.roam.corp.google.com>


> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
>> 
>> When sockets created SO_REUSEPORT move to TW state they are placed
>> back on the the tb->owners. fastreuse port is no longer set so we have
>> to walk potential long list of sockets in tb->owners to open a new
>> listener socket. I imagine this is happens when we try to open a new
>> listener SO_REUSEPORT after the system has been running a while and so
>> we hit the long tb->owners list.
> 
> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
> 
> So where tb->fastreuse is cleared ?
> 
> If all your sockets have SO_REUSEPORT set, this should not happen.
> 

The app starts out with no SO_REUSEPORT, and then we restart it with that option enabled.  What I suspect is we have all the twsks from the original service, and the fastreuse stuff is cleared.  My naive patch resets it once we add a reuseport sk to the tb and that makes the problem go away.  I'm reworking all of this logic and adding some extra info to the tb to make the reset actually safe.  I'll send those patches out tomorrow. Thanks,

Josef

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  3:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook,
	Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
> >> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> >> Hi all-
> >> >>
> >> >> I apologize for being rather late with this.  I didn't realize that
> >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> >> it on the linux-api list, so I missed the discussion.
> >> >>
> >> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> >> but I think the API has some issues that will bite us down the road
> >> >> if it becomes stable in its current form.
> >> >>
> >> >> Most of the problems I see are summarized in this transcript:
> >> >>
> >> >> # mkdir cg2
> >> >> # mount -t cgroup2 none cg2
> >> >> # mkdir cg2/nosockets
> >> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> >> ...
> >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >> >>
> >> >> ^^^^ You can modify a cgroup after opening it O_RDONLY?
> >> >>
> >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >> >>
> >> >> ^^^^ This is fine.  The bpf() syscall manipulates bpf objects.
> >> >>
> >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >> >>
> >> >> ^^^^ This is not so good:
> >> >> ^^^^
> >> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects.  This
> >> >> ^^^^    is manipulating a cgroup.  There's no reason that a socket creation
> >> >> ^^^^    filter couldn't be written in a different language (new iptables
> >> >> ^^^^    table?  Simple list of address families?), but if that happened,
> >> >> ^^^^    then using bpf() to install it would be entirely nonsensical.
> >> >
> >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> >> > network stack is using cgroup as an application group that should
> >> > invoke bpf program at the certain point in the stack.
> >> > imo cgroup management is orthogonal.
> >>
> >> It is literally modifying the struct cgroup, and, as a practical
> >> matter, it's causing membership in the cgroup to have a certain
> >> effect.  But rather than pointless arguing, let me propose an
> >> alternative API that I think solves most of the problems here.
> >>
> >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> >> Instead, the cgroup gets three new control files:
> >> "net.ingress_filter", "net.egress_filter", and
> >> "net.socket_create_filter".  Initially, if you read these files, you
> >> see "none\n".
> >>
> >> To attach a bpf filter, you open the file for write and do an ioctl on
> >> it.  After doing the ioctl, if you read the file, you'll see
> >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> >> the bpf program.
> >>
> >> To detach any type of filter, bpf or otherwise, you open the file for
> >> write and write "none\n" (or just "none").
> >>
> >> If you write anything else to the file, you get -EINVAL.  But, if
> >> someone writes a new type of filter (perhaps a simple list of address
> >> families), maybe you can enable the filter by writing something
> >> appropriate to the file.
> >
> > I see no difference in what you're proposing vs what is already implemented
> > from feature set point of view, but the file approach is very ugly, since
> > it's a mismatch to FD style access that bpf is using everywhere.
> > In your proposal you'd also need to add bpf prefix everywhere.
> > So the control file names should be bpf_inet_ingress, bpf_inet_egress
> > and bpf_socket_create.
> 
> I think we're still talking past each other.  A big part of the point
> of changing it is that none of this is specific to bpf.  You could (in

the hooks and context passed into the program is very much bpf specific.
That's what I've been trying to convey all along.

> theory -- I'm not proposing implementing these until there's demand)
> have:
> 
> net.socket_create_filter = "none": no filter
> net.socket_create_filter = "bpf:baadf00d": bpf filter

i'm assuming 'baadf00d' is bpf program fd expressed a text string?
and kernel needs to parse above? will you allow capital and lower
case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
can program fd expressed as decimal or hex or both?
how do you return the error? as a text string for user space
to parse?

> net.socket_create_filter = "iptables:foobar": some iptables thingy
> net.socket_create_filter = "nft:blahblahblah": some nft thingy

iptables/nft are not applicable to 'bpf_socket_create' which
looks into:
struct bpf_sock {
        __u32 bound_dev_if;
        __u32 family;
        __u32 type;
        __u32 protocol;
};
so don't fit as an example.

> net.socket_create_filter = "disallow": no sockets created period
> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

so you're proposing to add a bunch of hard coded logic to the kernel.
First to parse such text into some sort of syntax tree or list/set
and then have hard coded logic specifically for these two use cases?
While above two can be implemented as trivial bpf programs already?!
That goes 180% degree vs bpf philosophy. bpf is about moving
the specific code out of the kernel and keeping kernel generic that
it can solve as many use cases as possible by being programmable.

> See?  This API is not bpf-specific.  It's an API for filtering.  The

no. I don't see it. BPF_CGROUP_INET_SOCK_CREATE is very much bpf specific
and we just discussed it to the last the detail.

> Can you explain your use case more clearly?
...
> What exactly isn't sensible about using cgroup_bpf for containers?

my use case is close to zero overhead application network monitoring.

> >> > As you're pointing out, in case of security, we probably
> >> > want to preserve original bpf program that should always be
> >> > run first and only after it returned 'ok' (we'd need to define
> >> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
> >>
> >> It's already defined AFAICT.  1 means okay.  0 means not okay.
> >
> > sorry that doesn't make any sense. For seccomp we have a set of
> > ranges that mean different things. Here you're proposing to
> > hastily assign 1 and 0 ? How is that extensible?
> > We need to carefully think through what should be the semantics
> > of attaching multiple programs, consider performance implications,
> > return codes and so on.
> 
> You already assigned it.  The return value of the bpf program, loaded
> in Linus' tree today, tells the kernel whether to accept or reject.

yes. that's what the program tells the hook.
I'm saying that whenever we have a link list of the programs
interaction between them may or may not be expressible with 1/0
and I don't want to rush such decision.

> >
> >> > Another alternative is to disallow attaching programs in sub-hierarchy
> >> > if parent has something already attached, but it's not useful
> >> > for general case.
> >> > All of these are possible future extensions.
> >>
> >> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
> >> have to do it now (or disable the feature for 4.10).  This is why I'm
> >> bringing this whole thing up now.
> >
> > We don't have to touch user visible api here, so extensions are fine.
> 
> Huh?  My example in the original email attaches a program in a
> sub-hierarchy.  Are you saying that 4.11 could make that example stop
> working?

No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
type programs and only root can attach them.
and this root semantics obviously have to be preserved from now on,
but that doesn't mean that non-root combinations have to follow the same.
For example, if for some bizarre reason you want to do
net.socket_create_filter = "disallow": no sockets created period
in the hard coded way without using bpf at all
(I would certainly oppose that as a waste of kernel .text,
but I'm not going to nack it), so you can do whatever semantics you like.
Similarly for bpf+lsm+cgroup. In the next round of Mickael's patches
we'll keep debating the right security model for it.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  3:12 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <80574175-3692-0278-a74e-23b752d44f73@gmail.com>

On Mon, Dec 19, 2016 at 6:52 PM, David Ahern <dsahern@gmail.com> wrote:
> On 12/19/16 6:56 PM, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <dsahern@gmail.com> wrote:
>>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>>> net.socket_create_filter = "none": no filter
>>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>>> net.socket_create_filter = "disallow": no sockets created period
>>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>>
>>> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.
>>
>> Can you elaborate on what goes wrong?  (Obviously the
>> "address_family_list" example makes no sense in that context.)
>
> Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability.

Oh -- I'm not proposing that at all.  Let me clarify.  For the bpf
case, if you *read* the file, you'd see "bpf:baadf00d".  But writing
"bpf:baadf00d" is nonsense and would give you -EINVAL.  Instead you
install a bpf filter by opening the file for write (O_RDWR or
O_WRONLY) and doing ioctl(cgroup_socket_create_file_fd,
CGROUP_IOCTL_SET_BPF_FILTER, bpf_fd);  It's kind of like
BPF_PROG_ATTACH except that it respects filesystem permissions, it
isn't in the bpf() multiplexer, the filter being set is implied (by
the fd in use), and everything is nicely seccompable.

To remove the filter, you write "none" or "none\n" to the file.

As a future extension, if someone wanted more than one filter to be
able to coexist in the cgroup socket_create_filter slot, you could
plausibly do 'echo disallow >>net.socket_create_filter' or use a new
ioctl CGROUP_IOCTL_APPEND_BPF_FILTER, and then you'd read the file and
see more than one line.  But this would be a future extension and may
never be needed.

>>
>> a) sub-cgroups cannot have a filter at all of the parent has a filter.
>> (This is the "punt" approach -- it lets different semantics be
>> assigned later without breaking userspace.)
>>
>> b) sub-cgroups can have a filter if a parent does, too.  The semantics
>> are that the sub-cgroup filter runs first and all side-effects occur.
>> If that filter says "reject" then ancestor filters are skipped.  If
>> that filter says "accept", then the ancestor filter is run and its
>> side-effects happen as well.  (And so on, all the way up to the root.)
>
> That comes with a big performance hit for skb / data path cases.
>
> I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.

I'm not sure I buy the performance hit.  If you do it naively, then
performance will indeed suck.  But there's already a bunch of code in
there to pre-populate a filter list for faster use.  Currently, we
have:

struct cgroup_bpf {
        /*
         * Store two sets of bpf_prog pointers, one for programs that are
         * pinned directly to this cgroup, and one for those that are effective
         * when this cgroup is accessed.
         */
        struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
        struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
};

in struct cgroup, there's a 'struct cgroup_bpf bpf;'.

This would change to something like:

struct cgroup_filter_slot {
  struct bpf_prog *effective;
  struct cgroup_filter_slot *next;
  struct bpf_prog *local;
}

local is NULL unless *this* cgroup has a filter.  effective points to
the bpf_prog that's active in this cgroup or the nearest ancestor that
has a filter.  next is NULL if there are no filters higher in the
chain or points to the next slot that has a filter.  struct cgroup
has:

struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];

To evaluate it, you do:

struct cgroup_filter_slot *slot = &cgroup->slot[the index];

if (!slot->effective)
  return;

do {
  evaluate(slot->effective);
  slot = slot->next;
} while (unlikely(slot));

The old code was one branch per evaluation.  The new code is n+1
branches where n is the number of filters, so it's one extra branch in
the worst case.  But the extra branch is cache-hot (the variable is
right next to slot->effective, which is needed regardless) and highly
predictable (in the case where nesting isn't used, the branch is not
taken, and it's a backward branch which most CPUs will predict as not
taken).

I expect that, on x86, this adds at most a cycle or two and quite
possibly adds no measurable overhead at all.

^ permalink raw reply

* [PATCH v3 3/3] Bluetooth: btusb: Configure Marvell to use one of the pins for oob wakeup
From: Rajat Jain @ 2016-12-20  2:56 UTC (permalink / raw)
  To: Rob Herring, Mark Rutland, Marcel Holtmann, Gustavo Padovan,
	Johan Hedberg, Amitkumar Karwar, Wei-Ning Huang, Xinming Hu,
	netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA, Brian Norris,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Rajat Jain, rajatxjain-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <1482202592-76314-1-git-send-email-rajatja-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

The Marvell devices may have many gpio pins, and hence for wakeup
on these out-of-band pins, the chip needs to be told which pin is
to be used for wakeup, using an hci command.

Thus, we read the pin number etc from the device tree node and send
a command to the chip.

Signed-off-by: Rajat Jain <rajatja-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Acked-by: Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
v3: * remove the Marvell specific id table and check
    * Add reference to marvell-bt-8xxx.txt in btusb.txt
    * Add "Reviewed-by" and "Acked-by"    
v2: Fix the binding document to specify to use "wakeup" interrupt-name

 Documentation/devicetree/bindings/net/btusb.txt    |  3 ++
 .../{marvell-bt-sd8xxx.txt => marvell-bt-8xxx.txt} | 46 +++++++++++++++----
 drivers/bluetooth/btusb.c                          | 51 ++++++++++++++++++++++
 3 files changed, 92 insertions(+), 8 deletions(-)
 rename Documentation/devicetree/bindings/net/{marvell-bt-sd8xxx.txt => marvell-bt-8xxx.txt} (50%)

diff --git a/Documentation/devicetree/bindings/net/btusb.txt b/Documentation/devicetree/bindings/net/btusb.txt
index 2c0355c..01fa2d4 100644
--- a/Documentation/devicetree/bindings/net/btusb.txt
+++ b/Documentation/devicetree/bindings/net/btusb.txt
@@ -10,6 +10,9 @@ Required properties:
 
 		  "usb1286,204e" (Marvell 8997)
 
+Also, vendors that use btusb may have device additional properties, e.g:
+Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
+
 Optional properties:
 
   - interrupt-parent: phandle of the parent interrupt controller
diff --git a/Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt b/Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
similarity index 50%
rename from Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt
rename to Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
index 6a9a63c..9be1059 100644
--- a/Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt
+++ b/Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
@@ -1,16 +1,21 @@
-Marvell 8897/8997 (sd8897/sd8997) bluetooth SDIO devices
+Marvell 8897/8997 (sd8897/sd8997) bluetooth devices (SDIO or USB based)
 ------
+The 8997 devices supports multiple interfaces. When used on SDIO interfaces,
+the btmrvl driver is used and when used on USB interface, the btusb driver is
+used.
 
 Required properties:
 
   - compatible : should be one of the following:
-	* "marvell,sd8897-bt"
-	* "marvell,sd8997-bt"
+	* "marvell,sd8897-bt" (for SDIO)
+	* "marvell,sd8997-bt" (for SDIO)
+	* "usb1286,204e"      (for USB)
 
 Optional properties:
 
   - marvell,cal-data: Calibration data downloaded to the device during
 		      initialization. This is an array of 28 values(u8).
+		      This is only applicable to SDIO devices.
 
   - marvell,wakeup-pin: It represents wakeup pin number of the bluetooth chip.
 		        firmware will use the pin to wakeup host system (u16).
@@ -18,10 +23,15 @@ Optional properties:
 		      platform. The value will be configured to firmware. This
 		      is needed to work chip's sleep feature as expected (u16).
   - interrupt-parent: phandle of the parent interrupt controller
-  - interrupts : interrupt pin number to the cpu. Driver will request an irq based
-		 on this interrupt number. During system suspend, the irq will be
-		 enabled so that the bluetooth chip can wakeup host platform under
-		 certain condition. During system resume, the irq will be disabled
+  - interrupt-names: Used only for USB based devices (See below)
+  - interrupts : specifies the interrupt pin number to the cpu. For SDIO, the
+		 driver will use the first interrupt specified in the interrupt
+		 array. For USB based devices, the driver will use the interrupt
+		 named "wakeup" from the interrupt-names and interrupt arrays.
+		 The driver will request an irq based on this interrupt number.
+		 During system suspend, the irq will be enabled so that the
+		 bluetooth chip can wakeup host platform under certain
+		 conditions. During system resume, the irq will be disabled
 		 to make sure unnecessary interrupt is not received.
 
 Example:
@@ -29,7 +39,9 @@ Example:
 IRQ pin 119 is used as system wakeup source interrupt.
 wakeup pin 13 and gap 100ms are configured so that firmware can wakeup host
 using this device side pin and wakeup latency.
-calibration data is also available in below example.
+
+Example for SDIO device follows (calibration data is also available in
+below example).
 
 &mmc3 {
 	status = "okay";
@@ -54,3 +66,21 @@ calibration data is also available in below example.
 		marvell,wakeup-gap-ms = /bits/ 16 <0x64>;
 	};
 };
+
+Example for USB device:
+
+&usb_host1_ohci {
+    status = "okay";
+    #address-cells = <1>;
+    #size-cells = <0>;
+
+    mvl_bt1: bt@1 {
+	compatible = "usb1286,204e";
+	reg = <1>;
+	interrupt-parent = <&gpio0>;
+	interrupt-names = "wakeup";
+	interrupts = <119 IRQ_TYPE_LEVEL_LOW>;
+	marvell,wakeup-pin = /bits/ 16 <0x0d>;
+	marvell,wakeup-gap-ms = /bits/ 16 <0x64>;
+    };
+};
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index beca4e9..a5c2cf9 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -2343,6 +2343,50 @@ static int btusb_shutdown_intel(struct hci_dev *hdev)
 	return 0;
 }
 
+#ifdef CONFIG_PM
+/* Configure an out-of-band gpio as wake-up pin, if specified in device tree */
+static int marvell_config_oob_wake(struct hci_dev *hdev)
+{
+	struct sk_buff *skb;
+	struct btusb_data *data = hci_get_drvdata(hdev);
+	struct device *dev = &data->udev->dev;
+	u16 pin, gap, opcode;
+	int ret;
+	u8 cmd[5];
+
+	/* Move on if no wakeup pin specified */
+	if (of_property_read_u16(dev->of_node, "marvell,wakeup-pin", &pin) ||
+	    of_property_read_u16(dev->of_node, "marvell,wakeup-gap-ms", &gap))
+		return 0;
+
+	/* Vendor specific command to configure a GPIO as wake-up pin */
+	opcode = hci_opcode_pack(0x3F, 0x59);
+	cmd[0] = opcode & 0xFF;
+	cmd[1] = opcode >> 8;
+	cmd[2] = 2; /* length of parameters that follow */
+	cmd[3] = pin;
+	cmd[4] = gap; /* time in ms, for which wakeup pin should be asserted */
+
+	skb = bt_skb_alloc(sizeof(cmd), GFP_KERNEL);
+	if (!skb) {
+		bt_dev_err(hdev, "%s: No memory\n", __func__);
+		return -ENOMEM;
+	}
+
+	memcpy(skb_put(skb, sizeof(cmd)), cmd, sizeof(cmd));
+	hci_skb_pkt_type(skb) = HCI_COMMAND_PKT;
+
+	ret = btusb_send_frame(hdev, skb);
+	if (ret) {
+		bt_dev_err(hdev, "%s: configuration failed\n", __func__);
+		kfree_skb(skb);
+		return ret;
+	}
+
+	return 0;
+}
+#endif
+
 static int btusb_set_bdaddr_marvell(struct hci_dev *hdev,
 				    const bdaddr_t *bdaddr)
 {
@@ -2917,6 +2961,13 @@ static int btusb_probe(struct usb_interface *intf,
 	err = btusb_config_oob_wake(hdev);
 	if (err)
 		goto out_free_dev;
+
+	/* Marvell devices may need a specific chip configuration */
+	if (id->driver_info & BTUSB_MARVELL && data->oob_wake_irq) {
+		err = marvell_config_oob_wake(hdev);
+		if (err)
+			goto out_free_dev;
+	}
 #endif
 	if (id->driver_info & BTUSB_CW6622)
 		set_bit(HCI_QUIRK_BROKEN_STORED_LINK_KEY, &hdev->quirks);
-- 
2.8.0.rc3.226.g39d4020

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v3 2/3] Bluetooth: btusb: Add out-of-band wakeup support
From: Rajat Jain @ 2016-12-20  2:56 UTC (permalink / raw)
  To: Rob Herring, Mark Rutland, Marcel Holtmann, Gustavo Padovan,
	Johan Hedberg, Amitkumar Karwar, Wei-Ning Huang, Xinming Hu,
	netdev, devicetree, linux-bluetooth, Brian Norris, linux-kernel
  Cc: Rajat Jain, rajatxjain
In-Reply-To: <1482202592-76314-1-git-send-email-rajatja@google.com>

Some onboard BT chips (e.g. Marvell 8997) contain a wakeup pin that
can be connected to a gpio on the CPU side, and can be used to wakeup
the host out-of-band. This can be useful in situations where the
in-band wakeup is not possible or not preferable (e.g. the in-band
wakeup may require the USB host controller to remain active, and
hence consuming more system power during system sleep).

The oob gpio interrupt to be used for wakeup on the CPU side, is
read from the device tree node, (using standard interrupt descriptors).
A devcie tree binding document is also added for the driver. The
compatible string is in compliance with
Documentation/devicetree/bindings/usb/usb-device.txt

Signed-off-by: Rajat Jain <rajatja@google.com>
Reviewed-by: Brian Norris <briannorris@chromium.org>
---
v3: Add Brian's "Reviewed-by"
v2: * Use interrupt-names ("wakeup") instead of assuming first interrupt.
    * Leave it on device tree to specify IRQ flags (level /edge triggered)
    * Mark the device as non wakeable on exit.

 Documentation/devicetree/bindings/net/btusb.txt | 40 ++++++++++++
 drivers/bluetooth/btusb.c                       | 84 +++++++++++++++++++++++++
 2 files changed, 124 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/btusb.txt

diff --git a/Documentation/devicetree/bindings/net/btusb.txt b/Documentation/devicetree/bindings/net/btusb.txt
new file mode 100644
index 0000000..2c0355c
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/btusb.txt
@@ -0,0 +1,40 @@
+Generic Bluetooth controller over USB (btusb driver)
+---------------------------------------------------
+
+Required properties:
+
+  - compatible : should comply with the format "usbVID,PID" specified in
+		 Documentation/devicetree/bindings/usb/usb-device.txt
+		 At the time of writing, the only OF supported devices
+		 (more may be added later) are:
+
+		  "usb1286,204e" (Marvell 8997)
+
+Optional properties:
+
+  - interrupt-parent: phandle of the parent interrupt controller
+  - interrupt-names: (see below)
+  - interrupts : The interrupt specified by the name "wakeup" is the interrupt
+		 that shall be used for out-of-band wake-on-bt. Driver will
+		 request this interrupt for wakeup. During system suspend, the
+		 irq will be enabled so that the bluetooth chip can wakeup host
+		 platform out of band. During system resume, the irq will be
+		 disabled to make sure unnecessary interrupt is not received.
+
+Example:
+
+Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt:
+
+&usb_host1_ehci {
+    status = "okay";
+    #address-cells = <1>;
+    #size-cells = <0>;
+
+    mvl_bt1: bt@1 {
+	compatible = "usb1286,204e";
+	reg = <1>;
+	interrupt-parent = <&gpio0>;
+	interrupt-name = "wakeup";
+	interrupts = <3 IRQ_TYPE_LEVEL_LOW>;
+    };
+};
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index ce22cef..beca4e9 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -24,6 +24,8 @@
 #include <linux/module.h>
 #include <linux/usb.h>
 #include <linux/firmware.h>
+#include <linux/of_device.h>
+#include <linux/of_irq.h>
 #include <asm/unaligned.h>
 
 #include <net/bluetooth/bluetooth.h>
@@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = {
 #define BTUSB_BOOTING		9
 #define BTUSB_RESET_RESUME	10
 #define BTUSB_DIAG_RUNNING	11
+#define BTUSB_OOB_WAKE_DISABLED	12
 
 struct btusb_data {
 	struct hci_dev       *hdev;
@@ -416,6 +419,8 @@ struct btusb_data {
 	int (*recv_bulk)(struct btusb_data *data, void *buffer, int count);
 
 	int (*setup_on_usb)(struct hci_dev *hdev);
+
+	int oob_wake_irq;   /* irq for out-of-band wake-on-bt */
 };
 
 static inline void btusb_free_frags(struct btusb_data *data)
@@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, bool enable)
 }
 #endif
 
+#ifdef CONFIG_PM
+static irqreturn_t btusb_oob_wake_handler(int irq, void *priv)
+{
+	struct btusb_data *data = priv;
+
+	/* Disable only if not already disabled (keep it balanced) */
+	if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) {
+		disable_irq_nosync(irq);
+		disable_irq_wake(irq);
+	}
+	pm_wakeup_event(&data->udev->dev, 0);
+	return IRQ_HANDLED;
+}
+
+static const struct of_device_id btusb_match_table[] = {
+	{ .compatible = "usb1286,204e" },
+	{ }
+};
+MODULE_DEVICE_TABLE(of, btusb_match_table);
+
+/* Use an oob wakeup pin? */
+static int btusb_config_oob_wake(struct hci_dev *hdev)
+{
+	struct btusb_data *data = hci_get_drvdata(hdev);
+	struct device *dev = &data->udev->dev;
+	int irq, ret;
+
+	if (!of_match_device(btusb_match_table, dev))
+		return 0;
+
+	/* Move on if no IRQ specified */
+	irq = of_irq_get_byname(dev->of_node, "wakeup");
+	if (irq <= 0) {
+		bt_dev_dbg(hdev, "%s: no OOB Wakeup IRQ in DT", __func__);
+		return 0;
+	}
+
+	set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags);
+
+	ret = devm_request_irq(&hdev->dev, irq, btusb_oob_wake_handler,
+			       0, "OOB Wake-on-BT", data);
+	if (ret) {
+		bt_dev_err(hdev, "%s: IRQ request failed", __func__);
+		return ret;
+	}
+
+	ret = device_init_wakeup(dev, true);
+	if (ret) {
+		bt_dev_err(hdev, "%s: failed to init_wakeup\n", __func__);
+		return ret;
+	}
+
+	data->oob_wake_irq = irq;
+	disable_irq(irq);
+	bt_dev_info(hdev, "OOB Wake-on-BT configured at IRQ %u\n", irq);
+	return 0;
+}
+#endif
+
 static int btusb_probe(struct usb_interface *intf,
 		       const struct usb_device_id *id)
 {
@@ -2849,6 +2913,11 @@ static int btusb_probe(struct usb_interface *intf,
 	hdev->send   = btusb_send_frame;
 	hdev->notify = btusb_notify;
 
+#ifdef CONFIG_PM
+	err = btusb_config_oob_wake(hdev);
+	if (err)
+		goto out_free_dev;
+#endif
 	if (id->driver_info & BTUSB_CW6622)
 		set_bit(HCI_QUIRK_BROKEN_STORED_LINK_KEY, &hdev->quirks);
 
@@ -3061,6 +3130,9 @@ static void btusb_disconnect(struct usb_interface *intf)
 			usb_driver_release_interface(&btusb_driver, data->isoc);
 	}
 
+	if (data->oob_wake_irq)
+		device_init_wakeup(&data->udev->dev, false);
+
 	hci_free_dev(hdev);
 }
 
@@ -3089,6 +3161,12 @@ static int btusb_suspend(struct usb_interface *intf, pm_message_t message)
 	btusb_stop_traffic(data);
 	usb_kill_anchored_urbs(&data->tx_anchor);
 
+	if (data->oob_wake_irq && device_may_wakeup(&data->udev->dev)) {
+		clear_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags);
+		enable_irq_wake(data->oob_wake_irq);
+		enable_irq(data->oob_wake_irq);
+	}
+
 	/* Optionally request a device reset on resume, but only when
 	 * wakeups are disabled. If wakeups are enabled we assume the
 	 * device will stay powered up throughout suspend.
@@ -3126,6 +3204,12 @@ static int btusb_resume(struct usb_interface *intf)
 	if (--data->suspend_count)
 		return 0;
 
+	/* Disable only if not already disabled (keep it balanced) */
+	if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) {
+		disable_irq(data->oob_wake_irq);
+		disable_irq_wake(data->oob_wake_irq);
+	}
+
 	if (!test_bit(HCI_RUNNING, &hdev->flags))
 		goto done;
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox