Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 1/2] net: netcp: ethss: fix errors in ethtool ops
From: Murali Karicheri @ 2016-12-19 22:55 UTC (permalink / raw)
  To: netdev, davem, linux-kernel

From: WingMan Kwok <w-kwok2@ti.com>

In ethtool ops, it needs to retrieve the corresponding
ethss module (gbe or xgbe) from the net_device structure.
Prior to this patch, the retrieving procedure only
checks for the gbe module.  This patch fixes the issue
by checking the xgbe module if the net_device structure
does not correspond to the gbe module.

Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
---
 drivers/net/ethernet/ti/netcp_ethss.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ti/netcp_ethss.c b/drivers/net/ethernet/ti/netcp_ethss.c
index c7e547e..a31931c 100644
--- a/drivers/net/ethernet/ti/netcp_ethss.c
+++ b/drivers/net/ethernet/ti/netcp_ethss.c
@@ -1746,6 +1746,17 @@ static void keystone_set_msglevel(struct net_device *ndev, u32 value)
 	netcp->msg_enable = value;
 }
 
+static struct gbe_intf *keystone_get_intf_data(struct netcp_intf *netcp)
+{
+	struct gbe_intf *gbe_intf;
+
+	gbe_intf = netcp_module_get_intf_data(&gbe_module, netcp);
+	if (!gbe_intf)
+		gbe_intf = netcp_module_get_intf_data(&xgbe_module, netcp);
+
+	return gbe_intf;
+}
+
 static void keystone_get_stat_strings(struct net_device *ndev,
 				      uint32_t stringset, uint8_t *data)
 {
@@ -1754,7 +1765,7 @@ static void keystone_get_stat_strings(struct net_device *ndev,
 	struct gbe_priv *gbe_dev;
 	int i;
 
-	gbe_intf = netcp_module_get_intf_data(&gbe_module, netcp);
+	gbe_intf = keystone_get_intf_data(netcp);
 	if (!gbe_intf)
 		return;
 	gbe_dev = gbe_intf->gbe_dev;
@@ -1778,7 +1789,7 @@ static int keystone_get_sset_count(struct net_device *ndev, int stringset)
 	struct gbe_intf *gbe_intf;
 	struct gbe_priv *gbe_dev;
 
-	gbe_intf = netcp_module_get_intf_data(&gbe_module, netcp);
+	gbe_intf = keystone_get_intf_data(netcp);
 	if (!gbe_intf)
 		return -EINVAL;
 	gbe_dev = gbe_intf->gbe_dev;
@@ -1896,7 +1907,7 @@ static void keystone_get_ethtool_stats(struct net_device *ndev,
 	struct gbe_intf *gbe_intf;
 	struct gbe_priv *gbe_dev;
 
-	gbe_intf = netcp_module_get_intf_data(&gbe_module, netcp);
+	gbe_intf = keystone_get_intf_data(netcp);
 	if (!gbe_intf)
 		return;
 
@@ -1920,7 +1931,7 @@ static int keystone_get_link_ksettings(struct net_device *ndev,
 	if (!phy)
 		return -EINVAL;
 
-	gbe_intf = netcp_module_get_intf_data(&gbe_module, netcp);
+	gbe_intf = keystone_get_intf_data(netcp);
 	if (!gbe_intf)
 		return -EINVAL;
 
@@ -1953,7 +1964,7 @@ static int keystone_set_link_ksettings(struct net_device *ndev,
 	if (!phy)
 		return -EINVAL;
 
-	gbe_intf = netcp_module_get_intf_data(&gbe_module, netcp);
+	gbe_intf = keystone_get_intf_data(netcp);
 	if (!gbe_intf)
 		return -EINVAL;
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH net 2/2] net: netcp: ethss: fix 10gbe host port tx pri map configuration
From: Murali Karicheri @ 2016-12-19 22:55 UTC (permalink / raw)
  To: netdev, davem, linux-kernel
In-Reply-To: <1482188157-24490-1-git-send-email-m-karicheri2@ti.com>

From: WingMan Kwok <w-kwok2@ti.com>

This patch adds the missing 10gbe host port tx priority map
configurations.

Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
---
 drivers/net/ethernet/ti/netcp_ethss.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/netcp_ethss.c b/drivers/net/ethernet/ti/netcp_ethss.c
index a31931c..7d9e36f 100644
--- a/drivers/net/ethernet/ti/netcp_ethss.c
+++ b/drivers/net/ethernet/ti/netcp_ethss.c
@@ -94,6 +94,7 @@
 
 /* offset relative to base of XGBE_SS_REG_INDEX */
 #define XGBE10_SGMII_MODULE_OFFSET	0x100
+#define IS_SS_ID_XGBE(d)		((d)->ss_version == XGBE_SS_VERSION_10)
 /* offset relative to base of XGBE_SM_REG_INDEX */
 #define XGBE10_HOST_PORT_OFFSET		0x34
 #define XGBE10_SLAVE_PORT_OFFSET	0x64
@@ -2322,7 +2323,7 @@ static void gbe_init_host_port(struct gbe_priv *priv)
 	int bypass_en = 1;
 
 	/* Host Tx Pri */
-	if (IS_SS_ID_NU(priv))
+	if (IS_SS_ID_NU(priv) || IS_SS_ID_XGBE(priv))
 		writel(HOST_TX_PRI_MAP_DEFAULT,
 		       GBE_REG_ADDR(priv, host_port_regs, tx_pri_map));
 
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v2 2/3] Bluetooth: btusb: Add out-of-band wakeup support
From: Brian Norris @ 2016-12-19 23:10 UTC (permalink / raw)
  To: Rajat Jain
  Cc: Rob Herring, Mark Rutland, Marcel Holtmann, Gustavo Padovan,
	Johan Hedberg, Amitkumar Karwar, Wei-Ning Huang, Xinming Hu,
	netdev, devicetree, linux-bluetooth, linux-kernel, rajatxjain
In-Reply-To: <1481916604-114279-2-git-send-email-rajatja@google.com>

Hi Rajat,

On Fri, Dec 16, 2016 at 11:30:03AM -0800, Rajat Jain wrote:
> Some onboard BT chips (e.g. Marvell 8997) contain a wakeup pin that
> can be connected to a gpio on the CPU side, and can be used to wakeup
> the host out-of-band. This can be useful in situations where the
> in-band wakeup is not possible or not preferable (e.g. the in-band
> wakeup may require the USB host controller to remain active, and
> hence consuming more system power during system sleep).
> 
> The oob gpio interrupt to be used for wakeup on the CPU side, is
> read from the device tree node, (using standard interrupt descriptors).
> A devcie tree binding document is also added for the driver. The
> compatible string is in compliance with
> Documentation/devicetree/bindings/usb/usb-device.txt
> 
> Signed-off-by: Rajat Jain <rajatja@google.com>

A few small comments below, but other than those, for the whole series:

Reviewed-by: Brian Norris <briannorris@chromium.org>

> ---
> v2: * Use interrupt-names ("wakeup") instead of assuming first interrupt.
>     * Leave it on device tree to specify IRQ flags (level /edge triggered)
>     * Mark the device as non wakeable on exit.
> 
>  Documentation/devicetree/bindings/net/btusb.txt | 40 ++++++++++++
>  drivers/bluetooth/btusb.c                       | 84 +++++++++++++++++++++++++
>  2 files changed, 124 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/btusb.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/btusb.txt b/Documentation/devicetree/bindings/net/btusb.txt
> new file mode 100644
> index 0000000..2c0355c
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/btusb.txt
> @@ -0,0 +1,40 @@
> +Generic Bluetooth controller over USB (btusb driver)
> +---------------------------------------------------
> +
> +Required properties:
> +
> +  - compatible : should comply with the format "usbVID,PID" specified in
> +		 Documentation/devicetree/bindings/usb/usb-device.txt
> +		 At the time of writing, the only OF supported devices
> +		 (more may be added later) are:
> +
> +		  "usb1286,204e" (Marvell 8997)

I don't know if anyone cares about these sort of organizational aspects,
but in patch 3 you're extending this binding with device-specific
Marvell bindings. Seems like it'd be good to have some clear way to
correlate those 2. e.g., add a reference to marvell-bt-sd8xxx.txt (or
marvell-bt-8xxx.txt, once you rename it) in patch 3.

> +
> +Optional properties:
> +
> +  - interrupt-parent: phandle of the parent interrupt controller
> +  - interrupt-names: (see below)
> +  - interrupts : The interrupt specified by the name "wakeup" is the interrupt
> +		 that shall be used for out-of-band wake-on-bt. Driver will
> +		 request this interrupt for wakeup. During system suspend, the
> +		 irq will be enabled so that the bluetooth chip can wakeup host
> +		 platform out of band. During system resume, the irq will be
> +		 disabled to make sure unnecessary interrupt is not received.
> +
> +Example:
> +
> +Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt:
> +
> +&usb_host1_ehci {
> +    status = "okay";
> +    #address-cells = <1>;
> +    #size-cells = <0>;
> +
> +    mvl_bt1: bt@1 {
> +	compatible = "usb1286,204e";
> +	reg = <1>;
> +	interrupt-parent = <&gpio0>;
> +	interrupt-name = "wakeup";
> +	interrupts = <3 IRQ_TYPE_LEVEL_LOW>;
> +    };
> +};
> diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
> index ce22cef..beca4e9 100644
> --- a/drivers/bluetooth/btusb.c
> +++ b/drivers/bluetooth/btusb.c
> @@ -24,6 +24,8 @@
>  #include <linux/module.h>
>  #include <linux/usb.h>
>  #include <linux/firmware.h>
> +#include <linux/of_device.h>
> +#include <linux/of_irq.h>
>  #include <asm/unaligned.h>
>  
>  #include <net/bluetooth/bluetooth.h>
> @@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = {
>  #define BTUSB_BOOTING		9
>  #define BTUSB_RESET_RESUME	10
>  #define BTUSB_DIAG_RUNNING	11
> +#define BTUSB_OOB_WAKE_DISABLED	12
>  
>  struct btusb_data {
>  	struct hci_dev       *hdev;
> @@ -416,6 +419,8 @@ struct btusb_data {
>  	int (*recv_bulk)(struct btusb_data *data, void *buffer, int count);
>  
>  	int (*setup_on_usb)(struct hci_dev *hdev);
> +
> +	int oob_wake_irq;   /* irq for out-of-band wake-on-bt */
>  };
>  
>  static inline void btusb_free_frags(struct btusb_data *data)
> @@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, bool enable)
>  }
>  #endif
>  
> +#ifdef CONFIG_PM
> +static irqreturn_t btusb_oob_wake_handler(int irq, void *priv)
> +{
> +	struct btusb_data *data = priv;
> +
> +	/* Disable only if not already disabled (keep it balanced) */
> +	if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) {
> +		disable_irq_nosync(irq);
> +		disable_irq_wake(irq);
> +	}
> +	pm_wakeup_event(&data->udev->dev, 0);
> +	return IRQ_HANDLED;
> +}
> +
> +static const struct of_device_id btusb_match_table[] = {
> +	{ .compatible = "usb1286,204e" },
> +	{ }
> +};
> +MODULE_DEVICE_TABLE(of, btusb_match_table);

You define a match table here, but you also define essentially same
table for Marvell-specific additions in patch 3. It looks like maybe
it's legal to have more than one OF table in a module? But it seems like
it would get confusing, besides being somewhat strange to maintain. It
might also produce duplicate 'modinfo' output.

If you really want to independently opt into device-tree-specified
interrupts vs. Marvell-specific interrrupt configuration, then you
should probably just merge the latter into the former table, and
implement a Marvell/GPIO flag to stick in the .data field of this table.

Or it might be fine to drop one or both "match" checks. Particularly for
the Marvell-specific stuff, it's probably fair just to check if it has
an ->of_node and if 'id->driver_info & BTUSB_MARVELL'. Any other Marvell
device-specific quirks could probably be keyed off of the
(weirdly-named?) blacklist_table[], which already matches PID/VID.

> +
> +/* Use an oob wakeup pin? */
> +static int btusb_config_oob_wake(struct hci_dev *hdev)
> +{
> +	struct btusb_data *data = hci_get_drvdata(hdev);
> +	struct device *dev = &data->udev->dev;
> +	int irq, ret;
> +
> +	if (!of_match_device(btusb_match_table, dev))
> +		return 0;
> +
> +	/* Move on if no IRQ specified */
> +	irq = of_irq_get_byname(dev->of_node, "wakeup");
> +	if (irq <= 0) {
> +		bt_dev_dbg(hdev, "%s: no OOB Wakeup IRQ in DT", __func__);
> +		return 0;
> +	}
> +
> +	set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags);
> +
> +	ret = devm_request_irq(&hdev->dev, irq, btusb_oob_wake_handler,
> +			       0, "OOB Wake-on-BT", data);
> +	if (ret) {
> +		bt_dev_err(hdev, "%s: IRQ request failed", __func__);
> +		return ret;
> +	}
> +
> +	ret = device_init_wakeup(dev, true);
> +	if (ret) {
> +		bt_dev_err(hdev, "%s: failed to init_wakeup\n", __func__);
> +		return ret;
> +	}
> +
> +	data->oob_wake_irq = irq;
> +	disable_irq(irq);
> +	bt_dev_info(hdev, "OOB Wake-on-BT configured at IRQ %u\n", irq);
> +	return 0;
> +}
> +#endif
> +
>  static int btusb_probe(struct usb_interface *intf,
>  		       const struct usb_device_id *id)
>  {
> @@ -2849,6 +2913,11 @@ static int btusb_probe(struct usb_interface *intf,
>  	hdev->send   = btusb_send_frame;
>  	hdev->notify = btusb_notify;
>  
> +#ifdef CONFIG_PM
> +	err = btusb_config_oob_wake(hdev);
> +	if (err)
> +		goto out_free_dev;
> +#endif
>  	if (id->driver_info & BTUSB_CW6622)
>  		set_bit(HCI_QUIRK_BROKEN_STORED_LINK_KEY, &hdev->quirks);
>  
> @@ -3061,6 +3130,9 @@ static void btusb_disconnect(struct usb_interface *intf)
>  			usb_driver_release_interface(&btusb_driver, data->isoc);
>  	}
>  
> +	if (data->oob_wake_irq)
> +		device_init_wakeup(&data->udev->dev, false);
> +
>  	hci_free_dev(hdev);
>  }
>  
> @@ -3089,6 +3161,12 @@ static int btusb_suspend(struct usb_interface *intf, pm_message_t message)
>  	btusb_stop_traffic(data);
>  	usb_kill_anchored_urbs(&data->tx_anchor);
>  
> +	if (data->oob_wake_irq && device_may_wakeup(&data->udev->dev)) {
> +		clear_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags);
> +		enable_irq_wake(data->oob_wake_irq);
> +		enable_irq(data->oob_wake_irq);
> +	}
> +
>  	/* Optionally request a device reset on resume, but only when
>  	 * wakeups are disabled. If wakeups are enabled we assume the
>  	 * device will stay powered up throughout suspend.
> @@ -3126,6 +3204,12 @@ static int btusb_resume(struct usb_interface *intf)
>  	if (--data->suspend_count)

Not related to your patch exactly, but isn't this 'suspend_count' stuff
useless? I'd send a patch, but it might conflict with yours. Just wanted
to note it and see if I'm crazy...

Brian

>  		return 0;
>  
> +	/* Disable only if not already disabled (keep it balanced) */
> +	if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) {
> +		disable_irq(data->oob_wake_irq);
> +		disable_irq_wake(data->oob_wake_irq);
> +	}
> +
>  	if (!test_bit(HCI_RUNNING, &hdev->flags))
>  		goto done;
>  
> -- 
> 2.8.0.rc3.226.g39d4020
> 

^ permalink raw reply

* Re: [PATCH 1/3] NFC: trf7970a: add device tree option for 27MHz clock
From: Geoff Lansberry @ 2016-12-19 23:23 UTC (permalink / raw)
  To: Rob Herring
  Cc: linux-wireless, Lauro Ramos Venancio, Aloisio Almeida Jr,
	Samuel Ortiz, mark.rutland-5wv7dgnIgG8,
	netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mark Greer, Justin Bronder
In-Reply-To: <20161219223111.itu77g7wsqtj6cvs@rob-hp-laptop>

I can make that change, however, I worry that it may be a bit
misleading, since there are only two supported clock frequencies, but
a number like that to me implies that it could be set to any number
you want.   I'm new at this, and so I'll go ahead and change it as you
request, but I'd like to hear your thoughts on my concern.

Thanks
Geoff
Geoff Lansberry


Engineering Guy
Kuvée, Inc
125 Kingston St., 3rd Floor
Boston, MA 02111
1-617-290-1118 (m)
geoff.lansberry (skype)
http://www.kuvee.com



On Mon, Dec 19, 2016 at 5:31 PM, Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Thu, Dec 15, 2016 at 05:30:42PM -0500, Geoff Lansberry wrote:
>> From: Geoff Lansberry <geoff-R+k406RtEhcAvxtiuMwx3w@public.gmane.org>
>>
>> ---
>>  .../devicetree/bindings/net/nfc/trf7970a.txt       |  3 ++
>>  drivers/nfc/trf7970a.c                             | 42 ++++++++++++++++------
>>  2 files changed, 34 insertions(+), 11 deletions(-)
>>
>> diff --git a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt
>> index 32b35a0..9dda879 100644
>> --- a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt
>> +++ b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt
>> @@ -21,6 +21,8 @@ Optional SoC Specific Properties:
>>  - t5t-rmb-extra-byte-quirk: Specify that the trf7970a has the erratum
>>    where an extra byte is returned by Read Multiple Block commands issued
>>    to Type 5 tags.
>> +- crystal_27mhz: Set to specify that the input frequency to the trf7970a is 27.12MHz
>> +
>
> Can't you use 'clock-frequency = "27000000";'?
>
>>
>>  Example (for ARM-based BeagleBone with TRF7970A on SPI1):
>>
>> @@ -43,6 +45,7 @@ Example (for ARM-based BeagleBone with TRF7970A on SPI1):
>>               irq-status-read-quirk;
>>               en2-rf-quirk;
>>               t5t-rmb-extra-byte-quirk;
>> +             crystal_27mhz;
>>               status = "okay";
>>       };
>>  };
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: mlx4: Bug in XDP_TX + 16 rx-queues
From: Martin KaFai Lau @ 2016-12-19 23:37 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: Saeed Mahameed, netdev@vger.kernel.org, Alexei Starovoitov
In-Reply-To: <20161217101803.GB8732@kafai-mba.local>

Hi Tariq,

On Sat, Dec 17, 2016 at 02:18:03AM -0800, Martin KaFai Lau wrote:
> Hi All,
>
> I have been debugging with XDP_TX and 16 rx-queues.
>
> 1) When 16 rx-queues is used and an XDP prog is doing XDP_TX,
> it seems that the packet cannot be XDP_TX out if the pkt
> is received from some particular CPUs (/rx-queues).
>
> 2) If 8 rx-queues is used, it does not have problem.
>
> 3) The 16 rx-queues problem also went away after reverting these
> two patches:
> 15fca2c8eb41 net/mlx4_en: Add ethtool statistics for XDP cases
> 67f8b1dcb9ee net/mlx4_en: Refactor the XDP forwarding rings scheme
>
After taking a closer look at 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
and armed with the fact that '>8 rx-queues does not work', I have
made the attached change that fixed the issue.

Making change in mlx4_en_fill_qp_context() could be an easier fix
but I think this change will be easier for discussion purpose.

I don't want to lie that I know anything about how this variable
works in CX3.  If this change makes sense, I can cook up a diff.
Otherwise, can you shed some light on what could be happening
and hopefully can lead to a diff?

Thanks
--Martin


diff --git i/drivers/net/ethernet/mellanox/mlx4/en_netdev.c w/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index bcd955339058..b3bfb987e493 100644
--- i/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ w/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1638,10 +1638,10 @@ int mlx4_en_start_port(struct net_device *dev)

 	/* Configure tx cq's and rings */
 	for (t = 0 ; t < MLX4_EN_NUM_TX_TYPES; t++) {
-		u8 num_tx_rings_p_up = t == TX ? priv->num_tx_rings_p_up : 1;
-
 		for (i = 0; i < priv->tx_ring_num[t]; i++) {
 			/* Configure cq */
+			int user_prio;
+
 			cq = priv->tx_cq[t][i];
 			err = mlx4_en_activate_cq(priv, cq, i);
 			if (err) {
@@ -1660,9 +1660,14 @@ int mlx4_en_start_port(struct net_device *dev)

 			/* Configure ring */
 			tx_ring = priv->tx_ring[t][i];
+			if (t != TX_XDP)
+				user_prio = i / priv->num_tx_rings_p_up;
+			else
+				user_prio = i & 0x07;
+
 			err = mlx4_en_activate_tx_ring(priv, tx_ring,
 						       cq->mcq.cqn,
-						       i / num_tx_rings_p_up);
+						       user_prio);
 			if (err) {
 				en_err(priv, "Failed allocating Tx ring\n");
 				mlx4_en_deactivate_cq(priv, cq);

^ permalink raw reply related

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  0:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn,
	Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <CALCETrWr5XMkexdGp7HdkiLkQV=P9ycj+sNO7xWSRoCVxihVZA@mail.gmail.com>

On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> Hi all-
> >>
> >> I apologize for being rather late with this.  I didn't realize that
> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> it on the linux-api list, so I missed the discussion.
> >>
> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> but I think the API has some issues that will bite us down the road
> >> if it becomes stable in its current form.
> >>
> >> Most of the problems I see are summarized in this transcript:
> >>
> >> # mkdir cg2
> >> # mount -t cgroup2 none cg2
> >> # mkdir cg2/nosockets
> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> ...
> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >>
> >> ^^^^ You can modify a cgroup after opening it O_RDONLY?
> >>
> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >>
> >> ^^^^ This is fine.  The bpf() syscall manipulates bpf objects.
> >>
> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >>
> >> ^^^^ This is not so good:
> >> ^^^^
> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects.  This
> >> ^^^^    is manipulating a cgroup.  There's no reason that a socket creation
> >> ^^^^    filter couldn't be written in a different language (new iptables
> >> ^^^^    table?  Simple list of address families?), but if that happened,
> >> ^^^^    then using bpf() to install it would be entirely nonsensical.
> >
> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> > network stack is using cgroup as an application group that should
> > invoke bpf program at the certain point in the stack.
> > imo cgroup management is orthogonal.
> 
> It is literally modifying the struct cgroup, and, as a practical
> matter, it's causing membership in the cgroup to have a certain
> effect.  But rather than pointless arguing, let me propose an
> alternative API that I think solves most of the problems here.
> 
> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> Instead, the cgroup gets three new control files:
> "net.ingress_filter", "net.egress_filter", and
> "net.socket_create_filter".  Initially, if you read these files, you
> see "none\n".
> 
> To attach a bpf filter, you open the file for write and do an ioctl on
> it.  After doing the ioctl, if you read the file, you'll see
> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> the bpf program.
> 
> To detach any type of filter, bpf or otherwise, you open the file for
> write and write "none\n" (or just "none").
> 
> If you write anything else to the file, you get -EINVAL.  But, if
> someone writes a new type of filter (perhaps a simple list of address
> families), maybe you can enable the filter by writing something
> appropriate to the file.

I see no difference in what you're proposing vs what is already implemented
from feature set point of view, but the file approach is very ugly, since
it's a mismatch to FD style access that bpf is using everywhere.
In your proposal you'd also need to add bpf prefix everywhere.
So the control file names should be bpf_inet_ingress, bpf_inet_egress
and bpf_socket_create.
If you want to prepare such patch for them, I don't mind,
but you cannot kill syscall command, since it's more flexible
and your control-file approach _will_ be obsolete pretty quickly.

> Now the API matches the effect.  A cgroup with something other than
> "none" in one of its net.*_filter files is a cgroup that filters
> network activity.  And you get CRIU compatibility for free: CRIU can
> read the control file and use whatever mechanism is uses for BPF in
> general to restore the cgroup filter state.   As an added bonus, you
> get ACLs for free and the ugly multiplexer goes away.

extended bpf is not supported by criu. only classic, so having
control_file-style attachment doesn't buy us anything.

> >> # mkdir cg2/nosockets/sockets
> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
> >>
> >> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
> >> ^^^^ then we're stuck with its semantics.  If it returned -EINVAL instead,
> >> ^^^^ there would be a chance to refine it.
> >
> > i don't see the problem with this behavior. bpf and cgroup are indepedent.
> > Imange that socket accounting program is attached to cg2/nosockets.
> > The program is readonly and carry no security meaning.
> > Why cgroup should prevent creation of cg2/nosockets/foo directory ?
> 
> I think you're misunderstanding me.  What I'm saying is that, if you
> allow a cgroup and one of its descendents to both enable the same type
> of filter, you have just committed to some particular semantics for
> what happens.  And I think that the current semantics are the *wrong*
> semantics for default long-term use, so you should either fix the
> semantics or disable the problematic case.

Are you saying that use cases I provided are also 'wrong'?
If you insist on that we won't be able to make any forward progress.
The current semantics is fine for what it's designed for.

> >> # echo $$ >cg2/nosockets/sockets/cgroup.procs
> >> # ping 127.0.0.1
> >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
> >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
> >
> > hmm. in this case the admin created a subgroup with a group that allows
> > ping, so? how's that a problem?
> 
> I think you're forgetting about namespaces.  There are two different
> admins here.  There's the admin who created the outer container and
> said "no sockets here".  Then there's the admin inside the container
> who said "muahaha, I can create a sub-container and allow sockets
> *there*".

container management frameworks should be doing sensible
things. With any infra in the kernel there are many ways to
create non-sensical setups. It's not a job of the kernel
to interfere with user space.

> > In case of vrf I can imagine
> > a set of application auto-binding to one vrf and smaller
> > set of applications binding to a different vrf.
> > Or broader set of applications monitoring all tcp traffic
> > and subset of them monitoring udp instead.
> > Those are valid use cases.
> 
> > I guess you're advocating to run a link list of programs?
> > That won't be useful for the above use cases, where there is no
> > reason to run more than one program that control plane
> > assigned to run for this cgroup.
> 
> Yes there is, for both monitoring and policy.  If I want to monitor
> all activity in a cgroup, I probably want to monitor descendents as
> well.  If I want to restrict a cgroup, I want to restrict its
> descendents.

you're ignoring use cases I described earlier.
In vrf case there is only one ifindex it needs to bind to.

> In the case where I actually don't want to hook the descendents, I'd
> be find with having an option to turn off recursion.  Maybe
> net.egress_filter could also say "bpf(overridable):[hash]" or
> "bpf(nonrecursive):[hash]".  But you should have to opt in to allowing
> your filter to be overridden.

'opt in' only makes sense if you're implementing sandboxing environment.
It doesn't make sense as the default.

> > At this stage we decided to allow only one program per cgroup per hook
> > and later can extend it if necessary.
> 
> No you can't.  Since you allow the problematic case and you gave it
> the surprising "one program" semantics, you can't change it later.

please show me why we cannot? As far as I can see nothing prevents
that in the future. We can add any number of new fields to
BPF_PROG_ATTACH command just like we did in the past with
other commands, whereas control file interface is not extensible.

> > As you're pointing out, in case of security, we probably
> > want to preserve original bpf program that should always be
> > run first and only after it returned 'ok' (we'd need to define
> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
> 
> It's already defined AFAICT.  1 means okay.  0 means not okay.

sorry that doesn't make any sense. For seccomp we have a set of
ranges that mean different things. Here you're proposing to
hastily assign 1 and 0 ? How is that extensible?
We need to carefully think through what should be the semantics
of attaching multiple programs, consider performance implications,
return codes and so on.

> > Another alternative is to disallow attaching programs in sub-hierarchy
> > if parent has something already attached, but it's not useful
> > for general case.
> > All of these are possible future extensions.
> 
> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
> have to do it now (or disable the feature for 4.10).  This is why I'm
> bringing this whole thing up now.

We don't have to touch user visible api here, so extensions are fine.

> >> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
> >> programs attached that can do things if various events occur. (Right
> >> now, this means socket operations, but there are plans in the works
> >> to do this for LSM hooks too.) These bpf programs can say yes or no,
> >> but they can also read out various data (including socket payloads!)
> >> and save them away where an attacker can find them. This sounds a
> >> lot like seccomp with a narrower scope but a much stronger ability to
> >> exfiltrate private information.
> >
> > that sounds like a future problem to solve when bpf+lsm+cgroup are
> > used for security.
> 
> [...]
> 
> >
> > I disagree with the urgency. I see nothing that needs immediate action.
> > bpf+lsm+cgroup is not in the tree yet.
> >
> 
> I disagree very strongly here.  The API in 4.10 is IMO quite ugly, but
> the result if bpf+lsm+cgroup works *differently* will be far, far
> uglier.

again we're talking about the future here and 'ugly' is the matter of taste.
I hear all the time that people say that netlink api is ugly, so?

> I think the right solution here is to clean up the API so that it'll
> work for future extensions that people are already imagining.  If this
> can happen for 4.10, great.  If not, then postpone this stuff
> entirely.  I've had code I've written for Linux postponed extensively
> until I've gotten the API right, and it's not so bad.  (Actually, I've
> even had API changes I've made reverted in -stable, IIRC.  This is
> much worse than postponing before a release.)

huh? 'not right api' because it's using bpf syscall instead
of cgroup control-file? I think the opposite is the truth.
syscall style is more extensible whereas control-file and text
based interfaces have proven to be huge pain to extend and maintain.
Again, I don't mind if you want to have both available.

^ permalink raw reply

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
From: Francois Romieu @ 2016-12-20  0:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Lino Sanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev
In-Reply-To: <20161219100215.GA6296@amd>

Pavel Machek <pavel@ucw.cz> :
[...]
> Considering the memory barriers... is something like this neccessary
> in the via-rhine ?

Yes.

> AFAICT... we need a barrier after making sure that descriptor is no
> longer owned by DMA (to make sure we don't get stale data in rest of
> descriptor)... and we need a barrier before giving the descriptor to
> the dma, to make sure DMA engine sees the complete update....?

I would not expect stale data while processing a single transmit
descriptor as the transmit completion does not use the rest of the
descriptor at all in the via-rhine driver. However I agree that transmit
descriptors should be read by the cpu with adequate ordering so the
dma_rmb() should stay.

Same kind of narrative for dma_wmb rhine_rx (s/read/written/ and
s/cpu/device/).

> diff --git a/drivers/net/ethernet/via/via-rhine.c b/drivers/net/ethernet/via/via-rhine.c
> index ba5c542..3806e72 100644
> --- a/drivers/net/ethernet/via/via-rhine.c
> +++ b/drivers/net/ethernet/via/via-rhine.c
[...]
> @@ -2061,6 +2062,7 @@ static int rhine_rx(struct net_device *dev, int limit)
>  
>  		if (desc_status & DescOwn)
>  			break;
> +		dma_rmb();
>  

I agree with your explanation for this one (late vlan processing in a 
different word from the same descriptor).

-- 
Ueimor

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  0:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook,
	Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <20161220000254.GA58895-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>

On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
<alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
>> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
>> >> Hi all-
>> >>
>> >> I apologize for being rather late with this.  I didn't realize that
>> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> >> it on the linux-api list, so I missed the discussion.
>> >>
>> >> I think that the inet ingress, egress etc filters are a neat feature,
>> >> but I think the API has some issues that will bite us down the road
>> >> if it becomes stable in its current form.
>> >>
>> >> Most of the problems I see are summarized in this transcript:
>> >>
>> >> # mkdir cg2
>> >> # mount -t cgroup2 none cg2
>> >> # mkdir cg2/nosockets
>> >> # strace cgrp_socket_rule cg2/nosockets/ 0
>> >> ...
>> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>> >>
>> >> ^^^^ You can modify a cgroup after opening it O_RDONLY?
>> >>
>> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
>> >>
>> >> ^^^^ This is fine.  The bpf() syscall manipulates bpf objects.
>> >>
>> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>> >>
>> >> ^^^^ This is not so good:
>> >> ^^^^
>> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects.  This
>> >> ^^^^    is manipulating a cgroup.  There's no reason that a socket creation
>> >> ^^^^    filter couldn't be written in a different language (new iptables
>> >> ^^^^    table?  Simple list of address families?), but if that happened,
>> >> ^^^^    then using bpf() to install it would be entirely nonsensical.
>> >
>> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
>> > network stack is using cgroup as an application group that should
>> > invoke bpf program at the certain point in the stack.
>> > imo cgroup management is orthogonal.
>>
>> It is literally modifying the struct cgroup, and, as a practical
>> matter, it's causing membership in the cgroup to have a certain
>> effect.  But rather than pointless arguing, let me propose an
>> alternative API that I think solves most of the problems here.
>>
>> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
>> Instead, the cgroup gets three new control files:
>> "net.ingress_filter", "net.egress_filter", and
>> "net.socket_create_filter".  Initially, if you read these files, you
>> see "none\n".
>>
>> To attach a bpf filter, you open the file for write and do an ioctl on
>> it.  After doing the ioctl, if you read the file, you'll see
>> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
>> the bpf program.
>>
>> To detach any type of filter, bpf or otherwise, you open the file for
>> write and write "none\n" (or just "none").
>>
>> If you write anything else to the file, you get -EINVAL.  But, if
>> someone writes a new type of filter (perhaps a simple list of address
>> families), maybe you can enable the filter by writing something
>> appropriate to the file.
>
> I see no difference in what you're proposing vs what is already implemented
> from feature set point of view, but the file approach is very ugly, since
> it's a mismatch to FD style access that bpf is using everywhere.
> In your proposal you'd also need to add bpf prefix everywhere.
> So the control file names should be bpf_inet_ingress, bpf_inet_egress
> and bpf_socket_create.

I think we're still talking past each other.  A big part of the point
of changing it is that none of this is specific to bpf.  You could (in
theory -- I'm not proposing implementing these until there's demand)
have:

net.socket_create_filter = "none": no filter
net.socket_create_filter = "bpf:baadf00d": bpf filter
net.socket_create_filter = "disallow": no sockets created period
net.socket_create_filter = "iptables:foobar": some iptables thingy
net.socket_create_filter = "nft:blahblahblah": some nft thingy
net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

See?  This API is not bpf-specific.  It's an API for filtering.  The
fact that struct cgroup currently contains a member called "bpf" is
purely an artifact of the fact that it currently only supports bpf.
Someone will want to rename it to "filters" some day, and
BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a
filter.

> If you want to prepare such patch for them, I don't mind,
> but you cannot kill syscall command, since it's more flexible
> and your control-file approach _will_ be obsolete pretty quickly.

BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO.
If you really really want a syscall, make it a new syscall.

>
>> Now the API matches the effect.  A cgroup with something other than
>> "none" in one of its net.*_filter files is a cgroup that filters
>> network activity.  And you get CRIU compatibility for free: CRIU can
>> read the control file and use whatever mechanism is uses for BPF in
>> general to restore the cgroup filter state.   As an added bonus, you
>> get ACLs for free and the ugly multiplexer goes away.
>
> extended bpf is not supported by criu. only classic, so having
> control_file-style attachment doesn't buy us anything.

CRIU will support it some day.  We might as well put fewer obstacles
in their way.

>
>> >> # mkdir cg2/nosockets/sockets
>> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
>> >>
>> >> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
>> >> ^^^^ then we're stuck with its semantics.  If it returned -EINVAL instead,
>> >> ^^^^ there would be a chance to refine it.
>> >
>> > i don't see the problem with this behavior. bpf and cgroup are indepedent.
>> > Imange that socket accounting program is attached to cg2/nosockets.
>> > The program is readonly and carry no security meaning.
>> > Why cgroup should prevent creation of cg2/nosockets/foo directory ?
>>
>> I think you're misunderstanding me.  What I'm saying is that, if you
>> allow a cgroup and one of its descendents to both enable the same type
>> of filter, you have just committed to some particular semantics for
>> what happens.  And I think that the current semantics are the *wrong*
>> semantics for default long-term use, so you should either fix the
>> semantics or disable the problematic case.
>
> Are you saying that use cases I provided are also 'wrong'?
> If you insist on that we won't be able to make any forward progress.
> The current semantics is fine for what it's designed for.

Can you explain your use case more clearly?

>
>> >> # echo $$ >cg2/nosockets/sockets/cgroup.procs
>> >> # ping 127.0.0.1
>> >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
>> >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
>> >
>> > hmm. in this case the admin created a subgroup with a group that allows
>> > ping, so? how's that a problem?
>>
>> I think you're forgetting about namespaces.  There are two different
>> admins here.  There's the admin who created the outer container and
>> said "no sockets here".  Then there's the admin inside the container
>> who said "muahaha, I can create a sub-container and allow sockets
>> *there*".
>
> container management frameworks should be doing sensible
> things. With any infra in the kernel there are many ways to
> create non-sensical setups. It's not a job of the kernel
> to interfere with user space.

What exactly isn't sensible about using cgroup_bpf for containers?

>
>> > In case of vrf I can imagine
>> > a set of application auto-binding to one vrf and smaller
>> > set of applications binding to a different vrf.
>> > Or broader set of applications monitoring all tcp traffic
>> > and subset of them monitoring udp instead.
>> > Those are valid use cases.
>>
>> > I guess you're advocating to run a link list of programs?
>> > That won't be useful for the above use cases, where there is no
>> > reason to run more than one program that control plane
>> > assigned to run for this cgroup.
>>
>> Yes there is, for both monitoring and policy.  If I want to monitor
>> all activity in a cgroup, I probably want to monitor descendents as
>> well.  If I want to restrict a cgroup, I want to restrict its
>> descendents.
>
> you're ignoring use cases I described earlier.
> In vrf case there is only one ifindex it needs to bind to.

I'm totally lost.  Can you explain what this has to do with the cgroup
hierarchy?

>> > At this stage we decided to allow only one program per cgroup per hook
>> > and later can extend it if necessary.
>>
>> No you can't.  Since you allow the problematic case and you gave it
>> the surprising "one program" semantics, you can't change it later.
>
> please show me why we cannot? As far as I can see nothing prevents
> that in the future. We can add any number of new fields to
> BPF_PROG_ATTACH command just like we did in the past with
> other commands, whereas control file interface is not extensible.

Because people are going to start using the old API, tools won't be
aware of the new semantics, you have no usable introspection
mechanism, and everyone is going to screw it up.  But it's even worse:
because global privilege is currently needed to set up these filters,
containers really can use it today, but once you switch to ns_capable,
then it suddenly becomes insecure.  And *that* is something that you
can't do.

>
>> > As you're pointing out, in case of security, we probably
>> > want to preserve original bpf program that should always be
>> > run first and only after it returned 'ok' (we'd need to define
>> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
>>
>> It's already defined AFAICT.  1 means okay.  0 means not okay.
>
> sorry that doesn't make any sense. For seccomp we have a set of
> ranges that mean different things. Here you're proposing to
> hastily assign 1 and 0 ? How is that extensible?
> We need to carefully think through what should be the semantics
> of attaching multiple programs, consider performance implications,
> return codes and so on.

You already assigned it.  The return value of the bpf program, loaded
in Linus' tree today, tells the kernel whether to accept or reject.

>
>> > Another alternative is to disallow attaching programs in sub-hierarchy
>> > if parent has something already attached, but it's not useful
>> > for general case.
>> > All of these are possible future extensions.
>>
>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>> have to do it now (or disable the feature for 4.10).  This is why I'm
>> bringing this whole thing up now.
>
> We don't have to touch user visible api here, so extensions are fine.

Huh?  My example in the original email attaches a program in a
sub-hierarchy.  Are you saying that 4.11 could make that example stop
working?

>
>> >> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
>> >> programs attached that can do things if various events occur. (Right
>> >> now, this means socket operations, but there are plans in the works
>> >> to do this for LSM hooks too.) These bpf programs can say yes or no,
>> >> but they can also read out various data (including socket payloads!)
>> >> and save them away where an attacker can find them. This sounds a
>> >> lot like seccomp with a narrower scope but a much stronger ability to
>> >> exfiltrate private information.
>> >
>> > that sounds like a future problem to solve when bpf+lsm+cgroup are
>> > used for security.
>>
>> [...]
>>
>> >
>> > I disagree with the urgency. I see nothing that needs immediate action.
>> > bpf+lsm+cgroup is not in the tree yet.
>> >
>>
>> I disagree very strongly here.  The API in 4.10 is IMO quite ugly, but
>> the result if bpf+lsm+cgroup works *differently* will be far, far
>> uglier.
>
> again we're talking about the future here and 'ugly' is the matter of taste.
> I hear all the time that people say that netlink api is ugly, so?

Yeah, it's ugly, but that ship already sailed.  This ship hasn't.

^ permalink raw reply

* Re: ipv6: handle -EFAULT from skb_copy_bits
From: Dave Jones @ 2016-12-20  0:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20161219.144848.736886978545428136.davem@davemloft.net>

On Mon, Dec 19, 2016 at 02:48:48PM -0500, David Miller wrote:

 > One thing that's interesting is that if the user picks "IPPROTO_RAW"
 > as the value of 'protocol' we set inet->hdrincl to 1.
 > 
 > The user can also set inet->hdrincl to 1 or 0 via setsockopt().
 > 
 > I think this is part of the problem.  The test above means to check
 > for "RAW socket with hdrincl set" and is trying to do this more simply.
 > But because setsockopt() can set this arbitrarily, testing sk_protocol
 > alone isn't enough.
 > 
 > So changing:
 > 
 > 	sk->sk_protocol == IPPROTO_RAW
 > 
 > into something like:
 > 
 > 	(sk->sk_socket->type == SOCK_RAW && inet_sk(sk)->hdrincl)
 > 
 > should correct the test.
 >  ..
 > 
 > You can test if the change I suggest above works.

Unfortunately, this made no difference.  I spent some time today trying
to make a better reproducer, but failed. I'll revisit again tomorrow.

Maybe I need >1 process/thread to trigger this.  That would explain why
I can trigger it with Trinity.

	Dave

^ permalink raw reply

* Re: ipv6: handle -EFAULT from skb_copy_bits
From: Dave Jones @ 2016-12-20  0:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20161220003144.omoqyghgdfbxdyuu@codemonkey.org.uk>

On Mon, Dec 19, 2016 at 07:31:44PM -0500, Dave Jones wrote:

 > Unfortunately, this made no difference.  I spent some time today trying
 > to make a better reproducer, but failed. I'll revisit again tomorrow.
 > 
 > Maybe I need >1 process/thread to trigger this.  That would explain why
 > I can trigger it with Trinity.

scratch that last part, I finally just repro'd it with a single process.

	Dave

^ permalink raw reply

* Re: [PATCH net] netfilter: check duplicate config when initializing in ipt_CLUSTERIP
From: Pablo Neira Ayuso @ 2016-12-20  0:48 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, netfilter-devel, davem, Marcelo Ricardo Leitner
In-Reply-To: <9b3e1f76b5670d33727e63b6e166ae416b0d65af.1481776300.git.lucien.xin@gmail.com>

On Thu, Dec 15, 2016 at 12:31:40PM +0800, Xin Long wrote:
> @@ -185,6 +186,17 @@ clusterip_config_init(const struct ipt_clusterip_tgt_info *i, __be32 ip,
>  	atomic_set(&c->refcount, 1);
>  	atomic_set(&c->entries, 1);
>  
> +	spin_lock_bh(&cn->lock);
> +	if (__clusterip_config_find(net, ip)) {
> +		spin_unlock_bh(&cn->lock);
> +		kfree(c);
> +
> +		return NULL;
> +	}

This is going to result in ENOMEM error report to userspace on race,
which can be confusing. Time for clusterip_config_init() to return
PTR_ERR()?

> +
> +	list_add_rcu(&c->list, &cn->configs);
> +	spin_unlock_bh(&cn->lock);
> +
>  #ifdef CONFIG_PROC_FS
>  	{
>  		char buffer[16];

^ permalink raw reply

* [PATCH net] openvswitch: Add a missing break statement.
From: Jarno Rajahalme @ 2016-12-20  1:06 UTC (permalink / raw)
  To: netdev; +Cc: jarno, jbenc, pshelar

Add a break statement to prevent fall-through from
OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL.  Without the break
actions setting ethernet addresses fail to validate with log messages
complaining about invalid tunnel attributes.

Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
---
 net/openvswitch/flow_netlink.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d19044f..c87d359 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2195,6 +2195,7 @@ static int validate_set(const struct nlattr *a,
 	case OVS_KEY_ATTR_ETHERNET:
 		if (mac_proto != MAC_PROTO_ETHERNET)
 			return -EINVAL;
+		break;

 	case OVS_KEY_ATTR_TUNNEL:
 		if (masked)
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v3 net-next 1/3] openvswitch: Add a missing break statement.
From: Jarno Rajahalme @ 2016-12-20  1:07 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: Linux Kernel Network Developers, Jiri Benc, Eric Garver
In-Reply-To: <CAOrHB_By4AoojkiA33bP3GT7ABsdZfJkp9j2kGpno76j9cqXug@mail.gmail.com>

> On Dec 13, 2016, at 9:07 PM, Pravin Shelar <pshelar@ovn.org> wrote:
> 
> On Tue, Nov 29, 2016 at 3:30 PM, Jarno Rajahalme <jarno@ovn.org> wrote:
>> Add a break statement to prevent fall-through from
>> OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL.  Without the break
>> actions setting ethernet addresses fail to validate with log messages
>> complaining about invalid tunnel attributes.
>> 
>> Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets")
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>> Acked-by: Pravin B Shelar <pshelar@ovn.org>
>> Acked-by: Jiri Benc <jbenc@redhat.com>
> 
> Hi Jarno,
> Since this is straight forward patch. can you send it separately so
> that we can get it merged soon?
> 

I just did, against net. You’ll take over the rest?

  Jarno

> Thanks,
> Pravin.

^ permalink raw reply

* Re: [PATCH net] openvswitch: Add a missing break statement.
From: Jarno Rajahalme @ 2016-12-20  1:10 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev, jbenc, pshelar
In-Reply-To: <1482195993-97937-1-git-send-email-jarno@ovn.org>


> On Dec 19, 2016, at 5:06 PM, Jarno Rajahalme <jarno@ovn.org> wrote:
> 
> Add a break statement to prevent fall-through from
> OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL.  Without the break
> actions setting ethernet addresses fail to validate with log messages
> complaining about invalid tunnel attributes.
> 
> Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets")
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> Acked-by: Pravin B Shelar <pshelar@ovn.org>
> Acked-by: Jiri Benc <jbenc@redhat.com>
> ---
> net/openvswitch/flow_netlink.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index d19044f..c87d359 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -2195,6 +2195,7 @@ static int validate_set(const struct nlattr *a,
> 	case OVS_KEY_ATTR_ETHERNET:
> 		if (mac_proto != MAC_PROTO_ETHERNET)
> 			return -EINVAL;
> +		break;
> 
> 	case OVS_KEY_ATTR_TUNNEL:
> 		if (masked)
> -- 
> 2.1.4
> 

I posted this separately from an earlier net-next patch series. Pravin agreed to pick up that rest of the series (skb->protocol w/ vlans).

  Jarno

^ permalink raw reply

* Which ethtool methods should I implement?
From: Timur Tabi @ 2016-12-20  1:29 UTC (permalink / raw)
  To: netdev

I'm adding support for ethtool to my driver 
(drivers/net/ethernet/qualcomm/emac/), and I can't find any meaningful 
HOWTO documentation, so I'm not sure which methods I need to implement.

Is there some minimal set of must-have ethtool methods that should be 
implemented?  Since I support phylib, I guess I should use 
phy_ethtool_get_link_ksettings and phy_ethtool_set_link_ksettings.  What 
else?

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: David Miller @ 2016-12-20  1:34 UTC (permalink / raw)
  To: alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w
  Cc: luto-DgEjT+Ai2ygdnm+yROfE0A, daniel-cYrQPVfZoowdnm+yROfE0A,
	mic-WFhQfpSGs3bR7s880joybQ, keescook-F7+t8E8rja9g9hUCZPvPmw,
	jann-XZ1E9jl8jIdeoWH0uzbU5w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	dsahern-Re5JQEeQqe8AvxtiuMwx3w, tgraf-G/eBtMaohhA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161220000254.GA58895-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>

From: Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Mon, 19 Dec 2016 16:02:56 -0800

> huh? 'not right api' because it's using bpf syscall instead
> of cgroup control-file? I think the opposite is the truth.

I completely agree with Alexei on this.

^ permalink raw reply

* Re: [PATCH v2 1/4] net: hix5hd2_gmac: add generic compatible string
From: Dongpo Li @ 2016-12-20  1:35 UTC (permalink / raw)
  To: Rob Herring
  Cc: David Miller, Mark Rutland, Michael Turquette, Stephen Boyd,
	Russell King, Zhangfei Gao, Yisen Zhuang, salil.mehta,
	Arnd Bergmann, Andrew Lunn, Jiancheng Xue, benjamin.chenhao,
	caizhiyong, netdev, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAL_Jsq+Su84DonQbh2KZ2o8JQ0Mqarag60Oq=Dby3y8Doyozxg@mail.gmail.com>



On 2016/12/20 0:04, Rob Herring wrote:
> On Mon, Dec 19, 2016 at 2:14 AM, Dongpo Li <lidongpo@hisilicon.com> wrote:
>> Hi Rob and David,
>>
>> On 2016/12/12 22:21, Rob Herring wrote:
>>> On Mon, Dec 12, 2016 at 5:16 AM, Dongpo Li <lidongpo@hisilicon.com> wrote:
>>>> Hi Rob,
>>>>
>>>> On 2016/12/10 6:35, Rob Herring wrote:
>>>>> On Mon, Dec 05, 2016 at 09:27:58PM +0800, Dongpo Li wrote:
> 
> [...]
> 
>>>>>> @@ -20,7 +25,7 @@ Required properties:
>>>>>>
>>>>>>  Example:
>>>>>>      gmac0: ethernet@f9840000 {
>>>>>> -            compatible = "hisilicon,hix5hd2-gmac";
>>>>>> +            compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1";
>>>>>
>>>>> You can't just change compatible strings.
>>>>>
>>>> Okay, maybe I should name all the compatible string with the suffix "-gmac" instead of
>>>> "-gemac". This can keep the compatible strings with the same suffix. Is this okay?
>>>> Can I just add the generic compatible string without changing the SoCs compatible string?
>>>> Like following:
>>>>         gmac0: ethernet@f9840000 {
>>>>  -              compatible = "hisilicon,hix5hd2-gmac";
>>>>  +              compatible = "hisilicon,hix5hd2-gmac", "hisilicon,hisi-gmac-v1";
>>>
>>> Yes, this is fine.
>>
>> As the patch series have been applied to net-next branch,
>> in which way should I commit this compatible fix?
>> Should I send a new patch fixing this compatible string error with "Fixes: xxx"?
>> Looking forward to your reply!
> 
> Yes to both.

Okay, thanks for your reply!
I will send the fix patch series as soon as possible.


    Regards,
    Dongpo

.

^ permalink raw reply

* Re: ipv6: handle -EFAULT from skb_copy_bits
From: David Miller @ 2016-12-20  1:36 UTC (permalink / raw)
  To: davej; +Cc: netdev
In-Reply-To: <20161220004013.43pcopemc6im32az@codemonkey.org.uk>

From: Dave Jones <davej@codemonkey.org.uk>
Date: Mon, 19 Dec 2016 19:40:13 -0500

> On Mon, Dec 19, 2016 at 07:31:44PM -0500, Dave Jones wrote:
> 
>  > Unfortunately, this made no difference.  I spent some time today trying
>  > to make a better reproducer, but failed. I'll revisit again tomorrow.
>  > 
>  > Maybe I need >1 process/thread to trigger this.  That would explain why
>  > I can trigger it with Trinity.
> 
> scratch that last part, I finally just repro'd it with a single process.

Thanks for the info, I'll try to think about this some more.

^ permalink raw reply

* Re: Which ethtool methods should I implement?
From: Florian Fainelli @ 2016-12-20  1:40 UTC (permalink / raw)
  To: Timur Tabi, netdev
In-Reply-To: <58588982.5030006@codeaurora.org>

On 12/19/2016 05:29 PM, Timur Tabi wrote:
> I'm adding support for ethtool to my driver
> (drivers/net/ethernet/qualcomm/emac/), and I can't find any meaningful
> HOWTO documentation, so I'm not sure which methods I need to implement.
> 
> Is there some minimal set of must-have ethtool methods that should be
> implemented?  Since I support phylib, I guess I should use
> phy_ethtool_get_link_ksettings and phy_ethtool_set_link_ksettings.  What
> else?

Ideally, everything that is supported by your HW, but I would with the
basic essential stuff that you would need in case someone reports
problems with your driver like:

- statistics (MAC for sure) and PHY (if possible), -S
- ability to restart auto-negotation (-r)
- reporting of driver information (-i)
- support toggling and reporting NETIF_F_* features -k/-K
-- 
Florian

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  1:40 UTC (permalink / raw)
  To: David Miller
  Cc: Alexei Starovoitov, Andrew Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David Ahern, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <20161219.203422.500916400463091702.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Mon, Dec 19, 2016 at 5:34 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> From: Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Date: Mon, 19 Dec 2016 16:02:56 -0800
>
>> huh? 'not right api' because it's using bpf syscall instead
>> of cgroup control-file? I think the opposite is the truth.
>
> I completely agree with Alexei on this.

So what happens when someone adds another type of filter?  Let's say
there's a simple, no-privilege-required list of allowed address
families that can hook up to the socket creation hook for a cgroup.
Does BPF_PROG_DETACH still detach it?  Or would both the bpf *and* the
list of allowed address families be in force?  If the latter, why
wouldn't two BPF programs on the same hook be allowed?

Concretely:

# mkdir /cgroup/a
# set_up_bpf_socket_rule /cgroup/a
# set_up_list_of_address_families /cgroup/a
# cat /cgroup/a/some_new_file [what gets displayed?]
# BPF_PROG_DETACH: what happens

By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
even take a reference to a BPF program as an argument.  What is it
supposed to do if this mechanism ever gets extended?

--Andy

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  1:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook,
	Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw@mail.gmail.com>

On Mon, Dec 19, 2016 at 4:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
>
> I'm totally lost.  Can you explain what this has to do with the cgroup
> hierarchy?
>

Okay, I figured out what you mean, I think.  You have a handful of vrf
devices.  Let's say they have ifindexes 1 and 2 (and maybe more).

The interesting case happens when you set up /cgroup/a with a bpf
program that binds new sockets to ifindex 1 and /cgroup/a/b with a bpf
program that binds new sockets to ifindex 2.  The question is: what
should happen if you're in /cgroup/a/b?  Presumably, if you do this,
you wanted to end up with ifindex 2.

I think the way it should actually work is that the kernel evaluates
/cgroup/a/b's hook and then /cgroup/a's hook.  Then /cgroup/a (which
is the more privileged hook) gets to make the choice.  If it wants
ifindex 2 to win, it can do (pseudocode):

if (!sk->sk_bound_if)
  sk->sk_bound_if = 1;

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: David Ahern @ 2016-12-20  1:44 UTC (permalink / raw)
  To: Andy Lutomirski, Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook,
	Jann Horn, Tejun Heo, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 12/19/16 5:25 PM, Andy Lutomirski wrote:
> net.socket_create_filter = "none": no filter
> net.socket_create_filter = "bpf:baadf00d": bpf filter
> net.socket_create_filter = "disallow": no sockets created period
> net.socket_create_filter = "iptables:foobar": some iptables thingy
> net.socket_create_filter = "nft:blahblahblah": some nft thingy
> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.

...

>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
> 
> I'm totally lost.  Can you explain what this has to do with the cgroup
> hierarchy?

I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is

    cgrp2/vrf/NAME

where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense. 

...

>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>>> have to do it now (or disable the feature for 4.10).  This is why I'm
>>> bringing this whole thing up now.
>>
>> We don't have to touch user visible api here, so extensions are fine.
> 
> Huh?  My example in the original email attaches a program in a
> sub-hierarchy.  Are you saying that 4.11 could make that example stop
> working?

Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?

^ permalink raw reply

* Re: [PATCH v2 2/3] Bluetooth: btusb: Add out-of-band wakeup support
From: Rajat Jain @ 2016-12-20  1:46 UTC (permalink / raw)
  To: Brian Norris
  Cc: Rob Herring, Mark Rutland, Marcel Holtmann, Gustavo Padovan,
	Johan Hedberg, Amitkumar Karwar, Wei-Ning Huang, Xinming Hu,
	netdev, devicetree, linux-bluetooth, linux-kernel, Rajat Jain
In-Reply-To: <20161219231021.GA142074@google.com>

Hi Brian,


On Mon, Dec 19, 2016 at 3:10 PM, Brian Norris <briannorris@chromium.org> wrote:
> Hi Rajat,
>
> On Fri, Dec 16, 2016 at 11:30:03AM -0800, Rajat Jain wrote:
>> Some onboard BT chips (e.g. Marvell 8997) contain a wakeup pin that
>> can be connected to a gpio on the CPU side, and can be used to wakeup
>> the host out-of-band. This can be useful in situations where the
>> in-band wakeup is not possible or not preferable (e.g. the in-band
>> wakeup may require the USB host controller to remain active, and
>> hence consuming more system power during system sleep).
>>
>> The oob gpio interrupt to be used for wakeup on the CPU side, is
>> read from the device tree node, (using standard interrupt descriptors).
>> A devcie tree binding document is also added for the driver. The
>> compatible string is in compliance with
>> Documentation/devicetree/bindings/usb/usb-device.txt
>>
>> Signed-off-by: Rajat Jain <rajatja@google.com>
>
> A few small comments below, but other than those, for the whole series:
>
> Reviewed-by: Brian Norris <briannorris@chromium.org>
>
>> ---
>> v2: * Use interrupt-names ("wakeup") instead of assuming first interrupt.
>>     * Leave it on device tree to specify IRQ flags (level /edge triggered)
>>     * Mark the device as non wakeable on exit.
>>
>>  Documentation/devicetree/bindings/net/btusb.txt | 40 ++++++++++++
>>  drivers/bluetooth/btusb.c                       | 84 +++++++++++++++++++++++++
>>  2 files changed, 124 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/net/btusb.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/btusb.txt b/Documentation/devicetree/bindings/net/btusb.txt
>> new file mode 100644
>> index 0000000..2c0355c
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/btusb.txt
>> @@ -0,0 +1,40 @@
>> +Generic Bluetooth controller over USB (btusb driver)
>> +---------------------------------------------------
>> +
>> +Required properties:
>> +
>> +  - compatible : should comply with the format "usbVID,PID" specified in
>> +              Documentation/devicetree/bindings/usb/usb-device.txt
>> +              At the time of writing, the only OF supported devices
>> +              (more may be added later) are:
>> +
>> +               "usb1286,204e" (Marvell 8997)
>
> I don't know if anyone cares about these sort of organizational aspects,
> but in patch 3 you're extending this binding with device-specific
> Marvell bindings. Seems like it'd be good to have some clear way to
> correlate those 2. e.g., add a reference to marvell-bt-sd8xxx.txt (or
> marvell-bt-8xxx.txt, once you rename it) in patch 3.

Good idea, I'll do it. (I'll add a reference to marvell-bt-8xxx.txt in
btusb.txt).

>
>> +
>> +Optional properties:
>> +
>> +  - interrupt-parent: phandle of the parent interrupt controller
>> +  - interrupt-names: (see below)
>> +  - interrupts : The interrupt specified by the name "wakeup" is the interrupt
>> +              that shall be used for out-of-band wake-on-bt. Driver will
>> +              request this interrupt for wakeup. During system suspend, the
>> +              irq will be enabled so that the bluetooth chip can wakeup host
>> +              platform out of band. During system resume, the irq will be
>> +              disabled to make sure unnecessary interrupt is not received.
>> +
>> +Example:
>> +
>> +Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt:
>> +
>> +&usb_host1_ehci {
>> +    status = "okay";
>> +    #address-cells = <1>;
>> +    #size-cells = <0>;
>> +
>> +    mvl_bt1: bt@1 {
>> +     compatible = "usb1286,204e";
>> +     reg = <1>;
>> +     interrupt-parent = <&gpio0>;
>> +     interrupt-name = "wakeup";
>> +     interrupts = <3 IRQ_TYPE_LEVEL_LOW>;
>> +    };
>> +};
>> diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
>> index ce22cef..beca4e9 100644
>> --- a/drivers/bluetooth/btusb.c
>> +++ b/drivers/bluetooth/btusb.c
>> @@ -24,6 +24,8 @@
>>  #include <linux/module.h>
>>  #include <linux/usb.h>
>>  #include <linux/firmware.h>
>> +#include <linux/of_device.h>
>> +#include <linux/of_irq.h>
>>  #include <asm/unaligned.h>
>>
>>  #include <net/bluetooth/bluetooth.h>
>> @@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = {
>>  #define BTUSB_BOOTING                9
>>  #define BTUSB_RESET_RESUME   10
>>  #define BTUSB_DIAG_RUNNING   11
>> +#define BTUSB_OOB_WAKE_DISABLED      12
>>
>>  struct btusb_data {
>>       struct hci_dev       *hdev;
>> @@ -416,6 +419,8 @@ struct btusb_data {
>>       int (*recv_bulk)(struct btusb_data *data, void *buffer, int count);
>>
>>       int (*setup_on_usb)(struct hci_dev *hdev);
>> +
>> +     int oob_wake_irq;   /* irq for out-of-band wake-on-bt */
>>  };
>>
>>  static inline void btusb_free_frags(struct btusb_data *data)
>> @@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, bool enable)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_PM
>> +static irqreturn_t btusb_oob_wake_handler(int irq, void *priv)
>> +{
>> +     struct btusb_data *data = priv;
>> +
>> +     /* Disable only if not already disabled (keep it balanced) */
>> +     if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) {
>> +             disable_irq_nosync(irq);
>> +             disable_irq_wake(irq);
>> +     }
>> +     pm_wakeup_event(&data->udev->dev, 0);
>> +     return IRQ_HANDLED;
>> +}
>> +
>> +static const struct of_device_id btusb_match_table[] = {
>> +     { .compatible = "usb1286,204e" },
>> +     { }
>> +};
>> +MODULE_DEVICE_TABLE(of, btusb_match_table);
>
> You define a match table here, but you also define essentially same
> table for Marvell-specific additions in patch 3. It looks like maybe
> it's legal to have more than one OF table in a module? But it seems like
> it would get confusing, besides being somewhat strange to maintain. It
> might also produce duplicate 'modinfo' output.
>
> If you really want to independently opt into device-tree-specified
> interrupts vs. Marvell-specific interrrupt configuration, then you
> should probably just merge the latter into the former table, and
> implement a Marvell/GPIO flag to stick in the .data field of this table.

The data we want to stick (The pin number on the chip) is board
specific rather than being chip specific, and hence .data may not be
useful here.

>
> Or it might be fine to drop one or both "match" checks. Particularly for
> the Marvell-specific stuff, it's probably fair just to check if it has
> an ->of_node and if 'id->driver_info & BTUSB_MARVELL'. Any other Marvell
> device-specific quirks could probably be keyed off of the
> (weirdly-named?) blacklist_table[], which already matches PID/VID.

Yup, I think I can remove the compatible string checks.

I'll be sending a V3.

>
>> +
>> +/* Use an oob wakeup pin? */
>> +static int btusb_config_oob_wake(struct hci_dev *hdev)
>> +{
>> +     struct btusb_data *data = hci_get_drvdata(hdev);
>> +     struct device *dev = &data->udev->dev;
>> +     int irq, ret;
>> +
>> +     if (!of_match_device(btusb_match_table, dev))
>> +             return 0;
>> +
>> +     /* Move on if no IRQ specified */
>> +     irq = of_irq_get_byname(dev->of_node, "wakeup");
>> +     if (irq <= 0) {
>> +             bt_dev_dbg(hdev, "%s: no OOB Wakeup IRQ in DT", __func__);
>> +             return 0;
>> +     }
>> +
>> +     set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags);
>> +
>> +     ret = devm_request_irq(&hdev->dev, irq, btusb_oob_wake_handler,
>> +                            0, "OOB Wake-on-BT", data);
>> +     if (ret) {
>> +             bt_dev_err(hdev, "%s: IRQ request failed", __func__);
>> +             return ret;
>> +     }
>> +
>> +     ret = device_init_wakeup(dev, true);
>> +     if (ret) {
>> +             bt_dev_err(hdev, "%s: failed to init_wakeup\n", __func__);
>> +             return ret;
>> +     }
>> +
>> +     data->oob_wake_irq = irq;
>> +     disable_irq(irq);
>> +     bt_dev_info(hdev, "OOB Wake-on-BT configured at IRQ %u\n", irq);
>> +     return 0;
>> +}
>> +#endif
>> +
>>  static int btusb_probe(struct usb_interface *intf,
>>                      const struct usb_device_id *id)
>>  {
>> @@ -2849,6 +2913,11 @@ static int btusb_probe(struct usb_interface *intf,
>>       hdev->send   = btusb_send_frame;
>>       hdev->notify = btusb_notify;
>>
>> +#ifdef CONFIG_PM
>> +     err = btusb_config_oob_wake(hdev);
>> +     if (err)
>> +             goto out_free_dev;
>> +#endif
>>       if (id->driver_info & BTUSB_CW6622)
>>               set_bit(HCI_QUIRK_BROKEN_STORED_LINK_KEY, &hdev->quirks);
>>
>> @@ -3061,6 +3130,9 @@ static void btusb_disconnect(struct usb_interface *intf)
>>                       usb_driver_release_interface(&btusb_driver, data->isoc);
>>       }
>>
>> +     if (data->oob_wake_irq)
>> +             device_init_wakeup(&data->udev->dev, false);
>> +
>>       hci_free_dev(hdev);
>>  }
>>
>> @@ -3089,6 +3161,12 @@ static int btusb_suspend(struct usb_interface *intf, pm_message_t message)
>>       btusb_stop_traffic(data);
>>       usb_kill_anchored_urbs(&data->tx_anchor);
>>
>> +     if (data->oob_wake_irq && device_may_wakeup(&data->udev->dev)) {
>> +             clear_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags);
>> +             enable_irq_wake(data->oob_wake_irq);
>> +             enable_irq(data->oob_wake_irq);
>> +     }
>> +
>>       /* Optionally request a device reset on resume, but only when
>>        * wakeups are disabled. If wakeups are enabled we assume the
>>        * device will stay powered up throughout suspend.
>> @@ -3126,6 +3204,12 @@ static int btusb_resume(struct usb_interface *intf)
>>       if (--data->suspend_count)
>
> Not related to your patch exactly, but isn't this 'suspend_count' stuff
> useless? I'd send a patch, but it might conflict with yours. Just wanted
> to note it and see if I'm crazy...
>
> Brian
>
>>               return 0;
>>
>> +     /* Disable only if not already disabled (keep it balanced) */
>> +     if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) {
>> +             disable_irq(data->oob_wake_irq);
>> +             disable_irq_wake(data->oob_wake_irq);
>> +     }
>> +
>>       if (!test_bit(HCI_RUNNING, &hdev->flags))
>>               goto done;
>>
>> --
>> 2.8.0.rc3.226.g39d4020
>>

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  1:56 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <2dbec775-6304-e44c-19c5-fbf07877e7b1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <dsahern-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>> net.socket_create_filter = "none": no filter
>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>> net.socket_create_filter = "disallow": no sockets created period
>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>
> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.

Can you elaborate on what goes wrong?  (Obviously the
"address_family_list" example makes no sense in that context.)

>
> ...
>
>>> you're ignoring use cases I described earlier.
>>> In vrf case there is only one ifindex it needs to bind to.
>>
>> I'm totally lost.  Can you explain what this has to do with the cgroup
>> hierarchy?
>
> I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is
>
>     cgrp2/vrf/NAME
>
> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense.

I tend to agree.  I still think that the mechanism as it stands is
broken in other respects and should be fixed before it goes live.  I
have no desire to cause problems for the vrf use case.

But keep in mind that the vrf use case is, in Linus' tree, a bit
broken right now in its interactions with other users of the same
mechanism.  Suppose I create a container and want to trace all of its
created sockets.  I'll set up cgrp2/container and load my tracer as a
socket creation hook.  Then a container sets up
cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
filter.  Now the tracing stops working -- oops.

>
> ...
>
>>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>>>> have to do it now (or disable the feature for 4.10).  This is why I'm
>>>> bringing this whole thing up now.
>>>
>>> We don't have to touch user visible api here, so extensions are fine.
>>
>> Huh?  My example in the original email attaches a program in a
>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>> working?
>
> Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?

Yes, exactly.  I think there are two sensible behaviors:

a) sub-cgroups cannot have a filter at all of the parent has a filter.
(This is the "punt" approach -- it lets different semantics be
assigned later without breaking userspace.)

b) sub-cgroups can have a filter if a parent does, too.  The semantics
are that the sub-cgroup filter runs first and all side-effects occur.
If that filter says "reject" then ancestor filters are skipped.  If
that filter says "accept", then the ancestor filter is run and its
side-effects happen as well.  (And so on, all the way up to the root.)

--Andy

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: David Miller @ 2016-12-20  1:56 UTC (permalink / raw)
  To: jbacik; +Cc: hannes, tom, kraigatgoog, eric.dumazet, netdev
In-Reply-To: <286A21B1-2A15-4DDF-B334-A016DA3D52EA@fb.com>

From: Josef Bacik <jbacik@fb.com>
Date: Sat, 17 Dec 2016 13:26:00 +0000

> So take my current duct tape fix and augment it with more
> information in the bind bucket?  I'm not sure how to make this work
> without at least having a list of the binded addrs as well to make
> sure we are really ok.  I suppose we could save the fastreuseport
> address that last succeeded to make it work properly, but I'd have
> to make it protocol agnostic and then have a callback to have the
> protocol to make sure we don't have to do the bind_conflict run.  Is
> that what you were thinking of?  Thanks,

So there isn't a deadlock or lockup here, something is just running
really slow, right?

And that "something" is a scan of the sockets on a tb list, and
there's lots of timewait sockets hung off of that tb.

As far as I can tell, this scan is happening in
inet_csk_bind_conflict().

Furthermore, reuseport is somehow required to make this problem
happen.  How exactly?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox