Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v3 1/3] bpf: Refactor cgroups code in prep for new type
From: Alexei Starovoitov @ 2016-11-28 20:06 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480348130-31354-2-git-send-email-dsa@cumulusnetworks.com>

On Mon, Nov 28, 2016 at 07:48:48AM -0800, David Ahern wrote:
> Code move only; no functional change intended.

not quite...

> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
>   * @sk: The socken sending or receiving traffic
> @@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,
>  
>  	prog = rcu_dereference(cgrp->bpf.effective[type]);
>  	if (prog) {
> -		unsigned int offset = skb->data - skb_network_header(skb);
> -
> -		__skb_push(skb, offset);
> -		ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
> -		__skb_pull(skb, offset);
> +		switch (type) {
> +		case BPF_CGROUP_INET_INGRESS:
> +		case BPF_CGROUP_INET_EGRESS:
> +			ret = __cgroup_bpf_run_filter_skb(skb, prog);
> +			break;

hmm. what's a point of double jump table? It's only burning cycles
in the fast path. We already have
prog = rcu_dereference(cgrp->bpf.effective[type]); if (prog) {...}
Could you do a variant of __cgroup_bpf_run_filter() instead ?
That doesnt't take 'skb' as an argument.
It will also solve scary looking NULL skb from patch 2:
__cgroup_bpf_run_filter(sk, NULL, ...

and to avoid copy-pasting first dozen lines of current
__cgroup_bpf_run_filter can be moved into a helper that
__cgroup_bpf_run_filter_skb and
__cgroup_bpf_run_filter_sk will call.
Or some other way to rearrange that code.

^ permalink raw reply

* Re: [PATCH net] sit: Set skb->protocol properly in ipip6_tunnel_xmit()
From: David Miller @ 2016-11-28 20:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, sfr, elicooper, netdev
In-Reply-To: <1480357043.18162.70.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Nov 2016 10:17:23 -0800

> It appears that Eli changelog was not very precise, because his bug was
> because of XFRM being involved.
> 
> xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
> 
> So XFRM calls skb_gso_segment() before ip_output() or
> ip6_finish_output2() had a chance to change skb->protocol

I think if we do it in ip6_local_out and ip_local_out, then none of
these hacks will be necessary at all.

^ permalink raw reply

* Re: [PATCH net-next 1/1] driver: ipvlan: Add the sanity check for ipvlan mode
From: David Miller @ 2016-11-28 20:08 UTC (permalink / raw)
  To: maheshb; +Cc: fgao, edumazet, netdev, gfree.wind
In-Reply-To: <CAF2d9ji2_JhiNv9P9aUt7wV2Vfjv+z_q-MWSKKH8tzO_1THobg@mail.gmail.com>

From: Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com>
Date: Mon, 28 Nov 2016 11:02:45 -0800

> On Mon, Nov 28, 2016 at 5:23 AM, <fgao@ikuai8.com> wrote:
> 
>> From: Gao Feng <fgao@ikuai8.com>
>>
>> The ipvlan mode variable "nval" is from userspace, so the ipvlan codes
>> should check if the mode variable "nval" is valid.
>>
>> Signed-off-by: Gao Feng <fgao@ikuai8.com>
>> ---
>>  drivers/net/ipvlan/ipvlan_main.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_
>> main.c
>> index ab90b22..537b5a9 100644
>> --- a/drivers/net/ipvlan/ipvlan_main.c
>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>> @@ -65,6 +65,9 @@ static int ipvlan_set_port_mode(struct ipvl_port *port,
>> u16 nval)
>>         struct net_device *mdev = port->dev;
>>         int err = 0;
>>
>> +       if (nval >= IPVLAN_MODE_MAX)
>> +               return -EINVAL;
>> +
>>
> I'm curious to know how you encountered this issue? The values are
> validated in ipvlan_nl_validate() and it should fail at that time itself.

I'm not applying this without at least a better explanation.

^ permalink raw reply

* [PATCH net-netx] net: lan78xx: add LAN7801 MAC-only support
From: Woojung.Huh @ 2016-11-28 20:03 UTC (permalink / raw)
  To: davem, f.fainelli, andrew; +Cc: netdev, UNGLinuxDriver

From: Woojung Huh <woojung.huh@microchip.com>

Add LAN7801 MAC-only configuration support.

Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
---
 drivers/net/usb/Kconfig   |   5 +++
 drivers/net/usb/lan78xx.c | 111 +++++++++++++++++++++++++++++++++++++++++++++-
 drivers/net/usb/lan78xx.h |  14 ++++++
 3 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index cdde590..3dd490f5 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -114,6 +114,11 @@ config USB_LAN78XX
 	help
 	  This option adds support for Microchip LAN78XX based USB 2
 	  & USB 3 10/100/1000 Ethernet adapters.
+	  LAN7800 : USB 3 to 10/100/1000 Ethernet adapter
+	  LAN7850 : USB 2 to 10/100/1000 Ethernet adapter
+	  LAN7801 : USB 3 to 10/100/1000 Ethernet adapter (MAC only)
+
+	  Proper PHY driver is required for LAN7801.
 
 	  To compile this driver as a module, choose M here: the
 	  module will be called lan78xx.
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 0c459e9..08f2895 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -40,7 +40,7 @@
 #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
 #define DRIVER_DESC	"LAN78XX USB 3.0 Gigabit Ethernet Devices"
 #define DRIVER_NAME	"lan78xx"
-#define DRIVER_VERSION	"1.0.5"
+#define DRIVER_VERSION	"1.0.6"
 
 #define TX_TIMEOUT_JIFFIES		(5 * HZ)
 #define THROTTLE_JIFFIES		(HZ / 8)
@@ -67,6 +67,7 @@
 #define LAN78XX_USB_VENDOR_ID		(0x0424)
 #define LAN7800_USB_PRODUCT_ID		(0x7800)
 #define LAN7850_USB_PRODUCT_ID		(0x7850)
+#define LAN7801_USB_PRODUCT_ID		(0x7801)
 #define LAN78XX_EEPROM_MAGIC		(0x78A5)
 #define LAN78XX_OTP_MAGIC		(0x78F3)
 
@@ -400,6 +401,21 @@ struct lan78xx_net {
 	struct irq_domain_data	domain_data;
 };
 
+/* define external phy id */
+#define	PHY_LAN8835			(0x0007C130)
+#define	PHY_KSZ9031RNX			(0x00221620)
+
+/* phyid : masked external phy id
+ * pre_config : if needed, configure MAC and/or external PHY
+ *		such as irq pin mux and RGMII timing.
+ */
+struct ext_phy_config_table {
+	int phyid;
+	void (*pre_config)(struct lan78xx_net *dev,
+			   struct phy_device *phydev,
+			   phy_interface_t *interface);
+};
+
 /* use ethtool to change the level for any given device */
 static int msg_level = -1;
 module_param(msg_level, int, 0);
@@ -1697,6 +1713,7 @@ static int lan78xx_mdiobus_read(struct mii_bus *bus, int phy_id, int idx)
 done:
 	mutex_unlock(&dev->phy_mutex);
 	usb_autopm_put_interface(dev->intf);
+
 	return ret;
 }
 
@@ -1759,6 +1776,10 @@ static int lan78xx_mdio_init(struct lan78xx_net *dev)
 		/* set to internal PHY id */
 		dev->mdiobus->phy_mask = ~(1 << 1);
 		break;
+	case ID_REV_CHIP_ID_7801_:
+		/* scan thru PHYAD[2..0] */
+		dev->mdiobus->phy_mask = ~(0xFF);
+		break;
 	}
 
 	ret = mdiobus_register(dev->mdiobus);
@@ -1933,11 +1954,58 @@ static void lan78xx_remove_irq_domain(struct lan78xx_net *dev)
 	dev->domain_data.irqdomain = NULL;
 }
 
+static void lan8835_pre_config(struct lan78xx_net *dev,
+			       struct phy_device *phydev,
+			       phy_interface_t *interface)
+{
+	int buf;
+	int ret;
+
+	/* LED2/PME_N/IRQ_N/RGMII_ID pin to IRQ_N mode */
+	buf = phy_read_mmd_indirect(phydev, 32784, 3);
+	buf &= ~0x1800;
+	buf |= 0x0800;
+	phy_write_mmd_indirect(phydev, 32784, 3, buf);
+
+	/* RGMII MAC TXC Delay Enable */
+	ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
+				MAC_RGMII_ID_TXC_DELAY_EN_);
+
+	/* RGMII TX DLL Tune Adjust */
+	ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
+
+	*interface = PHY_INTERFACE_MODE_RGMII_TXID;
+}
+
+static void ksz9031rnx_pre_config(struct lan78xx_net *dev,
+				  struct phy_device *phydev,
+				  phy_interface_t *interface)
+{
+	/* Micrel9301RNX PHY configuration */
+	/* RGMII Control Signal Pad Skew */
+	phy_write_mmd_indirect(phydev, 4, 2, 0x0077);
+	/* RGMII RX Data Pad Skew */
+	phy_write_mmd_indirect(phydev, 5, 2, 0x7777);
+	/* RGMII RX Clock Pad Skew */
+	phy_write_mmd_indirect(phydev, 8, 2, 0x1FF);
+
+	*interface = PHY_INTERFACE_MODE_RGMII_RXID;
+}
+
+/* external phyid & configure routine for LAN7801 */
+static struct ext_phy_config_table ext_phy_table[] = {
+	{ PHY_LAN8835, lan8835_pre_config },
+	{ PHY_KSZ9031RNX, ksz9031rnx_pre_config },
+	{}
+};
+
 static int lan78xx_phy_init(struct lan78xx_net *dev)
 {
 	int ret;
 	u32 mii_adv;
+	u32 masked_phyid;
 	struct phy_device *phydev = dev->net->phydev;
+	phy_interface_t interface;
 
 	phydev = phy_find_first(dev->mdiobus);
 	if (!phydev) {
@@ -1945,6 +2013,37 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 		return -EIO;
 	}
 
+	if ((dev->chipid == ID_REV_CHIP_ID_7800_) ||
+	    (dev->chipid == ID_REV_CHIP_ID_7850_)) {
+		phydev->is_internal = true;
+		interface = PHY_INTERFACE_MODE_GMII;
+
+	} else if (dev->chipid == ID_REV_CHIP_ID_7801_) {
+		int i;
+
+		if (!phydev->drv) {
+			netdev_err(dev->net, "no PHY driver found\n");
+			return -EIO;
+		}
+
+		masked_phyid = phydev->phy_id & phydev->drv->phy_id_mask;
+		interface = PHY_INTERFACE_MODE_RGMII;
+
+		for (i = 0; ext_phy_table[i].phyid != 0; i++) {
+			if ((masked_phyid == ext_phy_table[i].phyid) &&
+			    (ext_phy_table[i].pre_config)) {
+				ext_phy_table[i].pre_config(dev,
+							    phydev,
+							    &interface);
+			}
+		}
+
+		phydev->is_internal = false;
+	} else {
+		netdev_err(dev->net, "unknown ID found\n");
+		return -EIO;
+	}
+
 	/* if phyirq is not set, use polling mode in phylib */
 	if (dev->domain_data.phyirq > 0)
 		phydev->irq = dev->domain_data.phyirq;
@@ -1957,7 +2056,7 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 
 	ret = phy_connect_direct(dev->net, phydev,
 				 lan78xx_link_status_change,
-				 PHY_INTERFACE_MODE_GMII);
+				 interface);
 	if (ret) {
 		netdev_err(dev->net, "can't attach PHY to %s\n",
 			   dev->mdiobus->id);
@@ -2338,6 +2437,9 @@ static int lan78xx_reset(struct lan78xx_net *dev)
 	} while ((buf & PMT_CTL_PHY_RST_) || !(buf & PMT_CTL_READY_));
 
 	ret = lan78xx_read_reg(dev, MAC_CR, &buf);
+	/* LAN7801 only has RGMII mode */
+	if (dev->chipid == ID_REV_CHIP_ID_7801_)
+		buf &= ~MAC_CR_GMII_EN_;
 	buf |= MAC_CR_AUTO_DUPLEX_ | MAC_CR_AUTO_SPEED_;
 	ret = lan78xx_write_reg(dev, MAC_CR, buf);
 
@@ -2466,6 +2568,7 @@ static int lan78xx_stop(struct net_device *net)
 
 	phy_stop(net->phydev);
 	phy_disconnect(net->phydev);
+
 	net->phydev = NULL;
 
 	clear_bit(EVENT_DEV_OPEN, &dev->flags);
@@ -3887,6 +3990,10 @@ static const struct usb_device_id products[] = {
 	/* LAN7850 USB Gigabit Ethernet Device */
 	USB_DEVICE(LAN78XX_USB_VENDOR_ID, LAN7850_USB_PRODUCT_ID),
 	},
+	{
+	/* LAN7801 USB Gigabit Ethernet Device */
+	USB_DEVICE(LAN78XX_USB_VENDOR_ID, LAN7801_USB_PRODUCT_ID),
+	},
 	{},
 };
 MODULE_DEVICE_TABLE(usb, products);
diff --git a/drivers/net/usb/lan78xx.h b/drivers/net/usb/lan78xx.h
index 4092790..25aa546 100644
--- a/drivers/net/usb/lan78xx.h
+++ b/drivers/net/usb/lan78xx.h
@@ -108,6 +108,7 @@
 #define ID_REV_CHIP_REV_MASK_		(0x0000FFFF)
 #define ID_REV_CHIP_ID_7800_		(0x7800)
 #define ID_REV_CHIP_ID_7850_		(0x7850)
+#define ID_REV_CHIP_ID_7801_		(0x7801)
 
 #define FPGA_REV			(0x04)
 #define FPGA_REV_MINOR_MASK_		(0x0000FF00)
@@ -550,6 +551,7 @@
 #define LTM_INACTIVE1_TIMER10_		(0x0000FFFF)
 
 #define MAC_CR				(0x100)
+#define MAC_CR_GMII_EN_			(0x00080000)
 #define MAC_CR_EEE_TX_CLK_STOP_EN_	(0x00040000)
 #define MAC_CR_EEE_EN_			(0x00020000)
 #define MAC_CR_EEE_TLAR_EN_		(0x00010000)
@@ -787,6 +789,18 @@
 #define PHY_DEV_ID_MODEL_MASK_		(0x0FC00000)
 #define PHY_DEV_ID_OUI_MASK_		(0x003FFFFF)
 
+#define RGMII_TX_BYP_DLL		(0x708)
+#define RGMII_TX_BYP_DLL_TX_TUNE_ADJ_MASK_	(0x000FC00)
+#define RGMII_TX_BYP_DLL_TX_TUNE_SEL_MASK_	(0x00003F0)
+#define RGMII_TX_BYP_DLL_TX_DLL_RESET_		(0x0000002)
+#define RGMII_TX_BYP_DLL_TX_DLL_BYPASS_		(0x0000001)
+
+#define RGMII_RX_BYP_DLL		(0x70C)
+#define RGMII_RX_BYP_DLL_RX_TUNE_ADJ_MASK_	(0x000FC00)
+#define RGMII_RX_BYP_DLL_RX_TUNE_SEL_MASK_	(0x00003F0)
+#define RGMII_RX_BYP_DLL_RX_DLL_RESET_		(0x0000002)
+#define RGMII_RX_BYP_DLL_RX_DLL_BYPASS_		(0x0000001)
+
 #define OTP_BASE_ADDR			(0x00001000)
 #define OTP_ADDR_RANGE_			(0x1FF)
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net 11/16] net: ethernet: marvell: mvneta: fix fixed-link phydev leaks
From: Thomas Petazzoni @ 2016-11-28 20:10 UTC (permalink / raw)
  To: Johan Hovold
  Cc: David S. Miller, Vince Bridgers, Florian Fainelli, Fugang Duan,
	Pantelis Antoniou, Vitaly Bordug, Claudiu Manoil, Li Yang,
	Felix Fietkau, John Crispin, Matthias Brugger, Sergei Shtylyov,
	Lars Persson, Mugunthan V N, Grygorii Strashko, Rob Herring,
	Frank Rowand, Andrew Lunn, Vivien Didelot
In-Reply-To: <1480357509-28074-12-git-send-email-johan@kernel.org>

Hello,

On Mon, 28 Nov 2016 19:25:04 +0100, Johan Hovold wrote:
> Make sure to deregister and free any fixed-link PHY registered using
> of_phy_register_fixed_link() on probe errors and on driver unbind.
> 
> Fixes: 83895bedeee6 ("net: mvneta: add support for fixed links")
> Signed-off-by: Johan Hovold <johan@kernel.org>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 5 +++++
>  1 file changed, 5 insertions(+)

Reviewed-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>

Thanks!

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net-next V2 00/10] Mellanox 100G mlx5 DCBX and ethtool updates
From: David Miller @ 2016-11-28 20:10 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <1480258932-10302-1-git-send-email-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Sun, 27 Nov 2016 17:02:02 +0200

> This series provides the following mlx5 updates:
 ...

Series applied, thank you.

^ permalink raw reply

* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Lebrun @ 2016-11-28 20:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20161128.112248.2198395724649783009.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 2123 bytes --]

On 11/28/2016 05:22 PM, David Miller wrote:
> Thanks for trying to solve this problem.
> 
> But we really don't want this to be Kconfig gated.  If we decide to
> support this it should be a run-time selectable option.  Every
> distribution on the planet is going to turn your Kconfig option on, so
> whatever you think we might be saving by putting this behind Kconfig
> checks won't be realized for large swaths of the userbase.
> 

OK for that

> I also question how necessary %100 consistent hashing of ECMP really
> is.
> 
> If we can do something at extremely low cost and arrive at a very low
> disruption rate, honestly that's probably good enough.
> 
> I assume you've looked over RFC2992.  A low disrutpion algorithm is
> described there and if we could just get rid of the divide it might
> be extremely desirable.

I wanted a solution that has very low disruption for both scale up and
scale down. When a next-hop is added, only 1/N flows should be disrupted
(i.e. in this case, reaffected to the new next-hop). When a next-hop is
removed, only the flows that were handled by this particular next-hop
should be disrupted, and those flows should be equally rebalanced across
the remaining next-hops. Although the solution presented in RFC2992 has
a "reasonably" low disruption (looks like it can get up to 1/2
disruption though), I'm not sure if the equal-rebalancing part holds.

The advantage of my solution over RFC2992 is lowest possible disruption
and equal rebalancing of affected flows. The disadvantage is the lookup
complexity of O(log n) vs O(1). Although from a theoretical viewpoint
O(1) is obviously better, would O(log n) have an effectively measurable
negative impact on scalability ? If we consider 32 next-hops for a route
and 100 pseudo-random numbers generated per next-hop, the lookup
algorithm would have to perform in the worst case log2 3200 = 11
comparisons to select a next-hop for that route.

For my part I prefer the minimal disruption over constant time lookup.
I'll try to come up with some measurements to see the actual impact.

David


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 163 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 10/11] qede: Add basic XDP support
From: Mintz, Yuval @ 2016-11-28 20:20 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <20161128191832.46191e1c@jkicinski-Precision-T1700>

> Any particular reason not to allow the XDP prog being set while the
> device is closed?  You seem to preserve the program across
> close()/open() cycles so the edev->xdp_prog is alive and valid while
> device is closed IIUC.  

> I think other drivers are allowing setting XDP while closed and it
> would be cool to keep the behaviour identical across drivers :)

You're right; No reason to prevent this.
I'll fix it in v2.

^ permalink raw reply

* Re: [PATCH net-next 1/3] ethtool: (uapi) Add ETHTOOL_PHY_LOOPBACK to PHY tunables
From: Andrew Lunn @ 2016-11-28 20:21 UTC (permalink / raw)
  To: Allan W. Nielsen; +Cc: netdev, f.fainelli, raju.lakkaraju
In-Reply-To: <20161128192306.GA11474@microsemi.com>

On Mon, Nov 28, 2016 at 08:23:06PM +0100, Allan W. Nielsen wrote:
> Hi Andrew and Florian,
> 
> On 28/11/16 15:14, Andrew Lunn wrote:
> > On Mon, Nov 28, 2016 at 02:24:30PM +0100, Allan W. Nielsen wrote:
> > > From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> > >
> > > 3 types of PHY loopback are supported.
> > > i.e. Near-End Loopback, Far-End Loopback and External Loopback.
> > Is this integrated with ethtool --test? You only want the PHY to go
> > into loopback mode when running ethtool --test external_lb or maybe
> > ethtool --test offline.
> There are other use-cases for enabling PHY loopback:
> 
> 1) If the PHY is connected to a switch then a loop-port is sometime
>    used to "force/enable" one or more extra pass through the switch
>    core. This "hack" can sometime be used to achieve new functionality
>    with existing silicon.

With Linux, switches are managed via switchdev, or DSA. You will have
to teach this infrastructure that something really odd is going on
with one of its ports before you do anything like this in the PHY
layer. I suggest you leave this use case alone until somebody
really-really wants it. From my knowledge of the Marvell DSA driver,
this is not easy.

> 2) Existing user-space application may expect to be able to do the
>    testing on its own (generate/validate the test traffic).

Please can you reference some existing user-space application and the
kernel API it uses to put the PHY into loopback mode?

> We are always happy to integrate with any existing functionality, but
> as I understand "ethtool --test" then intention is to perform a test
> and then bring back the PHY in to a "normal" state (I may be
> wrong...).

Correct.

> The idea with this patch is to allow configuring loopback more
> "permanently" (userspace can decide when to activate and when to
> de-activate). I should properly have made that clear in the cover
> letter.

Leaving it in loopback is a really bad idea. I've spent days once
working out why an Ethernet did not work. Turned out the PHY powered
up in loopback mode, and the embedded OS running on it did not
initialise it to sensible defaults on probe. Packets we going out,
dhcp server was replying but all incoming packets were discarded.

It is really not obvious when everything looks O.K, but nothing works,
because the PHY is in loopback. There needs to be a big red flag to
warn you.

If you really do what to do this, you should look at NETIF_F_LOOPBACK
and consider how this concept can be applied at the PHY, not the MAC.
But you need the full concept, so you see things like:

2: eth0: <PHY_LOOPBACK,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 80:ee:73:83:60:27 brd ff:ff:ff:ff:ff:ff

Humm, i've no idea how you actually enable the MAC loopback
NETIF_F_LOOPBACK represents. I don't see anything with ip link set.

     Andrew

^ permalink raw reply

* Re: Checking for MDIO phy address 0
From: Florian Fainelli @ 2016-11-28 20:25 UTC (permalink / raw)
  To: Phil Endecott, netdev
In-Reply-To: <1480359230293@dmwebmail.dmwebmail.chezphil.org>

On 11/28/2016 10:53 AM, Phil Endecott wrote:
> Dear Experts,
> 
> Is it true that phy address 0 is special, and should not be used?

It is special in that it can be made special, or not (very helpful, I
know). AFAIR, address 0 was defined a while ago (by Cisco) to be a
special broadcast address which could be used to e.g: blast the same
write to several devices. As you may imagine this can be helpful in a
switch environment where you may have to reset e.g: 48-ports.

In practice, the behavior varies a lot depending on:

- whether address 0 is potentially used to access a built-in PHY within
e.g: an Ethernet switch

- what kind of strapping options/configurations are selected for the
PHY, and whether address 0 is going to be snooped by the PHY devices's
MDIO block and how it

> I have this in my device tree (edited for brevity):
> 
>         mdio@17020000 {
>             compatible = "apm,xgene-mdio-rgmii";
>             #address-cells = <0x00000001>;
>             #size-cells = <0x00000000>;
>             phy@4 {
>                 reg = <0x00000000>;

Nit: why not make the unit address and the actual reg property match, e.g:

	phy@0 {
		reg = <0>;
	};

	phy@1 {
		reg = <1>;
	};

>                 linux,phandle = <0x00000021>;
>                 phandle = <0x00000021>;
>             };
>             phy@5 {
>                 reg = <0x00000001>;
>                 linux,phandle = <0x00000022>;
>                 phandle = <0x00000022>;
>             };
>         };
>         ethernet@1f210000 {
>             compatible = "apm,xgene1-sgenet";
>             phy-connection-type = "sgmii";
>             phy-handle = <0x00000021>;
>         };
>         ethernet@1f210030 {
>             compatible = "apm,xgene1-sgenet";
>             phy-connection-type = "sgmii";
>             phy-handle = <0x00000022>;
>         };
> 
> (This is on a Gigabyte MP30-AR1, which has an X-Gene ARM64 processor.)
> 
> I've spent a long time trying to get the two gigabit ethernet ports to 
> correctly auto-negotiate down to 100 Mbit, and it's possible that the 
> underlying problem is that one of the phys is using address 0.  Does 
> that make sense?

Maybe, but you are not exactly describing the problems you are seeing.

If the issue is that both PHYs are configured to respond to MDIO address
0, what could happen is that the Ethernet instance which references this
address 0 may end-up clobbering the other PHY's settings as well because
of the broadcast properties applied to this special address.

Few things for you to check:

- can you scan the entire bus range and see which addresses are valid,
outside of address 0 and 1, there is possibly another address at which
one of the two PHYs should respond, if not, check the next recommendation

- if you have access to the MV88E1512 datasheet, can you check if the
PHY is configured to respond to MDIO address 0, and what this implies
for reads and writes towards that address? Can you disable this and just
have address 0 be a normal (non-broadcast) address?

> 
> Anyway, my reason for this message is to suggest a runtime diagnostic 
> message somewhere if address 0 is used - this could have saved me a lot of 
> work! If someone can suggest the right place to add this I can prepare a 
> patch.

Address 0 can be made special or not, but there is no programmatic way
to tell without scanning the MDIO bus for devices, so I don't think a
warning is going to help at all to warn about that. There are also tons
of cases where the address is just treated like any other address, which
is typical with MDIO connected Ethernet switches where PHY@0 is just the
switch's port 0 built-in PHY for instance.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v3 1/3] bpf: Refactor cgroups code in prep for new type
From: David Ahern @ 2016-11-28 20:31 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <20161128200641.GA7634@ast-mbp.thefacebook.com>

On 11/28/16 1:06 PM, Alexei Starovoitov wrote:
> On Mon, Nov 28, 2016 at 07:48:48AM -0800, David Ahern wrote:
>> Code move only; no functional change intended.
> 
> not quite...
> 
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ...
>>   * @sk: The socken sending or receiving traffic
>> @@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,
>>  
>>  	prog = rcu_dereference(cgrp->bpf.effective[type]);
>>  	if (prog) {
>> -		unsigned int offset = skb->data - skb_network_header(skb);
>> -
>> -		__skb_push(skb, offset);
>> -		ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
>> -		__skb_pull(skb, offset);
>> +		switch (type) {
>> +		case BPF_CGROUP_INET_INGRESS:
>> +		case BPF_CGROUP_INET_EGRESS:
>> +			ret = __cgroup_bpf_run_filter_skb(skb, prog);
>> +			break;
> 
> hmm. what's a point of double jump table? It's only burning cycles
> in the fast path. We already have
> prog = rcu_dereference(cgrp->bpf.effective[type]); if (prog) {...}
> Could you do a variant of __cgroup_bpf_run_filter() instead ?
> That doesnt't take 'skb' as an argument.
> It will also solve scary looking NULL skb from patch 2:
> __cgroup_bpf_run_filter(sk, NULL, ...
> 
> and to avoid copy-pasting first dozen lines of current
> __cgroup_bpf_run_filter can be moved into a helper that
> __cgroup_bpf_run_filter_skb and
> __cgroup_bpf_run_filter_sk will call.
> Or some other way to rearrange that code.
> 

sure
1. rename the existing __cgroup_bpf_run_filter to __cgroup_bpf_run_filter_skb

2. create new __cgroup_bpf_run_filter_sk for this new program type. the new run_filter does not need the family or full sock checks for this use case so the common code is only the sock_cgroup_ptr and prog lookups.

^ permalink raw reply

* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Miller @ 2016-11-28 20:32 UTC (permalink / raw)
  To: david.lebrun; +Cc: netdev
In-Reply-To: <583C9093.7060404@uclouvain.be>

From: David Lebrun <david.lebrun@uclouvain.be>
Date: Mon, 28 Nov 2016 21:16:19 +0100

> The advantage of my solution over RFC2992 is lowest possible disruption
> and equal rebalancing of affected flows. The disadvantage is the lookup
> complexity of O(log n) vs O(1). Although from a theoretical viewpoint
> O(1) is obviously better, would O(log n) have an effectively measurable
> negative impact on scalability ? If we consider 32 next-hops for a route
> and 100 pseudo-random numbers generated per next-hop, the lookup
> algorithm would have to perform in the worst case log2 3200 = 11
> comparisons to select a next-hop for that route.

When I was working on the routing cache removal in ipv4 I compared
using a stupid O(1) hash lookup of the FIB entries vs. the O(log n)
fib_trie stuff actually in use.

It did make a difference.

This is a lookup that can be invoked 20 million times per second or
more.

Every cycle matters.

We already have a lot of trouble getting under the cycle budget one
has for routing at wire speed for very high link rates, please don't
make it worse.

^ permalink raw reply

* Re: [PATCH net-next v3 2/3] bpf: Add new cgroup attach type to enable sock modifications
From: Alexei Starovoitov @ 2016-11-28 20:32 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480348130-31354-3-git-send-email-dsa@cumulusnetworks.com>

On Mon, Nov 28, 2016 at 07:48:49AM -0800, David Ahern wrote:
> Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
> BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
> any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
> Currently only sk_bound_dev_if is exported to userspace for modification
> by a bpf program.
> 
> This allows a cgroup to be configured such that AF_INET{6} sockets opened
> by processes are automatically bound to a specific device. In turn, this
> enables the running of programs that do not support SO_BINDTODEVICE in a
> specific VRF context / L3 domain.
> 
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 1f09c521adfe..808e158742a2 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -408,7 +408,7 @@ struct bpf_prog {
>  	enum bpf_prog_type	type;		/* Type of BPF program */
>  	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
>  	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
> -	unsigned int		(*bpf_func)(const struct sk_buff *skb,
> +	unsigned int		(*bpf_func)(const void *ctx,
>  					    const struct bpf_insn *filter);

Daniel already tweaked it. pls rebase.

> +static const struct bpf_func_proto *
> +cg_sock_func_proto(enum bpf_func_id func_id)
> +{
> +	return NULL;
> +}

if you don't want any helpers, just don't set .get_func_proto.
See check_call() in verifier.
Though why not allow socket filter like helpers that
sk_filter_func_proto() provides?
tail call, bpf_trace_printk, maps are useful things that you get for free.
Developing programs without bpf_trace_printk is pretty hard.

> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 5ddf5cda07f4..24d2550492ee 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -374,8 +374,18 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
>  
>  	if (sk->sk_prot->init) {
>  		err = sk->sk_prot->init(sk);
> -		if (err)
> +		if (err) {
> +			sk_common_release(sk);
> +			goto out;
> +		}
> +	}
> +
> +	if (!kern) {
> +		err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);

i guess from vrf use case point of view this is the best place,
since so_bindtodevice can still override it,
but thinking little bit into other use case like port binding
restrictions and port rewrites can we move it into inet_bind ?
My understanding nothing will be using bound_dev_if until that
time, so we can set it there?
And at that point we can extend 'struct bpf_sock' with other
fields like port and sockaddr...
and single BPF_PROG_TYPE_CGROUP_SOCK type will be used for
vrf and port binding use cases...
More users, more testing of that code path...

^ permalink raw reply

* Re: [PATCH net 0/2] mlx4 bug fixes for 4.9
From: David Miller @ 2016-11-28 20:34 UTC (permalink / raw)
  To: tariqt; +Cc: netdev, eranbe, sebott, swise
In-Reply-To: <1480267252-26146-1-git-send-email-tariqt@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>
Date: Sun, 27 Nov 2016 19:20:50 +0200

> This patchset includes 2 bug fixes:
> * In patch 1 we revert the commit that avoids invoking unregister_netdev
> in shutdown flow, as it introduces netdev presence issues where
> it can be accessed unsafely by ndo operations during the flow.
> * Patch 2 is a simple fix for a variable uninitialization issue.
> 
> Series generated against net commit:
> 6998cc6ec237 tipc: resolve connection flow control compatibility problem

Series applied, thanks.

^ permalink raw reply

* Re: [[PATCH net-next RFC] 1/4] net: dsa: mv88e6xxx: Implement mv88e6390 tag remap
From: Vivien Didelot @ 2016-11-28 20:35 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Andrew Lunn
In-Reply-To: <1479944598-20372-2-git-send-email-andrew@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

>  #define PORT_TAG_REGMAP_0123	0x18
>  #define PORT_TAG_REGMAP_4567	0x19
> +#define PORT_PRIO_MAP_TABLE	0x18    /* 6390 */
> +#define PORT_PRIO_MAP_TABLE_UPDATE		BIT(15)
> +#define PORT_PRIO_MAP_TABLE_INGRESS_PCP		(0x0 << 12)
> +#define PORT_PRIO_MAP_TABLE_EGRESS_GREEN_PCP	(0x1 << 12)
> +#define PORT_PRIO_MAP_TABLE_EGRESS_YELLOW_PCP	(0x2 << 12)
> +#define PORT_PRIO_MAP_TABLE_EGRESS_AVB_PCP	(0x3 << 12)
> +#define PORT_PRIO_MAP_TABLE_EGRESS_GREEN_DSCP	(0x5 << 12)
> +#define PORT_PRIO_MAP_TABLE_EGRESS_YELLOW_DSCP	(0x6 << 12)
> +#define PORT_PRIO_MAP_TABLE_EGRESS_AVB_DSCP	(0x7 << 12)

0x17 is the "IP Priority Mapping Table" register, so I'd define 0x18 as
PORT_IEEE_PRIO_MAP_TABLE to avoid later confusion.

>  
>  #define GLOBAL_STATUS		0x00
>  #define GLOBAL_STATUS_PPU_STATE BIT(15) /* 6351 and 6171 */
> @@ -813,6 +822,7 @@ struct mv88e6xxx_ops {
>  	void (*stats_get_strings)(struct mv88e6xxx_chip *chip,  uint8_t *data);
>  	void (*stats_get_stats)(struct mv88e6xxx_chip *chip,  int port,
>  				uint64_t *data);
> +	int (*tag_remap)(struct mv88e6xxx_chip *chip, int port);

I would've prefered an op like .tag_remap(*chip, port, prio, new) and a
wrapper in chip.c which loops over priority 0-7, but that would make the
implementation unnecessarily complex, so let's keep it as is for now ;)

>  };
>  
>  #define STATS_TYPE_PORT		BIT(0)
> diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c
> index af4772d86086..b7fab70f6cd7 100644
> --- a/drivers/net/dsa/mv88e6xxx/port.c
> +++ b/drivers/net/dsa/mv88e6xxx/port.c
> @@ -496,3 +496,60 @@ int mv88e6xxx_port_set_8021q_mode(struct mv88e6xxx_chip *chip, int port,
>  
>  	return 0;
>  }

Please add an ordered comment:

/* Offset 0x18: Port IEEE Priority Remapping Registers [0-3]
 * Offset 0x19: Port IEEE Priority Remapping Registers [4-7]
 */

> +
> +int mv88e6095_tag_remap(struct mv88e6xxx_chip *chip, int port)
> +{
> +	int err;
> +
> +	/* Tag Remap: use an identity 802.1p prio -> switch prio
> +	 * mapping.
> +	 */
> +	err = mv88e6xxx_port_write(chip, port, PORT_TAG_REGMAP_0123, 0x3210);
> +	if (err)
> +		return err;
> +
> +	/* Tag Remap 2: use an identity 802.1p prio -> switch
> +	 * prio mapping.
> +	 */

A single comment like this before the 2 writes will be enough:

    /* Use a direct priority mapping for all IEEE tagged frames */

> +	return mv88e6xxx_port_write(chip, port, PORT_TAG_REGMAP_4567, 0x7654);
> +}

Functions of port.c implementing Port Registers features must be
prefixed mv88e6xxx_port_* (6xxx can be a model in case of conflict).
mv88e6xxx_port_tag_remap() seems fine.

> +
> +int mv88e6390_tag_remap(struct mv88e6xxx_chip *chip, int port)
> +{
> +	int err, reg, i;
> +
> +	for (i = 0; i <= 7; i++) {
> +		reg = i | (i << 4) |

The pointer offset is 9, not 4.

> +			PORT_PRIO_MAP_TABLE_INGRESS_PCP |
> +			PORT_PRIO_MAP_TABLE_UPDATE;
> +		err = mv88e6xxx_port_write(chip, port, PORT_PRIO_MAP_TABLE,
> +					   reg);
> +		if (err)
> +			return err;
> +
> +		reg = i | PORT_PRIO_MAP_TABLE_EGRESS_GREEN_PCP |
> +			PORT_PRIO_MAP_TABLE_UPDATE;
> +		err = mv88e6xxx_port_write(chip, port, PORT_PRIO_MAP_TABLE,
> +					   reg);
> +		if (err)
> +			return err;
> +
> +		reg = i |
> +			PORT_PRIO_MAP_TABLE_EGRESS_YELLOW_PCP |
> +			PORT_PRIO_MAP_TABLE_UPDATE;
> +		err = mv88e6xxx_port_write(chip, port, PORT_PRIO_MAP_TABLE,
> +					   reg);
> +		if (err)
> +			return err;
> +
> +		reg = i |
> +			PORT_PRIO_MAP_TABLE_EGRESS_AVB_PCP |
> +			PORT_PRIO_MAP_TABLE_UPDATE;
> +		err = mv88e6xxx_port_write(chip, port, PORT_PRIO_MAP_TABLE,
> +					   reg);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}

Please add a static helper first to write the table, e.g.

    /* Offset 0x18: Port IEEE Priority Mapping Table (88E6190) */

    static int mv88e6xxx_port_ieeepmt_write(struct mv88e6xxx_chip *chip,
                                            int port, u8 table,
                                            u8 pointer, u16 data)

And then provide

    int mv88e6xxx_port_ieeepmt_tag_remap(...)


Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH net-next v3 3/3] samples: bpf: add userspace example for modifying sk_bound_dev_if
From: Alexei Starovoitov @ 2016-11-28 20:37 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480348130-31354-4-git-send-email-dsa@cumulusnetworks.com>

On Mon, Nov 28, 2016 at 07:48:50AM -0800, David Ahern wrote:
> Add a simple program to demonstrate the ability to attach a bpf program
> to a cgroup that sets sk_bound_dev_if for AF_INET{6} sockets when they
> are created.
> 
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
> +static int prog_load(int idx)
> +{
> +	struct bpf_insn prog[] = {
> +		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
> +		BPF_MOV64_IMM(BPF_REG_3, idx),
> +		BPF_MOV64_IMM(BPF_REG_2, offsetof(struct bpf_sock, bound_dev_if)),
> +		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3, offsetof(struct bpf_sock, bound_dev_if)),
> +		BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = verdict */
> +		BPF_EXIT_INSN(),
> +	};
> +
> +	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
> +			     "GPL", 0);
> +}

the program looks trivial enough :)

Could you integrate it into iproute2 as well ?
Then the whole vrf management will be easier.
The user wouldn't even need to be aware that iproute2 sets up
this program. It will know ifindex and can delete
the prog when vrf configs change and so on.

Also please convert this sample into automated test like samples/bpf/*.sh
we're going to move all of them to tools/testing/selftests/ eventually.

^ permalink raw reply

* Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: David Miller @ 2016-11-28 20:40 UTC (permalink / raw)
  To: eric.dumazet; +Cc: saeedm, netdev, tariqt
In-Reply-To: <1480212960.18162.23.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 26 Nov 2016 18:16:00 -0800

> On Sun, 2016-11-27 at 00:47 +0200, Saeed Mahameed wrote:
>> On Fri, Nov 25, 2016 at 5:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> As you see here in SRIOV mode (PF only) reads   sw stats from FW.
>> Tariq, I think we need to fix this.
> 
> Sure, my patch does not change this at all.
> 
> If mlx4_is_master() is false, then we aggregate the software states and
> only the software stats.
> 
> My patch makes this aggregation possible at the time ethtool or
> ndo_get_stat64() are called, since this absolutely not depend on the 250
> ms timer fetching hardware stats.

Saeed please provide counter arguments or ACK this patch, thank you.

^ permalink raw reply

* Re: [PATCH v3 net-next 1/2] net: ethernet: slicoss: add slicoss gigabit ethernet driver
From: Lino Sanfilippo @ 2016-11-28 20:46 UTC (permalink / raw)
  To: Markus Böhme, davem, charrer, liodot, gregkh, andrew
  Cc: devel, netdev, linux-kernel
In-Reply-To: <910890ec-7acf-1ced-55c6-d854ee2cdccc@mailbox.org>

Hi Markus,

On 27.11.2016 18:59, Markus Böhme wrote:
> Hello Lino,
> 
> just some things barely worth mentioning:
> 

> 
> I found a bunch of unused #defines in slic.h. I cannot judge if they are
> worth keeping:
> 
> 	SLIC_VRHSTATB_LONGE
> 	SLIC_VRHSTATB_PREA
> 	SLIC_ISR_IO
> 	SLIC_ISR_PING_MASK
> 	SLIC_GIG_SPEED_MASK
> 	SLIC_GMCR_RESET
> 	SLIC_XCR_RESET
> 	SLIC_XCR_XMTEN
> 	SLIC_XCR_PAUSEEN
> 	SLIC_XCR_LOADRNG
> 	SLIC_REG_DBAR
> 	SLIC_REG_PING
> 	SLIC_REG_DUMP_CMD
> 	SLIC_REG_DUMP_DATA
> 	SLIC_REG_WRHOSTID
> 	SLIC_REG_LOW_POWER
> 	SLIC_REG_RESET_IFACE
> 	SLIC_REG_ADDR_UPPER
> 	SLIC_REG_HBAR64
> 	SLIC_REG_DBAR64
> 	SLIC_REG_CBAR64
> 	SLIC_REG_RBAR64
> 	SLIC_REG_WRVLANID
> 	SLIC_REG_READ_XF_INFO
> 	SLIC_REG_WRITE_XF_INFO
> 	SLIC_REG_TICKS_PER_SEC
> 
> These device IDs are not used, either, but maybe it's good to keep them
> for documentation purposes:
> 
> 	PCI_SUBDEVICE_ID_ALACRITECH_1000X1_2
> 	PCI_SUBDEVICE_ID_ALACRITECH_SES1001T
> 	PCI_SUBDEVICE_ID_ALACRITECH_SEN2002XT
> 	PCI_SUBDEVICE_ID_ALACRITECH_SEN2001XT
> 	PCI_SUBDEVICE_ID_ALACRITECH_SEN2104ET
> 	PCI_SUBDEVICE_ID_ALACRITECH_SEN2102ET
> 

I left these defines in for both documentation and to avoid gaps in
register ranges. I would like to keep this as it is.

>> +
>> +/* SLIC EEPROM structure for Oasis */
>> +struct slic_mojave_eeprom {
> 
> Comment: "for Mojave".

Will fix, thanks,

> 
> [...]
> 
>> +struct slic_device {
>> +	struct pci_dev *pdev;
>> +	struct net_device *netdev;
>> +	void __iomem *regs;
>> +	/* upper address setting lock */
>> +	spinlock_t upper_lock;
>> +	struct slic_shmem shmem;
>> +	struct napi_struct napi;
>> +	struct slic_rx_queue rxq;
>> +	struct slic_tx_queue txq;
>> +	struct slic_stat_queue stq;
>> +	struct slic_stats stats;
>> +	struct slic_upr_list upr_list;
>> +	/* link configuration lock */
>> +	spinlock_t link_lock;
>> +	bool promisc;
>> +	bool autoneg;
>> +	int speed;
>> +	int duplex;
> 
> Maybe make speed and duplex unsigned? They are assigned and compared
> against unsigned values in slicoss.c, so this would get rid of some
> (benign, because of the range of the values) -Wsign-compare warnings in
> slic_configure_link_locked. However, in a comparison there SPEED_UNKNOWN
> would need to be casted to unsigned to prevent another one popping up.
> 

There is indeed a bunch of warnings concerning signedness. Will have a look
at all of them. However I think I will keep "speed" as an int, because casting
SPEED_UNKNOWN to an unsigned int is IMHO an ugly thing to do.

> [...]
> 
>> +#endif /* _SLIC_H */
>> diff --git a/drivers/net/ethernet/alacritech/slicoss.c b/drivers/net/ethernet/alacritech/slicoss.c
>> new file mode 100644
>> index 0000000..8cd862a
>> --- /dev/null
>> +++ b/drivers/net/ethernet/alacritech/slicoss.c
>> @@ -0,0 +1,1867 @@
> 
> [...]
> 
>> +
>> +static const struct pci_device_id slic_id_tbl[] = {
>> +	{ PCI_DEVICE(PCI_VENDOR_ID_ALACRITECH,
>> +		     PCI_DEVICE_ID_ALACRITECH_MOAVE) },
> 
> I missed this in slic.h, but is this a typo and "MOAVE" should be
> "MOJAVE"? There are a couple similar #defines in slic.h.

This should definitely be "Mojave". Will fix it. 

> 
> [...]
> 
>> +static void slic_refill_rx_queue(struct slic_device *sdev, gfp_t gfp)
>> +{
>> +	const unsigned int ALIGN_MASK = SLIC_RX_BUFF_ALIGN - 1;
>> +	unsigned int maplen = SLIC_RX_BUFF_SIZE;
>> +	struct slic_rx_queue *rxq = &sdev->rxq;
>> +	struct net_device *dev = sdev->netdev;
>> +	struct slic_rx_buffer *buff;
>> +	struct slic_rx_desc *desc;
>> +	unsigned int misalign;
>> +	unsigned int offset;
>> +	struct sk_buff *skb;
>> +	dma_addr_t paddr;
>> +
>> +	while (slic_get_free_rx_descs(rxq) > SLIC_MAX_REQ_RX_DESCS) {
>> +		skb = alloc_skb(maplen + ALIGN_MASK, gfp);
>> +		if (!skb)
>> +			break;
>> +
>> +		paddr = dma_map_single(&sdev->pdev->dev, skb->data, maplen,
>> +				       DMA_FROM_DEVICE);
>> +		if (dma_mapping_error(&sdev->pdev->dev, paddr)) {
>> +			netdev_err(dev, "mapping rx packet failed\n");
>> +			/* drop skb */
>> +			dev_kfree_skb_any(skb);
>> +			break;
>> +		}
>> +		/* ensure head buffer descriptors are 256 byte aligned */
>> +		offset = 0;
>> +		misalign = paddr & ALIGN_MASK;
>> +		if (misalign) {
>> +			offset = SLIC_RX_BUFF_ALIGN - misalign;
>> +			skb_reserve(skb, offset);
>> +		}
>> +		/* the HW expects dma chunks for descriptor + frame data */
>> +		desc = (struct slic_rx_desc *)skb->data;
>> +		memset(desc, 0, sizeof(*desc));
>> +
>> +		buff = &rxq->rxbuffs[rxq->put_idx];
>> +		buff->skb = skb;
>> +		dma_unmap_addr_set(buff, map_addr, paddr);
>> +		dma_unmap_len_set(buff, map_len, maplen);
>> +		buff->addr_offset = offset;
>> +		/* head buffer descriptors are placed immediately before skb */
>> +		slic_write(sdev, SLIC_REG_HBAR, lower_32_bits(paddr) +
>> +						offset);
> 
> This fits nicely on one line. :-)

Right, will fix.

> 
> [...]
> 
>> +static int slic_init_tx_queue(struct slic_device *sdev)
>> +{
>> +	struct slic_tx_queue *txq = &sdev->txq;
>> +	struct slic_tx_buffer *buff;
>> +	struct slic_tx_desc *desc;
>> +	int err;
>> +	int i;
> 
> You could make i unsigned...
> 

>> +
>> +	txq->len = SLIC_NUM_TX_DESCS;
>> +	txq->put_idx = 0;
>> +	txq->done_idx = 0;
>> +
>> +	txq->txbuffs = kcalloc(txq->len, sizeof(*buff), GFP_KERNEL);
>> +	if (!txq->txbuffs)
>> +		return -ENOMEM;
>> +
>> +	txq->dma_pool = dma_pool_create("slic_pool", &sdev->pdev->dev,
>> +					sizeof(*desc), SLIC_TX_DESC_ALIGN,
>> +					4096);
>> +	if (!txq->dma_pool) {
>> +		err = -ENOMEM;
>> +		netdev_err(sdev->netdev, "failed to create dma pool\n");
>> +		goto free_buffs;
>> +	}
>> +
>> +	for (i = 0; i < txq->len; i++) {
> 
> ...to fix a signed/unsigned comparison warning here, but...
> 
>> +		buff = &txq->txbuffs[i];
>> +		desc = dma_pool_zalloc(txq->dma_pool, GFP_KERNEL,
>> +				       &buff->desc_paddr);
>> +		if (!desc) {
>> +			netdev_err(sdev->netdev,
>> +				   "failed to alloc pool chunk (%i)\n", i);
>> +			err = -ENOMEM;
>> +			goto free_descs;
>> +		}
>> +
>> +		desc->hnd = cpu_to_le32((u32)(i + 1));
>> +		desc->cmd = SLIC_CMD_XMT_REQ;
>> +		desc->flags = 0;
>> +		desc->type = cpu_to_le32(SLIC_CMD_TYPE_DUMB);
>> +		buff->desc = desc;
>> +	}
>> +
>> +	return 0;
>> +
>> +free_descs:
>> +	while (i--) {
> 
> ...this would require reworking this logic to prevent an endless loop,
> so probably not worth bothering, considering that txq->len is well
> within the positive signed range.

AFAICS the logic does not have to be changed. The while loop will also work
fine if "i" is unsigned.

> 
>> +		buff = &txq->txbuffs[i];
>> +		dma_pool_free(txq->dma_pool, buff->desc, buff->desc_paddr);
>> +	}
>> +	dma_pool_destroy(txq->dma_pool);
>> +
>> +free_buffs:
>> +	kfree(txq->txbuffs);
>> +
>> +	return err;
>> +}
>> +
>> +static void slic_free_tx_queue(struct slic_device *sdev)
>> +{
>> +	struct slic_tx_queue *txq = &sdev->txq;
>> +	struct slic_tx_buffer *buff;
>> +	int i;
> 
> Make i unsigned? One warning less, almost no work invested.
> 
>> +
>> +	for (i = 0; i < txq->len; i++) {
>> +		buff = &txq->txbuffs[i];
>> +		dma_pool_free(txq->dma_pool, buff->desc, buff->desc_paddr);
>> +		if (!buff->skb)
>> +			continue;
>> +
>> +		dma_unmap_single(&sdev->pdev->dev,
>> +				 dma_unmap_addr(buff, map_addr),
>> +				 dma_unmap_len(buff, map_len), DMA_TO_DEVICE);
>> +		consume_skb(buff->skb);
>> +	}
>> +	dma_pool_destroy(txq->dma_pool);
>> +
>> +	kfree(txq->txbuffs);
>> +}
>> +
> 
> [...]
> 
>> +static void slic_free_rx_queue(struct slic_device *sdev)
>> +{
>> +	struct slic_rx_queue *rxq = &sdev->rxq;
>> +	struct slic_rx_buffer *buff;
>> +	int i;
> 
> Unsigned?
> 
>> +
>> +	/* free rx buffers */
>> +	for (i = 0; i < rxq->len; i++) {
>> +		buff = &rxq->rxbuffs[i];
>> +
>> +		if (!buff->skb)
>> +			continue;
>> +
>> +		dma_unmap_single(&sdev->pdev->dev,
>> +				 dma_unmap_addr(buff, map_addr),
>> +				 dma_unmap_len(buff, map_len),
>> +				 DMA_FROM_DEVICE);
>> +		consume_skb(buff->skb);
>> +	}
>> +	kfree(rxq->rxbuffs);
>> +}
> 
> [...]
> 
>> +static int slic_load_firmware(struct slic_device *sdev)
>> +{
>> +	u32 sectstart[SLIC_FIRMWARE_MAX_SECTIONS];
>> +	u32 sectsize[SLIC_FIRMWARE_MAX_SECTIONS];
>> +	const struct firmware *fw;
>> +	unsigned int datalen;
>> +	const char *file;
>> +	int code_start;
>> +	u32 numsects;
>> +	int idx = 0;
>> +	u32 sect;
>> +	u32 instr;
>> +	u32 addr;
>> +	u32 base;
>> +	int err;
>> +	int i;
> 
> Make i unsigned?
> 
>> +
>> +	file = (sdev->model == SLIC_MODEL_OASIS) ?  SLIC_FIRMWARE_OASIS :
>> +						    SLIC_FIRMWARE_MOAVE;
>> +	err = request_firmware(&fw, file, &sdev->pdev->dev);
>> +	if (err) {
>> +		dev_err(&sdev->pdev->dev, "failed to load firmware %s\n", file);
>> +		return err;
>> +	}
>> +	/* Do an initial sanity check concerning firmware size now. A further
>> +	 * check follows below.
>> +	 */
>> +	if (fw->size < SLIC_FIRMWARE_MIN_SIZE) {
>> +		dev_err(&sdev->pdev->dev,
>> +			"invalid firmware size %zu (min is %u)\n", fw->size,
>> +			SLIC_FIRMWARE_MIN_SIZE);
>> +		err = -EINVAL;
>> +		goto release;
>> +	}
>> +
>> +	numsects = slic_read_dword_from_firmware(fw, &idx);
>> +	if (numsects == 0 || numsects > SLIC_FIRMWARE_MAX_SECTIONS) {
>> +		dev_err(&sdev->pdev->dev,
>> +			"invalid number of sections in firmware: %u", numsects);
>> +		err = -EINVAL;
>> +		goto release;
>> +	}
>> +
>> +	datalen = numsects * 8 + 4;
>> +	for (i = 0; i < numsects; i++) {
>> +		sectsize[i] = slic_read_dword_from_firmware(fw, &idx);
>> +		datalen += sectsize[i];
>> +	}
>> +
>> +	/* do another sanity check against firmware size */
>> +	if (datalen > fw->size) {
>> +		dev_err(&sdev->pdev->dev,
>> +			"invalid firmware size %zu (expected >= %u)\n",
>> +			fw->size, datalen);
>> +		err = -EINVAL;
>> +		goto release;
>> +	}
>> +	/* get sections */
>> +	for (i = 0; i < numsects; i++)
>> +		sectstart[i] = slic_read_dword_from_firmware(fw, &idx);
>> +
>> +	code_start = idx;
>> +	instr = slic_read_dword_from_firmware(fw, &idx);
>> +
>> +	for (sect = 0; sect < numsects; sect++) {
>> +		unsigned int ssize = sectsize[sect] >> 3;
>> +
>> +		base = sectstart[sect];
>> +
>> +		for (addr = 0; addr < ssize; addr++) {
>> +			/* write out instruction address */
>> +			slic_write(sdev, SLIC_REG_WCS, base + addr);
>> +			/* write out instruction to low addr */
>> +			slic_write(sdev, SLIC_REG_WCS, instr);
>> +			instr = slic_read_dword_from_firmware(fw, &idx);
>> +			/* write out instruction to high addr */
>> +			slic_write(sdev, SLIC_REG_WCS, instr);
>> +			instr = slic_read_dword_from_firmware(fw, &idx);
>> +		}
>> +	}
>> +
>> +	idx = code_start;
>> +
>> +	for (sect = 0; sect < numsects; sect++) {
>> +		unsigned int ssize = sectsize[sect] >> 3;
>> +
>> +		instr = slic_read_dword_from_firmware(fw, &idx);
>> +		base = sectstart[sect];
>> +		if (base < 0x8000)
>> +			continue;
>> +
>> +		for (addr = 0; addr < ssize; addr++) {
>> +			/* write out instruction address */
>> +			slic_write(sdev, SLIC_REG_WCS,
>> +				   SLIC_WCS_COMPARE | (base + addr));
>> +			/* write out instruction to low addr */
>> +			slic_write(sdev, SLIC_REG_WCS, instr);
>> +			instr = slic_read_dword_from_firmware(fw, &idx);
>> +			/* write out instruction to high addr */
>> +			slic_write(sdev, SLIC_REG_WCS, instr);
>> +			instr = slic_read_dword_from_firmware(fw, &idx);
>> +		}
>> +	}
>> +	slic_flush_write(sdev);
>> +	mdelay(10);
>> +	/* everything OK, kick off the card */
>> +	slic_write(sdev, SLIC_REG_WCS, SLIC_WCS_START);
>> +	slic_flush_write(sdev);
>> +	/* wait long enough for ucode to init card and reach the mainloop */
>> +	mdelay(20);
>> +release:
>> +	release_firmware(fw);
>> +
>> +	return err;
>> +}
> 
> [...]
> 
>> +static int slic_init_iface(struct slic_device *sdev)
>> +{
>> +	struct slic_shmem *sm = &sdev->shmem;
>> +	int err;
>> +
>> +	sdev->upr_list.pending = false;
>> +
>> +	err = slic_init_shmem(sdev);
>> +	if (err) {
>> +		netdev_err(sdev->netdev, "failed to load firmware\n");
> 
> Wrong error message.

Yep, will fix.

> 
>> +		return err;
>> +	}
> 
> [...]
> 
>> +static netdev_tx_t slic_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	struct slic_device *sdev = netdev_priv(dev);
>> +	struct slic_tx_queue *txq = &sdev->txq;
>> +	struct slic_tx_buffer *buff;
>> +	struct slic_tx_desc *desc;
>> +	dma_addr_t paddr;
>> +	u32 cbar_val;
>> +	u32 maplen;
>> +
>> +	if (unlikely(slic_get_free_tx_descs(txq) < SLIC_MAX_REQ_TX_DESCS)) {
>> +		netdev_err(dev, "BUG! not enought tx LEs left: %u\n",
> 
> "Enough"?

Will fix.

>> +			   slic_get_free_tx_descs(txq));
>> +		return NETDEV_TX_BUSY;
>> +	}
> 
> [...]
> 
>> +static int slic_read_eeprom(struct slic_device *sdev)
>> +{
>> +	unsigned int devfn = PCI_FUNC(sdev->pdev->devfn);
>> +	struct slic_shmem *sm = &sdev->shmem;
>> +	struct slic_shmem_data *sm_data = sm->shmem_data;
>> +	const unsigned int MAX_LOOPS = 5000;
> 
> Another benign -Wsign-compare warning can be fixed by either dropping
> the unsigned here or making i below unsigned, too.
> 
>> +	unsigned int codesize;
>> +	unsigned char *eeprom;
>> +	struct slic_upr *upr;
>> +	dma_addr_t paddr;
>> +	int err = 0;
>> +	u8 *mac[2];
>> +	int i = 0;
>> +
>> +	eeprom = dma_zalloc_coherent(&sdev->pdev->dev, SLIC_EEPROM_SIZE,
>> +				     &paddr, GFP_KERNEL);
>> +	if (!eeprom)
>> +		return -ENOMEM;
>> +
>> +	slic_write(sdev, SLIC_REG_ICR, SLIC_ICR_INT_OFF);
>> +	/* setup ISP temporarily */
>> +	slic_write(sdev, SLIC_REG_ISP, lower_32_bits(sm->isr_paddr));
>> +
>> +	err = slic_new_upr(sdev, SLIC_UPR_CONFIG, paddr);
>> +	if (!err) {
>> +		for (i = 0; i < MAX_LOOPS; i++) {
>> +			if (le32_to_cpu(sm_data->isr) & SLIC_ISR_UPC)
>> +				break;
>> +			mdelay(1);
>> +		}
>> +		if (i == MAX_LOOPS) {
>> +			dev_err(&sdev->pdev->dev,
>> +				"timed out while waiting for eeprom data\n");
>> +			err = -ETIMEDOUT;
>> +		}
>> +		upr = slic_dequeue_upr(sdev);
>> +		kfree(upr);
>> +	}
>> +
>> +	slic_write(sdev, SLIC_REG_ISP, 0);
>> +	slic_write(sdev, SLIC_REG_ISR, 0);
>> +	slic_flush_write(sdev);
>> +
>> +	if (err)
>> +		goto free_eeprom;
>> +
>> +	if (sdev->model == SLIC_MODEL_OASIS) {
>> +		struct slic_oasis_eeprom *oee;
>> +
>> +		oee = (struct slic_oasis_eeprom *)eeprom;
>> +		mac[0] = oee->mac;
>> +		mac[1] = oee->mac2;
>> +		codesize = le16_to_cpu(oee->eeprom_code_size);
>> +	} else {
>> +		struct slic_mojave_eeprom *mee;
>> +
>> +		mee = (struct slic_mojave_eeprom *)eeprom;
>> +		mac[0] = mee->mac;
>> +		mac[1] = mee->mac2;
>> +		codesize = le16_to_cpu(mee->eeprom_code_size);
>> +	}
>> +
>> +	if (!slic_eeprom_valid(eeprom, codesize)) {
>> +		dev_err(&sdev->pdev->dev, "invalid checksum in eeprom\n");
>> +		err = -EINVAL;
>> +		goto free_eeprom;
>> +	}
>> +	/* set mac address */
>> +	ether_addr_copy(sdev->netdev->dev_addr, mac[devfn]);
>> +free_eeprom:
>> +	dma_free_coherent(&sdev->pdev->dev, SLIC_EEPROM_SIZE, eeprom, paddr);
>> +
>> +	return err;
>> +}
> 
> [...]
> 
>> +static int slic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>> +{
> 
> [...]
> 
>> +	err = register_netdev(dev);
>> +	if (err) {
>> +		dev_err(&pdev->dev, "failed to register net device: %i\n",
>> +			err);
> 
> Could be on one line.

Right, will adjust it.


> Regards,
> Markus
> 

Thanks Markus!

Regards,
Lino

^ permalink raw reply

* Re: [PATCH net-next v3 3/3] samples: bpf: add userspace example for modifying sk_bound_dev_if
From: David Ahern @ 2016-11-28 20:47 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <20161128203752.GC7634@ast-mbp.thefacebook.com>

On 11/28/16 1:37 PM, Alexei Starovoitov wrote:
> On Mon, Nov 28, 2016 at 07:48:50AM -0800, David Ahern wrote:
>> Add a simple program to demonstrate the ability to attach a bpf program
>> to a cgroup that sets sk_bound_dev_if for AF_INET{6} sockets when they
>> are created.
>>
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ...
>> +static int prog_load(int idx)
>> +{
>> +	struct bpf_insn prog[] = {
>> +		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
>> +		BPF_MOV64_IMM(BPF_REG_3, idx),
>> +		BPF_MOV64_IMM(BPF_REG_2, offsetof(struct bpf_sock, bound_dev_if)),
>> +		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3, offsetof(struct bpf_sock, bound_dev_if)),
>> +		BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = verdict */
>> +		BPF_EXIT_INSN(),
>> +	};
>> +
>> +	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
>> +			     "GPL", 0);
>> +}
> 
> the program looks trivial enough :)
> 
> Could you integrate it into iproute2 as well ?

yes, that is the plan. iproute2 can be used for all things vrf. As infra goes into the kernel, support is added to iproute2

> Then the whole vrf management will be easier.
> The user wouldn't even need to be aware that iproute2 sets up
> this program. It will know ifindex and can delete
> the prog when vrf configs change and so on.
> 
> Also please convert this sample into automated test like samples/bpf/*.sh
> we're going to move all of them to tools/testing/selftests/ eventually.
> 

ok

^ permalink raw reply

* Receive offloads, small RCVBUF and zero TCP window
From: Alex Sidorenko @ 2016-11-28 20:49 UTC (permalink / raw)
  To: netdev

One of our customers has met a problem: TCP window closes and stays closed forever even though receive buffer is empty. This problem has been reported for RHEL6.8 and I think that the issue is in __tcp_select_window() subroutine. Comparing sources of RHEL6.8 kernel and the latest upstream kernel (pulled from GIT today), it looks that it should still be present in the latest kernels.

The problem is triggered by the following conditions:

(a) small RCVBUF (24576 in our case), as a result WS=0
(b) mss = icsk->icsk_ack.rcv_mss > MTU

I asked customer to trigger vmcore when the problem occurs to find why window stays closed forever. I can see in vmcore (doing calculations following __tcp_select_window sources):

        windows: rcv=0, snd=65535  advmss=1460 rcv_ws=0 snd_ws=0
        --- Emulating __tcp_select_window ---
          rcv_mss=7300 free_space=18432 allowed_space=18432 full_space=16972
          rcv_ssthresh=5840, so free_space->5840 

So when we reach the test

		if (window <= free_space - mss || window > free_space)
			window = (free_space / mss) * mss;
		else if (mss == full_space &&
			 free_space > window + (full_space >> 1))
			window = free_space;

we have  negative value of (free_space - mss) = -1460

As a result, we do not update the window and it stays zero forever - even though application has read all available data and we have sufficient free_space.


This occurs only due to the fact that we have interface with MTU=1500 (so that mss=1460 is expected), but icsk->icsk_ack.rcv_mss is 5*1460 = 7300.

As a result, "Get the largest window that is a nice multiple of mss" means a multiple of 7300, and this never happens!

All other mss-related values look reasonable:

crash64> struct tcp_sock 0xffff8801bcb8c840  | grep mss
    icsk_sync_mss = 0xffffffff814ce620 , 
      rcv_mss = 7300
  mss_cache = 1460, 
  advmss = 1460, 
    user_mss = 0, 
    mss_clamp = 1460


Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss larger than MTU. I suspect the most important factor is that this host is running under VMWare. VMWare probably optimizes receive offloading dramatically, pushing to us merged SKBs larger than MTU. I have written a tool to print warnings when we have mss > advmss and ran it on my collection of vmcores. Almost in all cases where vmcore was taken on VMWare guest, we have some connections with mss > advmss. I have not found any vmcores showing this high mss value for any non-VMWare vmcore.

Obviously, this is a corner-case problem - it can happen only if we have a small RCVBUF. But I think this needs to be fixed anyway. I am not sure whether having 
icsk->icsk_ack.rcv_mss > MTU is expected. If not, this should be fixed in receiving offload subroutines (LRO?) or maybe VMWare NIC driver.

But if it is OK for NICs to merge received SKBs and present to TCP supersegments (similar to TSO), this needs to be fixed in __tcp_select_window - e.g. if we see a small RCVBUF and large icsk->icsk_ack.rcv_mss, switch to mss_clamp, as it was done in older versions. From __tcp_select_window() comment 

	/* MSS for the peer's data.  Previous versions used mss_clamp
	 * here.  I don't know if the value based on our guesses
	 * of peer's MSS is better for the performance.  It's more correct
	 * but may be worse for the performance because of rcv_mss
	 * fluctuations.  --SAW  1998/11/1
	 */

Regards,
Alex

-- 

------------------------------------------------------------------
Alex Sidorenko	email: asid@hpe.com
ERT  Linux 	Hewlett-Packard Enterprise (Canada)
------------------------------------------------------------------

^ permalink raw reply

* Re: [PATCH net-next 10/11] qede: Add basic XDP support
From: Mintz, Yuval @ 2016-11-28 20:53 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David S. Miller, Linux Kernel Network Developers
In-Reply-To: <CALx6S36kiAiMeZoszx=5uBrUecwCodJx2tg3kL4HBk=4eVMSLg@mail.gmail.com>

> > +       if (act == XDP_PASS)
> > +               return true;
> > +
> > +       /* Count number of packets not to be passed to stack */
> > +       rxq->xdp_no_pass++;
> > +
> > +       switch (act) {
> > +       default:
> > +               bpf_warn_invalid_xdp_action(act);

> XDP_TX is a valid action and in fact some drivers already support that.
> Given that this isn't the first instance of driver not supporting XDP_TX
> I think we need to clear define what means. Personally, I think that we
> shouldn't allow a program to load that  returns XDP_TX but driver does
> not support it. I believe Jesper might be looking into capabilities
> support for XDP to handle that. For the purposes of this patch I'd suggest
> having an XDP_TX case and warn user that an unsupported action
> as opposed to invalid  one was returned.

Patch #11 does add XDP_TX support.
Adding an explicit case with a warning to be removed in the next patch
sounds like a waste to me. But if you think 'future generations' would
benefit from it, sure.
 
> > +       case XDP_ABORTED:
> > +       case XDP_DROP:

^ permalink raw reply

* Re: Receive offloads, small RCVBUF and zero TCP window
From: David Miller @ 2016-11-28 20:54 UTC (permalink / raw)
  To: alexandre.sidorenko; +Cc: netdev
In-Reply-To: <2080597.A38JFJZ1AD@zbook>

From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
Date: Mon, 28 Nov 2016 15:49:26 -0500

> Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss
> larger than MTU.

It absolutely is not OK.

If VMWare wants to receive large frames for batching purposes it must
use GRO or similar to achieve that, not just send vanilla frames into
the stack which are larger than the device MTU.

^ permalink raw reply

* Re: [PATCH net-next v3 2/3] bpf: Add new cgroup attach type to enable sock modifications
From: David Ahern @ 2016-11-28 20:57 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <20161128203252.GB7634@ast-mbp.thefacebook.com>

On 11/28/16 1:32 PM, Alexei Starovoitov wrote:
> On Mon, Nov 28, 2016 at 07:48:49AM -0800, David Ahern wrote:
>> Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
>> BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
>> any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
>> Currently only sk_bound_dev_if is exported to userspace for modification
>> by a bpf program.
>>
>> This allows a cgroup to be configured such that AF_INET{6} sockets opened
>> by processes are automatically bound to a specific device. In turn, this
>> enables the running of programs that do not support SO_BINDTODEVICE in a
>> specific VRF context / L3 domain.
>>
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ...
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 1f09c521adfe..808e158742a2 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -408,7 +408,7 @@ struct bpf_prog {
>>  	enum bpf_prog_type	type;		/* Type of BPF program */
>>  	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
>>  	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
>> -	unsigned int		(*bpf_func)(const struct sk_buff *skb,
>> +	unsigned int		(*bpf_func)(const void *ctx,
>>  					    const struct bpf_insn *filter);
> 
> Daniel already tweaked it. pls rebase.

ack

> 
>> +static const struct bpf_func_proto *
>> +cg_sock_func_proto(enum bpf_func_id func_id)
>> +{
>> +	return NULL;
>> +}
> 
> if you don't want any helpers, just don't set .get_func_proto.
> See check_call() in verifier.

ack.

> Though why not allow socket filter like helpers that
> sk_filter_func_proto() provides?
> tail call, bpf_trace_printk, maps are useful things that you get for free.
> Developing programs without bpf_trace_printk is pretty hard.

this use case was trivial enough, but in general I get your point will use sk_filter_func_proto.

> 
>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>> index 5ddf5cda07f4..24d2550492ee 100644
>> --- a/net/ipv4/af_inet.c
>> +++ b/net/ipv4/af_inet.c
>> @@ -374,8 +374,18 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
>>  
>>  	if (sk->sk_prot->init) {
>>  		err = sk->sk_prot->init(sk);
>> -		if (err)
>> +		if (err) {
>> +			sk_common_release(sk);
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	if (!kern) {
>> +		err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
> 
> i guess from vrf use case point of view this is the best place,
> since so_bindtodevice can still override it,
> but thinking little bit into other use case like port binding
> restrictions and port rewrites can we move it into inet_bind ?

Deferring to inet_bind won't work for a number of use cases (e.g., udp, raw).

> My understanding nothing will be using bound_dev_if until that
> time, so we can set it there?

And yes, I do want to allow a sufficiently privileged process to override the inherited setting. For example, shell is management vrf cgroup and root user wants to run a program that sends packets out a data plane vrf using an option built into the program. The sequence is:

1. socket - inherits sk_bound_dev_if from bpf program attached to mgmt cgroup

2. setsockopt( new vrf )

3. connect - lookups to remote address use vrf from step 2.

Thanks for the review.

^ permalink raw reply

* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Lebrun @ 2016-11-28 20:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20161128.153209.2135257061368558724.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 635 bytes --]

On 11/28/2016 09:32 PM, David Miller wrote:
> When I was working on the routing cache removal in ipv4 I compared
> using a stupid O(1) hash lookup of the FIB entries vs. the O(log n)
> fib_trie stuff actually in use.
> 
> It did make a difference.
> 
> This is a lookup that can be invoked 20 million times per second or
> more.
> 
> Every cycle matters.
> 
> We already have a lot of trouble getting under the cycle budget one
> has for routing at wire speed for very high link rates, please don't
> make it worse.

OK, so O(1) mandatory. I will continue in that direction then.

Thanks for the feedback

David


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 163 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 1/3] ethtool: (uapi) Add ETHTOOL_PHY_LOOPBACK to PHY tunables
From: Florian Fainelli @ 2016-11-28 21:01 UTC (permalink / raw)
  To: Andrew Lunn, Allan W. Nielsen; +Cc: netdev, raju.lakkaraju
In-Reply-To: <20161128202114.GM17704@lunn.ch>

On 11/28/2016 12:21 PM, Andrew Lunn wrote:
> On Mon, Nov 28, 2016 at 08:23:06PM +0100, Allan W. Nielsen wrote:
>> Hi Andrew and Florian,
>>
>> On 28/11/16 15:14, Andrew Lunn wrote:
>>> On Mon, Nov 28, 2016 at 02:24:30PM +0100, Allan W. Nielsen wrote:
>>>> From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
>>>>
>>>> 3 types of PHY loopback are supported.
>>>> i.e. Near-End Loopback, Far-End Loopback and External Loopback.
>>> Is this integrated with ethtool --test? You only want the PHY to go
>>> into loopback mode when running ethtool --test external_lb or maybe
>>> ethtool --test offline.
>> There are other use-cases for enabling PHY loopback:
>>
>> 1) If the PHY is connected to a switch then a loop-port is sometime
>>    used to "force/enable" one or more extra pass through the switch
>>    core. This "hack" can sometime be used to achieve new functionality
>>    with existing silicon.
> 
> With Linux, switches are managed via switchdev, or DSA. You will have
> to teach this infrastructure that something really odd is going on
> with one of its ports before you do anything like this in the PHY
> layer. I suggest you leave this use case alone until somebody
> really-really wants it. From my knowledge of the Marvell DSA driver,
> this is not easy.

Agree with Andrew here, this particular use case with switches does not
need to be solved now, but if we imagine we need to support that,
chances are that we will want the network device as a configuration
entry point, more than the PHY device itself.

> 
>> 2) Existing user-space application may expect to be able to do the
>>    testing on its own (generate/validate the test traffic).
> 
> Please can you reference some existing user-space application and the
> kernel API it uses to put the PHY into loopback mode?
> 
>> We are always happy to integrate with any existing functionality, but
>> as I understand "ethtool --test" then intention is to perform a test
>> and then bring back the PHY in to a "normal" state (I may be
>> wrong...).
> 
> Correct.
> 
>> The idea with this patch is to allow configuring loopback more
>> "permanently" (userspace can decide when to activate and when to
>> de-activate). I should properly have made that clear in the cover
>> letter.
> 
> Leaving it in loopback is a really bad idea. I've spent days once
> working out why an Ethernet did not work. Turned out the PHY powered
> up in loopback mode, and the embedded OS running on it did not
> initialise it to sensible defaults on probe. Packets we going out,
> dhcp server was replying but all incoming packets were discarded.
> 
> It is really not obvious when everything looks O.K, but nothing works,
> because the PHY is in loopback. There needs to be a big red flag to
> warn you.
> 
> If you really do what to do this, you should look at NETIF_F_LOOPBACK
> and consider how this concept can be applied at the PHY, not the MAC.
> But you need the full concept, so you see things like:
> 
> 2: eth0: <PHY_LOOPBACK,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
>     link/ether 80:ee:73:83:60:27 brd ff:ff:ff:ff:ff:ff
> 
> Humm, i've no idea how you actually enable the MAC loopback
> NETIF_F_LOOPBACK represents. I don't see anything with ip link set.

I am afraid you lost me on this, NETIF_F_LOOPBACK is a netdev_features_t
bit, so it tells ethtool that this is a potential feature to be turned
on with ethtool -K <iface>. The semantics of this loopack feature are
not defined AFAICT, but a reasonable behavior from the driver is to put
itself in a mode where packets send by a socket-level application are
looped through the Ethernet adapter itself. Whether this happens at the
DMA level, the MII signals, or somewhere in the PHY, is driver specific
unfortunately.

Now, there is another way to toggle a loopback for a given Ethernet
adapter which is to actually set IFF_LOOPBACK in dev->flags for this
interface. Some drivers seem to be able to properly react to that as
well, although I see no way this can be done looking at the iproute2 or
ifconfig man pages..
-- 
Florian

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox