Netdev List
 help / color / mirror / Atom feed
* Re: FIXED_PHY is broken...
From: David Miller @ 2014-12-16 20:15 UTC (permalink / raw)
  To: netdev; +Cc: f.fainelli
In-Reply-To: <20141216.143027.1243798311589198629.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 16 Dec 2014 14:30:27 -0500 (EST)

> From: David Miller <davem@davemloft.net>
> Date: Tue, 16 Dec 2014 11:25:34 -0500 (EST)
> 
>> I get this now when I run oldconfig:
>> 
>> warning: (NET_DSA_BCM_SF2 && BCMGENET && SYSTEMPORT) selects FIXED_PHY which has unmet direct dependencies (NETDEVICES && PHYLIB=y)
> 
> Here is how I'm going to fix this.
> 
> FIXED_PHY needs to be allowed to be modular, and built even if PHYLIB is
> modular too.
> 
> ====================
> [PATCH] net: Allow FIXED_PHY to be modular.

Ok, it takes a little more work, the problem is that there is already
a module named fixed.ko in the regulator layer, so we have to rename
this to something else.

====================
[PATCH] net: Allow FIXED_PHY to be modular.

Otherwise we get things like:

warning: (NET_DSA_BCM_SF2 && BCMGENET && SYSTEMPORT) selects FIXED_PHY which has unmet direct dependencies (NETDEVICES && PHYLIB=y)

In order to make this work we have to rename fixed.c to fixed_phy.c
because the regulator drivers already have a module named "fixed.o".

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/phy/Kconfig                  | 4 ++--
 drivers/net/phy/Makefile                 | 2 +-
 drivers/net/phy/{fixed.c => fixed_phy.c} | 0
 include/linux/phy_fixed.h                | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)
 rename drivers/net/phy/{fixed.c => fixed_phy.c} (100%)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index b4b0f80..a3c251b 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -119,8 +119,8 @@ config MICREL_PHY
 	  Supports the KSZ9021, VSC8201, KS8001 PHYs.
 
 config FIXED_PHY
-	bool "Driver for MDIO Bus/PHY emulation with fixed speed/link PHYs"
-	depends on PHYLIB=y
+	tristate "Driver for MDIO Bus/PHY emulation with fixed speed/link PHYs"
+	depends on PHYLIB
 	---help---
 	  Adds the platform "fixed" MDIO Bus to cover the boards that use
 	  PHYs that are not connected to the real MDIO bus.
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index eb3b18b..501ea76 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -17,7 +17,7 @@ obj-$(CONFIG_BCM87XX_PHY)	+= bcm87xx.o
 obj-$(CONFIG_ICPLUS_PHY)	+= icplus.o
 obj-$(CONFIG_REALTEK_PHY)	+= realtek.o
 obj-$(CONFIG_LSI_ET1011C_PHY)	+= et1011c.o
-obj-$(CONFIG_FIXED_PHY)		+= fixed.o
+obj-$(CONFIG_FIXED_PHY)		+= fixed_phy.o
 obj-$(CONFIG_MDIO_BITBANG)	+= mdio-bitbang.o
 obj-$(CONFIG_MDIO_GPIO)		+= mdio-gpio.o
 obj-$(CONFIG_NATIONAL_PHY)	+= national.o
diff --git a/drivers/net/phy/fixed.c b/drivers/net/phy/fixed_phy.c
similarity index 100%
rename from drivers/net/phy/fixed.c
rename to drivers/net/phy/fixed_phy.c
diff --git a/include/linux/phy_fixed.h b/include/linux/phy_fixed.h
index f2ca1b4..7e75bfe 100644
--- a/include/linux/phy_fixed.h
+++ b/include/linux/phy_fixed.h
@@ -11,7 +11,7 @@ struct fixed_phy_status {
 
 struct device_node;
 
-#ifdef CONFIG_FIXED_PHY
+#if IS_ENABLED(CONFIG_FIXED_PHY)
 extern int fixed_phy_add(unsigned int irq, int phy_id,
 			 struct fixed_phy_status *status);
 extern struct phy_device *fixed_phy_register(unsigned int irq,
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 20:13 UTC (permalink / raw)
  To: David Miller; +Cc: alexei.starovoitov, netdev
In-Reply-To: <20141216.150354.64901094367530710.davem@davemloft.net>



On 12/16/14, 3:03 PM, David Miller wrote:

> You shouldn't need to export any symbols.

As a point of clarification - is it acceptable to export symbols for use 
with in tree modules such as tcp_htcp? We are more than willing to do 
the work required to bring this in line with best practices.

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Samudrala, Sridhar @ 2014-12-16 20:11 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: John Fastabend, Jamal Hadi Salim, Hubert Sokolowski,
	netdev@vger.kernel.org, Vlad Yasevich
In-Reply-To: <54908856.1070708@cumulusnetworks.com>


On 12/16/2014 11:30 AM, Roopa Prabhu wrote:
> On 12/16/14, 9:21 AM, Samudrala, Sridhar wrote:
>>
>> On 12/16/2014 8:35 AM, John Fastabend wrote:
>>>
>>>> Is there no way to get the unicast/multicast mac addresses for such
>>>> a driver?
>>>
>>> You can almost infer it from ip link by looking at all the stacked
>>> drivers and figuring out how the address are propagated down. Then
>>> look at the routes and figure out multicast address. But other than
>>> the fdb dump mechanism I don't think there is anything.
>>
>> It looks like we can get the device specific unicast/multicast mac 
>> addresses via 'ip maddr' too.
> if i remember correctly, 'ip maddr' was only for multicast list. And 
> there was no way to dump the unicast list until bridge self was 
> introduced.
> the only way to dump unicast addresses today is by using the `bridge 
> fdb show self`
Yes. 'ip maddr show' only lists the multicast macs as the name suggests. 
I stand corrected.
May be we need 'ip uaddr show' to list unicast macs instead of 
overloading 'bridge fdb show' to show unicast lists.

Thanks
Sridhar

^ permalink raw reply

* [PATCH 0/2] ip_tunnel fixes
From: Thomas Graf @ 2014-12-16 20:05 UTC (permalink / raw)
  To: davem; +Cc: netdev, therbert

Thomas Graf (2):
  ip_tunnel: Add sanity checks to ip_tunnel_encap_add_ops()
  ip_tunnel: Add missing validation of encap type to
    ip_tunnel_encap_setup()

 net/ipv4/ip_tunnel.c | 9 +++++++++
 1 file changed, 9 insertions(+)

-- 
1.9.3

^ permalink raw reply

* [PATCH 1/2] ip_tunnel: Add sanity checks to ip_tunnel_encap_add_ops()
From: Thomas Graf @ 2014-12-16 20:05 UTC (permalink / raw)
  To: davem; +Cc: netdev, therbert
In-Reply-To: <cover.1418759998.git.tgraf@suug.ch>

The symbols are exported and could be used by external modules.

Fixes: a8c5f9 ("ip_tunnel: Ops registration for secondary encap (fou, gue)")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/ipv4/ip_tunnel.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 63e745a..2f498f8 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -514,6 +514,9 @@ const struct ip_tunnel_encap_ops __rcu *
 int ip_tunnel_encap_add_ops(const struct ip_tunnel_encap_ops *ops,
 			    unsigned int num)
 {
+	if (num >= MAX_IPTUN_ENCAP_OPS)
+		return -ERANGE;
+
 	return !cmpxchg((const struct ip_tunnel_encap_ops **)
 			&iptun_encaps[num],
 			NULL, ops) ? 0 : -1;
@@ -525,6 +528,9 @@ int ip_tunnel_encap_del_ops(const struct ip_tunnel_encap_ops *ops,
 {
 	int ret;
 
+	if (num >= MAX_IPTUN_ENCAP_OPS)
+		return -ERANGE;
+
 	ret = (cmpxchg((const struct ip_tunnel_encap_ops **)
 		       &iptun_encaps[num],
 		       ops, NULL) == ops) ? 0 : -1;
-- 
1.9.3

^ permalink raw reply related

* [PATCH 2/2] ip_tunnel: Add missing validation of encap type to ip_tunnel_encap_setup()
From: Thomas Graf @ 2014-12-16 20:05 UTC (permalink / raw)
  To: davem; +Cc: netdev, therbert
In-Reply-To: <cover.1418759998.git.tgraf@suug.ch>

The encap->type comes straight from Netlink. Validate it against
max supported encap types just like ip_encap_hlen() already does.

Fixes: a8c5f9 ("ip_tunnel: Ops registration for secondary encap (fou, gue)")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/ipv4/ip_tunnel.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 2f498f8..d3e4479 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -573,6 +573,9 @@ int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t,
 	if (t->encap.type == TUNNEL_ENCAP_NONE)
 		return 0;
 
+	if (t->encap.type >= MAX_IPTUN_ENCAP_OPS)
+		return -EINVAL;
+
 	rcu_read_lock();
 	ops = rcu_dereference(iptun_encaps[t->encap.type]);
 	if (likely(ops && ops->build_header))
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: David Miller @ 2014-12-16 20:03 UTC (permalink / raw)
  To: rapier; +Cc: alexei.starovoitov, netdev
In-Reply-To: <54908FAD.5060500@psc.edu>

From: rapier <rapier@psc.edu>
Date: Tue, 16 Dec 2014 15:01:49 -0500

> On 12/16/14, 2:09 PM, David Miller wrote:
>> It's a non-starter.
> 
> Understood. We're in the process of reviewing which symbols are
> required by the DLKM we've provided and which are required by in tree
> modules (e.g tcp_estats_enabled is required by tcp_htcp).

You shouldn't need to export any symbols.

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Marcelo Ricardo Leitner @ 2014-12-16 20:02 UTC (permalink / raw)
  To: Michael Chan
  Cc: Bjorn Helgaas, Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki, Prashant Sreedharan
In-Reply-To: <1418759684.4248.12.camel@LTIRV-MCHAN1.corp.ad.broadcom.com>

On 16-12-2014 17:54, Michael Chan wrote:
> On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
>> It's a
>> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
>> Gigabit Ethernet PCI Express
>> over here
>>
>> I put a WARN_ON(1) after those printks, and this is what I got:
>>
>> [    1.550640] pci 0000:02:00.0: 1st 1 1
>> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
>> [    1.550645] ------------[ cut here ]------------
>> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
>> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
>> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
>> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
>> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
>> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
>> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
>> [    1.550669] Call Trace:
>> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
>> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
>> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
>> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
>> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
>> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
>> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
>> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
>
> So does it work if you use a non-zero crs_timeout?  The driver has
> called tg3_halt() which may affect configuration read responses.  I need
> to check with the hardware team to see if the 5722 will return CRS in
> this scenario.

Sorry, I replied to the thread that you weren't in yet.
It didn't..
http://thread.gmane.org/gmane.linux.network/342566/focus=37932

   Marcelo

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 20:01 UTC (permalink / raw)
  To: David Miller, alexei.starovoitov; +Cc: netdev
In-Reply-To: <20141216.140919.44020047339373813.davem@davemloft.net>



On 12/16/14, 2:09 PM, David Miller wrote:
> From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Date: Tue, 16 Dec 2014 10:24:31 -0800
>
>> imo that is very questionable design choice.
>> export a lot of in-kernel bits to be used by out-of-tree kernel module?
>
> It's a non-starter.

Understood. We're in the process of reviewing which symbols are required 
by the DLKM we've provided and which are required by in tree modules 
(e.g tcp_estats_enabled is required by tcp_htcp). In the meantime, I do 
want to stress that the KIS is distinct from the control and management 
functions in this patch. We'd like to make sure that problems with this 
patch aren't negating the value of the instrument set.

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Michael Chan @ 2014-12-16 19:54 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Bjorn Helgaas, Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki, Prashant Sreedharan
In-Reply-To: <54907300.9050902@gmail.com>

On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
> It's a 
> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
> Gigabit Ethernet PCI Express
> over here
> 
> I put a WARN_ON(1) after those printks, and this is what I got:
> 
> [    1.550640] pci 0000:02:00.0: 1st 1 1
> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
> [    1.550645] ------------[ cut here ]------------
> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
> [    1.550669] Call Trace:
> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3] 

So does it work if you use a non-zero crs_timeout?  The driver has
called tg3_halt() which may affect configuration read responses.  I need
to check with the hardware team to see if the 5722 will return CRS in
this scenario.

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Roopa Prabhu @ 2014-12-16 19:30 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: John Fastabend, Jamal Hadi Salim, Hubert Sokolowski,
	netdev@vger.kernel.org, Vlad Yasevich
In-Reply-To: <54906A11.6030701@intel.com>

On 12/16/14, 9:21 AM, Samudrala, Sridhar wrote:
>
> On 12/16/2014 8:35 AM, John Fastabend wrote:
>>
>>> Is there no way to get the unicast/multicast mac addresses for such
>>> a driver?
>>
>> You can almost infer it from ip link by looking at all the stacked
>> drivers and figuring out how the address are propagated down. Then
>> look at the routes and figure out multicast address. But other than
>> the fdb dump mechanism I don't think there is anything.
>
> It looks like we can get the device specific unicast/multicast mac 
> addresses via 'ip maddr' too.
if i remember correctly, 'ip maddr' was only for multicast list. And 
there was no way to dump the unicast list until bridge self was introduced.
the only way to dump unicast addresses today is by using the `bridge fdb 
show self`

^ permalink raw reply

* Re: FIXED_PHY is broken...
From: David Miller @ 2014-12-16 19:30 UTC (permalink / raw)
  To: netdev; +Cc: f.fainelli
In-Reply-To: <20141216.112534.1684849738404064350.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 16 Dec 2014 11:25:34 -0500 (EST)

> I get this now when I run oldconfig:
> 
> warning: (NET_DSA_BCM_SF2 && BCMGENET && SYSTEMPORT) selects FIXED_PHY which has unmet direct dependencies (NETDEVICES && PHYLIB=y)

Here is how I'm going to fix this.

FIXED_PHY needs to be allowed to be modular, and built even if PHYLIB is
modular too.

====================
[PATCH] net: Allow FIXED_PHY to be modular.

Otherwise we get things like:

warning: (NET_DSA_BCM_SF2 && BCMGENET && SYSTEMPORT) selects FIXED_PHY which has unmet direct dependencies (NETDEVICES && PHYLIB=y)

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/phy/Kconfig   | 4 ++--
 include/linux/phy_fixed.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index b4b0f80..a3c251b 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -119,8 +119,8 @@ config MICREL_PHY
 	  Supports the KSZ9021, VSC8201, KS8001 PHYs.
 
 config FIXED_PHY
-	bool "Driver for MDIO Bus/PHY emulation with fixed speed/link PHYs"
-	depends on PHYLIB=y
+	tristate "Driver for MDIO Bus/PHY emulation with fixed speed/link PHYs"
+	depends on PHYLIB
 	---help---
 	  Adds the platform "fixed" MDIO Bus to cover the boards that use
 	  PHYs that are not connected to the real MDIO bus.
diff --git a/include/linux/phy_fixed.h b/include/linux/phy_fixed.h
index f2ca1b4..7e75bfe 100644
--- a/include/linux/phy_fixed.h
+++ b/include/linux/phy_fixed.h
@@ -11,7 +11,7 @@ struct fixed_phy_status {
 
 struct device_node;
 
-#ifdef CONFIG_FIXED_PHY
+#if IS_ENABLED(CONFIG_FIXED_PHY)
 extern int fixed_phy_add(unsigned int irq, int phy_id,
 			 struct fixed_phy_status *status);
 extern struct phy_device *fixed_phy_register(unsigned int irq,
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-16 19:25 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: John Fastabend, netdev@vger.kernel.org, Jamal Hadi Salim,
	Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DB473D@ORSMSX101.amr.corp.intel.com>

On 12/16/14, 9:29 AM, Arad, Ronen wrote:
>
>> -----Original Message-----
>> From: John Fastabend [mailto:john.fastabend@gmail.com]
>> Sent: Tuesday, December 16, 2014 6:42 PM
>> To: Arad, Ronen
>> Cc: Roopa Prabhu; netdev@vger.kernel.org; Jamal Hadi Salim; Jiri Pirko;
>> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/16/2014 03:01 AM, Arad, Ronen wrote:
>>> In my reply (inline) I elaborate on the validity of bridge-less and offloaded-
>> bridge models for L2 switching.
>>> I also discuss the implied necessity of a bridge device for L3 routing and
>> potential issues with the upcoming FIB offloading proposal.
>>>> -----Original Message-----
>>>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>>>> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
>>>> Sent: Tuesday, December 16, 2014 3:21 AM
>>>> To: Arad, Ronen
>>>> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri
>>>> Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>>>> stephen@networkplumber.org; linville@tuxdriver.com;
>>>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com;
>>>> gospo@cumulusnetworks.com
>>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
>>>> del bridge port attributes
>>>>
>>>> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>>>>>> -----Original Message-----
>>>>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>>>>>> Sent: Tuesday, December 16, 2014 1:28 AM
>>>>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>>>>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>>>>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>>>>>> vyasevic@redhat.com; davem@davemloft.net;
>>>> shm@cumulusnetworks.com;
>>>>>> gospo@cumulusnetworks.com
>>>>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set
>>>>>> and del bridge port attributes
>>>>>>
>>>>>> On 12/15/14 13:36, Arad, Ronen wrote:
>>>>>>>> -----Original Message-----
>>>>>>> The behavior of a driver could depend on the presence of a bridge
>>>>>>> and
>>>>>> features such as FDB LEARNING and LEARNING_SYNC.
>>>>>>
>>>>>> Indeed, those are bridge attributes.
>>>>>>
>>>>>>> A switch port driver which is not enslaved to a bridge might need
>>>>>>> to implement VLAN-aware FDB within the driver and report its
>>>>>>> content to
>>>>>>> user-
>>>>>> space using ndo_fdb_dump.
>>>>>>     >
>>>>>>> A switch port driver which is enslaved to a bridge could do with
>>>>>>> only pass through for static FDB configuration
>>>>>>     > to the HW when LEARNING_SYNC is configured. FDB reporting to
>>>>>> user- space and soft aging are left to the bridge module FDB.
>>>>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
>>>>>>> in-driver
>>>>>> FDB as long as it could dump the HW FDB on demand.
>>>>>>> LEARNING_SYNC also requires periodic updates of freshness
>>>>>>> information
>>>>>> from the driver to the bridge module.
>>>>>>
>>>>>> If you have an fdb - shouldnt that be exposed only if you have a
>>>>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
>>>>> I'm trying to find out what are the opinions of other people in the
>>>>> netdev
>>>> list.
>>>>> John have clearly stated that he'd like to see full L2 switching
>>>>> functionality
>>>> (at least) supported without making a bridge device mandatory.
>>>>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already
>>>>> support that
>>>> with proper setting of SELF/MASTER flags by iproute2.
>>>>> I see the value in supporting both approaches (bridge device
>>>>> mandatory and bridge device optional). If the choice is left to
>>>>> user-driven policy decision, we need to document both use models and
>>>>> map traditional L2 features to each model.
>>>>> The L2 offloading (or NETFUNC as it is currently called), which is
>>>>> being discussed on a different patch-set, is only needed when a
>>>>> bridge device is used.
>>>>> Without a bridge device, all configuration has to be targeted at the
>>>>> switch port driver directly using the SELF flag. FDB remains
>>>>> relevant and it is used to configure static MAC table entries and dump
>> the HW MAC table.
>>>> Your understanding is right here. So far all patches have kept both
>>>> models in mind.
>>>
>>>>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or
>>>>> even higher), there is a gap between what the HW is doing and what
>>>>> is explicitly modeled in Linux.
>>>
>>>> Can you elaborate more here ?. We use the linux model to accelerate a
>>>> multi-layer (l2-l3) switch today. There maybe a few gaps, but these
>>>> gaps can be closed by having equivalent functionality in the software path.
>>> What I meant is that without a bridge device the HW switch is seen as a
>> collection of independent switch ports. Typical switch ASIC performs L2
>> switching by default. This is not expressed explicitly in Linux without a bridge
>> device.
>>> The SELF flag is used to target typical bridge port and bridge configuration
>> at a switch port device.
>>> Without an explicit bridge device, bridge attributes have to be
>>> directed at an arbitrary port (any port could represent the entire switch)
>> and interpreted by the switch port driver as intended for the entire switch
>> (this includes attributes like STP etc.) Each switch port device driver has to
>> implement similar functionality (i.e. all bridge and fdb related ndos)
>> independently without common functionality shared (e.g. FDB, soft aging).
>>> It is a valid use model and could avoid the complexity of having to deal with
>> the presence of both SW and HW bridge and to deal with explicit offloading
>> of data-path.
>>> I was trying to find out whether the intention was to continue and support
>> both bridge-less an offloaded-bridge models and leave it to the end-user to
>> choose the desirable model at configuration time.
>>> This would require dual support in the switch port driver in order to have
>> best user experience across multiple switch ASICs or other kinds of devices.
>> I'm still missing why there is duplicate implementations in the driver.
>> If the driver implements the set of ndo ops why should it care who calls
>> them? I think you tried to explain this already but I'm not seeing it.
>>
> Let's consider a bridge property. I'll use the default PVID attribute as an example. This is currently configurable by sysfs only and a netlink support for that is still due. Let's assume for our discussion that a DEAFAULT_PVID attribute will be added as a bridge attribute within AFSPEC nested attribute of AF_BRIDGE SETLINK message.
> When a bridge device is present, this attribute is processed by the bridge module and saved as default_pvid field in net_bridge structure. When a switch port is enslaved to a bridge, the bridge driver creates a net_bridge_port instance and assigns it a pvid inherited from the default_pvid attribute of the bridge. Setting the pvid for a new enslaved switch port is not done via netlink. It only applies to the net_bridge_port structure which is internal to the bridge module. Offloading this to HW is not addressed with current bridge offloading.
Correct. And that's where I am going with my NETFUNC_OFFLOAD series. 
Hooks in the bridge driver to communicate all changes to the switch driver.
The default pvid will/should be also communicated to the switch driver. 
And the switch driver does not need to learn the default pvid for the 
bridge.
The callbacks (ndo's) into the switch driver will indicate the pvid for 
the switch port (just like all the userspace notifications for the 
bridge driver do today).

>
> When a bridge device is not used, the DEFAULT_PVID will be targeted using the SELF flag to any of the switch ports. The driver will recognize that as a bridge port and will need to maintain some switch global structure similar to net_bridge where it could save the default_pvid. The driver, knowing that the switch port is not enslaved to a bridge, will have to replicate the same functionality. In the HW case, it will have to configure default VLAN on all the switch ports.
> This is different from the yet to be defined way of propagating default PVID from a bridge device to offloaded bridge ports.
>
> Another example is STP. STP attributes are bridge attributes which are not offloaded when a bridge device is present. The bridge module handles STP protocol internally. Without bridge device, STP attributes have to be targeted at a switch port device and the driver should save them in driver-specific structures and have proprietary implementation of STP (as the one in the bridge module is not used).


yes, for all the cases where you are trying to bypass the bridge driver, 
the switch driver will need implement similar functionality. So, for 
switch asics offloads there is a disadvantage to not use the bridge driver.
>
>   
>> [...]
>>
>> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa might have
>> worked some of it out.
>>
>> --
>> John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: B Viswanath @ 2014-12-16 19:23 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: John Fastabend, netdev@vger.kernel.org, Roopa Prabhu,
	Jamal Hadi Salim, Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org,
	tgraf@suug.ch, stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DB473D@ORSMSX101.amr.corp.intel.com>

Hi,

This is my first email on this thread, and on this list. My apologies
if I have not understood something correctly. I would like to
participate  in this discussion, which is one of the reasons I joined
this list recently.  Some feedback inline below.

On 16 December 2014 at 22:59, Arad, Ronen <ronen.arad@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: John Fastabend [mailto:john.fastabend@gmail.com]
>> Sent: Tuesday, December 16, 2014 6:42 PM
>> To: Arad, Ronen
>> Cc: Roopa Prabhu; netdev@vger.kernel.org; Jamal Hadi Salim; Jiri Pirko;
>> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/16/2014 03:01 AM, Arad, Ronen wrote:
>> >
>> > In my reply (inline) I elaborate on the validity of bridge-less and offloaded-
>> bridge models for L2 switching.
>> >
>> > I also discuss the implied necessity of a bridge device for L3 routing and
>> potential issues with the upcoming FIB offloading proposal.
>> >
>> >> -----Original Message-----
>> >> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> >> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
>> >> Sent: Tuesday, December 16, 2014 3:21 AM
>> >> To: Arad, Ronen
>> >> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri
>> >> Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> >> stephen@networkplumber.org; linville@tuxdriver.com;
>> >> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com;
>> >> gospo@cumulusnetworks.com
>> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
>> >> del bridge port attributes
>> >>
>> >> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>> >>>> Sent: Tuesday, December 16, 2014 1:28 AM
>> >>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>> >>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>> >>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>> >>>> vyasevic@redhat.com; davem@davemloft.net;
>> >> shm@cumulusnetworks.com;
>> >>>> gospo@cumulusnetworks.com
>> >>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set
>> >>>> and del bridge port attributes
>> >>>>
>> >>>> On 12/15/14 13:36, Arad, Ronen wrote:
>> >>>>>
>> >>>>>> -----Original Message-----
>> >>>>> The behavior of a driver could depend on the presence of a bridge
>> >>>>> and
>> >>>> features such as FDB LEARNING and LEARNING_SYNC.
>> >>>>
>> >>>> Indeed, those are bridge attributes.
>> >>>>
>> >>>>> A switch port driver which is not enslaved to a bridge might need
>> >>>>> to implement VLAN-aware FDB within the driver and report its
>> >>>>> content to
>> >>>>> user-
>> >>>> space using ndo_fdb_dump.
>> >>>>    >
>> >>>>> A switch port driver which is enslaved to a bridge could do with
>> >>>>> only pass through for static FDB configuration
>> >>>>    > to the HW when LEARNING_SYNC is configured. FDB reporting to
>> >>>> user- space and soft aging are left to the bridge module FDB.
>> >>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
>> >>>>> in-driver
>> >>>> FDB as long as it could dump the HW FDB on demand.
>> >>>>> LEARNING_SYNC also requires periodic updates of freshness
>> >>>>> information
>> >>>> from the driver to the bridge module.
>> >>>>
>> >>>> If you have an fdb - shouldnt that be exposed only if you have a
>> >>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
>> >>> I'm trying to find out what are the opinions of other people in the
>> >>> netdev
>> >> list.
>> >>> John have clearly stated that he'd like to see full L2 switching
>> >>> functionality
>> >> (at least) supported without making a bridge device mandatory.
>> >>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already
>> >>> support that
>> >> with proper setting of SELF/MASTER flags by iproute2.
>> >>> I see the value in supporting both approaches (bridge device
>> >>> mandatory and bridge device optional). If the choice is left to
>> >>> user-driven policy decision, we need to document both use models and
>> >>> map traditional L2 features to each model.
>> >>> The L2 offloading (or NETFUNC as it is currently called), which is
>> >>> being discussed on a different patch-set, is only needed when a
>> >>> bridge device is used.
>> >>> Without a bridge device, all configuration has to be targeted at the
>> >>> switch port driver directly using the SELF flag. FDB remains
>> >>> relevant and it is used to configure static MAC table entries and dump
>> the HW MAC table.
>> >
>> >> Your understanding is right here. So far all patches have kept both
>> >> models in mind.
>> >
>> >
>> >>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or
>> >>> even higher), there is a gap between what the HW is doing and what
>> >>> is explicitly modeled in Linux.
>> >
>> >
>> >> Can you elaborate more here ?. We use the linux model to accelerate a
>> >> multi-layer (l2-l3) switch today. There maybe a few gaps, but these
>> >> gaps can be closed by having equivalent functionality in the software path.
>> >
>> > What I meant is that without a bridge device the HW switch is seen as a
>> collection of independent switch ports. Typical switch ASIC performs L2
>> switching by default. This is not expressed explicitly in Linux without a bridge
>> device.
>> > The SELF flag is used to target typical bridge port and bridge configuration
>> at a switch port device.
>> > Without an explicit bridge device, bridge attributes have to be
>> > directed at an arbitrary port (any port could represent the entire switch)
>> and interpreted by the switch port driver as intended for the entire switch
>> (this includes attributes like STP etc.) Each switch port device driver has to
>> implement similar functionality (i.e. all bridge and fdb related ndos)
>> independently without common functionality shared (e.g. FDB, soft aging).
>> > It is a valid use model and could avoid the complexity of having to deal with
>> the presence of both SW and HW bridge and to deal with explicit offloading
>> of data-path.
>> >
>> > I was trying to find out whether the intention was to continue and support
>> both bridge-less an offloaded-bridge models and leave it to the end-user to
>> choose the desirable model at configuration time.
>> > This would require dual support in the switch port driver in order to have
>> best user experience across multiple switch ASICs or other kinds of devices.
>> >
>>
>> I'm still missing why there is duplicate implementations in the driver.
>> If the driver implements the set of ndo ops why should it care who calls
>> them? I think you tried to explain this already but I'm not seeing it.
>>
>
> Let's consider a bridge property. I'll use the default PVID attribute as an example. This is currently configurable by sysfs only and a netlink support for that is still due. Let's assume for our discussion that a DEAFAULT_PVID attribute will be added as a bridge attribute within AFSPEC nested attribute of AF_BRIDGE SETLINK message.
> When a bridge device is present, this attribute is processed by the bridge module and saved as default_pvid field in net_bridge structure. When a switch port is enslaved to a bridge, the bridge driver creates a net_bridge_port instance and assigns it a pvid inherited from the default_pvid attribute of the bridge. Setting the pvid for a new enslaved switch port is not done via netlink. It only applies to the net_bridge_port structure which is internal to the bridge module. Offloading this to HW is not addressed with current bridge offloading.
>
> When a bridge device is not used, the DEFAULT_PVID will be targeted using the SELF flag to any of the switch ports. The driver will recognize that as a bridge port and will need to maintain some switch global structure similar to net_bridge where it could save the default_pvid. The driver, knowing that the switch port is not enslaved to a bridge, will have to replicate the same functionality. In the HW case, it will have to configure default VLAN on all the switch ports.
> This is different from the yet to be defined way of propagating default PVID from a bridge device to offloaded bridge ports.
>
> Another example is STP. STP attributes are bridge attributes which are not offloaded when a bridge device is present. The bridge module handles STP protocol internally. Without bridge device, STP attributes have to be targeted at a switch port device and the driver should save them in driver-specific structures and have proprietary implementation of STP (as the one in the bridge module is not used).

In general I feel that the switch-device and port relation should be
that of the 'container-containee'. This is the actual physical
relationship.  Apart from some operations such as vlans and protocol
related, it is tricky to model all operations directly on ports. My
thinking is it is cleaner to have all operations be on switch-device,
which in turn peculates the operations downward, to its contained
ports as applicable.  The offloading is really a property of the
switch device and not individual ports. Similarly the FDB is
maintained by the switch and not the ports. As we extend the current
offloading mechanism to other L2, L3 and other features, we may find
it easier to have a 'switch-device' in place.

I am somewhat confused with the notion of bridges though. Many
existing linux-based routers use bridges differently than as a
vlan-broadcast-domain. For example it is common to have eth0.334 and
eth1 in the same bridge. What is being done internally is that the
additional vlan tag 334 (which indicates video traffic, say) is
removed and that video traffic is being bridged to eth1. There is no
default vlan for this bridge.  This is a software bridge. I am not
sure how this can be accomplished if there is a need to associate a
vlan with a bridge.

Thanks
Viswanath


>
>
>> [...]
>>
>> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa might have
>> worked some of it out.
>>
>> --
>> John Fastabend         Intel Corporation
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-16 19:20 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: netdev@vger.kernel.org, Jamal Hadi Salim, John Fastabend,
	Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DB15CA@ORSMSX101.amr.corp.intel.com>

On 12/16/14, 3:01 AM, Arad, Ronen wrote:
> In my reply (inline) I elaborate on the validity of bridge-less and offloaded-bridge models for L2 switching.
>
> I also discuss the implied necessity of a bridge device for L3 routing and potential issues with the upcoming FIB offloading proposal.
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
>> Sent: Tuesday, December 16, 2014 3:21 AM
>> To: Arad, Ronen
>> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri Pirko;
>> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>>>> -----Original Message-----
>>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>>>> Sent: Tuesday, December 16, 2014 1:28 AM
>>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>>>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com;
>>>> gospo@cumulusnetworks.com
>>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
>>>> del bridge port attributes
>>>>
>>>> On 12/15/14 13:36, Arad, Ronen wrote:
>>>>>> -----Original Message-----
>>>>> The behavior of a driver could depend on the presence of a bridge
>>>>> and
>>>> features such as FDB LEARNING and LEARNING_SYNC.
>>>>
>>>> Indeed, those are bridge attributes.
>>>>
>>>>> A switch port driver which is not enslaved to a bridge might need to
>>>>> implement VLAN-aware FDB within the driver and report its content to
>>>>> user-
>>>> space using ndo_fdb_dump.
>>>>    >
>>>>> A switch port driver which is enslaved to a bridge could do with
>>>>> only pass through for static FDB configuration
>>>>    > to the HW when LEARNING_SYNC is configured. FDB reporting to
>>>> user- space and soft aging are left to the bridge module FDB.
>>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
>>>>> in-driver
>>>> FDB as long as it could dump the HW FDB on demand.
>>>>> LEARNING_SYNC also requires periodic updates of freshness
>>>>> information
>>>> from the driver to the bridge module.
>>>>
>>>> If you have an fdb - shouldnt that be exposed only if you have a
>>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
>>> I'm trying to find out what are the opinions of other people in the netdev
>> list.
>>> John have clearly stated that he'd like to see full L2 switching functionality
>> (at least) supported without making a bridge device mandatory.
>>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that
>> with proper setting of SELF/MASTER flags by iproute2.
>>> I see the value in supporting both approaches (bridge device mandatory
>>> and bridge device optional). If the choice is left to user-driven policy decision,
>>> we need to document both use models and map traditional L2 features to
>>> each model.
>>> The L2 offloading (or NETFUNC as it is currently called), which is being
>>> discussed on a different patch-set, is only needed when a bridge device is
>>> used.
>>> Without a bridge device, all configuration has to be targeted at the switch
>>> port driver directly using the SELF flag. FDB remains relevant and it is used to
>>> configure static MAC table entries and dump the HW MAC table.
>> Your understanding is right here. So far all patches have kept both models in
>> mind.
>
>>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even
>>> higher), there is a gap between what the HW is doing and what is explicitly
>>> modeled in Linux.
>
>> Can you elaborate more here ?. We use the linux model to accelerate a
>> multi-layer (l2-l3) switch today. There maybe a few gaps, but these gaps can
>> be closed by having equivalent functionality in the software path.
> What I meant is that without a bridge device the HW switch is seen as a collection of independent switch ports. Typical switch ASIC performs L2 switching by default. This is not expressed explicitly in Linux without a bridge device.
> The SELF flag is used to target typical bridge port and bridge configuration at a switch port device.
> Without an explicit bridge device, bridge attributes have to be directed at an arbitrary port (any port could represent the entire switch) and interpreted by the switch port driver as intended for the entire switch (this includes attributes like STP etc.)
> Each switch port device driver has to implement similar functionality (i.e. all bridge and fdb related ndos) independently without common functionality shared (e.g. FDB, soft aging).
> It is a valid use model and could avoid the complexity of having to deal with the presence of both SW and HW bridge and to deal with explicit offloading of data-path.
>
> I was trying to find out whether the intention was to continue and support both bridge-less an offloaded-bridge models and leave it to the end-user to choose the desirable model at configuration time.
I have always been speaking for the offloaded-bridge model because we 
@cumulus work on such a model today.
But, The infrastructure had to be generic to support nics and switch 
asics. And, from the current state of things both bridge-less and 
offloaded-bridge models are supported. Rocker for example allows user to 
operate directly (ndo_bridge_set/getlink) on the switch hw. But it does 
require a bridge.
> This would require dual support in the switch port driver in order to have best user experience across multiple switch ASICs or other kinds of devices.
agreed. However, if the switch driver wants to support just one model,
  it is upto the switch driver to advertise the NETFUNC_OFFLOAD flag and 
maybe not honor requests with the self flag set.
Our switch driver will likely do that because we plan to support only 
the offloaded-bridge model.

>>> Without a bridge device, the HW is represented by a set of switch port
>>> devices and the bridging (both control and data planes) takes place only in
>>> the HW and switch port driver.
>>> Each switch port driver has to implement its own FDB as there is no
>>> common shared code among drivers for different HW devices.
>>> Using a bridge device could partially alleviate that, but it comes with a cost.
>>> There is a need to properly implement offloading of both configuration and
>>> data-path. The transmit and receive path in the bridge module should be
>>> somehow bypassed to avoid unnecessary overhead or duplicate packets
>>> coming from both software bridging and HW bridging.
>>>
>>>> What i was refering to was a scenario where i have no interest in the
>>>> fdb despite such a hardware capabilities. VLANs is a different issue;
>>>>
>>> VLAN is fundamental feature of L2 and L3 switching and Linux is unclear
>> about it. Bridge device could model bridging of untagged packets which
>> requires a bridge device for each VLAN and a vlan device on each port that is
>> a member of the bridge's VLAN.
>>> This different from the behavior and configuration of classic closed-source
>> switches.
>>> An alternative model is VLAN filtering where a bridge is VLAN-aware and
>>> switches tagged traffic. A bridge device represents multiple L2 domains with
>>> VLAN filtering policy that defines the switching rules within each domain.
>> And the linux bridge driver supports both models today.
>>
>>> Forwarding (e.g. L3 routing) is expected across such L2 domains using L3
>> entities.
>>> The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN
>>> filtering model is yet unclear to me.
>> In the vlan filtering bridge model, You can create a vlan device on the bridge
>> for l3 ...
>>
> That's what I'm thinking too (I experimented with such setup using veth interfaces, bridge device, and vlan interfaces). This, however, seems to require an explicit bridge for L3 support.
>
> Looking at the latest code of FIB offloading (not yet submitted to netdev), I noticed that a switch port device is expected as a lower descendent of the FIB destination device.
> This assumption is valid in the per-vlan bridge model where IP address is assigned to the bridge itself.
> This, however, is not consistent with the single multi-VLAN bridge model.
> Vlan interfaces on a bridge looks like siblings of the switch ports devices on the same bridge. They are not ancestors of the switch ports.
> The L3 domain ends at the bridge sub-interfaces. The only L3 entities are the vlan sub-interfaces on the bridge.
> Those are route next hops and the only possible fib_dev.
> L3 routing is not aware of the switch ports. Route is performed to next hop addresses on one of the vlan interfaces subnets. The actual resolution to a switch port device has to be performed by the neighbor subsystem (ARP/ND).
> It is unclear to me how the FIB offloading will be redirected to an ndo of a switch port device.
I have not looked at the rocker FIB offloading yet. But we have 
discussed this a bit at LPC in Dusseldorf.  The route can be pushed to 
hw after the resolution by the neighbour subsystem..or ..with the 
introduction of a switch device (to represent the asic), the switch 
driver can take care of this.

>
>>>>>>> Will the decision about using a bridge device or avoiding it be
>>>>>>> left to the end-user?
>>>>>> Its a user policy decision. Again the offload bit gets us this in a
>>>>>> reasonably configurable way IMO.
>>>>>>
>>>>>>> (This requires switch port drivers to be able to work and provide
>>>>>>> similar functionality in both setups).
>>>>>> Right, but if the drivers "care" who is calling their ndo ops
>>>>>> something is seriously broken. For the driver it should not need to
>>>>>> know anything about the callers so it doesn't matter to the driver
>>>>>> if its a netlink call from user space or an internal call fro
>>>>>> bridge.ko
>>>>> LEARNING_SYNC only makes sense when a switch port driver is enslaved
>>>>> to
>>>> a bridge.
>>>>    > Rocker switch driver indeed monitors upper change notifications
>>>> and keep track of master bridge presence.
>>>>> So bridge presence is not transparent.
>>>>>
>>>> Agreed - the challenge so far is that people have been fascinated by
>> "switch"
>>>> point of view. I think we are learning and the class device will
>>>> eventually become obvious as useful.
>>>>
>>>> cheers,
>>>> jamal
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org More majordomo
>> info
>>> at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
>> of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 0/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 19:13 UTC (permalink / raw)
  To: netdev
In-Reply-To: <549070C7.5070505@psc.edu>

Correction:

> A git repo is available at http://github.com/rapier/web10g

http://github.com/rapier1/web10g

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: David Miller @ 2014-12-16 19:11 UTC (permalink / raw)
  To: rapier; +Cc: alexei.starovoitov, netdev
In-Reply-To: <549080CC.3050704@psc.edu>

From: rapier <rapier@psc.edu>
Date: Tue, 16 Dec 2014 13:58:20 -0500

> I'm going to review those exports right now.

Please do not top-post.

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: David Miller @ 2014-12-16 19:09 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: rapier, netdev
In-Reply-To: <CAADnVQ+rmBMoDRRu-Dtp7QBMJEbto9nb2zvFEXBCQsqpg6tS9A@mail.gmail.com>

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Tue, 16 Dec 2014 10:24:31 -0800

> imo that is very questionable design choice.
> export a lot of in-kernel bits to be used by out-of-tree kernel module?

It's a non-starter.

^ permalink raw reply

* [PATCHv1 net] xen-netfront: use napi_complete() correctly to prevent Rx stalling
From: David Vrabel @ 2014-12-16 18:59 UTC (permalink / raw)
  To: netdev
  Cc: David Vrabel, xen-devel, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Eric Dumazet

After d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less interrupt
masking in NAPI) the napi instance is removed from the per-cpu list
prior to calling the n->poll(), and is only requeued if all of the
budget was used.  This inadvertently broke netfront because netfront
does not use NAPI correctly.

If netfront had not used all of its budget it would do a final check
for any Rx responses and avoid calling napi_complete() if there were
more responses.  It would still return under budget so it would never
be rescheduled.  The final check would also not re-enable the Rx
interrupt.

Additionally, xenvif_poll() would also call napi_complete() /after/
enabling the interrupt.  This resulted in a race between the
napi_complete() and the napi_schedule() in the interrupt handler.  The
use of local_irq_save/restore() avoided by race iff the handler is
running on the same CPU but not if it was running on a different CPU.

Fix both of these by always calling napi_compete() if the budget was
not all used, and then calling napi_schedule() if the final checks
says there's more work.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/xen-netfront.c |   11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 2f0a9ce..22bcb4e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -977,7 +977,6 @@ static int xennet_poll(struct napi_struct *napi, int budget)
 	struct sk_buff_head rxq;
 	struct sk_buff_head errq;
 	struct sk_buff_head tmpq;
-	unsigned long flags;
 	int err;
 
 	spin_lock(&queue->rx_lock);
@@ -1050,15 +1049,11 @@ err:
 	if (work_done < budget) {
 		int more_to_do = 0;
 
-		napi_gro_flush(napi, false);
-
-		local_irq_save(flags);
+		napi_complete(napi);
 
 		RING_FINAL_CHECK_FOR_RESPONSES(&queue->rx, more_to_do);
-		if (!more_to_do)
-			__napi_complete(napi);
-
-		local_irq_restore(flags);
+		if (more_to_do)
+			napi_schedule(napi);
 	}
 
 	spin_unlock(&queue->rx_lock);
-- 
1.7.10.4

^ permalink raw reply related

* Re: VLAN bridge patches backported to 2.6.x ?
From: Willy Tarreau @ 2014-12-16 18:59 UTC (permalink / raw)
  To: Ajit Warrier; +Cc: netdev
In-Reply-To: <CAKrgqt3rMUnkQChTE+_HDiTHx++bp8MWcQ_xp7-Y3tbFznTKMg@mail.gmail.com>

On Sun, Dec 07, 2014 at 11:25:18AM -0800, Ajit Warrier wrote:
> The VLAN aware bridge was introduced to the Linux mainline on the 3.x kernels:
> 
> http://lwn.net/Articles/538877/
> 
> Has this feature been backported to the 2.6.x kernels by anybody ? The
> official linux 2.6.32.64 does not have this feature.

No, we only backport fixes to 2.6.32, not new features. Such a large
patchset could definitely break working setups for various reasons ranging
from incompatibilities to errors in backports. So that's definitely not
something supposed to happen for 2.6.32.

Regards,
Willy

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 18:58 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev
In-Reply-To: <CAADnVQ+rmBMoDRRu-Dtp7QBMJEbto9nb2zvFEXBCQsqpg6tS9A@mail.gmail.com>

I'm going to review those exports right now.

On 12/16/14, 1:24 PM, Alexei Starovoitov wrote:
> On Tue, Dec 16, 2014 at 9:50 AM, rapier <rapier@psc.edu> wrote:
>> +struct idr tcp_estats_idr;
>> +EXPORT_SYMBOL(tcp_estats_idr);
>> +static int next_id = 1;
>> +DEFINE_SPINLOCK(tcp_estats_idr_lock);
>> +EXPORT_SYMBOL(tcp_estats_idr_lock);
>> +
>> +int tcp_estats_wq_enabled __read_mostly = 0;
>> +EXPORT_SYMBOL(tcp_estats_wq_enabled);
>> +struct workqueue_struct *tcp_estats_wq = NULL;
>> +EXPORT_SYMBOL(tcp_estats_wq);
>> +void (*create_notify_func)(struct work_struct *work);
>> +EXPORT_SYMBOL(create_notify_func);
>> +void (*establish_notify_func)(struct work_struct *work);
>> +EXPORT_SYMBOL(establish_notify_func);
>> +void (*destroy_notify_func)(struct work_struct *work);
>> +EXPORT_SYMBOL(destroy_notify_func);
>> +unsigned long persist_delay = 0;
>> +EXPORT_SYMBOL(persist_delay);
>> +
>> +struct static_key tcp_estats_enabled __read_mostly = STATIC_KEY_INIT_FALSE;
>> +EXPORT_SYMBOL(tcp_estats_enabled);
> ...
>> +EXPORT_SYMBOL(tcp_estats_create);
> ...
>> +/* Do not call directly.  Called from tcp_estats_unuse() through call_rcu.
>> */
>> +void tcp_estats_free(struct rcu_head *rcu)
> ...
>> +EXPORT_SYMBOL(tcp_estats_free);
>
> imo that is very questionable design choice.
> export a lot of in-kernel bits to be used by out-of-tree kernel module?
>

^ permalink raw reply

* Re: [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: Jeff Kirsher @ 2014-12-16 18:56 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, netdev, jiri, gospo, Jamal Hadi Salim,
	John Fastabend
In-Reply-To: <1418573945-27840-1-git-send-email-ogerlitz@mellanox.com>

On Sun, Dec 14, 2014 at 8:19 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> The current implementations all use dev_uc_add_excl() and such whose API
> doesn't support vlans, so we can't make it with NICs HW for now.
>
> Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c |    5 +++++
>  net/core/rtnetlink.c                        |    5 +++++
>  2 files changed, 10 insertions(+), 0 deletions(-)
>

For the Intel driver changes...
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Martin Lau @ 2014-12-16 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: Blake Matheny, Laurent Chavey, Yuchung Cheng,
	netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <D0B44739.74E8A%bmatheny@fb.com>

> >On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
> >
> >> I think patches 1 and 3 are good additions, since they establish
> >> few permanent points of instrumentation in tcp stack.
> >> Patches 4-5 look more like use cases of tracepoints established
> >> before. They may feel like simple additions and, no doubt,
> >> they are useful, but since they expose things via tracing
> >> infra they become part of api and cannot be changed later,
> >> when more stats would be needed.
We can consider to reuse the events's format (tracing/events/*/format). I think
blktrace.c is using similar approach in trace-cmd.

> >> I think systemtap like scripting on top of patches 1 and 3
> >> should solve your use case ?
We have quite a few different versions running in the production.  It may not
be operationally easy.

> >> Also, have you looked at recent eBPF work?
> >> Though it's not completely ready yet, soon it should
> >> be able to do the same stats collection as you have
> >> in 4/5 without adding permanent pieces to the kernel.
We are keeping an eye on the eBPF work.


> On 12/15/14, 8:03 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:
> 
> >So it looks like web10g like interfaces are very often requested by
> >various teams.
> >
> >And we have many different views on how to hack this. I am astonished by
> >number of hacks I saw about this stuff going on.
> >
> >What about a clean way, extending current TCP_INFO, which is both
> >available as a getsockopt() for socket owners and ss/iproute2
> >information for 'external entities'
> >
> >If we consider web10g info needed, then adding a ftrace/eBPF like
> >interface is simply yet another piece of code we need to maintain,
> >and the argument of 'this should cost nothing if not activated' is
> >nonsense since major players need to constantly monitor TCP metrics and
> >behavior.
For the data collecting part, it would be nice to do it in the TCP itself.

Having a getsockopt will be useful for the new application/library to take
advantage of.

For the continuous monitoring/logging purpose, ftrace can provide event
triggered tracing instead of periodically consulting ss.

Thanks,
--Martin

^ permalink raw reply

* Re: [PATCH net-next 0/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 18:24 UTC (permalink / raw)
  To: netdev
In-Reply-To: <549070C7.5070505@psc.edu>

My apologies about the odd formatting in the previous message. Not sure 
what happened with my MUA.

Chris

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: Alexei Starovoitov @ 2014-12-16 18:24 UTC (permalink / raw)
  To: rapier; +Cc: netdev

On Tue, Dec 16, 2014 at 9:50 AM, rapier <rapier@psc.edu> wrote:
> +struct idr tcp_estats_idr;
> +EXPORT_SYMBOL(tcp_estats_idr);
> +static int next_id = 1;
> +DEFINE_SPINLOCK(tcp_estats_idr_lock);
> +EXPORT_SYMBOL(tcp_estats_idr_lock);
> +
> +int tcp_estats_wq_enabled __read_mostly = 0;
> +EXPORT_SYMBOL(tcp_estats_wq_enabled);
> +struct workqueue_struct *tcp_estats_wq = NULL;
> +EXPORT_SYMBOL(tcp_estats_wq);
> +void (*create_notify_func)(struct work_struct *work);
> +EXPORT_SYMBOL(create_notify_func);
> +void (*establish_notify_func)(struct work_struct *work);
> +EXPORT_SYMBOL(establish_notify_func);
> +void (*destroy_notify_func)(struct work_struct *work);
> +EXPORT_SYMBOL(destroy_notify_func);
> +unsigned long persist_delay = 0;
> +EXPORT_SYMBOL(persist_delay);
> +
> +struct static_key tcp_estats_enabled __read_mostly = STATIC_KEY_INIT_FALSE;
> +EXPORT_SYMBOL(tcp_estats_enabled);
...
> +EXPORT_SYMBOL(tcp_estats_create);
...
> +/* Do not call directly.  Called from tcp_estats_unuse() through call_rcu.
> */
> +void tcp_estats_free(struct rcu_head *rcu)
...
> +EXPORT_SYMBOL(tcp_estats_free);

imo that is very questionable design choice.
export a lot of in-kernel bits to be used by out-of-tree kernel module?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox