Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net-next] irda: fix link order if IRDA is built into the kernel
From: Greg KH @ 2017-08-29 19:05 UTC (permalink / raw)
  To: David Miller; +Cc: devel, samuel, netdev, linux-kernel, geert, fengguang.wu
In-Reply-To: <20170829.104945.1786969849973428586.davem@davemloft.net>

On Tue, Aug 29, 2017 at 10:49:45AM -0700, David Miller wrote:
> From: Greg KH <gregkh@linuxfoundation.org>
> Date: Tue, 29 Aug 2017 19:46:22 +0200
> 
> > When moving the IRDA code out of net/ into drivers/staging/irda/net, the
> > link order changes when IRDA is built into the kernel.  That causes a
> > kernel crash at boot time as netfilter isn't initialized yet.
> > 
> > To fix this, build and link the irda networking code in the same exact
> > order that it was previously before the move.
> > 
> > Reported-by: kernel test robot <fengguang.wu@intel.com>
> > Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
> > Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
> 
> Greg, just change the initializer in IRDA so that it will run
> after subsys_init() when built statically.
> 
> IRDA is definitely not the first pontentially statically built
> thing that needs netlink up and available.

Ok, will do that tomorrow and test it and send you the patch.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH net-next v2 00/10] net: dsa: add generic debugfs interface
From: Arkadi Sharshevsky @ 2017-08-29 19:05 UTC (permalink / raw)
  To: Andrew Lunn, Jiri Pirko
  Cc: Vivien Didelot, netdev, linux-kernel, kernel, David S. Miller,
	Florian Fainelli, Egil Hjelmeland, John Crispin, Woojung Huh,
	Sean Wang, Nikita Yushchenko, Chris Healy, mlxsw
In-Reply-To: <20170829125004.GB22093@lunn.ch>



On 08/29/2017 03:50 PM, Andrew Lunn wrote:
> On Tue, Aug 29, 2017 at 08:25:23AM +0200, Jiri Pirko wrote:
>> Mon, Aug 28, 2017 at 10:08:34PM CEST, andrew@lunn.ch wrote:
>>>> I see this overlaps a lot with DPIPE. Why won't you use that to expose
>>>> your hw state?
>>>
>>> We took a look at dpipe and i talked to you about using it for this
>>> sort of thing at netconf/netdev. But dpipe has issues displaying the
>>> sort of information we have. I never figured out how to do two
>>> dimensional tables. The output of the dpipe command is pretty
>>> unreadable. A lot of the information being dumped here is not about
>>> the data pipe, etc.
>>
>> So improve it. No problem. Also, we extend it to support what you neede.
> 
> Will i did try to do this back in March. And i failed.
> 
> Lets start with stats. Vivien gives an example on the cover letter:
> 
>     # pr -mt switch0/port{5,6}/stats
>     in_good_octets      : 0             in_good_octets      : 13824
>     in_bad_octets       : 0             in_bad_octets       : 0
>     in_unicast          : 0             in_unicast          : 0
>     in_broadcasts       : 0             in_broadcasts       : 216
>     in_multicasts       : 0             in_multicasts       : 0
>     in_pause            : 0             in_pause            : 0
>     in_undersize        : 0             in_undersize        : 0
> 
> This is what i tried to implement using dpipe. It is a simple two
> dimensional table. First column is a string, second a u64. In debugfs
> we have such a table per port. That fits with the hierarchy that each
> port is a directory in debugfs. But it could also be implemented as
> one table with N+1 columns, for N switch ports.
>

Hi Andrew,

This looks to me like basic L2 statistics that are obtained via
ethtool, I remember you had this problem with the DSA and CPU port.
Is this still the case?

I remembered we wanted to use dpipe for the DSA routing table
and IP priority table.

I think both those processes really look like match/action table
, thus they can be modeled successfully by dpipe.

> How about you, or one of your team, implement that. It should be able
> to use the dsa_loop driver, which is a simple dummy switch. But it
> does have statistics counters for all ports. Florian or I can help you
> get it running if needed.
> 
> This branch contains some of the basic plumbing code from my previous
> attempt:
> 
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flunn%2Flinux.git&data=02%7C01%7Carkadis%40mellanox.com%7Cb3cac139af204f79259c08d4eedc8410%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636396078291326351&sdata=K5D3TAb2spckuF5k88oOaVt0dmtHj0AwE8bEEGPPdGI%3D&reserved=0 v4.11-rc4-net-next-dpipe
> 
> 	 Andrew
> 

^ permalink raw reply

* [PATCH net v2] sch_multiq: fix double free on init failure
From: Nikolay Aleksandrov @ 2017-08-29 19:11 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, jhs, xiyou.wangcong, jiri, roopa, Nikolay Aleksandrov
In-Reply-To: <CAM_iQpVm0EvXsdcsAR546urmqgkeVgVqhJu=GwgWAhOutjzgyw@mail.gmail.com>

The below commit added a call to ->destroy() on init failure, but multiq
still frees ->queues on error in init, but ->queues is also freed by
->destroy() thus we get double free and corrupted memory.

Very easy to reproduce (eth0 not multiqueue):
$ tc qdisc add dev eth0 root multiq
RTNETLINK answers: Operation not supported
$ ip l add dumdum type dummy
(crash)

Trace log:
[ 3929.467747] general protection fault: 0000 [#1] SMP
[ 3929.468083] Modules linked in:
[ 3929.468302] CPU: 3 PID: 967 Comm: ip Not tainted 4.13.0-rc6+ #56
[ 3929.468625] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 3929.469124] task: ffff88003716a700 task.stack: ffff88005872c000
[ 3929.469449] RIP: 0010:__kmalloc_track_caller+0x117/0x1be
[ 3929.469746] RSP: 0018:ffff88005872f6a0 EFLAGS: 00010246
[ 3929.470042] RAX: 00000000000002de RBX: 0000000058a59000 RCX: 00000000000002df
[ 3929.470406] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff821f7020
[ 3929.470770] RBP: ffff88005872f6e8 R08: 000000000001f010 R09: 0000000000000000
[ 3929.471133] R10: ffff88005872f730 R11: 0000000000008cdd R12: ff006d75646d7564
[ 3929.471496] R13: 00000000014000c0 R14: ffff88005b403c00 R15: ffff88005b403c00
[ 3929.471869] FS:  00007f0b70480740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
[ 3929.472286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3929.472677] CR2: 00007ffcee4f3000 CR3: 0000000059d45000 CR4: 00000000000406e0
[ 3929.473209] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3929.474109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3929.474873] Call Trace:
[ 3929.475337]  ? kstrdup_const+0x23/0x25
[ 3929.475863]  kstrdup+0x2e/0x4b
[ 3929.476338]  kstrdup_const+0x23/0x25
[ 3929.478084]  __kernfs_new_node+0x28/0xbc
[ 3929.478478]  kernfs_new_node+0x35/0x55
[ 3929.478929]  kernfs_create_link+0x23/0x76
[ 3929.479478]  sysfs_do_create_link_sd.isra.2+0x85/0xd7
[ 3929.480096]  sysfs_create_link+0x33/0x35
[ 3929.480649]  device_add+0x200/0x589
[ 3929.481184]  netdev_register_kobject+0x7c/0x12f
[ 3929.481711]  register_netdevice+0x373/0x471
[ 3929.482174]  rtnl_newlink+0x614/0x729
[ 3929.482610]  ? rtnl_newlink+0x17f/0x729
[ 3929.483080]  rtnetlink_rcv_msg+0x188/0x197
[ 3929.483533]  ? rcu_read_unlock+0x3e/0x5f
[ 3929.483984]  ? rtnl_newlink+0x729/0x729
[ 3929.484420]  netlink_rcv_skb+0x6c/0xce
[ 3929.484858]  rtnetlink_rcv+0x23/0x2a
[ 3929.485291]  netlink_unicast+0x103/0x181
[ 3929.485735]  netlink_sendmsg+0x326/0x337
[ 3929.486181]  sock_sendmsg_nosec+0x14/0x3f
[ 3929.486614]  sock_sendmsg+0x29/0x2e
[ 3929.486973]  ___sys_sendmsg+0x209/0x28b
[ 3929.487340]  ? do_raw_spin_unlock+0xcd/0xf8
[ 3929.487719]  ? _raw_spin_unlock+0x27/0x31
[ 3929.488092]  ? __handle_mm_fault+0x651/0xdb1
[ 3929.488471]  ? check_chain_key+0xb0/0xfd
[ 3929.488847]  __sys_sendmsg+0x45/0x63
[ 3929.489206]  ? __sys_sendmsg+0x45/0x63
[ 3929.489576]  SyS_sendmsg+0x19/0x1b
[ 3929.489901]  entry_SYSCALL_64_fastpath+0x23/0xc2
[ 3929.490172] RIP: 0033:0x7f0b6fb93690
[ 3929.490423] RSP: 002b:00007ffcee4ed588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 3929.490881] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f0b6fb93690
[ 3929.491198] RDX: 0000000000000000 RSI: 00007ffcee4ed5d0 RDI: 0000000000000003
[ 3929.491521] RBP: ffff88005872ff98 R08: 0000000000000001 R09: 0000000000000000
[ 3929.491801] R10: 00007ffcee4ed350 R11: 0000000000000246 R12: 0000000000000002
[ 3929.492075] R13: 000000000066f1a0 R14: 00007ffcee4f5680 R15: 0000000000000000
[ 3929.492352]  ? trace_hardirqs_off_caller+0xa7/0xcf
[ 3929.492590] Code: 8b 45 c0 48 8b 45 b8 74 17 48 8b 4d c8 83 ca ff 44
89 ee 4c 89 f7 e8 83 ca ff ff 49 89 c4 eb 49 49 63 56 20 48 8d 48 01 4d
8b 06 <49> 8b 1c 14 48 89 c2 4c 89 e0 65 49 0f c7 08 0f 94 c0 83 f0 01
[ 3929.493335] RIP: __kmalloc_track_caller+0x117/0x1be RSP: ffff88005872f6a0

Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
---
v2: added the acks and folded the return as per Cong's comment

 net/sched/sch_multiq.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/sched/sch_multiq.c b/net/sched/sch_multiq.c
index f143b7bbaa0d..9c454f5d6c38 100644
--- a/net/sched/sch_multiq.c
+++ b/net/sched/sch_multiq.c
@@ -257,12 +257,7 @@ static int multiq_init(struct Qdisc *sch, struct nlattr *opt)
 	for (i = 0; i < q->max_bands; i++)
 		q->queues[i] = &noop_qdisc;
 
-	err = multiq_tune(sch, opt);
-
-	if (err)
-		kfree(q->queues);
-
-	return err;
+	return multiq_tune(sch, opt);
 }
 
 static int multiq_dump(struct Qdisc *sch, struct sk_buff *skb)
-- 
2.1.4

^ permalink raw reply related

* Re: [patch v1 1/2] dt-bindings: net: add binding documentation for mlxsw thermal control
From: Andrew Lunn @ 2017-08-29 19:15 UTC (permalink / raw)
  To: Vadim Pasternak
  Cc: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
	jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org,
	ivecera-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <AM4PR05MB33300CC580ADD6CDABD13167A29F0-n5Jp0YuYvM1LPiJj6BpYmdqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

On Tue, Aug 29, 2017 at 05:57:01PM +0000, Vadim Pasternak wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Lunn [mailto:andrew-g2DYL2Zd6BY@public.gmane.org]
> > Sent: Tuesday, August 29, 2017 8:23 PM
> > To: Vadim Pasternak <vadimp-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Cc: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org; jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org;
> > ivecera-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: [patch v1 1/2] dt-bindings: net: add binding documentation for
> > mlxsw thermal control
> > 
> > > +- compatible		: "mellanox,mlxsw_minimal"
> > 
> > Interesting product name. Is there a mlxsw_maximal planned?
> > 
> 
> Hi Andrew,
> 
> Thank you very much for review.
> 
> No plans for such product. We just have fully functional drivers for different
> kind of Mellanox switch devices like spectrum, switchib, switchx2. All of them
> work over PCI bus. The minimal is supposed to be used for the chassis
> management and we uses it at BMC side. It works over I2C bus and doesn't
> depend on switch type. So it has a minaml functionality, so name "minimal".

That is not really how device tree compatible names work. The
compatible should be the product name, or the first product which had
this functionality. So in this case is the product is the chassis?
What name does the chassis have?

At some point, it would be good if you made the .dts file public. That
would give people a better idea how this all fits together.

> > I don't know much about the thermal subsystem. But looking at other
> > example binding documents, you seem to do something different here to
> > other drivers. Why do you not use what seems to be the common format:
> 
> In mlxsw_thermal driver we have definition for the thermal trips, which contains
> the type, like  "active" of "passive", temperature  in millicelsius and min/max states
> for cooling device. These vector defines thermal trip points.
> The hysteresis parameter is not relevant.

As i said, i don't know much about the thermal subsystem. So i cannot
comment on if your binding is good or bad. It just seems different,
which is often bad.

You really need the thermal subsystem guys to review this. But you
didn't Cc: them.

./scripts/get_maintainer.pl drivers/thermal/thermal_core.c 
Zhang Rui <rui.zhang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> (supporter:THERMAL)
Eduardo Valentin <edubezval-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> (supporter:THERMAL)
linux-pm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list:THERMAL)

This driver has nothing to do with netdev. I'm wondering if it
actually belongs in drivers/thermal? Or maybe drivers/hwmon?
The pm and hwmon guys will probably tell you.

	Andrew
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net] nfp: double free on error in probe
From: Dan Carpenter @ 2017-08-29 19:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Simon Horman, oss-drivers, netdev,
	kernel-janitors

Both the nfp_net_pf_app_start() and the nfp_net_pci_probe() functions
call nfp_net_pf_app_stop_ctrl(pf) so there is a double free.  The free
should be done from the probe function because it's allocated there so
I have removed the call from nfp_net_pf_app_start().

Fixes: 02082701b974 ("nfp: create control vNICs and wire up rx/tx")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
index d55f6c5ed1be..7c22cc4654b7 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
@@ -458,7 +458,7 @@ static int nfp_net_pf_app_start(struct nfp_pf *pf)
 
 	err = nfp_app_start(pf->app, pf->ctrl_vnic);
 	if (err)
-		goto err_ctrl_stop;
+		return err;
 
 	if (pf->num_vfs) {
 		err = nfp_app_sriov_enable(pf->app, pf->num_vfs);
@@ -470,8 +470,6 @@ static int nfp_net_pf_app_start(struct nfp_pf *pf)
 
 err_app_stop:
 	nfp_app_stop(pf->app);
-err_ctrl_stop:
-	nfp_net_pf_app_stop_ctrl(pf);
 	return err;
 }
 

^ permalink raw reply related

* Re: mlxsw and rtnl lock
From: Arkadi Sharshevsky @ 2017-08-29 19:16 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw
In-Reply-To: <b6dc0a4e-4ed9-ed5f-ac0f-c3fe06d5ed68@gmail.com>



On 08/28/2017 09:00 PM, David Ahern wrote:
> On 8/26/17 11:04 AM, Ido Schimmel wrote:
>> Regarding the silent abort, that's intentional. You can look at the same
>> code in v4.9 - when the chain was still blocking - and you'll see that
>> we didn't propagate the error even then. This was discussed in the past
>> and the conclusion was that user doesn't expect to operation to fail. If
>> hardware resources are exceeded, we let the kernel take care of the
>> forwarding instead.
>>
> 
> In addition to Roopa's comments... The silent abort is not a good user
> experience. Right now it's add a network address or route, cross fingers
> and hope it does not overflow some limit (nexthop, ecmp, neighbor,
> prefix, etc) that triggers the offload abort.
> 
> The mlxsw driver queries for some limits (e.g., max rifs) but I don't
> see any query related to current usage, and there is no API to pass any
> of that data to user space so user space has no programmatic way to
> handle this. I realize you are aware of this limitation. The point is to

The first dpipe table that was introduced was the erif table.
which gathers L3 statistics.

The rifs are actually also fixed size resource, so maybe it is
more correct to introduce it as 'resources' and connect it to
the erif table. That way you will be able to obtain current
usage, and receive notification when it will be drained out.

> emphasize the need to resolve this.
> 

^ permalink raw reply

* [PATCH V2 net-next] liquidio: show NIC's U-Boot version in a dev_info() message
From: Felix Manlunas @ 2017-08-29 19:19 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	weilin.chang

From: Weilin Chang <weilin.chang@cavium.com>

Signed-off-by: Weilin Chang <weilin.chang@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
Patch Change Log:
  V1 -> V2:
    * Move octeon_get_uboot_version() to a proper place to avoid forward
      declaration.
    * Remove complicated for-loops that search for substrings; replace them
      with calls to strstr().
    * Don't add unnecessary fields to struct octeon_device.

 .../net/ethernet/cavium/liquidio/octeon_console.c  | 78 ++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_console.c b/drivers/net/ethernet/cavium/liquidio/octeon_console.c
index 19e5212..ec3dd69 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_console.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_console.c
@@ -574,6 +574,82 @@ int octeon_init_consoles(struct octeon_device *oct)
 	return ret;
 }
 
+static void octeon_get_uboot_version(struct octeon_device *oct)
+{
+	s32 bytes_read, tries, total_read;
+	struct octeon_console *console;
+	u32 console_num = 0;
+	char *uboot_ver;
+	char *buf;
+	char *p;
+
+#define OCTEON_UBOOT_VER_BUF_SIZE 512
+	buf = kmalloc(OCTEON_UBOOT_VER_BUF_SIZE, GFP_KERNEL);
+	if (!buf)
+		return;
+
+	if (octeon_console_send_cmd(oct, "setenv stdout pci\n", 50)) {
+		kfree(buf);
+		return;
+	}
+
+	if (octeon_console_send_cmd(oct, "version\n", 1)) {
+		kfree(buf);
+		return;
+	}
+
+	console = &oct->console[console_num];
+	tries = 0;
+	total_read = 0;
+
+	do {
+		/* Take console output regardless of whether it will
+		 * be logged
+		 */
+		bytes_read =
+			octeon_console_read(oct,
+					    console_num, buf + total_read,
+					    OCTEON_UBOOT_VER_BUF_SIZE - 1 -
+					    total_read);
+		if (bytes_read > 0) {
+			buf[bytes_read] = '\0';
+
+			total_read += bytes_read;
+			if (console->waiting)
+				octeon_console_handle_result(oct, console_num);
+		} else if (bytes_read < 0) {
+			dev_err(&oct->pci_dev->dev, "Error reading console %u, ret=%d\n",
+				console_num, bytes_read);
+		}
+
+		tries++;
+	} while ((bytes_read > 0) && (tries < 16));
+
+	/* If nothing is read after polling the console,
+	 * output any leftovers if any
+	 */
+	if ((total_read == 0) && (console->leftover[0])) {
+		dev_dbg(&oct->pci_dev->dev, "%u: %s\n",
+			console_num, console->leftover);
+		console->leftover[0] = '\0';
+	}
+
+	buf[OCTEON_UBOOT_VER_BUF_SIZE - 1] = '\0';
+
+	uboot_ver = strstr(buf, "U-Boot");
+	if (uboot_ver) {
+		p = strstr(uboot_ver, "mips");
+		if (p) {
+			p--;
+			*p = '\0';
+			dev_info(&oct->pci_dev->dev, "%s\n", uboot_ver);
+		}
+	}
+
+	kfree(buf);
+	octeon_console_send_cmd(oct, "setenv stdout serial\n", 50);
+}
+
 int octeon_add_console(struct octeon_device *oct, u32 console_num,
 		       char *dbg_enb)
 {
@@ -611,6 +687,8 @@ int octeon_add_console(struct octeon_device *oct, u32 console_num,
 
 		work = &oct->console_poll_work[console_num].work;
 
+		octeon_get_uboot_version(oct);
+
 		INIT_DELAYED_WORK(work, check_console);
 		oct->console_poll_work[console_num].ctxptr = (void *)oct;
 		oct->console_poll_work[console_num].ctxul = console_num;

^ permalink raw reply related

* Re: [PATCH net-next v2 00/10] net: dsa: add generic debugfs interface
From: Florian Fainelli @ 2017-08-29 19:19 UTC (permalink / raw)
  To: Arkadi Sharshevsky, Andrew Lunn, Jiri Pirko
  Cc: Vivien Didelot, netdev, linux-kernel, kernel, David S. Miller,
	Egil Hjelmeland, John Crispin, Woojung Huh, Sean Wang,
	Nikita Yushchenko, Chris Healy, mlxsw
In-Reply-To: <8e6fa7ea-fb7b-3026-b9c0-417ef2efae27@mellanox.com>

On 08/29/2017 12:05 PM, Arkadi Sharshevsky wrote:
> 
> 
> On 08/29/2017 03:50 PM, Andrew Lunn wrote:
>> On Tue, Aug 29, 2017 at 08:25:23AM +0200, Jiri Pirko wrote:
>>> Mon, Aug 28, 2017 at 10:08:34PM CEST, andrew@lunn.ch wrote:
>>>>> I see this overlaps a lot with DPIPE. Why won't you use that to expose
>>>>> your hw state?
>>>>
>>>> We took a look at dpipe and i talked to you about using it for this
>>>> sort of thing at netconf/netdev. But dpipe has issues displaying the
>>>> sort of information we have. I never figured out how to do two
>>>> dimensional tables. The output of the dpipe command is pretty
>>>> unreadable. A lot of the information being dumped here is not about
>>>> the data pipe, etc.
>>>
>>> So improve it. No problem. Also, we extend it to support what you neede.
>>
>> Will i did try to do this back in March. And i failed.
>>
>> Lets start with stats. Vivien gives an example on the cover letter:
>>
>>     # pr -mt switch0/port{5,6}/stats
>>     in_good_octets      : 0             in_good_octets      : 13824
>>     in_bad_octets       : 0             in_bad_octets       : 0
>>     in_unicast          : 0             in_unicast          : 0
>>     in_broadcasts       : 0             in_broadcasts       : 216
>>     in_multicasts       : 0             in_multicasts       : 0
>>     in_pause            : 0             in_pause            : 0
>>     in_undersize        : 0             in_undersize        : 0
>>
>> This is what i tried to implement using dpipe. It is a simple two
>> dimensional table. First column is a string, second a u64. In debugfs
>> we have such a table per port. That fits with the hierarchy that each
>> port is a directory in debugfs. But it could also be implemented as
>> one table with N+1 columns, for N switch ports.
>>
> 
> Hi Andrew,
> 
> This looks to me like basic L2 statistics that are obtained via
> ethtool, I remember you had this problem with the DSA and CPU port.
> Is this still the case?

Yes, there are no net_device representors for CPU and DSA ports because
if we did that, it would be confusing as we would be creating two
network devices for both ends of the pipe. For DSA (inter-switch)
interfaces you would have one "dsa" netdev for each adjacent switch so
two DSA interface represent the inter switch link.

For the "CPU" port, you have the master network device (e.g: eth0) and
the "cpu" network device, this is confusing. "cpu" is not usable, since
it does not make sense for the "cpu" to send traffic via this interface,
the model is to terminate user-facing ports and use a tag to deliver
packets to the right interfaces. For "dsa" it's pretty much the same story.

> 
> I remembered we wanted to use dpipe for the DSA routing table
> and IP priority table.
> 
> I think both those processes really look like match/action table
> , thus they can be modeled successfully by dpipe.
> 
>> How about you, or one of your team, implement that. It should be able
>> to use the dsa_loop driver, which is a simple dummy switch. But it
>> does have statistics counters for all ports. Florian or I can help you
>> get it running if needed.
>>
>> This branch contains some of the basic plumbing code from my previous
>> attempt:
>>
>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flunn%2Flinux.git&data=02%7C01%7Carkadis%40mellanox.com%7Cb3cac139af204f79259c08d4eedc8410%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636396078291326351&sdata=K5D3TAb2spckuF5k88oOaVt0dmtHj0AwE8bEEGPPdGI%3D&reserved=0 v4.11-rc4-net-next-dpipe
>>
>> 	 Andrew
>>


-- 
Florian

^ permalink raw reply

* [PATCH net-next] net: bcmgenet: Use correct I/O accessors
From: Florian Fainelli @ 2017-08-29 19:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, opendmb, jaedon.shin, Florian Fainelli, Florian Fainelli

From: Florian Fainelli <florian.fainelli@broadcom.com>

The GENET driver currently uses __raw_{read,write}l which means
native I/O endian. This works correctly for an ARM LE kernel (default)
but fails miserably on an ARM BE (BE8) kernel where registers are kept
little endian, so replace uses with {read,write}l_relaxed here which is
what we want because this is all performance sensitive code.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
This was tested on ARM LE and ARM BE, but I would appreciate Doug's
review on this to make sure I did not add bugs

 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 75 ++++++++++++++++----------
 drivers/net/ethernet/broadcom/genet/bcmgenet.h | 13 ++++-
 2 files changed, 58 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index a981c4ee9d72..612d1ef3b5f5 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -72,23 +72,42 @@
 #define GENET_RDMA_REG_OFF	(priv->hw_params->rdma_offset + \
 				TOTAL_DESC * DMA_DESC_SIZE)
 
+static inline void bcmgenet_writel(u32 value, void __iomem *offset)
+{
+	/* MIPS chips strapped for BE will automagically configure the
+	 * peripheral registers for CPU-native byte order.
+	 */
+	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
+		__raw_writel(value, offset);
+	else
+		writel_relaxed(value, offset);
+}
+
+static inline u32 bcmgenet_readl(void __iomem *offset)
+{
+	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
+		return __raw_readl(offset);
+	else
+		return readl_relaxed(offset);
+}
+
 static inline void dmadesc_set_length_status(struct bcmgenet_priv *priv,
 					     void __iomem *d, u32 value)
 {
-	__raw_writel(value, d + DMA_DESC_LENGTH_STATUS);
+	bcmgenet_writel(value, d + DMA_DESC_LENGTH_STATUS);
 }
 
 static inline u32 dmadesc_get_length_status(struct bcmgenet_priv *priv,
 					    void __iomem *d)
 {
-	return __raw_readl(d + DMA_DESC_LENGTH_STATUS);
+	return bcmgenet_readl(d + DMA_DESC_LENGTH_STATUS);
 }
 
 static inline void dmadesc_set_addr(struct bcmgenet_priv *priv,
 				    void __iomem *d,
 				    dma_addr_t addr)
 {
-	__raw_writel(lower_32_bits(addr), d + DMA_DESC_ADDRESS_LO);
+	bcmgenet_writel(lower_32_bits(addr), d + DMA_DESC_ADDRESS_LO);
 
 	/* Register writes to GISB bus can take couple hundred nanoseconds
 	 * and are done for each packet, save these expensive writes unless
@@ -96,7 +115,7 @@ static inline void dmadesc_set_addr(struct bcmgenet_priv *priv,
 	 */
 #ifdef CONFIG_PHYS_ADDR_T_64BIT
 	if (priv->hw_params->flags & GENET_HAS_40BITS)
-		__raw_writel(upper_32_bits(addr), d + DMA_DESC_ADDRESS_HI);
+		bcmgenet_writel(upper_32_bits(addr), d + DMA_DESC_ADDRESS_HI);
 #endif
 }
 
@@ -113,7 +132,7 @@ static inline dma_addr_t dmadesc_get_addr(struct bcmgenet_priv *priv,
 {
 	dma_addr_t addr;
 
-	addr = __raw_readl(d + DMA_DESC_ADDRESS_LO);
+	addr = bcmgenet_readl(d + DMA_DESC_ADDRESS_LO);
 
 	/* Register writes to GISB bus can take couple hundred nanoseconds
 	 * and are done for each packet, save these expensive writes unless
@@ -121,7 +140,7 @@ static inline dma_addr_t dmadesc_get_addr(struct bcmgenet_priv *priv,
 	 */
 #ifdef CONFIG_PHYS_ADDR_T_64BIT
 	if (priv->hw_params->flags & GENET_HAS_40BITS)
-		addr |= (u64)__raw_readl(d + DMA_DESC_ADDRESS_HI) << 32;
+		addr |= (u64)bcmgenet_readl(d + DMA_DESC_ADDRESS_HI) << 32;
 #endif
 	return addr;
 }
@@ -156,8 +175,8 @@ static inline u32 bcmgenet_tbuf_ctrl_get(struct bcmgenet_priv *priv)
 	if (GENET_IS_V1(priv))
 		return bcmgenet_rbuf_readl(priv, TBUF_CTRL_V1);
 	else
-		return __raw_readl(priv->base +
-				priv->hw_params->tbuf_offset + TBUF_CTRL);
+		return bcmgenet_readl(priv->base +
+				      priv->hw_params->tbuf_offset + TBUF_CTRL);
 }
 
 static inline void bcmgenet_tbuf_ctrl_set(struct bcmgenet_priv *priv, u32 val)
@@ -165,7 +184,7 @@ static inline void bcmgenet_tbuf_ctrl_set(struct bcmgenet_priv *priv, u32 val)
 	if (GENET_IS_V1(priv))
 		bcmgenet_rbuf_writel(priv, val, TBUF_CTRL_V1);
 	else
-		__raw_writel(val, priv->base +
+		bcmgenet_writel(val, priv->base +
 				priv->hw_params->tbuf_offset + TBUF_CTRL);
 }
 
@@ -174,8 +193,8 @@ static inline u32 bcmgenet_bp_mc_get(struct bcmgenet_priv *priv)
 	if (GENET_IS_V1(priv))
 		return bcmgenet_rbuf_readl(priv, TBUF_BP_MC_V1);
 	else
-		return __raw_readl(priv->base +
-				priv->hw_params->tbuf_offset + TBUF_BP_MC);
+		return bcmgenet_readl(priv->base +
+				      priv->hw_params->tbuf_offset + TBUF_BP_MC);
 }
 
 static inline void bcmgenet_bp_mc_set(struct bcmgenet_priv *priv, u32 val)
@@ -183,7 +202,7 @@ static inline void bcmgenet_bp_mc_set(struct bcmgenet_priv *priv, u32 val)
 	if (GENET_IS_V1(priv))
 		bcmgenet_rbuf_writel(priv, val, TBUF_BP_MC_V1);
 	else
-		__raw_writel(val, priv->base +
+		bcmgenet_writel(val, priv->base +
 				priv->hw_params->tbuf_offset + TBUF_BP_MC);
 }
 
@@ -326,28 +345,28 @@ static inline struct bcmgenet_priv *dev_to_priv(struct device *dev)
 static inline u32 bcmgenet_tdma_readl(struct bcmgenet_priv *priv,
 				      enum dma_reg r)
 {
-	return __raw_readl(priv->base + GENET_TDMA_REG_OFF +
-			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
+	return bcmgenet_readl(priv->base + GENET_TDMA_REG_OFF +
+			      DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
 }
 
 static inline void bcmgenet_tdma_writel(struct bcmgenet_priv *priv,
 					u32 val, enum dma_reg r)
 {
-	__raw_writel(val, priv->base + GENET_TDMA_REG_OFF +
+	bcmgenet_writel(val, priv->base + GENET_TDMA_REG_OFF +
 			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
 }
 
 static inline u32 bcmgenet_rdma_readl(struct bcmgenet_priv *priv,
 				      enum dma_reg r)
 {
-	return __raw_readl(priv->base + GENET_RDMA_REG_OFF +
-			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
+	return bcmgenet_readl(priv->base + GENET_RDMA_REG_OFF +
+			      DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
 }
 
 static inline void bcmgenet_rdma_writel(struct bcmgenet_priv *priv,
 					u32 val, enum dma_reg r)
 {
-	__raw_writel(val, priv->base + GENET_RDMA_REG_OFF +
+	bcmgenet_writel(val, priv->base + GENET_RDMA_REG_OFF +
 			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
 }
 
@@ -418,16 +437,16 @@ static inline u32 bcmgenet_tdma_ring_readl(struct bcmgenet_priv *priv,
 					   unsigned int ring,
 					   enum dma_ring_reg r)
 {
-	return __raw_readl(priv->base + GENET_TDMA_REG_OFF +
-			(DMA_RING_SIZE * ring) +
-			genet_dma_ring_regs[r]);
+	return bcmgenet_readl(priv->base + GENET_TDMA_REG_OFF +
+			      (DMA_RING_SIZE * ring) +
+			      genet_dma_ring_regs[r]);
 }
 
 static inline void bcmgenet_tdma_ring_writel(struct bcmgenet_priv *priv,
 					     unsigned int ring, u32 val,
 					     enum dma_ring_reg r)
 {
-	__raw_writel(val, priv->base + GENET_TDMA_REG_OFF +
+	bcmgenet_writel(val, priv->base + GENET_TDMA_REG_OFF +
 			(DMA_RING_SIZE * ring) +
 			genet_dma_ring_regs[r]);
 }
@@ -436,16 +455,16 @@ static inline u32 bcmgenet_rdma_ring_readl(struct bcmgenet_priv *priv,
 					   unsigned int ring,
 					   enum dma_ring_reg r)
 {
-	return __raw_readl(priv->base + GENET_RDMA_REG_OFF +
-			(DMA_RING_SIZE * ring) +
-			genet_dma_ring_regs[r]);
+	return bcmgenet_readl(priv->base + GENET_RDMA_REG_OFF +
+			      (DMA_RING_SIZE * ring) +
+			      genet_dma_ring_regs[r]);
 }
 
 static inline void bcmgenet_rdma_ring_writel(struct bcmgenet_priv *priv,
 					     unsigned int ring, u32 val,
 					     enum dma_ring_reg r)
 {
-	__raw_writel(val, priv->base + GENET_RDMA_REG_OFF +
+	bcmgenet_writel(val, priv->base + GENET_RDMA_REG_OFF +
 			(DMA_RING_SIZE * ring) +
 			genet_dma_ring_regs[r]);
 }
@@ -991,12 +1010,12 @@ static void bcmgenet_eee_enable_set(struct net_device *dev, bool enable)
 	bcmgenet_umac_writel(priv, reg, UMAC_EEE_CTRL);
 
 	/* Enable EEE and switch to a 27Mhz clock automatically */
-	reg = __raw_readl(priv->base + off);
+	reg = bcmgenet_readl(priv->base + off);
 	if (enable)
 		reg |= TBUF_EEE_EN | TBUF_PM_EN;
 	else
 		reg &= ~(TBUF_EEE_EN | TBUF_PM_EN);
-	__raw_writel(reg, priv->base + off);
+	bcmgenet_writel(reg, priv->base + off);
 
 	/* Do the same for thing for RBUF */
 	reg = bcmgenet_rbuf_readl(priv, RBUF_ENERGY_CTRL);
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
index 3a34fdba5301..4b31e6c172b6 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
@@ -671,12 +671,21 @@ struct bcmgenet_priv {
 static inline u32 bcmgenet_##name##_readl(struct bcmgenet_priv *priv,	\
 					u32 off)			\
 {									\
-	return __raw_readl(priv->base + offset + off);			\
+	/* MIPS chips strapped for BE will automagically configure the	\
+	 * peripheral registers for CPU-native byte order.		\
+	 */								\
+	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) \
+		return __raw_readl(priv->base + offset + off);		\
+	else								\
+		return readl_relaxed(priv->base + offset + off);	\
 }									\
 static inline void bcmgenet_##name##_writel(struct bcmgenet_priv *priv,	\
 					u32 val, u32 off)		\
 {									\
-	__raw_writel(val, priv->base + offset + off);			\
+	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) \
+		return __raw_writel(val, priv->base + offset + off);	\
+	else								\
+		writel_relaxed(val, priv->base + offset + off);		\
 }
 
 GENET_IO_MACRO(ext, GENET_EXT_OFF);
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net-next] net: bcmgenet: Use correct I/O accessors
From: Florian Fainelli @ 2017-08-29 19:29 UTC (permalink / raw)
  To: netdev; +Cc: davem, opendmb, jaedon.shin
In-Reply-To: <1504034587-31113-1-git-send-email-f.fainelli@gmail.com>

On 08/29/2017 12:23 PM, Florian Fainelli wrote:
> From: Florian Fainelli <florian.fainelli@broadcom.com>

Oh nooo I just exposed myself, let me resend with the gmail address.

> 
> The GENET driver currently uses __raw_{read,write}l which means
> native I/O endian. This works correctly for an ARM LE kernel (default)
> but fails miserably on an ARM BE (BE8) kernel where registers are kept
> little endian, so replace uses with {read,write}l_relaxed here which is
> what we want because this is all performance sensitive code.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
> This was tested on ARM LE and ARM BE, but I would appreciate Doug's
> review on this to make sure I did not add bugs
> 
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 75 ++++++++++++++++----------
>  drivers/net/ethernet/broadcom/genet/bcmgenet.h | 13 ++++-
>  2 files changed, 58 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index a981c4ee9d72..612d1ef3b5f5 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -72,23 +72,42 @@
>  #define GENET_RDMA_REG_OFF	(priv->hw_params->rdma_offset + \
>  				TOTAL_DESC * DMA_DESC_SIZE)
>  
> +static inline void bcmgenet_writel(u32 value, void __iomem *offset)
> +{
> +	/* MIPS chips strapped for BE will automagically configure the
> +	 * peripheral registers for CPU-native byte order.
> +	 */
> +	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
> +		__raw_writel(value, offset);
> +	else
> +		writel_relaxed(value, offset);
> +}
> +
> +static inline u32 bcmgenet_readl(void __iomem *offset)
> +{
> +	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
> +		return __raw_readl(offset);
> +	else
> +		return readl_relaxed(offset);
> +}
> +
>  static inline void dmadesc_set_length_status(struct bcmgenet_priv *priv,
>  					     void __iomem *d, u32 value)
>  {
> -	__raw_writel(value, d + DMA_DESC_LENGTH_STATUS);
> +	bcmgenet_writel(value, d + DMA_DESC_LENGTH_STATUS);
>  }
>  
>  static inline u32 dmadesc_get_length_status(struct bcmgenet_priv *priv,
>  					    void __iomem *d)
>  {
> -	return __raw_readl(d + DMA_DESC_LENGTH_STATUS);
> +	return bcmgenet_readl(d + DMA_DESC_LENGTH_STATUS);
>  }
>  
>  static inline void dmadesc_set_addr(struct bcmgenet_priv *priv,
>  				    void __iomem *d,
>  				    dma_addr_t addr)
>  {
> -	__raw_writel(lower_32_bits(addr), d + DMA_DESC_ADDRESS_LO);
> +	bcmgenet_writel(lower_32_bits(addr), d + DMA_DESC_ADDRESS_LO);
>  
>  	/* Register writes to GISB bus can take couple hundred nanoseconds
>  	 * and are done for each packet, save these expensive writes unless
> @@ -96,7 +115,7 @@ static inline void dmadesc_set_addr(struct bcmgenet_priv *priv,
>  	 */
>  #ifdef CONFIG_PHYS_ADDR_T_64BIT
>  	if (priv->hw_params->flags & GENET_HAS_40BITS)
> -		__raw_writel(upper_32_bits(addr), d + DMA_DESC_ADDRESS_HI);
> +		bcmgenet_writel(upper_32_bits(addr), d + DMA_DESC_ADDRESS_HI);
>  #endif
>  }
>  
> @@ -113,7 +132,7 @@ static inline dma_addr_t dmadesc_get_addr(struct bcmgenet_priv *priv,
>  {
>  	dma_addr_t addr;
>  
> -	addr = __raw_readl(d + DMA_DESC_ADDRESS_LO);
> +	addr = bcmgenet_readl(d + DMA_DESC_ADDRESS_LO);
>  
>  	/* Register writes to GISB bus can take couple hundred nanoseconds
>  	 * and are done for each packet, save these expensive writes unless
> @@ -121,7 +140,7 @@ static inline dma_addr_t dmadesc_get_addr(struct bcmgenet_priv *priv,
>  	 */
>  #ifdef CONFIG_PHYS_ADDR_T_64BIT
>  	if (priv->hw_params->flags & GENET_HAS_40BITS)
> -		addr |= (u64)__raw_readl(d + DMA_DESC_ADDRESS_HI) << 32;
> +		addr |= (u64)bcmgenet_readl(d + DMA_DESC_ADDRESS_HI) << 32;
>  #endif
>  	return addr;
>  }
> @@ -156,8 +175,8 @@ static inline u32 bcmgenet_tbuf_ctrl_get(struct bcmgenet_priv *priv)
>  	if (GENET_IS_V1(priv))
>  		return bcmgenet_rbuf_readl(priv, TBUF_CTRL_V1);
>  	else
> -		return __raw_readl(priv->base +
> -				priv->hw_params->tbuf_offset + TBUF_CTRL);
> +		return bcmgenet_readl(priv->base +
> +				      priv->hw_params->tbuf_offset + TBUF_CTRL);
>  }
>  
>  static inline void bcmgenet_tbuf_ctrl_set(struct bcmgenet_priv *priv, u32 val)
> @@ -165,7 +184,7 @@ static inline void bcmgenet_tbuf_ctrl_set(struct bcmgenet_priv *priv, u32 val)
>  	if (GENET_IS_V1(priv))
>  		bcmgenet_rbuf_writel(priv, val, TBUF_CTRL_V1);
>  	else
> -		__raw_writel(val, priv->base +
> +		bcmgenet_writel(val, priv->base +
>  				priv->hw_params->tbuf_offset + TBUF_CTRL);
>  }
>  
> @@ -174,8 +193,8 @@ static inline u32 bcmgenet_bp_mc_get(struct bcmgenet_priv *priv)
>  	if (GENET_IS_V1(priv))
>  		return bcmgenet_rbuf_readl(priv, TBUF_BP_MC_V1);
>  	else
> -		return __raw_readl(priv->base +
> -				priv->hw_params->tbuf_offset + TBUF_BP_MC);
> +		return bcmgenet_readl(priv->base +
> +				      priv->hw_params->tbuf_offset + TBUF_BP_MC);
>  }
>  
>  static inline void bcmgenet_bp_mc_set(struct bcmgenet_priv *priv, u32 val)
> @@ -183,7 +202,7 @@ static inline void bcmgenet_bp_mc_set(struct bcmgenet_priv *priv, u32 val)
>  	if (GENET_IS_V1(priv))
>  		bcmgenet_rbuf_writel(priv, val, TBUF_BP_MC_V1);
>  	else
> -		__raw_writel(val, priv->base +
> +		bcmgenet_writel(val, priv->base +
>  				priv->hw_params->tbuf_offset + TBUF_BP_MC);
>  }
>  
> @@ -326,28 +345,28 @@ static inline struct bcmgenet_priv *dev_to_priv(struct device *dev)
>  static inline u32 bcmgenet_tdma_readl(struct bcmgenet_priv *priv,
>  				      enum dma_reg r)
>  {
> -	return __raw_readl(priv->base + GENET_TDMA_REG_OFF +
> -			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
> +	return bcmgenet_readl(priv->base + GENET_TDMA_REG_OFF +
> +			      DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
>  }
>  
>  static inline void bcmgenet_tdma_writel(struct bcmgenet_priv *priv,
>  					u32 val, enum dma_reg r)
>  {
> -	__raw_writel(val, priv->base + GENET_TDMA_REG_OFF +
> +	bcmgenet_writel(val, priv->base + GENET_TDMA_REG_OFF +
>  			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
>  }
>  
>  static inline u32 bcmgenet_rdma_readl(struct bcmgenet_priv *priv,
>  				      enum dma_reg r)
>  {
> -	return __raw_readl(priv->base + GENET_RDMA_REG_OFF +
> -			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
> +	return bcmgenet_readl(priv->base + GENET_RDMA_REG_OFF +
> +			      DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
>  }
>  
>  static inline void bcmgenet_rdma_writel(struct bcmgenet_priv *priv,
>  					u32 val, enum dma_reg r)
>  {
> -	__raw_writel(val, priv->base + GENET_RDMA_REG_OFF +
> +	bcmgenet_writel(val, priv->base + GENET_RDMA_REG_OFF +
>  			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
>  }
>  
> @@ -418,16 +437,16 @@ static inline u32 bcmgenet_tdma_ring_readl(struct bcmgenet_priv *priv,
>  					   unsigned int ring,
>  					   enum dma_ring_reg r)
>  {
> -	return __raw_readl(priv->base + GENET_TDMA_REG_OFF +
> -			(DMA_RING_SIZE * ring) +
> -			genet_dma_ring_regs[r]);
> +	return bcmgenet_readl(priv->base + GENET_TDMA_REG_OFF +
> +			      (DMA_RING_SIZE * ring) +
> +			      genet_dma_ring_regs[r]);
>  }
>  
>  static inline void bcmgenet_tdma_ring_writel(struct bcmgenet_priv *priv,
>  					     unsigned int ring, u32 val,
>  					     enum dma_ring_reg r)
>  {
> -	__raw_writel(val, priv->base + GENET_TDMA_REG_OFF +
> +	bcmgenet_writel(val, priv->base + GENET_TDMA_REG_OFF +
>  			(DMA_RING_SIZE * ring) +
>  			genet_dma_ring_regs[r]);
>  }
> @@ -436,16 +455,16 @@ static inline u32 bcmgenet_rdma_ring_readl(struct bcmgenet_priv *priv,
>  					   unsigned int ring,
>  					   enum dma_ring_reg r)
>  {
> -	return __raw_readl(priv->base + GENET_RDMA_REG_OFF +
> -			(DMA_RING_SIZE * ring) +
> -			genet_dma_ring_regs[r]);
> +	return bcmgenet_readl(priv->base + GENET_RDMA_REG_OFF +
> +			      (DMA_RING_SIZE * ring) +
> +			      genet_dma_ring_regs[r]);
>  }
>  
>  static inline void bcmgenet_rdma_ring_writel(struct bcmgenet_priv *priv,
>  					     unsigned int ring, u32 val,
>  					     enum dma_ring_reg r)
>  {
> -	__raw_writel(val, priv->base + GENET_RDMA_REG_OFF +
> +	bcmgenet_writel(val, priv->base + GENET_RDMA_REG_OFF +
>  			(DMA_RING_SIZE * ring) +
>  			genet_dma_ring_regs[r]);
>  }
> @@ -991,12 +1010,12 @@ static void bcmgenet_eee_enable_set(struct net_device *dev, bool enable)
>  	bcmgenet_umac_writel(priv, reg, UMAC_EEE_CTRL);
>  
>  	/* Enable EEE and switch to a 27Mhz clock automatically */
> -	reg = __raw_readl(priv->base + off);
> +	reg = bcmgenet_readl(priv->base + off);
>  	if (enable)
>  		reg |= TBUF_EEE_EN | TBUF_PM_EN;
>  	else
>  		reg &= ~(TBUF_EEE_EN | TBUF_PM_EN);
> -	__raw_writel(reg, priv->base + off);
> +	bcmgenet_writel(reg, priv->base + off);
>  
>  	/* Do the same for thing for RBUF */
>  	reg = bcmgenet_rbuf_readl(priv, RBUF_ENERGY_CTRL);
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
> index 3a34fdba5301..4b31e6c172b6 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
> @@ -671,12 +671,21 @@ struct bcmgenet_priv {
>  static inline u32 bcmgenet_##name##_readl(struct bcmgenet_priv *priv,	\
>  					u32 off)			\
>  {									\
> -	return __raw_readl(priv->base + offset + off);			\
> +	/* MIPS chips strapped for BE will automagically configure the	\
> +	 * peripheral registers for CPU-native byte order.		\
> +	 */								\
> +	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) \
> +		return __raw_readl(priv->base + offset + off);		\
> +	else								\
> +		return readl_relaxed(priv->base + offset + off);	\
>  }									\
>  static inline void bcmgenet_##name##_writel(struct bcmgenet_priv *priv,	\
>  					u32 val, u32 off)		\
>  {									\
> -	__raw_writel(val, priv->base + offset + off);			\
> +	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) \
> +		return __raw_writel(val, priv->base + offset + off);	\
> +	else								\
> +		writel_relaxed(val, priv->base + offset + off);		\
>  }
>  
>  GENET_IO_MACRO(ext, GENET_EXT_OFF);
> 


-- 
Florian

^ permalink raw reply

* [PATCH net-next v2] net: bcmgenet: Use correct I/O accessors
From: Florian Fainelli @ 2017-08-29 19:25 UTC (permalink / raw)
  To: netdev; +Cc: davem, opendmb, jaedon.shin, Florian Fainelli

The GENET driver currently uses __raw_{read,write}l which means
native I/O endian. This works correctly for an ARM LE kernel (default)
but fails miserably on an ARM BE (BE8) kernel where registers are kept
little endian, so replace uses with {read,write}l_relaxed here which is
what we want because this is all performance sensitive code.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v2:

- fixed email address

 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 75 ++++++++++++++++----------
 drivers/net/ethernet/broadcom/genet/bcmgenet.h | 13 ++++-
 2 files changed, 58 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index a981c4ee9d72..612d1ef3b5f5 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -72,23 +72,42 @@
 #define GENET_RDMA_REG_OFF	(priv->hw_params->rdma_offset + \
 				TOTAL_DESC * DMA_DESC_SIZE)
 
+static inline void bcmgenet_writel(u32 value, void __iomem *offset)
+{
+	/* MIPS chips strapped for BE will automagically configure the
+	 * peripheral registers for CPU-native byte order.
+	 */
+	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
+		__raw_writel(value, offset);
+	else
+		writel_relaxed(value, offset);
+}
+
+static inline u32 bcmgenet_readl(void __iomem *offset)
+{
+	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
+		return __raw_readl(offset);
+	else
+		return readl_relaxed(offset);
+}
+
 static inline void dmadesc_set_length_status(struct bcmgenet_priv *priv,
 					     void __iomem *d, u32 value)
 {
-	__raw_writel(value, d + DMA_DESC_LENGTH_STATUS);
+	bcmgenet_writel(value, d + DMA_DESC_LENGTH_STATUS);
 }
 
 static inline u32 dmadesc_get_length_status(struct bcmgenet_priv *priv,
 					    void __iomem *d)
 {
-	return __raw_readl(d + DMA_DESC_LENGTH_STATUS);
+	return bcmgenet_readl(d + DMA_DESC_LENGTH_STATUS);
 }
 
 static inline void dmadesc_set_addr(struct bcmgenet_priv *priv,
 				    void __iomem *d,
 				    dma_addr_t addr)
 {
-	__raw_writel(lower_32_bits(addr), d + DMA_DESC_ADDRESS_LO);
+	bcmgenet_writel(lower_32_bits(addr), d + DMA_DESC_ADDRESS_LO);
 
 	/* Register writes to GISB bus can take couple hundred nanoseconds
 	 * and are done for each packet, save these expensive writes unless
@@ -96,7 +115,7 @@ static inline void dmadesc_set_addr(struct bcmgenet_priv *priv,
 	 */
 #ifdef CONFIG_PHYS_ADDR_T_64BIT
 	if (priv->hw_params->flags & GENET_HAS_40BITS)
-		__raw_writel(upper_32_bits(addr), d + DMA_DESC_ADDRESS_HI);
+		bcmgenet_writel(upper_32_bits(addr), d + DMA_DESC_ADDRESS_HI);
 #endif
 }
 
@@ -113,7 +132,7 @@ static inline dma_addr_t dmadesc_get_addr(struct bcmgenet_priv *priv,
 {
 	dma_addr_t addr;
 
-	addr = __raw_readl(d + DMA_DESC_ADDRESS_LO);
+	addr = bcmgenet_readl(d + DMA_DESC_ADDRESS_LO);
 
 	/* Register writes to GISB bus can take couple hundred nanoseconds
 	 * and are done for each packet, save these expensive writes unless
@@ -121,7 +140,7 @@ static inline dma_addr_t dmadesc_get_addr(struct bcmgenet_priv *priv,
 	 */
 #ifdef CONFIG_PHYS_ADDR_T_64BIT
 	if (priv->hw_params->flags & GENET_HAS_40BITS)
-		addr |= (u64)__raw_readl(d + DMA_DESC_ADDRESS_HI) << 32;
+		addr |= (u64)bcmgenet_readl(d + DMA_DESC_ADDRESS_HI) << 32;
 #endif
 	return addr;
 }
@@ -156,8 +175,8 @@ static inline u32 bcmgenet_tbuf_ctrl_get(struct bcmgenet_priv *priv)
 	if (GENET_IS_V1(priv))
 		return bcmgenet_rbuf_readl(priv, TBUF_CTRL_V1);
 	else
-		return __raw_readl(priv->base +
-				priv->hw_params->tbuf_offset + TBUF_CTRL);
+		return bcmgenet_readl(priv->base +
+				      priv->hw_params->tbuf_offset + TBUF_CTRL);
 }
 
 static inline void bcmgenet_tbuf_ctrl_set(struct bcmgenet_priv *priv, u32 val)
@@ -165,7 +184,7 @@ static inline void bcmgenet_tbuf_ctrl_set(struct bcmgenet_priv *priv, u32 val)
 	if (GENET_IS_V1(priv))
 		bcmgenet_rbuf_writel(priv, val, TBUF_CTRL_V1);
 	else
-		__raw_writel(val, priv->base +
+		bcmgenet_writel(val, priv->base +
 				priv->hw_params->tbuf_offset + TBUF_CTRL);
 }
 
@@ -174,8 +193,8 @@ static inline u32 bcmgenet_bp_mc_get(struct bcmgenet_priv *priv)
 	if (GENET_IS_V1(priv))
 		return bcmgenet_rbuf_readl(priv, TBUF_BP_MC_V1);
 	else
-		return __raw_readl(priv->base +
-				priv->hw_params->tbuf_offset + TBUF_BP_MC);
+		return bcmgenet_readl(priv->base +
+				      priv->hw_params->tbuf_offset + TBUF_BP_MC);
 }
 
 static inline void bcmgenet_bp_mc_set(struct bcmgenet_priv *priv, u32 val)
@@ -183,7 +202,7 @@ static inline void bcmgenet_bp_mc_set(struct bcmgenet_priv *priv, u32 val)
 	if (GENET_IS_V1(priv))
 		bcmgenet_rbuf_writel(priv, val, TBUF_BP_MC_V1);
 	else
-		__raw_writel(val, priv->base +
+		bcmgenet_writel(val, priv->base +
 				priv->hw_params->tbuf_offset + TBUF_BP_MC);
 }
 
@@ -326,28 +345,28 @@ static inline struct bcmgenet_priv *dev_to_priv(struct device *dev)
 static inline u32 bcmgenet_tdma_readl(struct bcmgenet_priv *priv,
 				      enum dma_reg r)
 {
-	return __raw_readl(priv->base + GENET_TDMA_REG_OFF +
-			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
+	return bcmgenet_readl(priv->base + GENET_TDMA_REG_OFF +
+			      DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
 }
 
 static inline void bcmgenet_tdma_writel(struct bcmgenet_priv *priv,
 					u32 val, enum dma_reg r)
 {
-	__raw_writel(val, priv->base + GENET_TDMA_REG_OFF +
+	bcmgenet_writel(val, priv->base + GENET_TDMA_REG_OFF +
 			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
 }
 
 static inline u32 bcmgenet_rdma_readl(struct bcmgenet_priv *priv,
 				      enum dma_reg r)
 {
-	return __raw_readl(priv->base + GENET_RDMA_REG_OFF +
-			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
+	return bcmgenet_readl(priv->base + GENET_RDMA_REG_OFF +
+			      DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
 }
 
 static inline void bcmgenet_rdma_writel(struct bcmgenet_priv *priv,
 					u32 val, enum dma_reg r)
 {
-	__raw_writel(val, priv->base + GENET_RDMA_REG_OFF +
+	bcmgenet_writel(val, priv->base + GENET_RDMA_REG_OFF +
 			DMA_RINGS_SIZE + bcmgenet_dma_regs[r]);
 }
 
@@ -418,16 +437,16 @@ static inline u32 bcmgenet_tdma_ring_readl(struct bcmgenet_priv *priv,
 					   unsigned int ring,
 					   enum dma_ring_reg r)
 {
-	return __raw_readl(priv->base + GENET_TDMA_REG_OFF +
-			(DMA_RING_SIZE * ring) +
-			genet_dma_ring_regs[r]);
+	return bcmgenet_readl(priv->base + GENET_TDMA_REG_OFF +
+			      (DMA_RING_SIZE * ring) +
+			      genet_dma_ring_regs[r]);
 }
 
 static inline void bcmgenet_tdma_ring_writel(struct bcmgenet_priv *priv,
 					     unsigned int ring, u32 val,
 					     enum dma_ring_reg r)
 {
-	__raw_writel(val, priv->base + GENET_TDMA_REG_OFF +
+	bcmgenet_writel(val, priv->base + GENET_TDMA_REG_OFF +
 			(DMA_RING_SIZE * ring) +
 			genet_dma_ring_regs[r]);
 }
@@ -436,16 +455,16 @@ static inline u32 bcmgenet_rdma_ring_readl(struct bcmgenet_priv *priv,
 					   unsigned int ring,
 					   enum dma_ring_reg r)
 {
-	return __raw_readl(priv->base + GENET_RDMA_REG_OFF +
-			(DMA_RING_SIZE * ring) +
-			genet_dma_ring_regs[r]);
+	return bcmgenet_readl(priv->base + GENET_RDMA_REG_OFF +
+			      (DMA_RING_SIZE * ring) +
+			      genet_dma_ring_regs[r]);
 }
 
 static inline void bcmgenet_rdma_ring_writel(struct bcmgenet_priv *priv,
 					     unsigned int ring, u32 val,
 					     enum dma_ring_reg r)
 {
-	__raw_writel(val, priv->base + GENET_RDMA_REG_OFF +
+	bcmgenet_writel(val, priv->base + GENET_RDMA_REG_OFF +
 			(DMA_RING_SIZE * ring) +
 			genet_dma_ring_regs[r]);
 }
@@ -991,12 +1010,12 @@ static void bcmgenet_eee_enable_set(struct net_device *dev, bool enable)
 	bcmgenet_umac_writel(priv, reg, UMAC_EEE_CTRL);
 
 	/* Enable EEE and switch to a 27Mhz clock automatically */
-	reg = __raw_readl(priv->base + off);
+	reg = bcmgenet_readl(priv->base + off);
 	if (enable)
 		reg |= TBUF_EEE_EN | TBUF_PM_EN;
 	else
 		reg &= ~(TBUF_EEE_EN | TBUF_PM_EN);
-	__raw_writel(reg, priv->base + off);
+	bcmgenet_writel(reg, priv->base + off);
 
 	/* Do the same for thing for RBUF */
 	reg = bcmgenet_rbuf_readl(priv, RBUF_ENERGY_CTRL);
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
index 3a34fdba5301..4b31e6c172b6 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
@@ -671,12 +671,21 @@ struct bcmgenet_priv {
 static inline u32 bcmgenet_##name##_readl(struct bcmgenet_priv *priv,	\
 					u32 off)			\
 {									\
-	return __raw_readl(priv->base + offset + off);			\
+	/* MIPS chips strapped for BE will automagically configure the	\
+	 * peripheral registers for CPU-native byte order.		\
+	 */								\
+	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) \
+		return __raw_readl(priv->base + offset + off);		\
+	else								\
+		return readl_relaxed(priv->base + offset + off);	\
 }									\
 static inline void bcmgenet_##name##_writel(struct bcmgenet_priv *priv,	\
 					u32 val, u32 off)		\
 {									\
-	__raw_writel(val, priv->base + offset + off);			\
+	if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) \
+		return __raw_writel(val, priv->base + offset + off);	\
+	else								\
+		writel_relaxed(val, priv->base + offset + off);		\
 }
 
 GENET_IO_MACRO(ext, GENET_EXT_OFF);
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-08-29 19:35 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Network Development, Koichiro Den, virtualization
In-Reply-To: <CAF=yD-Lre0NATUVH8p0zS=G5CiY=WfKSefLmOZw940jZFgECgQ@mail.gmail.com>

On Fri, Aug 25, 2017 at 9:03 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>>> >> >> > We don't enable network watchdog on virtio but we could and maybe
>>> >> >> > should.
>>> >> >>
>>> >> >> Can you elaborate?
>>> >> >
>>> >> > The issue is that holding onto buffers for very long times makes guests
>>> >> > think they are stuck. This is funamentally because from guest point of
>>> >> > view this is a NIC, so it is supposed to transmit things out in
>>> >> > a timely manner. If host backs the virtual NIC by something that is not
>>> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
>>> >> > guest will be confused.
>>> >>
>>> >> That assumes that guests are fragile in this regard. A linux guest
>>> >> does not make such assumptions.
>>> >
>>> > Yes it does. Examples above:
>>> >         > > - a single slow flow can occupy the whole ring, you will not
>>> >         > >   be able to make any new buffers available for the fast flow
>>>
>>> Oh, right. Though those are due to vring_desc pool exhaustion
>>> rather than an upper bound on latency of any single packet.
>>>
>>> Limiting the number of zerocopy packets in flight to some fraction
>>> of the ring ensures that fast flows can always grab a slot.
>>> Running
>>> out of ubuf_info slots reverts to copy, so indirectly does this. But
>>> I read it correclty the zerocopy pool may be equal to or larger than
>>> the descriptor pool. Should we refine the zcopy_used test
>>>
>>>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>>>
>>> to also return false if the number of outstanding ubuf_info is greater
>>> than, say, vq->num >> 1?
>>
>>
>> We'll need to think about where to put the threshold, but I think it's
>> a good idea.
>>
>> Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
>> resources.
>>
>> In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.
>>
>> Need to experiment with some numbers.
>
> I can take a stab with two flows, one delayed in a deep host qdisc
> queue. See how this change affects the other flow and also how
> sensitive that is to the chosen threshold value.

Incomplete results at this stage, but I do see this correlation between
flows. It occurs even while not running out of zerocopy descriptors,
which I cannot yet explain.

Running two threads in a guest, each with a udp socket, each
sending up to 100 datagrams, or until EAGAIN, every msec.

Sender A sends 1B datagrams.
Sender B sends VHOST_GOODCOPY_LEN, which is enough
to trigger zcopy_used in vhost net.

A local receive process on the host receives both flows. To avoid
a deep copy when looping the packet onto the receive path,
changed skb_orphan_frags_rx to always return false (gross hack).

The flow with the larger packets is redirected through netem on ifb0:

  modprobe ifb
  ip link set dev ifb0 up
  tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit

  tc qdisc add dev tap0 ingress
  tc filter add dev tap0 parent ffff: protocol ip \
      u32 match ip dport 8000 0xffff \
      action mirred egress redirect dev ifb0

For 10 second run, packet count with various ifb0 queue lengths $LIMIT:

no filter
  rx.A: ~840,000
  rx.B: ~840,000

limit 1
  rx.A: ~500,000
  rx.B: ~3100
  ifb0: 3273 sent, 371141 dropped

limit 100
  rx.A: ~9000
  rx.B: ~4200
  ifb0: 4630 sent, 1491 dropped

limit 1000
  rx.A: ~6800
  rx.B: ~4200
  ifb0: 4651 sent, 0 dropped

Sender B is always correctly rate limited to 1 MBps or less. With a
short queue, it ends up dropping a lot and sending even less.

When a queue builds up for sender B, sender A throughput is strongly
correlated with queue length. With queue length 1, it can send almost
at unthrottled speed. But even at limit 100 its throughput is on the
same order as sender B.

What is surprising to me is that this happens even though the number
of ubuf_info in use at limit 100 is around 100 at all times. In other words,
it does not exhaust the pool.

When forcing zcopy_used to be false for all packets, this effect of
sender A throughput being correlated with sender B does not happen.

no filter
  rx.A: ~850,000
  rx.B: ~850,000

limit 100
  rx.A: ~850,000
  rx.B: ~4200
  ifb0: 4518 sent, 876182 dropped

Also relevant is that with zerocopy, the sender processes back off
and report the same count as the receiver. Without zerocopy,
both senders send at full speed, even if only 4200 packets from flow
B arrive at the receiver.

This is with the default virtio_net driver, so without napi-tx.

It appears that the zerocopy notifications are pausing the guest.
Will look at that now.

By the way, I have had an unrelated patch outstanding for a while
to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
command. Will send that as RFC.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Michael S. Tsirkin @ 2017-08-29 19:42 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com>

On Tue, Aug 29, 2017 at 03:35:38PM -0400, Willem de Bruijn wrote:
> By the way, I have had an unrelated patch outstanding for a while
> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
> command. Will send that as RFC.

Oh nice. One needs to be careful about locking there which is why
no devices support that yet.

-- 
MST

^ permalink raw reply

* Re: [PATCH net] sch_hhf: fix null pointer dereference on init failure
From: Nikolay Aleksandrov @ 2017-08-29 19:46 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, jhs, xiyou.wangcong, jiri, roopa
In-Reply-To: <1504033335-19098-1-git-send-email-nikolay@cumulusnetworks.com>

On 29/08/17 22:02, Nikolay Aleksandrov wrote:
> If sch_hhf fails in its ->init() function (either due to wrong
> user-space arguments as below or memory alloc failure of hh_flows) it
> will do a null pointer deref of q->hh_flows in its ->destroy() function.
> 
> To reproduce the crash:
> $ tc qdisc add dev eth0 root hhf quantum 2000000 non_hh_weight 10000000
> 

Uh, sorry about sending these out separately. I'll send the rest as a set for
easier review once I test them, should be just a couple more.

^ permalink raw reply

* Re: XDP redirect measurements, gotchas and tracepoints
From: Alexander Duyck @ 2017-08-29 19:52 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Jesper Dangaard Brouer, Michael Chan, John Fastabend,
	Duyck, Alexander H, pstaszewski@itcare.pl, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org, borkmann@iogearbox.net
In-Reply-To: <20170829190253.GA77682@C02RW35GFVH8.dhcp.broadcom.net>

On Tue, Aug 29, 2017 at 12:02 PM, Andy Gospodarek <andy@greyhouse.net> wrote:
> On Tue, Aug 29, 2017 at 09:23:49AM -0700, Alexander Duyck wrote:
>> On Tue, Aug 29, 2017 at 6:26 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> >
>> > On Mon, 28 Aug 2017 09:11:25 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:
>> >
>> >> My advice would be to not over complicate this. My big concern with
>> >> all this buffer recycling is what happens the first time somebody
>> >> introduces something like mirroring? Are you going to copy the data to
>> >> a new page which would be quite expensive or just have to introduce
>> >> reference counts? You are going to have to deal with stuff like
>> >> reference counts eventually so you might as well bite that bullet now.
>> >> My advice would be to not bother with optimizing for performance right
>> >> now and instead focus on just getting functionality. The approach we
>> >> took in ixgbe for the transmit path should work for almost any other
>> >> driver since all you are looking at is having to free the page
>> >> reference which takes care of reference counting already.
>> >
>> > This return API is not about optimizing performance right now.  It is
>> > actually about allowing us to change the underlying memory model per RX
>> > queue for XDP.
>>
>>  I would disagree. To me this is a obvious case of premature optimization.
>>
>
> I'm with Jesper on this.  Though it may seem to you that this is an
> optimization that is not a goal.
>
>> > If a RX-ring is use for both SKBs and XDP, then the refcnt model is
>> > still enforced.  Although a driver using the 1-packet-per-page model,
>> > should be able to reuse refcnt==1 pages when returned from XDP.
>>
>> Isn't this the case for all Rx on XDP enabled rings. Last I knew there
>> was an option to pass packets up via an SKB if XDP_PASS is returned.
>> Are you saying we need to do a special allocation path if an XDP
>> program doesn't make use of XDP_PASS?
>
> I am not proposing that a special allocation path is needed depending on the
> return code from the XDP program.  I'm proposing that in a case where
> the return code is XDP_REDIRECT (or really anytime the ndo_xdp_xmit
> operation is called), that there should be:
>
> (1) notification back to the driver/resource/etc that allocated the page
> that resources are no longer in use.
>
> or
>
> (2) common alloc/free framework used by drivers that operate on
> xdp->data so that framework takes care of refcounting, etc.
>
> My preference is (1) since it provides drivers the most flexibility in
> the event that some hardware resource (rx ring buffer pointer) or
> software resource (page or other chunk of memory) can be freed.

So my preference would be (2) rather than (1). The simple reason being
that I would prefer it if we didn't have every driver doing their own
memory management API. With that said though I realize we need to have
a solid proof-of-concept before we can get there so I am okay with
each driver doing their own thing until we have a clear winner of some
sort in all these discussions.

The biggest issue I see with (1) is that it assumes that the Tx should
somehow be responsible for the recycling. The whole thing ends up
being way too racy. The problem is you will have two drivers sharing a
region of memory and each will have to do some sort of additional
reference counting if not using the existing page count. The result is
you can have either the Rx device needing to free resources which will
have to somehow be tracked from the Rx side, and the Tx side will have
to handle the freeing and dumping the packet back into the Rx pool if
it is the Tx that needs to free the resources. The standard case in
this gets messy, but the exception cases such as removing or resetting
a driver can get messy quickly.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-08-29 19:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <20170829224136-mutt-send-email-mst@kernel.org>

On Tue, Aug 29, 2017 at 3:42 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Aug 29, 2017 at 03:35:38PM -0400, Willem de Bruijn wrote:
>> By the way, I have had an unrelated patch outstanding for a while
>> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
>> command. Will send that as RFC.
>
> Oh nice.

Great :)

> One needs to be careful about locking there which is why
> no devices support that yet.

I originally wrote it based on the virtnet_reset function introduced
for xdp. Calling this from virtnet_config_changed_work is non trivial,
as virtnet_freeze_down waits until no config worker is running.

Otherwise, I could not find any constraints on when freeze may be
called, and it largely follows the same path. I hope I didn't miss anything.

^ permalink raw reply

* Re: mlxsw and rtnl lock
From: David Ahern @ 2017-08-29 20:04 UTC (permalink / raw)
  To: Arkadi Sharshevsky, Ido Schimmel
  Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw
In-Reply-To: <2723f732-b337-a792-05ee-4a1461bf2aad@mellanox.com>

On 8/29/17 12:10 AM, Arkadi Sharshevsky wrote:
> 
> 
> On 08/28/2017 09:00 PM, David Ahern wrote:
>> On 8/26/17 11:04 AM, Ido Schimmel wrote:
>>> Regarding the silent abort, that's intentional. You can look at the same
>>> code in v4.9 - when the chain was still blocking - and you'll see that
>>> we didn't propagate the error even then. This was discussed in the past
>>> and the conclusion was that user doesn't expect to operation to fail. If
>>> hardware resources are exceeded, we let the kernel take care of the
>>> forwarding instead.
>>>
>>
>> In addition to Roopa's comments... The silent abort is not a good user
>> experience. Right now it's add a network address or route, cross fingers
>> and hope it does not overflow some limit (nexthop, ecmp, neighbor,
>> prefix, etc) that triggers the offload abort.
>>
>> The mlxsw driver queries for some limits (e.g., max rifs) but I don't
>> see any query related to current usage, and there is no API to pass any
>> of that data to user space so user space has no programmatic way to
>> handle this. I realize you are aware of this limitation. The point is to
>> emphasize the need to resolve this.
>>
> 
> We actually thought about providing he user some tools to understand
> the ASIC's limitations by introducing the 'resource' object to devlink.
> 
> By linking dpipe tables to resources the user can understand which
> hardware processes share a common resource, furthermore this resources
> usage could be observed. By this more visibility can be obtained.
> 
> Its not a remedy for the silent abort, but, maybe a notification
> can be sent from devlink in case of abort that some resources is
> full.
> 
> This proposition was sent as RFC several weeks ago.
> 

Do you have patches (kernel and devlink) for the proposal?

^ permalink raw reply

* Re: [PATCH net] nfp: double free on error in probe
From: Jakub Kicinski @ 2017-08-29 20:05 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: David S. Miller, Simon Horman, oss-drivers, netdev,
	kernel-janitors
In-Reply-To: <20170829190855.fabfktv2dg57x3gs@mwanda>

On Tue, 29 Aug 2017 22:15:16 +0300, Dan Carpenter wrote:
> Both the nfp_net_pf_app_start() and the nfp_net_pci_probe() functions
> call nfp_net_pf_app_stop_ctrl(pf) so there is a double free.  The free
> should be done from the probe function because it's allocated there so
> I have removed the call from nfp_net_pf_app_start().
> 
> Fixes: 02082701b974 ("nfp: create control vNICs and wire up rx/tx")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* [PATCH RFC 0/2] virtio-net: implement reset request
From: Willem de Bruijn @ 2017-08-29 20:07 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, virtualization, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Implement VIRTIO_CONFIG_S_NEEDS_RESET (patch 1) and
add a feature bit to signal support (patch 2)

Further details in the individual patches.

Willem de Bruijn (2):
  virtio_net: implement VIRTIO_CONFIG_S_NEEDS_RESET
  virtio_net: enable probing for NEEDS_RESET support

 drivers/net/virtio_net.c           | 71 +++++++++++++++++++++++++++++++++-----
 include/uapi/linux/virtio_config.h |  3 ++
 include/uapi/linux/virtio_net.h    |  4 +++
 3 files changed, 69 insertions(+), 9 deletions(-)

-- 
2.14.1.342.g6490525c54-goog

^ permalink raw reply

* [PATCH RFC 1/2] virtio_net: implement VIRTIO_CONFIG_S_NEEDS_RESET
From: Willem de Bruijn @ 2017-08-29 20:07 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, virtualization, Willem de Bruijn
In-Reply-To: <20170829200759.13975-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Implement the reset communication request defined in the VIRTIO 1.0
specification and introduces in Linux in commit c00bbcf862896 ("virtio:
add VIRTIO_CONFIG_S_NEEDS_RESET device status bit").

Since that patch, the virtio-net driver has added a virtnet_reset
function that implements the requested behavior through calls to the
power management freeze and restore functions.

That function has recently been reverted when its sole caller was
updated. Bring it back and listen for the request from the host on
the config channel.

Implement the feature analogous to other config requests. In
particular, acknowledge the request in the same manner as the
NET_S_ANNOUNCE link announce request, by responding with a new
VIRTIO_NET_CTRL_${TYPE} command. On reception, the host must check
the config status register for success or failure.

The existing freeze handler verifies that no config changes are
running concurrently. Elide this check for reset. The request is
always handled from the config workqueue. No other config requests
can be active or scheduled concurrently on vi->config.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/virtio_net.c        | 69 +++++++++++++++++++++++++++++++++++------
 include/uapi/linux/virtio_net.h |  4 +++
 2 files changed, 64 insertions(+), 9 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 52ae78ca3d38..5e349226f7c1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1458,12 +1458,11 @@ static void virtnet_netpoll(struct net_device *dev)
 }
 #endif
 
-static void virtnet_ack_link_announce(struct virtnet_info *vi)
+static void virtnet_ack(struct virtnet_info *vi, u8 class, u8 cmd)
 {
 	rtnl_lock();
-	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
-				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
-		dev_warn(&vi->dev->dev, "Failed to ack link announce.\n");
+	if (!virtnet_send_command(vi, class, cmd, NULL))
+		dev_warn(&vi->dev->dev, "Failed to ack %u.%u\n", class, cmd);
 	rtnl_unlock();
 }
 
@@ -1857,13 +1856,16 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
 	.set_link_ksettings = virtnet_set_link_ksettings,
 };
 
-static void virtnet_freeze_down(struct virtio_device *vdev)
+static void virtnet_freeze_down(struct virtio_device *vdev, bool in_reset)
 {
 	struct virtnet_info *vi = vdev->priv;
 	int i;
 
-	/* Make sure no work handler is accessing the device */
-	flush_work(&vi->config_work);
+	/* Make sure no work handler is accessing the device,
+	 * unless this call is made from the reset work handler itself.
+	 */
+	if (!in_reset)
+		flush_work(&vi->config_work);
 
 	netif_device_detach(vi->dev);
 	netif_tx_disable(vi->dev);
@@ -1878,6 +1880,7 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
 }
 
 static int init_vqs(struct virtnet_info *vi);
+static void remove_vq_common(struct virtnet_info *vi);
 
 static int virtnet_restore_up(struct virtio_device *vdev)
 {
@@ -1906,6 +1909,45 @@ static int virtnet_restore_up(struct virtio_device *vdev)
 	return err;
 }
 
+static int virtnet_reset(struct virtnet_info *vi)
+{
+	struct virtio_device *dev = vi->vdev;
+	bool failed;
+	int ret;
+
+	virtio_config_disable(dev);
+	failed = dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED;
+	virtnet_freeze_down(dev, true);
+	remove_vq_common(vi);
+
+	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
+	virtio_add_status(dev, VIRTIO_CONFIG_S_DRIVER);
+
+	/* Restore the failed status (see virtio_device_restore). */
+	if (failed)
+		virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
+
+	ret = virtio_finalize_features(dev);
+	if (ret)
+		goto err;
+
+	ret = virtnet_restore_up(dev);
+	if (ret)
+		goto err;
+
+	ret = virtnet_set_queues(vi, vi->curr_queue_pairs);
+	if (ret)
+		goto err;
+
+	virtio_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
+	virtio_config_enable(dev);
+	return 0;
+
+err:
+	virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
+	return ret;
+}
+
 static int virtnet_set_guest_offloads(struct virtnet_info *vi, u64 offloads)
 {
 	struct scatterlist sg;
@@ -2085,7 +2127,16 @@ static void virtnet_config_changed_work(struct work_struct *work)
 
 	if (v & VIRTIO_NET_S_ANNOUNCE) {
 		netdev_notify_peers(vi->dev);
-		virtnet_ack_link_announce(vi);
+		virtnet_ack(vi, VIRTIO_NET_CTRL_ANNOUNCE,
+			    VIRTIO_NET_CTRL_ANNOUNCE_ACK);
+	}
+
+	if (vi->vdev->config->get_status(vi->vdev) &
+	    VIRTIO_CONFIG_S_NEEDS_RESET) {
+		virtnet_reset(vi);
+		virtnet_ack(vi, VIRTIO_NET_CTRL_RESET,
+			    VIRTIO_NET_CTRL_RESET_ACK);
+
 	}
 
 	/* Ignore unknown (future) status bits */
@@ -2708,7 +2759,7 @@ static __maybe_unused int virtnet_freeze(struct virtio_device *vdev)
 	struct virtnet_info *vi = vdev->priv;
 
 	virtnet_cpu_notif_remove(vi);
-	virtnet_freeze_down(vdev);
+	virtnet_freeze_down(vdev, false);
 	remove_vq_common(vi);
 
 	return 0;
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b518288..188fdc528f13 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -245,4 +245,8 @@ struct virtio_net_ctrl_mq {
 #define VIRTIO_NET_CTRL_GUEST_OFFLOADS   5
 #define VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET        0
 
+/* Signal that the driver received and executed the reset command. */
+#define VIRTIO_NET_CTRL_RESET			6
+#define VIRTIO_NET_CTRL_RESET_ACK		0
+
 #endif /* _UAPI_LINUX_VIRTIO_NET_H */
-- 
2.14.1.342.g6490525c54-goog

^ permalink raw reply related

* [PATCH RFC 2/2] virtio_net: enable probing for NEEDS_RESET support
From: Willem de Bruijn @ 2017-08-29 20:07 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, virtualization, Willem de Bruijn
In-Reply-To: <20170829200759.13975-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Implement a mechanism to signal that a virtio device implements the
VIRTIO_CONFIG_S_NEEDS_RESET command.

Testing for VIRTIO_CONFIG_S_NEEDS_RESET support by issuing the request
and verifying the reset state would require an expensive state change.

To avoid that, add a status bit to the feature register used during
feature negotiation between host and guest.

Set the bit for virtio-net, which supports the feature.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/virtio_net.c           | 2 ++
 include/uapi/linux/virtio_config.h | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5e349226f7c1..15483a982106 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2801,12 +2801,14 @@ static struct virtio_device_id id_table[] = {
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
+	VIRTIO_F_CAN_RESET,
 };
 
 static unsigned int features_legacy[] = {
 	VIRTNET_FEATURES,
 	VIRTIO_NET_F_GSO,
 	VIRTIO_F_ANY_LAYOUT,
+	VIRTIO_F_CAN_RESET,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 308e2096291f..726be44a04d3 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -56,6 +56,9 @@
  * suppressed them? */
 #define VIRTIO_F_NOTIFY_ON_EMPTY	24
 
+/* Driver supports the VIRTIO_CONFIG_S_NEEDS_RESET command */
+#define VIRTIO_F_CAN_RESET		25
+
 /* Can the device handle any descriptor layout? */
 #define VIRTIO_F_ANY_LAYOUT		27
 #endif /* VIRTIO_CONFIG_NO_LEGACY */
-- 
2.14.1.342.g6490525c54-goog

^ permalink raw reply related

* Re: [PATCH RFC 2/2] virtio_net: enable probing for NEEDS_RESET support
From: Michael S. Tsirkin @ 2017-08-29 20:16 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, Willem de Bruijn, virtualization
In-Reply-To: <20170829200759.13975-3-willemdebruijn.kernel@gmail.com>

On Tue, Aug 29, 2017 at 04:07:59PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Implement a mechanism to signal that a virtio device implements the
> VIRTIO_CONFIG_S_NEEDS_RESET command.
> 
> Testing for VIRTIO_CONFIG_S_NEEDS_RESET support by issuing the request
> and verifying the reset state would require an expensive state change.
> 
> To avoid that, add a status bit to the feature register used during
> feature negotiation between host and guest.
> 
> Set the bit for virtio-net, which supports the feature.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>

All virtio 1 devices have the reset feature so maybe guest does
not need this flag. Does device need it? Does device really
care that guest can't recover?

> ---
>  drivers/net/virtio_net.c           | 2 ++
>  include/uapi/linux/virtio_config.h | 3 +++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 5e349226f7c1..15483a982106 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2801,12 +2801,14 @@ static struct virtio_device_id id_table[] = {
>  
>  static unsigned int features[] = {
>  	VIRTNET_FEATURES,
> +	VIRTIO_F_CAN_RESET,
>  };
>  
>  static unsigned int features_legacy[] = {
>  	VIRTNET_FEATURES,
>  	VIRTIO_NET_F_GSO,
>  	VIRTIO_F_ANY_LAYOUT,
> +	VIRTIO_F_CAN_RESET,
>  };
>  
>  static struct virtio_driver virtio_net_driver = {
> diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
> index 308e2096291f..726be44a04d3 100644
> --- a/include/uapi/linux/virtio_config.h
> +++ b/include/uapi/linux/virtio_config.h
> @@ -56,6 +56,9 @@
>   * suppressed them? */
>  #define VIRTIO_F_NOTIFY_ON_EMPTY	24
>  
> +/* Driver supports the VIRTIO_CONFIG_S_NEEDS_RESET command */
> +#define VIRTIO_F_CAN_RESET		25
> +
>  /* Can the device handle any descriptor layout? */
>  #define VIRTIO_F_ANY_LAYOUT		27
>  #endif /* VIRTIO_CONFIG_NO_LEGACY */
> -- 
> 2.14.1.342.g6490525c54-goog

^ permalink raw reply

* [PATCH net-next v3] bridge: fdb add and delete tracepoints
From: Roopa Prabhu @ 2017-08-29 20:16 UTC (permalink / raw)
  To: davem; +Cc: netdev, nikolay, f.fainelli, bridge

From: Roopa Prabhu <roopa@cumulusnetworks.com>

A few useful tracepoints to trace bridge forwarding
database updates.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
v2: address comments from florian
v3: remove stray character '=' in print (pointed out by florian)

 include/trace/events/bridge.h |   98 +++++++++++++++++++++++++++++++++++++++++
 net/bridge/br_fdb.c           |    7 +++
 net/core/net-traces.c         |    6 +++
 3 files changed, 111 insertions(+)
 create mode 100644 include/trace/events/bridge.h

diff --git a/include/trace/events/bridge.h b/include/trace/events/bridge.h
new file mode 100644
index 0000000..0f1cde0
--- /dev/null
+++ b/include/trace/events/bridge.h
@@ -0,0 +1,98 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM bridge
+
+#if !defined(_TRACE_BRIDGE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_BRIDGE_H
+
+#include <linux/netdevice.h>
+#include <linux/tracepoint.h>
+
+#include "../../../net/bridge/br_private.h"
+
+TRACE_EVENT(br_fdb_add,
+
+	TP_PROTO(struct ndmsg *ndm, struct net_device *dev,
+		 const unsigned char *addr, u16 vid, u16 nlh_flags),
+
+	TP_ARGS(ndm, dev, addr, vid, nlh_flags),
+
+	TP_STRUCT__entry(
+		__field(u8, ndm_flags)
+		__string(dev, dev->name)
+		__array(unsigned char, addr, ETH_ALEN)
+		__field(u16, vid)
+		__field(u16, nlh_flags)
+	),
+
+	TP_fast_assign(
+		__assign_str(dev, dev->name);
+		memcpy(__entry->addr, addr, ETH_ALEN);
+		__entry->vid = vid;
+		__entry->nlh_flags = nlh_flags;
+		__entry->ndm_flags = ndm->ndm_flags;
+	),
+
+	TP_printk("dev %s addr %02x:%02x:%02x:%02x:%02x:%02x vid %u nlh_flags %04x ndm_flags %02x",
+		  __get_str(dev), __entry->addr[0], __entry->addr[1],
+		  __entry->addr[2], __entry->addr[3], __entry->addr[4],
+		  __entry->addr[5], __entry->vid,
+		  __entry->nlh_flags, __entry->ndm_flags)
+);
+
+TRACE_EVENT(br_fdb_external_learn_add,
+
+	TP_PROTO(struct net_bridge *br, struct net_bridge_port *p,
+		 const unsigned char *addr, u16 vid),
+
+	TP_ARGS(br, p, addr, vid),
+
+	TP_STRUCT__entry(
+		__string(br_dev, br->dev->name)
+		__string(dev, p ? p->dev->name : "null")
+		__array(unsigned char, addr, ETH_ALEN)
+		__field(u16, vid)
+	),
+
+	TP_fast_assign(
+		__assign_str(br_dev, br->dev->name);
+		__assign_str(dev, p ? p->dev->name : "null");
+		memcpy(__entry->addr, addr, ETH_ALEN);
+		__entry->vid = vid;
+	),
+
+	TP_printk("br_dev %s port %s addr %02x:%02x:%02x:%02x:%02x:%02x vid %u",
+		  __get_str(br_dev), __get_str(dev), __entry->addr[0],
+		  __entry->addr[1], __entry->addr[2], __entry->addr[3],
+		  __entry->addr[4], __entry->addr[5], __entry->vid)
+);
+
+TRACE_EVENT(fdb_delete,
+
+	TP_PROTO(struct net_bridge *br, struct net_bridge_fdb_entry *f),
+
+	TP_ARGS(br, f),
+
+	TP_STRUCT__entry(
+		__string(br_dev, br->dev->name)
+		__string(dev, f->dst ? f->dst->dev->name : "null")
+		__array(unsigned char, addr, ETH_ALEN)
+		__field(u16, vid)
+	),
+
+	TP_fast_assign(
+		__assign_str(br_dev, br->dev->name);
+		__assign_str(dev, f->dst ? f->dst->dev->name : "null");
+		memcpy(__entry->addr, f->addr.addr, ETH_ALEN);
+		__entry->vid = f->vlan_id;
+	),
+
+	TP_printk("br_dev %s dev %s addr %02x:%02x:%02x:%02x:%02x:%02x vid %u",
+		  __get_str(br_dev), __get_str(dev), __entry->addr[0],
+		  __entry->addr[1], __entry->addr[2], __entry->addr[3],
+		  __entry->addr[4], __entry->addr[5], __entry->vid)
+);
+
+#endif /* _TRACE_BRIDGE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index a79b648..be5e1da 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -25,6 +25,7 @@
 #include <asm/unaligned.h>
 #include <linux/if_vlan.h>
 #include <net/switchdev.h>
+#include <trace/events/bridge.h>
 #include "br_private.h"
 
 static struct kmem_cache *br_fdb_cache __read_mostly;
@@ -171,6 +172,8 @@ static void fdb_del_hw_addr(struct net_bridge *br, const unsigned char *addr)
 
 static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f)
 {
+	trace_fdb_delete(br, f);
+
 	if (f->is_static)
 		fdb_del_hw_addr(br, f->addr.addr);
 
@@ -870,6 +873,8 @@ int br_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 	struct net_bridge *br = NULL;
 	int err = 0;
 
+	trace_br_fdb_add(ndm, dev, addr, vid, nlh_flags);
+
 	if (!(ndm->ndm_state & (NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE))) {
 		pr_info("bridge: RTM_NEWNEIGH with invalid state %#x\n", ndm->ndm_state);
 		return -EINVAL;
@@ -1066,6 +1071,8 @@ int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
 	bool modified = false;
 	int err = 0;
 
+	trace_br_fdb_external_learn_add(br, p, addr, vid);
+
 	spin_lock_bh(&br->hash_lock);
 
 	head = &br->hash[br_mac_hash(addr, vid)];
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index 4f1468c..4a0292c 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -37,6 +37,12 @@
 #include <trace/events/fib6.h>
 EXPORT_TRACEPOINT_SYMBOL_GPL(fib6_table_lookup);
 #endif
+#if IS_ENABLED(CONFIG_BRIDGE)
+#include <trace/events/bridge.h>
+EXPORT_TRACEPOINT_SYMBOL_GPL(br_fdb_add);
+EXPORT_TRACEPOINT_SYMBOL_GPL(br_fdb_external_learn_add);
+EXPORT_TRACEPOINT_SYMBOL_GPL(fdb_delete);
+#endif
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] net: remove dmaengine.h inclusion from netdevice.h
From: Dave Jiang @ 2017-08-29 20:17 UTC (permalink / raw)
  To: davem; +Cc: netdev

Since the removal of NET_DMA, dmaengine.h header file shouldn't be needed
by netdevice.h anymore.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 include/linux/netdevice.h |    1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 779b235..dd5b440 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -35,7 +35,6 @@
 
 #include <linux/percpu.h>
 #include <linux/rculist.h>
-#include <linux/dmaengine.h>
 #include <linux/workqueue.h>
 #include <linux/dynamic_queue_limits.h>
 

^ permalink raw reply related

* Re: [PATCH net-next 0/4] Endian fixes for SYSTEMPORT/SF2/MDIO
From: Florian Fainelli @ 2017-08-29 20:17 UTC (permalink / raw)
  To: netdev; +Cc: davem, opendmb, andrew, vivien.didelot
In-Reply-To: <1504031985-52808-1-git-send-email-f.fainelli@gmail.com>

On 08/29/2017 11:39 AM, Florian Fainelli wrote:
> Hi David,
> 
> While trying an ARM BE kernel for kinks, the 3 drivers below started not
> working and the reasons why became pretty obvious because the register space
> remains LE (hardwired), except for Broadcom MIPS where it follows the CPU's
> native endian (let's call that a feature).

David, I found a minor problem in the RSB endian configuration for
SYSTEMPOR Lite that uses two bits for that, I will repost shortly.

> 
> Thanks!
> 
> Florian Fainelli (4):
>   net: systemport: Use correct I/O accessors
>   net: dsa: bcm_sf2: Use correct I/O accessors
>   net: systemport: Set correct RSB endian bits based on host
>   net: phy: mdio-bcm-unimac: Use correct I/O accessors
> 
>  drivers/net/dsa/bcm_sf2.h                  | 12 +++++------
>  drivers/net/ethernet/broadcom/bcmsysport.c | 21 ++++++++++++--------
>  drivers/net/phy/mdio-bcm-unimac.c          | 32 ++++++++++++++++++++++++------
>  3 files changed, 45 insertions(+), 20 deletions(-)
> 


-- 
Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox