Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/5] e1000e: reset MAC-PHY interconnect on 82577/82578
From: Jeff Kirsher @ 2014-12-16 12:12 UTC (permalink / raw)
  To: Zhu Yanjun, stable, Willy Tarreau
  Cc: netdev, w, Zhu Yanjun, Bruce Allan, David S. Miller
In-Reply-To: <1418725700-31465-2-git-send-email-Yanjun.Zhu@windriver.com>

[-- Attachment #1: Type: text/plain, Size: 1727 bytes --]

On Tue, 2014-12-16 at 18:28 +0800, Zhu Yanjun wrote:
> 2.6.x kernels require a similar logic change as commit 6dfaa76 
> [e1000e: reset MAC-PHY interconnect on 82577/82578] introduces
> for newer kernels.
> 
> During Sx->S0 transitions, the interconnect between the MAC and PHY on
> 82577/82578 can remain in SMBus mode instead of transitioning to the
> PCIe-like mode required during normal operation.  Toggling the
> LANPHYPC
> Value bit essentially resets the interconnect forcing it to the
> correct
> mode.
> 
> after review of all intel drivers, found several instances where
> drivers had the incorrect pattern of:
> memory mapped write();
> delay();
> 
> which should always be:
> memory mapped write();
> write flush(); /* aka memory mapped read */
> delay();
> 
> explanation:
> The reason for including the flush is that writes can be held
> (posted) in PCI/PCIe bridges, but the read always has to complete
> synchronously and therefore has to flush all pending writes to a
> device.  If a write is held and followed by a delay, the delay
> means nothing because the write may not have reached hardware
> (maybe even not until the next read)
> 
> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
> ---
>  drivers/net/e1000e/defines.h |  2 ++
>  drivers/net/e1000e/ich8lan.c | 20 ++++++++++++++++++++
>  2 files changed, 22 insertions(+)

To be clear, Zhu is wanting this applied to stable trees (yet did not CC
stable@vger.kernel.org ).

Willy- I am fine with this series being applied to stable.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 1/1] net: fec: Fix NAPI race
From: Russell King - ARM Linux @ 2014-12-16 11:41 UTC (permalink / raw)
  To: Fabio Estevam
  Cc: Fugang Duan, David S. Miller, netdev@vger.kernel.org,
	Estevam Fabio-R49496, Ben Hutchings, Stephen Hemminger,
	robert.daniels, Marek Vašut
In-Reply-To: <CAOMZO5D3cysOauyptcjCDM6sjK7ZEiAY==xYu-qqHbwFreoV8g@mail.gmail.com>

On Tue, Dec 16, 2014 at 09:33:53AM -0200, Fabio Estevam wrote:
> Hi Fugang,
> 
> On Tue, Dec 16, 2014 at 8:25 AM, Fugang Duan <b38611@freescale.com> wrote:
> > Do camera capture test on i.MX6q sabresd board, and save the capture data to
> > nfs rootfs. The command is:
> > gst-launch-1.0 -e imxv4l2src device=/dev/video1 num-buffers=2592000 ! tee name=t !
> > queue ! imxv4l2sink sync=false t. ! queue ! vpuenc ! queue ! mux. pulsesrc num-buffers=3720937
> > blocksize=4096 ! 'audio/x-raw, rate=44100, channels=2' ! queue ! imxmp3enc ! mpegaudioparse !
> > queue ! mux. qtmux name=mux ! filesink location=video_recording_long.mov
> >
> > After about 10 hours running, there have net watchdog timeout kernel dump:
> > ...
> > WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x2b4/0x2d8()
> > NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
> 
> Adding more people who reported similar issues in the past.
> 
> Marek,
> 
> Does this patch solve the problem you reported at
> http://www.spinics.net/lists/netdev/msg268167.html ?

My set of patches fixed stuff exactly like this...

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* Re: [PATCH net-next 1/1] net: fec: Fix NAPI race
From: Fabio Estevam @ 2014-12-16 11:33 UTC (permalink / raw)
  To: Fugang Duan
  Cc: David S. Miller, netdev@vger.kernel.org, Estevam Fabio-R49496,
	Ben Hutchings, Stephen Hemminger, robert.daniels,
	Marek Vašut, Russell King
In-Reply-To: <1418725558-17287-1-git-send-email-b38611@freescale.com>

Hi Fugang,

On Tue, Dec 16, 2014 at 8:25 AM, Fugang Duan <b38611@freescale.com> wrote:
> Do camera capture test on i.MX6q sabresd board, and save the capture data to
> nfs rootfs. The command is:
> gst-launch-1.0 -e imxv4l2src device=/dev/video1 num-buffers=2592000 ! tee name=t !
> queue ! imxv4l2sink sync=false t. ! queue ! vpuenc ! queue ! mux. pulsesrc num-buffers=3720937
> blocksize=4096 ! 'audio/x-raw, rate=44100, channels=2' ! queue ! imxmp3enc ! mpegaudioparse !
> queue ! mux. qtmux name=mux ! filesink location=video_recording_long.mov
>
> After about 10 hours running, there have net watchdog timeout kernel dump:
> ...
> WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x2b4/0x2d8()
> NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out

Adding more people who reported similar issues in the past.

Marek,

Does this patch solve the problem you reported at
http://www.spinics.net/lists/netdev/msg268167.html ?

^ permalink raw reply

* [PATCH net] net/mlx4: Cache line CQE/EQE stride fixes
From: Amir Vadai @ 2014-12-16 11:28 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Amir Vadai, Yevgeny Petrilin, Or Gerlitz, clsoto,
	Ido Shamay, Wei Yang

From: Ido Shamay <idos@mellanox.com>

This commit contains 2 fixes for the 128B CQE/EQE stride feaure.
Wei found that mlx4_QUERY_HCA function marked the wrong capability
in flags (64B CQE/EQE), when CQE/EQE stride feature was enabled.
Also added small fix in initial CQE ownership bit assignment, when CQE
is size is not default 32B.

Fixes: 77507aa24 (net/mlx4: Enable CQE/EQE stride support)
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
Dave Hi,

Please pull this patch also to stable (at least 3.17)

Thanks,
Amir

 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 11 +++++++++--
 drivers/net/ethernet/mellanox/mlx4/fw.c        |  4 ++--
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6ff214d..190cbd9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1569,8 +1569,15 @@ int mlx4_en_start_port(struct net_device *dev)
 			mlx4_en_free_affinity_hint(priv, i);
 			goto cq_err;
 		}
-		for (j = 0; j < cq->size; j++)
-			cq->buf[j].owner_sr_opcode = MLX4_CQE_OWNER_MASK;
+
+		for (j = 0; j < cq->size; j++) {
+			struct mlx4_cqe *cqe = NULL;
+
+			cqe = mlx4_en_get_cqe(cq->buf, j, priv->cqe_size) +
+			      priv->cqe_factor;
+			cqe->owner_sr_opcode = MLX4_CQE_OWNER_MASK;
+		}
+
 		err = mlx4_en_set_cq_moder(priv, cq);
 		if (err) {
 			en_err(priv, "Failed setting cq moderation parameters\n");
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index ef3b95b..32c848d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -1848,8 +1848,8 @@ int mlx4_QUERY_HCA(struct mlx4_dev *dev,
 	/* CX3 is capable of extending CQEs\EQEs to strides larger than 64B */
 	MLX4_GET(byte_field, outbox, INIT_HCA_EQE_CQE_STRIDE_OFFSET);
 	if (byte_field) {
-		param->dev_cap_enabled |= MLX4_DEV_CAP_64B_EQE_ENABLED;
-		param->dev_cap_enabled |= MLX4_DEV_CAP_64B_CQE_ENABLED;
+		param->dev_cap_enabled |= MLX4_DEV_CAP_EQE_STRIDE_ENABLED;
+		param->dev_cap_enabled |= MLX4_DEV_CAP_CQE_STRIDE_ENABLED;
 		param->cqe_size = 1 << ((byte_field &
 					 MLX4_CQE_SIZE_MASK_STRIDE) + 5);
 		param->eqe_size = 1 << (((byte_field &
-- 
1.8.3.1

^ permalink raw reply related

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16 11:01 UTC (permalink / raw)
  To: Roopa Prabhu, netdev@vger.kernel.org
  Cc: Jamal Hadi Salim, John Fastabend, Jiri Pirko, sfeldma@gmail.com,
	bcrl@kvack.org, tgraf@suug.ch, stephen@networkplumber.org,
	linville@tuxdriver.com, vyasevic@redhat.com, davem@davemloft.net,
	shm@cumulusnetworks.com, gospo@cumulusnetworks.com
In-Reply-To: <548F88E8.1010703@cumulusnetworks.com>


In my reply (inline) I elaborate on the validity of bridge-less and offloaded-bridge models for L2 switching.

I also discuss the implied necessity of a bridge device for L3 routing and potential issues with the upcoming FIB offloading proposal.

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
> Sent: Tuesday, December 16, 2014 3:21 AM
> To: Arad, Ronen
> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri Pirko;
> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
> >
> >> -----Original Message-----
> >> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
> >> Sent: Tuesday, December 16, 2014 1:28 AM
> >> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
> >> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
> >> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
> >> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com;
> >> gospo@cumulusnetworks.com
> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
> >> del bridge port attributes
> >>
> >> On 12/15/14 13:36, Arad, Ronen wrote:
> >>>
> >>>> -----Original Message-----
> >>> The behavior of a driver could depend on the presence of a bridge
> >>> and
> >> features such as FDB LEARNING and LEARNING_SYNC.
> >>
> >> Indeed, those are bridge attributes.
> >>
> >>> A switch port driver which is not enslaved to a bridge might need to
> >>> implement VLAN-aware FDB within the driver and report its content to
> >>> user-
> >> space using ndo_fdb_dump.
> >>   >
> >>> A switch port driver which is enslaved to a bridge could do with
> >>> only pass through for static FDB configuration
> >>   > to the HW when LEARNING_SYNC is configured. FDB reporting to
> >> user- space and soft aging are left to the bridge module FDB.
> >>> Such driver, without LEARNING_SYNC could still avoid maintaing
> >>> in-driver
> >> FDB as long as it could dump the HW FDB on demand.
> >>> LEARNING_SYNC also requires periodic updates of freshness
> >>> information
> >> from the driver to the bridge module.
> >>
> >> If you have an fdb - shouldnt that be exposed only if you have a
> >> bridge abstraction exposed? i.e thats where the Linux tools would work.
> > I'm trying to find out what are the opinions of other people in the netdev
> list.
> > John have clearly stated that he'd like to see full L2 switching functionality
> (at least) supported without making a bridge device mandatory.
> > The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that
> with proper setting of SELF/MASTER flags by iproute2.
> > I see the value in supporting both approaches (bridge device mandatory
> > and bridge device optional). If the choice is left to user-driven policy decision,
> > we need to document both use models and map traditional L2 features to
> > each model.
> > The L2 offloading (or NETFUNC as it is currently called), which is being
> > discussed on a different patch-set, is only needed when a bridge device is
> > used.
> > Without a bridge device, all configuration has to be targeted at the switch
> > port driver directly using the SELF flag. FDB remains relevant and it is used to
> > configure static MAC table entries and dump the HW MAC table.

> Your understanding is right here. So far all patches have kept both models in
> mind.


> > When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even
> > higher), there is a gap between what the HW is doing and what is explicitly
> > modeled in Linux.


> Can you elaborate more here ?. We use the linux model to accelerate a
> multi-layer (l2-l3) switch today. There maybe a few gaps, but these gaps can
> be closed by having equivalent functionality in the software path.

What I meant is that without a bridge device the HW switch is seen as a collection of independent switch ports. Typical switch ASIC performs L2 switching by default. This is not expressed explicitly in Linux without a bridge device.
The SELF flag is used to target typical bridge port and bridge configuration at a switch port device.
Without an explicit bridge device, bridge attributes have to be directed at an arbitrary port (any port could represent the entire switch) and interpreted by the switch port driver as intended for the entire switch (this includes attributes like STP etc.)
Each switch port device driver has to implement similar functionality (i.e. all bridge and fdb related ndos) independently without common functionality shared (e.g. FDB, soft aging).
It is a valid use model and could avoid the complexity of having to deal with the presence of both SW and HW bridge and to deal with explicit offloading of data-path.

I was trying to find out whether the intention was to continue and support both bridge-less an offloaded-bridge models and leave it to the end-user to choose the desirable model at configuration time.
This would require dual support in the switch port driver in order to have best user experience across multiple switch ASICs or other kinds of devices.


> 
> > Without a bridge device, the HW is represented by a set of switch port
> > devices and the bridging (both control and data planes) takes place only in
> > the HW and switch port driver.
> > Each switch port driver has to implement its own FDB as there is no
> > common shared code among drivers for different HW devices.
> > Using a bridge device could partially alleviate that, but it comes with a cost.
> > There is a need to properly implement offloading of both configuration and
> > data-path. The transmit and receive path in the bridge module should be
> > somehow bypassed to avoid unnecessary overhead or duplicate packets
> > coming from both software bridging and HW bridging.
> >
> >> What i was refering to was a scenario where i have no interest in the
> >> fdb despite such a hardware capabilities. VLANs is a different issue;
> >>
> > VLAN is fundamental feature of L2 and L3 switching and Linux is unclear
> about it. Bridge device could model bridging of untagged packets which
> requires a bridge device for each VLAN and a vlan device on each port that is
> a member of the bridge's VLAN.
> > This different from the behavior and configuration of classic closed-source
> switches.
> > An alternative model is VLAN filtering where a bridge is VLAN-aware and
> > switches tagged traffic. A bridge device represents multiple L2 domains with
> > VLAN filtering policy that defines the switching rules within each domain.
> And the linux bridge driver supports both models today.
> 
> > Forwarding (e.g. L3 routing) is expected across such L2 domains using L3
> entities.
> > The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN
> > filtering model is yet unclear to me.

> In the vlan filtering bridge model, You can create a vlan device on the bridge
> for l3 ...
> 

That's what I'm thinking too (I experimented with such setup using veth interfaces, bridge device, and vlan interfaces). This, however, seems to require an explicit bridge for L3 support.

Looking at the latest code of FIB offloading (not yet submitted to netdev), I noticed that a switch port device is expected as a lower descendent of the FIB destination device.
This assumption is valid in the per-vlan bridge model where IP address is assigned to the bridge itself.
This, however, is not consistent with the single multi-VLAN bridge model.
Vlan interfaces on a bridge looks like siblings of the switch ports devices on the same bridge. They are not ancestors of the switch ports.
The L3 domain ends at the bridge sub-interfaces. The only L3 entities are the vlan sub-interfaces on the bridge. 
Those are route next hops and the only possible fib_dev.
L3 routing is not aware of the switch ports. Route is performed to next hop addresses on one of the vlan interfaces subnets. The actual resolution to a switch port device has to be performed by the neighbor subsystem (ARP/ND).
It is unclear to me how the FIB offloading will be redirected to an ndo of a switch port device.

> >
> >>>>> Will the decision about using a bridge device or avoiding it be
> >>>>> left to the end-user?
> >>>> Its a user policy decision. Again the offload bit gets us this in a
> >>>> reasonably configurable way IMO.
> >>>>
> >>>>> (This requires switch port drivers to be able to work and provide
> >>>>> similar functionality in both setups).
> >>>> Right, but if the drivers "care" who is calling their ndo ops
> >>>> something is seriously broken. For the driver it should not need to
> >>>> know anything about the callers so it doesn't matter to the driver
> >>>> if its a netlink call from user space or an internal call fro
> >>>> bridge.ko
> >>> LEARNING_SYNC only makes sense when a switch port driver is enslaved
> >>> to
> >> a bridge.
> >>   > Rocker switch driver indeed monitors upper change notifications
> >> and keep track of master bridge presence.
> >>> So bridge presence is not transparent.
> >>>
> >> Agreed - the challenge so far is that people have been fascinated by
> "switch"
> >> point of view. I think we are learning and the class device will
> >> eventually become obvious as useful.
> >>
> >> cheers,
> >> jamal
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org More majordomo
> info
> > at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next 1/1] net: fec: Fix NAPI race
From: Fugang Duan @ 2014-12-16 10:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, R49496, bhutchings, stephen, b38611

Do camera capture test on i.MX6q sabresd board, and save the capture data to
nfs rootfs. The command is:
gst-launch-1.0 -e imxv4l2src device=/dev/video1 num-buffers=2592000 ! tee name=t !
queue ! imxv4l2sink sync=false t. ! queue ! vpuenc ! queue ! mux. pulsesrc num-buffers=3720937
blocksize=4096 ! 'audio/x-raw, rate=44100, channels=2' ! queue ! imxmp3enc ! mpegaudioparse !
queue ! mux. qtmux name=mux ! filesink location=video_recording_long.mov

After about 10 hours running, there have net watchdog timeout kernel dump:
...
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x2b4/0x2d8()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.24-01051-gdb840b7 #440
[<80014e6c>] (unwind_backtrace) from [<800118ac>] (show_stack+0x10/0x14)
[<800118ac>] (show_stack) from [<806ae3f0>] (dump_stack+0x78/0xc0)
[<806ae3f0>] (dump_stack) from [<8002b504>] (warn_slowpath_common+0x68/0x8c)
[<8002b504>] (warn_slowpath_common) from [<8002b558>] (warn_slowpath_fmt+0x30/0x40)
[<8002b558>] (warn_slowpath_fmt) from [<8055e0d4>] (dev_watchdog+0x2b4/0x2d8)
[<8055e0d4>] (dev_watchdog) from [<800352d8>] (call_timer_fn.isra.33+0x24/0x8c)
[<800352d8>] (call_timer_fn.isra.33) from [<800354c4>] (run_timer_softirq+0x184/0x220)
[<800354c4>] (run_timer_softirq) from [<8002f420>] (__do_softirq+0xc0/0x22c)
[<8002f420>] (__do_softirq) from [<8002f804>] (irq_exit+0xa8/0xf4)
[<8002f804>] (irq_exit) from [<8000ee5c>] (handle_IRQ+0x54/0xb4)
[<8000ee5c>] (handle_IRQ) from [<80008598>] (gic_handle_irq+0x28/0x5c)
[<80008598>] (gic_handle_irq) from [<800123c0>] (__irq_svc+0x40/0x74)
Exception stack(0x80d27f18 to 0x80d27f60)
7f00:                                                       80d27f60 0000014c
7f20: 8858c60e 0000004d 884e4540 0000004d ab7250d0 80d34348 00000000 00000000
7f40: 00000001 00000000 00000017 80d27f60 800702a4 80476e6c 600f0013 ffffffff
[<800123c0>] (__irq_svc) from [<80476e6c>] (cpuidle_enter_state+0x50/0xe0)
[<80476e6c>] (cpuidle_enter_state) from [<80476fa8>] (cpuidle_idle_call+0xac/0x154)
[<80476fa8>] (cpuidle_idle_call) from [<8000f174>] (arch_cpu_idle+0x8/0x44)
[<8000f174>] (arch_cpu_idle) from [<80064c54>] (cpu_startup_entry+0x100/0x158)
[<80064c54>] (cpu_startup_entry) from [<80cd8a9c>] (start_kernel+0x304/0x368)
---[ end trace 09ebd32fb032f86d ]---
...

There might have a race in napi_schedule(), leaving interrupts disabled forever.
After these patch, the case still work more than 40 hours running.

Signed-off-by: Fugang Duan <B38611@freescale.com>
---
 drivers/net/ethernet/freescale/fec_main.c |   19 +++++++------------
 1 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 8c5b15e..5c4a8bd 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1558,20 +1558,21 @@ fec_enet_interrupt(int irq, void *dev_id)
 {
 	struct net_device *ndev = dev_id;
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const unsigned napi_mask = FEC_ENET_RXF | FEC_ENET_TXF;
 	uint int_events;
 	irqreturn_t ret = IRQ_NONE;
 
 	int_events = readl(fep->hwp + FEC_IEVENT);
-	writel(int_events & ~napi_mask, fep->hwp + FEC_IEVENT);
+	writel(int_events, fep->hwp + FEC_IEVENT);
 	fec_enet_collect_events(fep, int_events);
 
-	if (int_events & napi_mask) {
+	if (fep->work_tx || fep->work_rx) {
 		ret = IRQ_HANDLED;
 
-		/* Disable the NAPI interrupts */
-		writel(FEC_ENET_MII, fep->hwp + FEC_IMASK);
-		napi_schedule(&fep->napi);
+		if (napi_schedule_prep(&fep->napi)) {
+			/* Disable the NAPI interrupts */
+			writel(FEC_ENET_MII, fep->hwp + FEC_IMASK);
+			__napi_schedule(&fep->napi);
+		}
 	}
 
 	if (int_events & FEC_ENET_MII) {
@@ -1591,12 +1592,6 @@ static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	int pkts;
 
-	/*
-	 * Clear any pending transmit or receive interrupts before
-	 * processing the rings to avoid racing with the hardware.
-	 */
-	writel(FEC_ENET_RXF | FEC_ENET_TXF, fep->hwp + FEC_IEVENT);
-
 	pkts = fec_enet_rx(ndev, budget);
 
 	fec_enet_tx(ndev);
-- 
1.7.8

^ permalink raw reply related

* Re: Bug: mv643xxx fails with highmem
From: Russell King - ARM Linux @ 2014-12-16 10:57 UTC (permalink / raw)
  To: fugang.duan@freescale.com
  Cc: David Miller, Fabio.Estevam@freescale.com,
	ezequiel.garcia@free-electrons.com, netdev@vger.kernel.org
In-Reply-To: <BLUPR03MB37305CF1F6797329BD9B183F56C0@BLUPR03MB373.namprd03.prod.outlook.com>

On Tue, Dec 16, 2014 at 02:19:38AM +0000, fugang.duan@freescale.com wrote:
> From: Russell King - ARM Linux <linux@arm.linux.org.uk> Sent: Tuesday, December 16, 2014 2:05 AM
> > To: Duan Fugang-B38611
> > Cc: David Miller; Estevam Fabio-R49496; ezequiel.garcia@free-
> > electrons.com; netdev@vger.kernel.org
> > Subject: Re: Bug: mv643xxx fails with highmem
> > 
> > On Fri, Dec 12, 2014 at 05:34:01AM +0000, fugang.duan@freescale.com wrote:
> > > I will submit one patch to fix the issue.
> > 
> > There's more bugs in the FEC driver... here's the relevant bits:
> > 
> > static void
> > fec_enet_tx_queue(struct net_device *ndev, u16 queue_id) {
> >         bdp = txq->dirty_tx;
> > 
> >         bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
> > 
> >         while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
> >                 /* current queue is empty */
> >                 if (bdp == txq->cur_tx)
> >                         break;
> > 
> >                 skb = txq->tx_skbuff[index];
> >                 txq->tx_skbuff[index] = NULL;
> >                 if (!IS_TSO_HEADER(txq, bdp->cbd_bufaddr))
> >                         dma_unmap_single(&fep->pdev->dev, bdp-
> > >cbd_bufaddr,
> >                                         bdp->cbd_datlen, DMA_TO_DEVICE);
> >                 bdp->cbd_bufaddr = 0;
> >                 if (!skb) {
> >                         bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
> >                         continue;
> >                 }
> > ...
> >                 txq->dirty_tx = bdp;
> >                 bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
> >         }
> > 
> > Consider the following code path:
> > - we enter this function
> > - get the dirty_tx pointer
> > - move to the next descriptor (which we'll call descriptor A)
> > - next descriptor indicates that TX_READY = 0
> > - bdp != txq->cur_tx
> > - we unmap if needed
> > - we set bdp->cmdbufaddr = 0
> > - assume skb is NULL, so we move to the next descriptor (we'll call this
> > B)
> > - next descriptor _may_ have TX_READY = 1
> > - we break out of the loop, and return
> > 
> > Some time later, we re-enter:
> > - get the dirty_tx pointer
> > - move to the next descriptor (which is descriptor A above)
> > - next descriptor indicates that TX_READY = 0
> > - bdp != txq->cur_tx
> > - we call dma_unmap_single(..., bdp->cbd_bufaddr, which we previously
> > zeroed
> >   - the DMA API debugging complains that FEC is unmapping memory which it
> >     doesn't own
> > 
> > Unfortunately, this does appear to happen - from a paste from Jon
> > Nettleton from iMX6Q:
> > 
> >  32. [   45.033001] unmapping this address 0x0 size 66  33. [   45.037470]
> > ------------[ cut here ]------------  34. [   45.042127] WARNING: CPU: 0
> > PID: 102 at lib/dma-debug.c:1080 check_unmap+0x784/0x9f4()
> > 35. [   45.050066] fec 2188000.ethernet: DMA-API: device driver tries to
> > free DMA memory it has not a]
> > 
> > (where the printk at line 32 is something that was added to debug this.)
> > 
> > The sad thing is that the remainder of my FEC patches did go a long way
> > to clean up these kinds of issues in the driver (and there's /many/ of
> > them), but unfortunately other conflicting changes got merged before I
> > could finish rebasing them, I decided to move on to other things and
> > discard the remainder of my patch set.  Marek showed some interest in
> > taking the patch set over, but I've not heard anything more - and I'm not
> > about to resurect my efforts only to get into the same situation where
> > I'm carrying 50 odd patches which I can't merge back into mainline
> > without spending weeks endlessly rebasing them.
> > 
> Russell, many thanks for your effort and thanks for your pointing out the bug.
> I will think one method to fix it.
> 
> And I have one question for highmem dma mapping issue as below:
> fec_enet_txq_submit_frag_skb(struct fec_enet_priv_tx_q *txq, struct sk_buff *skb, struct net_device *ndev)
> {
> 	...
> 		    bufaddr = page_address(this_frag->page.p) + this_frag->page_offset;
> 
>                 index = fec_enet_get_bd_index(txq->tx_bd_base, bdp, fep);
>                 if (((unsigned long) bufaddr) & fep->tx_align ||
>                         fep->quirks & FEC_QUIRK_SWAP_FRAME) {
>                         memcpy(txq->tx_bounce[index], bufaddr, frag_len);
>                         bufaddr = txq->tx_bounce[index];
> 
>                         if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
>                                 swap_buffer(bufaddr, frag_len);
>                 }
> 
>                 addr = dma_map_single(&fep->pdev->dev, bufaddr, frag_len,
>                                       DMA_TO_DEVICE);
>                 if (dma_mapping_error(&fep->pdev->dev, addr)) {
>                         dev_kfree_skb_any(skb);
>                         if (net_ratelimit())
>                                 netdev_err(ndev, "Tx DMA memory map failed\n");
>                         goto dma_mapping_error;
>                 }
> 	...
> }
>
> If the frag page is located at high memory, use dma_map_single() is not
> right, must use skb_frag_dma_map() or dma_map_page().

Correct.

> But before mapping, if tx has buffer alignment limitation (tx_align is
> not zero), there need to do memcpy for buffer alignment.

Right, and that can be detected by simply checking this_frag->page_offset
as we know that a page address is always aligned to a page.

> So, there we need to check whether the page is in highmem, if so, we
> need to call kmap_atomic() or kmap_high_get() to get cpu address,
> And then do memcpy or swap buffer operation.

Yes - you'd need to do something like this:

	if (this_frag->page_offset & fep->tx_align ||
	    fep->quirks & FEC_QUIRK_SWAP_FRAME) {
		bufaddr = kmap_atomic(this_frag->page.p) + this_frag->page_offset;
		memcpy(txq->tx_bounce[index], bufaddr, frag_len);
		kunmap_atomic(bufaddr);

		bufaddr = txq->tx_bounce[index];
		if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
			swap_buffer(bufaddr, frag_len);
                addr = dma_map_single(&fep->pdev->dev, bufaddr, frag_len,
                                      DMA_TO_DEVICE);
	} else {
		addr = skb_frag_dma_map(&fep->pdev->dev, this_frag, 0,
					frag_len, DMA_TO_DEVICE);
	}

	if (dma_mapping_error(&fep->pdev->dev, addr)) {
		dev_kfree_skb_any(skb);
		if (net_ratelimit())
			netdev_err(ndev, "Tx DMA memory map failed\n");
		goto dma_mapping_error;
	}

You'll also need to record whether you should use dma_unmap_page() or
dma_unmap_single().

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* RE: [PATCH 1/5] e1000e: reset MAC-PHY interconnect on 82577/82578
From: David Laight @ 2014-12-16 10:48 UTC (permalink / raw)
  To: 'Zhu Yanjun', netdev@vger.kernel.org, w@1wt.eu
  Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher, David S. Miller
In-Reply-To: <1418725700-31465-2-git-send-email-Yanjun.Zhu@windriver.com>

From: Zhu Yanjun
> 2.6.x kernels require a similar logic change as commit 6dfaa76
> [e1000e: reset MAC-PHY interconnect on 82577/82578] introduces
> for newer kernels.
> 
> During Sx->S0 transitions, the interconnect between the MAC and PHY on
> 82577/82578 can remain in SMBus mode instead of transitioning to the
> PCIe-like mode required during normal operation.  Toggling the LANPHYPC
> Value bit essentially resets the interconnect forcing it to the correct
> mode.
> 
> after review of all intel drivers, found several instances where
> drivers had the incorrect pattern of:
> memory mapped write();
> delay();
> 
> which should always be:
> memory mapped write();
> write flush(); /* aka memory mapped read */
> delay();

Probably not only the intel drivers.
I'd bet that a lot of the delay() calls are wrong.

> explanation:
> The reason for including the flush is that writes can be held
> (posted) in PCI/PCIe bridges, but the read always has to complete
> synchronously and therefore has to flush all pending writes to a
> device.  If a write is held and followed by a delay, the delay
> means nothing because the write may not have reached hardware
> (maybe even not until the next read)

It might even be that the 'write flush' (ie a read) guarantees to
generates a long enough delay that the delay() itself isn't needed.

	David

^ permalink raw reply

* Re: [PATCH net] net: ipv6: allow explicitly choosing optimistic addresses
From: Erik Kline @ 2014-12-16 10:38 UTC (permalink / raw)
  To: Lorenzo Colitti
  Cc: netdev@vger.kernel.org, YOSHIFUJI Hideaki, Hannes Frederic Sowa
In-Reply-To: <CAKD1Yr3+S81mKbVdQ-P6byx-e2Hfn8GJz+Jr7SLWrRjkfpD2qA@mail.gmail.com>

On Fri, Dec 12, 2014 at 4:37 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
> On Fri, Dec 12, 2014 at 12:50 PM, Erik Kline <ek@google.com> wrote:
>>
>> @@ -1527,7 +1527,8 @@ int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
>>                 if (!net_eq(dev_net(ifp->idev->dev), net))
>>                         continue;
>>                 if (ipv6_addr_equal(&ifp->addr, addr) &&
>> -                   !(ifp->flags&IFA_F_TENTATIVE) &&
>> +                   (!(ifp->flags&IFA_F_TENTATIVE) ||
>> +                    ifp->flags&IFA_F_OPTIMISTIC) &&
>>                     (dev == NULL || ifp->idev->dev == dev ||
>>                      !(ifp->scope&(IFA_LINK|IFA_HOST) || strict))) {
>>                         rcu_read_unlock_bh();
>
> I looked at the callers of ipv6_chk_addr (lxr finds 19 files); from
> what I saw, this change will make all of them more correct except for
> ndisc_solicit. With this change, ndisc_solicit could now send
> neighbour solicitations from optimistic addresses, which is not
> allowed by the RFC.

True.  I have a version that fixes this, but ended up splitting
ipv6_chk_addr into two:

    - ipv6_chk_addr (same as today), and
    - ipv6_chk_addr_and_flags (which takes a "u32 banned_flags" argument)

The ugliness of this naming is not lost on me.  I briefly tried a
version that reuses the "strict" argument to achieve the same thing,
but that doesn't seem to be quite correct in all cases either.

^ permalink raw reply

* [PATCH 5/5] e1000e: cleanup use of check_reset_block function pointer
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 44abd5c 
[e1000e: cleanup use of check_reset_block function pointer] introduces
for newer kernels.

Replace e1000_check_reset_block() inline function with calls to the PHY ops
check_reset_block function pointer.

[yanjun.zhu: only modifications in function e1000_init_phy_params_pchlan will
be backported. Others remain unchanged]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/ich8lan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index 0da2c2c..732cd48 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -280,7 +280,7 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
-	if (!e1000_check_reset_block(hw)) {
+	if (!hw->phy.ops.check_reset_block(hw)) {
 		/*Set Phy Config Counter to 50msec */
 		ctrl = er32(FEXTNVM3);
 		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
-- 
1.9.1

^ permalink raw reply related

* [PATCH 4/5] e1000e: update workaround for 82579 intermittently disabled during S0->Sx
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 90b8298 
[e1000e: update workaround for 82579 intermittently disabled during S0->Sx]
introduces for newer kernels.

The workaround which toggles the LANPHYPC (LAN PHY Power Control) value bit
to force the MAC-Phy interconnect into PCIe mode from SMBus mode during
driver load and resume should always be done except if PHY resets are
blocked by the Manageability Engine (ME).  Previously, the toggle was done
only if PHY resets are blocked and the ME was disabled.

The rest of the patch is just indentation changes as a consequence of the
updated workaround.

[yanjun.zhu: indentation changes are removed.
function e1000_init_phy_workarounds_pchlan does not exist]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/ich8lan.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index 8c7e4aa..0da2c2c 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -280,8 +280,7 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;

-	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID) && 
-		!e1000_check_reset_block(hw)) {
+	if (!e1000_check_reset_block(hw)) {
 		/*Set Phy Config Counter to 50msec */
 		ctrl = er32(FEXTNVM3);
 		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
-- 
1.9.1

^ permalink raw reply related

* [PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 6cc7aae 
[e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked]
introduces for newer kernels.

When PHY reset is intentionally blocked on 82577/8/9, do not toggle the
LANPHYPC value bit (essentially performing a hard power reset of the
device) otherwise the PHY can be put into an unknown state.

Cleanup whitespace in the same function.

[yanjun.zhu: whitespace remains unchanged]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/ich8lan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index c4b2d15..8c7e4aa 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -280,7 +280,8 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
-	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID) && 
+		!e1000_check_reset_block(hw)) {
 		/*Set Phy Config Counter to 50msec */
 		ctrl = er32(FEXTNVM3);
 		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
-- 
1.9.1

^ permalink raw reply related

* [PATCH 2/5] e1000e: workaround EEPROM configuration change on 82579 on kernel 2.6.x
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 62bc813 
[e1000e: workaround EEPROM configuration change on 82579] introduces
for newer kernels.

An update to the EEPROM on 82579 will extend a delay in hardware to fix an
issue with WoL not working after a G3->S5 transition which is unrelated to
the driver.  However, this extended delay conflicts with nominal operation
of the device when it is initialized by the driver and after every reset
of the hardware (i.e. the driver starts configuring the device before the
hardware is done with it's own configuration work).  The workaround for
when the driver is in control of the device is to tell the hardware after
every reset the configuration delay should be the original shorter one.

Some pre-existing variables are renamed generically to be re-used with
new register accesses.

[e1000_toggle_lanphypc_value_ich8lan does not exist. Its implementations
exist in e1000_init_phy_params_pchlan. Renamed variables remain unchanged]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/hw.h      |  1 +
 drivers/net/e1000e/ich8lan.c | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/e1000e/hw.h b/drivers/net/e1000e/hw.h
index 11f3b7c..b055d78 100644
--- a/drivers/net/e1000e/hw.h
+++ b/drivers/net/e1000e/hw.h
@@ -60,6 +60,7 @@ enum e1e_registers {
 	E1000_FEXTNVM  = 0x00028, /* Future Extended NVM - RW */
 	E1000_FCT      = 0x00030, /* Flow Control Type - RW */
 	E1000_VET      = 0x00038, /* VLAN Ether Type - RW */
+	E1000_FEXTNVM3 = 0x0003C, /* Future Extended NVM 3 - RW */
 	E1000_ICR      = 0x000C0, /* Interrupt Cause Read - R/clr */
 	E1000_ITR      = 0x000C4, /* Interrupt Throttling Rate - RW */
 	E1000_ICS      = 0x000C8, /* Interrupt Cause Set - WO */
diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index 020657c..c4b2d15 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -108,6 +108,9 @@
 #define E1000_FEXTNVM_SW_CONFIG		1
 #define E1000_FEXTNVM_SW_CONFIG_ICH8M (1 << 27) /* Bit redefined for ICH8M :/ */
 
+#define E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK    0x0C000000
+#define E1000_FEXTNVM3_PHY_CFG_COUNTER_50MSEC  0x08000000
+
 #define PCIE_ICH8_SNOOP_ALL		PCIE_NO_SNOOP_ALL
 
 #define E1000_ICH_RAR_ENTRIES		7
@@ -278,6 +281,12 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
 	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+		/*Set Phy Config Counter to 50msec */
+		ctrl = er32(FEXTNVM3);
+		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
+		ctrl |= E1000_FEXTNVM3_PHY_CFG_COUNTER_50MSEC;
+		ew32(FEXTNVM3, ctrl);
+
 		/*
 		 * The MAC-PHY interconnect may still be in SMBus mode
 		 * after Sx->S0.  Toggle the LANPHYPC Value bit to force
@@ -2685,6 +2694,14 @@ static s32 e1000_reset_hw_ich8lan(struct e1000_hw *hw)
 	ew32(CTRL, (ctrl | E1000_CTRL_RST));
 	msleep(20);
 
+	/* Set Phy Config Counter to 50msec */
+	if (hw->mac.type == e1000_pch2lan) {
+		u32 phycc_reg = er32(FEXTNVM3);
+		phycc_reg &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
+		phycc_reg |= E1000_FEXTNVM3_PHY_CFG_COUNTER_50MSEC;
+		ew32(FEXTNVM3, phycc_reg);
+	}
+
 	if (!ret_val)
 		e1000_release_swflag_ich8lan(hw);
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH 1/5] e1000e: reset MAC-PHY interconnect on 82577/82578
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000
  Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher, David S. Miller
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 6dfaa76 
[e1000e: reset MAC-PHY interconnect on 82577/82578] introduces
for newer kernels.

During Sx->S0 transitions, the interconnect between the MAC and PHY on
82577/82578 can remain in SMBus mode instead of transitioning to the
PCIe-like mode required during normal operation.  Toggling the LANPHYPC
Value bit essentially resets the interconnect forcing it to the correct
mode.

after review of all intel drivers, found several instances where
drivers had the incorrect pattern of:
memory mapped write();
delay();

which should always be:
memory mapped write();
write flush(); /* aka memory mapped read */
delay();

explanation:
The reason for including the flush is that writes can be held
(posted) in PCI/PCIe bridges, but the read always has to complete
synchronously and therefore has to flush all pending writes to a
device.  If a write is held and followed by a delay, the delay
means nothing because the write may not have reached hardware
(maybe even not until the next read)

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/defines.h |  2 ++
 drivers/net/e1000e/ich8lan.c | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/drivers/net/e1000e/defines.h b/drivers/net/e1000e/defines.h
index 1190167..52283a6 100644
--- a/drivers/net/e1000e/defines.h
+++ b/drivers/net/e1000e/defines.h
@@ -214,6 +214,8 @@
 #define E1000_CTRL_SPD_1000 0x00000200  /* Force 1Gb */
 #define E1000_CTRL_FRCSPD   0x00000800  /* Force Speed */
 #define E1000_CTRL_FRCDPX   0x00001000  /* Force Duplex */
+#define E1000_CTRL_LANPHYPC_OVERRIDE 0x00010000 /* SW control of LANPHYPC */
+#define E1000_CTRL_LANPHYPC_VALUE    0x00020000 /* SW value of LANPHYPC */
 #define E1000_CTRL_SWDPIN0  0x00040000  /* SWDPIN 0 value */
 #define E1000_CTRL_SWDPIN1  0x00080000  /* SWDPIN 1 value */
 #define E1000_CTRL_SWDPIO0  0x00400000  /* SWDPIN 0 Input or output */
diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index de39f9a..020657c 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -88,6 +88,8 @@
 
 
 #define E1000_ICH_FWSM_RSPCIPHY	0x00000040 /* Reset PHY on PCI Reset */
+/* FW established a valid mode */ 
+#define E1000_ICH_FWSM_FW_VALID                0x00008000
 
 #define E1000_ICH_MNG_IAMT_MODE		0x2
 
@@ -260,6 +262,7 @@ static inline void __ew32flash(struct e1000_hw *hw, unsigned long reg, u32 val)
 static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 {
 	struct e1000_phy_info *phy = &hw->phy;
+	u32 ctrl;
 	s32 ret_val = 0;
 
 	phy->addr                     = 1;
@@ -274,6 +277,23 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
+	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+		/*
+		 * The MAC-PHY interconnect may still be in SMBus mode
+		 * after Sx->S0.  Toggle the LANPHYPC Value bit to force
+		 * the interconnect to PCIe mode, but only if there is no
+		 * firmware present otherwise firmware will have done it.
+		*/
+		ctrl = er32(CTRL);
+		ctrl |=  E1000_CTRL_LANPHYPC_OVERRIDE;
+		ctrl &= ~E1000_CTRL_LANPHYPC_VALUE;
+		ew32(CTRL, ctrl);
+		e1e_flush();
+		udelay(10);
+		ctrl &= ~E1000_CTRL_LANPHYPC_OVERRIDE;
+		ew32(CTRL, ctrl);
+		msleep(50);
+	}
 	/*
 	 * Reset the PHY before any acccess to it.  Doing so, ensures that
 	 * the PHY is in a known good state before we read/write PHY registers.
-- 
1.9.1

^ permalink raw reply related

* [PATCH V2 0/5] e1000e: fix nic not boot after rebooting
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun

With kernel 2.6.x, e1000e with 82577/8/9 sometimes will not boot
after rebooting. 

If a kernel 2.6.x board with 82577/8/9 e1000e nic is rebooted for 
100 times, there are 7~8 times that 82577/8/9 e1000e nic will not boot 
normally.

V2:
Follow the advice from Willy, the wrong upstream commit IDs in these
5 patches commit messages are fixed.

Zhu Yanjun (5):
  e1000e: reset MAC-PHY interconnect on 82577/82578
  e1000e: workaround EEPROM configuration change on 82579 on kernel
    2.6.x
  e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
  e1000e: update workaround for 82579 intermittently disabled during
    S0->Sx
  e1000e: cleanup use of check_reset_block function pointer

 drivers/net/e1000e/defines.h |  2 ++
 drivers/net/e1000e/hw.h      |  1 +
 drivers/net/e1000e/ich8lan.c | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 40 insertions(+)

-- 
1.9.1

^ permalink raw reply

* Re: [PATCH net 1/2] net: dsa: handle non-existing PHYs on switch internal bus
From: Andrew Lunn @ 2014-12-16 10:24 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, davem, computersforpeace, andrey.volkov, Andrew Lunn
In-Reply-To: <1418330956-17151-2-git-send-email-f.fainelli@gmail.com>

On Thu, Dec 11, 2014 at 12:49:15PM -0800, Florian Fainelli wrote:
> In case there is no PHY at the designated address on the internal
> switch, we would basically de-reference a null pointer here:
> 
> dsa_slave_phy_setup(...)
> {
> 	p->phy = ds->slave_mii_bus->phy_map[p->port];
> 	phy_connect_direct(slave_dev, p->phy, dsa_slave_adjust_link,
> 				      ^------
> 
> This can be triggered when the platform configuration (platform_data or
> Device Tree) indicates there should be a PHY device at this address, but
> the HW is non-responsive, such that we cannot attach a PHY device at
> this specific location.
> 
> Fix this by checking the return value prior to calling
> phy_connect_direct().
> 
> CC: Andrew Lunn <andrew@lunn.ch>
> Fixes: b31f65fb4383 ("net: dsa: slave: Fix autoneg for phys on switch MDIO bus")
> Reported-by: Brian Norris <computersforpeace@gmail.com>
> Signed-off-by: Andrey Volkov <andrey.volkov@nexvision.fr>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Acked-by: Andrew Lunn <andrew@lunn.ch>

	  Andrew


> ---
>  net/dsa/slave.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index ab03e00ffe8f..0efaab1e408b 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -491,6 +491,9 @@ static void dsa_slave_phy_setup(struct dsa_slave_priv *p,
>  	 */
>  	if (!p->phy) {
>  		p->phy = ds->slave_mii_bus->phy_map[p->port];
> +		if (!p->phy)
> +			return;
> +
>  		phy_connect_direct(slave_dev, p->phy, dsa_slave_adjust_link,
>  				   p->phy_interface);
>  	} else {
> -- 
> 2.1.0
> 

^ permalink raw reply

* Re: [PATCH net-next v9 0/3] add hisilicon hip04 ethernet driver
From: Ding Tianhong @ 2014-12-16 10:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, zhangfei.gao, davem, linux, f.fainelli,
	sergei.shtylyov, mark.rutland, David.Laight, eric.dumazet, xuwei5,
	netdev, devicetree
In-Reply-To: <26790181.hAWQCCSkRa@wuerfel>

On 2014/12/16 16:54, Arnd Bergmann wrote:
> On Tuesday 16 December 2014 15:57:21 Ding Tianhong wrote:
>> On 2014/12/15 21:29, Arnd Bergmann wrote:
>>> On Thursday 11 December 2014 19:42:27 Ding Tianhong wrote:
>>> @@ -381,57 +392,37 @@ static void hip04_tx_reclaim(struct net_device *ndev, bool force)
>>>  		dev_kfree_skb(priv->tx_skb[tx_tail]);
>>>  		priv->tx_skb[tx_tail] = NULL;
>>>  		tx_tail = TX_NEXT(tx_tail);
>>> -		priv->tx_count--;
>>> -
>>> -		if (priv->tx_count <= 0)
>>> -			break;
>>> +		count--;
>>>  	}
>>>  
> ...
>>> -	queue_delayed_work(priv->wq, &priv->tx_clean_task, delta_in_ticks);
>>> +	return count;
>>
>> I think should return pkts_compl, because may break from the loop, the
>> pkts_compl may smaller than count.
> 
> The calling convention I used is to return the packets that are remaining
> on the queue. Only if that is nonzero we need to reschedule the timer.
> 

OK, agree.

>> and we need to add netif_tx_lock() to protect this function to avoid concurrency conflict.
> 
> Oh, did I miss something? The idea was that the start_xmit function only updates
> the tx_head pointer and reads the tx_tail, while the tx_reclaim function does
> the reverse, and writes to a different cache line, in order to allow a lockless
> queue traversal.
> 
> Can you point to a specific struct member that still need to be protected by
> the lock? Did I miss a race that would allow both functions to exit with
> the timer disabled and entries left on the queue?
> 
OK, got it, no problem.

>>> @@ -623,8 +648,6 @@ static int hip04_mac_stop(struct net_device *ndev)
>>>  	struct hip04_priv *priv = netdev_priv(ndev);
>>>  	int i;
>>>  
>>> -	cancel_delayed_work_sync(&priv->tx_clean_task);
>>> -
>> I think we should cancle the hrtimer when closed and queue the timer when open.
> 
> I was expecting that force-cleaning up the tx queue would be enough for that.
> It it not?
> 
> I suppose it can't hurt to cancel the timer here anyway, and maybe use
> WARN_ON() if it's still active.
> 

Ok, I found no need to worry about this, when the dev is closed, the napi will disable and will not enter timer again.

> Starting the timer after opening seems wrong though: at that point there are
> no packets on the queue yet. The timer should always start ticking at the
> exact point when the first packet is put on the queue while the timer is
> not already pending.
> 
Ok.

>>>  	napi_disable(&priv->napi);
>>>  	netif_stop_queue(ndev);
>>>  	hip04_mac_disable(ndev);
>>> @@ -725,6 +748,7 @@ static int hip04_mac_probe(struct platform_device *pdev)
>>>  	struct hip04_priv *priv;
>>>  	struct resource *res;
>>>  	unsigned int irq;
>>> +	ktime_t txtime;
>>>  	int ret;
>>>  
>>>  	ndev = alloc_etherdev(sizeof(struct hip04_priv));
>>> @@ -751,6 +775,21 @@ static int hip04_mac_probe(struct platform_device *pdev)
>>>  	priv->port = arg.args[0];
>>>  	priv->chan = arg.args[1] * RX_DESC_NUM;
>>>  
>>> +	hrtimer_init(&priv->tx_coalesce_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>>> +
>>> +	/*
>>> +	 * BQL will try to keep the TX queue as short as possible, but it can't
>>> +	 * be faster than tx_coalesce_usecs, so we need a fast timeout here,
>>> +	 * but also long enough to gather up enough frames to ensure we don't
>>> +	 * get more interrupts than necessary.
>>> +	 * 200us is enough for 16 frames of 1500 bytes at gigabit ethernet rate
>>> +	 */
>>> +	priv->tx_coalesce_frames = TX_DESC_NUM * 3 / 4;
>>> +	priv->tx_coalesce_usecs = 200;
>>> +	/* allow timer to fire after half the time at the earliest */
>>> +	txtime = ktime_set(0, priv->tx_coalesce_usecs * NSEC_PER_USEC / 2);
>>> +	hrtimer_set_expires_range(&priv->tx_coalesce_timer, txtime, txtime);
>>> +
>>
>> I think miss the line:
>>  priv->tx_coalesce_timer.function = tx_done;
> 
> Yes, good point.
> 

I will send v10 when the net-next open again, and these days will test this driver, thanks a lot.

Ding

> 	Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> .
> 

^ permalink raw reply

* Re: [PATCH net-next v9 0/3] add hisilicon hip04 ethernet driver
From: Arnd Bergmann @ 2014-12-16  8:54 UTC (permalink / raw)
  To: Ding Tianhong
  Cc: linux-arm-kernel, zhangfei.gao, davem, linux, f.fainelli,
	sergei.shtylyov, mark.rutland, David.Laight, eric.dumazet, xuwei5,
	netdev, devicetree
In-Reply-To: <548FE5E1.1090300@huawei.com>

On Tuesday 16 December 2014 15:57:21 Ding Tianhong wrote:
> On 2014/12/15 21:29, Arnd Bergmann wrote:
> > On Thursday 11 December 2014 19:42:27 Ding Tianhong wrote:
> > @@ -381,57 +392,37 @@ static void hip04_tx_reclaim(struct net_device *ndev, bool force)
> >  		dev_kfree_skb(priv->tx_skb[tx_tail]);
> >  		priv->tx_skb[tx_tail] = NULL;
> >  		tx_tail = TX_NEXT(tx_tail);
> > -		priv->tx_count--;
> > -
> > -		if (priv->tx_count <= 0)
> > -			break;
> > +		count--;
> >  	}
> >  
...
> > -	queue_delayed_work(priv->wq, &priv->tx_clean_task, delta_in_ticks);
> > +	return count;
> 
> I think should return pkts_compl, because may break from the loop, the
> pkts_compl may smaller than count.

The calling convention I used is to return the packets that are remaining
on the queue. Only if that is nonzero we need to reschedule the timer.

> and we need to add netif_tx_lock() to protect this function to avoid concurrency conflict.

Oh, did I miss something? The idea was that the start_xmit function only updates
the tx_head pointer and reads the tx_tail, while the tx_reclaim function does
the reverse, and writes to a different cache line, in order to allow a lockless
queue traversal.

Can you point to a specific struct member that still need to be protected by
the lock? Did I miss a race that would allow both functions to exit with
the timer disabled and entries left on the queue?

> > @@ -623,8 +648,6 @@ static int hip04_mac_stop(struct net_device *ndev)
> >  	struct hip04_priv *priv = netdev_priv(ndev);
> >  	int i;
> >  
> > -	cancel_delayed_work_sync(&priv->tx_clean_task);
> > -
> I think we should cancle the hrtimer when closed and queue the timer when open.

I was expecting that force-cleaning up the tx queue would be enough for that.
It it not?

I suppose it can't hurt to cancel the timer here anyway, and maybe use
WARN_ON() if it's still active.

Starting the timer after opening seems wrong though: at that point there are
no packets on the queue yet. The timer should always start ticking at the
exact point when the first packet is put on the queue while the timer is
not already pending.

> >  	napi_disable(&priv->napi);
> >  	netif_stop_queue(ndev);
> >  	hip04_mac_disable(ndev);
> > @@ -725,6 +748,7 @@ static int hip04_mac_probe(struct platform_device *pdev)
> >  	struct hip04_priv *priv;
> >  	struct resource *res;
> >  	unsigned int irq;
> > +	ktime_t txtime;
> >  	int ret;
> >  
> >  	ndev = alloc_etherdev(sizeof(struct hip04_priv));
> > @@ -751,6 +775,21 @@ static int hip04_mac_probe(struct platform_device *pdev)
> >  	priv->port = arg.args[0];
> >  	priv->chan = arg.args[1] * RX_DESC_NUM;
> >  
> > +	hrtimer_init(&priv->tx_coalesce_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > +
> > +	/*
> > +	 * BQL will try to keep the TX queue as short as possible, but it can't
> > +	 * be faster than tx_coalesce_usecs, so we need a fast timeout here,
> > +	 * but also long enough to gather up enough frames to ensure we don't
> > +	 * get more interrupts than necessary.
> > +	 * 200us is enough for 16 frames of 1500 bytes at gigabit ethernet rate
> > +	 */
> > +	priv->tx_coalesce_frames = TX_DESC_NUM * 3 / 4;
> > +	priv->tx_coalesce_usecs = 200;
> > +	/* allow timer to fire after half the time at the earliest */
> > +	txtime = ktime_set(0, priv->tx_coalesce_usecs * NSEC_PER_USEC / 2);
> > +	hrtimer_set_expires_range(&priv->tx_coalesce_timer, txtime, txtime);
> > +
> 
> I think miss the line:
>  priv->tx_coalesce_timer.function = tx_done;

Yes, good point.

	Arnd

^ permalink raw reply

* Re: [PATCH net-next v9 0/3] add hisilicon hip04 ethernet driver
From: Ding Tianhong @ 2014-12-16  7:57 UTC (permalink / raw)
  To: Arnd Bergmann, linux-arm-kernel
  Cc: zhangfei.gao, davem, linux, f.fainelli, sergei.shtylyov,
	mark.rutland, David.Laight, eric.dumazet, xuwei5, netdev,
	devicetree
In-Reply-To: <2708715.thQWqYuW6a@wuerfel>

On 2014/12/15 21:29, Arnd Bergmann wrote:
> On Thursday 11 December 2014 19:42:27 Ding Tianhong wrote:
>> v9:
>> - There is no tx completion interrupts to free DMAd Tx packets, it means taht
>>   we rely on new tx packets arriving to run the destructors of completed packets,
>>   which open up space in their sockets's send queues. Sometimes we don't get such
>>   new packets causing Tx to stall, a single UDP transmitter is a good example of
>>   this situation, so we need a clean up workqueue to reclaims completed packets,
>>   the workqueue will only free the last packets which is already stay for several jiffies.
>>   Also fix some format cleanups.
> 
> You must have missed the reply in which David Miller explained why
> you can't call skb_orphan and rely on an occasional cleanup of the
> queue. Please use something like the patch below:
> 
> - drop the broken skb_orphan call
> - drop the workqueue
> - batch cleanup based on tx_coalesce_frames/usecs for better throughput
> - use a reasonable default tx timeout (200us, could be shorted
>   based on measurements) with a range timer
> - fix napi poll function return value
> - use a lockless queue for cleanup
> 

Ok, agree, my comments see below.

> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> 
> Feel free to fold this into your patch rather than keeping it separate.
> Completely untested, probably contains some bugs. Please ask if you
> have questions about this.
> 
> diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
> index 9d37b670db6a..d85c5287654e 100644
> --- a/drivers/net/ethernet/hisilicon/hip04_eth.c
> +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
> @@ -12,6 +12,7 @@
>  #include <linux/etherdevice.h>
>  #include <linux/platform_device.h>
>  #include <linux/interrupt.h>
> +#include <linux/ktime.h>
>  #include <linux/of_address.h>
>  #include <linux/phy.h>
>  #include <linux/of_mdio.h>
> @@ -157,9 +158,11 @@ struct hip04_priv {
>  	struct sk_buff *tx_skb[TX_DESC_NUM];
>  	dma_addr_t tx_phys[TX_DESC_NUM];
>  	unsigned int tx_head;
> -	unsigned int tx_tail;
> -	int tx_count;
> -	unsigned long last_tx;
> +
> +	/* FIXME: make these adjustable through ethtool */
> +	int tx_coalesce_frames;
> +	int tx_coalesce_usecs;
> +	struct hrtimer tx_coalesce_timer;
>  
>  	unsigned char *rx_buf[RX_DESC_NUM];
>  	dma_addr_t rx_phys[RX_DESC_NUM];
> @@ -171,10 +174,15 @@ struct hip04_priv {
>  	struct regmap *map;
>  	struct work_struct tx_timeout_task;
>  
> -	struct workqueue_struct *wq;
> -	struct delayed_work tx_clean_task;
> +	/* written only by tx cleanup */
> +	unsigned int tx_tail ____cacheline_aligned_in_smp;
>  };
>  
> +static inline unsigned int tx_count(unsigned int head, unsigned int tail)
> +{
> +	return (head - tail) % (TX_DESC_NUM - 1);
> +}
> +
>  static void hip04_config_port(struct net_device *ndev, u32 speed, u32 duplex)
>  {
>  	struct hip04_priv *priv = netdev_priv(ndev);
> @@ -351,18 +359,21 @@ static int hip04_set_mac_address(struct net_device *ndev, void *addr)
>  	return 0;
>  }
>  
> -static void hip04_tx_reclaim(struct net_device *ndev, bool force)
> +static int hip04_tx_reclaim(struct net_device *ndev, bool force)
>  {
>  	struct hip04_priv *priv = netdev_priv(ndev);
>  	unsigned tx_tail = priv->tx_tail;
>  	struct tx_desc *desc;
>  	unsigned int bytes_compl = 0, pkts_compl = 0;
> +	unsigned int count;
>  
> -	if (priv->tx_count == 0)
> +	smp_rmb();
> +	count = tx_count(ACCESS_ONCE(priv->tx_head), tx_tail);
> +	if (count == 0)
>  		goto out;
>  
> -	while ((tx_tail != priv->tx_head) || (priv->tx_count == TX_DESC_NUM)) {
> -		desc = &priv->tx_desc[priv->tx_tail];
> +	while (count) {
> +		desc = &priv->tx_desc[tx_tail];
>  		if (desc->send_addr != 0) {
>  			if (force)
>  				desc->send_addr = 0;
> @@ -381,57 +392,37 @@ static void hip04_tx_reclaim(struct net_device *ndev, bool force)
>  		dev_kfree_skb(priv->tx_skb[tx_tail]);
>  		priv->tx_skb[tx_tail] = NULL;
>  		tx_tail = TX_NEXT(tx_tail);
> -		priv->tx_count--;
> -
> -		if (priv->tx_count <= 0)
> -			break;
> +		count--;
>  	}
>  
>  	priv->tx_tail = tx_tail;
> +	smp_wmb(); /* Ensure tx_tail visible to xmit */
>  
> -	/* Ensure tx_tail & tx_count visible to xmit */
> -	smp_mb();
>  out:
> -
>  	if (pkts_compl || bytes_compl)
>  		netdev_completed_queue(ndev, pkts_compl, bytes_compl);
>  
> -	if (unlikely(netif_queue_stopped(ndev)) &&
> -	    (priv->tx_count < TX_DESC_NUM))
> +	if (unlikely(netif_queue_stopped(ndev)) && (count < (TX_DESC_NUM - 1)))
>  		netif_wake_queue(ndev);
> -}
>  
> -static void hip04_tx_clean_monitor(struct work_struct *work)
> -{
> -	struct hip04_priv *priv = container_of(work, struct hip04_priv,
> -					       tx_clean_task.work);
> -	struct net_device *ndev = priv->ndev;
> -	int delta_in_ticks = msecs_to_jiffies(1000);
> -
> -	if (!time_in_range(jiffies, priv->last_tx,
> -			   priv->last_tx + delta_in_ticks)) {
> -		netif_tx_lock(ndev);
> -		hip04_tx_reclaim(ndev, false);
> -		netif_tx_unlock(ndev);
> -	}
> -	queue_delayed_work(priv->wq, &priv->tx_clean_task, delta_in_ticks);
> +	return count;

I think should return pkts_compl, because may break from the loop, the pkts_compl may smaller than count.
and we need to add netif_tx_lock() to protect this function to avoid concurrency conflict.

>  }
>  
>  static int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  {
>  	struct hip04_priv *priv = netdev_priv(ndev);
>  	struct net_device_stats *stats = &ndev->stats;
> -	unsigned int tx_head = priv->tx_head;
> +	unsigned int tx_head = priv->tx_head, count;
>  	struct tx_desc *desc = &priv->tx_desc[tx_head];
>  	dma_addr_t phys;
>  
> -	if (priv->tx_count >= TX_DESC_NUM) {
> +	smp_rmb();
> +	count = tx_count(tx_head, ACCESS_ONCE(priv->tx_tail));
> +	if (count == (TX_DESC_NUM - 1)) {
>  		netif_stop_queue(ndev);
>  		return NETDEV_TX_BUSY;
>  	}
>  
> -	hip04_tx_reclaim(ndev, false);
> -
>  	phys = dma_map_single(&ndev->dev, skb->data, skb->len, DMA_TO_DEVICE);
>  	if (dma_mapping_error(&ndev->dev, phys)) {
>  		dev_kfree_skb(skb);
> @@ -447,20 +438,33 @@ static int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  	desc->wb_addr = cpu_to_be32(phys);
>  	skb_tx_timestamp(skb);
>  
> -	/* Don't wait up for transmitted skbs to be freed. */
> -	skb_orphan(skb);
> -
> +	/* FIXME: eliminate this mmio access if xmit_more is set */
>  	hip04_set_xmit_desc(priv, phys);
>  	priv->tx_head = TX_NEXT(tx_head);
> +	count++;
>  	netdev_sent_queue(ndev, skb->len);
>  
>  	stats->tx_bytes += skb->len;
>  	stats->tx_packets++;
> -	priv->tx_count++;
> -	priv->last_tx = jiffies;
>  
> -	/* Ensure tx_head & tx_count update visible to tx reclaim */
> -	smp_mb();
> +	/* Ensure tx_head update visible to tx reclaim */
> +	smp_wmb();
> +
> +	/* queue is getting full, better start cleaning up now */
> +	if (count >= priv->tx_coalesce_frames) {
> +		if (napi_schedule_prep(&priv->napi)) {
> +			/* disable rx interrupt and timer */
> +			priv->reg_inten &= ~(RCV_INT);
> +			writel_relaxed(DEF_INT_MASK & ~RCV_INT,
> +				       priv->base + PPE_INTEN);
> +			hrtimer_cancel(&priv->tx_coalesce_timer);
> +			__napi_schedule(&priv->napi);
> +		}
> +	} else if (!hrtimer_is_queued(&priv->tx_coalesce_timer)) {
> +		/* cleanup not pending yet, start a new timer */
> +		hrtimer_start_expires(&priv->tx_coalesce_timer,
> +				      HRTIMER_MODE_REL);
> +	}
>  
>  	return NETDEV_TX_OK;
>  }
> @@ -477,6 +481,7 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
>  	bool last = false;
>  	dma_addr_t phys;
>  	int rx = 0;
> +	int tx_remaining;
>  	u16 len;
>  	u32 err;
>  
> @@ -513,11 +518,11 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
>  
>  		buf = netdev_alloc_frag(priv->rx_buf_size);
>  		if (!buf)
> -			return -ENOMEM;
> +			goto done;
>  		phys = dma_map_single(&ndev->dev, buf,
>  				      RX_BUF_SIZE, DMA_FROM_DEVICE);
>  		if (dma_mapping_error(&ndev->dev, phys))
> -			return -EIO;
> +			goto done;
>  		priv->rx_buf[priv->rx_head] = buf;
>  		priv->rx_phys[priv->rx_head] = phys;
>  		hip04_set_recv_desc(priv, phys);
> @@ -537,6 +542,11 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
>  	}
>  	napi_complete(napi);
>  done:
> +	/* clean up tx descriptors and start a new timer if necessary */
> +	tx_remaining = hip04_tx_reclaim(ndev, false);
> +	if (rx < budget && tx_remaining)
> +		hrtimer_start_expires(&priv->tx_coalesce_timer, HRTIMER_MODE_REL);
> +
>  	return rx;
>  }
>  
> @@ -547,6 +557,9 @@ static irqreturn_t hip04_mac_interrupt(int irq, void *dev_id)
>  	struct net_device_stats *stats = &ndev->stats;
>  	u32 ists = readl_relaxed(priv->base + PPE_INTSTS);
>  
> +	if (!ists)
> +		return IRQ_NONE;
> +
>  	writel_relaxed(DEF_INT_MASK, priv->base + PPE_RINT);
>  
>  	if (unlikely(ists & DEF_INT_ERR)) {
> @@ -560,16 +573,32 @@ static irqreturn_t hip04_mac_interrupt(int irq, void *dev_id)
>  		}
>  	}
>  
> -	if (ists & RCV_INT) {
> +	if (ists & RCV_INT && napi_schedule_prep(&priv->napi)) {
>  		/* disable rx interrupt */
>  		priv->reg_inten &= ~(RCV_INT);
> -		writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
> -		napi_schedule(&priv->napi);
> +		writel_relaxed(DEF_INT_MASK & ~RCV_INT, priv->base + PPE_INTEN);
> +		hrtimer_cancel(&priv->tx_coalesce_timer);
> +		__napi_schedule(&priv->napi);
>  	}
>  
>  	return IRQ_HANDLED;
>  }
>  
> +enum hrtimer_restart tx_done(struct hrtimer *hrtimer)
> +{
> +	struct hip04_priv *priv;
> +	priv = container_of(hrtimer, struct hip04_priv, tx_coalesce_timer);
> +
> +	if (napi_schedule_prep(&priv->napi)) {
> +		/* disable rx interrupt */
> +		priv->reg_inten &= ~(RCV_INT);
> +		writel_relaxed(DEF_INT_MASK & ~RCV_INT, priv->base + PPE_INTEN);
> +		__napi_schedule(&priv->napi);
> +	}
> +
> +	return HRTIMER_NORESTART;
> +}
> +
>  static void hip04_adjust_link(struct net_device *ndev)
>  {
>  	struct hip04_priv *priv = netdev_priv(ndev);
> @@ -589,7 +618,6 @@ static int hip04_mac_open(struct net_device *ndev)
>  	priv->rx_head = 0;
>  	priv->tx_head = 0;
>  	priv->tx_tail = 0;
> -	priv->tx_count = 0;
>  	hip04_reset_ppe(priv);
>  
>  	for (i = 0; i < RX_DESC_NUM; i++) {
> @@ -612,9 +640,6 @@ static int hip04_mac_open(struct net_device *ndev)
>  	hip04_mac_enable(ndev);
>  	napi_enable(&priv->napi);
>  
> -	INIT_DELAYED_WORK(&priv->tx_clean_task, hip04_tx_clean_monitor);
> -	queue_delayed_work(priv->wq, &priv->tx_clean_task, 0);
> -
>  	return 0;
>  }
>  
> @@ -623,8 +648,6 @@ static int hip04_mac_stop(struct net_device *ndev)
>  	struct hip04_priv *priv = netdev_priv(ndev);
>  	int i;
>  
> -	cancel_delayed_work_sync(&priv->tx_clean_task);
> -
I think we should cancle the hrtimer when closed and queue the timer when open.

>  	napi_disable(&priv->napi);
>  	netif_stop_queue(ndev);
>  	hip04_mac_disable(ndev);
> @@ -725,6 +748,7 @@ static int hip04_mac_probe(struct platform_device *pdev)
>  	struct hip04_priv *priv;
>  	struct resource *res;
>  	unsigned int irq;
> +	ktime_t txtime;
>  	int ret;
>  
>  	ndev = alloc_etherdev(sizeof(struct hip04_priv));
> @@ -751,6 +775,21 @@ static int hip04_mac_probe(struct platform_device *pdev)
>  	priv->port = arg.args[0];
>  	priv->chan = arg.args[1] * RX_DESC_NUM;
>  
> +	hrtimer_init(&priv->tx_coalesce_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +
> +	/*
> +	 * BQL will try to keep the TX queue as short as possible, but it can't
> +	 * be faster than tx_coalesce_usecs, so we need a fast timeout here,
> +	 * but also long enough to gather up enough frames to ensure we don't
> +	 * get more interrupts than necessary.
> +	 * 200us is enough for 16 frames of 1500 bytes at gigabit ethernet rate
> +	 */
> +	priv->tx_coalesce_frames = TX_DESC_NUM * 3 / 4;
> +	priv->tx_coalesce_usecs = 200;
> +	/* allow timer to fire after half the time at the earliest */
> +	txtime = ktime_set(0, priv->tx_coalesce_usecs * NSEC_PER_USEC / 2);
> +	hrtimer_set_expires_range(&priv->tx_coalesce_timer, txtime, txtime);
> +

I think miss the line:
 priv->tx_coalesce_timer.function = tx_done;


Regards
Ding
>  	priv->map = syscon_node_to_regmap(arg.np);
>  	if (IS_ERR(priv->map)) {
>  		dev_warn(d, "no syscon hisilicon,hip04-ppe\n");
> @@ -788,12 +827,6 @@ static int hip04_mac_probe(struct platform_device *pdev)
>  		}
>  	}
>  
> -	priv->wq = create_singlethread_workqueue(ndev->name);
> -	if (!priv->wq) {
> -		ret = -ENOMEM;
> -		goto init_fail;
> -	}
> -
>  	INIT_WORK(&priv->tx_timeout_task, hip04_tx_timeout_task);
>  
>  	ether_setup(ndev);
> @@ -848,8 +881,6 @@ static int hip04_remove(struct platform_device *pdev)
>  	free_irq(ndev->irq, ndev);
>  	of_node_put(priv->phy_node);
>  	cancel_work_sync(&priv->tx_timeout_task);
> -	if (priv->wq)
> -		destroy_workqueue(priv->wq);
>  	free_netdev(ndev);
>  
>  	return 0;
> 
> 
> .
> 

^ permalink raw reply

* Re: [PATCH net 0/3] net: broadcom: fix FIXED_PHY dependencies
From: David Miller @ 2014-12-16  5:57 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev
In-Reply-To: <1418666235-3109-1-git-send-email-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Mon, 15 Dec 2014 09:57:12 -0800

> This patch series removes the bogus "select FIXED_PHY if FOO=y" that
> I have been using in GENET, SYSTEMPORT and the SF2 DSA switch
> driver.

Series applied, thanks Florian.

^ permalink raw reply

* Re: errors in alignment changes..
From: Simon Horman @ 2014-12-16  5:33 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: David Miller, mitsuhiro.kimura.kc, netdev
In-Reply-To: <548AD8D0.4040207@cogentembedded.com>

On Fri, Dec 12, 2014 at 03:00:16PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 12/12/2014 7:30 AM, Simon Horman wrote:
> 
> >>From: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
> >>Date: Thu, 11 Dec 2014 01:08:07 +0300
> 
> >>>    I guess we can just do:
> 
> >>>		rxdesc->addr = dma_map_single(...);
> 
> >>Best not to leave a potentially invalid DMA address in a
> >>receive descriptor the chip can potentially fetch and
> >>look at.
> 
> >>That's why I said to put it into a local variable and
> >>check for errors first.
> 
> >Hi Dave,
> 
> >this patch ending up in net is partially my fault.
> 
> >Sergei, do you have time to address David's concerns in relation to this
> >patch?
> 
>    No, I probably don't.
> 
> >If not I would like to suggest reverting it for now.
> 
>    Why? The patch does what it was intended for. Getting rid of
> virt_to_phys() calls is a separate issue, IMO.

Thanks, that is fine by me.

^ permalink raw reply

* RE: Bug: mv643xxx fails with highmem
From: fugang.duan @ 2014-12-16  2:19 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: David Miller, Fabio.Estevam@freescale.com,
	ezequiel.garcia@free-electrons.com, netdev@vger.kernel.org
In-Reply-To: <20141215180430.GK11285@n2100.arm.linux.org.uk>

From: Russell King - ARM Linux <linux@arm.linux.org.uk> Sent: Tuesday, December 16, 2014 2:05 AM
> To: Duan Fugang-B38611
> Cc: David Miller; Estevam Fabio-R49496; ezequiel.garcia@free-
> electrons.com; netdev@vger.kernel.org
> Subject: Re: Bug: mv643xxx fails with highmem
> 
> On Fri, Dec 12, 2014 at 05:34:01AM +0000, fugang.duan@freescale.com wrote:
> > I will submit one patch to fix the issue.
> 
> There's more bugs in the FEC driver... here's the relevant bits:
> 
> static void
> fec_enet_tx_queue(struct net_device *ndev, u16 queue_id) {
>         bdp = txq->dirty_tx;
> 
>         bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
> 
>         while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
>                 /* current queue is empty */
>                 if (bdp == txq->cur_tx)
>                         break;
> 
>                 skb = txq->tx_skbuff[index];
>                 txq->tx_skbuff[index] = NULL;
>                 if (!IS_TSO_HEADER(txq, bdp->cbd_bufaddr))
>                         dma_unmap_single(&fep->pdev->dev, bdp-
> >cbd_bufaddr,
>                                         bdp->cbd_datlen, DMA_TO_DEVICE);
>                 bdp->cbd_bufaddr = 0;
>                 if (!skb) {
>                         bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
>                         continue;
>                 }
> ...
>                 txq->dirty_tx = bdp;
>                 bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
>         }
> 
> Consider the following code path:
> - we enter this function
> - get the dirty_tx pointer
> - move to the next descriptor (which we'll call descriptor A)
> - next descriptor indicates that TX_READY = 0
> - bdp != txq->cur_tx
> - we unmap if needed
> - we set bdp->cmdbufaddr = 0
> - assume skb is NULL, so we move to the next descriptor (we'll call this
> B)
> - next descriptor _may_ have TX_READY = 1
> - we break out of the loop, and return
> 
> Some time later, we re-enter:
> - get the dirty_tx pointer
> - move to the next descriptor (which is descriptor A above)
> - next descriptor indicates that TX_READY = 0
> - bdp != txq->cur_tx
> - we call dma_unmap_single(..., bdp->cbd_bufaddr, which we previously
> zeroed
>   - the DMA API debugging complains that FEC is unmapping memory which it
>     doesn't own
> 
> Unfortunately, this does appear to happen - from a paste from Jon
> Nettleton from iMX6Q:
> 
>  32. [   45.033001] unmapping this address 0x0 size 66  33. [   45.037470]
> ------------[ cut here ]------------  34. [   45.042127] WARNING: CPU: 0
> PID: 102 at lib/dma-debug.c:1080 check_unmap+0x784/0x9f4()
> 35. [   45.050066] fec 2188000.ethernet: DMA-API: device driver tries to
> free DMA memory it has not a]
> 
> (where the printk at line 32 is something that was added to debug this.)
> 
> The sad thing is that the remainder of my FEC patches did go a long way
> to clean up these kinds of issues in the driver (and there's /many/ of
> them), but unfortunately other conflicting changes got merged before I
> could finish rebasing them, I decided to move on to other things and
> discard the remainder of my patch set.  Marek showed some interest in
> taking the patch set over, but I've not heard anything more - and I'm not
> about to resurect my efforts only to get into the same situation where
> I'm carrying 50 odd patches which I can't merge back into mainline
> without spending weeks endlessly rebasing them.
> 
Russell, many thanks for your effort and thanks for your pointing out the bug.
I will think one method to fix it.

And I have one question for highmem dma mapping issue as below:
fec_enet_txq_submit_frag_skb(struct fec_enet_priv_tx_q *txq, struct sk_buff *skb, struct net_device *ndev)
{
	...
		    bufaddr = page_address(this_frag->page.p) + this_frag->page_offset;

                index = fec_enet_get_bd_index(txq->tx_bd_base, bdp, fep);
                if (((unsigned long) bufaddr) & fep->tx_align ||
                        fep->quirks & FEC_QUIRK_SWAP_FRAME) {
                        memcpy(txq->tx_bounce[index], bufaddr, frag_len);
                        bufaddr = txq->tx_bounce[index];

                        if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
                                swap_buffer(bufaddr, frag_len);
                }

                addr = dma_map_single(&fep->pdev->dev, bufaddr, frag_len,
                                      DMA_TO_DEVICE);
                if (dma_mapping_error(&fep->pdev->dev, addr)) {
                        dev_kfree_skb_any(skb);
                        if (net_ratelimit())
                                netdev_err(ndev, "Tx DMA memory map failed\n");
                        goto dma_mapping_error;
                }
	...
}
If the frag page is located at high memory, use dma_map_single() is not right, must use skb_frag_dma_map() or dma_map_page().
But before mapping, if tx has buffer alignment limitation (tx_align is not zero), there need to do memcpy for buffer alignment.
So, there we need to check whether the page is in highmem, if so, we need to call kmap_atomic() or kmap_high_get() to get cpu address,
And then do memcpy or swap buffer operation.

Do you think the above solution is right ? or maybe there have better method to fix it ?

Regards,
Andy

^ permalink raw reply

* Re: [PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
From: yzhu1 @ 2014-12-16  2:08 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Sergei Shtylyov, Zhu Yanjun, netdev@vger.kernel.org, ALLAN, BRUCE,
	KIRSHER, JEFFREY
In-Reply-To: <20141215133335.GC28701@1wt.eu>

Hi, Willy

Thanks for your reply.

I will fix it and send V2.

Best Regards!
Zhu Yanjun
On 12/15/2014 09:33 PM, Willy Tarreau wrote:
> On Mon, Dec 15, 2014 at 01:21:43PM +0000, Zhu, Yanjun wrote:
>> Hi, Willy
>>
>> Thanks for your reply.
>>
>> This patch "[PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked" exists in tag v3.1.
>>
>> Please follow these steps, this patch can be found:
>>
>> 1. git checkout -f v3.1
>>
>> 2. git log -p drivers/net/e1000e/ich8lan.c
>>
>> 3. search "b7d6e335"
>>
>> Then we will find this patch.
> Ah it's because you truncated the commit ID from the right instead of from
> the left. Truncated commit IDs are valid from the left, not from the right.
> In your case, the commit is 6cc7aaed70c96c3933fbacbad582fc79b7d6e335
> ("e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked"), so
> the truncated ID is 6cc7aae, not b7d6e335. It's important to fix that in
> your commit messages so that a "git show" works correctly (it failed for me
> for this precise reason).
>
> Thanks,
> Willy
>

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-16  1:20 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: Jamal Hadi Salim, John Fastabend, netdev@vger.kernel.org,
	Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DA6149@ORSMSX101.amr.corp.intel.com>

On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>
>> -----Original Message-----
>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>> Sent: Tuesday, December 16, 2014 1:28 AM
>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/15/14 13:36, Arad, Ronen wrote:
>>>
>>>> -----Original Message-----
>>> The behavior of a driver could depend on the presence of a bridge and
>> features such as FDB LEARNING and LEARNING_SYNC.
>>
>> Indeed, those are bridge attributes.
>>
>>> A switch port driver which is not enslaved to a bridge might need to
>>> implement VLAN-aware FDB within the driver and report its content to user-
>> space using ndo_fdb_dump.
>>   >
>>> A switch port driver which is enslaved to a bridge could do with only
>>> pass through for static FDB configuration
>>   > to the HW when LEARNING_SYNC is configured. FDB reporting to user-
>> space and soft aging are left to the bridge module FDB.
>>> Such driver, without LEARNING_SYNC could still avoid maintaing in-driver
>> FDB as long as it could dump the HW FDB on demand.
>>> LEARNING_SYNC also requires periodic updates of freshness information
>> from the driver to the bridge module.
>>
>> If you have an fdb - shouldnt that be exposed only if you have a bridge
>> abstraction exposed? i.e thats where the Linux tools would work.
> I'm trying to find out what are the opinions of other people in the netdev list.
> John have clearly stated that he'd like to see full L2 switching functionality (at least) supported without making a bridge device mandatory.
> The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that with proper setting of SELF/MASTER flags by iproute2.
> I see the value in supporting both approaches (bridge device mandatory and bridge device optional). If the choice is left to user-driven policy decision, we need to document both use models and map traditional L2 features to each model.
> The L2 offloading (or NETFUNC as it is currently called), which is being discussed on a different patch-set, is only needed when a bridge device is used.
> Without a bridge device, all configuration has to be targeted at the switch port driver directly using the SELF flag. FDB remains relevant and it is used to configure static MAC table entries and dump the HW MAC table.
Your understanding is right here. So far all patches have kept both 
models in mind.
> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even higher), there is a gap between what the HW is doing and what is explicitly modeled in Linux.
Can you elaborate more here ?. We use the linux model to accelerate a 
multi-layer (l2-l3) switch today. There maybe a few gaps, but these gaps 
can be closed by having equivalent functionality in the software path.

> Without a bridge device, the HW is represented by a set of switch port devices and the bridging (both control and data planes) takes place only in the HW and switch port driver.
> Each switch port driver has to implement its own FDB as there is no common shared code among drivers for different HW devices.
> Using a bridge device could partially alleviate that, but it comes with a cost. There is a need to properly implement offloading of both configuration and data-path. The transmit and receive path in the bridge module should be somehow bypassed to avoid unnecessary overhead or duplicate packets coming from both software bridging and HW bridging.
>
>> What i was refering to was a scenario where i have no interest in the fdb
>> despite such a hardware capabilities. VLANs is a different issue;
>>
> VLAN is fundamental feature of L2 and L3 switching and Linux is unclear about it. Bridge device could model bridging of untagged packets which requires a bridge device for each VLAN and a vlan device on each port that is a member of the bridge's VLAN.
> This different from the behavior and configuration of classic closed-source switches.
> An alternative model is VLAN filtering where a bridge is VLAN-aware and switches tagged traffic. A bridge device represents multiple L2 domains with VLAN filtering policy that defines the switching rules within each domain.
And the linux bridge driver supports both models today.

> Forwarding (e.g. L3 routing) is expected across such L2 domains using L3 entities.
> The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN filtering model is yet unclear to me.
In the vlan filtering bridge model, You can create a vlan device on the 
bridge for l3 ...

>   
>>>>> Will the decision about using a bridge device or avoiding it be left
>>>>> to the end-user?
>>>> Its a user policy decision. Again the offload bit gets us this in a
>>>> reasonably configurable way IMO.
>>>>
>>>>> (This requires switch port drivers to be able to work and provide
>>>>> similar functionality in both setups).
>>>> Right, but if the drivers "care" who is calling their ndo ops
>>>> something is seriously broken. For the driver it should not need to
>>>> know anything about the callers so it doesn't matter to the driver if
>>>> its a netlink call from user space or an internal call fro bridge.ko
>>> LEARNING_SYNC only makes sense when a switch port driver is enslaved to
>> a bridge.
>>   > Rocker switch driver indeed monitors upper change notifications and keep
>> track of master bridge presence.
>>> So bridge presence is not transparent.
>>>
>> Agreed - the challenge so far is that people have been fascinated by "switch"
>> point of view. I think we are learning and the class device will eventually
>> become obvious as useful.
>>
>> cheers,
>> jamal
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16  0:58 UTC (permalink / raw)
  To: Jamal Hadi Salim, John Fastabend, netdev@vger.kernel.org
  Cc: Roopa Prabhu, Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org,
	tgraf@suug.ch, stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <548F6E62.1040500@mojatatu.com>

> -----Original Message-----
> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
> Sent: Tuesday, December 16, 2014 1:28 AM
> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 12/15/14 13:36, Arad, Ronen wrote:
> >
> >
> >> -----Original Message-----
> 
> > The behavior of a driver could depend on the presence of a bridge and
> features such as FDB LEARNING and LEARNING_SYNC.
> 
> Indeed, those are bridge attributes.
> 
> > A switch port driver which is not enslaved to a bridge might need to
> >implement VLAN-aware FDB within the driver and report its content to user-
> space using ndo_fdb_dump.
>  >
> > A switch port driver which is enslaved to a bridge could do with only
> > pass through for static FDB configuration
>  > to the HW when LEARNING_SYNC is configured. FDB reporting to user-
> space and soft aging are left to the bridge module FDB.
> > Such driver, without LEARNING_SYNC could still avoid maintaing in-driver
> FDB as long as it could dump the HW FDB on demand.
> > LEARNING_SYNC also requires periodic updates of freshness information
> from the driver to the bridge module.
> >
> 
> 
> If you have an fdb - shouldnt that be exposed only if you have a bridge
> abstraction exposed? i.e thats where the Linux tools would work.

I'm trying to find out what are the opinions of other people in the netdev list.
John have clearly stated that he'd like to see full L2 switching functionality (at least) supported without making a bridge device mandatory.
The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that with proper setting of SELF/MASTER flags by iproute2.
I see the value in supporting both approaches (bridge device mandatory and bridge device optional). If the choice is left to user-driven policy decision, we need to document both use models and map traditional L2 features to each model. 
The L2 offloading (or NETFUNC as it is currently called), which is being discussed on a different patch-set, is only needed when a bridge device is used.
Without a bridge device, all configuration has to be targeted at the switch port driver directly using the SELF flag. FDB remains relevant and it is used to configure static MAC table entries and dump the HW MAC table.
When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even higher), there is a gap between what the HW is doing and what is explicitly modeled in Linux. Without a bridge device, the HW is represented by a set of switch port devices and the bridging (both control and data planes) takes place only in the HW and switch port driver.
Each switch port driver has to implement its own FDB as there is no common shared code among drivers for different HW devices.
Using a bridge device could partially alleviate that, but it comes with a cost. There is a need to properly implement offloading of both configuration and data-path. The transmit and receive path in the bridge module should be somehow bypassed to avoid unnecessary overhead or duplicate packets coming from both software bridging and HW bridging.

> What i was refering to was a scenario where i have no interest in the fdb
> despite such a hardware capabilities. VLANs is a different issue;
>
VLAN is fundamental feature of L2 and L3 switching and Linux is unclear about it. Bridge device could model bridging of untagged packets which requires a bridge device for each VLAN and a vlan device on each port that is a member of the bridge's VLAN.
This different from the behavior and configuration of classic closed-source switches.
An alternative model is VLAN filtering where a bridge is VLAN-aware and switches tagged traffic. A bridge device represents multiple L2 domains with VLAN filtering policy that defines the switching rules within each domain. Forwarding (e.g. L3 routing) is expected across such L2 domains using L3 entities.
The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN filtering model is yet unclear to me.

> >>> Will the decision about using a bridge device or avoiding it be left
> >>> to the end-user?
> >>
> >> Its a user policy decision. Again the offload bit gets us this in a
> >> reasonably configurable way IMO.
> >>
> >>> (This requires switch port drivers to be able to work and provide
> >>> similar functionality in both setups).
> >>
> >> Right, but if the drivers "care" who is calling their ndo ops
> >> something is seriously broken. For the driver it should not need to
> >> know anything about the callers so it doesn't matter to the driver if
> >> its a netlink call from user space or an internal call fro bridge.ko
> >
> > LEARNING_SYNC only makes sense when a switch port driver is enslaved to
> a bridge.
>  > Rocker switch driver indeed monitors upper change notifications and keep
> track of master bridge presence.
> > So bridge presence is not transparent.
> >
> 
> Agreed - the challenge so far is that people have been fascinated by "switch"
> point of view. I think we are learning and the class device will eventually
> become obvious as useful.
> 
> cheers,
> jamal

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox