Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 6cc7aae 
[e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked]
introduces for newer kernels.

When PHY reset is intentionally blocked on 82577/8/9, do not toggle the
LANPHYPC value bit (essentially performing a hard power reset of the
device) otherwise the PHY can be put into an unknown state.

Cleanup whitespace in the same function.

[yanjun.zhu: whitespace remains unchanged]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/ich8lan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index c4b2d15..8c7e4aa 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -280,7 +280,8 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
-	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID) && 
+		!e1000_check_reset_block(hw)) {
 		/*Set Phy Config Counter to 50msec */
 		ctrl = er32(FEXTNVM3);
 		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
-- 
1.9.1

^ permalink raw reply related

* [PATCH 4/5] e1000e: update workaround for 82579 intermittently disabled during S0->Sx
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 90b8298 
[e1000e: update workaround for 82579 intermittently disabled during S0->Sx]
introduces for newer kernels.

The workaround which toggles the LANPHYPC (LAN PHY Power Control) value bit
to force the MAC-Phy interconnect into PCIe mode from SMBus mode during
driver load and resume should always be done except if PHY resets are
blocked by the Manageability Engine (ME).  Previously, the toggle was done
only if PHY resets are blocked and the ME was disabled.

The rest of the patch is just indentation changes as a consequence of the
updated workaround.

[yanjun.zhu: indentation changes are removed.
function e1000_init_phy_workarounds_pchlan does not exist]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/ich8lan.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index 8c7e4aa..0da2c2c 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -280,8 +280,7 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;

-	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID) && 
-		!e1000_check_reset_block(hw)) {
+	if (!e1000_check_reset_block(hw)) {
 		/*Set Phy Config Counter to 50msec */
 		ctrl = er32(FEXTNVM3);
 		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
-- 
1.9.1

^ permalink raw reply related

* [PATCH 5/5] e1000e: cleanup use of check_reset_block function pointer
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 44abd5c 
[e1000e: cleanup use of check_reset_block function pointer] introduces
for newer kernels.

Replace e1000_check_reset_block() inline function with calls to the PHY ops
check_reset_block function pointer.

[yanjun.zhu: only modifications in function e1000_init_phy_params_pchlan will
be backported. Others remain unchanged]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/ich8lan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index 0da2c2c..732cd48 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -280,7 +280,7 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
-	if (!e1000_check_reset_block(hw)) {
+	if (!hw->phy.ops.check_reset_block(hw)) {
 		/*Set Phy Config Counter to 50msec */
 		ctrl = er32(FEXTNVM3);
 		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net] net: ipv6: allow explicitly choosing optimistic addresses
From: Erik Kline @ 2014-12-16 10:38 UTC (permalink / raw)
  To: Lorenzo Colitti
  Cc: netdev@vger.kernel.org, YOSHIFUJI Hideaki, Hannes Frederic Sowa
In-Reply-To: <CAKD1Yr3+S81mKbVdQ-P6byx-e2Hfn8GJz+Jr7SLWrRjkfpD2qA@mail.gmail.com>

On Fri, Dec 12, 2014 at 4:37 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
> On Fri, Dec 12, 2014 at 12:50 PM, Erik Kline <ek@google.com> wrote:
>>
>> @@ -1527,7 +1527,8 @@ int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
>>                 if (!net_eq(dev_net(ifp->idev->dev), net))
>>                         continue;
>>                 if (ipv6_addr_equal(&ifp->addr, addr) &&
>> -                   !(ifp->flags&IFA_F_TENTATIVE) &&
>> +                   (!(ifp->flags&IFA_F_TENTATIVE) ||
>> +                    ifp->flags&IFA_F_OPTIMISTIC) &&
>>                     (dev == NULL || ifp->idev->dev == dev ||
>>                      !(ifp->scope&(IFA_LINK|IFA_HOST) || strict))) {
>>                         rcu_read_unlock_bh();
>
> I looked at the callers of ipv6_chk_addr (lxr finds 19 files); from
> what I saw, this change will make all of them more correct except for
> ndisc_solicit. With this change, ndisc_solicit could now send
> neighbour solicitations from optimistic addresses, which is not
> allowed by the RFC.

True.  I have a version that fixes this, but ended up splitting
ipv6_chk_addr into two:

    - ipv6_chk_addr (same as today), and
    - ipv6_chk_addr_and_flags (which takes a "u32 banned_flags" argument)

The ugliness of this naming is not lost on me.  I briefly tried a
version that reuses the "strict" argument to achieve the same thing,
but that doesn't seem to be quite correct in all cases either.

^ permalink raw reply

* RE: [PATCH 1/5] e1000e: reset MAC-PHY interconnect on 82577/82578
From: David Laight @ 2014-12-16 10:48 UTC (permalink / raw)
  To: 'Zhu Yanjun', netdev@vger.kernel.org, w@1wt.eu
  Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher, David S. Miller
In-Reply-To: <1418725700-31465-2-git-send-email-Yanjun.Zhu@windriver.com>

From: Zhu Yanjun
> 2.6.x kernels require a similar logic change as commit 6dfaa76
> [e1000e: reset MAC-PHY interconnect on 82577/82578] introduces
> for newer kernels.
> 
> During Sx->S0 transitions, the interconnect between the MAC and PHY on
> 82577/82578 can remain in SMBus mode instead of transitioning to the
> PCIe-like mode required during normal operation.  Toggling the LANPHYPC
> Value bit essentially resets the interconnect forcing it to the correct
> mode.
> 
> after review of all intel drivers, found several instances where
> drivers had the incorrect pattern of:
> memory mapped write();
> delay();
> 
> which should always be:
> memory mapped write();
> write flush(); /* aka memory mapped read */
> delay();

Probably not only the intel drivers.
I'd bet that a lot of the delay() calls are wrong.

> explanation:
> The reason for including the flush is that writes can be held
> (posted) in PCI/PCIe bridges, but the read always has to complete
> synchronously and therefore has to flush all pending writes to a
> device.  If a write is held and followed by a delay, the delay
> means nothing because the write may not have reached hardware
> (maybe even not until the next read)

It might even be that the 'write flush' (ie a read) guarantees to
generates a long enough delay that the delay() itself isn't needed.

	David

^ permalink raw reply

* Re: Bug: mv643xxx fails with highmem
From: Russell King - ARM Linux @ 2014-12-16 10:57 UTC (permalink / raw)
  To: fugang.duan@freescale.com
  Cc: David Miller, Fabio.Estevam@freescale.com,
	ezequiel.garcia@free-electrons.com, netdev@vger.kernel.org
In-Reply-To: <BLUPR03MB37305CF1F6797329BD9B183F56C0@BLUPR03MB373.namprd03.prod.outlook.com>

On Tue, Dec 16, 2014 at 02:19:38AM +0000, fugang.duan@freescale.com wrote:
> From: Russell King - ARM Linux <linux@arm.linux.org.uk> Sent: Tuesday, December 16, 2014 2:05 AM
> > To: Duan Fugang-B38611
> > Cc: David Miller; Estevam Fabio-R49496; ezequiel.garcia@free-
> > electrons.com; netdev@vger.kernel.org
> > Subject: Re: Bug: mv643xxx fails with highmem
> > 
> > On Fri, Dec 12, 2014 at 05:34:01AM +0000, fugang.duan@freescale.com wrote:
> > > I will submit one patch to fix the issue.
> > 
> > There's more bugs in the FEC driver... here's the relevant bits:
> > 
> > static void
> > fec_enet_tx_queue(struct net_device *ndev, u16 queue_id) {
> >         bdp = txq->dirty_tx;
> > 
> >         bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
> > 
> >         while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
> >                 /* current queue is empty */
> >                 if (bdp == txq->cur_tx)
> >                         break;
> > 
> >                 skb = txq->tx_skbuff[index];
> >                 txq->tx_skbuff[index] = NULL;
> >                 if (!IS_TSO_HEADER(txq, bdp->cbd_bufaddr))
> >                         dma_unmap_single(&fep->pdev->dev, bdp-
> > >cbd_bufaddr,
> >                                         bdp->cbd_datlen, DMA_TO_DEVICE);
> >                 bdp->cbd_bufaddr = 0;
> >                 if (!skb) {
> >                         bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
> >                         continue;
> >                 }
> > ...
> >                 txq->dirty_tx = bdp;
> >                 bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
> >         }
> > 
> > Consider the following code path:
> > - we enter this function
> > - get the dirty_tx pointer
> > - move to the next descriptor (which we'll call descriptor A)
> > - next descriptor indicates that TX_READY = 0
> > - bdp != txq->cur_tx
> > - we unmap if needed
> > - we set bdp->cmdbufaddr = 0
> > - assume skb is NULL, so we move to the next descriptor (we'll call this
> > B)
> > - next descriptor _may_ have TX_READY = 1
> > - we break out of the loop, and return
> > 
> > Some time later, we re-enter:
> > - get the dirty_tx pointer
> > - move to the next descriptor (which is descriptor A above)
> > - next descriptor indicates that TX_READY = 0
> > - bdp != txq->cur_tx
> > - we call dma_unmap_single(..., bdp->cbd_bufaddr, which we previously
> > zeroed
> >   - the DMA API debugging complains that FEC is unmapping memory which it
> >     doesn't own
> > 
> > Unfortunately, this does appear to happen - from a paste from Jon
> > Nettleton from iMX6Q:
> > 
> >  32. [   45.033001] unmapping this address 0x0 size 66  33. [   45.037470]
> > ------------[ cut here ]------------  34. [   45.042127] WARNING: CPU: 0
> > PID: 102 at lib/dma-debug.c:1080 check_unmap+0x784/0x9f4()
> > 35. [   45.050066] fec 2188000.ethernet: DMA-API: device driver tries to
> > free DMA memory it has not a]
> > 
> > (where the printk at line 32 is something that was added to debug this.)
> > 
> > The sad thing is that the remainder of my FEC patches did go a long way
> > to clean up these kinds of issues in the driver (and there's /many/ of
> > them), but unfortunately other conflicting changes got merged before I
> > could finish rebasing them, I decided to move on to other things and
> > discard the remainder of my patch set.  Marek showed some interest in
> > taking the patch set over, but I've not heard anything more - and I'm not
> > about to resurect my efforts only to get into the same situation where
> > I'm carrying 50 odd patches which I can't merge back into mainline
> > without spending weeks endlessly rebasing them.
> > 
> Russell, many thanks for your effort and thanks for your pointing out the bug.
> I will think one method to fix it.
> 
> And I have one question for highmem dma mapping issue as below:
> fec_enet_txq_submit_frag_skb(struct fec_enet_priv_tx_q *txq, struct sk_buff *skb, struct net_device *ndev)
> {
> 	...
> 		    bufaddr = page_address(this_frag->page.p) + this_frag->page_offset;
> 
>                 index = fec_enet_get_bd_index(txq->tx_bd_base, bdp, fep);
>                 if (((unsigned long) bufaddr) & fep->tx_align ||
>                         fep->quirks & FEC_QUIRK_SWAP_FRAME) {
>                         memcpy(txq->tx_bounce[index], bufaddr, frag_len);
>                         bufaddr = txq->tx_bounce[index];
> 
>                         if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
>                                 swap_buffer(bufaddr, frag_len);
>                 }
> 
>                 addr = dma_map_single(&fep->pdev->dev, bufaddr, frag_len,
>                                       DMA_TO_DEVICE);
>                 if (dma_mapping_error(&fep->pdev->dev, addr)) {
>                         dev_kfree_skb_any(skb);
>                         if (net_ratelimit())
>                                 netdev_err(ndev, "Tx DMA memory map failed\n");
>                         goto dma_mapping_error;
>                 }
> 	...
> }
>
> If the frag page is located at high memory, use dma_map_single() is not
> right, must use skb_frag_dma_map() or dma_map_page().

Correct.

> But before mapping, if tx has buffer alignment limitation (tx_align is
> not zero), there need to do memcpy for buffer alignment.

Right, and that can be detected by simply checking this_frag->page_offset
as we know that a page address is always aligned to a page.

> So, there we need to check whether the page is in highmem, if so, we
> need to call kmap_atomic() or kmap_high_get() to get cpu address,
> And then do memcpy or swap buffer operation.

Yes - you'd need to do something like this:

	if (this_frag->page_offset & fep->tx_align ||
	    fep->quirks & FEC_QUIRK_SWAP_FRAME) {
		bufaddr = kmap_atomic(this_frag->page.p) + this_frag->page_offset;
		memcpy(txq->tx_bounce[index], bufaddr, frag_len);
		kunmap_atomic(bufaddr);

		bufaddr = txq->tx_bounce[index];
		if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
			swap_buffer(bufaddr, frag_len);
                addr = dma_map_single(&fep->pdev->dev, bufaddr, frag_len,
                                      DMA_TO_DEVICE);
	} else {
		addr = skb_frag_dma_map(&fep->pdev->dev, this_frag, 0,
					frag_len, DMA_TO_DEVICE);
	}

	if (dma_mapping_error(&fep->pdev->dev, addr)) {
		dev_kfree_skb_any(skb);
		if (net_ratelimit())
			netdev_err(ndev, "Tx DMA memory map failed\n");
		goto dma_mapping_error;
	}

You'll also need to record whether you should use dma_unmap_page() or
dma_unmap_single().

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* [PATCH net-next 1/1] net: fec: Fix NAPI race
From: Fugang Duan @ 2014-12-16 10:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, R49496, bhutchings, stephen, b38611

Do camera capture test on i.MX6q sabresd board, and save the capture data to
nfs rootfs. The command is:
gst-launch-1.0 -e imxv4l2src device=/dev/video1 num-buffers=2592000 ! tee name=t !
queue ! imxv4l2sink sync=false t. ! queue ! vpuenc ! queue ! mux. pulsesrc num-buffers=3720937
blocksize=4096 ! 'audio/x-raw, rate=44100, channels=2' ! queue ! imxmp3enc ! mpegaudioparse !
queue ! mux. qtmux name=mux ! filesink location=video_recording_long.mov

After about 10 hours running, there have net watchdog timeout kernel dump:
...
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x2b4/0x2d8()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.24-01051-gdb840b7 #440
[<80014e6c>] (unwind_backtrace) from [<800118ac>] (show_stack+0x10/0x14)
[<800118ac>] (show_stack) from [<806ae3f0>] (dump_stack+0x78/0xc0)
[<806ae3f0>] (dump_stack) from [<8002b504>] (warn_slowpath_common+0x68/0x8c)
[<8002b504>] (warn_slowpath_common) from [<8002b558>] (warn_slowpath_fmt+0x30/0x40)
[<8002b558>] (warn_slowpath_fmt) from [<8055e0d4>] (dev_watchdog+0x2b4/0x2d8)
[<8055e0d4>] (dev_watchdog) from [<800352d8>] (call_timer_fn.isra.33+0x24/0x8c)
[<800352d8>] (call_timer_fn.isra.33) from [<800354c4>] (run_timer_softirq+0x184/0x220)
[<800354c4>] (run_timer_softirq) from [<8002f420>] (__do_softirq+0xc0/0x22c)
[<8002f420>] (__do_softirq) from [<8002f804>] (irq_exit+0xa8/0xf4)
[<8002f804>] (irq_exit) from [<8000ee5c>] (handle_IRQ+0x54/0xb4)
[<8000ee5c>] (handle_IRQ) from [<80008598>] (gic_handle_irq+0x28/0x5c)
[<80008598>] (gic_handle_irq) from [<800123c0>] (__irq_svc+0x40/0x74)
Exception stack(0x80d27f18 to 0x80d27f60)
7f00:                                                       80d27f60 0000014c
7f20: 8858c60e 0000004d 884e4540 0000004d ab7250d0 80d34348 00000000 00000000
7f40: 00000001 00000000 00000017 80d27f60 800702a4 80476e6c 600f0013 ffffffff
[<800123c0>] (__irq_svc) from [<80476e6c>] (cpuidle_enter_state+0x50/0xe0)
[<80476e6c>] (cpuidle_enter_state) from [<80476fa8>] (cpuidle_idle_call+0xac/0x154)
[<80476fa8>] (cpuidle_idle_call) from [<8000f174>] (arch_cpu_idle+0x8/0x44)
[<8000f174>] (arch_cpu_idle) from [<80064c54>] (cpu_startup_entry+0x100/0x158)
[<80064c54>] (cpu_startup_entry) from [<80cd8a9c>] (start_kernel+0x304/0x368)
---[ end trace 09ebd32fb032f86d ]---
...

There might have a race in napi_schedule(), leaving interrupts disabled forever.
After these patch, the case still work more than 40 hours running.

Signed-off-by: Fugang Duan <B38611@freescale.com>
---
 drivers/net/ethernet/freescale/fec_main.c |   19 +++++++------------
 1 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 8c5b15e..5c4a8bd 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1558,20 +1558,21 @@ fec_enet_interrupt(int irq, void *dev_id)
 {
 	struct net_device *ndev = dev_id;
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const unsigned napi_mask = FEC_ENET_RXF | FEC_ENET_TXF;
 	uint int_events;
 	irqreturn_t ret = IRQ_NONE;
 
 	int_events = readl(fep->hwp + FEC_IEVENT);
-	writel(int_events & ~napi_mask, fep->hwp + FEC_IEVENT);
+	writel(int_events, fep->hwp + FEC_IEVENT);
 	fec_enet_collect_events(fep, int_events);
 
-	if (int_events & napi_mask) {
+	if (fep->work_tx || fep->work_rx) {
 		ret = IRQ_HANDLED;
 
-		/* Disable the NAPI interrupts */
-		writel(FEC_ENET_MII, fep->hwp + FEC_IMASK);
-		napi_schedule(&fep->napi);
+		if (napi_schedule_prep(&fep->napi)) {
+			/* Disable the NAPI interrupts */
+			writel(FEC_ENET_MII, fep->hwp + FEC_IMASK);
+			__napi_schedule(&fep->napi);
+		}
 	}
 
 	if (int_events & FEC_ENET_MII) {
@@ -1591,12 +1592,6 @@ static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	int pkts;
 
-	/*
-	 * Clear any pending transmit or receive interrupts before
-	 * processing the rings to avoid racing with the hardware.
-	 */
-	writel(FEC_ENET_RXF | FEC_ENET_TXF, fep->hwp + FEC_IEVENT);
-
 	pkts = fec_enet_rx(ndev, budget);
 
 	fec_enet_tx(ndev);
-- 
1.7.8

^ permalink raw reply related

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16 11:01 UTC (permalink / raw)
  To: Roopa Prabhu, netdev@vger.kernel.org
  Cc: Jamal Hadi Salim, John Fastabend, Jiri Pirko, sfeldma@gmail.com,
	bcrl@kvack.org, tgraf@suug.ch, stephen@networkplumber.org,
	linville@tuxdriver.com, vyasevic@redhat.com, davem@davemloft.net,
	shm@cumulusnetworks.com, gospo@cumulusnetworks.com
In-Reply-To: <548F88E8.1010703@cumulusnetworks.com>


In my reply (inline) I elaborate on the validity of bridge-less and offloaded-bridge models for L2 switching.

I also discuss the implied necessity of a bridge device for L3 routing and potential issues with the upcoming FIB offloading proposal.

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
> Sent: Tuesday, December 16, 2014 3:21 AM
> To: Arad, Ronen
> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri Pirko;
> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
> >
> >> -----Original Message-----
> >> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
> >> Sent: Tuesday, December 16, 2014 1:28 AM
> >> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
> >> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
> >> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
> >> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com;
> >> gospo@cumulusnetworks.com
> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
> >> del bridge port attributes
> >>
> >> On 12/15/14 13:36, Arad, Ronen wrote:
> >>>
> >>>> -----Original Message-----
> >>> The behavior of a driver could depend on the presence of a bridge
> >>> and
> >> features such as FDB LEARNING and LEARNING_SYNC.
> >>
> >> Indeed, those are bridge attributes.
> >>
> >>> A switch port driver which is not enslaved to a bridge might need to
> >>> implement VLAN-aware FDB within the driver and report its content to
> >>> user-
> >> space using ndo_fdb_dump.
> >>   >
> >>> A switch port driver which is enslaved to a bridge could do with
> >>> only pass through for static FDB configuration
> >>   > to the HW when LEARNING_SYNC is configured. FDB reporting to
> >> user- space and soft aging are left to the bridge module FDB.
> >>> Such driver, without LEARNING_SYNC could still avoid maintaing
> >>> in-driver
> >> FDB as long as it could dump the HW FDB on demand.
> >>> LEARNING_SYNC also requires periodic updates of freshness
> >>> information
> >> from the driver to the bridge module.
> >>
> >> If you have an fdb - shouldnt that be exposed only if you have a
> >> bridge abstraction exposed? i.e thats where the Linux tools would work.
> > I'm trying to find out what are the opinions of other people in the netdev
> list.
> > John have clearly stated that he'd like to see full L2 switching functionality
> (at least) supported without making a bridge device mandatory.
> > The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that
> with proper setting of SELF/MASTER flags by iproute2.
> > I see the value in supporting both approaches (bridge device mandatory
> > and bridge device optional). If the choice is left to user-driven policy decision,
> > we need to document both use models and map traditional L2 features to
> > each model.
> > The L2 offloading (or NETFUNC as it is currently called), which is being
> > discussed on a different patch-set, is only needed when a bridge device is
> > used.
> > Without a bridge device, all configuration has to be targeted at the switch
> > port driver directly using the SELF flag. FDB remains relevant and it is used to
> > configure static MAC table entries and dump the HW MAC table.

> Your understanding is right here. So far all patches have kept both models in
> mind.


> > When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even
> > higher), there is a gap between what the HW is doing and what is explicitly
> > modeled in Linux.


> Can you elaborate more here ?. We use the linux model to accelerate a
> multi-layer (l2-l3) switch today. There maybe a few gaps, but these gaps can
> be closed by having equivalent functionality in the software path.

What I meant is that without a bridge device the HW switch is seen as a collection of independent switch ports. Typical switch ASIC performs L2 switching by default. This is not expressed explicitly in Linux without a bridge device.
The SELF flag is used to target typical bridge port and bridge configuration at a switch port device.
Without an explicit bridge device, bridge attributes have to be directed at an arbitrary port (any port could represent the entire switch) and interpreted by the switch port driver as intended for the entire switch (this includes attributes like STP etc.)
Each switch port device driver has to implement similar functionality (i.e. all bridge and fdb related ndos) independently without common functionality shared (e.g. FDB, soft aging).
It is a valid use model and could avoid the complexity of having to deal with the presence of both SW and HW bridge and to deal with explicit offloading of data-path.

I was trying to find out whether the intention was to continue and support both bridge-less an offloaded-bridge models and leave it to the end-user to choose the desirable model at configuration time.
This would require dual support in the switch port driver in order to have best user experience across multiple switch ASICs or other kinds of devices.


> 
> > Without a bridge device, the HW is represented by a set of switch port
> > devices and the bridging (both control and data planes) takes place only in
> > the HW and switch port driver.
> > Each switch port driver has to implement its own FDB as there is no
> > common shared code among drivers for different HW devices.
> > Using a bridge device could partially alleviate that, but it comes with a cost.
> > There is a need to properly implement offloading of both configuration and
> > data-path. The transmit and receive path in the bridge module should be
> > somehow bypassed to avoid unnecessary overhead or duplicate packets
> > coming from both software bridging and HW bridging.
> >
> >> What i was refering to was a scenario where i have no interest in the
> >> fdb despite such a hardware capabilities. VLANs is a different issue;
> >>
> > VLAN is fundamental feature of L2 and L3 switching and Linux is unclear
> about it. Bridge device could model bridging of untagged packets which
> requires a bridge device for each VLAN and a vlan device on each port that is
> a member of the bridge's VLAN.
> > This different from the behavior and configuration of classic closed-source
> switches.
> > An alternative model is VLAN filtering where a bridge is VLAN-aware and
> > switches tagged traffic. A bridge device represents multiple L2 domains with
> > VLAN filtering policy that defines the switching rules within each domain.
> And the linux bridge driver supports both models today.
> 
> > Forwarding (e.g. L3 routing) is expected across such L2 domains using L3
> entities.
> > The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN
> > filtering model is yet unclear to me.

> In the vlan filtering bridge model, You can create a vlan device on the bridge
> for l3 ...
> 

That's what I'm thinking too (I experimented with such setup using veth interfaces, bridge device, and vlan interfaces). This, however, seems to require an explicit bridge for L3 support.

Looking at the latest code of FIB offloading (not yet submitted to netdev), I noticed that a switch port device is expected as a lower descendent of the FIB destination device.
This assumption is valid in the per-vlan bridge model where IP address is assigned to the bridge itself.
This, however, is not consistent with the single multi-VLAN bridge model.
Vlan interfaces on a bridge looks like siblings of the switch ports devices on the same bridge. They are not ancestors of the switch ports.
The L3 domain ends at the bridge sub-interfaces. The only L3 entities are the vlan sub-interfaces on the bridge. 
Those are route next hops and the only possible fib_dev.
L3 routing is not aware of the switch ports. Route is performed to next hop addresses on one of the vlan interfaces subnets. The actual resolution to a switch port device has to be performed by the neighbor subsystem (ARP/ND).
It is unclear to me how the FIB offloading will be redirected to an ndo of a switch port device.

> >
> >>>>> Will the decision about using a bridge device or avoiding it be
> >>>>> left to the end-user?
> >>>> Its a user policy decision. Again the offload bit gets us this in a
> >>>> reasonably configurable way IMO.
> >>>>
> >>>>> (This requires switch port drivers to be able to work and provide
> >>>>> similar functionality in both setups).
> >>>> Right, but if the drivers "care" who is calling their ndo ops
> >>>> something is seriously broken. For the driver it should not need to
> >>>> know anything about the callers so it doesn't matter to the driver
> >>>> if its a netlink call from user space or an internal call fro
> >>>> bridge.ko
> >>> LEARNING_SYNC only makes sense when a switch port driver is enslaved
> >>> to
> >> a bridge.
> >>   > Rocker switch driver indeed monitors upper change notifications
> >> and keep track of master bridge presence.
> >>> So bridge presence is not transparent.
> >>>
> >> Agreed - the challenge so far is that people have been fascinated by
> "switch"
> >> point of view. I think we are learning and the class device will
> >> eventually become obvious as useful.
> >>
> >> cheers,
> >> jamal
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org More majordomo
> info
> > at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net] net/mlx4: Cache line CQE/EQE stride fixes
From: Amir Vadai @ 2014-12-16 11:28 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Amir Vadai, Yevgeny Petrilin, Or Gerlitz, clsoto,
	Ido Shamay, Wei Yang

From: Ido Shamay <idos@mellanox.com>

This commit contains 2 fixes for the 128B CQE/EQE stride feaure.
Wei found that mlx4_QUERY_HCA function marked the wrong capability
in flags (64B CQE/EQE), when CQE/EQE stride feature was enabled.
Also added small fix in initial CQE ownership bit assignment, when CQE
is size is not default 32B.

Fixes: 77507aa24 (net/mlx4: Enable CQE/EQE stride support)
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
Dave Hi,

Please pull this patch also to stable (at least 3.17)

Thanks,
Amir

 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 11 +++++++++--
 drivers/net/ethernet/mellanox/mlx4/fw.c        |  4 ++--
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6ff214d..190cbd9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1569,8 +1569,15 @@ int mlx4_en_start_port(struct net_device *dev)
 			mlx4_en_free_affinity_hint(priv, i);
 			goto cq_err;
 		}
-		for (j = 0; j < cq->size; j++)
-			cq->buf[j].owner_sr_opcode = MLX4_CQE_OWNER_MASK;
+
+		for (j = 0; j < cq->size; j++) {
+			struct mlx4_cqe *cqe = NULL;
+
+			cqe = mlx4_en_get_cqe(cq->buf, j, priv->cqe_size) +
+			      priv->cqe_factor;
+			cqe->owner_sr_opcode = MLX4_CQE_OWNER_MASK;
+		}
+
 		err = mlx4_en_set_cq_moder(priv, cq);
 		if (err) {
 			en_err(priv, "Failed setting cq moderation parameters\n");
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index ef3b95b..32c848d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -1848,8 +1848,8 @@ int mlx4_QUERY_HCA(struct mlx4_dev *dev,
 	/* CX3 is capable of extending CQEs\EQEs to strides larger than 64B */
 	MLX4_GET(byte_field, outbox, INIT_HCA_EQE_CQE_STRIDE_OFFSET);
 	if (byte_field) {
-		param->dev_cap_enabled |= MLX4_DEV_CAP_64B_EQE_ENABLED;
-		param->dev_cap_enabled |= MLX4_DEV_CAP_64B_CQE_ENABLED;
+		param->dev_cap_enabled |= MLX4_DEV_CAP_EQE_STRIDE_ENABLED;
+		param->dev_cap_enabled |= MLX4_DEV_CAP_CQE_STRIDE_ENABLED;
 		param->cqe_size = 1 << ((byte_field &
 					 MLX4_CQE_SIZE_MASK_STRIDE) + 5);
 		param->eqe_size = 1 << (((byte_field &
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next 1/1] net: fec: Fix NAPI race
From: Fabio Estevam @ 2014-12-16 11:33 UTC (permalink / raw)
  To: Fugang Duan
  Cc: David S. Miller, netdev@vger.kernel.org, Estevam Fabio-R49496,
	Ben Hutchings, Stephen Hemminger, robert.daniels,
	Marek Vašut, Russell King
In-Reply-To: <1418725558-17287-1-git-send-email-b38611@freescale.com>

Hi Fugang,

On Tue, Dec 16, 2014 at 8:25 AM, Fugang Duan <b38611@freescale.com> wrote:
> Do camera capture test on i.MX6q sabresd board, and save the capture data to
> nfs rootfs. The command is:
> gst-launch-1.0 -e imxv4l2src device=/dev/video1 num-buffers=2592000 ! tee name=t !
> queue ! imxv4l2sink sync=false t. ! queue ! vpuenc ! queue ! mux. pulsesrc num-buffers=3720937
> blocksize=4096 ! 'audio/x-raw, rate=44100, channels=2' ! queue ! imxmp3enc ! mpegaudioparse !
> queue ! mux. qtmux name=mux ! filesink location=video_recording_long.mov
>
> After about 10 hours running, there have net watchdog timeout kernel dump:
> ...
> WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x2b4/0x2d8()
> NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out

Adding more people who reported similar issues in the past.

Marek,

Does this patch solve the problem you reported at
http://www.spinics.net/lists/netdev/msg268167.html ?

^ permalink raw reply

* Re: [PATCH net-next 1/1] net: fec: Fix NAPI race
From: Russell King - ARM Linux @ 2014-12-16 11:41 UTC (permalink / raw)
  To: Fabio Estevam
  Cc: Fugang Duan, David S. Miller, netdev@vger.kernel.org,
	Estevam Fabio-R49496, Ben Hutchings, Stephen Hemminger,
	robert.daniels, Marek Vašut
In-Reply-To: <CAOMZO5D3cysOauyptcjCDM6sjK7ZEiAY==xYu-qqHbwFreoV8g@mail.gmail.com>

On Tue, Dec 16, 2014 at 09:33:53AM -0200, Fabio Estevam wrote:
> Hi Fugang,
> 
> On Tue, Dec 16, 2014 at 8:25 AM, Fugang Duan <b38611@freescale.com> wrote:
> > Do camera capture test on i.MX6q sabresd board, and save the capture data to
> > nfs rootfs. The command is:
> > gst-launch-1.0 -e imxv4l2src device=/dev/video1 num-buffers=2592000 ! tee name=t !
> > queue ! imxv4l2sink sync=false t. ! queue ! vpuenc ! queue ! mux. pulsesrc num-buffers=3720937
> > blocksize=4096 ! 'audio/x-raw, rate=44100, channels=2' ! queue ! imxmp3enc ! mpegaudioparse !
> > queue ! mux. qtmux name=mux ! filesink location=video_recording_long.mov
> >
> > After about 10 hours running, there have net watchdog timeout kernel dump:
> > ...
> > WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x2b4/0x2d8()
> > NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
> 
> Adding more people who reported similar issues in the past.
> 
> Marek,
> 
> Does this patch solve the problem you reported at
> http://www.spinics.net/lists/netdev/msg268167.html ?

My set of patches fixed stuff exactly like this...

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* Re: [PATCH 1/5] e1000e: reset MAC-PHY interconnect on 82577/82578
From: Jeff Kirsher @ 2014-12-16 12:12 UTC (permalink / raw)
  To: Zhu Yanjun, stable, Willy Tarreau
  Cc: netdev, w, Zhu Yanjun, Bruce Allan, David S. Miller
In-Reply-To: <1418725700-31465-2-git-send-email-Yanjun.Zhu@windriver.com>

[-- Attachment #1: Type: text/plain, Size: 1727 bytes --]

On Tue, 2014-12-16 at 18:28 +0800, Zhu Yanjun wrote:
> 2.6.x kernels require a similar logic change as commit 6dfaa76 
> [e1000e: reset MAC-PHY interconnect on 82577/82578] introduces
> for newer kernels.
> 
> During Sx->S0 transitions, the interconnect between the MAC and PHY on
> 82577/82578 can remain in SMBus mode instead of transitioning to the
> PCIe-like mode required during normal operation.  Toggling the
> LANPHYPC
> Value bit essentially resets the interconnect forcing it to the
> correct
> mode.
> 
> after review of all intel drivers, found several instances where
> drivers had the incorrect pattern of:
> memory mapped write();
> delay();
> 
> which should always be:
> memory mapped write();
> write flush(); /* aka memory mapped read */
> delay();
> 
> explanation:
> The reason for including the flush is that writes can be held
> (posted) in PCI/PCIe bridges, but the read always has to complete
> synchronously and therefore has to flush all pending writes to a
> device.  If a write is held and followed by a delay, the delay
> means nothing because the write may not have reached hardware
> (maybe even not until the next read)
> 
> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
> ---
>  drivers/net/e1000e/defines.h |  2 ++
>  drivers/net/e1000e/ich8lan.c | 20 ++++++++++++++++++++
>  2 files changed, 22 insertions(+)

To be clear, Zhu is wanting this applied to stable trees (yet did not CC
stable@vger.kernel.org ).

Willy- I am fine with this series being applied to stable.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 1/5] e1000e: reset MAC-PHY interconnect on 82577/82578
From: Willy Tarreau @ 2014-12-16 12:17 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: Zhu Yanjun, stable, netdev, Zhu Yanjun, Bruce Allan,
	David S. Miller
In-Reply-To: <1418731926.2467.15.camel@jtkirshe-mobl>

On Tue, Dec 16, 2014 at 04:12:06AM -0800, Jeff Kirsher wrote:
> On Tue, 2014-12-16 at 18:28 +0800, Zhu Yanjun wrote:
> > 2.6.x kernels require a similar logic change as commit 6dfaa76 
> > [e1000e: reset MAC-PHY interconnect on 82577/82578] introduces
> > for newer kernels.
> > 
> > During Sx->S0 transitions, the interconnect between the MAC and PHY on
> > 82577/82578 can remain in SMBus mode instead of transitioning to the
> > PCIe-like mode required during normal operation.  Toggling the
> > LANPHYPC
> > Value bit essentially resets the interconnect forcing it to the
> > correct
> > mode.
> > 
> > after review of all intel drivers, found several instances where
> > drivers had the incorrect pattern of:
> > memory mapped write();
> > delay();
> > 
> > which should always be:
> > memory mapped write();
> > write flush(); /* aka memory mapped read */
> > delay();
> > 
> > explanation:
> > The reason for including the flush is that writes can be held
> > (posted) in PCI/PCIe bridges, but the read always has to complete
> > synchronously and therefore has to flush all pending writes to a
> > device.  If a write is held and followed by a delay, the delay
> > means nothing because the write may not have reached hardware
> > (maybe even not until the next read)
> > 
> > Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> > Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
> > ---
> >  drivers/net/e1000e/defines.h |  2 ++
> >  drivers/net/e1000e/ich8lan.c | 20 ++++++++++++++++++++
> >  2 files changed, 22 insertions(+)
> 
> To be clear, Zhu is wanting this applied to stable trees (yet did not CC
> stable@vger.kernel.org ).
> 
> Willy- I am fine with this series being applied to stable.

OK, thanks Jeff! I'm queuing the patches now.

Best regards,
Willy

^ permalink raw reply

* Re: [PATCH net] net/mlx4: Cache line CQE/EQE stride fixes
From: Amir Vadai @ 2014-12-16 12:29 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Yevgeny Petrilin, Or Gerlitz,
	clsoto@linux.vnet.ibm.com, Ido Shamay, Wei Yang
In-Reply-To: <1418729334-2974-1-git-send-email-amirv@mellanox.com>

On 12/16/2014 1:28 PM, Amir Vadai wrote:
> From: Ido Shamay <idos@mellanox.com>
> 
> This commit contains 2 fixes for the 128B CQE/EQE stride feaure.
> Wei found that mlx4_QUERY_HCA function marked the wrong capability
> in flags (64B CQE/EQE), when CQE/EQE stride feature was enabled.
> Also added small fix in initial CQE ownership bit assignment, when CQE
> is size is not default 32B.
> 
> Fixes: 77507aa24 (net/mlx4: Enable CQE/EQE stride support)
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> Signed-off-by: Ido Shamay <idos@mellanox.com>
> Signed-off-by: Amir Vadai <amirv@mellanox.com>
> ---
> Dave Hi,
> 
> Please pull this patch also to stable (at least 3.17)
> 
> Thanks,
> Amir

Small correction: Should pull into -stable >= 3.18

Amir

^ permalink raw reply

* [PATCH 0/5] tun/macvtap: TUNSETIFF fixes
From: Michael S. Tsirkin @ 2014-12-16 13:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: David Miller, netdev, Dan Carpenter, Jason Wang

Dan Carpenter reported the following:
	static checker warning:

		drivers/net/tun.c:1694 tun_set_iff()
		warn: 0x17100 is larger than 16 bits

	drivers/net/tun.c
	  1692
	  1693          tun->flags = (tun->flags & ~TUN_FEATURES) |
	  1694                  (ifr->ifr_flags & TUN_FEATURES);
	  1695

	It's complaining because the "ifr->ifr_flags" variable is a short
	(should it be unsigned?).  The new define:

	#define IFF_VNET_LE    0x10000

	doesn't fit in two bytes.  Other suspect looking code could be:

		return __virtio16_to_cpu(q->flags & IFF_VNET_LE, val);

And that's true: we have run out of IFF flags in tun.

So let's not try to add more: add simple GET/SET ioctls
instead. Easy to test, leads to clear semantics.

Alternatively we'll have to revert the whole thing for 3.19,
but that seems more work as this has dependencies
in other places.

While here, I noticed that macvtap was actually reading
ifreq flags as a 32 bit field.
Fix that up as well.

Michael S. Tsirkin (5):
  macvtap: fix uninitialized access on TUNSETIFF
  if_tun: add TUNSETVNETLE/TUNGETVNETLE
  tun: drop broken IFF_VNET_LE
  macvtap: drop broken IFF_VNET_LE
  if_tun: drop broken IFF_VNET_LE

 include/uapi/linux/if_tun.h |  3 ++-
 drivers/net/macvtap.c       | 30 ++++++++++++++++++++++++------
 drivers/net/tun.c           | 26 +++++++++++++++++++++++---
 3 files changed, 49 insertions(+), 10 deletions(-)

-- 
MST

^ permalink raw reply

* [PATCH 1/5] macvtap: fix uninitialized access on TUNSETIFF
From: Michael S. Tsirkin @ 2014-12-16 13:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Miller, netdev, Dan Carpenter, Jason Wang, Tom Herbert,
	Ben Hutchings, Vlad Yasevich, Herbert Xu
In-Reply-To: <1418732988-3535-1-git-send-email-mst@redhat.com>

flags field in ifreq is only 16 bit wide, but
we read it as a 32 bit value.
If userspace doesn't zero-initialize unused fields,
this will lead to failures.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/macvtap.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index af90ab5..de88285 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -1011,7 +1011,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 	void __user *argp = (void __user *)arg;
 	struct ifreq __user *ifr = argp;
 	unsigned int __user *up = argp;
-	unsigned int u;
+	unsigned short u;
 	int __user *sp = argp;
 	int s;
 	int ret;
@@ -1026,7 +1026,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		if ((u & ~MACVTAP_FEATURES) != (IFF_NO_PI | IFF_TAP))
 			ret = -EINVAL;
 		else
-			q->flags = u;
+			q->flags = (q->flags & ~MACVTAP_FEATURES) | u;
 
 		return ret;
 
@@ -1039,8 +1039,9 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		}
 
 		ret = 0;
+		u = q->flags;
 		if (copy_to_user(&ifr->ifr_name, vlan->dev->name, IFNAMSIZ) ||
-		    put_user(q->flags, &ifr->ifr_flags))
+		    put_user(u, &ifr->ifr_flags))
 			ret = -EFAULT;
 		macvtap_put_vlan(vlan);
 		rtnl_unlock();
-- 
MST

^ permalink raw reply related

* [PATCH 2/5] if_tun: add TUNSETVNETLE/TUNGETVNETLE
From: Michael S. Tsirkin @ 2014-12-16 13:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: David Miller, netdev, Dan Carpenter, linux-api, Jason Wang
In-Reply-To: <1418732988-3535-1-git-send-email-mst@redhat.com>

ifreq flags field is only 16 bit wide, so setting IFF_VNET_LE there has
no effect:
doesn't fit in two bytes.

The tests passed apparently because they have an even number of bugs,
all cancelling out.

Luckily we didn't release a kernel with this flag, so it's
not too late to fix this.

Add TUNSETVNETLE/TUNGETVNETLE to really achieve the purpose
of IFF_VNET_LE.

This has an added benefit that if we ever want a BE flag,
we won't have to deal with weird configurations like
setting both LE and BE at the same time.

IFF_VNET_LE will be dropped in a follow-up patch.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/if_tun.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 18b2403..274630c 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -48,6 +48,8 @@
 #define TUNSETQUEUE  _IOW('T', 217, int)
 #define TUNSETIFINDEX	_IOW('T', 218, unsigned int)
 #define TUNGETFILTER _IOR('T', 219, struct sock_fprog)
+#define TUNSETVNETLE _IOW('T', 220, int)
+#define TUNGETVNETLE _IOR('T', 221, int)

 /* TUNSETIFF ifr flags */
 #define IFF_TUN		0x0001
-- 
MST

^ permalink raw reply related

* [PATCH 3/5] tun: drop broken IFF_VNET_LE
From: Michael S. Tsirkin @ 2014-12-16 13:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Miller, netdev, Dan Carpenter, Jason Wang, Herbert Xu,
	Tom Herbert, Ben Hutchings, Xi Wang, Masatake YAMATO
In-Reply-To: <1418732988-3535-1-git-send-email-mst@redhat.com>

Use TUNSETVNETLE/TUNGETVNETLE instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/tun.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c052bd6b..e3e8a0e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -109,9 +109,11 @@ do {								\
  * overload it to mean fasync when stored there.
  */
 #define TUN_FASYNC	IFF_ATTACH_QUEUE
+/* High bits in flags field are unused. */
+#define TUN_VNET_LE     0x80000000
 
 #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
-		      IFF_VNET_LE | IFF_MULTI_QUEUE)
+		      IFF_MULTI_QUEUE)
 #define GOODCOPY_LEN 128
 
 #define FLT_EXACT_COUNT 8
@@ -207,12 +209,12 @@ struct tun_struct {
 
 static inline u16 tun16_to_cpu(struct tun_struct *tun, __virtio16 val)
 {
-	return __virtio16_to_cpu(tun->flags & IFF_VNET_LE, val);
+	return __virtio16_to_cpu(tun->flags & TUN_VNET_LE, val);
 }
 
 static inline __virtio16 cpu_to_tun16(struct tun_struct *tun, u16 val)
 {
-	return __cpu_to_virtio16(tun->flags & IFF_VNET_LE, val);
+	return __cpu_to_virtio16(tun->flags & TUN_VNET_LE, val);
 }
 
 static inline u32 tun_hashfn(u32 rxhash)
@@ -1853,6 +1855,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	int sndbuf;
 	int vnet_hdr_sz;
 	unsigned int ifindex;
+	int le;
 	int ret;
 
 	if (cmd == TUNSETIFF || cmd == TUNSETQUEUE || _IOC_TYPE(cmd) == 0x89) {
@@ -2052,6 +2055,23 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		tun->vnet_hdr_sz = vnet_hdr_sz;
 		break;
 
+	case TUNGETVNETLE:
+		le = !!(tun->flags & TUN_VNET_LE);
+		if (put_user(le, (int __user *)argp))
+			ret = -EFAULT;
+		break;
+
+	case TUNSETVNETLE:
+		if (get_user(le, (int __user *)argp)) {
+			ret = -EFAULT;
+			break;
+		}
+		if (le)
+			tun->flags |= TUN_VNET_LE;
+		else
+			tun->flags &= ~TUN_VNET_LE;
+		break;
+
 	case TUNATTACHFILTER:
 		/* Can be set only for TAPs */
 		ret = -EINVAL;
-- 
MST

^ permalink raw reply related

* [PATCH 4/5] macvtap: drop broken IFF_VNET_LE
From: Michael S. Tsirkin @ 2014-12-16 13:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Miller, netdev, Dan Carpenter, Jason Wang, Tom Herbert,
	Ben Hutchings, Vlad Yasevich, Herbert Xu
In-Reply-To: <1418732988-3535-1-git-send-email-mst@redhat.com>

Use TUNSETVNETLE/TUNGETVNETLE instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/macvtap.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index de88285..f1f0df1 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -45,16 +45,18 @@ struct macvtap_queue {
 	struct list_head next;
 };
 
-#define MACVTAP_FEATURES (IFF_VNET_HDR | IFF_VNET_LE | IFF_MULTI_QUEUE)
+#define MACVTAP_FEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
+
+#define MACVTAP_VNET_LE 0x80000000
 
 static inline u16 macvtap16_to_cpu(struct macvtap_queue *q, __virtio16 val)
 {
-	return __virtio16_to_cpu(q->flags & IFF_VNET_LE, val);
+	return __virtio16_to_cpu(q->flags & MACVTAP_VNET_LE, val);
 }
 
 static inline __virtio16 cpu_to_macvtap16(struct macvtap_queue *q, u16 val)
 {
-	return __cpu_to_virtio16(q->flags & IFF_VNET_LE, val);
+	return __cpu_to_virtio16(q->flags & MACVTAP_VNET_LE, val);
 }
 
 static struct proto macvtap_proto = {
@@ -1082,6 +1084,21 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		q->vnet_hdr_sz = s;
 		return 0;
 
+	case TUNGETVNETLE:
+		s = !!(q->flags & MACVTAP_VNET_LE);
+		if (put_user(s, sp))
+			return -EFAULT;
+		return 0;
+
+	case TUNSETVNETLE:
+		if (get_user(s, sp))
+			return -EFAULT;
+		if (s)
+			q->flags |= MACVTAP_VNET_LE;
+		else
+			q->flags &= ~MACVTAP_VNET_LE;
+		return 0;
+
 	case TUNSETOFFLOAD:
 		/* let the user check for future flags */
 		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-- 
MST

^ permalink raw reply related

* [PATCH 5/5] if_tun: drop broken IFF_VNET_LE
From: Michael S. Tsirkin @ 2014-12-16 13:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: David Miller, netdev, Dan Carpenter, linux-api, Jason Wang
In-Reply-To: <1418732988-3535-1-git-send-email-mst@redhat.com>

Everyone should use TUNSETVNETLE/TUNGETVNETLE instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/if_tun.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 274630c..50ae243 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -59,7 +59,6 @@
 #define IFF_ONE_QUEUE	0x2000
 #define IFF_VNET_HDR	0x4000
 #define IFF_TUN_EXCL	0x8000
-#define IFF_VNET_LE	0x10000
 #define IFF_MULTI_QUEUE 0x0100
 #define IFF_ATTACH_QUEUE 0x0200
 #define IFF_DETACH_QUEUE 0x0400
-- 
MST

^ permalink raw reply related

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Jamal Hadi Salim @ 2014-12-16 13:06 UTC (permalink / raw)
  To: John Fastabend
  Cc: Hubert Sokolowski, Roopa Prabhu, netdev@vger.kernel.org,
	Vlad Yasevich
In-Reply-To: <548F80B2.80408@gmail.com>

On 12/15/14 19:45, John Fastabend wrote:
> On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:

>
> hmm good question. When I implemented this on the host nics with SR-IOV,
> VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
> the driver.

So if i understand correctly, this is a NIC with an FDB. And there is no
concept of a bridge to which it is attached. To the point of
classical uni/multicast addresses on a netdev abstraction; these
are typically stored in *much simpler tables* (used to be IO
registers back in the day)
Do these NICs not have such a concept?
An fdb entry has an egress port column; I have seen cases where the
port is labeled as "Cpu port" which would mean it belongs to the host;
but in this case it just seems there is no such concept and as Or
brought up in another email - what does "VLANid" mean in such a case?
If we go with a CPU port concept,
We could then use the concept of a vlan filter on a port basis
but then what happens when you dont have an fdb (majority of cases)?

> My logic was if some netdev ethx has a set of MAC addresses
> above it well then any virtual function or virtual device also behind
> the hardware shouldn't be sending those addresses out the egress switch
> facing port. Otherwise the switch will see packets it knows are behind
> that port and drop them. Or flood them if it hasn't learned the address
> yet. Either way they will never get to the right netdev.
>
> Admittedly I wasn't thinking about switches with many ports at the time.
>

I often struggle with trying to "box" SRIOV into some concept of a
switch abstraction and sometimes i am puzzled.
Would exposing the SRIOV underlay as a switch not have solved this
problem? Then the virtual ports essentially are bridge ports.
Maybe what we need is a concept of a "edge relay" extended netdev?
These things would have an fdb as well down and uplink relay ports that
can be attached to them.

>> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
>> the other driver did).
>
> My original thinking here was... if it didn't implement fdb_add, fdb_del
> and fdb_dump then if you wanted to think of it as having forwarding
> database that was fine but it was really just a two port mac relay. In
> which case just dump all the mac addresses it knows about. In this case
> if it was something more fancy it could do its own dump like vxlan or
> macvlan.
>

The challenge here is lack of separation between a NICs uni/multicast
ports which it owns - which is a traditional operation regardless of
what capabilities the NIC has; vs an fdb which has may have many
other capabilities. Probably all NICs capable of many MACs implement
fdbs?

> For a host nic ucast/multicast and fdb are the same, I think? The
> code we had was just short-hand to allow the common case a host nic
> to work. Notice vxlan and bridge drivers didn't dump there addr lists
> from fdb_dump until your patch.
>
> Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
> I shouldn't overload the addr lists.
>

Not just those - I am wondering about the general utility of what
Hubert was trying to do if all the driver does is call the default
dumper based on some flags presence and the default dumper
does a dump of uni/multicast host entries. Those are not really fdb
entries in the traditional sense.
Is there no way to get the unicast/multicast mac addresses for such
a driver?
I think that would help bring clarity to my confusion.

>
> I'm interested to see what Vlad says as well. But the current situation
> is previously some drivers dumped their addr lists others didn't.
> Specifically, the more switch like devices (bridge, vxlan) didn't. Now
> every device will dump the addr lists. I'm not entirely convinced that
> is correct.
>

I am glad this happened ;-> Otherwise we wouldnt be having this
discussion. When Vlad was asking me I was in a rush to get the patch
out and didnt question because i thought this was something some crazy
virtualization people needed.
If Vlad's use case goes away, then Hubert's little restoration is fine.

> It works OK for host nics (NICS that can't forward between ports) and
> seems at best confusing for real switch asics.

So if these NICs have fdb entries and i programmed it (meaning setting
which port a given MAC should be sent to), would it not work?

> On a related question do
> you expect the switch asic to trap any packets with MAC addresses in
> the multi/unicast address lists and send them to the correct netdev? Or
> will the switch forward them using normal FDB tables?
>

I think there would be a separate table for that. Roopa, can you check
with the ASICs you guys work on? The point i was trying to make above
is today there is a uni/multicast list or table of sorts that all NICs
expose.
There's always the hack of a "cpu port". I have also seen the "cpu port"
being conceptualized in L3 tables to imply "next hop is cpu" where you
have an IP address owned by the host; so maybe we need a concept of a
cpu port or again the revival of TheThing class device.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next 1/1] net: fec: Fix NAPI race
From: Marek Vasut @ 2014-12-16 13:34 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Fabio Estevam, Fugang Duan, David S. Miller,
	netdev@vger.kernel.org, Estevam Fabio-R49496, Ben Hutchings,
	Stephen Hemminger, robert.daniels
In-Reply-To: <20141216114131.GQ11285@n2100.arm.linux.org.uk>

On Tuesday, December 16, 2014 at 12:41:31 PM, Russell King - ARM Linux wrote:
> On Tue, Dec 16, 2014 at 09:33:53AM -0200, Fabio Estevam wrote:
> > Hi Fugang,
> > 
> > On Tue, Dec 16, 2014 at 8:25 AM, Fugang Duan <b38611@freescale.com> wrote:
> > > Do camera capture test on i.MX6q sabresd board, and save the capture
> > > data to nfs rootfs. The command is:
> > > gst-launch-1.0 -e imxv4l2src device=/dev/video1 num-buffers=2592000 !
> > > tee name=t ! queue ! imxv4l2sink sync=false t. ! queue ! vpuenc !
> > > queue ! mux. pulsesrc num-buffers=3720937 blocksize=4096 !
> > > 'audio/x-raw, rate=44100, channels=2' ! queue ! imxmp3enc !
> > > mpegaudioparse ! queue ! mux. qtmux name=mux ! filesink
> > > location=video_recording_long.mov
> > > 
> > > After about 10 hours running, there have net watchdog timeout kernel
> > > dump: ...
> > > WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264
> > > dev_watchdog+0x2b4/0x2d8() NETDEV WATCHDOG: eth0 (fec): transmit queue
> > > 0 timed out
> > 
> > Adding more people who reported similar issues in the past.
> > 
> > Marek,
> > 
> > Does this patch solve the problem you reported at
> > http://www.spinics.net/lists/netdev/msg268167.html ?
> 
> My set of patches fixed stuff exactly like this...

I still keep your G+ post open, in case I ever manage to find free time to dive 
into it. It's be a terrible waste to let these patches go. Right now, I'm in the 
process of finishing my degree (finally) so things are just crap, apologies.

Best regards,
Marek Vasut

^ permalink raw reply

* Re: [iproute2] tc: Show classes more hierarchically]
From: Eric Dumazet @ 2014-12-16 13:49 UTC (permalink / raw)
  To: vadim4j; +Cc: netdev
In-Reply-To: <20141215224851.GB6734@angus-think.lan>

On Tue, 2014-12-16 at 00:48 +0200, vadim4j@gmail.com wrote:
> Hi All,
> 
> I am playing with showing classes in more hierarchically format and I
> have some code and example of output from my TC looks like:
> 
> # tc/tc -t class show dev tap0
> 
>  \---1:2 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
>         \---1:40 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
>         \---1:50 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
>         \---1:60 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
>  \---1:1 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
>         \---1:10 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
>                \---1:11 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
>                       \---1:111 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
>         \---1:20 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
>         \---1:30 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
> 
> 

> So I'd like to ask if it might be useful for the TC users (may be
> better format ?) to have this ?

Sure, this seems interesting, thanks !

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Hubert Sokolowski @ 2014-12-16 14:35 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: John Fastabend, Roopa Prabhu, netdev@vger.kernel.org,
	Vlad Yasevich
In-Reply-To: <54902E5E.2070405@mojatatu.com>

>
> I am glad this happened ;-> Otherwise we wouldnt be having this
> discussion. When Vlad was asking me I was in a rush to get the patch
> out and didnt question because i thought this was something some crazy
> virtualization people needed.
> If Vlad's use case goes away, then Hubert's little restoration is fine.
>

I'm afraid there might be a little more to fix in here. I just tested
my patch after I moved the dflt_fdb_dump unconditional call inside br_fdb_dump,
so the "self" is returned again for the bridge and saw that br_fdb_dump is called
twice in some cases. As a result I saw duplicate "self" entries.
I think the problem is in rtnl_fdb_dump how it invokes ndo_fdb_dump.
In 3.16 the algorithm was very simple, now it is a little bit more complicated.

I tested on 3.18 without my changes only added pr_info inside br_fdb_dump:
pr_info("br fdb dump called dev %s filter dev %s.\n", dev->name,
			filter_dev->name);

I loaded dummy module, created bridge br0 with brctl and then attached dummy0
to that bridge:
    brctl  addif  br0 dummy0
Then when trying to filter by brport only:
./bridge fdb show brport dummy0
5e:e2:a0:21:0c:f5 vlan 1 master br0 permanent
5e:e2:a0:21:0c:f5 master br0 permanent
33:33:00:00:00:01 self permanent

Even though the output looks OK, I see in the journalctl logs the callback
was called twice with same attributes:
Dec 16 09:15:39 localhost.localdomain kernel: br fdb dump called dev br0 filter dev dummy0.
Dec 16 09:15:39 localhost.localdomain kernel: br fdb dump called dev br0 filter dev dummy0.

Do you have any idea why this is happening? I hope this test makes sense :).

Thanks,
Hubert

--
Hubert Sokolowski           Intel Corporation

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Rajat Jain @ 2014-12-16 16:04 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Nils Holland, David Miller, netdev, linux-pci@vger.kernel.org
In-Reply-To: <548EF90A.5070607@gmail.com>

Hello All,

Apologies for jumping in late, but for some reason I do not see the
original mail in my inbox. However I am taking a look at the mails as
sent on linux-pci (and I will keep an eye out for the bug report that
Bjorn asked for).

>
> I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:
>
> $ grep 'pci 0000:02' tg3.bad
> [    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
> [    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
> [    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
> [    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
> [    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
> [    1.561555] pci 0000:02:00.0: 1st 1 1
> [    1.561558] pci 0000:02:00.0: crs_timeout: 0
> [   20.412021] pci 0000:02:00.0: 1st 1 1
> [   20.412022] pci 0000:02:00.0: crs_timeout: 0
> [   20.413596] pci 0000:02:00.0: 1st 1 1
> [   20.413598] pci 0000:02:00.0: crs_timeout: 0
>
> And without it:
>
> $ grep 'pci 0000:02' tg3.good
> [    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
> [    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
> [    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
> [    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
> [    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
> [    1.576778] pci 0000:02:00.0: 1st 1 1
> [   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4
>

It seems that in the first 2 attempts that were made to probe the
device are all OK and return regular device ID and vendor ID for TG3
(CRS does not have a role to play). However, later attempts return a
CRS.

1) May I ask if you are using acpihp or pciehp? I assume pciehp?

2) Can you please also send dmesg output while passing
pciehp.pciehp_debug=1? In the fail case, do you see a message
indicating the pciehp gave up since it got CRS for a long time
(something like "pci 0000:02:00.0 id reading try 50 times with
interval 20 ms to get ffff0001")?

3) Currently the pciehp passes "0" for the argument "crs_timeout" to
pci_bus_read_dev_vendor_id(). Can you please try increasing it to, say
30 seconds (30 * 1000). (For comparison data, acpihp uses the value
60*1000 i.e. 60 seconds today) and run the fail case once again?

Thanks a lot in advance for the debugging help ;-)

Rajat

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox