Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v2 1/4] dtb: Add SGMII based 1GbE node to APM X-Gene SoC device tree
From: Iyappan Subramanian @ 2014-10-14  0:05 UTC (permalink / raw)
  To: davem, romieu, netdev, devicetree
  Cc: linux-arm-kernel, patches, kchudgar, Iyappan Subramanian
In-Reply-To: <1413245135-2989-1-git-send-email-isubramanian@apm.com>

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Keyur Chudgar <kchudgar@apm.com>
---
 arch/arm64/boot/dts/apm-mustang.dts |  4 ++++
 arch/arm64/boot/dts/apm-storm.dtsi  | 24 ++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/arch/arm64/boot/dts/apm-mustang.dts b/arch/arm64/boot/dts/apm-mustang.dts
index 2ae782b..71a1489 100644
--- a/arch/arm64/boot/dts/apm-mustang.dts
+++ b/arch/arm64/boot/dts/apm-mustang.dts
@@ -33,6 +33,10 @@
 	status = "ok";
 };
 
+&sgenet0 {
+	status = "ok";
+};
+
 &xgenet {
 	status = "ok";
 };
diff --git a/arch/arm64/boot/dts/apm-storm.dtsi b/arch/arm64/boot/dts/apm-storm.dtsi
index d16cc03..f45bbfe 100644
--- a/arch/arm64/boot/dts/apm-storm.dtsi
+++ b/arch/arm64/boot/dts/apm-storm.dtsi
@@ -176,6 +176,16 @@
 				clock-output-names = "menetclk";
 			};
 
+			sge0clk: sge0clk@1f21c000 {
+				compatible = "apm,xgene-device-clock";
+				#clock-cells = <1>;
+				clocks = <&socplldiv2 0>;
+				reg = <0x0 0x1f21c000 0x0 0x1000>;
+				reg-names = "csr-reg";
+				csr-mask = <0x3>;
+				clock-output-names = "sge0clk";
+			};
+
 			xge0clk: xge0clk@1f61c000 {
 				compatible = "apm,xgene-device-clock";
 				#clock-cells = <1>;
@@ -446,6 +456,20 @@
 			};
 		};
 
+		sgenet0: ethernet@1f210000 {
+			compatible = "apm,xgene-enet";
+			status = "disabled";
+			reg = <0x0 0x1f210000 0x0 0x10000>,
+			      <0x0 0x1f200000 0x0 0X10000>,
+			      <0x0 0x1B000000 0x0 0X20000>;
+			reg-names = "enet_csr", "ring_csr", "ring_cmd";
+			interrupts = <0x0 0xA0 0x4>;
+			dma-coherent;
+			clocks = <&sge0clk 0>;
+			local-mac-address = [00 00 00 00 00 00];
+			phy-connection-type = "sgmii";
+		};
+
 		xgenet: ethernet@1f610000 {
 			compatible = "apm,xgene-enet";
 			status = "disabled";
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 0/4] Add SGMII based 1GbE support to APM X-Gene SoC ethernet driver
From: Iyappan Subramanian @ 2014-10-14  0:05 UTC (permalink / raw)
  To: davem, romieu, netdev, devicetree
  Cc: linux-arm-kernel, patches, kchudgar, Iyappan Subramanian

Adding SGMII based 1GbE basic support to APM X-Gene SoC ethernet driver.

v2: Address comments from v1
* Split the patchset into two, the first one being preparatory patch
* Added link_state function pointer to the xgene_mac_ops structure
* Added xgene_indirect_ctl structure for indirect read/write arguments

v1:
* Initial version
---

Iyappan Subramanian (4):
  dtb: Add SGMII based 1GbE node to APM X-Gene SoC device tree
  drivers: net: xgene: Preparing for adding SGMII based 1GbE
  drivers: net: xgene: Add SGMII based 1GbE support
  drivers: net: xgene: Add SGMII based 1GbE ethtool support

 arch/arm64/boot/dts/apm-mustang.dts                |   4 +
 arch/arm64/boot/dts/apm-storm.dtsi                 |  24 ++
 drivers/net/ethernet/apm/xgene/Makefile            |   2 +-
 .../net/ethernet/apm/xgene/xgene_enet_ethtool.c    |  25 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c     |   1 -
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h     |   4 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c   |  18 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h   |  12 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c  | 389 +++++++++++++++++++++
 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.h  |  41 +++
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c  |   3 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h  |   4 -
 12 files changed, 506 insertions(+), 21 deletions(-)
 create mode 100644 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
 create mode 100644 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.h

-- 
1.9.1

^ permalink raw reply

* Re: something is wrong in commit 971f10eca1 - tcp: better TCP_SKB_CB layout to reduce cache line misses
From: Cong Wang @ 2014-10-13 23:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Krzysztof Kolasa, netdev, edumazet, David Miller
In-Reply-To: <1413139613.9362.78.camel@edumazet-glaptop2.roam.corp.google.com>

Probably not related with this bug, but with regarding to the
offending commit, what's the point of the memmove() in tcp_v4_rcv()
since ip_rcv() already clears IPCB()?

^ permalink raw reply

* Re: [PATCH v1 2/3] drivers: net: xgene: Add SGMII based 1GbE support
From: Iyappan Subramanian @ 2014-10-13 23:19 UTC (permalink / raw)
  To: Francois Romieu
  Cc: David Miller, netdev, devicetree@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, patches, Keyur Chudgar
In-Reply-To: <20141010220112.GA7633@electric-eye.fr.zoreil.com>

Thanks for the review.

On Fri, Oct 10, 2014 at 3:01 PM, Francois Romieu <romieu@fr.zoreil.com> wrote:
> Iyappan Subramanian <isubramanian@apm.com> :
> [...]
>> diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
>> index c8f3824..63ea194 100644
>> --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
>> +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
>> @@ -410,7 +410,6 @@ static void xgene_gmac_set_mac_addr(struct xgene_enet_pdata *pdata)
>>       addr0 = (dev_addr[3] << 24) | (dev_addr[2] << 16) |
>>               (dev_addr[1] << 8) | dev_addr[0];
>>       addr1 = (dev_addr[5] << 24) | (dev_addr[4] << 16);
>> -     addr1 |= pdata->phy_addr & 0xFFFF;
>
> phy_addr removal is harmless (all zeroes from netdev priv data) but it's
> unrelated to $SUBJECT.
>
> You may split this patch as:
> 1. prettyfication / cruft removal
> 2. add link_state in xgene_mac_ops / pdata->rm shuffle
> 3. SGMII based 1GbE support

I will split the patch into two, the first one being preparatory patch.

>
> Mostly stylistic review below.
>
> [...]
>> diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
>> index 15ec426..dc024c1 100644
>> --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
>> +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
> [...]
>> @@ -179,7 +180,6 @@ enum xgene_enet_rm {
>>  #define TUND_ADDR                    0x4a
>>
>>  #define TSO_IPPROTO_TCP                      1
>> -#define      FULL_DUPLEX                     2
>
> See above.
>
>>
>>  #define USERINFO_POS                 0
>>  #define USERINFO_LEN                 32
> [...]
>> diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
>> new file mode 100644
>> index 0000000..6038596
>> --- /dev/null
>> +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
> [...]
>> +static void xgene_enet_wr_csr(struct xgene_enet_pdata *pdata,
>> +                           u32 offset, u32 val)
>> +{
>> +     void __iomem *addr = pdata->eth_csr_addr + offset;
>> +
>> +     iowrite32(val, addr);
>> +}
>
> Replace 'pdata' with one of 'xp', 'pd', 'p' ?
>
> You should be able to pack a lot.

That helps!  I will use 'p' where ever possible and try to be consistent.

>
> static void xgene_enet_wr_csr(struct xgene_enet_pdata *p, u32 offset, u32 val)
> {
>         iowrite32(val, p->eth_csr_addr + offset);
> }
>
> There are several of those.
>
> [...]
>> +static bool xgene_enet_wr_indirect(void __iomem *addr, void __iomem *wr,
>> +                                void __iomem *cmd, void __iomem *cmd_done,
>> +                                u32 wr_addr, u32 wr_data)
>> +{
>> +     u32 done;
>> +     u8 wait = 10;
>> +
>> +     iowrite32(wr_addr, addr);
>> +     iowrite32(wr_data, wr);
>> +     iowrite32(XGENE_ENET_WR_CMD, cmd);
>> +
>> +     /* wait for write command to complete */
>> +     while (!(done = ioread32(cmd_done)) && wait--)
>> +             udelay(1);
>> +
>> +     if (!done)
>> +             return false;
>
>         int i;
>
>         for (i = 0; i < 10; i++) {
>                 if (ioread32(cmd_done)) {
>                         iowrite32(0, cmd);
>                         return true;
>                 }
>                 udelay(1);
>         }
>
>         return false;
>

Sure.  I will use the 'for' loop.

>> +
>> +     iowrite32(0, cmd);
>> +
>> +     return true;
>> +}
>> +
>> +static void xgene_enet_wr_mac(struct xgene_enet_pdata *pdata,
>> +                           u32 wr_addr, u32 wr_data)
>> +{
>> +     void __iomem *addr, *wr, *cmd, *cmd_done;
>> +
>> +     addr = pdata->mcx_mac_addr + MAC_ADDR_REG_OFFSET;
>> +     wr = pdata->mcx_mac_addr + MAC_WRITE_REG_OFFSET;
>> +     cmd = pdata->mcx_mac_addr + MAC_COMMAND_REG_OFFSET;
>> +     cmd_done = pdata->mcx_mac_addr + MAC_COMMAND_DONE_REG_OFFSET;
>
> struct xgene_indirect_ctl {
>         void __iomem *addr;
>         void __iomem *ctl;
>         void __iomem *cmd;
>         void __iomem *cmd_done;
> };
>
> static void xgene_enet_wr_mac(struct xgene_enet_pdata *p, u32 addr, u32 data)
> {
>         struct xgene_indirect_ctl ctl = {
>                 .addr           = p->mcx_mac_addr + MAC_ADDR_REG_OFFSET;
>                 .ctl            = p->mcx_mac_addr + MAC_WRITE_REG_OFFSET;
>                 .cmd            = p->mcx_mac_addr + MAC_COMMAND_REG_OFFSET;
>                 .cmd_done       = p->mcx_mac_addr + MAC_COMMAND_DONE_REG_OFFSET;
>         };
>
>         if (!xgene_enet_wr_indirect(&ctl, wr_addr, wr_data)) {
>                 ...
>
> It's syntaxic sugar that avoids (an excess of) 'void *' parameters.

I agree and will use xgene_indirect_ctl.

>
> You could reuse it for xgene_enet_rd_mac.
>
>> +
>> +     if (!xgene_enet_wr_indirect(addr, wr, cmd, cmd_done, wr_addr, wr_data))
>> +             netdev_err(pdata->ndev, "MCX mac write failed, addr: %04x\n",
>> +                        wr_addr);
>> +}
>> +
>> +static void xgene_enet_rd_csr(struct xgene_enet_pdata *pdata,
>> +                           u32 offset, u32 *val)
>> +{
>> +     void __iomem *addr = pdata->eth_csr_addr + offset;
>> +
>> +     *val = ioread32(addr);
>> +}
>
> static u32 xgene_enet_rd_csr(struct xgene_enet_pdata *pdata, u32 offset)
> {
>         return ioread32(pdata->eth_csr_addr + offset);
> }

I will change as you suggested and it is consistent with ioread32().

>
>> +
>> +static void xgene_enet_rd_diag_csr(struct xgene_enet_pdata *pdata,
>> +                                u32 offset, u32 *val)
>> +{
>> +     void __iomem *addr = pdata->eth_diag_csr_addr + offset;
>> +
>> +     *val = ioread32(addr);
>> +}
>> +
>> +static bool xgene_enet_rd_indirect(void __iomem *addr, void __iomem *rd,
>> +                                void __iomem *cmd, void __iomem *cmd_done,
>> +                                u32 rd_addr, u32 *rd_data)
>> +{
>> +     u32 done;
>> +     u8 wait = 10;
>> +
>> +     iowrite32(rd_addr, addr);
>> +     iowrite32(XGENE_ENET_RD_CMD, cmd);
>> +
>> +     /* wait for read command to complete */
>> +     while (!(done = ioread32(cmd_done)) && wait--)
>> +             udelay(1);
>> +
>> +     if (!done)
>> +             return false;
>
> See above.
>
> [...]
>> +static void xgene_sgmac_rx_enable(struct xgene_enet_pdata *pdata)
>> +{
>> +     u32 data;
>> +
>> +     xgene_enet_rd_mac(pdata, MAC_CONFIG_1_ADDR, &data);
>> +     xgene_enet_wr_mac(pdata, MAC_CONFIG_1_ADDR, data | RX_EN);
>> +}
>> +
>> +static void xgene_sgmac_tx_enable(struct xgene_enet_pdata *pdata)
>> +{
>> +     u32 data;
>> +
>> +     xgene_enet_rd_mac(pdata, MAC_CONFIG_1_ADDR, &data);
>> +     xgene_enet_wr_mac(pdata, MAC_CONFIG_1_ADDR, data | TX_EN);
>> +}
>
> static void _xgene_sgmac_rxtx(struct xgene_enet_pdata *p, bool set, u32 bits)
> {
>         u32 data;
>
>         xgene_enet_rd_mac(pdata, MAC_CONFIG_1_ADDR, &data);
>
>         if (set)
>                 data |= bits;
>         else
>                 data &= ~bits;
>
>         xgene_enet_wr_mac(pdata, MAC_CONFIG_1_ADDR, data);
> }
>
> (or _xgene_sgmac_rxtx(struct xgene_enet_pdata *p, u32 set, u32 clear))
>
> static void xgene_sgmac_rxtx_set(struct xgene_enet_pdata *p, u32 bits)
> {
>         _xgene_sgmac_rxtx(p, true, bits);
> }
>
> static void xgene_sgmac_rx_enable(struct xgene_enet_pdata *p)
> {
>         xgene_sgmac_rxtx_set(p, RX_EN);

Thanks for the suggestion.  I would prefer to call _xgene_sgmac_rxtx
directly from here.

> }
>
> static void xgene_sgmac_tx_enable(struct xgene_enet_pdata *p)
> {
>         xgene_sgmac_rxtx_set(p, TX_EN);
> }
>
> static void xgene_sgmac_rxtx_clear(struct xgene_enet_pdata *p, u32 bits)
> {
>         _xgene_sgmac_rxtx(p, false, bits);
> }
>
> etc.
>
> [...]
>> +struct xgene_mac_ops xgene_sgmac_ops = {
>> +     .init = xgene_sgmac_init,
>> +     .reset = xgene_sgmac_reset,
>> +     .rx_enable = xgene_sgmac_rx_enable,
>> +     .tx_enable = xgene_sgmac_tx_enable,
>> +     .rx_disable = xgene_sgmac_rx_disable,
>> +     .tx_disable = xgene_sgmac_tx_disable,
>> +     .set_mac_addr = xgene_sgmac_set_mac_addr,
>> +     .link_state = xgene_enet_link_state
>
> Please use tabs before '='.

I will use tabs before "=" for xgene_mac_ops and xgene_port_ops
structure initialization.

>
>> +};
>> +
>> +struct xgene_port_ops xgene_sgport_ops = {
>> +     .reset = xgene_enet_reset,
>> +     .cle_bypass = xgene_enet_cle_bypass,
>> +     .shutdown = xgene_enet_shutdown,
>
> See above.
>
> [...]
>> diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.h b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.h
>> new file mode 100644
>> index 0000000..de43246
>> --- /dev/null
>> +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.h
> [...]
>> +#define PHY_ADDR(src)                (((src)<<8) & GENMASK(12, 8))
>
>    #define PHY_ADDR(src)                (((src) << 8) & GENMASK(12, 8))
>
> --
> Ueimor

^ permalink raw reply

* Re: [PATCH v9 net-next 2/4] net: filter: split filter.h and expose eBPF to user space
From: Alexei Starovoitov @ 2014-10-13 21:49 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Hannes Frederic Sowa, Chema Gonzalez,
	Eric Dumazet, Peter Zijlstra, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML
In-Reply-To: <543C0A2D.5040908@redhat.com>

On Mon, Oct 13, 2014 at 10:21 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> On 09/03/2014 05:46 PM, Daniel Borkmann wrote:
> ...
>>
>> Ok, given you post the remaining two RFCs later on this window as
>> you indicate, I have no objections:
>>
>> Acked-by: Daniel Borkmann <dborkman@redhat.com>
>
>
> Ping, Alexei, are you still sending the patch for bpf_common.h or
> do you want me to take care of this?

It's not forgotten.
I'm not sending it only because net-next is closed
and it seems to be -next material.

^ permalink raw reply

* Re: Fw: [Bug 86081] New: Can't free the return value of sock_kmalloc() when the value is NULL
From: Cong Wang @ 2014-10-13 20:58 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Andy Grover, netdev
In-Reply-To: <20141012170756.19d1d12e@uryu.home.lan>

On Sun, Oct 12, 2014 at 8:07 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> in function rds_cmsg_rdma_args() at net/rds/rdma.c:L546, the variable
> "iovstack" is an array and the pointer variable *iovs is equal to iovstack (at
> Line 554). As the the return value of sock_kmalloc() (called at line 578),when
> "iovs" is NULL, function sock_kfree_s() will be called(at line 697) and
> function sock_kfree_s() will free "iovs".
> The related code snippets in function rds_cmsg_rdma_args() are as followings.
> rds_cmsg_rdma_args() at net/rds/rdma.c:L546

Free'ing NULL is okay, but not sub'ing the sk_omem_alloc size.
I will send a patch.

^ permalink raw reply

* Re: vxlan gro problem ?
From: Or Gerlitz @ 2014-10-13 20:56 UTC (permalink / raw)
  To: yinpeijun
  Cc: Or Gerlitz, qinchuanyu, Linux Netdev List, Linux Kernel, lichunhe,
	wangfakai, liuyongan
In-Reply-To: <543B97F0.6040603@huawei.com>

On Mon, Oct 13, 2014 at 11:14 AM, yinpeijun <yinpeijun@huawei.com> wrote:
> On 2014/10/13 3:50, Or Gerlitz wrote:
> my test environment use mellanox ConnectX-3 Pro nic ,  as I know the nic support Rx  checksum offload.  but I am not confirm if should  I  do some special configure?
> or the nic driver or firmware need update  ?  also , I have used redhat7.0 ovs vxlan to test with the similar configure as before, but there is also no improvement .

The NIC (HW model and firmware) look just fine. As it seems now, this
boils down to get the RHEL7 inbox mlx4 driver to work properly on your
setup, something which goes a bit beyond the interest of the upstream
mailing lists...

Or.

>
> the nic infomation:
>
> 04:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
>
> root@localhost:~# ethtool -i eth4
> driver: mlx4_en
> version: 2.0(Dec 2011)
> firmware-version:  2.31.5050
> bus-info: 0000:04:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: no
> supports-register-dump: no
> supports-priv-flags: yes
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
From: Dave Taht @ 2014-10-13 20:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Alexander Duyck, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller
In-Reply-To: <20141013222717.42f3c399@redhat.com>

On Mon, Oct 13, 2014 at 1:27 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Mon, 13 Oct 2014 10:20:17 -0700 Dave Taht <dave.taht@gmail.com> wrote:
>
>> On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> >> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
> [...]
>>
>> I would like to also get better behavior out of gigE and below, and for
>> these changes to not impact the downstream behavior of the network
>> overall.
>
> I also care about 1Gbit/s and below, that why I did some many tests
> (with igb at 10Mbit/s, 100Mbit/s and 1Gbit/s).
>
>
>> To give you an example, I would like to see the tcp flows in the
>> 2nd chart here, to converge faster than the 5 seconds they currently
>> take at GigE speeds.
>>
>> http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html
>
> In the last graph, where you cannot saturate the link, because you
> turned off GSO, GRO and TSO.  Here I expect you will see the benefit of
> the qdisc bulking. That is, will be able to saturate the link and
> achieve the lower latency as BQL will cut off the bursts at +1 MTU.
> I would be interested in the results...

I am too!!!! 5 seconds to converge? 50x the baseline latency when under load?
vs not being able to saturate the link at all? Ugh. Two lousy choices.

I think xmit_more will
help a lot in the latter case, my other suggestions regarding reducing
the size of the offloads in the former.

But it looks like xmit_more support needs to be added to fq, and fq_codel (?),
and despite me reading the patches submitted thus far, it would be saner
for someone else to patch e1000e support for the nuc (and the zillions
of other e1000e platforms)

(did I miss that patch go by?)

I'm certainly willing to test the result on that platform (and I have
some other tweaks
in my queue at the qdisc layer that I can throw in, also).

>> > We made all these changes so that we can spend cpu cycles at the right
>> > place.
>
> Exactly.

+1.

So what happens when more cpu cycles are available in the right place?

The dequeue routines in both fq and fq_codel are a bit more complex than
pfifo_fast (and I've longed to kill the maxpacket concept in codel, btw)....

y'all are in such a lovely place with profilers and hardware at the ready
to just make a simple sysctl and analyze what happens at rates I can
only dream of.


>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer



-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply

* Re: [ovs-dev] [PATCH/RFC repost 7/8] ofproto: translate datapath select group action
From: Ben Pfaff @ 2014-10-13 20:46 UTC (permalink / raw)
  To: Simon Horman; +Cc: dev, netdev
In-Reply-To: <20141009011434.GE9339@vergenet.net>

On Thu, Oct 09, 2014 at 10:14:36AM +0900, Simon Horman wrote:
> On Fri, Sep 26, 2014 at 04:57:25PM -0700, Ben Pfaff wrote:
> > On Thu, Sep 18, 2014 at 10:55:10AM +0900, Simon Horman wrote:
> > > This patch is a prototype and has several limitations:
> > > 
> > > * It assumes that no actions follow a select group action
> > >   because the resulting packet after a select group action may
> > >   differ depending on the bucket used. It may be possible
> > >   to address this problem using recirculation. Or to not use
> > >   the datapath select group in such situations. In any case
> > >   this patch does not solve this problem or even prevent it
> > >   from occurring.
> > 
> > It seems like this limitation in particular is a pretty big one.  Do
> > you have a good plan in mind for how to resolve it?
> 
> Hi Ben,
> 
> it seems to me that this would be somewhat difficult to resolve in the
> datapath so I propose not doing so. And I have two ideas on how to
> resolve this problem outside of the datapath.
> 
> 1. Recirculation
> 
>    It seems to me that it ought to be possible to handle this by
>    recirculating if actions occur after an ODP select group action.
> 
>    This could be made slightly more selective by only recirculating
>    if the execution different buckets may result in different packet
>    contents and the actions after the ODP select group action rely on
>    the packet contents (e.g. set actions do but output actions do not).
> 
>    My feeling is that this could be implemented by adding a small amount
>    of extra state to action translation in ovs-vswitchd.
> 
> 2. Fall back to selecting buckets in ovs-vswtichd
> 
>    The idea here is to detect cases where there would be a problem
>    executing actions after an ODP select group action and in that
>    case to select buckets in ovs-vswtichd: that is use the existing bucket
>    translation code in ovs-vswtichd.
> 
>    Though this seems conceptually simpler than recirculation it
>    seems to me that it would be somewhat more difficult to implement
>    as it implies a two stage translation process: e.g. one stage to
>    determine if an ODP select group may be used; and one to perform
>    the translation.
> 
>    I seem to recall trying various two stage translation processes
>    as part some earlier unrelated work. And my recollection is that
>    the result of my previous efforts were not pretty.
> 
> Both of the above more or less negate any benefits of ODP select group
> action. In particular lowering flow setup cost and potentially allowing
> complete offload of select groups from the datapath to hardware. However I
> think that this case is not a common one as it requires both of the
> following. And I think they are both not usual use cases.
> 
> * Different buckets modifying packets in different ways
>   - My expectation is that it is common for buckets to be homogeneous in
>     regards to packet modifications. But perhaps this is na??ve in the
>     context of VLANs, MPLS, and similar tags that can be pushed and popped.
> * Actions that rely on packet contents after
>   - My expectation is that it is common to use a select group to output
>     packets and that is the final action performed.

I am glad that you have thought about it.  Your ideas seem like a good
start to me.  Personally, approach #2, of falling back to selecting
buckets in ovs-vswitchd, seems like a clean solution to me, although
if it really takes multiple stages in translation then that is
undesirable, so I hope that some clean and simple approach works out.

I think that we probably need a solution before we can apply the patch
series, because otherwise we end up with half-working code.

^ permalink raw reply

* Re: [PATCH 1/1 net-next] caif_usb: remove redundant memory message
From: Joe Perches @ 2014-10-13 20:34 UTC (permalink / raw)
  To: Fabian Frederick; +Cc: linux-kernel, Dmitry Tarnyagin, David S. Miller, netdev
In-Reply-To: <1413231946-9914-1-git-send-email-fabf@skynet.be>

On Mon, 2014-10-13 at 22:25 +0200, Fabian Frederick wrote:
> Let MM subsystem display out of memory messages.
[]
> diff --git a/net/caif/caif_usb.c b/net/caif/caif_usb.c
[]
> @@ -87,10 +87,9 @@ static struct cflayer *cfusbl_create(int phyid, u8 ethaddr[ETH_ALEN],
>  {
>  	struct cfusbl *this = kmalloc(sizeof(struct cfusbl), GFP_ATOMIC);
>  
> -	if (!this) {
> -		pr_warn("Out of memory\n");
> +	if (!this)
>  		return NULL;
> -	}
> +
>  	caif_assert(offsetof(struct cfusbl, layer) == 0);
>  
>  	memset(this, 0, sizeof(struct cflayer));

This bit should probably be:

	memset(&this->layer, 0, sizeof(this->layer));

or the allocation above should use kzalloc.

^ permalink raw reply

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
From: Jesper Dangaard Brouer @ 2014-10-13 20:27 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, Alexander Duyck, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller,
	brouer
In-Reply-To: <CAA93jw5MKVaNaObVD3Lm=DP0ie6HcjwOvt0X-whDLJD_6hF5bw@mail.gmail.com>


On Mon, 13 Oct 2014 10:20:17 -0700 Dave Taht <dave.taht@gmail.com> wrote:

> On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
[...]
> 
> I would like to also get better behavior out of gigE and below, and for
> these changes to not impact the downstream behavior of the network
> overall.

I also care about 1Gbit/s and below, that why I did some many tests
(with igb at 10Mbit/s, 100Mbit/s and 1Gbit/s).  

 
> To give you an example, I would like to see the tcp flows in the
> 2nd chart here, to converge faster than the 5 seconds they currently
> take at GigE speeds.
> 
> http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html

In the last graph, where you cannot saturate the link, because you
turned off GSO, GRO and TSO.  Here I expect you will see the benefit of
the qdisc bulking. That is, will be able to saturate the link and
achieve the lower latency as BQL will cut off the bursts at +1 MTU.
I would be interested in the results...


> > We made all these changes so that we can spend cpu cycles at the right
> > place.

Exactly. 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH 1/1 net-next] caif_usb: remove redundant memory message
From: Fabian Frederick @ 2014-10-13 20:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Fabian Frederick, Dmitry Tarnyagin, David S. Miller, netdev

Let MM subsystem display out of memory messages.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
---
 net/caif/caif_usb.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/caif/caif_usb.c b/net/caif/caif_usb.c
index ba02db0..0e487b0 100644
--- a/net/caif/caif_usb.c
+++ b/net/caif/caif_usb.c
@@ -87,10 +87,9 @@ static struct cflayer *cfusbl_create(int phyid, u8 ethaddr[ETH_ALEN],
 {
 	struct cfusbl *this = kmalloc(sizeof(struct cfusbl), GFP_ATOMIC);
 
-	if (!this) {
-		pr_warn("Out of memory\n");
+	if (!this)
 		return NULL;
-	}
+
 	caif_assert(offsetof(struct cfusbl, layer) == 0);
 
 	memset(this, 0, sizeof(struct cflayer));
-- 
1.9.3

^ permalink raw reply related

* [PATCH 1/1 net-next] caif: replace kmalloc/memset 0 by kzalloc
From: Fabian Frederick @ 2014-10-13 20:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: Fabian Frederick, Dmitry Tarnyagin, David S. Miller, netdev

Also add blank line after declaration

Signed-off-by: Fabian Frederick <fabf@skynet.be>
---
 net/caif/cfmuxl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/caif/cfmuxl.c b/net/caif/cfmuxl.c
index 8c5d638..510aa5a 100644
--- a/net/caif/cfmuxl.c
+++ b/net/caif/cfmuxl.c
@@ -47,10 +47,10 @@ static struct cflayer *get_up(struct cfmuxl *muxl, u16 id);
 
 struct cflayer *cfmuxl_create(void)
 {
-	struct cfmuxl *this = kmalloc(sizeof(struct cfmuxl), GFP_ATOMIC);
+	struct cfmuxl *this = kzalloc(sizeof(struct cfmuxl), GFP_ATOMIC);
+
 	if (!this)
 		return NULL;
-	memset(this, 0, sizeof(*this));
 	this->layer.receive = cfmuxl_receive;
 	this->layer.transmit = cfmuxl_transmit;
 	this->layer.ctrlcmd = cfmuxl_ctrlcmd;
-- 
1.9.3

^ permalink raw reply related

* Bug#763325: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
From: ael @ 2014-10-13 20:20 UTC (permalink / raw)
  To: netdev; +Cc: 763325

It was suggested that I post this bug here.
>From /var/log/messages:
=======================================================================

Oct  6 14:16:29 shelf kernel: [  961.633769] ------------[ cut here ]------------
Oct  6 14:16:29 shelf kernel: [  961.633790] WARNING: CPU: 0 PID: 0 at /build/linux-P15SNz/linux-3.16.3/net/sched/sch_generic.c:264
dev_watchdog+0x236/0x240()
Oct  6 14:16:29 shelf kernel: [  961.633792] NETDEV WATCHDOG: eth0
(r8169): transmit queue 0 timed out
Oct  6 14:16:29 shelf kernel: [  961.633794] Modules linked in:
sha256_ssse3 sha256_generic dm_crypt dm_mod usb_storage uinput cpuid
snd_hrtimer binfmt_misc snd_seq snd_seq_device bnep joydev nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc
snd_hda_codec_realtek iTCO_wdt snd_hda_codec_hdmi snd_hda_codec_generic
iTCO_vendor_support ecb uvcvideo videobuf2_vmalloc btusb
videobuf2_memops videobuf2_core v4l2_common bluetooth videodev media
6lowpan_iphc arc4 rtsx_pci_sdmmc rtsx_pci_ms iwlmvm memstick mmc_core
mac80211 iwlwifi cfg80211 r8169 rtsx_pci mii rfkill x86_pkg_temp_thermal
intel_powerclamp intel_rapl coretemp kvm_intel kvm crc32_pclmul
crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper cryptd lpc_ich evdev psmouse snd_hda_intel
serio_raw sr_mod snd_hda_controller pcspkr sg snd_hda_codec cdrom
mfd_core tpm_infineon i915 snd_hwdep i2c_i801 wmi shpchp snd_pcm
ehci_pci snd_timer tpm_tis drm_kms_helper ehci_hcd xhci_hcd snd drm
soundcore tpm i2c_algo_bit mei_me usbcore i2c_core mei video usb_common
ac battery processor button fuse parport_pc ppdev lp parport autofs4
ext4 crc16 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ahci libahci
crct10dif_pclmul crct10dif_common libata scsi_mod thermal thermal_sys
Oct  6 14:16:29 shelf kernel: [  961.633898] CPU: 0 PID: 0 Comm:
swapper/0 Not tainted 3.16-2-amd64 #1 Debian 3.16.3-2
Oct  6 14:16:29 shelf kernel: [  961.633899] Hardware name:
Notebook                         W54_55SU1,SUW/W54_55SU1,SUW, BIOS 4.6.5
05/29/2014
Oct  6 14:16:29 shelf kernel: [  961.633901]  0000000000000009 ffffffff81506188 ffff88041fa03e28 ffffffff81065707
Oct  6 14:16:29 shelf kernel: [  961.633903]  0000000000000000
ffff88041fa03e78 0000000000000001 0000000000000000
Oct  6 14:16:29 shelf kernel: [  961.633907]  ffff88040a0ec000
ffffffff8106576c ffffffff81775ca0 0000000000000030
Oct  6 14:16:29 shelf kernel: [  961.633912] Call Trace:
Oct  6 14:16:29 shelf kernel: [  961.633914]  <IRQ>
[<ffffffff81506188>] ? dump_stack+0x41/0x51
Oct  6 14:16:29 shelf kernel: [  961.633922]  [<ffffffff81065707>] ?
warn_slowpath_common+0x77/0x90
Oct  6 14:16:29 shelf kernel: [  961.633926]  [<ffffffff8106576c>] ?
warn_slowpath_fmt+0x4c/0x50
Oct  6 14:16:29 shelf kernel: [  961.633930]  [<ffffffff81439af6>] ?
dev_watchdog+0x236/0x240
Oct  6 14:16:29 shelf kernel: [  961.633936]  [<ffffffff814398c0>] ?
dev_graft_qdisc+0x70/0x70
Oct  6 14:16:29 shelf kernel: [  961.633941]  [<ffffffff810709c1>] ?
call_timer_fn+0x31/0x100
Oct  6 14:16:29 shelf kernel: [  961.633945]  [<ffffffff814398c0>] ?
dev_graft_qdisc+0x70/0x70
Oct  6 14:16:29 shelf kernel: [  961.633948]  [<ffffffff81071ff9>] ?
run_timer_softirq+0x209/0x2f0
Oct  6 14:16:29 shelf kernel: [  961.633953]  [<ffffffff8106a5a1>] ?
__do_softirq+0xf1/0x290
Oct  6 14:16:29 shelf kernel: [  961.633956]  [<ffffffff8106a975>] ?
irq_exit+0x95/0xa0
Oct  6 14:16:29 shelf kernel: [  961.633961]  [<ffffffff8150f155>] ?
smp_apic_timer_interrupt+0x45/0x60
Oct  6 14:16:29 shelf kernel: [  961.633965]  [<ffffffff8150d25d>] ?
apic_timer_interrupt+0x6d/0x80
Oct  6 14:16:29 shelf kernel: [  961.633968]  <EOI>
[<ffffffff81088b5d>] ? __hrtimer_start_range_ns+0x1cd/0x390
Oct  6 14:16:29 shelf kernel: [  961.633976]  [<ffffffff813d96e2>] ?
cpuidle_enter_state+0x52/0xc0
Oct  6 14:16:29 shelf kernel: [  961.633980]  [<ffffffff813d96d8>] ?
cpuidle_enter_state+0x48/0xc0
Oct  6 14:16:29 shelf kernel: [  961.633984]  [<ffffffff810a5d28>] ?
cpu_startup_entry+0x2f8/0x400
Oct  6 14:16:29 shelf kernel: [  961.633989]  [<ffffffff8190305a>] ?
start_kernel+0x47b/0x486
Oct  6 14:16:29 shelf kernel: [  961.633993]  [<ffffffff81902a04>] ?
set_init_arg+0x4e/0x4e
Oct  6 14:16:29 shelf kernel: [  961.633997]  [<ffffffff81902120>] ?
early_idt_handlers+0x120/0x120
Oct  6 14:16:29 shelf kernel: [  961.634000]  [<ffffffff8190271f>] ?
x86_64_start_kernel+0x14d/0x15c
Oct  6 14:16:29 shelf kernel: [  961.634004] ---[ end trace
065e688ac03d9c53 ]---
Oct  6 14:16:29 shelf kernel: [  961.657322] r8169 0000:03:00.1 eth0:
link up
===================================================================

I believe I was changing from ethernet to wireless frequently while
doing some testing just before this particular instance. However this
bug is typicaly trigggered several times when just using eth0 and has
been happening for the last week.

For completeness, here are extracts from the log before the instance
above (when I was probably explicitly taking eth0 up and down, so it
possibly has irrelevant noise from those activities):

Oct  6 13:59:06 shelf kernel: [    1.466463] r8169 0000:03:00.1 eth0:
RTL8411 at 0xffffc900018c4000, 80:fa:5b:04:ca:96, XID 1c800880 IRQ 44
Oct  6 13:59:06 shelf kernel: [    1.466470] r8169 0000:03:00.1 eth0:
jumbo features [frames: 9200 bytes, tx checksumming: ko]
--[snip]--
Oct  6 13:59:06 shelf kernel: [    1.556245] r8169 0000:03:00.1:
firmware: direct-loading firmware rtl_nic/rtl8411-2.fw
Oct  6 13:59:06 shelf kernel: [    1.571948] r8169 0000:03:00.1 eth0:
link down
--[snip]--
Oct  6 13:59:09 shelf kernel: [    5.463398] r8169 0000:03:00.1 eth0:
link up
--[snip]--
Oct  6 14:09:17 shelf kernel: [  545.893079] r8169 0000:03:00.1 eth0:
link down
--[snip]--
Oct  6 14:09:17 shelf kernel: [  547.647632] r8169 0000:03:00.1: no
hotplug settings from platform
Oct  6 14:09:17 shelf kernel: [  547.648327] done.
Oct  6 14:09:17 shelf kernel: [  547.660691] Bluetooth: hci0: read Intel
version: 370710018002030d37
Oct  6 14:09:17 shelf kernel: [  547.660693] Bluetooth: hci0: Intel
device is already patched. patch num: 37
Oct  6 14:09:17 shelf kernel: [  547.805664] [drm] Enabling RC6 states:
RC6 on, RC6p off, RC6pp off
Oct  6 14:09:23 shelf kernel: [  553.115079] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:24 shelf kernel: [  554.134710] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:27 shelf kernel: [  557.556657] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:28 shelf kernel: [  558.578349] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:32 shelf kernel: [  562.018553] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:33 shelf kernel: [  563.040819] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:36 shelf kernel: [  566.548111] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:37 shelf kernel: [  567.568219] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:41 shelf kernel: [  571.786079] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:42 shelf kernel: [  572.808291] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:45 shelf kernel: [  575.383331] r8169 0000:03:00.1 eth0:
link up
--[snip]--
Oct  6 14:11:59 shelf kernel: [  691.106397] r8169 0000:03:00.1 eth0:
link down
--[snip]--
Oct  6 14:11:59 shelf kernel: [  691.974414] r8169 0000:03:00.1: no
hotplug settings from platform
--[snip]--
Oct  6 14:12:05 shelf kernel: [  697.522170] r8169 0000:03:00.1 eth0:
link up
========================================================

lspci:
-----
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor DRAM Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06)
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)
00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5)
00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #4 (rev d5)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation HM86 Express LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)
02:00.0 Network controller: Intel Corporation Wireless 7260 (rev 73)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 5287 (rev 01)
03:00.1 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 12)
-----------------------------------------------------------

lspci -v -s 03:00.1:
-------------------

03:00.1 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 12)
	Subsystem: CLEVO/KAPOK Computer Device 5455
	Flags: bus master, fast devsel, latency 0, IRQ 43
	I/O ports at e000 [size=256]
	Memory at f7c14000 (64-bit, non-prefetchable) [size=4K]
	Memory at f7c10000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: r8169
--------------------------------------------------------------------

I have not set the debug parameter on r8169 but can do so if that would
be useful.

ael

^ permalink raw reply

* Re: [PATCH] net: wireless: brcm80211: brcmfmac: dhd_sdio.c: Cleaning up missing null-terminate in conjunction with strncpy
From: Rickard Strandqvist @ 2014-10-13 20:14 UTC (permalink / raw)
  To: David Laight
  Cc: Brett Rudley, Arend van Spriel, Hante Meuleman, John W. Linville,
	Pieter-Paul Giesberts, Daniel Kim, linux-wireless@vger.kernel.org,
	brcm80211-dev-list@broadcom.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D174C992F@AcuExch.aculab.com>

2014-10-13 10:55 GMT+02:00 David Laight <David.Laight@aculab.com>:
> From: Rickard Strandqvist
>> Replacing strncpy with strlcpy to avoid strings that lacks null terminate.
>> And changed from using strncpy to strlcpy to simplify code.
>
> I think you should return an error if the strings get truncated.
> Silent truncation is going to lead to issues at some point in the future
> (in some places).
>
>> Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
>> ---
>>  drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c |   25 ++++++++++----------
>>  1 file changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
>> b/drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
>> index f55f625..d20d4e6 100644
>> --- a/drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
>> +++ b/drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
>> @@ -670,7 +670,6 @@ static int brcmf_sdio_get_fwnames(struct brcmf_chip *ci,
>>                                 struct brcmf_sdio_dev *sdiodev)
>>  {
>>       int i;
>> -     uint fw_len, nv_len;
>>       char end;
>>
>>       for (i = 0; i < ARRAY_SIZE(brcmf_fwname_data); i++) {
>> @@ -684,25 +683,25 @@ static int brcmf_sdio_get_fwnames(struct brcmf_chip *ci,
>>               return -ENODEV;
>>       }
>>
>> -     fw_len = sizeof(sdiodev->fw_name) - 1;
>> -     nv_len = sizeof(sdiodev->nvram_name) - 1;
>>       /* check if firmware path is provided by module parameter */
>>       if (brcmf_firmware_path[0] != '\0') {
>> -             strncpy(sdiodev->fw_name, brcmf_firmware_path, fw_len);
>> -             strncpy(sdiodev->nvram_name, brcmf_firmware_path, nv_len);
>> -             fw_len -= strlen(sdiodev->fw_name);
>> -             nv_len -= strlen(sdiodev->nvram_name);
>> +             strlcpy(sdiodev->fw_name, brcmf_firmware_path,
>> +                     sizeof(sdiodev->fw_name));
>> +             strlcpy(sdiodev->nvram_name, brcmf_firmware_path,
>> +                     sizeof(sdiodev->nvram_name));
>>
>>               end = brcmf_firmware_path[strlen(brcmf_firmware_path) - 1];
>
> If you are doing a strlen() here, you could use the length for the copy
> and/or use it to avoid the strcat().
>
>>               if (end != '/') {
>> -                     strncat(sdiodev->fw_name, "/", fw_len);
>> -                     strncat(sdiodev->nvram_name, "/", nv_len);
>> -                     fw_len--;
>> -                     nv_len--;
>> +                     strlcat(sdiodev->fw_name, "/",
>> +                             sizeof(sdiodev->fw_name));
>> +                     strlcat(sdiodev->nvram_name, "/",
>> +                             sizeof(sdiodev->nvram_name));
>>               }
>>       }
>> -     strncat(sdiodev->fw_name, brcmf_fwname_data[i].bin, fw_len);
>> -     strncat(sdiodev->nvram_name, brcmf_fwname_data[i].nv, nv_len);
>> +     strlcat(sdiodev->fw_name, brcmf_fwname_data[i].bin,
>> +             sizeof(sdiodev->fw_name));
>> +     strlcat(sdiodev->nvram_name, brcmf_fwname_data[i].nv,
>> +             sizeof(sdiodev->nvram_name));
>
> I assume something ensures that fw_name[0] == 0 here.
>
>         David


Hi David


What do you mean you would use the strlen, can you give some example
code instead?


Arend van Spriel wanted me to change the title before.
So this should really continue the conversation in that new mail. And
then you can also see my other comments.
And I have the same type of objection to the:
brcmf_firmware_path[0] == '\0'

Se:
https://lkml.org/lkml/2014/10/12/42

Kind regards
Rickard Strandqvist

^ permalink raw reply

* Fwd: micrel: ksz8051 badly detected as ksz8031
From: Angelo Dureghello @ 2014-10-13 19:34 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <543BB635.80608@gmail.com>

Hi,

i confirm, moving from kernel 3.5.1 to  3.17.0 the network Micrel phy 
link is non
functional.

The non working detection of KSZ8051 (taken default, ksz8031, from mdio 
driver)
breaks the functionality.

But this is probably due to the fact i am not using DT population, but 
just using
a minimal DT just to boot (to avoidto change the bootloader setup that's 
already
into production).

What change from 3.5.1 and 3.17.0 is mainly that DT support has been added.

If i copy the old 3.5.1 micrel.c i can have the link, but network was 
not working
properly.

So i brute-fixed the issue into 3.17.0 with this small patchfor my need,
Imiss the complete understanding still for a proper fix.Hope you can 
elucidate.




diff -crB drivers/net/ethernet/ti/davinci_mdio.c 
../linux-3.17/drivers/net/ethernet/ti/davinci_mdio.c
*** drivers/net/ethernet/ti/davinci_mdio.c      2014-10-13 
21:02:25.522022750 +0200
--- ../linux-3.17/drivers/net/ethernet/ti/davinci_mdio.c 2014-10-05 
21:23:04.000000000 +0200
***************
*** 320,327 ****
   }
   #endif

- extern struct phy_driver *micrel_pdrv;
-
   static int davinci_mdio_probe(struct platform_device *pdev)
   {
         struct mdio_platform_data *pdata = dev_get_platdata(&pdev->dev);
--- 320,325 ----
***************
*** 397,408 ****
         for (addr = 0; addr < PHY_MAX_ADDR; addr++) {
                 phy = data->bus->phy_map[addr];
                 if (phy) {
-                       /*
-                        * Angelo : micrel broken the link adding OF 
support.
-                        * waiting for the patch, i force the link to 
out phy.
-                        */
-                       phy->drv = micrel_pdrv;
-                       data->bus->phy_map[addr] = phy;
                         dev_info(dev, "phy[%d]: device %s, driver %s\n",
                                  phy->addr, dev_name(&phy->dev),
                                  phy->drv ? phy->drv->name : "unknown");
--- 395,400 ----

diff -crB drivers/net/phy/micrel.c ../linux-3.17/drivers/net/phy/micrel.c
*** drivers/net/phy/micrel.c    2014-10-13 21:00:03.273616086 +0200
--- ../linux-3.17/drivers/net/phy/micrel.c      2014-10-05 
21:23:04.000000000 +0200
***************
*** 640,650 ****
                 ARRAY_SIZE(ksphy_driver));
   }

- /*
-  * angelo, temporary patch
-  */
- struct phy_driver *micrel_pdrv = &ksphy_driver[5];
-
   module_init(ksphy_init);
   module_exit(ksphy_exit);



Regards, angelo

^ permalink raw reply

* Re: [PATCH] sfc: efx: add support for skb->xmit_more
From: Daniel Borkmann @ 2014-10-13 19:18 UTC (permalink / raw)
  To: Edward Cree; +Cc: davem, nikolay, netdev, Shradha Shah, linux-net-drivers
In-Reply-To: <543C13AE.80606@solarflare.com>

On 10/13/2014 08:02 PM, Edward Cree wrote:
...
> Until these issues are addressed, I have to NAK this.  Tomorrow I'll try
> to rustle up an rfc patch that handles these issues, unless someone
> beats me to it.

Thanks for your feedback! Yes, please feel free to take this onwards and
integrate/adapt it to your needs for sfc's xmit_more support.

Thanks a lot,
Daniel

^ permalink raw reply

* Re: IPv6 FIB related crash with MACVLANs in 3.9.11+ kernel.
From: Ben Greear @ 2014-10-13 18:06 UTC (permalink / raw)
  To: Vladislav Yasevich; +Cc: Cong Wang, Hannes Frederic Sowa, Hongmei Li, netdev
In-Reply-To: <CAGCdqXHh6AvabW2JwQEK5NcnTuxw9eMdRwd+JvYZUYqyHW4K7Q@mail.gmail.com>

On 10/12/2014 04:42 AM, Vladislav Yasevich wrote:
> On Mon, Sep 29, 2014 at 5:24 PM, Ben Greear <greearb@candelatech.com> wrote:
>> On 09/29/2014 01:39 PM, Cong Wang wrote:
>>> On Mon, Sep 29, 2014 at 12:48 PM, Ben Greear <greearb@candelatech.com> wrote:
>>>>
>>>> We are going to be running up to 20 concurrent scripts that
>>>> will be dumping routes & ips, and configuring ip addresses
>>>> and routes during bringup.
>>>>
>>>> I don't have a simple stand-alone way to reproduce this,
>>>> but at least when we reported the problem is was very easy
>>>> for us to reproduce with our tool.
>>>>
>>>> Maybe 20 scripts running in parallel that randomly configured and dumped routes
>>>> and ip addresses on random interfaces would do the trick?
>>>>
>>>
>>> I think the most important question is why is this related with macvlan?
>>> IPv6 is L3 while macvlan pure L2, if this is a IPv6 routing bug, it should
>>> not be limited to macvlan.
>>
>> We saw it using mac-vlans on top of ixgbe.  Probably it can be reproduced
>> elsewhere, but since we had a good test case, we didn't bother trying lots
>> of other combinations.
>>
>> Just enabling some debug code caused the problem to be much harder
>> to reproduce, so it is probably some sort of race.  Maybe lots of mac-vlans
>> make it easier to hit.
>>
>>> What does your IPv6 routing table look like? And how do you configure those
>>> macvlan interfaces?
>>
>> Each interface would have it's own routing table, with at least as subnet
>> route.  I'm not sure we were adding a default gateway or not.
>>
>> The main routing table would also have some auto-created routes
>> I think.
>>
>> It is configured using 'ip' to add and dump routes.
>>
> 
> Ben
> 
> Just saw this thread.  Can you check if this commit makes a difference:
> commit 40b8fe45d1f094e3babe7b2dc2b71557ab71401d
> Author: Vlad Yasevich <vyasevich@gmail.com>
> Date:   Mon Sep 22 16:34:17 2014 -0400
> 
>     macvtap: Fix race between device delete and open.

We can retest this, but we do not use macvtap, and we have higher-priority
things to test at the moment, so might be a while.

Thanks,
Ben

> 
> Thanks
> -vlad
> 
>> Thanks,
>> Ben
>>
>> --
>> Ben Greear <greearb@candelatech.com>
>> Candela Technologies Inc  http://www.candelatech.com
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH] sfc: efx: add support for skb->xmit_more
From: Edward Cree @ 2014-10-13 18:02 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, nikolay, netdev, Shradha Shah, linux-net-drivers
In-Reply-To: <1413219585-15854-1-git-send-email-dborkman@redhat.com>

On 13/10/14 17:59, Daniel Borkmann wrote:
> Add support for xmit_more batching: efx has BQL support, thus we need
> to only move the queue hang check before the hw tail pointer write
> and test for xmit_more bit resp. whether the queue/stack has stopped.
> Thanks to Nikolay Aleksandrov who had a Solarflare NIC to test!
I see two issues here.

1) The error handling path (labels dma_err: and err:) will unwind back
to tx_queue->insert_count, which (AFAICT) only gets updated by
efx_nic_push_buffers() (at least on EF10, where that calls
efx_ef10_tx_write()).
So if we transmit a bunch of skbs with xmit_more, then get an error, we
will unwind all the way back to the start, whereas (again, AFAICT) the
previous skbs were nicely completed and safe to send.
The unwind will leave us in a good state, but we'll have failed to send
some packets that there was nothing wrong with.
I think the appropriate fix for this would be to maintain two separate
insert_counts, as xmit_more means we can't conflate "last write pointer
pushed" and "write pointer at start of this skb" any longer.

2) I think we shouldn't do PIO when xmit_more is set - it's probably bad
for performance (though I haven't measured), and I'm not entirely
convinced it will behave correctly.  I think
efx_nic_tx_is_empty(tx_queue) will return true if we have xmit_more'd a
PIO packet, enabling us to potentially PIO the next packet as well, thus
overwriting the PIO buffer before the NIC's had a chance to read it.
I'm also unsure about what happens to TX Push (efx_ef10_push_tx_desc),
for similar reasons.
So I think TX PIO and TX Push both need to be suppressed when
skb->xmit_more (as a correctness issue), and should also be suppressed
on the !xmit_more skb that 'uncorks' a previous row of xmit_mores - as a
performance issue at the least, and for TX Push I'm also unsure about
the correctness (though I haven't worked that one through in detail).

Until these issues are addressed, I have to NAK this.  Tomorrow I'll try
to rustle up an rfc patch that handles these issues, unless someone
beats me to it.

-Edward

> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
> Cc: Shradha Shah <sshah@solarflare.com>
> ---
>  drivers/net/ethernet/sfc/tx.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
> index 3206098..566432c 100644
> --- a/drivers/net/ethernet/sfc/tx.c
> +++ b/drivers/net/ethernet/sfc/tx.c
> @@ -439,13 +439,13 @@ finish_packet:
>  
>  	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
>  
> +	efx_tx_maybe_stop_queue(tx_queue);
> +
>  	/* Pass off to hardware */
> -	efx_nic_push_buffers(tx_queue);
> +	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
> +		efx_nic_push_buffers(tx_queue);
>  
>  	tx_queue->tx_packets++;
> -
> -	efx_tx_maybe_stop_queue(tx_queue);
> -
>  	return NETDEV_TX_OK;
>  
>   dma_err:
> @@ -1308,11 +1308,12 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
>  
>  	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
>  
> -	/* Pass off to hardware */
> -	efx_nic_push_buffers(tx_queue);
> -
>  	efx_tx_maybe_stop_queue(tx_queue);
>  
> +	/* Pass off to hardware */
> +	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
> +		efx_nic_push_buffers(tx_queue);
> +
>  	tx_queue->tso_bursts++;
>  	return NETDEV_TX_OK;
>  

^ permalink raw reply

* Re: [PATCH net-next] tg3: Add skb->xmit_more support
From: Florian Westphal @ 2014-10-13 18:01 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Prashant Sreedharan, davem, netdev, dborkman, mchan
In-Reply-To: <20141013173806.GB26105@breakpoint.cc>

Florian Westphal <fw@strlen.de> wrote:
> Prashant Sreedharan <prashant@broadcom.com> wrote:
> > Ring TX doorbell only if xmit_more is not set or the queue is stopped.
> > 
> > Suggested-by: Daniel Borkmann <dborkman@redhat.com>
> > Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
> > Signed-off-by: Michael Chan <mchan@broadcom.com>
> > ---
> >  drivers/net/ethernet/broadcom/tg3.c |   10 ++++++----
> >  1 files changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> > index ba49948..dbb41c1 100644
> > --- a/drivers/net/ethernet/broadcom/tg3.c
> > +++ b/drivers/net/ethernet/broadcom/tg3.c
> > @@ -8099,9 +8099,6 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  	/* Sync BD data before updating mailbox */
> >  	wmb();
> >  
> > -	/* Packets are ready, update Tx producer idx local and on card. */
> > -	tw32_tx_mbox(tnapi->prodmbox, entry);
> > -
> >  	tnapi->tx_prod = entry;
> >  	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
> >  		netif_tx_stop_queue(txq);
> > @@ -8116,7 +8113,12 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  			netif_tx_wake_queue(txq);
> >  	}
> >  
> > -	mmiowb();
> > +	if (!skb->xmit_more || netif_xmit_stopped(txq)) {
> > +		/* Packets are ready, update Tx producer idx on card. */
> > +		tw32_tx_mbox(tnapi->prodmbox, entry);
> 
> I think you need to swap the test, i.e.

Never mind, sorry for the noise.

^ permalink raw reply

* Re: [2/2] dsa: Replace mii_bus with a generic host device
From: Guenter Roeck @ 2014-10-13 17:51 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, f.fainelli, kernel, davem
In-Reply-To: <20140915170024.1261.97226.stgit@ahduyck-bv4.jf.intel.com>

On Mon, Sep 15, 2014 at 01:00:27PM -0400, Alexander Duyck wrote:
> This change makes it so that instead of passing and storing a mii_bus we
> instead pass and store a host_dev.  From there we can test to determine the
> exact type of device, and can verify it is the correct device for our switch.
> 
> So for example it would be possible to pass a device pointer from a pci_dev
> and instead of checking for a PHY ID we could check for a vendor and/or device
> ID.
> 

Hi Alexander,

We are thinking about porting a PCIe based switch to use the DSA infrastructure,
so this patch is coming quite handy.

Have you been working on a port or a PCIe based chip to DSA yourself, by any
chance ? If yes, could you possibly share the current state ?
That might save us some time.

Thanks,
Guenter

^ permalink raw reply

* Re: [PATCH stable v3.2 v3.4] ipv4: disable bh while doing route gc
From: David Miller @ 2014-10-13 17:51 UTC (permalink / raw)
  To: mleitner; +Cc: stable, netdev, hannes
In-Reply-To: <6c3d6eca5d6a15c01393b010f2116bd169477c5a.1413215324.git.mleitner@redhat.com>

From: Marcelo Ricardo Leitner <mleitner@redhat.com>
Date: Mon, 13 Oct 2014 14:03:30 -0300

> Further tests revealed that after moving the garbage collector to a work
> queue and protecting it with a spinlock may leave the system prone to
> soft lockups if bottom half gets very busy.
> 
> It was reproced with a set of firewall rules that REJECTed packets. If
> the NIC bottom half handler ends up running on the same CPU that is
> running the garbage collector on a very large cache, the garbage
> collector will not be able to do its job due to the amount of work
> needed for handling the REJECTs and also won't reschedule.
> 
> The fix is to disable bottom half during the garbage collecting, as it
> already was in the first place (most calls to it came from softirqs).
> 
> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Acked-by: David S. Miller <davem@davemloft.net>
> Cc: stable@vger.kernel.org

-stable folks, please integrate this directly, thanks!

^ permalink raw reply

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
From: Tom Herbert @ 2014-10-13 17:43 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, Alexander Duyck, Jesper Dangaard Brouer,
	Hannes Frederic Sowa, John Fastabend, Jamal Hadi Salim,
	Daniel Borkmann, Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, David Miller
In-Reply-To: <CAA93jw5MKVaNaObVD3Lm=DP0ie6HcjwOvt0X-whDLJD_6hF5bw@mail.gmail.com>

When TSO/GSO is enabled with BQL we tend to see large limits than if
they are disabled. This is due to the fact that we treat gso packets
as a single packet although way to queuing in device. BQL limit is
nominally 2*N+1 MSS where N is minimal number of bytes to keep queue
full. In gso MSS is up to 64K, so a limit of 192K is common (without
BQL I see limit of 30K).

For GSO, it seems like we can split larger segments. For instance if
in a bulk dequeue we need 30K but have 64K next in qdisc, maybe we can
split to do GSO on first 30K of segment, and requeue other 34K to the
qdisc.

With TSO we might do something similar, but probably harder to get
granularity since TX completions are only done on TSO packets (might
be interesting if a device could report partial completions somehow).


On Mon, Oct 13, 2014 at 10:20 AM, Dave Taht <dave.taht@gmail.com> wrote:
> On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
>>>         When I first got cc'd on these threads, and saw
>>>         netperf-wrapper being
>>>         used on it,
>>>         I thought: "Oh god, I've created a monster.". My intent with
>>>         helping create
>>>         such a measurement tool was not to routinely drive a network
>>>         to saturation
>>>         but to be able to measure the impact on latency of doing so.
>>>
>>>         I was trying get reasonable behavior when a "router" went into
>>>         overload.
>>>
>>>         Servers, on the other hand, have more options to avoid
>>>         overload than
>>>         routers do. There's been a great deal of really nice work on
>>>         that
>>>         front. I love all that.
>>>
>>>         and I like BQL because it provides enough backpressure to be
>>>         able to
>>>         do smarter things about scheduling packets higher in the
>>>         stack. (life
>>>         pre-BQL cost some hair)
>>>
>>>         But tom once told me me "BQL's objective is to keep the
>>>         hardware busy".
>>>         It uses an MIAD controller instead of a more sane AIMD one, in
>>>         particular,
>>>         I'd much rather it ramped down to smaller values after
>>>         absorbing a burst.
>>>
>>>         My objective is always to keep the *network's behavior
>>>         optimal*,
>>>         minimizing bursts, and subsequent tail loss on the other side,
>>>         and responding quickly to loss, and doing that by
>>>         preserving to the highest extent possible the ack clocking
>>>         that a
>>>         fluid model has. I LOVE BQL for providing more backpressure
>>>         than has
>>>         ever existed before, and I know it's incredibly difficult to
>>>         have fluid models
>>>         in a conventional cpu architecture that has to do other stuff.
>>>
>>>         But in order to get the best results for network behavior I'm
>>>         willing to
>>>         sacrifice a great deal of cpu, interrupts, whatever it takes!
>>>         to get the
>>>         most packets to all the destinations specified, whatever the
>>>         workload,
>>>         with the *minimum amount of latency between ack and reply*
>>>         possible.
>>>
>>>         What I'd hoped for in the new bulking and rcu stuff was to be
>>>         able to
>>>         see a net reduction in TSO/GSO Size, and/or BQL's size, and I
>>>         also did
>>>         keep hoping for some profiles on sch_fq, and for more complex
>>>         benchmarking of dozens or hundreds of realistically sized TCP
>>>         flows
>>>         (in both directions) to exercise it all.
>>>
>>>         Some of the data presented showed that a single BQL'd queue
>>>         was >400K,
>>>         and with hardware multi-queue, 128K, when TSO and GSO were
>>>         used, but
>>>         with hardware multi-queue and no TSO/GSO, BQL was closer to
>>>         30K.
>>>
>>>         This said to me that the maximum "right" size for a TSO/GSO
>>>         "packet" was
>>>         closer to 12k in this environment, and the right size for BQL,
>>>         30k,
>>>         before it started exerting backpressure to the qdisc.
>>>
>>>         This would reduce the potential inter-flow network latency by
>>>         a factor
>>>         of 10 on the single hw queue scenario, and 4 in the multi
>>>         queue one.
>>>
>>>         It would probably cost some interrupts, and in scenarios
>>>         lacking
>>>         packet loss, throughput, but in other scenarios with lots of
>>>         flows
>>>         each flow will ramp up in speed, faster, as you reduce the
>>>         RTTs.
>>>         Paying attention to this will also push profiling activities
>>>         into
>>>         areas of the stack that might be profitable.
>>>
>>>         I would very much like to have profiles of happens now both
>>>         here and
>>>         elsewhere in the stack with this new code with TSO/GSO sizes
>>>         capped
>>>         thusly and BQL capped to 30k, and a smarter qdisc like fq
>>>         used.
>>>
>>>         2) Most of the time, a server is not driving the wire to
>>>         saturation. If
>>>            it is, you are doing something wrong. The BQL queues are
>>>         empty, or
>>>            nearly so, so the instant someone creates a qdisc queue, it
>>>            drains.
>>>
>>>         But: if there are two or more flows under contention, creating
>>>         a qdisc queue
>>>             better multiplexing the results is highly desirable, and
>>>         the stack
>>>            should be smart enough to make that overload only last
>>>         briefly.
>>>
>>>            This is part of why I'm unfond of the deep and persistent
>>>         BQL queues as we
>>>         get today.
>>>
>>>         3) Pure ack-only workloads are rare. It is a useful test case,
>>>         but...
>>>
>>>         4) I thought the ring-cleanup optimization was rather
>>>         interesting and
>>>            could be made more dynamic.
>>>
>>>         5) I remain amazed at the vast improvements in throughput,
>>>         reductions in
>>>         interrupts, lockless operation and the RCU stuff that have
>>>         come out of
>>>         this so far, but had to make these points in the hope that the
>>>         big picture
>>>         is retained.
>>>
>>>         It does no good to blast packets through the network unless
>>>         there is a
>>>         high probability that they will actually be received on the
>>>         other side.
>>>
>>>         thanks for listening.
>>
>> Dave
>>
>> I am kind of surprised you wrote this nonsense.
>
> Sorry, it was definately pre-coffee! I get twitchy when all the joy
> seems to be in spewing packets at high rates rather than optimizing
> for low RTTs in packet paired flow behavior.
>
There are multiple dimensions we are trying to optimize for. Bulk
dequeue should not adversely affect latency, we are merely doing work
in one operation previously done in several.

The lack of granularity in GSO segments might be something that could
be address though. When TSO/GSO is enabled with BQL we tend to see
large limits than if they are disabled. This is due to the fact that
we treat gso packets as a single packet all the way to queuing in
device. BQL limit is nominally 2*N+1 MSS where N is minimal number of
bytes to keep queue full. In gso MSS is up to 64K, so a limit of 192K
is common (without BQL I see limit of 30K).

For GSO, it seems like we can split larger segments. For instance if
in a bulk dequeue we need 30K but have 64K segment next in qdisc,
maybe we can split to do GSO on first 30K of segment, and requeue
other 34K to the qdisc.

With TSO we might do something similar, but probably harder to get
granularity since TX completions are only done on TSO packets (might
be interesting if a device could report partial completions somehow).

>> Being able to send data at full speed has nothing to do about how
>> packets are scheduled. You are concerned about packet scheduling, and
>> not about how fast we can send raw data on a 40Gb NIC.
>
> I would like to also get better behavior out of gigE and below, and for
> these changes to not impact the downstream behavior of the network
> overall.
>
> To give you an example, I would like to see the tcp flows in the
> 2nd chart here, to converge faster than the 5 seconds they currently
> take at GigE speeds.
>
> http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html
>
>> We made all these changes so that we can spend cpu cycles at the right
>> place.
>
> I grok that.
>
>> There are reasons we have fq_codel, and fq. Do not forget this, please.
>
> Which is why I was hoping to see profiles along the way that showed where
> else locks were being taken, what those cpu cycles were like, etc, and what were
> those hotspots, when a smarter qdisc was engaged.
>
>
> --
> Dave Täht
>
> https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply

* Re: [PATCH net-next] tg3: Add skb->xmit_more support
From: Florian Westphal @ 2014-10-13 17:38 UTC (permalink / raw)
  To: Prashant Sreedharan; +Cc: davem, netdev, dborkman, mchan
In-Reply-To: <1413217302-15396-1-git-send-email-prashant@broadcom.com>

Prashant Sreedharan <prashant@broadcom.com> wrote:
> Ring TX doorbell only if xmit_more is not set or the queue is stopped.
> 
> Suggested-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>
> ---
>  drivers/net/ethernet/broadcom/tg3.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index ba49948..dbb41c1 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -8099,9 +8099,6 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	/* Sync BD data before updating mailbox */
>  	wmb();
>  
> -	/* Packets are ready, update Tx producer idx local and on card. */
> -	tw32_tx_mbox(tnapi->prodmbox, entry);
> -
>  	tnapi->tx_prod = entry;
>  	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
>  		netif_tx_stop_queue(txq);
> @@ -8116,7 +8113,12 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  			netif_tx_wake_queue(txq);
>  	}
>  
> -	mmiowb();
> +	if (!skb->xmit_more || netif_xmit_stopped(txq)) {
> +		/* Packets are ready, update Tx producer idx on card. */
> +		tw32_tx_mbox(tnapi->prodmbox, entry);

I think you need to swap the test, i.e.

netif_xmit_stopped() || xmit_more

Otherwise, hw may not be aware of further packets wainting
and queue might not be restarted?

^ permalink raw reply

* Re: [PATCH v2 net] x86: bpf_jit: fix two bugs in eBPF JIT compiler
From: Darrick J. Wong @ 2014-10-13 17:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, netdev, linux-kernel
In-Reply-To: <1412998223-3805-1-git-send-email-ast@plumgrid.com>

On Fri, Oct 10, 2014 at 08:30:23PM -0700, Alexei Starovoitov wrote:
> 1.
> JIT compiler using multi-pass approach to converge to final image size,
> since x86 instructions are variable length. It starts with large
> gaps between instructions (so some jumps may use imm32 instead of imm8)
> and iterates until total program size is the same as in previous pass.
> This algorithm works only if program size is strictly decreasing.
> Programs that use LD_ABS insn need additional code in prologue, but it
> was not emitted during 1st pass, so there was a chance that 2nd pass would
> adjust imm32->imm8 jump offsets to the same number of bytes as increase in
> prologue, which may cause algorithm to erroneously decide that size converged.
> Fix it by always emitting largest prologue in the first pass which
> is detected by oldproglen==0 check.
> Also change error check condition 'proglen != oldproglen' to fail gracefully.
> 
> 2.
> while staring at the code realized that 64-byte buffer may not be enough
> when 1st insn is large, so increase it to 128 to avoid buffer overflow
> (theoretical maximum size of prologue+div is 109) and add runtime check.
> 
> Fixes: 622582786c9e ("net: filter: x86: internal BPF JIT")
> Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

This fixes the crash, thank you!

Tested-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
> v1->v2: reduce chances of stack corruption in case of future bugs (suggested by Eric)
> 
> note in classic BPF programs 1st insn is always short move, but native eBPF
> programs may trigger buffer overflow. I couldn't force the crash with overflow,
> since there are no further calls while this part of stack is used.
> Both are ugly bugs regardless.
> When net-next opens I will add narrowed down testcase from 'nmap' to testsuite.
> 
>  arch/x86/net/bpf_jit_comp.c |   25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index d56cd1f..3f62734 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -182,12 +182,17 @@ struct jit_context {
>  	bool seen_ld_abs;
>  };
>  
> +/* maximum number of bytes emitted while JITing one eBPF insn */
> +#define BPF_MAX_INSN_SIZE	128
> +#define BPF_INSN_SAFETY		64
> +
>  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>  		  int oldproglen, struct jit_context *ctx)
>  {
>  	struct bpf_insn *insn = bpf_prog->insnsi;
>  	int insn_cnt = bpf_prog->len;
> -	u8 temp[64];
> +	bool seen_ld_abs = ctx->seen_ld_abs | (oldproglen == 0);
> +	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
>  	int i;
>  	int proglen = 0;
>  	u8 *prog = temp;
> @@ -225,7 +230,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>  	EMIT2(0x31, 0xc0); /* xor eax, eax */
>  	EMIT3(0x4D, 0x31, 0xED); /* xor r13, r13 */
>  
> -	if (ctx->seen_ld_abs) {
> +	if (seen_ld_abs) {
>  		/* r9d : skb->len - skb->data_len (headlen)
>  		 * r10 : skb->data
>  		 */
> @@ -685,7 +690,7 @@ xadd:			if (is_imm8(insn->off))
>  		case BPF_JMP | BPF_CALL:
>  			func = (u8 *) __bpf_call_base + imm32;
>  			jmp_offset = func - (image + addrs[i]);
> -			if (ctx->seen_ld_abs) {
> +			if (seen_ld_abs) {
>  				EMIT2(0x41, 0x52); /* push %r10 */
>  				EMIT2(0x41, 0x51); /* push %r9 */
>  				/* need to adjust jmp offset, since
> @@ -699,7 +704,7 @@ xadd:			if (is_imm8(insn->off))
>  				return -EINVAL;
>  			}
>  			EMIT1_off32(0xE8, jmp_offset);
> -			if (ctx->seen_ld_abs) {
> +			if (seen_ld_abs) {
>  				EMIT2(0x41, 0x59); /* pop %r9 */
>  				EMIT2(0x41, 0x5A); /* pop %r10 */
>  			}
> @@ -804,7 +809,8 @@ emit_jmp:
>  			goto common_load;
>  		case BPF_LD | BPF_ABS | BPF_W:
>  			func = CHOOSE_LOAD_FUNC(imm32, sk_load_word);
> -common_load:		ctx->seen_ld_abs = true;
> +common_load:
> +			ctx->seen_ld_abs = seen_ld_abs = true;
>  			jmp_offset = func - (image + addrs[i]);
>  			if (!func || !is_simm32(jmp_offset)) {
>  				pr_err("unsupported bpf func %d addr %p image %p\n",
> @@ -878,6 +884,11 @@ common_load:		ctx->seen_ld_abs = true;
>  		}
>  
>  		ilen = prog - temp;
> +		if (ilen > BPF_MAX_INSN_SIZE) {
> +			pr_err("bpf_jit_compile fatal insn size error\n");
> +			return -EFAULT;
> +		}
> +
>  		if (image) {
>  			if (unlikely(proglen + ilen > oldproglen)) {
>  				pr_err("bpf_jit_compile fatal error\n");
> @@ -934,9 +945,11 @@ void bpf_int_jit_compile(struct bpf_prog *prog)
>  			goto out;
>  		}
>  		if (image) {
> -			if (proglen != oldproglen)
> +			if (proglen != oldproglen) {
>  				pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
>  				       proglen, oldproglen);
> +				goto out;
> +			}
>  			break;
>  		}
>  		if (proglen == oldproglen) {
> -- 
> 1.7.9.5
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox