Netdev List
 help / color / mirror / Atom feed
* [PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
From: Zhu Yanjun @ 2014-12-15  8:39 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418632754-16698-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit b7d6e335
[e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked]
introduces for newer kernels.

When PHY reset is intentionally blocked on 82577/8/9, do not toggle the
LANPHYPC value bit (essentially performing a hard power reset of the
device) otherwise the PHY can be put into an unknown state.

Cleanup whitespace in the same function.

[yanjun.zhu: whitespace remains unchanged]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/ich8lan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index c4b2d15..8c7e4aa 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -280,7 +280,8 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
-	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID) && 
+		!e1000_check_reset_block(hw)) {
 		/*Set Phy Config Counter to 50msec */
 		ctrl = er32(FEXTNVM3);
 		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
-- 
1.9.1

^ permalink raw reply related

* [PATCH 2/5] e1000e: workaround EEPROM configuration change on 82579 on kernel 2.6.x
From: Zhu Yanjun @ 2014-12-15  8:39 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418632754-16698-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 62f1d8d1
[e1000e: workaround EEPROM configuration change on 82579] introduces
for newer kernels.

An update to the EEPROM on 82579 will extend a delay in hardware to fix an
issue with WoL not working after a G3->S5 transition which is unrelated to
the driver.  However, this extended delay conflicts with nominal operation
of the device when it is initialized by the driver and after every reset
of the hardware (i.e. the driver starts configuring the device before the
hardware is done with it's own configuration work).  The workaround for
when the driver is in control of the device is to tell the hardware after
every reset the configuration delay should be the original shorter one.

Some pre-existing variables are renamed generically to be re-used with
new register accesses.

[e1000_toggle_lanphypc_value_ich8lan does not exist. Its implementations
exist in e1000_init_phy_params_pchlan. Renamed variables remain unchanged]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/hw.h      |  1 +
 drivers/net/e1000e/ich8lan.c | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/e1000e/hw.h b/drivers/net/e1000e/hw.h
index 11f3b7c..b055d78 100644
--- a/drivers/net/e1000e/hw.h
+++ b/drivers/net/e1000e/hw.h
@@ -60,6 +60,7 @@ enum e1e_registers {
 	E1000_FEXTNVM  = 0x00028, /* Future Extended NVM - RW */
 	E1000_FCT      = 0x00030, /* Flow Control Type - RW */
 	E1000_VET      = 0x00038, /* VLAN Ether Type - RW */
+	E1000_FEXTNVM3 = 0x0003C, /* Future Extended NVM 3 - RW */
 	E1000_ICR      = 0x000C0, /* Interrupt Cause Read - R/clr */
 	E1000_ITR      = 0x000C4, /* Interrupt Throttling Rate - RW */
 	E1000_ICS      = 0x000C8, /* Interrupt Cause Set - WO */
diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index 020657c..c4b2d15 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -108,6 +108,9 @@
 #define E1000_FEXTNVM_SW_CONFIG		1
 #define E1000_FEXTNVM_SW_CONFIG_ICH8M (1 << 27) /* Bit redefined for ICH8M :/ */
 
+#define E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK    0x0C000000
+#define E1000_FEXTNVM3_PHY_CFG_COUNTER_50MSEC  0x08000000
+
 #define PCIE_ICH8_SNOOP_ALL		PCIE_NO_SNOOP_ALL
 
 #define E1000_ICH_RAR_ENTRIES		7
@@ -278,6 +281,12 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
 	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+		/*Set Phy Config Counter to 50msec */
+		ctrl = er32(FEXTNVM3);
+		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
+		ctrl |= E1000_FEXTNVM3_PHY_CFG_COUNTER_50MSEC;
+		ew32(FEXTNVM3, ctrl);
+
 		/*
 		 * The MAC-PHY interconnect may still be in SMBus mode
 		 * after Sx->S0.  Toggle the LANPHYPC Value bit to force
@@ -2685,6 +2694,14 @@ static s32 e1000_reset_hw_ich8lan(struct e1000_hw *hw)
 	ew32(CTRL, (ctrl | E1000_CTRL_RST));
 	msleep(20);
 
+	/* Set Phy Config Counter to 50msec */
+	if (hw->mac.type == e1000_pch2lan) {
+		u32 phycc_reg = er32(FEXTNVM3);
+		phycc_reg &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
+		phycc_reg |= E1000_FEXTNVM3_PHY_CFG_COUNTER_50MSEC;
+		ew32(FEXTNVM3, phycc_reg);
+	}
+
 	if (!ret_val)
 		e1000_release_swflag_ich8lan(hw);
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH 1/5] e1000e: reset MAC-PHY interconnect on 82577/82578
From: Zhu Yanjun @ 2014-12-15  8:39 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000
  Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher, David S. Miller
In-Reply-To: <1418632754-16698-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 901b2b95
[e1000e: reset MAC-PHY interconnect on 82577/82578] introduces
for newer kernels.

During Sx->S0 transitions, the interconnect between the MAC and PHY on
82577/82578 can remain in SMBus mode instead of transitioning to the
PCIe-like mode required during normal operation.  Toggling the LANPHYPC
Value bit essentially resets the interconnect forcing it to the correct
mode.

after review of all intel drivers, found several instances where
drivers had the incorrect pattern of:
memory mapped write();
delay();

which should always be:
memory mapped write();
write flush(); /* aka memory mapped read */
delay();

explanation:
The reason for including the flush is that writes can be held
(posted) in PCI/PCIe bridges, but the read always has to complete
synchronously and therefore has to flush all pending writes to a
device.  If a write is held and followed by a delay, the delay
means nothing because the write may not have reached hardware
(maybe even not until the next read)

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/defines.h |  2 ++
 drivers/net/e1000e/ich8lan.c | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/drivers/net/e1000e/defines.h b/drivers/net/e1000e/defines.h
index 1190167..52283a6 100644
--- a/drivers/net/e1000e/defines.h
+++ b/drivers/net/e1000e/defines.h
@@ -214,6 +214,8 @@
 #define E1000_CTRL_SPD_1000 0x00000200  /* Force 1Gb */
 #define E1000_CTRL_FRCSPD   0x00000800  /* Force Speed */
 #define E1000_CTRL_FRCDPX   0x00001000  /* Force Duplex */
+#define E1000_CTRL_LANPHYPC_OVERRIDE 0x00010000 /* SW control of LANPHYPC */
+#define E1000_CTRL_LANPHYPC_VALUE    0x00020000 /* SW value of LANPHYPC */
 #define E1000_CTRL_SWDPIN0  0x00040000  /* SWDPIN 0 value */
 #define E1000_CTRL_SWDPIN1  0x00080000  /* SWDPIN 1 value */
 #define E1000_CTRL_SWDPIO0  0x00400000  /* SWDPIN 0 Input or output */
diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index de39f9a..020657c 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -88,6 +88,8 @@
 
 
 #define E1000_ICH_FWSM_RSPCIPHY	0x00000040 /* Reset PHY on PCI Reset */
+/* FW established a valid mode */ 
+#define E1000_ICH_FWSM_FW_VALID                0x00008000
 
 #define E1000_ICH_MNG_IAMT_MODE		0x2
 
@@ -260,6 +262,7 @@ static inline void __ew32flash(struct e1000_hw *hw, unsigned long reg, u32 val)
 static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 {
 	struct e1000_phy_info *phy = &hw->phy;
+	u32 ctrl;
 	s32 ret_val = 0;
 
 	phy->addr                     = 1;
@@ -274,6 +277,23 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
+	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+		/*
+		 * The MAC-PHY interconnect may still be in SMBus mode
+		 * after Sx->S0.  Toggle the LANPHYPC Value bit to force
+		 * the interconnect to PCIe mode, but only if there is no
+		 * firmware present otherwise firmware will have done it.
+		*/
+		ctrl = er32(CTRL);
+		ctrl |=  E1000_CTRL_LANPHYPC_OVERRIDE;
+		ctrl &= ~E1000_CTRL_LANPHYPC_VALUE;
+		ew32(CTRL, ctrl);
+		e1e_flush();
+		udelay(10);
+		ctrl &= ~E1000_CTRL_LANPHYPC_OVERRIDE;
+		ew32(CTRL, ctrl);
+		msleep(50);
+	}
 	/*
 	 * Reset the PHY before any acccess to it.  Doing so, ensures that
 	 * the PHY is in a known good state before we read/write PHY registers.
-- 
1.9.1

^ permalink raw reply related

* [PATCH 0/5] e1000e: fix nic not boot after rebooting
From: Zhu Yanjun @ 2014-12-15  8:39 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun

With kernel 2.6.x, e1000e with 82577/8/9 sometimes will not boot
after rebooting. 

If a kernel 2.6.x board with 82577/8/9 e1000e nic is rebooted for 
100 times, there are 7~8 times that 82577/8/9 e1000e nic will not boot 
normally.

Zhu Yanjun (5):
  e1000e: reset MAC-PHY interconnect on 82577/82578
  e1000e: workaround EEPROM configuration change on 82579 on kernel
    2.6.x
  e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
  e1000e: update workaround for 82579 intermittently disabled during
    S0->Sx
  e1000e: cleanup use of check_reset_block function pointer

 drivers/net/e1000e/defines.h |  2 ++
 drivers/net/e1000e/hw.h      |  1 +
 drivers/net/e1000e/ich8lan.c | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 40 insertions(+)

-- 
1.9.1

^ permalink raw reply

* Re: Multicast packets being lost (3.10 stable)
From: Simon Horman @ 2014-12-15  8:04 UTC (permalink / raw)
  To: David Miller
  Cc: linus.luessing, shemming, netdev, bridge, gregkh, openwrt-devel
In-Reply-To: <20141213.153741.1636406298389279104.davem@davemloft.net>

On Sat, Dec 13, 2014 at 03:37:41PM -0500, David Miller wrote:
> From: Linus Lüssing <linus.luessing@c0d3.blue>
> Date: Wed, 10 Dec 2014 20:16:33 +0100
> 
> > did you have a chance to look into backporting these fixes for
> > stable yet?
> 
> I am not submitting -stable fixes back to 3.10 any longer, at most
> I am doing 4 -stable releases and right now that is 3.18, 3.17,
> v3.14, and v3.12

Hi Dave,

is there a method for people to get networking -stable fixes into older
(longterm) stable releases? Would it be best if people submitted them to
-stable themselves?

^ permalink raw reply

* [PATCH net, v2] gre: fix the inner mac header in nbma tunnel xmit path
From: Timo Teräs @ 2014-12-15  7:24 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Timo Teräs, Tom Herbert, Alexander Duyck
In-Reply-To: <20141211.150706.1736198745550279282.davem@davemloft.net>

The NBMA GRE tunnels temporarily push GRE header that contain the
per-packet NBMA destination on the skb via header ops early in xmit
path. It is the later pulled before the real GRE header is constructed.

The inner mac was thus set differently in nbma case: the GRE header
has been pushed by neighbor layer, and mac header points to beginning
of the temporary gre header (set by dev_queue_xmit).

Now that the offloads expect mac header to point to the gre payload,
fix the xmit patch to:
 - pull first the temporary gre header away
 - and reset mac header to point to gre payload

This fixes tso to work again with nbma tunnels.

Fixes: 14051f0452a2 ("gre: Use inner mac length when computing tunnel length")
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Cc: Tom Herbert <therbert@google.com>
Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
---
Fixed the stupid typo of skb_reset_mac_header()'s argument. And yes,
I did recompile, retest and verify that this works.

Comments from v1 apply:

Though, normally mac header does point to the begging of the hardware header.
E.g. in ethernet case it's pointing to the ethernet header. But now in gre
case it's instead pointing to the payload which seems counter-intuitive to me.
But I guess tunnels are a bit of special case, and there's valid reasons to
have it point to tunnel payload too.

Applying this patch on top of the Tom's previous fix of 14051f0452a2 seems to
now make my dmvpn scenario work again. So this should go to -stable too
(atleast 3.14).

 net/ipv4/ip_gre.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 94213c8..b40b90d 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -250,10 +250,6 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb,
 	struct ip_tunnel *tunnel = netdev_priv(dev);
 	const struct iphdr *tnl_params;
 
-	skb = gre_handle_offloads(skb, !!(tunnel->parms.o_flags&TUNNEL_CSUM));
-	if (IS_ERR(skb))
-		goto out;
-
 	if (dev->header_ops) {
 		/* Need space for new headers */
 		if (skb_cow_head(skb, dev->needed_headroom -
@@ -266,6 +262,7 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb,
 		 * to gre header.
 		 */
 		skb_pull(skb, tunnel->hlen + sizeof(struct iphdr));
+		skb_reset_mac_header(skb);
 	} else {
 		if (skb_cow_head(skb, dev->needed_headroom))
 			goto free_skb;
@@ -273,6 +270,10 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb,
 		tnl_params = &tunnel->parms.iph;
 	}
 
+	skb = gre_handle_offloads(skb, !!(tunnel->parms.o_flags&TUNNEL_CSUM));
+	if (IS_ERR(skb))
+		goto out;
+
 	__gre_xmit(skb, dev, tnl_params, skb->protocol);
 
 	return NETDEV_TX_OK;
-- 
2.2.0

^ permalink raw reply related

* [PATCH net-next] fib_trie.txt: fix typo
From: Duan Jiong @ 2014-12-15  6:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev


Fix the typo, there should be "It".
On the other hand, fix whitespace errors detected by checkpatch.pl

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
---
 Documentation/networking/fib_trie.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/fib_trie.txt b/Documentation/networking/fib_trie.txt
index 0723db7..fe71938 100644
--- a/Documentation/networking/fib_trie.txt
+++ b/Documentation/networking/fib_trie.txt
@@ -73,8 +73,8 @@ trie_leaf_remove()
 
 trie_rebalance()
 	The key function for the dynamic trie after any change in the trie
-	it is run to optimize and reorganize. Tt will walk the trie upwards 
-	towards the root from a given tnode, doing a resize() at each step 
+	it is run to optimize and reorganize. It will walk the trie upwards
+	towards the root from a given tnode, doing a resize() at each step
 	to implement level compression.
 
 resize()
-- 
1.8.3.1

^ permalink raw reply related

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Alexei Starovoitov @ 2014-12-15  6:55 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team

On Sun, Dec 14, 2014 at 5:56 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> Hi,
>
> We have been using the kernel ftrace infra to collect TCP per-flow statistics.
> The following patch set is a first slim-down version of our
> existing implementation. We would like to get some early feedback
> and make it useful for others.
>
> [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs:
> Defines some basic tracepoints (by TRACE_EVENT).
>
> [RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints:
> A sample perf script with simple ip/port filtering and summary output.
>
> [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer:
> Declares a few more tracepoints (by DECLARE_TRACE) which are
> used by the tcp_tracer.  The tcp_tracer is in the patch 5/5.
>
> [RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs:
> Defines a few tcp_trace structs which are used to collect statistics
> on each tcp_sock.
>
> [RFC PATCH net-next 5/5] tcp: Add TCP tracer:
> It introduces a tcp_tracer which hooks onto the tracepoints defined in the
> patch 1/5 and 3/5.  It collects data defined in patch 4/5. We currently
> use this tracer to collect per-flow statistics.  The commit log has
> some more details.

I think patches 1 and 3 are good additions, since they establish
few permanent points of instrumentation in tcp stack.
Patches 4-5 look more like use cases of tracepoints established
before. They may feel like simple additions and, no doubt,
they are useful, but since they expose things via tracing
infra they become part of api and cannot be changed later,
when more stats would be needed.
I think systemtap like scripting on top of patches 1 and 3
should solve your use case ?
Also, have you looked at recent eBPF work?
Though it's not completely ready yet, soon it should
be able to do the same stats collection as you have
in 4/5 without adding permanent pieces to the kernel.

^ permalink raw reply

* Potential bugs found in e100
From: Jia-Ju Bai @ 2014-12-15  3:24 UTC (permalink / raw)
  To: netdev
In-Reply-To: <001101d01809$b58ccbe0$20a663a0$@163.com>

Recently I test linux device drivers in Linux 3.17.2, and find some
potential bugs. 

e100 driver:
The target file is drivers/net/ethernet/intel/e100.c, which is used to build
e100.ko. I hope you can help me check my findings:
[1] The function pci_pool_create is called by e100_probe when initializing
the ethernet card driver. But when pci_pool_create is failed, which means
that it returns NULL to nic->cbs_pool, the system crash will happen. Because
pci_pool_alloc (in e100_alloc_cbs in e100_up in e100_open) need to use
nic->cbs_pool to allocate the resource, but it is NULL. I suggest that a
check can be added in the code to detect whether pci_pool_create returns
NULL.
[2] In the normal process, netif_napi_add is called in e100_probe, but
netif_napi_del is not called in e100_remove. However, many other ethernet
card drivers call them in pairs, even in the error handling paths, such as
r8169 and igb.

Could you help me check these findings? Thank you very much, and I'm looking
forward to your reply.

--
Jia-Ju Bai

^ permalink raw reply

* Potential bugs found in e1000/e1000e
From: Jia-Ju Bai @ 2014-12-15  3:23 UTC (permalink / raw)
  To: netdev
In-Reply-To: <001301d0180a$f7cb74b0$e7625e10$@163.com>

Recently I test linux device drivers in Linux 3.17.2, and find some
potential bugs.

e1000 driver:
The target file is drivers/net/ethernet/intel/e1000/e1000_main.c, which is
used to build e1000.ko. I hope you can help me check my findings:
[1] In the normal process, netif_napi_add is called in e1000_probe, but
netif_napi_del is not called in e1000_remove. However, many other ethernet
card drivers call them in pairs, even in the error handling paths, such as
r8169 and igb.

e1000e driver:
The target file is drivers/net/ethernet/intel/e1000e/netdev.c, which is used
to build e1000e.ko. I hope you can help me check my findings:
[1] In the normal process, netif_napi_add is called in e1000_probe, but
netif_napi_del is not called in e1000_remove. However, many other ethernet
card drivers call them in pairs, even in the error handling paths, such as
r8169 and igb.
[2] The function vzalloc is called by e1000e_setup_rx_resources (in
e1000_open) when initializing the ethernet card driver. But when vzalloc is
failed, "err" segment in e1000e_setup_rx_resources is executed to return,
and then e1000e_free_tx_resources in "err_setup_rx" segment in e1000_open is
executed to halt. However, "writel(0, tx_ring->head)" statement in
e1000_clean_tx_ring in e1000e_free_tx_resources will cause system crash,
because "tx_ring->head" is not assigned the value. In the code,
"tx_ring->head" is initialized in e1000_configure_tx in e1000_configure
after the e1000e_setup_rx_resources.
[3] The same system crashes with [2] happens, when kcalloc in
e1000e_setup_rx_resources is failed(returns NULL).
[4] The same system crashes with [2] happens, when e1000_alloc_ring_dma in
e1000e_setup_rx_resources is failed(returns error code).
[5] In the normal process of e1000e, pci_enable_pcie_error_reporting and
pci_disable_pcie_error_reporting is called in pairs in e1000_probe and
e1000_remove. However, when pci_enable_pcie_error_reporting has been called
and pci_save_state in e1000_probe is failed, "err_alloc_etherdev" segment in
e1000_probe is executed immediately to exit, but
pci_disable_pcie_error_reporting is not called.
[6] The same situation with [5] happens when alloc_etherdev_mqs in
e1000_probe is failed.
[7] The same situation with [5] happens when ioremap in e1000_probe is
failed.
[8] The same situation with [5] happens when e1000_sw_init in e1000_probe is
failed.
[9] The same situation with [5] happens when register_netdev in e1000_probe
is failed.
[10] When request_irq in e1000_request_irq is failed, pm_qos_add_request in
e1000_open is called, but pm_qos_remove_request is not called.

Could you help me check these findings? Thank you very much, and I'm looking
forward to your reply.

--
Jia-Ju Bai

^ permalink raw reply

* Potential bugs found in igb
From: Jia-Ju Bai @ 2014-12-15  3:22 UTC (permalink / raw)
  To: netdev
In-Reply-To: <005801d0180e$6a672de0$3f3589a0$@163.com>

Recently I test linux device drivers in Linux 3.17.2, and find some
potential bugs.

igb driver:
[1] In the normal process of igb, pci_enable_pcie_error_reporting and
pci_disable_pcie_error_reporting is called in pairs in igb_probe and
igb_remove. However, when pci_enable_pcie_error_reporting has been called
and alloc_etherdev_mqs in igb_probe is failed, "err_alloc_etherdev" segment
in igb_probe is executed immediately to exit, but
pci_disable_pcie_error_reporting is not called.
[2] The same situation with [1] happens when pci_iomap in igb_probe is
failed.
[3] The same situation with [1] happens when igb_sw_init in igb_probe is
failed.
[4] The same situation with [1] happens when register_netdev in igb_probe is
failed.
[5] The same situation with [1] happens when igb_init_i2c in igb_probe is
failed.
[6] The function kcalloc is called by igb_sw_init when initializing the
ethernet card driver, but kfree is not called when register_netdev in
igb_probe is failed, which may cause memory leak.
[7] The same situation with [6] happens when igb_init_i2c in igb_probe is
failed.
[8] The same situation with [6] happens when kzalloc in igb_alloc_q_vector
is failed.
[9] The same situation with [6] happens when igb_alloc_q_vector in
igb_alloc_q_vectors is failed.
[10] When igb_init_i2c in igb_probe is failed, igb_enable_sriov is called in
igb_probe_vfs, but igb_disable_sriov is not called.
[11] The same situation with [10] happens when register_netdev in igb_probe
is failed.

Could you help me check these findings? Thank you very much, and I'm looking
forward to your reply.

--
Jia-Ju Bai

^ permalink raw reply

* Potential bugs found in via-rhine
From: Jia-Ju Bai @ 2014-12-15  2:46 UTC (permalink / raw)
  To: netdev

Recently I test linux device drivers 3.17.2, and find some potential bugs.
The target file is drivers/net/ethernet/via/via-rhine, which is used to
build via-rhine.ko. I hope you can help me check my findings:

[1] In the normal process, netif_napi_add is called in
rhine_init_one_common, but netif_napi_del is not called in
rhine_remove_one_pci. However, many other ethernet card drivers call them in
pairs, even in the error handling paths, such as r8169 and igb.
[2] In my running test, if dma_mapping_error in alloc_rbufs is failed,
system crash occurs when executing netif_receive_skb in rhine_napipoll, but
I can not find the specific reason.

Could you help me check these findings? Thank you very much, and I'm looking
forward to your reply.

--
Jia-Ju Bai

^ permalink raw reply

* Potential bugs found in 8139too
From: Jia-Ju Bai @ 2014-12-15  2:35 UTC (permalink / raw)
  To: netdev

Recently I test linux device drivers 3.17.2, and find some potential bugs.

The target file is drivers/net/ethernet/realtek/8139too.c, which is used to
build 8139too.ko. I hope you can help me check my findings:
[1] In the normal process of 8139too, netif_napi_add is called in
rtl8139_init_one, but netif_napi_del is not called in rtl8139_remove_one.
However, many other ethernet card drivers call them in pairs, even in the
error handling paths, such as r8169 and igb.
[2] In the normal process of 8139too, pci_enable_device and
pci_disable_device are called in pairs in rtl8139_init_board(in
rtl8139_init_one) and rtl8139_remove_one. However, when pci_enable_device
has been called and pci_request_regions is failed in rtl8139_init_board,
"err_out" segment in rtl8139_init_board is executed immediately to exit, but
pci_disable_device is not called because "disable_dev_on_err = 0".

Could you help me check these findings? Thank you very much, and I'm looking
forward to your reply.

--
Jia-Ju Bai

^ permalink raw reply

* [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-1-git-send-email-kafai@fb.com>

The tcp tracer, which will be added in the later patch, depends
on them to collect statistics.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/trace/events/tcp.h | 15 +++++++++++++++
 net/ipv4/tcp_input.c       |  6 ++++++
 2 files changed, 21 insertions(+)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 81b40ef..440a26b 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -167,6 +167,21 @@ TRACE_EVENT(tcp_rcv_established,
 		  __entry->rcv_nxt, __entry->rcv_wnd)
 );
 
+DECLARE_TRACE(tcp_ooo_rcv,
+	      TP_PROTO(struct sock *sk),
+	      TP_ARGS(sk)
+);
+
+DECLARE_TRACE(tcp_sacks_rcv,
+	     TP_PROTO(struct sock *sk, int num_sacks),
+	     TP_ARGS(sk, num_sacks)
+);
+
+DECLARE_TRACE(tcp_rtt_sample,
+	      TP_PROTO(struct sock *sk, long rtt_sample_us),
+	      TP_ARGS(sk, rtt_sample_us)
+);
+
 #undef TCP_TRACE_ASSIGN_SA
 
 #endif /* _TRACE_TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 808fad7..1f82987 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1650,6 +1650,8 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 	int i, j;
 	int first_sack_index;
 
+	trace_tcp_sacks_rcv(sk, num_sacks);
+
 	state.flag = 0;
 	state.reord = tp->packets_out;
 	state.rtt_us = -1L;
@@ -2932,6 +2934,9 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
 
 	/* RFC6298: only reset backoff on valid RTT measurement. */
 	inet_csk(sk)->icsk_backoff = 0;
+
+	trace_tcp_rtt_sample(sk, seq_rtt_us);
+
 	return true;
 }
 
@@ -4232,6 +4237,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOQUEUE);
 	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
 		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
+	trace_tcp_ooo_rcv(sk);
 
 	skb1 = skb_peek_tail(&tp->out_of_order_queue);
 	if (!skb1) {
-- 
1.8.1

^ permalink raw reply related

* [RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-1-git-send-email-kafai@fb.com>

A sample perf script.  It has a simple ip/port filtering and a summary output.

Here is a test with netem delay 100ms loss 0.1% and comparing the tcp-summary
output between reno and cubic.  It was run in a kvm environment:

[root@qemu1-centos65 perf]# sysctl -w net.ipv4.tcp_congestion_control=reno
net.ipv4.tcp_congestion_control = reno
[root@qemu1-centos65 perf]# ./perf record -a -e 'tcp:*' netperf -c -C -H 192.168.168.254 -l 60 -p 8888 -- -s 1M -S 1M -m 64K -P ,8889
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.168.254 () port 8889 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

2097152 425984  65536    60.78         2.91   0.00     0.86     0.000   771.397
[ perf record: Woken up 13 times to write data ]
[ perf record: Captured and wrote 3.231 MB perf.data (~141185 samples) ]
[root@qemu1 perf]# PYTHONPATH="$PYTHONPATH:/root/devhostshare/fb-kernel/linux/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace" ./perf script -s scripts/python/tcp-summary.py  -- --rport 8889
snd_cwnd(avg/min/max): 27/8/47
segs out: 15290
data segs out: 15275
octets out: 22116754
loss rxmits: 0
other rxmits: 13
rxmits%: 0.085
dup_acks: 402
established: 1
close: 1

[root@qemu1 perf]# sysctl -w net.ipv4.tcp_congestion_control=cubic
net.ipv4.tcp_congestion_control = cubic
[root@qemu1 perf]# ./perf record -a -e 'tcp:*' netperf -c -C -H 192.168.168.254 -l 60 -p 8888 -- -s 1M -S 1M -m 64K -P ,8889
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.168.254 () port 8889 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

2097152 425984  65536    60.25         4.97   0.02     0.86     1.096   454.051
[ perf record: Woken up 21 times to write data ]
[ perf record: Captured and wrote 5.525 MB perf.data (~241393 samples) ]
[root@qemu1 perf]# PYTHONPATH="$PYTHONPATH:/root/devhostshare/fb-kernel/linux/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace" ./perf script -s scripts/python/tcp-summary.py  -- --rport 8889
snd_cwnd(avg/min/max): 47/10/78
segs out: 25869
data segs out: 25848
octets out: 37426458
loss rxmits: 0
other rxmits: 19
rxmits%: 0.074
dup_acks: 957
established: 1
close: 1

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/perf/scripts/python/tcp-summary.py | 262 +++++++++++++++++++++++++++++++
 1 file changed, 262 insertions(+)
 create mode 100644 tools/perf/scripts/python/tcp-summary.py

diff --git a/tools/perf/scripts/python/tcp-summary.py b/tools/perf/scripts/python/tcp-summary.py
new file mode 100644
index 0000000..fe85a43
--- /dev/null
+++ b/tools/perf/scripts/python/tcp-summary.py
@@ -0,0 +1,262 @@
+import os
+import sys
+import argparse
+import struct
+import socket
+
+sys.path.append(os.environ['PERF_EXEC_PATH'] + \
+	'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
+
+from perf_trace_context import *
+from Core import *
+
+def ntohip(nip):
+	hip = 0L
+	for i in nip:
+		hip <<= 8
+		hip |= i
+	return hip
+
+def unpack_nip(nip_in_c):
+	fmt = "!" + "B" * len(nip_in_c)
+	return struct.unpack(fmt, nip_in_c)
+
+def unpack_and_hostify_sa(laddr, lport, raddr, rport):
+	return (ntohip(unpack_nip(laddr)), socket.ntohs(lport),
+		ntohip(unpack_nip(raddr)), socket.ntohs(rport))
+
+class SimpleSubnet(object):
+	def _parse_ip4(self):
+		hip = 0L
+		dec_u8 = self.ip_str.split(".")
+		for i in dec_u8:
+			hip = hip << 8
+			hip |= long(i)
+		self.hip = hip
+
+	def _parse_ip6(self):
+		hip = 0L;
+		hex_u16	 = self.ip_str.split(":")
+		if self.ip_str.endswith("::"):
+			if hex_u16.count("") > 2:
+				raise Exception("Error in parsing IPv6")
+			cpz_zeros = 8 - len(hex_u16) + 2
+			hex_u16 = hex_u16[:-1]
+		elif "" in hex_u16:
+			if hex_u16.count("") > 1:
+				raise Exception("Error in parsing IPv6")
+			cpz_zeros = 8 - len(hex_u16) + 1
+		for i in hex_u16:
+			if len(i) == 0:
+				for j in range(cpz_zeros):
+					hip = hip << 16
+			else:
+				hip = hip << 16;
+				hip |= long(i, 16)
+		self.hip = hip
+
+	def _parse_ip(self, ip_str):
+		slash_start = ip_str.find("/")
+		if slash_start == -1:
+			self.ip_str = ip_str
+			self.plen = 0
+		else:
+			self.ip_str = ip_str[0:slash_start]
+			self.plen = int(ip_str[slash_start+1:])
+
+		if ':' in self.ip_str:
+			self._parse_ip6()
+			self.is_ip6 = True
+			if self.plen == 0:
+				self.plen = 128
+			self.netmask = 0xffffffffffffffffffffffffffffffffL >> (128 - self.plen) << (128 - self.plen)
+		else:
+			self._parse_ip4()
+			self.is_ip6 = False
+			if self.plen == 0:
+				self.plen = 32
+			self.netmask = 0xffffffffL >> (32 - self.plen) << (32 - self.plen)
+
+	def _min_ip(self):
+		return self.hip & self.netmask
+
+	def _max_ip(self):
+		if self.is_ip6:
+			return self._min_ip() | (0xffffffffffffffffffffffffffffffffL - self.netmask)
+		else:
+			return self._min_ip() | (0xffffffffL - self.netmask)
+
+	def __contains__(self, hip):
+		return self._min_ip() <= hip <= self._max_ip()
+
+	def __init__(self, ip_str):
+		self._parse_ip(ip_str)
+
+class SimpleFilter(object):
+	def __init__(self, lsn, lhport, rsn, rhport):
+		self.lsn = lsn
+		self.rsn = rsn
+		self.lhport = lhport
+		self.rhport = rhport
+
+	def match(self, lhip, lhport, rhip, rhport):
+		if self.lsn and lhip not in self.lsn:
+			return False
+		if self.rsn and rhip not in self.rsn:
+			return False
+		if self.lhport and self.lhport != lhport:
+			return False
+		if self.rhport and self.rhport != rhport:
+			return False
+
+		return True
+
+class TcpSummary(object):
+	def __init__(self):
+		self.snd_cwnd_sum = 0
+		self.snd_cwnd_count = 0
+		self.max_snd_cwnd = 0
+		self.min_snd_cwnd = 0xffffffff
+		self.segs_out = 0
+		self.data_segs_out = 0
+		self.data_octets_out = 0
+		self.loss_rxmits = 0
+		self.other_rxmits = 0
+		self.dup_acks = 0
+		self.established = 0
+		self.close = 0
+
+	def add_loss_rxmits(self, n):
+		self.loss_rxmits += n
+
+	def add_other_rxmits(self, n):
+		self.other_rxmits += n
+
+	def add_segs_out(self, n):
+		self.segs_out += n
+
+	def add_data_segs_out(self, n):
+		self.data_segs_out += n
+
+	def add_data_octets_out(self, n):
+		self.data_octets_out += n
+
+	def add_snd_cwnd_sample(self, n):
+		self.snd_cwnd_sum += n
+		self.snd_cwnd_count += 1
+		self.min_snd_cwnd = min(self.min_snd_cwnd, n)
+		self.max_snd_cwnd = max(self.max_snd_cwnd, n)
+
+	def inc_dup_acks(self):
+		self.dup_acks += 1
+
+	def inc_established(self):
+		self.established += 1
+
+	def inc_close(self):
+		self.close += 1
+
+	def report(self):
+		if self.snd_cwnd_count == 0:
+			avg_cwnd = 0
+			min_snd_cwnd = 0
+		else:
+			avg_cwnd = self.snd_cwnd_sum / self.snd_cwnd_count
+			min_snd_cwnd = self.min_snd_cwnd
+		rxmit_rate = (self.loss_rxmits + self.other_rxmits) * 100.0 / \
+			     self.data_segs_out if self.data_segs_out else 0
+		print "snd_cwnd(avg/min/max): %u/%u/%u" % (avg_cwnd,
+							   min_snd_cwnd,
+							   self.max_snd_cwnd)
+		print "segs out: %u" % self.segs_out
+		print "data segs out: %u" % self.data_segs_out
+		print "octets out: %u" % self.data_octets_out
+		print "loss rxmits: %u" % self.loss_rxmits
+		print "other rxmits: %u" % self.other_rxmits
+		print "rxmits%%: %.3f" % rxmit_rate
+		print "dup_acks: %u" % self.dup_acks
+		print "established: %u" % self.established
+		print "close: %u" % self.close
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--laddr", help="Local address in IP[/prefix-len]")
+parser.add_argument("--raddr", help="Remote address in IP[/prefix-len]")
+parser.add_argument("--lport", type=int, default=0, help="Local port")
+parser.add_argument("--rport", type=int, default=0, help="Remote port")
+args = parser.parse_args()
+lsn = SimpleSubnet(args.laddr) if args.laddr else None
+rsn = SimpleSubnet(args.raddr) if args.raddr else None
+addr_filter = SimpleFilter(lsn, args.lport, rsn, args.rport)
+tcp_summary = TcpSummary()
+
+def trace_begin():
+	pass
+
+def trace_end():
+	tcp_summary.report();
+
+def tcp__tcp_transmit_skb(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	common_callchain, ipv6, laddr, raddr, lport,
+	rport, seq, end_seq, pcount, ca_state,
+	snd_nxt, snd_una, snd_wnd, snd_cwnd, mss_cache,
+	ssthresh, srtt_us, rto_ms):
+
+	if not addr_filter.match(*unpack_and_hostify_sa(laddr, lport,
+							raddr, rport)):
+		return
+
+	tcp_summary.add_segs_out(pcount)
+	if end_seq > seq:
+		tcp_summary.add_snd_cwnd_sample(snd_cwnd)
+		if seq < snd_nxt:
+			if ca_state == 4:
+				tcp_summary.add_loss_rxmits(pcount)
+			else:
+				tcp_summary.add_other_rxmits(pcount)
+		else:
+			tcp_summary.add_data_segs_out(pcount)
+			tcp_summary.add_data_octets_out(end_seq - seq)
+
+def tcp__tcp_rcv_established(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	common_callchain, ipv6, laddr, raddr, lport,
+	rport, seq, end_seq, ack_seq, snd_una,
+	rcv_nxt, rcv_wnd):
+	if not addr_filter.match(*unpack_and_hostify_sa(laddr, lport,
+							raddr, rport)):
+		return
+
+	if seq == end_seq and ack_seq == snd_una:
+		tcp_summary.inc_dup_acks()
+
+def tcp__tcp_established(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	common_callchain, ipv6, laddr, raddr, lport,
+	rport, snd_cwnd, mss_cache, ssthresh, srtt_us,
+	rto_ms):
+
+	if not addr_filter.match(*unpack_and_hostify_sa(laddr, lport,
+							raddr, rport)):
+		return
+
+	tcp_summary.inc_established()
+
+def tcp__tcp_close(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	common_callchain, ipv6, laddr, raddr, lport,
+	rport, snd_cwnd, mss_cache, ssthresh, srtt_us,
+	rto_ms):
+
+	if not addr_filter.match(*unpack_and_hostify_sa(laddr, lport,
+							raddr, rport)):
+		return
+
+	tcp_summary.inc_close()
+
+def trace_unhandled(event_name, context, event_fields_dict):
+	print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])
+
+def print_header(event_name, cpu, secs, nsecs, pid, comm):
+	print "%-20s %5u %05u.%09u %8u %-20s " % \
+	(event_name, cpu, secs, nsecs, pid, comm),
-- 
1.8.1

^ permalink raw reply related

* [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-1-git-send-email-kafai@fb.com>

Add TRACE_EVENT when:
1. connection established
2. segs received
3. segs sending out
4. connection close

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/trace/events/tcp.h | 175 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-traces.c      |   1 +
 net/ipv4/tcp.c             |   6 +-
 net/ipv4/tcp_input.c       |   3 +
 net/ipv4/tcp_output.c      |   3 +
 5 files changed, 187 insertions(+), 1 deletion(-)
 create mode 100644 include/trace/events/tcp.h

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
new file mode 100644
index 0000000..81b40ef
--- /dev/null
+++ b/include/trace/events/tcp.h
@@ -0,0 +1,175 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tcp
+
+#if !defined(_TRACE_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TCP_H
+
+#include <net/sock.h>
+#include <net/inet_sock.h>
+#include <net/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ipv6.h>
+#include <linux/tracepoint.h>
+#include <uapi/linux/in6.h>
+
+#define TCP_TRACE_ASSIGN_SA(e, sk)	do {				\
+	(e)->lport = inet_sk((sk))->inet_sport;				\
+	(e)->rport = inet_sk((sk))->inet_dport;				\
+	if ((sk)->sk_family == AF_INET) {				\
+		(e)->ipv6 = 0;						\
+		memset((e)->laddr, 0, sizeof((e)->laddr));		\
+		memset((e)->raddr, 0, sizeof((e)->raddr));		\
+		memcpy((e)->laddr, &inet_sk((sk))->inet_saddr,		\
+		       sizeof(inet_sk((sk))->inet_saddr));		\
+		memcpy((e)->raddr, &inet_sk((sk))->inet_daddr,		\
+		       sizeof(inet_sk((sk))->inet_daddr));		\
+	} else {							\
+		(e)->ipv6 = 1;						\
+		memcpy((e)->laddr, inet6_sk((sk))->saddr.s6_addr,	\
+		       sizeof((e)->laddr));				\
+		memcpy((e)->raddr, (sk)->sk_v6_daddr.s6_addr,		\
+		       sizeof((e)->raddr));				\
+	}								\
+} while (0)
+
+DECLARE_EVENT_CLASS(tcp,
+	TP_PROTO(struct sock *sk),
+	TP_ARGS(sk),
+	TP_STRUCT__entry(
+		__field(u8, ipv6)
+		__array(u8, laddr, 16)
+		__array(u8, raddr, 16)
+		__field(u16, lport)
+		__field(u16, rport)
+		__field(u32, snd_cwnd)
+		__field(u32, mss_cache)
+		__field(u32, ssthresh)
+		__field(u64, srtt_us)
+		__field(u32, rto_ms)
+	),
+	TP_fast_assign(
+		TCP_TRACE_ASSIGN_SA(__entry, sk);
+		__entry->snd_cwnd = tcp_sk(sk)->snd_cwnd;
+		__entry->mss_cache = tcp_sk(sk)->mss_cache;
+		__entry->ssthresh = tcp_current_ssthresh(sk);
+		__entry->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+		__entry->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	),
+	TP_printk("local=%s:%d remote=%s:%d snd_cwnd=%u mss_cache=%u "
+		  "ssthresh=%u srtt_us=%llu rto_ms=%u",
+		  __print_hex(__entry->laddr, 16),
+		  __entry->lport,
+		  __print_hex(__entry->raddr, 16),
+		  __entry->rport,
+		  __entry->snd_cwnd, __entry->mss_cache,
+		  __entry->ssthresh, __entry->srtt_us, __entry->rto_ms)
+);
+
+DEFINE_EVENT(tcp,
+	     tcp_established,
+	     TP_PROTO(struct sock *sk),
+	     TP_ARGS(sk)
+);
+
+DEFINE_EVENT(tcp,
+	     tcp_close,
+	     TP_PROTO(struct sock *sk),
+	     TP_ARGS(sk)
+);
+
+TRACE_EVENT(tcp_transmit_skb,
+	TP_PROTO(struct sock *sk, struct sk_buff *skb),
+	TP_ARGS(sk, skb),
+	TP_STRUCT__entry(
+		__field(u8, ipv6)
+		__array(u8, laddr, 16)
+		__array(u8, raddr, 16)
+		__field(u16, lport)
+		__field(u16, rport)
+		__field(u32, seq)
+		__field(u32, end_seq)
+		__field(u32, pcount)
+		__field(u8, ca_state)
+		__field(u32, snd_nxt)
+		__field(u32, snd_una)
+		__field(u32, snd_wnd)
+		__field(u32, snd_cwnd)
+		__field(u32, mss_cache)
+		__field(u32, ssthresh)
+		__field(u64, srtt_us)
+		__field(u32, rto_ms)
+	),
+	TP_fast_assign(
+		TCP_TRACE_ASSIGN_SA(__entry, sk);
+		__entry->seq = TCP_SKB_CB(skb)->seq;
+		__entry->end_seq = TCP_SKB_CB(skb)->end_seq;
+		__entry->pcount = tcp_skb_pcount(skb);
+		__entry->ca_state = inet_csk(sk)->icsk_ca_state;
+		__entry->snd_nxt = tcp_sk(sk)->snd_nxt;
+		__entry->snd_una = tcp_sk(sk)->snd_una;
+		__entry->snd_wnd = tcp_sk(sk)->snd_wnd;
+		__entry->snd_cwnd = tcp_sk(sk)->snd_cwnd;
+		__entry->mss_cache = tcp_sk(sk)->mss_cache;
+		__entry->ssthresh = tcp_current_ssthresh(sk);
+		__entry->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+		__entry->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	),
+	TP_printk("local=%s:%d remote=%s:%d "
+		  "skb_seq=%u skb_end_seq=%u pcount=%u ca_state=%x "
+		  "snd_nxt=%u snd_una=%u snd_wnd=%u snd_cwnd=%u mss_cache=%u "
+		  "ssthresh=%u srtt_us=%llu rto_ms=%u",
+		  __print_hex(__entry->laddr, 16), __entry->lport,
+		  __print_hex(__entry->raddr, 16), __entry->rport,
+
+		  __entry->seq, __entry->end_seq, __entry->pcount,
+		  __entry->ca_state,
+
+		  __entry->snd_nxt, __entry->snd_una, __entry->snd_wnd,
+		  __entry->snd_cwnd, __entry->mss_cache,
+
+		  __entry->ssthresh, __entry->srtt_us, __entry->rto_ms)
+);
+
+TRACE_EVENT(tcp_rcv_established,
+	    TP_PROTO(struct sock *sk, struct sk_buff *skb),
+	    TP_ARGS(sk, skb),
+	TP_STRUCT__entry(
+		__field(u8, ipv6)
+		__array(u8, laddr, 16)
+		__array(u8, raddr, 16)
+		__field(u16, lport)
+		__field(u16, rport)
+		__field(u32, seq)
+		__field(u32, end_seq)
+		__field(u32, ack_seq)
+		__field(u32, snd_una)
+		__field(u32, rcv_nxt)
+		__field(u32, rcv_wnd)
+	),
+	TP_fast_assign(
+		TCP_TRACE_ASSIGN_SA(__entry, sk);
+		__entry->seq = TCP_SKB_CB(skb)->seq;
+		__entry->end_seq = TCP_SKB_CB(skb)->end_seq;
+		__entry->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+		__entry->snd_una = tcp_sk(sk)->snd_una;
+		__entry->rcv_nxt = tcp_sk(sk)->rcv_nxt;
+		__entry->rcv_wnd = tcp_sk(sk)->rcv_wnd;
+	),
+	TP_printk("local=%s:%d remote=%s:%d "
+		  "skb_seq=%u skb_end_seq=%u skb_ack_seq=%u snd_una=%u "
+		  "rcv_nxt=%u, rcv_wnd=%u",
+		  __print_hex(__entry->laddr, 16), __entry->lport,
+		  __print_hex(__entry->raddr, 16), __entry->rport,
+
+		  __entry->seq, __entry->end_seq, __entry->ack_seq,
+		  __entry->snd_una,
+
+		  __entry->rcv_nxt, __entry->rcv_wnd)
+);
+
+#undef TCP_TRACE_ASSIGN_SA
+
+#endif /* _TRACE_TCP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index ba3c012..63f966b 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -31,6 +31,7 @@
 #include <trace/events/napi.h>
 #include <trace/events/sock.h>
 #include <trace/events/udp.h>
+#include <trace/events/tcp.h>
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3075723..3b887fa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include <net/xfrm.h>
 #include <net/ip.h>
 #include <net/sock.h>
+#include <trace/events/tcp.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -1901,8 +1902,10 @@ void tcp_set_state(struct sock *sk, int state)
 
 	switch (state) {
 	case TCP_ESTABLISHED:
-		if (oldstate != TCP_ESTABLISHED)
+		if (oldstate != TCP_ESTABLISHED) {
 			TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
+			trace_tcp_established(sk);
+		}
 		break;
 
 	case TCP_CLOSE:
@@ -1913,6 +1916,7 @@ void tcp_set_state(struct sock *sk, int state)
 		if (inet_csk(sk)->icsk_bind_hash &&
 		    !(sk->sk_userlocks & SOCK_BINDPORT_LOCK))
 			inet_put_port(sk);
+		trace_tcp_close(sk);
 		/* fall through */
 	default:
 		if (oldstate == TCP_ESTABLISHED)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 075ab4d..808fad7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -75,6 +75,7 @@
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
 #include <linux/errqueue.h>
+#include <trace/events/tcp.h>
 
 int sysctl_tcp_timestamps __read_mostly = 1;
 int sysctl_tcp_window_scaling __read_mostly = 1;
@@ -5076,6 +5077,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	trace_tcp_rcv_established(sk, skb);
+
 	if (unlikely(sk->sk_rx_dst == NULL))
 		inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb);
 	/*
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..9832512 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -41,6 +41,7 @@
 #include <linux/compiler.h>
 #include <linux/gfp.h>
 #include <linux/module.h>
+#include <trace/events/tcp.h>
 
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse __read_mostly = 1;
@@ -1014,6 +1015,8 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	/* Our usage of tstamp should remain private */
 	skb->tstamp.tv64 = 0;
 
+	trace_tcp_transmit_skb(sk, skb);
+
 	/* Cleanup our debris for IP stacks */
 	memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
 			       sizeof(struct inet6_skb_parm)));
-- 
1.8.1

^ permalink raw reply related

* [RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs.
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-1-git-send-email-kafai@fb.com>

The tcp_sk_trace and its related structs define what will be
collected and recorded to the tracing's ring_buffer by
the TCP tracer (in the following patch).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/tcp.h            |  4 +++
 include/net/tcp_trace.h        | 18 ++++++++++
 include/uapi/linux/tcp_trace.h | 78 ++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/Kconfig           | 11 ++++++
 kernel/trace/Makefile          |  1 +
 kernel/trace/tcp_trace.c       | 37 ++++++++++++++++++++
 net/ipv4/tcp.c                 |  4 +++
 7 files changed, 153 insertions(+)
 create mode 100644 include/net/tcp_trace.h
 create mode 100644 include/uapi/linux/tcp_trace.h
 create mode 100644 kernel/trace/tcp_trace.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 67309ec..8d25cb3 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -315,6 +315,10 @@ struct tcp_sock {
 	 * socket. Used to retransmit SYNACKs etc.
 	 */
 	struct request_sock *fastopen_rsk;
+
+#ifdef CONFIG_TCP_TRACE
+	struct tcp_sk_trace *trace;
+#endif
 };
 
 enum tsq_flags {
diff --git a/include/net/tcp_trace.h b/include/net/tcp_trace.h
new file mode 100644
index 0000000..f800cc7
--- /dev/null
+++ b/include/net/tcp_trace.h
@@ -0,0 +1,18 @@
+#ifndef TCP_TRACE_H
+#define TCP_TRACE_H
+
+struct sock;
+
+#ifdef CONFIG_TCP_TRACE
+
+void tcp_sk_trace_init(struct sock *sk);
+void tcp_sk_trace_destruct(struct sock *sk);
+
+#else /* CONFIG_TCP_TRACE */
+
+static inline void tcp_sk_trace_init(struct sock *sk) {}
+static inline void tcp_sk_trace_destruct(struct sock *sk) {}
+
+#endif
+
+#endif /* TCP_TRACE_H */
diff --git a/include/uapi/linux/tcp_trace.h b/include/uapi/linux/tcp_trace.h
new file mode 100644
index 0000000..4f91056
--- /dev/null
+++ b/include/uapi/linux/tcp_trace.h
@@ -0,0 +1,78 @@
+#ifndef UAPI_TCP_TRACE_H
+#define UAPI_TCP_TRACE_H
+
+#include <linux/kernel.h>
+
+#define TCP_TRACE_MAGIC		0x54435000
+#define TCP_TRACE_VERSION	0x01
+#define TCP_TRACE_MAGIC_VERSION	(TCP_TRACE_MAGIC | TCP_TRACE_VERSION)
+
+enum tcp_trace_events {
+	TCP_TRACE_EVENT_ESTABLISHED,
+	TCP_TRACE_EVENT_PERIODIC,	/* Periodic event every 2s */
+	TCP_TRACE_EVENT_RETRANS,	/* Retrans (not in TCP_CA_Loss) */
+	TCP_TRACE_EVENT_RETRANS_LOSS,	/* Retrans in TCP_CA_Loss */
+	TCP_TRACE_EVENT_CLOSE,	/* Connection close */
+};
+
+struct tcp_stats {
+	/* outing packets */
+	__u32	segs_out;
+	__u32	data_segs_out;
+	__u64	data_octets_out;
+
+	/* retrans */
+	__u32	other_segs_retrans;
+	__u32	other_octets_retrans;
+	__u32	loss_segs_retrans;
+	__u32	loss_octets_retrans;
+
+	/* incoming packets */
+	__u32	segs_in;
+	__u32	data_segs_in;
+	__u64	data_octets_in;
+
+	/* RTT */
+	__u64	rtt_sample_us;
+	__u64	max_rtt_us;
+	__u64	min_rtt_us;
+	__u64	sum_rtt_us;
+	__u32	count_rtt;
+
+	/* RTO */
+	__u32	max_rto_ms;
+	__u32	min_rto_ms;
+
+	/* OOO or Loss */
+	__u32	dup_acks_in;
+	__u32	sacks_in;
+	__u32	sack_blks_in;
+	__u32	ooo_in;
+} __packed;
+
+struct tcp_trace {
+	__u32	magic;
+	__u8	event:7,
+		ipv6:1;
+	__u32	local_addr[4];
+	__u32	remote_addr[4];
+	__u16	local_port;
+	__u16	remote_port;
+} __packed;
+
+struct tcp_trace_basic {
+	struct tcp_trace event;
+	/* current values from tcp_sock */
+	__u32	snd_cwnd;
+	__u32	mss;
+	__u32	ssthresh;
+	__u64	srtt_us;
+	__u32	rto_ms;
+} __packed;
+
+struct tcp_trace_stats {
+	struct tcp_trace_basic basic;
+	struct tcp_stats stats;
+} __packed;
+
+#endif /* UAPI_TCP_TRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index a5da09c..f30835c 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -599,6 +599,17 @@ config RING_BUFFER_STARTUP_TEST
 
 	 If unsure, say N
 
+config TCP_TRACE
+	bool "TCP tracing"
+	depends on NET && INET
+	select DEBUG_FS
+	select TRACEPOINTS
+	select GENERIC_TRACER
+	help
+	  This tracer collects per-flow statistics and events.
+
+	  If unsure, say N.
+
 endif # FTRACE
 
 endif # TRACING_SUPPORT
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 67d6369..71d008a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -65,5 +65,6 @@ obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
 obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o
 
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
+obj-$(CONFIG_TCP_TRACE) += tcp_trace.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/tcp_trace.c b/kernel/trace/tcp_trace.c
new file mode 100644
index 0000000..9d09fd0
--- /dev/null
+++ b/kernel/trace/tcp_trace.c
@@ -0,0 +1,37 @@
+#include <net/tcp_trace.h>
+#include <linux/tcp.h>
+#include <uapi/linux/tcp_trace.h>
+
+static bool tcp_trace_enabled __read_mostly;
+
+struct tcp_sk_trace {
+	struct tcp_stats stats;
+	unsigned long start_ts;
+	unsigned long last_ts;
+};
+
+void tcp_sk_trace_init(struct sock *sk)
+{
+	struct tcp_sk_trace *sktr;
+
+	tcp_sk(sk)->trace = NULL;
+	if (!tcp_trace_enabled)
+		return;
+
+	sktr  = kzalloc(sizeof(*sktr), gfp_any());
+	if (unlikely(!sktr))
+		return;
+
+	tcp_sk(sk)->trace = sktr;
+	sk->sk_destruct = tcp_sock_destruct;
+
+	sktr->stats.min_rtt_us = U64_MAX;
+	sktr->stats.min_rto_ms = U32_MAX;
+
+	sktr->last_ts = sktr->start_ts = jiffies;
+}
+
+void tcp_sk_trace_destruct(struct sock *sk)
+{
+	kfree(tcp_sk(sk)->trace);
+}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3b887fa..41871c4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include <net/xfrm.h>
 #include <net/ip.h>
 #include <net/sock.h>
+#include <net/tcp_trace.h>
 #include <trace/events/tcp.h>
 
 #include <asm/uaccess.h>
@@ -1904,6 +1905,7 @@ void tcp_set_state(struct sock *sk, int state)
 	case TCP_ESTABLISHED:
 		if (oldstate != TCP_ESTABLISHED) {
 			TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
+			tcp_sk_trace_init(sk);
 			trace_tcp_established(sk);
 		}
 		break;
@@ -2254,6 +2256,8 @@ EXPORT_SYMBOL(tcp_disconnect);
 
 void tcp_sock_destruct(struct sock *sk)
 {
+	tcp_sk_trace_destruct(sk);
+
 	inet_sock_destruct(sk);
 
 	kfree(inet_csk(sk)->icsk_accept_queue.fastopenq);
-- 
1.8.1

^ permalink raw reply related

* [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

Hi,

We have been using the kernel ftrace infra to collect TCP per-flow statistics.
The following patch set is a first slim-down version of our
existing implementation. We would like to get some early feedback
and make it useful for others.

[RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs:
Defines some basic tracepoints (by TRACE_EVENT).

[RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints:
A sample perf script with simple ip/port filtering and summary output.

[RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer:
Declares a few more tracepoints (by DECLARE_TRACE) which are
used by the tcp_tracer.  The tcp_tracer is in the patch 5/5.

[RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs:
Defines a few tcp_trace structs which are used to collect statistics
on each tcp_sock.

[RFC PATCH net-next 5/5] tcp: Add TCP tracer:
It introduces a tcp_tracer which hooks onto the tracepoints defined in the
patch 1/5 and 3/5.  It collects data defined in patch 4/5. We currently
use this tracer to collect per-flow statistics.  The commit log has
some more details.

Thanks,
--Martin

^ permalink raw reply

* [RFC PATCH net-next 5/5] tcp: Add TCP tracer
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-1-git-send-email-kafai@fb.com>

Define probes and register them to the TCP tracepoints.  The probes
collect the data defined in struct tcp_sk_trace and record them to
the tracing's ring_buffer.

We currently use the TCP tracer to collect per-flow TCP statistics. Here is our
sample usage:
1. compare performance of different subnet, connection-type (like wire vs
   wireless), application-protocol...etc.
2. Uncover uplink/backbone/subnet issue, e.g. by tracking the rxmit rate.

This patch concludes the series. It is still missing a few things that
we currently have, like:
- why the sender is blocked? and how long for each reason?
- some TCP Congestion Control data
- ...etc.
We are interested to complete them in the next few versions.

[
Some background:
It was inspired by the Web10G effort (RFC4898) on collecting TCP
per-flow stats.  We initially took the Web10G kernel patch, capture only a
subset of RFC4898 that we monitor continuously in production and added codes to
leverage the ftrace infra.
]

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/trace/tcp_trace.c | 489 ++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace.h     |   1 +
 2 files changed, 488 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/tcp_trace.c b/kernel/trace/tcp_trace.c
index 9d09fd0..da9b748 100644
--- a/kernel/trace/tcp_trace.c
+++ b/kernel/trace/tcp_trace.c
@@ -1,12 +1,29 @@
 #include <net/tcp_trace.h>
+#include <net/tcp.h>
+#include <trace/events/tcp.h>
 #include <linux/tcp.h>
+#include <linux/ipv6.h>
+#include <linux/ftrace_event.h>
+#include <linux/jiffies.h>
 #include <uapi/linux/tcp_trace.h>
 
+#include "trace_output.h"
+
+#define REPORT_INTERVAL_MS 2000
+
+static struct trace_array *tcp_tr;
 static bool tcp_trace_enabled __read_mostly;
 
+static struct trace_print_flags tcp_trace_event_names[] = {
+	{ TCP_TRACE_EVENT_ESTABLISHED, "established" },
+	{ TCP_TRACE_EVENT_PERIODIC, "periodic" },
+	{ TCP_TRACE_EVENT_RETRANS, "retrans" },
+	{ TCP_TRACE_EVENT_RETRANS_LOSS, "retrans_loss" },
+	{ TCP_TRACE_EVENT_CLOSE, "close" }
+};
+
 struct tcp_sk_trace {
 	struct tcp_stats stats;
-	unsigned long start_ts;
 	unsigned long last_ts;
 };
 
@@ -28,10 +45,478 @@ void tcp_sk_trace_init(struct sock *sk)
 	sktr->stats.min_rtt_us = U64_MAX;
 	sktr->stats.min_rto_ms = U32_MAX;
 
-	sktr->last_ts = sktr->start_ts = jiffies;
+	sktr->last_ts = jiffies;
 }
 
 void tcp_sk_trace_destruct(struct sock *sk)
 {
 	kfree(tcp_sk(sk)->trace);
 }
+
+static void tcp_trace_init(struct tcp_trace *tr,
+			   enum tcp_trace_events trev,
+			   struct sock *sk)
+{
+	tr->event = trev;
+	if (sk->sk_family == AF_INET) {
+		tr->ipv6 = 0;
+		tr->local_addr[0] = inet_sk(sk)->inet_saddr;
+		tr->remote_addr[0] = inet_sk(sk)->inet_daddr;
+	} else {
+		BUG_ON(sk->sk_family != AF_INET6);
+		tr->ipv6 = 1;
+		memcpy(tr->local_addr, inet6_sk(sk)->saddr.s6_addr32,
+		       sizeof(tr->local_addr));
+		memcpy(tr->remote_addr, sk->sk_v6_daddr.s6_addr32,
+		       sizeof(tr->remote_addr));
+	}
+	tr->local_port = inet_sk(sk)->inet_sport;
+	tr->remote_port = inet_sk(sk)->inet_dport;
+}
+
+static void tcp_trace_basic_init(struct tcp_trace_basic *trb,
+				 enum tcp_trace_events trev,
+				 struct sock *sk)
+{
+	tcp_trace_init((struct tcp_trace *)trb, trev, sk);
+	trb->snd_cwnd = tcp_sk(sk)->snd_cwnd;
+	trb->mss = tcp_sk(sk)->mss_cache;
+	trb->ssthresh = tcp_current_ssthresh(sk);
+	trb->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+	trb->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+}
+
+static void tcp_trace_basic_add(enum tcp_trace_events trev, struct sock *sk)
+{
+	struct ring_buffer *buffer;
+	int pc;
+	struct ring_buffer_event *event;
+	struct tcp_trace_basic *trb;
+	struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+	if (!sktr)
+		return;
+
+	tracing_record_cmdline(current);
+	buffer = tcp_tr->trace_buffer.buffer;
+	pc = preempt_count();
+	event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
+					  sizeof(*trb), 0, pc);
+	if (!event)
+		return;
+	trb = ring_buffer_event_data(event);
+	tcp_trace_basic_init(trb, trev, sk);
+	trace_buffer_unlock_commit(buffer, event, 0, pc);
+}
+
+static void tcp_trace_stats_init(struct tcp_trace_stats *trs,
+				 enum tcp_trace_events trev,
+				 struct sock *sk)
+{
+	struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+	tcp_trace_basic_init((struct tcp_trace_basic *)trs, trev, sk);
+	memcpy(&trs->stats, &sktr->stats, sizeof(sktr->stats));
+}
+
+static void tcp_trace_stats_add(enum tcp_trace_events trev, struct sock *sk)
+{
+	struct ring_buffer *buffer;
+	int pc;
+	struct ring_buffer_event *event;
+	struct tcp_trace_stats *trs;
+	struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+	if (!sktr)
+		return;
+
+	tracing_record_cmdline(current);
+	buffer = tcp_tr->trace_buffer.buffer;
+	pc = preempt_count();
+	event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
+					  sizeof(*trs), 0, pc);
+	if (!event)
+		return;
+	trs = ring_buffer_event_data(event);
+
+	tcp_trace_stats_init(trs, trev, sk);
+
+	trace_buffer_unlock_commit(buffer, event, 0, pc);
+}
+
+static void tcp_trace_established(void *ignore, struct sock *sk)
+{
+	tcp_trace_basic_add(TCP_TRACE_EVENT_ESTABLISHED, sk);
+}
+
+static void tcp_trace_transmit_skb(void *ignore, struct sock *sk,
+				   struct sk_buff *skb)
+{
+	int pcount;
+	struct tcp_sk_trace *sktr;
+	struct tcp_skb_cb *tcb;
+	unsigned int data_len;
+	bool retrans = false;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	tcb = TCP_SKB_CB(skb);
+	pcount = tcp_skb_pcount(skb);
+	data_len = tcb->end_seq - tcb->seq;
+
+	sktr->stats.segs_out += pcount;
+
+	if (!data_len)
+		goto out;
+
+	sktr->stats.data_segs_out += pcount;
+	sktr->stats.data_octets_out += data_len;
+
+	if (before(tcb->seq, tcp_sk(sk)->snd_nxt)) {
+		enum tcp_trace_events trev;
+
+		retrans = true;
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
+			sktr->stats.loss_segs_retrans += pcount;
+			sktr->stats.loss_octets_retrans += data_len;
+			trev = TCP_TRACE_EVENT_RETRANS_LOSS;
+		} else {
+			sktr->stats.other_segs_retrans += pcount;
+			sktr->stats.other_octets_retrans += data_len;
+			trev = TCP_TRACE_EVENT_RETRANS;
+		}
+		tcp_trace_stats_add(trev, sk);
+		return;
+	}
+
+out:
+	if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
+	    REPORT_INTERVAL_MS) {
+		sktr->last_ts = jiffies;
+		tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
+	}
+}
+
+static void tcp_trace_rcv_established(void *ignore, struct sock *sk,
+				      struct sk_buff *skb)
+{
+	struct tcp_sk_trace *sktr;
+	unsigned int data_len;
+	struct tcphdr *th;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	th = tcp_hdr(skb);
+	WARN_ON_ONCE(skb->len < th->doff << 2);
+
+	sktr->stats.segs_in++;
+	data_len = skb->len - (th->doff << 2);
+	if (data_len) {
+		sktr->stats.data_segs_in++;
+		sktr->stats.data_octets_in += data_len;
+	} else {
+		if (TCP_SKB_CB(skb)->ack_seq == tcp_sk(sk)->snd_una)
+			sktr->stats.dup_acks_in++;
+	}
+
+	if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
+	    REPORT_INTERVAL_MS) {
+		sktr->last_ts = jiffies;
+		tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
+	}
+}
+
+static void tcp_trace_close(void *ignore, struct sock *sk)
+{
+	struct tcp_sk_trace *sktr;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	tcp_trace_stats_add(TCP_TRACE_EVENT_CLOSE, sk);
+}
+
+static void tcp_trace_ooo_rcv(void *ignore, struct sock *sk)
+{
+	struct tcp_sk_trace *sktr;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	sktr->stats.ooo_in++;
+}
+
+static void tcp_trace_sacks_rcv(void *ignore, struct sock *sk, int num_sacks)
+{
+	struct tcp_sk_trace *sktr;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	sktr->stats.sacks_in++;
+	sktr->stats.sack_blks_in += num_sacks;
+}
+
+void tcp_trace_rtt_sample(void *ignore, struct sock *sk,
+			  long rtt_sample_us)
+{
+	struct tcp_sk_trace *sktr;
+	u32 rto_ms;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+
+	sktr->stats.rtt_sample_us = rtt_sample_us;
+	sktr->stats.max_rtt_us = max_t(u64, sktr->stats.max_rtt_us,
+				       rtt_sample_us);
+	sktr->stats.min_rtt_us = min_t(u64, sktr->stats.min_rtt_us,
+				       rtt_sample_us);
+
+	sktr->stats.count_rtt++;
+	sktr->stats.sum_rtt_us += rtt_sample_us;
+
+	sktr->stats.max_rto_ms = max_t(u32, sktr->stats.max_rto_ms, rto_ms);
+	sktr->stats.min_rto_ms = min_t(u32, sktr->stats.min_rto_ms, rto_ms);
+}
+
+static enum print_line_t
+tcp_trace_print(struct trace_iterator *iter)
+{
+	struct trace_seq *s = &iter->seq;
+	struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
+	struct tcp_trace_basic *trb;
+	struct tcp_stats *stats;
+
+	union {
+		struct sockaddr_in v4;
+		struct sockaddr_in6 v6;
+	} local_sa, remote_sa;
+
+	local_sa.v4.sin_port = tr->local_port;
+	remote_sa.v4.sin_port = tr->remote_port;
+	if (tr->ipv6) {
+		local_sa.v6.sin6_family = AF_INET6;
+		remote_sa.v6.sin6_family = AF_INET6;
+		memcpy(local_sa.v6.sin6_addr.s6_addr, tr->local_addr, 4);
+		memcpy(remote_sa.v6.sin6_addr.s6_addr, tr->remote_addr, 4);
+	} else {
+		local_sa.v4.sin_family = AF_INET;
+		remote_sa.v4.sin_family = AF_INET;
+		local_sa.v4.sin_addr.s_addr =  tr->local_addr[0];
+		remote_sa.v4.sin_addr.s_addr = tr->remote_addr[0];
+	}
+
+	ftrace_print_symbols_seq(s, tr->event, tcp_trace_event_names);
+	if (trace_seq_has_overflowed(s))
+		goto out;
+
+	trb = (struct tcp_trace_basic *)tr;
+	trace_seq_printf(s,
+			 " %pISpc %pISpc snd_cwnd=%u mss=%u ssthresh=%u"
+			 " srtt_us=%llu rto_ms=%u",
+			 &local_sa, &remote_sa,
+			 trb->snd_cwnd, trb->mss, trb->ssthresh,
+			 trb->srtt_us, trb->rto_ms);
+
+	if (tr->event == TCP_TRACE_EVENT_ESTABLISHED ||
+	    trace_seq_has_overflowed(s))
+		goto out;
+
+	stats = &(((struct tcp_trace_stats *)tr)->stats);
+	trace_seq_printf(s,
+		" segs_out=%u data_segs_out=%u data_octets_out=%llu"
+		" other_segs_retrans=%u other_octets_retrans=%u"
+		" loss_segs_retrans=%u loss_octets_retrans=%u"
+		" segs_in=%u data_segs_in=%u data_octets_in=%llu"
+		" max_rtt_us=%llu min_rtt_us=%llu"
+		" count_rtt=%u sum_rtt_us=%llu"
+		" rtt_sample_us=%llu"
+		" max_rto_ms=%u min_rto_ms=%u"
+		" dup_acks_in=%u sacks_in=%u"
+		" sack_blks_in=%u ooo_in=%u",
+		stats->segs_out, stats->data_segs_out, stats->data_octets_out,
+		stats->other_segs_retrans, stats->other_octets_retrans,
+		stats->loss_segs_retrans, stats->loss_octets_retrans,
+		stats->segs_in, stats->data_segs_in, stats->data_octets_in,
+		stats->max_rtt_us, stats->min_rtt_us,
+		stats->count_rtt, stats->sum_rtt_us,
+		stats->rtt_sample_us,
+		stats->max_rto_ms, stats->min_rto_ms,
+		stats->dup_acks_in, stats->sacks_in,
+		stats->sack_blks_in, stats->ooo_in);
+	if (trace_seq_has_overflowed(s))
+		goto out;
+
+out:
+	trace_seq_putc(s, '\n');
+
+	return trace_seq_has_overflowed(s) ?
+		TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
+tcp_trace_print_binary(struct trace_iterator *iter)
+{
+	struct trace_seq *s = &iter->seq;
+	struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
+
+	tr->magic = TCP_TRACE_MAGIC_VERSION;
+	if (tr->event == TCP_TRACE_EVENT_ESTABLISHED)
+		trace_seq_putmem(s, tr, sizeof(struct tcp_trace_basic));
+	else
+		trace_seq_putmem(s, tr, sizeof(struct tcp_trace_stats));
+
+	return trace_seq_has_overflowed(s) ?
+		TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
+tcp_tracer_print_line(struct trace_iterator *iter)
+{
+	return (trace_flags & TRACE_ITER_BIN) ?
+		tcp_trace_print_binary(iter) :
+		tcp_trace_print(iter);
+}
+
+static void tcp_unregister_tracepoints(void)
+{
+	unregister_trace_tcp_established(tcp_trace_established, NULL);
+	unregister_trace_tcp_close(tcp_trace_close, NULL);
+	unregister_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
+	unregister_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+	unregister_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+	unregister_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+	unregister_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
+
+	tracepoint_synchronize_unregister();
+}
+
+static int tcp_register_tracepoints(void)
+{
+	int ret;
+
+	ret = register_trace_tcp_established(tcp_trace_established, NULL);
+	if (ret)
+		return ret;
+
+	ret = register_trace_tcp_close(tcp_trace_close, NULL);
+	if (ret)
+		goto err1;
+
+	ret = register_trace_tcp_rcv_established(tcp_trace_rcv_established,
+						 NULL);
+	if (ret)
+		goto err2;
+
+	ret = register_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+	if (ret)
+		goto err3;
+
+	ret = register_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+	if (ret)
+		goto err4;
+
+	ret = register_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+	if (ret)
+		goto err5;
+
+	ret = register_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
+	if (ret)
+		goto err6;
+
+	return ret;
+
+err6:
+	unregister_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+err5:
+	unregister_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+err4:
+	unregister_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+err3:
+	unregister_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
+err2:
+	unregister_trace_tcp_close(tcp_trace_close, NULL);
+err1:
+	unregister_trace_tcp_established(tcp_trace_established, NULL);
+
+	return ret;
+}
+
+static void tcp_tracer_start(struct trace_array *tr)
+{
+	int ret;
+
+	if (tcp_trace_enabled)
+		return;
+
+	ret = tcp_register_tracepoints();
+	if (ret == 0) {
+		tcp_trace_enabled = true;
+		pr_info("tcp_trace: enabled\n");
+	}
+}
+
+static void tcp_tracer_stop(struct trace_array *tr)
+{
+	if (!tcp_trace_enabled)
+		return;
+
+	tcp_unregister_tracepoints();
+	tcp_trace_enabled = false;
+	pr_info("tcp_trace: disabled\n");
+}
+
+static void tcp_tracer_reset(struct trace_array *tr)
+{
+	tcp_tracer_stop(tr);
+}
+
+static int tcp_tracer_init(struct trace_array *tr)
+{
+	tcp_tr = tr;
+	tcp_tracer_start(tr);
+	return 0;
+}
+
+static struct tracer tcp_tracer __read_mostly = {
+	.name           = "tcp",
+	.init           = tcp_tracer_init,
+	.reset          = tcp_tracer_reset,
+	.start          = tcp_tracer_start,
+	.stop           = tcp_tracer_stop,
+	.print_line	= tcp_tracer_print_line,
+};
+
+static struct trace_event_functions tcp_trace_event_funcs;
+
+static struct trace_event tcp_trace_event = {
+	.type           = TRACE_TCP,
+	.funcs          = &tcp_trace_event_funcs,
+};
+
+static int __init init_tcp_tracer(void)
+{
+	if (!register_ftrace_event(&tcp_trace_event)) {
+		pr_warn("Cannot register TCP trace event\n");
+		return 1;
+	}
+
+	if (register_tracer(&tcp_tracer) != 0) {
+		pr_warn("Cannot register TCP tracer\n");
+		unregister_ftrace_event(&tcp_trace_event);
+		return 1;
+	}
+	return 0;
+}
+
+device_initcall(init_tcp_tracer);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 3255dfb..d88bdd2 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -38,6 +38,7 @@ enum trace_type {
 	TRACE_USER_STACK,
 	TRACE_BLK,
 	TRACE_BPUTS,
+	TRACE_TCP,
 
 	__TRACE_LAST_TYPE,
 };
-- 
1.8.1

^ permalink raw reply related

* Potential bugs found in 3c59x
From: Jia-Ju Bai @ 2014-12-15  1:41 UTC (permalink / raw)
  To: netdev

Recently I test linux device drivers 3.17.2. 
The target file is drivers/net/ethernet/3com/3c59x.c, which is used to build
3c59x.ko. I hope you can help me check my findings:
[1] The function vortex_up is called by vortex_open when initializing the
ethernet card driver. But when vortex_up is failed, which means that it
returns the error value, "out" segment is executed immediately to halt the
process. However, the resources allocated by __netdev_alloc_skb in
vortex_open are not released by dev_kfree_skb when vortex_up is failed.
[2] As shown in [1], one reason that vortex_up is failed is that
pci_enable_device is failed(return the error value) in vortex_up, and
"err_out" segment is executed immediately to return.
Could you help me check these findings? Thank you very much, and I'm looking
forward to your reply.

^ permalink raw reply

* Re: [PATCH net] net/mlx4_en: correct the endianness of doorbell_qpn on big endian platform
From: Wei Yang @ 2014-12-15  1:32 UTC (permalink / raw)
  To: David Miller
  Cc: weiyang, David.Laight, eric.dumazet, netdev, gideonn, edumazet,
	amirv
In-Reply-To: <20141213.234320.1607496855879763694.davem@davemloft.net>

On Sat, Dec 13, 2014 at 11:43:20PM -0500, David Miller wrote:
>From: Wei Yang <weiyang@linux.vnet.ibm.com>
>Date: Sat, 13 Dec 2014 11:13:38 +0800
>
>> On Mon, Dec 08, 2014 at 10:42:37PM +0800, Wei Yang wrote:
>> If you prefer this way, I would like to send a new version for this.
>> Is it ok for you?
>
>I'm not so sure.  There are implications when using the __raw_*()
>routines.
>
>In particular, using __raw_{read,write}l() also means that the usual
>necessary I/O memory barriers are not being performed.
>
>There are therefore no ordering guarantees between __raw_*() and other
>I/O or memory accesses whatsoever.

Thanks David.

Actually, the last mail is asking David Laight. I am trying to understanding
his comment and Amir told me he was suggesting to use __raw_*() version.

Hmm... this is really a problem found in the v3.18-rc1 and the root cause is
the endianess. I am ok to use any method to fix this problem, even revert it.
Could the maintainer from Mellanox gives me a word?

-- 
Richard Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH] bonding: move ipoib_header_ops to vmlinux
From: Wengang @ 2014-12-15  1:12 UTC (permalink / raw)
  To: David Miller, jay.vosburgh-Z7WLFzj8eWMS+FvcfC7Uqw
  Cc: ogerlitz-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <547E6C70.7040809-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

Anyone please respond to this?

thanks,
wengang

于 2014年12月03日 09:50, Wengang Wang 写道:
> Hi David and Jay,
>
> Then about about the change in this patch?
>
> thanks,
> wengang
>
> 在 2014年11月26日 09:30, Wengang 写道:
>> 于 2014年11月26日 02:44, David Miller 写道:
>>> From: Jay Vosburgh <jay.vosburgh-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>> Date: Tue, 25 Nov 2014 10:41:17 -0800
>>>
>>>> Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>>>
>>>>> On 11/25/2014 8:07 AM, David Miller wrote:
>>>>>> IPOIB should not work over bonding as it requires that the device
>>>>>> use ARPHRD_ETHER.
>>>>> Hi Dave,
>>>>>
>>>>> IPoIB devices can be enslaved to both bonding and teaming in their 
>>>>> HA mode,
>>>>> the bond device type becomes ARPHRD_INFINIBAND when this happens.
>>>>     The point was that pktgen disallows ARPHRD_INFINIBAND, not that
>>>> bonding does.
>>>>
>>>>     Pktgen specifically checks for type != ARPHRD_ETHER, so the
>>>> IPoIB bond should not be able to be used with pkgten.  My suspicion is
>>>> that pktgen is being configured on the bond first, then an IPoIB slave
>>>> is added to the bond; this would change its type in a way that pktgen
>>>> wouldn't notice.
>>> +1
>>
>> I think it go this way:
>>
>> 1) bond_master is ready
>> 2) bond_enslave enslave a IPOIB interface calling bond_setup_by_slave
>> 3) then bond_setup_by_slave set change master type to ARPHRD_INFINIBAND.
>>
>> code is like this:
>>
>> 1 /* enslave device <slave> to bond device <master> */
>> 2 int bond_enslave(struct net_device *bond_dev, struct net_device 
>> *slave_dev)
>> 3 {
>> 4 <snip>...
>> 5 /* set bonding device ether type by slave - bonding netdevices are
>> 6 * created with ether_setup, so when the slave type is not ARPHRD_ETHER
>> 7 * there is a need to override some of the type dependent 
>> attribs/funcs.
>> 8 *
>> 9 * bond ether type mutual exclusion - don't allow slaves of dissimilar
>> 10 * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the 
>> same bond
>> 11 */
>> 12 if (!bond_has_slaves(bond)) {
>> 13 if (bond_dev->type != slave_dev->type) {
>> 14 <snip>...
>> 15 if (slave_dev->type != ARPHRD_ETHER)
>> 16 bond_setup_by_slave(bond_dev, slave_dev);
>> 17 else {
>> 18 ether_setup(bond_dev);
>> 19 bond_dev->priv_flags &= ~IFF_TX_SKB_SHARING;
>> 20 }
>> 21
>> 22 call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE,
>> 23 bond_dev);
>> 24 }
>> 25 <snip>...
>> 26 }
>> 27
>> 28 static void bond_setup_by_slave(struct net_device *bond_dev,
>> 29 struct net_device *slave_dev)
>> 30 {
>> 31 bond_dev->header_ops = slave_dev->header_ops;
>> 32
>> 33 bond_dev->type = slave_dev->type;
>> 34 bond_dev->hard_header_len = slave_dev->hard_header_len;
>> 35 bond_dev->addr_len = slave_dev->addr_len;
>> 36
>> 37 memcpy(bond_dev->broadcast, slave_dev->broadcast,
>> 38 slave_dev->addr_len);
>> 39 }
>> 40
>>
>> thanks
>> wengang
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: Jiri Pirko @ 2014-12-14 22:33 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Andy Gospodarek,
	Jamal Hadi Salim, John Fastabend
In-Reply-To: <CAJ3xEMhbic5w4d0dGsTMRDoq5Vx8wqZRFpy-q+8HaeDfYciUqw@mail.gmail.com>

Sun, Dec 14, 2014 at 09:14:27PM CET, gerlitz.or@gmail.com wrote:
>On Sun, Dec 14, 2014 at 9:23 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Sun, Dec 14, 2014 at 05:19:05PM CET, ogerlitz@mellanox.com wrote:
>>>The current implementations all use dev_uc_add_excl() and such whose API
>>>doesn't support vlans, so we can't make it with NICs HW for now.
>>>
>>>Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
>>
>> Maybe I'm missing something, but this commit did not introduce the
>> problem.
>
>This commit introduced the vid param to ndo_fdb_add and ndo_dflt_fdb_add
>which further call the dev_uc_add APIs... so it did introduced the
>ability to provide VID into these APIs, right? and we want to protect
>against anyone using this ability @ this point.

That is in-kernel change. Vs. usespace the patch is a no-op. If userspace
fills up NDA_VLAN, it is ignored before the patch as well as after. No
behaviour change, just +- cosmetics.

>
>> If was there already before when NDA_VLAN was set and ignored.
>> But other than this. I like the patch
>>
>> Reviewed-by: Jiri Pirko <jiri@resnulli.us>

^ permalink raw reply

* Re: [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: Or Gerlitz @ 2014-12-14 20:14 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Andy Gospodarek,
	Jamal Hadi Salim, John Fastabend
In-Reply-To: <20141214192351.GA1850@nanopsycho.orion>

On Sun, Dec 14, 2014 at 9:23 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Dec 14, 2014 at 05:19:05PM CET, ogerlitz@mellanox.com wrote:
>>The current implementations all use dev_uc_add_excl() and such whose API
>>doesn't support vlans, so we can't make it with NICs HW for now.
>>
>>Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
>
> Maybe I'm missing something, but this commit did not introduce the
> problem.

This commit introduced the vid param to ndo_fdb_add and ndo_dflt_fdb_add
which further call the dev_uc_add APIs... so it did introduced the
ability to provide VID into these APIs, right? and we want to protect
against anyone using this ability @ this point.

> If was there already before when NDA_VLAN was set and ignored.
> But other than this. I like the patch
>
> Reviewed-by: Jiri Pirko <jiri@resnulli.us>

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-14 19:41 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: sfeldma, jhs, bcrl, tgraf, john.fastabend, stephen, linville,
	vyasevic, netdev, davem, shm, gospo
In-Reply-To: <20141214153549.GA2174@nanopsycho.orion>

On 12/14/14, 7:35 AM, Jiri Pirko wrote:
> Sun, Dec 14, 2014 at 03:13:40PM CET, roopa@cumulusnetworks.com wrote:
>> On 12/11/14, 2:25 PM, Jiri Pirko wrote:
>>> Thu, Dec 11, 2014 at 07:27:32PM CET, roopa@cumulusnetworks.com wrote:
>>>> On 12/11/14, 10:07 AM, Jiri Pirko wrote:
>>>>> Thu, Dec 11, 2014 at 06:59:15PM CET, roopa@cumulusnetworks.com wrote:
>>>>>> On 12/11/14, 9:11 AM, Jiri Pirko wrote:
>>>>>>> Thu, Dec 11, 2014 at 05:52:10PM CET, roopa@cumulusnetworks.com wrote:
>>>>>>>> On 12/10/14, 1:37 AM, Jiri Pirko wrote:
>>>>>>>>> Wed, Dec 10, 2014 at 10:05:18AM CET, roopa@cumulusnetworks.com wrote:
>>>>>>>>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>>>>>>>
>>>>>>>>>> This patch adds two new api's netdev_switch_port_bridge_setlink
>>>>>>>>>> and netdev_switch_port_bridge_dellink to offload bridge port attributes
>>>>>>>>>> to switch asic
>>>>>>>>>>
>>>>>>>>>> (The names of the apis look odd with 'switch_port_bridge',
>>>>>>>>>> but am more inclined to change the prefix of the api to something else.
>>>>>>>>>> Will take any suggestions).
>>>>>>>>>>
>>>>>>>>>> The api's look at the NETIF_F_HW_NETFUNC_OFFLOAD feature flag to
>>>>>>>>>> pass bridge port attributes to the port device.
>>>>>>>>>>
>>>>>>>>>> If the device has the NETIF_F_HW_NETFUNC_OFFLOAD, but does not support
>>>>>>>>>> the bridge port attribute offload ndo, call bridge port attribute ndo's on
>>>>>>>>>> the lowerdevs if supported. This is one way to pass bridge port attributes
>>>>>>>>>> through stacked netdevs (example when bridge port is a bond and bond slaves
>>>>>>>>>> are switch ports).
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>>>>>>> ---
>>>>>>>>>> include/net/switchdev.h   |    5 +++-
>>>>>>>>>> net/switchdev/switchdev.c |   70 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>> 2 files changed, 74 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>>>>>>>>> index 8a6d164..22676b6 100644
>>>>>>>>>> --- a/include/net/switchdev.h
>>>>>>>>>> +++ b/include/net/switchdev.h
>>>>>>>>>> @@ -17,7 +17,10 @@
>>>>>>>>>> int netdev_switch_parent_id_get(struct net_device *dev,
>>>>>>>>>> 				struct netdev_phys_item_id *psid);
>>>>>>>>>> int netdev_switch_port_stp_update(struct net_device *dev, u8 state);
>>>>>>>>>> -
>>>>>>>>>> +int netdev_switch_port_bridge_setlink(struct net_device *dev,
>>>>>>>>>> +				struct nlmsghdr *nlh, u16 flags);
>>>>>>>>>> +int netdev_switch_port_bridge_dellink(struct net_device *dev,
>>>>>>>>>> +				struct nlmsghdr *nlh, u16 flags);
>>>>>>>>>> #else
>>>>>>>>>>
>>>>>>>>>> static inline int netdev_switch_parent_id_get(struct net_device *dev,
>>>>>>>>>> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>>>>>>>>>> index d162b21..62317e1 100644
>>>>>>>>>> --- a/net/switchdev/switchdev.c
>>>>>>>>>> +++ b/net/switchdev/switchdev.c
>>>>>>>>>> @@ -50,3 +50,73 @@ int netdev_switch_port_stp_update(struct net_device *dev, u8 state)
>>>>>>>>>> 	return ops->ndo_switch_port_stp_update(dev, state);
>>>>>>>>>> }
>>>>>>>>>> EXPORT_SYMBOL(netdev_switch_port_stp_update);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + *	netdev_switch_port_bridge_setlink - Notify switch device port of bridge
>>>>>>>>>> + *	port attributes
>>>>>>>>>> + *
>>>>>>>>>> + *	@dev: port device
>>>>>>>>>> + *	@nlh: netlink msg with bridge port attributes
>>>>>>>>>> + *
>>>>>>>>>> + *	Notify switch device port of bridge port attributes
>>>>>>>>>> + */
>>>>>>>>>> +int netdev_switch_port_bridge_setlink(struct net_device *dev,
>>>>>>>>>> +									  struct nlmsghdr *nlh, u16 flags)
>>>>>>>>>> +{
>>>>>>>>>> +	const struct net_device_ops *ops = dev->netdev_ops;
>>>>>>>>>> +	struct net_device *lower_dev;
>>>>>>>>>> +	struct list_head *iter;
>>>>>>>>>> +	int ret = 0, err = 0;
>>>>>>>>>> +
>>>>>>>>>> +	if (!(dev->features & NETIF_F_HW_NETFUNC_OFFLOAD))
>>>>>>>>>> +		return err;
>>>>>>>>>> +
>>>>>>>>>> +	if (ops->ndo_bridge_setlink) {
>>>>>>>>>> +	    WARN_ON(!ops->ndo_switch_parent_id_get);
>>>>>>>>>> +	    return ops->ndo_bridge_setlink(dev, nlh, flags);
>>>>>>>>> 	You have to change ndo_bridge_setlink in netdevice.h first.
>>>>>>>>> 	Otherwise when only this patch is applied (during bisection)
>>>>>>>>> 	this won't compile.
>>>>>>>> ack, will fix it and keep that in mind next time.
>>>>>>>>>> +	}
>>>>>>>>>> +
>>>>>>>>>> +	netdev_for_each_lower_dev(dev, lower_dev, iter) {
>>>>>>>>> 	I do not understand why to iterate over lower devices. At this
>>>>>>>>> 	stage we don't know a thing about this upper or its lowers. Let
>>>>>>>>> 	the uppers (/masters) to decide if this needs to be propagated
>>>>>>>>> 	or not.
>>>>>>>> Jiri, In the stacked devices case, there is no way to propagate the bridge
>>>>>>>> port attributes to switch device driver today (vlan and other bridge port
>>>>>>>> attributes). Can you tell me if there is a way ?. no, ndo_vlan* ndo's are not
>>>>>>>> useful here. Nor we should go and implement ndo_bridge_setlink* in all
>>>>>>>> devices that can be bridge ports.
>>>>>>> Hmm. I just think that is cleaner to implement ndo_bridge_setlink in
>>>>>>> bonding for example and let it propagate the the call to slaves.
>>>>>> No, that will require bridge attribute support in all drivers. And that is no
>>>>>> good.
>>>>> Not all drivers, just all masters which want to support this. Like bond,
>>>>> team, macvlan etc. That would be the same as for
>>>>> ndo_vlan_rx_add_vid/ndo_vlan_rx_kill_vid/ndo_change_mtu etc. I do not
>>>>> see any problem in that. It is much much clearer over big hammer iterate
>>>>> over lowers in my opinion.
>>>> You cannot avoid the lowerdev iteration in any case.
>>>> If you added it in the individual drivers: bond, macvlan and other drivers
>>>> will all have to do the same thing.
>>>> ie Call bridge setlink on lowerdevs.
>>> I feel that the right way is to let masters propagate that themselves in
>>> their code.
>> In this case no. Just because an interface is a port in a bridge, it is wrong
>> to indicate that the interface driver needs to understand all bridge port
>> configuration attributes. Note that with what you are asking for ...all
>> bridge port drivers (bonds, vxlans) will also need to implement the netdev
>> stp state update api.
> I'm very well aware of this fact. But still, I'm convinced that the way
> similar things are implemented now, using prapagation inside particular
> drivers (bond/team/etc) is the correct way to go. I do not see any
> downside in that. But we are running in circles here. I would love to
> hear opinion of other people here.

  I know you are aware of this fact ,...but just stating again here that 
we are also talking about l3 ops here.
Which don't make sense to be  in all drivers. Maybe i am being very 
conservative.

And yes, agree we are going in circles. :). and yes, would love to hear 
other opinions as well.


>
>
>>> That's it. I might be wrong of course.
>>
>>>> My patch avoids the need to modify these drivers. Besides it does this only
>>>> when the OFFLOAD flag is set.
>>> Yep, well in my reply to another patch of you series I expressed my
>>> feeling that the flag should be really checked in particular switch
>>> driver, not core. But I might be wrong there as well...
>> The bridge driver owns these attributes...and he needs to call the switchdev
>> api to offload.
>> And the condition for the switchdev api call is the offload flag. And the
>> offload flag is part of the switchdev api.
>> The drivers just set it on the netdev, they dont own the offload flag. So, I
>> don't see a reason why the core should not
>> know about the flag.
> I do not understand the formulation "own the offload flag". What I say
> is let the bridge/others to call the switchdev api unconditionally and
> let the leaf drivers handle that as they see fit, taking various facts
> into account, flags included. This way you avoid the need for flags
> inheritance in stacked scenarios. Imagine following example:
>
> bridge - bond --- eth1
>                --- eth2
>
> eth1 and eth2 are switch ports. Now eth1 has the flag set and eth2 does
> not. Should the bond have the flag set or not? And if it has, eth2 need
> to check the flag as well to do not offload.
>
> Implementing the inheritance correctly would be a small nightmare. So I
> say, why don't just let the leafs to check and decide.
Note that i wasn't really considering inheritance as the major factor in 
our argument.
If that needs to be dropped, i can consider that. But, the flag is there 
for a reason: To not unconditionally call lowerdev ndo's.
ie reduce some overhead.

And in the above case, the bond does get the flag..because it has a 
lowerdev with the feature flag.
And note that the inheritance does not come with a cost. This is all 
existing code. If i hadn't mentioned it, it would probably go unnoticed ;).

I had to just add the below patch. And existing calls to 
netdev_update_features in bridge ndo_add/del_slave takes care of 
inheriting the feature flags from the slaves.
again, i did not introduce any complexity here. The feature flag is just 
there to reduce some overhead in unnecessarily traversing all lowerdevs.


@@ -159,7 +161,8 @@ enum {
   */
  #define NETIF_F_ONE_FOR_ALL	(NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ROBUST | \
  				 NETIF_F_SG | NETIF_F_HIGHDMA |		\
-				 NETIF_F_FRAGLIST | NETIF_F_VLAN_CHALLENGED)
+				 NETIF_F_FRAGLIST | NETIF_F_VLAN_CHALLENGED | \
+				 NETIF_F_HW_NETFUNC_OFFLOAD)
  /*
   * If one device doesn't support one of these features, then disable it
   * for all in netdev_increment_features.




>
>
>> What has been accepted in the kernel currently does not help bridge driver
>> offloading to switchdev. It does help if you want to manage your switch
>> device separately like you were already doing with nics. ie going to switch
>> port driver directly. It does not help the stacked device case either.
>>
>>
>> I will resubmit my series with the checkpatch errors you pointed out.
>>
>> And, am also looking at other ways to solve the problem.
>>
>> Thanks for the review.
>>
>>
>>>> It will not stop at adding the ndo_bridge_setlink to bond/macvlan etc. It
>>>> will be all other ndo_ops we will need for switch asics.
>>>> It will be l3 tomorrow, if the route is through a bond (But at that point, we
>>>> may end up having to introduce switch device instead of going to the port.
>>>> Lets see).
>>>>
>>>> Today this patch introduces an abstract way to get to the switch driver by
>>>> getting to the slave switch port (And only when the OFFLOAD flag is set).
>>>>
>>>>
>>>>>>> Let every "upper" to handle ndo_bridge_setlink their way. Sometimes it
>>>>>>> might not make sense to propagate to "lowers".
>>>>>> This does not really propagate to lowers. It is just trying to get to a
>>>>>> switch port and from there to the switch driver.
>>>>>> Example, bond driver does not need to care if its a bridge port. It will
>>>>>> simply pass the call to its slave which
>>>>>> might be a switch port.
>>>>>>
>>>>>> bond driver does not care if its a bridge port. But the switch driver cares,
>>>>>> because it knows that the bond was created with switch ports.
>>>>>>
>>>>>>
>>>>>>>> And this allows a switch driver to receive these callbacks if it has marked
>>>>>>>> the switch port with an offload flag. Your way of using the switch port to
>>>>>>>> get to the switch driver does not help in these cases.
>>>>>>> I do not follow how this is related to this case (stacked layout).
>>>>>>>
>>>>>>>> The other option is to use the 'switch device (not port)' to get to the
>>>>>>>> switch driver.
>>>>>>> That would not help this case (stacked layout) I believe.
>>>>>>>
>>>>>>>
>>>>>>>> This patch shows that you can still do this with the ndo ops.
>>>>>>>>>> +		err = netdev_switch_port_bridge_setlink(lower_dev, nlh, flags);
>>>>>>>>>> +		if (err)
>>>>>>>>>> +			ret = err;
>>>>>>>>>> +    }
>>>>>>>>>   ^^^^^ Indent is off. This should be catched by scripts/checkpatch.pl.
>>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +	return ret;
>>>>>>>>>> +}
>>>>>>>>>> +EXPORT_SYMBOL(netdev_switch_port_bridge_setlink);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + *	netdev_switch_port_bridge_dellink - Notify switch device port of bridge
>>>>>>>>>> + *	attribute delete
>>>>>>>>>> + *
>>>>>>>>>> + *	@dev: port device
>>>>>>>>>> + *	@nlh: netlink msg with bridge port attributes
>>>>>>>>>> + *
>>>>>>>>>> + *	Notify switch device port of bridge port attribute delete
>>>>>>>>>> + */
>>>>>>>>>> +int netdev_switch_port_bridge_dellink(struct net_device *dev,
>>>>>>>>>> +									  struct nlmsghdr *nlh, u16 flags)
>>>>>>>>>> +{
>>>>>>>>>> +	const struct net_device_ops *ops = dev->netdev_ops;
>>>>>>>>>> +	struct net_device *lower_dev;
>>>>>>>>>> +	struct list_head *iter;
>>>>>>>>>> +	int ret = 0, err = 0;
>>>>>>>>>> +
>>>>>>>>>> +	if (!(dev->features & NETIF_F_HW_NETFUNC_OFFLOAD))
>>>>>>>>>> +		return err;
>>>>>>>>>> +
>>>>>>>>>> +	if (ops->ndo_bridge_dellink) {
>>>>>>>>>> +		WARN_ON(!ops->ndo_switch_parent_id_get);
>>>>>>>>>> +		return ops->ndo_bridge_dellink(dev, nlh, flags);
>>>>>>>>>> +	}
>>>>>>>>>> +
>>>>>>>>>> +	netdev_for_each_lower_dev(dev, lower_dev, iter) {
>>>>>>>>>> +		err = netdev_switch_port_bridge_dellink(lower_dev, nlh, flags);
>>>>>>>>>> +		if (err)
>>>>>>>>>> +			ret = err;
>>>>>>>>>> +	}
>>>>>>>>>> +
>>>>>>>>>> +	return ret;
>>>>>>>>>> +}
>>>>>>>>>> +EXPORT_SYMBOL(netdev_switch_port_bridge_dellink);
>>>>>>>>>> -- 
>>>>>>>>>> 1.7.10.4
>>>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox