Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 0/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 18:24 UTC (permalink / raw)
  To: netdev
In-Reply-To: <549070C7.5070505@psc.edu>

My apologies about the odd formatting in the previous message. Not sure 
what happened with my MUA.

Chris

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: Alexei Starovoitov @ 2014-12-16 18:24 UTC (permalink / raw)
  To: rapier; +Cc: netdev

On Tue, Dec 16, 2014 at 9:50 AM, rapier <rapier@psc.edu> wrote:
> +struct idr tcp_estats_idr;
> +EXPORT_SYMBOL(tcp_estats_idr);
> +static int next_id = 1;
> +DEFINE_SPINLOCK(tcp_estats_idr_lock);
> +EXPORT_SYMBOL(tcp_estats_idr_lock);
> +
> +int tcp_estats_wq_enabled __read_mostly = 0;
> +EXPORT_SYMBOL(tcp_estats_wq_enabled);
> +struct workqueue_struct *tcp_estats_wq = NULL;
> +EXPORT_SYMBOL(tcp_estats_wq);
> +void (*create_notify_func)(struct work_struct *work);
> +EXPORT_SYMBOL(create_notify_func);
> +void (*establish_notify_func)(struct work_struct *work);
> +EXPORT_SYMBOL(establish_notify_func);
> +void (*destroy_notify_func)(struct work_struct *work);
> +EXPORT_SYMBOL(destroy_notify_func);
> +unsigned long persist_delay = 0;
> +EXPORT_SYMBOL(persist_delay);
> +
> +struct static_key tcp_estats_enabled __read_mostly = STATIC_KEY_INIT_FALSE;
> +EXPORT_SYMBOL(tcp_estats_enabled);
...
> +EXPORT_SYMBOL(tcp_estats_create);
...
> +/* Do not call directly.  Called from tcp_estats_unuse() through call_rcu.
> */
> +void tcp_estats_free(struct rcu_head *rcu)
...
> +EXPORT_SYMBOL(tcp_estats_free);

imo that is very questionable design choice.
export a lot of in-kernel bits to be used by out-of-tree kernel module?

^ permalink raw reply

* How to fix CHECK warning: testing a 'safe expression'
From: Murali Karicheri @ 2014-12-16 18:23 UTC (permalink / raw)
  To: netdev

netdev maintainers,

I got a comment to address CHECK warning and wondering how to address 
'warning: testing a 'safe expression' which appears when using
IS_ERR_OR_NULL(foo)

where foo is defined as

struct foo_type *foo;

The foo get assigned only NULL or ERR_PTR(error code). So I believe the 
usage is correct. But then how do I make the CHECK happy of its usage?

I have tried doing a grep on the current usage of IS_ERR_OR_NULL() and 
found 276 of them causes this warning in the v3.18 version of the kernel 
that I am using

$ grep -r "warning: testing a 'safe expression" * | wc -l
276

1) Can someone explain what this warning means?

2) Is it acceptable to post patches to netdev list with this warning?

3) if not, how this is expected to be fixed? Any example usage to fix 
this warning will be helpful.

Thanks in advance for
-- 
Murali Karicheri
Linux Kernel, Texas Instruments

^ permalink raw reply

* pull request: wireless 2014-12-16
From: John W. Linville @ 2014-12-16 18:16 UTC (permalink / raw)
  To: davem; +Cc: linux-wireless, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 28116 bytes --]

Dave,

Please pull this batch of fixes intended for the 3.19 stream!

For the Bluetooth bits, Johan says:

"The patches consist of:

 - Coccinelle warning fix
 - hci_dev_lock/unlock fixes
 - Fixes for pending mgmt command handling
 - Fixes for properly following the force_lesc_support switch
 - Fix for a Microsoft branded Broadcom adapter
 - New device id for Atheros AR3012
 - Fix for BR/EDR Secure Connections enabling"

Along with that...

Brian Norris avoids leaking some kernel memory contents via printk in brcmsmac.

Julia Lawall corrects some misspellings in a few drivers.

Larry Finger gives us one more rtlwifi fix to correct a porting oversight.

Wei Yongjun fixes a sparse warning in rtlwifi.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit 67e2c3883828b39548cee2091b36656787775d95:

  Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security (2014-12-14 20:36:37 -0800)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless.git tags/master-2014-12-15

for you to fetch changes up to 9a1dce3a059111a7289680f4b8c0ec4f8736b6ee:

  rtlwifi: rtl8192ce: Set fw_ready flag (2014-12-15 13:46:20 -0500)

----------------------------------------------------------------
Brian Norris (1):
      brcmsmac: don't leak kernel memory via printk()

Fengguang Wu (1):
      Bluetooth: fix err_cast.cocci warnings

Jaganath Kanakkassery (2):
      Bluetooth: Fix missing hci_dev_lock/unlock in mgmt req_complete()
      Bluetooth: Fix missing hci_dev_lock/unlock in hci_event

Janne Heikkinen (1):
      Bluetooth: Add USB device 04ca:3010 as Atheros AR3012

Johan Hedberg (5):
      Bluetooth: Fix calling hci_conn_put too early
      Bluetooth: Fix incorrect pending cmd removal in pairing_complete()
      Bluetooth: Fix notifying mgmt power off before flushing connection list
      Bluetooth: Fix enabling BR/EDR SC when powering on
      Bluetooth: Fix mgmt response status when removing adapter

John W. Linville (1):
      Merge branch 'for-upstream' of git://git.kernel.org/.../bluetooth/bluetooth-next

Julia Lawall (3):
      zd1211rw: fix misspelling of current function in string
      hostap_cs: fix misspelling of current function in string
      rtlwifi: rtl8821ae: fix misspelling of current function in string

Larry Finger (1):
      rtlwifi: rtl8192ce: Set fw_ready flag

Marcel Holtmann (4):
      Bluetooth: Check for force_lesc_support when enabling SMP over BR/EDR
      Bluetooth: Check for force_lesc_support before rejecting SMP over BR/EDR
      Bluetooth: Fix generation of non-resolvable private addresses
      Bluetooth: Fix check for support for page scan related commands

Wei Yongjun (1):
      rtlwifi: rtl8192cu: Fix sparse non static symbol warning

 drivers/bluetooth/ath3k.c                      |  2 +
 drivers/bluetooth/btusb.c                      |  1 +
 drivers/net/wireless/brcm80211/brcmsmac/main.c |  2 +-
 drivers/net/wireless/hostap/hostap_cs.c        | 15 ++---
 drivers/net/wireless/rtlwifi/rtl8192ce/hw.c    |  2 +
 drivers/net/wireless/rtlwifi/rtl8192cu/hw.c    |  2 +-
 drivers/net/wireless/rtlwifi/rtl8821ae/dm.c    | 11 ++--
 drivers/net/wireless/zd1211rw/zd_chip.c        |  6 +-
 net/bluetooth/hci_conn.c                       |  2 +-
 net/bluetooth/hci_core.c                       | 60 ++++++++++--------
 net/bluetooth/hci_event.c                      | 20 ++++++
 net/bluetooth/l2cap_core.c                     |  5 +-
 net/bluetooth/mgmt.c                           | 85 ++++++++++++++++++--------
 net/bluetooth/smp.c                            |  5 +-
 14 files changed, 143 insertions(+), 75 deletions(-)

diff --git a/drivers/bluetooth/ath3k.c b/drivers/bluetooth/ath3k.c
index fce758896280..1ee27ac18de0 100644
--- a/drivers/bluetooth/ath3k.c
+++ b/drivers/bluetooth/ath3k.c
@@ -87,6 +87,7 @@ static const struct usb_device_id ath3k_table[] = {
 	{ USB_DEVICE(0x04CA, 0x3007) },
 	{ USB_DEVICE(0x04CA, 0x3008) },
 	{ USB_DEVICE(0x04CA, 0x300b) },
+	{ USB_DEVICE(0x04CA, 0x3010) },
 	{ USB_DEVICE(0x0930, 0x0219) },
 	{ USB_DEVICE(0x0930, 0x0220) },
 	{ USB_DEVICE(0x0930, 0x0227) },
@@ -140,6 +141,7 @@ static const struct usb_device_id ath3k_blist_tbl[] = {
 	{ USB_DEVICE(0x04ca, 0x3007), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x04ca, 0x3008), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x04ca, 0x300b), .driver_info = BTUSB_ATH3012 },
+	{ USB_DEVICE(0x04ca, 0x3010), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0219), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0220), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0227), .driver_info = BTUSB_ATH3012 },
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index 31dd24ac9926..19cf2cf22e87 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -167,6 +167,7 @@ static const struct usb_device_id blacklist_table[] = {
 	{ USB_DEVICE(0x04ca, 0x3007), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x04ca, 0x3008), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x04ca, 0x300b), .driver_info = BTUSB_ATH3012 },
+	{ USB_DEVICE(0x04ca, 0x3010), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0219), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0220), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0227), .driver_info = BTUSB_ATH3012 },
diff --git a/drivers/net/wireless/brcm80211/brcmsmac/main.c b/drivers/net/wireless/brcm80211/brcmsmac/main.c
index a104d7ac3796..eb8584a9c49a 100644
--- a/drivers/net/wireless/brcm80211/brcmsmac/main.c
+++ b/drivers/net/wireless/brcm80211/brcmsmac/main.c
@@ -316,7 +316,7 @@ static const u16 xmtfifo_sz[][NFIFO] = {
 static const char * const fifo_names[] = {
 	"AC_BK", "AC_BE", "AC_VI", "AC_VO", "BCMC", "ATIM" };
 #else
-static const char fifo_names[6][0];
+static const char fifo_names[6][1];
 #endif
 
 #ifdef DEBUG
diff --git a/drivers/net/wireless/hostap/hostap_cs.c b/drivers/net/wireless/hostap/hostap_cs.c
index b6ec51923b20..50033aa7c7d5 100644
--- a/drivers/net/wireless/hostap/hostap_cs.c
+++ b/drivers/net/wireless/hostap/hostap_cs.c
@@ -381,18 +381,15 @@ static void prism2_pccard_genesis_reset(local_info_t *local, int hcr)
 
 	res = pcmcia_read_config_byte(hw_priv->link, CISREG_COR, &old_cor);
 	if (res != 0) {
-		printk(KERN_DEBUG "prism2_pccard_genesis_sreset failed 1 "
-		       "(%d)\n", res);
+		printk(KERN_DEBUG "%s failed 1 (%d)\n", __func__, res);
 		return;
 	}
-	printk(KERN_DEBUG "prism2_pccard_genesis_sreset: original COR %02x\n",
-		old_cor);
+	printk(KERN_DEBUG "%s: original COR %02x\n", __func__, old_cor);
 
 	res = pcmcia_write_config_byte(hw_priv->link, CISREG_COR,
 				old_cor | COR_SOFT_RESET);
 	if (res != 0) {
-		printk(KERN_DEBUG "prism2_pccard_genesis_sreset failed 2 "
-		       "(%d)\n", res);
+		printk(KERN_DEBUG "%s failed 2 (%d)\n", __func__, res);
 		return;
 	}
 
@@ -401,8 +398,7 @@ static void prism2_pccard_genesis_reset(local_info_t *local, int hcr)
 	/* Setup Genesis mode */
 	res = pcmcia_write_config_byte(hw_priv->link, CISREG_CCSR, hcr);
 	if (res != 0) {
-		printk(KERN_DEBUG "prism2_pccard_genesis_sreset failed 3 "
-		       "(%d)\n", res);
+		printk(KERN_DEBUG "%s failed 3 (%d)\n", __func__, res);
 		return;
 	}
 	mdelay(10);
@@ -410,8 +406,7 @@ static void prism2_pccard_genesis_reset(local_info_t *local, int hcr)
 	res = pcmcia_write_config_byte(hw_priv->link, CISREG_COR,
 				old_cor & ~COR_SOFT_RESET);
 	if (res != 0) {
-		printk(KERN_DEBUG "prism2_pccard_genesis_sreset failed 4 "
-		       "(%d)\n", res);
+		printk(KERN_DEBUG "%s failed 4 (%d)\n", __func__, res);
 		return;
 	}
 
diff --git a/drivers/net/wireless/rtlwifi/rtl8192ce/hw.c b/drivers/net/wireless/rtlwifi/rtl8192ce/hw.c
index d2ec5160bbf0..5c646d5f7bb8 100644
--- a/drivers/net/wireless/rtlwifi/rtl8192ce/hw.c
+++ b/drivers/net/wireless/rtlwifi/rtl8192ce/hw.c
@@ -955,6 +955,7 @@ int rtl92ce_hw_init(struct ieee80211_hw *hw)
 	local_save_flags(flags);
 	local_irq_enable();
 
+	rtlhal->fw_ready = false;
 	rtlpriv->intf_ops->disable_aspm(hw);
 	rtstatus = _rtl92ce_init_mac(hw);
 	if (!rtstatus) {
@@ -971,6 +972,7 @@ int rtl92ce_hw_init(struct ieee80211_hw *hw)
 		goto exit;
 	}
 
+	rtlhal->fw_ready = true;
 	rtlhal->last_hmeboxnum = 0;
 	rtl92c_phy_mac_config(hw);
 	/* because last function modify RCR, so we update
diff --git a/drivers/net/wireless/rtlwifi/rtl8192cu/hw.c b/drivers/net/wireless/rtlwifi/rtl8192cu/hw.c
index 873363acbacf..551321728ae0 100644
--- a/drivers/net/wireless/rtlwifi/rtl8192cu/hw.c
+++ b/drivers/net/wireless/rtlwifi/rtl8192cu/hw.c
@@ -1592,7 +1592,7 @@ void rtl92cu_get_hw_reg(struct ieee80211_hw *hw, u8 variable, u8 *val)
 	}
 }
 
-bool usb_cmd_send_packet(struct ieee80211_hw *hw, struct sk_buff *skb)
+static bool usb_cmd_send_packet(struct ieee80211_hw *hw, struct sk_buff *skb)
 {
   /* Currently nothing happens here.
    * Traffic stops after some seconds in WPA2 802.11n mode.
diff --git a/drivers/net/wireless/rtlwifi/rtl8821ae/dm.c b/drivers/net/wireless/rtlwifi/rtl8821ae/dm.c
index 9be106109921..ba30b0d250fd 100644
--- a/drivers/net/wireless/rtlwifi/rtl8821ae/dm.c
+++ b/drivers/net/wireless/rtlwifi/rtl8821ae/dm.c
@@ -2078,8 +2078,7 @@ void rtl8821ae_dm_txpwr_track_set_pwr(struct ieee80211_hw *hw,
 	if (rtldm->tx_rate != 0xFF)
 		tx_rate = rtl8821ae_hw_rate_to_mrate(hw, rtldm->tx_rate);
 
-	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD,
-		 "===>rtl8812ae_dm_txpwr_track_set_pwr\n");
+	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD, "===>%s\n", __func__);
 
 	if (tx_rate != 0xFF) { /* Mimic Modify High Rate BBSwing Limit.*/
 		/*CCK*/
@@ -2128,7 +2127,7 @@ void rtl8821ae_dm_txpwr_track_set_pwr(struct ieee80211_hw *hw,
 
 	if (method == BBSWING) {
 		RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD,
-			 "===>rtl8812ae_dm_txpwr_track_set_pwr\n");
+			 "===>%s\n", __func__);
 		if (rf_path == RF90_PATH_A) {
 			final_swing_idx[RF90_PATH_A] =
 				(rtldm->ofdm_index[RF90_PATH_A] >
@@ -2260,7 +2259,8 @@ void rtl8821ae_dm_txpower_tracking_callback_thermalmeter(
 	rtldm->txpower_trackinginit = true;
 
 	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD,
-		 "===>rtl8812ae_dm_txpower_tracking_callback_thermalmeter,\n pDM_Odm->BbSwingIdxCckBase: %d,pDM_Odm->BbSwingIdxOfdmBase[A]:%d, pDM_Odm->DefaultOfdmIndex: %d\n",
+		 "===>%s,\n pDM_Odm->BbSwingIdxCckBase: %d,pDM_Odm->BbSwingIdxOfdmBase[A]:%d, pDM_Odm->DefaultOfdmIndex: %d\n",
+		 __func__,
 		 rtldm->swing_idx_cck_base,
 		 rtldm->swing_idx_ofdm_base[RF90_PATH_A],
 		 rtldm->default_ofdm_index);
@@ -2539,8 +2539,7 @@ void rtl8821ae_dm_txpower_tracking_callback_thermalmeter(
 		}
 	}
 
-	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD,
-		 "<===rtl8812ae_dm_txpower_tracking_callback_thermalmeter\n");
+	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD, "<===%s\n", __func__);
 }
 
 void rtl8821ae_dm_check_txpower_tracking_thermalmeter(struct ieee80211_hw *hw)
diff --git a/drivers/net/wireless/zd1211rw/zd_chip.c b/drivers/net/wireless/zd1211rw/zd_chip.c
index 73a49b868035..07b94eda9604 100644
--- a/drivers/net/wireless/zd1211rw/zd_chip.c
+++ b/drivers/net/wireless/zd1211rw/zd_chip.c
@@ -129,7 +129,7 @@ int zd_ioread32v_locked(struct zd_chip *chip, u32 *values, const zd_addr_t *addr
 	r = zd_ioread16v_locked(chip, v16, a16, count16);
 	if (r) {
 		dev_dbg_f(zd_chip_dev(chip),
-			  "error: zd_ioread16v_locked. Error number %d\n", r);
+			  "error: %s. Error number %d\n", __func__, r);
 		return r;
 	}
 
@@ -256,8 +256,8 @@ int zd_iowrite32a_locked(struct zd_chip *chip,
 		if (r) {
 			zd_usb_iowrite16v_async_end(&chip->usb, 0);
 			dev_dbg_f(zd_chip_dev(chip),
-				"error _zd_iowrite32v_locked."
-				" Error number %d\n", r);
+				"error _%s. Error number %d\n", __func__,
+				r);
 			return r;
 		}
 	}
diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
index 79d84b88b8f0..fe18825cc8a4 100644
--- a/net/bluetooth/hci_conn.c
+++ b/net/bluetooth/hci_conn.c
@@ -661,7 +661,7 @@ static void hci_req_add_le_create_conn(struct hci_request *req,
 	memset(&cp, 0, sizeof(cp));
 
 	/* Update random address, but set require_privacy to false so
-	 * that we never connect with an unresolvable address.
+	 * that we never connect with an non-resolvable address.
 	 */
 	if (hci_update_random_address(req, false, &own_addr_type))
 		return;
diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 93f92a085506..5dcacf9607e4 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -1373,8 +1373,6 @@ static void hci_init1_req(struct hci_request *req, unsigned long opt)
 
 static void bredr_setup(struct hci_request *req)
 {
-	struct hci_dev *hdev = req->hdev;
-
 	__le16 param;
 	__u8 flt_type;
 
@@ -1403,14 +1401,6 @@ static void bredr_setup(struct hci_request *req)
 	/* Connection accept timeout ~20 secs */
 	param = cpu_to_le16(0x7d00);
 	hci_req_add(req, HCI_OP_WRITE_CA_TIMEOUT, 2, &param);
-
-	/* AVM Berlin (31), aka "BlueFRITZ!", reports version 1.2,
-	 * but it does not support page scan related HCI commands.
-	 */
-	if (hdev->manufacturer != 31 && hdev->hci_ver > BLUETOOTH_VER_1_1) {
-		hci_req_add(req, HCI_OP_READ_PAGE_SCAN_ACTIVITY, 0, NULL);
-		hci_req_add(req, HCI_OP_READ_PAGE_SCAN_TYPE, 0, NULL);
-	}
 }
 
 static void le_setup(struct hci_request *req)
@@ -1718,6 +1708,16 @@ static void hci_init3_req(struct hci_request *req, unsigned long opt)
 	if (hdev->commands[5] & 0x10)
 		hci_setup_link_policy(req);
 
+	if (hdev->commands[8] & 0x01)
+		hci_req_add(req, HCI_OP_READ_PAGE_SCAN_ACTIVITY, 0, NULL);
+
+	/* Some older Broadcom based Bluetooth 1.2 controllers do not
+	 * support the Read Page Scan Type command. Check support for
+	 * this command in the bit mask of supported commands.
+	 */
+	if (hdev->commands[13] & 0x01)
+		hci_req_add(req, HCI_OP_READ_PAGE_SCAN_TYPE, 0, NULL);
+
 	if (lmp_le_capable(hdev)) {
 		u8 events[8];
 
@@ -2634,6 +2634,12 @@ static int hci_dev_do_close(struct hci_dev *hdev)
 	drain_workqueue(hdev->workqueue);
 
 	hci_dev_lock(hdev);
+
+	if (!test_and_clear_bit(HCI_AUTO_OFF, &hdev->dev_flags)) {
+		if (hdev->dev_type == HCI_BREDR)
+			mgmt_powered(hdev, 0);
+	}
+
 	hci_inquiry_cache_flush(hdev);
 	hci_pend_le_actions_clear(hdev);
 	hci_conn_hash_flush(hdev);
@@ -2681,14 +2687,6 @@ static int hci_dev_do_close(struct hci_dev *hdev)
 	hdev->flags &= BIT(HCI_RAW);
 	hdev->dev_flags &= ~HCI_PERSISTENT_MASK;
 
-	if (!test_and_clear_bit(HCI_AUTO_OFF, &hdev->dev_flags)) {
-		if (hdev->dev_type == HCI_BREDR) {
-			hci_dev_lock(hdev);
-			mgmt_powered(hdev, 0);
-			hci_dev_unlock(hdev);
-		}
-	}
-
 	/* Controller radio is available but is currently powered down */
 	hdev->amp_status = AMP_STATUS_POWERED_DOWN;
 
@@ -3083,7 +3081,9 @@ static void hci_power_on(struct work_struct *work)
 
 	err = hci_dev_do_open(hdev);
 	if (err < 0) {
+		hci_dev_lock(hdev);
 		mgmt_set_powered_failed(hdev, err);
+		hci_dev_unlock(hdev);
 		return;
 	}
 
@@ -3959,17 +3959,29 @@ int hci_update_random_address(struct hci_request *req, bool require_privacy,
 	}
 
 	/* In case of required privacy without resolvable private address,
-	 * use an unresolvable private address. This is useful for active
+	 * use an non-resolvable private address. This is useful for active
 	 * scanning and non-connectable advertising.
 	 */
 	if (require_privacy) {
-		bdaddr_t urpa;
+		bdaddr_t nrpa;
+
+		while (true) {
+			/* The non-resolvable private address is generated
+			 * from random six bytes with the two most significant
+			 * bits cleared.
+			 */
+			get_random_bytes(&nrpa, 6);
+			nrpa.b[5] &= 0x3f;
 
-		get_random_bytes(&urpa, 6);
-		urpa.b[5] &= 0x3f;	/* Clear two most significant bits */
+			/* The non-resolvable private address shall not be
+			 * equal to the public address.
+			 */
+			if (bacmp(&hdev->bdaddr, &nrpa))
+				break;
+		}
 
 		*own_addr_type = ADDR_LE_DEV_RANDOM;
-		set_random_addr(req, &urpa);
+		set_random_addr(req, &nrpa);
 		return 0;
 	}
 
@@ -5625,7 +5637,7 @@ void hci_req_add_le_passive_scan(struct hci_request *req)
 	u8 filter_policy;
 
 	/* Set require_privacy to false since no SCAN_REQ are send
-	 * during passive scanning. Not using an unresolvable address
+	 * during passive scanning. Not using an non-resolvable address
 	 * here is important so that peer devices using direct
 	 * advertising with our address will be correctly reported
 	 * by the controller.
diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
index 322abbbbcef9..39a5c8a01726 100644
--- a/net/bluetooth/hci_event.c
+++ b/net/bluetooth/hci_event.c
@@ -257,6 +257,8 @@ static void hci_cc_write_auth_enable(struct hci_dev *hdev, struct sk_buff *skb)
 	if (!sent)
 		return;
 
+	hci_dev_lock(hdev);
+
 	if (!status) {
 		__u8 param = *((__u8 *) sent);
 
@@ -268,6 +270,8 @@ static void hci_cc_write_auth_enable(struct hci_dev *hdev, struct sk_buff *skb)
 
 	if (test_bit(HCI_MGMT, &hdev->dev_flags))
 		mgmt_auth_enable_complete(hdev, status);
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_write_encrypt_mode(struct hci_dev *hdev, struct sk_buff *skb)
@@ -443,6 +447,8 @@ static void hci_cc_write_ssp_mode(struct hci_dev *hdev, struct sk_buff *skb)
 	if (!sent)
 		return;
 
+	hci_dev_lock(hdev);
+
 	if (!status) {
 		if (sent->mode)
 			hdev->features[1][0] |= LMP_HOST_SSP;
@@ -458,6 +464,8 @@ static void hci_cc_write_ssp_mode(struct hci_dev *hdev, struct sk_buff *skb)
 		else
 			clear_bit(HCI_SSP_ENABLED, &hdev->dev_flags);
 	}
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_write_sc_support(struct hci_dev *hdev, struct sk_buff *skb)
@@ -471,6 +479,8 @@ static void hci_cc_write_sc_support(struct hci_dev *hdev, struct sk_buff *skb)
 	if (!sent)
 		return;
 
+	hci_dev_lock(hdev);
+
 	if (!status) {
 		if (sent->support)
 			hdev->features[1][0] |= LMP_HOST_SC;
@@ -486,6 +496,8 @@ static void hci_cc_write_sc_support(struct hci_dev *hdev, struct sk_buff *skb)
 		else
 			clear_bit(HCI_SC_ENABLED, &hdev->dev_flags);
 	}
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_read_local_version(struct hci_dev *hdev, struct sk_buff *skb)
@@ -1135,6 +1147,8 @@ static void hci_cc_le_set_scan_enable(struct hci_dev *hdev,
 	if (!cp)
 		return;
 
+	hci_dev_lock(hdev);
+
 	switch (cp->enable) {
 	case LE_SCAN_ENABLE:
 		set_bit(HCI_LE_SCAN, &hdev->dev_flags);
@@ -1184,6 +1198,8 @@ static void hci_cc_le_set_scan_enable(struct hci_dev *hdev,
 		BT_ERR("Used reserved LE_Scan_Enable param %d", cp->enable);
 		break;
 	}
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_le_read_white_list_size(struct hci_dev *hdev,
@@ -1278,6 +1294,8 @@ static void hci_cc_write_le_host_supported(struct hci_dev *hdev,
 	if (!sent)
 		return;
 
+	hci_dev_lock(hdev);
+
 	if (sent->le) {
 		hdev->features[1][0] |= LMP_HOST_LE;
 		set_bit(HCI_LE_ENABLED, &hdev->dev_flags);
@@ -1291,6 +1309,8 @@ static void hci_cc_write_le_host_supported(struct hci_dev *hdev,
 		hdev->features[1][0] |= LMP_HOST_LE_BREDR;
 	else
 		hdev->features[1][0] &= ~LMP_HOST_LE_BREDR;
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_set_adv_param(struct hci_dev *hdev, struct sk_buff *skb)
diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c
index a2b6dfa38a0c..d04dc0095736 100644
--- a/net/bluetooth/l2cap_core.c
+++ b/net/bluetooth/l2cap_core.c
@@ -6966,8 +6966,9 @@ static struct l2cap_conn *l2cap_conn_add(struct hci_conn *hcon)
 	    test_bit(HCI_HS_ENABLED, &hcon->hdev->dev_flags))
 		conn->local_fixed_chan |= L2CAP_FC_A2MP;
 
-	if (bredr_sc_enabled(hcon->hdev) &&
-	    test_bit(HCI_LE_ENABLED, &hcon->hdev->dev_flags))
+	if (test_bit(HCI_LE_ENABLED, &hcon->hdev->dev_flags) &&
+	    (bredr_sc_enabled(hcon->hdev) ||
+	     test_bit(HCI_FORCE_LESC, &hcon->hdev->dbg_flags)))
 		conn->local_fixed_chan |= L2CAP_FC_SMP_BREDR;
 
 	mutex_init(&conn->ident_lock);
diff --git a/net/bluetooth/mgmt.c b/net/bluetooth/mgmt.c
index 7384f1161336..06c2e652e4b6 100644
--- a/net/bluetooth/mgmt.c
+++ b/net/bluetooth/mgmt.c
@@ -2199,12 +2199,14 @@ static void le_enable_complete(struct hci_dev *hdev, u8 status)
 {
 	struct cmd_lookup match = { NULL, hdev };
 
+	hci_dev_lock(hdev);
+
 	if (status) {
 		u8 mgmt_err = mgmt_status(status);
 
 		mgmt_pending_foreach(MGMT_OP_SET_LE, hdev, cmd_status_rsp,
 				     &mgmt_err);
-		return;
+		goto unlock;
 	}
 
 	mgmt_pending_foreach(MGMT_OP_SET_LE, hdev, settings_rsp, &match);
@@ -2222,17 +2224,16 @@ static void le_enable_complete(struct hci_dev *hdev, u8 status)
 	if (test_bit(HCI_LE_ENABLED, &hdev->dev_flags)) {
 		struct hci_request req;
 
-		hci_dev_lock(hdev);
-
 		hci_req_init(&req, hdev);
 		update_adv_data(&req);
 		update_scan_rsp_data(&req);
 		hci_req_run(&req, NULL);
 
 		hci_update_background_scan(hdev);
-
-		hci_dev_unlock(hdev);
 	}
+
+unlock:
+	hci_dev_unlock(hdev);
 }
 
 static int set_le(struct sock *sk, struct hci_dev *hdev, void *data, u16 len)
@@ -3114,14 +3115,13 @@ static void pairing_complete(struct pending_cmd *cmd, u8 status)
 	conn->disconn_cfm_cb = NULL;
 
 	hci_conn_drop(conn);
-	hci_conn_put(conn);
-
-	mgmt_pending_remove(cmd);
 
 	/* The device is paired so there is no need to remove
 	 * its connection parameters anymore.
 	 */
 	clear_bit(HCI_CONN_PARAM_REMOVAL_PEND, &conn->flags);
+
+	hci_conn_put(conn);
 }
 
 void mgmt_smp_complete(struct hci_conn *conn, bool complete)
@@ -3130,8 +3130,10 @@ void mgmt_smp_complete(struct hci_conn *conn, bool complete)
 	struct pending_cmd *cmd;
 
 	cmd = find_pairing(conn);
-	if (cmd)
+	if (cmd) {
 		cmd->cmd_complete(cmd, status);
+		mgmt_pending_remove(cmd);
+	}
 }
 
 static void pairing_complete_cb(struct hci_conn *conn, u8 status)
@@ -3141,10 +3143,13 @@ static void pairing_complete_cb(struct hci_conn *conn, u8 status)
 	BT_DBG("status %u", status);
 
 	cmd = find_pairing(conn);
-	if (!cmd)
+	if (!cmd) {
 		BT_DBG("Unable to find a pending command");
-	else
-		cmd->cmd_complete(cmd, mgmt_status(status));
+		return;
+	}
+
+	cmd->cmd_complete(cmd, mgmt_status(status));
+	mgmt_pending_remove(cmd);
 }
 
 static void le_pairing_complete_cb(struct hci_conn *conn, u8 status)
@@ -3157,10 +3162,13 @@ static void le_pairing_complete_cb(struct hci_conn *conn, u8 status)
 		return;
 
 	cmd = find_pairing(conn);
-	if (!cmd)
+	if (!cmd) {
 		BT_DBG("Unable to find a pending command");
-	else
-		cmd->cmd_complete(cmd, mgmt_status(status));
+		return;
+	}
+
+	cmd->cmd_complete(cmd, mgmt_status(status));
+	mgmt_pending_remove(cmd);
 }
 
 static int pair_device(struct sock *sk, struct hci_dev *hdev, void *data,
@@ -3274,8 +3282,10 @@ static int pair_device(struct sock *sk, struct hci_dev *hdev, void *data,
 	cmd->user_data = hci_conn_get(conn);
 
 	if ((conn->state == BT_CONNECTED || conn->state == BT_CONFIG) &&
-	    hci_conn_security(conn, sec_level, auth_type, true))
-		pairing_complete(cmd, 0);
+	    hci_conn_security(conn, sec_level, auth_type, true)) {
+		cmd->cmd_complete(cmd, 0);
+		mgmt_pending_remove(cmd);
+	}
 
 	err = 0;
 
@@ -3317,7 +3327,8 @@ static int cancel_pair_device(struct sock *sk, struct hci_dev *hdev, void *data,
 		goto unlock;
 	}
 
-	pairing_complete(cmd, MGMT_STATUS_CANCELLED);
+	cmd->cmd_complete(cmd, MGMT_STATUS_CANCELLED);
+	mgmt_pending_remove(cmd);
 
 	err = cmd_complete(sk, hdev->id, MGMT_OP_CANCEL_PAIR_DEVICE, 0,
 			   addr, sizeof(*addr));
@@ -3791,7 +3802,7 @@ static bool trigger_discovery(struct hci_request *req, u8 *status)
 
 		/* All active scans will be done with either a resolvable
 		 * private address (when privacy feature has been enabled)
-		 * or unresolvable private address.
+		 * or non-resolvable private address.
 		 */
 		err = hci_update_random_address(req, true, &own_addr_type);
 		if (err < 0) {
@@ -4279,12 +4290,14 @@ static void set_advertising_complete(struct hci_dev *hdev, u8 status)
 {
 	struct cmd_lookup match = { NULL, hdev };
 
+	hci_dev_lock(hdev);
+
 	if (status) {
 		u8 mgmt_err = mgmt_status(status);
 
 		mgmt_pending_foreach(MGMT_OP_SET_ADVERTISING, hdev,
 				     cmd_status_rsp, &mgmt_err);
-		return;
+		goto unlock;
 	}
 
 	if (test_bit(HCI_LE_ADV, &hdev->dev_flags))
@@ -4299,6 +4312,9 @@ static void set_advertising_complete(struct hci_dev *hdev, u8 status)
 
 	if (match.sk)
 		sock_put(match.sk);
+
+unlock:
+	hci_dev_unlock(hdev);
 }
 
 static int set_advertising(struct sock *sk, struct hci_dev *hdev, void *data,
@@ -6081,6 +6097,11 @@ static int powered_update_hci(struct hci_dev *hdev)
 		hci_req_add(&req, HCI_OP_WRITE_SSP_MODE, 1, &ssp);
 	}
 
+	if (bredr_sc_enabled(hdev) && !lmp_host_sc_capable(hdev)) {
+		u8 sc = 0x01;
+		hci_req_add(&req, HCI_OP_WRITE_SC_SUPPORT, sizeof(sc), &sc);
+	}
+
 	if (test_bit(HCI_LE_ENABLED, &hdev->dev_flags) &&
 	    lmp_bredr_capable(hdev)) {
 		struct hci_cp_write_le_host_supported cp;
@@ -6130,8 +6151,7 @@ static int powered_update_hci(struct hci_dev *hdev)
 int mgmt_powered(struct hci_dev *hdev, u8 powered)
 {
 	struct cmd_lookup match = { NULL, hdev };
-	u8 status_not_powered = MGMT_STATUS_NOT_POWERED;
-	u8 zero_cod[] = { 0, 0, 0 };
+	u8 status, zero_cod[] = { 0, 0, 0 };
 	int err;
 
 	if (!test_bit(HCI_MGMT, &hdev->dev_flags))
@@ -6147,7 +6167,20 @@ int mgmt_powered(struct hci_dev *hdev, u8 powered)
 	}
 
 	mgmt_pending_foreach(MGMT_OP_SET_POWERED, hdev, settings_rsp, &match);
-	mgmt_pending_foreach(0, hdev, cmd_complete_rsp, &status_not_powered);
+
+	/* If the power off is because of hdev unregistration let
+	 * use the appropriate INVALID_INDEX status. Otherwise use
+	 * NOT_POWERED. We cover both scenarios here since later in
+	 * mgmt_index_removed() any hci_conn callbacks will have already
+	 * been triggered, potentially causing misleading DISCONNECTED
+	 * status responses.
+	 */
+	if (test_bit(HCI_UNREGISTER, &hdev->dev_flags))
+		status = MGMT_STATUS_INVALID_INDEX;
+	else
+		status = MGMT_STATUS_NOT_POWERED;
+
+	mgmt_pending_foreach(0, hdev, cmd_complete_rsp, &status);
 
 	if (memcmp(hdev->dev_class, zero_cod, sizeof(zero_cod)) != 0)
 		mgmt_event(MGMT_EV_CLASS_OF_DEV_CHANGED, hdev,
@@ -6681,8 +6714,10 @@ void mgmt_auth_failed(struct hci_conn *conn, u8 hci_status)
 	mgmt_event(MGMT_EV_AUTH_FAILED, conn->hdev, &ev, sizeof(ev),
 		    cmd ? cmd->sk : NULL);
 
-	if (cmd)
-		pairing_complete(cmd, status);
+	if (cmd) {
+		cmd->cmd_complete(cmd, status);
+		mgmt_pending_remove(cmd);
+	}
 }
 
 void mgmt_auth_enable_complete(struct hci_dev *hdev, u8 status)
diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
index 6a46252fe66f..b67749bb55bf 100644
--- a/net/bluetooth/smp.c
+++ b/net/bluetooth/smp.c
@@ -1673,7 +1673,8 @@ static u8 smp_cmd_pairing_req(struct l2cap_conn *conn, struct sk_buff *skb)
 	/* SMP over BR/EDR requires special treatment */
 	if (conn->hcon->type == ACL_LINK) {
 		/* We must have a BR/EDR SC link */
-		if (!test_bit(HCI_CONN_AES_CCM, &conn->hcon->flags))
+		if (!test_bit(HCI_CONN_AES_CCM, &conn->hcon->flags) &&
+		    !test_bit(HCI_FORCE_LESC, &hdev->dbg_flags))
 			return SMP_CROSS_TRANSP_NOT_ALLOWED;
 
 		set_bit(SMP_FLAG_SC, &smp->flags);
@@ -2927,7 +2928,7 @@ static struct l2cap_chan *smp_add_cid(struct hci_dev *hdev, u16 cid)
 	tfm_aes = crypto_alloc_blkcipher("ecb(aes)", 0, 0);
 	if (IS_ERR(tfm_aes)) {
 		BT_ERR("Unable to create crypto context");
-		return ERR_PTR(PTR_ERR(tfm_aes));
+		return ERR_CAST(tfm_aes);
 	}
 
 create_chan:
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply related

* Re: [iproute2] tc: Show classes more hierarchically]
From: Marcelo Ricardo Leitner @ 2014-12-16 18:12 UTC (permalink / raw)
  To: vadim4j, netdev
In-Reply-To: <20141215224851.GB6734@angus-think.lan>

On 15-12-2014 20:48, vadim4j@gmail.com wrote:
> Hi All,
>
> I am playing with showing classes in more hierarchically format and I
> have some code and example of output from my TC looks like:
>
> # tc/tc -t class show dev tap0
>
>   \---1:2 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:40 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:50 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:60 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>   \---1:1 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:10 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>                 \---1:11 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>                        \---1:111 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:20 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:30 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>
>
> which in standart output mode it looks like:
>
> # tc/tc class show dev tap0
>
> class htb 1:11 parent 1:10 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:111 parent 1:11 prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:10 parent 1:1 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b
> class htb 1:1 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:20 parent 1:1 leaf 20: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:2 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:30 parent 1:1 leaf 30: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:40 parent 1:2 leaf 40: prio 0 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b
> class htb 1:50 parent 1:2 leaf 50: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:60 parent 1:2 leaf 60: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>
> So I'd like to ask if it might be useful for the TC users (may be
> better format ?) to have this ?

Good idea! It already looks good, but what about:

   |-- 1:2 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   |      |-- 1:40 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   |      |-- 1:50 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   |      '-- 1:60 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   |-- 1:1 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   ...

just another idea..

Thanks.
   Marcelo

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Marcelo Ricardo Leitner @ 2014-12-16 18:00 UTC (permalink / raw)
  To: rajatxjain; +Cc: Nils Holland, David Miller, netdev, linux-pci@vger.kernel.org
In-Reply-To: <CAA93t1qyZE-9tw8pg1KG6g4iyy0QMW=iass5w=6ZGMTMu+vi_A@mail.gmail.com>

On 16-12-2014 14:04, Rajat Jain wrote:
> Hello All,
>
> Apologies for jumping in late, but for some reason I do not see the
> original mail in my inbox. However I am taking a look at the mails as
> sent on linux-pci (and I will keep an eye out for the bug report that
> Bjorn asked for).
>

np!
Nils would you create that BZ please? As you did all the bisect.. :)

>
>>
>> I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:
>>
>> $ grep 'pci 0000:02' tg3.bad
>> [    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.561555] pci 0000:02:00.0: 1st 1 1
>> [    1.561558] pci 0000:02:00.0: crs_timeout: 0
>> [   20.412021] pci 0000:02:00.0: 1st 1 1
>> [   20.412022] pci 0000:02:00.0: crs_timeout: 0
>> [   20.413596] pci 0000:02:00.0: 1st 1 1
>> [   20.413598] pci 0000:02:00.0: crs_timeout: 0
>>
>> And without it:
>>
>> $ grep 'pci 0000:02' tg3.good
>> [    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.576778] pci 0000:02:00.0: 1st 1 1
>> [   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4
>>
>
> It seems that in the first 2 attempts that were made to probe the
> device are all OK and return regular device ID and vendor ID for TG3
> (CRS does not have a role to play). However, later attempts return a
> CRS.
>
> 1) May I ask if you are using acpihp or pciehp? I assume pciehp?

Well.. system doesn't support hotplug..
Chipset is a "Intel Corporation 5 Series/3400 Series", fwiw

> 2) Can you please also send dmesg output while passing
> pciehp.pciehp_debug=1? In the fail case, do you see a message
> indicating the pciehp gave up since it got CRS for a long time
> (something like "pci 0000:02:00.0 id reading try 50 times with
> interval 20 ms to get ffff0001")?

I did use that option anyway, but it resulted in no new messages.

> 3) Currently the pciehp passes "0" for the argument "crs_timeout" to
> pci_bus_read_dev_vendor_id(). Can you please try increasing it to, say
> 30 seconds (30 * 1000). (For comparison data, acpihp uses the value
> 60*1000 i.e. 60 seconds today) and run the fail case once again?
>
> Thanks a lot in advance for the debugging help ;-)
>

Seems it's not safe to do that with those backtraces..
I did it, system was very slow to boot, still didn't get the NIC on and 
got a bunch of "scheduling while atomic" due to that msleep() call.

The first invoke was fine:
Dec 16 15:40:00 odin kernel: [    0.190711] pci 0000:02:00.0: 1st 
165a14e4 14e4
Dec 16 15:40:00 odin kernel: [    0.190717] pci 0000:02:00.0: 1st 
165a14e4 14e4
Dec 16 15:40:00 odin kernel: [    0.191091] pci 0000:02:00.0: System 
wakeup disabled by ACPI
Dec 16 15:40:00 odin kernel: [    1.576061] pci 0000:02:00.0: 1st 1 1
Dec 16 15:40:00 odin kernel: [    1.577474] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.580487] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.585508] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.594499] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.611499] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.644521] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.709566] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.838654] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    2.095765] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    2.608956] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    3.634443] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    5.684388] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    9.783279] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [   17.980060] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [   34.372640] pci 0000:02:00.0: not responding

The other two...
Dec 16 15:40:09 odin kernel: [   54.154688] pci 0000:02:00.0: 1st 1 1
Dec 16 15:40:09 odin kernel: [   54.154690] BUG: scheduling while 
atomic: ip/1575/0x00000200
Dec 16 15:40:09 odin kernel: pci 0000:02:00.0: 1st 1 1
Dec 16 15:40:09 odin kernel: BUG: scheduling while atomic: 
ip/1575/0x00000200
Dec 16 15:40:09 odin kernel: pci 0000:02:00.0: 1 1
Dec 16 15:40:09 odin kernel: BUG: scheduling while atomic: 
ip/1575/0x00000200
(...)

BUG backtraces were very similar to the 2nd and 3rd I posted on the 
other email, it just pointed to the msleep() call instead of my BUG_ON(1).

I can dig deeper if you think it's worth, but as the 1st call didn't 
have this issue and it didn't complete either, seems we are good about 
the test.. right?

Thanks,
Marcelo

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Marcelo Ricardo Leitner @ 2014-12-16 17:59 UTC (permalink / raw)
  To: Michael Chan, Bjorn Helgaas
  Cc: Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki, Prashant Sreedharan
In-Reply-To: <1418750141.4248.3.camel@LTIRV-MCHAN1.corp.ad.broadcom.com>

On 16-12-2014 15:15, Michael Chan wrote:
> On Tue, 2014-12-16 at 09:20 -0700, Bjorn Helgaas wrote:
>> I think we're in this path:
>>
>>      tg3_init_hw
>>        tg3_reset_hw
>>          tg3_disable_ints
>>          tg3_stop_fw
>>          tg3_write_sig_pre_reset
>>          tg3_chip_reset
>>            pci_device_is_present
>>              pci_bus_read_dev_vendor_id
>>
>> and in this case pci_device_is_present() also passes a timeout of zero
>> to pci_bus_read_dev_vendor_id().  My guess is that tg3 is resetting
>> the device, so it's not too surprising that the config read returns
>> CRS status immediately afterward.
>>
> At the point of calling pci_device_is_present(), chip reset hasn't
> started yet, so there should be no problem reading config space.
> 
> In all the newer tg3 chips, chip reset does not reset the PCIE block.
> So I think config space should always be accesible even during reset.

It's a 
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express
over here

I put a WARN_ON(1) after those printks, and this is what I got:

[    1.550640] pci 0000:02:00.0: 1st 1 1
[    1.550643] pci 0000:02:00.0: crs_timeout: 0
[    1.550645] ------------[ cut here ]------------
[    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
[    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
[    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
[    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
[    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
[    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
[    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
[    1.550669] Call Trace:
[    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
[    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
[    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
[    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
[    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
[    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
[    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
[    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
[    1.550705]  [<ffffffff8127d2bb>] ? kernfs_activate+0x7b/0xf0
[    1.550708]  [<ffffffff813bbdc5>] local_pci_probe+0x45/0xa0
[    1.550711]  [<ffffffff8127fa8d>] ? sysfs_do_create_link_sd.isra.2+0x6d/0xc0
[    1.550714]  [<ffffffff813bd1b9>] pci_device_probe+0xf9/0x150
[    1.550717]  [<ffffffff814906fd>] driver_probe_device+0x12d/0x3d0
[    1.550720]  [<ffffffff81490a7b>] __driver_attach+0x9b/0xa0
[    1.550722]  [<ffffffff814909e0>] ? __device_attach+0x40/0x40
[    1.550724]  [<ffffffff8148e4f3>] bus_for_each_dev+0x73/0xc0
[    1.550726]  [<ffffffff814900ee>] driver_attach+0x1e/0x20
[    1.550729]  [<ffffffff8148fcb0>] bus_add_driver+0x180/0x250
[    1.550731]  [<ffffffffa0050000>] ? 0xffffffffa0050000
[    1.550733]  [<ffffffff81491274>] driver_register+0x64/0xf0
[    1.550735]  [<ffffffff813bb72b>] __pci_register_driver+0x4b/0x50
[    1.550739]  [<ffffffffa005001e>] tg3_driver_init+0x1e/0x1000 [tg3]
[    1.550742]  [<ffffffff81002144>] do_one_initcall+0xd4/0x210
[    1.550747]  [<ffffffff811cbc42>] ? __vunmap+0xc2/0x110
[    1.550751]  [<ffffffff8111336b>] load_module+0x1cab/0x2730
[    1.550753]  [<ffffffff8110efc0>] ? store_uevent+0x70/0x70
[    1.550756]  [<ffffffff8120b090>] ? kernel_read+0x50/0x80
[    1.550760]  [<ffffffff81113fa6>] SyS_finit_module+0xa6/0xd0
[    1.550763]  [<ffffffff81745129>] system_call_fastpath+0x12/0x17
[    1.550764] ---[ end trace 4cc3153e369484ea ]---
[    1.550963] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95722) rev a200] (PCI Express) MAC address 00:0a:f7:2b:9b:39
[    1.550965] tg3 0000:02:00.0 eth0: attached PHY is 5722/5756 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
[    1.550966] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    1.550967] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[    1.556112] tg3 0000:02:00.0 p1p1: renamed from eth0
...

[   23.545119] tg3 0000:02:00.0: irq 32 for MSI/MSI-X
[   25.424981] tg3 0000:02:00.0 p1p1: No firmware running
[   25.425686] pci 0000:02:00.0: 1st 1 1
[   25.425687] pci 0000:02:00.0: crs_timeout: 0
[   25.425687] ------------[ cut here ]------------
[   25.425691] WARNING: CPU: 0 PID: 1590 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
[   25.425692] Modules linked in: bridge stp llc openvswitch x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_co
dec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul snd_hwdep crc32_pclmul crc32c_intel ghash_clmulni_intel snd_seq mei_me snd_seq_d
evice snd_pcm iTCO_wdt iTCO_vendor_support snd_timer mei snd lpc_ich i2c_i801 pcspkr mfd_core dcdbas serio_raw soundcore microcode ie31200_edac shpchp edac_
core nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc xfs libcrc32c i915 raid0 i2c_algo_bit drm_kms_helper drm e1000e tg3 ptp pps_core video
[   25.425714] CPU: 0 PID: 1590 Comm: ip Tainted: G        W      3.18.0-rc6+ #8
[   25.425715] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
[   25.425716]  0000000000000000 0000000097b01d0c ffff8807f0687408 ffffffff8173db46
[   25.425717]  0000000000000000 0000000000000000 ffff8807f0687448 ffffffff81094d41
[   25.425719]  ffff8807f0687428 ffff8807f1e27000 0000000000000000 0000000000000000
[   25.425720] Call Trace:
[   25.425723]  [<ffffffff8173db46>] dump_stack+0x46/0x58
[   25.425726]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
[   25.425728]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
[   25.425729]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
[   25.425733]  [<ffffffffa0028b47>] ? tg3_phy_auxctl_write+0x27/0x30 [tg3]
[   25.425735]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
[   25.425738]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
[   25.425740]  [<ffffffffa003756d>] tg3_reset_hw+0x8d/0x2ce0 [tg3]
[   25.425743]  [<ffffffff81383f6a>] ? delay_tsc+0x4a/0x80
[   25.425744]  [<ffffffff81383eec>] ? __udelay+0x2c/0x30
[   25.425747]  [<ffffffffa0026be4>] ? _tw32_flush+0x44/0x80 [tg3]
[   25.425749]  [<ffffffffa003a216>] tg3_init_hw+0x56/0x60 [tg3]
[   25.425751]  [<ffffffffa003c0d5>] tg3_start+0xbe5/0x1210 [tg3]
[   25.425753]  [<ffffffff81383eec>] ? __udelay+0x2c/0x30
[   25.425755]  [<ffffffffa0026be4>] ? _tw32_flush+0x44/0x80 [tg3]
[   25.425757]  [<ffffffffa003c828>] tg3_open+0x128/0x2e0 [tg3]
[   25.425760]  [<ffffffff8162c6cf>] __dev_open+0xcf/0x140
[   25.425761]  [<ffffffff8162c9f1>] __dev_change_flags+0xa1/0x160
[   25.425762]  [<ffffffff8162cad9>] dev_change_flags+0x29/0x60
[   25.425764]  [<ffffffff8163a4a9>] do_setlink+0x399/0xa90
[   25.425766]  [<ffffffff8163ca7c>] rtnl_newlink+0x51c/0x740
[   25.425768]  [<ffffffff8163c653>] ? rtnl_newlink+0xf3/0x740
[   25.425771]  [<ffffffff811e730c>] ? new_slab+0x14c/0x490
[   25.425774]  [<ffffffff81303188>] ? security_capable+0x18/0x20
[   25.425776]  [<ffffffff8109cf7d>] ? ns_capable+0x2d/0x60
[   25.425778]  [<ffffffff816391a4>] rtnetlink_rcv_msg+0xa4/0x270
[   25.425780]  [<ffffffff8165840d>] ? __netlink_lookup+0x4d/0x70
[   25.425781]  [<ffffffff81639100>] ? rtnetlink_rcv+0x40/0x40
[   25.425783]  [<ffffffff8165c4a1>] netlink_rcv_skb+0xc1/0xe0
[   25.425784]  [<ffffffff816390ec>] rtnetlink_rcv+0x2c/0x40
[   25.425785]  [<ffffffff8165ba26>] netlink_unicast+0x106/0x210
[   25.425787]  [<ffffffff8165be55>] netlink_sendmsg+0x325/0x790
[   25.425788]  [<ffffffff8160de50>] sock_sendmsg+0xa0/0xe0
[   25.425791]  [<ffffffff8120e8cd>] ? lookup_real+0x1d/0x50
[   25.425792]  [<ffffffff8160e394>] ___sys_sendmsg+0x2f4/0x310
[   25.425794]  [<ffffffff8119bdf2>] ? lru_cache_add_active_or_unevictable+0x32/0xc0
[   25.425796]  [<ffffffff8160c673>] ? sock_destroy_inode+0x33/0x40
...
[   25.425794]  [<ffffffff8119bdf2>] ? lru_cache_add_active_or_unevictable+0x32/0xc0
[   25.425796]  [<ffffffff8160c673>] ? sock_destroy_inode+0x33/0x40
[   25.425798]  [<ffffffff8121bfd5>] ? __dentry_kill+0x145/0x1d0
[   25.425799]  [<ffffffff8121c105>] ? dput+0xa5/0x170
[   25.425800]  [<ffffffff81224f74>] ? mntput+0x24/0x40
[   25.425802]  [<ffffffff81206d6a>] ? __fput+0x17a/0x1e0
[   25.425803]  [<ffffffff8160ee21>] __sys_sendmsg+0x51/0x90
[   25.425805]  [<ffffffff8160ee72>] SyS_sendmsg+0x12/0x20
[   25.425807]  [<ffffffff81745129>] system_call_fastpath+0x12/0x17
[   25.425808] ---[ end trace 4cc3153e369484eb ]---
[   25.427385] pci 0000:02:00.0: 1st 1 1
[   25.427386] pci 0000:02:00.0: crs_timeout: 0
[   25.427387] ------------[ cut here ]------------
[   25.427389] WARNING: CPU: 0 PID: 1590 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
[   25.427389] Modules linked in: bridge stp llc openvswitch x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul snd_hwdep crc32_pclmul crc32c_intel ghash_clmulni_intel snd_seq mei_me snd_seq_device snd_pcm iTCO_wdt iTCO_vendor_support snd_timer mei snd lpc_ich i2c_i801 pcspkr mfd_core dcdbas serio_raw soundcore microcode ie31200_edac shpchp edac_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc xfs libcrc32c i915 raid0 i2c_algo_bit drm_kms_helper drm e1000e tg3 ptp pps_core video
[   25.427403] CPU: 0 PID: 1590 Comm: ip Tainted: G        W      3.18.0-rc6+ #8
[   25.427404] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
[   25.427405]  0000000000000000 0000000097b01d0c ffff8807f0687488 ffffffff8173db46
[   25.427406]  0000000000000000 0000000000000000 ffff8807f06874c8 ffffffff81094d41
[   25.427416]  ffff8807f06874a8 ffff8807f1e27000 0000000000000000 0000000000000000
[   25.427417] Call Trace:
[   25.427418]  [<ffffffff8173db46>] dump_stack+0x46/0x58
[   25.427420]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
[   25.427421]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
[   25.427423]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
[   25.427425]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
[   25.427427]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
[   25.427430]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
[   25.427433]  [<ffffffffa003c208>] tg3_start+0xd18/0x1210 [tg3]
[   25.427434]  [<ffffffff81383eec>] ? __udelay+0x2c/0x30
[   25.427437]  [<ffffffffa0026be4>] ? _tw32_flush+0x44/0x80 [tg3]
[   25.427439]  [<ffffffffa003c828>] tg3_open+0x128/0x2e0 [tg3]
[   25.427441]  [<ffffffff8162c6cf>] __dev_open+0xcf/0x140
[   25.427442]  [<ffffffff8162c9f1>] __dev_change_flags+0xa1/0x160
[   25.427443]  [<ffffffff8162cad9>] dev_change_flags+0x29/0x60
[   25.427445]  [<ffffffff8163a4a9>] do_setlink+0x399/0xa90
[   25.427448]  [<ffffffff8163ca7c>] rtnl_newlink+0x51c/0x740
[   25.427449]  [<ffffffff8163c653>] ? rtnl_newlink+0xf3/0x740
[   25.427449]  [<ffffffff8163c653>] ? rtnl_newlink+0xf3/0x740
[   25.427452]  [<ffffffff811e730c>] ? new_slab+0x14c/0x490
[   25.427454]  [<ffffffff81303188>] ? security_capable+0x18/0x20
[   25.427455]  [<ffffffff8109cf7d>] ? ns_capable+0x2d/0x60
[   25.427457]  [<ffffffff816391a4>] rtnetlink_rcv_msg+0xa4/0x270
[   25.427459]  [<ffffffff8165840d>] ? __netlink_lookup+0x4d/0x70
[   25.427460]  [<ffffffff81639100>] ? rtnetlink_rcv+0x40/0x40
[   25.427462]  [<ffffffff8165c4a1>] netlink_rcv_skb+0xc1/0xe0
[   25.427464]  [<ffffffff816390ec>] rtnetlink_rcv+0x2c/0x40
[   25.427465]  [<ffffffff8165ba26>] netlink_unicast+0x106/0x210
[   25.427466]  [<ffffffff8165be55>] netlink_sendmsg+0x325/0x790
[   25.427468]  [<ffffffff8160de50>] sock_sendmsg+0xa0/0xe0
[   25.427469]  [<ffffffff8120e8cd>] ? lookup_real+0x1d/0x50
[   25.427471]  [<ffffffff8160e394>] ___sys_sendmsg+0x2f4/0x310
[   25.427472]  [<ffffffff8119bdf2>] ? lru_cache_add_active_or_unevictable+0x32/0xc0
[   25.427475]  [<ffffffff8160c673>] ? sock_destroy_inode+0x33/0x40
[   25.427477]  [<ffffffff8121bfd5>] ? __dentry_kill+0x145/0x1d0
[   25.427478]  [<ffffffff8121c105>] ? dput+0xa5/0x170
[   25.427479]  [<ffffffff81224f74>] ? mntput+0x24/0x40
[   25.427481]  [<ffffffff81206d6a>] ? __fput+0x17a/0x1e0
[   25.427482]  [<ffffffff8160ee21>] __sys_sendmsg+0x51/0x90
[   25.427483]  [<ffffffff8160ee72>] SyS_sendmsg+0x12/0x20
[   25.427493]  [<ffffffff81745129>] system_call_fastpath+0x12/0x17
[   25.427494] ---[ end trace 4cc3153e369484ec ]---

  Marcelo

^ permalink raw reply

* [PATCH net-next 3/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 17:50 UTC (permalink / raw)
  To: netdev

This patch set is the union of the previous two patches.
Applying this patch to the net-next kernel (commit f96fe22)
provides full functionality. The DLKM and API found at
https://sourceforge.net/projects/tcpestats/files/ will allow
interested parties to test out our implementation from a
user perspective.

As note - to enable tcp_estats in the kernel the
net.ipv4.tcp_estats must be set. To enable all statistics
set net.ipv4.tcp_estats=127

---
  include/linux/tcp.h        |   8 +
  include/net/tcp.h          |   1 +
  include/net/tcp_estats.h   | 376 +++++++++++++++++++++++
  include/uapi/linux/tcp.h   |   6 +-
  net/ipv4/Kconfig           |  25 ++
  net/ipv4/Makefile          |   1 +
  net/ipv4/sysctl_net_ipv4.c |  14 +
  net/ipv4/tcp.c             |  21 +-
  net/ipv4/tcp_cong.c        |   3 +
  net/ipv4/tcp_estats.c      | 736 +++++++++++++++++++++++++++++++++++++++++++++
  net/ipv4/tcp_htcp.c        |   1 +
  net/ipv4/tcp_input.c       | 116 ++++++-
  net/ipv4/tcp_ipv4.c        |  10 +
  net/ipv4/tcp_output.c      |  61 +++-
  net/ipv4/tcp_timer.c       |   3 +
  net/ipv6/tcp_ipv6.c        |   7 +
  16 files changed, 1368 insertions(+), 21 deletions(-)
  create mode 100644 include/net/tcp_estats.h
  create mode 100644 net/ipv4/tcp_estats.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 67309ec..8758360 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -126,6 +126,10 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
  	return (struct tcp_request_sock *)req;
  }
  
+#ifdef CONFIG_TCP_ESTATS
+struct tcp_estats;
+#endif
+
  struct tcp_sock {
  	/* inet_connection_sock has to be the first member of tcp_sock */
  	struct inet_connection_sock	inet_conn;
@@ -309,6 +313,10 @@ struct tcp_sock {
  	struct tcp_md5sig_info	__rcu *md5sig_info;
  #endif
  
+#ifdef CONFIG_TCP_ESTATS
+	struct tcp_estats	*tcp_stats;
+#endif
+
  /* TCP fastopen related information */
  	struct tcp_fastopen_request *fastopen_req;
  	/* fastopen_rsk points to request_sock that resulted in this big
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f50f29faf..9f7e31e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -43,6 +43,7 @@
  #include <net/tcp_states.h>
  #include <net/inet_ecn.h>
  #include <net/dst.h>
+#include <net/tcp_estats.h>
  
  #include <linux/seq_file.h>
  #include <linux/memcontrol.h>
diff --git a/include/net/tcp_estats.h b/include/net/tcp_estats.h
new file mode 100644
index 0000000..ff6000e
--- /dev/null
+++ b/include/net/tcp_estats.h
@@ -0,0 +1,376 @@
+/*
+ * include/net/tcp_estats.h
+ *
+ * Implementation of TCP Extended Statistics MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _TCP_ESTATS_H
+#define _TCP_ESTATS_H
+
+#include <net/sock.h>
+#include <linux/idr.h>
+#include <linux/in.h>
+#include <linux/jump_label.h>
+#include <linux/spinlock.h>
+#include <linux/tcp.h>
+#include <linux/workqueue.h>
+
+/* defines number of seconds that stats persist after connection ends */
+#define TCP_ESTATS_PERSIST_DELAY_SECS 5
+
+enum tcp_estats_sndlim_states {
+	TCP_ESTATS_SNDLIM_NONE = -1,
+	TCP_ESTATS_SNDLIM_SENDER,
+	TCP_ESTATS_SNDLIM_CWND,
+	TCP_ESTATS_SNDLIM_RWIN,
+	TCP_ESTATS_SNDLIM_STARTUP,
+	TCP_ESTATS_SNDLIM_TSODEFER,
+	TCP_ESTATS_SNDLIM_PACE,
+	TCP_ESTATS_SNDLIM_NSTATES	/* Keep at end */
+};
+
+enum tcp_estats_addrtype {
+	TCP_ESTATS_ADDRTYPE_IPV4 = 1,
+	TCP_ESTATS_ADDRTYPE_IPV6 = 2
+};
+
+enum tcp_estats_softerror_reason {
+	TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW = 1,
+	TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW = 2,
+	TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW = 3,
+	TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW = 4,
+	TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW = 5,
+	TCP_ESTATS_SOFTERROR_ABOVE_TS_WINDOW = 6,
+	TCP_ESTATS_SOFTERROR_DATA_CHECKSUM = 7,
+	TCP_ESTATS_SOFTERROR_OTHER = 8,
+};
+
+#define TCP_ESTATS_INACTIVE	2
+#define TCP_ESTATS_ACTIVE	1
+
+#define TCP_ESTATS_TABLEMASK_INACTIVE	0x00
+#define TCP_ESTATS_TABLEMASK_ACTIVE	0x01
+#define TCP_ESTATS_TABLEMASK_PERF	0x02
+#define TCP_ESTATS_TABLEMASK_PATH	0x04
+#define TCP_ESTATS_TABLEMASK_STACK	0x08
+#define TCP_ESTATS_TABLEMASK_APP	0x10
+#define TCP_ESTATS_TABLEMASK_EXTRAS	0x40
+
+#ifdef CONFIG_TCP_ESTATS
+
+extern struct static_key tcp_estats_enabled;
+
+#define TCP_ESTATS_CHECK(tp, table, expr)				\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats) &&			\
+			    likely((tp)->tcp_stats->tables.table)) {	\
+				(expr);					\
+			}						\
+		}							\
+	} while (0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, ++((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_DEC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, --((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) += (val))
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) = (val))
+#define TCP_ESTATS_UPDATE(tp, func)					\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats)) {			\
+				(func);					\
+			}						\
+		}							\
+	} while (0)
+
+/*
+ * Variables that can be read and written directly.
+ *
+ * Contains all variables from RFC 4898. Commented fields are
+ * either not implemented (only StartTimeStamp
+ * remains unimplemented in this release) or have
+ * handlers and do not need struct storage.
+ */
+struct tcp_estats_connection_table {
+	u32			AddressType;
+	union { struct in_addr addr; struct in6_addr addr6; }	LocalAddress;
+	union { struct in_addr addr; struct in6_addr addr6; }	RemAddress;
+	u16			LocalPort;
+	u16			RemPort;
+};
+
+struct tcp_estats_perf_table {
+	u32		SegsOut;
+	u32		DataSegsOut;
+	u64		DataOctetsOut;
+	u32		SegsRetrans;
+	u32		OctetsRetrans;
+	u32		SegsIn;
+	u32		DataSegsIn;
+	u64		DataOctetsIn;
+	/*		ElapsedSecs */
+	/*		ElapsedMicroSecs */
+	/*		StartTimeStamp */
+	/*		CurMSS */
+	/*		PipeSize */
+	u32		MaxPipeSize;
+	/*		SmoothedRTT */
+	/*		CurRTO */
+	u32		CongSignals;
+	/*		CurCwnd */
+	/*		CurSsthresh */
+	u32		Timeouts;
+	/*		CurRwinSent */
+	u32		MaxRwinSent;
+	u32		ZeroRwinSent;
+	/*		CurRwinRcvd */
+	u32		MaxRwinRcvd;
+	u32		ZeroRwinRcvd;
+	/*		SndLimTransRwin */
+	/*		SndLimTransCwnd */
+	/*		SndLimTransSnd */
+	/*		SndLimTimeRwin */
+	/*		SndLimTimeCwnd */
+	/*		SndLimTimeSnd */
+	u32		snd_lim_trans[TCP_ESTATS_SNDLIM_NSTATES];
+	u32		snd_lim_time[TCP_ESTATS_SNDLIM_NSTATES];
+};
+
+struct tcp_estats_path_table {
+	/*		RetranThresh */
+	u32		NonRecovDAEpisodes;
+	u32		SumOctetsReordered;
+	u32		NonRecovDA;
+	u32		SampleRTT;
+	/*		RTTVar */
+	u32		MaxRTT;
+	u32		MinRTT;
+	u64		SumRTT;
+	u32		CountRTT;
+	u32		MaxRTO;
+	u32		MinRTO;
+	u8		IpTtl;
+	u8		IpTosIn;
+	/*		IpTosOut */
+	u32		PreCongSumCwnd;
+	u32		PreCongSumRTT;
+	u32		PostCongSumRTT;
+	u32		PostCongCountRTT;
+	u32		ECNsignals;
+	u32		DupAckEpisodes;
+	/*		RcvRTT */
+	u32		DupAcksOut;
+	u32		CERcvd;
+	u32		ECESent;
+};
+
+struct tcp_estats_stack_table {
+	u32		ActiveOpen;
+	/*		MSSSent */
+	/*		MSSRcvd */
+	/*		WinScaleSent */
+	/*		WinScaleRcvd */
+	/*		TimeStamps */
+	/*		ECN */
+	/*		WillSendSACK */
+	/*		WillUseSACK */
+	/*		State */
+	/*		Nagle */
+	u32		MaxSsCwnd;
+	u32		MaxCaCwnd;
+	u32		MaxSsthresh;
+	u32		MinSsthresh;
+	/*		InRecovery */
+	u32		DupAcksIn;
+	u32		SpuriousFrDetected;
+	u32		SpuriousRtoDetected;
+	u32		SoftErrors;
+	u32		SoftErrorReason;
+	u32		SlowStart;
+	u32		CongAvoid;
+	u32		OtherReductions;
+	u32		CongOverCount;
+	u32		FastRetran;
+	u32		SubsequentTimeouts;
+	/*		CurTimeoutCount */
+	u32		AbruptTimeouts;
+	u32		SACKsRcvd;
+	u32		SACKBlocksRcvd;
+	u32		SendStall;
+	u32		DSACKDups;
+	u32		MaxMSS;
+	u32		MinMSS;
+	u32		SndInitial;
+	u32		RecInitial;
+	/*		CurRetxQueue */
+	/*		MaxRetxQueue */
+	/*		CurReasmQueue */
+	u32		MaxReasmQueue;
+	u32		EarlyRetrans;
+	u32		EarlyRetransDelay;
+};
+
+struct tcp_estats_app_table {
+	/*		SndUna */
+	/*		SndNxt */
+	u32		SndMax;
+	u64		ThruOctetsAcked;
+	/*		RcvNxt */
+	u64		ThruOctetsReceived;
+	/*		CurAppWQueue */
+	u32		MaxAppWQueue;
+	/*		CurAppRQueue */
+	u32		MaxAppRQueue;
+};
+
+/*
+    currently, no backing store is needed for tuning elements in
+     web10g - they are all read or written to directly in other
+     data structures (such as the socket)
+*/
+
+struct tcp_estats_extras_table {
+	/*		OtherReductionsCV */
+	u32		OtherReductionsCM;
+	u32		Priority;
+};
+
+struct tcp_estats_tables {
+	struct tcp_estats_connection_table	*connection_table;
+	struct tcp_estats_perf_table		*perf_table;
+	struct tcp_estats_path_table		*path_table;
+	struct tcp_estats_stack_table		*stack_table;
+	struct tcp_estats_app_table		*app_table;
+	struct tcp_estats_extras_table		*extras_table;
+};
+
+struct tcp_estats {
+	int				tcpe_cid; /* idr map id */
+
+	struct sock			*sk;
+	kuid_t				uid;
+	kgid_t				gid;
+	int				ids;
+
+	atomic_t			users;
+
+	enum tcp_estats_sndlim_states	limstate;
+	ktime_t				limstate_ts;
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	ktime_t				start_ts;
+	ktime_t				current_ts;
+#else
+	unsigned long			start_ts;
+	unsigned long			current_ts;
+#endif
+	struct timeval			start_tv;
+
+        int				queued;
+        struct work_struct		create_notify;
+        struct work_struct		establish_notify;
+        struct delayed_work		destroy_notify;
+
+	struct tcp_estats_tables	tables;
+
+	struct rcu_head			rcu;
+};
+
+extern struct idr tcp_estats_idr;
+
+extern int tcp_estats_wq_enabled;
+extern struct workqueue_struct *tcp_estats_wq;
+extern void (*create_notify_func)(struct work_struct *work);
+extern void (*establish_notify_func)(struct work_struct *work);
+extern void (*destroy_notify_func)(struct work_struct *work);
+
+extern unsigned long persist_delay;
+extern spinlock_t tcp_estats_idr_lock;
+
+/* For the TCP code */
+extern int  tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype t,
+			      int active);
+extern void tcp_estats_destroy(struct sock *sk);
+extern void tcp_estats_establish(struct sock *sk);
+extern void tcp_estats_free(struct rcu_head *rcu);
+
+extern void tcp_estats_update_snd_nxt(struct tcp_sock *tp);
+extern void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack);
+extern void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample);
+extern void tcp_estats_update_timeout(struct sock *sk);
+extern void tcp_estats_update_mss(struct tcp_sock *tp);
+extern void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp);
+extern void tcp_estats_update_sndlim(struct tcp_sock *tp,
+				     enum tcp_estats_sndlim_states why);
+extern void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq);
+extern void tcp_estats_update_rwin_sent(struct tcp_sock *tp);
+extern void tcp_estats_update_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_post_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_segsend(struct sock *sk, int pcount,
+                                      u32 seq, u32 end_seq, int flags);
+extern void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb);
+extern void tcp_estats_update_finish_segrecv(struct tcp_sock *tp);
+extern void tcp_estats_update_writeq(struct sock *sk);
+extern void tcp_estats_update_recvq(struct sock *sk);
+
+extern void tcp_estats_init(void);
+
+static inline void tcp_estats_use(struct tcp_estats *stats)
+{
+	atomic_inc(&stats->users);
+}
+
+static inline int tcp_estats_use_if_valid(struct tcp_estats *stats)
+{
+	return atomic_inc_not_zero(&stats->users);
+}
+
+static inline void tcp_estats_unuse(struct tcp_estats *stats)
+{
+	if (atomic_dec_and_test(&stats->users)) {
+		sock_put(stats->sk);
+		stats->sk = NULL;
+		call_rcu(&stats->rcu, tcp_estats_free);
+	}
+}
+
+#else /* !CONFIG_TCP_ESTATS */
+
+#define tcp_estats_enabled	(0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_DEC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_UPDATE(tp, func)		do {} while (0)
+
+static inline void tcp_estats_init(void) { }
+static inline void tcp_estats_establish(struct sock *sk) { }
+static inline void tcp_estats_create(struct sock *sk,
+				     enum tcp_estats_addrtype t,
+				     int active) { }
+static inline void tcp_estats_destroy(struct sock *sk) { }
+
+#endif /* CONFIG_TCP_ESTATS */
+
+#endif /* _TCP_ESTATS_H */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 3b97183..5dae043 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -186,9 +186,13 @@ struct tcp_info {
  	__u32	tcpi_rcv_space;
  
  	__u32	tcpi_total_retrans;
-
  	__u64	tcpi_pacing_rate;
  	__u64	tcpi_max_pacing_rate;
+
+#ifdef CONFIG_TCP_ESTATS
+	/* RFC 4898 extended stats Info */
+	__u32	tcpi_estats_cid;
+#endif
  };
  
  /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index bd29016..4bd176e 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -680,3 +680,28 @@ config TCP_MD5SIG
  	  on the Internet.
  
  	  If unsure, say N.
+
+config TCP_ESTATS
+	bool "TCP: Extended TCP statistics (RFC4898) MIB"
+	---help---
+	  RFC 4898 specifies a number of extended statistics for TCP. This
+	  data can be accessed using netlink. See http://www.web10g.org for
+	  more details.
+
+if TCP_ESTATS
+
+config TCP_ESTATS_STRICT_ELAPSEDTIME
+	bool "TCP: ESTATS strict ElapsedSecs/Msecs counters"
+	depends on TCP_ESTATS
+	default n
+	---help---
+	  Elapsed time since beginning of connection.
+	  RFC4898 defines ElapsedSecs/Msecs as being updated via ktime_get
+	  at each protocol event (sending or receiving of a segment);
+	  as this can be a performance hit, leaving this config option off
+	  will update elapsed based on on the jiffies counter instead.
+	  Set to Y for strict conformance with the MIB.
+
+	  If unsure, say N.
+
+endif
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 518c04e..7e2c69a 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_INET_TUNNEL) += tunnel4.o
  obj-$(CONFIG_INET_XFRM_MODE_TRANSPORT) += xfrm4_mode_transport.o
  obj-$(CONFIG_INET_XFRM_MODE_TUNNEL) += xfrm4_mode_tunnel.o
  obj-$(CONFIG_IP_PNP) += ipconfig.o
+obj-$(CONFIG_TCP_ESTATS) += tcp_estats.o
  obj-$(CONFIG_NETFILTER)	+= netfilter.o netfilter/
  obj-$(CONFIG_INET_DIAG) += inet_diag.o
  obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e0ee384..edc5a66 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,11 @@ static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
  static int ip_ping_group_range_min[] = { 0, 0 };
  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
  
+/* Extended statistics (RFC4898). */
+#ifdef CONFIG_TCP_ESTATS
+int sysctl_tcp_estats __read_mostly;
+#endif  /* CONFIG_TCP_ESTATS */
+
  /* Update system visible IP port range */
  static void set_local_port_range(struct net *net, int range[2])
  {
@@ -767,6 +772,15 @@ static struct ctl_table ipv4_table[] = {
  		.proc_handler	= proc_dointvec_minmax,
  		.extra1		= &one
  	},
+#ifdef CONFIG_TCP_ESTATS
+	{
+		.procname	= "tcp_estats",
+		.data		= &sysctl_tcp_estats,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#endif /* CONFIG TCP ESTATS */
  	{ }
  };
  
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3075723..698dbb7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -418,6 +418,10 @@ void tcp_init_sock(struct sock *sk)
  	sk->sk_sndbuf = sysctl_tcp_wmem[1];
  	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
  
+#ifdef CONFIG_TCP_ESTATS
+	tp->tcp_stats = NULL;
+#endif
+
  	local_bh_disable();
  	sock_update_memcg(sk);
  	sk_sockets_allocated_inc(sk);
@@ -972,6 +976,9 @@ wait_for_memory:
  		tcp_push(sk, flags & ~MSG_MORE, mss_now,
  			 TCP_NAGLE_PUSH, size_goal);
  
+		if (copied)
+                        TCP_ESTATS_UPDATE(tp, tcp_estats_update_writeq(sk));
+
  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
  			goto do_error;
  
@@ -1264,9 +1271,11 @@ new_segment:
  wait_for_sndbuf:
  			set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
  wait_for_memory:
-			if (copied)
+			if (copied) {
  				tcp_push(sk, flags & ~MSG_MORE, mss_now,
  					 TCP_NAGLE_PUSH, size_goal);
+				TCP_ESTATS_UPDATE(tp, tcp_estats_update_writeq(sk));
+			}
  
  			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
  				goto do_error;
@@ -1658,6 +1667,8 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  			     *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt, flags);
  		}
  
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+
  		/* Well, if we have backlog, try to process it now yet. */
  
  		if (copied >= target && !sk->sk_backlog.tail)
@@ -2684,6 +2695,11 @@ void tcp_get_info(const struct sock *sk, struct tcp_info *info)
  					sk->sk_pacing_rate : ~0ULL;
  	info->tcpi_max_pacing_rate = sk->sk_max_pacing_rate != ~0U ?
  					sk->sk_max_pacing_rate : ~0ULL;
+
+#ifdef CONFIG_TCP_ESTATS
+	info->tcpi_estats_cid = (tp->tcp_stats && tp->tcp_stats->tcpe_cid > 0)
+					? tp->tcp_stats->tcpe_cid : 0;
+#endif
  }
  EXPORT_SYMBOL_GPL(tcp_get_info);
  
@@ -3101,6 +3117,9 @@ void __init tcp_init(void)
  		tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
  
  	tcp_metrics_init();
+
  	BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0);
+	tcp_estats_init();
+
  	tcp_tasklet_init();
  }
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 27ead0d..e93929d 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -295,6 +295,8 @@ void tcp_slow_start(struct tcp_sock *tp, u32 acked)
  {
  	u32 cwnd = tp->snd_cwnd + acked;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, SlowStart);
+
  	if (cwnd > tp->snd_ssthresh)
  		cwnd = tp->snd_ssthresh + 1;
  	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
@@ -304,6 +306,7 @@ EXPORT_SYMBOL_GPL(tcp_slow_start);
  /* In theory this is tp->snd_cwnd += 1 / tp->snd_cwnd (or alternative w) */
  void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w)
  {
+	TCP_ESTATS_VAR_INC(tp, stack_table, CongAvoid);
  	if (tp->snd_cwnd_cnt >= w) {
  		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
  			tp->snd_cwnd++;
diff --git a/net/ipv4/tcp_estats.c b/net/ipv4/tcp_estats.c
new file mode 100644
index 0000000..e817540
--- /dev/null
+++ b/net/ipv4/tcp_estats.c
@@ -0,0 +1,736 @@
+/*
+ * net/ipv4/tcp_estats.c
+ *
+ * Implementation of TCP ESTATS MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/export.h>
+#ifndef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+#include <linux/jiffies.h>
+#endif
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/string.h>
+#include <net/tcp_estats.h>
+#include <net/tcp.h>
+#include <asm/atomic.h>
+#include <asm/byteorder.h>
+
+#define ESTATS_INF32	0xffffffff
+
+#define ESTATS_MAX_CID	5000000
+
+extern int sysctl_tcp_estats;
+
+struct idr tcp_estats_idr;
+EXPORT_SYMBOL(tcp_estats_idr);
+static int next_id = 1;
+DEFINE_SPINLOCK(tcp_estats_idr_lock);
+EXPORT_SYMBOL(tcp_estats_idr_lock);
+
+int tcp_estats_wq_enabled __read_mostly = 0;
+EXPORT_SYMBOL(tcp_estats_wq_enabled);
+struct workqueue_struct *tcp_estats_wq = NULL;
+EXPORT_SYMBOL(tcp_estats_wq);
+void (*create_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(create_notify_func);
+void (*establish_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(establish_notify_func);
+void (*destroy_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(destroy_notify_func);
+unsigned long persist_delay = 0;
+EXPORT_SYMBOL(persist_delay);
+
+struct static_key tcp_estats_enabled __read_mostly = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(tcp_estats_enabled);
+
+/* if HAVE_JUMP_LABEL is defined, then static_key_slow_inc/dec uses a
+ *   mutex in its implementation, and hence can't be called if in_interrupt().
+ * if HAVE_JUMP_LABEL is NOT defined, then no mutex is used, hence no need
+ *   for deferring enable/disable */
+#ifdef HAVE_JUMP_LABEL
+static atomic_t tcp_estats_enabled_deferred;
+
+static void tcp_estats_handle_deferred_enable_disable(void)
+{
+	int count = atomic_xchg(&tcp_estats_enabled_deferred, 0);
+
+	while (count > 0) {
+		static_key_slow_inc(&tcp_estats_enabled);
+		--count;
+	}
+
+	while (count < 0) {
+		static_key_slow_dec(&tcp_estats_enabled);
+		++count;
+	}
+}
+#endif
+
+static inline void tcp_estats_enable(void)
+{
+#ifdef HAVE_JUMP_LABEL
+	if (in_interrupt()) {
+		atomic_inc(&tcp_estats_enabled_deferred);
+		return;
+	}
+	tcp_estats_handle_deferred_enable_disable();
+#endif
+	static_key_slow_inc(&tcp_estats_enabled);
+}
+
+static inline void tcp_estats_disable(void)
+{
+#ifdef HAVE_JUMP_LABEL
+	if (in_interrupt()) {
+		atomic_dec(&tcp_estats_enabled_deferred);
+		return;
+	}
+	tcp_estats_handle_deferred_enable_disable();
+#endif
+	static_key_slow_dec(&tcp_estats_enabled);
+}
+
+/* Calculates the required amount of memory for any enabled tables. */
+int tcp_estats_get_allocation_size(int sysctl)
+{
+	int size = sizeof(struct tcp_estats) +
+		sizeof(struct tcp_estats_connection_table);
+
+	if (sysctl & TCP_ESTATS_TABLEMASK_PERF)
+		size += sizeof(struct tcp_estats_perf_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_PATH)
+		size += sizeof(struct tcp_estats_path_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_STACK)
+		size += sizeof(struct tcp_estats_stack_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_APP)
+		size += sizeof(struct tcp_estats_app_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_EXTRAS)
+		size += sizeof(struct tcp_estats_extras_table);
+	return size;
+}
+
+/* Called whenever a TCP/IPv4 sock is created.
+ * net/ipv4/tcp_ipv4.c: tcp_v4_syn_recv_sock,
+ *			tcp_v4_init_sock
+ * Allocates a stats structure and initializes values.
+ */
+int tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype addrtype,
+		      int active)
+{
+	struct tcp_estats *stats;
+	struct tcp_estats_tables *tables;
+	struct tcp_sock *tp = tcp_sk(sk);
+	void *estats_mem;
+	int sysctl;
+	int ret;
+
+	/* Read the sysctl once before calculating memory needs and initializing
+	 * tables to avoid raciness. */
+	sysctl = ACCESS_ONCE(sysctl_tcp_estats);
+	if (likely(sysctl == TCP_ESTATS_TABLEMASK_INACTIVE)) {
+		return 0;
+	}
+
+	estats_mem = kzalloc(tcp_estats_get_allocation_size(sysctl), gfp_any());
+	if (!estats_mem)
+		return -ENOMEM;
+
+	stats = estats_mem;
+	estats_mem += sizeof(struct tcp_estats);
+
+	tables = &stats->tables;
+
+	tables->connection_table = estats_mem;
+	estats_mem += sizeof(struct tcp_estats_connection_table);
+
+	if (sysctl & TCP_ESTATS_TABLEMASK_PERF) {
+		tables->perf_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_perf_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_PATH) {
+		tables->path_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_path_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_STACK) {
+		tables->stack_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_stack_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_APP) {
+		tables->app_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_app_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_EXTRAS) {
+		tables->extras_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_extras_table);
+	}
+
+	stats->tcpe_cid = -1;
+	stats->queued = 0;
+
+	tables->connection_table->AddressType = addrtype;
+
+	sock_hold(sk);
+	stats->sk = sk;
+	atomic_set(&stats->users, 0);
+
+	stats->limstate = TCP_ESTATS_SNDLIM_STARTUP;
+	stats->limstate_ts = ktime_get();
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->start_ts = stats->current_ts = stats->limstate_ts;
+#else
+	stats->start_ts = stats->current_ts = jiffies;
+#endif
+	do_gettimeofday(&stats->start_tv);
+
+	/* order is important -
+	 * must have stats hooked into tp and tcp_estats_enabled()
+	 * in order to have the TCP_ESTATS_VAR_<> macros work */
+	tp->tcp_stats = stats;
+	tcp_estats_enable();
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, ActiveOpen, active);
+
+	TCP_ESTATS_VAR_SET(tp, app_table, SndMax, tp->snd_nxt);
+	TCP_ESTATS_VAR_SET(tp, stack_table, SndInitial, tp->snd_nxt);
+
+	TCP_ESTATS_VAR_SET(tp, path_table, MinRTT, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, path_table, MinRTO, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, stack_table, MinMSS, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, stack_table, MinSsthresh, ESTATS_INF32);
+
+	tcp_estats_use(stats);
+
+	if (tcp_estats_wq_enabled) {
+		tcp_estats_use(stats);
+		stats->queued = 1;
+		stats->tcpe_cid = 0;
+		INIT_WORK(&stats->create_notify, create_notify_func);
+		ret = queue_work(tcp_estats_wq, &stats->create_notify);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_estats_create);
+
+void tcp_estats_destroy(struct sock *sk)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+
+	if (stats == NULL)
+		return;
+
+	/* Attribute final sndlim time. */
+	tcp_estats_update_sndlim(tcp_sk(stats->sk), stats->limstate);
+
+	if (tcp_estats_wq_enabled && stats->queued) {
+		INIT_DELAYED_WORK(&stats->destroy_notify,
+			destroy_notify_func);
+		queue_delayed_work(tcp_estats_wq, &stats->destroy_notify,
+			persist_delay);
+	}
+	tcp_estats_unuse(stats);
+}
+
+/* Do not call directly.  Called from tcp_estats_unuse() through call_rcu. */
+void tcp_estats_free(struct rcu_head *rcu)
+{
+	struct tcp_estats *stats = container_of(rcu, struct tcp_estats, rcu);
+	tcp_estats_disable();
+	kfree(stats);
+}
+EXPORT_SYMBOL(tcp_estats_free);
+
+/* Called when a connection enters the ESTABLISHED state, and has all its
+ * state initialized.
+ * net/ipv4/tcp_input.c: tcp_rcv_state_process,
+ *			 tcp_rcv_synsent_state_process
+ * Here we link the statistics structure in so it is visible in the /proc
+ * fs, and do some final init.
+ */
+void tcp_estats_establish(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_connection_table *conn_table;
+
+	if (stats == NULL)
+		return;
+
+	conn_table = stats->tables.connection_table;
+
+	/* Let's set these here, since they can't change once the
+	 * connection is established.
+	 */
+	conn_table->LocalPort = inet->inet_num;
+	conn_table->RemPort = ntohs(inet->inet_dport);
+
+	if (conn_table->AddressType == TCP_ESTATS_ADDRTYPE_IPV4) {
+		memcpy(&conn_table->LocalAddress.addr, &inet->inet_rcv_saddr,
+			sizeof(struct in_addr));
+		memcpy(&conn_table->RemAddress.addr, &inet->inet_daddr,
+			sizeof(struct in_addr));
+	}
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	else if (conn_table->AddressType == TCP_ESTATS_ADDRTYPE_IPV6) {
+		memcpy(&conn_table->LocalAddress.addr6, &(sk)->sk_v6_rcv_saddr,
+		       sizeof(struct in6_addr));
+		/* ipv6 daddr now uses a different struct than saddr */
+		memcpy(&conn_table->RemAddress.addr6, &(sk)->sk_v6_daddr,
+		       sizeof(struct in6_addr));
+	}
+#endif
+	else {
+		pr_err("TCP ESTATS: AddressType not valid.\n");
+	}
+
+	tcp_estats_update_finish_segrecv(tp);
+	tcp_estats_update_rwin_rcvd(tp);
+	tcp_estats_update_rwin_sent(tp);
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, RecInitial, tp->rcv_nxt);
+
+	tcp_estats_update_sndlim(tp, TCP_ESTATS_SNDLIM_SENDER);
+
+	if (tcp_estats_wq_enabled && stats->queued) {
+		INIT_WORK(&stats->establish_notify, establish_notify_func);
+		queue_work(tcp_estats_wq, &stats->establish_notify);
+	}
+}
+
+/*
+ * Statistics update functions
+ */
+
+void tcp_estats_update_snd_nxt(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+
+	if (stats->tables.app_table) {
+		if (after(tp->snd_nxt, stats->tables.app_table->SndMax))
+			stats->tables.app_table->SndMax = tp->snd_nxt;
+	}
+}
+
+void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+
+	if (stats->tables.app_table)
+		stats->tables.app_table->ThruOctetsAcked += ack - tp->snd_una;
+}
+
+void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+	unsigned long rtt_sample_msec = rtt_sample/1000;
+	u32 rto;
+
+	if (path_table == NULL)
+		return;
+
+	path_table->SampleRTT = rtt_sample_msec;
+
+	if (rtt_sample_msec > path_table->MaxRTT)
+		path_table->MaxRTT = rtt_sample_msec;
+	if (rtt_sample_msec < path_table->MinRTT)
+		path_table->MinRTT = rtt_sample_msec;
+
+	path_table->CountRTT++;
+	path_table->SumRTT += rtt_sample_msec;
+
+	rto = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	if (rto > path_table->MaxRTO)
+		path_table->MaxRTO = rto;
+	if (rto < path_table->MinRTO)
+		path_table->MinRTO = rto;
+}
+
+void tcp_estats_update_timeout(struct sock *sk)
+{
+	if (inet_csk(sk)->icsk_backoff)
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), stack_table, SubsequentTimeouts);
+	else
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), perf_table, Timeouts);
+
+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open)
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), stack_table, AbruptTimeouts);
+}
+
+void tcp_estats_update_mss(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_stack_table *stack_table = stats->tables.stack_table;
+	int mss = tp->mss_cache;
+
+	if (stack_table == NULL)
+		return;
+
+	if (mss > stack_table->MaxMSS)
+		stack_table->MaxMSS = mss;
+	if (mss < stack_table->MinMSS)
+		stack_table->MinMSS = mss;
+}
+
+void tcp_estats_update_finish_segrecv(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_tables *tables = &stats->tables;
+	struct tcp_estats_perf_table *perf_table = tables->perf_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+	u32 mss = tp->mss_cache;
+	u32 cwnd;
+	u32 ssthresh;
+	u32 pipe_size;
+
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->current_ts = ktime_get();
+#else
+	stats->current_ts = jiffies;
+#endif
+
+	if (stack_table != NULL) {
+		cwnd = tp->snd_cwnd * mss;
+		if (tp->snd_cwnd <= tp->snd_ssthresh) {
+			if (cwnd > stack_table->MaxSsCwnd)
+				stack_table->MaxSsCwnd = cwnd;
+		} else if (cwnd > stack_table->MaxCaCwnd) {
+			stack_table->MaxCaCwnd = cwnd;
+		}
+	}
+
+	if (perf_table != NULL) {
+		pipe_size = tcp_packets_in_flight(tp) * mss;
+		if (pipe_size > perf_table->MaxPipeSize)
+			perf_table->MaxPipeSize = pipe_size;
+	}
+
+	/* Discard initiail ssthresh set at infinity. */
+	if (tp->snd_ssthresh >= TCP_INFINITE_SSTHRESH) {
+		return;
+	}
+
+	if (stack_table != NULL) {
+		ssthresh = tp->snd_ssthresh * tp->mss_cache;
+		if (ssthresh > stack_table->MaxSsthresh)
+			stack_table->MaxSsthresh = ssthresh;
+		if (ssthresh < stack_table->MinSsthresh)
+			stack_table->MinSsthresh = ssthresh;
+	}
+}
+EXPORT_SYMBOL(tcp_estats_update_finish_segrecv);
+
+void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	u32 win = tp->snd_wnd;
+
+	if (perf_table == NULL)
+		return;
+
+	if (win > perf_table->MaxRwinRcvd)
+		perf_table->MaxRwinRcvd = win;
+	if (win == 0)
+		perf_table->ZeroRwinRcvd++;
+}
+
+void tcp_estats_update_rwin_sent(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	u32 win = tp->rcv_wnd;
+
+	if (perf_table == NULL)
+		return;
+
+	if (win > perf_table->MaxRwinSent)
+		perf_table->MaxRwinSent = win;
+	if (win == 0)
+		perf_table->ZeroRwinSent++;
+}
+
+void tcp_estats_update_sndlim(struct tcp_sock *tp,
+			      enum tcp_estats_sndlim_states state)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	ktime_t now;
+
+	if (state <= TCP_ESTATS_SNDLIM_NONE ||
+	    state >= TCP_ESTATS_SNDLIM_NSTATES) {
+		pr_err("tcp_estats_update_sndlim: BUG: state out of range %d\n",
+		       state);
+		return;
+	}
+
+	if (perf_table == NULL)
+		return;
+
+	now = ktime_get();
+	perf_table->snd_lim_time[stats->limstate]
+	    += ktime_to_us(ktime_sub(now, stats->limstate_ts));
+	stats->limstate_ts = now;
+	if (stats->limstate != state) {
+		stats->limstate = state;
+		perf_table->snd_lim_trans[state]++;
+	}
+}
+
+void tcp_estats_update_congestion(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+
+	TCP_ESTATS_VAR_INC(tp, perf_table, CongSignals);
+
+	if (path_table != NULL) {
+		path_table->PreCongSumCwnd += tp->snd_cwnd * tp->mss_cache;
+		path_table->PreCongSumRTT += path_table->SampleRTT;
+	}
+}
+
+void tcp_estats_update_post_congestion(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+
+	if (path_table != NULL) {
+		path_table->PostCongCountRTT++;
+		path_table->PostCongSumRTT += path_table->SampleRTT;
+	}
+}
+
+void tcp_estats_update_segsend(struct sock *sk, int pcount,
+			       u32 seq, u32 end_seq, int flags)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	struct tcp_estats_app_table *app_table = stats->tables.app_table;
+
+	int data_len = end_seq - seq;
+
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->current_ts = ktime_get();
+#else
+	stats->current_ts = jiffies;
+#endif
+
+	if (perf_table == NULL)
+		return;
+
+	/* We know we're sending a segment. */
+	perf_table->SegsOut += pcount;
+
+	/* A pure ACK contains no data; everything else is data. */
+	if (data_len > 0) {
+		perf_table->DataSegsOut += pcount;
+		perf_table->DataOctetsOut += data_len;
+	}
+
+	/* Check for retransmission. */
+	if (flags & TCPHDR_SYN) {
+		if (inet_csk(sk)->icsk_retransmits)
+			perf_table->SegsRetrans++;
+	} else if (app_table != NULL &&
+		   before(seq, app_table->SndMax)) {
+		perf_table->SegsRetrans += pcount;
+		perf_table->OctetsRetrans += data_len;
+	}
+}
+
+void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	struct tcp_estats_tables *tables = &tp->tcp_stats->tables;
+	struct tcp_estats_path_table *path_table = tables->path_table;
+	struct tcp_estats_perf_table *perf_table = tables->perf_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+	struct tcphdr *th = tcp_hdr(skb);
+	struct iphdr *iph = ip_hdr(skb);
+
+	if (perf_table != NULL)
+		perf_table->SegsIn++;
+
+	if (skb->len == th->doff * 4) {
+		if (stack_table != NULL &&
+		    TCP_SKB_CB(skb)->ack_seq == tp->snd_una)
+			stack_table->DupAcksIn++;
+	} else {
+		if (perf_table != NULL) {
+			perf_table->DataSegsIn++;
+			perf_table->DataOctetsIn += skb->len - th->doff * 4;
+		}
+	}
+
+	if (path_table != NULL) {
+		path_table->IpTtl = iph->ttl;
+		path_table->IpTosIn = iph->tos;
+	}
+}
+EXPORT_SYMBOL(tcp_estats_update_segrecv);
+
+void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq)
+{
+        /* After much debate, it was decided that "seq - rcv_nxt" is
+           indeed what we want, as opposed to what Krishnan suggested
+           to better match the RFC: "seq - tp->rcv_wup" */
+	TCP_ESTATS_VAR_ADD(tp, app_table, ThruOctetsReceived,
+			   seq - tp->rcv_nxt);
+}
+
+void tcp_estats_update_writeq(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats_app_table *app_table =
+			tp->tcp_stats->tables.app_table;
+	int len;
+
+	if (app_table == NULL)
+		return;
+
+	len = tp->write_seq - app_table->SndMax;
+
+	if (len > app_table->MaxAppWQueue)
+		app_table->MaxAppWQueue = len;
+}
+
+static inline u32 ofo_qlen(struct tcp_sock *tp)
+{
+	if (!skb_peek(&tp->out_of_order_queue))
+		return 0;
+	else
+		return TCP_SKB_CB(tp->out_of_order_queue.prev)->end_seq -
+		    TCP_SKB_CB(tp->out_of_order_queue.next)->seq;
+}
+
+void tcp_estats_update_recvq(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats_tables *tables = &tp->tcp_stats->tables;
+	struct tcp_estats_app_table *app_table = tables->app_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+
+	if (app_table != NULL) {
+		u32 len = tp->rcv_nxt - tp->copied_seq;
+		if (app_table->MaxAppRQueue < len)
+			app_table->MaxAppRQueue = len;
+	}
+
+	if (stack_table != NULL) {
+		u32 len = ofo_qlen(tp);
+		if (stack_table->MaxReasmQueue < len)
+			stack_table->MaxReasmQueue = len;
+	}
+}
+
+/*
+ * Manage connection ID table
+ */
+
+static int get_new_cid(struct tcp_estats *stats)
+{
+         int id_cid;
+
+again:
+         spin_lock_bh(&tcp_estats_idr_lock);
+         id_cid = idr_alloc(&tcp_estats_idr, stats, next_id, 0, GFP_KERNEL);
+         if (unlikely(id_cid == -ENOSPC)) {
+                 spin_unlock_bh(&tcp_estats_idr_lock);
+                 goto again;
+         }
+         if (unlikely(id_cid == -ENOMEM)) {
+                 spin_unlock_bh(&tcp_estats_idr_lock);
+                 return -ENOMEM;
+         }
+         next_id = (id_cid + 1) % ESTATS_MAX_CID;
+         stats->tcpe_cid = id_cid;
+         spin_unlock_bh(&tcp_estats_idr_lock);
+         return 0;
+}
+
+static void create_func(struct work_struct *work)
+{
+	/* stub for netlink notification of new connections */
+	;
+}
+
+static void establish_func(struct work_struct *work)
+{
+	struct tcp_estats *stats = container_of(work, struct tcp_estats,
+						establish_notify);
+	int err = 0;
+
+	if ((stats->tcpe_cid) > 0) {
+		pr_err("TCP estats container established multiple times.\n");
+		return;
+	}
+
+	if ((stats->tcpe_cid) == 0) {
+		err = get_new_cid(stats);
+		if (err)
+			pr_devel("get_new_cid error %d\n", err);
+	}
+}
+
+static void destroy_func(struct work_struct *work)
+{
+	struct tcp_estats *stats = container_of(work, struct tcp_estats,
+						destroy_notify.work);
+
+	int id_cid = stats->tcpe_cid;
+
+	if (id_cid == 0)
+		pr_devel("TCP estats destroyed before being established.\n");
+
+	if (id_cid >= 0) {
+		if (id_cid) {
+			spin_lock_bh(&tcp_estats_idr_lock);
+			idr_remove(&tcp_estats_idr, id_cid);
+			spin_unlock_bh(&tcp_estats_idr_lock);
+		}
+		stats->tcpe_cid = -1;
+
+		tcp_estats_unuse(stats);
+	}
+}
+
+void __init tcp_estats_init()
+{
+	idr_init(&tcp_estats_idr);
+
+	create_notify_func = &create_func;
+	establish_notify_func = &establish_func;
+	destroy_notify_func = &destroy_func;
+
+	persist_delay = TCP_ESTATS_PERSIST_DELAY_SECS * HZ;
+
+	tcp_estats_wq = alloc_workqueue("tcp_estats", WQ_MEM_RECLAIM, 256);
+	if (tcp_estats_wq == NULL) {
+		pr_err("tcp_estats_init(): alloc_workqueue failed\n");
+		goto cleanup_fail;
+	}
+
+	tcp_estats_wq_enabled = 1;
+	return;
+
+cleanup_fail:
+	pr_err("TCP ESTATS: initialization failed.\n");
+}
diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c
index 58469ff..5facb4c 100644
--- a/net/ipv4/tcp_htcp.c
+++ b/net/ipv4/tcp_htcp.c
@@ -251,6 +251,7 @@ static void htcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
  			tp->snd_cwnd_cnt += ca->pkts_acked;
  
  		ca->pkts_acked = 1;
+		TCP_ESTATS_VAR_INC(tp, stack_table, CongAvoid);
  	}
  }
  
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 075ab4d..8f0601b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -77,8 +77,10 @@
  #include <linux/errqueue.h>
  
  int sysctl_tcp_timestamps __read_mostly = 1;
+EXPORT_SYMBOL(sysctl_tcp_timestamps);
  int sysctl_tcp_window_scaling __read_mostly = 1;
  int sysctl_tcp_sack __read_mostly = 1;
+EXPORT_SYMBOL(sysctl_tcp_sack);
  int sysctl_tcp_fack __read_mostly = 1;
  int sysctl_tcp_reordering __read_mostly = TCP_FASTRETRANS_THRESH;
  int sysctl_tcp_max_reordering __read_mostly = 300;
@@ -231,13 +233,15 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
  			tcp_enter_quickack_mode((struct sock *)tp);
  		break;
  	case INET_ECN_CE:
+		TCP_ESTATS_VAR_INC(tp, path_table, CERcvd);
  		if (tcp_ca_needs_ecn((struct sock *)tp))
  			tcp_ca_event((struct sock *)tp, CA_EVENT_ECN_IS_CE);
-
  		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
  			/* Better not delay acks, sender can have a very low cwnd */
  			tcp_enter_quickack_mode((struct sock *)tp);
  			tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+		} else {
+			TCP_ESTATS_VAR_INC(tp, path_table, ECESent);
  		}
  		tp->ecn_flags |= TCP_ECN_SEEN;
  		break;
@@ -1104,6 +1108,7 @@ static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
  		dup_sack = true;
  		tcp_dsack_seen(tp);
  		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPDSACKRECV);
+		TCP_ESTATS_VAR_INC(tp, stack_table, DSACKDups);
  	} else if (num_sacks > 1) {
  		u32 end_seq_1 = get_unaligned_be32(&sp[1].end_seq);
  		u32 start_seq_1 = get_unaligned_be32(&sp[1].start_seq);
@@ -1114,6 +1119,7 @@ static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
  			tcp_dsack_seen(tp);
  			NET_INC_STATS_BH(sock_net(sk),
  					LINUX_MIB_TCPDSACKOFORECV);
+			TCP_ESTATS_VAR_INC(tp, stack_table, DSACKDups);
  		}
  	}
  
@@ -1653,6 +1659,9 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
  	state.reord = tp->packets_out;
  	state.rtt_us = -1L;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, SACKsRcvd);
+	TCP_ESTATS_VAR_ADD(tp, stack_table, SACKBlocksRcvd, num_sacks);
+
  	if (!tp->sacked_out) {
  		if (WARN_ON(tp->fackets_out))
  			tp->fackets_out = 0;
@@ -1928,6 +1937,8 @@ void tcp_enter_loss(struct sock *sk)
  	bool new_recovery = false;
  	bool is_reneg;			/* is receiver reneging on SACKs? */
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
+
  	/* Reduce ssthresh if it has not yet been made inside this window. */
  	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
  	    !after(tp->high_seq, tp->snd_una) ||
@@ -2200,8 +2211,12 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
  	 */
  	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
  	    (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
-	    !tcp_may_send_now(sk))
-		return !tcp_pause_early_retransmit(sk, flag);
+	    !tcp_may_send_now(sk)) {
+		int early_retrans = !tcp_pause_early_retransmit(sk, flag);
+		if (early_retrans)
+			TCP_ESTATS_VAR_INC(tp, stack_table, EarlyRetrans);
+		return early_retrans;
+	}
  
  	return false;
  }
@@ -2299,9 +2314,15 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
   */
  static inline void tcp_moderate_cwnd(struct tcp_sock *tp)
  {
-	tp->snd_cwnd = min(tp->snd_cwnd,
-			   tcp_packets_in_flight(tp) + tcp_max_burst(tp));
-	tp->snd_cwnd_stamp = tcp_time_stamp;
+	u32 pkts = tcp_packets_in_flight(tp) + tcp_max_burst(tp);
+
+	if (pkts < tp->snd_cwnd) {
+		tp->snd_cwnd = pkts;
+		tp->snd_cwnd_stamp = tcp_time_stamp;
+
+		TCP_ESTATS_VAR_INC(tp, stack_table, OtherReductions);
+		TCP_ESTATS_VAR_INC(tp, extras_table, OtherReductionsCM);
+	}
  }
  
  /* Nothing was retransmitted or returned timestamp is less
@@ -2402,6 +2423,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
  		if (tp->prior_ssthresh > tp->snd_ssthresh) {
  			tp->snd_ssthresh = tp->prior_ssthresh;
  			tcp_ecn_withdraw_cwr(tp);
+			TCP_ESTATS_VAR_INC(tp, stack_table, CongOverCount);
  		}
  	} else {
  		tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh);
@@ -2428,10 +2450,15 @@ static bool tcp_try_undo_recovery(struct sock *sk)
  		 */
  		DBGUNDO(sk, inet_csk(sk)->icsk_ca_state == TCP_CA_Loss ? "loss" : "retrans");
  		tcp_undo_cwnd_reduction(sk, false);
-		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss)
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
  			mib_idx = LINUX_MIB_TCPLOSSUNDO;
-		else
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousRtoDetected);
+		} else {
  			mib_idx = LINUX_MIB_TCPFULLUNDO;
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousFrDetected);
+		}
  
  		NET_INC_STATS_BH(sock_net(sk), mib_idx);
  	}
@@ -2472,9 +2499,12 @@ static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
  
  		DBGUNDO(sk, "partial loss");
  		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO);
-		if (frto_undo)
+		if (frto_undo) {
  			NET_INC_STATS_BH(sock_net(sk),
  					 LINUX_MIB_TCPSPURIOUSRTOS);
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousRtoDetected);
+		}
  		inet_csk(sk)->icsk_retransmits = 0;
  		if (frto_undo || tcp_is_sack(tp))
  			tcp_set_ca_state(sk, TCP_CA_Open);
@@ -2555,6 +2585,7 @@ void tcp_enter_cwr(struct sock *sk)
  		tcp_init_cwnd_reduction(sk);
  		tcp_set_ca_state(sk, TCP_CA_CWR);
  	}
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
  }
  
  static void tcp_try_keep_open(struct sock *sk)
@@ -2580,8 +2611,10 @@ static void tcp_try_to_open(struct sock *sk, int flag, const int prior_unsacked)
  	if (!tcp_any_retrans_done(sk))
  		tp->retrans_stamp = 0;
  
-	if (flag & FLAG_ECE)
+	if (flag & FLAG_ECE) {
  		tcp_enter_cwr(sk);
+		TCP_ESTATS_VAR_INC(tp, path_table, ECNsignals);
+	}
  
  	if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
  		tcp_try_keep_open(sk);
@@ -2826,6 +2859,10 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  			}
  			break;
  
+		case TCP_CA_Disorder:
+			TCP_ESTATS_VAR_INC(tp, path_table, NonRecovDAEpisodes);
+			break;
+
  		case TCP_CA_Recovery:
  			if (tcp_is_reno(tp))
  				tcp_reset_reno_sack(tp);
@@ -2870,6 +2907,10 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
  			tcp_try_undo_dsack(sk);
  
+
+		if (icsk->icsk_ca_state == TCP_CA_Disorder)
+			TCP_ESTATS_VAR_INC(tp, path_table, NonRecovDA);
+
  		if (!tcp_time_to_recover(sk, flag)) {
  			tcp_try_to_open(sk, flag, prior_unsacked);
  			return;
@@ -2889,6 +2930,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  		/* Otherwise enter Recovery state */
  		tcp_enter_recovery(sk, (flag & FLAG_ECE));
  		fast_rexmit = 1;
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
+		TCP_ESTATS_VAR_INC(tp, stack_table, FastRetran);
  	}
  
  	if (do_lost)
@@ -2928,6 +2971,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
  
  	tcp_rtt_estimator(sk, seq_rtt_us);
  	tcp_set_rto(sk);
+	TCP_ESTATS_UPDATE(tcp_sk(sk), tcp_estats_update_rtt(sk, seq_rtt_us));
  
  	/* RFC6298: only reset backoff on valid RTT measurement. */
  	inet_csk(sk)->icsk_backoff = 0;
@@ -3007,6 +3051,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
  	if (!tp->do_early_retrans)
  		return;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, EarlyRetransDelay);
  	tcp_enter_recovery(sk, false);
  	tcp_update_scoreboard(sk, 1);
  	tcp_xmit_retransmit_queue(sk);
@@ -3310,9 +3355,11 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
  				tp->max_window = nwin;
  				tcp_sync_mss(sk, inet_csk(sk)->icsk_pmtu_cookie);
  			}
+			TCP_ESTATS_UPDATE(tp, tcp_estats_update_rwin_rcvd(tp));
  		}
  	}
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_acked(tp, ack));
  	tp->snd_una = ack;
  
  	return flag;
@@ -3410,6 +3457,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	int prior_packets = tp->packets_out;
  	const int prior_unsacked = tp->packets_out - tp->sacked_out;
  	int acked = 0; /* Number of packets newly acked */
+	int prior_state = icsk->icsk_ca_state;
  	long sack_rtt_us = -1L;
  
  	/* We very likely will need to access write queue head. */
@@ -3419,6 +3467,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	 * then we can probably ignore it.
  	 */
  	if (before(ack, prior_snd_una)) {
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+				   TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW);
  		/* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */
  		if (before(ack, prior_snd_una - tp->max_window)) {
  			tcp_send_challenge_ack(sk);
@@ -3430,8 +3481,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	/* If the ack includes data we haven't sent yet, discard
  	 * this segment (RFC793 Section 3.9).
  	 */
-	if (after(ack, tp->snd_nxt))
+	if (after(ack, tp->snd_nxt)) {
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+				   TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW);
  		goto invalid_ack;
+	}
  
  	if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
  	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
@@ -3439,6 +3494,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  
  	if (after(ack, prior_snd_una)) {
  		flag |= FLAG_SND_UNA_ADVANCED;
+		if (icsk->icsk_ca_state == TCP_CA_Disorder)
+			TCP_ESTATS_VAR_ADD(tp, path_table, SumOctetsReordered,
+					   ack - prior_snd_una);
  		icsk->icsk_retransmits = 0;
  	}
  
@@ -3456,6 +3514,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  		 * Note, we use the fact that SND.UNA>=SND.WL2.
  		 */
  		tcp_update_wl(tp, ack_seq);
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_acked(tp, ack));
  		tp->snd_una = ack;
  		flag |= FLAG_WIN_UPDATE;
  
@@ -3510,6 +3569,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
  		tcp_fastretrans_alert(sk, acked, prior_unsacked,
  				      is_dupack, flag);
+		if (icsk->icsk_ca_state == TCP_CA_Open &&
+		    prior_state >= TCP_CA_CWR)
+			TCP_ESTATS_UPDATE(tp,
+				tcp_estats_update_post_congestion(tp));
  	}
  	if (tp->tlp_high_seq)
  		tcp_process_tlp_ack(sk, ack, flag);
@@ -4177,7 +4240,9 @@ static void tcp_ofo_queue(struct sock *sk)
  
  		tail = skb_peek_tail(&sk->sk_receive_queue);
  		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, tp->rcv_nxt));
  		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+
  		if (!eaten)
  			__skb_queue_tail(&sk->sk_receive_queue, skb);
  		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
@@ -4232,6 +4297,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
  	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
  		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
  
+        TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+        TCP_ESTATS_VAR_INC(tp, path_table, DupAcksOut);
+
  	skb1 = skb_peek_tail(&tp->out_of_order_queue);
  	if (!skb1) {
  		/* Initial out of order segment, build 1 SACK. */
@@ -4242,6 +4310,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
  						TCP_SKB_CB(skb)->end_seq;
  		}
  		__skb_queue_head(&tp->out_of_order_queue, skb);
+                TCP_ESTATS_VAR_INC(tp, path_table, DupAckEpisodes);
  		goto end;
  	}
  
@@ -4438,6 +4507,9 @@ queue_and_out:
  
  			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
  		}
+		TCP_ESTATS_UPDATE(
+			tp,
+			tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
  		if (skb->len)
  			tcp_event_data_recv(sk, skb);
@@ -4459,6 +4531,8 @@ queue_and_out:
  
  		tcp_fast_path_check(sk);
  
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+
  		if (eaten > 0)
  			kfree_skb_partial(skb, fragstolen);
  		if (!sock_flag(sk, SOCK_DEAD))
@@ -4990,6 +5064,9 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  	    tcp_paws_discard(sk, skb)) {
  		if (!th->rst) {
  			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
+			TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+			TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+					   TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW);
  			tcp_send_dupack(sk, skb);
  			goto discard;
  		}
@@ -5004,6 +5081,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  		 * an acknowledgment should be sent in reply (unless the RST
  		 * bit is set, if so drop the segment and return)".
  		 */
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+			before(TCP_SKB_CB(skb)->end_seq, tp->rcv_wup) ?
+				TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW :
+				TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW);
  		if (!th->rst) {
  			if (th->syn)
  				goto syn_challenge;
@@ -5152,6 +5234,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  				return;
  			} else { /* Header too small */
  				TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
+				TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+				TCP_ESTATS_VAR_SET(tp, stack_table,
+						   SoftErrorReason,
+						   TCP_ESTATS_SOFTERROR_OTHER);
  				goto discard;
  			}
  		} else {
@@ -5178,6 +5264,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  					tcp_rcv_rtt_measure_ts(sk, skb);
  
  					__skb_pull(skb, tcp_header_len);
+					TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  					tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
  					NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
  					eaten = 1;
@@ -5204,10 +5291,12 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
  
  				/* Bulk data transfer: receiver */
+				TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  				eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
  						      &fragstolen);
  			}
  
+			TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
  			tcp_event_data_recv(sk, skb);
  
  			if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
@@ -5260,6 +5349,9 @@ step5:
  csum_error:
  	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_CSUMERRORS);
  	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
+	TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+	TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+			   TCP_ESTATS_SOFTERROR_DATA_CHECKSUM);
  
  discard:
  	__kfree_skb(skb);
@@ -5459,6 +5551,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
  		smp_mb();
  
  		tcp_finish_connect(sk, skb);
+		tcp_estats_establish(sk);
  
  		if ((tp->syn_fastopen || tp->syn_data) &&
  		    tcp_rcv_fastopen_synack(sk, skb, &foc))
@@ -5685,6 +5778,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
  		smp_mb();
  		tcp_set_state(sk, TCP_ESTABLISHED);
  		sk->sk_state_change(sk);
+		tcp_estats_establish(sk);
  
  		/* Note, that this wakeup is only for marginal crossed SYN case.
  		 * Passively open sockets are not waked up, because
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a3f72d7..9c85a54 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1310,6 +1310,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
  	if (!newsk)
  		goto exit_nonewsk;
  
+	tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV4, TCP_ESTATS_INACTIVE);
+
  	newsk->sk_gso_type = SKB_GSO_TCPV4;
  	inet_sk_rx_dst_set(newsk, skb);
  
@@ -1670,6 +1672,8 @@ process:
  	skb->dev = NULL;
  
  	bh_lock_sock_nested(sk);
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_segrecv(tcp_sk(sk), skb));
  	ret = 0;
  	if (!sock_owned_by_user(sk)) {
  		if (!tcp_prequeue(sk, skb))
@@ -1680,6 +1684,8 @@ process:
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;
  	}
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_finish_segrecv(tcp_sk(sk)));
  	bh_unlock_sock(sk);
  
  	sock_put(sk);
@@ -1809,6 +1815,8 @@ static int tcp_v4_init_sock(struct sock *sk)
  	tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
  #endif
  
+	tcp_estats_create(sk, TCP_ESTATS_ADDRTYPE_IPV4, TCP_ESTATS_ACTIVE);
+
  	return 0;
  }
  
@@ -1842,6 +1850,8 @@ void tcp_v4_destroy_sock(struct sock *sk)
  	if (inet_csk(sk)->icsk_bind_hash)
  		inet_put_port(sk);
  
+	tcp_estats_destroy(sk);
+
  	BUG_ON(tp->fastopen_rsk != NULL);
  
  	/* If socket is aborted during connect operation */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..145b4f2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -80,6 +80,7 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
  
  	tcp_advance_send_head(sk, skb);
  	tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_snd_nxt(tp));
  
  	tp->packets_out += tcp_skb_pcount(skb);
  	if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
@@ -292,6 +293,7 @@ static u16 tcp_select_window(struct sock *sk)
  	}
  	tp->rcv_wnd = new_win;
  	tp->rcv_wup = tp->rcv_nxt;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_rwin_sent(tp));
  
  	/* Make sure we do not exceed the maximum possible
  	 * scaled window.
@@ -905,6 +907,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  	struct tcp_md5sig_key *md5;
  	struct tcphdr *th;
  	int err;
+#ifdef CONFIG_TCP_ESTATS
+	__u32 seq;
+	__u32 end_seq;
+	int tcp_flags;
+	int pcount;
+#endif
  
  	BUG_ON(!skb || !tcp_skb_pcount(skb));
  
@@ -1008,6 +1016,15 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  		TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS,
  			      tcp_skb_pcount(skb));
  
+#ifdef CONFIG_TCP_ESTATS
+	/* If the skb isn't cloned, we can't reference it after
+	 * calling queue_xmit, so copy everything we need here. */
+	pcount = tcp_skb_pcount(skb);
+	seq = TCP_SKB_CB(skb)->seq;
+	end_seq = TCP_SKB_CB(skb)->end_seq;
+	tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
+#endif
+
  	/* OK, its time to fill skb_shinfo(skb)->gso_segs */
  	skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb);
  
@@ -1020,10 +1037,17 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  
  	err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
  
+	if (likely(!err)) {
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_segsend(sk, pcount,
+								seq, end_seq,
+								tcp_flags));
+	}
+
  	if (likely(err <= 0))
  		return err;
  
  	tcp_enter_cwr(sk);
+	TCP_ESTATS_VAR_INC(tp, stack_table, SendStall);
  
  	return net_xmit_eval(err);
  }
@@ -1398,6 +1422,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
  	if (icsk->icsk_mtup.enabled)
  		mss_now = min(mss_now, tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low));
  	tp->mss_cache = mss_now;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_mss(tp));
  
  	return mss_now;
  }
@@ -1670,11 +1695,13 @@ static unsigned int tcp_snd_test(const struct sock *sk, struct sk_buff *skb,
  	tcp_init_tso_segs(sk, skb, cur_mss);
  
  	if (!tcp_nagle_test(tp, skb, cur_mss, nonagle))
-		return 0;
+		return -TCP_ESTATS_SNDLIM_SENDER;
  
  	cwnd_quota = tcp_cwnd_test(tp, skb);
-	if (cwnd_quota && !tcp_snd_wnd_test(tp, skb, cur_mss))
-		cwnd_quota = 0;
+	if (!cwnd_quota)
+		return -TCP_ESTATS_SNDLIM_CWND;
+	if (!tcp_snd_wnd_test(tp, skb, cur_mss))
+		return -TCP_ESTATS_SNDLIM_RWIN;
  
  	return cwnd_quota;
  }
@@ -1688,7 +1715,7 @@ bool tcp_may_send_now(struct sock *sk)
  	return skb &&
  		tcp_snd_test(sk, skb, tcp_current_mss(sk),
  			     (tcp_skb_is_last(sk, skb) ?
-			      tp->nonagle : TCP_NAGLE_PUSH));
+			      tp->nonagle : TCP_NAGLE_PUSH)) > 0;
  }
  
  /* Trim TSO SKB to LEN bytes, put the remaining data into a new packet
@@ -1978,6 +2005,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  	unsigned int tso_segs, sent_pkts;
  	int cwnd_quota;
  	int result;
+	int why = TCP_ESTATS_SNDLIM_SENDER;
  	bool is_cwnd_limited = false;
  	u32 max_segs;
  
@@ -2008,6 +2036,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  
  		cwnd_quota = tcp_cwnd_test(tp, skb);
  		if (!cwnd_quota) {
+			why = TCP_ESTATS_SNDLIM_CWND;
  			is_cwnd_limited = true;
  			if (push_one == 2)
  				/* Force out a loss probe pkt. */
@@ -2016,19 +2045,24 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  				break;
  		}
  
-		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
+		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
+			why = TCP_ESTATS_SNDLIM_RWIN;
  			break;
-
+		}
+		
  		if (tso_segs == 1) {
  			if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
  						     (tcp_skb_is_last(sk, skb) ?
  						      nonagle : TCP_NAGLE_PUSH))))
+				/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  				break;
  		} else {
  			if (!push_one &&
  			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
-						 max_segs))
+						 max_segs)) {
+				why = TCP_ESTATS_SNDLIM_TSODEFER;
  				break;
+			}
  		}
  
  		limit = mss_now;
@@ -2041,6 +2075,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  
  		if (skb->len > limit &&
  		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  
  		/* TCP Small Queues :
@@ -2064,10 +2099,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  			 */
  			smp_mb__after_atomic();
  			if (atomic_read(&sk->sk_wmem_alloc) > limit)
+				/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  				break;
  		}
  
  		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  
  repair:
@@ -2080,9 +2117,12 @@ repair:
  		sent_pkts += tcp_skb_pcount(skb);
  
  		if (push_one)
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  	}
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_sndlim(tp, why));
+
  	if (likely(sent_pkts)) {
  		if (tcp_in_cwnd_reduction(sk))
  			tp->prr_out += sent_pkts;
@@ -3148,11 +3188,16 @@ int tcp_connect(struct sock *sk)
  	 */
  	tp->snd_nxt = tp->write_seq;
  	tp->pushed_seq = tp->write_seq;
-	TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
  
  	/* Timer for repeating the SYN until an answer. */
  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
  				  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, SndInitial, tp->write_seq);
+	TCP_ESTATS_VAR_SET(tp, app_table, SndMax, tp->write_seq);
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_snd_nxt(tp));
+	TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
+
  	return 0;
  }
  EXPORT_SYMBOL(tcp_connect);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 1829c7f..0f6f1f4 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -477,6 +477,9 @@ out_reset_timer:
  		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
  	}
  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+
+        TCP_ESTATS_UPDATE(tp, tcp_estats_update_timeout(sk));
+
  	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
  		__sk_dst_reset(sk);
  
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5ff8780..db1f88f 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1131,6 +1131,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
  	if (newsk == NULL)
  		goto out_nonewsk;
  
+	tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV6, TCP_ESTATS_INACTIVE);
+
  	/*
  	 * No need to charge this sock to the relevant IPv6 refcnt debug socks
  	 * count here, tcp_create_openreq_child now does this for us, see the
@@ -1463,6 +1465,8 @@ process:
  	skb->dev = NULL;
  
  	bh_lock_sock_nested(sk);
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_segrecv(tcp_sk(sk), skb));
  	ret = 0;
  	if (!sock_owned_by_user(sk)) {
  		if (!tcp_prequeue(sk, skb))
@@ -1473,6 +1477,8 @@ process:
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;
  	}
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_finish_segrecv(tcp_sk(sk)));
  	bh_unlock_sock(sk);
  
  	sock_put(sk);
@@ -1661,6 +1667,7 @@ static int tcp_v6_init_sock(struct sock *sk)
  #ifdef CONFIG_TCP_MD5SIG
  	tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
  #endif
+	tcp_estats_create(sk, TCP_ESTATS_ADDRTYPE_IPV6, TCP_ESTATS_ACTIVE);
  
  	return 0;
  }
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 17:50 UTC (permalink / raw)
  To: netdev

This set of patches provide control and management routines for the
kernel instrument set (KIS). This set of patches can be applied
against net-next independently of the KIS. While the kernel can be
patched, compiled, and run with this patch set it provides no real
functionality without the KIS implementation.

The reason is that the development team is primarily focused on ensuring
that the KIS is taken up by the community. Alternative control and
management methods can be developed and implemented as long as the KIS
is in the kernel.

In order for this patch set to compile on its own we have included two
files/patches that were previously introduced in the KIS implementation.
These are include/net/tcp_estats.h and include/linux/tcp.h. If patching
against a source tree that includes the KIS implementation
net/ipv4/[tcp_estats.c, sysctl_net_ipv4.c, Kconfig, Makefile] are required.

---
  include/linux/tcp.h        |   8 +
  include/net/tcp_estats.h   | 376 +++++++++++++++++++++++
  net/ipv4/Kconfig           |  25 ++
  net/ipv4/Makefile          |   1 +
  net/ipv4/sysctl_net_ipv4.c |  14 +
  net/ipv4/tcp_estats.c      | 736 +++++++++++++++++++++++++++++++++++++++++++++
  6 files changed, 1160 insertions(+)
  create mode 100644 include/net/tcp_estats.h
  create mode 100644 net/ipv4/tcp_estats.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 67309ec..8758360 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -126,6 +126,10 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
  	return (struct tcp_request_sock *)req;
  }
  
+#ifdef CONFIG_TCP_ESTATS
+struct tcp_estats;
+#endif
+
  struct tcp_sock {
  	/* inet_connection_sock has to be the first member of tcp_sock */
  	struct inet_connection_sock	inet_conn;
@@ -309,6 +313,10 @@ struct tcp_sock {
  	struct tcp_md5sig_info	__rcu *md5sig_info;
  #endif
  
+#ifdef CONFIG_TCP_ESTATS
+	struct tcp_estats	*tcp_stats;
+#endif
+
  /* TCP fastopen related information */
  	struct tcp_fastopen_request *fastopen_req;
  	/* fastopen_rsk points to request_sock that resulted in this big
diff --git a/include/net/tcp_estats.h b/include/net/tcp_estats.h
new file mode 100644
index 0000000..ff6000e
--- /dev/null
+++ b/include/net/tcp_estats.h
@@ -0,0 +1,376 @@
+/*
+ * include/net/tcp_estats.h
+ *
+ * Implementation of TCP Extended Statistics MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _TCP_ESTATS_H
+#define _TCP_ESTATS_H
+
+#include <net/sock.h>
+#include <linux/idr.h>
+#include <linux/in.h>
+#include <linux/jump_label.h>
+#include <linux/spinlock.h>
+#include <linux/tcp.h>
+#include <linux/workqueue.h>
+
+/* defines number of seconds that stats persist after connection ends */
+#define TCP_ESTATS_PERSIST_DELAY_SECS 5
+
+enum tcp_estats_sndlim_states {
+	TCP_ESTATS_SNDLIM_NONE = -1,
+	TCP_ESTATS_SNDLIM_SENDER,
+	TCP_ESTATS_SNDLIM_CWND,
+	TCP_ESTATS_SNDLIM_RWIN,
+	TCP_ESTATS_SNDLIM_STARTUP,
+	TCP_ESTATS_SNDLIM_TSODEFER,
+	TCP_ESTATS_SNDLIM_PACE,
+	TCP_ESTATS_SNDLIM_NSTATES	/* Keep at end */
+};
+
+enum tcp_estats_addrtype {
+	TCP_ESTATS_ADDRTYPE_IPV4 = 1,
+	TCP_ESTATS_ADDRTYPE_IPV6 = 2
+};
+
+enum tcp_estats_softerror_reason {
+	TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW = 1,
+	TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW = 2,
+	TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW = 3,
+	TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW = 4,
+	TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW = 5,
+	TCP_ESTATS_SOFTERROR_ABOVE_TS_WINDOW = 6,
+	TCP_ESTATS_SOFTERROR_DATA_CHECKSUM = 7,
+	TCP_ESTATS_SOFTERROR_OTHER = 8,
+};
+
+#define TCP_ESTATS_INACTIVE	2
+#define TCP_ESTATS_ACTIVE	1
+
+#define TCP_ESTATS_TABLEMASK_INACTIVE	0x00
+#define TCP_ESTATS_TABLEMASK_ACTIVE	0x01
+#define TCP_ESTATS_TABLEMASK_PERF	0x02
+#define TCP_ESTATS_TABLEMASK_PATH	0x04
+#define TCP_ESTATS_TABLEMASK_STACK	0x08
+#define TCP_ESTATS_TABLEMASK_APP	0x10
+#define TCP_ESTATS_TABLEMASK_EXTRAS	0x40
+
+#ifdef CONFIG_TCP_ESTATS
+
+extern struct static_key tcp_estats_enabled;
+
+#define TCP_ESTATS_CHECK(tp, table, expr)				\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats) &&			\
+			    likely((tp)->tcp_stats->tables.table)) {	\
+				(expr);					\
+			}						\
+		}							\
+	} while (0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, ++((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_DEC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, --((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) += (val))
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) = (val))
+#define TCP_ESTATS_UPDATE(tp, func)					\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats)) {			\
+				(func);					\
+			}						\
+		}							\
+	} while (0)
+
+/*
+ * Variables that can be read and written directly.
+ *
+ * Contains all variables from RFC 4898. Commented fields are
+ * either not implemented (only StartTimeStamp
+ * remains unimplemented in this release) or have
+ * handlers and do not need struct storage.
+ */
+struct tcp_estats_connection_table {
+	u32			AddressType;
+	union { struct in_addr addr; struct in6_addr addr6; }	LocalAddress;
+	union { struct in_addr addr; struct in6_addr addr6; }	RemAddress;
+	u16			LocalPort;
+	u16			RemPort;
+};
+
+struct tcp_estats_perf_table {
+	u32		SegsOut;
+	u32		DataSegsOut;
+	u64		DataOctetsOut;
+	u32		SegsRetrans;
+	u32		OctetsRetrans;
+	u32		SegsIn;
+	u32		DataSegsIn;
+	u64		DataOctetsIn;
+	/*		ElapsedSecs */
+	/*		ElapsedMicroSecs */
+	/*		StartTimeStamp */
+	/*		CurMSS */
+	/*		PipeSize */
+	u32		MaxPipeSize;
+	/*		SmoothedRTT */
+	/*		CurRTO */
+	u32		CongSignals;
+	/*		CurCwnd */
+	/*		CurSsthresh */
+	u32		Timeouts;
+	/*		CurRwinSent */
+	u32		MaxRwinSent;
+	u32		ZeroRwinSent;
+	/*		CurRwinRcvd */
+	u32		MaxRwinRcvd;
+	u32		ZeroRwinRcvd;
+	/*		SndLimTransRwin */
+	/*		SndLimTransCwnd */
+	/*		SndLimTransSnd */
+	/*		SndLimTimeRwin */
+	/*		SndLimTimeCwnd */
+	/*		SndLimTimeSnd */
+	u32		snd_lim_trans[TCP_ESTATS_SNDLIM_NSTATES];
+	u32		snd_lim_time[TCP_ESTATS_SNDLIM_NSTATES];
+};
+
+struct tcp_estats_path_table {
+	/*		RetranThresh */
+	u32		NonRecovDAEpisodes;
+	u32		SumOctetsReordered;
+	u32		NonRecovDA;
+	u32		SampleRTT;
+	/*		RTTVar */
+	u32		MaxRTT;
+	u32		MinRTT;
+	u64		SumRTT;
+	u32		CountRTT;
+	u32		MaxRTO;
+	u32		MinRTO;
+	u8		IpTtl;
+	u8		IpTosIn;
+	/*		IpTosOut */
+	u32		PreCongSumCwnd;
+	u32		PreCongSumRTT;
+	u32		PostCongSumRTT;
+	u32		PostCongCountRTT;
+	u32		ECNsignals;
+	u32		DupAckEpisodes;
+	/*		RcvRTT */
+	u32		DupAcksOut;
+	u32		CERcvd;
+	u32		ECESent;
+};
+
+struct tcp_estats_stack_table {
+	u32		ActiveOpen;
+	/*		MSSSent */
+	/*		MSSRcvd */
+	/*		WinScaleSent */
+	/*		WinScaleRcvd */
+	/*		TimeStamps */
+	/*		ECN */
+	/*		WillSendSACK */
+	/*		WillUseSACK */
+	/*		State */
+	/*		Nagle */
+	u32		MaxSsCwnd;
+	u32		MaxCaCwnd;
+	u32		MaxSsthresh;
+	u32		MinSsthresh;
+	/*		InRecovery */
+	u32		DupAcksIn;
+	u32		SpuriousFrDetected;
+	u32		SpuriousRtoDetected;
+	u32		SoftErrors;
+	u32		SoftErrorReason;
+	u32		SlowStart;
+	u32		CongAvoid;
+	u32		OtherReductions;
+	u32		CongOverCount;
+	u32		FastRetran;
+	u32		SubsequentTimeouts;
+	/*		CurTimeoutCount */
+	u32		AbruptTimeouts;
+	u32		SACKsRcvd;
+	u32		SACKBlocksRcvd;
+	u32		SendStall;
+	u32		DSACKDups;
+	u32		MaxMSS;
+	u32		MinMSS;
+	u32		SndInitial;
+	u32		RecInitial;
+	/*		CurRetxQueue */
+	/*		MaxRetxQueue */
+	/*		CurReasmQueue */
+	u32		MaxReasmQueue;
+	u32		EarlyRetrans;
+	u32		EarlyRetransDelay;
+};
+
+struct tcp_estats_app_table {
+	/*		SndUna */
+	/*		SndNxt */
+	u32		SndMax;
+	u64		ThruOctetsAcked;
+	/*		RcvNxt */
+	u64		ThruOctetsReceived;
+	/*		CurAppWQueue */
+	u32		MaxAppWQueue;
+	/*		CurAppRQueue */
+	u32		MaxAppRQueue;
+};
+
+/*
+    currently, no backing store is needed for tuning elements in
+     web10g - they are all read or written to directly in other
+     data structures (such as the socket)
+*/
+
+struct tcp_estats_extras_table {
+	/*		OtherReductionsCV */
+	u32		OtherReductionsCM;
+	u32		Priority;
+};
+
+struct tcp_estats_tables {
+	struct tcp_estats_connection_table	*connection_table;
+	struct tcp_estats_perf_table		*perf_table;
+	struct tcp_estats_path_table		*path_table;
+	struct tcp_estats_stack_table		*stack_table;
+	struct tcp_estats_app_table		*app_table;
+	struct tcp_estats_extras_table		*extras_table;
+};
+
+struct tcp_estats {
+	int				tcpe_cid; /* idr map id */
+
+	struct sock			*sk;
+	kuid_t				uid;
+	kgid_t				gid;
+	int				ids;
+
+	atomic_t			users;
+
+	enum tcp_estats_sndlim_states	limstate;
+	ktime_t				limstate_ts;
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	ktime_t				start_ts;
+	ktime_t				current_ts;
+#else
+	unsigned long			start_ts;
+	unsigned long			current_ts;
+#endif
+	struct timeval			start_tv;
+
+        int				queued;
+        struct work_struct		create_notify;
+        struct work_struct		establish_notify;
+        struct delayed_work		destroy_notify;
+
+	struct tcp_estats_tables	tables;
+
+	struct rcu_head			rcu;
+};
+
+extern struct idr tcp_estats_idr;
+
+extern int tcp_estats_wq_enabled;
+extern struct workqueue_struct *tcp_estats_wq;
+extern void (*create_notify_func)(struct work_struct *work);
+extern void (*establish_notify_func)(struct work_struct *work);
+extern void (*destroy_notify_func)(struct work_struct *work);
+
+extern unsigned long persist_delay;
+extern spinlock_t tcp_estats_idr_lock;
+
+/* For the TCP code */
+extern int  tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype t,
+			      int active);
+extern void tcp_estats_destroy(struct sock *sk);
+extern void tcp_estats_establish(struct sock *sk);
+extern void tcp_estats_free(struct rcu_head *rcu);
+
+extern void tcp_estats_update_snd_nxt(struct tcp_sock *tp);
+extern void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack);
+extern void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample);
+extern void tcp_estats_update_timeout(struct sock *sk);
+extern void tcp_estats_update_mss(struct tcp_sock *tp);
+extern void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp);
+extern void tcp_estats_update_sndlim(struct tcp_sock *tp,
+				     enum tcp_estats_sndlim_states why);
+extern void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq);
+extern void tcp_estats_update_rwin_sent(struct tcp_sock *tp);
+extern void tcp_estats_update_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_post_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_segsend(struct sock *sk, int pcount,
+                                      u32 seq, u32 end_seq, int flags);
+extern void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb);
+extern void tcp_estats_update_finish_segrecv(struct tcp_sock *tp);
+extern void tcp_estats_update_writeq(struct sock *sk);
+extern void tcp_estats_update_recvq(struct sock *sk);
+
+extern void tcp_estats_init(void);
+
+static inline void tcp_estats_use(struct tcp_estats *stats)
+{
+	atomic_inc(&stats->users);
+}
+
+static inline int tcp_estats_use_if_valid(struct tcp_estats *stats)
+{
+	return atomic_inc_not_zero(&stats->users);
+}
+
+static inline void tcp_estats_unuse(struct tcp_estats *stats)
+{
+	if (atomic_dec_and_test(&stats->users)) {
+		sock_put(stats->sk);
+		stats->sk = NULL;
+		call_rcu(&stats->rcu, tcp_estats_free);
+	}
+}
+
+#else /* !CONFIG_TCP_ESTATS */
+
+#define tcp_estats_enabled	(0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_DEC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_UPDATE(tp, func)		do {} while (0)
+
+static inline void tcp_estats_init(void) { }
+static inline void tcp_estats_establish(struct sock *sk) { }
+static inline void tcp_estats_create(struct sock *sk,
+				     enum tcp_estats_addrtype t,
+				     int active) { }
+static inline void tcp_estats_destroy(struct sock *sk) { }
+
+#endif /* CONFIG_TCP_ESTATS */
+
+#endif /* _TCP_ESTATS_H */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index bd29016..c04ba8f 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -680,3 +680,28 @@ config TCP_MD5SIG
  	  on the Internet.
  
  	  If unsure, say N.
+
+config TCP_ESTATS
+	bool "TCP: Extended TCP statistics (RFC4898) MIB"
+	---help---
+	  RFC 4898 specifies a number of extended statistics for TCP. This
+	  data can be accessed using netlink. See http://www.web10g.org for
+	  more details.
+
+if TCP_ESTATS
+
+config TCP_ESTATS_STRICT_ELAPSEDTIME	
+	bool "TCP: ESTATS strict ElapsedSecs/Msecs counters"
+	depends on TCP_ESTATS
+	default n
+	---help---
+	  Elapsed time since beginning of connection.
+	  RFC4898 defines ElapsedSecs/Msecs as being updated via ktime_get
+	  at each protocol event (sending or receiving of a segment);
+	  as this can be a performance hit, leaving this config option off
+	  will update elapsed based on on the jiffies counter instead.
+	  Set to Y for strict conformance with the MIB.
+
+	  If unsure, say N.
+
+endif
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 518c04e..7e2c69a 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_INET_TUNNEL) += tunnel4.o
  obj-$(CONFIG_INET_XFRM_MODE_TRANSPORT) += xfrm4_mode_transport.o
  obj-$(CONFIG_INET_XFRM_MODE_TUNNEL) += xfrm4_mode_tunnel.o
  obj-$(CONFIG_IP_PNP) += ipconfig.o
+obj-$(CONFIG_TCP_ESTATS) += tcp_estats.o
  obj-$(CONFIG_NETFILTER)	+= netfilter.o netfilter/
  obj-$(CONFIG_INET_DIAG) += inet_diag.o
  obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e0ee384..edc5a66 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,11 @@ static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
  static int ip_ping_group_range_min[] = { 0, 0 };
  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
  
+/* Extended statistics (RFC4898). */
+#ifdef CONFIG_TCP_ESTATS
+int sysctl_tcp_estats __read_mostly;
+#endif  /* CONFIG_TCP_ESTATS */
+
  /* Update system visible IP port range */
  static void set_local_port_range(struct net *net, int range[2])
  {
@@ -767,6 +772,15 @@ static struct ctl_table ipv4_table[] = {
  		.proc_handler	= proc_dointvec_minmax,
  		.extra1		= &one
  	},
+#ifdef CONFIG_TCP_ESTATS
+	{
+		.procname	= "tcp_estats",
+		.data		= &sysctl_tcp_estats,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#endif /* CONFIG TCP ESTATS */
  	{ }
  };
  
diff --git a/net/ipv4/tcp_estats.c b/net/ipv4/tcp_estats.c
new file mode 100644
index 0000000..e817540
--- /dev/null
+++ b/net/ipv4/tcp_estats.c
@@ -0,0 +1,736 @@
+/*
+ * net/ipv4/tcp_estats.c
+ *
+ * Implementation of TCP ESTATS MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/export.h>
+#ifndef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+#include <linux/jiffies.h>
+#endif
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/string.h>
+#include <net/tcp_estats.h>
+#include <net/tcp.h>
+#include <asm/atomic.h>
+#include <asm/byteorder.h>
+
+#define ESTATS_INF32	0xffffffff
+
+#define ESTATS_MAX_CID	5000000
+
+extern int sysctl_tcp_estats;
+
+struct idr tcp_estats_idr;
+EXPORT_SYMBOL(tcp_estats_idr);
+static int next_id = 1;
+DEFINE_SPINLOCK(tcp_estats_idr_lock);
+EXPORT_SYMBOL(tcp_estats_idr_lock);
+
+int tcp_estats_wq_enabled __read_mostly = 0;
+EXPORT_SYMBOL(tcp_estats_wq_enabled);
+struct workqueue_struct *tcp_estats_wq = NULL;
+EXPORT_SYMBOL(tcp_estats_wq);
+void (*create_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(create_notify_func);
+void (*establish_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(establish_notify_func);
+void (*destroy_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(destroy_notify_func);
+unsigned long persist_delay = 0;
+EXPORT_SYMBOL(persist_delay);
+
+struct static_key tcp_estats_enabled __read_mostly = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(tcp_estats_enabled);
+
+/* if HAVE_JUMP_LABEL is defined, then static_key_slow_inc/dec uses a
+ *   mutex in its implementation, and hence can't be called if in_interrupt().
+ * if HAVE_JUMP_LABEL is NOT defined, then no mutex is used, hence no need
+ *   for deferring enable/disable */
+#ifdef HAVE_JUMP_LABEL
+static atomic_t tcp_estats_enabled_deferred;
+
+static void tcp_estats_handle_deferred_enable_disable(void)
+{
+	int count = atomic_xchg(&tcp_estats_enabled_deferred, 0);
+
+	while (count > 0) {
+		static_key_slow_inc(&tcp_estats_enabled);
+		--count;
+	}
+
+	while (count < 0) {
+		static_key_slow_dec(&tcp_estats_enabled);
+		++count;
+	}
+}
+#endif
+
+static inline void tcp_estats_enable(void)
+{
+#ifdef HAVE_JUMP_LABEL
+	if (in_interrupt()) {
+		atomic_inc(&tcp_estats_enabled_deferred);
+		return;
+	}
+	tcp_estats_handle_deferred_enable_disable();
+#endif
+	static_key_slow_inc(&tcp_estats_enabled);
+}
+
+static inline void tcp_estats_disable(void)
+{
+#ifdef HAVE_JUMP_LABEL
+	if (in_interrupt()) {
+		atomic_dec(&tcp_estats_enabled_deferred);
+		return;
+	}
+	tcp_estats_handle_deferred_enable_disable();
+#endif
+	static_key_slow_dec(&tcp_estats_enabled);
+}
+
+/* Calculates the required amount of memory for any enabled tables. */
+int tcp_estats_get_allocation_size(int sysctl)
+{
+	int size = sizeof(struct tcp_estats) +
+		sizeof(struct tcp_estats_connection_table);
+
+	if (sysctl & TCP_ESTATS_TABLEMASK_PERF)
+		size += sizeof(struct tcp_estats_perf_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_PATH)
+		size += sizeof(struct tcp_estats_path_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_STACK)
+		size += sizeof(struct tcp_estats_stack_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_APP)
+		size += sizeof(struct tcp_estats_app_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_EXTRAS)
+		size += sizeof(struct tcp_estats_extras_table);
+	return size;
+}
+
+/* Called whenever a TCP/IPv4 sock is created.
+ * net/ipv4/tcp_ipv4.c: tcp_v4_syn_recv_sock,
+ *			tcp_v4_init_sock
+ * Allocates a stats structure and initializes values.
+ */
+int tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype addrtype,
+		      int active)
+{
+	struct tcp_estats *stats;
+	struct tcp_estats_tables *tables;
+	struct tcp_sock *tp = tcp_sk(sk);
+	void *estats_mem;
+	int sysctl;
+	int ret;
+
+	/* Read the sysctl once before calculating memory needs and initializing
+	 * tables to avoid raciness. */
+	sysctl = ACCESS_ONCE(sysctl_tcp_estats);
+	if (likely(sysctl == TCP_ESTATS_TABLEMASK_INACTIVE)) {
+		return 0;
+	}
+
+	estats_mem = kzalloc(tcp_estats_get_allocation_size(sysctl), gfp_any());
+	if (!estats_mem)
+		return -ENOMEM;
+
+	stats = estats_mem;
+	estats_mem += sizeof(struct tcp_estats);
+
+	tables = &stats->tables;
+
+	tables->connection_table = estats_mem;
+	estats_mem += sizeof(struct tcp_estats_connection_table);
+
+	if (sysctl & TCP_ESTATS_TABLEMASK_PERF) {
+		tables->perf_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_perf_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_PATH) {
+		tables->path_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_path_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_STACK) {
+		tables->stack_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_stack_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_APP) {
+		tables->app_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_app_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_EXTRAS) {
+		tables->extras_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_extras_table);
+	}
+
+	stats->tcpe_cid = -1;
+	stats->queued = 0;
+
+	tables->connection_table->AddressType = addrtype;
+
+	sock_hold(sk);
+	stats->sk = sk;
+	atomic_set(&stats->users, 0);
+
+	stats->limstate = TCP_ESTATS_SNDLIM_STARTUP;
+	stats->limstate_ts = ktime_get();
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->start_ts = stats->current_ts = stats->limstate_ts;
+#else
+	stats->start_ts = stats->current_ts = jiffies;
+#endif
+	do_gettimeofday(&stats->start_tv);
+
+	/* order is important -
+	 * must have stats hooked into tp and tcp_estats_enabled()
+	 * in order to have the TCP_ESTATS_VAR_<> macros work */
+	tp->tcp_stats = stats;
+	tcp_estats_enable();
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, ActiveOpen, active);
+
+	TCP_ESTATS_VAR_SET(tp, app_table, SndMax, tp->snd_nxt);
+	TCP_ESTATS_VAR_SET(tp, stack_table, SndInitial, tp->snd_nxt);
+
+	TCP_ESTATS_VAR_SET(tp, path_table, MinRTT, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, path_table, MinRTO, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, stack_table, MinMSS, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, stack_table, MinSsthresh, ESTATS_INF32);
+
+	tcp_estats_use(stats);
+
+	if (tcp_estats_wq_enabled) {
+		tcp_estats_use(stats);
+		stats->queued = 1;
+		stats->tcpe_cid = 0;
+		INIT_WORK(&stats->create_notify, create_notify_func);
+		ret = queue_work(tcp_estats_wq, &stats->create_notify);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_estats_create);
+
+void tcp_estats_destroy(struct sock *sk)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+
+	if (stats == NULL)
+		return;
+
+	/* Attribute final sndlim time. */
+	tcp_estats_update_sndlim(tcp_sk(stats->sk), stats->limstate);
+
+	if (tcp_estats_wq_enabled && stats->queued) {
+		INIT_DELAYED_WORK(&stats->destroy_notify,
+			destroy_notify_func);
+		queue_delayed_work(tcp_estats_wq, &stats->destroy_notify,
+			persist_delay);
+	}
+	tcp_estats_unuse(stats);
+}
+
+/* Do not call directly.  Called from tcp_estats_unuse() through call_rcu. */
+void tcp_estats_free(struct rcu_head *rcu)
+{
+	struct tcp_estats *stats = container_of(rcu, struct tcp_estats, rcu);
+	tcp_estats_disable();
+	kfree(stats);
+}
+EXPORT_SYMBOL(tcp_estats_free);
+
+/* Called when a connection enters the ESTABLISHED state, and has all its
+ * state initialized.
+ * net/ipv4/tcp_input.c: tcp_rcv_state_process,
+ *			 tcp_rcv_synsent_state_process
+ * Here we link the statistics structure in so it is visible in the /proc
+ * fs, and do some final init.
+ */
+void tcp_estats_establish(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_connection_table *conn_table;
+
+	if (stats == NULL)
+		return;
+
+	conn_table = stats->tables.connection_table;
+
+	/* Let's set these here, since they can't change once the
+	 * connection is established.
+	 */
+	conn_table->LocalPort = inet->inet_num;
+	conn_table->RemPort = ntohs(inet->inet_dport);
+
+	if (conn_table->AddressType == TCP_ESTATS_ADDRTYPE_IPV4) {
+		memcpy(&conn_table->LocalAddress.addr, &inet->inet_rcv_saddr,
+			sizeof(struct in_addr));
+		memcpy(&conn_table->RemAddress.addr, &inet->inet_daddr,
+			sizeof(struct in_addr));
+	}
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	else if (conn_table->AddressType == TCP_ESTATS_ADDRTYPE_IPV6) {
+		memcpy(&conn_table->LocalAddress.addr6, &(sk)->sk_v6_rcv_saddr,
+		       sizeof(struct in6_addr));
+		/* ipv6 daddr now uses a different struct than saddr */
+		memcpy(&conn_table->RemAddress.addr6, &(sk)->sk_v6_daddr,
+		       sizeof(struct in6_addr));
+	}
+#endif
+	else {
+		pr_err("TCP ESTATS: AddressType not valid.\n");
+	}
+
+	tcp_estats_update_finish_segrecv(tp);
+	tcp_estats_update_rwin_rcvd(tp);
+	tcp_estats_update_rwin_sent(tp);
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, RecInitial, tp->rcv_nxt);
+
+	tcp_estats_update_sndlim(tp, TCP_ESTATS_SNDLIM_SENDER);
+
+	if (tcp_estats_wq_enabled && stats->queued) {
+		INIT_WORK(&stats->establish_notify, establish_notify_func);
+		queue_work(tcp_estats_wq, &stats->establish_notify);
+	}
+}
+
+/*
+ * Statistics update functions
+ */
+
+void tcp_estats_update_snd_nxt(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+
+	if (stats->tables.app_table) {
+		if (after(tp->snd_nxt, stats->tables.app_table->SndMax))
+			stats->tables.app_table->SndMax = tp->snd_nxt;
+	}
+}
+
+void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+
+	if (stats->tables.app_table)
+		stats->tables.app_table->ThruOctetsAcked += ack - tp->snd_una;
+}
+
+void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+	unsigned long rtt_sample_msec = rtt_sample/1000;
+	u32 rto;
+
+	if (path_table == NULL)
+		return;
+
+	path_table->SampleRTT = rtt_sample_msec;
+
+	if (rtt_sample_msec > path_table->MaxRTT)
+		path_table->MaxRTT = rtt_sample_msec;
+	if (rtt_sample_msec < path_table->MinRTT)
+		path_table->MinRTT = rtt_sample_msec;
+
+	path_table->CountRTT++;
+	path_table->SumRTT += rtt_sample_msec;
+
+	rto = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	if (rto > path_table->MaxRTO)
+		path_table->MaxRTO = rto;
+	if (rto < path_table->MinRTO)
+		path_table->MinRTO = rto;
+}
+
+void tcp_estats_update_timeout(struct sock *sk)
+{
+	if (inet_csk(sk)->icsk_backoff)
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), stack_table, SubsequentTimeouts);
+	else
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), perf_table, Timeouts);
+
+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open)
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), stack_table, AbruptTimeouts);
+}
+
+void tcp_estats_update_mss(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_stack_table *stack_table = stats->tables.stack_table;
+	int mss = tp->mss_cache;
+
+	if (stack_table == NULL)
+		return;
+
+	if (mss > stack_table->MaxMSS)
+		stack_table->MaxMSS = mss;
+	if (mss < stack_table->MinMSS)
+		stack_table->MinMSS = mss;
+}
+
+void tcp_estats_update_finish_segrecv(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_tables *tables = &stats->tables;
+	struct tcp_estats_perf_table *perf_table = tables->perf_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+	u32 mss = tp->mss_cache;
+	u32 cwnd;
+	u32 ssthresh;
+	u32 pipe_size;
+
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->current_ts = ktime_get();
+#else
+	stats->current_ts = jiffies;
+#endif
+
+	if (stack_table != NULL) {
+		cwnd = tp->snd_cwnd * mss;
+		if (tp->snd_cwnd <= tp->snd_ssthresh) {
+			if (cwnd > stack_table->MaxSsCwnd)
+				stack_table->MaxSsCwnd = cwnd;
+		} else if (cwnd > stack_table->MaxCaCwnd) {
+			stack_table->MaxCaCwnd = cwnd;
+		}
+	}
+
+	if (perf_table != NULL) {
+		pipe_size = tcp_packets_in_flight(tp) * mss;
+		if (pipe_size > perf_table->MaxPipeSize)
+			perf_table->MaxPipeSize = pipe_size;
+	}
+
+	/* Discard initiail ssthresh set at infinity. */
+	if (tp->snd_ssthresh >= TCP_INFINITE_SSTHRESH) {
+		return;
+	}
+
+	if (stack_table != NULL) {
+		ssthresh = tp->snd_ssthresh * tp->mss_cache;
+		if (ssthresh > stack_table->MaxSsthresh)
+			stack_table->MaxSsthresh = ssthresh;
+		if (ssthresh < stack_table->MinSsthresh)
+			stack_table->MinSsthresh = ssthresh;
+	}
+}
+EXPORT_SYMBOL(tcp_estats_update_finish_segrecv);
+
+void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	u32 win = tp->snd_wnd;
+
+	if (perf_table == NULL)
+		return;
+
+	if (win > perf_table->MaxRwinRcvd)
+		perf_table->MaxRwinRcvd = win;
+	if (win == 0)
+		perf_table->ZeroRwinRcvd++;
+}
+
+void tcp_estats_update_rwin_sent(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	u32 win = tp->rcv_wnd;
+
+	if (perf_table == NULL)
+		return;
+
+	if (win > perf_table->MaxRwinSent)
+		perf_table->MaxRwinSent = win;
+	if (win == 0)
+		perf_table->ZeroRwinSent++;
+}
+
+void tcp_estats_update_sndlim(struct tcp_sock *tp,
+			      enum tcp_estats_sndlim_states state)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	ktime_t now;
+
+	if (state <= TCP_ESTATS_SNDLIM_NONE ||
+	    state >= TCP_ESTATS_SNDLIM_NSTATES) {
+		pr_err("tcp_estats_update_sndlim: BUG: state out of range %d\n",
+		       state);
+		return;
+	}
+
+	if (perf_table == NULL)
+		return;
+
+	now = ktime_get();
+	perf_table->snd_lim_time[stats->limstate]
+	    += ktime_to_us(ktime_sub(now, stats->limstate_ts));
+	stats->limstate_ts = now;
+	if (stats->limstate != state) {
+		stats->limstate = state;
+		perf_table->snd_lim_trans[state]++;
+	}
+}
+
+void tcp_estats_update_congestion(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+
+	TCP_ESTATS_VAR_INC(tp, perf_table, CongSignals);
+
+	if (path_table != NULL) {
+		path_table->PreCongSumCwnd += tp->snd_cwnd * tp->mss_cache;
+		path_table->PreCongSumRTT += path_table->SampleRTT;
+	}
+}
+
+void tcp_estats_update_post_congestion(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+
+	if (path_table != NULL) {
+		path_table->PostCongCountRTT++;
+		path_table->PostCongSumRTT += path_table->SampleRTT;
+	}
+}
+
+void tcp_estats_update_segsend(struct sock *sk, int pcount,
+			       u32 seq, u32 end_seq, int flags)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	struct tcp_estats_app_table *app_table = stats->tables.app_table;
+
+	int data_len = end_seq - seq;
+
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->current_ts = ktime_get();
+#else
+	stats->current_ts = jiffies;
+#endif
+
+	if (perf_table == NULL)
+		return;
+
+	/* We know we're sending a segment. */
+	perf_table->SegsOut += pcount;
+
+	/* A pure ACK contains no data; everything else is data. */
+	if (data_len > 0) {
+		perf_table->DataSegsOut += pcount;
+		perf_table->DataOctetsOut += data_len;
+	}
+
+	/* Check for retransmission. */
+	if (flags & TCPHDR_SYN) {
+		if (inet_csk(sk)->icsk_retransmits)
+			perf_table->SegsRetrans++;
+	} else if (app_table != NULL &&
+		   before(seq, app_table->SndMax)) {
+		perf_table->SegsRetrans += pcount;
+		perf_table->OctetsRetrans += data_len;
+	}
+}
+
+void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	struct tcp_estats_tables *tables = &tp->tcp_stats->tables;
+	struct tcp_estats_path_table *path_table = tables->path_table;
+	struct tcp_estats_perf_table *perf_table = tables->perf_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+	struct tcphdr *th = tcp_hdr(skb);
+	struct iphdr *iph = ip_hdr(skb);
+
+	if (perf_table != NULL)
+		perf_table->SegsIn++;
+
+	if (skb->len == th->doff * 4) {
+		if (stack_table != NULL &&
+		    TCP_SKB_CB(skb)->ack_seq == tp->snd_una)
+			stack_table->DupAcksIn++;
+	} else {
+		if (perf_table != NULL) {
+			perf_table->DataSegsIn++;
+			perf_table->DataOctetsIn += skb->len - th->doff * 4;
+		}
+	}
+
+	if (path_table != NULL) {
+		path_table->IpTtl = iph->ttl;
+		path_table->IpTosIn = iph->tos;
+	}
+}
+EXPORT_SYMBOL(tcp_estats_update_segrecv);
+
+void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq)
+{
+        /* After much debate, it was decided that "seq - rcv_nxt" is
+           indeed what we want, as opposed to what Krishnan suggested
+           to better match the RFC: "seq - tp->rcv_wup" */
+	TCP_ESTATS_VAR_ADD(tp, app_table, ThruOctetsReceived,
+			   seq - tp->rcv_nxt);
+}
+
+void tcp_estats_update_writeq(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats_app_table *app_table =
+			tp->tcp_stats->tables.app_table;
+	int len;
+
+	if (app_table == NULL)
+		return;
+
+	len = tp->write_seq - app_table->SndMax;
+
+	if (len > app_table->MaxAppWQueue)
+		app_table->MaxAppWQueue = len;
+}
+
+static inline u32 ofo_qlen(struct tcp_sock *tp)
+{
+	if (!skb_peek(&tp->out_of_order_queue))
+		return 0;
+	else
+		return TCP_SKB_CB(tp->out_of_order_queue.prev)->end_seq -
+		    TCP_SKB_CB(tp->out_of_order_queue.next)->seq;
+}
+
+void tcp_estats_update_recvq(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats_tables *tables = &tp->tcp_stats->tables;
+	struct tcp_estats_app_table *app_table = tables->app_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+
+	if (app_table != NULL) {
+		u32 len = tp->rcv_nxt - tp->copied_seq;
+		if (app_table->MaxAppRQueue < len)
+			app_table->MaxAppRQueue = len;
+	}
+
+	if (stack_table != NULL) {
+		u32 len = ofo_qlen(tp);
+		if (stack_table->MaxReasmQueue < len)
+			stack_table->MaxReasmQueue = len;
+	}
+}
+
+/*
+ * Manage connection ID table
+ */
+
+static int get_new_cid(struct tcp_estats *stats)
+{
+         int id_cid;
+
+again:
+         spin_lock_bh(&tcp_estats_idr_lock);
+         id_cid = idr_alloc(&tcp_estats_idr, stats, next_id, 0, GFP_KERNEL);
+         if (unlikely(id_cid == -ENOSPC)) {
+                 spin_unlock_bh(&tcp_estats_idr_lock);
+                 goto again;
+         }
+         if (unlikely(id_cid == -ENOMEM)) {
+                 spin_unlock_bh(&tcp_estats_idr_lock);
+                 return -ENOMEM;
+         }
+         next_id = (id_cid + 1) % ESTATS_MAX_CID;
+         stats->tcpe_cid = id_cid;
+         spin_unlock_bh(&tcp_estats_idr_lock);
+         return 0;
+}
+
+static void create_func(struct work_struct *work)
+{
+	/* stub for netlink notification of new connections */
+	;
+}
+
+static void establish_func(struct work_struct *work)
+{
+	struct tcp_estats *stats = container_of(work, struct tcp_estats,
+						establish_notify);
+	int err = 0;
+
+	if ((stats->tcpe_cid) > 0) {
+		pr_err("TCP estats container established multiple times.\n");
+		return;
+	}
+
+	if ((stats->tcpe_cid) == 0) {
+		err = get_new_cid(stats);
+		if (err)
+			pr_devel("get_new_cid error %d\n", err);
+	}
+}
+
+static void destroy_func(struct work_struct *work)
+{
+	struct tcp_estats *stats = container_of(work, struct tcp_estats,
+						destroy_notify.work);
+
+	int id_cid = stats->tcpe_cid;
+
+	if (id_cid == 0)
+		pr_devel("TCP estats destroyed before being established.\n");
+
+	if (id_cid >= 0) {
+		if (id_cid) {
+			spin_lock_bh(&tcp_estats_idr_lock);
+			idr_remove(&tcp_estats_idr, id_cid);
+			spin_unlock_bh(&tcp_estats_idr_lock);
+		}
+		stats->tcpe_cid = -1;
+
+		tcp_estats_unuse(stats);
+	}
+}
+
+void __init tcp_estats_init()
+{
+	idr_init(&tcp_estats_idr);
+
+	create_notify_func = &create_func;
+	establish_notify_func = &establish_func;
+	destroy_notify_func = &destroy_func;
+
+	persist_delay = TCP_ESTATS_PERSIST_DELAY_SECS * HZ;
+
+	tcp_estats_wq = alloc_workqueue("tcp_estats", WQ_MEM_RECLAIM, 256);
+	if (tcp_estats_wq == NULL) {
+		pr_err("tcp_estats_init(): alloc_workqueue failed\n");
+		goto cleanup_fail;
+	}
+
+	tcp_estats_wq_enabled = 1;
+	return;
+
+cleanup_fail:
+	pr_err("TCP ESTATS: initialization failed.\n");
+}
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 1/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 17:50 UTC (permalink / raw)
  To: netdev

This patch provides the kernel instrument set. While this patch
compiles and runs it does not have control and management capabilities.
These are provided in the next patch submission.

---
  include/linux/tcp.h      |   8 +
  include/net/tcp.h        |   1 +
  include/net/tcp_estats.h | 376 +++++++++++++++++++++++++++++++++++++++++++++++
  include/uapi/linux/tcp.h |   6 +-
  net/ipv4/tcp.c           |  21 ++-
  net/ipv4/tcp_cong.c      |   3 +
  net/ipv4/tcp_htcp.c      |   1 +
  net/ipv4/tcp_input.c     | 116 +++++++++++++--
  net/ipv4/tcp_ipv4.c      |  10 ++
  net/ipv4/tcp_output.c    |  61 +++++++-
  net/ipv4/tcp_timer.c     |   3 +
  net/ipv6/tcp_ipv6.c      |   7 +
  12 files changed, 592 insertions(+), 21 deletions(-)
  create mode 100644 include/net/tcp_estats.h

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 67309ec..8758360 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -126,6 +126,10 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
  	return (struct tcp_request_sock *)req;
  }
  
+#ifdef CONFIG_TCP_ESTATS
+struct tcp_estats;
+#endif
+
  struct tcp_sock {
  	/* inet_connection_sock has to be the first member of tcp_sock */
  	struct inet_connection_sock	inet_conn;
@@ -309,6 +313,10 @@ struct tcp_sock {
  	struct tcp_md5sig_info	__rcu *md5sig_info;
  #endif
  
+#ifdef CONFIG_TCP_ESTATS
+	struct tcp_estats	*tcp_stats;
+#endif
+
  /* TCP fastopen related information */
  	struct tcp_fastopen_request *fastopen_req;
  	/* fastopen_rsk points to request_sock that resulted in this big
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f50f29faf..9f7e31e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -43,6 +43,7 @@
  #include <net/tcp_states.h>
  #include <net/inet_ecn.h>
  #include <net/dst.h>
+#include <net/tcp_estats.h>
  
  #include <linux/seq_file.h>
  #include <linux/memcontrol.h>
diff --git a/include/net/tcp_estats.h b/include/net/tcp_estats.h
new file mode 100644
index 0000000..ff6000e
--- /dev/null
+++ b/include/net/tcp_estats.h
@@ -0,0 +1,376 @@
+/*
+ * include/net/tcp_estats.h
+ *
+ * Implementation of TCP Extended Statistics MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _TCP_ESTATS_H
+#define _TCP_ESTATS_H
+
+#include <net/sock.h>
+#include <linux/idr.h>
+#include <linux/in.h>
+#include <linux/jump_label.h>
+#include <linux/spinlock.h>
+#include <linux/tcp.h>
+#include <linux/workqueue.h>
+
+/* defines number of seconds that stats persist after connection ends */
+#define TCP_ESTATS_PERSIST_DELAY_SECS 5
+
+enum tcp_estats_sndlim_states {
+	TCP_ESTATS_SNDLIM_NONE = -1,
+	TCP_ESTATS_SNDLIM_SENDER,
+	TCP_ESTATS_SNDLIM_CWND,
+	TCP_ESTATS_SNDLIM_RWIN,
+	TCP_ESTATS_SNDLIM_STARTUP,
+	TCP_ESTATS_SNDLIM_TSODEFER,
+	TCP_ESTATS_SNDLIM_PACE,
+	TCP_ESTATS_SNDLIM_NSTATES	/* Keep at end */
+};
+
+enum tcp_estats_addrtype {
+	TCP_ESTATS_ADDRTYPE_IPV4 = 1,
+	TCP_ESTATS_ADDRTYPE_IPV6 = 2
+};
+
+enum tcp_estats_softerror_reason {
+	TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW = 1,
+	TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW = 2,
+	TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW = 3,
+	TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW = 4,
+	TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW = 5,
+	TCP_ESTATS_SOFTERROR_ABOVE_TS_WINDOW = 6,
+	TCP_ESTATS_SOFTERROR_DATA_CHECKSUM = 7,
+	TCP_ESTATS_SOFTERROR_OTHER = 8,
+};
+
+#define TCP_ESTATS_INACTIVE	2
+#define TCP_ESTATS_ACTIVE	1
+
+#define TCP_ESTATS_TABLEMASK_INACTIVE	0x00
+#define TCP_ESTATS_TABLEMASK_ACTIVE	0x01
+#define TCP_ESTATS_TABLEMASK_PERF	0x02
+#define TCP_ESTATS_TABLEMASK_PATH	0x04
+#define TCP_ESTATS_TABLEMASK_STACK	0x08
+#define TCP_ESTATS_TABLEMASK_APP	0x10
+#define TCP_ESTATS_TABLEMASK_EXTRAS	0x40
+
+#ifdef CONFIG_TCP_ESTATS
+
+extern struct static_key tcp_estats_enabled;
+
+#define TCP_ESTATS_CHECK(tp, table, expr)				\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats) &&			\
+			    likely((tp)->tcp_stats->tables.table)) {	\
+				(expr);					\
+			}						\
+		}							\
+	} while (0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, ++((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_DEC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, --((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) += (val))
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) = (val))
+#define TCP_ESTATS_UPDATE(tp, func)					\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats)) {			\
+				(func);					\
+			}						\
+		}							\
+	} while (0)
+
+/*
+ * Variables that can be read and written directly.
+ *
+ * Contains all variables from RFC 4898. Commented fields are
+ * either not implemented (only StartTimeStamp
+ * remains unimplemented in this release) or have
+ * handlers and do not need struct storage.
+ */
+struct tcp_estats_connection_table {
+	u32			AddressType;
+	union { struct in_addr addr; struct in6_addr addr6; }	LocalAddress;
+	union { struct in_addr addr; struct in6_addr addr6; }	RemAddress;
+	u16			LocalPort;
+	u16			RemPort;
+};
+
+struct tcp_estats_perf_table {
+	u32		SegsOut;
+	u32		DataSegsOut;
+	u64		DataOctetsOut;
+	u32		SegsRetrans;
+	u32		OctetsRetrans;
+	u32		SegsIn;
+	u32		DataSegsIn;
+	u64		DataOctetsIn;
+	/*		ElapsedSecs */
+	/*		ElapsedMicroSecs */
+	/*		StartTimeStamp */
+	/*		CurMSS */
+	/*		PipeSize */
+	u32		MaxPipeSize;
+	/*		SmoothedRTT */
+	/*		CurRTO */
+	u32		CongSignals;
+	/*		CurCwnd */
+	/*		CurSsthresh */
+	u32		Timeouts;
+	/*		CurRwinSent */
+	u32		MaxRwinSent;
+	u32		ZeroRwinSent;
+	/*		CurRwinRcvd */
+	u32		MaxRwinRcvd;
+	u32		ZeroRwinRcvd;
+	/*		SndLimTransRwin */
+	/*		SndLimTransCwnd */
+	/*		SndLimTransSnd */
+	/*		SndLimTimeRwin */
+	/*		SndLimTimeCwnd */
+	/*		SndLimTimeSnd */
+	u32		snd_lim_trans[TCP_ESTATS_SNDLIM_NSTATES];
+	u32		snd_lim_time[TCP_ESTATS_SNDLIM_NSTATES];
+};
+
+struct tcp_estats_path_table {
+	/*		RetranThresh */
+	u32		NonRecovDAEpisodes;
+	u32		SumOctetsReordered;
+	u32		NonRecovDA;
+	u32		SampleRTT;
+	/*		RTTVar */
+	u32		MaxRTT;
+	u32		MinRTT;
+	u64		SumRTT;
+	u32		CountRTT;
+	u32		MaxRTO;
+	u32		MinRTO;
+	u8		IpTtl;
+	u8		IpTosIn;
+	/*		IpTosOut */
+	u32		PreCongSumCwnd;
+	u32		PreCongSumRTT;
+	u32		PostCongSumRTT;
+	u32		PostCongCountRTT;
+	u32		ECNsignals;
+	u32		DupAckEpisodes;
+	/*		RcvRTT */
+	u32		DupAcksOut;
+	u32		CERcvd;
+	u32		ECESent;
+};
+
+struct tcp_estats_stack_table {
+	u32		ActiveOpen;
+	/*		MSSSent */
+	/*		MSSRcvd */
+	/*		WinScaleSent */
+	/*		WinScaleRcvd */
+	/*		TimeStamps */
+	/*		ECN */
+	/*		WillSendSACK */
+	/*		WillUseSACK */
+	/*		State */
+	/*		Nagle */
+	u32		MaxSsCwnd;
+	u32		MaxCaCwnd;
+	u32		MaxSsthresh;
+	u32		MinSsthresh;
+	/*		InRecovery */
+	u32		DupAcksIn;
+	u32		SpuriousFrDetected;
+	u32		SpuriousRtoDetected;
+	u32		SoftErrors;
+	u32		SoftErrorReason;
+	u32		SlowStart;
+	u32		CongAvoid;
+	u32		OtherReductions;
+	u32		CongOverCount;
+	u32		FastRetran;
+	u32		SubsequentTimeouts;
+	/*		CurTimeoutCount */
+	u32		AbruptTimeouts;
+	u32		SACKsRcvd;
+	u32		SACKBlocksRcvd;
+	u32		SendStall;
+	u32		DSACKDups;
+	u32		MaxMSS;
+	u32		MinMSS;
+	u32		SndInitial;
+	u32		RecInitial;
+	/*		CurRetxQueue */
+	/*		MaxRetxQueue */
+	/*		CurReasmQueue */
+	u32		MaxReasmQueue;
+	u32		EarlyRetrans;
+	u32		EarlyRetransDelay;
+};
+
+struct tcp_estats_app_table {
+	/*		SndUna */
+	/*		SndNxt */
+	u32		SndMax;
+	u64		ThruOctetsAcked;
+	/*		RcvNxt */
+	u64		ThruOctetsReceived;
+	/*		CurAppWQueue */
+	u32		MaxAppWQueue;
+	/*		CurAppRQueue */
+	u32		MaxAppRQueue;
+};
+
+/*
+    currently, no backing store is needed for tuning elements in
+     web10g - they are all read or written to directly in other
+     data structures (such as the socket)
+*/
+
+struct tcp_estats_extras_table {
+	/*		OtherReductionsCV */
+	u32		OtherReductionsCM;
+	u32		Priority;
+};
+
+struct tcp_estats_tables {
+	struct tcp_estats_connection_table	*connection_table;
+	struct tcp_estats_perf_table		*perf_table;
+	struct tcp_estats_path_table		*path_table;
+	struct tcp_estats_stack_table		*stack_table;
+	struct tcp_estats_app_table		*app_table;
+	struct tcp_estats_extras_table		*extras_table;
+};
+
+struct tcp_estats {
+	int				tcpe_cid; /* idr map id */
+
+	struct sock			*sk;
+	kuid_t				uid;
+	kgid_t				gid;
+	int				ids;
+
+	atomic_t			users;
+
+	enum tcp_estats_sndlim_states	limstate;
+	ktime_t				limstate_ts;
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	ktime_t				start_ts;
+	ktime_t				current_ts;
+#else
+	unsigned long			start_ts;
+	unsigned long			current_ts;
+#endif
+	struct timeval			start_tv;
+
+        int				queued;
+        struct work_struct		create_notify;
+        struct work_struct		establish_notify;
+        struct delayed_work		destroy_notify;
+
+	struct tcp_estats_tables	tables;
+
+	struct rcu_head			rcu;
+};
+
+extern struct idr tcp_estats_idr;
+
+extern int tcp_estats_wq_enabled;
+extern struct workqueue_struct *tcp_estats_wq;
+extern void (*create_notify_func)(struct work_struct *work);
+extern void (*establish_notify_func)(struct work_struct *work);
+extern void (*destroy_notify_func)(struct work_struct *work);
+
+extern unsigned long persist_delay;
+extern spinlock_t tcp_estats_idr_lock;
+
+/* For the TCP code */
+extern int  tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype t,
+			      int active);
+extern void tcp_estats_destroy(struct sock *sk);
+extern void tcp_estats_establish(struct sock *sk);
+extern void tcp_estats_free(struct rcu_head *rcu);
+
+extern void tcp_estats_update_snd_nxt(struct tcp_sock *tp);
+extern void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack);
+extern void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample);
+extern void tcp_estats_update_timeout(struct sock *sk);
+extern void tcp_estats_update_mss(struct tcp_sock *tp);
+extern void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp);
+extern void tcp_estats_update_sndlim(struct tcp_sock *tp,
+				     enum tcp_estats_sndlim_states why);
+extern void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq);
+extern void tcp_estats_update_rwin_sent(struct tcp_sock *tp);
+extern void tcp_estats_update_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_post_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_segsend(struct sock *sk, int pcount,
+                                      u32 seq, u32 end_seq, int flags);
+extern void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb);
+extern void tcp_estats_update_finish_segrecv(struct tcp_sock *tp);
+extern void tcp_estats_update_writeq(struct sock *sk);
+extern void tcp_estats_update_recvq(struct sock *sk);
+
+extern void tcp_estats_init(void);
+
+static inline void tcp_estats_use(struct tcp_estats *stats)
+{
+	atomic_inc(&stats->users);
+}
+
+static inline int tcp_estats_use_if_valid(struct tcp_estats *stats)
+{
+	return atomic_inc_not_zero(&stats->users);
+}
+
+static inline void tcp_estats_unuse(struct tcp_estats *stats)
+{
+	if (atomic_dec_and_test(&stats->users)) {
+		sock_put(stats->sk);
+		stats->sk = NULL;
+		call_rcu(&stats->rcu, tcp_estats_free);
+	}
+}
+
+#else /* !CONFIG_TCP_ESTATS */
+
+#define tcp_estats_enabled	(0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_DEC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_UPDATE(tp, func)		do {} while (0)
+
+static inline void tcp_estats_init(void) { }
+static inline void tcp_estats_establish(struct sock *sk) { }
+static inline void tcp_estats_create(struct sock *sk,
+				     enum tcp_estats_addrtype t,
+				     int active) { }
+static inline void tcp_estats_destroy(struct sock *sk) { }
+
+#endif /* CONFIG_TCP_ESTATS */
+
+#endif /* _TCP_ESTATS_H */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 3b97183..5dae043 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -186,9 +186,13 @@ struct tcp_info {
  	__u32	tcpi_rcv_space;
  
  	__u32	tcpi_total_retrans;
-
  	__u64	tcpi_pacing_rate;
  	__u64	tcpi_max_pacing_rate;
+
+#ifdef CONFIG_TCP_ESTATS
+	/* RFC 4898 extended stats Info */
+	__u32	tcpi_estats_cid;
+#endif
  };
  
  /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3075723..698dbb7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -418,6 +418,10 @@ void tcp_init_sock(struct sock *sk)
  	sk->sk_sndbuf = sysctl_tcp_wmem[1];
  	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
  
+#ifdef CONFIG_TCP_ESTATS
+	tp->tcp_stats = NULL;
+#endif
+
  	local_bh_disable();
  	sock_update_memcg(sk);
  	sk_sockets_allocated_inc(sk);
@@ -972,6 +976,9 @@ wait_for_memory:
  		tcp_push(sk, flags & ~MSG_MORE, mss_now,
  			 TCP_NAGLE_PUSH, size_goal);
  
+		if (copied)
+                        TCP_ESTATS_UPDATE(tp, tcp_estats_update_writeq(sk));
+
  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
  			goto do_error;
  
@@ -1264,9 +1271,11 @@ new_segment:
  wait_for_sndbuf:
  			set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
  wait_for_memory:
-			if (copied)
+			if (copied) {
  				tcp_push(sk, flags & ~MSG_MORE, mss_now,
  					 TCP_NAGLE_PUSH, size_goal);
+				TCP_ESTATS_UPDATE(tp, tcp_estats_update_writeq(sk));
+			}
  
  			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
  				goto do_error;
@@ -1658,6 +1667,8 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  			     *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt, flags);
  		}
  
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+
  		/* Well, if we have backlog, try to process it now yet. */
  
  		if (copied >= target && !sk->sk_backlog.tail)
@@ -2684,6 +2695,11 @@ void tcp_get_info(const struct sock *sk, struct tcp_info *info)
  					sk->sk_pacing_rate : ~0ULL;
  	info->tcpi_max_pacing_rate = sk->sk_max_pacing_rate != ~0U ?
  					sk->sk_max_pacing_rate : ~0ULL;
+
+#ifdef CONFIG_TCP_ESTATS
+	info->tcpi_estats_cid = (tp->tcp_stats && tp->tcp_stats->tcpe_cid > 0)
+					? tp->tcp_stats->tcpe_cid : 0;
+#endif
  }
  EXPORT_SYMBOL_GPL(tcp_get_info);
  
@@ -3101,6 +3117,9 @@ void __init tcp_init(void)
  		tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
  
  	tcp_metrics_init();
+
  	BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0);
+	tcp_estats_init();
+
  	tcp_tasklet_init();
  }
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 27ead0d..e93929d 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -295,6 +295,8 @@ void tcp_slow_start(struct tcp_sock *tp, u32 acked)
  {
  	u32 cwnd = tp->snd_cwnd + acked;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, SlowStart);
+
  	if (cwnd > tp->snd_ssthresh)
  		cwnd = tp->snd_ssthresh + 1;
  	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
@@ -304,6 +306,7 @@ EXPORT_SYMBOL_GPL(tcp_slow_start);
  /* In theory this is tp->snd_cwnd += 1 / tp->snd_cwnd (or alternative w) */
  void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w)
  {
+	TCP_ESTATS_VAR_INC(tp, stack_table, CongAvoid);
  	if (tp->snd_cwnd_cnt >= w) {
  		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
  			tp->snd_cwnd++;
diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c
index 58469ff..5facb4c 100644
--- a/net/ipv4/tcp_htcp.c
+++ b/net/ipv4/tcp_htcp.c
@@ -251,6 +251,7 @@ static void htcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
  			tp->snd_cwnd_cnt += ca->pkts_acked;
  
  		ca->pkts_acked = 1;
+		TCP_ESTATS_VAR_INC(tp, stack_table, CongAvoid);
  	}
  }
  
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 075ab4d..8f0601b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -77,8 +77,10 @@
  #include <linux/errqueue.h>
  
  int sysctl_tcp_timestamps __read_mostly = 1;
+EXPORT_SYMBOL(sysctl_tcp_timestamps);
  int sysctl_tcp_window_scaling __read_mostly = 1;
  int sysctl_tcp_sack __read_mostly = 1;
+EXPORT_SYMBOL(sysctl_tcp_sack);
  int sysctl_tcp_fack __read_mostly = 1;
  int sysctl_tcp_reordering __read_mostly = TCP_FASTRETRANS_THRESH;
  int sysctl_tcp_max_reordering __read_mostly = 300;
@@ -231,13 +233,15 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
  			tcp_enter_quickack_mode((struct sock *)tp);
  		break;
  	case INET_ECN_CE:
+		TCP_ESTATS_VAR_INC(tp, path_table, CERcvd);
  		if (tcp_ca_needs_ecn((struct sock *)tp))
  			tcp_ca_event((struct sock *)tp, CA_EVENT_ECN_IS_CE);
-
  		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
  			/* Better not delay acks, sender can have a very low cwnd */
  			tcp_enter_quickack_mode((struct sock *)tp);
  			tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+		} else {
+			TCP_ESTATS_VAR_INC(tp, path_table, ECESent);
  		}
  		tp->ecn_flags |= TCP_ECN_SEEN;
  		break;
@@ -1104,6 +1108,7 @@ static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
  		dup_sack = true;
  		tcp_dsack_seen(tp);
  		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPDSACKRECV);
+		TCP_ESTATS_VAR_INC(tp, stack_table, DSACKDups);
  	} else if (num_sacks > 1) {
  		u32 end_seq_1 = get_unaligned_be32(&sp[1].end_seq);
  		u32 start_seq_1 = get_unaligned_be32(&sp[1].start_seq);
@@ -1114,6 +1119,7 @@ static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
  			tcp_dsack_seen(tp);
  			NET_INC_STATS_BH(sock_net(sk),
  					LINUX_MIB_TCPDSACKOFORECV);
+			TCP_ESTATS_VAR_INC(tp, stack_table, DSACKDups);
  		}
  	}
  
@@ -1653,6 +1659,9 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
  	state.reord = tp->packets_out;
  	state.rtt_us = -1L;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, SACKsRcvd);
+	TCP_ESTATS_VAR_ADD(tp, stack_table, SACKBlocksRcvd, num_sacks);
+
  	if (!tp->sacked_out) {
  		if (WARN_ON(tp->fackets_out))
  			tp->fackets_out = 0;
@@ -1928,6 +1937,8 @@ void tcp_enter_loss(struct sock *sk)
  	bool new_recovery = false;
  	bool is_reneg;			/* is receiver reneging on SACKs? */
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
+
  	/* Reduce ssthresh if it has not yet been made inside this window. */
  	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
  	    !after(tp->high_seq, tp->snd_una) ||
@@ -2200,8 +2211,12 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
  	 */
  	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
  	    (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
-	    !tcp_may_send_now(sk))
-		return !tcp_pause_early_retransmit(sk, flag);
+	    !tcp_may_send_now(sk)) {
+		int early_retrans = !tcp_pause_early_retransmit(sk, flag);
+		if (early_retrans)
+			TCP_ESTATS_VAR_INC(tp, stack_table, EarlyRetrans);
+		return early_retrans;
+	}
  
  	return false;
  }
@@ -2299,9 +2314,15 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
   */
  static inline void tcp_moderate_cwnd(struct tcp_sock *tp)
  {
-	tp->snd_cwnd = min(tp->snd_cwnd,
-			   tcp_packets_in_flight(tp) + tcp_max_burst(tp));
-	tp->snd_cwnd_stamp = tcp_time_stamp;
+	u32 pkts = tcp_packets_in_flight(tp) + tcp_max_burst(tp);
+
+	if (pkts < tp->snd_cwnd) {
+		tp->snd_cwnd = pkts;
+		tp->snd_cwnd_stamp = tcp_time_stamp;
+
+		TCP_ESTATS_VAR_INC(tp, stack_table, OtherReductions);
+		TCP_ESTATS_VAR_INC(tp, extras_table, OtherReductionsCM);
+	}
  }
  
  /* Nothing was retransmitted or returned timestamp is less
@@ -2402,6 +2423,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
  		if (tp->prior_ssthresh > tp->snd_ssthresh) {
  			tp->snd_ssthresh = tp->prior_ssthresh;
  			tcp_ecn_withdraw_cwr(tp);
+			TCP_ESTATS_VAR_INC(tp, stack_table, CongOverCount);
  		}
  	} else {
  		tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh);
@@ -2428,10 +2450,15 @@ static bool tcp_try_undo_recovery(struct sock *sk)
  		 */
  		DBGUNDO(sk, inet_csk(sk)->icsk_ca_state == TCP_CA_Loss ? "loss" : "retrans");
  		tcp_undo_cwnd_reduction(sk, false);
-		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss)
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
  			mib_idx = LINUX_MIB_TCPLOSSUNDO;
-		else
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousRtoDetected);
+		} else {
  			mib_idx = LINUX_MIB_TCPFULLUNDO;
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousFrDetected);
+		}
  
  		NET_INC_STATS_BH(sock_net(sk), mib_idx);
  	}
@@ -2472,9 +2499,12 @@ static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
  
  		DBGUNDO(sk, "partial loss");
  		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO);
-		if (frto_undo)
+		if (frto_undo) {
  			NET_INC_STATS_BH(sock_net(sk),
  					 LINUX_MIB_TCPSPURIOUSRTOS);
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousRtoDetected);
+		}
  		inet_csk(sk)->icsk_retransmits = 0;
  		if (frto_undo || tcp_is_sack(tp))
  			tcp_set_ca_state(sk, TCP_CA_Open);
@@ -2555,6 +2585,7 @@ void tcp_enter_cwr(struct sock *sk)
  		tcp_init_cwnd_reduction(sk);
  		tcp_set_ca_state(sk, TCP_CA_CWR);
  	}
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
  }
  
  static void tcp_try_keep_open(struct sock *sk)
@@ -2580,8 +2611,10 @@ static void tcp_try_to_open(struct sock *sk, int flag, const int prior_unsacked)
  	if (!tcp_any_retrans_done(sk))
  		tp->retrans_stamp = 0;
  
-	if (flag & FLAG_ECE)
+	if (flag & FLAG_ECE) {
  		tcp_enter_cwr(sk);
+		TCP_ESTATS_VAR_INC(tp, path_table, ECNsignals);
+	}
  
  	if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
  		tcp_try_keep_open(sk);
@@ -2826,6 +2859,10 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  			}
  			break;
  
+		case TCP_CA_Disorder:
+			TCP_ESTATS_VAR_INC(tp, path_table, NonRecovDAEpisodes);
+			break;
+
  		case TCP_CA_Recovery:
  			if (tcp_is_reno(tp))
  				tcp_reset_reno_sack(tp);
@@ -2870,6 +2907,10 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
  			tcp_try_undo_dsack(sk);
  
+
+		if (icsk->icsk_ca_state == TCP_CA_Disorder)
+			TCP_ESTATS_VAR_INC(tp, path_table, NonRecovDA);
+
  		if (!tcp_time_to_recover(sk, flag)) {
  			tcp_try_to_open(sk, flag, prior_unsacked);
  			return;
@@ -2889,6 +2930,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  		/* Otherwise enter Recovery state */
  		tcp_enter_recovery(sk, (flag & FLAG_ECE));
  		fast_rexmit = 1;
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
+		TCP_ESTATS_VAR_INC(tp, stack_table, FastRetran);
  	}
  
  	if (do_lost)
@@ -2928,6 +2971,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
  
  	tcp_rtt_estimator(sk, seq_rtt_us);
  	tcp_set_rto(sk);
+	TCP_ESTATS_UPDATE(tcp_sk(sk), tcp_estats_update_rtt(sk, seq_rtt_us));
  
  	/* RFC6298: only reset backoff on valid RTT measurement. */
  	inet_csk(sk)->icsk_backoff = 0;
@@ -3007,6 +3051,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
  	if (!tp->do_early_retrans)
  		return;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, EarlyRetransDelay);
  	tcp_enter_recovery(sk, false);
  	tcp_update_scoreboard(sk, 1);
  	tcp_xmit_retransmit_queue(sk);
@@ -3310,9 +3355,11 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
  				tp->max_window = nwin;
  				tcp_sync_mss(sk, inet_csk(sk)->icsk_pmtu_cookie);
  			}
+			TCP_ESTATS_UPDATE(tp, tcp_estats_update_rwin_rcvd(tp));
  		}
  	}
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_acked(tp, ack));
  	tp->snd_una = ack;
  
  	return flag;
@@ -3410,6 +3457,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	int prior_packets = tp->packets_out;
  	const int prior_unsacked = tp->packets_out - tp->sacked_out;
  	int acked = 0; /* Number of packets newly acked */
+	int prior_state = icsk->icsk_ca_state;
  	long sack_rtt_us = -1L;
  
  	/* We very likely will need to access write queue head. */
@@ -3419,6 +3467,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	 * then we can probably ignore it.
  	 */
  	if (before(ack, prior_snd_una)) {
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+				   TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW);
  		/* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */
  		if (before(ack, prior_snd_una - tp->max_window)) {
  			tcp_send_challenge_ack(sk);
@@ -3430,8 +3481,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	/* If the ack includes data we haven't sent yet, discard
  	 * this segment (RFC793 Section 3.9).
  	 */
-	if (after(ack, tp->snd_nxt))
+	if (after(ack, tp->snd_nxt)) {
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+				   TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW);
  		goto invalid_ack;
+	}
  
  	if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
  	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
@@ -3439,6 +3494,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  
  	if (after(ack, prior_snd_una)) {
  		flag |= FLAG_SND_UNA_ADVANCED;
+		if (icsk->icsk_ca_state == TCP_CA_Disorder)
+			TCP_ESTATS_VAR_ADD(tp, path_table, SumOctetsReordered,
+					   ack - prior_snd_una);
  		icsk->icsk_retransmits = 0;
  	}
  
@@ -3456,6 +3514,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  		 * Note, we use the fact that SND.UNA>=SND.WL2.
  		 */
  		tcp_update_wl(tp, ack_seq);
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_acked(tp, ack));
  		tp->snd_una = ack;
  		flag |= FLAG_WIN_UPDATE;
  
@@ -3510,6 +3569,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
  		tcp_fastretrans_alert(sk, acked, prior_unsacked,
  				      is_dupack, flag);
+		if (icsk->icsk_ca_state == TCP_CA_Open &&
+		    prior_state >= TCP_CA_CWR)
+			TCP_ESTATS_UPDATE(tp,
+				tcp_estats_update_post_congestion(tp));
  	}
  	if (tp->tlp_high_seq)
  		tcp_process_tlp_ack(sk, ack, flag);
@@ -4177,7 +4240,9 @@ static void tcp_ofo_queue(struct sock *sk)
  
  		tail = skb_peek_tail(&sk->sk_receive_queue);
  		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, tp->rcv_nxt));
  		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+
  		if (!eaten)
  			__skb_queue_tail(&sk->sk_receive_queue, skb);
  		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
@@ -4232,6 +4297,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
  	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
  		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
  
+        TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+        TCP_ESTATS_VAR_INC(tp, path_table, DupAcksOut);
+
  	skb1 = skb_peek_tail(&tp->out_of_order_queue);
  	if (!skb1) {
  		/* Initial out of order segment, build 1 SACK. */
@@ -4242,6 +4310,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
  						TCP_SKB_CB(skb)->end_seq;
  		}
  		__skb_queue_head(&tp->out_of_order_queue, skb);
+                TCP_ESTATS_VAR_INC(tp, path_table, DupAckEpisodes);
  		goto end;
  	}
  
@@ -4438,6 +4507,9 @@ queue_and_out:
  
  			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
  		}
+		TCP_ESTATS_UPDATE(
+			tp,
+			tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
  		if (skb->len)
  			tcp_event_data_recv(sk, skb);
@@ -4459,6 +4531,8 @@ queue_and_out:
  
  		tcp_fast_path_check(sk);
  
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+
  		if (eaten > 0)
  			kfree_skb_partial(skb, fragstolen);
  		if (!sock_flag(sk, SOCK_DEAD))
@@ -4990,6 +5064,9 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  	    tcp_paws_discard(sk, skb)) {
  		if (!th->rst) {
  			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
+			TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+			TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+					   TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW);
  			tcp_send_dupack(sk, skb);
  			goto discard;
  		}
@@ -5004,6 +5081,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  		 * an acknowledgment should be sent in reply (unless the RST
  		 * bit is set, if so drop the segment and return)".
  		 */
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+			before(TCP_SKB_CB(skb)->end_seq, tp->rcv_wup) ?
+				TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW :
+				TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW);
  		if (!th->rst) {
  			if (th->syn)
  				goto syn_challenge;
@@ -5152,6 +5234,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  				return;
  			} else { /* Header too small */
  				TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
+				TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+				TCP_ESTATS_VAR_SET(tp, stack_table,
+						   SoftErrorReason,
+						   TCP_ESTATS_SOFTERROR_OTHER);
  				goto discard;
  			}
  		} else {
@@ -5178,6 +5264,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  					tcp_rcv_rtt_measure_ts(sk, skb);
  
  					__skb_pull(skb, tcp_header_len);
+					TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  					tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
  					NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
  					eaten = 1;
@@ -5204,10 +5291,12 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
  
  				/* Bulk data transfer: receiver */
+				TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  				eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
  						      &fragstolen);
  			}
  
+			TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
  			tcp_event_data_recv(sk, skb);
  
  			if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
@@ -5260,6 +5349,9 @@ step5:
  csum_error:
  	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_CSUMERRORS);
  	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
+	TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+	TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+			   TCP_ESTATS_SOFTERROR_DATA_CHECKSUM);
  
  discard:
  	__kfree_skb(skb);
@@ -5459,6 +5551,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
  		smp_mb();
  
  		tcp_finish_connect(sk, skb);
+		tcp_estats_establish(sk);
  
  		if ((tp->syn_fastopen || tp->syn_data) &&
  		    tcp_rcv_fastopen_synack(sk, skb, &foc))
@@ -5685,6 +5778,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
  		smp_mb();
  		tcp_set_state(sk, TCP_ESTABLISHED);
  		sk->sk_state_change(sk);
+		tcp_estats_establish(sk);
  
  		/* Note, that this wakeup is only for marginal crossed SYN case.
  		 * Passively open sockets are not waked up, because
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a3f72d7..9c85a54 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1310,6 +1310,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
  	if (!newsk)
  		goto exit_nonewsk;
  
+	tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV4, TCP_ESTATS_INACTIVE);
+
  	newsk->sk_gso_type = SKB_GSO_TCPV4;
  	inet_sk_rx_dst_set(newsk, skb);
  
@@ -1670,6 +1672,8 @@ process:
  	skb->dev = NULL;
  
  	bh_lock_sock_nested(sk);
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_segrecv(tcp_sk(sk), skb));
  	ret = 0;
  	if (!sock_owned_by_user(sk)) {
  		if (!tcp_prequeue(sk, skb))
@@ -1680,6 +1684,8 @@ process:
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;
  	}
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_finish_segrecv(tcp_sk(sk)));
  	bh_unlock_sock(sk);
  
  	sock_put(sk);
@@ -1809,6 +1815,8 @@ static int tcp_v4_init_sock(struct sock *sk)
  	tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
  #endif
  
+	tcp_estats_create(sk, TCP_ESTATS_ADDRTYPE_IPV4, TCP_ESTATS_ACTIVE);
+
  	return 0;
  }
  
@@ -1842,6 +1850,8 @@ void tcp_v4_destroy_sock(struct sock *sk)
  	if (inet_csk(sk)->icsk_bind_hash)
  		inet_put_port(sk);
  
+	tcp_estats_destroy(sk);
+
  	BUG_ON(tp->fastopen_rsk != NULL);
  
  	/* If socket is aborted during connect operation */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..145b4f2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -80,6 +80,7 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
  
  	tcp_advance_send_head(sk, skb);
  	tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_snd_nxt(tp));
  
  	tp->packets_out += tcp_skb_pcount(skb);
  	if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
@@ -292,6 +293,7 @@ static u16 tcp_select_window(struct sock *sk)
  	}
  	tp->rcv_wnd = new_win;
  	tp->rcv_wup = tp->rcv_nxt;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_rwin_sent(tp));
  
  	/* Make sure we do not exceed the maximum possible
  	 * scaled window.
@@ -905,6 +907,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  	struct tcp_md5sig_key *md5;
  	struct tcphdr *th;
  	int err;
+#ifdef CONFIG_TCP_ESTATS
+	__u32 seq;
+	__u32 end_seq;
+	int tcp_flags;
+	int pcount;
+#endif
  
  	BUG_ON(!skb || !tcp_skb_pcount(skb));
  
@@ -1008,6 +1016,15 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  		TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS,
  			      tcp_skb_pcount(skb));
  
+#ifdef CONFIG_TCP_ESTATS
+	/* If the skb isn't cloned, we can't reference it after
+	 * calling queue_xmit, so copy everything we need here. */
+	pcount = tcp_skb_pcount(skb);
+	seq = TCP_SKB_CB(skb)->seq;
+	end_seq = TCP_SKB_CB(skb)->end_seq;
+	tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
+#endif
+
  	/* OK, its time to fill skb_shinfo(skb)->gso_segs */
  	skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb);
  
@@ -1020,10 +1037,17 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  
  	err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
  
+	if (likely(!err)) {
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_segsend(sk, pcount,
+								seq, end_seq,
+								tcp_flags));
+	}
+
  	if (likely(err <= 0))
  		return err;
  
  	tcp_enter_cwr(sk);
+	TCP_ESTATS_VAR_INC(tp, stack_table, SendStall);
  
  	return net_xmit_eval(err);
  }
@@ -1398,6 +1422,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
  	if (icsk->icsk_mtup.enabled)
  		mss_now = min(mss_now, tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low));
  	tp->mss_cache = mss_now;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_mss(tp));
  
  	return mss_now;
  }
@@ -1670,11 +1695,13 @@ static unsigned int tcp_snd_test(const struct sock *sk, struct sk_buff *skb,
  	tcp_init_tso_segs(sk, skb, cur_mss);
  
  	if (!tcp_nagle_test(tp, skb, cur_mss, nonagle))
-		return 0;
+		return -TCP_ESTATS_SNDLIM_SENDER;
  
  	cwnd_quota = tcp_cwnd_test(tp, skb);
-	if (cwnd_quota && !tcp_snd_wnd_test(tp, skb, cur_mss))
-		cwnd_quota = 0;
+	if (!cwnd_quota)
+		return -TCP_ESTATS_SNDLIM_CWND;
+	if (!tcp_snd_wnd_test(tp, skb, cur_mss))
+		return -TCP_ESTATS_SNDLIM_RWIN;
  
  	return cwnd_quota;
  }
@@ -1688,7 +1715,7 @@ bool tcp_may_send_now(struct sock *sk)
  	return skb &&
  		tcp_snd_test(sk, skb, tcp_current_mss(sk),
  			     (tcp_skb_is_last(sk, skb) ?
-			      tp->nonagle : TCP_NAGLE_PUSH));
+			      tp->nonagle : TCP_NAGLE_PUSH)) > 0;
  }
  
  /* Trim TSO SKB to LEN bytes, put the remaining data into a new packet
@@ -1978,6 +2005,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  	unsigned int tso_segs, sent_pkts;
  	int cwnd_quota;
  	int result;
+	int why = TCP_ESTATS_SNDLIM_SENDER;
  	bool is_cwnd_limited = false;
  	u32 max_segs;
  
@@ -2008,6 +2036,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  
  		cwnd_quota = tcp_cwnd_test(tp, skb);
  		if (!cwnd_quota) {
+			why = TCP_ESTATS_SNDLIM_CWND;
  			is_cwnd_limited = true;
  			if (push_one == 2)
  				/* Force out a loss probe pkt. */
@@ -2016,19 +2045,24 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  				break;
  		}
  
-		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
+		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
+			why = TCP_ESTATS_SNDLIM_RWIN;
  			break;
-
+		}
+		
  		if (tso_segs == 1) {
  			if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
  						     (tcp_skb_is_last(sk, skb) ?
  						      nonagle : TCP_NAGLE_PUSH))))
+				/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  				break;
  		} else {
  			if (!push_one &&
  			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
-						 max_segs))
+						 max_segs)) {
+				why = TCP_ESTATS_SNDLIM_TSODEFER;
  				break;
+			}
  		}
  
  		limit = mss_now;
@@ -2041,6 +2075,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  
  		if (skb->len > limit &&
  		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  
  		/* TCP Small Queues :
@@ -2064,10 +2099,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  			 */
  			smp_mb__after_atomic();
  			if (atomic_read(&sk->sk_wmem_alloc) > limit)
+				/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  				break;
  		}
  
  		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  
  repair:
@@ -2080,9 +2117,12 @@ repair:
  		sent_pkts += tcp_skb_pcount(skb);
  
  		if (push_one)
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  	}
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_sndlim(tp, why));
+
  	if (likely(sent_pkts)) {
  		if (tcp_in_cwnd_reduction(sk))
  			tp->prr_out += sent_pkts;
@@ -3148,11 +3188,16 @@ int tcp_connect(struct sock *sk)
  	 */
  	tp->snd_nxt = tp->write_seq;
  	tp->pushed_seq = tp->write_seq;
-	TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
  
  	/* Timer for repeating the SYN until an answer. */
  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
  				  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, SndInitial, tp->write_seq);
+	TCP_ESTATS_VAR_SET(tp, app_table, SndMax, tp->write_seq);
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_snd_nxt(tp));
+	TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
+
  	return 0;
  }
  EXPORT_SYMBOL(tcp_connect);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 1829c7f..0f6f1f4 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -477,6 +477,9 @@ out_reset_timer:
  		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
  	}
  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+
+        TCP_ESTATS_UPDATE(tp, tcp_estats_update_timeout(sk));
+
  	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
  		__sk_dst_reset(sk);
  
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5ff8780..db1f88f 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1131,6 +1131,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
  	if (newsk == NULL)
  		goto out_nonewsk;
  
+	tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV6, TCP_ESTATS_INACTIVE);
+
  	/*
  	 * No need to charge this sock to the relevant IPv6 refcnt debug socks
  	 * count here, tcp_create_openreq_child now does this for us, see the
@@ -1463,6 +1465,8 @@ process:
  	skb->dev = NULL;
  
  	bh_lock_sock_nested(sk);
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_segrecv(tcp_sk(sk), skb));
  	ret = 0;
  	if (!sock_owned_by_user(sk)) {
  		if (!tcp_prequeue(sk, skb))
@@ -1473,6 +1477,8 @@ process:
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;
  	}
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_finish_segrecv(tcp_sk(sk)));
  	bh_unlock_sock(sk);
  
  	sock_put(sk);
@@ -1661,6 +1667,7 @@ static int tcp_v6_init_sock(struct sock *sk)
  #ifdef CONFIG_TCP_MD5SIG
  	tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
  #endif
+	tcp_estats_create(sk, TCP_ESTATS_ADDRTYPE_IPV6, TCP_ESTATS_ACTIVE);
  
  	return 0;
  }
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 0/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 17:49 UTC (permalink / raw)
  To: netdev

The following patch increments and/or update select RFC 4898 (TCP 
Extended Statistics MIB) metrics within the TCP stack; we refer to this 
as the Kernel Instrument Set (or KIS).  The goal of RFC 4898 is to 
expose advanced statistics from TCP’s vantage point to userland in order 
to help diagnose performance problems in both the network and 
application. The metrics are gathered and cached within structures 
defined in our header file (tcp_estats.h) on a per connection basis 
allowing for highly detailed analysis of all TCP flows. More information 
can be found at http://www.web10g.org/

Note, the KIS does not integrate any specific ABI. This allows for a
clear separation between the kernel instruments and the methodology used
to make the metrics available to userland. Currently, we have a netlink
implementation available as a DLKM and an associated API available at
https://sourceforge.net/projects/tcpestats/files/

Performance analysis provided by the kernel development teams at Google
and Facebook indicate that the overhead imposed when the KIS is
configured active and exposed via an ABI are minimal.  Facebook related
performance impacts of between 0% and 2%, depending on the the frequency
of polling the KIS via ftrace.  Analysis performed at Google indicate
similar performance characteristics.

Since the size of the KIS patch set is considerable (~2k lines), we have
broken it up into two components, the first provides our structures and 
macros to the TCP networking DLKMs. The second provides the routines 
that manage and control the TCP Extended Statistics, as well as 
providing hooks for configuring and enabling the KIS. Each set of 
patches patches, compiles, and runs independently. However, full 
functionality requires both patch sets to be installed.

We took this approach because the control and management (C&M) routines 
are, in our view, of secondary importance to the actual instrumentation. 
As such, we did not want any issues with the C&M methods to impact the 
adoption of the KIS. There is overlap between the two patch sets 
(specifically the header files in the C&M) which will likely make 
applying the C&M patch cleanly on top of the KIS patch problematic. As 
such, I've also included a concatenated patch that includes both the KIS 
and the C&M for evaluation.

A git repo is available at http://github.com/rapier/web10g The
net-next branch contains the instrumentation (1st patch set) and the
control and management (2nd patch set). The API is also available at
http://github.com/rapier1/web10g-userland

Chris Rapier

^ permalink raw reply

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16 17:29 UTC (permalink / raw)
  To: John Fastabend, netdev@vger.kernel.org
  Cc: Roopa Prabhu, Jamal Hadi Salim, Jiri Pirko, sfeldma@gmail.com,
	bcrl@kvack.org, tgraf@suug.ch, stephen@networkplumber.org,
	linville@tuxdriver.com, vyasevic@redhat.com, davem@davemloft.net,
	shm@cumulusnetworks.com, gospo@cumulusnetworks.com
In-Reply-To: <549060CF.5020706@gmail.com>



> -----Original Message-----
> From: John Fastabend [mailto:john.fastabend@gmail.com]
> Sent: Tuesday, December 16, 2014 6:42 PM
> To: Arad, Ronen
> Cc: Roopa Prabhu; netdev@vger.kernel.org; Jamal Hadi Salim; Jiri Pirko;
> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 12/16/2014 03:01 AM, Arad, Ronen wrote:
> >
> > In my reply (inline) I elaborate on the validity of bridge-less and offloaded-
> bridge models for L2 switching.
> >
> > I also discuss the implied necessity of a bridge device for L3 routing and
> potential issues with the upcoming FIB offloading proposal.
> >
> >> -----Original Message-----
> >> From: netdev-owner@vger.kernel.org [mailto:netdev-
> >> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
> >> Sent: Tuesday, December 16, 2014 3:21 AM
> >> To: Arad, Ronen
> >> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri
> >> Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> >> stephen@networkplumber.org; linville@tuxdriver.com;
> >> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com;
> >> gospo@cumulusnetworks.com
> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
> >> del bridge port attributes
> >>
> >> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
> >>>
> >>>> -----Original Message-----
> >>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
> >>>> Sent: Tuesday, December 16, 2014 1:28 AM
> >>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
> >>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
> >>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
> >>>> vyasevic@redhat.com; davem@davemloft.net;
> >> shm@cumulusnetworks.com;
> >>>> gospo@cumulusnetworks.com
> >>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set
> >>>> and del bridge port attributes
> >>>>
> >>>> On 12/15/14 13:36, Arad, Ronen wrote:
> >>>>>
> >>>>>> -----Original Message-----
> >>>>> The behavior of a driver could depend on the presence of a bridge
> >>>>> and
> >>>> features such as FDB LEARNING and LEARNING_SYNC.
> >>>>
> >>>> Indeed, those are bridge attributes.
> >>>>
> >>>>> A switch port driver which is not enslaved to a bridge might need
> >>>>> to implement VLAN-aware FDB within the driver and report its
> >>>>> content to
> >>>>> user-
> >>>> space using ndo_fdb_dump.
> >>>>    >
> >>>>> A switch port driver which is enslaved to a bridge could do with
> >>>>> only pass through for static FDB configuration
> >>>>    > to the HW when LEARNING_SYNC is configured. FDB reporting to
> >>>> user- space and soft aging are left to the bridge module FDB.
> >>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
> >>>>> in-driver
> >>>> FDB as long as it could dump the HW FDB on demand.
> >>>>> LEARNING_SYNC also requires periodic updates of freshness
> >>>>> information
> >>>> from the driver to the bridge module.
> >>>>
> >>>> If you have an fdb - shouldnt that be exposed only if you have a
> >>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
> >>> I'm trying to find out what are the opinions of other people in the
> >>> netdev
> >> list.
> >>> John have clearly stated that he'd like to see full L2 switching
> >>> functionality
> >> (at least) supported without making a bridge device mandatory.
> >>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already
> >>> support that
> >> with proper setting of SELF/MASTER flags by iproute2.
> >>> I see the value in supporting both approaches (bridge device
> >>> mandatory and bridge device optional). If the choice is left to
> >>> user-driven policy decision, we need to document both use models and
> >>> map traditional L2 features to each model.
> >>> The L2 offloading (or NETFUNC as it is currently called), which is
> >>> being discussed on a different patch-set, is only needed when a
> >>> bridge device is used.
> >>> Without a bridge device, all configuration has to be targeted at the
> >>> switch port driver directly using the SELF flag. FDB remains
> >>> relevant and it is used to configure static MAC table entries and dump
> the HW MAC table.
> >
> >> Your understanding is right here. So far all patches have kept both
> >> models in mind.
> >
> >
> >>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or
> >>> even higher), there is a gap between what the HW is doing and what
> >>> is explicitly modeled in Linux.
> >
> >
> >> Can you elaborate more here ?. We use the linux model to accelerate a
> >> multi-layer (l2-l3) switch today. There maybe a few gaps, but these
> >> gaps can be closed by having equivalent functionality in the software path.
> >
> > What I meant is that without a bridge device the HW switch is seen as a
> collection of independent switch ports. Typical switch ASIC performs L2
> switching by default. This is not expressed explicitly in Linux without a bridge
> device.
> > The SELF flag is used to target typical bridge port and bridge configuration
> at a switch port device.
> > Without an explicit bridge device, bridge attributes have to be
> > directed at an arbitrary port (any port could represent the entire switch)
> and interpreted by the switch port driver as intended for the entire switch
> (this includes attributes like STP etc.) Each switch port device driver has to
> implement similar functionality (i.e. all bridge and fdb related ndos)
> independently without common functionality shared (e.g. FDB, soft aging).
> > It is a valid use model and could avoid the complexity of having to deal with
> the presence of both SW and HW bridge and to deal with explicit offloading
> of data-path.
> >
> > I was trying to find out whether the intention was to continue and support
> both bridge-less an offloaded-bridge models and leave it to the end-user to
> choose the desirable model at configuration time.
> > This would require dual support in the switch port driver in order to have
> best user experience across multiple switch ASICs or other kinds of devices.
> >
> 
> I'm still missing why there is duplicate implementations in the driver.
> If the driver implements the set of ndo ops why should it care who calls
> them? I think you tried to explain this already but I'm not seeing it.
> 

Let's consider a bridge property. I'll use the default PVID attribute as an example. This is currently configurable by sysfs only and a netlink support for that is still due. Let's assume for our discussion that a DEAFAULT_PVID attribute will be added as a bridge attribute within AFSPEC nested attribute of AF_BRIDGE SETLINK message.
When a bridge device is present, this attribute is processed by the bridge module and saved as default_pvid field in net_bridge structure. When a switch port is enslaved to a bridge, the bridge driver creates a net_bridge_port instance and assigns it a pvid inherited from the default_pvid attribute of the bridge. Setting the pvid for a new enslaved switch port is not done via netlink. It only applies to the net_bridge_port structure which is internal to the bridge module. Offloading this to HW is not addressed with current bridge offloading.

When a bridge device is not used, the DEFAULT_PVID will be targeted using the SELF flag to any of the switch ports. The driver will recognize that as a bridge port and will need to maintain some switch global structure similar to net_bridge where it could save the default_pvid. The driver, knowing that the switch port is not enslaved to a bridge, will have to replicate the same functionality. In the HW case, it will have to configure default VLAN on all the switch ports.
This is different from the yet to be defined way of propagating default PVID from a bridge device to offloaded bridge ports.

Another example is STP. STP attributes are bridge attributes which are not offloaded when a bridge device is present. The bridge module handles STP protocol internally. Without bridge device, STP attributes have to be targeted at a switch port device and the driver should save them in driver-specific structures and have proprietary implementation of STP (as the one in the bridge module is not used).

 
> [...]
> 
> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa might have
> worked some of it out.
> 
> --
> John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Samudrala, Sridhar @ 2014-12-16 17:21 UTC (permalink / raw)
  To: John Fastabend, Jamal Hadi Salim
  Cc: Hubert Sokolowski, Roopa Prabhu, netdev@vger.kernel.org,
	Vlad Yasevich
In-Reply-To: <54905F67.2090509@gmail.com>


On 12/16/2014 8:35 AM, John Fastabend wrote:
>
>> Is there no way to get the unicast/multicast mac addresses for such
>> a driver?
>
> You can almost infer it from ip link by looking at all the stacked
> drivers and figuring out how the address are propagated down. Then
> look at the routes and figure out multicast address. But other than
> the fdb dump mechanism I don't think there is anything.

It looks like we can get the device specific unicast/multicast mac 
addresses via 'ip maddr' too.

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Michael Chan @ 2014-12-16 17:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rajat Jain, Marcelo Ricardo Leitner, Nils Holland, David Miller,
	netdev, linux-pci@vger.kernel.org, Rafael Wysocki,
	Prashant Sreedharan
In-Reply-To: <CAErSpo5dqQE7nZ6zf2odgpHBWA3ZpTjhbgQKnY8YxQW+a+298w@mail.gmail.com>

On Tue, 2014-12-16 at 09:20 -0700, Bjorn Helgaas wrote:
> I think we're in this path:
> 
>     tg3_init_hw
>       tg3_reset_hw
>         tg3_disable_ints
>         tg3_stop_fw
>         tg3_write_sig_pre_reset
>         tg3_chip_reset
>           pci_device_is_present
>             pci_bus_read_dev_vendor_id
> 
> and in this case pci_device_is_present() also passes a timeout of zero
> to pci_bus_read_dev_vendor_id().  My guess is that tg3 is resetting
> the device, so it's not too surprising that the config read returns
> CRS status immediately afterward.
> 
At the point of calling pci_device_is_present(), chip reset hasn't
started yet, so there should be no problem reading config space.

In all the newer tg3 chips, chip reset does not reset the PCIE block.
So I think config space should always be accesible even during reset.
> 

^ permalink raw reply

* Re: BCM4313 & brcmsmac & 3.12: only semi-working?
From: Arend van Spriel @ 2014-12-16 16:51 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Maximilian Engelhardt, Rafał Miłecki, Seth Forshee,
	brcm80211 development, linux-wireless@vger.kernel.org,
	Network Development
In-Reply-To: <547F0575.7010104@broadcom.com>

On 12/03/14 13:43, Arend van Spriel wrote:
> On 12/02/14 22:40, Michael Tokarev wrote:
>> 30.11.2014 15:04, Arend van Spriel wrote:
>>
>>> Thanks. Did not find what I was looking for, but I started working on
>>> integrating btcoex related functionality. The attached patch will print
>>> some info so I can focus on the required functionality for your device.
>>> It is based on 3.18-rc5.
>>
>> With this patch applied against 3.18-rc5, the machine instantly reboots
>> once brcmsmac module is loaded. I'm still debugging this.

Hmm. The function brcms_btc_ecicoex_enab() is calling itself. Please 
remove that call as it causes endless recursion and eventually reboot.

Regards,
Arend

> Argh. Probably the register access I added end up in limbo land or some
> other stupid mistake. I will double check my patch.
>
> Regards,
> Arend
>
>> Thanks,
>>
>> /mjt
>

^ permalink raw reply

* [PATCH_V2] dm9000: Add regulator and reset support to dm9000
From: Zubair Lutfullah Kakakhel @ 2014-12-16 16:46 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, paul.burton-1AXoQHu6uovQT0dZR+AlfA,
	Zubair.Kakakhel-1AXoQHu6uovQT0dZR+AlfA

In boards, the dm9000 chip's power and reset can be controlled by gpio.

It makes sense to add them to the dm9000 driver and let dt be used to
enable power and reset the phy.

Signed-off-by: Zubair Lutfullah Kakakhel <Zubair.Kakakhel-1AXoQHu6uovQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Paul Burton <paul.burton-1AXoQHu6uovQT0dZR+AlfA@public.gmane.org>
---
V2. Fixed a small blooper. dev_dgb -> dev_dbg

---
 .../devicetree/bindings/net/davicom-dm9000.txt     |  4 +++
 drivers/net/ethernet/davicom/dm9000.c              | 33 ++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/davicom-dm9000.txt b/Documentation/devicetree/bindings/net/davicom-dm9000.txt
index 28767ed..dba19a2 100644
--- a/Documentation/devicetree/bindings/net/davicom-dm9000.txt
+++ b/Documentation/devicetree/bindings/net/davicom-dm9000.txt
@@ -11,6 +11,8 @@ Required properties:
 Optional properties:
 - davicom,no-eeprom : Configuration EEPROM is not available
 - davicom,ext-phy : Use external PHY
+- reset-gpio : phandle of gpio that will be used to reset chip during probe
+- vcc-supply : phandle of regulator that will be used to enable power to chip
 
 Example:
 
@@ -21,4 +23,6 @@ Example:
 		interrupts = <7 4>;
 		local-mac-address = [00 00 de ad be ef];
 		davicom,no-eeprom;
+		reset-gpio = <&gpf 12 GPIO_ACTIVE_LOW>;
+		vcc-supply = <&eth0_power>;
 	};
diff --git a/drivers/net/ethernet/davicom/dm9000.c b/drivers/net/ethernet/davicom/dm9000.c
index ef0bb58..97dbeec 100644
--- a/drivers/net/ethernet/davicom/dm9000.c
+++ b/drivers/net/ethernet/davicom/dm9000.c
@@ -36,6 +36,9 @@
 #include <linux/platform_device.h>
 #include <linux/irq.h>
 #include <linux/slab.h>
+#include <linux/regulator/consumer.h>
+#include <linux/gpio.h>
+#include <linux/of_gpio.h>
 
 #include <asm/delay.h>
 #include <asm/irq.h>
@@ -1426,11 +1429,41 @@ dm9000_probe(struct platform_device *pdev)
 	struct dm9000_plat_data *pdata = dev_get_platdata(&pdev->dev);
 	struct board_info *db;	/* Point a board information structure */
 	struct net_device *ndev;
+	struct device *dev = &pdev->dev;
 	const unsigned char *mac_src;
 	int ret = 0;
 	int iosize;
 	int i;
 	u32 id_val;
+	int reset_gpio;
+	enum of_gpio_flags flags;
+	struct regulator *power;
+
+	power = devm_regulator_get(dev, "vcc");
+	if (IS_ERR(power)) {
+		dev_dbg(dev, "no regulator provided\n");
+	} else if (!regulator_is_enabled(power)) {
+		ret = regulator_enable(power);
+		dev_dbg(dev, "regulator enabled\n");
+	}
+
+	reset_gpio = of_get_named_gpio_flags(dev->of_node, "reset-gpio", 0,
+					     &flags);
+	if (gpio_is_valid(reset_gpio)) {
+		ret = devm_gpio_request_one(dev, reset_gpio, flags,
+					    "dm9000_reset");
+		if (ret) {
+			dev_err(dev, "failed to request reset gpio %d: %d\n",
+				reset_gpio, ret);
+		} else {
+			gpio_direction_output(reset_gpio, 0);
+			/* According to manual PWRST# Low Period Min 1ms */
+			msleep(2);
+			gpio_direction_output(reset_gpio, 1);
+			/* Needs 3ms to read eeprom when PWRST is deasserted */
+			msleep(4);
+		}
+	}
 
 	if (!pdata) {
 		pdata = dm9000_parse_dt(&pdev->dev);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: John Fastabend @ 2014-12-16 16:41 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: Roopa Prabhu, netdev@vger.kernel.org, Jamal Hadi Salim,
	Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DB15CA@ORSMSX101.amr.corp.intel.com>

On 12/16/2014 03:01 AM, Arad, Ronen wrote:
>
> In my reply (inline) I elaborate on the validity of bridge-less and offloaded-bridge models for L2 switching.
>
> I also discuss the implied necessity of a bridge device for L3 routing and potential issues with the upcoming FIB offloading proposal.
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
>> Sent: Tuesday, December 16, 2014 3:21 AM
>> To: Arad, Ronen
>> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri Pirko;
>> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>>>
>>>> -----Original Message-----
>>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>>>> Sent: Tuesday, December 16, 2014 1:28 AM
>>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>>>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com;
>>>> gospo@cumulusnetworks.com
>>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
>>>> del bridge port attributes
>>>>
>>>> On 12/15/14 13:36, Arad, Ronen wrote:
>>>>>
>>>>>> -----Original Message-----
>>>>> The behavior of a driver could depend on the presence of a bridge
>>>>> and
>>>> features such as FDB LEARNING and LEARNING_SYNC.
>>>>
>>>> Indeed, those are bridge attributes.
>>>>
>>>>> A switch port driver which is not enslaved to a bridge might need to
>>>>> implement VLAN-aware FDB within the driver and report its content to
>>>>> user-
>>>> space using ndo_fdb_dump.
>>>>    >
>>>>> A switch port driver which is enslaved to a bridge could do with
>>>>> only pass through for static FDB configuration
>>>>    > to the HW when LEARNING_SYNC is configured. FDB reporting to
>>>> user- space and soft aging are left to the bridge module FDB.
>>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
>>>>> in-driver
>>>> FDB as long as it could dump the HW FDB on demand.
>>>>> LEARNING_SYNC also requires periodic updates of freshness
>>>>> information
>>>> from the driver to the bridge module.
>>>>
>>>> If you have an fdb - shouldnt that be exposed only if you have a
>>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
>>> I'm trying to find out what are the opinions of other people in the netdev
>> list.
>>> John have clearly stated that he'd like to see full L2 switching functionality
>> (at least) supported without making a bridge device mandatory.
>>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that
>> with proper setting of SELF/MASTER flags by iproute2.
>>> I see the value in supporting both approaches (bridge device mandatory
>>> and bridge device optional). If the choice is left to user-driven policy decision,
>>> we need to document both use models and map traditional L2 features to
>>> each model.
>>> The L2 offloading (or NETFUNC as it is currently called), which is being
>>> discussed on a different patch-set, is only needed when a bridge device is
>>> used.
>>> Without a bridge device, all configuration has to be targeted at the switch
>>> port driver directly using the SELF flag. FDB remains relevant and it is used to
>>> configure static MAC table entries and dump the HW MAC table.
>
>> Your understanding is right here. So far all patches have kept both models in
>> mind.
>
>
>>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even
>>> higher), there is a gap between what the HW is doing and what is explicitly
>>> modeled in Linux.
>
>
>> Can you elaborate more here ?. We use the linux model to accelerate a
>> multi-layer (l2-l3) switch today. There maybe a few gaps, but these gaps can
>> be closed by having equivalent functionality in the software path.
>
> What I meant is that without a bridge device the HW switch is seen as a collection of independent switch ports. Typical switch ASIC performs L2 switching by default. This is not expressed explicitly in Linux without a bridge device.
> The SELF flag is used to target typical bridge port and bridge configuration at a switch port device.
> Without an explicit bridge device, bridge attributes have to be directed at an arbitrary port (any port could represent the entire switch) and interpreted by the switch port driver as intended for the entire switch (this includes attributes like STP etc.)
> Each switch port device driver has to implement similar functionality (i.e. all bridge and fdb related ndos) independently without common functionality shared (e.g. FDB, soft aging).
> It is a valid use model and could avoid the complexity of having to deal with the presence of both SW and HW bridge and to deal with explicit offloading of data-path.
>
> I was trying to find out whether the intention was to continue and support both bridge-less an offloaded-bridge models and leave it to the end-user to choose the desirable model at configuration time.
> This would require dual support in the switch port driver in order to have best user experience across multiple switch ASICs or other kinds of devices.
>

I'm still missing why there is duplicate implementations in the driver.
If the driver implements the set of ndo ops why should it care who calls
them? I think you tried to explain this already but I'm not seeing it.

[...]

I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa
might have worked some of it out.

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH] dm9000: Add regulator and reset support to dm9000
From: Zubair Lutfullah Kakakhel @ 2014-12-16 16:41 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, paul.burton-1AXoQHu6uovQT0dZR+AlfA
In-Reply-To: <1418747624-2682-1-git-send-email-Zubair.Kakakhel-1AXoQHu6uovQT0dZR+AlfA@public.gmane.org>



On 16/12/14 16:33, Zubair Lutfullah Kakakhel wrote:
...

> +
> +	power = devm_regulator_get(dev, "vcc");
> +	if (IS_ERR(power)) {
> +		dev_dbg(dev, "no regulator provided\n");
> +	} else if (!regulator_is_enabled(power)) {
> +		ret = regulator_enable(power);
> +		dev_dgb(dev, "regulator enabled\n");
		^dev_dbg

Apologies. This fix wasn't squashed in. I'll resend.

> +	}
> +
> +	reset_gpio = of_get_named_gpio_flags(dev->of_node, "reset-gpio", 0,
> +					     &flags);

ZubairLK
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: John Fastabend @ 2014-12-16 16:35 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Hubert Sokolowski, Roopa Prabhu, netdev@vger.kernel.org,
	Vlad Yasevich
In-Reply-To: <54902E5E.2070405@mojatatu.com>

On 12/16/2014 05:06 AM, Jamal Hadi Salim wrote:
> On 12/15/14 19:45, John Fastabend wrote:
>> On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:
>
>>
>> hmm good question. When I implemented this on the host nics with SR-IOV,
>> VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
>> the driver.
>
> So if i understand correctly, this is a NIC with an FDB. And there is no
> concept of a bridge to which it is attached. To the point of
> classical uni/multicast addresses on a netdev abstraction; these
> are typically stored in *much simpler tables* (used to be IO
> registers back in the day)

 From a model perspective it looks like a edge relay. Only a single
downlink with multiple uplinks. No learning, no loops and so no
STP, et. al. required. It may or may not support MAC+VLAN forwarding
or just MAC forwarding.

It may be configured via register writes or more complicated firmware
requests or some other mechanism. This is device dependent even across
devices by the same vendor the mechanisms change. But the driver
abstracts this.

> Do these NICs not have such a concept?
> An fdb entry has an egress port column; I have seen cases where the
> port is labeled as "Cpu port" which would mean it belongs to the host;

But in the SR-IOV case you have multiple "Cpu ports" and you want
to send packets to each of them depending on the configuration.

    port0   port1     port2  port3
     |        |        |      |      uplinks
  +------------------------------+
  |                              |
  |       SRIOV edge relay       |
  |                              |
  +------------------------------+
                  |                   downlink

In a host nic with SRIOV each port will be a PCIE function. So really
they are all CPU ports. For multi-function devices they might all be
physical functions.

In the hardware there needs to be a table to forward incoming traffic
to the correct port#. For L2 we use MAC+VLAN and an egress port column
to select the port. The model shouldn't care if the port is backed by
a VF or PF or set of queues. It just needs to forward packets to the
correct uplink.

One issue we have today when writing software for these edge relays
is we don't have a netdev representing the downlink. Or a netdev
representing management functions of the device. So if I want to
say change the mode of the edge relay from VEB to VEPA I usually
just send the message to the PF. Or if I want to send packets out on
the wire but not through the edge relay usually we do this by sending
control packets over an elected PF and it will attach a tag or something
so the edge relay doesn't forward or flood them to other uplinks. Adding
a netdev for the downlink would probably clean some of this up. Now
we rely on some behaviour that is not well-defined.

> but in this case it just seems there is no such concept and as Or
> brought up in another email - what does "VLANid" mean in such a case?

I think most host nics with SR-IOV can forward using VLAN + MAC and
do filtering on VLANid. Many can also put a default VLAN on the packet.

> If we go with a CPU port concept,
> We could then use the concept of a vlan filter on a port basis
> but then what happens when you dont have an fdb (majority of cases)?

Not sure what the question is here.. I'm hoping the above helped
explain my thinking on this.

Don't have an FDB? This means you don't have any way to forward
between ports so you must have a 1:1 mapping between the physical
port and the netdev. I think its fair to think of this as a TPMR
(two port mac relay) although not a very useful abstraction.

>
>> My logic was if some netdev ethx has a set of MAC addresses
>> above it well then any virtual function or virtual device also behind
>> the hardware shouldn't be sending those addresses out the egress switch
>> facing port. Otherwise the switch will see packets it knows are behind
>> that port and drop them. Or flood them if it hasn't learned the address
>> yet. Either way they will never get to the right netdev.
>>
>> Admittedly I wasn't thinking about switches with many ports at the time.
>>
>
> I often struggle with trying to "box" SRIOV into some concept of a
> switch abstraction and sometimes i am puzzled.
> Would exposing the SRIOV underlay as a switch not have solved this
> problem? Then the virtual ports essentially are bridge ports.

Yes this would help and this is how I view it. Although the
edge relay vs "real standards based" bridge distinction is important
because we don't do learning, only have a single uplink, don't run
loop detecting protocols, etc. All that stuff is not needed on a host
where you "know" your MAC addresses (at least for many use cases) and
can not build loops.

> Maybe what we need is a concept of a "edge relay" extended netdev?

This is effectively what the fdb table does right? Sure its not as
explicit as it could be but this is how I treat the NIC when I learn
it has multiple downlinks and a single uplink. At the moment we use
a trick similar to Jiri's on rocker, when we get a switch op like
getlink, setlink we "know" what switch object it refers to because
the netdev maps to a single switch always.

> These things would have an fdb as well down and uplink relay ports that
> can be attached to them.
>

Right in the current code paths there is no "attach" operation we assume
the edge relay and ports are attached when the ports are created via
SR-IOV or hw-offload or whatever.

What are we missing? We have the FDB and a unique id to show ports on
the same edge relay. User space can build this abstraction from those
two things. A downlink netdev port would probably clean up the
abstraction a bit especially for sending control frames.

>
>>> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
>>> the other driver did).
>>
>> My original thinking here was... if it didn't implement fdb_add, fdb_del
>> and fdb_dump then if you wanted to think of it as having forwarding
>> database that was fine but it was really just a two port mac relay. In
>> which case just dump all the mac addresses it knows about. In this case
>> if it was something more fancy it could do its own dump like vxlan or
>> macvlan.
>>
>
> The challenge here is lack of separation between a NICs uni/multicast
> ports which it owns - which is a traditional operation regardless of
> what capabilities the NIC has; vs an fdb which has may have many
> other capabilities. Probably all NICs capable of many MACs implement
> fdbs?

Yes they must to support forwarding. Agreed its a bit clunky they
way we overload uni/multicast address lists. But what does it mean
to add a unicast address to a port and not have it in the FDB? If
the port wants to receive traffic on a MAC because its added to the
unicast list doesn't it mean insert it into the FDB so the packets
actually get sent to the netdev?

Otherwise its a two step process one add it to the multicast list
and then add it to the FDB. I'm not sure why this is valuable.

>
>> For a host nic ucast/multicast and fdb are the same, I think? The
>> code we had was just short-hand to allow the common case a host nic
>> to work. Notice vxlan and bridge drivers didn't dump there addr lists
>> from fdb_dump until your patch.
>>
>> Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
>> I shouldn't overload the addr lists.
>>
>
> Not just those - I am wondering about the general utility of what
> Hubert was trying to do if all the driver does is call the default
> dumper based on some flags presence and the default dumper
> does a dump of uni/multicast host entries. Those are not really fdb
> entries in the traditional sense.

But as a practical matter any uni/multicast entry is in the FDB
so when the host nic has multiple ports we receive those mac addresses
on the port. The drivers do this today and it seems reasonable to me.

> Is there no way to get the unicast/multicast mac addresses for such
> a driver?

You can almost infer it from ip link by looking at all the stacked
drivers and figuring out how the address are propagated down. Then
look at the routes and figure out multicast address. But other than
the fdb dump mechanism I don't think there is anything.

> I think that would help bring clarity to my confusion.
>

clear as mud now?

>
>>
>> I'm interested to see what Vlad says as well. But the current situation
>> is previously some drivers dumped their addr lists others didn't.
>> Specifically, the more switch like devices (bridge, vxlan) didn't. Now
>> every device will dump the addr lists. I'm not entirely convinced that
>> is correct.
>>
>
> I am glad this happened ;-> Otherwise we wouldnt be having this
> discussion. When Vlad was asking me I was in a rush to get the patch
> out and didnt question because i thought this was something some crazy
> virtualization people needed.
> If Vlad's use case goes away, then Hubert's little restoration is fine.

Yep. maybe we can talk about it at the netdev users conference

>
>
>> It works OK for host nics (NICS that can't forward between ports) and
>> seems at best confusing for real switch asics.
>
> So if these NICs have fdb entries and i programmed it (meaning setting
> which port a given MAC should be sent to), would it not work?

You mean via 'bridge fdb add' yes this will work. But then as a short
hand we also program the ucast/multicast addresses. (have I beaten this
to death yet?)

>
>> On a related question do
>> you expect the switch asic to trap any packets with MAC addresses in
>> the multi/unicast address lists and send them to the correct netdev? Or
>> will the switch forward them using normal FDB tables?
>>
>
> I think there would be a separate table for that. Roopa, can you check
> with the ASICs you guys work on? The point i was trying to make above
> is today there is a uni/multicast list or table of sorts that all NICs
> expose.
> There's always the hack of a "cpu port". I have also seen the "cpu port"
> being conceptualized in L3 tables to imply "next hop is cpu" where you
> have an IP address owned by the host; so maybe we need a concept of a
> cpu port or again the revival of TheThing class device.

OK the confusing part of "cpu port" to me is in a host nic trying to
map this abstraction onto it implies a host nic may have many "cpu
ports".

Thanks,
.John

>
> cheers,
> jamal
>

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* [PATCH] dm9000: Add regulator and reset support to dm9000
From: Zubair Lutfullah Kakakhel @ 2014-12-16 16:33 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, paul.burton-1AXoQHu6uovQT0dZR+AlfA,
	Zubair.Kakakhel-1AXoQHu6uovQT0dZR+AlfA

In boards, the dm9000 chip's power and reset can be controlled by gpio.

It makes sense to add them to the dm9000 driver and let dt be used to
enable power and reset the phy.

Signed-off-by: Zubair Lutfullah Kakakhel <Zubair.Kakakhel-1AXoQHu6uovQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Paul Burton <paul.burton-1AXoQHu6uovQT0dZR+AlfA@public.gmane.org>
---
 .../devicetree/bindings/net/davicom-dm9000.txt     |  4 +++
 drivers/net/ethernet/davicom/dm9000.c              | 33 ++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/davicom-dm9000.txt b/Documentation/devicetree/bindings/net/davicom-dm9000.txt
index 28767ed..dba19a2 100644
--- a/Documentation/devicetree/bindings/net/davicom-dm9000.txt
+++ b/Documentation/devicetree/bindings/net/davicom-dm9000.txt
@@ -11,6 +11,8 @@ Required properties:
 Optional properties:
 - davicom,no-eeprom : Configuration EEPROM is not available
 - davicom,ext-phy : Use external PHY
+- reset-gpio : phandle of gpio that will be used to reset chip during probe
+- vcc-supply : phandle of regulator that will be used to enable power to chip
 
 Example:
 
@@ -21,4 +23,6 @@ Example:
 		interrupts = <7 4>;
 		local-mac-address = [00 00 de ad be ef];
 		davicom,no-eeprom;
+		reset-gpio = <&gpf 12 GPIO_ACTIVE_LOW>;
+		vcc-supply = <&eth0_power>;
 	};
diff --git a/drivers/net/ethernet/davicom/dm9000.c b/drivers/net/ethernet/davicom/dm9000.c
index ef0bb58..7333b8d 100644
--- a/drivers/net/ethernet/davicom/dm9000.c
+++ b/drivers/net/ethernet/davicom/dm9000.c
@@ -36,6 +36,9 @@
 #include <linux/platform_device.h>
 #include <linux/irq.h>
 #include <linux/slab.h>
+#include <linux/regulator/consumer.h>
+#include <linux/gpio.h>
+#include <linux/of_gpio.h>
 
 #include <asm/delay.h>
 #include <asm/irq.h>
@@ -1426,11 +1429,41 @@ dm9000_probe(struct platform_device *pdev)
 	struct dm9000_plat_data *pdata = dev_get_platdata(&pdev->dev);
 	struct board_info *db;	/* Point a board information structure */
 	struct net_device *ndev;
+	struct device *dev = &pdev->dev;
 	const unsigned char *mac_src;
 	int ret = 0;
 	int iosize;
 	int i;
 	u32 id_val;
+	int reset_gpio;
+	enum of_gpio_flags flags;
+	struct regulator *power;
+
+	power = devm_regulator_get(dev, "vcc");
+	if (IS_ERR(power)) {
+		dev_dbg(dev, "no regulator provided\n");
+	} else if (!regulator_is_enabled(power)) {
+		ret = regulator_enable(power);
+		dev_dgb(dev, "regulator enabled\n");
+	}
+
+	reset_gpio = of_get_named_gpio_flags(dev->of_node, "reset-gpio", 0,
+					     &flags);
+	if (gpio_is_valid(reset_gpio)) {
+		ret = devm_gpio_request_one(dev, reset_gpio, flags,
+					    "dm9000_reset");
+		if (ret) {
+			dev_err(dev, "failed to request reset gpio %d: %d\n",
+				reset_gpio, ret);
+		} else {
+			gpio_direction_output(reset_gpio, 0);
+			/* According to manual PWRST# Low Period Min 1ms */
+			msleep(2);
+			gpio_direction_output(reset_gpio, 1);
+			/* Needs 3ms to read eeprom when PWRST is deasserted */
+			msleep(4);
+		}
+	}
 
 	if (!pdata) {
 		pdata = dm9000_parse_dt(&pdev->dev);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* FIXED_PHY is broken...
From: David Miller @ 2014-12-16 16:25 UTC (permalink / raw)
  To: netdev; +Cc: f.fainelli

I get this now when I run oldconfig:

warning: (NET_DSA_BCM_SF2 && BCMGENET && SYSTEMPORT) selects FIXED_PHY which has unmet direct dependencies (NETDEVICES && PHYLIB=y)

For the thousandth time, you cannot select Kconfig options which have
dependencies of any kind, because select does not recursively cause
dependencies to be enabled up to the root of the Kconfig tree.

If you select on something which has a "depends on", stop right there
because you can't do it.

It only works for pure leaf Kconfig nodes with no deps.

All you needed to do in order to test this was do an allmodconfig
build.

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Bjorn Helgaas @ 2014-12-16 16:20 UTC (permalink / raw)
  To: Rajat Jain
  Cc: Marcelo Ricardo Leitner, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki, Prashant Sreedharan,
	Michael Chan
In-Reply-To: <CAA93t1qyZE-9tw8pg1KG6g4iyy0QMW=iass5w=6ZGMTMu+vi_A@mail.gmail.com>

[+cc Rafael, Prashant, Michael]

On Tue, Dec 16, 2014 at 9:04 AM, Rajat Jain <rajatxjain@gmail.com> wrote:
> Hello All,
>
> Apologies for jumping in late, but for some reason I do not see the
> original mail in my inbox. However I am taking a look at the mails as
> sent on linux-pci (and I will keep an eye out for the bug report that
> Bjorn asked for).
>
>
>>
>> I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:
>>
>> $ grep 'pci 0000:02' tg3.bad
>> [    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.561555] pci 0000:02:00.0: 1st 1 1
>> [    1.561558] pci 0000:02:00.0: crs_timeout: 0
>> [   20.412021] pci 0000:02:00.0: 1st 1 1
>> [   20.412022] pci 0000:02:00.0: crs_timeout: 0
>> [   20.413596] pci 0000:02:00.0: 1st 1 1
>> [   20.413598] pci 0000:02:00.0: crs_timeout: 0
>>
>> And without it:
>>
>> $ grep 'pci 0000:02' tg3.good
>> [    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.576778] pci 0000:02:00.0: 1st 1 1
>> [   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4
>>
>
> It seems that in the first 2 attempts that were made to probe the
> device are all OK and return regular device ID and vendor ID for TG3
> (CRS does not have a role to play). However, later attempts return a
> CRS.
>
> 1) May I ask if you are using acpihp or pciehp? I assume pciehp?
>
> 2) Can you please also send dmesg output while passing
> pciehp.pciehp_debug=1? In the fail case, do you see a message
> indicating the pciehp gave up since it got CRS for a long time
> (something like "pci 0000:02:00.0 id reading try 50 times with
> interval 20 ms to get ffff0001")?
>
> 3) Currently the pciehp passes "0" for the argument "crs_timeout" to
> pci_bus_read_dev_vendor_id(). Can you please try increasing it to, say
> 30 seconds (30 * 1000). (For comparison data, acpihp uses the value
> 60*1000 i.e. 60 seconds today) and run the fail case once again?

Using zero for the timeout seems bogus to me.  But I doubt pciehp is
involved in this situation.

I think we're in this path:

    tg3_init_hw
      tg3_reset_hw
        tg3_disable_ints
        tg3_stop_fw
        tg3_write_sig_pre_reset
        tg3_chip_reset
          pci_device_is_present
            pci_bus_read_dev_vendor_id

and in this case pci_device_is_present() also passes a timeout of zero
to pci_bus_read_dev_vendor_id().  My guess is that tg3 is resetting
the device, so it's not too surprising that the config read returns
CRS status immediately afterward.

Bjorn

^ permalink raw reply

* Re: [PATCH 0/5] tun/macvtap: TUNSETIFF fixes
From: David Miller @ 2014-12-16 16:20 UTC (permalink / raw)
  To: mst; +Cc: linux-kernel, netdev, dan.carpenter, jasowang
In-Reply-To: <1418732988-3535-1-git-send-email-mst@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Tue, 16 Dec 2014 15:04:53 +0200

> Dan Carpenter reported the following:
 ...
> And that's true: we have run out of IFF flags in tun.
> 
> So let's not try to add more: add simple GET/SET ioctls
> instead. Easy to test, leads to clear semantics.
> 
> Alternatively we'll have to revert the whole thing for 3.19,
> but that seems more work as this has dependencies
> in other places.
> 
> While here, I noticed that macvtap was actually reading
> ifreq flags as a 32 bit field.
> Fix that up as well.

Looks good, series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Samudrala, Sridhar @ 2014-12-16 15:54 UTC (permalink / raw)
  To: Arad, Ronen, Roopa Prabhu, netdev@vger.kernel.org
  Cc: Jamal Hadi Salim, John Fastabend, Jiri Pirko, sfeldma@gmail.com,
	bcrl@kvack.org, tgraf@suug.ch, stephen@networkplumber.org,
	linville@tuxdriver.com, vyasevic@redhat.com, davem@davemloft.net,
	shm@cumulusnetworks.com, gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DB15CA@ORSMSX101.amr.corp.intel.com>


On 12/16/2014 3:01 AM, Arad, Ronen wrote:
> In my reply (inline) I elaborate on the validity of bridge-less and offloaded-bridge models for L2 switching.
>
> I also discuss the implied necessity of a bridge device for L3 routing and potential issues with the upcoming FIB offloading proposal.
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
>> Sent: Tuesday, December 16, 2014 3:21 AM
>> To: Arad, Ronen
>> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri Pirko;
>> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>>>> -----Original Message-----
>>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>>>> Sent: Tuesday, December 16, 2014 1:28 AM
>>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>>>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com;
>>>> gospo@cumulusnetworks.com
>>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
>>>> del bridge port attributes
>>>>
>>>> On 12/15/14 13:36, Arad, Ronen wrote:
>>>>>> -----Original Message-----
>>>>> The behavior of a driver could depend on the presence of a bridge
>>>>> and
>>>> features such as FDB LEARNING and LEARNING_SYNC.
>>>>
>>>> Indeed, those are bridge attributes.
>>>>
>>>>> A switch port driver which is not enslaved to a bridge might need to
>>>>> implement VLAN-aware FDB within the driver and report its content to
>>>>> user-
>>>> space using ndo_fdb_dump.
>>>>    >
>>>>> A switch port driver which is enslaved to a bridge could do with
>>>>> only pass through for static FDB configuration
>>>>    > to the HW when LEARNING_SYNC is configured. FDB reporting to
>>>> user- space and soft aging are left to the bridge module FDB.
>>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
>>>>> in-driver
>>>> FDB as long as it could dump the HW FDB on demand.
>>>>> LEARNING_SYNC also requires periodic updates of freshness
>>>>> information
>>>> from the driver to the bridge module.
>>>>
>>>> If you have an fdb - shouldnt that be exposed only if you have a
>>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
>>> I'm trying to find out what are the opinions of other people in the netdev
>> list.
>>> John have clearly stated that he'd like to see full L2 switching functionality
>> (at least) supported without making a bridge device mandatory.
>>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that
>> with proper setting of SELF/MASTER flags by iproute2.
>>> I see the value in supporting both approaches (bridge device mandatory
>>> and bridge device optional). If the choice is left to user-driven policy decision,
>>> we need to document both use models and map traditional L2 features to
>>> each model.
>>> The L2 offloading (or NETFUNC as it is currently called), which is being
>>> discussed on a different patch-set, is only needed when a bridge device is
>>> used.
>>> Without a bridge device, all configuration has to be targeted at the switch
>>> port driver directly using the SELF flag. FDB remains relevant and it is used to
>>> configure static MAC table entries and dump the HW MAC table.
>> Your understanding is right here. So far all patches have kept both models in
>> mind.
>
>>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even
>>> higher), there is a gap between what the HW is doing and what is explicitly
>>> modeled in Linux.
>
>> Can you elaborate more here ?. We use the linux model to accelerate a
>> multi-layer (l2-l3) switch today. There maybe a few gaps, but these gaps can
>> be closed by having equivalent functionality in the software path.
> What I meant is that without a bridge device the HW switch is seen as a collection of independent switch ports. Typical switch ASIC performs L2 switching by default. This is not expressed explicitly in Linux without a bridge device.
> The SELF flag is used to target typical bridge port and bridge configuration at a switch port device.
> Without an explicit bridge device, bridge attributes have to be directed at an arbitrary port (any port could represent the entire switch) and interpreted by the switch port driver as intended for the entire switch (this includes attributes like STP etc.)
> Each switch port device driver has to implement similar functionality (i.e. all bridge and fdb related ndos) independently without common functionality shared (e.g. FDB, soft aging).
> It is a valid use model and could avoid the complexity of having to deal with the presence of both SW and HW bridge and to deal with explicit offloading of data-path.
>
> I was trying to find out whether the intention was to continue and support both bridge-less an offloaded-bridge models and leave it to the end-user to choose the desirable model at configuration time.
> This would require dual support in the switch port driver in order to have best user experience across multiple switch ASICs or other kinds of devices.

Also is one of the usecase for an explicit bridge device to support 
software switching by causing
the data packets to be processed at software bridge using appropriate 
port attribute settings?
Or is it only to make it convenient to maintain the fdb and represent 
the hardware path?
>>> Without a bridge device, the HW is represented by a set of switch port
>>> devices and the bridging (both control and data planes) takes place only in
>>> the HW and switch port driver.
>>> Each switch port driver has to implement its own FDB as there is no
>>> common shared code among drivers for different HW devices.
>>> Using a bridge device could partially alleviate that, but it comes with a cost.
>>> There is a need to properly implement offloading of both configuration and
>>> data-path. The transmit and receive path in the bridge module should be
>>> somehow bypassed to avoid unnecessary overhead or duplicate packets
>>> coming from both software bridging and HW bridging.
>>>
>>>> What i was refering to was a scenario where i have no interest in the
>>>> fdb despite such a hardware capabilities. VLANs is a different issue;
>>>>
>>> VLAN is fundamental feature of L2 and L3 switching and Linux is unclear
>> about it. Bridge device could model bridging of untagged packets which
>> requires a bridge device for each VLAN and a vlan device on each port that is
>> a member of the bridge's VLAN.
>>> This different from the behavior and configuration of classic closed-source
>> switches.
>>> An alternative model is VLAN filtering where a bridge is VLAN-aware and
>>> switches tagged traffic. A bridge device represents multiple L2 domains with
>>> VLAN filtering policy that defines the switching rules within each domain.
>> And the linux bridge driver supports both models today.
>>
>>> Forwarding (e.g. L3 routing) is expected across such L2 domains using L3
>> entities.
>>> The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN
>>> filtering model is yet unclear to me.
>> In the vlan filtering bridge model, You can create a vlan device on the bridge
>> for l3 ...
>>
> That's what I'm thinking too (I experimented with such setup using veth interfaces, bridge device, and vlan interfaces). This, however, seems to require an explicit bridge for L3 support.
>
> Looking at the latest code of FIB offloading (not yet submitted to netdev), I noticed that a switch port device is expected as a lower descendent of the FIB destination device.
> This assumption is valid in the per-vlan bridge model where IP address is assigned to the bridge itself.
> This, however, is not consistent with the single multi-VLAN bridge model.
> Vlan interfaces on a bridge looks like siblings of the switch ports devices on the same bridge. They are not ancestors of the switch ports.
> The L3 domain ends at the bridge sub-interfaces. The only L3 entities are the vlan sub-interfaces on the bridge.
> Those are route next hops and the only possible fib_dev.
> L3 routing is not aware of the switch ports. Route is performed to next hop addresses on one of the vlan interfaces subnets. The actual resolution to a switch port device has to be performed by the neighbor subsystem (ARP/ND).
> It is unclear to me how the FIB offloading will be redirected to an ndo of a switch port device.
For L3, i would think we need to support offloading of assigning the 
gateway IP and the actual route. For ex: to create route to subnet 
2.2.2.0/24 with GW as 1.1.1.254/24, a user may do
     ip addr add 1.1.1.254/24 dev <swX>
     ip route add 5.5.5.0/24 via 1.1.1.254 dev swX

Here swX has to be a device corresponding to the switch (or cpu port), 
not a switch port and the ARP requests for this gw IP need to be passed 
to the linux stack so that it can send arp replies and also add an ARP 
entry in hardware.


>
>>>>>>> Will the decision about using a bridge device or avoiding it be
>>>>>>> left to the end-user?
>>>>>> Its a user policy decision. Again the offload bit gets us this in a
>>>>>> reasonably configurable way IMO.
>>>>>>
>>>>>>> (This requires switch port drivers to be able to work and provide
>>>>>>> similar functionality in both setups).
>>>>>> Right, but if the drivers "care" who is calling their ndo ops
>>>>>> something is seriously broken. For the driver it should not need to
>>>>>> know anything about the callers so it doesn't matter to the driver
>>>>>> if its a netlink call from user space or an internal call fro
>>>>>> bridge.ko
>>>>> LEARNING_SYNC only makes sense when a switch port driver is enslaved
>>>>> to
>>>> a bridge.
>>>>    > Rocker switch driver indeed monitors upper change notifications
>>>> and keep track of master bridge presence.
>>>>> So bridge presence is not transparent.
>>>>>
>>>> Agreed - the challenge so far is that people have been fascinated by
>> "switch"
>>>> point of view. I think we are learning and the class device will
>>>> eventually become obvious as useful.
>>>>
>>>> cheers,
>>>> jamal
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org More majordomo
>> info
>>> at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
>> of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Rajat Jain @ 2014-12-16 16:04 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Nils Holland, David Miller, netdev, linux-pci@vger.kernel.org
In-Reply-To: <548EF90A.5070607@gmail.com>

Hello All,

Apologies for jumping in late, but for some reason I do not see the
original mail in my inbox. However I am taking a look at the mails as
sent on linux-pci (and I will keep an eye out for the bug report that
Bjorn asked for).

>
> I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:
>
> $ grep 'pci 0000:02' tg3.bad
> [    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
> [    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
> [    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
> [    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
> [    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
> [    1.561555] pci 0000:02:00.0: 1st 1 1
> [    1.561558] pci 0000:02:00.0: crs_timeout: 0
> [   20.412021] pci 0000:02:00.0: 1st 1 1
> [   20.412022] pci 0000:02:00.0: crs_timeout: 0
> [   20.413596] pci 0000:02:00.0: 1st 1 1
> [   20.413598] pci 0000:02:00.0: crs_timeout: 0
>
> And without it:
>
> $ grep 'pci 0000:02' tg3.good
> [    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
> [    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
> [    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
> [    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
> [    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
> [    1.576778] pci 0000:02:00.0: 1st 1 1
> [   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4
>

It seems that in the first 2 attempts that were made to probe the
device are all OK and return regular device ID and vendor ID for TG3
(CRS does not have a role to play). However, later attempts return a
CRS.

1) May I ask if you are using acpihp or pciehp? I assume pciehp?

2) Can you please also send dmesg output while passing
pciehp.pciehp_debug=1? In the fail case, do you see a message
indicating the pciehp gave up since it got CRS for a long time
(something like "pci 0000:02:00.0 id reading try 50 times with
interval 20 ms to get ffff0001")?

3) Currently the pciehp passes "0" for the argument "crs_timeout" to
pci_bus_read_dev_vendor_id(). Can you please try increasing it to, say
30 seconds (30 * 1000). (For comparison data, acpihp uses the value
60*1000 i.e. 60 seconds today) and run the fail case once again?

Thanks a lot in advance for the debugging help ;-)

Rajat

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox