Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-16 19:25 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: John Fastabend, netdev@vger.kernel.org, Jamal Hadi Salim,
	Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DB473D@ORSMSX101.amr.corp.intel.com>

On 12/16/14, 9:29 AM, Arad, Ronen wrote:
>
>> -----Original Message-----
>> From: John Fastabend [mailto:john.fastabend@gmail.com]
>> Sent: Tuesday, December 16, 2014 6:42 PM
>> To: Arad, Ronen
>> Cc: Roopa Prabhu; netdev@vger.kernel.org; Jamal Hadi Salim; Jiri Pirko;
>> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/16/2014 03:01 AM, Arad, Ronen wrote:
>>> In my reply (inline) I elaborate on the validity of bridge-less and offloaded-
>> bridge models for L2 switching.
>>> I also discuss the implied necessity of a bridge device for L3 routing and
>> potential issues with the upcoming FIB offloading proposal.
>>>> -----Original Message-----
>>>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>>>> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
>>>> Sent: Tuesday, December 16, 2014 3:21 AM
>>>> To: Arad, Ronen
>>>> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri
>>>> Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>>>> stephen@networkplumber.org; linville@tuxdriver.com;
>>>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com;
>>>> gospo@cumulusnetworks.com
>>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
>>>> del bridge port attributes
>>>>
>>>> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>>>>>> -----Original Message-----
>>>>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>>>>>> Sent: Tuesday, December 16, 2014 1:28 AM
>>>>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>>>>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>>>>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>>>>>> vyasevic@redhat.com; davem@davemloft.net;
>>>> shm@cumulusnetworks.com;
>>>>>> gospo@cumulusnetworks.com
>>>>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set
>>>>>> and del bridge port attributes
>>>>>>
>>>>>> On 12/15/14 13:36, Arad, Ronen wrote:
>>>>>>>> -----Original Message-----
>>>>>>> The behavior of a driver could depend on the presence of a bridge
>>>>>>> and
>>>>>> features such as FDB LEARNING and LEARNING_SYNC.
>>>>>>
>>>>>> Indeed, those are bridge attributes.
>>>>>>
>>>>>>> A switch port driver which is not enslaved to a bridge might need
>>>>>>> to implement VLAN-aware FDB within the driver and report its
>>>>>>> content to
>>>>>>> user-
>>>>>> space using ndo_fdb_dump.
>>>>>>     >
>>>>>>> A switch port driver which is enslaved to a bridge could do with
>>>>>>> only pass through for static FDB configuration
>>>>>>     > to the HW when LEARNING_SYNC is configured. FDB reporting to
>>>>>> user- space and soft aging are left to the bridge module FDB.
>>>>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
>>>>>>> in-driver
>>>>>> FDB as long as it could dump the HW FDB on demand.
>>>>>>> LEARNING_SYNC also requires periodic updates of freshness
>>>>>>> information
>>>>>> from the driver to the bridge module.
>>>>>>
>>>>>> If you have an fdb - shouldnt that be exposed only if you have a
>>>>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
>>>>> I'm trying to find out what are the opinions of other people in the
>>>>> netdev
>>>> list.
>>>>> John have clearly stated that he'd like to see full L2 switching
>>>>> functionality
>>>> (at least) supported without making a bridge device mandatory.
>>>>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already
>>>>> support that
>>>> with proper setting of SELF/MASTER flags by iproute2.
>>>>> I see the value in supporting both approaches (bridge device
>>>>> mandatory and bridge device optional). If the choice is left to
>>>>> user-driven policy decision, we need to document both use models and
>>>>> map traditional L2 features to each model.
>>>>> The L2 offloading (or NETFUNC as it is currently called), which is
>>>>> being discussed on a different patch-set, is only needed when a
>>>>> bridge device is used.
>>>>> Without a bridge device, all configuration has to be targeted at the
>>>>> switch port driver directly using the SELF flag. FDB remains
>>>>> relevant and it is used to configure static MAC table entries and dump
>> the HW MAC table.
>>>> Your understanding is right here. So far all patches have kept both
>>>> models in mind.
>>>
>>>>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or
>>>>> even higher), there is a gap between what the HW is doing and what
>>>>> is explicitly modeled in Linux.
>>>
>>>> Can you elaborate more here ?. We use the linux model to accelerate a
>>>> multi-layer (l2-l3) switch today. There maybe a few gaps, but these
>>>> gaps can be closed by having equivalent functionality in the software path.
>>> What I meant is that without a bridge device the HW switch is seen as a
>> collection of independent switch ports. Typical switch ASIC performs L2
>> switching by default. This is not expressed explicitly in Linux without a bridge
>> device.
>>> The SELF flag is used to target typical bridge port and bridge configuration
>> at a switch port device.
>>> Without an explicit bridge device, bridge attributes have to be
>>> directed at an arbitrary port (any port could represent the entire switch)
>> and interpreted by the switch port driver as intended for the entire switch
>> (this includes attributes like STP etc.) Each switch port device driver has to
>> implement similar functionality (i.e. all bridge and fdb related ndos)
>> independently without common functionality shared (e.g. FDB, soft aging).
>>> It is a valid use model and could avoid the complexity of having to deal with
>> the presence of both SW and HW bridge and to deal with explicit offloading
>> of data-path.
>>> I was trying to find out whether the intention was to continue and support
>> both bridge-less an offloaded-bridge models and leave it to the end-user to
>> choose the desirable model at configuration time.
>>> This would require dual support in the switch port driver in order to have
>> best user experience across multiple switch ASICs or other kinds of devices.
>> I'm still missing why there is duplicate implementations in the driver.
>> If the driver implements the set of ndo ops why should it care who calls
>> them? I think you tried to explain this already but I'm not seeing it.
>>
> Let's consider a bridge property. I'll use the default PVID attribute as an example. This is currently configurable by sysfs only and a netlink support for that is still due. Let's assume for our discussion that a DEAFAULT_PVID attribute will be added as a bridge attribute within AFSPEC nested attribute of AF_BRIDGE SETLINK message.
> When a bridge device is present, this attribute is processed by the bridge module and saved as default_pvid field in net_bridge structure. When a switch port is enslaved to a bridge, the bridge driver creates a net_bridge_port instance and assigns it a pvid inherited from the default_pvid attribute of the bridge. Setting the pvid for a new enslaved switch port is not done via netlink. It only applies to the net_bridge_port structure which is internal to the bridge module. Offloading this to HW is not addressed with current bridge offloading.
Correct. And that's where I am going with my NETFUNC_OFFLOAD series. 
Hooks in the bridge driver to communicate all changes to the switch driver.
The default pvid will/should be also communicated to the switch driver. 
And the switch driver does not need to learn the default pvid for the 
bridge.
The callbacks (ndo's) into the switch driver will indicate the pvid for 
the switch port (just like all the userspace notifications for the 
bridge driver do today).

>
> When a bridge device is not used, the DEFAULT_PVID will be targeted using the SELF flag to any of the switch ports. The driver will recognize that as a bridge port and will need to maintain some switch global structure similar to net_bridge where it could save the default_pvid. The driver, knowing that the switch port is not enslaved to a bridge, will have to replicate the same functionality. In the HW case, it will have to configure default VLAN on all the switch ports.
> This is different from the yet to be defined way of propagating default PVID from a bridge device to offloaded bridge ports.
>
> Another example is STP. STP attributes are bridge attributes which are not offloaded when a bridge device is present. The bridge module handles STP protocol internally. Without bridge device, STP attributes have to be targeted at a switch port device and the driver should save them in driver-specific structures and have proprietary implementation of STP (as the one in the bridge module is not used).


yes, for all the cases where you are trying to bypass the bridge driver, 
the switch driver will need implement similar functionality. So, for 
switch asics offloads there is a disadvantage to not use the bridge driver.
>
>   
>> [...]
>>
>> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa might have
>> worked some of it out.
>>
>> --
>> John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: B Viswanath @ 2014-12-16 19:23 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: John Fastabend, netdev@vger.kernel.org, Roopa Prabhu,
	Jamal Hadi Salim, Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org,
	tgraf@suug.ch, stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DB473D@ORSMSX101.amr.corp.intel.com>

Hi,

This is my first email on this thread, and on this list. My apologies
if I have not understood something correctly. I would like to
participate  in this discussion, which is one of the reasons I joined
this list recently.  Some feedback inline below.

On 16 December 2014 at 22:59, Arad, Ronen <ronen.arad@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: John Fastabend [mailto:john.fastabend@gmail.com]
>> Sent: Tuesday, December 16, 2014 6:42 PM
>> To: Arad, Ronen
>> Cc: Roopa Prabhu; netdev@vger.kernel.org; Jamal Hadi Salim; Jiri Pirko;
>> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/16/2014 03:01 AM, Arad, Ronen wrote:
>> >
>> > In my reply (inline) I elaborate on the validity of bridge-less and offloaded-
>> bridge models for L2 switching.
>> >
>> > I also discuss the implied necessity of a bridge device for L3 routing and
>> potential issues with the upcoming FIB offloading proposal.
>> >
>> >> -----Original Message-----
>> >> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> >> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
>> >> Sent: Tuesday, December 16, 2014 3:21 AM
>> >> To: Arad, Ronen
>> >> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri
>> >> Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> >> stephen@networkplumber.org; linville@tuxdriver.com;
>> >> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com;
>> >> gospo@cumulusnetworks.com
>> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
>> >> del bridge port attributes
>> >>
>> >> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>> >>>> Sent: Tuesday, December 16, 2014 1:28 AM
>> >>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>> >>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>> >>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>> >>>> vyasevic@redhat.com; davem@davemloft.net;
>> >> shm@cumulusnetworks.com;
>> >>>> gospo@cumulusnetworks.com
>> >>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set
>> >>>> and del bridge port attributes
>> >>>>
>> >>>> On 12/15/14 13:36, Arad, Ronen wrote:
>> >>>>>
>> >>>>>> -----Original Message-----
>> >>>>> The behavior of a driver could depend on the presence of a bridge
>> >>>>> and
>> >>>> features such as FDB LEARNING and LEARNING_SYNC.
>> >>>>
>> >>>> Indeed, those are bridge attributes.
>> >>>>
>> >>>>> A switch port driver which is not enslaved to a bridge might need
>> >>>>> to implement VLAN-aware FDB within the driver and report its
>> >>>>> content to
>> >>>>> user-
>> >>>> space using ndo_fdb_dump.
>> >>>>    >
>> >>>>> A switch port driver which is enslaved to a bridge could do with
>> >>>>> only pass through for static FDB configuration
>> >>>>    > to the HW when LEARNING_SYNC is configured. FDB reporting to
>> >>>> user- space and soft aging are left to the bridge module FDB.
>> >>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
>> >>>>> in-driver
>> >>>> FDB as long as it could dump the HW FDB on demand.
>> >>>>> LEARNING_SYNC also requires periodic updates of freshness
>> >>>>> information
>> >>>> from the driver to the bridge module.
>> >>>>
>> >>>> If you have an fdb - shouldnt that be exposed only if you have a
>> >>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
>> >>> I'm trying to find out what are the opinions of other people in the
>> >>> netdev
>> >> list.
>> >>> John have clearly stated that he'd like to see full L2 switching
>> >>> functionality
>> >> (at least) supported without making a bridge device mandatory.
>> >>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already
>> >>> support that
>> >> with proper setting of SELF/MASTER flags by iproute2.
>> >>> I see the value in supporting both approaches (bridge device
>> >>> mandatory and bridge device optional). If the choice is left to
>> >>> user-driven policy decision, we need to document both use models and
>> >>> map traditional L2 features to each model.
>> >>> The L2 offloading (or NETFUNC as it is currently called), which is
>> >>> being discussed on a different patch-set, is only needed when a
>> >>> bridge device is used.
>> >>> Without a bridge device, all configuration has to be targeted at the
>> >>> switch port driver directly using the SELF flag. FDB remains
>> >>> relevant and it is used to configure static MAC table entries and dump
>> the HW MAC table.
>> >
>> >> Your understanding is right here. So far all patches have kept both
>> >> models in mind.
>> >
>> >
>> >>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or
>> >>> even higher), there is a gap between what the HW is doing and what
>> >>> is explicitly modeled in Linux.
>> >
>> >
>> >> Can you elaborate more here ?. We use the linux model to accelerate a
>> >> multi-layer (l2-l3) switch today. There maybe a few gaps, but these
>> >> gaps can be closed by having equivalent functionality in the software path.
>> >
>> > What I meant is that without a bridge device the HW switch is seen as a
>> collection of independent switch ports. Typical switch ASIC performs L2
>> switching by default. This is not expressed explicitly in Linux without a bridge
>> device.
>> > The SELF flag is used to target typical bridge port and bridge configuration
>> at a switch port device.
>> > Without an explicit bridge device, bridge attributes have to be
>> > directed at an arbitrary port (any port could represent the entire switch)
>> and interpreted by the switch port driver as intended for the entire switch
>> (this includes attributes like STP etc.) Each switch port device driver has to
>> implement similar functionality (i.e. all bridge and fdb related ndos)
>> independently without common functionality shared (e.g. FDB, soft aging).
>> > It is a valid use model and could avoid the complexity of having to deal with
>> the presence of both SW and HW bridge and to deal with explicit offloading
>> of data-path.
>> >
>> > I was trying to find out whether the intention was to continue and support
>> both bridge-less an offloaded-bridge models and leave it to the end-user to
>> choose the desirable model at configuration time.
>> > This would require dual support in the switch port driver in order to have
>> best user experience across multiple switch ASICs or other kinds of devices.
>> >
>>
>> I'm still missing why there is duplicate implementations in the driver.
>> If the driver implements the set of ndo ops why should it care who calls
>> them? I think you tried to explain this already but I'm not seeing it.
>>
>
> Let's consider a bridge property. I'll use the default PVID attribute as an example. This is currently configurable by sysfs only and a netlink support for that is still due. Let's assume for our discussion that a DEAFAULT_PVID attribute will be added as a bridge attribute within AFSPEC nested attribute of AF_BRIDGE SETLINK message.
> When a bridge device is present, this attribute is processed by the bridge module and saved as default_pvid field in net_bridge structure. When a switch port is enslaved to a bridge, the bridge driver creates a net_bridge_port instance and assigns it a pvid inherited from the default_pvid attribute of the bridge. Setting the pvid for a new enslaved switch port is not done via netlink. It only applies to the net_bridge_port structure which is internal to the bridge module. Offloading this to HW is not addressed with current bridge offloading.
>
> When a bridge device is not used, the DEFAULT_PVID will be targeted using the SELF flag to any of the switch ports. The driver will recognize that as a bridge port and will need to maintain some switch global structure similar to net_bridge where it could save the default_pvid. The driver, knowing that the switch port is not enslaved to a bridge, will have to replicate the same functionality. In the HW case, it will have to configure default VLAN on all the switch ports.
> This is different from the yet to be defined way of propagating default PVID from a bridge device to offloaded bridge ports.
>
> Another example is STP. STP attributes are bridge attributes which are not offloaded when a bridge device is present. The bridge module handles STP protocol internally. Without bridge device, STP attributes have to be targeted at a switch port device and the driver should save them in driver-specific structures and have proprietary implementation of STP (as the one in the bridge module is not used).

In general I feel that the switch-device and port relation should be
that of the 'container-containee'. This is the actual physical
relationship.  Apart from some operations such as vlans and protocol
related, it is tricky to model all operations directly on ports. My
thinking is it is cleaner to have all operations be on switch-device,
which in turn peculates the operations downward, to its contained
ports as applicable.  The offloading is really a property of the
switch device and not individual ports. Similarly the FDB is
maintained by the switch and not the ports. As we extend the current
offloading mechanism to other L2, L3 and other features, we may find
it easier to have a 'switch-device' in place.

I am somewhat confused with the notion of bridges though. Many
existing linux-based routers use bridges differently than as a
vlan-broadcast-domain. For example it is common to have eth0.334 and
eth1 in the same bridge. What is being done internally is that the
additional vlan tag 334 (which indicates video traffic, say) is
removed and that video traffic is being bridged to eth1. There is no
default vlan for this bridge.  This is a software bridge. I am not
sure how this can be accomplished if there is a need to associate a
vlan with a bridge.

Thanks
Viswanath


>
>
>> [...]
>>
>> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa might have
>> worked some of it out.
>>
>> --
>> John Fastabend         Intel Corporation
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-16 19:20 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: netdev@vger.kernel.org, Jamal Hadi Salim, John Fastabend,
	Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DB15CA@ORSMSX101.amr.corp.intel.com>

On 12/16/14, 3:01 AM, Arad, Ronen wrote:
> In my reply (inline) I elaborate on the validity of bridge-less and offloaded-bridge models for L2 switching.
>
> I also discuss the implied necessity of a bridge device for L3 routing and potential issues with the upcoming FIB offloading proposal.
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
>> Sent: Tuesday, December 16, 2014 3:21 AM
>> To: Arad, Ronen
>> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri Pirko;
>> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>>>> -----Original Message-----
>>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>>>> Sent: Tuesday, December 16, 2014 1:28 AM
>>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>>>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com;
>>>> gospo@cumulusnetworks.com
>>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
>>>> del bridge port attributes
>>>>
>>>> On 12/15/14 13:36, Arad, Ronen wrote:
>>>>>> -----Original Message-----
>>>>> The behavior of a driver could depend on the presence of a bridge
>>>>> and
>>>> features such as FDB LEARNING and LEARNING_SYNC.
>>>>
>>>> Indeed, those are bridge attributes.
>>>>
>>>>> A switch port driver which is not enslaved to a bridge might need to
>>>>> implement VLAN-aware FDB within the driver and report its content to
>>>>> user-
>>>> space using ndo_fdb_dump.
>>>>    >
>>>>> A switch port driver which is enslaved to a bridge could do with
>>>>> only pass through for static FDB configuration
>>>>    > to the HW when LEARNING_SYNC is configured. FDB reporting to
>>>> user- space and soft aging are left to the bridge module FDB.
>>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
>>>>> in-driver
>>>> FDB as long as it could dump the HW FDB on demand.
>>>>> LEARNING_SYNC also requires periodic updates of freshness
>>>>> information
>>>> from the driver to the bridge module.
>>>>
>>>> If you have an fdb - shouldnt that be exposed only if you have a
>>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
>>> I'm trying to find out what are the opinions of other people in the netdev
>> list.
>>> John have clearly stated that he'd like to see full L2 switching functionality
>> (at least) supported without making a bridge device mandatory.
>>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that
>> with proper setting of SELF/MASTER flags by iproute2.
>>> I see the value in supporting both approaches (bridge device mandatory
>>> and bridge device optional). If the choice is left to user-driven policy decision,
>>> we need to document both use models and map traditional L2 features to
>>> each model.
>>> The L2 offloading (or NETFUNC as it is currently called), which is being
>>> discussed on a different patch-set, is only needed when a bridge device is
>>> used.
>>> Without a bridge device, all configuration has to be targeted at the switch
>>> port driver directly using the SELF flag. FDB remains relevant and it is used to
>>> configure static MAC table entries and dump the HW MAC table.
>> Your understanding is right here. So far all patches have kept both models in
>> mind.
>
>>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even
>>> higher), there is a gap between what the HW is doing and what is explicitly
>>> modeled in Linux.
>
>> Can you elaborate more here ?. We use the linux model to accelerate a
>> multi-layer (l2-l3) switch today. There maybe a few gaps, but these gaps can
>> be closed by having equivalent functionality in the software path.
> What I meant is that without a bridge device the HW switch is seen as a collection of independent switch ports. Typical switch ASIC performs L2 switching by default. This is not expressed explicitly in Linux without a bridge device.
> The SELF flag is used to target typical bridge port and bridge configuration at a switch port device.
> Without an explicit bridge device, bridge attributes have to be directed at an arbitrary port (any port could represent the entire switch) and interpreted by the switch port driver as intended for the entire switch (this includes attributes like STP etc.)
> Each switch port device driver has to implement similar functionality (i.e. all bridge and fdb related ndos) independently without common functionality shared (e.g. FDB, soft aging).
> It is a valid use model and could avoid the complexity of having to deal with the presence of both SW and HW bridge and to deal with explicit offloading of data-path.
>
> I was trying to find out whether the intention was to continue and support both bridge-less an offloaded-bridge models and leave it to the end-user to choose the desirable model at configuration time.
I have always been speaking for the offloaded-bridge model because we 
@cumulus work on such a model today.
But, The infrastructure had to be generic to support nics and switch 
asics. And, from the current state of things both bridge-less and 
offloaded-bridge models are supported. Rocker for example allows user to 
operate directly (ndo_bridge_set/getlink) on the switch hw. But it does 
require a bridge.
> This would require dual support in the switch port driver in order to have best user experience across multiple switch ASICs or other kinds of devices.
agreed. However, if the switch driver wants to support just one model,
  it is upto the switch driver to advertise the NETFUNC_OFFLOAD flag and 
maybe not honor requests with the self flag set.
Our switch driver will likely do that because we plan to support only 
the offloaded-bridge model.

>>> Without a bridge device, the HW is represented by a set of switch port
>>> devices and the bridging (both control and data planes) takes place only in
>>> the HW and switch port driver.
>>> Each switch port driver has to implement its own FDB as there is no
>>> common shared code among drivers for different HW devices.
>>> Using a bridge device could partially alleviate that, but it comes with a cost.
>>> There is a need to properly implement offloading of both configuration and
>>> data-path. The transmit and receive path in the bridge module should be
>>> somehow bypassed to avoid unnecessary overhead or duplicate packets
>>> coming from both software bridging and HW bridging.
>>>
>>>> What i was refering to was a scenario where i have no interest in the
>>>> fdb despite such a hardware capabilities. VLANs is a different issue;
>>>>
>>> VLAN is fundamental feature of L2 and L3 switching and Linux is unclear
>> about it. Bridge device could model bridging of untagged packets which
>> requires a bridge device for each VLAN and a vlan device on each port that is
>> a member of the bridge's VLAN.
>>> This different from the behavior and configuration of classic closed-source
>> switches.
>>> An alternative model is VLAN filtering where a bridge is VLAN-aware and
>>> switches tagged traffic. A bridge device represents multiple L2 domains with
>>> VLAN filtering policy that defines the switching rules within each domain.
>> And the linux bridge driver supports both models today.
>>
>>> Forwarding (e.g. L3 routing) is expected across such L2 domains using L3
>> entities.
>>> The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN
>>> filtering model is yet unclear to me.
>> In the vlan filtering bridge model, You can create a vlan device on the bridge
>> for l3 ...
>>
> That's what I'm thinking too (I experimented with such setup using veth interfaces, bridge device, and vlan interfaces). This, however, seems to require an explicit bridge for L3 support.
>
> Looking at the latest code of FIB offloading (not yet submitted to netdev), I noticed that a switch port device is expected as a lower descendent of the FIB destination device.
> This assumption is valid in the per-vlan bridge model where IP address is assigned to the bridge itself.
> This, however, is not consistent with the single multi-VLAN bridge model.
> Vlan interfaces on a bridge looks like siblings of the switch ports devices on the same bridge. They are not ancestors of the switch ports.
> The L3 domain ends at the bridge sub-interfaces. The only L3 entities are the vlan sub-interfaces on the bridge.
> Those are route next hops and the only possible fib_dev.
> L3 routing is not aware of the switch ports. Route is performed to next hop addresses on one of the vlan interfaces subnets. The actual resolution to a switch port device has to be performed by the neighbor subsystem (ARP/ND).
> It is unclear to me how the FIB offloading will be redirected to an ndo of a switch port device.
I have not looked at the rocker FIB offloading yet. But we have 
discussed this a bit at LPC in Dusseldorf.  The route can be pushed to 
hw after the resolution by the neighbour subsystem..or ..with the 
introduction of a switch device (to represent the asic), the switch 
driver can take care of this.

>
>>>>>>> Will the decision about using a bridge device or avoiding it be
>>>>>>> left to the end-user?
>>>>>> Its a user policy decision. Again the offload bit gets us this in a
>>>>>> reasonably configurable way IMO.
>>>>>>
>>>>>>> (This requires switch port drivers to be able to work and provide
>>>>>>> similar functionality in both setups).
>>>>>> Right, but if the drivers "care" who is calling their ndo ops
>>>>>> something is seriously broken. For the driver it should not need to
>>>>>> know anything about the callers so it doesn't matter to the driver
>>>>>> if its a netlink call from user space or an internal call fro
>>>>>> bridge.ko
>>>>> LEARNING_SYNC only makes sense when a switch port driver is enslaved
>>>>> to
>>>> a bridge.
>>>>    > Rocker switch driver indeed monitors upper change notifications
>>>> and keep track of master bridge presence.
>>>>> So bridge presence is not transparent.
>>>>>
>>>> Agreed - the challenge so far is that people have been fascinated by
>> "switch"
>>>> point of view. I think we are learning and the class device will
>>>> eventually become obvious as useful.
>>>>
>>>> cheers,
>>>> jamal
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org More majordomo
>> info
>>> at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
>> of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 0/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 19:13 UTC (permalink / raw)
  To: netdev
In-Reply-To: <549070C7.5070505@psc.edu>

Correction:

> A git repo is available at http://github.com/rapier/web10g

http://github.com/rapier1/web10g

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: David Miller @ 2014-12-16 19:11 UTC (permalink / raw)
  To: rapier; +Cc: alexei.starovoitov, netdev
In-Reply-To: <549080CC.3050704@psc.edu>

From: rapier <rapier@psc.edu>
Date: Tue, 16 Dec 2014 13:58:20 -0500

> I'm going to review those exports right now.

Please do not top-post.

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: David Miller @ 2014-12-16 19:09 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: rapier, netdev
In-Reply-To: <CAADnVQ+rmBMoDRRu-Dtp7QBMJEbto9nb2zvFEXBCQsqpg6tS9A@mail.gmail.com>

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Tue, 16 Dec 2014 10:24:31 -0800

> imo that is very questionable design choice.
> export a lot of in-kernel bits to be used by out-of-tree kernel module?

It's a non-starter.

^ permalink raw reply

* [PATCHv1 net] xen-netfront: use napi_complete() correctly to prevent Rx stalling
From: David Vrabel @ 2014-12-16 18:59 UTC (permalink / raw)
  To: netdev
  Cc: David Vrabel, xen-devel, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Eric Dumazet

After d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less interrupt
masking in NAPI) the napi instance is removed from the per-cpu list
prior to calling the n->poll(), and is only requeued if all of the
budget was used.  This inadvertently broke netfront because netfront
does not use NAPI correctly.

If netfront had not used all of its budget it would do a final check
for any Rx responses and avoid calling napi_complete() if there were
more responses.  It would still return under budget so it would never
be rescheduled.  The final check would also not re-enable the Rx
interrupt.

Additionally, xenvif_poll() would also call napi_complete() /after/
enabling the interrupt.  This resulted in a race between the
napi_complete() and the napi_schedule() in the interrupt handler.  The
use of local_irq_save/restore() avoided by race iff the handler is
running on the same CPU but not if it was running on a different CPU.

Fix both of these by always calling napi_compete() if the budget was
not all used, and then calling napi_schedule() if the final checks
says there's more work.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/xen-netfront.c |   11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 2f0a9ce..22bcb4e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -977,7 +977,6 @@ static int xennet_poll(struct napi_struct *napi, int budget)
 	struct sk_buff_head rxq;
 	struct sk_buff_head errq;
 	struct sk_buff_head tmpq;
-	unsigned long flags;
 	int err;
 
 	spin_lock(&queue->rx_lock);
@@ -1050,15 +1049,11 @@ err:
 	if (work_done < budget) {
 		int more_to_do = 0;
 
-		napi_gro_flush(napi, false);
-
-		local_irq_save(flags);
+		napi_complete(napi);
 
 		RING_FINAL_CHECK_FOR_RESPONSES(&queue->rx, more_to_do);
-		if (!more_to_do)
-			__napi_complete(napi);
-
-		local_irq_restore(flags);
+		if (more_to_do)
+			napi_schedule(napi);
 	}
 
 	spin_unlock(&queue->rx_lock);
-- 
1.7.10.4

^ permalink raw reply related

* Re: VLAN bridge patches backported to 2.6.x ?
From: Willy Tarreau @ 2014-12-16 18:59 UTC (permalink / raw)
  To: Ajit Warrier; +Cc: netdev
In-Reply-To: <CAKrgqt3rMUnkQChTE+_HDiTHx++bp8MWcQ_xp7-Y3tbFznTKMg@mail.gmail.com>

On Sun, Dec 07, 2014 at 11:25:18AM -0800, Ajit Warrier wrote:
> The VLAN aware bridge was introduced to the Linux mainline on the 3.x kernels:
> 
> http://lwn.net/Articles/538877/
> 
> Has this feature been backported to the 2.6.x kernels by anybody ? The
> official linux 2.6.32.64 does not have this feature.

No, we only backport fixes to 2.6.32, not new features. Such a large
patchset could definitely break working setups for various reasons ranging
from incompatibilities to errors in backports. So that's definitely not
something supposed to happen for 2.6.32.

Regards,
Willy

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 18:58 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev
In-Reply-To: <CAADnVQ+rmBMoDRRu-Dtp7QBMJEbto9nb2zvFEXBCQsqpg6tS9A@mail.gmail.com>

I'm going to review those exports right now.

On 12/16/14, 1:24 PM, Alexei Starovoitov wrote:
> On Tue, Dec 16, 2014 at 9:50 AM, rapier <rapier@psc.edu> wrote:
>> +struct idr tcp_estats_idr;
>> +EXPORT_SYMBOL(tcp_estats_idr);
>> +static int next_id = 1;
>> +DEFINE_SPINLOCK(tcp_estats_idr_lock);
>> +EXPORT_SYMBOL(tcp_estats_idr_lock);
>> +
>> +int tcp_estats_wq_enabled __read_mostly = 0;
>> +EXPORT_SYMBOL(tcp_estats_wq_enabled);
>> +struct workqueue_struct *tcp_estats_wq = NULL;
>> +EXPORT_SYMBOL(tcp_estats_wq);
>> +void (*create_notify_func)(struct work_struct *work);
>> +EXPORT_SYMBOL(create_notify_func);
>> +void (*establish_notify_func)(struct work_struct *work);
>> +EXPORT_SYMBOL(establish_notify_func);
>> +void (*destroy_notify_func)(struct work_struct *work);
>> +EXPORT_SYMBOL(destroy_notify_func);
>> +unsigned long persist_delay = 0;
>> +EXPORT_SYMBOL(persist_delay);
>> +
>> +struct static_key tcp_estats_enabled __read_mostly = STATIC_KEY_INIT_FALSE;
>> +EXPORT_SYMBOL(tcp_estats_enabled);
> ...
>> +EXPORT_SYMBOL(tcp_estats_create);
> ...
>> +/* Do not call directly.  Called from tcp_estats_unuse() through call_rcu.
>> */
>> +void tcp_estats_free(struct rcu_head *rcu)
> ...
>> +EXPORT_SYMBOL(tcp_estats_free);
>
> imo that is very questionable design choice.
> export a lot of in-kernel bits to be used by out-of-tree kernel module?
>

^ permalink raw reply

* Re: [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: Jeff Kirsher @ 2014-12-16 18:56 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, netdev, jiri, gospo, Jamal Hadi Salim,
	John Fastabend
In-Reply-To: <1418573945-27840-1-git-send-email-ogerlitz@mellanox.com>

On Sun, Dec 14, 2014 at 8:19 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> The current implementations all use dev_uc_add_excl() and such whose API
> doesn't support vlans, so we can't make it with NICs HW for now.
>
> Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c |    5 +++++
>  net/core/rtnetlink.c                        |    5 +++++
>  2 files changed, 10 insertions(+), 0 deletions(-)
>

For the Intel driver changes...
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Martin Lau @ 2014-12-16 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: Blake Matheny, Laurent Chavey, Yuchung Cheng,
	netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <D0B44739.74E8A%bmatheny@fb.com>

> >On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
> >
> >> I think patches 1 and 3 are good additions, since they establish
> >> few permanent points of instrumentation in tcp stack.
> >> Patches 4-5 look more like use cases of tracepoints established
> >> before. They may feel like simple additions and, no doubt,
> >> they are useful, but since they expose things via tracing
> >> infra they become part of api and cannot be changed later,
> >> when more stats would be needed.
We can consider to reuse the events's format (tracing/events/*/format). I think
blktrace.c is using similar approach in trace-cmd.

> >> I think systemtap like scripting on top of patches 1 and 3
> >> should solve your use case ?
We have quite a few different versions running in the production.  It may not
be operationally easy.

> >> Also, have you looked at recent eBPF work?
> >> Though it's not completely ready yet, soon it should
> >> be able to do the same stats collection as you have
> >> in 4/5 without adding permanent pieces to the kernel.
We are keeping an eye on the eBPF work.


> On 12/15/14, 8:03 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:
> 
> >So it looks like web10g like interfaces are very often requested by
> >various teams.
> >
> >And we have many different views on how to hack this. I am astonished by
> >number of hacks I saw about this stuff going on.
> >
> >What about a clean way, extending current TCP_INFO, which is both
> >available as a getsockopt() for socket owners and ss/iproute2
> >information for 'external entities'
> >
> >If we consider web10g info needed, then adding a ftrace/eBPF like
> >interface is simply yet another piece of code we need to maintain,
> >and the argument of 'this should cost nothing if not activated' is
> >nonsense since major players need to constantly monitor TCP metrics and
> >behavior.
For the data collecting part, it would be nice to do it in the TCP itself.

Having a getsockopt will be useful for the new application/library to take
advantage of.

For the continuous monitoring/logging purpose, ftrace can provide event
triggered tracing instead of periodically consulting ss.

Thanks,
--Martin

^ permalink raw reply

* Re: [PATCH net-next 0/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 18:24 UTC (permalink / raw)
  To: netdev
In-Reply-To: <549070C7.5070505@psc.edu>

My apologies about the odd formatting in the previous message. Not sure 
what happened with my MUA.

Chris

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: Alexei Starovoitov @ 2014-12-16 18:24 UTC (permalink / raw)
  To: rapier; +Cc: netdev

On Tue, Dec 16, 2014 at 9:50 AM, rapier <rapier@psc.edu> wrote:
> +struct idr tcp_estats_idr;
> +EXPORT_SYMBOL(tcp_estats_idr);
> +static int next_id = 1;
> +DEFINE_SPINLOCK(tcp_estats_idr_lock);
> +EXPORT_SYMBOL(tcp_estats_idr_lock);
> +
> +int tcp_estats_wq_enabled __read_mostly = 0;
> +EXPORT_SYMBOL(tcp_estats_wq_enabled);
> +struct workqueue_struct *tcp_estats_wq = NULL;
> +EXPORT_SYMBOL(tcp_estats_wq);
> +void (*create_notify_func)(struct work_struct *work);
> +EXPORT_SYMBOL(create_notify_func);
> +void (*establish_notify_func)(struct work_struct *work);
> +EXPORT_SYMBOL(establish_notify_func);
> +void (*destroy_notify_func)(struct work_struct *work);
> +EXPORT_SYMBOL(destroy_notify_func);
> +unsigned long persist_delay = 0;
> +EXPORT_SYMBOL(persist_delay);
> +
> +struct static_key tcp_estats_enabled __read_mostly = STATIC_KEY_INIT_FALSE;
> +EXPORT_SYMBOL(tcp_estats_enabled);
...
> +EXPORT_SYMBOL(tcp_estats_create);
...
> +/* Do not call directly.  Called from tcp_estats_unuse() through call_rcu.
> */
> +void tcp_estats_free(struct rcu_head *rcu)
...
> +EXPORT_SYMBOL(tcp_estats_free);

imo that is very questionable design choice.
export a lot of in-kernel bits to be used by out-of-tree kernel module?

^ permalink raw reply

* How to fix CHECK warning: testing a 'safe expression'
From: Murali Karicheri @ 2014-12-16 18:23 UTC (permalink / raw)
  To: netdev

netdev maintainers,

I got a comment to address CHECK warning and wondering how to address 
'warning: testing a 'safe expression' which appears when using
IS_ERR_OR_NULL(foo)

where foo is defined as

struct foo_type *foo;

The foo get assigned only NULL or ERR_PTR(error code). So I believe the 
usage is correct. But then how do I make the CHECK happy of its usage?

I have tried doing a grep on the current usage of IS_ERR_OR_NULL() and 
found 276 of them causes this warning in the v3.18 version of the kernel 
that I am using

$ grep -r "warning: testing a 'safe expression" * | wc -l
276

1) Can someone explain what this warning means?

2) Is it acceptable to post patches to netdev list with this warning?

3) if not, how this is expected to be fixed? Any example usage to fix 
this warning will be helpful.

Thanks in advance for
-- 
Murali Karicheri
Linux Kernel, Texas Instruments

^ permalink raw reply

* pull request: wireless 2014-12-16
From: John W. Linville @ 2014-12-16 18:16 UTC (permalink / raw)
  To: davem; +Cc: linux-wireless, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 28116 bytes --]

Dave,

Please pull this batch of fixes intended for the 3.19 stream!

For the Bluetooth bits, Johan says:

"The patches consist of:

 - Coccinelle warning fix
 - hci_dev_lock/unlock fixes
 - Fixes for pending mgmt command handling
 - Fixes for properly following the force_lesc_support switch
 - Fix for a Microsoft branded Broadcom adapter
 - New device id for Atheros AR3012
 - Fix for BR/EDR Secure Connections enabling"

Along with that...

Brian Norris avoids leaking some kernel memory contents via printk in brcmsmac.

Julia Lawall corrects some misspellings in a few drivers.

Larry Finger gives us one more rtlwifi fix to correct a porting oversight.

Wei Yongjun fixes a sparse warning in rtlwifi.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit 67e2c3883828b39548cee2091b36656787775d95:

  Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security (2014-12-14 20:36:37 -0800)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless.git tags/master-2014-12-15

for you to fetch changes up to 9a1dce3a059111a7289680f4b8c0ec4f8736b6ee:

  rtlwifi: rtl8192ce: Set fw_ready flag (2014-12-15 13:46:20 -0500)

----------------------------------------------------------------
Brian Norris (1):
      brcmsmac: don't leak kernel memory via printk()

Fengguang Wu (1):
      Bluetooth: fix err_cast.cocci warnings

Jaganath Kanakkassery (2):
      Bluetooth: Fix missing hci_dev_lock/unlock in mgmt req_complete()
      Bluetooth: Fix missing hci_dev_lock/unlock in hci_event

Janne Heikkinen (1):
      Bluetooth: Add USB device 04ca:3010 as Atheros AR3012

Johan Hedberg (5):
      Bluetooth: Fix calling hci_conn_put too early
      Bluetooth: Fix incorrect pending cmd removal in pairing_complete()
      Bluetooth: Fix notifying mgmt power off before flushing connection list
      Bluetooth: Fix enabling BR/EDR SC when powering on
      Bluetooth: Fix mgmt response status when removing adapter

John W. Linville (1):
      Merge branch 'for-upstream' of git://git.kernel.org/.../bluetooth/bluetooth-next

Julia Lawall (3):
      zd1211rw: fix misspelling of current function in string
      hostap_cs: fix misspelling of current function in string
      rtlwifi: rtl8821ae: fix misspelling of current function in string

Larry Finger (1):
      rtlwifi: rtl8192ce: Set fw_ready flag

Marcel Holtmann (4):
      Bluetooth: Check for force_lesc_support when enabling SMP over BR/EDR
      Bluetooth: Check for force_lesc_support before rejecting SMP over BR/EDR
      Bluetooth: Fix generation of non-resolvable private addresses
      Bluetooth: Fix check for support for page scan related commands

Wei Yongjun (1):
      rtlwifi: rtl8192cu: Fix sparse non static symbol warning

 drivers/bluetooth/ath3k.c                      |  2 +
 drivers/bluetooth/btusb.c                      |  1 +
 drivers/net/wireless/brcm80211/brcmsmac/main.c |  2 +-
 drivers/net/wireless/hostap/hostap_cs.c        | 15 ++---
 drivers/net/wireless/rtlwifi/rtl8192ce/hw.c    |  2 +
 drivers/net/wireless/rtlwifi/rtl8192cu/hw.c    |  2 +-
 drivers/net/wireless/rtlwifi/rtl8821ae/dm.c    | 11 ++--
 drivers/net/wireless/zd1211rw/zd_chip.c        |  6 +-
 net/bluetooth/hci_conn.c                       |  2 +-
 net/bluetooth/hci_core.c                       | 60 ++++++++++--------
 net/bluetooth/hci_event.c                      | 20 ++++++
 net/bluetooth/l2cap_core.c                     |  5 +-
 net/bluetooth/mgmt.c                           | 85 ++++++++++++++++++--------
 net/bluetooth/smp.c                            |  5 +-
 14 files changed, 143 insertions(+), 75 deletions(-)

diff --git a/drivers/bluetooth/ath3k.c b/drivers/bluetooth/ath3k.c
index fce758896280..1ee27ac18de0 100644
--- a/drivers/bluetooth/ath3k.c
+++ b/drivers/bluetooth/ath3k.c
@@ -87,6 +87,7 @@ static const struct usb_device_id ath3k_table[] = {
 	{ USB_DEVICE(0x04CA, 0x3007) },
 	{ USB_DEVICE(0x04CA, 0x3008) },
 	{ USB_DEVICE(0x04CA, 0x300b) },
+	{ USB_DEVICE(0x04CA, 0x3010) },
 	{ USB_DEVICE(0x0930, 0x0219) },
 	{ USB_DEVICE(0x0930, 0x0220) },
 	{ USB_DEVICE(0x0930, 0x0227) },
@@ -140,6 +141,7 @@ static const struct usb_device_id ath3k_blist_tbl[] = {
 	{ USB_DEVICE(0x04ca, 0x3007), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x04ca, 0x3008), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x04ca, 0x300b), .driver_info = BTUSB_ATH3012 },
+	{ USB_DEVICE(0x04ca, 0x3010), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0219), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0220), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0227), .driver_info = BTUSB_ATH3012 },
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index 31dd24ac9926..19cf2cf22e87 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -167,6 +167,7 @@ static const struct usb_device_id blacklist_table[] = {
 	{ USB_DEVICE(0x04ca, 0x3007), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x04ca, 0x3008), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x04ca, 0x300b), .driver_info = BTUSB_ATH3012 },
+	{ USB_DEVICE(0x04ca, 0x3010), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0219), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0220), .driver_info = BTUSB_ATH3012 },
 	{ USB_DEVICE(0x0930, 0x0227), .driver_info = BTUSB_ATH3012 },
diff --git a/drivers/net/wireless/brcm80211/brcmsmac/main.c b/drivers/net/wireless/brcm80211/brcmsmac/main.c
index a104d7ac3796..eb8584a9c49a 100644
--- a/drivers/net/wireless/brcm80211/brcmsmac/main.c
+++ b/drivers/net/wireless/brcm80211/brcmsmac/main.c
@@ -316,7 +316,7 @@ static const u16 xmtfifo_sz[][NFIFO] = {
 static const char * const fifo_names[] = {
 	"AC_BK", "AC_BE", "AC_VI", "AC_VO", "BCMC", "ATIM" };
 #else
-static const char fifo_names[6][0];
+static const char fifo_names[6][1];
 #endif
 
 #ifdef DEBUG
diff --git a/drivers/net/wireless/hostap/hostap_cs.c b/drivers/net/wireless/hostap/hostap_cs.c
index b6ec51923b20..50033aa7c7d5 100644
--- a/drivers/net/wireless/hostap/hostap_cs.c
+++ b/drivers/net/wireless/hostap/hostap_cs.c
@@ -381,18 +381,15 @@ static void prism2_pccard_genesis_reset(local_info_t *local, int hcr)
 
 	res = pcmcia_read_config_byte(hw_priv->link, CISREG_COR, &old_cor);
 	if (res != 0) {
-		printk(KERN_DEBUG "prism2_pccard_genesis_sreset failed 1 "
-		       "(%d)\n", res);
+		printk(KERN_DEBUG "%s failed 1 (%d)\n", __func__, res);
 		return;
 	}
-	printk(KERN_DEBUG "prism2_pccard_genesis_sreset: original COR %02x\n",
-		old_cor);
+	printk(KERN_DEBUG "%s: original COR %02x\n", __func__, old_cor);
 
 	res = pcmcia_write_config_byte(hw_priv->link, CISREG_COR,
 				old_cor | COR_SOFT_RESET);
 	if (res != 0) {
-		printk(KERN_DEBUG "prism2_pccard_genesis_sreset failed 2 "
-		       "(%d)\n", res);
+		printk(KERN_DEBUG "%s failed 2 (%d)\n", __func__, res);
 		return;
 	}
 
@@ -401,8 +398,7 @@ static void prism2_pccard_genesis_reset(local_info_t *local, int hcr)
 	/* Setup Genesis mode */
 	res = pcmcia_write_config_byte(hw_priv->link, CISREG_CCSR, hcr);
 	if (res != 0) {
-		printk(KERN_DEBUG "prism2_pccard_genesis_sreset failed 3 "
-		       "(%d)\n", res);
+		printk(KERN_DEBUG "%s failed 3 (%d)\n", __func__, res);
 		return;
 	}
 	mdelay(10);
@@ -410,8 +406,7 @@ static void prism2_pccard_genesis_reset(local_info_t *local, int hcr)
 	res = pcmcia_write_config_byte(hw_priv->link, CISREG_COR,
 				old_cor & ~COR_SOFT_RESET);
 	if (res != 0) {
-		printk(KERN_DEBUG "prism2_pccard_genesis_sreset failed 4 "
-		       "(%d)\n", res);
+		printk(KERN_DEBUG "%s failed 4 (%d)\n", __func__, res);
 		return;
 	}
 
diff --git a/drivers/net/wireless/rtlwifi/rtl8192ce/hw.c b/drivers/net/wireless/rtlwifi/rtl8192ce/hw.c
index d2ec5160bbf0..5c646d5f7bb8 100644
--- a/drivers/net/wireless/rtlwifi/rtl8192ce/hw.c
+++ b/drivers/net/wireless/rtlwifi/rtl8192ce/hw.c
@@ -955,6 +955,7 @@ int rtl92ce_hw_init(struct ieee80211_hw *hw)
 	local_save_flags(flags);
 	local_irq_enable();
 
+	rtlhal->fw_ready = false;
 	rtlpriv->intf_ops->disable_aspm(hw);
 	rtstatus = _rtl92ce_init_mac(hw);
 	if (!rtstatus) {
@@ -971,6 +972,7 @@ int rtl92ce_hw_init(struct ieee80211_hw *hw)
 		goto exit;
 	}
 
+	rtlhal->fw_ready = true;
 	rtlhal->last_hmeboxnum = 0;
 	rtl92c_phy_mac_config(hw);
 	/* because last function modify RCR, so we update
diff --git a/drivers/net/wireless/rtlwifi/rtl8192cu/hw.c b/drivers/net/wireless/rtlwifi/rtl8192cu/hw.c
index 873363acbacf..551321728ae0 100644
--- a/drivers/net/wireless/rtlwifi/rtl8192cu/hw.c
+++ b/drivers/net/wireless/rtlwifi/rtl8192cu/hw.c
@@ -1592,7 +1592,7 @@ void rtl92cu_get_hw_reg(struct ieee80211_hw *hw, u8 variable, u8 *val)
 	}
 }
 
-bool usb_cmd_send_packet(struct ieee80211_hw *hw, struct sk_buff *skb)
+static bool usb_cmd_send_packet(struct ieee80211_hw *hw, struct sk_buff *skb)
 {
   /* Currently nothing happens here.
    * Traffic stops after some seconds in WPA2 802.11n mode.
diff --git a/drivers/net/wireless/rtlwifi/rtl8821ae/dm.c b/drivers/net/wireless/rtlwifi/rtl8821ae/dm.c
index 9be106109921..ba30b0d250fd 100644
--- a/drivers/net/wireless/rtlwifi/rtl8821ae/dm.c
+++ b/drivers/net/wireless/rtlwifi/rtl8821ae/dm.c
@@ -2078,8 +2078,7 @@ void rtl8821ae_dm_txpwr_track_set_pwr(struct ieee80211_hw *hw,
 	if (rtldm->tx_rate != 0xFF)
 		tx_rate = rtl8821ae_hw_rate_to_mrate(hw, rtldm->tx_rate);
 
-	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD,
-		 "===>rtl8812ae_dm_txpwr_track_set_pwr\n");
+	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD, "===>%s\n", __func__);
 
 	if (tx_rate != 0xFF) { /* Mimic Modify High Rate BBSwing Limit.*/
 		/*CCK*/
@@ -2128,7 +2127,7 @@ void rtl8821ae_dm_txpwr_track_set_pwr(struct ieee80211_hw *hw,
 
 	if (method == BBSWING) {
 		RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD,
-			 "===>rtl8812ae_dm_txpwr_track_set_pwr\n");
+			 "===>%s\n", __func__);
 		if (rf_path == RF90_PATH_A) {
 			final_swing_idx[RF90_PATH_A] =
 				(rtldm->ofdm_index[RF90_PATH_A] >
@@ -2260,7 +2259,8 @@ void rtl8821ae_dm_txpower_tracking_callback_thermalmeter(
 	rtldm->txpower_trackinginit = true;
 
 	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD,
-		 "===>rtl8812ae_dm_txpower_tracking_callback_thermalmeter,\n pDM_Odm->BbSwingIdxCckBase: %d,pDM_Odm->BbSwingIdxOfdmBase[A]:%d, pDM_Odm->DefaultOfdmIndex: %d\n",
+		 "===>%s,\n pDM_Odm->BbSwingIdxCckBase: %d,pDM_Odm->BbSwingIdxOfdmBase[A]:%d, pDM_Odm->DefaultOfdmIndex: %d\n",
+		 __func__,
 		 rtldm->swing_idx_cck_base,
 		 rtldm->swing_idx_ofdm_base[RF90_PATH_A],
 		 rtldm->default_ofdm_index);
@@ -2539,8 +2539,7 @@ void rtl8821ae_dm_txpower_tracking_callback_thermalmeter(
 		}
 	}
 
-	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD,
-		 "<===rtl8812ae_dm_txpower_tracking_callback_thermalmeter\n");
+	RT_TRACE(rtlpriv, COMP_POWER_TRACKING, DBG_LOUD, "<===%s\n", __func__);
 }
 
 void rtl8821ae_dm_check_txpower_tracking_thermalmeter(struct ieee80211_hw *hw)
diff --git a/drivers/net/wireless/zd1211rw/zd_chip.c b/drivers/net/wireless/zd1211rw/zd_chip.c
index 73a49b868035..07b94eda9604 100644
--- a/drivers/net/wireless/zd1211rw/zd_chip.c
+++ b/drivers/net/wireless/zd1211rw/zd_chip.c
@@ -129,7 +129,7 @@ int zd_ioread32v_locked(struct zd_chip *chip, u32 *values, const zd_addr_t *addr
 	r = zd_ioread16v_locked(chip, v16, a16, count16);
 	if (r) {
 		dev_dbg_f(zd_chip_dev(chip),
-			  "error: zd_ioread16v_locked. Error number %d\n", r);
+			  "error: %s. Error number %d\n", __func__, r);
 		return r;
 	}
 
@@ -256,8 +256,8 @@ int zd_iowrite32a_locked(struct zd_chip *chip,
 		if (r) {
 			zd_usb_iowrite16v_async_end(&chip->usb, 0);
 			dev_dbg_f(zd_chip_dev(chip),
-				"error _zd_iowrite32v_locked."
-				" Error number %d\n", r);
+				"error _%s. Error number %d\n", __func__,
+				r);
 			return r;
 		}
 	}
diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
index 79d84b88b8f0..fe18825cc8a4 100644
--- a/net/bluetooth/hci_conn.c
+++ b/net/bluetooth/hci_conn.c
@@ -661,7 +661,7 @@ static void hci_req_add_le_create_conn(struct hci_request *req,
 	memset(&cp, 0, sizeof(cp));
 
 	/* Update random address, but set require_privacy to false so
-	 * that we never connect with an unresolvable address.
+	 * that we never connect with an non-resolvable address.
 	 */
 	if (hci_update_random_address(req, false, &own_addr_type))
 		return;
diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 93f92a085506..5dcacf9607e4 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -1373,8 +1373,6 @@ static void hci_init1_req(struct hci_request *req, unsigned long opt)
 
 static void bredr_setup(struct hci_request *req)
 {
-	struct hci_dev *hdev = req->hdev;
-
 	__le16 param;
 	__u8 flt_type;
 
@@ -1403,14 +1401,6 @@ static void bredr_setup(struct hci_request *req)
 	/* Connection accept timeout ~20 secs */
 	param = cpu_to_le16(0x7d00);
 	hci_req_add(req, HCI_OP_WRITE_CA_TIMEOUT, 2, &param);
-
-	/* AVM Berlin (31), aka "BlueFRITZ!", reports version 1.2,
-	 * but it does not support page scan related HCI commands.
-	 */
-	if (hdev->manufacturer != 31 && hdev->hci_ver > BLUETOOTH_VER_1_1) {
-		hci_req_add(req, HCI_OP_READ_PAGE_SCAN_ACTIVITY, 0, NULL);
-		hci_req_add(req, HCI_OP_READ_PAGE_SCAN_TYPE, 0, NULL);
-	}
 }
 
 static void le_setup(struct hci_request *req)
@@ -1718,6 +1708,16 @@ static void hci_init3_req(struct hci_request *req, unsigned long opt)
 	if (hdev->commands[5] & 0x10)
 		hci_setup_link_policy(req);
 
+	if (hdev->commands[8] & 0x01)
+		hci_req_add(req, HCI_OP_READ_PAGE_SCAN_ACTIVITY, 0, NULL);
+
+	/* Some older Broadcom based Bluetooth 1.2 controllers do not
+	 * support the Read Page Scan Type command. Check support for
+	 * this command in the bit mask of supported commands.
+	 */
+	if (hdev->commands[13] & 0x01)
+		hci_req_add(req, HCI_OP_READ_PAGE_SCAN_TYPE, 0, NULL);
+
 	if (lmp_le_capable(hdev)) {
 		u8 events[8];
 
@@ -2634,6 +2634,12 @@ static int hci_dev_do_close(struct hci_dev *hdev)
 	drain_workqueue(hdev->workqueue);
 
 	hci_dev_lock(hdev);
+
+	if (!test_and_clear_bit(HCI_AUTO_OFF, &hdev->dev_flags)) {
+		if (hdev->dev_type == HCI_BREDR)
+			mgmt_powered(hdev, 0);
+	}
+
 	hci_inquiry_cache_flush(hdev);
 	hci_pend_le_actions_clear(hdev);
 	hci_conn_hash_flush(hdev);
@@ -2681,14 +2687,6 @@ static int hci_dev_do_close(struct hci_dev *hdev)
 	hdev->flags &= BIT(HCI_RAW);
 	hdev->dev_flags &= ~HCI_PERSISTENT_MASK;
 
-	if (!test_and_clear_bit(HCI_AUTO_OFF, &hdev->dev_flags)) {
-		if (hdev->dev_type == HCI_BREDR) {
-			hci_dev_lock(hdev);
-			mgmt_powered(hdev, 0);
-			hci_dev_unlock(hdev);
-		}
-	}
-
 	/* Controller radio is available but is currently powered down */
 	hdev->amp_status = AMP_STATUS_POWERED_DOWN;
 
@@ -3083,7 +3081,9 @@ static void hci_power_on(struct work_struct *work)
 
 	err = hci_dev_do_open(hdev);
 	if (err < 0) {
+		hci_dev_lock(hdev);
 		mgmt_set_powered_failed(hdev, err);
+		hci_dev_unlock(hdev);
 		return;
 	}
 
@@ -3959,17 +3959,29 @@ int hci_update_random_address(struct hci_request *req, bool require_privacy,
 	}
 
 	/* In case of required privacy without resolvable private address,
-	 * use an unresolvable private address. This is useful for active
+	 * use an non-resolvable private address. This is useful for active
 	 * scanning and non-connectable advertising.
 	 */
 	if (require_privacy) {
-		bdaddr_t urpa;
+		bdaddr_t nrpa;
+
+		while (true) {
+			/* The non-resolvable private address is generated
+			 * from random six bytes with the two most significant
+			 * bits cleared.
+			 */
+			get_random_bytes(&nrpa, 6);
+			nrpa.b[5] &= 0x3f;
 
-		get_random_bytes(&urpa, 6);
-		urpa.b[5] &= 0x3f;	/* Clear two most significant bits */
+			/* The non-resolvable private address shall not be
+			 * equal to the public address.
+			 */
+			if (bacmp(&hdev->bdaddr, &nrpa))
+				break;
+		}
 
 		*own_addr_type = ADDR_LE_DEV_RANDOM;
-		set_random_addr(req, &urpa);
+		set_random_addr(req, &nrpa);
 		return 0;
 	}
 
@@ -5625,7 +5637,7 @@ void hci_req_add_le_passive_scan(struct hci_request *req)
 	u8 filter_policy;
 
 	/* Set require_privacy to false since no SCAN_REQ are send
-	 * during passive scanning. Not using an unresolvable address
+	 * during passive scanning. Not using an non-resolvable address
 	 * here is important so that peer devices using direct
 	 * advertising with our address will be correctly reported
 	 * by the controller.
diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
index 322abbbbcef9..39a5c8a01726 100644
--- a/net/bluetooth/hci_event.c
+++ b/net/bluetooth/hci_event.c
@@ -257,6 +257,8 @@ static void hci_cc_write_auth_enable(struct hci_dev *hdev, struct sk_buff *skb)
 	if (!sent)
 		return;
 
+	hci_dev_lock(hdev);
+
 	if (!status) {
 		__u8 param = *((__u8 *) sent);
 
@@ -268,6 +270,8 @@ static void hci_cc_write_auth_enable(struct hci_dev *hdev, struct sk_buff *skb)
 
 	if (test_bit(HCI_MGMT, &hdev->dev_flags))
 		mgmt_auth_enable_complete(hdev, status);
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_write_encrypt_mode(struct hci_dev *hdev, struct sk_buff *skb)
@@ -443,6 +447,8 @@ static void hci_cc_write_ssp_mode(struct hci_dev *hdev, struct sk_buff *skb)
 	if (!sent)
 		return;
 
+	hci_dev_lock(hdev);
+
 	if (!status) {
 		if (sent->mode)
 			hdev->features[1][0] |= LMP_HOST_SSP;
@@ -458,6 +464,8 @@ static void hci_cc_write_ssp_mode(struct hci_dev *hdev, struct sk_buff *skb)
 		else
 			clear_bit(HCI_SSP_ENABLED, &hdev->dev_flags);
 	}
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_write_sc_support(struct hci_dev *hdev, struct sk_buff *skb)
@@ -471,6 +479,8 @@ static void hci_cc_write_sc_support(struct hci_dev *hdev, struct sk_buff *skb)
 	if (!sent)
 		return;
 
+	hci_dev_lock(hdev);
+
 	if (!status) {
 		if (sent->support)
 			hdev->features[1][0] |= LMP_HOST_SC;
@@ -486,6 +496,8 @@ static void hci_cc_write_sc_support(struct hci_dev *hdev, struct sk_buff *skb)
 		else
 			clear_bit(HCI_SC_ENABLED, &hdev->dev_flags);
 	}
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_read_local_version(struct hci_dev *hdev, struct sk_buff *skb)
@@ -1135,6 +1147,8 @@ static void hci_cc_le_set_scan_enable(struct hci_dev *hdev,
 	if (!cp)
 		return;
 
+	hci_dev_lock(hdev);
+
 	switch (cp->enable) {
 	case LE_SCAN_ENABLE:
 		set_bit(HCI_LE_SCAN, &hdev->dev_flags);
@@ -1184,6 +1198,8 @@ static void hci_cc_le_set_scan_enable(struct hci_dev *hdev,
 		BT_ERR("Used reserved LE_Scan_Enable param %d", cp->enable);
 		break;
 	}
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_le_read_white_list_size(struct hci_dev *hdev,
@@ -1278,6 +1294,8 @@ static void hci_cc_write_le_host_supported(struct hci_dev *hdev,
 	if (!sent)
 		return;
 
+	hci_dev_lock(hdev);
+
 	if (sent->le) {
 		hdev->features[1][0] |= LMP_HOST_LE;
 		set_bit(HCI_LE_ENABLED, &hdev->dev_flags);
@@ -1291,6 +1309,8 @@ static void hci_cc_write_le_host_supported(struct hci_dev *hdev,
 		hdev->features[1][0] |= LMP_HOST_LE_BREDR;
 	else
 		hdev->features[1][0] &= ~LMP_HOST_LE_BREDR;
+
+	hci_dev_unlock(hdev);
 }
 
 static void hci_cc_set_adv_param(struct hci_dev *hdev, struct sk_buff *skb)
diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c
index a2b6dfa38a0c..d04dc0095736 100644
--- a/net/bluetooth/l2cap_core.c
+++ b/net/bluetooth/l2cap_core.c
@@ -6966,8 +6966,9 @@ static struct l2cap_conn *l2cap_conn_add(struct hci_conn *hcon)
 	    test_bit(HCI_HS_ENABLED, &hcon->hdev->dev_flags))
 		conn->local_fixed_chan |= L2CAP_FC_A2MP;
 
-	if (bredr_sc_enabled(hcon->hdev) &&
-	    test_bit(HCI_LE_ENABLED, &hcon->hdev->dev_flags))
+	if (test_bit(HCI_LE_ENABLED, &hcon->hdev->dev_flags) &&
+	    (bredr_sc_enabled(hcon->hdev) ||
+	     test_bit(HCI_FORCE_LESC, &hcon->hdev->dbg_flags)))
 		conn->local_fixed_chan |= L2CAP_FC_SMP_BREDR;
 
 	mutex_init(&conn->ident_lock);
diff --git a/net/bluetooth/mgmt.c b/net/bluetooth/mgmt.c
index 7384f1161336..06c2e652e4b6 100644
--- a/net/bluetooth/mgmt.c
+++ b/net/bluetooth/mgmt.c
@@ -2199,12 +2199,14 @@ static void le_enable_complete(struct hci_dev *hdev, u8 status)
 {
 	struct cmd_lookup match = { NULL, hdev };
 
+	hci_dev_lock(hdev);
+
 	if (status) {
 		u8 mgmt_err = mgmt_status(status);
 
 		mgmt_pending_foreach(MGMT_OP_SET_LE, hdev, cmd_status_rsp,
 				     &mgmt_err);
-		return;
+		goto unlock;
 	}
 
 	mgmt_pending_foreach(MGMT_OP_SET_LE, hdev, settings_rsp, &match);
@@ -2222,17 +2224,16 @@ static void le_enable_complete(struct hci_dev *hdev, u8 status)
 	if (test_bit(HCI_LE_ENABLED, &hdev->dev_flags)) {
 		struct hci_request req;
 
-		hci_dev_lock(hdev);
-
 		hci_req_init(&req, hdev);
 		update_adv_data(&req);
 		update_scan_rsp_data(&req);
 		hci_req_run(&req, NULL);
 
 		hci_update_background_scan(hdev);
-
-		hci_dev_unlock(hdev);
 	}
+
+unlock:
+	hci_dev_unlock(hdev);
 }
 
 static int set_le(struct sock *sk, struct hci_dev *hdev, void *data, u16 len)
@@ -3114,14 +3115,13 @@ static void pairing_complete(struct pending_cmd *cmd, u8 status)
 	conn->disconn_cfm_cb = NULL;
 
 	hci_conn_drop(conn);
-	hci_conn_put(conn);
-
-	mgmt_pending_remove(cmd);
 
 	/* The device is paired so there is no need to remove
 	 * its connection parameters anymore.
 	 */
 	clear_bit(HCI_CONN_PARAM_REMOVAL_PEND, &conn->flags);
+
+	hci_conn_put(conn);
 }
 
 void mgmt_smp_complete(struct hci_conn *conn, bool complete)
@@ -3130,8 +3130,10 @@ void mgmt_smp_complete(struct hci_conn *conn, bool complete)
 	struct pending_cmd *cmd;
 
 	cmd = find_pairing(conn);
-	if (cmd)
+	if (cmd) {
 		cmd->cmd_complete(cmd, status);
+		mgmt_pending_remove(cmd);
+	}
 }
 
 static void pairing_complete_cb(struct hci_conn *conn, u8 status)
@@ -3141,10 +3143,13 @@ static void pairing_complete_cb(struct hci_conn *conn, u8 status)
 	BT_DBG("status %u", status);
 
 	cmd = find_pairing(conn);
-	if (!cmd)
+	if (!cmd) {
 		BT_DBG("Unable to find a pending command");
-	else
-		cmd->cmd_complete(cmd, mgmt_status(status));
+		return;
+	}
+
+	cmd->cmd_complete(cmd, mgmt_status(status));
+	mgmt_pending_remove(cmd);
 }
 
 static void le_pairing_complete_cb(struct hci_conn *conn, u8 status)
@@ -3157,10 +3162,13 @@ static void le_pairing_complete_cb(struct hci_conn *conn, u8 status)
 		return;
 
 	cmd = find_pairing(conn);
-	if (!cmd)
+	if (!cmd) {
 		BT_DBG("Unable to find a pending command");
-	else
-		cmd->cmd_complete(cmd, mgmt_status(status));
+		return;
+	}
+
+	cmd->cmd_complete(cmd, mgmt_status(status));
+	mgmt_pending_remove(cmd);
 }
 
 static int pair_device(struct sock *sk, struct hci_dev *hdev, void *data,
@@ -3274,8 +3282,10 @@ static int pair_device(struct sock *sk, struct hci_dev *hdev, void *data,
 	cmd->user_data = hci_conn_get(conn);
 
 	if ((conn->state == BT_CONNECTED || conn->state == BT_CONFIG) &&
-	    hci_conn_security(conn, sec_level, auth_type, true))
-		pairing_complete(cmd, 0);
+	    hci_conn_security(conn, sec_level, auth_type, true)) {
+		cmd->cmd_complete(cmd, 0);
+		mgmt_pending_remove(cmd);
+	}
 
 	err = 0;
 
@@ -3317,7 +3327,8 @@ static int cancel_pair_device(struct sock *sk, struct hci_dev *hdev, void *data,
 		goto unlock;
 	}
 
-	pairing_complete(cmd, MGMT_STATUS_CANCELLED);
+	cmd->cmd_complete(cmd, MGMT_STATUS_CANCELLED);
+	mgmt_pending_remove(cmd);
 
 	err = cmd_complete(sk, hdev->id, MGMT_OP_CANCEL_PAIR_DEVICE, 0,
 			   addr, sizeof(*addr));
@@ -3791,7 +3802,7 @@ static bool trigger_discovery(struct hci_request *req, u8 *status)
 
 		/* All active scans will be done with either a resolvable
 		 * private address (when privacy feature has been enabled)
-		 * or unresolvable private address.
+		 * or non-resolvable private address.
 		 */
 		err = hci_update_random_address(req, true, &own_addr_type);
 		if (err < 0) {
@@ -4279,12 +4290,14 @@ static void set_advertising_complete(struct hci_dev *hdev, u8 status)
 {
 	struct cmd_lookup match = { NULL, hdev };
 
+	hci_dev_lock(hdev);
+
 	if (status) {
 		u8 mgmt_err = mgmt_status(status);
 
 		mgmt_pending_foreach(MGMT_OP_SET_ADVERTISING, hdev,
 				     cmd_status_rsp, &mgmt_err);
-		return;
+		goto unlock;
 	}
 
 	if (test_bit(HCI_LE_ADV, &hdev->dev_flags))
@@ -4299,6 +4312,9 @@ static void set_advertising_complete(struct hci_dev *hdev, u8 status)
 
 	if (match.sk)
 		sock_put(match.sk);
+
+unlock:
+	hci_dev_unlock(hdev);
 }
 
 static int set_advertising(struct sock *sk, struct hci_dev *hdev, void *data,
@@ -6081,6 +6097,11 @@ static int powered_update_hci(struct hci_dev *hdev)
 		hci_req_add(&req, HCI_OP_WRITE_SSP_MODE, 1, &ssp);
 	}
 
+	if (bredr_sc_enabled(hdev) && !lmp_host_sc_capable(hdev)) {
+		u8 sc = 0x01;
+		hci_req_add(&req, HCI_OP_WRITE_SC_SUPPORT, sizeof(sc), &sc);
+	}
+
 	if (test_bit(HCI_LE_ENABLED, &hdev->dev_flags) &&
 	    lmp_bredr_capable(hdev)) {
 		struct hci_cp_write_le_host_supported cp;
@@ -6130,8 +6151,7 @@ static int powered_update_hci(struct hci_dev *hdev)
 int mgmt_powered(struct hci_dev *hdev, u8 powered)
 {
 	struct cmd_lookup match = { NULL, hdev };
-	u8 status_not_powered = MGMT_STATUS_NOT_POWERED;
-	u8 zero_cod[] = { 0, 0, 0 };
+	u8 status, zero_cod[] = { 0, 0, 0 };
 	int err;
 
 	if (!test_bit(HCI_MGMT, &hdev->dev_flags))
@@ -6147,7 +6167,20 @@ int mgmt_powered(struct hci_dev *hdev, u8 powered)
 	}
 
 	mgmt_pending_foreach(MGMT_OP_SET_POWERED, hdev, settings_rsp, &match);
-	mgmt_pending_foreach(0, hdev, cmd_complete_rsp, &status_not_powered);
+
+	/* If the power off is because of hdev unregistration let
+	 * use the appropriate INVALID_INDEX status. Otherwise use
+	 * NOT_POWERED. We cover both scenarios here since later in
+	 * mgmt_index_removed() any hci_conn callbacks will have already
+	 * been triggered, potentially causing misleading DISCONNECTED
+	 * status responses.
+	 */
+	if (test_bit(HCI_UNREGISTER, &hdev->dev_flags))
+		status = MGMT_STATUS_INVALID_INDEX;
+	else
+		status = MGMT_STATUS_NOT_POWERED;
+
+	mgmt_pending_foreach(0, hdev, cmd_complete_rsp, &status);
 
 	if (memcmp(hdev->dev_class, zero_cod, sizeof(zero_cod)) != 0)
 		mgmt_event(MGMT_EV_CLASS_OF_DEV_CHANGED, hdev,
@@ -6681,8 +6714,10 @@ void mgmt_auth_failed(struct hci_conn *conn, u8 hci_status)
 	mgmt_event(MGMT_EV_AUTH_FAILED, conn->hdev, &ev, sizeof(ev),
 		    cmd ? cmd->sk : NULL);
 
-	if (cmd)
-		pairing_complete(cmd, status);
+	if (cmd) {
+		cmd->cmd_complete(cmd, status);
+		mgmt_pending_remove(cmd);
+	}
 }
 
 void mgmt_auth_enable_complete(struct hci_dev *hdev, u8 status)
diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
index 6a46252fe66f..b67749bb55bf 100644
--- a/net/bluetooth/smp.c
+++ b/net/bluetooth/smp.c
@@ -1673,7 +1673,8 @@ static u8 smp_cmd_pairing_req(struct l2cap_conn *conn, struct sk_buff *skb)
 	/* SMP over BR/EDR requires special treatment */
 	if (conn->hcon->type == ACL_LINK) {
 		/* We must have a BR/EDR SC link */
-		if (!test_bit(HCI_CONN_AES_CCM, &conn->hcon->flags))
+		if (!test_bit(HCI_CONN_AES_CCM, &conn->hcon->flags) &&
+		    !test_bit(HCI_FORCE_LESC, &hdev->dbg_flags))
 			return SMP_CROSS_TRANSP_NOT_ALLOWED;
 
 		set_bit(SMP_FLAG_SC, &smp->flags);
@@ -2927,7 +2928,7 @@ static struct l2cap_chan *smp_add_cid(struct hci_dev *hdev, u16 cid)
 	tfm_aes = crypto_alloc_blkcipher("ecb(aes)", 0, 0);
 	if (IS_ERR(tfm_aes)) {
 		BT_ERR("Unable to create crypto context");
-		return ERR_PTR(PTR_ERR(tfm_aes));
+		return ERR_CAST(tfm_aes);
 	}
 
 create_chan:
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply related

* Re: [iproute2] tc: Show classes more hierarchically]
From: Marcelo Ricardo Leitner @ 2014-12-16 18:12 UTC (permalink / raw)
  To: vadim4j, netdev
In-Reply-To: <20141215224851.GB6734@angus-think.lan>

On 15-12-2014 20:48, vadim4j@gmail.com wrote:
> Hi All,
>
> I am playing with showing classes in more hierarchically format and I
> have some code and example of output from my TC looks like:
>
> # tc/tc -t class show dev tap0
>
>   \---1:2 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:40 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:50 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:60 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>   \---1:1 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:10 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>                 \---1:11 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>                        \---1:111 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:20 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>          \---1:30 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>
>
> which in standart output mode it looks like:
>
> # tc/tc class show dev tap0
>
> class htb 1:11 parent 1:10 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:111 parent 1:11 prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:10 parent 1:1 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b
> class htb 1:1 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:20 parent 1:1 leaf 20: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:2 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:30 parent 1:1 leaf 30: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:40 parent 1:2 leaf 40: prio 0 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b
> class htb 1:50 parent 1:2 leaf 50: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b
> class htb 1:60 parent 1:2 leaf 60: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
>
> So I'd like to ask if it might be useful for the TC users (may be
> better format ?) to have this ?

Good idea! It already looks good, but what about:

   |-- 1:2 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   |      |-- 1:40 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   |      |-- 1:50 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   |      '-- 1:60 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   |-- 1:1 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b
   ...

just another idea..

Thanks.
   Marcelo

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Marcelo Ricardo Leitner @ 2014-12-16 18:00 UTC (permalink / raw)
  To: rajatxjain; +Cc: Nils Holland, David Miller, netdev, linux-pci@vger.kernel.org
In-Reply-To: <CAA93t1qyZE-9tw8pg1KG6g4iyy0QMW=iass5w=6ZGMTMu+vi_A@mail.gmail.com>

On 16-12-2014 14:04, Rajat Jain wrote:
> Hello All,
>
> Apologies for jumping in late, but for some reason I do not see the
> original mail in my inbox. However I am taking a look at the mails as
> sent on linux-pci (and I will keep an eye out for the bug report that
> Bjorn asked for).
>

np!
Nils would you create that BZ please? As you did all the bisect.. :)

>
>>
>> I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:
>>
>> $ grep 'pci 0000:02' tg3.bad
>> [    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.561555] pci 0000:02:00.0: 1st 1 1
>> [    1.561558] pci 0000:02:00.0: crs_timeout: 0
>> [   20.412021] pci 0000:02:00.0: 1st 1 1
>> [   20.412022] pci 0000:02:00.0: crs_timeout: 0
>> [   20.413596] pci 0000:02:00.0: 1st 1 1
>> [   20.413598] pci 0000:02:00.0: crs_timeout: 0
>>
>> And without it:
>>
>> $ grep 'pci 0000:02' tg3.good
>> [    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.576778] pci 0000:02:00.0: 1st 1 1
>> [   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4
>>
>
> It seems that in the first 2 attempts that were made to probe the
> device are all OK and return regular device ID and vendor ID for TG3
> (CRS does not have a role to play). However, later attempts return a
> CRS.
>
> 1) May I ask if you are using acpihp or pciehp? I assume pciehp?

Well.. system doesn't support hotplug..
Chipset is a "Intel Corporation 5 Series/3400 Series", fwiw

> 2) Can you please also send dmesg output while passing
> pciehp.pciehp_debug=1? In the fail case, do you see a message
> indicating the pciehp gave up since it got CRS for a long time
> (something like "pci 0000:02:00.0 id reading try 50 times with
> interval 20 ms to get ffff0001")?

I did use that option anyway, but it resulted in no new messages.

> 3) Currently the pciehp passes "0" for the argument "crs_timeout" to
> pci_bus_read_dev_vendor_id(). Can you please try increasing it to, say
> 30 seconds (30 * 1000). (For comparison data, acpihp uses the value
> 60*1000 i.e. 60 seconds today) and run the fail case once again?
>
> Thanks a lot in advance for the debugging help ;-)
>

Seems it's not safe to do that with those backtraces..
I did it, system was very slow to boot, still didn't get the NIC on and 
got a bunch of "scheduling while atomic" due to that msleep() call.

The first invoke was fine:
Dec 16 15:40:00 odin kernel: [    0.190711] pci 0000:02:00.0: 1st 
165a14e4 14e4
Dec 16 15:40:00 odin kernel: [    0.190717] pci 0000:02:00.0: 1st 
165a14e4 14e4
Dec 16 15:40:00 odin kernel: [    0.191091] pci 0000:02:00.0: System 
wakeup disabled by ACPI
Dec 16 15:40:00 odin kernel: [    1.576061] pci 0000:02:00.0: 1st 1 1
Dec 16 15:40:00 odin kernel: [    1.577474] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.580487] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.585508] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.594499] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.611499] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.644521] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.709566] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.838654] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    2.095765] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    2.608956] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    3.634443] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    5.684388] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    9.783279] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [   17.980060] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [   34.372640] pci 0000:02:00.0: not responding

The other two...
Dec 16 15:40:09 odin kernel: [   54.154688] pci 0000:02:00.0: 1st 1 1
Dec 16 15:40:09 odin kernel: [   54.154690] BUG: scheduling while 
atomic: ip/1575/0x00000200
Dec 16 15:40:09 odin kernel: pci 0000:02:00.0: 1st 1 1
Dec 16 15:40:09 odin kernel: BUG: scheduling while atomic: 
ip/1575/0x00000200
Dec 16 15:40:09 odin kernel: pci 0000:02:00.0: 1 1
Dec 16 15:40:09 odin kernel: BUG: scheduling while atomic: 
ip/1575/0x00000200
(...)

BUG backtraces were very similar to the 2nd and 3rd I posted on the 
other email, it just pointed to the msleep() call instead of my BUG_ON(1).

I can dig deeper if you think it's worth, but as the 1st call didn't 
have this issue and it didn't complete either, seems we are good about 
the test.. right?

Thanks,
Marcelo

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Marcelo Ricardo Leitner @ 2014-12-16 17:59 UTC (permalink / raw)
  To: Michael Chan, Bjorn Helgaas
  Cc: Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki, Prashant Sreedharan
In-Reply-To: <1418750141.4248.3.camel@LTIRV-MCHAN1.corp.ad.broadcom.com>

On 16-12-2014 15:15, Michael Chan wrote:
> On Tue, 2014-12-16 at 09:20 -0700, Bjorn Helgaas wrote:
>> I think we're in this path:
>>
>>      tg3_init_hw
>>        tg3_reset_hw
>>          tg3_disable_ints
>>          tg3_stop_fw
>>          tg3_write_sig_pre_reset
>>          tg3_chip_reset
>>            pci_device_is_present
>>              pci_bus_read_dev_vendor_id
>>
>> and in this case pci_device_is_present() also passes a timeout of zero
>> to pci_bus_read_dev_vendor_id().  My guess is that tg3 is resetting
>> the device, so it's not too surprising that the config read returns
>> CRS status immediately afterward.
>>
> At the point of calling pci_device_is_present(), chip reset hasn't
> started yet, so there should be no problem reading config space.
> 
> In all the newer tg3 chips, chip reset does not reset the PCIE block.
> So I think config space should always be accesible even during reset.

It's a 
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express
over here

I put a WARN_ON(1) after those printks, and this is what I got:

[    1.550640] pci 0000:02:00.0: 1st 1 1
[    1.550643] pci 0000:02:00.0: crs_timeout: 0
[    1.550645] ------------[ cut here ]------------
[    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
[    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
[    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
[    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
[    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
[    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
[    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
[    1.550669] Call Trace:
[    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
[    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
[    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
[    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
[    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
[    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
[    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
[    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
[    1.550705]  [<ffffffff8127d2bb>] ? kernfs_activate+0x7b/0xf0
[    1.550708]  [<ffffffff813bbdc5>] local_pci_probe+0x45/0xa0
[    1.550711]  [<ffffffff8127fa8d>] ? sysfs_do_create_link_sd.isra.2+0x6d/0xc0
[    1.550714]  [<ffffffff813bd1b9>] pci_device_probe+0xf9/0x150
[    1.550717]  [<ffffffff814906fd>] driver_probe_device+0x12d/0x3d0
[    1.550720]  [<ffffffff81490a7b>] __driver_attach+0x9b/0xa0
[    1.550722]  [<ffffffff814909e0>] ? __device_attach+0x40/0x40
[    1.550724]  [<ffffffff8148e4f3>] bus_for_each_dev+0x73/0xc0
[    1.550726]  [<ffffffff814900ee>] driver_attach+0x1e/0x20
[    1.550729]  [<ffffffff8148fcb0>] bus_add_driver+0x180/0x250
[    1.550731]  [<ffffffffa0050000>] ? 0xffffffffa0050000
[    1.550733]  [<ffffffff81491274>] driver_register+0x64/0xf0
[    1.550735]  [<ffffffff813bb72b>] __pci_register_driver+0x4b/0x50
[    1.550739]  [<ffffffffa005001e>] tg3_driver_init+0x1e/0x1000 [tg3]
[    1.550742]  [<ffffffff81002144>] do_one_initcall+0xd4/0x210
[    1.550747]  [<ffffffff811cbc42>] ? __vunmap+0xc2/0x110
[    1.550751]  [<ffffffff8111336b>] load_module+0x1cab/0x2730
[    1.550753]  [<ffffffff8110efc0>] ? store_uevent+0x70/0x70
[    1.550756]  [<ffffffff8120b090>] ? kernel_read+0x50/0x80
[    1.550760]  [<ffffffff81113fa6>] SyS_finit_module+0xa6/0xd0
[    1.550763]  [<ffffffff81745129>] system_call_fastpath+0x12/0x17
[    1.550764] ---[ end trace 4cc3153e369484ea ]---
[    1.550963] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95722) rev a200] (PCI Express) MAC address 00:0a:f7:2b:9b:39
[    1.550965] tg3 0000:02:00.0 eth0: attached PHY is 5722/5756 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
[    1.550966] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    1.550967] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[    1.556112] tg3 0000:02:00.0 p1p1: renamed from eth0
...

[   23.545119] tg3 0000:02:00.0: irq 32 for MSI/MSI-X
[   25.424981] tg3 0000:02:00.0 p1p1: No firmware running
[   25.425686] pci 0000:02:00.0: 1st 1 1
[   25.425687] pci 0000:02:00.0: crs_timeout: 0
[   25.425687] ------------[ cut here ]------------
[   25.425691] WARNING: CPU: 0 PID: 1590 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
[   25.425692] Modules linked in: bridge stp llc openvswitch x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_co
dec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul snd_hwdep crc32_pclmul crc32c_intel ghash_clmulni_intel snd_seq mei_me snd_seq_d
evice snd_pcm iTCO_wdt iTCO_vendor_support snd_timer mei snd lpc_ich i2c_i801 pcspkr mfd_core dcdbas serio_raw soundcore microcode ie31200_edac shpchp edac_
core nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc xfs libcrc32c i915 raid0 i2c_algo_bit drm_kms_helper drm e1000e tg3 ptp pps_core video
[   25.425714] CPU: 0 PID: 1590 Comm: ip Tainted: G        W      3.18.0-rc6+ #8
[   25.425715] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
[   25.425716]  0000000000000000 0000000097b01d0c ffff8807f0687408 ffffffff8173db46
[   25.425717]  0000000000000000 0000000000000000 ffff8807f0687448 ffffffff81094d41
[   25.425719]  ffff8807f0687428 ffff8807f1e27000 0000000000000000 0000000000000000
[   25.425720] Call Trace:
[   25.425723]  [<ffffffff8173db46>] dump_stack+0x46/0x58
[   25.425726]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
[   25.425728]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
[   25.425729]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
[   25.425733]  [<ffffffffa0028b47>] ? tg3_phy_auxctl_write+0x27/0x30 [tg3]
[   25.425735]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
[   25.425738]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
[   25.425740]  [<ffffffffa003756d>] tg3_reset_hw+0x8d/0x2ce0 [tg3]
[   25.425743]  [<ffffffff81383f6a>] ? delay_tsc+0x4a/0x80
[   25.425744]  [<ffffffff81383eec>] ? __udelay+0x2c/0x30
[   25.425747]  [<ffffffffa0026be4>] ? _tw32_flush+0x44/0x80 [tg3]
[   25.425749]  [<ffffffffa003a216>] tg3_init_hw+0x56/0x60 [tg3]
[   25.425751]  [<ffffffffa003c0d5>] tg3_start+0xbe5/0x1210 [tg3]
[   25.425753]  [<ffffffff81383eec>] ? __udelay+0x2c/0x30
[   25.425755]  [<ffffffffa0026be4>] ? _tw32_flush+0x44/0x80 [tg3]
[   25.425757]  [<ffffffffa003c828>] tg3_open+0x128/0x2e0 [tg3]
[   25.425760]  [<ffffffff8162c6cf>] __dev_open+0xcf/0x140
[   25.425761]  [<ffffffff8162c9f1>] __dev_change_flags+0xa1/0x160
[   25.425762]  [<ffffffff8162cad9>] dev_change_flags+0x29/0x60
[   25.425764]  [<ffffffff8163a4a9>] do_setlink+0x399/0xa90
[   25.425766]  [<ffffffff8163ca7c>] rtnl_newlink+0x51c/0x740
[   25.425768]  [<ffffffff8163c653>] ? rtnl_newlink+0xf3/0x740
[   25.425771]  [<ffffffff811e730c>] ? new_slab+0x14c/0x490
[   25.425774]  [<ffffffff81303188>] ? security_capable+0x18/0x20
[   25.425776]  [<ffffffff8109cf7d>] ? ns_capable+0x2d/0x60
[   25.425778]  [<ffffffff816391a4>] rtnetlink_rcv_msg+0xa4/0x270
[   25.425780]  [<ffffffff8165840d>] ? __netlink_lookup+0x4d/0x70
[   25.425781]  [<ffffffff81639100>] ? rtnetlink_rcv+0x40/0x40
[   25.425783]  [<ffffffff8165c4a1>] netlink_rcv_skb+0xc1/0xe0
[   25.425784]  [<ffffffff816390ec>] rtnetlink_rcv+0x2c/0x40
[   25.425785]  [<ffffffff8165ba26>] netlink_unicast+0x106/0x210
[   25.425787]  [<ffffffff8165be55>] netlink_sendmsg+0x325/0x790
[   25.425788]  [<ffffffff8160de50>] sock_sendmsg+0xa0/0xe0
[   25.425791]  [<ffffffff8120e8cd>] ? lookup_real+0x1d/0x50
[   25.425792]  [<ffffffff8160e394>] ___sys_sendmsg+0x2f4/0x310
[   25.425794]  [<ffffffff8119bdf2>] ? lru_cache_add_active_or_unevictable+0x32/0xc0
[   25.425796]  [<ffffffff8160c673>] ? sock_destroy_inode+0x33/0x40
...
[   25.425794]  [<ffffffff8119bdf2>] ? lru_cache_add_active_or_unevictable+0x32/0xc0
[   25.425796]  [<ffffffff8160c673>] ? sock_destroy_inode+0x33/0x40
[   25.425798]  [<ffffffff8121bfd5>] ? __dentry_kill+0x145/0x1d0
[   25.425799]  [<ffffffff8121c105>] ? dput+0xa5/0x170
[   25.425800]  [<ffffffff81224f74>] ? mntput+0x24/0x40
[   25.425802]  [<ffffffff81206d6a>] ? __fput+0x17a/0x1e0
[   25.425803]  [<ffffffff8160ee21>] __sys_sendmsg+0x51/0x90
[   25.425805]  [<ffffffff8160ee72>] SyS_sendmsg+0x12/0x20
[   25.425807]  [<ffffffff81745129>] system_call_fastpath+0x12/0x17
[   25.425808] ---[ end trace 4cc3153e369484eb ]---
[   25.427385] pci 0000:02:00.0: 1st 1 1
[   25.427386] pci 0000:02:00.0: crs_timeout: 0
[   25.427387] ------------[ cut here ]------------
[   25.427389] WARNING: CPU: 0 PID: 1590 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
[   25.427389] Modules linked in: bridge stp llc openvswitch x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul snd_hwdep crc32_pclmul crc32c_intel ghash_clmulni_intel snd_seq mei_me snd_seq_device snd_pcm iTCO_wdt iTCO_vendor_support snd_timer mei snd lpc_ich i2c_i801 pcspkr mfd_core dcdbas serio_raw soundcore microcode ie31200_edac shpchp edac_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc xfs libcrc32c i915 raid0 i2c_algo_bit drm_kms_helper drm e1000e tg3 ptp pps_core video
[   25.427403] CPU: 0 PID: 1590 Comm: ip Tainted: G        W      3.18.0-rc6+ #8
[   25.427404] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
[   25.427405]  0000000000000000 0000000097b01d0c ffff8807f0687488 ffffffff8173db46
[   25.427406]  0000000000000000 0000000000000000 ffff8807f06874c8 ffffffff81094d41
[   25.427416]  ffff8807f06874a8 ffff8807f1e27000 0000000000000000 0000000000000000
[   25.427417] Call Trace:
[   25.427418]  [<ffffffff8173db46>] dump_stack+0x46/0x58
[   25.427420]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
[   25.427421]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
[   25.427423]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
[   25.427425]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
[   25.427427]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
[   25.427430]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
[   25.427433]  [<ffffffffa003c208>] tg3_start+0xd18/0x1210 [tg3]
[   25.427434]  [<ffffffff81383eec>] ? __udelay+0x2c/0x30
[   25.427437]  [<ffffffffa0026be4>] ? _tw32_flush+0x44/0x80 [tg3]
[   25.427439]  [<ffffffffa003c828>] tg3_open+0x128/0x2e0 [tg3]
[   25.427441]  [<ffffffff8162c6cf>] __dev_open+0xcf/0x140
[   25.427442]  [<ffffffff8162c9f1>] __dev_change_flags+0xa1/0x160
[   25.427443]  [<ffffffff8162cad9>] dev_change_flags+0x29/0x60
[   25.427445]  [<ffffffff8163a4a9>] do_setlink+0x399/0xa90
[   25.427448]  [<ffffffff8163ca7c>] rtnl_newlink+0x51c/0x740
[   25.427449]  [<ffffffff8163c653>] ? rtnl_newlink+0xf3/0x740
[   25.427449]  [<ffffffff8163c653>] ? rtnl_newlink+0xf3/0x740
[   25.427452]  [<ffffffff811e730c>] ? new_slab+0x14c/0x490
[   25.427454]  [<ffffffff81303188>] ? security_capable+0x18/0x20
[   25.427455]  [<ffffffff8109cf7d>] ? ns_capable+0x2d/0x60
[   25.427457]  [<ffffffff816391a4>] rtnetlink_rcv_msg+0xa4/0x270
[   25.427459]  [<ffffffff8165840d>] ? __netlink_lookup+0x4d/0x70
[   25.427460]  [<ffffffff81639100>] ? rtnetlink_rcv+0x40/0x40
[   25.427462]  [<ffffffff8165c4a1>] netlink_rcv_skb+0xc1/0xe0
[   25.427464]  [<ffffffff816390ec>] rtnetlink_rcv+0x2c/0x40
[   25.427465]  [<ffffffff8165ba26>] netlink_unicast+0x106/0x210
[   25.427466]  [<ffffffff8165be55>] netlink_sendmsg+0x325/0x790
[   25.427468]  [<ffffffff8160de50>] sock_sendmsg+0xa0/0xe0
[   25.427469]  [<ffffffff8120e8cd>] ? lookup_real+0x1d/0x50
[   25.427471]  [<ffffffff8160e394>] ___sys_sendmsg+0x2f4/0x310
[   25.427472]  [<ffffffff8119bdf2>] ? lru_cache_add_active_or_unevictable+0x32/0xc0
[   25.427475]  [<ffffffff8160c673>] ? sock_destroy_inode+0x33/0x40
[   25.427477]  [<ffffffff8121bfd5>] ? __dentry_kill+0x145/0x1d0
[   25.427478]  [<ffffffff8121c105>] ? dput+0xa5/0x170
[   25.427479]  [<ffffffff81224f74>] ? mntput+0x24/0x40
[   25.427481]  [<ffffffff81206d6a>] ? __fput+0x17a/0x1e0
[   25.427482]  [<ffffffff8160ee21>] __sys_sendmsg+0x51/0x90
[   25.427483]  [<ffffffff8160ee72>] SyS_sendmsg+0x12/0x20
[   25.427493]  [<ffffffff81745129>] system_call_fastpath+0x12/0x17
[   25.427494] ---[ end trace 4cc3153e369484ec ]---

  Marcelo

^ permalink raw reply

* [PATCH net-next 3/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 17:50 UTC (permalink / raw)
  To: netdev

This patch set is the union of the previous two patches.
Applying this patch to the net-next kernel (commit f96fe22)
provides full functionality. The DLKM and API found at
https://sourceforge.net/projects/tcpestats/files/ will allow
interested parties to test out our implementation from a
user perspective.

As note - to enable tcp_estats in the kernel the
net.ipv4.tcp_estats must be set. To enable all statistics
set net.ipv4.tcp_estats=127

---
  include/linux/tcp.h        |   8 +
  include/net/tcp.h          |   1 +
  include/net/tcp_estats.h   | 376 +++++++++++++++++++++++
  include/uapi/linux/tcp.h   |   6 +-
  net/ipv4/Kconfig           |  25 ++
  net/ipv4/Makefile          |   1 +
  net/ipv4/sysctl_net_ipv4.c |  14 +
  net/ipv4/tcp.c             |  21 +-
  net/ipv4/tcp_cong.c        |   3 +
  net/ipv4/tcp_estats.c      | 736 +++++++++++++++++++++++++++++++++++++++++++++
  net/ipv4/tcp_htcp.c        |   1 +
  net/ipv4/tcp_input.c       | 116 ++++++-
  net/ipv4/tcp_ipv4.c        |  10 +
  net/ipv4/tcp_output.c      |  61 +++-
  net/ipv4/tcp_timer.c       |   3 +
  net/ipv6/tcp_ipv6.c        |   7 +
  16 files changed, 1368 insertions(+), 21 deletions(-)
  create mode 100644 include/net/tcp_estats.h
  create mode 100644 net/ipv4/tcp_estats.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 67309ec..8758360 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -126,6 +126,10 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
  	return (struct tcp_request_sock *)req;
  }
  
+#ifdef CONFIG_TCP_ESTATS
+struct tcp_estats;
+#endif
+
  struct tcp_sock {
  	/* inet_connection_sock has to be the first member of tcp_sock */
  	struct inet_connection_sock	inet_conn;
@@ -309,6 +313,10 @@ struct tcp_sock {
  	struct tcp_md5sig_info	__rcu *md5sig_info;
  #endif
  
+#ifdef CONFIG_TCP_ESTATS
+	struct tcp_estats	*tcp_stats;
+#endif
+
  /* TCP fastopen related information */
  	struct tcp_fastopen_request *fastopen_req;
  	/* fastopen_rsk points to request_sock that resulted in this big
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f50f29faf..9f7e31e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -43,6 +43,7 @@
  #include <net/tcp_states.h>
  #include <net/inet_ecn.h>
  #include <net/dst.h>
+#include <net/tcp_estats.h>
  
  #include <linux/seq_file.h>
  #include <linux/memcontrol.h>
diff --git a/include/net/tcp_estats.h b/include/net/tcp_estats.h
new file mode 100644
index 0000000..ff6000e
--- /dev/null
+++ b/include/net/tcp_estats.h
@@ -0,0 +1,376 @@
+/*
+ * include/net/tcp_estats.h
+ *
+ * Implementation of TCP Extended Statistics MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _TCP_ESTATS_H
+#define _TCP_ESTATS_H
+
+#include <net/sock.h>
+#include <linux/idr.h>
+#include <linux/in.h>
+#include <linux/jump_label.h>
+#include <linux/spinlock.h>
+#include <linux/tcp.h>
+#include <linux/workqueue.h>
+
+/* defines number of seconds that stats persist after connection ends */
+#define TCP_ESTATS_PERSIST_DELAY_SECS 5
+
+enum tcp_estats_sndlim_states {
+	TCP_ESTATS_SNDLIM_NONE = -1,
+	TCP_ESTATS_SNDLIM_SENDER,
+	TCP_ESTATS_SNDLIM_CWND,
+	TCP_ESTATS_SNDLIM_RWIN,
+	TCP_ESTATS_SNDLIM_STARTUP,
+	TCP_ESTATS_SNDLIM_TSODEFER,
+	TCP_ESTATS_SNDLIM_PACE,
+	TCP_ESTATS_SNDLIM_NSTATES	/* Keep at end */
+};
+
+enum tcp_estats_addrtype {
+	TCP_ESTATS_ADDRTYPE_IPV4 = 1,
+	TCP_ESTATS_ADDRTYPE_IPV6 = 2
+};
+
+enum tcp_estats_softerror_reason {
+	TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW = 1,
+	TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW = 2,
+	TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW = 3,
+	TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW = 4,
+	TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW = 5,
+	TCP_ESTATS_SOFTERROR_ABOVE_TS_WINDOW = 6,
+	TCP_ESTATS_SOFTERROR_DATA_CHECKSUM = 7,
+	TCP_ESTATS_SOFTERROR_OTHER = 8,
+};
+
+#define TCP_ESTATS_INACTIVE	2
+#define TCP_ESTATS_ACTIVE	1
+
+#define TCP_ESTATS_TABLEMASK_INACTIVE	0x00
+#define TCP_ESTATS_TABLEMASK_ACTIVE	0x01
+#define TCP_ESTATS_TABLEMASK_PERF	0x02
+#define TCP_ESTATS_TABLEMASK_PATH	0x04
+#define TCP_ESTATS_TABLEMASK_STACK	0x08
+#define TCP_ESTATS_TABLEMASK_APP	0x10
+#define TCP_ESTATS_TABLEMASK_EXTRAS	0x40
+
+#ifdef CONFIG_TCP_ESTATS
+
+extern struct static_key tcp_estats_enabled;
+
+#define TCP_ESTATS_CHECK(tp, table, expr)				\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats) &&			\
+			    likely((tp)->tcp_stats->tables.table)) {	\
+				(expr);					\
+			}						\
+		}							\
+	} while (0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, ++((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_DEC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, --((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) += (val))
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) = (val))
+#define TCP_ESTATS_UPDATE(tp, func)					\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats)) {			\
+				(func);					\
+			}						\
+		}							\
+	} while (0)
+
+/*
+ * Variables that can be read and written directly.
+ *
+ * Contains all variables from RFC 4898. Commented fields are
+ * either not implemented (only StartTimeStamp
+ * remains unimplemented in this release) or have
+ * handlers and do not need struct storage.
+ */
+struct tcp_estats_connection_table {
+	u32			AddressType;
+	union { struct in_addr addr; struct in6_addr addr6; }	LocalAddress;
+	union { struct in_addr addr; struct in6_addr addr6; }	RemAddress;
+	u16			LocalPort;
+	u16			RemPort;
+};
+
+struct tcp_estats_perf_table {
+	u32		SegsOut;
+	u32		DataSegsOut;
+	u64		DataOctetsOut;
+	u32		SegsRetrans;
+	u32		OctetsRetrans;
+	u32		SegsIn;
+	u32		DataSegsIn;
+	u64		DataOctetsIn;
+	/*		ElapsedSecs */
+	/*		ElapsedMicroSecs */
+	/*		StartTimeStamp */
+	/*		CurMSS */
+	/*		PipeSize */
+	u32		MaxPipeSize;
+	/*		SmoothedRTT */
+	/*		CurRTO */
+	u32		CongSignals;
+	/*		CurCwnd */
+	/*		CurSsthresh */
+	u32		Timeouts;
+	/*		CurRwinSent */
+	u32		MaxRwinSent;
+	u32		ZeroRwinSent;
+	/*		CurRwinRcvd */
+	u32		MaxRwinRcvd;
+	u32		ZeroRwinRcvd;
+	/*		SndLimTransRwin */
+	/*		SndLimTransCwnd */
+	/*		SndLimTransSnd */
+	/*		SndLimTimeRwin */
+	/*		SndLimTimeCwnd */
+	/*		SndLimTimeSnd */
+	u32		snd_lim_trans[TCP_ESTATS_SNDLIM_NSTATES];
+	u32		snd_lim_time[TCP_ESTATS_SNDLIM_NSTATES];
+};
+
+struct tcp_estats_path_table {
+	/*		RetranThresh */
+	u32		NonRecovDAEpisodes;
+	u32		SumOctetsReordered;
+	u32		NonRecovDA;
+	u32		SampleRTT;
+	/*		RTTVar */
+	u32		MaxRTT;
+	u32		MinRTT;
+	u64		SumRTT;
+	u32		CountRTT;
+	u32		MaxRTO;
+	u32		MinRTO;
+	u8		IpTtl;
+	u8		IpTosIn;
+	/*		IpTosOut */
+	u32		PreCongSumCwnd;
+	u32		PreCongSumRTT;
+	u32		PostCongSumRTT;
+	u32		PostCongCountRTT;
+	u32		ECNsignals;
+	u32		DupAckEpisodes;
+	/*		RcvRTT */
+	u32		DupAcksOut;
+	u32		CERcvd;
+	u32		ECESent;
+};
+
+struct tcp_estats_stack_table {
+	u32		ActiveOpen;
+	/*		MSSSent */
+	/*		MSSRcvd */
+	/*		WinScaleSent */
+	/*		WinScaleRcvd */
+	/*		TimeStamps */
+	/*		ECN */
+	/*		WillSendSACK */
+	/*		WillUseSACK */
+	/*		State */
+	/*		Nagle */
+	u32		MaxSsCwnd;
+	u32		MaxCaCwnd;
+	u32		MaxSsthresh;
+	u32		MinSsthresh;
+	/*		InRecovery */
+	u32		DupAcksIn;
+	u32		SpuriousFrDetected;
+	u32		SpuriousRtoDetected;
+	u32		SoftErrors;
+	u32		SoftErrorReason;
+	u32		SlowStart;
+	u32		CongAvoid;
+	u32		OtherReductions;
+	u32		CongOverCount;
+	u32		FastRetran;
+	u32		SubsequentTimeouts;
+	/*		CurTimeoutCount */
+	u32		AbruptTimeouts;
+	u32		SACKsRcvd;
+	u32		SACKBlocksRcvd;
+	u32		SendStall;
+	u32		DSACKDups;
+	u32		MaxMSS;
+	u32		MinMSS;
+	u32		SndInitial;
+	u32		RecInitial;
+	/*		CurRetxQueue */
+	/*		MaxRetxQueue */
+	/*		CurReasmQueue */
+	u32		MaxReasmQueue;
+	u32		EarlyRetrans;
+	u32		EarlyRetransDelay;
+};
+
+struct tcp_estats_app_table {
+	/*		SndUna */
+	/*		SndNxt */
+	u32		SndMax;
+	u64		ThruOctetsAcked;
+	/*		RcvNxt */
+	u64		ThruOctetsReceived;
+	/*		CurAppWQueue */
+	u32		MaxAppWQueue;
+	/*		CurAppRQueue */
+	u32		MaxAppRQueue;
+};
+
+/*
+    currently, no backing store is needed for tuning elements in
+     web10g - they are all read or written to directly in other
+     data structures (such as the socket)
+*/
+
+struct tcp_estats_extras_table {
+	/*		OtherReductionsCV */
+	u32		OtherReductionsCM;
+	u32		Priority;
+};
+
+struct tcp_estats_tables {
+	struct tcp_estats_connection_table	*connection_table;
+	struct tcp_estats_perf_table		*perf_table;
+	struct tcp_estats_path_table		*path_table;
+	struct tcp_estats_stack_table		*stack_table;
+	struct tcp_estats_app_table		*app_table;
+	struct tcp_estats_extras_table		*extras_table;
+};
+
+struct tcp_estats {
+	int				tcpe_cid; /* idr map id */
+
+	struct sock			*sk;
+	kuid_t				uid;
+	kgid_t				gid;
+	int				ids;
+
+	atomic_t			users;
+
+	enum tcp_estats_sndlim_states	limstate;
+	ktime_t				limstate_ts;
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	ktime_t				start_ts;
+	ktime_t				current_ts;
+#else
+	unsigned long			start_ts;
+	unsigned long			current_ts;
+#endif
+	struct timeval			start_tv;
+
+        int				queued;
+        struct work_struct		create_notify;
+        struct work_struct		establish_notify;
+        struct delayed_work		destroy_notify;
+
+	struct tcp_estats_tables	tables;
+
+	struct rcu_head			rcu;
+};
+
+extern struct idr tcp_estats_idr;
+
+extern int tcp_estats_wq_enabled;
+extern struct workqueue_struct *tcp_estats_wq;
+extern void (*create_notify_func)(struct work_struct *work);
+extern void (*establish_notify_func)(struct work_struct *work);
+extern void (*destroy_notify_func)(struct work_struct *work);
+
+extern unsigned long persist_delay;
+extern spinlock_t tcp_estats_idr_lock;
+
+/* For the TCP code */
+extern int  tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype t,
+			      int active);
+extern void tcp_estats_destroy(struct sock *sk);
+extern void tcp_estats_establish(struct sock *sk);
+extern void tcp_estats_free(struct rcu_head *rcu);
+
+extern void tcp_estats_update_snd_nxt(struct tcp_sock *tp);
+extern void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack);
+extern void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample);
+extern void tcp_estats_update_timeout(struct sock *sk);
+extern void tcp_estats_update_mss(struct tcp_sock *tp);
+extern void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp);
+extern void tcp_estats_update_sndlim(struct tcp_sock *tp,
+				     enum tcp_estats_sndlim_states why);
+extern void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq);
+extern void tcp_estats_update_rwin_sent(struct tcp_sock *tp);
+extern void tcp_estats_update_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_post_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_segsend(struct sock *sk, int pcount,
+                                      u32 seq, u32 end_seq, int flags);
+extern void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb);
+extern void tcp_estats_update_finish_segrecv(struct tcp_sock *tp);
+extern void tcp_estats_update_writeq(struct sock *sk);
+extern void tcp_estats_update_recvq(struct sock *sk);
+
+extern void tcp_estats_init(void);
+
+static inline void tcp_estats_use(struct tcp_estats *stats)
+{
+	atomic_inc(&stats->users);
+}
+
+static inline int tcp_estats_use_if_valid(struct tcp_estats *stats)
+{
+	return atomic_inc_not_zero(&stats->users);
+}
+
+static inline void tcp_estats_unuse(struct tcp_estats *stats)
+{
+	if (atomic_dec_and_test(&stats->users)) {
+		sock_put(stats->sk);
+		stats->sk = NULL;
+		call_rcu(&stats->rcu, tcp_estats_free);
+	}
+}
+
+#else /* !CONFIG_TCP_ESTATS */
+
+#define tcp_estats_enabled	(0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_DEC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_UPDATE(tp, func)		do {} while (0)
+
+static inline void tcp_estats_init(void) { }
+static inline void tcp_estats_establish(struct sock *sk) { }
+static inline void tcp_estats_create(struct sock *sk,
+				     enum tcp_estats_addrtype t,
+				     int active) { }
+static inline void tcp_estats_destroy(struct sock *sk) { }
+
+#endif /* CONFIG_TCP_ESTATS */
+
+#endif /* _TCP_ESTATS_H */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 3b97183..5dae043 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -186,9 +186,13 @@ struct tcp_info {
  	__u32	tcpi_rcv_space;
  
  	__u32	tcpi_total_retrans;
-
  	__u64	tcpi_pacing_rate;
  	__u64	tcpi_max_pacing_rate;
+
+#ifdef CONFIG_TCP_ESTATS
+	/* RFC 4898 extended stats Info */
+	__u32	tcpi_estats_cid;
+#endif
  };
  
  /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index bd29016..4bd176e 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -680,3 +680,28 @@ config TCP_MD5SIG
  	  on the Internet.
  
  	  If unsure, say N.
+
+config TCP_ESTATS
+	bool "TCP: Extended TCP statistics (RFC4898) MIB"
+	---help---
+	  RFC 4898 specifies a number of extended statistics for TCP. This
+	  data can be accessed using netlink. See http://www.web10g.org for
+	  more details.
+
+if TCP_ESTATS
+
+config TCP_ESTATS_STRICT_ELAPSEDTIME
+	bool "TCP: ESTATS strict ElapsedSecs/Msecs counters"
+	depends on TCP_ESTATS
+	default n
+	---help---
+	  Elapsed time since beginning of connection.
+	  RFC4898 defines ElapsedSecs/Msecs as being updated via ktime_get
+	  at each protocol event (sending or receiving of a segment);
+	  as this can be a performance hit, leaving this config option off
+	  will update elapsed based on on the jiffies counter instead.
+	  Set to Y for strict conformance with the MIB.
+
+	  If unsure, say N.
+
+endif
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 518c04e..7e2c69a 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_INET_TUNNEL) += tunnel4.o
  obj-$(CONFIG_INET_XFRM_MODE_TRANSPORT) += xfrm4_mode_transport.o
  obj-$(CONFIG_INET_XFRM_MODE_TUNNEL) += xfrm4_mode_tunnel.o
  obj-$(CONFIG_IP_PNP) += ipconfig.o
+obj-$(CONFIG_TCP_ESTATS) += tcp_estats.o
  obj-$(CONFIG_NETFILTER)	+= netfilter.o netfilter/
  obj-$(CONFIG_INET_DIAG) += inet_diag.o
  obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e0ee384..edc5a66 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,11 @@ static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
  static int ip_ping_group_range_min[] = { 0, 0 };
  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
  
+/* Extended statistics (RFC4898). */
+#ifdef CONFIG_TCP_ESTATS
+int sysctl_tcp_estats __read_mostly;
+#endif  /* CONFIG_TCP_ESTATS */
+
  /* Update system visible IP port range */
  static void set_local_port_range(struct net *net, int range[2])
  {
@@ -767,6 +772,15 @@ static struct ctl_table ipv4_table[] = {
  		.proc_handler	= proc_dointvec_minmax,
  		.extra1		= &one
  	},
+#ifdef CONFIG_TCP_ESTATS
+	{
+		.procname	= "tcp_estats",
+		.data		= &sysctl_tcp_estats,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#endif /* CONFIG TCP ESTATS */
  	{ }
  };
  
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3075723..698dbb7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -418,6 +418,10 @@ void tcp_init_sock(struct sock *sk)
  	sk->sk_sndbuf = sysctl_tcp_wmem[1];
  	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
  
+#ifdef CONFIG_TCP_ESTATS
+	tp->tcp_stats = NULL;
+#endif
+
  	local_bh_disable();
  	sock_update_memcg(sk);
  	sk_sockets_allocated_inc(sk);
@@ -972,6 +976,9 @@ wait_for_memory:
  		tcp_push(sk, flags & ~MSG_MORE, mss_now,
  			 TCP_NAGLE_PUSH, size_goal);
  
+		if (copied)
+                        TCP_ESTATS_UPDATE(tp, tcp_estats_update_writeq(sk));
+
  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
  			goto do_error;
  
@@ -1264,9 +1271,11 @@ new_segment:
  wait_for_sndbuf:
  			set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
  wait_for_memory:
-			if (copied)
+			if (copied) {
  				tcp_push(sk, flags & ~MSG_MORE, mss_now,
  					 TCP_NAGLE_PUSH, size_goal);
+				TCP_ESTATS_UPDATE(tp, tcp_estats_update_writeq(sk));
+			}
  
  			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
  				goto do_error;
@@ -1658,6 +1667,8 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  			     *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt, flags);
  		}
  
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+
  		/* Well, if we have backlog, try to process it now yet. */
  
  		if (copied >= target && !sk->sk_backlog.tail)
@@ -2684,6 +2695,11 @@ void tcp_get_info(const struct sock *sk, struct tcp_info *info)
  					sk->sk_pacing_rate : ~0ULL;
  	info->tcpi_max_pacing_rate = sk->sk_max_pacing_rate != ~0U ?
  					sk->sk_max_pacing_rate : ~0ULL;
+
+#ifdef CONFIG_TCP_ESTATS
+	info->tcpi_estats_cid = (tp->tcp_stats && tp->tcp_stats->tcpe_cid > 0)
+					? tp->tcp_stats->tcpe_cid : 0;
+#endif
  }
  EXPORT_SYMBOL_GPL(tcp_get_info);
  
@@ -3101,6 +3117,9 @@ void __init tcp_init(void)
  		tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
  
  	tcp_metrics_init();
+
  	BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0);
+	tcp_estats_init();
+
  	tcp_tasklet_init();
  }
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 27ead0d..e93929d 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -295,6 +295,8 @@ void tcp_slow_start(struct tcp_sock *tp, u32 acked)
  {
  	u32 cwnd = tp->snd_cwnd + acked;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, SlowStart);
+
  	if (cwnd > tp->snd_ssthresh)
  		cwnd = tp->snd_ssthresh + 1;
  	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
@@ -304,6 +306,7 @@ EXPORT_SYMBOL_GPL(tcp_slow_start);
  /* In theory this is tp->snd_cwnd += 1 / tp->snd_cwnd (or alternative w) */
  void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w)
  {
+	TCP_ESTATS_VAR_INC(tp, stack_table, CongAvoid);
  	if (tp->snd_cwnd_cnt >= w) {
  		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
  			tp->snd_cwnd++;
diff --git a/net/ipv4/tcp_estats.c b/net/ipv4/tcp_estats.c
new file mode 100644
index 0000000..e817540
--- /dev/null
+++ b/net/ipv4/tcp_estats.c
@@ -0,0 +1,736 @@
+/*
+ * net/ipv4/tcp_estats.c
+ *
+ * Implementation of TCP ESTATS MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/export.h>
+#ifndef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+#include <linux/jiffies.h>
+#endif
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/string.h>
+#include <net/tcp_estats.h>
+#include <net/tcp.h>
+#include <asm/atomic.h>
+#include <asm/byteorder.h>
+
+#define ESTATS_INF32	0xffffffff
+
+#define ESTATS_MAX_CID	5000000
+
+extern int sysctl_tcp_estats;
+
+struct idr tcp_estats_idr;
+EXPORT_SYMBOL(tcp_estats_idr);
+static int next_id = 1;
+DEFINE_SPINLOCK(tcp_estats_idr_lock);
+EXPORT_SYMBOL(tcp_estats_idr_lock);
+
+int tcp_estats_wq_enabled __read_mostly = 0;
+EXPORT_SYMBOL(tcp_estats_wq_enabled);
+struct workqueue_struct *tcp_estats_wq = NULL;
+EXPORT_SYMBOL(tcp_estats_wq);
+void (*create_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(create_notify_func);
+void (*establish_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(establish_notify_func);
+void (*destroy_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(destroy_notify_func);
+unsigned long persist_delay = 0;
+EXPORT_SYMBOL(persist_delay);
+
+struct static_key tcp_estats_enabled __read_mostly = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(tcp_estats_enabled);
+
+/* if HAVE_JUMP_LABEL is defined, then static_key_slow_inc/dec uses a
+ *   mutex in its implementation, and hence can't be called if in_interrupt().
+ * if HAVE_JUMP_LABEL is NOT defined, then no mutex is used, hence no need
+ *   for deferring enable/disable */
+#ifdef HAVE_JUMP_LABEL
+static atomic_t tcp_estats_enabled_deferred;
+
+static void tcp_estats_handle_deferred_enable_disable(void)
+{
+	int count = atomic_xchg(&tcp_estats_enabled_deferred, 0);
+
+	while (count > 0) {
+		static_key_slow_inc(&tcp_estats_enabled);
+		--count;
+	}
+
+	while (count < 0) {
+		static_key_slow_dec(&tcp_estats_enabled);
+		++count;
+	}
+}
+#endif
+
+static inline void tcp_estats_enable(void)
+{
+#ifdef HAVE_JUMP_LABEL
+	if (in_interrupt()) {
+		atomic_inc(&tcp_estats_enabled_deferred);
+		return;
+	}
+	tcp_estats_handle_deferred_enable_disable();
+#endif
+	static_key_slow_inc(&tcp_estats_enabled);
+}
+
+static inline void tcp_estats_disable(void)
+{
+#ifdef HAVE_JUMP_LABEL
+	if (in_interrupt()) {
+		atomic_dec(&tcp_estats_enabled_deferred);
+		return;
+	}
+	tcp_estats_handle_deferred_enable_disable();
+#endif
+	static_key_slow_dec(&tcp_estats_enabled);
+}
+
+/* Calculates the required amount of memory for any enabled tables. */
+int tcp_estats_get_allocation_size(int sysctl)
+{
+	int size = sizeof(struct tcp_estats) +
+		sizeof(struct tcp_estats_connection_table);
+
+	if (sysctl & TCP_ESTATS_TABLEMASK_PERF)
+		size += sizeof(struct tcp_estats_perf_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_PATH)
+		size += sizeof(struct tcp_estats_path_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_STACK)
+		size += sizeof(struct tcp_estats_stack_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_APP)
+		size += sizeof(struct tcp_estats_app_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_EXTRAS)
+		size += sizeof(struct tcp_estats_extras_table);
+	return size;
+}
+
+/* Called whenever a TCP/IPv4 sock is created.
+ * net/ipv4/tcp_ipv4.c: tcp_v4_syn_recv_sock,
+ *			tcp_v4_init_sock
+ * Allocates a stats structure and initializes values.
+ */
+int tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype addrtype,
+		      int active)
+{
+	struct tcp_estats *stats;
+	struct tcp_estats_tables *tables;
+	struct tcp_sock *tp = tcp_sk(sk);
+	void *estats_mem;
+	int sysctl;
+	int ret;
+
+	/* Read the sysctl once before calculating memory needs and initializing
+	 * tables to avoid raciness. */
+	sysctl = ACCESS_ONCE(sysctl_tcp_estats);
+	if (likely(sysctl == TCP_ESTATS_TABLEMASK_INACTIVE)) {
+		return 0;
+	}
+
+	estats_mem = kzalloc(tcp_estats_get_allocation_size(sysctl), gfp_any());
+	if (!estats_mem)
+		return -ENOMEM;
+
+	stats = estats_mem;
+	estats_mem += sizeof(struct tcp_estats);
+
+	tables = &stats->tables;
+
+	tables->connection_table = estats_mem;
+	estats_mem += sizeof(struct tcp_estats_connection_table);
+
+	if (sysctl & TCP_ESTATS_TABLEMASK_PERF) {
+		tables->perf_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_perf_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_PATH) {
+		tables->path_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_path_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_STACK) {
+		tables->stack_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_stack_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_APP) {
+		tables->app_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_app_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_EXTRAS) {
+		tables->extras_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_extras_table);
+	}
+
+	stats->tcpe_cid = -1;
+	stats->queued = 0;
+
+	tables->connection_table->AddressType = addrtype;
+
+	sock_hold(sk);
+	stats->sk = sk;
+	atomic_set(&stats->users, 0);
+
+	stats->limstate = TCP_ESTATS_SNDLIM_STARTUP;
+	stats->limstate_ts = ktime_get();
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->start_ts = stats->current_ts = stats->limstate_ts;
+#else
+	stats->start_ts = stats->current_ts = jiffies;
+#endif
+	do_gettimeofday(&stats->start_tv);
+
+	/* order is important -
+	 * must have stats hooked into tp and tcp_estats_enabled()
+	 * in order to have the TCP_ESTATS_VAR_<> macros work */
+	tp->tcp_stats = stats;
+	tcp_estats_enable();
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, ActiveOpen, active);
+
+	TCP_ESTATS_VAR_SET(tp, app_table, SndMax, tp->snd_nxt);
+	TCP_ESTATS_VAR_SET(tp, stack_table, SndInitial, tp->snd_nxt);
+
+	TCP_ESTATS_VAR_SET(tp, path_table, MinRTT, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, path_table, MinRTO, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, stack_table, MinMSS, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, stack_table, MinSsthresh, ESTATS_INF32);
+
+	tcp_estats_use(stats);
+
+	if (tcp_estats_wq_enabled) {
+		tcp_estats_use(stats);
+		stats->queued = 1;
+		stats->tcpe_cid = 0;
+		INIT_WORK(&stats->create_notify, create_notify_func);
+		ret = queue_work(tcp_estats_wq, &stats->create_notify);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_estats_create);
+
+void tcp_estats_destroy(struct sock *sk)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+
+	if (stats == NULL)
+		return;
+
+	/* Attribute final sndlim time. */
+	tcp_estats_update_sndlim(tcp_sk(stats->sk), stats->limstate);
+
+	if (tcp_estats_wq_enabled && stats->queued) {
+		INIT_DELAYED_WORK(&stats->destroy_notify,
+			destroy_notify_func);
+		queue_delayed_work(tcp_estats_wq, &stats->destroy_notify,
+			persist_delay);
+	}
+	tcp_estats_unuse(stats);
+}
+
+/* Do not call directly.  Called from tcp_estats_unuse() through call_rcu. */
+void tcp_estats_free(struct rcu_head *rcu)
+{
+	struct tcp_estats *stats = container_of(rcu, struct tcp_estats, rcu);
+	tcp_estats_disable();
+	kfree(stats);
+}
+EXPORT_SYMBOL(tcp_estats_free);
+
+/* Called when a connection enters the ESTABLISHED state, and has all its
+ * state initialized.
+ * net/ipv4/tcp_input.c: tcp_rcv_state_process,
+ *			 tcp_rcv_synsent_state_process
+ * Here we link the statistics structure in so it is visible in the /proc
+ * fs, and do some final init.
+ */
+void tcp_estats_establish(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_connection_table *conn_table;
+
+	if (stats == NULL)
+		return;
+
+	conn_table = stats->tables.connection_table;
+
+	/* Let's set these here, since they can't change once the
+	 * connection is established.
+	 */
+	conn_table->LocalPort = inet->inet_num;
+	conn_table->RemPort = ntohs(inet->inet_dport);
+
+	if (conn_table->AddressType == TCP_ESTATS_ADDRTYPE_IPV4) {
+		memcpy(&conn_table->LocalAddress.addr, &inet->inet_rcv_saddr,
+			sizeof(struct in_addr));
+		memcpy(&conn_table->RemAddress.addr, &inet->inet_daddr,
+			sizeof(struct in_addr));
+	}
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	else if (conn_table->AddressType == TCP_ESTATS_ADDRTYPE_IPV6) {
+		memcpy(&conn_table->LocalAddress.addr6, &(sk)->sk_v6_rcv_saddr,
+		       sizeof(struct in6_addr));
+		/* ipv6 daddr now uses a different struct than saddr */
+		memcpy(&conn_table->RemAddress.addr6, &(sk)->sk_v6_daddr,
+		       sizeof(struct in6_addr));
+	}
+#endif
+	else {
+		pr_err("TCP ESTATS: AddressType not valid.\n");
+	}
+
+	tcp_estats_update_finish_segrecv(tp);
+	tcp_estats_update_rwin_rcvd(tp);
+	tcp_estats_update_rwin_sent(tp);
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, RecInitial, tp->rcv_nxt);
+
+	tcp_estats_update_sndlim(tp, TCP_ESTATS_SNDLIM_SENDER);
+
+	if (tcp_estats_wq_enabled && stats->queued) {
+		INIT_WORK(&stats->establish_notify, establish_notify_func);
+		queue_work(tcp_estats_wq, &stats->establish_notify);
+	}
+}
+
+/*
+ * Statistics update functions
+ */
+
+void tcp_estats_update_snd_nxt(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+
+	if (stats->tables.app_table) {
+		if (after(tp->snd_nxt, stats->tables.app_table->SndMax))
+			stats->tables.app_table->SndMax = tp->snd_nxt;
+	}
+}
+
+void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+
+	if (stats->tables.app_table)
+		stats->tables.app_table->ThruOctetsAcked += ack - tp->snd_una;
+}
+
+void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+	unsigned long rtt_sample_msec = rtt_sample/1000;
+	u32 rto;
+
+	if (path_table == NULL)
+		return;
+
+	path_table->SampleRTT = rtt_sample_msec;
+
+	if (rtt_sample_msec > path_table->MaxRTT)
+		path_table->MaxRTT = rtt_sample_msec;
+	if (rtt_sample_msec < path_table->MinRTT)
+		path_table->MinRTT = rtt_sample_msec;
+
+	path_table->CountRTT++;
+	path_table->SumRTT += rtt_sample_msec;
+
+	rto = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	if (rto > path_table->MaxRTO)
+		path_table->MaxRTO = rto;
+	if (rto < path_table->MinRTO)
+		path_table->MinRTO = rto;
+}
+
+void tcp_estats_update_timeout(struct sock *sk)
+{
+	if (inet_csk(sk)->icsk_backoff)
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), stack_table, SubsequentTimeouts);
+	else
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), perf_table, Timeouts);
+
+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open)
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), stack_table, AbruptTimeouts);
+}
+
+void tcp_estats_update_mss(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_stack_table *stack_table = stats->tables.stack_table;
+	int mss = tp->mss_cache;
+
+	if (stack_table == NULL)
+		return;
+
+	if (mss > stack_table->MaxMSS)
+		stack_table->MaxMSS = mss;
+	if (mss < stack_table->MinMSS)
+		stack_table->MinMSS = mss;
+}
+
+void tcp_estats_update_finish_segrecv(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_tables *tables = &stats->tables;
+	struct tcp_estats_perf_table *perf_table = tables->perf_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+	u32 mss = tp->mss_cache;
+	u32 cwnd;
+	u32 ssthresh;
+	u32 pipe_size;
+
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->current_ts = ktime_get();
+#else
+	stats->current_ts = jiffies;
+#endif
+
+	if (stack_table != NULL) {
+		cwnd = tp->snd_cwnd * mss;
+		if (tp->snd_cwnd <= tp->snd_ssthresh) {
+			if (cwnd > stack_table->MaxSsCwnd)
+				stack_table->MaxSsCwnd = cwnd;
+		} else if (cwnd > stack_table->MaxCaCwnd) {
+			stack_table->MaxCaCwnd = cwnd;
+		}
+	}
+
+	if (perf_table != NULL) {
+		pipe_size = tcp_packets_in_flight(tp) * mss;
+		if (pipe_size > perf_table->MaxPipeSize)
+			perf_table->MaxPipeSize = pipe_size;
+	}
+
+	/* Discard initiail ssthresh set at infinity. */
+	if (tp->snd_ssthresh >= TCP_INFINITE_SSTHRESH) {
+		return;
+	}
+
+	if (stack_table != NULL) {
+		ssthresh = tp->snd_ssthresh * tp->mss_cache;
+		if (ssthresh > stack_table->MaxSsthresh)
+			stack_table->MaxSsthresh = ssthresh;
+		if (ssthresh < stack_table->MinSsthresh)
+			stack_table->MinSsthresh = ssthresh;
+	}
+}
+EXPORT_SYMBOL(tcp_estats_update_finish_segrecv);
+
+void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	u32 win = tp->snd_wnd;
+
+	if (perf_table == NULL)
+		return;
+
+	if (win > perf_table->MaxRwinRcvd)
+		perf_table->MaxRwinRcvd = win;
+	if (win == 0)
+		perf_table->ZeroRwinRcvd++;
+}
+
+void tcp_estats_update_rwin_sent(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	u32 win = tp->rcv_wnd;
+
+	if (perf_table == NULL)
+		return;
+
+	if (win > perf_table->MaxRwinSent)
+		perf_table->MaxRwinSent = win;
+	if (win == 0)
+		perf_table->ZeroRwinSent++;
+}
+
+void tcp_estats_update_sndlim(struct tcp_sock *tp,
+			      enum tcp_estats_sndlim_states state)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	ktime_t now;
+
+	if (state <= TCP_ESTATS_SNDLIM_NONE ||
+	    state >= TCP_ESTATS_SNDLIM_NSTATES) {
+		pr_err("tcp_estats_update_sndlim: BUG: state out of range %d\n",
+		       state);
+		return;
+	}
+
+	if (perf_table == NULL)
+		return;
+
+	now = ktime_get();
+	perf_table->snd_lim_time[stats->limstate]
+	    += ktime_to_us(ktime_sub(now, stats->limstate_ts));
+	stats->limstate_ts = now;
+	if (stats->limstate != state) {
+		stats->limstate = state;
+		perf_table->snd_lim_trans[state]++;
+	}
+}
+
+void tcp_estats_update_congestion(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+
+	TCP_ESTATS_VAR_INC(tp, perf_table, CongSignals);
+
+	if (path_table != NULL) {
+		path_table->PreCongSumCwnd += tp->snd_cwnd * tp->mss_cache;
+		path_table->PreCongSumRTT += path_table->SampleRTT;
+	}
+}
+
+void tcp_estats_update_post_congestion(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+
+	if (path_table != NULL) {
+		path_table->PostCongCountRTT++;
+		path_table->PostCongSumRTT += path_table->SampleRTT;
+	}
+}
+
+void tcp_estats_update_segsend(struct sock *sk, int pcount,
+			       u32 seq, u32 end_seq, int flags)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	struct tcp_estats_app_table *app_table = stats->tables.app_table;
+
+	int data_len = end_seq - seq;
+
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->current_ts = ktime_get();
+#else
+	stats->current_ts = jiffies;
+#endif
+
+	if (perf_table == NULL)
+		return;
+
+	/* We know we're sending a segment. */
+	perf_table->SegsOut += pcount;
+
+	/* A pure ACK contains no data; everything else is data. */
+	if (data_len > 0) {
+		perf_table->DataSegsOut += pcount;
+		perf_table->DataOctetsOut += data_len;
+	}
+
+	/* Check for retransmission. */
+	if (flags & TCPHDR_SYN) {
+		if (inet_csk(sk)->icsk_retransmits)
+			perf_table->SegsRetrans++;
+	} else if (app_table != NULL &&
+		   before(seq, app_table->SndMax)) {
+		perf_table->SegsRetrans += pcount;
+		perf_table->OctetsRetrans += data_len;
+	}
+}
+
+void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	struct tcp_estats_tables *tables = &tp->tcp_stats->tables;
+	struct tcp_estats_path_table *path_table = tables->path_table;
+	struct tcp_estats_perf_table *perf_table = tables->perf_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+	struct tcphdr *th = tcp_hdr(skb);
+	struct iphdr *iph = ip_hdr(skb);
+
+	if (perf_table != NULL)
+		perf_table->SegsIn++;
+
+	if (skb->len == th->doff * 4) {
+		if (stack_table != NULL &&
+		    TCP_SKB_CB(skb)->ack_seq == tp->snd_una)
+			stack_table->DupAcksIn++;
+	} else {
+		if (perf_table != NULL) {
+			perf_table->DataSegsIn++;
+			perf_table->DataOctetsIn += skb->len - th->doff * 4;
+		}
+	}
+
+	if (path_table != NULL) {
+		path_table->IpTtl = iph->ttl;
+		path_table->IpTosIn = iph->tos;
+	}
+}
+EXPORT_SYMBOL(tcp_estats_update_segrecv);
+
+void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq)
+{
+        /* After much debate, it was decided that "seq - rcv_nxt" is
+           indeed what we want, as opposed to what Krishnan suggested
+           to better match the RFC: "seq - tp->rcv_wup" */
+	TCP_ESTATS_VAR_ADD(tp, app_table, ThruOctetsReceived,
+			   seq - tp->rcv_nxt);
+}
+
+void tcp_estats_update_writeq(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats_app_table *app_table =
+			tp->tcp_stats->tables.app_table;
+	int len;
+
+	if (app_table == NULL)
+		return;
+
+	len = tp->write_seq - app_table->SndMax;
+
+	if (len > app_table->MaxAppWQueue)
+		app_table->MaxAppWQueue = len;
+}
+
+static inline u32 ofo_qlen(struct tcp_sock *tp)
+{
+	if (!skb_peek(&tp->out_of_order_queue))
+		return 0;
+	else
+		return TCP_SKB_CB(tp->out_of_order_queue.prev)->end_seq -
+		    TCP_SKB_CB(tp->out_of_order_queue.next)->seq;
+}
+
+void tcp_estats_update_recvq(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats_tables *tables = &tp->tcp_stats->tables;
+	struct tcp_estats_app_table *app_table = tables->app_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+
+	if (app_table != NULL) {
+		u32 len = tp->rcv_nxt - tp->copied_seq;
+		if (app_table->MaxAppRQueue < len)
+			app_table->MaxAppRQueue = len;
+	}
+
+	if (stack_table != NULL) {
+		u32 len = ofo_qlen(tp);
+		if (stack_table->MaxReasmQueue < len)
+			stack_table->MaxReasmQueue = len;
+	}
+}
+
+/*
+ * Manage connection ID table
+ */
+
+static int get_new_cid(struct tcp_estats *stats)
+{
+         int id_cid;
+
+again:
+         spin_lock_bh(&tcp_estats_idr_lock);
+         id_cid = idr_alloc(&tcp_estats_idr, stats, next_id, 0, GFP_KERNEL);
+         if (unlikely(id_cid == -ENOSPC)) {
+                 spin_unlock_bh(&tcp_estats_idr_lock);
+                 goto again;
+         }
+         if (unlikely(id_cid == -ENOMEM)) {
+                 spin_unlock_bh(&tcp_estats_idr_lock);
+                 return -ENOMEM;
+         }
+         next_id = (id_cid + 1) % ESTATS_MAX_CID;
+         stats->tcpe_cid = id_cid;
+         spin_unlock_bh(&tcp_estats_idr_lock);
+         return 0;
+}
+
+static void create_func(struct work_struct *work)
+{
+	/* stub for netlink notification of new connections */
+	;
+}
+
+static void establish_func(struct work_struct *work)
+{
+	struct tcp_estats *stats = container_of(work, struct tcp_estats,
+						establish_notify);
+	int err = 0;
+
+	if ((stats->tcpe_cid) > 0) {
+		pr_err("TCP estats container established multiple times.\n");
+		return;
+	}
+
+	if ((stats->tcpe_cid) == 0) {
+		err = get_new_cid(stats);
+		if (err)
+			pr_devel("get_new_cid error %d\n", err);
+	}
+}
+
+static void destroy_func(struct work_struct *work)
+{
+	struct tcp_estats *stats = container_of(work, struct tcp_estats,
+						destroy_notify.work);
+
+	int id_cid = stats->tcpe_cid;
+
+	if (id_cid == 0)
+		pr_devel("TCP estats destroyed before being established.\n");
+
+	if (id_cid >= 0) {
+		if (id_cid) {
+			spin_lock_bh(&tcp_estats_idr_lock);
+			idr_remove(&tcp_estats_idr, id_cid);
+			spin_unlock_bh(&tcp_estats_idr_lock);
+		}
+		stats->tcpe_cid = -1;
+
+		tcp_estats_unuse(stats);
+	}
+}
+
+void __init tcp_estats_init()
+{
+	idr_init(&tcp_estats_idr);
+
+	create_notify_func = &create_func;
+	establish_notify_func = &establish_func;
+	destroy_notify_func = &destroy_func;
+
+	persist_delay = TCP_ESTATS_PERSIST_DELAY_SECS * HZ;
+
+	tcp_estats_wq = alloc_workqueue("tcp_estats", WQ_MEM_RECLAIM, 256);
+	if (tcp_estats_wq == NULL) {
+		pr_err("tcp_estats_init(): alloc_workqueue failed\n");
+		goto cleanup_fail;
+	}
+
+	tcp_estats_wq_enabled = 1;
+	return;
+
+cleanup_fail:
+	pr_err("TCP ESTATS: initialization failed.\n");
+}
diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c
index 58469ff..5facb4c 100644
--- a/net/ipv4/tcp_htcp.c
+++ b/net/ipv4/tcp_htcp.c
@@ -251,6 +251,7 @@ static void htcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
  			tp->snd_cwnd_cnt += ca->pkts_acked;
  
  		ca->pkts_acked = 1;
+		TCP_ESTATS_VAR_INC(tp, stack_table, CongAvoid);
  	}
  }
  
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 075ab4d..8f0601b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -77,8 +77,10 @@
  #include <linux/errqueue.h>
  
  int sysctl_tcp_timestamps __read_mostly = 1;
+EXPORT_SYMBOL(sysctl_tcp_timestamps);
  int sysctl_tcp_window_scaling __read_mostly = 1;
  int sysctl_tcp_sack __read_mostly = 1;
+EXPORT_SYMBOL(sysctl_tcp_sack);
  int sysctl_tcp_fack __read_mostly = 1;
  int sysctl_tcp_reordering __read_mostly = TCP_FASTRETRANS_THRESH;
  int sysctl_tcp_max_reordering __read_mostly = 300;
@@ -231,13 +233,15 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
  			tcp_enter_quickack_mode((struct sock *)tp);
  		break;
  	case INET_ECN_CE:
+		TCP_ESTATS_VAR_INC(tp, path_table, CERcvd);
  		if (tcp_ca_needs_ecn((struct sock *)tp))
  			tcp_ca_event((struct sock *)tp, CA_EVENT_ECN_IS_CE);
-
  		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
  			/* Better not delay acks, sender can have a very low cwnd */
  			tcp_enter_quickack_mode((struct sock *)tp);
  			tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+		} else {
+			TCP_ESTATS_VAR_INC(tp, path_table, ECESent);
  		}
  		tp->ecn_flags |= TCP_ECN_SEEN;
  		break;
@@ -1104,6 +1108,7 @@ static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
  		dup_sack = true;
  		tcp_dsack_seen(tp);
  		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPDSACKRECV);
+		TCP_ESTATS_VAR_INC(tp, stack_table, DSACKDups);
  	} else if (num_sacks > 1) {
  		u32 end_seq_1 = get_unaligned_be32(&sp[1].end_seq);
  		u32 start_seq_1 = get_unaligned_be32(&sp[1].start_seq);
@@ -1114,6 +1119,7 @@ static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
  			tcp_dsack_seen(tp);
  			NET_INC_STATS_BH(sock_net(sk),
  					LINUX_MIB_TCPDSACKOFORECV);
+			TCP_ESTATS_VAR_INC(tp, stack_table, DSACKDups);
  		}
  	}
  
@@ -1653,6 +1659,9 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
  	state.reord = tp->packets_out;
  	state.rtt_us = -1L;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, SACKsRcvd);
+	TCP_ESTATS_VAR_ADD(tp, stack_table, SACKBlocksRcvd, num_sacks);
+
  	if (!tp->sacked_out) {
  		if (WARN_ON(tp->fackets_out))
  			tp->fackets_out = 0;
@@ -1928,6 +1937,8 @@ void tcp_enter_loss(struct sock *sk)
  	bool new_recovery = false;
  	bool is_reneg;			/* is receiver reneging on SACKs? */
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
+
  	/* Reduce ssthresh if it has not yet been made inside this window. */
  	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
  	    !after(tp->high_seq, tp->snd_una) ||
@@ -2200,8 +2211,12 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
  	 */
  	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
  	    (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
-	    !tcp_may_send_now(sk))
-		return !tcp_pause_early_retransmit(sk, flag);
+	    !tcp_may_send_now(sk)) {
+		int early_retrans = !tcp_pause_early_retransmit(sk, flag);
+		if (early_retrans)
+			TCP_ESTATS_VAR_INC(tp, stack_table, EarlyRetrans);
+		return early_retrans;
+	}
  
  	return false;
  }
@@ -2299,9 +2314,15 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
   */
  static inline void tcp_moderate_cwnd(struct tcp_sock *tp)
  {
-	tp->snd_cwnd = min(tp->snd_cwnd,
-			   tcp_packets_in_flight(tp) + tcp_max_burst(tp));
-	tp->snd_cwnd_stamp = tcp_time_stamp;
+	u32 pkts = tcp_packets_in_flight(tp) + tcp_max_burst(tp);
+
+	if (pkts < tp->snd_cwnd) {
+		tp->snd_cwnd = pkts;
+		tp->snd_cwnd_stamp = tcp_time_stamp;
+
+		TCP_ESTATS_VAR_INC(tp, stack_table, OtherReductions);
+		TCP_ESTATS_VAR_INC(tp, extras_table, OtherReductionsCM);
+	}
  }
  
  /* Nothing was retransmitted or returned timestamp is less
@@ -2402,6 +2423,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
  		if (tp->prior_ssthresh > tp->snd_ssthresh) {
  			tp->snd_ssthresh = tp->prior_ssthresh;
  			tcp_ecn_withdraw_cwr(tp);
+			TCP_ESTATS_VAR_INC(tp, stack_table, CongOverCount);
  		}
  	} else {
  		tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh);
@@ -2428,10 +2450,15 @@ static bool tcp_try_undo_recovery(struct sock *sk)
  		 */
  		DBGUNDO(sk, inet_csk(sk)->icsk_ca_state == TCP_CA_Loss ? "loss" : "retrans");
  		tcp_undo_cwnd_reduction(sk, false);
-		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss)
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
  			mib_idx = LINUX_MIB_TCPLOSSUNDO;
-		else
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousRtoDetected);
+		} else {
  			mib_idx = LINUX_MIB_TCPFULLUNDO;
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousFrDetected);
+		}
  
  		NET_INC_STATS_BH(sock_net(sk), mib_idx);
  	}
@@ -2472,9 +2499,12 @@ static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
  
  		DBGUNDO(sk, "partial loss");
  		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO);
-		if (frto_undo)
+		if (frto_undo) {
  			NET_INC_STATS_BH(sock_net(sk),
  					 LINUX_MIB_TCPSPURIOUSRTOS);
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousRtoDetected);
+		}
  		inet_csk(sk)->icsk_retransmits = 0;
  		if (frto_undo || tcp_is_sack(tp))
  			tcp_set_ca_state(sk, TCP_CA_Open);
@@ -2555,6 +2585,7 @@ void tcp_enter_cwr(struct sock *sk)
  		tcp_init_cwnd_reduction(sk);
  		tcp_set_ca_state(sk, TCP_CA_CWR);
  	}
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
  }
  
  static void tcp_try_keep_open(struct sock *sk)
@@ -2580,8 +2611,10 @@ static void tcp_try_to_open(struct sock *sk, int flag, const int prior_unsacked)
  	if (!tcp_any_retrans_done(sk))
  		tp->retrans_stamp = 0;
  
-	if (flag & FLAG_ECE)
+	if (flag & FLAG_ECE) {
  		tcp_enter_cwr(sk);
+		TCP_ESTATS_VAR_INC(tp, path_table, ECNsignals);
+	}
  
  	if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
  		tcp_try_keep_open(sk);
@@ -2826,6 +2859,10 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  			}
  			break;
  
+		case TCP_CA_Disorder:
+			TCP_ESTATS_VAR_INC(tp, path_table, NonRecovDAEpisodes);
+			break;
+
  		case TCP_CA_Recovery:
  			if (tcp_is_reno(tp))
  				tcp_reset_reno_sack(tp);
@@ -2870,6 +2907,10 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
  			tcp_try_undo_dsack(sk);
  
+
+		if (icsk->icsk_ca_state == TCP_CA_Disorder)
+			TCP_ESTATS_VAR_INC(tp, path_table, NonRecovDA);
+
  		if (!tcp_time_to_recover(sk, flag)) {
  			tcp_try_to_open(sk, flag, prior_unsacked);
  			return;
@@ -2889,6 +2930,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  		/* Otherwise enter Recovery state */
  		tcp_enter_recovery(sk, (flag & FLAG_ECE));
  		fast_rexmit = 1;
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
+		TCP_ESTATS_VAR_INC(tp, stack_table, FastRetran);
  	}
  
  	if (do_lost)
@@ -2928,6 +2971,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
  
  	tcp_rtt_estimator(sk, seq_rtt_us);
  	tcp_set_rto(sk);
+	TCP_ESTATS_UPDATE(tcp_sk(sk), tcp_estats_update_rtt(sk, seq_rtt_us));
  
  	/* RFC6298: only reset backoff on valid RTT measurement. */
  	inet_csk(sk)->icsk_backoff = 0;
@@ -3007,6 +3051,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
  	if (!tp->do_early_retrans)
  		return;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, EarlyRetransDelay);
  	tcp_enter_recovery(sk, false);
  	tcp_update_scoreboard(sk, 1);
  	tcp_xmit_retransmit_queue(sk);
@@ -3310,9 +3355,11 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
  				tp->max_window = nwin;
  				tcp_sync_mss(sk, inet_csk(sk)->icsk_pmtu_cookie);
  			}
+			TCP_ESTATS_UPDATE(tp, tcp_estats_update_rwin_rcvd(tp));
  		}
  	}
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_acked(tp, ack));
  	tp->snd_una = ack;
  
  	return flag;
@@ -3410,6 +3457,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	int prior_packets = tp->packets_out;
  	const int prior_unsacked = tp->packets_out - tp->sacked_out;
  	int acked = 0; /* Number of packets newly acked */
+	int prior_state = icsk->icsk_ca_state;
  	long sack_rtt_us = -1L;
  
  	/* We very likely will need to access write queue head. */
@@ -3419,6 +3467,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	 * then we can probably ignore it.
  	 */
  	if (before(ack, prior_snd_una)) {
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+				   TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW);
  		/* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */
  		if (before(ack, prior_snd_una - tp->max_window)) {
  			tcp_send_challenge_ack(sk);
@@ -3430,8 +3481,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	/* If the ack includes data we haven't sent yet, discard
  	 * this segment (RFC793 Section 3.9).
  	 */
-	if (after(ack, tp->snd_nxt))
+	if (after(ack, tp->snd_nxt)) {
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+				   TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW);
  		goto invalid_ack;
+	}
  
  	if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
  	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
@@ -3439,6 +3494,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  
  	if (after(ack, prior_snd_una)) {
  		flag |= FLAG_SND_UNA_ADVANCED;
+		if (icsk->icsk_ca_state == TCP_CA_Disorder)
+			TCP_ESTATS_VAR_ADD(tp, path_table, SumOctetsReordered,
+					   ack - prior_snd_una);
  		icsk->icsk_retransmits = 0;
  	}
  
@@ -3456,6 +3514,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  		 * Note, we use the fact that SND.UNA>=SND.WL2.
  		 */
  		tcp_update_wl(tp, ack_seq);
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_acked(tp, ack));
  		tp->snd_una = ack;
  		flag |= FLAG_WIN_UPDATE;
  
@@ -3510,6 +3569,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
  		tcp_fastretrans_alert(sk, acked, prior_unsacked,
  				      is_dupack, flag);
+		if (icsk->icsk_ca_state == TCP_CA_Open &&
+		    prior_state >= TCP_CA_CWR)
+			TCP_ESTATS_UPDATE(tp,
+				tcp_estats_update_post_congestion(tp));
  	}
  	if (tp->tlp_high_seq)
  		tcp_process_tlp_ack(sk, ack, flag);
@@ -4177,7 +4240,9 @@ static void tcp_ofo_queue(struct sock *sk)
  
  		tail = skb_peek_tail(&sk->sk_receive_queue);
  		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, tp->rcv_nxt));
  		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+
  		if (!eaten)
  			__skb_queue_tail(&sk->sk_receive_queue, skb);
  		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
@@ -4232,6 +4297,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
  	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
  		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
  
+        TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+        TCP_ESTATS_VAR_INC(tp, path_table, DupAcksOut);
+
  	skb1 = skb_peek_tail(&tp->out_of_order_queue);
  	if (!skb1) {
  		/* Initial out of order segment, build 1 SACK. */
@@ -4242,6 +4310,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
  						TCP_SKB_CB(skb)->end_seq;
  		}
  		__skb_queue_head(&tp->out_of_order_queue, skb);
+                TCP_ESTATS_VAR_INC(tp, path_table, DupAckEpisodes);
  		goto end;
  	}
  
@@ -4438,6 +4507,9 @@ queue_and_out:
  
  			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
  		}
+		TCP_ESTATS_UPDATE(
+			tp,
+			tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
  		if (skb->len)
  			tcp_event_data_recv(sk, skb);
@@ -4459,6 +4531,8 @@ queue_and_out:
  
  		tcp_fast_path_check(sk);
  
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+
  		if (eaten > 0)
  			kfree_skb_partial(skb, fragstolen);
  		if (!sock_flag(sk, SOCK_DEAD))
@@ -4990,6 +5064,9 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  	    tcp_paws_discard(sk, skb)) {
  		if (!th->rst) {
  			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
+			TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+			TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+					   TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW);
  			tcp_send_dupack(sk, skb);
  			goto discard;
  		}
@@ -5004,6 +5081,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  		 * an acknowledgment should be sent in reply (unless the RST
  		 * bit is set, if so drop the segment and return)".
  		 */
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+			before(TCP_SKB_CB(skb)->end_seq, tp->rcv_wup) ?
+				TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW :
+				TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW);
  		if (!th->rst) {
  			if (th->syn)
  				goto syn_challenge;
@@ -5152,6 +5234,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  				return;
  			} else { /* Header too small */
  				TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
+				TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+				TCP_ESTATS_VAR_SET(tp, stack_table,
+						   SoftErrorReason,
+						   TCP_ESTATS_SOFTERROR_OTHER);
  				goto discard;
  			}
  		} else {
@@ -5178,6 +5264,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  					tcp_rcv_rtt_measure_ts(sk, skb);
  
  					__skb_pull(skb, tcp_header_len);
+					TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  					tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
  					NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
  					eaten = 1;
@@ -5204,10 +5291,12 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
  
  				/* Bulk data transfer: receiver */
+				TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  				eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
  						      &fragstolen);
  			}
  
+			TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
  			tcp_event_data_recv(sk, skb);
  
  			if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
@@ -5260,6 +5349,9 @@ step5:
  csum_error:
  	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_CSUMERRORS);
  	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
+	TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+	TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+			   TCP_ESTATS_SOFTERROR_DATA_CHECKSUM);
  
  discard:
  	__kfree_skb(skb);
@@ -5459,6 +5551,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
  		smp_mb();
  
  		tcp_finish_connect(sk, skb);
+		tcp_estats_establish(sk);
  
  		if ((tp->syn_fastopen || tp->syn_data) &&
  		    tcp_rcv_fastopen_synack(sk, skb, &foc))
@@ -5685,6 +5778,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
  		smp_mb();
  		tcp_set_state(sk, TCP_ESTABLISHED);
  		sk->sk_state_change(sk);
+		tcp_estats_establish(sk);
  
  		/* Note, that this wakeup is only for marginal crossed SYN case.
  		 * Passively open sockets are not waked up, because
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a3f72d7..9c85a54 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1310,6 +1310,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
  	if (!newsk)
  		goto exit_nonewsk;
  
+	tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV4, TCP_ESTATS_INACTIVE);
+
  	newsk->sk_gso_type = SKB_GSO_TCPV4;
  	inet_sk_rx_dst_set(newsk, skb);
  
@@ -1670,6 +1672,8 @@ process:
  	skb->dev = NULL;
  
  	bh_lock_sock_nested(sk);
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_segrecv(tcp_sk(sk), skb));
  	ret = 0;
  	if (!sock_owned_by_user(sk)) {
  		if (!tcp_prequeue(sk, skb))
@@ -1680,6 +1684,8 @@ process:
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;
  	}
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_finish_segrecv(tcp_sk(sk)));
  	bh_unlock_sock(sk);
  
  	sock_put(sk);
@@ -1809,6 +1815,8 @@ static int tcp_v4_init_sock(struct sock *sk)
  	tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
  #endif
  
+	tcp_estats_create(sk, TCP_ESTATS_ADDRTYPE_IPV4, TCP_ESTATS_ACTIVE);
+
  	return 0;
  }
  
@@ -1842,6 +1850,8 @@ void tcp_v4_destroy_sock(struct sock *sk)
  	if (inet_csk(sk)->icsk_bind_hash)
  		inet_put_port(sk);
  
+	tcp_estats_destroy(sk);
+
  	BUG_ON(tp->fastopen_rsk != NULL);
  
  	/* If socket is aborted during connect operation */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..145b4f2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -80,6 +80,7 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
  
  	tcp_advance_send_head(sk, skb);
  	tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_snd_nxt(tp));
  
  	tp->packets_out += tcp_skb_pcount(skb);
  	if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
@@ -292,6 +293,7 @@ static u16 tcp_select_window(struct sock *sk)
  	}
  	tp->rcv_wnd = new_win;
  	tp->rcv_wup = tp->rcv_nxt;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_rwin_sent(tp));
  
  	/* Make sure we do not exceed the maximum possible
  	 * scaled window.
@@ -905,6 +907,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  	struct tcp_md5sig_key *md5;
  	struct tcphdr *th;
  	int err;
+#ifdef CONFIG_TCP_ESTATS
+	__u32 seq;
+	__u32 end_seq;
+	int tcp_flags;
+	int pcount;
+#endif
  
  	BUG_ON(!skb || !tcp_skb_pcount(skb));
  
@@ -1008,6 +1016,15 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  		TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS,
  			      tcp_skb_pcount(skb));
  
+#ifdef CONFIG_TCP_ESTATS
+	/* If the skb isn't cloned, we can't reference it after
+	 * calling queue_xmit, so copy everything we need here. */
+	pcount = tcp_skb_pcount(skb);
+	seq = TCP_SKB_CB(skb)->seq;
+	end_seq = TCP_SKB_CB(skb)->end_seq;
+	tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
+#endif
+
  	/* OK, its time to fill skb_shinfo(skb)->gso_segs */
  	skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb);
  
@@ -1020,10 +1037,17 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  
  	err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
  
+	if (likely(!err)) {
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_segsend(sk, pcount,
+								seq, end_seq,
+								tcp_flags));
+	}
+
  	if (likely(err <= 0))
  		return err;
  
  	tcp_enter_cwr(sk);
+	TCP_ESTATS_VAR_INC(tp, stack_table, SendStall);
  
  	return net_xmit_eval(err);
  }
@@ -1398,6 +1422,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
  	if (icsk->icsk_mtup.enabled)
  		mss_now = min(mss_now, tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low));
  	tp->mss_cache = mss_now;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_mss(tp));
  
  	return mss_now;
  }
@@ -1670,11 +1695,13 @@ static unsigned int tcp_snd_test(const struct sock *sk, struct sk_buff *skb,
  	tcp_init_tso_segs(sk, skb, cur_mss);
  
  	if (!tcp_nagle_test(tp, skb, cur_mss, nonagle))
-		return 0;
+		return -TCP_ESTATS_SNDLIM_SENDER;
  
  	cwnd_quota = tcp_cwnd_test(tp, skb);
-	if (cwnd_quota && !tcp_snd_wnd_test(tp, skb, cur_mss))
-		cwnd_quota = 0;
+	if (!cwnd_quota)
+		return -TCP_ESTATS_SNDLIM_CWND;
+	if (!tcp_snd_wnd_test(tp, skb, cur_mss))
+		return -TCP_ESTATS_SNDLIM_RWIN;
  
  	return cwnd_quota;
  }
@@ -1688,7 +1715,7 @@ bool tcp_may_send_now(struct sock *sk)
  	return skb &&
  		tcp_snd_test(sk, skb, tcp_current_mss(sk),
  			     (tcp_skb_is_last(sk, skb) ?
-			      tp->nonagle : TCP_NAGLE_PUSH));
+			      tp->nonagle : TCP_NAGLE_PUSH)) > 0;
  }
  
  /* Trim TSO SKB to LEN bytes, put the remaining data into a new packet
@@ -1978,6 +2005,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  	unsigned int tso_segs, sent_pkts;
  	int cwnd_quota;
  	int result;
+	int why = TCP_ESTATS_SNDLIM_SENDER;
  	bool is_cwnd_limited = false;
  	u32 max_segs;
  
@@ -2008,6 +2036,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  
  		cwnd_quota = tcp_cwnd_test(tp, skb);
  		if (!cwnd_quota) {
+			why = TCP_ESTATS_SNDLIM_CWND;
  			is_cwnd_limited = true;
  			if (push_one == 2)
  				/* Force out a loss probe pkt. */
@@ -2016,19 +2045,24 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  				break;
  		}
  
-		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
+		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
+			why = TCP_ESTATS_SNDLIM_RWIN;
  			break;
-
+		}
+		
  		if (tso_segs == 1) {
  			if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
  						     (tcp_skb_is_last(sk, skb) ?
  						      nonagle : TCP_NAGLE_PUSH))))
+				/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  				break;
  		} else {
  			if (!push_one &&
  			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
-						 max_segs))
+						 max_segs)) {
+				why = TCP_ESTATS_SNDLIM_TSODEFER;
  				break;
+			}
  		}
  
  		limit = mss_now;
@@ -2041,6 +2075,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  
  		if (skb->len > limit &&
  		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  
  		/* TCP Small Queues :
@@ -2064,10 +2099,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  			 */
  			smp_mb__after_atomic();
  			if (atomic_read(&sk->sk_wmem_alloc) > limit)
+				/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  				break;
  		}
  
  		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  
  repair:
@@ -2080,9 +2117,12 @@ repair:
  		sent_pkts += tcp_skb_pcount(skb);
  
  		if (push_one)
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  	}
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_sndlim(tp, why));
+
  	if (likely(sent_pkts)) {
  		if (tcp_in_cwnd_reduction(sk))
  			tp->prr_out += sent_pkts;
@@ -3148,11 +3188,16 @@ int tcp_connect(struct sock *sk)
  	 */
  	tp->snd_nxt = tp->write_seq;
  	tp->pushed_seq = tp->write_seq;
-	TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
  
  	/* Timer for repeating the SYN until an answer. */
  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
  				  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, SndInitial, tp->write_seq);
+	TCP_ESTATS_VAR_SET(tp, app_table, SndMax, tp->write_seq);
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_snd_nxt(tp));
+	TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
+
  	return 0;
  }
  EXPORT_SYMBOL(tcp_connect);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 1829c7f..0f6f1f4 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -477,6 +477,9 @@ out_reset_timer:
  		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
  	}
  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+
+        TCP_ESTATS_UPDATE(tp, tcp_estats_update_timeout(sk));
+
  	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
  		__sk_dst_reset(sk);
  
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5ff8780..db1f88f 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1131,6 +1131,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
  	if (newsk == NULL)
  		goto out_nonewsk;
  
+	tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV6, TCP_ESTATS_INACTIVE);
+
  	/*
  	 * No need to charge this sock to the relevant IPv6 refcnt debug socks
  	 * count here, tcp_create_openreq_child now does this for us, see the
@@ -1463,6 +1465,8 @@ process:
  	skb->dev = NULL;
  
  	bh_lock_sock_nested(sk);
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_segrecv(tcp_sk(sk), skb));
  	ret = 0;
  	if (!sock_owned_by_user(sk)) {
  		if (!tcp_prequeue(sk, skb))
@@ -1473,6 +1477,8 @@ process:
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;
  	}
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_finish_segrecv(tcp_sk(sk)));
  	bh_unlock_sock(sk);
  
  	sock_put(sk);
@@ -1661,6 +1667,7 @@ static int tcp_v6_init_sock(struct sock *sk)
  #ifdef CONFIG_TCP_MD5SIG
  	tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
  #endif
+	tcp_estats_create(sk, TCP_ESTATS_ADDRTYPE_IPV6, TCP_ESTATS_ACTIVE);
  
  	return 0;
  }
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 17:50 UTC (permalink / raw)
  To: netdev

This set of patches provide control and management routines for the
kernel instrument set (KIS). This set of patches can be applied
against net-next independently of the KIS. While the kernel can be
patched, compiled, and run with this patch set it provides no real
functionality without the KIS implementation.

The reason is that the development team is primarily focused on ensuring
that the KIS is taken up by the community. Alternative control and
management methods can be developed and implemented as long as the KIS
is in the kernel.

In order for this patch set to compile on its own we have included two
files/patches that were previously introduced in the KIS implementation.
These are include/net/tcp_estats.h and include/linux/tcp.h. If patching
against a source tree that includes the KIS implementation
net/ipv4/[tcp_estats.c, sysctl_net_ipv4.c, Kconfig, Makefile] are required.

---
  include/linux/tcp.h        |   8 +
  include/net/tcp_estats.h   | 376 +++++++++++++++++++++++
  net/ipv4/Kconfig           |  25 ++
  net/ipv4/Makefile          |   1 +
  net/ipv4/sysctl_net_ipv4.c |  14 +
  net/ipv4/tcp_estats.c      | 736 +++++++++++++++++++++++++++++++++++++++++++++
  6 files changed, 1160 insertions(+)
  create mode 100644 include/net/tcp_estats.h
  create mode 100644 net/ipv4/tcp_estats.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 67309ec..8758360 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -126,6 +126,10 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
  	return (struct tcp_request_sock *)req;
  }
  
+#ifdef CONFIG_TCP_ESTATS
+struct tcp_estats;
+#endif
+
  struct tcp_sock {
  	/* inet_connection_sock has to be the first member of tcp_sock */
  	struct inet_connection_sock	inet_conn;
@@ -309,6 +313,10 @@ struct tcp_sock {
  	struct tcp_md5sig_info	__rcu *md5sig_info;
  #endif
  
+#ifdef CONFIG_TCP_ESTATS
+	struct tcp_estats	*tcp_stats;
+#endif
+
  /* TCP fastopen related information */
  	struct tcp_fastopen_request *fastopen_req;
  	/* fastopen_rsk points to request_sock that resulted in this big
diff --git a/include/net/tcp_estats.h b/include/net/tcp_estats.h
new file mode 100644
index 0000000..ff6000e
--- /dev/null
+++ b/include/net/tcp_estats.h
@@ -0,0 +1,376 @@
+/*
+ * include/net/tcp_estats.h
+ *
+ * Implementation of TCP Extended Statistics MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _TCP_ESTATS_H
+#define _TCP_ESTATS_H
+
+#include <net/sock.h>
+#include <linux/idr.h>
+#include <linux/in.h>
+#include <linux/jump_label.h>
+#include <linux/spinlock.h>
+#include <linux/tcp.h>
+#include <linux/workqueue.h>
+
+/* defines number of seconds that stats persist after connection ends */
+#define TCP_ESTATS_PERSIST_DELAY_SECS 5
+
+enum tcp_estats_sndlim_states {
+	TCP_ESTATS_SNDLIM_NONE = -1,
+	TCP_ESTATS_SNDLIM_SENDER,
+	TCP_ESTATS_SNDLIM_CWND,
+	TCP_ESTATS_SNDLIM_RWIN,
+	TCP_ESTATS_SNDLIM_STARTUP,
+	TCP_ESTATS_SNDLIM_TSODEFER,
+	TCP_ESTATS_SNDLIM_PACE,
+	TCP_ESTATS_SNDLIM_NSTATES	/* Keep at end */
+};
+
+enum tcp_estats_addrtype {
+	TCP_ESTATS_ADDRTYPE_IPV4 = 1,
+	TCP_ESTATS_ADDRTYPE_IPV6 = 2
+};
+
+enum tcp_estats_softerror_reason {
+	TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW = 1,
+	TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW = 2,
+	TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW = 3,
+	TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW = 4,
+	TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW = 5,
+	TCP_ESTATS_SOFTERROR_ABOVE_TS_WINDOW = 6,
+	TCP_ESTATS_SOFTERROR_DATA_CHECKSUM = 7,
+	TCP_ESTATS_SOFTERROR_OTHER = 8,
+};
+
+#define TCP_ESTATS_INACTIVE	2
+#define TCP_ESTATS_ACTIVE	1
+
+#define TCP_ESTATS_TABLEMASK_INACTIVE	0x00
+#define TCP_ESTATS_TABLEMASK_ACTIVE	0x01
+#define TCP_ESTATS_TABLEMASK_PERF	0x02
+#define TCP_ESTATS_TABLEMASK_PATH	0x04
+#define TCP_ESTATS_TABLEMASK_STACK	0x08
+#define TCP_ESTATS_TABLEMASK_APP	0x10
+#define TCP_ESTATS_TABLEMASK_EXTRAS	0x40
+
+#ifdef CONFIG_TCP_ESTATS
+
+extern struct static_key tcp_estats_enabled;
+
+#define TCP_ESTATS_CHECK(tp, table, expr)				\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats) &&			\
+			    likely((tp)->tcp_stats->tables.table)) {	\
+				(expr);					\
+			}						\
+		}							\
+	} while (0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, ++((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_DEC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, --((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) += (val))
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) = (val))
+#define TCP_ESTATS_UPDATE(tp, func)					\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats)) {			\
+				(func);					\
+			}						\
+		}							\
+	} while (0)
+
+/*
+ * Variables that can be read and written directly.
+ *
+ * Contains all variables from RFC 4898. Commented fields are
+ * either not implemented (only StartTimeStamp
+ * remains unimplemented in this release) or have
+ * handlers and do not need struct storage.
+ */
+struct tcp_estats_connection_table {
+	u32			AddressType;
+	union { struct in_addr addr; struct in6_addr addr6; }	LocalAddress;
+	union { struct in_addr addr; struct in6_addr addr6; }	RemAddress;
+	u16			LocalPort;
+	u16			RemPort;
+};
+
+struct tcp_estats_perf_table {
+	u32		SegsOut;
+	u32		DataSegsOut;
+	u64		DataOctetsOut;
+	u32		SegsRetrans;
+	u32		OctetsRetrans;
+	u32		SegsIn;
+	u32		DataSegsIn;
+	u64		DataOctetsIn;
+	/*		ElapsedSecs */
+	/*		ElapsedMicroSecs */
+	/*		StartTimeStamp */
+	/*		CurMSS */
+	/*		PipeSize */
+	u32		MaxPipeSize;
+	/*		SmoothedRTT */
+	/*		CurRTO */
+	u32		CongSignals;
+	/*		CurCwnd */
+	/*		CurSsthresh */
+	u32		Timeouts;
+	/*		CurRwinSent */
+	u32		MaxRwinSent;
+	u32		ZeroRwinSent;
+	/*		CurRwinRcvd */
+	u32		MaxRwinRcvd;
+	u32		ZeroRwinRcvd;
+	/*		SndLimTransRwin */
+	/*		SndLimTransCwnd */
+	/*		SndLimTransSnd */
+	/*		SndLimTimeRwin */
+	/*		SndLimTimeCwnd */
+	/*		SndLimTimeSnd */
+	u32		snd_lim_trans[TCP_ESTATS_SNDLIM_NSTATES];
+	u32		snd_lim_time[TCP_ESTATS_SNDLIM_NSTATES];
+};
+
+struct tcp_estats_path_table {
+	/*		RetranThresh */
+	u32		NonRecovDAEpisodes;
+	u32		SumOctetsReordered;
+	u32		NonRecovDA;
+	u32		SampleRTT;
+	/*		RTTVar */
+	u32		MaxRTT;
+	u32		MinRTT;
+	u64		SumRTT;
+	u32		CountRTT;
+	u32		MaxRTO;
+	u32		MinRTO;
+	u8		IpTtl;
+	u8		IpTosIn;
+	/*		IpTosOut */
+	u32		PreCongSumCwnd;
+	u32		PreCongSumRTT;
+	u32		PostCongSumRTT;
+	u32		PostCongCountRTT;
+	u32		ECNsignals;
+	u32		DupAckEpisodes;
+	/*		RcvRTT */
+	u32		DupAcksOut;
+	u32		CERcvd;
+	u32		ECESent;
+};
+
+struct tcp_estats_stack_table {
+	u32		ActiveOpen;
+	/*		MSSSent */
+	/*		MSSRcvd */
+	/*		WinScaleSent */
+	/*		WinScaleRcvd */
+	/*		TimeStamps */
+	/*		ECN */
+	/*		WillSendSACK */
+	/*		WillUseSACK */
+	/*		State */
+	/*		Nagle */
+	u32		MaxSsCwnd;
+	u32		MaxCaCwnd;
+	u32		MaxSsthresh;
+	u32		MinSsthresh;
+	/*		InRecovery */
+	u32		DupAcksIn;
+	u32		SpuriousFrDetected;
+	u32		SpuriousRtoDetected;
+	u32		SoftErrors;
+	u32		SoftErrorReason;
+	u32		SlowStart;
+	u32		CongAvoid;
+	u32		OtherReductions;
+	u32		CongOverCount;
+	u32		FastRetran;
+	u32		SubsequentTimeouts;
+	/*		CurTimeoutCount */
+	u32		AbruptTimeouts;
+	u32		SACKsRcvd;
+	u32		SACKBlocksRcvd;
+	u32		SendStall;
+	u32		DSACKDups;
+	u32		MaxMSS;
+	u32		MinMSS;
+	u32		SndInitial;
+	u32		RecInitial;
+	/*		CurRetxQueue */
+	/*		MaxRetxQueue */
+	/*		CurReasmQueue */
+	u32		MaxReasmQueue;
+	u32		EarlyRetrans;
+	u32		EarlyRetransDelay;
+};
+
+struct tcp_estats_app_table {
+	/*		SndUna */
+	/*		SndNxt */
+	u32		SndMax;
+	u64		ThruOctetsAcked;
+	/*		RcvNxt */
+	u64		ThruOctetsReceived;
+	/*		CurAppWQueue */
+	u32		MaxAppWQueue;
+	/*		CurAppRQueue */
+	u32		MaxAppRQueue;
+};
+
+/*
+    currently, no backing store is needed for tuning elements in
+     web10g - they are all read or written to directly in other
+     data structures (such as the socket)
+*/
+
+struct tcp_estats_extras_table {
+	/*		OtherReductionsCV */
+	u32		OtherReductionsCM;
+	u32		Priority;
+};
+
+struct tcp_estats_tables {
+	struct tcp_estats_connection_table	*connection_table;
+	struct tcp_estats_perf_table		*perf_table;
+	struct tcp_estats_path_table		*path_table;
+	struct tcp_estats_stack_table		*stack_table;
+	struct tcp_estats_app_table		*app_table;
+	struct tcp_estats_extras_table		*extras_table;
+};
+
+struct tcp_estats {
+	int				tcpe_cid; /* idr map id */
+
+	struct sock			*sk;
+	kuid_t				uid;
+	kgid_t				gid;
+	int				ids;
+
+	atomic_t			users;
+
+	enum tcp_estats_sndlim_states	limstate;
+	ktime_t				limstate_ts;
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	ktime_t				start_ts;
+	ktime_t				current_ts;
+#else
+	unsigned long			start_ts;
+	unsigned long			current_ts;
+#endif
+	struct timeval			start_tv;
+
+        int				queued;
+        struct work_struct		create_notify;
+        struct work_struct		establish_notify;
+        struct delayed_work		destroy_notify;
+
+	struct tcp_estats_tables	tables;
+
+	struct rcu_head			rcu;
+};
+
+extern struct idr tcp_estats_idr;
+
+extern int tcp_estats_wq_enabled;
+extern struct workqueue_struct *tcp_estats_wq;
+extern void (*create_notify_func)(struct work_struct *work);
+extern void (*establish_notify_func)(struct work_struct *work);
+extern void (*destroy_notify_func)(struct work_struct *work);
+
+extern unsigned long persist_delay;
+extern spinlock_t tcp_estats_idr_lock;
+
+/* For the TCP code */
+extern int  tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype t,
+			      int active);
+extern void tcp_estats_destroy(struct sock *sk);
+extern void tcp_estats_establish(struct sock *sk);
+extern void tcp_estats_free(struct rcu_head *rcu);
+
+extern void tcp_estats_update_snd_nxt(struct tcp_sock *tp);
+extern void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack);
+extern void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample);
+extern void tcp_estats_update_timeout(struct sock *sk);
+extern void tcp_estats_update_mss(struct tcp_sock *tp);
+extern void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp);
+extern void tcp_estats_update_sndlim(struct tcp_sock *tp,
+				     enum tcp_estats_sndlim_states why);
+extern void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq);
+extern void tcp_estats_update_rwin_sent(struct tcp_sock *tp);
+extern void tcp_estats_update_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_post_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_segsend(struct sock *sk, int pcount,
+                                      u32 seq, u32 end_seq, int flags);
+extern void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb);
+extern void tcp_estats_update_finish_segrecv(struct tcp_sock *tp);
+extern void tcp_estats_update_writeq(struct sock *sk);
+extern void tcp_estats_update_recvq(struct sock *sk);
+
+extern void tcp_estats_init(void);
+
+static inline void tcp_estats_use(struct tcp_estats *stats)
+{
+	atomic_inc(&stats->users);
+}
+
+static inline int tcp_estats_use_if_valid(struct tcp_estats *stats)
+{
+	return atomic_inc_not_zero(&stats->users);
+}
+
+static inline void tcp_estats_unuse(struct tcp_estats *stats)
+{
+	if (atomic_dec_and_test(&stats->users)) {
+		sock_put(stats->sk);
+		stats->sk = NULL;
+		call_rcu(&stats->rcu, tcp_estats_free);
+	}
+}
+
+#else /* !CONFIG_TCP_ESTATS */
+
+#define tcp_estats_enabled	(0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_DEC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_UPDATE(tp, func)		do {} while (0)
+
+static inline void tcp_estats_init(void) { }
+static inline void tcp_estats_establish(struct sock *sk) { }
+static inline void tcp_estats_create(struct sock *sk,
+				     enum tcp_estats_addrtype t,
+				     int active) { }
+static inline void tcp_estats_destroy(struct sock *sk) { }
+
+#endif /* CONFIG_TCP_ESTATS */
+
+#endif /* _TCP_ESTATS_H */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index bd29016..c04ba8f 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -680,3 +680,28 @@ config TCP_MD5SIG
  	  on the Internet.
  
  	  If unsure, say N.
+
+config TCP_ESTATS
+	bool "TCP: Extended TCP statistics (RFC4898) MIB"
+	---help---
+	  RFC 4898 specifies a number of extended statistics for TCP. This
+	  data can be accessed using netlink. See http://www.web10g.org for
+	  more details.
+
+if TCP_ESTATS
+
+config TCP_ESTATS_STRICT_ELAPSEDTIME	
+	bool "TCP: ESTATS strict ElapsedSecs/Msecs counters"
+	depends on TCP_ESTATS
+	default n
+	---help---
+	  Elapsed time since beginning of connection.
+	  RFC4898 defines ElapsedSecs/Msecs as being updated via ktime_get
+	  at each protocol event (sending or receiving of a segment);
+	  as this can be a performance hit, leaving this config option off
+	  will update elapsed based on on the jiffies counter instead.
+	  Set to Y for strict conformance with the MIB.
+
+	  If unsure, say N.
+
+endif
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 518c04e..7e2c69a 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_INET_TUNNEL) += tunnel4.o
  obj-$(CONFIG_INET_XFRM_MODE_TRANSPORT) += xfrm4_mode_transport.o
  obj-$(CONFIG_INET_XFRM_MODE_TUNNEL) += xfrm4_mode_tunnel.o
  obj-$(CONFIG_IP_PNP) += ipconfig.o
+obj-$(CONFIG_TCP_ESTATS) += tcp_estats.o
  obj-$(CONFIG_NETFILTER)	+= netfilter.o netfilter/
  obj-$(CONFIG_INET_DIAG) += inet_diag.o
  obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e0ee384..edc5a66 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,11 @@ static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
  static int ip_ping_group_range_min[] = { 0, 0 };
  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
  
+/* Extended statistics (RFC4898). */
+#ifdef CONFIG_TCP_ESTATS
+int sysctl_tcp_estats __read_mostly;
+#endif  /* CONFIG_TCP_ESTATS */
+
  /* Update system visible IP port range */
  static void set_local_port_range(struct net *net, int range[2])
  {
@@ -767,6 +772,15 @@ static struct ctl_table ipv4_table[] = {
  		.proc_handler	= proc_dointvec_minmax,
  		.extra1		= &one
  	},
+#ifdef CONFIG_TCP_ESTATS
+	{
+		.procname	= "tcp_estats",
+		.data		= &sysctl_tcp_estats,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#endif /* CONFIG TCP ESTATS */
  	{ }
  };
  
diff --git a/net/ipv4/tcp_estats.c b/net/ipv4/tcp_estats.c
new file mode 100644
index 0000000..e817540
--- /dev/null
+++ b/net/ipv4/tcp_estats.c
@@ -0,0 +1,736 @@
+/*
+ * net/ipv4/tcp_estats.c
+ *
+ * Implementation of TCP ESTATS MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/export.h>
+#ifndef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+#include <linux/jiffies.h>
+#endif
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/string.h>
+#include <net/tcp_estats.h>
+#include <net/tcp.h>
+#include <asm/atomic.h>
+#include <asm/byteorder.h>
+
+#define ESTATS_INF32	0xffffffff
+
+#define ESTATS_MAX_CID	5000000
+
+extern int sysctl_tcp_estats;
+
+struct idr tcp_estats_idr;
+EXPORT_SYMBOL(tcp_estats_idr);
+static int next_id = 1;
+DEFINE_SPINLOCK(tcp_estats_idr_lock);
+EXPORT_SYMBOL(tcp_estats_idr_lock);
+
+int tcp_estats_wq_enabled __read_mostly = 0;
+EXPORT_SYMBOL(tcp_estats_wq_enabled);
+struct workqueue_struct *tcp_estats_wq = NULL;
+EXPORT_SYMBOL(tcp_estats_wq);
+void (*create_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(create_notify_func);
+void (*establish_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(establish_notify_func);
+void (*destroy_notify_func)(struct work_struct *work);
+EXPORT_SYMBOL(destroy_notify_func);
+unsigned long persist_delay = 0;
+EXPORT_SYMBOL(persist_delay);
+
+struct static_key tcp_estats_enabled __read_mostly = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(tcp_estats_enabled);
+
+/* if HAVE_JUMP_LABEL is defined, then static_key_slow_inc/dec uses a
+ *   mutex in its implementation, and hence can't be called if in_interrupt().
+ * if HAVE_JUMP_LABEL is NOT defined, then no mutex is used, hence no need
+ *   for deferring enable/disable */
+#ifdef HAVE_JUMP_LABEL
+static atomic_t tcp_estats_enabled_deferred;
+
+static void tcp_estats_handle_deferred_enable_disable(void)
+{
+	int count = atomic_xchg(&tcp_estats_enabled_deferred, 0);
+
+	while (count > 0) {
+		static_key_slow_inc(&tcp_estats_enabled);
+		--count;
+	}
+
+	while (count < 0) {
+		static_key_slow_dec(&tcp_estats_enabled);
+		++count;
+	}
+}
+#endif
+
+static inline void tcp_estats_enable(void)
+{
+#ifdef HAVE_JUMP_LABEL
+	if (in_interrupt()) {
+		atomic_inc(&tcp_estats_enabled_deferred);
+		return;
+	}
+	tcp_estats_handle_deferred_enable_disable();
+#endif
+	static_key_slow_inc(&tcp_estats_enabled);
+}
+
+static inline void tcp_estats_disable(void)
+{
+#ifdef HAVE_JUMP_LABEL
+	if (in_interrupt()) {
+		atomic_dec(&tcp_estats_enabled_deferred);
+		return;
+	}
+	tcp_estats_handle_deferred_enable_disable();
+#endif
+	static_key_slow_dec(&tcp_estats_enabled);
+}
+
+/* Calculates the required amount of memory for any enabled tables. */
+int tcp_estats_get_allocation_size(int sysctl)
+{
+	int size = sizeof(struct tcp_estats) +
+		sizeof(struct tcp_estats_connection_table);
+
+	if (sysctl & TCP_ESTATS_TABLEMASK_PERF)
+		size += sizeof(struct tcp_estats_perf_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_PATH)
+		size += sizeof(struct tcp_estats_path_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_STACK)
+		size += sizeof(struct tcp_estats_stack_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_APP)
+		size += sizeof(struct tcp_estats_app_table);
+	if (sysctl & TCP_ESTATS_TABLEMASK_EXTRAS)
+		size += sizeof(struct tcp_estats_extras_table);
+	return size;
+}
+
+/* Called whenever a TCP/IPv4 sock is created.
+ * net/ipv4/tcp_ipv4.c: tcp_v4_syn_recv_sock,
+ *			tcp_v4_init_sock
+ * Allocates a stats structure and initializes values.
+ */
+int tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype addrtype,
+		      int active)
+{
+	struct tcp_estats *stats;
+	struct tcp_estats_tables *tables;
+	struct tcp_sock *tp = tcp_sk(sk);
+	void *estats_mem;
+	int sysctl;
+	int ret;
+
+	/* Read the sysctl once before calculating memory needs and initializing
+	 * tables to avoid raciness. */
+	sysctl = ACCESS_ONCE(sysctl_tcp_estats);
+	if (likely(sysctl == TCP_ESTATS_TABLEMASK_INACTIVE)) {
+		return 0;
+	}
+
+	estats_mem = kzalloc(tcp_estats_get_allocation_size(sysctl), gfp_any());
+	if (!estats_mem)
+		return -ENOMEM;
+
+	stats = estats_mem;
+	estats_mem += sizeof(struct tcp_estats);
+
+	tables = &stats->tables;
+
+	tables->connection_table = estats_mem;
+	estats_mem += sizeof(struct tcp_estats_connection_table);
+
+	if (sysctl & TCP_ESTATS_TABLEMASK_PERF) {
+		tables->perf_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_perf_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_PATH) {
+		tables->path_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_path_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_STACK) {
+		tables->stack_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_stack_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_APP) {
+		tables->app_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_app_table);
+	}
+	if (sysctl & TCP_ESTATS_TABLEMASK_EXTRAS) {
+		tables->extras_table = estats_mem;
+		estats_mem += sizeof(struct tcp_estats_extras_table);
+	}
+
+	stats->tcpe_cid = -1;
+	stats->queued = 0;
+
+	tables->connection_table->AddressType = addrtype;
+
+	sock_hold(sk);
+	stats->sk = sk;
+	atomic_set(&stats->users, 0);
+
+	stats->limstate = TCP_ESTATS_SNDLIM_STARTUP;
+	stats->limstate_ts = ktime_get();
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->start_ts = stats->current_ts = stats->limstate_ts;
+#else
+	stats->start_ts = stats->current_ts = jiffies;
+#endif
+	do_gettimeofday(&stats->start_tv);
+
+	/* order is important -
+	 * must have stats hooked into tp and tcp_estats_enabled()
+	 * in order to have the TCP_ESTATS_VAR_<> macros work */
+	tp->tcp_stats = stats;
+	tcp_estats_enable();
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, ActiveOpen, active);
+
+	TCP_ESTATS_VAR_SET(tp, app_table, SndMax, tp->snd_nxt);
+	TCP_ESTATS_VAR_SET(tp, stack_table, SndInitial, tp->snd_nxt);
+
+	TCP_ESTATS_VAR_SET(tp, path_table, MinRTT, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, path_table, MinRTO, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, stack_table, MinMSS, ESTATS_INF32);
+	TCP_ESTATS_VAR_SET(tp, stack_table, MinSsthresh, ESTATS_INF32);
+
+	tcp_estats_use(stats);
+
+	if (tcp_estats_wq_enabled) {
+		tcp_estats_use(stats);
+		stats->queued = 1;
+		stats->tcpe_cid = 0;
+		INIT_WORK(&stats->create_notify, create_notify_func);
+		ret = queue_work(tcp_estats_wq, &stats->create_notify);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_estats_create);
+
+void tcp_estats_destroy(struct sock *sk)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+
+	if (stats == NULL)
+		return;
+
+	/* Attribute final sndlim time. */
+	tcp_estats_update_sndlim(tcp_sk(stats->sk), stats->limstate);
+
+	if (tcp_estats_wq_enabled && stats->queued) {
+		INIT_DELAYED_WORK(&stats->destroy_notify,
+			destroy_notify_func);
+		queue_delayed_work(tcp_estats_wq, &stats->destroy_notify,
+			persist_delay);
+	}
+	tcp_estats_unuse(stats);
+}
+
+/* Do not call directly.  Called from tcp_estats_unuse() through call_rcu. */
+void tcp_estats_free(struct rcu_head *rcu)
+{
+	struct tcp_estats *stats = container_of(rcu, struct tcp_estats, rcu);
+	tcp_estats_disable();
+	kfree(stats);
+}
+EXPORT_SYMBOL(tcp_estats_free);
+
+/* Called when a connection enters the ESTABLISHED state, and has all its
+ * state initialized.
+ * net/ipv4/tcp_input.c: tcp_rcv_state_process,
+ *			 tcp_rcv_synsent_state_process
+ * Here we link the statistics structure in so it is visible in the /proc
+ * fs, and do some final init.
+ */
+void tcp_estats_establish(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_connection_table *conn_table;
+
+	if (stats == NULL)
+		return;
+
+	conn_table = stats->tables.connection_table;
+
+	/* Let's set these here, since they can't change once the
+	 * connection is established.
+	 */
+	conn_table->LocalPort = inet->inet_num;
+	conn_table->RemPort = ntohs(inet->inet_dport);
+
+	if (conn_table->AddressType == TCP_ESTATS_ADDRTYPE_IPV4) {
+		memcpy(&conn_table->LocalAddress.addr, &inet->inet_rcv_saddr,
+			sizeof(struct in_addr));
+		memcpy(&conn_table->RemAddress.addr, &inet->inet_daddr,
+			sizeof(struct in_addr));
+	}
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	else if (conn_table->AddressType == TCP_ESTATS_ADDRTYPE_IPV6) {
+		memcpy(&conn_table->LocalAddress.addr6, &(sk)->sk_v6_rcv_saddr,
+		       sizeof(struct in6_addr));
+		/* ipv6 daddr now uses a different struct than saddr */
+		memcpy(&conn_table->RemAddress.addr6, &(sk)->sk_v6_daddr,
+		       sizeof(struct in6_addr));
+	}
+#endif
+	else {
+		pr_err("TCP ESTATS: AddressType not valid.\n");
+	}
+
+	tcp_estats_update_finish_segrecv(tp);
+	tcp_estats_update_rwin_rcvd(tp);
+	tcp_estats_update_rwin_sent(tp);
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, RecInitial, tp->rcv_nxt);
+
+	tcp_estats_update_sndlim(tp, TCP_ESTATS_SNDLIM_SENDER);
+
+	if (tcp_estats_wq_enabled && stats->queued) {
+		INIT_WORK(&stats->establish_notify, establish_notify_func);
+		queue_work(tcp_estats_wq, &stats->establish_notify);
+	}
+}
+
+/*
+ * Statistics update functions
+ */
+
+void tcp_estats_update_snd_nxt(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+
+	if (stats->tables.app_table) {
+		if (after(tp->snd_nxt, stats->tables.app_table->SndMax))
+			stats->tables.app_table->SndMax = tp->snd_nxt;
+	}
+}
+
+void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+
+	if (stats->tables.app_table)
+		stats->tables.app_table->ThruOctetsAcked += ack - tp->snd_una;
+}
+
+void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+	unsigned long rtt_sample_msec = rtt_sample/1000;
+	u32 rto;
+
+	if (path_table == NULL)
+		return;
+
+	path_table->SampleRTT = rtt_sample_msec;
+
+	if (rtt_sample_msec > path_table->MaxRTT)
+		path_table->MaxRTT = rtt_sample_msec;
+	if (rtt_sample_msec < path_table->MinRTT)
+		path_table->MinRTT = rtt_sample_msec;
+
+	path_table->CountRTT++;
+	path_table->SumRTT += rtt_sample_msec;
+
+	rto = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	if (rto > path_table->MaxRTO)
+		path_table->MaxRTO = rto;
+	if (rto < path_table->MinRTO)
+		path_table->MinRTO = rto;
+}
+
+void tcp_estats_update_timeout(struct sock *sk)
+{
+	if (inet_csk(sk)->icsk_backoff)
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), stack_table, SubsequentTimeouts);
+	else
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), perf_table, Timeouts);
+
+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open)
+		TCP_ESTATS_VAR_INC(tcp_sk(sk), stack_table, AbruptTimeouts);
+}
+
+void tcp_estats_update_mss(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_stack_table *stack_table = stats->tables.stack_table;
+	int mss = tp->mss_cache;
+
+	if (stack_table == NULL)
+		return;
+
+	if (mss > stack_table->MaxMSS)
+		stack_table->MaxMSS = mss;
+	if (mss < stack_table->MinMSS)
+		stack_table->MinMSS = mss;
+}
+
+void tcp_estats_update_finish_segrecv(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_tables *tables = &stats->tables;
+	struct tcp_estats_perf_table *perf_table = tables->perf_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+	u32 mss = tp->mss_cache;
+	u32 cwnd;
+	u32 ssthresh;
+	u32 pipe_size;
+
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->current_ts = ktime_get();
+#else
+	stats->current_ts = jiffies;
+#endif
+
+	if (stack_table != NULL) {
+		cwnd = tp->snd_cwnd * mss;
+		if (tp->snd_cwnd <= tp->snd_ssthresh) {
+			if (cwnd > stack_table->MaxSsCwnd)
+				stack_table->MaxSsCwnd = cwnd;
+		} else if (cwnd > stack_table->MaxCaCwnd) {
+			stack_table->MaxCaCwnd = cwnd;
+		}
+	}
+
+	if (perf_table != NULL) {
+		pipe_size = tcp_packets_in_flight(tp) * mss;
+		if (pipe_size > perf_table->MaxPipeSize)
+			perf_table->MaxPipeSize = pipe_size;
+	}
+
+	/* Discard initiail ssthresh set at infinity. */
+	if (tp->snd_ssthresh >= TCP_INFINITE_SSTHRESH) {
+		return;
+	}
+
+	if (stack_table != NULL) {
+		ssthresh = tp->snd_ssthresh * tp->mss_cache;
+		if (ssthresh > stack_table->MaxSsthresh)
+			stack_table->MaxSsthresh = ssthresh;
+		if (ssthresh < stack_table->MinSsthresh)
+			stack_table->MinSsthresh = ssthresh;
+	}
+}
+EXPORT_SYMBOL(tcp_estats_update_finish_segrecv);
+
+void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	u32 win = tp->snd_wnd;
+
+	if (perf_table == NULL)
+		return;
+
+	if (win > perf_table->MaxRwinRcvd)
+		perf_table->MaxRwinRcvd = win;
+	if (win == 0)
+		perf_table->ZeroRwinRcvd++;
+}
+
+void tcp_estats_update_rwin_sent(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	u32 win = tp->rcv_wnd;
+
+	if (perf_table == NULL)
+		return;
+
+	if (win > perf_table->MaxRwinSent)
+		perf_table->MaxRwinSent = win;
+	if (win == 0)
+		perf_table->ZeroRwinSent++;
+}
+
+void tcp_estats_update_sndlim(struct tcp_sock *tp,
+			      enum tcp_estats_sndlim_states state)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	ktime_t now;
+
+	if (state <= TCP_ESTATS_SNDLIM_NONE ||
+	    state >= TCP_ESTATS_SNDLIM_NSTATES) {
+		pr_err("tcp_estats_update_sndlim: BUG: state out of range %d\n",
+		       state);
+		return;
+	}
+
+	if (perf_table == NULL)
+		return;
+
+	now = ktime_get();
+	perf_table->snd_lim_time[stats->limstate]
+	    += ktime_to_us(ktime_sub(now, stats->limstate_ts));
+	stats->limstate_ts = now;
+	if (stats->limstate != state) {
+		stats->limstate = state;
+		perf_table->snd_lim_trans[state]++;
+	}
+}
+
+void tcp_estats_update_congestion(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+
+	TCP_ESTATS_VAR_INC(tp, perf_table, CongSignals);
+
+	if (path_table != NULL) {
+		path_table->PreCongSumCwnd += tp->snd_cwnd * tp->mss_cache;
+		path_table->PreCongSumRTT += path_table->SampleRTT;
+	}
+}
+
+void tcp_estats_update_post_congestion(struct tcp_sock *tp)
+{
+	struct tcp_estats *stats = tp->tcp_stats;
+	struct tcp_estats_path_table *path_table = stats->tables.path_table;
+
+	if (path_table != NULL) {
+		path_table->PostCongCountRTT++;
+		path_table->PostCongSumRTT += path_table->SampleRTT;
+	}
+}
+
+void tcp_estats_update_segsend(struct sock *sk, int pcount,
+			       u32 seq, u32 end_seq, int flags)
+{
+	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
+	struct tcp_estats_perf_table *perf_table = stats->tables.perf_table;
+	struct tcp_estats_app_table *app_table = stats->tables.app_table;
+
+	int data_len = end_seq - seq;
+
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	stats->current_ts = ktime_get();
+#else
+	stats->current_ts = jiffies;
+#endif
+
+	if (perf_table == NULL)
+		return;
+
+	/* We know we're sending a segment. */
+	perf_table->SegsOut += pcount;
+
+	/* A pure ACK contains no data; everything else is data. */
+	if (data_len > 0) {
+		perf_table->DataSegsOut += pcount;
+		perf_table->DataOctetsOut += data_len;
+	}
+
+	/* Check for retransmission. */
+	if (flags & TCPHDR_SYN) {
+		if (inet_csk(sk)->icsk_retransmits)
+			perf_table->SegsRetrans++;
+	} else if (app_table != NULL &&
+		   before(seq, app_table->SndMax)) {
+		perf_table->SegsRetrans += pcount;
+		perf_table->OctetsRetrans += data_len;
+	}
+}
+
+void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	struct tcp_estats_tables *tables = &tp->tcp_stats->tables;
+	struct tcp_estats_path_table *path_table = tables->path_table;
+	struct tcp_estats_perf_table *perf_table = tables->perf_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+	struct tcphdr *th = tcp_hdr(skb);
+	struct iphdr *iph = ip_hdr(skb);
+
+	if (perf_table != NULL)
+		perf_table->SegsIn++;
+
+	if (skb->len == th->doff * 4) {
+		if (stack_table != NULL &&
+		    TCP_SKB_CB(skb)->ack_seq == tp->snd_una)
+			stack_table->DupAcksIn++;
+	} else {
+		if (perf_table != NULL) {
+			perf_table->DataSegsIn++;
+			perf_table->DataOctetsIn += skb->len - th->doff * 4;
+		}
+	}
+
+	if (path_table != NULL) {
+		path_table->IpTtl = iph->ttl;
+		path_table->IpTosIn = iph->tos;
+	}
+}
+EXPORT_SYMBOL(tcp_estats_update_segrecv);
+
+void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq)
+{
+        /* After much debate, it was decided that "seq - rcv_nxt" is
+           indeed what we want, as opposed to what Krishnan suggested
+           to better match the RFC: "seq - tp->rcv_wup" */
+	TCP_ESTATS_VAR_ADD(tp, app_table, ThruOctetsReceived,
+			   seq - tp->rcv_nxt);
+}
+
+void tcp_estats_update_writeq(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats_app_table *app_table =
+			tp->tcp_stats->tables.app_table;
+	int len;
+
+	if (app_table == NULL)
+		return;
+
+	len = tp->write_seq - app_table->SndMax;
+
+	if (len > app_table->MaxAppWQueue)
+		app_table->MaxAppWQueue = len;
+}
+
+static inline u32 ofo_qlen(struct tcp_sock *tp)
+{
+	if (!skb_peek(&tp->out_of_order_queue))
+		return 0;
+	else
+		return TCP_SKB_CB(tp->out_of_order_queue.prev)->end_seq -
+		    TCP_SKB_CB(tp->out_of_order_queue.next)->seq;
+}
+
+void tcp_estats_update_recvq(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_estats_tables *tables = &tp->tcp_stats->tables;
+	struct tcp_estats_app_table *app_table = tables->app_table;
+	struct tcp_estats_stack_table *stack_table = tables->stack_table;
+
+	if (app_table != NULL) {
+		u32 len = tp->rcv_nxt - tp->copied_seq;
+		if (app_table->MaxAppRQueue < len)
+			app_table->MaxAppRQueue = len;
+	}
+
+	if (stack_table != NULL) {
+		u32 len = ofo_qlen(tp);
+		if (stack_table->MaxReasmQueue < len)
+			stack_table->MaxReasmQueue = len;
+	}
+}
+
+/*
+ * Manage connection ID table
+ */
+
+static int get_new_cid(struct tcp_estats *stats)
+{
+         int id_cid;
+
+again:
+         spin_lock_bh(&tcp_estats_idr_lock);
+         id_cid = idr_alloc(&tcp_estats_idr, stats, next_id, 0, GFP_KERNEL);
+         if (unlikely(id_cid == -ENOSPC)) {
+                 spin_unlock_bh(&tcp_estats_idr_lock);
+                 goto again;
+         }
+         if (unlikely(id_cid == -ENOMEM)) {
+                 spin_unlock_bh(&tcp_estats_idr_lock);
+                 return -ENOMEM;
+         }
+         next_id = (id_cid + 1) % ESTATS_MAX_CID;
+         stats->tcpe_cid = id_cid;
+         spin_unlock_bh(&tcp_estats_idr_lock);
+         return 0;
+}
+
+static void create_func(struct work_struct *work)
+{
+	/* stub for netlink notification of new connections */
+	;
+}
+
+static void establish_func(struct work_struct *work)
+{
+	struct tcp_estats *stats = container_of(work, struct tcp_estats,
+						establish_notify);
+	int err = 0;
+
+	if ((stats->tcpe_cid) > 0) {
+		pr_err("TCP estats container established multiple times.\n");
+		return;
+	}
+
+	if ((stats->tcpe_cid) == 0) {
+		err = get_new_cid(stats);
+		if (err)
+			pr_devel("get_new_cid error %d\n", err);
+	}
+}
+
+static void destroy_func(struct work_struct *work)
+{
+	struct tcp_estats *stats = container_of(work, struct tcp_estats,
+						destroy_notify.work);
+
+	int id_cid = stats->tcpe_cid;
+
+	if (id_cid == 0)
+		pr_devel("TCP estats destroyed before being established.\n");
+
+	if (id_cid >= 0) {
+		if (id_cid) {
+			spin_lock_bh(&tcp_estats_idr_lock);
+			idr_remove(&tcp_estats_idr, id_cid);
+			spin_unlock_bh(&tcp_estats_idr_lock);
+		}
+		stats->tcpe_cid = -1;
+
+		tcp_estats_unuse(stats);
+	}
+}
+
+void __init tcp_estats_init()
+{
+	idr_init(&tcp_estats_idr);
+
+	create_notify_func = &create_func;
+	establish_notify_func = &establish_func;
+	destroy_notify_func = &destroy_func;
+
+	persist_delay = TCP_ESTATS_PERSIST_DELAY_SECS * HZ;
+
+	tcp_estats_wq = alloc_workqueue("tcp_estats", WQ_MEM_RECLAIM, 256);
+	if (tcp_estats_wq == NULL) {
+		pr_err("tcp_estats_init(): alloc_workqueue failed\n");
+		goto cleanup_fail;
+	}
+
+	tcp_estats_wq_enabled = 1;
+	return;
+
+cleanup_fail:
+	pr_err("TCP ESTATS: initialization failed.\n");
+}
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 1/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 17:50 UTC (permalink / raw)
  To: netdev

This patch provides the kernel instrument set. While this patch
compiles and runs it does not have control and management capabilities.
These are provided in the next patch submission.

---
  include/linux/tcp.h      |   8 +
  include/net/tcp.h        |   1 +
  include/net/tcp_estats.h | 376 +++++++++++++++++++++++++++++++++++++++++++++++
  include/uapi/linux/tcp.h |   6 +-
  net/ipv4/tcp.c           |  21 ++-
  net/ipv4/tcp_cong.c      |   3 +
  net/ipv4/tcp_htcp.c      |   1 +
  net/ipv4/tcp_input.c     | 116 +++++++++++++--
  net/ipv4/tcp_ipv4.c      |  10 ++
  net/ipv4/tcp_output.c    |  61 +++++++-
  net/ipv4/tcp_timer.c     |   3 +
  net/ipv6/tcp_ipv6.c      |   7 +
  12 files changed, 592 insertions(+), 21 deletions(-)
  create mode 100644 include/net/tcp_estats.h

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 67309ec..8758360 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -126,6 +126,10 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
  	return (struct tcp_request_sock *)req;
  }
  
+#ifdef CONFIG_TCP_ESTATS
+struct tcp_estats;
+#endif
+
  struct tcp_sock {
  	/* inet_connection_sock has to be the first member of tcp_sock */
  	struct inet_connection_sock	inet_conn;
@@ -309,6 +313,10 @@ struct tcp_sock {
  	struct tcp_md5sig_info	__rcu *md5sig_info;
  #endif
  
+#ifdef CONFIG_TCP_ESTATS
+	struct tcp_estats	*tcp_stats;
+#endif
+
  /* TCP fastopen related information */
  	struct tcp_fastopen_request *fastopen_req;
  	/* fastopen_rsk points to request_sock that resulted in this big
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f50f29faf..9f7e31e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -43,6 +43,7 @@
  #include <net/tcp_states.h>
  #include <net/inet_ecn.h>
  #include <net/dst.h>
+#include <net/tcp_estats.h>
  
  #include <linux/seq_file.h>
  #include <linux/memcontrol.h>
diff --git a/include/net/tcp_estats.h b/include/net/tcp_estats.h
new file mode 100644
index 0000000..ff6000e
--- /dev/null
+++ b/include/net/tcp_estats.h
@@ -0,0 +1,376 @@
+/*
+ * include/net/tcp_estats.h
+ *
+ * Implementation of TCP Extended Statistics MIB (RFC 4898)
+ *
+ * Authors:
+ *   John Estabrook <jsestabrook@gmail.com>
+ *   Andrew K. Adams <akadams@psc.edu>
+ *   Kevin Hogan <kwabena@google.com>
+ *   Dominin Hamon <dma@stripysock.com>
+ *   John Heffner <johnwheffner@gmail.com>
+ *
+ * The Web10Gig project.  See http://www.web10gig.org
+ *
+ * Copyright © 2011, Pittsburgh Supercomputing Center (PSC).
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _TCP_ESTATS_H
+#define _TCP_ESTATS_H
+
+#include <net/sock.h>
+#include <linux/idr.h>
+#include <linux/in.h>
+#include <linux/jump_label.h>
+#include <linux/spinlock.h>
+#include <linux/tcp.h>
+#include <linux/workqueue.h>
+
+/* defines number of seconds that stats persist after connection ends */
+#define TCP_ESTATS_PERSIST_DELAY_SECS 5
+
+enum tcp_estats_sndlim_states {
+	TCP_ESTATS_SNDLIM_NONE = -1,
+	TCP_ESTATS_SNDLIM_SENDER,
+	TCP_ESTATS_SNDLIM_CWND,
+	TCP_ESTATS_SNDLIM_RWIN,
+	TCP_ESTATS_SNDLIM_STARTUP,
+	TCP_ESTATS_SNDLIM_TSODEFER,
+	TCP_ESTATS_SNDLIM_PACE,
+	TCP_ESTATS_SNDLIM_NSTATES	/* Keep at end */
+};
+
+enum tcp_estats_addrtype {
+	TCP_ESTATS_ADDRTYPE_IPV4 = 1,
+	TCP_ESTATS_ADDRTYPE_IPV6 = 2
+};
+
+enum tcp_estats_softerror_reason {
+	TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW = 1,
+	TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW = 2,
+	TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW = 3,
+	TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW = 4,
+	TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW = 5,
+	TCP_ESTATS_SOFTERROR_ABOVE_TS_WINDOW = 6,
+	TCP_ESTATS_SOFTERROR_DATA_CHECKSUM = 7,
+	TCP_ESTATS_SOFTERROR_OTHER = 8,
+};
+
+#define TCP_ESTATS_INACTIVE	2
+#define TCP_ESTATS_ACTIVE	1
+
+#define TCP_ESTATS_TABLEMASK_INACTIVE	0x00
+#define TCP_ESTATS_TABLEMASK_ACTIVE	0x01
+#define TCP_ESTATS_TABLEMASK_PERF	0x02
+#define TCP_ESTATS_TABLEMASK_PATH	0x04
+#define TCP_ESTATS_TABLEMASK_STACK	0x08
+#define TCP_ESTATS_TABLEMASK_APP	0x10
+#define TCP_ESTATS_TABLEMASK_EXTRAS	0x40
+
+#ifdef CONFIG_TCP_ESTATS
+
+extern struct static_key tcp_estats_enabled;
+
+#define TCP_ESTATS_CHECK(tp, table, expr)				\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats) &&			\
+			    likely((tp)->tcp_stats->tables.table)) {	\
+				(expr);					\
+			}						\
+		}							\
+	} while (0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, ++((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_DEC(tp, table, var)				\
+	TCP_ESTATS_CHECK(tp, table, --((tp)->tcp_stats->tables.table->var))
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) += (val))
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)				\
+	TCP_ESTATS_CHECK(tp, table,					\
+			 ((tp)->tcp_stats->tables.table->var) = (val))
+#define TCP_ESTATS_UPDATE(tp, func)					\
+	do {								\
+		if (static_key_false(&tcp_estats_enabled)) {		\
+			if (likely((tp)->tcp_stats)) {			\
+				(func);					\
+			}						\
+		}							\
+	} while (0)
+
+/*
+ * Variables that can be read and written directly.
+ *
+ * Contains all variables from RFC 4898. Commented fields are
+ * either not implemented (only StartTimeStamp
+ * remains unimplemented in this release) or have
+ * handlers and do not need struct storage.
+ */
+struct tcp_estats_connection_table {
+	u32			AddressType;
+	union { struct in_addr addr; struct in6_addr addr6; }	LocalAddress;
+	union { struct in_addr addr; struct in6_addr addr6; }	RemAddress;
+	u16			LocalPort;
+	u16			RemPort;
+};
+
+struct tcp_estats_perf_table {
+	u32		SegsOut;
+	u32		DataSegsOut;
+	u64		DataOctetsOut;
+	u32		SegsRetrans;
+	u32		OctetsRetrans;
+	u32		SegsIn;
+	u32		DataSegsIn;
+	u64		DataOctetsIn;
+	/*		ElapsedSecs */
+	/*		ElapsedMicroSecs */
+	/*		StartTimeStamp */
+	/*		CurMSS */
+	/*		PipeSize */
+	u32		MaxPipeSize;
+	/*		SmoothedRTT */
+	/*		CurRTO */
+	u32		CongSignals;
+	/*		CurCwnd */
+	/*		CurSsthresh */
+	u32		Timeouts;
+	/*		CurRwinSent */
+	u32		MaxRwinSent;
+	u32		ZeroRwinSent;
+	/*		CurRwinRcvd */
+	u32		MaxRwinRcvd;
+	u32		ZeroRwinRcvd;
+	/*		SndLimTransRwin */
+	/*		SndLimTransCwnd */
+	/*		SndLimTransSnd */
+	/*		SndLimTimeRwin */
+	/*		SndLimTimeCwnd */
+	/*		SndLimTimeSnd */
+	u32		snd_lim_trans[TCP_ESTATS_SNDLIM_NSTATES];
+	u32		snd_lim_time[TCP_ESTATS_SNDLIM_NSTATES];
+};
+
+struct tcp_estats_path_table {
+	/*		RetranThresh */
+	u32		NonRecovDAEpisodes;
+	u32		SumOctetsReordered;
+	u32		NonRecovDA;
+	u32		SampleRTT;
+	/*		RTTVar */
+	u32		MaxRTT;
+	u32		MinRTT;
+	u64		SumRTT;
+	u32		CountRTT;
+	u32		MaxRTO;
+	u32		MinRTO;
+	u8		IpTtl;
+	u8		IpTosIn;
+	/*		IpTosOut */
+	u32		PreCongSumCwnd;
+	u32		PreCongSumRTT;
+	u32		PostCongSumRTT;
+	u32		PostCongCountRTT;
+	u32		ECNsignals;
+	u32		DupAckEpisodes;
+	/*		RcvRTT */
+	u32		DupAcksOut;
+	u32		CERcvd;
+	u32		ECESent;
+};
+
+struct tcp_estats_stack_table {
+	u32		ActiveOpen;
+	/*		MSSSent */
+	/*		MSSRcvd */
+	/*		WinScaleSent */
+	/*		WinScaleRcvd */
+	/*		TimeStamps */
+	/*		ECN */
+	/*		WillSendSACK */
+	/*		WillUseSACK */
+	/*		State */
+	/*		Nagle */
+	u32		MaxSsCwnd;
+	u32		MaxCaCwnd;
+	u32		MaxSsthresh;
+	u32		MinSsthresh;
+	/*		InRecovery */
+	u32		DupAcksIn;
+	u32		SpuriousFrDetected;
+	u32		SpuriousRtoDetected;
+	u32		SoftErrors;
+	u32		SoftErrorReason;
+	u32		SlowStart;
+	u32		CongAvoid;
+	u32		OtherReductions;
+	u32		CongOverCount;
+	u32		FastRetran;
+	u32		SubsequentTimeouts;
+	/*		CurTimeoutCount */
+	u32		AbruptTimeouts;
+	u32		SACKsRcvd;
+	u32		SACKBlocksRcvd;
+	u32		SendStall;
+	u32		DSACKDups;
+	u32		MaxMSS;
+	u32		MinMSS;
+	u32		SndInitial;
+	u32		RecInitial;
+	/*		CurRetxQueue */
+	/*		MaxRetxQueue */
+	/*		CurReasmQueue */
+	u32		MaxReasmQueue;
+	u32		EarlyRetrans;
+	u32		EarlyRetransDelay;
+};
+
+struct tcp_estats_app_table {
+	/*		SndUna */
+	/*		SndNxt */
+	u32		SndMax;
+	u64		ThruOctetsAcked;
+	/*		RcvNxt */
+	u64		ThruOctetsReceived;
+	/*		CurAppWQueue */
+	u32		MaxAppWQueue;
+	/*		CurAppRQueue */
+	u32		MaxAppRQueue;
+};
+
+/*
+    currently, no backing store is needed for tuning elements in
+     web10g - they are all read or written to directly in other
+     data structures (such as the socket)
+*/
+
+struct tcp_estats_extras_table {
+	/*		OtherReductionsCV */
+	u32		OtherReductionsCM;
+	u32		Priority;
+};
+
+struct tcp_estats_tables {
+	struct tcp_estats_connection_table	*connection_table;
+	struct tcp_estats_perf_table		*perf_table;
+	struct tcp_estats_path_table		*path_table;
+	struct tcp_estats_stack_table		*stack_table;
+	struct tcp_estats_app_table		*app_table;
+	struct tcp_estats_extras_table		*extras_table;
+};
+
+struct tcp_estats {
+	int				tcpe_cid; /* idr map id */
+
+	struct sock			*sk;
+	kuid_t				uid;
+	kgid_t				gid;
+	int				ids;
+
+	atomic_t			users;
+
+	enum tcp_estats_sndlim_states	limstate;
+	ktime_t				limstate_ts;
+#ifdef CONFIG_TCP_ESTATS_STRICT_ELAPSEDTIME
+	ktime_t				start_ts;
+	ktime_t				current_ts;
+#else
+	unsigned long			start_ts;
+	unsigned long			current_ts;
+#endif
+	struct timeval			start_tv;
+
+        int				queued;
+        struct work_struct		create_notify;
+        struct work_struct		establish_notify;
+        struct delayed_work		destroy_notify;
+
+	struct tcp_estats_tables	tables;
+
+	struct rcu_head			rcu;
+};
+
+extern struct idr tcp_estats_idr;
+
+extern int tcp_estats_wq_enabled;
+extern struct workqueue_struct *tcp_estats_wq;
+extern void (*create_notify_func)(struct work_struct *work);
+extern void (*establish_notify_func)(struct work_struct *work);
+extern void (*destroy_notify_func)(struct work_struct *work);
+
+extern unsigned long persist_delay;
+extern spinlock_t tcp_estats_idr_lock;
+
+/* For the TCP code */
+extern int  tcp_estats_create(struct sock *sk, enum tcp_estats_addrtype t,
+			      int active);
+extern void tcp_estats_destroy(struct sock *sk);
+extern void tcp_estats_establish(struct sock *sk);
+extern void tcp_estats_free(struct rcu_head *rcu);
+
+extern void tcp_estats_update_snd_nxt(struct tcp_sock *tp);
+extern void tcp_estats_update_acked(struct tcp_sock *tp, u32 ack);
+extern void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample);
+extern void tcp_estats_update_timeout(struct sock *sk);
+extern void tcp_estats_update_mss(struct tcp_sock *tp);
+extern void tcp_estats_update_rwin_rcvd(struct tcp_sock *tp);
+extern void tcp_estats_update_sndlim(struct tcp_sock *tp,
+				     enum tcp_estats_sndlim_states why);
+extern void tcp_estats_update_rcvd(struct tcp_sock *tp, u32 seq);
+extern void tcp_estats_update_rwin_sent(struct tcp_sock *tp);
+extern void tcp_estats_update_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_post_congestion(struct tcp_sock *tp);
+extern void tcp_estats_update_segsend(struct sock *sk, int pcount,
+                                      u32 seq, u32 end_seq, int flags);
+extern void tcp_estats_update_segrecv(struct tcp_sock *tp, struct sk_buff *skb);
+extern void tcp_estats_update_finish_segrecv(struct tcp_sock *tp);
+extern void tcp_estats_update_writeq(struct sock *sk);
+extern void tcp_estats_update_recvq(struct sock *sk);
+
+extern void tcp_estats_init(void);
+
+static inline void tcp_estats_use(struct tcp_estats *stats)
+{
+	atomic_inc(&stats->users);
+}
+
+static inline int tcp_estats_use_if_valid(struct tcp_estats *stats)
+{
+	return atomic_inc_not_zero(&stats->users);
+}
+
+static inline void tcp_estats_unuse(struct tcp_estats *stats)
+{
+	if (atomic_dec_and_test(&stats->users)) {
+		sock_put(stats->sk);
+		stats->sk = NULL;
+		call_rcu(&stats->rcu, tcp_estats_free);
+	}
+}
+
+#else /* !CONFIG_TCP_ESTATS */
+
+#define tcp_estats_enabled	(0)
+
+#define TCP_ESTATS_VAR_INC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_DEC(tp, table, var)	do {} while (0)
+#define TCP_ESTATS_VAR_ADD(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_VAR_SET(tp, table, var, val)	do {} while (0)
+#define TCP_ESTATS_UPDATE(tp, func)		do {} while (0)
+
+static inline void tcp_estats_init(void) { }
+static inline void tcp_estats_establish(struct sock *sk) { }
+static inline void tcp_estats_create(struct sock *sk,
+				     enum tcp_estats_addrtype t,
+				     int active) { }
+static inline void tcp_estats_destroy(struct sock *sk) { }
+
+#endif /* CONFIG_TCP_ESTATS */
+
+#endif /* _TCP_ESTATS_H */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 3b97183..5dae043 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -186,9 +186,13 @@ struct tcp_info {
  	__u32	tcpi_rcv_space;
  
  	__u32	tcpi_total_retrans;
-
  	__u64	tcpi_pacing_rate;
  	__u64	tcpi_max_pacing_rate;
+
+#ifdef CONFIG_TCP_ESTATS
+	/* RFC 4898 extended stats Info */
+	__u32	tcpi_estats_cid;
+#endif
  };
  
  /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3075723..698dbb7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -418,6 +418,10 @@ void tcp_init_sock(struct sock *sk)
  	sk->sk_sndbuf = sysctl_tcp_wmem[1];
  	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
  
+#ifdef CONFIG_TCP_ESTATS
+	tp->tcp_stats = NULL;
+#endif
+
  	local_bh_disable();
  	sock_update_memcg(sk);
  	sk_sockets_allocated_inc(sk);
@@ -972,6 +976,9 @@ wait_for_memory:
  		tcp_push(sk, flags & ~MSG_MORE, mss_now,
  			 TCP_NAGLE_PUSH, size_goal);
  
+		if (copied)
+                        TCP_ESTATS_UPDATE(tp, tcp_estats_update_writeq(sk));
+
  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
  			goto do_error;
  
@@ -1264,9 +1271,11 @@ new_segment:
  wait_for_sndbuf:
  			set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
  wait_for_memory:
-			if (copied)
+			if (copied) {
  				tcp_push(sk, flags & ~MSG_MORE, mss_now,
  					 TCP_NAGLE_PUSH, size_goal);
+				TCP_ESTATS_UPDATE(tp, tcp_estats_update_writeq(sk));
+			}
  
  			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
  				goto do_error;
@@ -1658,6 +1667,8 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  			     *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt, flags);
  		}
  
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+
  		/* Well, if we have backlog, try to process it now yet. */
  
  		if (copied >= target && !sk->sk_backlog.tail)
@@ -2684,6 +2695,11 @@ void tcp_get_info(const struct sock *sk, struct tcp_info *info)
  					sk->sk_pacing_rate : ~0ULL;
  	info->tcpi_max_pacing_rate = sk->sk_max_pacing_rate != ~0U ?
  					sk->sk_max_pacing_rate : ~0ULL;
+
+#ifdef CONFIG_TCP_ESTATS
+	info->tcpi_estats_cid = (tp->tcp_stats && tp->tcp_stats->tcpe_cid > 0)
+					? tp->tcp_stats->tcpe_cid : 0;
+#endif
  }
  EXPORT_SYMBOL_GPL(tcp_get_info);
  
@@ -3101,6 +3117,9 @@ void __init tcp_init(void)
  		tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
  
  	tcp_metrics_init();
+
  	BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0);
+	tcp_estats_init();
+
  	tcp_tasklet_init();
  }
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 27ead0d..e93929d 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -295,6 +295,8 @@ void tcp_slow_start(struct tcp_sock *tp, u32 acked)
  {
  	u32 cwnd = tp->snd_cwnd + acked;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, SlowStart);
+
  	if (cwnd > tp->snd_ssthresh)
  		cwnd = tp->snd_ssthresh + 1;
  	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
@@ -304,6 +306,7 @@ EXPORT_SYMBOL_GPL(tcp_slow_start);
  /* In theory this is tp->snd_cwnd += 1 / tp->snd_cwnd (or alternative w) */
  void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w)
  {
+	TCP_ESTATS_VAR_INC(tp, stack_table, CongAvoid);
  	if (tp->snd_cwnd_cnt >= w) {
  		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
  			tp->snd_cwnd++;
diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c
index 58469ff..5facb4c 100644
--- a/net/ipv4/tcp_htcp.c
+++ b/net/ipv4/tcp_htcp.c
@@ -251,6 +251,7 @@ static void htcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
  			tp->snd_cwnd_cnt += ca->pkts_acked;
  
  		ca->pkts_acked = 1;
+		TCP_ESTATS_VAR_INC(tp, stack_table, CongAvoid);
  	}
  }
  
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 075ab4d..8f0601b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -77,8 +77,10 @@
  #include <linux/errqueue.h>
  
  int sysctl_tcp_timestamps __read_mostly = 1;
+EXPORT_SYMBOL(sysctl_tcp_timestamps);
  int sysctl_tcp_window_scaling __read_mostly = 1;
  int sysctl_tcp_sack __read_mostly = 1;
+EXPORT_SYMBOL(sysctl_tcp_sack);
  int sysctl_tcp_fack __read_mostly = 1;
  int sysctl_tcp_reordering __read_mostly = TCP_FASTRETRANS_THRESH;
  int sysctl_tcp_max_reordering __read_mostly = 300;
@@ -231,13 +233,15 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
  			tcp_enter_quickack_mode((struct sock *)tp);
  		break;
  	case INET_ECN_CE:
+		TCP_ESTATS_VAR_INC(tp, path_table, CERcvd);
  		if (tcp_ca_needs_ecn((struct sock *)tp))
  			tcp_ca_event((struct sock *)tp, CA_EVENT_ECN_IS_CE);
-
  		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
  			/* Better not delay acks, sender can have a very low cwnd */
  			tcp_enter_quickack_mode((struct sock *)tp);
  			tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+		} else {
+			TCP_ESTATS_VAR_INC(tp, path_table, ECESent);
  		}
  		tp->ecn_flags |= TCP_ECN_SEEN;
  		break;
@@ -1104,6 +1108,7 @@ static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
  		dup_sack = true;
  		tcp_dsack_seen(tp);
  		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPDSACKRECV);
+		TCP_ESTATS_VAR_INC(tp, stack_table, DSACKDups);
  	} else if (num_sacks > 1) {
  		u32 end_seq_1 = get_unaligned_be32(&sp[1].end_seq);
  		u32 start_seq_1 = get_unaligned_be32(&sp[1].start_seq);
@@ -1114,6 +1119,7 @@ static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
  			tcp_dsack_seen(tp);
  			NET_INC_STATS_BH(sock_net(sk),
  					LINUX_MIB_TCPDSACKOFORECV);
+			TCP_ESTATS_VAR_INC(tp, stack_table, DSACKDups);
  		}
  	}
  
@@ -1653,6 +1659,9 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
  	state.reord = tp->packets_out;
  	state.rtt_us = -1L;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, SACKsRcvd);
+	TCP_ESTATS_VAR_ADD(tp, stack_table, SACKBlocksRcvd, num_sacks);
+
  	if (!tp->sacked_out) {
  		if (WARN_ON(tp->fackets_out))
  			tp->fackets_out = 0;
@@ -1928,6 +1937,8 @@ void tcp_enter_loss(struct sock *sk)
  	bool new_recovery = false;
  	bool is_reneg;			/* is receiver reneging on SACKs? */
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
+
  	/* Reduce ssthresh if it has not yet been made inside this window. */
  	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
  	    !after(tp->high_seq, tp->snd_una) ||
@@ -2200,8 +2211,12 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
  	 */
  	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
  	    (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
-	    !tcp_may_send_now(sk))
-		return !tcp_pause_early_retransmit(sk, flag);
+	    !tcp_may_send_now(sk)) {
+		int early_retrans = !tcp_pause_early_retransmit(sk, flag);
+		if (early_retrans)
+			TCP_ESTATS_VAR_INC(tp, stack_table, EarlyRetrans);
+		return early_retrans;
+	}
  
  	return false;
  }
@@ -2299,9 +2314,15 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
   */
  static inline void tcp_moderate_cwnd(struct tcp_sock *tp)
  {
-	tp->snd_cwnd = min(tp->snd_cwnd,
-			   tcp_packets_in_flight(tp) + tcp_max_burst(tp));
-	tp->snd_cwnd_stamp = tcp_time_stamp;
+	u32 pkts = tcp_packets_in_flight(tp) + tcp_max_burst(tp);
+
+	if (pkts < tp->snd_cwnd) {
+		tp->snd_cwnd = pkts;
+		tp->snd_cwnd_stamp = tcp_time_stamp;
+
+		TCP_ESTATS_VAR_INC(tp, stack_table, OtherReductions);
+		TCP_ESTATS_VAR_INC(tp, extras_table, OtherReductionsCM);
+	}
  }
  
  /* Nothing was retransmitted or returned timestamp is less
@@ -2402,6 +2423,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
  		if (tp->prior_ssthresh > tp->snd_ssthresh) {
  			tp->snd_ssthresh = tp->prior_ssthresh;
  			tcp_ecn_withdraw_cwr(tp);
+			TCP_ESTATS_VAR_INC(tp, stack_table, CongOverCount);
  		}
  	} else {
  		tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh);
@@ -2428,10 +2450,15 @@ static bool tcp_try_undo_recovery(struct sock *sk)
  		 */
  		DBGUNDO(sk, inet_csk(sk)->icsk_ca_state == TCP_CA_Loss ? "loss" : "retrans");
  		tcp_undo_cwnd_reduction(sk, false);
-		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss)
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
  			mib_idx = LINUX_MIB_TCPLOSSUNDO;
-		else
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousRtoDetected);
+		} else {
  			mib_idx = LINUX_MIB_TCPFULLUNDO;
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousFrDetected);
+		}
  
  		NET_INC_STATS_BH(sock_net(sk), mib_idx);
  	}
@@ -2472,9 +2499,12 @@ static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
  
  		DBGUNDO(sk, "partial loss");
  		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO);
-		if (frto_undo)
+		if (frto_undo) {
  			NET_INC_STATS_BH(sock_net(sk),
  					 LINUX_MIB_TCPSPURIOUSRTOS);
+			TCP_ESTATS_VAR_INC(tp, stack_table,
+					   SpuriousRtoDetected);
+		}
  		inet_csk(sk)->icsk_retransmits = 0;
  		if (frto_undo || tcp_is_sack(tp))
  			tcp_set_ca_state(sk, TCP_CA_Open);
@@ -2555,6 +2585,7 @@ void tcp_enter_cwr(struct sock *sk)
  		tcp_init_cwnd_reduction(sk);
  		tcp_set_ca_state(sk, TCP_CA_CWR);
  	}
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
  }
  
  static void tcp_try_keep_open(struct sock *sk)
@@ -2580,8 +2611,10 @@ static void tcp_try_to_open(struct sock *sk, int flag, const int prior_unsacked)
  	if (!tcp_any_retrans_done(sk))
  		tp->retrans_stamp = 0;
  
-	if (flag & FLAG_ECE)
+	if (flag & FLAG_ECE) {
  		tcp_enter_cwr(sk);
+		TCP_ESTATS_VAR_INC(tp, path_table, ECNsignals);
+	}
  
  	if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
  		tcp_try_keep_open(sk);
@@ -2826,6 +2859,10 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  			}
  			break;
  
+		case TCP_CA_Disorder:
+			TCP_ESTATS_VAR_INC(tp, path_table, NonRecovDAEpisodes);
+			break;
+
  		case TCP_CA_Recovery:
  			if (tcp_is_reno(tp))
  				tcp_reset_reno_sack(tp);
@@ -2870,6 +2907,10 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
  			tcp_try_undo_dsack(sk);
  
+
+		if (icsk->icsk_ca_state == TCP_CA_Disorder)
+			TCP_ESTATS_VAR_INC(tp, path_table, NonRecovDA);
+
  		if (!tcp_time_to_recover(sk, flag)) {
  			tcp_try_to_open(sk, flag, prior_unsacked);
  			return;
@@ -2889,6 +2930,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
  		/* Otherwise enter Recovery state */
  		tcp_enter_recovery(sk, (flag & FLAG_ECE));
  		fast_rexmit = 1;
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_congestion(tp));
+		TCP_ESTATS_VAR_INC(tp, stack_table, FastRetran);
  	}
  
  	if (do_lost)
@@ -2928,6 +2971,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
  
  	tcp_rtt_estimator(sk, seq_rtt_us);
  	tcp_set_rto(sk);
+	TCP_ESTATS_UPDATE(tcp_sk(sk), tcp_estats_update_rtt(sk, seq_rtt_us));
  
  	/* RFC6298: only reset backoff on valid RTT measurement. */
  	inet_csk(sk)->icsk_backoff = 0;
@@ -3007,6 +3051,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
  	if (!tp->do_early_retrans)
  		return;
  
+	TCP_ESTATS_VAR_INC(tp, stack_table, EarlyRetransDelay);
  	tcp_enter_recovery(sk, false);
  	tcp_update_scoreboard(sk, 1);
  	tcp_xmit_retransmit_queue(sk);
@@ -3310,9 +3355,11 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
  				tp->max_window = nwin;
  				tcp_sync_mss(sk, inet_csk(sk)->icsk_pmtu_cookie);
  			}
+			TCP_ESTATS_UPDATE(tp, tcp_estats_update_rwin_rcvd(tp));
  		}
  	}
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_acked(tp, ack));
  	tp->snd_una = ack;
  
  	return flag;
@@ -3410,6 +3457,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	int prior_packets = tp->packets_out;
  	const int prior_unsacked = tp->packets_out - tp->sacked_out;
  	int acked = 0; /* Number of packets newly acked */
+	int prior_state = icsk->icsk_ca_state;
  	long sack_rtt_us = -1L;
  
  	/* We very likely will need to access write queue head. */
@@ -3419,6 +3467,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	 * then we can probably ignore it.
  	 */
  	if (before(ack, prior_snd_una)) {
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+				   TCP_ESTATS_SOFTERROR_BELOW_ACK_WINDOW);
  		/* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */
  		if (before(ack, prior_snd_una - tp->max_window)) {
  			tcp_send_challenge_ack(sk);
@@ -3430,8 +3481,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  	/* If the ack includes data we haven't sent yet, discard
  	 * this segment (RFC793 Section 3.9).
  	 */
-	if (after(ack, tp->snd_nxt))
+	if (after(ack, tp->snd_nxt)) {
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+				   TCP_ESTATS_SOFTERROR_ABOVE_ACK_WINDOW);
  		goto invalid_ack;
+	}
  
  	if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
  	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
@@ -3439,6 +3494,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  
  	if (after(ack, prior_snd_una)) {
  		flag |= FLAG_SND_UNA_ADVANCED;
+		if (icsk->icsk_ca_state == TCP_CA_Disorder)
+			TCP_ESTATS_VAR_ADD(tp, path_table, SumOctetsReordered,
+					   ack - prior_snd_una);
  		icsk->icsk_retransmits = 0;
  	}
  
@@ -3456,6 +3514,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  		 * Note, we use the fact that SND.UNA>=SND.WL2.
  		 */
  		tcp_update_wl(tp, ack_seq);
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_acked(tp, ack));
  		tp->snd_una = ack;
  		flag |= FLAG_WIN_UPDATE;
  
@@ -3510,6 +3569,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
  		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
  		tcp_fastretrans_alert(sk, acked, prior_unsacked,
  				      is_dupack, flag);
+		if (icsk->icsk_ca_state == TCP_CA_Open &&
+		    prior_state >= TCP_CA_CWR)
+			TCP_ESTATS_UPDATE(tp,
+				tcp_estats_update_post_congestion(tp));
  	}
  	if (tp->tlp_high_seq)
  		tcp_process_tlp_ack(sk, ack, flag);
@@ -4177,7 +4240,9 @@ static void tcp_ofo_queue(struct sock *sk)
  
  		tail = skb_peek_tail(&sk->sk_receive_queue);
  		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, tp->rcv_nxt));
  		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+
  		if (!eaten)
  			__skb_queue_tail(&sk->sk_receive_queue, skb);
  		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
@@ -4232,6 +4297,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
  	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
  		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
  
+        TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+        TCP_ESTATS_VAR_INC(tp, path_table, DupAcksOut);
+
  	skb1 = skb_peek_tail(&tp->out_of_order_queue);
  	if (!skb1) {
  		/* Initial out of order segment, build 1 SACK. */
@@ -4242,6 +4310,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
  						TCP_SKB_CB(skb)->end_seq;
  		}
  		__skb_queue_head(&tp->out_of_order_queue, skb);
+                TCP_ESTATS_VAR_INC(tp, path_table, DupAckEpisodes);
  		goto end;
  	}
  
@@ -4438,6 +4507,9 @@ queue_and_out:
  
  			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
  		}
+		TCP_ESTATS_UPDATE(
+			tp,
+			tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
  		if (skb->len)
  			tcp_event_data_recv(sk, skb);
@@ -4459,6 +4531,8 @@ queue_and_out:
  
  		tcp_fast_path_check(sk);
  
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
+
  		if (eaten > 0)
  			kfree_skb_partial(skb, fragstolen);
  		if (!sock_flag(sk, SOCK_DEAD))
@@ -4990,6 +5064,9 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  	    tcp_paws_discard(sk, skb)) {
  		if (!th->rst) {
  			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
+			TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+			TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+					   TCP_ESTATS_SOFTERROR_BELOW_TS_WINDOW);
  			tcp_send_dupack(sk, skb);
  			goto discard;
  		}
@@ -5004,6 +5081,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  		 * an acknowledgment should be sent in reply (unless the RST
  		 * bit is set, if so drop the segment and return)".
  		 */
+		TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+		TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+			before(TCP_SKB_CB(skb)->end_seq, tp->rcv_wup) ?
+				TCP_ESTATS_SOFTERROR_BELOW_DATA_WINDOW :
+				TCP_ESTATS_SOFTERROR_ABOVE_DATA_WINDOW);
  		if (!th->rst) {
  			if (th->syn)
  				goto syn_challenge;
@@ -5152,6 +5234,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  				return;
  			} else { /* Header too small */
  				TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
+				TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+				TCP_ESTATS_VAR_SET(tp, stack_table,
+						   SoftErrorReason,
+						   TCP_ESTATS_SOFTERROR_OTHER);
  				goto discard;
  			}
  		} else {
@@ -5178,6 +5264,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  					tcp_rcv_rtt_measure_ts(sk, skb);
  
  					__skb_pull(skb, tcp_header_len);
+					TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  					tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
  					NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
  					eaten = 1;
@@ -5204,10 +5291,12 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
  
  				/* Bulk data transfer: receiver */
+				TCP_ESTATS_UPDATE(tp, tcp_estats_update_rcvd(tp, TCP_SKB_CB(skb)->end_seq));
  				eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
  						      &fragstolen);
  			}
  
+			TCP_ESTATS_UPDATE(tp, tcp_estats_update_recvq(sk));
  			tcp_event_data_recv(sk, skb);
  
  			if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
@@ -5260,6 +5349,9 @@ step5:
  csum_error:
  	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_CSUMERRORS);
  	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
+	TCP_ESTATS_VAR_INC(tp, stack_table, SoftErrors);
+	TCP_ESTATS_VAR_SET(tp, stack_table, SoftErrorReason,
+			   TCP_ESTATS_SOFTERROR_DATA_CHECKSUM);
  
  discard:
  	__kfree_skb(skb);
@@ -5459,6 +5551,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
  		smp_mb();
  
  		tcp_finish_connect(sk, skb);
+		tcp_estats_establish(sk);
  
  		if ((tp->syn_fastopen || tp->syn_data) &&
  		    tcp_rcv_fastopen_synack(sk, skb, &foc))
@@ -5685,6 +5778,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
  		smp_mb();
  		tcp_set_state(sk, TCP_ESTABLISHED);
  		sk->sk_state_change(sk);
+		tcp_estats_establish(sk);
  
  		/* Note, that this wakeup is only for marginal crossed SYN case.
  		 * Passively open sockets are not waked up, because
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a3f72d7..9c85a54 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1310,6 +1310,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
  	if (!newsk)
  		goto exit_nonewsk;
  
+	tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV4, TCP_ESTATS_INACTIVE);
+
  	newsk->sk_gso_type = SKB_GSO_TCPV4;
  	inet_sk_rx_dst_set(newsk, skb);
  
@@ -1670,6 +1672,8 @@ process:
  	skb->dev = NULL;
  
  	bh_lock_sock_nested(sk);
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_segrecv(tcp_sk(sk), skb));
  	ret = 0;
  	if (!sock_owned_by_user(sk)) {
  		if (!tcp_prequeue(sk, skb))
@@ -1680,6 +1684,8 @@ process:
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;
  	}
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_finish_segrecv(tcp_sk(sk)));
  	bh_unlock_sock(sk);
  
  	sock_put(sk);
@@ -1809,6 +1815,8 @@ static int tcp_v4_init_sock(struct sock *sk)
  	tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
  #endif
  
+	tcp_estats_create(sk, TCP_ESTATS_ADDRTYPE_IPV4, TCP_ESTATS_ACTIVE);
+
  	return 0;
  }
  
@@ -1842,6 +1850,8 @@ void tcp_v4_destroy_sock(struct sock *sk)
  	if (inet_csk(sk)->icsk_bind_hash)
  		inet_put_port(sk);
  
+	tcp_estats_destroy(sk);
+
  	BUG_ON(tp->fastopen_rsk != NULL);
  
  	/* If socket is aborted during connect operation */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..145b4f2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -80,6 +80,7 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
  
  	tcp_advance_send_head(sk, skb);
  	tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_snd_nxt(tp));
  
  	tp->packets_out += tcp_skb_pcount(skb);
  	if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
@@ -292,6 +293,7 @@ static u16 tcp_select_window(struct sock *sk)
  	}
  	tp->rcv_wnd = new_win;
  	tp->rcv_wup = tp->rcv_nxt;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_rwin_sent(tp));
  
  	/* Make sure we do not exceed the maximum possible
  	 * scaled window.
@@ -905,6 +907,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  	struct tcp_md5sig_key *md5;
  	struct tcphdr *th;
  	int err;
+#ifdef CONFIG_TCP_ESTATS
+	__u32 seq;
+	__u32 end_seq;
+	int tcp_flags;
+	int pcount;
+#endif
  
  	BUG_ON(!skb || !tcp_skb_pcount(skb));
  
@@ -1008,6 +1016,15 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  		TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS,
  			      tcp_skb_pcount(skb));
  
+#ifdef CONFIG_TCP_ESTATS
+	/* If the skb isn't cloned, we can't reference it after
+	 * calling queue_xmit, so copy everything we need here. */
+	pcount = tcp_skb_pcount(skb);
+	seq = TCP_SKB_CB(skb)->seq;
+	end_seq = TCP_SKB_CB(skb)->end_seq;
+	tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
+#endif
+
  	/* OK, its time to fill skb_shinfo(skb)->gso_segs */
  	skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb);
  
@@ -1020,10 +1037,17 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
  
  	err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
  
+	if (likely(!err)) {
+		TCP_ESTATS_UPDATE(tp, tcp_estats_update_segsend(sk, pcount,
+								seq, end_seq,
+								tcp_flags));
+	}
+
  	if (likely(err <= 0))
  		return err;
  
  	tcp_enter_cwr(sk);
+	TCP_ESTATS_VAR_INC(tp, stack_table, SendStall);
  
  	return net_xmit_eval(err);
  }
@@ -1398,6 +1422,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
  	if (icsk->icsk_mtup.enabled)
  		mss_now = min(mss_now, tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low));
  	tp->mss_cache = mss_now;
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_mss(tp));
  
  	return mss_now;
  }
@@ -1670,11 +1695,13 @@ static unsigned int tcp_snd_test(const struct sock *sk, struct sk_buff *skb,
  	tcp_init_tso_segs(sk, skb, cur_mss);
  
  	if (!tcp_nagle_test(tp, skb, cur_mss, nonagle))
-		return 0;
+		return -TCP_ESTATS_SNDLIM_SENDER;
  
  	cwnd_quota = tcp_cwnd_test(tp, skb);
-	if (cwnd_quota && !tcp_snd_wnd_test(tp, skb, cur_mss))
-		cwnd_quota = 0;
+	if (!cwnd_quota)
+		return -TCP_ESTATS_SNDLIM_CWND;
+	if (!tcp_snd_wnd_test(tp, skb, cur_mss))
+		return -TCP_ESTATS_SNDLIM_RWIN;
  
  	return cwnd_quota;
  }
@@ -1688,7 +1715,7 @@ bool tcp_may_send_now(struct sock *sk)
  	return skb &&
  		tcp_snd_test(sk, skb, tcp_current_mss(sk),
  			     (tcp_skb_is_last(sk, skb) ?
-			      tp->nonagle : TCP_NAGLE_PUSH));
+			      tp->nonagle : TCP_NAGLE_PUSH)) > 0;
  }
  
  /* Trim TSO SKB to LEN bytes, put the remaining data into a new packet
@@ -1978,6 +2005,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  	unsigned int tso_segs, sent_pkts;
  	int cwnd_quota;
  	int result;
+	int why = TCP_ESTATS_SNDLIM_SENDER;
  	bool is_cwnd_limited = false;
  	u32 max_segs;
  
@@ -2008,6 +2036,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  
  		cwnd_quota = tcp_cwnd_test(tp, skb);
  		if (!cwnd_quota) {
+			why = TCP_ESTATS_SNDLIM_CWND;
  			is_cwnd_limited = true;
  			if (push_one == 2)
  				/* Force out a loss probe pkt. */
@@ -2016,19 +2045,24 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  				break;
  		}
  
-		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
+		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
+			why = TCP_ESTATS_SNDLIM_RWIN;
  			break;
-
+		}
+		
  		if (tso_segs == 1) {
  			if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
  						     (tcp_skb_is_last(sk, skb) ?
  						      nonagle : TCP_NAGLE_PUSH))))
+				/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  				break;
  		} else {
  			if (!push_one &&
  			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
-						 max_segs))
+						 max_segs)) {
+				why = TCP_ESTATS_SNDLIM_TSODEFER;
  				break;
+			}
  		}
  
  		limit = mss_now;
@@ -2041,6 +2075,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  
  		if (skb->len > limit &&
  		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  
  		/* TCP Small Queues :
@@ -2064,10 +2099,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
  			 */
  			smp_mb__after_atomic();
  			if (atomic_read(&sk->sk_wmem_alloc) > limit)
+				/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  				break;
  		}
  
  		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  
  repair:
@@ -2080,9 +2117,12 @@ repair:
  		sent_pkts += tcp_skb_pcount(skb);
  
  		if (push_one)
+			/* set above: why = TCP_ESTATS_SNDLIM_SENDER; */
  			break;
  	}
  
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_sndlim(tp, why));
+
  	if (likely(sent_pkts)) {
  		if (tcp_in_cwnd_reduction(sk))
  			tp->prr_out += sent_pkts;
@@ -3148,11 +3188,16 @@ int tcp_connect(struct sock *sk)
  	 */
  	tp->snd_nxt = tp->write_seq;
  	tp->pushed_seq = tp->write_seq;
-	TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
  
  	/* Timer for repeating the SYN until an answer. */
  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
  				  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+
+	TCP_ESTATS_VAR_SET(tp, stack_table, SndInitial, tp->write_seq);
+	TCP_ESTATS_VAR_SET(tp, app_table, SndMax, tp->write_seq);
+	TCP_ESTATS_UPDATE(tp, tcp_estats_update_snd_nxt(tp));
+	TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
+
  	return 0;
  }
  EXPORT_SYMBOL(tcp_connect);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 1829c7f..0f6f1f4 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -477,6 +477,9 @@ out_reset_timer:
  		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
  	}
  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+
+        TCP_ESTATS_UPDATE(tp, tcp_estats_update_timeout(sk));
+
  	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
  		__sk_dst_reset(sk);
  
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5ff8780..db1f88f 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1131,6 +1131,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
  	if (newsk == NULL)
  		goto out_nonewsk;
  
+	tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV6, TCP_ESTATS_INACTIVE);
+
  	/*
  	 * No need to charge this sock to the relevant IPv6 refcnt debug socks
  	 * count here, tcp_create_openreq_child now does this for us, see the
@@ -1463,6 +1465,8 @@ process:
  	skb->dev = NULL;
  
  	bh_lock_sock_nested(sk);
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_segrecv(tcp_sk(sk), skb));
  	ret = 0;
  	if (!sock_owned_by_user(sk)) {
  		if (!tcp_prequeue(sk, skb))
@@ -1473,6 +1477,8 @@ process:
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;
  	}
+	TCP_ESTATS_UPDATE(
+		tcp_sk(sk), tcp_estats_update_finish_segrecv(tcp_sk(sk)));
  	bh_unlock_sock(sk);
  
  	sock_put(sk);
@@ -1661,6 +1667,7 @@ static int tcp_v6_init_sock(struct sock *sk)
  #ifdef CONFIG_TCP_MD5SIG
  	tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
  #endif
+	tcp_estats_create(sk, TCP_ESTATS_ADDRTYPE_IPV6, TCP_ESTATS_ACTIVE);
  
  	return 0;
  }
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 0/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 17:49 UTC (permalink / raw)
  To: netdev

The following patch increments and/or update select RFC 4898 (TCP 
Extended Statistics MIB) metrics within the TCP stack; we refer to this 
as the Kernel Instrument Set (or KIS).  The goal of RFC 4898 is to 
expose advanced statistics from TCP’s vantage point to userland in order 
to help diagnose performance problems in both the network and 
application. The metrics are gathered and cached within structures 
defined in our header file (tcp_estats.h) on a per connection basis 
allowing for highly detailed analysis of all TCP flows. More information 
can be found at http://www.web10g.org/

Note, the KIS does not integrate any specific ABI. This allows for a
clear separation between the kernel instruments and the methodology used
to make the metrics available to userland. Currently, we have a netlink
implementation available as a DLKM and an associated API available at
https://sourceforge.net/projects/tcpestats/files/

Performance analysis provided by the kernel development teams at Google
and Facebook indicate that the overhead imposed when the KIS is
configured active and exposed via an ABI are minimal.  Facebook related
performance impacts of between 0% and 2%, depending on the the frequency
of polling the KIS via ftrace.  Analysis performed at Google indicate
similar performance characteristics.

Since the size of the KIS patch set is considerable (~2k lines), we have
broken it up into two components, the first provides our structures and 
macros to the TCP networking DLKMs. The second provides the routines 
that manage and control the TCP Extended Statistics, as well as 
providing hooks for configuring and enabling the KIS. Each set of 
patches patches, compiles, and runs independently. However, full 
functionality requires both patch sets to be installed.

We took this approach because the control and management (C&M) routines 
are, in our view, of secondary importance to the actual instrumentation. 
As such, we did not want any issues with the C&M methods to impact the 
adoption of the KIS. There is overlap between the two patch sets 
(specifically the header files in the C&M) which will likely make 
applying the C&M patch cleanly on top of the KIS patch problematic. As 
such, I've also included a concatenated patch that includes both the KIS 
and the C&M for evaluation.

A git repo is available at http://github.com/rapier/web10g The
net-next branch contains the instrumentation (1st patch set) and the
control and management (2nd patch set). The API is also available at
http://github.com/rapier1/web10g-userland

Chris Rapier

^ permalink raw reply

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16 17:29 UTC (permalink / raw)
  To: John Fastabend, netdev@vger.kernel.org
  Cc: Roopa Prabhu, Jamal Hadi Salim, Jiri Pirko, sfeldma@gmail.com,
	bcrl@kvack.org, tgraf@suug.ch, stephen@networkplumber.org,
	linville@tuxdriver.com, vyasevic@redhat.com, davem@davemloft.net,
	shm@cumulusnetworks.com, gospo@cumulusnetworks.com
In-Reply-To: <549060CF.5020706@gmail.com>



> -----Original Message-----
> From: John Fastabend [mailto:john.fastabend@gmail.com]
> Sent: Tuesday, December 16, 2014 6:42 PM
> To: Arad, Ronen
> Cc: Roopa Prabhu; netdev@vger.kernel.org; Jamal Hadi Salim; Jiri Pirko;
> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 12/16/2014 03:01 AM, Arad, Ronen wrote:
> >
> > In my reply (inline) I elaborate on the validity of bridge-less and offloaded-
> bridge models for L2 switching.
> >
> > I also discuss the implied necessity of a bridge device for L3 routing and
> potential issues with the upcoming FIB offloading proposal.
> >
> >> -----Original Message-----
> >> From: netdev-owner@vger.kernel.org [mailto:netdev-
> >> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
> >> Sent: Tuesday, December 16, 2014 3:21 AM
> >> To: Arad, Ronen
> >> Cc: Jamal Hadi Salim; John Fastabend; netdev@vger.kernel.org; Jiri
> >> Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> >> stephen@networkplumber.org; linville@tuxdriver.com;
> >> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com;
> >> gospo@cumulusnetworks.com
> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
> >> del bridge port attributes
> >>
> >> On 12/15/14, 4:58 PM, Arad, Ronen wrote:
> >>>
> >>>> -----Original Message-----
> >>>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
> >>>> Sent: Tuesday, December 16, 2014 1:28 AM
> >>>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
> >>>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
> >>>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
> >>>> vyasevic@redhat.com; davem@davemloft.net;
> >> shm@cumulusnetworks.com;
> >>>> gospo@cumulusnetworks.com
> >>>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set
> >>>> and del bridge port attributes
> >>>>
> >>>> On 12/15/14 13:36, Arad, Ronen wrote:
> >>>>>
> >>>>>> -----Original Message-----
> >>>>> The behavior of a driver could depend on the presence of a bridge
> >>>>> and
> >>>> features such as FDB LEARNING and LEARNING_SYNC.
> >>>>
> >>>> Indeed, those are bridge attributes.
> >>>>
> >>>>> A switch port driver which is not enslaved to a bridge might need
> >>>>> to implement VLAN-aware FDB within the driver and report its
> >>>>> content to
> >>>>> user-
> >>>> space using ndo_fdb_dump.
> >>>>    >
> >>>>> A switch port driver which is enslaved to a bridge could do with
> >>>>> only pass through for static FDB configuration
> >>>>    > to the HW when LEARNING_SYNC is configured. FDB reporting to
> >>>> user- space and soft aging are left to the bridge module FDB.
> >>>>> Such driver, without LEARNING_SYNC could still avoid maintaing
> >>>>> in-driver
> >>>> FDB as long as it could dump the HW FDB on demand.
> >>>>> LEARNING_SYNC also requires periodic updates of freshness
> >>>>> information
> >>>> from the driver to the bridge module.
> >>>>
> >>>> If you have an fdb - shouldnt that be exposed only if you have a
> >>>> bridge abstraction exposed? i.e thats where the Linux tools would work.
> >>> I'm trying to find out what are the opinions of other people in the
> >>> netdev
> >> list.
> >>> John have clearly stated that he'd like to see full L2 switching
> >>> functionality
> >> (at least) supported without making a bridge device mandatory.
> >>> The existing bridge ndos (ndo_bridge_{set,del,get}link) already
> >>> support that
> >> with proper setting of SELF/MASTER flags by iproute2.
> >>> I see the value in supporting both approaches (bridge device
> >>> mandatory and bridge device optional). If the choice is left to
> >>> user-driven policy decision, we need to document both use models and
> >>> map traditional L2 features to each model.
> >>> The L2 offloading (or NETFUNC as it is currently called), which is
> >>> being discussed on a different patch-set, is only needed when a
> >>> bridge device is used.
> >>> Without a bridge device, all configuration has to be targeted at the
> >>> switch port driver directly using the SELF flag. FDB remains
> >>> relevant and it is used to configure static MAC table entries and dump
> the HW MAC table.
> >
> >> Your understanding is right here. So far all patches have kept both
> >> models in mind.
> >
> >
> >>> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or
> >>> even higher), there is a gap between what the HW is doing and what
> >>> is explicitly modeled in Linux.
> >
> >
> >> Can you elaborate more here ?. We use the linux model to accelerate a
> >> multi-layer (l2-l3) switch today. There maybe a few gaps, but these
> >> gaps can be closed by having equivalent functionality in the software path.
> >
> > What I meant is that without a bridge device the HW switch is seen as a
> collection of independent switch ports. Typical switch ASIC performs L2
> switching by default. This is not expressed explicitly in Linux without a bridge
> device.
> > The SELF flag is used to target typical bridge port and bridge configuration
> at a switch port device.
> > Without an explicit bridge device, bridge attributes have to be
> > directed at an arbitrary port (any port could represent the entire switch)
> and interpreted by the switch port driver as intended for the entire switch
> (this includes attributes like STP etc.) Each switch port device driver has to
> implement similar functionality (i.e. all bridge and fdb related ndos)
> independently without common functionality shared (e.g. FDB, soft aging).
> > It is a valid use model and could avoid the complexity of having to deal with
> the presence of both SW and HW bridge and to deal with explicit offloading
> of data-path.
> >
> > I was trying to find out whether the intention was to continue and support
> both bridge-less an offloaded-bridge models and leave it to the end-user to
> choose the desirable model at configuration time.
> > This would require dual support in the switch port driver in order to have
> best user experience across multiple switch ASICs or other kinds of devices.
> >
> 
> I'm still missing why there is duplicate implementations in the driver.
> If the driver implements the set of ndo ops why should it care who calls
> them? I think you tried to explain this already but I'm not seeing it.
> 

Let's consider a bridge property. I'll use the default PVID attribute as an example. This is currently configurable by sysfs only and a netlink support for that is still due. Let's assume for our discussion that a DEAFAULT_PVID attribute will be added as a bridge attribute within AFSPEC nested attribute of AF_BRIDGE SETLINK message.
When a bridge device is present, this attribute is processed by the bridge module and saved as default_pvid field in net_bridge structure. When a switch port is enslaved to a bridge, the bridge driver creates a net_bridge_port instance and assigns it a pvid inherited from the default_pvid attribute of the bridge. Setting the pvid for a new enslaved switch port is not done via netlink. It only applies to the net_bridge_port structure which is internal to the bridge module. Offloading this to HW is not addressed with current bridge offloading.

When a bridge device is not used, the DEFAULT_PVID will be targeted using the SELF flag to any of the switch ports. The driver will recognize that as a bridge port and will need to maintain some switch global structure similar to net_bridge where it could save the default_pvid. The driver, knowing that the switch port is not enslaved to a bridge, will have to replicate the same functionality. In the HW case, it will have to configure default VLAN on all the switch ports.
This is different from the yet to be defined way of propagating default PVID from a bridge device to offloaded bridge ports.

Another example is STP. STP attributes are bridge attributes which are not offloaded when a bridge device is present. The bridge module handles STP protocol internally. Without bridge device, STP attributes have to be targeted at a switch port device and the driver should save them in driver-specific structures and have proprietary implementation of STP (as the one in the bridge module is not used).

 
> [...]
> 
> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa might have
> worked some of it out.
> 
> --
> John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Samudrala, Sridhar @ 2014-12-16 17:21 UTC (permalink / raw)
  To: John Fastabend, Jamal Hadi Salim
  Cc: Hubert Sokolowski, Roopa Prabhu, netdev@vger.kernel.org,
	Vlad Yasevich
In-Reply-To: <54905F67.2090509@gmail.com>


On 12/16/2014 8:35 AM, John Fastabend wrote:
>
>> Is there no way to get the unicast/multicast mac addresses for such
>> a driver?
>
> You can almost infer it from ip link by looking at all the stacked
> drivers and figuring out how the address are propagated down. Then
> look at the routes and figure out multicast address. But other than
> the fdb dump mechanism I don't think there is anything.

It looks like we can get the device specific unicast/multicast mac 
addresses via 'ip maddr' too.

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Michael Chan @ 2014-12-16 17:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rajat Jain, Marcelo Ricardo Leitner, Nils Holland, David Miller,
	netdev, linux-pci@vger.kernel.org, Rafael Wysocki,
	Prashant Sreedharan
In-Reply-To: <CAErSpo5dqQE7nZ6zf2odgpHBWA3ZpTjhbgQKnY8YxQW+a+298w@mail.gmail.com>

On Tue, 2014-12-16 at 09:20 -0700, Bjorn Helgaas wrote:
> I think we're in this path:
> 
>     tg3_init_hw
>       tg3_reset_hw
>         tg3_disable_ints
>         tg3_stop_fw
>         tg3_write_sig_pre_reset
>         tg3_chip_reset
>           pci_device_is_present
>             pci_bus_read_dev_vendor_id
> 
> and in this case pci_device_is_present() also passes a timeout of zero
> to pci_bus_read_dev_vendor_id().  My guess is that tg3 is resetting
> the device, so it's not too surprising that the config read returns
> CRS status immediately afterward.
> 
At the point of calling pci_device_is_present(), chip reset hasn't
started yet, so there should be no problem reading config space.

In all the newer tg3 chips, chip reset does not reset the PCIE block.
So I think config space should always be accesible even during reset.
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox