Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH iproute2 4/4] tc: Allow to easy change network namespace
From: Jiri Pirko @ 2014-12-14  9:36 UTC (permalink / raw)
  To: Vadim Kochan; +Cc: netdev
In-Reply-To: <1418493334-23142-5-git-send-email-vadim4j@gmail.com>

Sat, Dec 13, 2014 at 06:55:34PM CET, vadim4j@gmail.com wrote:
>From: Vadim Kochan <vadim4j@gmail.com>
>
>Added new '-netns' option to simplify executing following cmd:
>
>    ip netns exec NETNS tc OPTIONS COMMAND OBJECT
>
>    to
>
>    tc -n[etns] NETNS OPTIONS COMMAND OBJECT
>
>e.g.:
>
>    tc -net vnet0 qdisc
>
>Signed-off-by: Vadim Kochan <vadim4j@gmail.com>

Signed-off-by: Jiri Pirko <jiri@resnulli.us>

^ permalink raw reply

* [PATCH iproute2 REGRESSION] ss: Dont show netlink and packet sockets by default
From: Vadim Kochan @ 2014-12-14  9:36 UTC (permalink / raw)
  To: netdev; +Cc: Vadim Kochan

From: Vadim Kochan <vadim4j@gmail.com>

Checking by SS_CLOSE state was remowed in:

    (45a4770bc0) ss: Remove checking SS_CLOSE state for packet and netlink

which is not really correct because now by default all sockets are seen
when do 'ss'.

Here is most correct fix which considers specified family.

To see netlink sockets:
    ss -A netlink

To see packet sockets:
    ss -A packet

And ss by default will show only connected/established sockets as it
was before all the time.

Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
---
 misc/ss.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/misc/ss.c b/misc/ss.c
index e9927a5..6050ab6 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2801,6 +2801,9 @@ static int packet_show(struct filter *f)
 	int ino;
 	unsigned long long sk;
 
+	if (preferred_family != AF_PACKET && !(f->states & (1<<SS_CLOSE)))
+		return 0;
+
 	if (packet_show_netlink(f, NULL) == 0)
 		return 0;
 
@@ -3028,6 +3031,9 @@ static int netlink_show(struct filter *f)
 	int rq, wq, rc;
 	unsigned long long sk, cb;
 
+	if (preferred_family != AF_NETLINK && !(f->states & (1<<SS_CLOSE)))
+		return 0;
+
 	if (!getenv("PROC_NET_NETLINK") && !getenv("PROC_ROOT") &&
 		netlink_show_netlink(f, NULL) == 0)
 		return 0;
-- 
2.1.3

^ permalink raw reply related

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-14 14:13 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: sfeldma, jhs, bcrl, tgraf, john.fastabend, stephen, linville,
	vyasevic, netdev, davem, shm, gospo
In-Reply-To: <20141211222559.GB1880@nanopsycho.orion>

On 12/11/14, 2:25 PM, Jiri Pirko wrote:
> Thu, Dec 11, 2014 at 07:27:32PM CET, roopa@cumulusnetworks.com wrote:
>> On 12/11/14, 10:07 AM, Jiri Pirko wrote:
>>> Thu, Dec 11, 2014 at 06:59:15PM CET, roopa@cumulusnetworks.com wrote:
>>>> On 12/11/14, 9:11 AM, Jiri Pirko wrote:
>>>>> Thu, Dec 11, 2014 at 05:52:10PM CET, roopa@cumulusnetworks.com wrote:
>>>>>> On 12/10/14, 1:37 AM, Jiri Pirko wrote:
>>>>>>> Wed, Dec 10, 2014 at 10:05:18AM CET, roopa@cumulusnetworks.com wrote:
>>>>>>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>>>>>
>>>>>>>> This patch adds two new api's netdev_switch_port_bridge_setlink
>>>>>>>> and netdev_switch_port_bridge_dellink to offload bridge port attributes
>>>>>>>> to switch asic
>>>>>>>>
>>>>>>>> (The names of the apis look odd with 'switch_port_bridge',
>>>>>>>> but am more inclined to change the prefix of the api to something else.
>>>>>>>> Will take any suggestions).
>>>>>>>>
>>>>>>>> The api's look at the NETIF_F_HW_NETFUNC_OFFLOAD feature flag to
>>>>>>>> pass bridge port attributes to the port device.
>>>>>>>>
>>>>>>>> If the device has the NETIF_F_HW_NETFUNC_OFFLOAD, but does not support
>>>>>>>> the bridge port attribute offload ndo, call bridge port attribute ndo's on
>>>>>>>> the lowerdevs if supported. This is one way to pass bridge port attributes
>>>>>>>> through stacked netdevs (example when bridge port is a bond and bond slaves
>>>>>>>> are switch ports).
>>>>>>>>
>>>>>>>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>>>>> ---
>>>>>>>> include/net/switchdev.h   |    5 +++-
>>>>>>>> net/switchdev/switchdev.c |   70 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> 2 files changed, 74 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>>>>>>> index 8a6d164..22676b6 100644
>>>>>>>> --- a/include/net/switchdev.h
>>>>>>>> +++ b/include/net/switchdev.h
>>>>>>>> @@ -17,7 +17,10 @@
>>>>>>>> int netdev_switch_parent_id_get(struct net_device *dev,
>>>>>>>> 				struct netdev_phys_item_id *psid);
>>>>>>>> int netdev_switch_port_stp_update(struct net_device *dev, u8 state);
>>>>>>>> -
>>>>>>>> +int netdev_switch_port_bridge_setlink(struct net_device *dev,
>>>>>>>> +				struct nlmsghdr *nlh, u16 flags);
>>>>>>>> +int netdev_switch_port_bridge_dellink(struct net_device *dev,
>>>>>>>> +				struct nlmsghdr *nlh, u16 flags);
>>>>>>>> #else
>>>>>>>>
>>>>>>>> static inline int netdev_switch_parent_id_get(struct net_device *dev,
>>>>>>>> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>>>>>>>> index d162b21..62317e1 100644
>>>>>>>> --- a/net/switchdev/switchdev.c
>>>>>>>> +++ b/net/switchdev/switchdev.c
>>>>>>>> @@ -50,3 +50,73 @@ int netdev_switch_port_stp_update(struct net_device *dev, u8 state)
>>>>>>>> 	return ops->ndo_switch_port_stp_update(dev, state);
>>>>>>>> }
>>>>>>>> EXPORT_SYMBOL(netdev_switch_port_stp_update);
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + *	netdev_switch_port_bridge_setlink - Notify switch device port of bridge
>>>>>>>> + *	port attributes
>>>>>>>> + *
>>>>>>>> + *	@dev: port device
>>>>>>>> + *	@nlh: netlink msg with bridge port attributes
>>>>>>>> + *
>>>>>>>> + *	Notify switch device port of bridge port attributes
>>>>>>>> + */
>>>>>>>> +int netdev_switch_port_bridge_setlink(struct net_device *dev,
>>>>>>>> +									  struct nlmsghdr *nlh, u16 flags)
>>>>>>>> +{
>>>>>>>> +	const struct net_device_ops *ops = dev->netdev_ops;
>>>>>>>> +	struct net_device *lower_dev;
>>>>>>>> +	struct list_head *iter;
>>>>>>>> +	int ret = 0, err = 0;
>>>>>>>> +
>>>>>>>> +	if (!(dev->features & NETIF_F_HW_NETFUNC_OFFLOAD))
>>>>>>>> +		return err;
>>>>>>>> +
>>>>>>>> +	if (ops->ndo_bridge_setlink) {
>>>>>>>> +	    WARN_ON(!ops->ndo_switch_parent_id_get);
>>>>>>>> +	    return ops->ndo_bridge_setlink(dev, nlh, flags);
>>>>>>> 	You have to change ndo_bridge_setlink in netdevice.h first.
>>>>>>> 	Otherwise when only this patch is applied (during bisection)
>>>>>>> 	this won't compile.
>>>>>> ack, will fix it and keep that in mind next time.
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	netdev_for_each_lower_dev(dev, lower_dev, iter) {
>>>>>>> 	I do not understand why to iterate over lower devices. At this
>>>>>>> 	stage we don't know a thing about this upper or its lowers. Let
>>>>>>> 	the uppers (/masters) to decide if this needs to be propagated
>>>>>>> 	or not.
>>>>>> Jiri, In the stacked devices case, there is no way to propagate the bridge
>>>>>> port attributes to switch device driver today (vlan and other bridge port
>>>>>> attributes). Can you tell me if there is a way ?. no, ndo_vlan* ndo's are not
>>>>>> useful here. Nor we should go and implement ndo_bridge_setlink* in all
>>>>>> devices that can be bridge ports.
>>>>> Hmm. I just think that is cleaner to implement ndo_bridge_setlink in
>>>>> bonding for example and let it propagate the the call to slaves.
>>>> No, that will require bridge attribute support in all drivers. And that is no
>>>> good.
>>> Not all drivers, just all masters which want to support this. Like bond,
>>> team, macvlan etc. That would be the same as for
>>> ndo_vlan_rx_add_vid/ndo_vlan_rx_kill_vid/ndo_change_mtu etc. I do not
>>> see any problem in that. It is much much clearer over big hammer iterate
>>> over lowers in my opinion.
>> You cannot avoid the lowerdev iteration in any case.
>> If you added it in the individual drivers: bond, macvlan and other drivers
>> will all have to do the same thing.
>> ie Call bridge setlink on lowerdevs.
> I feel that the right way is to let masters propagate that themselves in
> their code.
In this case no. Just because an interface is a port in a bridge, it is 
wrong to indicate that the interface driver needs to understand all 
bridge port configuration attributes. Note that with what you are asking 
for ...all bridge port drivers (bonds, vxlans) will also need to 
implement the netdev stp state update api.
> That's it. I might be wrong of course.


>> My patch avoids the need to modify these drivers. Besides it does this only
>> when the OFFLOAD flag is set.
>
> Yep, well in my reply to another patch of you series I expressed my
> feeling that the flag should be really checked in particular switch
> driver, not core. But I might be wrong there as well...

The bridge driver owns these attributes...and he needs to call the 
switchdev api to offload.
And the condition for the switchdev api call is the offload flag. And 
the offload flag is part of the switchdev api.
The drivers just set it on the netdev, they dont own the offload flag. 
So, I don't see a reason why the core should not
know about the flag.

What has been accepted in the kernel currently does not help bridge 
driver offloading to switchdev. It does help if you want to manage your 
switch device separately like you were already doing with nics. ie going 
to switch port driver directly. It does not help the stacked device case 
either.


I will resubmit my series with the checkpatch errors you pointed out.

And, am also looking at other ways to solve the problem.

Thanks for the review.


>> It will not stop at adding the ndo_bridge_setlink to bond/macvlan etc. It
>> will be all other ndo_ops we will need for switch asics.
>> It will be l3 tomorrow, if the route is through a bond (But at that point, we
>> may end up having to introduce switch device instead of going to the port.
>> Lets see).
>>
>> Today this patch introduces an abstract way to get to the switch driver by
>> getting to the slave switch port (And only when the OFFLOAD flag is set).
>>
>>
>>>
>>>>> Let every "upper" to handle ndo_bridge_setlink their way. Sometimes it
>>>>> might not make sense to propagate to "lowers".
>>>> This does not really propagate to lowers. It is just trying to get to a
>>>> switch port and from there to the switch driver.
>>>> Example, bond driver does not need to care if its a bridge port. It will
>>>> simply pass the call to its slave which
>>>> might be a switch port.
>>>>
>>>> bond driver does not care if its a bridge port. But the switch driver cares,
>>>> because it knows that the bond was created with switch ports.
>>>>
>>>>
>>>>>> And this allows a switch driver to receive these callbacks if it has marked
>>>>>> the switch port with an offload flag. Your way of using the switch port to
>>>>>> get to the switch driver does not help in these cases.
>>>>> I do not follow how this is related to this case (stacked layout).
>>>>>
>>>>>> The other option is to use the 'switch device (not port)' to get to the
>>>>>> switch driver.
>>>>> That would not help this case (stacked layout) I believe.
>>>>>
>>>>>
>>>>>> This patch shows that you can still do this with the ndo ops.
>>>>>>>> +		err = netdev_switch_port_bridge_setlink(lower_dev, nlh, flags);
>>>>>>>> +		if (err)
>>>>>>>> +			ret = err;
>>>>>>>> +    }
>>>>>>>   ^^^^^ Indent is off. This should be catched by scripts/checkpatch.pl.
>>>>>>>
>>>>>>>> +
>>>>>>>> +	return ret;
>>>>>>>> +}
>>>>>>>> +EXPORT_SYMBOL(netdev_switch_port_bridge_setlink);
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + *	netdev_switch_port_bridge_dellink - Notify switch device port of bridge
>>>>>>>> + *	attribute delete
>>>>>>>> + *
>>>>>>>> + *	@dev: port device
>>>>>>>> + *	@nlh: netlink msg with bridge port attributes
>>>>>>>> + *
>>>>>>>> + *	Notify switch device port of bridge port attribute delete
>>>>>>>> + */
>>>>>>>> +int netdev_switch_port_bridge_dellink(struct net_device *dev,
>>>>>>>> +									  struct nlmsghdr *nlh, u16 flags)
>>>>>>>> +{
>>>>>>>> +	const struct net_device_ops *ops = dev->netdev_ops;
>>>>>>>> +	struct net_device *lower_dev;
>>>>>>>> +	struct list_head *iter;
>>>>>>>> +	int ret = 0, err = 0;
>>>>>>>> +
>>>>>>>> +	if (!(dev->features & NETIF_F_HW_NETFUNC_OFFLOAD))
>>>>>>>> +		return err;
>>>>>>>> +
>>>>>>>> +	if (ops->ndo_bridge_dellink) {
>>>>>>>> +		WARN_ON(!ops->ndo_switch_parent_id_get);
>>>>>>>> +		return ops->ndo_bridge_dellink(dev, nlh, flags);
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	netdev_for_each_lower_dev(dev, lower_dev, iter) {
>>>>>>>> +		err = netdev_switch_port_bridge_dellink(lower_dev, nlh, flags);
>>>>>>>> +		if (err)
>>>>>>>> +			ret = err;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	return ret;
>>>>>>>> +}
>>>>>>>> +EXPORT_SYMBOL(netdev_switch_port_bridge_dellink);
>>>>>>>> -- 
>>>>>>>> 1.7.10.4
>>>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net 0/2] mlx4 driver fixes for 3.19-rc1
From: Or Gerlitz @ 2014-12-14 14:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Matan Barak, Amir Vadai, Tal Alon, Or Gerlitz

Hi Dave, 

Just fixes for two small issues introduced in the 3.19 merge window

Or.

Matan Barak (1):
  net/mlx4_core: Fixed memory leak and incorrect refcount in mlx4_load_one

Or Gerlitz (1):
  net/mlx4_core: Avoid double dumping of the PF device capabilities

 drivers/net/ethernet/mellanox/mlx4/fw.c   |   26 +++++++-----
 drivers/net/ethernet/mellanox/mlx4/fw.h   |    1 +
 drivers/net/ethernet/mellanox/mlx4/main.c |   62 ++++++++++++++++-------------
 3 files changed, 50 insertions(+), 39 deletions(-)

^ permalink raw reply

* [PATCH net 2/2] net/mlx4_core: Avoid double dumping of the PF device capabilities
From: Or Gerlitz @ 2014-12-14 14:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Matan Barak, Amir Vadai, Tal Alon, Or Gerlitz
In-Reply-To: <1418566685-9855-1-git-send-email-ogerlitz@mellanox.com>

To support asymmetric EQ allocations, we should query the device
capabilities prior to enabling SRIOV. As a side effect of adding that,
we are dumping the PF device capabilities twice. Avoid that by moving
the printing into a helper function which is called once.

Fixes: 7ae0e400cd93 ('net/mlx4_core: Flexible (asymmetric) allocation of
		     EQs and MSI-X vectors for PF/VFs')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/fw.c   |   26 +++++++++++++++-----------
 drivers/net/ethernet/mellanox/mlx4/fw.h   |    1 +
 drivers/net/ethernet/mellanox/mlx4/main.c |    1 +
 3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index ef3b95b..51807bb 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -787,11 +787,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 		if ((1 << (field & 0x3f)) > (PAGE_SIZE / dev_cap->bf_reg_size))
 			field = 3;
 		dev_cap->bf_regs_per_page = 1 << (field & 0x3f);
-		mlx4_dbg(dev, "BlueFlame available (reg size %d, regs/page %d)\n",
-			 dev_cap->bf_reg_size, dev_cap->bf_regs_per_page);
 	} else {
 		dev_cap->bf_reg_size = 0;
-		mlx4_dbg(dev, "BlueFlame not available\n");
 	}
 
 	MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_SG_SQ_OFFSET);
@@ -902,9 +899,6 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 			goto out;
 	}
 
-	mlx4_dbg(dev, "Base MM extensions: flags %08x, rsvd L_Key %08x\n",
-		 dev_cap->bmme_flags, dev_cap->reserved_lkey);
-
 	/*
 	 * Each UAR has 4 EQ doorbells; so if a UAR is reserved, then
 	 * we can't use any EQs whose doorbell falls on that page,
@@ -916,6 +910,21 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 	else
 		dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_SYS_EQS;
 
+out:
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	return err;
+}
+
+void mlx4_dev_cap_dump(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
+{
+	if (dev_cap->bf_reg_size > 0)
+		mlx4_dbg(dev, "BlueFlame available (reg size %d, regs/page %d)\n",
+			 dev_cap->bf_reg_size, dev_cap->bf_regs_per_page);
+	else
+		mlx4_dbg(dev, "BlueFlame not available\n");
+
+	mlx4_dbg(dev, "Base MM extensions: flags %08x, rsvd L_Key %08x\n",
+		 dev_cap->bmme_flags, dev_cap->reserved_lkey);
 	mlx4_dbg(dev, "Max ICM size %lld MB\n",
 		 (unsigned long long) dev_cap->max_icm_sz >> 20);
 	mlx4_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n",
@@ -949,13 +958,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 		 dev_cap->dmfs_high_rate_qpn_base);
 	mlx4_dbg(dev, "DMFS high rate steer QPn range: %d\n",
 		 dev_cap->dmfs_high_rate_qpn_range);
-
 	dump_dev_cap_flags(dev, dev_cap->flags);
 	dump_dev_cap_flags2(dev, dev_cap->flags2);
-
-out:
-	mlx4_free_cmd_mailbox(dev, mailbox);
-	return err;
 }
 
 int mlx4_QUERY_PORT(struct mlx4_dev *dev, int port, struct mlx4_port_cap *port_cap)
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.h b/drivers/net/ethernet/mellanox/mlx4/fw.h
index 794e282..62562b6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.h
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.h
@@ -224,6 +224,7 @@ struct mlx4_set_ib_param {
 	u32 cap_mask;
 };
 
+void mlx4_dev_cap_dump(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap);
 int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap);
 int mlx4_QUERY_PORT(struct mlx4_dev *dev, int port, struct mlx4_port_cap *port_cap);
 int mlx4_QUERY_FUNC_CAP(struct mlx4_dev *dev, u8 gen_or_port,
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index c2ef266..b935bf3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -305,6 +305,7 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 		mlx4_err(dev, "QUERY_DEV_CAP command failed, aborting\n");
 		return err;
 	}
+	mlx4_dev_cap_dump(dev, dev_cap);
 
 	if (dev_cap->min_page_sz > PAGE_SIZE) {
 		mlx4_err(dev, "HCA minimum page size of %d bigger than kernel PAGE_SIZE of %ld, aborting\n",
-- 
1.7.1

^ permalink raw reply related

* [PATCH net 1/2] net/mlx4_core: Fixed memory leak and incorrect refcount in mlx4_load_one
From: Or Gerlitz @ 2014-12-14 14:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Matan Barak, Amir Vadai, Tal Alon, Or Gerlitz
In-Reply-To: <1418566685-9855-1-git-send-email-ogerlitz@mellanox.com>

From: Matan Barak <matanb@mellanox.com>

The current mlx4_load_one has a memory leak as it always allocates
dev_cap, but frees it only on error.

In addition, even if VFs exist when mlx4_load_one is called,
we still need to notify probed VFs that we're loading (by
incrementing pf_loading).

Fixes: a0eacca948d2 ('net/mlx4_core: Refactor mlx4_load_one')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/main.c |   61 +++++++++++++++-------------
 1 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index e25436b..c2ef266 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2488,41 +2488,42 @@ static u64 mlx4_enable_sriov(struct mlx4_dev *dev, struct pci_dev *pdev,
 			     u8 total_vfs, int existing_vfs)
 {
 	u64 dev_flags = dev->flags;
+	int err = 0;
 
-	dev->dev_vfs = kzalloc(
-			total_vfs * sizeof(*dev->dev_vfs),
-			GFP_KERNEL);
+	atomic_inc(&pf_loading);
+	if (dev->flags &  MLX4_FLAG_SRIOV) {
+		if (existing_vfs != total_vfs) {
+			mlx4_err(dev, "SR-IOV was already enabled, but with num_vfs (%d) different than requested (%d)\n",
+				 existing_vfs, total_vfs);
+			total_vfs = existing_vfs;
+		}
+	}
+
+	dev->dev_vfs = kzalloc(total_vfs * sizeof(*dev->dev_vfs), GFP_KERNEL);
 	if (NULL == dev->dev_vfs) {
 		mlx4_err(dev, "Failed to allocate memory for VFs\n");
 		goto disable_sriov;
-	} else if (!(dev->flags &  MLX4_FLAG_SRIOV)) {
-		int err = 0;
-
-		atomic_inc(&pf_loading);
-		if (existing_vfs) {
-			if (existing_vfs != total_vfs)
-				mlx4_err(dev, "SR-IOV was already enabled, but with num_vfs (%d) different than requested (%d)\n",
-					 existing_vfs, total_vfs);
-		} else {
-			mlx4_warn(dev, "Enabling SR-IOV with %d VFs\n", total_vfs);
-			err = pci_enable_sriov(pdev, total_vfs);
-		}
-		if (err) {
-			mlx4_err(dev, "Failed to enable SR-IOV, continuing without SR-IOV (err = %d)\n",
-				 err);
-			atomic_dec(&pf_loading);
-			goto disable_sriov;
-		} else {
-			mlx4_warn(dev, "Running in master mode\n");
-			dev_flags |= MLX4_FLAG_SRIOV |
-				MLX4_FLAG_MASTER;
-			dev_flags &= ~MLX4_FLAG_SLAVE;
-			dev->num_vfs = total_vfs;
-		}
+	}
+
+	if (!(dev->flags &  MLX4_FLAG_SRIOV)) {
+		mlx4_warn(dev, "Enabling SR-IOV with %d VFs\n", total_vfs);
+		err = pci_enable_sriov(pdev, total_vfs);
+	}
+	if (err) {
+		mlx4_err(dev, "Failed to enable SR-IOV, continuing without SR-IOV (err = %d)\n",
+			 err);
+		goto disable_sriov;
+	} else {
+		mlx4_warn(dev, "Running in master mode\n");
+		dev_flags |= MLX4_FLAG_SRIOV |
+			MLX4_FLAG_MASTER;
+		dev_flags &= ~MLX4_FLAG_SLAVE;
+		dev->num_vfs = total_vfs;
 	}
 	return dev_flags;
 
 disable_sriov:
+	atomic_dec(&pf_loading);
 	dev->num_vfs = 0;
 	kfree(dev->dev_vfs);
 	return dev_flags & ~MLX4_FLAG_MASTER;
@@ -2606,8 +2607,10 @@ static int mlx4_load_one(struct pci_dev *pdev, int pci_dev_data,
 		}
 
 		if (total_vfs) {
-			existing_vfs = pci_num_vf(pdev);
 			dev->flags = MLX4_FLAG_MASTER;
+			existing_vfs = pci_num_vf(pdev);
+			if (existing_vfs)
+				dev->flags |= MLX4_FLAG_SRIOV;
 			dev->num_vfs = total_vfs;
 		}
 	}
@@ -2643,6 +2646,7 @@ slave_start:
 	}
 
 	if (mlx4_is_master(dev)) {
+		/* when we hit the goto slave_start below, dev_cap already initialized */
 		if (!dev_cap) {
 			dev_cap = kzalloc(sizeof(*dev_cap), GFP_KERNEL);
 
@@ -2849,6 +2853,7 @@ slave_start:
 	if (mlx4_is_master(dev) && dev->num_vfs)
 		atomic_dec(&pf_loading);
 
+	kfree(dev_cap);
 	return 0;
 
 err_port:
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH iproute2 REGRESSION] ss: Dont show netlink and packet sockets by default
From: Sergei Shtylyov @ 2014-12-14 14:26 UTC (permalink / raw)
  To: Vadim Kochan, netdev
In-Reply-To: <1418549765-9466-1-git-send-email-vadim4j@gmail.com>

Hello.

On 12/14/2014 12:36 PM, Vadim Kochan wrote:

> From: Vadim Kochan <vadim4j@gmail.com>

> Checking by SS_CLOSE state was remowed in:

>      (45a4770bc0) ss: Remove checking SS_CLOSE state for packet and netlink

> which is not really correct because now by default all sockets are seen
> when do 'ss'.

> Here is most correct fix which considers specified family.

> To see netlink sockets:
>      ss -A netlink

> To see packet sockets:
>      ss -A packet

> And ss by default will show only connected/established sockets as it
> was before all the time.
>
> Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
> ---
>   misc/ss.c | 6 ++++++
>   1 file changed, 6 insertions(+)

> diff --git a/misc/ss.c b/misc/ss.c
> index e9927a5..6050ab6 100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -2801,6 +2801,9 @@ static int packet_show(struct filter *f)
>   	int ino;
>   	unsigned long long sk;
>
> +	if (preferred_family != AF_PACKET && !(f->states & (1<<SS_CLOSE)))

   Please surround << with spaces, to be consistent with other operators and 
general kernel coding style.

> +		return 0;
> +
>   	if (packet_show_netlink(f, NULL) == 0)
>   		return 0;
>
> @@ -3028,6 +3031,9 @@ static int netlink_show(struct filter *f)
>   	int rq, wq, rc;
>   	unsigned long long sk, cb;
>
> +	if (preferred_family != AF_NETLINK && !(f->states & (1<<SS_CLOSE)))

    Likewise.

> +		return 0;
> +
[...]

WBR, Sergei

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Jiri Pirko @ 2014-12-14 15:35 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: sfeldma, jhs, bcrl, tgraf, john.fastabend, stephen, linville,
	vyasevic, netdev, davem, shm, gospo
In-Reply-To: <548D9B14.9010201@cumulusnetworks.com>

Sun, Dec 14, 2014 at 03:13:40PM CET, roopa@cumulusnetworks.com wrote:
>On 12/11/14, 2:25 PM, Jiri Pirko wrote:
>>Thu, Dec 11, 2014 at 07:27:32PM CET, roopa@cumulusnetworks.com wrote:
>>>On 12/11/14, 10:07 AM, Jiri Pirko wrote:
>>>>Thu, Dec 11, 2014 at 06:59:15PM CET, roopa@cumulusnetworks.com wrote:
>>>>>On 12/11/14, 9:11 AM, Jiri Pirko wrote:
>>>>>>Thu, Dec 11, 2014 at 05:52:10PM CET, roopa@cumulusnetworks.com wrote:
>>>>>>>On 12/10/14, 1:37 AM, Jiri Pirko wrote:
>>>>>>>>Wed, Dec 10, 2014 at 10:05:18AM CET, roopa@cumulusnetworks.com wrote:
>>>>>>>>>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>>>>>>
>>>>>>>>>This patch adds two new api's netdev_switch_port_bridge_setlink
>>>>>>>>>and netdev_switch_port_bridge_dellink to offload bridge port attributes
>>>>>>>>>to switch asic
>>>>>>>>>
>>>>>>>>>(The names of the apis look odd with 'switch_port_bridge',
>>>>>>>>>but am more inclined to change the prefix of the api to something else.
>>>>>>>>>Will take any suggestions).
>>>>>>>>>
>>>>>>>>>The api's look at the NETIF_F_HW_NETFUNC_OFFLOAD feature flag to
>>>>>>>>>pass bridge port attributes to the port device.
>>>>>>>>>
>>>>>>>>>If the device has the NETIF_F_HW_NETFUNC_OFFLOAD, but does not support
>>>>>>>>>the bridge port attribute offload ndo, call bridge port attribute ndo's on
>>>>>>>>>the lowerdevs if supported. This is one way to pass bridge port attributes
>>>>>>>>>through stacked netdevs (example when bridge port is a bond and bond slaves
>>>>>>>>>are switch ports).
>>>>>>>>>
>>>>>>>>>Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>>>>>>---
>>>>>>>>>include/net/switchdev.h   |    5 +++-
>>>>>>>>>net/switchdev/switchdev.c |   70 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>2 files changed, 74 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>>diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>>>>>>>>index 8a6d164..22676b6 100644
>>>>>>>>>--- a/include/net/switchdev.h
>>>>>>>>>+++ b/include/net/switchdev.h
>>>>>>>>>@@ -17,7 +17,10 @@
>>>>>>>>>int netdev_switch_parent_id_get(struct net_device *dev,
>>>>>>>>>				struct netdev_phys_item_id *psid);
>>>>>>>>>int netdev_switch_port_stp_update(struct net_device *dev, u8 state);
>>>>>>>>>-
>>>>>>>>>+int netdev_switch_port_bridge_setlink(struct net_device *dev,
>>>>>>>>>+				struct nlmsghdr *nlh, u16 flags);
>>>>>>>>>+int netdev_switch_port_bridge_dellink(struct net_device *dev,
>>>>>>>>>+				struct nlmsghdr *nlh, u16 flags);
>>>>>>>>>#else
>>>>>>>>>
>>>>>>>>>static inline int netdev_switch_parent_id_get(struct net_device *dev,
>>>>>>>>>diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>>>>>>>>>index d162b21..62317e1 100644
>>>>>>>>>--- a/net/switchdev/switchdev.c
>>>>>>>>>+++ b/net/switchdev/switchdev.c
>>>>>>>>>@@ -50,3 +50,73 @@ int netdev_switch_port_stp_update(struct net_device *dev, u8 state)
>>>>>>>>>	return ops->ndo_switch_port_stp_update(dev, state);
>>>>>>>>>}
>>>>>>>>>EXPORT_SYMBOL(netdev_switch_port_stp_update);
>>>>>>>>>+
>>>>>>>>>+/**
>>>>>>>>>+ *	netdev_switch_port_bridge_setlink - Notify switch device port of bridge
>>>>>>>>>+ *	port attributes
>>>>>>>>>+ *
>>>>>>>>>+ *	@dev: port device
>>>>>>>>>+ *	@nlh: netlink msg with bridge port attributes
>>>>>>>>>+ *
>>>>>>>>>+ *	Notify switch device port of bridge port attributes
>>>>>>>>>+ */
>>>>>>>>>+int netdev_switch_port_bridge_setlink(struct net_device *dev,
>>>>>>>>>+									  struct nlmsghdr *nlh, u16 flags)
>>>>>>>>>+{
>>>>>>>>>+	const struct net_device_ops *ops = dev->netdev_ops;
>>>>>>>>>+	struct net_device *lower_dev;
>>>>>>>>>+	struct list_head *iter;
>>>>>>>>>+	int ret = 0, err = 0;
>>>>>>>>>+
>>>>>>>>>+	if (!(dev->features & NETIF_F_HW_NETFUNC_OFFLOAD))
>>>>>>>>>+		return err;
>>>>>>>>>+
>>>>>>>>>+	if (ops->ndo_bridge_setlink) {
>>>>>>>>>+	    WARN_ON(!ops->ndo_switch_parent_id_get);
>>>>>>>>>+	    return ops->ndo_bridge_setlink(dev, nlh, flags);
>>>>>>>>	You have to change ndo_bridge_setlink in netdevice.h first.
>>>>>>>>	Otherwise when only this patch is applied (during bisection)
>>>>>>>>	this won't compile.
>>>>>>>ack, will fix it and keep that in mind next time.
>>>>>>>>>+	}
>>>>>>>>>+
>>>>>>>>>+	netdev_for_each_lower_dev(dev, lower_dev, iter) {
>>>>>>>>	I do not understand why to iterate over lower devices. At this
>>>>>>>>	stage we don't know a thing about this upper or its lowers. Let
>>>>>>>>	the uppers (/masters) to decide if this needs to be propagated
>>>>>>>>	or not.
>>>>>>>Jiri, In the stacked devices case, there is no way to propagate the bridge
>>>>>>>port attributes to switch device driver today (vlan and other bridge port
>>>>>>>attributes). Can you tell me if there is a way ?. no, ndo_vlan* ndo's are not
>>>>>>>useful here. Nor we should go and implement ndo_bridge_setlink* in all
>>>>>>>devices that can be bridge ports.
>>>>>>Hmm. I just think that is cleaner to implement ndo_bridge_setlink in
>>>>>>bonding for example and let it propagate the the call to slaves.
>>>>>No, that will require bridge attribute support in all drivers. And that is no
>>>>>good.
>>>>Not all drivers, just all masters which want to support this. Like bond,
>>>>team, macvlan etc. That would be the same as for
>>>>ndo_vlan_rx_add_vid/ndo_vlan_rx_kill_vid/ndo_change_mtu etc. I do not
>>>>see any problem in that. It is much much clearer over big hammer iterate
>>>>over lowers in my opinion.
>>>You cannot avoid the lowerdev iteration in any case.
>>>If you added it in the individual drivers: bond, macvlan and other drivers
>>>will all have to do the same thing.
>>>ie Call bridge setlink on lowerdevs.
>>I feel that the right way is to let masters propagate that themselves in
>>their code.
>In this case no. Just because an interface is a port in a bridge, it is wrong
>to indicate that the interface driver needs to understand all bridge port
>configuration attributes. Note that with what you are asking for ...all
>bridge port drivers (bonds, vxlans) will also need to implement the netdev
>stp state update api.

I'm very well aware of this fact. But still, I'm convinced that the way
similar things are implemented now, using prapagation inside particular
drivers (bond/team/etc) is the correct way to go. I do not see any
downside in that. But we are running in circles here. I would love to
hear opinion of other people here.


>>That's it. I might be wrong of course.
>
>
>>>My patch avoids the need to modify these drivers. Besides it does this only
>>>when the OFFLOAD flag is set.
>>
>>Yep, well in my reply to another patch of you series I expressed my
>>feeling that the flag should be really checked in particular switch
>>driver, not core. But I might be wrong there as well...
>
>The bridge driver owns these attributes...and he needs to call the switchdev
>api to offload.
>And the condition for the switchdev api call is the offload flag. And the
>offload flag is part of the switchdev api.
>The drivers just set it on the netdev, they dont own the offload flag. So, I
>don't see a reason why the core should not
>know about the flag.

I do not understand the formulation "own the offload flag". What I say
is let the bridge/others to call the switchdev api unconditionally and
let the leaf drivers handle that as they see fit, taking various facts
into account, flags included. This way you avoid the need for flags
inheritance in stacked scenarios. Imagine following example:

bridge - bond --- eth1 
              --- eth2

eth1 and eth2 are switch ports. Now eth1 has the flag set and eth2 does
not. Should the bond have the flag set or not? And if it has, eth2 need
to check the flag as well to do not offload.

Implementing the inheritance correctly would be a small nightmare. So I
say, why don't just let the leafs to check and decide.


>
>What has been accepted in the kernel currently does not help bridge driver
>offloading to switchdev. It does help if you want to manage your switch
>device separately like you were already doing with nics. ie going to switch
>port driver directly. It does not help the stacked device case either.
>
>
>I will resubmit my series with the checkpatch errors you pointed out.
>
>And, am also looking at other ways to solve the problem.
>
>Thanks for the review.
>
>
>>>It will not stop at adding the ndo_bridge_setlink to bond/macvlan etc. It
>>>will be all other ndo_ops we will need for switch asics.
>>>It will be l3 tomorrow, if the route is through a bond (But at that point, we
>>>may end up having to introduce switch device instead of going to the port.
>>>Lets see).
>>>
>>>Today this patch introduces an abstract way to get to the switch driver by
>>>getting to the slave switch port (And only when the OFFLOAD flag is set).
>>>
>>>
>>>>
>>>>>>Let every "upper" to handle ndo_bridge_setlink their way. Sometimes it
>>>>>>might not make sense to propagate to "lowers".
>>>>>This does not really propagate to lowers. It is just trying to get to a
>>>>>switch port and from there to the switch driver.
>>>>>Example, bond driver does not need to care if its a bridge port. It will
>>>>>simply pass the call to its slave which
>>>>>might be a switch port.
>>>>>
>>>>>bond driver does not care if its a bridge port. But the switch driver cares,
>>>>>because it knows that the bond was created with switch ports.
>>>>>
>>>>>
>>>>>>>And this allows a switch driver to receive these callbacks if it has marked
>>>>>>>the switch port with an offload flag. Your way of using the switch port to
>>>>>>>get to the switch driver does not help in these cases.
>>>>>>I do not follow how this is related to this case (stacked layout).
>>>>>>
>>>>>>>The other option is to use the 'switch device (not port)' to get to the
>>>>>>>switch driver.
>>>>>>That would not help this case (stacked layout) I believe.
>>>>>>
>>>>>>
>>>>>>>This patch shows that you can still do this with the ndo ops.
>>>>>>>>>+		err = netdev_switch_port_bridge_setlink(lower_dev, nlh, flags);
>>>>>>>>>+		if (err)
>>>>>>>>>+			ret = err;
>>>>>>>>>+    }
>>>>>>>>  ^^^^^ Indent is off. This should be catched by scripts/checkpatch.pl.
>>>>>>>>
>>>>>>>>>+
>>>>>>>>>+	return ret;
>>>>>>>>>+}
>>>>>>>>>+EXPORT_SYMBOL(netdev_switch_port_bridge_setlink);
>>>>>>>>>+
>>>>>>>>>+/**
>>>>>>>>>+ *	netdev_switch_port_bridge_dellink - Notify switch device port of bridge
>>>>>>>>>+ *	attribute delete
>>>>>>>>>+ *
>>>>>>>>>+ *	@dev: port device
>>>>>>>>>+ *	@nlh: netlink msg with bridge port attributes
>>>>>>>>>+ *
>>>>>>>>>+ *	Notify switch device port of bridge port attribute delete
>>>>>>>>>+ */
>>>>>>>>>+int netdev_switch_port_bridge_dellink(struct net_device *dev,
>>>>>>>>>+									  struct nlmsghdr *nlh, u16 flags)
>>>>>>>>>+{
>>>>>>>>>+	const struct net_device_ops *ops = dev->netdev_ops;
>>>>>>>>>+	struct net_device *lower_dev;
>>>>>>>>>+	struct list_head *iter;
>>>>>>>>>+	int ret = 0, err = 0;
>>>>>>>>>+
>>>>>>>>>+	if (!(dev->features & NETIF_F_HW_NETFUNC_OFFLOAD))
>>>>>>>>>+		return err;
>>>>>>>>>+
>>>>>>>>>+	if (ops->ndo_bridge_dellink) {
>>>>>>>>>+		WARN_ON(!ops->ndo_switch_parent_id_get);
>>>>>>>>>+		return ops->ndo_bridge_dellink(dev, nlh, flags);
>>>>>>>>>+	}
>>>>>>>>>+
>>>>>>>>>+	netdev_for_each_lower_dev(dev, lower_dev, iter) {
>>>>>>>>>+		err = netdev_switch_port_bridge_dellink(lower_dev, nlh, flags);
>>>>>>>>>+		if (err)
>>>>>>>>>+			ret = err;
>>>>>>>>>+	}
>>>>>>>>>+
>>>>>>>>>+	return ret;
>>>>>>>>>+}
>>>>>>>>>+EXPORT_SYMBOL(netdev_switch_port_bridge_dellink);
>>>>>>>>>-- 
>>>>>>>>>1.7.10.4
>>>>>>>>>
>>>>>>--
>>>>>>To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>>>the body of a message to majordomo@vger.kernel.org
>>>>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>--
>>To unsubscribe from this list: send the line "unsubscribe netdev" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: Or Gerlitz @ 2014-12-14 16:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, jiri, gospo, jhs, john.r.fastabend, Or Gerlitz

The current implementations all use dev_uc_add_excl() and such whose API
doesn't support vlans, so we can't make it with NICs HW for now.

Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |    5 +++++
 net/core/rtnetlink.c                        |    5 +++++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 0a7ea4c..a5f2660 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7549,6 +7549,11 @@ static int i40e_ndo_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 	if (!(pf->flags & I40E_FLAG_SRIOV_ENABLED))
 		return -EOPNOTSUPP;
 
+	if (vid) {
+		pr_info("%s: vlans aren't supported yet for dev_uc|mc_add()\n", dev->name);
+		return -EINVAL;
+	}
+
 	/* Hardware does not support aging addresses so if a
 	 * ndm_state is given only allow permanent addresses
 	 */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d06107d..9cf6fe9 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2368,6 +2368,11 @@ int ndo_dflt_fdb_add(struct ndmsg *ndm,
 		return err;
 	}
 
+	if (vid) {
+		pr_info("%s: vlans aren't supported yet for dev_uc|mc_add()\n", dev->name);
+		return err;
+	}
+
 	if (is_unicast_ether_addr(addr) || is_link_local_ether_addr(addr))
 		err = dev_uc_add_excl(dev, addr);
 	else if (is_multicast_ether_addr(addr))
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH iproute2 REGRESSION] ss: Dont show netlink and packet sockets by default
From: vadim4j @ 2014-12-14 17:15 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: Vadim Kochan, netdev
In-Reply-To: <548D9E25.30100@cogentembedded.com>

On Sun, Dec 14, 2014 at 05:26:45PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 12/14/2014 12:36 PM, Vadim Kochan wrote:
> 
> >From: Vadim Kochan <vadim4j@gmail.com>
> 
> >Checking by SS_CLOSE state was remowed in:
> 
> >     (45a4770bc0) ss: Remove checking SS_CLOSE state for packet and netlink
> 
> >which is not really correct because now by default all sockets are seen
> >when do 'ss'.
> 
> >Here is most correct fix which considers specified family.
> 
> >To see netlink sockets:
> >     ss -A netlink
> 
> >To see packet sockets:
> >     ss -A packet
> 
> >And ss by default will show only connected/established sockets as it
> >was before all the time.
> >
> >Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
> >---
> >  misc/ss.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> 
> >diff --git a/misc/ss.c b/misc/ss.c
> >index e9927a5..6050ab6 100644
> >--- a/misc/ss.c
> >+++ b/misc/ss.c
> >@@ -2801,6 +2801,9 @@ static int packet_show(struct filter *f)
> >  	int ino;
> >  	unsigned long long sk;
> >
> >+	if (preferred_family != AF_PACKET && !(f->states & (1<<SS_CLOSE)))
> 
>   Please surround << with spaces, to be consistent with other operators and
> general kernel coding style.
> 
> >+		return 0;
> >+
> >  	if (packet_show_netlink(f, NULL) == 0)
> >  		return 0;
> >
> >@@ -3028,6 +3031,9 @@ static int netlink_show(struct filter *f)
> >  	int rq, wq, rc;
> >  	unsigned long long sk, cb;
> >
> >+	if (preferred_family != AF_NETLINK && !(f->states & (1<<SS_CLOSE)))
> 
>    Likewise.
> 
> >+		return 0;
> >+
> [...]
> 
> WBR, Sergei
> 
OK, I just returned removed code, but I agree to correct it, thanks.

^ permalink raw reply

* [PATCH iproute2 REGRESSION v2] ss: Dont show netlink and packet sockets by default
From: Vadim Kochan @ 2014-12-14 17:23 UTC (permalink / raw)
  To: netdev; +Cc: Vadim Kochan

From: Vadim Kochan <vadim4j@gmail.com>

Checking by SS_CLOSE state was remowed in:

    (45a4770bc0) ss: Remove checking SS_CLOSE state for packet and netlink

which is not really correct because now by default all sockets are seen
when do 'ss'.

Here is most correct fix which considers specified family.

To see netlink sockets:
    ss -A netlink

To see packet sockets:
    ss -A packet

And ss by default will show only connected/established sockets as it
was before all the time.

Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
---
 misc/ss.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/misc/ss.c b/misc/ss.c
index e9927a5..8f39eb8 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2801,6 +2801,9 @@ static int packet_show(struct filter *f)
 	int ino;
 	unsigned long long sk;
 
+	if (preferred_family != AF_PACKET && !(f->states & (1 << SS_CLOSE)))
+		return 0;
+
 	if (packet_show_netlink(f, NULL) == 0)
 		return 0;
 
@@ -3028,6 +3031,9 @@ static int netlink_show(struct filter *f)
 	int rq, wq, rc;
 	unsigned long long sk, cb;
 
+	if (preferred_family != AF_NETLINK && !(f->states & (1 << SS_CLOSE)))
+		return 0;
+
 	if (!getenv("PROC_NET_NETLINK") && !getenv("PROC_ROOT") &&
 		netlink_show_netlink(f, NULL) == 0)
 		return 0;
-- 
2.1.3

^ permalink raw reply related

* Re: [PATCH ethtool v2 2/3] ethtool: Add copybreak support
From: Ben Hutchings @ 2014-12-14 17:46 UTC (permalink / raw)
  To: Govindarajulu Varadarajan; +Cc: netdev, ogerlitz, yevgenyp
In-Reply-To: <1412637141-3205-3-git-send-email-_govind@gmx.com>

[-- Attachment #1: Type: text/plain, Size: 1987 bytes --]

On Tue, 2014-10-07 at 04:42 +0530, Govindarajulu Varadarajan wrote:
> This patch adds support for setting/getting driver's rx_copybreak value.
> copybreak is set/get using new ethtool tunable interface.
> 
> This was added to net-next in
> commit: f0db9b073415848709dd59a6394969882f517da9
> 
> 	ethtool: Add generic options for tunables
> 
> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
> ---
>  ethtool.c | 177 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 177 insertions(+)
> 
> diff --git a/ethtool.c b/ethtool.c
> index bf583f3..4045356 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -179,6 +179,12 @@ static const struct flag_info flags_msglvl[] = {
>  	{ "wol",	NETIF_MSG_WOL },
>  };
>  
> +static const char *tunable_name[] = {
> +	[ETHTOOL_ID_UNSPEC]	= "Unspec",
> +	[ETHTOOL_RX_COPYBREAK]	= "rx",
> +	[ETHTOOL_TX_COPYBREAK]	= "tx",
> +};

Tunables should be named by a string set defined in the kernel.

[...]
> @@ -4055,6 +4228,10 @@ static const struct option {
>           "             [ rx-mini N ]\n"
>           "             [ rx-jumbo N ]\n"
>           "             [ tx N ]\n" },
> +       { "-b|--show-copybreak", 1, do_gcopybreak, "Show copybreak values" },
> +       { "-B|--set-copybreak", 1, do_scopybreak, "Set copybreak values",
> +         "             [ rx N]\n"
> +         "             [ tx N]\n" },
>         { "-k|--show-features|--show-offload", 1, do_gfeatures,
>           "Get state of protocol offload and other features" },
>         { "-K|--features|--offload", 1, do_sfeatures,
[...]

T don't think this is worth two options of its own.  You should be able
to add generic get/set-tunable optins.  You'll need to get the string
set to find out the names of tunables.  When setting a tunable, you'll
need to get it first to find out its type.

Ben.

-- 
Ben Hutchings
The two most common things in the universe are hydrogen and stupidity.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* Re: [ethtool PATCH 1/1] bug fix: SFP Tx BIAS uses memory wrong offset
From: Ben Hutchings @ 2014-12-14 17:48 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: bwh, netdev
In-Reply-To: <1413367740-11118-1-git-send-email-jhs@emojatatu.com>

[-- Attachment #1: Type: text/plain, Size: 927 bytes --]

On Wed, 2014-10-15 at 06:09 -0400, Jamal Hadi Salim wrote:
> From: Jamal Hadi Salim <jhs@mojatatu.com>
> 
> SFF-8472 rev 12.0 indicates the SFP BIAS is at offset 100 of page A2
> 
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>

Applied, thanks.

Ben.

> ---
>  sfpdiag.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/sfpdiag.c b/sfpdiag.c
> index 812a2fa..a3dbc9b 100644
> --- a/sfpdiag.c
> +++ b/sfpdiag.c
> @@ -48,7 +48,7 @@
>  #define SFF_A2_VCC_HWARN                  12
>  #define SFF_A2_VCC_LWARN                  14
>  
> -#define SFF_A2_BIAS                       96
> +#define SFF_A2_BIAS                       100
>  #define SFF_A2_BIAS_HALRM                 16
>  #define SFF_A2_BIAS_LALRM                 18
>  #define SFF_A2_BIAS_HWARN                 20

-- 
Ben Hutchings
The two most common things in the universe are hydrogen and stupidity.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* Re: [ethtool][PATCH] Fix build with musl by using more common typedefs
From: Ben Hutchings @ 2014-12-14 18:37 UTC (permalink / raw)
  To: Paul Barker; +Cc: Ben Hutchings, netdev, John Spencer
In-Reply-To: <1414323849-5739-2-git-send-email-paul@paulbarker.me.uk>

[-- Attachment #1: Type: text/plain, Size: 1510 bytes --]

On Sun, 2014-10-26 at 11:44 +0000, Paul Barker wrote:
> When using musl as the standard C library, type names such as '__int32_t' are
> not defined. Instead we must use the more commonly defined type names such as
> 'int32_t', which are defined in <stdint.h>.
> 
> Signed-off-by: John Spencer <maillist-linux@barfooze.de>
> Signed-off-by: Paul Barker <paul@paulbarker.me.uk>

Applied, thanks.

Ben.

> ---
>  internal.h | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/internal.h b/internal.h
> index a9dfae0..262a39f 100644
> --- a/internal.h
> +++ b/internal.h
> @@ -7,6 +7,7 @@
>  #include "ethtool-config.h"
>  #endif
>  #include <stdio.h>
> +#include <stdint.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <sys/types.h>
> @@ -17,16 +18,16 @@
>  
>  /* ethtool.h expects these to be defined by <linux/types.h> */
>  #ifndef HAVE_BE_TYPES
> -typedef __uint16_t __be16;
> -typedef __uint32_t __be32;
> +typedef uint16_t __be16;
> +typedef uint32_t __be32;
>  typedef unsigned long long __be64;
>  #endif
>  
>  typedef unsigned long long u64;
> -typedef __uint32_t u32;
> -typedef __uint16_t u16;
> -typedef __uint8_t u8;
> -typedef __int32_t s32;
> +typedef uint32_t u32;
> +typedef uint16_t u16;
> +typedef uint8_t u8;
> +typedef int32_t s32;
>  
>  #include "ethtool-copy.h"
>  #include "net_tstamp-copy.h"

-- 
Ben Hutchings
The two most common things in the universe are hydrogen and stupidity.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* Re: [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: Jiri Pirko @ 2014-12-14 19:23 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David S. Miller, netdev, gospo, jhs, john.r.fastabend
In-Reply-To: <1418573945-27840-1-git-send-email-ogerlitz@mellanox.com>

Sun, Dec 14, 2014 at 05:19:05PM CET, ogerlitz@mellanox.com wrote:
>The current implementations all use dev_uc_add_excl() and such whose API
>doesn't support vlans, so we can't make it with NICs HW for now.
>
>Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')

Maybe I'm missing something, but this commit did not introduce the
problem. If was there already before when NDA_VLAN was set and ignored.

But other than this. I like the patch

Reviewed-by: Jiri Pirko <jiri@resnulli.us>



>Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>---
> drivers/net/ethernet/intel/i40e/i40e_main.c |    5 +++++
> net/core/rtnetlink.c                        |    5 +++++
> 2 files changed, 10 insertions(+), 0 deletions(-)
>
>diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>index 0a7ea4c..a5f2660 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>@@ -7549,6 +7549,11 @@ static int i40e_ndo_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
> 	if (!(pf->flags & I40E_FLAG_SRIOV_ENABLED))
> 		return -EOPNOTSUPP;
> 
>+	if (vid) {
>+		pr_info("%s: vlans aren't supported yet for dev_uc|mc_add()\n", dev->name);
>+		return -EINVAL;
>+	}
>+
> 	/* Hardware does not support aging addresses so if a
> 	 * ndm_state is given only allow permanent addresses
> 	 */
>diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>index d06107d..9cf6fe9 100644
>--- a/net/core/rtnetlink.c
>+++ b/net/core/rtnetlink.c
>@@ -2368,6 +2368,11 @@ int ndo_dflt_fdb_add(struct ndmsg *ndm,
> 		return err;
> 	}
> 
>+	if (vid) {
>+		pr_info("%s: vlans aren't supported yet for dev_uc|mc_add()\n", dev->name);
>+		return err;
>+	}
>+
> 	if (is_unicast_ether_addr(addr) || is_link_local_ether_addr(addr))
> 		err = dev_uc_add_excl(dev, addr);
> 	else if (is_multicast_ether_addr(addr))
>-- 
>1.7.1
>

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-14 19:41 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: sfeldma, jhs, bcrl, tgraf, john.fastabend, stephen, linville,
	vyasevic, netdev, davem, shm, gospo
In-Reply-To: <20141214153549.GA2174@nanopsycho.orion>

On 12/14/14, 7:35 AM, Jiri Pirko wrote:
> Sun, Dec 14, 2014 at 03:13:40PM CET, roopa@cumulusnetworks.com wrote:
>> On 12/11/14, 2:25 PM, Jiri Pirko wrote:
>>> Thu, Dec 11, 2014 at 07:27:32PM CET, roopa@cumulusnetworks.com wrote:
>>>> On 12/11/14, 10:07 AM, Jiri Pirko wrote:
>>>>> Thu, Dec 11, 2014 at 06:59:15PM CET, roopa@cumulusnetworks.com wrote:
>>>>>> On 12/11/14, 9:11 AM, Jiri Pirko wrote:
>>>>>>> Thu, Dec 11, 2014 at 05:52:10PM CET, roopa@cumulusnetworks.com wrote:
>>>>>>>> On 12/10/14, 1:37 AM, Jiri Pirko wrote:
>>>>>>>>> Wed, Dec 10, 2014 at 10:05:18AM CET, roopa@cumulusnetworks.com wrote:
>>>>>>>>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>>>>>>>
>>>>>>>>>> This patch adds two new api's netdev_switch_port_bridge_setlink
>>>>>>>>>> and netdev_switch_port_bridge_dellink to offload bridge port attributes
>>>>>>>>>> to switch asic
>>>>>>>>>>
>>>>>>>>>> (The names of the apis look odd with 'switch_port_bridge',
>>>>>>>>>> but am more inclined to change the prefix of the api to something else.
>>>>>>>>>> Will take any suggestions).
>>>>>>>>>>
>>>>>>>>>> The api's look at the NETIF_F_HW_NETFUNC_OFFLOAD feature flag to
>>>>>>>>>> pass bridge port attributes to the port device.
>>>>>>>>>>
>>>>>>>>>> If the device has the NETIF_F_HW_NETFUNC_OFFLOAD, but does not support
>>>>>>>>>> the bridge port attribute offload ndo, call bridge port attribute ndo's on
>>>>>>>>>> the lowerdevs if supported. This is one way to pass bridge port attributes
>>>>>>>>>> through stacked netdevs (example when bridge port is a bond and bond slaves
>>>>>>>>>> are switch ports).
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>>>>>>> ---
>>>>>>>>>> include/net/switchdev.h   |    5 +++-
>>>>>>>>>> net/switchdev/switchdev.c |   70 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>> 2 files changed, 74 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>>>>>>>>> index 8a6d164..22676b6 100644
>>>>>>>>>> --- a/include/net/switchdev.h
>>>>>>>>>> +++ b/include/net/switchdev.h
>>>>>>>>>> @@ -17,7 +17,10 @@
>>>>>>>>>> int netdev_switch_parent_id_get(struct net_device *dev,
>>>>>>>>>> 				struct netdev_phys_item_id *psid);
>>>>>>>>>> int netdev_switch_port_stp_update(struct net_device *dev, u8 state);
>>>>>>>>>> -
>>>>>>>>>> +int netdev_switch_port_bridge_setlink(struct net_device *dev,
>>>>>>>>>> +				struct nlmsghdr *nlh, u16 flags);
>>>>>>>>>> +int netdev_switch_port_bridge_dellink(struct net_device *dev,
>>>>>>>>>> +				struct nlmsghdr *nlh, u16 flags);
>>>>>>>>>> #else
>>>>>>>>>>
>>>>>>>>>> static inline int netdev_switch_parent_id_get(struct net_device *dev,
>>>>>>>>>> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>>>>>>>>>> index d162b21..62317e1 100644
>>>>>>>>>> --- a/net/switchdev/switchdev.c
>>>>>>>>>> +++ b/net/switchdev/switchdev.c
>>>>>>>>>> @@ -50,3 +50,73 @@ int netdev_switch_port_stp_update(struct net_device *dev, u8 state)
>>>>>>>>>> 	return ops->ndo_switch_port_stp_update(dev, state);
>>>>>>>>>> }
>>>>>>>>>> EXPORT_SYMBOL(netdev_switch_port_stp_update);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + *	netdev_switch_port_bridge_setlink - Notify switch device port of bridge
>>>>>>>>>> + *	port attributes
>>>>>>>>>> + *
>>>>>>>>>> + *	@dev: port device
>>>>>>>>>> + *	@nlh: netlink msg with bridge port attributes
>>>>>>>>>> + *
>>>>>>>>>> + *	Notify switch device port of bridge port attributes
>>>>>>>>>> + */
>>>>>>>>>> +int netdev_switch_port_bridge_setlink(struct net_device *dev,
>>>>>>>>>> +									  struct nlmsghdr *nlh, u16 flags)
>>>>>>>>>> +{
>>>>>>>>>> +	const struct net_device_ops *ops = dev->netdev_ops;
>>>>>>>>>> +	struct net_device *lower_dev;
>>>>>>>>>> +	struct list_head *iter;
>>>>>>>>>> +	int ret = 0, err = 0;
>>>>>>>>>> +
>>>>>>>>>> +	if (!(dev->features & NETIF_F_HW_NETFUNC_OFFLOAD))
>>>>>>>>>> +		return err;
>>>>>>>>>> +
>>>>>>>>>> +	if (ops->ndo_bridge_setlink) {
>>>>>>>>>> +	    WARN_ON(!ops->ndo_switch_parent_id_get);
>>>>>>>>>> +	    return ops->ndo_bridge_setlink(dev, nlh, flags);
>>>>>>>>> 	You have to change ndo_bridge_setlink in netdevice.h first.
>>>>>>>>> 	Otherwise when only this patch is applied (during bisection)
>>>>>>>>> 	this won't compile.
>>>>>>>> ack, will fix it and keep that in mind next time.
>>>>>>>>>> +	}
>>>>>>>>>> +
>>>>>>>>>> +	netdev_for_each_lower_dev(dev, lower_dev, iter) {
>>>>>>>>> 	I do not understand why to iterate over lower devices. At this
>>>>>>>>> 	stage we don't know a thing about this upper or its lowers. Let
>>>>>>>>> 	the uppers (/masters) to decide if this needs to be propagated
>>>>>>>>> 	or not.
>>>>>>>> Jiri, In the stacked devices case, there is no way to propagate the bridge
>>>>>>>> port attributes to switch device driver today (vlan and other bridge port
>>>>>>>> attributes). Can you tell me if there is a way ?. no, ndo_vlan* ndo's are not
>>>>>>>> useful here. Nor we should go and implement ndo_bridge_setlink* in all
>>>>>>>> devices that can be bridge ports.
>>>>>>> Hmm. I just think that is cleaner to implement ndo_bridge_setlink in
>>>>>>> bonding for example and let it propagate the the call to slaves.
>>>>>> No, that will require bridge attribute support in all drivers. And that is no
>>>>>> good.
>>>>> Not all drivers, just all masters which want to support this. Like bond,
>>>>> team, macvlan etc. That would be the same as for
>>>>> ndo_vlan_rx_add_vid/ndo_vlan_rx_kill_vid/ndo_change_mtu etc. I do not
>>>>> see any problem in that. It is much much clearer over big hammer iterate
>>>>> over lowers in my opinion.
>>>> You cannot avoid the lowerdev iteration in any case.
>>>> If you added it in the individual drivers: bond, macvlan and other drivers
>>>> will all have to do the same thing.
>>>> ie Call bridge setlink on lowerdevs.
>>> I feel that the right way is to let masters propagate that themselves in
>>> their code.
>> In this case no. Just because an interface is a port in a bridge, it is wrong
>> to indicate that the interface driver needs to understand all bridge port
>> configuration attributes. Note that with what you are asking for ...all
>> bridge port drivers (bonds, vxlans) will also need to implement the netdev
>> stp state update api.
> I'm very well aware of this fact. But still, I'm convinced that the way
> similar things are implemented now, using prapagation inside particular
> drivers (bond/team/etc) is the correct way to go. I do not see any
> downside in that. But we are running in circles here. I would love to
> hear opinion of other people here.

  I know you are aware of this fact ,...but just stating again here that 
we are also talking about l3 ops here.
Which don't make sense to be  in all drivers. Maybe i am being very 
conservative.

And yes, agree we are going in circles. :). and yes, would love to hear 
other opinions as well.


>
>
>>> That's it. I might be wrong of course.
>>
>>>> My patch avoids the need to modify these drivers. Besides it does this only
>>>> when the OFFLOAD flag is set.
>>> Yep, well in my reply to another patch of you series I expressed my
>>> feeling that the flag should be really checked in particular switch
>>> driver, not core. But I might be wrong there as well...
>> The bridge driver owns these attributes...and he needs to call the switchdev
>> api to offload.
>> And the condition for the switchdev api call is the offload flag. And the
>> offload flag is part of the switchdev api.
>> The drivers just set it on the netdev, they dont own the offload flag. So, I
>> don't see a reason why the core should not
>> know about the flag.
> I do not understand the formulation "own the offload flag". What I say
> is let the bridge/others to call the switchdev api unconditionally and
> let the leaf drivers handle that as they see fit, taking various facts
> into account, flags included. This way you avoid the need for flags
> inheritance in stacked scenarios. Imagine following example:
>
> bridge - bond --- eth1
>                --- eth2
>
> eth1 and eth2 are switch ports. Now eth1 has the flag set and eth2 does
> not. Should the bond have the flag set or not? And if it has, eth2 need
> to check the flag as well to do not offload.
>
> Implementing the inheritance correctly would be a small nightmare. So I
> say, why don't just let the leafs to check and decide.
Note that i wasn't really considering inheritance as the major factor in 
our argument.
If that needs to be dropped, i can consider that. But, the flag is there 
for a reason: To not unconditionally call lowerdev ndo's.
ie reduce some overhead.

And in the above case, the bond does get the flag..because it has a 
lowerdev with the feature flag.
And note that the inheritance does not come with a cost. This is all 
existing code. If i hadn't mentioned it, it would probably go unnoticed ;).

I had to just add the below patch. And existing calls to 
netdev_update_features in bridge ndo_add/del_slave takes care of 
inheriting the feature flags from the slaves.
again, i did not introduce any complexity here. The feature flag is just 
there to reduce some overhead in unnecessarily traversing all lowerdevs.


@@ -159,7 +161,8 @@ enum {
   */
  #define NETIF_F_ONE_FOR_ALL	(NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ROBUST | \
  				 NETIF_F_SG | NETIF_F_HIGHDMA |		\
-				 NETIF_F_FRAGLIST | NETIF_F_VLAN_CHALLENGED)
+				 NETIF_F_FRAGLIST | NETIF_F_VLAN_CHALLENGED | \
+				 NETIF_F_HW_NETFUNC_OFFLOAD)
  /*
   * If one device doesn't support one of these features, then disable it
   * for all in netdev_increment_features.




>
>
>> What has been accepted in the kernel currently does not help bridge driver
>> offloading to switchdev. It does help if you want to manage your switch
>> device separately like you were already doing with nics. ie going to switch
>> port driver directly. It does not help the stacked device case either.
>>
>>
>> I will resubmit my series with the checkpatch errors you pointed out.
>>
>> And, am also looking at other ways to solve the problem.
>>
>> Thanks for the review.
>>
>>
>>>> It will not stop at adding the ndo_bridge_setlink to bond/macvlan etc. It
>>>> will be all other ndo_ops we will need for switch asics.
>>>> It will be l3 tomorrow, if the route is through a bond (But at that point, we
>>>> may end up having to introduce switch device instead of going to the port.
>>>> Lets see).
>>>>
>>>> Today this patch introduces an abstract way to get to the switch driver by
>>>> getting to the slave switch port (And only when the OFFLOAD flag is set).
>>>>
>>>>
>>>>>>> Let every "upper" to handle ndo_bridge_setlink their way. Sometimes it
>>>>>>> might not make sense to propagate to "lowers".
>>>>>> This does not really propagate to lowers. It is just trying to get to a
>>>>>> switch port and from there to the switch driver.
>>>>>> Example, bond driver does not need to care if its a bridge port. It will
>>>>>> simply pass the call to its slave which
>>>>>> might be a switch port.
>>>>>>
>>>>>> bond driver does not care if its a bridge port. But the switch driver cares,
>>>>>> because it knows that the bond was created with switch ports.
>>>>>>
>>>>>>
>>>>>>>> And this allows a switch driver to receive these callbacks if it has marked
>>>>>>>> the switch port with an offload flag. Your way of using the switch port to
>>>>>>>> get to the switch driver does not help in these cases.
>>>>>>> I do not follow how this is related to this case (stacked layout).
>>>>>>>
>>>>>>>> The other option is to use the 'switch device (not port)' to get to the
>>>>>>>> switch driver.
>>>>>>> That would not help this case (stacked layout) I believe.
>>>>>>>
>>>>>>>
>>>>>>>> This patch shows that you can still do this with the ndo ops.
>>>>>>>>>> +		err = netdev_switch_port_bridge_setlink(lower_dev, nlh, flags);
>>>>>>>>>> +		if (err)
>>>>>>>>>> +			ret = err;
>>>>>>>>>> +    }
>>>>>>>>>   ^^^^^ Indent is off. This should be catched by scripts/checkpatch.pl.
>>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +	return ret;
>>>>>>>>>> +}
>>>>>>>>>> +EXPORT_SYMBOL(netdev_switch_port_bridge_setlink);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + *	netdev_switch_port_bridge_dellink - Notify switch device port of bridge
>>>>>>>>>> + *	attribute delete
>>>>>>>>>> + *
>>>>>>>>>> + *	@dev: port device
>>>>>>>>>> + *	@nlh: netlink msg with bridge port attributes
>>>>>>>>>> + *
>>>>>>>>>> + *	Notify switch device port of bridge port attribute delete
>>>>>>>>>> + */
>>>>>>>>>> +int netdev_switch_port_bridge_dellink(struct net_device *dev,
>>>>>>>>>> +									  struct nlmsghdr *nlh, u16 flags)
>>>>>>>>>> +{
>>>>>>>>>> +	const struct net_device_ops *ops = dev->netdev_ops;
>>>>>>>>>> +	struct net_device *lower_dev;
>>>>>>>>>> +	struct list_head *iter;
>>>>>>>>>> +	int ret = 0, err = 0;
>>>>>>>>>> +
>>>>>>>>>> +	if (!(dev->features & NETIF_F_HW_NETFUNC_OFFLOAD))
>>>>>>>>>> +		return err;
>>>>>>>>>> +
>>>>>>>>>> +	if (ops->ndo_bridge_dellink) {
>>>>>>>>>> +		WARN_ON(!ops->ndo_switch_parent_id_get);
>>>>>>>>>> +		return ops->ndo_bridge_dellink(dev, nlh, flags);
>>>>>>>>>> +	}
>>>>>>>>>> +
>>>>>>>>>> +	netdev_for_each_lower_dev(dev, lower_dev, iter) {
>>>>>>>>>> +		err = netdev_switch_port_bridge_dellink(lower_dev, nlh, flags);
>>>>>>>>>> +		if (err)
>>>>>>>>>> +			ret = err;
>>>>>>>>>> +	}
>>>>>>>>>> +
>>>>>>>>>> +	return ret;
>>>>>>>>>> +}
>>>>>>>>>> +EXPORT_SYMBOL(netdev_switch_port_bridge_dellink);
>>>>>>>>>> -- 
>>>>>>>>>> 1.7.10.4
>>>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: Or Gerlitz @ 2014-12-14 20:14 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Andy Gospodarek,
	Jamal Hadi Salim, John Fastabend
In-Reply-To: <20141214192351.GA1850@nanopsycho.orion>

On Sun, Dec 14, 2014 at 9:23 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Dec 14, 2014 at 05:19:05PM CET, ogerlitz@mellanox.com wrote:
>>The current implementations all use dev_uc_add_excl() and such whose API
>>doesn't support vlans, so we can't make it with NICs HW for now.
>>
>>Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
>
> Maybe I'm missing something, but this commit did not introduce the
> problem.

This commit introduced the vid param to ndo_fdb_add and ndo_dflt_fdb_add
which further call the dev_uc_add APIs... so it did introduced the
ability to provide VID into these APIs, right? and we want to protect
against anyone using this ability @ this point.

> If was there already before when NDA_VLAN was set and ignored.
> But other than this. I like the patch
>
> Reviewed-by: Jiri Pirko <jiri@resnulli.us>

^ permalink raw reply

* Re: [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: Jiri Pirko @ 2014-12-14 22:33 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Andy Gospodarek,
	Jamal Hadi Salim, John Fastabend
In-Reply-To: <CAJ3xEMhbic5w4d0dGsTMRDoq5Vx8wqZRFpy-q+8HaeDfYciUqw@mail.gmail.com>

Sun, Dec 14, 2014 at 09:14:27PM CET, gerlitz.or@gmail.com wrote:
>On Sun, Dec 14, 2014 at 9:23 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Sun, Dec 14, 2014 at 05:19:05PM CET, ogerlitz@mellanox.com wrote:
>>>The current implementations all use dev_uc_add_excl() and such whose API
>>>doesn't support vlans, so we can't make it with NICs HW for now.
>>>
>>>Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
>>
>> Maybe I'm missing something, but this commit did not introduce the
>> problem.
>
>This commit introduced the vid param to ndo_fdb_add and ndo_dflt_fdb_add
>which further call the dev_uc_add APIs... so it did introduced the
>ability to provide VID into these APIs, right? and we want to protect
>against anyone using this ability @ this point.

That is in-kernel change. Vs. usespace the patch is a no-op. If userspace
fills up NDA_VLAN, it is ignored before the patch as well as after. No
behaviour change, just +- cosmetics.

>
>> If was there already before when NDA_VLAN was set and ignored.
>> But other than this. I like the patch
>>
>> Reviewed-by: Jiri Pirko <jiri@resnulli.us>

^ permalink raw reply

* Re: [PATCH] bonding: move ipoib_header_ops to vmlinux
From: Wengang @ 2014-12-15  1:12 UTC (permalink / raw)
  To: David Miller, jay.vosburgh-Z7WLFzj8eWMS+FvcfC7Uqw
  Cc: ogerlitz-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <547E6C70.7040809-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

Anyone please respond to this?

thanks,
wengang

于 2014年12月03日 09:50, Wengang Wang 写道:
> Hi David and Jay,
>
> Then about about the change in this patch?
>
> thanks,
> wengang
>
> 在 2014年11月26日 09:30, Wengang 写道:
>> 于 2014年11月26日 02:44, David Miller 写道:
>>> From: Jay Vosburgh <jay.vosburgh-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>> Date: Tue, 25 Nov 2014 10:41:17 -0800
>>>
>>>> Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>>>
>>>>> On 11/25/2014 8:07 AM, David Miller wrote:
>>>>>> IPOIB should not work over bonding as it requires that the device
>>>>>> use ARPHRD_ETHER.
>>>>> Hi Dave,
>>>>>
>>>>> IPoIB devices can be enslaved to both bonding and teaming in their 
>>>>> HA mode,
>>>>> the bond device type becomes ARPHRD_INFINIBAND when this happens.
>>>>     The point was that pktgen disallows ARPHRD_INFINIBAND, not that
>>>> bonding does.
>>>>
>>>>     Pktgen specifically checks for type != ARPHRD_ETHER, so the
>>>> IPoIB bond should not be able to be used with pkgten.  My suspicion is
>>>> that pktgen is being configured on the bond first, then an IPoIB slave
>>>> is added to the bond; this would change its type in a way that pktgen
>>>> wouldn't notice.
>>> +1
>>
>> I think it go this way:
>>
>> 1) bond_master is ready
>> 2) bond_enslave enslave a IPOIB interface calling bond_setup_by_slave
>> 3) then bond_setup_by_slave set change master type to ARPHRD_INFINIBAND.
>>
>> code is like this:
>>
>> 1 /* enslave device <slave> to bond device <master> */
>> 2 int bond_enslave(struct net_device *bond_dev, struct net_device 
>> *slave_dev)
>> 3 {
>> 4 <snip>...
>> 5 /* set bonding device ether type by slave - bonding netdevices are
>> 6 * created with ether_setup, so when the slave type is not ARPHRD_ETHER
>> 7 * there is a need to override some of the type dependent 
>> attribs/funcs.
>> 8 *
>> 9 * bond ether type mutual exclusion - don't allow slaves of dissimilar
>> 10 * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the 
>> same bond
>> 11 */
>> 12 if (!bond_has_slaves(bond)) {
>> 13 if (bond_dev->type != slave_dev->type) {
>> 14 <snip>...
>> 15 if (slave_dev->type != ARPHRD_ETHER)
>> 16 bond_setup_by_slave(bond_dev, slave_dev);
>> 17 else {
>> 18 ether_setup(bond_dev);
>> 19 bond_dev->priv_flags &= ~IFF_TX_SKB_SHARING;
>> 20 }
>> 21
>> 22 call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE,
>> 23 bond_dev);
>> 24 }
>> 25 <snip>...
>> 26 }
>> 27
>> 28 static void bond_setup_by_slave(struct net_device *bond_dev,
>> 29 struct net_device *slave_dev)
>> 30 {
>> 31 bond_dev->header_ops = slave_dev->header_ops;
>> 32
>> 33 bond_dev->type = slave_dev->type;
>> 34 bond_dev->hard_header_len = slave_dev->hard_header_len;
>> 35 bond_dev->addr_len = slave_dev->addr_len;
>> 36
>> 37 memcpy(bond_dev->broadcast, slave_dev->broadcast,
>> 38 slave_dev->addr_len);
>> 39 }
>> 40
>>
>> thanks
>> wengang
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net] net/mlx4_en: correct the endianness of doorbell_qpn on big endian platform
From: Wei Yang @ 2014-12-15  1:32 UTC (permalink / raw)
  To: David Miller
  Cc: weiyang, David.Laight, eric.dumazet, netdev, gideonn, edumazet,
	amirv
In-Reply-To: <20141213.234320.1607496855879763694.davem@davemloft.net>

On Sat, Dec 13, 2014 at 11:43:20PM -0500, David Miller wrote:
>From: Wei Yang <weiyang@linux.vnet.ibm.com>
>Date: Sat, 13 Dec 2014 11:13:38 +0800
>
>> On Mon, Dec 08, 2014 at 10:42:37PM +0800, Wei Yang wrote:
>> If you prefer this way, I would like to send a new version for this.
>> Is it ok for you?
>
>I'm not so sure.  There are implications when using the __raw_*()
>routines.
>
>In particular, using __raw_{read,write}l() also means that the usual
>necessary I/O memory barriers are not being performed.
>
>There are therefore no ordering guarantees between __raw_*() and other
>I/O or memory accesses whatsoever.

Thanks David.

Actually, the last mail is asking David Laight. I am trying to understanding
his comment and Amir told me he was suggesting to use __raw_*() version.

Hmm... this is really a problem found in the v3.18-rc1 and the root cause is
the endianess. I am ok to use any method to fix this problem, even revert it.
Could the maintainer from Mellanox gives me a word?

-- 
Richard Yang
Help you, Help me

^ permalink raw reply

* Potential bugs found in 3c59x
From: Jia-Ju Bai @ 2014-12-15  1:41 UTC (permalink / raw)
  To: netdev

Recently I test linux device drivers 3.17.2. 
The target file is drivers/net/ethernet/3com/3c59x.c, which is used to build
3c59x.ko. I hope you can help me check my findings:
[1] The function vortex_up is called by vortex_open when initializing the
ethernet card driver. But when vortex_up is failed, which means that it
returns the error value, "out" segment is executed immediately to halt the
process. However, the resources allocated by __netdev_alloc_skb in
vortex_open are not released by dev_kfree_skb when vortex_up is failed.
[2] As shown in [1], one reason that vortex_up is failed is that
pci_enable_device is failed(return the error value) in vortex_up, and
"err_out" segment is executed immediately to return.
Could you help me check these findings? Thank you very much, and I'm looking
forward to your reply.

^ permalink raw reply

* [RFC PATCH net-next 5/5] tcp: Add TCP tracer
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-1-git-send-email-kafai@fb.com>

Define probes and register them to the TCP tracepoints.  The probes
collect the data defined in struct tcp_sk_trace and record them to
the tracing's ring_buffer.

We currently use the TCP tracer to collect per-flow TCP statistics. Here is our
sample usage:
1. compare performance of different subnet, connection-type (like wire vs
   wireless), application-protocol...etc.
2. Uncover uplink/backbone/subnet issue, e.g. by tracking the rxmit rate.

This patch concludes the series. It is still missing a few things that
we currently have, like:
- why the sender is blocked? and how long for each reason?
- some TCP Congestion Control data
- ...etc.
We are interested to complete them in the next few versions.

[
Some background:
It was inspired by the Web10G effort (RFC4898) on collecting TCP
per-flow stats.  We initially took the Web10G kernel patch, capture only a
subset of RFC4898 that we monitor continuously in production and added codes to
leverage the ftrace infra.
]

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/trace/tcp_trace.c | 489 ++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace.h     |   1 +
 2 files changed, 488 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/tcp_trace.c b/kernel/trace/tcp_trace.c
index 9d09fd0..da9b748 100644
--- a/kernel/trace/tcp_trace.c
+++ b/kernel/trace/tcp_trace.c
@@ -1,12 +1,29 @@
 #include <net/tcp_trace.h>
+#include <net/tcp.h>
+#include <trace/events/tcp.h>
 #include <linux/tcp.h>
+#include <linux/ipv6.h>
+#include <linux/ftrace_event.h>
+#include <linux/jiffies.h>
 #include <uapi/linux/tcp_trace.h>
 
+#include "trace_output.h"
+
+#define REPORT_INTERVAL_MS 2000
+
+static struct trace_array *tcp_tr;
 static bool tcp_trace_enabled __read_mostly;
 
+static struct trace_print_flags tcp_trace_event_names[] = {
+	{ TCP_TRACE_EVENT_ESTABLISHED, "established" },
+	{ TCP_TRACE_EVENT_PERIODIC, "periodic" },
+	{ TCP_TRACE_EVENT_RETRANS, "retrans" },
+	{ TCP_TRACE_EVENT_RETRANS_LOSS, "retrans_loss" },
+	{ TCP_TRACE_EVENT_CLOSE, "close" }
+};
+
 struct tcp_sk_trace {
 	struct tcp_stats stats;
-	unsigned long start_ts;
 	unsigned long last_ts;
 };
 
@@ -28,10 +45,478 @@ void tcp_sk_trace_init(struct sock *sk)
 	sktr->stats.min_rtt_us = U64_MAX;
 	sktr->stats.min_rto_ms = U32_MAX;
 
-	sktr->last_ts = sktr->start_ts = jiffies;
+	sktr->last_ts = jiffies;
 }
 
 void tcp_sk_trace_destruct(struct sock *sk)
 {
 	kfree(tcp_sk(sk)->trace);
 }
+
+static void tcp_trace_init(struct tcp_trace *tr,
+			   enum tcp_trace_events trev,
+			   struct sock *sk)
+{
+	tr->event = trev;
+	if (sk->sk_family == AF_INET) {
+		tr->ipv6 = 0;
+		tr->local_addr[0] = inet_sk(sk)->inet_saddr;
+		tr->remote_addr[0] = inet_sk(sk)->inet_daddr;
+	} else {
+		BUG_ON(sk->sk_family != AF_INET6);
+		tr->ipv6 = 1;
+		memcpy(tr->local_addr, inet6_sk(sk)->saddr.s6_addr32,
+		       sizeof(tr->local_addr));
+		memcpy(tr->remote_addr, sk->sk_v6_daddr.s6_addr32,
+		       sizeof(tr->remote_addr));
+	}
+	tr->local_port = inet_sk(sk)->inet_sport;
+	tr->remote_port = inet_sk(sk)->inet_dport;
+}
+
+static void tcp_trace_basic_init(struct tcp_trace_basic *trb,
+				 enum tcp_trace_events trev,
+				 struct sock *sk)
+{
+	tcp_trace_init((struct tcp_trace *)trb, trev, sk);
+	trb->snd_cwnd = tcp_sk(sk)->snd_cwnd;
+	trb->mss = tcp_sk(sk)->mss_cache;
+	trb->ssthresh = tcp_current_ssthresh(sk);
+	trb->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+	trb->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+}
+
+static void tcp_trace_basic_add(enum tcp_trace_events trev, struct sock *sk)
+{
+	struct ring_buffer *buffer;
+	int pc;
+	struct ring_buffer_event *event;
+	struct tcp_trace_basic *trb;
+	struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+	if (!sktr)
+		return;
+
+	tracing_record_cmdline(current);
+	buffer = tcp_tr->trace_buffer.buffer;
+	pc = preempt_count();
+	event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
+					  sizeof(*trb), 0, pc);
+	if (!event)
+		return;
+	trb = ring_buffer_event_data(event);
+	tcp_trace_basic_init(trb, trev, sk);
+	trace_buffer_unlock_commit(buffer, event, 0, pc);
+}
+
+static void tcp_trace_stats_init(struct tcp_trace_stats *trs,
+				 enum tcp_trace_events trev,
+				 struct sock *sk)
+{
+	struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+	tcp_trace_basic_init((struct tcp_trace_basic *)trs, trev, sk);
+	memcpy(&trs->stats, &sktr->stats, sizeof(sktr->stats));
+}
+
+static void tcp_trace_stats_add(enum tcp_trace_events trev, struct sock *sk)
+{
+	struct ring_buffer *buffer;
+	int pc;
+	struct ring_buffer_event *event;
+	struct tcp_trace_stats *trs;
+	struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+	if (!sktr)
+		return;
+
+	tracing_record_cmdline(current);
+	buffer = tcp_tr->trace_buffer.buffer;
+	pc = preempt_count();
+	event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
+					  sizeof(*trs), 0, pc);
+	if (!event)
+		return;
+	trs = ring_buffer_event_data(event);
+
+	tcp_trace_stats_init(trs, trev, sk);
+
+	trace_buffer_unlock_commit(buffer, event, 0, pc);
+}
+
+static void tcp_trace_established(void *ignore, struct sock *sk)
+{
+	tcp_trace_basic_add(TCP_TRACE_EVENT_ESTABLISHED, sk);
+}
+
+static void tcp_trace_transmit_skb(void *ignore, struct sock *sk,
+				   struct sk_buff *skb)
+{
+	int pcount;
+	struct tcp_sk_trace *sktr;
+	struct tcp_skb_cb *tcb;
+	unsigned int data_len;
+	bool retrans = false;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	tcb = TCP_SKB_CB(skb);
+	pcount = tcp_skb_pcount(skb);
+	data_len = tcb->end_seq - tcb->seq;
+
+	sktr->stats.segs_out += pcount;
+
+	if (!data_len)
+		goto out;
+
+	sktr->stats.data_segs_out += pcount;
+	sktr->stats.data_octets_out += data_len;
+
+	if (before(tcb->seq, tcp_sk(sk)->snd_nxt)) {
+		enum tcp_trace_events trev;
+
+		retrans = true;
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
+			sktr->stats.loss_segs_retrans += pcount;
+			sktr->stats.loss_octets_retrans += data_len;
+			trev = TCP_TRACE_EVENT_RETRANS_LOSS;
+		} else {
+			sktr->stats.other_segs_retrans += pcount;
+			sktr->stats.other_octets_retrans += data_len;
+			trev = TCP_TRACE_EVENT_RETRANS;
+		}
+		tcp_trace_stats_add(trev, sk);
+		return;
+	}
+
+out:
+	if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
+	    REPORT_INTERVAL_MS) {
+		sktr->last_ts = jiffies;
+		tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
+	}
+}
+
+static void tcp_trace_rcv_established(void *ignore, struct sock *sk,
+				      struct sk_buff *skb)
+{
+	struct tcp_sk_trace *sktr;
+	unsigned int data_len;
+	struct tcphdr *th;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	th = tcp_hdr(skb);
+	WARN_ON_ONCE(skb->len < th->doff << 2);
+
+	sktr->stats.segs_in++;
+	data_len = skb->len - (th->doff << 2);
+	if (data_len) {
+		sktr->stats.data_segs_in++;
+		sktr->stats.data_octets_in += data_len;
+	} else {
+		if (TCP_SKB_CB(skb)->ack_seq == tcp_sk(sk)->snd_una)
+			sktr->stats.dup_acks_in++;
+	}
+
+	if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
+	    REPORT_INTERVAL_MS) {
+		sktr->last_ts = jiffies;
+		tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
+	}
+}
+
+static void tcp_trace_close(void *ignore, struct sock *sk)
+{
+	struct tcp_sk_trace *sktr;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	tcp_trace_stats_add(TCP_TRACE_EVENT_CLOSE, sk);
+}
+
+static void tcp_trace_ooo_rcv(void *ignore, struct sock *sk)
+{
+	struct tcp_sk_trace *sktr;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	sktr->stats.ooo_in++;
+}
+
+static void tcp_trace_sacks_rcv(void *ignore, struct sock *sk, int num_sacks)
+{
+	struct tcp_sk_trace *sktr;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	sktr->stats.sacks_in++;
+	sktr->stats.sack_blks_in += num_sacks;
+}
+
+void tcp_trace_rtt_sample(void *ignore, struct sock *sk,
+			  long rtt_sample_us)
+{
+	struct tcp_sk_trace *sktr;
+	u32 rto_ms;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+
+	sktr->stats.rtt_sample_us = rtt_sample_us;
+	sktr->stats.max_rtt_us = max_t(u64, sktr->stats.max_rtt_us,
+				       rtt_sample_us);
+	sktr->stats.min_rtt_us = min_t(u64, sktr->stats.min_rtt_us,
+				       rtt_sample_us);
+
+	sktr->stats.count_rtt++;
+	sktr->stats.sum_rtt_us += rtt_sample_us;
+
+	sktr->stats.max_rto_ms = max_t(u32, sktr->stats.max_rto_ms, rto_ms);
+	sktr->stats.min_rto_ms = min_t(u32, sktr->stats.min_rto_ms, rto_ms);
+}
+
+static enum print_line_t
+tcp_trace_print(struct trace_iterator *iter)
+{
+	struct trace_seq *s = &iter->seq;
+	struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
+	struct tcp_trace_basic *trb;
+	struct tcp_stats *stats;
+
+	union {
+		struct sockaddr_in v4;
+		struct sockaddr_in6 v6;
+	} local_sa, remote_sa;
+
+	local_sa.v4.sin_port = tr->local_port;
+	remote_sa.v4.sin_port = tr->remote_port;
+	if (tr->ipv6) {
+		local_sa.v6.sin6_family = AF_INET6;
+		remote_sa.v6.sin6_family = AF_INET6;
+		memcpy(local_sa.v6.sin6_addr.s6_addr, tr->local_addr, 4);
+		memcpy(remote_sa.v6.sin6_addr.s6_addr, tr->remote_addr, 4);
+	} else {
+		local_sa.v4.sin_family = AF_INET;
+		remote_sa.v4.sin_family = AF_INET;
+		local_sa.v4.sin_addr.s_addr =  tr->local_addr[0];
+		remote_sa.v4.sin_addr.s_addr = tr->remote_addr[0];
+	}
+
+	ftrace_print_symbols_seq(s, tr->event, tcp_trace_event_names);
+	if (trace_seq_has_overflowed(s))
+		goto out;
+
+	trb = (struct tcp_trace_basic *)tr;
+	trace_seq_printf(s,
+			 " %pISpc %pISpc snd_cwnd=%u mss=%u ssthresh=%u"
+			 " srtt_us=%llu rto_ms=%u",
+			 &local_sa, &remote_sa,
+			 trb->snd_cwnd, trb->mss, trb->ssthresh,
+			 trb->srtt_us, trb->rto_ms);
+
+	if (tr->event == TCP_TRACE_EVENT_ESTABLISHED ||
+	    trace_seq_has_overflowed(s))
+		goto out;
+
+	stats = &(((struct tcp_trace_stats *)tr)->stats);
+	trace_seq_printf(s,
+		" segs_out=%u data_segs_out=%u data_octets_out=%llu"
+		" other_segs_retrans=%u other_octets_retrans=%u"
+		" loss_segs_retrans=%u loss_octets_retrans=%u"
+		" segs_in=%u data_segs_in=%u data_octets_in=%llu"
+		" max_rtt_us=%llu min_rtt_us=%llu"
+		" count_rtt=%u sum_rtt_us=%llu"
+		" rtt_sample_us=%llu"
+		" max_rto_ms=%u min_rto_ms=%u"
+		" dup_acks_in=%u sacks_in=%u"
+		" sack_blks_in=%u ooo_in=%u",
+		stats->segs_out, stats->data_segs_out, stats->data_octets_out,
+		stats->other_segs_retrans, stats->other_octets_retrans,
+		stats->loss_segs_retrans, stats->loss_octets_retrans,
+		stats->segs_in, stats->data_segs_in, stats->data_octets_in,
+		stats->max_rtt_us, stats->min_rtt_us,
+		stats->count_rtt, stats->sum_rtt_us,
+		stats->rtt_sample_us,
+		stats->max_rto_ms, stats->min_rto_ms,
+		stats->dup_acks_in, stats->sacks_in,
+		stats->sack_blks_in, stats->ooo_in);
+	if (trace_seq_has_overflowed(s))
+		goto out;
+
+out:
+	trace_seq_putc(s, '\n');
+
+	return trace_seq_has_overflowed(s) ?
+		TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
+tcp_trace_print_binary(struct trace_iterator *iter)
+{
+	struct trace_seq *s = &iter->seq;
+	struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
+
+	tr->magic = TCP_TRACE_MAGIC_VERSION;
+	if (tr->event == TCP_TRACE_EVENT_ESTABLISHED)
+		trace_seq_putmem(s, tr, sizeof(struct tcp_trace_basic));
+	else
+		trace_seq_putmem(s, tr, sizeof(struct tcp_trace_stats));
+
+	return trace_seq_has_overflowed(s) ?
+		TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
+tcp_tracer_print_line(struct trace_iterator *iter)
+{
+	return (trace_flags & TRACE_ITER_BIN) ?
+		tcp_trace_print_binary(iter) :
+		tcp_trace_print(iter);
+}
+
+static void tcp_unregister_tracepoints(void)
+{
+	unregister_trace_tcp_established(tcp_trace_established, NULL);
+	unregister_trace_tcp_close(tcp_trace_close, NULL);
+	unregister_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
+	unregister_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+	unregister_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+	unregister_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+	unregister_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
+
+	tracepoint_synchronize_unregister();
+}
+
+static int tcp_register_tracepoints(void)
+{
+	int ret;
+
+	ret = register_trace_tcp_established(tcp_trace_established, NULL);
+	if (ret)
+		return ret;
+
+	ret = register_trace_tcp_close(tcp_trace_close, NULL);
+	if (ret)
+		goto err1;
+
+	ret = register_trace_tcp_rcv_established(tcp_trace_rcv_established,
+						 NULL);
+	if (ret)
+		goto err2;
+
+	ret = register_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+	if (ret)
+		goto err3;
+
+	ret = register_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+	if (ret)
+		goto err4;
+
+	ret = register_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+	if (ret)
+		goto err5;
+
+	ret = register_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
+	if (ret)
+		goto err6;
+
+	return ret;
+
+err6:
+	unregister_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+err5:
+	unregister_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+err4:
+	unregister_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+err3:
+	unregister_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
+err2:
+	unregister_trace_tcp_close(tcp_trace_close, NULL);
+err1:
+	unregister_trace_tcp_established(tcp_trace_established, NULL);
+
+	return ret;
+}
+
+static void tcp_tracer_start(struct trace_array *tr)
+{
+	int ret;
+
+	if (tcp_trace_enabled)
+		return;
+
+	ret = tcp_register_tracepoints();
+	if (ret == 0) {
+		tcp_trace_enabled = true;
+		pr_info("tcp_trace: enabled\n");
+	}
+}
+
+static void tcp_tracer_stop(struct trace_array *tr)
+{
+	if (!tcp_trace_enabled)
+		return;
+
+	tcp_unregister_tracepoints();
+	tcp_trace_enabled = false;
+	pr_info("tcp_trace: disabled\n");
+}
+
+static void tcp_tracer_reset(struct trace_array *tr)
+{
+	tcp_tracer_stop(tr);
+}
+
+static int tcp_tracer_init(struct trace_array *tr)
+{
+	tcp_tr = tr;
+	tcp_tracer_start(tr);
+	return 0;
+}
+
+static struct tracer tcp_tracer __read_mostly = {
+	.name           = "tcp",
+	.init           = tcp_tracer_init,
+	.reset          = tcp_tracer_reset,
+	.start          = tcp_tracer_start,
+	.stop           = tcp_tracer_stop,
+	.print_line	= tcp_tracer_print_line,
+};
+
+static struct trace_event_functions tcp_trace_event_funcs;
+
+static struct trace_event tcp_trace_event = {
+	.type           = TRACE_TCP,
+	.funcs          = &tcp_trace_event_funcs,
+};
+
+static int __init init_tcp_tracer(void)
+{
+	if (!register_ftrace_event(&tcp_trace_event)) {
+		pr_warn("Cannot register TCP trace event\n");
+		return 1;
+	}
+
+	if (register_tracer(&tcp_tracer) != 0) {
+		pr_warn("Cannot register TCP tracer\n");
+		unregister_ftrace_event(&tcp_trace_event);
+		return 1;
+	}
+	return 0;
+}
+
+device_initcall(init_tcp_tracer);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 3255dfb..d88bdd2 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -38,6 +38,7 @@ enum trace_type {
 	TRACE_USER_STACK,
 	TRACE_BLK,
 	TRACE_BPUTS,
+	TRACE_TCP,
 
 	__TRACE_LAST_TYPE,
 };
-- 
1.8.1

^ permalink raw reply related

* [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

Hi,

We have been using the kernel ftrace infra to collect TCP per-flow statistics.
The following patch set is a first slim-down version of our
existing implementation. We would like to get some early feedback
and make it useful for others.

[RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs:
Defines some basic tracepoints (by TRACE_EVENT).

[RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints:
A sample perf script with simple ip/port filtering and summary output.

[RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer:
Declares a few more tracepoints (by DECLARE_TRACE) which are
used by the tcp_tracer.  The tcp_tracer is in the patch 5/5.

[RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs:
Defines a few tcp_trace structs which are used to collect statistics
on each tcp_sock.

[RFC PATCH net-next 5/5] tcp: Add TCP tracer:
It introduces a tcp_tracer which hooks onto the tracepoints defined in the
patch 1/5 and 3/5.  It collects data defined in patch 4/5. We currently
use this tracer to collect per-flow statistics.  The commit log has
some more details.

Thanks,
--Martin

^ permalink raw reply

* [RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs.
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-1-git-send-email-kafai@fb.com>

The tcp_sk_trace and its related structs define what will be
collected and recorded to the tracing's ring_buffer by
the TCP tracer (in the following patch).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/tcp.h            |  4 +++
 include/net/tcp_trace.h        | 18 ++++++++++
 include/uapi/linux/tcp_trace.h | 78 ++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/Kconfig           | 11 ++++++
 kernel/trace/Makefile          |  1 +
 kernel/trace/tcp_trace.c       | 37 ++++++++++++++++++++
 net/ipv4/tcp.c                 |  4 +++
 7 files changed, 153 insertions(+)
 create mode 100644 include/net/tcp_trace.h
 create mode 100644 include/uapi/linux/tcp_trace.h
 create mode 100644 kernel/trace/tcp_trace.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 67309ec..8d25cb3 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -315,6 +315,10 @@ struct tcp_sock {
 	 * socket. Used to retransmit SYNACKs etc.
 	 */
 	struct request_sock *fastopen_rsk;
+
+#ifdef CONFIG_TCP_TRACE
+	struct tcp_sk_trace *trace;
+#endif
 };
 
 enum tsq_flags {
diff --git a/include/net/tcp_trace.h b/include/net/tcp_trace.h
new file mode 100644
index 0000000..f800cc7
--- /dev/null
+++ b/include/net/tcp_trace.h
@@ -0,0 +1,18 @@
+#ifndef TCP_TRACE_H
+#define TCP_TRACE_H
+
+struct sock;
+
+#ifdef CONFIG_TCP_TRACE
+
+void tcp_sk_trace_init(struct sock *sk);
+void tcp_sk_trace_destruct(struct sock *sk);
+
+#else /* CONFIG_TCP_TRACE */
+
+static inline void tcp_sk_trace_init(struct sock *sk) {}
+static inline void tcp_sk_trace_destruct(struct sock *sk) {}
+
+#endif
+
+#endif /* TCP_TRACE_H */
diff --git a/include/uapi/linux/tcp_trace.h b/include/uapi/linux/tcp_trace.h
new file mode 100644
index 0000000..4f91056
--- /dev/null
+++ b/include/uapi/linux/tcp_trace.h
@@ -0,0 +1,78 @@
+#ifndef UAPI_TCP_TRACE_H
+#define UAPI_TCP_TRACE_H
+
+#include <linux/kernel.h>
+
+#define TCP_TRACE_MAGIC		0x54435000
+#define TCP_TRACE_VERSION	0x01
+#define TCP_TRACE_MAGIC_VERSION	(TCP_TRACE_MAGIC | TCP_TRACE_VERSION)
+
+enum tcp_trace_events {
+	TCP_TRACE_EVENT_ESTABLISHED,
+	TCP_TRACE_EVENT_PERIODIC,	/* Periodic event every 2s */
+	TCP_TRACE_EVENT_RETRANS,	/* Retrans (not in TCP_CA_Loss) */
+	TCP_TRACE_EVENT_RETRANS_LOSS,	/* Retrans in TCP_CA_Loss */
+	TCP_TRACE_EVENT_CLOSE,	/* Connection close */
+};
+
+struct tcp_stats {
+	/* outing packets */
+	__u32	segs_out;
+	__u32	data_segs_out;
+	__u64	data_octets_out;
+
+	/* retrans */
+	__u32	other_segs_retrans;
+	__u32	other_octets_retrans;
+	__u32	loss_segs_retrans;
+	__u32	loss_octets_retrans;
+
+	/* incoming packets */
+	__u32	segs_in;
+	__u32	data_segs_in;
+	__u64	data_octets_in;
+
+	/* RTT */
+	__u64	rtt_sample_us;
+	__u64	max_rtt_us;
+	__u64	min_rtt_us;
+	__u64	sum_rtt_us;
+	__u32	count_rtt;
+
+	/* RTO */
+	__u32	max_rto_ms;
+	__u32	min_rto_ms;
+
+	/* OOO or Loss */
+	__u32	dup_acks_in;
+	__u32	sacks_in;
+	__u32	sack_blks_in;
+	__u32	ooo_in;
+} __packed;
+
+struct tcp_trace {
+	__u32	magic;
+	__u8	event:7,
+		ipv6:1;
+	__u32	local_addr[4];
+	__u32	remote_addr[4];
+	__u16	local_port;
+	__u16	remote_port;
+} __packed;
+
+struct tcp_trace_basic {
+	struct tcp_trace event;
+	/* current values from tcp_sock */
+	__u32	snd_cwnd;
+	__u32	mss;
+	__u32	ssthresh;
+	__u64	srtt_us;
+	__u32	rto_ms;
+} __packed;
+
+struct tcp_trace_stats {
+	struct tcp_trace_basic basic;
+	struct tcp_stats stats;
+} __packed;
+
+#endif /* UAPI_TCP_TRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index a5da09c..f30835c 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -599,6 +599,17 @@ config RING_BUFFER_STARTUP_TEST
 
 	 If unsure, say N
 
+config TCP_TRACE
+	bool "TCP tracing"
+	depends on NET && INET
+	select DEBUG_FS
+	select TRACEPOINTS
+	select GENERIC_TRACER
+	help
+	  This tracer collects per-flow statistics and events.
+
+	  If unsure, say N.
+
 endif # FTRACE
 
 endif # TRACING_SUPPORT
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 67d6369..71d008a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -65,5 +65,6 @@ obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
 obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o
 
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
+obj-$(CONFIG_TCP_TRACE) += tcp_trace.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/tcp_trace.c b/kernel/trace/tcp_trace.c
new file mode 100644
index 0000000..9d09fd0
--- /dev/null
+++ b/kernel/trace/tcp_trace.c
@@ -0,0 +1,37 @@
+#include <net/tcp_trace.h>
+#include <linux/tcp.h>
+#include <uapi/linux/tcp_trace.h>
+
+static bool tcp_trace_enabled __read_mostly;
+
+struct tcp_sk_trace {
+	struct tcp_stats stats;
+	unsigned long start_ts;
+	unsigned long last_ts;
+};
+
+void tcp_sk_trace_init(struct sock *sk)
+{
+	struct tcp_sk_trace *sktr;
+
+	tcp_sk(sk)->trace = NULL;
+	if (!tcp_trace_enabled)
+		return;
+
+	sktr  = kzalloc(sizeof(*sktr), gfp_any());
+	if (unlikely(!sktr))
+		return;
+
+	tcp_sk(sk)->trace = sktr;
+	sk->sk_destruct = tcp_sock_destruct;
+
+	sktr->stats.min_rtt_us = U64_MAX;
+	sktr->stats.min_rto_ms = U32_MAX;
+
+	sktr->last_ts = sktr->start_ts = jiffies;
+}
+
+void tcp_sk_trace_destruct(struct sock *sk)
+{
+	kfree(tcp_sk(sk)->trace);
+}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3b887fa..41871c4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include <net/xfrm.h>
 #include <net/ip.h>
 #include <net/sock.h>
+#include <net/tcp_trace.h>
 #include <trace/events/tcp.h>
 
 #include <asm/uaccess.h>
@@ -1904,6 +1905,7 @@ void tcp_set_state(struct sock *sk, int state)
 	case TCP_ESTABLISHED:
 		if (oldstate != TCP_ESTABLISHED) {
 			TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
+			tcp_sk_trace_init(sk);
 			trace_tcp_established(sk);
 		}
 		break;
@@ -2254,6 +2256,8 @@ EXPORT_SYMBOL(tcp_disconnect);
 
 void tcp_sock_destruct(struct sock *sk)
 {
+	tcp_sk_trace_destruct(sk);
+
 	inet_sock_destruct(sk);
 
 	kfree(inet_csk(sk)->icsk_accept_queue.fastopenq);
-- 
1.8.1

^ permalink raw reply related

* [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-1-git-send-email-kafai@fb.com>

Add TRACE_EVENT when:
1. connection established
2. segs received
3. segs sending out
4. connection close

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/trace/events/tcp.h | 175 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-traces.c      |   1 +
 net/ipv4/tcp.c             |   6 +-
 net/ipv4/tcp_input.c       |   3 +
 net/ipv4/tcp_output.c      |   3 +
 5 files changed, 187 insertions(+), 1 deletion(-)
 create mode 100644 include/trace/events/tcp.h

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
new file mode 100644
index 0000000..81b40ef
--- /dev/null
+++ b/include/trace/events/tcp.h
@@ -0,0 +1,175 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tcp
+
+#if !defined(_TRACE_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TCP_H
+
+#include <net/sock.h>
+#include <net/inet_sock.h>
+#include <net/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ipv6.h>
+#include <linux/tracepoint.h>
+#include <uapi/linux/in6.h>
+
+#define TCP_TRACE_ASSIGN_SA(e, sk)	do {				\
+	(e)->lport = inet_sk((sk))->inet_sport;				\
+	(e)->rport = inet_sk((sk))->inet_dport;				\
+	if ((sk)->sk_family == AF_INET) {				\
+		(e)->ipv6 = 0;						\
+		memset((e)->laddr, 0, sizeof((e)->laddr));		\
+		memset((e)->raddr, 0, sizeof((e)->raddr));		\
+		memcpy((e)->laddr, &inet_sk((sk))->inet_saddr,		\
+		       sizeof(inet_sk((sk))->inet_saddr));		\
+		memcpy((e)->raddr, &inet_sk((sk))->inet_daddr,		\
+		       sizeof(inet_sk((sk))->inet_daddr));		\
+	} else {							\
+		(e)->ipv6 = 1;						\
+		memcpy((e)->laddr, inet6_sk((sk))->saddr.s6_addr,	\
+		       sizeof((e)->laddr));				\
+		memcpy((e)->raddr, (sk)->sk_v6_daddr.s6_addr,		\
+		       sizeof((e)->raddr));				\
+	}								\
+} while (0)
+
+DECLARE_EVENT_CLASS(tcp,
+	TP_PROTO(struct sock *sk),
+	TP_ARGS(sk),
+	TP_STRUCT__entry(
+		__field(u8, ipv6)
+		__array(u8, laddr, 16)
+		__array(u8, raddr, 16)
+		__field(u16, lport)
+		__field(u16, rport)
+		__field(u32, snd_cwnd)
+		__field(u32, mss_cache)
+		__field(u32, ssthresh)
+		__field(u64, srtt_us)
+		__field(u32, rto_ms)
+	),
+	TP_fast_assign(
+		TCP_TRACE_ASSIGN_SA(__entry, sk);
+		__entry->snd_cwnd = tcp_sk(sk)->snd_cwnd;
+		__entry->mss_cache = tcp_sk(sk)->mss_cache;
+		__entry->ssthresh = tcp_current_ssthresh(sk);
+		__entry->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+		__entry->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	),
+	TP_printk("local=%s:%d remote=%s:%d snd_cwnd=%u mss_cache=%u "
+		  "ssthresh=%u srtt_us=%llu rto_ms=%u",
+		  __print_hex(__entry->laddr, 16),
+		  __entry->lport,
+		  __print_hex(__entry->raddr, 16),
+		  __entry->rport,
+		  __entry->snd_cwnd, __entry->mss_cache,
+		  __entry->ssthresh, __entry->srtt_us, __entry->rto_ms)
+);
+
+DEFINE_EVENT(tcp,
+	     tcp_established,
+	     TP_PROTO(struct sock *sk),
+	     TP_ARGS(sk)
+);
+
+DEFINE_EVENT(tcp,
+	     tcp_close,
+	     TP_PROTO(struct sock *sk),
+	     TP_ARGS(sk)
+);
+
+TRACE_EVENT(tcp_transmit_skb,
+	TP_PROTO(struct sock *sk, struct sk_buff *skb),
+	TP_ARGS(sk, skb),
+	TP_STRUCT__entry(
+		__field(u8, ipv6)
+		__array(u8, laddr, 16)
+		__array(u8, raddr, 16)
+		__field(u16, lport)
+		__field(u16, rport)
+		__field(u32, seq)
+		__field(u32, end_seq)
+		__field(u32, pcount)
+		__field(u8, ca_state)
+		__field(u32, snd_nxt)
+		__field(u32, snd_una)
+		__field(u32, snd_wnd)
+		__field(u32, snd_cwnd)
+		__field(u32, mss_cache)
+		__field(u32, ssthresh)
+		__field(u64, srtt_us)
+		__field(u32, rto_ms)
+	),
+	TP_fast_assign(
+		TCP_TRACE_ASSIGN_SA(__entry, sk);
+		__entry->seq = TCP_SKB_CB(skb)->seq;
+		__entry->end_seq = TCP_SKB_CB(skb)->end_seq;
+		__entry->pcount = tcp_skb_pcount(skb);
+		__entry->ca_state = inet_csk(sk)->icsk_ca_state;
+		__entry->snd_nxt = tcp_sk(sk)->snd_nxt;
+		__entry->snd_una = tcp_sk(sk)->snd_una;
+		__entry->snd_wnd = tcp_sk(sk)->snd_wnd;
+		__entry->snd_cwnd = tcp_sk(sk)->snd_cwnd;
+		__entry->mss_cache = tcp_sk(sk)->mss_cache;
+		__entry->ssthresh = tcp_current_ssthresh(sk);
+		__entry->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+		__entry->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	),
+	TP_printk("local=%s:%d remote=%s:%d "
+		  "skb_seq=%u skb_end_seq=%u pcount=%u ca_state=%x "
+		  "snd_nxt=%u snd_una=%u snd_wnd=%u snd_cwnd=%u mss_cache=%u "
+		  "ssthresh=%u srtt_us=%llu rto_ms=%u",
+		  __print_hex(__entry->laddr, 16), __entry->lport,
+		  __print_hex(__entry->raddr, 16), __entry->rport,
+
+		  __entry->seq, __entry->end_seq, __entry->pcount,
+		  __entry->ca_state,
+
+		  __entry->snd_nxt, __entry->snd_una, __entry->snd_wnd,
+		  __entry->snd_cwnd, __entry->mss_cache,
+
+		  __entry->ssthresh, __entry->srtt_us, __entry->rto_ms)
+);
+
+TRACE_EVENT(tcp_rcv_established,
+	    TP_PROTO(struct sock *sk, struct sk_buff *skb),
+	    TP_ARGS(sk, skb),
+	TP_STRUCT__entry(
+		__field(u8, ipv6)
+		__array(u8, laddr, 16)
+		__array(u8, raddr, 16)
+		__field(u16, lport)
+		__field(u16, rport)
+		__field(u32, seq)
+		__field(u32, end_seq)
+		__field(u32, ack_seq)
+		__field(u32, snd_una)
+		__field(u32, rcv_nxt)
+		__field(u32, rcv_wnd)
+	),
+	TP_fast_assign(
+		TCP_TRACE_ASSIGN_SA(__entry, sk);
+		__entry->seq = TCP_SKB_CB(skb)->seq;
+		__entry->end_seq = TCP_SKB_CB(skb)->end_seq;
+		__entry->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+		__entry->snd_una = tcp_sk(sk)->snd_una;
+		__entry->rcv_nxt = tcp_sk(sk)->rcv_nxt;
+		__entry->rcv_wnd = tcp_sk(sk)->rcv_wnd;
+	),
+	TP_printk("local=%s:%d remote=%s:%d "
+		  "skb_seq=%u skb_end_seq=%u skb_ack_seq=%u snd_una=%u "
+		  "rcv_nxt=%u, rcv_wnd=%u",
+		  __print_hex(__entry->laddr, 16), __entry->lport,
+		  __print_hex(__entry->raddr, 16), __entry->rport,
+
+		  __entry->seq, __entry->end_seq, __entry->ack_seq,
+		  __entry->snd_una,
+
+		  __entry->rcv_nxt, __entry->rcv_wnd)
+);
+
+#undef TCP_TRACE_ASSIGN_SA
+
+#endif /* _TRACE_TCP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index ba3c012..63f966b 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -31,6 +31,7 @@
 #include <trace/events/napi.h>
 #include <trace/events/sock.h>
 #include <trace/events/udp.h>
+#include <trace/events/tcp.h>
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3075723..3b887fa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include <net/xfrm.h>
 #include <net/ip.h>
 #include <net/sock.h>
+#include <trace/events/tcp.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -1901,8 +1902,10 @@ void tcp_set_state(struct sock *sk, int state)
 
 	switch (state) {
 	case TCP_ESTABLISHED:
-		if (oldstate != TCP_ESTABLISHED)
+		if (oldstate != TCP_ESTABLISHED) {
 			TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
+			trace_tcp_established(sk);
+		}
 		break;
 
 	case TCP_CLOSE:
@@ -1913,6 +1916,7 @@ void tcp_set_state(struct sock *sk, int state)
 		if (inet_csk(sk)->icsk_bind_hash &&
 		    !(sk->sk_userlocks & SOCK_BINDPORT_LOCK))
 			inet_put_port(sk);
+		trace_tcp_close(sk);
 		/* fall through */
 	default:
 		if (oldstate == TCP_ESTABLISHED)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 075ab4d..808fad7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -75,6 +75,7 @@
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
 #include <linux/errqueue.h>
+#include <trace/events/tcp.h>
 
 int sysctl_tcp_timestamps __read_mostly = 1;
 int sysctl_tcp_window_scaling __read_mostly = 1;
@@ -5076,6 +5077,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	trace_tcp_rcv_established(sk, skb);
+
 	if (unlikely(sk->sk_rx_dst == NULL))
 		inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb);
 	/*
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..9832512 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -41,6 +41,7 @@
 #include <linux/compiler.h>
 #include <linux/gfp.h>
 #include <linux/module.h>
+#include <trace/events/tcp.h>
 
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse __read_mostly = 1;
@@ -1014,6 +1015,8 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	/* Our usage of tstamp should remain private */
 	skb->tstamp.tv64 = 0;
 
+	trace_tcp_transmit_skb(sk, skb);
+
 	/* Cleanup our debris for IP stacks */
 	memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
 			       sizeof(struct inet6_skb_parm)));
-- 
1.8.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox