Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] ipv4: fix address selection in fib_compute_spec_dst
From: Julian Anastasov @ 2012-07-19  7:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

	ip_options_compile can be called for forwarded packets,
make sure the specific-destionation address is a local one as
specified in RFC 1812, 4.2.2.2 Addresses in Options

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/fib_frontend.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 7a31194..b832036 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -206,7 +206,8 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
 	int scope;
 
 	rt = skb_rtable(skb);
-	if (!(rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)))
+	if ((rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST | RTCF_LOCAL)) ==
+	    RTCF_LOCAL)
 		return ip_hdr(skb)->daddr;
 
 	in_dev = __in_dev_get_rcu(dev);
-- 
1.7.3.4

^ permalink raw reply related

* Re: [PATCH net-next V1 1/9] IB/ipoib: Add support for clones / multiple childs on the same partition
From: Or Gerlitz @ 2012-07-19  8:11 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, roland, netdev, ali, sean.hefty, shlomop, erezsh,
	Or Gerlitz
In-Reply-To: <50073484.9070501@intel.com>

On Thu, Jul 19, 2012 at 1:11 AM, John Fastabend
<john.r.fastabend@intel.com> wrote:

> [...] Also what is a "pkey"

Hi John,

pkey (pronounced PEE KEY) stands for "partition keys" where partitions are
in a way IB's vlans, so the functionality provided by IPoIB child devices
is similar to what done by Ethernet 8021q vlan devices. Dave suggests
that we use rtnl_link_ops to create these childs instead of the proprietary
sysfs which was introduced when IPoIB was merged and is described here
Documentation/infiniband/ipoib.txt

Or.

^ permalink raw reply

* [PATCH net-next V1 1/4] net/mlx4: Move MAC_MASK to a common place
From: Or Gerlitz @ 2012-07-19  8:33 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342686832-21406-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Define this macro is one common place instead of duplicating it over the code

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    |    6 +++---
 drivers/net/ethernet/mellanox/mlx4/mcg.c           |    1 -
 drivers/net/ethernet/mellanox/mlx4/port.c          |    1 -
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |    3 +--
 include/linux/mlx4/driver.h                        |    2 ++
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index dd6a77b..9d0b88e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -34,12 +34,12 @@
 #include <linux/kernel.h>
 #include <linux/ethtool.h>
 #include <linux/netdevice.h>
+#include <linux/mlx4/driver.h>
 
 #include "mlx4_en.h"
 #include "en_port.h"
 
 #define EN_ETHTOOL_QP_ATTACH (1ull << 63)
-#define EN_ETHTOOL_MAC_MASK 0xffffffffffffULL
 #define EN_ETHTOOL_SHORT_MASK cpu_to_be16(0xffff)
 #define EN_ETHTOOL_WORD_MASK  cpu_to_be32(0xffffffff)
 
@@ -751,7 +751,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 	struct ethhdr *eth_spec;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_spec_list *spec_l2;
-	__be64 mac_msk = cpu_to_be64(EN_ETHTOOL_MAC_MASK << 16);
+	__be64 mac_msk = cpu_to_be64(MLX4_MAC_MASK << 16);
 
 	err = mlx4_en_validate_flow(dev, cmd);
 	if (err)
@@ -761,7 +761,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 	if (!spec_l2)
 		return -ENOMEM;
 
-	mac = priv->mac & EN_ETHTOOL_MAC_MASK;
+	mac = priv->mac & MLX4_MAC_MASK;
 	be_mac = cpu_to_be64(mac << 16);
 
 	spec_l2->id = MLX4_NET_TRANS_RULE_ID_ETH;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index 5bac0df..4ec3835 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -41,7 +41,6 @@
 
 #define MGM_QPN_MASK       0x00FFFFFF
 #define MGM_BLCK_LB_BIT    30
-#define MLX4_MAC_MASK	   0xffffffffffffULL
 
 static const u8 zero_gid[16];	/* automatically initialized to 0 */
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c b/drivers/net/ethernet/mellanox/mlx4/port.c
index a51d1b9..028833f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -39,7 +39,6 @@
 #include "mlx4.h"
 
 #define MLX4_MAC_VALID		(1ull << 63)
-#define MLX4_MAC_MASK		0xffffffffffffULL
 
 #define MLX4_VLAN_VALID		(1u << 31)
 #define MLX4_VLAN_MASK		0xfff
diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index c3fa919..94ceddd 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -41,13 +41,12 @@
 #include <linux/slab.h>
 #include <linux/mlx4/cmd.h>
 #include <linux/mlx4/qp.h>
+#include <linux/if_ether.h>
 
 #include "mlx4.h"
 #include "fw.h"
 
 #define MLX4_MAC_VALID		(1ull << 63)
-#define MLX4_MAC_MASK		0x7fffffffffffffffULL
-#define ETH_ALEN		6
 
 struct mac_res {
 	struct list_head list;
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 5f1298b..8dc485f 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -37,6 +37,8 @@
 
 struct mlx4_dev;
 
+#define MLX4_MAC_MASK	   0xffffffffffffULL
+
 enum mlx4_dev_event {
 	MLX4_DEV_EVENT_CATASTROPHIC_ERROR,
 	MLX4_DEV_EVENT_PORT_UP,
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next V1 3/4] {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq
From: Or Gerlitz @ 2012-07-19  8:33 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342686832-21406-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Enable callers of mlx4_assign_eq to supply a pointer to cpu_rmap.
If supplied, the assigned IRQ is tracked using rmap infrastructure.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/hw/mlx4/main.c          |    3 ++-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |    3 ++-
 drivers/net/ethernet/mellanox/mlx4/eq.c    |   12 +++++++++++-
 include/linux/mlx4/device.h                |    4 +++-
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 8a3a203..a07b774 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1159,7 +1159,8 @@ static void mlx4_ib_alloc_eqs(struct mlx4_dev *dev, struct mlx4_ib_dev *ibdev)
 			sprintf(name, "mlx4-ib-%d-%d@%s",
 				i, j, dev->pdev->bus->name);
 			/* Set IRQ for specific name (per ring) */
-			if (mlx4_assign_eq(dev, name, &ibdev->eq_table[eq])) {
+			if (mlx4_assign_eq(dev, name, NULL,
+					   &ibdev->eq_table[eq])) {
 				/* Use legacy (same as mlx4_en driver) */
 				pr_warn("Can't allocate EQ %d; reverting to legacy\n", eq);
 				ibdev->eq_table[eq] =
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 908a460..0ef6156 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -91,7 +91,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 				sprintf(name, "%s-%d", priv->dev->name,
 					cq->ring);
 				/* Set IRQ for specific name (per ring) */
-				if (mlx4_assign_eq(mdev->dev, name, &cq->vector)) {
+				if (mlx4_assign_eq(mdev->dev, name, NULL,
+						   &cq->vector)) {
 					cq->vector = (cq->ring + 1 + priv->port)
 					    % mdev->dev->caps.num_comp_vectors;
 					mlx4_warn(mdev, "Failed Assigning an EQ to "
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
index bce98d9..cd48337 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -39,6 +39,7 @@
 #include <linux/dma-mapping.h>
 
 #include <linux/mlx4/cmd.h>
+#include <linux/cpu_rmap.h>
 
 #include "mlx4.h"
 #include "fw.h"
@@ -1060,7 +1061,8 @@ int mlx4_test_interrupts(struct mlx4_dev *dev)
 }
 EXPORT_SYMBOL(mlx4_test_interrupts);
 
-int mlx4_assign_eq(struct mlx4_dev *dev, char* name, int * vector)
+int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
+		   int *vector)
 {
 
 	struct mlx4_priv *priv = mlx4_priv(dev);
@@ -1074,6 +1076,14 @@ int mlx4_assign_eq(struct mlx4_dev *dev, char* name, int * vector)
 			snprintf(priv->eq_table.irq_names +
 					vec * MLX4_IRQNAME_SIZE,
 					MLX4_IRQNAME_SIZE, "%s", name);
+#ifdef CONFIG_RFS_ACCEL
+			if (rmap) {
+				err = irq_cpu_rmap_add(rmap,
+						       priv->eq_table.eq[vec].irq);
+				if (err)
+					mlx4_warn(dev, "Failed adding irq rmap\n");
+			}
+#endif
 			err = request_irq(priv->eq_table.eq[vec].irq,
 					  mlx4_msi_x_interrupt, 0,
 					  &priv->eq_table.irq_names[vec<<5],
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 6f0d133..4d7761f 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -36,6 +36,7 @@
 #include <linux/pci.h>
 #include <linux/completion.h>
 #include <linux/radix-tree.h>
+#include <linux/cpu_rmap.h>
 
 #include <linux/atomic.h>
 
@@ -784,7 +785,8 @@ void mlx4_fmr_unmap(struct mlx4_dev *dev, struct mlx4_fmr *fmr,
 int mlx4_fmr_free(struct mlx4_dev *dev, struct mlx4_fmr *fmr);
 int mlx4_SYNC_TPT(struct mlx4_dev *dev);
 int mlx4_test_interrupts(struct mlx4_dev *dev);
-int mlx4_assign_eq(struct mlx4_dev *dev, char* name , int* vector);
+int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
+		   int *vector);
 void mlx4_release_eq(struct mlx4_dev *dev, int vec);
 
 int mlx4_wol_read(struct mlx4_dev *dev, u64 *config, int port);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next V1 4/4] net/mlx4_en: Add accelerated RFS support
From: Or Gerlitz @ 2012-07-19  8:33 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342686832-21406-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Use RFS infrastructure and flow steering in HW to keep CPU
affinity of rx interrupts and application per TCP stream.

A flow steering filter is added to the HW whenever the RFS
ndo callback is invoked by core networking code.

Because the invocation takes place in interrupt context, the
actual setup of HW is done using workqueue. Whenever new filter
is added, the driver checks for expiry of existing filters.

Since there's window in time between the point where the core
RFS code invoked the ndo callback, to the point where the HW
is configured from the workqueue context, the 2nd, 3rd etc
packets from that stream will cause the net core to invoke
the callback again and again.

To prevent inefficient/double configuration of the HW, the filters
are kept in a database which is indexed using hash function to enable
fast access.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c     |    8 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  316 ++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     |    3 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   16 ++
 4 files changed, 342 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 0ef6156..aa9c2f6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -77,6 +77,12 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 	struct mlx4_en_dev *mdev = priv->mdev;
 	int err = 0;
 	char name[25];
+	struct cpu_rmap *rmap =
+#ifdef CONFIG_RFS_ACCEL
+		priv->dev->rx_cpu_rmap;
+#else
+		NULL;
+#endif
 
 	cq->dev = mdev->pndev[priv->port];
 	cq->mcq.set_ci_db  = cq->wqres.db.db;
@@ -91,7 +97,7 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 				sprintf(name, "%s-%d", priv->dev->name,
 					cq->ring);
 				/* Set IRQ for specific name (per ring) */
-				if (mlx4_assign_eq(mdev->dev, name, NULL,
+				if (mlx4_assign_eq(mdev->dev, name, rmap,
 						   &cq->vector)) {
 					cq->vector = (cq->ring + 1 + priv->port)
 					    % mdev->dev->caps.num_comp_vectors;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 4ce5ca8..8864d8b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -36,6 +36,8 @@
 #include <linux/if_vlan.h>
 #include <linux/delay.h>
 #include <linux/slab.h>
+#include <linux/hash.h>
+#include <net/ip.h>
 
 #include <linux/mlx4/driver.h>
 #include <linux/mlx4/device.h>
@@ -66,6 +68,299 @@ static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 	return 0;
 }
 
+#ifdef CONFIG_RFS_ACCEL
+
+struct mlx4_en_filter {
+	struct list_head next;
+	struct work_struct work;
+
+	__be32 src_ip;
+	__be32 dst_ip;
+	__be16 src_port;
+	__be16 dst_port;
+
+	int rxq_index;
+	struct mlx4_en_priv *priv;
+	u32 flow_id;			/* RFS infrastructure id */
+	int id;				/* mlx4_en driver id */
+	u64 reg_id;			/* Flow steering API id */
+	u8 activated;			/* Used to prevent expiry before filter
+					 * is attached
+					 */
+	struct hlist_node filter_chain;
+};
+
+static void mlx4_en_filter_rfs_expire(struct mlx4_en_priv *priv);
+
+static void mlx4_en_filter_work(struct work_struct *work)
+{
+	struct mlx4_en_filter *filter = container_of(work,
+						     struct mlx4_en_filter,
+						     work);
+	struct mlx4_en_priv *priv = filter->priv;
+	struct mlx4_spec_list spec_tcp = {
+		.id = MLX4_NET_TRANS_RULE_ID_TCP,
+		{
+			.tcp_udp = {
+				.dst_port = filter->dst_port,
+				.dst_port_msk = (__force __be16)-1,
+				.src_port = filter->src_port,
+				.src_port_msk = (__force __be16)-1,
+			},
+		},
+	};
+	struct mlx4_spec_list spec_ip = {
+		.id = MLX4_NET_TRANS_RULE_ID_IPV4,
+		{
+			.ipv4 = {
+				.dst_ip = filter->dst_ip,
+				.dst_ip_msk = (__force __be32)-1,
+				.src_ip = filter->src_ip,
+				.src_ip_msk = (__force __be32)-1,
+			},
+		},
+	};
+	struct mlx4_spec_list spec_eth = {
+		.id = MLX4_NET_TRANS_RULE_ID_ETH,
+	};
+	struct mlx4_net_trans_rule rule = {
+		.list = LIST_HEAD_INIT(rule.list),
+		.queue_mode = MLX4_NET_TRANS_Q_LIFO,
+		.exclusive = 1,
+		.allow_loopback = 1,
+		.promisc_mode = MLX4_FS_PROMISC_NONE,
+		.port = priv->port,
+		.priority = MLX4_DOMAIN_RFS,
+	};
+	int rc;
+	__be64 mac;
+	__be64 mac_mask = cpu_to_be64(MLX4_MAC_MASK << 16);
+
+	list_add_tail(&spec_eth.list, &rule.list);
+	list_add_tail(&spec_ip.list, &rule.list);
+	list_add_tail(&spec_tcp.list, &rule.list);
+
+	mac = cpu_to_be64((priv->mac & MLX4_MAC_MASK) << 16);
+
+	rule.qpn = priv->rss_map.qps[filter->rxq_index].qpn;
+	memcpy(spec_eth.eth.dst_mac, &mac, ETH_ALEN);
+	memcpy(spec_eth.eth.dst_mac_msk, &mac_mask, ETH_ALEN);
+
+	filter->activated = 0;
+
+	if (filter->reg_id) {
+		rc = mlx4_flow_detach(priv->mdev->dev, filter->reg_id);
+		if (rc && rc != -ENOENT)
+			en_err(priv, "Error detaching flow. rc = %d\n", rc);
+	}
+
+	rc = mlx4_flow_attach(priv->mdev->dev, &rule, &filter->reg_id);
+	if (rc)
+		en_err(priv, "Error attaching flow. err = %d\n", rc);
+
+	mlx4_en_filter_rfs_expire(priv);
+
+	filter->activated = 1;
+}
+
+static inline struct hlist_head *
+filter_hash_bucket(struct mlx4_en_priv *priv, __be32 src_ip, __be32 dst_ip,
+		   __be16 src_port, __be16 dst_port)
+{
+	unsigned long l;
+	int bucket_idx;
+
+	l = (__force unsigned long)src_port |
+	    ((__force unsigned long)dst_port << 2);
+	l ^= (__force unsigned long)(src_ip ^ dst_ip);
+
+	bucket_idx = hash_long(l, MLX4_EN_FILTER_HASH_SHIFT);
+
+	return &priv->filter_hash[bucket_idx];
+}
+
+static struct mlx4_en_filter *
+mlx4_en_filter_alloc(struct mlx4_en_priv *priv, int rxq_index, __be32 src_ip,
+		     __be32 dst_ip, __be16 src_port, __be16 dst_port,
+		     u32 flow_id)
+{
+	struct mlx4_en_filter *filter = NULL;
+
+	filter = kzalloc(sizeof(struct mlx4_en_filter), GFP_ATOMIC);
+	if (!filter)
+		return NULL;
+
+	filter->priv = priv;
+	filter->rxq_index = rxq_index;
+	INIT_WORK(&filter->work, mlx4_en_filter_work);
+
+	filter->src_ip = src_ip;
+	filter->dst_ip = dst_ip;
+	filter->src_port = src_port;
+	filter->dst_port = dst_port;
+
+	filter->flow_id = flow_id;
+
+	filter->id = priv->last_filter_id++;
+
+	list_add_tail(&filter->next, &priv->filters);
+	hlist_add_head(&filter->filter_chain,
+		       filter_hash_bucket(priv, src_ip, dst_ip, src_port,
+					  dst_port));
+
+	return filter;
+}
+
+static void mlx4_en_filter_free(struct mlx4_en_filter *filter)
+{
+	struct mlx4_en_priv *priv = filter->priv;
+	int rc;
+
+	list_del(&filter->next);
+
+	rc = mlx4_flow_detach(priv->mdev->dev, filter->reg_id);
+	if (rc && rc != -ENOENT)
+		en_err(priv, "Error detaching flow. rc = %d\n", rc);
+
+	kfree(filter);
+}
+
+static inline struct mlx4_en_filter *
+mlx4_en_filter_find(struct mlx4_en_priv *priv, __be32 src_ip, __be32 dst_ip,
+		    __be16 src_port, __be16 dst_port)
+{
+	struct hlist_node *elem;
+	struct mlx4_en_filter *filter;
+	struct mlx4_en_filter *ret = NULL;
+
+	hlist_for_each_entry(filter, elem,
+			     filter_hash_bucket(priv, src_ip, dst_ip,
+						src_port, dst_port),
+			     filter_chain) {
+		if (filter->src_ip == src_ip &&
+		    filter->dst_ip == dst_ip &&
+		    filter->src_port == src_port &&
+		    filter->dst_port == dst_port) {
+			ret = filter;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static int
+mlx4_en_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
+		   u16 rxq_index, u32 flow_id)
+{
+	struct mlx4_en_priv *priv = netdev_priv(net_dev);
+	struct mlx4_en_filter *filter;
+	const struct iphdr *ip;
+	const __be16 *ports;
+	__be32 src_ip;
+	__be32 dst_ip;
+	__be16 src_port;
+	__be16 dst_port;
+	int nhoff = skb_network_offset(skb);
+	int ret = 0;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		return -EPROTONOSUPPORT;
+
+	ip = (const struct iphdr *)(skb->data + nhoff);
+	if (ip_is_fragment(ip))
+		return -EPROTONOSUPPORT;
+
+	ports = (const __be16 *)(skb->data + nhoff + 4 * ip->ihl);
+
+	src_ip = ip->saddr;
+	dst_ip = ip->daddr;
+	src_port = ports[0];
+	dst_port = ports[1];
+
+	if (ip->protocol != IPPROTO_TCP)
+		return -EPROTONOSUPPORT;
+
+	spin_lock_bh(&priv->filters_lock);
+	filter = mlx4_en_filter_find(priv, src_ip, dst_ip, src_port, dst_port);
+	if (filter) {
+		if (filter->rxq_index == rxq_index)
+			goto out;
+
+		filter->rxq_index = rxq_index;
+	} else {
+		filter = mlx4_en_filter_alloc(priv, rxq_index,
+					      src_ip, dst_ip,
+					      src_port, dst_port, flow_id);
+		if (!filter) {
+			ret = -ENOMEM;
+			goto err;
+		}
+	}
+
+	queue_work(priv->mdev->workqueue, &filter->work);
+
+out:
+	ret = filter->id;
+err:
+	spin_unlock_bh(&priv->filters_lock);
+
+	return ret;
+}
+
+void mlx4_en_cleanup_filters(struct mlx4_en_priv *priv,
+			     struct mlx4_en_rx_ring *rx_ring)
+{
+	struct mlx4_en_filter *filter, *tmp;
+	LIST_HEAD(del_list);
+
+	spin_lock_bh(&priv->filters_lock);
+	list_for_each_entry_safe(filter, tmp, &priv->filters, next) {
+		list_move(&filter->next, &del_list);
+		hlist_del(&filter->filter_chain);
+	}
+	spin_unlock_bh(&priv->filters_lock);
+
+	list_for_each_entry_safe(filter, tmp, &del_list, next) {
+		cancel_work_sync(&filter->work);
+		mlx4_en_filter_free(filter);
+	}
+}
+
+static void mlx4_en_filter_rfs_expire(struct mlx4_en_priv *priv)
+{
+	struct mlx4_en_filter *filter = NULL, *tmp, *last_filter = NULL;
+	LIST_HEAD(del_list);
+	int i = 0;
+
+	spin_lock_bh(&priv->filters_lock);
+	list_for_each_entry_safe(filter, tmp, &priv->filters, next) {
+		if (i > MLX4_EN_FILTER_EXPIRY_QUOTA)
+			break;
+
+		if (filter->activated &&
+		    !work_pending(&filter->work) &&
+		    rps_may_expire_flow(priv->dev,
+					filter->rxq_index, filter->flow_id,
+					filter->id)) {
+			list_move(&filter->next, &del_list);
+			hlist_del(&filter->filter_chain);
+		} else
+			last_filter = filter;
+
+		i++;
+	}
+
+	if (last_filter && (&last_filter->next != priv->filters.next))
+		list_move(&priv->filters, &last_filter->next);
+
+	spin_unlock_bh(&priv->filters_lock);
+
+	list_for_each_entry_safe(filter, tmp, &del_list, next)
+		mlx4_en_filter_free(filter);
+}
+#endif
+
 static int mlx4_en_vlan_rx_add_vid(struct net_device *dev, unsigned short vid)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -1079,6 +1374,11 @@ void mlx4_en_free_resources(struct mlx4_en_priv *priv)
 {
 	int i;
 
+#ifdef CONFIG_RFS_ACCEL
+	free_irq_cpu_rmap(priv->dev->rx_cpu_rmap);
+	priv->dev->rx_cpu_rmap = NULL;
+#endif
+
 	for (i = 0; i < priv->tx_ring_num; i++) {
 		if (priv->tx_ring[i].tx_info)
 			mlx4_en_destroy_tx_ring(priv, &priv->tx_ring[i]);
@@ -1134,6 +1434,15 @@ int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
 			goto err;
 	}
 
+#ifdef CONFIG_RFS_ACCEL
+	priv->dev->rx_cpu_rmap = alloc_irq_cpu_rmap(priv->rx_ring_num);
+	if (!priv->dev->rx_cpu_rmap)
+		goto err;
+
+	INIT_LIST_HEAD(&priv->filters);
+	spin_lock_init(&priv->filters_lock);
+#endif
+
 	return 0;
 
 err:
@@ -1241,6 +1550,9 @@ static const struct net_device_ops mlx4_netdev_ops = {
 #endif
 	.ndo_set_features	= mlx4_en_set_features,
 	.ndo_setup_tc		= mlx4_en_setup_tc,
+#ifdef CONFIG_RFS_ACCEL
+	.ndo_rx_flow_steer	= mlx4_en_filter_rfs,
+#endif
 };
 
 int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
@@ -1358,6 +1670,10 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 			NETIF_F_HW_VLAN_FILTER;
 	dev->hw_features |= NETIF_F_LOOPBACK;
 
+	if (mdev->dev->caps.steering_mode ==
+	    MLX4_STEERING_MODE_DEVICE_MANAGED)
+		dev->hw_features |= NETIF_F_NTUPLE;
+
 	mdev->pndev[port] = dev;
 
 	netif_carrier_off(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index a04cbf7..796cd58 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -389,6 +389,9 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
 	vfree(ring->rx_info);
 	ring->rx_info = NULL;
+#ifdef CONFIG_RFS_ACCEL
+	mlx4_en_cleanup_filters(priv, ring);
+#endif
 }
 
 void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index a126321..af34c98 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -43,6 +43,7 @@
 #ifdef CONFIG_MLX4_EN_DCB
 #include <linux/dcbnl.h>
 #endif
+#include <linux/cpu_rmap.h>
 
 #include <linux/mlx4/device.h>
 #include <linux/mlx4/qp.h>
@@ -77,6 +78,9 @@
 #define STATS_DELAY		(HZ / 4)
 #define MAX_NUM_OF_FS_RULES	256
 
+#define MLX4_EN_FILTER_HASH_SHIFT 4
+#define MLX4_EN_FILTER_EXPIRY_QUOTA 60
+
 /* Typical TSO descriptor with 16 gather entries is 352 bytes... */
 #define MAX_DESC_SIZE		512
 #define MAX_DESC_TXBBS		(MAX_DESC_SIZE / TXBB_SIZE)
@@ -523,6 +527,13 @@ struct mlx4_en_priv {
 	struct ieee_ets ets;
 	u16 maxrate[IEEE_8021QAZ_MAX_TCS];
 #endif
+#ifdef CONFIG_RFS_ACCEL
+	spinlock_t filters_lock;
+	int last_filter_id;
+	struct list_head filters;
+	struct hlist_head filter_hash[1 << MLX4_EN_FILTER_HASH_SHIFT];
+#endif
+
 };
 
 enum mlx4_en_wol {
@@ -602,6 +613,11 @@ int mlx4_en_QUERY_PORT(struct mlx4_en_dev *mdev, u8 port);
 extern const struct dcbnl_rtnl_ops mlx4_en_dcbnl_ops;
 #endif
 
+#ifdef CONFIG_RFS_ACCEL
+void mlx4_en_cleanup_filters(struct mlx4_en_priv *priv,
+			     struct mlx4_en_rx_ring *rx_ring);
+#endif
+
 #define MLX4_EN_NUM_SELF_TEST	5
 void mlx4_en_ex_selftest(struct net_device *dev, u32 *flags, u64 *buf);
 u64 mlx4_en_mac_to_u64(u8 *addr);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next V1 2/4] net/rps: Protect cpu_rmap.h from double inclusion
From: Or Gerlitz @ 2012-07-19  8:33 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342686832-21406-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 include/linux/cpu_rmap.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpu_rmap.h b/include/linux/cpu_rmap.h
index 473771a..ac3bbb5 100644
--- a/include/linux/cpu_rmap.h
+++ b/include/linux/cpu_rmap.h
@@ -1,3 +1,6 @@
+#ifndef __LINUX_CPU_RMAP_H
+#define __LINUX_CPU_RMAP_H
+
 /*
  * cpu_rmap.c: CPU affinity reverse-map support
  * Copyright 2011 Solarflare Communications Inc.
@@ -71,3 +74,4 @@ extern void free_irq_cpu_rmap(struct cpu_rmap *rmap);
 extern int irq_cpu_rmap_add(struct cpu_rmap *rmap, int irq);
 
 #endif
+#endif /* __LINUX_CPU_RMAP_H */
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next V1 0/4] net/mlx4_en: Add accelerated RFS support
From: Or Gerlitz @ 2012-07-19  8:33 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Or Gerlitz, Amir Vadai

Hi Dave, 

This series from Amir Vadai adds support for Accelerated RFS 
to the mlx4_en Ethernet driver.

The code uses the Accelerated RFS infrastructure and HW flow steering 
to keep CPU affinity of rx interrupts and applications per TCP stream.

To do so, we had to add little protection to cpu_rmap.h against double 
inclusion. Also, added linking between CPU to IRQ using rmap in the 
mlx4_core driver.

changes from V0:
 - always use CONFIG_RFS_ACCEL instead of using twice CONFIG_CPU_RMAP directly

Or.


Amir Vadai (4):
  net/mlx4: Move MAC_MASK to a common place
  net/rps: Protect cpu_rmap.h from double inclusion
  {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq
  net/mlx4_en: Add accelerated RFS support

 drivers/infiniband/hw/mlx4/main.c                  |    3 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c         |    9 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    |    6 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |  316 ++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |    3 +
 drivers/net/ethernet/mellanox/mlx4/eq.c            |   12 +-
 drivers/net/ethernet/mellanox/mlx4/mcg.c           |    1 -
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h       |   16 +
 drivers/net/ethernet/mellanox/mlx4/port.c          |    1 -
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |    3 +-
 include/linux/cpu_rmap.h                           |    4 +
 include/linux/mlx4/device.h                        |    4 +-
 include/linux/mlx4/driver.h                        |    2 +
 13 files changed, 369 insertions(+), 11 deletions(-)

CC: Amir Vadai <amirv@mellanox.com>

^ permalink raw reply

* RE: [net-next 9/9] ixgbe: Cleanup holes in flags after removing several of them
From: David Laight @ 2012-07-19  8:33 UTC (permalink / raw)
  To: Jeff Kirsher, davem; +Cc: Alexander Duyck, netdev, gospo, sassmann
In-Reply-To: <1342643516-2696-10-git-send-email-jeffrey.t.kirsher@intel.com>

> This change is just meant to defragment the flags as there are several
hole
> that have been introduced since several features, or the flags for
them,
> have been removed.

Doesn't this sort of change just make it difficult for people who are
looking at hexdumps of memory but don't have exactly the right header
file to hand?

It doesn't really gain anything much either.

I can (just) imagine reordering flags so that the commonly
tested ones are in the low bits so that they can be tested
with small immediate constants - saving an instruction.
But that isn't what is being done here.

	David

^ permalink raw reply

* [PATCH net-next] ipv4: tcp: remove per net tcp_sock
From: Eric Dumazet @ 2012-07-19  8:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tom Herbert, Bill Sommerfeld

From: Eric Dumazet <edumazet@google.com>

tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
per network namespace.

This leads to bad behavior on multiqueue NICS, because many cpus
contend for the socket lock and once socket lock is acquired, extra
false sharing on various socket fields slow down the operations.

To better resist to attacks, we use a percpu socket. Each cpu can
run without contention, using appropriate memory (local node)

Additional features :

1) We also mirror the queue_mapping of the incoming skb, so that
answers use the same queue if possible.

2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()

3) We now limit the number of in-flight RST/ACK [1] packets
per cpu, instead of per namespace, and we honor the sysctl_wmem_default
limit dynamically. (Prior to this patch, sysctl_wmem_default value was
copied at boot time, so any further change would not affect tcp_sock
limit)


[1] These packets are only generated when no socket was matched for
the incoming packet.

Reported-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
---
 include/net/ip.h         |    2 -
 include/net/netns/ipv4.h |    1 
 net/ipv4/ip_output.c     |   46 ++++++++++++++++++++++---------------
 net/ipv4/tcp_ipv4.c      |    8 ++----
 4 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index ec5cfde..bd5e444 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -158,7 +158,7 @@ static inline __u8 ip_reply_arg_flowi_flags(const struct ip_reply_arg *arg)
 	return (arg->flags & IP_REPLY_ARG_NOSRCCHECK) ? FLOWI_FLAG_ANYSRC : 0;
 }
 
-void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, __be32 daddr,
+void ip_send_unicast_reply(struct net *net, struct sk_buff *skb, __be32 daddr,
 			   __be32 saddr, const struct ip_reply_arg *arg,
 			   unsigned int len);
 
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 2e089a9..d909c7f 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -38,7 +38,6 @@ struct netns_ipv4 {
 	struct sock		*fibnl;
 
 	struct sock		**icmp_sk;
-	struct sock		*tcp_sock;
 	struct inet_peer_base	*peers;
 	struct tcpm_hash_bucket	*tcp_metrics_hash;
 	unsigned int		tcp_metrics_hash_mask;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index cc52679..3633136 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1463,20 +1463,29 @@ static int ip_reply_glue_bits(void *dptr, char *to, int offset,
 
 /*
  *	Generic function to send a packet as reply to another packet.
- *	Used to send TCP resets so far.
+ *	Used to send some TCP resets/acks so far.
  *
- *	Should run single threaded per socket because it uses the sock
- *     	structure to pass arguments.
+ *	Use a fake percpu inet socket to avoid false sharing and contention.
  */
-void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, __be32 daddr,
+void ip_send_unicast_reply(struct net *net, struct sk_buff *skb, __be32 daddr,
 			   __be32 saddr, const struct ip_reply_arg *arg,
 			   unsigned int len)
 {
-	struct inet_sock *inet = inet_sk(sk);
 	struct ip_options_data replyopts;
 	struct ipcm_cookie ipc;
 	struct flowi4 fl4;
 	struct rtable *rt = skb_rtable(skb);
+	struct sk_buff *nskb;
+	struct sock *sk;
+	struct inet_sock *inet;
+	static DEFINE_PER_CPU(struct inet_sock, unicast_sock) = {
+		.sk = {
+			.sk_wmem_alloc	= ATOMIC_INIT(1),
+			.sk_allocation	= GFP_ATOMIC,
+			.sk_flags	= (1UL << SOCK_USE_WRITE_QUEUE),
+		},
+		.pmtudisc = IP_PMTUDISC_WANT,
+	};
 
 	if (ip_options_echo(&replyopts.opt.opt, skb))
 		return;
@@ -1494,38 +1503,39 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, __be32 daddr,
 
 	flowi4_init_output(&fl4, arg->bound_dev_if, 0,
 			   RT_TOS(arg->tos),
-			   RT_SCOPE_UNIVERSE, sk->sk_protocol,
+			   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
 			   ip_reply_arg_flowi_flags(arg),
 			   daddr, saddr,
 			   tcp_hdr(skb)->source, tcp_hdr(skb)->dest);
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
-	rt = ip_route_output_key(sock_net(sk), &fl4);
+	rt = ip_route_output_key(net, &fl4);
 	if (IS_ERR(rt))
 		return;
 
-	/* And let IP do all the hard work.
+	inet = &get_cpu_var(unicast_sock);
 
-	   This chunk is not reenterable, hence spinlock.
-	   Note that it uses the fact, that this function is called
-	   with locally disabled BH and that sk cannot be already spinlocked.
-	 */
-	bh_lock_sock(sk);
 	inet->tos = arg->tos;
+	sk = &inet->sk;
 	sk->sk_priority = skb->priority;
 	sk->sk_protocol = ip_hdr(skb)->protocol;
 	sk->sk_bound_dev_if = arg->bound_dev_if;
+	sock_net_set(sk, net);
+	__skb_queue_head_init(&sk->sk_write_queue);
+	sk->sk_sndbuf = sysctl_wmem_default;
 	ip_append_data(sk, &fl4, ip_reply_glue_bits, arg->iov->iov_base, len, 0,
 		       &ipc, &rt, MSG_DONTWAIT);
-	if ((skb = skb_peek(&sk->sk_write_queue)) != NULL) {
+	nskb = skb_peek(&sk->sk_write_queue);
+	if (nskb) {
 		if (arg->csumoffset >= 0)
-			*((__sum16 *)skb_transport_header(skb) +
-			  arg->csumoffset) = csum_fold(csum_add(skb->csum,
+			*((__sum16 *)skb_transport_header(nskb) +
+			  arg->csumoffset) = csum_fold(csum_add(nskb->csum,
 								arg->csum));
-		skb->ip_summed = CHECKSUM_NONE;
+		nskb->ip_summed = CHECKSUM_NONE;
+		skb_set_queue_mapping(nskb, skb_get_queue_mapping(skb));
 		ip_push_pending_frames(sk, &fl4);
 	}
 
-	bh_unlock_sock(sk);
+	put_cpu_var(unicast_sock);
 
 	ip_rt_put(rt);
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d9caf5c..d7d2fa5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -688,7 +688,7 @@ static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
 
 	net = dev_net(skb_dst(skb)->dev);
 	arg.tos = ip_hdr(skb)->tos;
-	ip_send_unicast_reply(net->ipv4.tcp_sock, skb, ip_hdr(skb)->saddr,
+	ip_send_unicast_reply(net, skb, ip_hdr(skb)->saddr,
 			      ip_hdr(skb)->daddr, &arg, arg.iov[0].iov_len);
 
 	TCP_INC_STATS_BH(net, TCP_MIB_OUTSEGS);
@@ -771,7 +771,7 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
 	if (oif)
 		arg.bound_dev_if = oif;
 	arg.tos = tos;
-	ip_send_unicast_reply(net->ipv4.tcp_sock, skb, ip_hdr(skb)->saddr,
+	ip_send_unicast_reply(net, skb, ip_hdr(skb)->saddr,
 			      ip_hdr(skb)->daddr, &arg, arg.iov[0].iov_len);
 
 	TCP_INC_STATS_BH(net, TCP_MIB_OUTSEGS);
@@ -2624,13 +2624,11 @@ EXPORT_SYMBOL(tcp_prot);
 
 static int __net_init tcp_sk_init(struct net *net)
 {
-	return inet_ctl_sock_create(&net->ipv4.tcp_sock,
-				    PF_INET, SOCK_RAW, IPPROTO_TCP, net);
+	return 0;
 }
 
 static void __net_exit tcp_sk_exit(struct net *net)
 {
-	inet_ctl_sock_destroy(net->ipv4.tcp_sock);
 }
 
 static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)

^ permalink raw reply related

* [PATCH] net: e100: ucode is optional in some cases
From: Bjørn Mork @ 2012-07-19  9:33 UTC (permalink / raw)
  To: netdev
  Cc: Jeff Kirsher, Jesse Brandeburg, Bruce Allan, Carolyn Wyborny,
	Don Skidmore, Greg Rose, Peter P Waskiewicz Jr, Alex Duyck,
	John Ronciak, e1000-devel, Bjørn Mork

  commit 9ac32e1b firmware: convert e100 driver to request_firmware()

did a straight conversion of the in-driver ucode to external
files.  This introduced the possibility of the driver failing
to enable an interface due to missing ucode. There was no
evaluation of the importance of the ucode at the time.

Based on the information available from

 http://downloadmirror.intel.com/5154/eng/e100.htm
 http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/fxp/rcvbundl.h?rev=HEAD;content-type=text%2Fplain

we can assume that the ucode implements the "CPU Cycle Saver"
feature on the supported adapters.  Although generally wanted,
this is an optional feature.  The ucode source is not
available,  preventing it from being included in free
distributions.  This creates unnecessary problems for the end
users. Doing a network install based on a free distribution
installer requires the user to download and insert the ucode
into the installer.

Making the ucode optional when possible improves the user
experience and driver usability.

The ucode for some adapters include a bugfix, making it
essential.  We continue to fail for these adapters unless the
ucode is available.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
This was inspired by a recent bad experience trying to install
Debian wheezy on an older laptop using a Debian netboot image. 
The image does of course not include any of the ucode files
due to the missing source.  The installer supports inserting
firmware from a floppy or other medium, but I find that to be
unnecessary hassle given that the device can work without it.



 drivers/net/ethernet/intel/e100.c |   41 +++++++++++++++++++++++++++++--------
 1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/e100.c b/drivers/net/ethernet/intel/e100.c
index ada720b..81670d8 100644
--- a/drivers/net/ethernet/intel/e100.c
+++ b/drivers/net/ethernet/intel/e100.c
@@ -1249,20 +1249,36 @@ static const struct firmware *e100_request_firmware(struct nic *nic)
 	const struct firmware *fw = nic->fw;
 	u8 timer, bundle, min_size;
 	int err = 0;
+	bool required = false;
 
 	/* do not load u-code for ICH devices */
 	if (nic->flags & ich)
 		return NULL;
 
-	/* Search for ucode match against h/w revision */
-	if (nic->mac == mac_82559_D101M)
+	/* Search for ucode match against h/w revision
+	 *
+	 * The FIRMWARE_D102E ucode includes both CPUSaver and
+	 *
+	 *    "fixes for bugs in the B-step hardware (specifically, bugs
+	 *     with Inline Receive)."
+	 *
+	 * according to
+	 * http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/fxp/rcvbundl.h?rev=HEAD;content-type=text%2Fplain
+	 * So we must fail if it cannot be loaded.
+	 *
+	 * The other microcode files are only required for the optional
+	 * CPUSaver feature.  Nice to have, but no reason to fail.
+	 */
+	if (nic->mac == mac_82559_D101M) {
 		fw_name = FIRMWARE_D101M;
-	else if (nic->mac == mac_82559_D101S)
+	} else if (nic->mac == mac_82559_D101S) {
 		fw_name = FIRMWARE_D101S;
-	else if (nic->mac == mac_82551_F || nic->mac == mac_82551_10)
+	} else if (nic->mac == mac_82551_F || nic->mac == mac_82551_10) {
 		fw_name = FIRMWARE_D102E;
-	else /* No ucode on other devices */
+		required = true;
+	} else { /* No ucode on other devices */
 		return NULL;
+	}
 
 	/* If the firmware has not previously been loaded, request a pointer
 	 * to it. If it was previously loaded, we are reinitializing the
@@ -1273,10 +1289,17 @@ static const struct firmware *e100_request_firmware(struct nic *nic)
 		err = request_firmware(&fw, fw_name, &nic->pdev->dev);
 
 	if (err) {
-		netif_err(nic, probe, nic->netdev,
-			  "Failed to load firmware \"%s\": %d\n",
-			  fw_name, err);
-		return ERR_PTR(err);
+		if (required) {
+			netif_err(nic, probe, nic->netdev,
+				  "Failed to load firmware \"%s\": %d\n",
+				  fw_name, err);
+			return ERR_PTR(err);
+		} else {
+			netif_info(nic, probe, nic->netdev,
+				   "CPUSaver disabled. Needs \"%s\": %d\n",
+				   fw_name, err);
+			return NULL;
+		}
 	}
 
 	/* Firmware should be precisely UCODE_SIZE (words) plus three bytes
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next 1/2] asix: Rework reading from EEPROM
From: Christian Riesch @ 2012-07-19 10:23 UTC (permalink / raw)
  To: netdev; +Cc: Allan Chou, Mark Lord, Grant Grundler, Christian Riesch

The current code for reading the EEPROM via ethtool in the asix
driver has a few issues. It cannot handle odd length values
(accesses must be aligned at 16 bit boundaries) and interprets the
offset provided by ethtool as 16 bit word offset instead as byte offset.

The new code for asix_get_eeprom() introduced by this patch is
modeled after the code in
drivers/net/ethernet/atheros/atl1e/atl1e_ethtool.c
and provides read access to the entire EEPROM with arbitrary
offsets and lengths.

Signed-off-by: Christian Riesch <christian.riesch@omicron.at>
---
 drivers/net/usb/asix.h         |    5 ++---
 drivers/net/usb/asix_common.c  |   39 ++++++++++++++++++++++-----------------
 drivers/net/usb/asix_devices.c |    9 ---------
 drivers/net/usb/ax88172a.c     |    3 ---
 4 files changed, 24 insertions(+), 32 deletions(-)

diff --git a/drivers/net/usb/asix.h b/drivers/net/usb/asix.h
index 77d9e4c..fbff177 100644
--- a/drivers/net/usb/asix.h
+++ b/drivers/net/usb/asix.h
@@ -156,8 +156,7 @@
 #define AX_GPIO_RSE		0x80	/* Reload serial EEPROM */
 
 #define AX_EEPROM_MAGIC		0xdeadbeef
-#define AX88172_EEPROM_LEN	0x40
-#define AX88772_EEPROM_LEN	0xff
+#define AX_EEPROM_LEN		0x200
 
 /* This structure cannot exceed sizeof(unsigned long [5]) AKA 20 bytes */
 struct asix_data {
@@ -165,7 +164,7 @@ struct asix_data {
 	u8 mac_addr[ETH_ALEN];
 	u8 phymode;
 	u8 ledmode;
-	u8 eeprom_len;
+	u8 res;
 };
 
 int asix_read_cmd(struct usbnet *dev, u8 cmd, u16 value, u16 index,
diff --git a/drivers/net/usb/asix_common.c b/drivers/net/usb/asix_common.c
index 336f755..0b5b2d3 100644
--- a/drivers/net/usb/asix_common.c
+++ b/drivers/net/usb/asix_common.c
@@ -478,46 +478,51 @@ int asix_set_wol(struct net_device *net, struct ethtool_wolinfo *wolinfo)
 
 int asix_get_eeprom_len(struct net_device *net)
 {
-	struct usbnet *dev = netdev_priv(net);
-	struct asix_data *data = (struct asix_data *)&dev->data;
-
-	return data->eeprom_len;
+	return AX_EEPROM_LEN;
 }
 
 int asix_get_eeprom(struct net_device *net, struct ethtool_eeprom *eeprom,
 		    u8 *data)
 {
 	struct usbnet *dev = netdev_priv(net);
-	__le16 *ebuf = (__le16 *)data;
+	u16 *eeprom_buff;
+	int first_word, last_word;
 	int i;
 
-	/* Crude hack to ensure that we don't overwrite memory
-	 * if an odd length is supplied
-	 */
-	if (eeprom->len % 2)
+	if (eeprom->len == 0)
 		return -EINVAL;
 
 	eeprom->magic = AX_EEPROM_MAGIC;
 
+	first_word = eeprom->offset >> 1;
+	last_word = (eeprom->offset + eeprom->len - 1) >> 1;
+
+	eeprom_buff = kmalloc(sizeof(u16) * (last_word - first_word + 1),
+			      GFP_KERNEL);
+	if (!eeprom_buff)
+		return -ENOMEM;
+
 	/* ax8817x returns 2 bytes from eeprom on read */
-	for (i=0; i < eeprom->len / 2; i++) {
-		if (asix_read_cmd(dev, AX_CMD_READ_EEPROM,
-			eeprom->offset + i, 0, 2, &ebuf[i]) < 0)
-			return -EINVAL;
+	for (i = first_word; i <= last_word; i++) {
+		if (asix_read_cmd(dev, AX_CMD_READ_EEPROM, i, 0, 2,
+				  &(eeprom_buff[i - first_word])) < 0) {
+			kfree(eeprom_buff);
+			return -EIO;
+		}
 	}
+
+	memcpy(data, (u8 *)eeprom_buff + (eeprom->offset & 1), eeprom->len);
+	kfree(eeprom_buff);
 	return 0;
 }
 
 void asix_get_drvinfo(struct net_device *net, struct ethtool_drvinfo *info)
 {
-	struct usbnet *dev = netdev_priv(net);
-	struct asix_data *data = (struct asix_data *)&dev->data;
-
 	/* Inherit standard device info */
 	usbnet_get_drvinfo(net, info);
 	strncpy (info->driver, DRIVER_NAME, sizeof info->driver);
 	strncpy (info->version, DRIVER_VERSION, sizeof info->version);
-	info->eedump_len = data->eeprom_len;
+	info->eedump_len = AX_EEPROM_LEN;
 }
 
 int asix_set_mac_address(struct net_device *net, void *p)
diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index ed9403b..658c08f 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -201,9 +201,6 @@ static int ax88172_bind(struct usbnet *dev, struct usb_interface *intf)
 	u8 buf[ETH_ALEN];
 	int i;
 	unsigned long gpio_bits = dev->driver_info->data;
-	struct asix_data *data = (struct asix_data *)&dev->data;
-
-	data->eeprom_len = AX88172_EEPROM_LEN;
 
 	usbnet_get_endpoints(dev,intf);
 
@@ -409,12 +406,9 @@ static const struct net_device_ops ax88772_netdev_ops = {
 static int ax88772_bind(struct usbnet *dev, struct usb_interface *intf)
 {
 	int ret, embd_phy;
-	struct asix_data *data = (struct asix_data *)&dev->data;
 	u8 buf[ETH_ALEN];
 	u32 phyid;
 
-	data->eeprom_len = AX88772_EEPROM_LEN;
-
 	usbnet_get_endpoints(dev,intf);
 
 	/* Get the MAC address */
@@ -767,9 +761,6 @@ static int ax88178_bind(struct usbnet *dev, struct usb_interface *intf)
 {
 	int ret;
 	u8 buf[ETH_ALEN];
-	struct asix_data *data = (struct asix_data *)&dev->data;
-
-	data->eeprom_len = AX88772_EEPROM_LEN;
 
 	usbnet_get_endpoints(dev,intf);
 
diff --git a/drivers/net/usb/ax88172a.c b/drivers/net/usb/ax88172a.c
index 3d0f8fa..97dce0f 100644
--- a/drivers/net/usb/ax88172a.c
+++ b/drivers/net/usb/ax88172a.c
@@ -228,12 +228,9 @@ err:
 static int ax88172a_bind(struct usbnet *dev, struct usb_interface *intf)
 {
 	int ret;
-	struct asix_data *data = (struct asix_data *)&dev->data;
 	u8 buf[ETH_ALEN];
 	struct ax88172a_private *priv;
 
-	data->eeprom_len = AX88772_EEPROM_LEN;
-
 	usbnet_get_endpoints(dev, intf);
 
 	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
-- 
1.7.0.4

^ permalink raw reply related

* [PATCH net-next 2/2] asix: Add support for programming the EEPROM
From: Christian Riesch @ 2012-07-19 10:23 UTC (permalink / raw)
  To: netdev; +Cc: Allan Chou, Mark Lord, Grant Grundler, Christian Riesch
In-Reply-To: <1342693387-17945-1-git-send-email-christian.riesch@omicron.at>

This patch adds the asix_set_eeprom() function to provide support for
programming the configuration EEPROM via ethtool.

Signed-off-by: Christian Riesch <christian.riesch@omicron.at>
---
 drivers/net/usb/asix.h         |    2 +
 drivers/net/usb/asix_common.c  |   81 ++++++++++++++++++++++++++++++++++++++++
 drivers/net/usb/asix_devices.c |    3 +
 drivers/net/usb/ax88172a.c     |    1 +
 4 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/drivers/net/usb/asix.h b/drivers/net/usb/asix.h
index fbff177..e889631 100644
--- a/drivers/net/usb/asix.h
+++ b/drivers/net/usb/asix.h
@@ -208,6 +208,8 @@ int asix_set_wol(struct net_device *net, struct ethtool_wolinfo *wolinfo);
 int asix_get_eeprom_len(struct net_device *net);
 int asix_get_eeprom(struct net_device *net, struct ethtool_eeprom *eeprom,
 		    u8 *data);
+int asix_set_eeprom(struct net_device *net, struct ethtool_eeprom *eeprom,
+		    u8 *data);
 
 void asix_get_drvinfo(struct net_device *net, struct ethtool_drvinfo *info);
 
diff --git a/drivers/net/usb/asix_common.c b/drivers/net/usb/asix_common.c
index 0b5b2d3..774d9ce 100644
--- a/drivers/net/usb/asix_common.c
+++ b/drivers/net/usb/asix_common.c
@@ -516,6 +516,87 @@ int asix_get_eeprom(struct net_device *net, struct ethtool_eeprom *eeprom,
 	return 0;
 }
 
+int asix_set_eeprom(struct net_device *net, struct ethtool_eeprom *eeprom,
+		    u8 *data)
+{
+	struct usbnet *dev = netdev_priv(net);
+	u16 *eeprom_buff;
+	int first_word, last_word;
+	int i;
+	int ret;
+
+	netdev_dbg(net, "write EEPROM len %d, offset %d, magic 0x%x\n",
+		   eeprom->len, eeprom->offset, eeprom->magic);
+
+	if (eeprom->len == 0)
+		return -EINVAL;
+
+	if (eeprom->magic != AX_EEPROM_MAGIC)
+		return -EINVAL;
+
+	first_word = eeprom->offset >> 1;
+	last_word = (eeprom->offset + eeprom->len - 1) >> 1;
+
+	eeprom_buff = kmalloc(sizeof(u16) * (last_word - first_word + 1),
+			      GFP_KERNEL);
+	if (!eeprom_buff)
+		return -ENOMEM;
+
+	/* align data to 16 bit boundaries, read the missing data from
+	   the EEPROM */
+	if (eeprom->offset & 1) {
+		ret = asix_read_cmd(dev, AX_CMD_READ_EEPROM, first_word, 0, 2,
+				    &(eeprom_buff[0]));
+		if (ret < 0) {
+			netdev_err(net, "Failed to read EEPROM at offset 0x%02x.\n", first_word);
+			goto free;
+		}
+	}
+
+	if ((eeprom->offset + eeprom->len) & 1) {
+		ret = asix_read_cmd(dev, AX_CMD_READ_EEPROM, last_word, 0, 2,
+				    &(eeprom_buff[last_word - first_word]));
+		if (ret < 0) {
+			netdev_err(net, "Failed to read EEPROM at offset 0x%02x.\n", last_word);
+			goto free;
+		}
+	}
+
+	memcpy((u8 *)eeprom_buff + (eeprom->offset & 1), data, eeprom->len);
+
+	/* write data to EEPROM */
+	ret = asix_write_cmd(dev, AX_CMD_WRITE_ENABLE, 0x0000, 0, 0, NULL);
+	if (ret < 0) {
+		netdev_err(net, "Failed to enable EEPROM write\n");
+		goto free;
+	}
+	msleep(20);
+
+	for (i = first_word; i <= last_word; i++) {
+		netdev_dbg(net, "write to EEPROM at offset 0x%02x, data 0x%04x\n",
+			   i, eeprom_buff[i - first_word]);
+		ret = asix_write_cmd(dev, AX_CMD_WRITE_EEPROM, i,
+				     eeprom_buff[i - first_word], 0, NULL);
+		if (ret < 0) {
+			netdev_err(net, "Failed to write EEPROM at offset 0x%02x.\n",
+				   i);
+			goto free;
+		}
+		msleep(20);
+	}
+
+	ret = asix_write_cmd(dev, AX_CMD_WRITE_DISABLE, 0x0000, 0, 0, NULL);
+	if (ret < 0) {
+		netdev_err(net, "Failed to disable EEPROM write\n");
+		goto free;
+	}
+
+	ret = 0;
+free:
+	kfree(eeprom_buff);
+	return ret;
+}
+
 void asix_get_drvinfo(struct net_device *net, struct ethtool_drvinfo *info)
 {
 	/* Inherit standard device info */
diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 658c08f..4fd48df 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -119,6 +119,7 @@ static const struct ethtool_ops ax88172_ethtool_ops = {
 	.set_wol		= asix_set_wol,
 	.get_eeprom_len		= asix_get_eeprom_len,
 	.get_eeprom		= asix_get_eeprom,
+	.set_eeprom		= asix_set_eeprom,
 	.get_settings		= usbnet_get_settings,
 	.set_settings		= usbnet_set_settings,
 	.nway_reset		= usbnet_nway_reset,
@@ -258,6 +259,7 @@ static const struct ethtool_ops ax88772_ethtool_ops = {
 	.set_wol		= asix_set_wol,
 	.get_eeprom_len		= asix_get_eeprom_len,
 	.get_eeprom		= asix_get_eeprom,
+	.set_eeprom		= asix_set_eeprom,
 	.get_settings		= usbnet_get_settings,
 	.set_settings		= usbnet_set_settings,
 	.nway_reset		= usbnet_nway_reset,
@@ -478,6 +480,7 @@ static const struct ethtool_ops ax88178_ethtool_ops = {
 	.set_wol		= asix_set_wol,
 	.get_eeprom_len		= asix_get_eeprom_len,
 	.get_eeprom		= asix_get_eeprom,
+	.set_eeprom		= asix_set_eeprom,
 	.get_settings		= usbnet_get_settings,
 	.set_settings		= usbnet_set_settings,
 	.nway_reset		= usbnet_nway_reset,
diff --git a/drivers/net/usb/ax88172a.c b/drivers/net/usb/ax88172a.c
index 97dce0f..c8e0aa8 100644
--- a/drivers/net/usb/ax88172a.c
+++ b/drivers/net/usb/ax88172a.c
@@ -194,6 +194,7 @@ static const struct ethtool_ops ax88172a_ethtool_ops = {
 	.set_wol		= asix_set_wol,
 	.get_eeprom_len		= asix_get_eeprom_len,
 	.get_eeprom		= asix_get_eeprom,
+	.set_eeprom		= asix_set_eeprom,
 	.get_settings		= ax88172a_get_settings,
 	.set_settings		= ax88172a_set_settings,
 	.nway_reset		= ax88172a_nway_reset,
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH v4] net: cgroup: fix access the unallocated memory in netprio cgroup
From: Neil Horman @ 2012-07-19 10:28 UTC (permalink / raw)
  To: Li Zefan
  Cc: John Fastabend, Gao feng, eric.dumazet, linux-kernel, netdev,
	davem, Eric Dumazet, Rustad, Mark D
In-Reply-To: <50075C43.7000706@huawei.com>

On Thu, Jul 19, 2012 at 09:00:51AM +0800, Li Zefan wrote:
> >>>>>  static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
> 
> >>>>>  {
> >>>>>  	struct cgroup_netprio_state *cs;
> >>>>> -	int ret;
> >>>>> +	int ret = -EINVAL;
> >>>>>
> >>>>>  	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
> >>>>>  	if (!cs)
> >>>>>  		return ERR_PTR(-ENOMEM);
> >>>>>
> >>>>> -	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
> >>>>> -		kfree(cs);
> >>>>> -		return ERR_PTR(-EINVAL);
> >>>>> -	}
> >>>>> +	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
> >>>>> +		goto out;
> >>>>>
> >>>>>  	ret = get_prioidx(&cs->prioidx);
> >>>>> -	if (ret != 0) {
> >>>>> +	if (ret < 0) {
> >>>>>  		pr_warn("No space in priority index array\n");
> >>>>> -		kfree(cs);
> >>>>> -		return ERR_PTR(ret);
> >>>>> +		goto out;
> >>>>> +	}
> >>>>> +
> >>>>> +	ret = update_netdev_tables();
> >>>>> +	if (ret < 0) {
> >>>>> +		put_prioidx(cs->prioidx);
> >>>>> +		goto out;
> >>>>>  	}
> >>>>
> >>>> Gao,
> >>>>
> >>>> This introduces a null ptr dereference when netprio_cgroup is built
> >>>> into the kernel because update_netdev_tables() depends on init_net.
> >>>> However cgrp_create is being called by cgroup_init before
> >>>> do_initcalls() is called and before net_dev_init().
> >>>>
> >>>> .John
> >>>>
> >>> Not sure I follow here John.  Shouldn't init_net be initialized prior to any
> >>> network devices getting registered?  In other words, shouldn't for_each_netdev
> >>> just result in zero iterations through the loop?
> >>> Neil
> >>>
> >>
> >> init_net _is_ initialized prior to any network devices getting
> >> registered but not before cgrp_create called via cgroup_init.
> >>
> >> #define for_each_netdev(net, d)         \
> >>                 list_for_each_entry(d, &(net)->dev_base_head, dev_list)
> >>
> >> but dev_base_head is zeroed at this time. In netdev_init we have,
> >>
> >>         INIT_LIST_HEAD(&net->dev_base_head);
> >>
> >> but we haven't got that far yet because cgroup_init is called
> >> before do_initcalls().
> >>
> > ok, I see that, and it makes sense, but at this point I'm more concerned with
> > cgroups getting initalized twice.  The early_init flag is clear in the
> > cgroup_subsystem for netprio, so we really shouldn't be getting initalized from
> > cgroup_init. We should be getting initalized from the module_init() call that
> 
> > we register
> 
> If the early_init flag is set, a cgroup subsys will be initialized from
> cgroup_early_init(), otherwise cgroup_init().
> 
> If netprio is built as a module, the subsys will be initailized from module_init(),
> otherwise cgroup_init() (in this case cgroup_load_subsys() called in module_init()
> is a no-op).
> 
> So it won't get initialized twice.
> 
> 
Yeah, we already figured that out :).

Still its not a sane interface.  If you create a module_init function for a bit
of code, you expect that function to be called before the rest of your code ever
gets executed.  The way cgroup_init works, ss->cgroup_create gets called before
the module_init routine does when the module is built monolithically.  So no, no
double initalization, but definately some behavior that is going to be prone to
mistakes.
Neil

^ permalink raw reply

* Re: resurrecting tcphealth
From: Piotr Sawuk @ 2012-07-19 10:37 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel
In-Reply-To: <1648678.899sL4YBmA@cpaasch-mac>

On Mo, 16.07.2012, 17:24, Christoph Paasch wrote:

> You should do jiffies_to_msecs(tp->srtt) >> 3.
>
> The RTT is already exposed by tcp_info anyway... (see tcp_get_info() - where
> you also see the bitshift)

thanks a lot. rtt is output for completion's sake, it helps in diagnosis.
here my hopefully final version. it comes with tcp_info interface too:

diff -rub A/include/linux/tcp.h B/include/linux/tcp.h
--- A/include/linux/tcp.h	2012-07-15 00:40:28.000000000 +0200
+++ B/include/linux/tcp.h	2012-07-18 13:03:50.000000000 +0200
@@ -183,6 +183,12 @@
 	__u32	tcpi_rcv_space;

 	__u32	tcpi_total_retrans;
+
+	/* TCP Health */
+	__u32	tcpi_dup_acks;
+	__u32	tcpi_dup_pkts;
+	__u32	tcpi_acks;
+	__u32	tcpi_pkts;
 };

 /* for TCP_MD5SIG socket option */
@@ -492,6 +498,17 @@
 	 * contains related tcp_cookie_transactions fields.
 	 */
 	struct tcp_cookie_values  *cookie_values;
+
+#ifdef CONFIG_TCPHEALTH
+	/*
+	 * TCP health monitoring counters.
+	 */
+	__u32	dup_acks_sent;
+	__u32	dup_pkts_recv;
+	__u32	acks_sent;
+	__u32	pkts_recv;
+	__u32	last_ack_sent;	/* Sequence number of the last ack we sent. */
+#endif
 };

 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
diff -rub A/net/ipv4/Kconfig B/net/ipv4/Kconfig
--- A/net/ipv4/Kconfig	2012-07-15 00:40:28.000000000 +0200
+++ B/net/ipv4/Kconfig	2012-07-16 20:47:54.000000000 +0200
@@ -619,6 +619,28 @@
 	default "reno" if DEFAULT_RENO
 	default "cubic"

+config TCPHEALTH
+	bool "TCP client-side health-statistics (/proc/net/tcphealth)"
+	default n
+	---help---
+	TCP Health Monitoring (established connections only):
+	 -Duplicate ACKs indicate there could be lost or reordered packets
+	  on the connection.
+	 -Duplicate Packets Received signal a slow and badly inefficient
+	  connection.
+	 -RttEst estimates how long future packets will take on a round trip
+	  over the connection.
+
+	Additionally you get total amount of sent ACKs and received Packets.
+	All these values are displayed seperately for each connection.
+	If you are running a dedicated server you wont need this.
+	Duplicate ACKs refers only to those sent upon receiving a Packet.
+	A server most likely doesn't receive much Packets to count.
+	Hence for a server these statistics wont be meaningful.
+	especially since they are split into individual connections.
+
+	If you plan to investigate why some download is slow, say Y.
+
 config TCP_MD5SIG
 	bool "TCP: MD5 Signature Option support (RFC2385) (EXPERIMENTAL)"
 	depends on EXPERIMENTAL
diff -rub A/net/ipv4/tcp.c B/net/ipv4/tcp.c
--- A/net/ipv4/tcp.c	2012-07-15 00:40:28.000000000 +0200
+++ B/net/ipv4/tcp.c	2012-07-18 13:04:08.000000000 +0200
@@ -2723,6 +2723,13 @@
 	info->tcpi_rcv_space = tp->rcvq_space.space;

 	info->tcpi_total_retrans = tp->total_retrans;
+
+#ifdef TCPHEALTH
+	tcpi_dup_acks = tp->dup_acks_sent;
+	tcpi_dup_pkts = tp->dup_pkts_recv;
+	tcpi_acks = tp->acks_sent;
+	tcpi_pkts = tp->pkts_recv;
+#endif
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);

diff -rub A/net/ipv4/tcp_input.c B/net/ipv4/tcp_input.c
--- A/net/ipv4/tcp_input.c	2012-07-15 00:40:28.000000000 +0200
+++ B/net/ipv4/tcp_input.c	2012-07-16 20:47:54.000000000 +0200
@@ -4492,6 +4492,11 @@
 		}

 		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
+#ifdef CONFIG_TCPHEALTH
+			/* Course-Grained Timeout caused retransmit inefficiency-
+			 * this packet has been received twice. */
+			tp->dup_pkts_recv++;
+#endif
 			SOCK_DEBUG(sk, "ofo packet was already received\n");
 			__skb_unlink(skb, &tp->out_of_order_queue);
 			__kfree_skb(skb);
@@ -4824,6 +4829,12 @@
 		return;
 	}

+#ifdef CONFIG_TCPHEALTH
+	/* A packet is a "duplicate" if it contains bytes we have already
received. */
+	if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
+		tp->dup_pkts_recv++;
+#endif
+
 	if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
 		/* A retransmit, 2nd most common case.  Force an immediate ack. */
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
@@ -5535,6 +5546,12 @@

 	tp->rx_opt.saw_tstamp = 0;

+#ifdef CONFIG_TCPHEALTH
+	/*
+	 *	total per-connection packet arrivals.
+	 */
+	tp->pkts_recv++;
+#endif
 	/*	pred_flags is 0xS?10 << 16 + snd_wnd
 	 *	if header_prediction is to be made
 	 *	'S' will always be tp->tcp_header_len >> 2
diff -rub A/net/ipv4/tcp_ipv4.c B/net/ipv4/tcp_ipv4.c
--- A/net/ipv4/tcp_ipv4.c	2012-07-15 00:40:28.000000000 +0200
+++ B/net/ipv4/tcp_ipv4.c	2012-07-18 11:56:32.000000000 +0200
@@ -2500,6 +2500,68 @@
 	return 0;
 }

+#ifdef CONFIG_TCPHEALTH
+/*
+ *	Output /proc/net/tcphealth
+ */
+#define LINESZ 128
+
+int tcp_health_seq_show(struct seq_file *seq, void *v)
+{
+	int len, tab;
+	struct tcp_iter_state *st;
+
+	if (v == SEQ_START_TOKEN) {
+		seq_printf(seq,
+		"id   Local Address        Remote Address       RttEst(ms) AcksSent "
+		"DupAcksSent PktsRecv DupPktsRecv\n");
+		goto out;
+	}
+
+	/* Loop through established TCP connections */
+	st = seq->private;
+
+
+	if (st->state == TCP_SEQ_STATE_ESTABLISHED)
+	{
+		const struct tcp_sock *tp = tcp_sk(v);
+		const struct inet_sock *inet = inet_sk(v);
+
+		seq_printf(seq, "%d: %pI4:%u%n",
+				st->num,
+				&inet->inet_rcv_saddr,
+				ntohs(inet->inet_sport),
+				&tab
+			);
+		len = 3 + 21; /* 3 is minimum length for "%d: " */
+		if (tab < len) seq_printf(seq, "%*s", len - tab, "");
+		else len = tab;
+		seq_printf(seq, " %pI4:%u%n",
+				&inet->inet_daddr,
+				ntohs(inet->inet_dport),
+				&tab
+			);
+		tab += len;
+		len = 5 + 21 + 22; /* is num ever > 999? */
+		if (tab < len)  seq_printf(seq, "%*s", len - tab, "");
+		else len = tab;
+		seq_printf(seq, " %8u %8lu %8lu %8lu %8lu%n",
+				jiffies_to_msecs(tp->srtt)>>3,
+				(unsigned long)tp->acks_sent,
+				(unsigned long)tp->dup_acks_sent,
+				(unsigned long)tp->pkts_recv,
+				(unsigned long)tp->dup_pkts_recv,
+				&tab
+			);
+
+		seq_printf(seq, "%*s\n", LINESZ - 1 - len - tab, "");
+	}
+
+out:
+	return 0;
+}
+#endif /* CONFIG_TCPHEALTH */
+
 static const struct file_operations tcp_afinfo_seq_fops = {
 	.owner   = THIS_MODULE,
 	.open    = tcp_seq_open,
@@ -2508,6 +2570,17 @@
 	.release = seq_release_net
 };

+#ifdef CONFIG_TCPHEALTH
+static struct tcp_seq_afinfo tcphealth_seq_afinfo = {
+	.name		= "tcphealth",
+	.family		= AF_INET,
+	.seq_fops	= &tcp_afinfo_seq_fops,
+	.seq_ops	= {
+		.show		= tcp_health_seq_show,
+	},
+};
+#endif
+
 static struct tcp_seq_afinfo tcp4_seq_afinfo = {
 	.name		= "tcp",
 	.family		= AF_INET,
@@ -2519,12 +2592,20 @@

 static int __net_init tcp4_proc_init_net(struct net *net)
 {
-	return tcp_proc_register(net, &tcp4_seq_afinfo);
+	int ret = tcp_proc_register(net, &tcp4_seq_afinfo);
+#ifdef CONFIG_TCPHEALTH
+	if(ret == 0)
+		ret = tcp_proc_register(net, &tcphealth_seq_afinfo);
+#endif
+	return ret;
 }

 static void __net_exit tcp4_proc_exit_net(struct net *net)
 {
 	tcp_proc_unregister(net, &tcp4_seq_afinfo);
+#ifdef CONFIG_TCPHEALTH
+	tcp_proc_unregister(net, &tcphealth_seq_afinfo);
+#endif
 }

 static struct pernet_operations tcp4_net_ops = {
diff -rub A/net/ipv4/tcp_output.c B/net/ipv4/tcp_output.c
--- A/net/ipv4/tcp_output.c	2012-07-15 00:40:28.000000000 +0200
+++ B/net/ipv4/tcp_output.c	2012-07-16 20:47:54.000000000 +0200
@@ -2772,8 +2772,19 @@
 	skb_reserve(buff, MAX_TCP_HEADER);
 	tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK);

+#ifdef CONFIG_TCPHEALTH
+	/* If the rcv_nxt has not advanced since sending our last ACK, this is a
duplicate. */
+	if (tcp_sk(sk)->rcv_nxt == tcp_sk(sk)->last_ack_sent)
+		tcp_sk(sk)->dup_acks_sent++;
+	/* Record the total number of acks sent on this connection. */
+	tcp_sk(sk)->acks_sent++;
+#endif
+
 	/* Send it off, this clears delayed acks for us. */
 	TCP_SKB_CB(buff)->when = tcp_time_stamp;
+#ifdef CONFIG_TCPHEALTH
+	tcp_sk(sk)->last_ack_sent = tcp_sk(sk)->rcv_nxt;
+#endif
 	tcp_transmit_skb(sk, buff, 0, GFP_ATOMIC);
 }

^ permalink raw reply

* Re: [PATCH v2] sctp: Implement quick failover draft from tsvwg
From: Neil Horman @ 2012-07-19 10:45 UTC (permalink / raw)
  To: Joe Perches
  Cc: netdev, Vlad Yasevich, Sridhar Samudrala, David S. Miller,
	linux-sctp
In-Reply-To: <1342643458.2013.32.camel@joe2Laptop>

On Wed, Jul 18, 2012 at 01:30:58PM -0700, Joe Perches wrote:
> On Wed, 2012-07-18 at 14:01 -0400, Neil Horman wrote:
> > I've seen several attempts recently made to do quick failover of sctp transports
> > by reducing various retransmit timers and counters.  While its possible to
> > implement a faster failover on multihomed sctp associations, its not
> > particularly robust, in that it can lead to unneeded retransmits, as well as
> > false connection failures due to intermittent latency on a network.
> 
> trivia:
> 
> > diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> 
> > @@ -871,6 +885,10 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
> >  		spc_state = SCTP_ADDR_UNREACHABLE;
> >  		break;
> >  
> > +	case SCTP_TRANSPORT_PF:
> > +		transport->state = SCTP_PF;
> > +		ulp_notify = false;
> > +		break;
> 
> nicer to add a newline here
> 
Ack, I'll fix that.

> >  	default:
> >  		return;
> >  	}
> > @@ -878,12 +896,15 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
> []
> > +	if (ulp_notify) {
> > +		memset(&addr, 0, sizeof(struct sockaddr_storage));
> > +		memcpy(&addr, &transport->ipaddr,
> > +		       transport->af_specific->sockaddr_len);
> 
> Perhaps it's better to do the memcpy then the memset of the
> space left instead.
> 
> 		memcpy(&addr, &transport->ipaddr, transport->af_specific->sockaddr_len);
> 		memset((char *)&addr) + transport->af_specific->sockaddr_len, 0,
> 		       sizeof(struct sockaddr_storage) - transport->af_specific->sockaddr_len);
> 		       
> 
hmm, not sure about that. It works either way for me, but I've not changed that
code, just the condition under which it was executed.  I'd rather save cleanups
like that for a separate patch if you don't mind.
Neil

> 
> 

^ permalink raw reply

* Re: [PATCH v2] sctp: Implement quick failover draft from tsvwg
From: Neil Horman @ 2012-07-19 10:46 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev, Sridhar Samudrala, David S. Miller, linux-sctp
In-Reply-To: <50072939.5030600@gmail.com>

On Wed, Jul 18, 2012 at 05:23:05PM -0400, Vlad Yasevich wrote:
> On 07/18/2012 02:01 PM, Neil Horman wrote:
> >I've seen several attempts recently made to do quick failover of sctp transports
> >by reducing various retransmit timers and counters.  While its possible to
> >implement a faster failover on multihomed sctp associations, its not
> >particularly robust, in that it can lead to unneeded retransmits, as well as
> >false connection failures due to intermittent latency on a network.
> >
> >Instead, lets implement the new ietf quick failover draft found here:
> >http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
> >
> >This will let the sctp stack identify transports that have had a small number of
> >errors, and avoid using them quickly until their reliability can be
> >re-established.  I've tested this out on two virt guests connected via multiple
> >isolated virt networks and believe its in compliance with the above draft and
> >works well.
> >
> >Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> >CC: Vlad Yasevich <vyasevich@gmail.com>
> >CC: Sridhar Samudrala <sri@us.ibm.com>
> >CC: "David S. Miller" <davem@davemloft.net>
> >CC: linux-sctp@vger.kernel.org
> >
> >---
> >Change notes:
> >
> >V2)
> >- Added socket option API from section 6.1 of the specification, as per
> >request from Vlad. Adding this socket option allows us to alter both the path
> >maximum retransmit value and the path partial failure threshold for each
> >transport and the association as a whole.
> >
> >- Added a per transport pf_retrans value, and initialized it from the
> >association value.  This makes each transport independently configurable as per
> >the socket option above, and prevents changes in the sysctl from bleeding into
> >an already created association.
> >---
> >  Documentation/networking/ip-sysctl.txt |   14 +++++
> >  include/net/sctp/constants.h           |    1 +
> >  include/net/sctp/structs.h             |   11 +++-
> >  include/net/sctp/user.h                |   11 ++++
> >  net/sctp/associola.c                   |   36 ++++++++++--
> >  net/sctp/outqueue.c                    |    6 +-
> >  net/sctp/sm_sideeffect.c               |   33 ++++++++++-
> >  net/sctp/socket.c                      |   96 ++++++++++++++++++++++++++++++++
> >  net/sctp/sysctl.c                      |    9 +++
> >  net/sctp/transport.c                   |    4 +-
> >  10 files changed, 206 insertions(+), 15 deletions(-)
> >
> 
> [ snip ]
> 
> >diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> >index b3b8a8d..dfffece 100644
> >--- a/net/sctp/socket.c
> >+++ b/net/sctp/socket.c
> >@@ -3470,6 +3470,52 @@ static int sctp_setsockopt_auto_asconf(struct sock *sk, char __user *optval,
> >  }
> >
> >
> >+/*
> >+ * SCTP_PEER_ADDR_THLDS
> >+ *
> >+ * This option allows us to alter the partially failed threshold for one or all
> >+ * transports in an association.  See Section 6.1 of:
> >+ * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
> >+ */
> >+static int sctp_setsockopt_paddr_thresholds(struct sock *sk,
> >+					    char __user *optval,
> >+					    unsigned int optlen)
> >+{
> >+	struct sctp_paddrthlds val;
> >+	struct sctp_transport *trans;
> >+	struct sctp_association *asoc;
> >+
> >+	if (optlen < sizeof(struct sctp_paddrthlds))
> >+		return -EINVAL;
> >+	if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval,
> >+			   optlen))
> >+		return -EFAULT;
> 
> What if optlen is bigger?  You going to trash the stack.
> 
> >+
> >+	if (sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
> >+		asoc = sctp_id2assoc(sk, val.spt_assoc_id);
> >+		if (!asoc)
> >+			return -ENOENT;
> >+		list_for_each_entry(trans, &asoc->peer.transport_addr_list,
> >+				    transports) {
> >+			trans->pathmaxrxt = val.spt_pathmaxrxt;
> >+			trans->pf_retrans = val.spt_pathpfthld;
> 
> You want to make sure that the values aren't 0.  Otherwise, you'll
> set the pathmaxrxt to 0 and that would be bad.
> 
> >+		}
> >+
> >+		asoc->pf_retrans = val.spt_pathpfthld;
> >+		asoc->pathmaxrxt = val.spt_pathmaxrxt;
> 
> Ditto.
> 
> >+	} else {
> >+		trans = sctp_addr_id2transport(sk, &val.spt_address,
> >+					       val.spt_assoc_id);
> >+		if (!trans)
> >+			return -ENOENT;
> >+
> >+		trans->pathmaxrxt = val.spt_pathmaxrxt;
> >+		trans->pf_retrans = val.spt_pathpfthld;
> 
> Ditto.
> 
> >+	}
> >+
> >+	return 0;
> >+}
> >+
> >  /* API 6.2 setsockopt(), getsockopt()
> >   *
> >   * Applications use setsockopt() and getsockopt() to set or retrieve
> >@@ -3619,6 +3665,9 @@ SCTP_STATIC int sctp_setsockopt(struct sock *sk, int level, int optname,
> >  	case SCTP_AUTO_ASCONF:
> >  		retval = sctp_setsockopt_auto_asconf(sk, optval, optlen);
> >  		break;
> >+	case SCTP_PEER_ADDR_THLDS:
> >+		retval = sctp_setsockopt_paddr_thresholds(sk, optval, optlen);
> >+		break;
> >  	default:
> >  		retval = -ENOPROTOOPT;
> >  		break;
> >@@ -5490,6 +5539,50 @@ static int sctp_getsockopt_assoc_ids(struct sock *sk, int len,
> >  	return 0;
> >  }
> >
> >+/*
> >+ * SCTP_PEER_ADDR_THLDS
> >+ *
> >+ * This option allows us to fetch the partially failed threshold for one or all
> >+ * transports in an association.  See Section 6.1 of:
> >+ * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
> >+ */
> >+static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
> >+					    char __user *optval,
> >+					    int optlen)
> >+{
> >+	struct sctp_paddrthlds val;
> >+	struct sctp_transport *trans;
> >+	struct sctp_association *asoc;
> >+
> >+	if (optlen < sizeof(struct sctp_paddrthlds))
> >+		return -EINVAL;
> >+	if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval, optlen))
> >+		return -EFAULT;
> 
> Again, trashing the stack if optlen and optval are bigger.
> 
> -vlad


Ack, I'll fix these up and repost.  Thanks!
Neil

^ permalink raw reply

* Re: calling request_firmware() from module init will not work with recent/future udev versions
From: Kay Sievers @ 2012-07-19 10:46 UTC (permalink / raw)
  To: Johannes Berg; +Cc: netdev, linux-wireless, Tom Gundersen, Andy Whitcroft
In-Reply-To: <1326716209.3510.7.camel@jlt3.sipsolutions.net>

On Mon, Jan 16, 2012 at 1:16 PM, Johannes Berg
<johannes@sipsolutions.net> wrote:
> On Mon, 2012-01-16 at 13:05 +0100, Kay Sievers wrote:
>
>> > What I'm was asking then is this: Can udev know that it is running from
>> > initramfs (presumably that can't be too hard) and simply not reply to
>> > async requests it doesn't have firmware for? Then once the real root is
>> > mounted it could satisfy (or not) firmware requests from the real root.
>>
>> We can surely change it to not cancel the firmware request.
>>
>> Either by making it aware that we run from initramfs, or by never
>> cancelling any firmware request and just leave it hanging around for
>> forever?

Never say 6 months is a long time to reply. :)

Fedora uses systemd in the initramfs now, which made it trivial to
implement this, and to leave the firmware requests hanging around
until we reach in the real rootfs and know if the firmware file is
available:
  http://cgit.freedesktop.org/systemd/systemd/commit/?id=39177382a4f92a834b568d6ae5d750eb2a5a86f9

The logic to tell udev that it runs in the initramfs could easily be
implemented by other initramfs tools than dracut, but they usually do
not really follow what we do here, so this might for now only work on
recent systems using dracut.

Cheers,
Kay

^ permalink raw reply

* Re: [PATCH 09/15] ipv4: Cache output routes in fib_info nexthops.
From: Steffen Klassert @ 2012-07-19 11:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120718.112404.1910372180742347127.davem@davemloft.net>

On Wed, Jul 18, 2012 at 11:24:04AM -0700, David Miller wrote:
> +
> +static void rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe)
> +{
> +	if (fnhe->fnhe_pmtu) {
> +		unsigned long expires = fnhe->fnhe_expires;
> +		unsigned long diff = jiffies - expires;

This should be diff = expires - jiffies

With that changed, everything seems to work fine :)

> +
> +		if (time_before(jiffies, expires)) {
> +			rt->rt_pmtu = fnhe->fnhe_pmtu;
> +			dst_set_expires(&rt->dst, diff);
>  		}
>  	}
> +	if (fnhe->fnhe_gw) {
> +		rt->rt_flags |= RTCF_REDIRECTED;
> +		rt->rt_gateway = fnhe->fnhe_gw;
> +	}
> +	fnhe->fnhe_stamp = jiffies;
>  }
>  

^ permalink raw reply

* [PATCH net-next] asix: AX88172A driver depends on phylib
From: Christian Riesch @ 2012-07-19 12:02 UTC (permalink / raw)
  To: netdev; +Cc: Fengguang Wu, Christian Riesch, kernel-janitors

Since commit 16626b0cc3d5afe250850f96759b241f8a403b52 the asix
driver depends on the phylib. Select phylib when the asix driver is
selected.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: kernel-janitors@vger.kernel.org
Signed-off-by: Christian Riesch <christian.riesch@omicron.at>
---
 drivers/net/usb/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index 833e32f..c1ae769 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -134,6 +134,7 @@ config USB_NET_AX8817X
 	tristate "ASIX AX88xxx Based USB 2.0 Ethernet Adapters"
 	depends on USB_USBNET
 	select CRC32
+	select PHYLIB
 	default y
 	help
 	  This option adds support for ASIX AX88xxx based USB 2.0
-- 
1.7.0.4

^ permalink raw reply related

* how routing table maintained
From: BALAKUMARAN KANNAN @ 2012-07-19 12:41 UTC (permalink / raw)
  To: netdev@vger.kernel.org

I want to know how routing table and routing cache is maintained. I can find some entries in cache even after the route is deleted from the routing table. Is it possible to delete an entry(the particular entry or flush the entire cache) from the cache once it is deleted from the main routing table.

Thank you

--Regards
K.Balakumaran

^ permalink raw reply

* Re: [PATCH net-next V1 7/9] net/eipoib: Add main driver functionality
From: Ben Hutchings @ 2012-07-19 13:49 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, roland, netdev, ali, sean.hefty, shlomop, Erez Shitrit
In-Reply-To: <1342609202-32427-8-git-send-email-ogerlitz@mellanox.com>

On Wed, 2012-07-18 at 14:00 +0300, Or Gerlitz wrote:
[...]
> +/* ----------------------------- VLAN funcs ---------------------------------- */
> +static int eth_ipoib_vlan_rx_add_vid(struct net_device *dev,
> +				     unsigned short vid)
> +{
> +	return 0;
> +}
> +
> +static int eth_ipoib_vlan_rx_kill_vid(struct net_device *dev,
> +				      unsigned short vid)
> +{
> +	return 0;
> +}
[...]
> +/* -------------------------- Device entry points --------------------------- */
> +static struct net_device_stats *parent_get_stats(struct net_device *parent_dev)
> +{
[...]
> +static const struct net_device_ops parent_netdev_ops = {
> +	.ndo_init		= parent_init,
> +	.ndo_uninit		= parent_uninit,
> +	.ndo_open		= parent_open,
> +	.ndo_stop		= parent_close,
> +	.ndo_start_xmit		= parent_tx,
> +	.ndo_select_queue	= parent_select_q,
> +	/* parnt mtu is min(slaves_mtus) */
> +	.ndo_change_mtu		= NULL,
> +	.ndo_fix_features	= parent_fix_features,
> +	/*
> +	 * initial mac address is randomized, can be changed
> +	 * thru this func later
> +	 */
> +	.ndo_set_mac_address = eth_mac_addr,
> +	.ndo_get_stats = parent_get_stats,

Why not implement ndo_get_stats64?  I don't think there's any good
reason for a new driver not to.

> +	.ndo_vlan_rx_add_vid = eth_ipoib_vlan_rx_add_vid,
> +	.ndo_vlan_rx_kill_vid = eth_ipoib_vlan_rx_kill_vid,

These shouldn't be needed.

[...]
> +/* netdev events handlers */
> +static inline int is_ipoib_pif_intf(struct net_device *dev)
> +{
> +	if (ARPHRD_INFINIBAND == dev->type && dev->priv_flags & IFF_EIPOIB_PIF)
> +			return 1;
[...]

Wrong indentation.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net-next V1 2/4] net/rps: Protect cpu_rmap.h from double inclusion
From: Ben Hutchings @ 2012-07-19 13:53 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, roland, netdev, oren, yevgenyp, Amir Vadai
In-Reply-To: <1342686832-21406-3-git-send-email-ogerlitz@mellanox.com>

On Thu, 2012-07-19 at 11:33 +0300, Or Gerlitz wrote:
> From: Amir Vadai <amirv@mellanox.com>
> 
> Signed-off-by: Amir Vadai <amirv@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>  include/linux/cpu_rmap.h |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/cpu_rmap.h b/include/linux/cpu_rmap.h
> index 473771a..ac3bbb5 100644
> --- a/include/linux/cpu_rmap.h
> +++ b/include/linux/cpu_rmap.h
> @@ -1,3 +1,6 @@
> +#ifndef __LINUX_CPU_RMAP_H
> +#define __LINUX_CPU_RMAP_H
> +
>  /*
>   * cpu_rmap.c: CPU affinity reverse-map support
>   * Copyright 2011 Solarflare Communications Inc.
> @@ -71,3 +74,4 @@ extern void free_irq_cpu_rmap(struct cpu_rmap *rmap);
>  extern int irq_cpu_rmap_add(struct cpu_rmap *rmap, int irq);
>  
>  #endif
> +#endif /* __LINUX_CPU_RMAP_H */

Oops :-/

Acked-by: Ben Hutchings <bhutchings@solarflare.com>

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net-next V1 4/4] net/mlx4_en: Add accelerated RFS support
From: Ben Hutchings @ 2012-07-19 14:09 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, roland, netdev, oren, yevgenyp, Amir Vadai
In-Reply-To: <1342686832-21406-5-git-send-email-ogerlitz@mellanox.com>

On Thu, 2012-07-19 at 11:33 +0300, Or Gerlitz wrote:
> From: Amir Vadai <amirv@mellanox.com>
> 
> Use RFS infrastructure and flow steering in HW to keep CPU
> affinity of rx interrupts and application per TCP stream.
> 
> A flow steering filter is added to the HW whenever the RFS
> ndo callback is invoked by core networking code.
> 
> Because the invocation takes place in interrupt context, the
> actual setup of HW is done using workqueue. Whenever new filter
> is added, the driver checks for expiry of existing filters.
> 
> Since there's window in time between the point where the core
> RFS code invoked the ndo callback, to the point where the HW
> is configured from the workqueue context, the 2nd, 3rd etc
> packets from that stream will cause the net core to invoke
> the callback again and again.

Yes, and this is true even when the filter can be reprogrammed
synchronously as there may be more packets for the redirected flow
already in the RX queue.  This is a result of fixing the bug you pointed
out earlier (commit 09994d1b09bd9b0046a4708fa50d2106610a4058).  We
should try to find a way of noting pending redirects without
reintroducing that bug.

> To prevent inefficient/double configuration of the HW, the filters
> are kept in a database which is indexed using hash function to enable
> fast access.
> 
> Signed-off-by: Amir Vadai <amirv@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_cq.c     |    8 +-
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  316 ++++++++++++++++++++++++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c     |    3 +
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   16 ++
>  4 files changed, 342 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
> index 0ef6156..aa9c2f6 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
> @@ -77,6 +77,12 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
>  	struct mlx4_en_dev *mdev = priv->mdev;
>  	int err = 0;
>  	char name[25];
> +	struct cpu_rmap *rmap =
> +#ifdef CONFIG_RFS_ACCEL
> +		priv->dev->rx_cpu_rmap;
> +#else
> +		NULL;
> +#endif

You can write this slightly more cleanly using IS_ENABLED().

[...]
> +static struct mlx4_en_filter *
> +mlx4_en_filter_alloc(struct mlx4_en_priv *priv, int rxq_index, __be32 src_ip,
> +                    __be32 dst_ip, __be16 src_port, __be16 dst_port,
> +                    u32 flow_id)
> +{
[...]
> +       filter->id = priv->last_filter_id++;
[...]

You need to limit the filter IDs to be < RPS_NO_FILTER.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net-next 4/7] sfc: Add support for IEEE-1588 PTP
From: Richard Cochran @ 2012-07-19 14:25 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, linux-net-drivers, Andrew Jackson
In-Reply-To: <1342635693.2617.56.camel@bwh-desktop.uk.solarflarecom.com>

On Wed, Jul 18, 2012 at 07:21:33PM +0100, Ben Hutchings wrote:
> Add PTP IEEE-1588 support and make accesible via the PHC subsystem.
> 
> This work is based on prior code by Andrew Jackson
> 
> Signed-off-by: Stuart Hodgson <smhodgson@solarflare.com>
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---
>  drivers/net/ethernet/sfc/Kconfig      |    7 +
>  drivers/net/ethernet/sfc/Makefile     |    1 +
>  drivers/net/ethernet/sfc/efx.c        |    3 +
>  drivers/net/ethernet/sfc/ethtool.c    |    1 +
>  drivers/net/ethernet/sfc/mcdi_pcol.h  |    1 +
>  drivers/net/ethernet/sfc/net_driver.h |   19 +-
>  drivers/net/ethernet/sfc/nic.h        |   31 +
>  drivers/net/ethernet/sfc/ptp.c        | 1519 +++++++++++++++++++++++++++++++++
>  drivers/net/ethernet/sfc/rx.c         |    2 +-
>  drivers/net/ethernet/sfc/siena.c      |    1 +
>  drivers/net/ethernet/sfc/tx.c         |    6 +
>  11 files changed, 1589 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/net/ethernet/sfc/ptp.c
> 
> diff --git a/drivers/net/ethernet/sfc/Kconfig b/drivers/net/ethernet/sfc/Kconfig
> index fb3cbc2..78c5d435 100644
> --- a/drivers/net/ethernet/sfc/Kconfig
> +++ b/drivers/net/ethernet/sfc/Kconfig
> @@ -34,3 +34,10 @@ config SFC_SRIOV
>  	  This enables support for the SFC9000 I/O Virtualization
>  	  features, allowing accelerated network performance in
>  	  virtualized environments.
> +config SFC_PTP
> +	bool "Solarflare SFC9000-family PTP support"
> +	depends on SFC && PTP_1588_CLOCK
> +	default y
> +	---help---
> +	  This enables support for the Precision Time Protocol (PTP)
> +	  on SFC9000-family NICs
> diff --git a/drivers/net/ethernet/sfc/Makefile b/drivers/net/ethernet/sfc/Makefile
> index ea1f8db..e11f2ec 100644
> --- a/drivers/net/ethernet/sfc/Makefile
> +++ b/drivers/net/ethernet/sfc/Makefile
> @@ -5,5 +5,6 @@ sfc-y			+= efx.o nic.o falcon.o siena.o tx.o rx.o filter.o \
>  			   mcdi.o mcdi_phy.o mcdi_mon.o
>  sfc-$(CONFIG_SFC_MTD)	+= mtd.o
>  sfc-$(CONFIG_SFC_SRIOV)	+= siena_sriov.o
> +sfc-$(CONFIG_SFC_PTP)	+= ptp.o
>  
>  obj-$(CONFIG_SFC)	+= sfc.o
> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> index 1c53d4b..e6631f0e 100644
> --- a/drivers/net/ethernet/sfc/efx.c
> +++ b/drivers/net/ethernet/sfc/efx.c
> @@ -1748,6 +1748,9 @@ static int efx_ioctl(struct net_device *net_dev, struct ifreq *ifr, int cmd)
>  
>  	EFX_ASSERT_RESET_SERIALISED(efx);
>  
> +	if (cmd == SIOCSHWTSTAMP)
> +		return efx_ptp_ioctl(efx, ifr, cmd);
> +
>  	/* Convert phy_id from older PRTAD/DEVAD format */
>  	if ((cmd == SIOCGMIIREG || cmd == SIOCSMIIREG) &&
>  	    (data->phy_id & 0xfc00) == 0x0400)
> diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
> index 10536f9..50cdd39 100644
> --- a/drivers/net/ethernet/sfc/ethtool.c
> +++ b/drivers/net/ethernet/sfc/ethtool.c
> @@ -1170,6 +1170,7 @@ const struct ethtool_ops efx_ethtool_ops = {
>  	.get_rxfh_indir_size	= efx_ethtool_get_rxfh_indir_size,
>  	.get_rxfh_indir		= efx_ethtool_get_rxfh_indir,
>  	.set_rxfh_indir		= efx_ethtool_set_rxfh_indir,
> +	.get_ts_info		= efx_ptp_get_ts_info,
>  	.get_module_info	= efx_ethtool_get_module_info,
>  	.get_module_eeprom	= efx_ethtool_get_module_eeprom,
>  };
> diff --git a/drivers/net/ethernet/sfc/mcdi_pcol.h b/drivers/net/ethernet/sfc/mcdi_pcol.h
> index 0310b9f0..0017f98 100644
> --- a/drivers/net/ethernet/sfc/mcdi_pcol.h
> +++ b/drivers/net/ethernet/sfc/mcdi_pcol.h
> @@ -290,6 +290,7 @@
>  #define          MCDI_EVENT_CODE_TX_FLUSH  0xc /* enum */
>  #define          MCDI_EVENT_CODE_PTP_RX  0xd /* enum */
>  #define          MCDI_EVENT_CODE_PTP_FAULT  0xe /* enum */
> +#define          MCDI_EVENT_CODE_PTP_PPS  0xf /* enum */
>  #define       MCDI_EVENT_CMDDONE_DATA_OFST 0
>  #define       MCDI_EVENT_CMDDONE_DATA_LBN 0
>  #define       MCDI_EVENT_CMDDONE_DATA_WIDTH 32
> diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
> index 9913e32..f84a5d5 100644
> --- a/drivers/net/ethernet/sfc/net_driver.h
> +++ b/drivers/net/ethernet/sfc/net_driver.h
> @@ -56,7 +56,8 @@
>  #define EFX_MAX_CHANNELS 32U
>  #define EFX_MAX_RX_QUEUES EFX_MAX_CHANNELS
>  #define EFX_EXTRA_CHANNEL_IOV	0
> -#define EFX_MAX_EXTRA_CHANNELS	1U
> +#define EFX_EXTRA_CHANNEL_PTP	1
> +#define EFX_MAX_EXTRA_CHANNELS	2U
>  
>  /* Checksum generation is a per-queue option in hardware, so each
>   * queue visible to the networking core is backed by two hardware TX
> @@ -68,6 +69,9 @@
>  #define EFX_TXQ_TYPES		4
>  #define EFX_MAX_TX_QUEUES	(EFX_TXQ_TYPES * EFX_MAX_CHANNELS)
>  
> +/* Forward declare Precision Time Protocol (PTP) support structure. */
> +struct efx_ptp_data;
> +
>  struct efx_self_tests;
>  
>  /**
> @@ -736,6 +740,7 @@ struct vfdi_status;
>   *	%local_addr_list. Protected by %local_lock.
>   * @local_lock: Mutex protecting %local_addr_list and %local_page_list.
>   * @peer_work: Work item to broadcast peer addresses to VMs.
> + * @ptp_data: PTP state data
>   * @monitor_work: Hardware monitor workitem
>   * @biu_lock: BIU (bus interface unit) lock
>   * @last_irq_cpu: Last CPU to handle a possible test interrupt.  This
> @@ -860,6 +865,10 @@ struct efx_nic {
>  	struct work_struct peer_work;
>  #endif
>  
> +#ifdef CONFIG_SFC_PTP
> +	struct efx_ptp_data *ptp_data;
> +#endif
> +
>  	/* The following fields may be written more often */
>  
>  	struct delayed_work monitor_work ____cacheline_aligned_in_smp;
> @@ -1122,5 +1131,13 @@ static inline void clear_bit_le(unsigned nr, unsigned char *addr)
>  #define EFX_MAX_FRAME_LEN(mtu) \
>  	((((mtu) + ETH_HLEN + VLAN_HLEN + 4/* FCS */ + 7) & ~7) + 16)
>  
> +static inline bool efx_xmit_with_hwtstamp(struct sk_buff *skb)
> +{
> +	return skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP;
> +}
> +static inline void efx_xmit_hwtstamp_pending(struct sk_buff *skb)
> +{
> +	skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +}
>  
>  #endif /* EFX_NET_DRIVER_H */
> diff --git a/drivers/net/ethernet/sfc/nic.h b/drivers/net/ethernet/sfc/nic.h
> index bab5cd9..cd53ab5 100644
> --- a/drivers/net/ethernet/sfc/nic.h
> +++ b/drivers/net/ethernet/sfc/nic.h
> @@ -250,6 +250,37 @@ extern int efx_sriov_get_vf_config(struct net_device *dev, int vf,
>  extern int efx_sriov_set_vf_spoofchk(struct net_device *net_dev, int vf,
>  				     bool spoofchk);
>  
> +struct ethtool_ts_info;
> +#ifdef CONFIG_SFC_PTP
> +extern void efx_ptp_probe(struct efx_nic *efx);
> +extern int efx_ptp_ioctl(struct efx_nic *efx, struct ifreq *ifr, int cmd);
> +extern int efx_ptp_get_ts_info(struct net_device *net_dev,
> +			       struct ethtool_ts_info *ts_info);
> +extern bool efx_ptp_is_ptp_tx(struct efx_nic *efx, struct sk_buff *skb);
> +extern int efx_ptp_tx(struct efx_nic *efx, struct sk_buff *skb);
> +extern void efx_ptp_event(struct efx_nic *efx, efx_qword_t *ev);
> +#else
> +static inline void efx_ptp_probe(struct efx_nic *efx) {}
> +static inline int efx_ptp_ioctl(struct efx_nic *efx, struct ifreq *ifr, int cmd)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int efx_ptp_get_ts_info(struct net_device *net_dev,
> +				      struct ethtool_ts_info *ts_info)
> +{
> +	return -EOPNOTSUPP;

If your PTP support is not enabled, then it would be better to offer
the standard ethtool answer to this query.

Also, it would be nice to still offer SW Tx timestamping, even when
PTP is disabled.

> +}
> +static inline bool efx_ptp_is_ptp_tx(struct efx_nic *efx, struct sk_buff *skb)
> +{
> +	return false;
> +}
> +static inline int efx_ptp_tx(struct efx_nic *efx, struct sk_buff *skb)
> +{
> +	return NETDEV_TX_OK;
> +}
> +static inline void efx_ptp_event(struct efx_nic *efx, efx_qword_t *ev) {}
> +#endif
> +
>  extern const struct efx_nic_type falcon_a1_nic_type;
>  extern const struct efx_nic_type falcon_b0_nic_type;
>  extern const struct efx_nic_type siena_a0_nic_type;
> diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
> new file mode 100644
> index 0000000..ba1e76a
> --- /dev/null
> +++ b/drivers/net/ethernet/sfc/ptp.c
> @@ -0,0 +1,1519 @@
> +/****************************************************************************
> + * Driver for Solarflare Solarstorm network controllers and boards
> + * Copyright 2011 Solarflare Communications Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation, incorporated herein by reference.
> + */
> +
> +/* Theory of operation:
> + *
> + * PTP support is assisted by firmware running on the MC, which provides
> + * the hardware timestamping capabilities.  Both transmitted and received
> + * PTP event packets are queued onto internal queues for subsequent processing;
> + * this is because the MC operations are relatively long and would block
> + * block NAPI/interrupt operation.
> + *
> + * Receive event processing:
> + *	The event contains the packet's UUID and sequence number, together
> + *	with the hardware timestamp.  The PTP receive packet queue is searched
> + *	for this UUID/sequence number and, if found, put on a pending queue.
> + *	Packets not matching are delivered without timestamps (MCDI events will
> + *	always arrive after the actual packet).
> + *	It is important for the operation of the PTP protocol that the ordering
> + *	of packets between the event and general port is maintained.
> + *
> + * Work queue processing:
> + *	If work waiting, synchronise host/hardware time
> + *
> + *	Transmit: send packet through MC, which returns the transmission time
> + *	that is converted to an appropriate timestamp.
> + *
> + *	Receive: the packet's reception time is converted to an appropriate
> + *	timestamp.
> + */
> +#include <linux/ip.h>
> +#include <linux/udp.h>
> +#include <linux/time.h>
> +#include <linux/ktime.h>
> +#include <linux/module.h>
> +#include <linux/net_tstamp.h>
> +#include <linux/ptp_clock_kernel.h>
> +#include "net_driver.h"
> +#include "efx.h"
> +#include "mcdi.h"
> +#include "mcdi_pcol.h"
> +#include "io.h"
> +#include "regs.h"
> +#include "nic.h"
> +
> +/* Maximum number of events expected to make up a PTP event */
> +#define	MAX_EVENT_FRAGS			3
> +
> +/* Maximum delay, ms, to begin synchronisation */
> +#define	MAX_SYNCHRONISE_WAIT_MS		2
> +
> +/* How long, at most, to spend synchronising */
> +#define	SYNCHRONISE_PERIOD_NS		250000
> +
> +/* How often to update the shared memory time */
> +#define	SYNCHRONISATION_GRANULARITY_NS	200
> +
> +/* Minimum permitted length of a (corrected) synchronisation time */
> +#define	MIN_SYNCHRONISATION_NS		120
> +
> +/* Maximum permitted length of a (corrected) synchronisation time */
> +#define	MAX_SYNCHRONISATION_NS		1000
> +
> +/* How many (MC) receive events that can be queued */
> +#define	MAX_RECEIVE_EVENTS		8
> +
> +/* Length of (modified) moving average. */
> +#define	AVERAGE_LENGTH			16
> +
> +/* How long an unmatched event or packet can be held */
> +#define PKT_EVENT_LIFETIME_MS		10
> +
> +/* Offsets into PTP packet for identification.  These offsets are from the
> + * start of the IP header, not the MAC header.  Note that neither PTP V1 nor
> + * PTP V2 permit the use of IPV4 options.
> + */
> +#define PTP_DPORT_OFFSET	22
> +
> +#define PTP_V1_VERSION_LENGTH	2
> +#define PTP_V1_VERSION_OFFSET	28
> +
> +#define PTP_V1_UUID_LENGTH	6
> +#define PTP_V1_UUID_OFFSET	50
> +
> +#define PTP_V1_SEQUENCE_LENGTH	2
> +#define PTP_V1_SEQUENCE_OFFSET	58
> +
> +/* The minimum length of a PTP V1 packet for offsets, etc. to be valid:
> + * includes IP header.
> + */
> +#define	PTP_V1_MIN_LENGTH	64
> +
> +#define PTP_V2_VERSION_LENGTH	1
> +#define PTP_V2_VERSION_OFFSET	29
> +
> +/* Although PTP V2 UUIDs are comprised a ClockIdentity (8) and PortNumber (2),
> + * the MC only captures the last six bytes of the clock identity. These values
> + * reflect those, not the ones used in the standard.  The standard permits
> + * mapping of V1 UUIDs to V2 UUIDs with these same values.
> + */
> +#define PTP_V2_MC_UUID_LENGTH	6
> +#define PTP_V2_MC_UUID_OFFSET	50
> +
> +#define PTP_V2_SEQUENCE_LENGTH	2
> +#define PTP_V2_SEQUENCE_OFFSET	58
> +
> +/* The minimum length of a PTP V2 packet for offsets, etc. to be valid:
> + * includes IP header.
> + */
> +#define	PTP_V2_MIN_LENGTH	63
> +
> +#define	PTP_MIN_LENGTH		63
> +
> +#define PTP_ADDRESS		0xe0000181	/* 224.0.1.129 */
> +#define PTP_EVENT_PORT		319
> +#define PTP_GENERAL_PORT	320
> +
> +/* Annoyingly the format of the version numbers are different between
> + * versions 1 and 2 so it isn't possible to simply look for 1 or 2.
> + */
> +#define	PTP_VERSION_V1		1
> +
> +#define	PTP_VERSION_V2		2
> +#define	PTP_VERSION_V2_MASK	0x0f
> +
> +enum ptp_packet_state {
> +	PTP_PACKET_STATE_UNMATCHED = 0,
> +	PTP_PACKET_STATE_MATCHED,
> +	PTP_PACKET_STATE_TIMED_OUT,
> +	PTP_PACKET_STATE_MATCH_UNWANTED
> +};
> +
> +/* NIC synchronised with single word of time only comprising
> + * partial seconds and full nanoseconds: 10^9 ~ 2^30 so 2 bits for seconds.
> + */
> +#define	MC_NANOSECOND_BITS	30
> +#define	MC_NANOSECOND_MASK	((1 << MC_NANOSECOND_BITS) - 1)
> +#define	MC_SECOND_MASK		((1 << (32 - MC_NANOSECOND_BITS)) - 1)
> +
> +/* Maximum parts-per-billion adjustment that is acceptable */
> +#define MAX_PPB			1000000
> +
> +/* Number of bits required to hold the above */
> +#define	MAX_PPB_BITS		20
> +
> +/* Number of extra bits allowed when calculating fractional ns.
> + * EXTRA_BITS + MC_CMD_PTP_IN_ADJUST_BITS + MAX_PPB_BITS should
> + * be less than 63.
> + */
> +#define	PPB_EXTRA_BITS		2
> +
> +/* Precalculate scale word to avoid long long division at runtime */
> +#define	PPB_SCALE_WORD	((1LL << (PPB_EXTRA_BITS + MC_CMD_PTP_IN_ADJUST_BITS +\
> +			MAX_PPB_BITS)) / 1000000000LL)
> +
> +#define PTP_SYNC_ATTEMPTS	4
> +
> +/**
> + * struct efx_ptp_match - Matching structure, stored in sk_buff's cb area.
> + * @words: UUID and (partial) sequence number
> + * @expiry: Time after which the packet should be delivered irrespective of
> + *            event arrival.
> + * @state: The state of the packet - whether it is ready for processing or
> + *         whether that is of no interest.
> + */
> +struct efx_ptp_match {
> +	u32 words[DIV_ROUND_UP(PTP_V1_UUID_LENGTH, 4)];
> +	unsigned long expiry;
> +	enum ptp_packet_state state;
> +};
> +
> +/**
> + * struct efx_ptp_event_rx - A PTP receive event (from MC)
> + * @seq0: First part of (PTP) UUID
> + * @seq1: Second part of (PTP) UUID and sequence number
> + * @hwtimestamp: Event timestamp
> + */
> +struct efx_ptp_event_rx {
> +	struct list_head link;
> +	u32 seq0;
> +	u32 seq1;
> +	ktime_t hwtimestamp;
> +	unsigned long expiry;
> +};
> +
> +/**
> + * struct efx_ptp_timeset - Synchronisation between host and MC
> + * @host_start: Host time immediately before hardware timestamp taken
> + * @seconds: Hardware timestamp, seconds
> + * @nanoseconds: Hardware timestamp, nanoseconds
> + * @host_end: Host time immediately after hardware timestamp taken
> + * @waitns: Number of nanoseconds between hardware timestamp being read and
> + *          host end time being seen
> + * @window: Difference of host_end and host_start
> + * @valid: Whether this timeset is valid
> + */
> +struct efx_ptp_timeset {
> +	u32 host_start;
> +	u32 seconds;
> +	u32 nanoseconds;
> +	u32 host_end;
> +	u32 waitns;
> +	u32 window;	/* Derived: end - start, allowing for wrap */
> +};
> +
> +/**
> + * struct efx_ptp_data - Precision Time Protocol (PTP) state
> + * @channel: The PTP channel
> + * @rxq: Receive queue (awaiting timestamps)
> + * @txq: Transmit queue
> + * @evt_list: List of MC receive events awaiting packets
> + * @evt_free_list: List of free events
> + * @evt_lock: Lock for manipulating evt_list and evt_free_list
> + * @rx_evts: Instantiated events (on evt_list and evt_free_list)
> + * @workwq: Work queue for processing pending PTP operations
> + * @work: Work task
> + * @reset_required: A serious error has occurred and the PTP task needs to be
> + *                  reset (disable, enable).
> + * @rxfilter_event: Receive filter when operating
> + * @rxfilter_general: Receive filter when operating
> + * @config: Current timestamp configuration
> + * @enabled: PTP operation enabled
> + * @mode: Mode in which PTP operating (PTP version)
> + * @evt_frags: Partly assembled PTP events
> + * @evt_frag_idx: Current fragment number
> + * @evt_code: Last event code
> + * @start: Address at which MC indicates ready for synchronisation
> + * @host_base_time: (Synchronised with mc_base_time) host time
> + * @mc_base_time: (Synchronised with host_base_time) MC/hardware time
> + * @base_time_valid: Whether host_base_time and mc_base_time are synchronised
> + * @last_sync_ns: Last number of nanoseconds between readings when synchronising
> + * @base_sync_ns: Number of nanoseconds for last synchronisation.
> + * @base_sync_valid: Whether base_sync_time is valid.
> + * @current_adjfreq: Current ppb adjustment.
> + * @phc_clock: Pointer to registered phc device
> + * @phc_clock_info: Registration structure for phc device
> + * @pps_work: pps work task for handling pps events
> + * @pps_workwq: pps work queue
> + * @nic_ts_enabled: Flag indicating if NIC generated TS events are handled
> + * @txbuf: Buffer for use when transmitting (PTP) packets to MC (avoids
> + *         allocations in main data path).
> + * @debug_ptp_dir: PTP debugfs directory
> + * @missed_rx_sync: Number of packets received without syncrhonisation.
> + * @good_syncs: Number of successful synchronisations.
> + * @no_time_syncs: Number of synchronisations with no good times.
> + * @bad_sync_durations: Number of synchronisations with bad durations.
> + * @bad_syncs: Number of failed synchronisations.
> + * @last_sync_time: Number of nanoseconds for last synchronisation.
> + * @sync_timeouts: Number of synchronisation timeouts
> + * @fast_syncs: Number of synchronisations requiring short delay
> + * @min_sync_delta: Minimum time between event and synchronisation
> + * @max_sync_delta: Maximum time between event and synchronisation
> + * @average_sync_delta: Average time between event and synchronisation.
> + *                      Modified moving average.
> + * @last_sync_delta: Last time between event and synchronisation
> + * @mc_stats: Context value for MC statistics
> + * @timeset: Last set of synchronisation statistics.
> + */
> +struct efx_ptp_data {
> +	struct efx_channel *channel;
> +	struct sk_buff_head rxq;
> +	struct sk_buff_head txq;
> +	struct list_head evt_list;
> +	struct list_head evt_free_list;
> +	spinlock_t evt_lock;
> +	struct efx_ptp_event_rx rx_evts[MAX_RECEIVE_EVENTS];
> +	struct workqueue_struct *workwq;
> +	struct work_struct work;
> +	bool reset_required;
> +	u32 rxfilter_event;
> +	u32 rxfilter_general;
> +	bool rxfilter_installed;
> +	struct hwtstamp_config config;
> +	bool enabled;
> +	unsigned int mode;
> +	efx_qword_t evt_frags[MAX_EVENT_FRAGS];
> +	int evt_frag_idx;
> +	int evt_code;
> +	struct efx_buffer start;
> +	ktime_t host_base_time;
> +	ktime_t mc_base_time;
> +	bool base_time_valid;
> +	unsigned last_sync_ns;
> +	unsigned base_sync_ns;
> +	bool base_sync_valid;
> +	s64 current_adjfreq;
> +	struct ptp_clock *phc_clock;
> +	struct ptp_clock_info phc_clock_info;
> +	struct work_struct pps_work;
> +	struct workqueue_struct *pps_workwq;
> +	bool nic_ts_enabled;
> +	u8 txbuf[ALIGN(MC_CMD_PTP_IN_TRANSMIT_LEN(
> +			       MC_CMD_PTP_IN_TRANSMIT_PACKET_MAXNUM), 4)];
> +	struct efx_ptp_timeset
> +	timeset[MC_CMD_PTP_OUT_SYNCHRONIZE_TIMESET_MAXNUM];
> +};
> +
> +static int efx_phc_adjfreq(struct ptp_clock_info *ptp, s32 delta);
> +static int efx_phc_adjtime(struct ptp_clock_info *ptp, s64 delta);
> +static int efx_phc_gettime(struct ptp_clock_info *ptp, struct timespec *ts);
> +static int efx_phc_settime(struct ptp_clock_info *ptp,
> +			   const struct timespec *e_ts);
> +static int efx_phc_enable(struct ptp_clock_info *ptp,
> +			  struct ptp_clock_request *request, int on);
> +
> +/* Enable MCDI PTP support. */
> +static int efx_ptp_enable(struct efx_nic *efx)
> +{
> +	u8 inbuf[MC_CMD_PTP_IN_ENABLE_LEN];
> +
> +	MCDI_SET_DWORD(inbuf, PTP_IN_OP, MC_CMD_PTP_OP_ENABLE);
> +	MCDI_SET_DWORD(inbuf, PTP_IN_ENABLE_QUEUE,
> +		       efx->ptp_data->channel->channel);
> +	MCDI_SET_DWORD(inbuf, PTP_IN_ENABLE_MODE, efx->ptp_data->mode);
> +
> +	return efx_mcdi_rpc(efx, MC_CMD_PTP, inbuf, sizeof(inbuf),
> +			    NULL, 0, NULL);
> +}
> +
> +/* Disable MCDI PTP support.
> + *
> + * Note that this function should never rely on the presence of ptp_data -
> + * may be called before that exists.
> + */
> +static int efx_ptp_disable(struct efx_nic *efx)
> +{
> +	u8 inbuf[MC_CMD_PTP_IN_DISABLE_LEN];
> +
> +	MCDI_SET_DWORD(inbuf, PTP_IN_OP, MC_CMD_PTP_OP_DISABLE);
> +	return efx_mcdi_rpc(efx, MC_CMD_PTP, inbuf, sizeof(inbuf),
> +			    NULL, 0, NULL);
> +}
> +
> +static void efx_ptp_deliver_rx_queue(struct sk_buff_head *q)
> +{
> +	struct sk_buff *skb;
> +
> +	while ((skb = skb_dequeue(q))) {
> +		local_bh_disable();
> +		netif_receive_skb(skb);
> +		local_bh_enable();
> +	}
> +}
> +
> +static void efx_ptp_handle_no_channel(struct efx_nic *efx)
> +{
> +	netif_err(efx, drv, efx->net_dev,
> +		  "ERROR: PTP requires MSI-X and 1 additional interrupt"
> +		  "vector. PTP disabled\n");
> +}
> +
> +/* Repeatedly send the host time to the MC which will capture the hardware
> + * time.
> + */
> +static void efx_ptp_send_times(struct efx_nic *efx, struct timespec *last_time)
> +{
> +	struct timespec now;
> +	struct timespec limit;
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	struct timespec start;
> +	int *mc_running = ptp->start.addr;
> +
> +	getnstimeofday(&now);
> +	start = now;
> +	limit = now;
> +	timespec_add_ns(&limit, SYNCHRONISE_PERIOD_NS);
> +
> +	/* Write host time for specified period or until MC is done */
> +	while ((timespec_compare(&now, &limit) < 0) &&
> +	       ACCESS_ONCE(*mc_running)) {
> +		struct timespec update_time;
> +		unsigned int host_time;
> +
> +		/* Don't update continuously to avoid saturating the PCIe bus */
> +		update_time = now;
> +		timespec_add_ns(&update_time, SYNCHRONISATION_GRANULARITY_NS);
> +		do {
> +			getnstimeofday(&now);
> +		} while ((timespec_compare(&now, &update_time) < 0) &&
> +			 ACCESS_ONCE(*mc_running));
> +
> +		/* Synchronise NIC with single word of time only */
> +		host_time = (now.tv_sec << MC_NANOSECOND_BITS) | now.tv_nsec;
> +		/* Update host time in NIC memory */
> +		_efx_writed(efx, host_time,
> +			    FR_CZ_MC_TREG_SMEM + MC_SMEM_P0_PTP_TIME_OFST);
> +	}
> +	*last_time = now;
> +	start = timespec_sub(now, start);
> +}
> +
> +/* Read a timeset from the MC's results and partial process. */
> +static void efx_ptp_read_timeset(u8 *data, struct efx_ptp_timeset *timeset)
> +{
> +	unsigned start_ns, end_ns;
> +
> +	timeset->host_start = MCDI_DWORD(data, PTP_OUT_SYNCHRONIZE_HOSTSTART);
> +	timeset->seconds = MCDI_DWORD(data, PTP_OUT_SYNCHRONIZE_SECONDS);
> +	timeset->nanoseconds = MCDI_DWORD(data,
> +					 PTP_OUT_SYNCHRONIZE_NANOSECONDS);
> +	timeset->host_end = MCDI_DWORD(data, PTP_OUT_SYNCHRONIZE_HOSTEND),
> +	timeset->waitns = MCDI_DWORD(data, PTP_OUT_SYNCHRONIZE_WAITNS);
> +
> +	/* Ignore seconds */
> +	start_ns = timeset->host_start & MC_NANOSECOND_MASK;
> +	end_ns = timeset->host_end & MC_NANOSECOND_MASK;
> +	/* Allow for rollover */
> +	if (end_ns < start_ns)
> +		end_ns += NSEC_PER_SEC;
> +	/* Determine duration of operation */
> +	timeset->window = end_ns - start_ns;
> +}
> +
> +/* Process times received from MC.
> + *
> + * Extract times from returned results, and establish the minimum value
> + * seen.  The minimum value represents the "best" possible time and events
> + * too much greater than this are rejected - the machine is, perhaps, too
> + * busy. A number of readings are taken so that, hopefully, at least one good
> + * synchronisation will be seen in the results.
> + */

This code looks like it is trying to find the offset between two
clocks. Is there some reason why you cannot use <linux/timecompare.h>
to accomplish this?

Also, these comments about "hopefull" synchronization make me
nervous. I think it might be easier just to offer RAW timestamps and
forget about the SYS timestamps.

I am trying to purge the whole SYS thing (only blackfin is left)
because there is a much better way to go about this, namely
synchronizing the system time to the PHC time via an internal PPS
signal.

> +static int efx_ptp_process_times(struct efx_nic *efx, u8 *synch_buf,
> +				 size_t response_length,
> +				 struct timespec *last_time)
> +{
> +	unsigned number_readings = (response_length /
> +			       MC_CMD_PTP_OUT_SYNCHRONIZE_TIMESET_LEN);
> +	unsigned i;
> +	unsigned min;
> +	unsigned min_set = 0;
> +	unsigned total;
> +	unsigned ngood = 0;
> +	unsigned last_good = 0;
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	bool min_valid = false;
> +	u32 last_sec;
> +	u32 start_sec;
> +
> +	if (number_readings == 0)
> +		return -EAGAIN;
> +
> +	/* Find minimum value in this set of results, discarding clearly
> +	 * erroneous results.
> +	 */
> +	for (i = 0; i < number_readings; i++) {
> +		efx_ptp_read_timeset(synch_buf, &ptp->timeset[i]);
> +		synch_buf += MC_CMD_PTP_OUT_SYNCHRONIZE_TIMESET_LEN;
> +		if (ptp->timeset[i].window > SYNCHRONISATION_GRANULARITY_NS) {
> +			if (min_valid) {
> +				if (ptp->timeset[i].window < min_set)
> +					min_set = ptp->timeset[i].window;
> +			} else {
> +				min_valid = true;
> +				min_set = ptp->timeset[i].window;
> +			}
> +		}
> +	}
> +
> +	if (min_valid) {
> +		if (ptp->base_sync_valid && (min_set > ptp->base_sync_ns))
> +			min = ptp->base_sync_ns;
> +		else
> +			min = min_set;
> +	} else {
> +		min = SYNCHRONISATION_GRANULARITY_NS;
> +	}
> +
> +	/* Discard excessively long synchronise durations.  The MC times
> +	 * when it finishes reading the host time so the corrected window
> +	 * time should be fairly constant for a given platform.
> +	 */
> +	total = 0;
> +	for (i = 0; i < number_readings; i++)
> +		if (ptp->timeset[i].window > ptp->timeset[i].waitns) {
> +			unsigned win;
> +
> +			win = ptp->timeset[i].window - ptp->timeset[i].waitns;
> +			if (win >= MIN_SYNCHRONISATION_NS &&
> +			    win < MAX_SYNCHRONISATION_NS) {
> +				total += ptp->timeset[i].window;
> +				ngood++;
> +				last_good = i;
> +			}
> +		}
> +
> +	if (ngood == 0) {
> +		netif_warn(efx, drv, efx->net_dev,
> +			   "PTP no suitable synchronisations %dns %dns\n",
> +			   ptp->base_sync_ns, min_set);
> +		return -EAGAIN;
> +	}
> +
> +	/* Average minimum this synchronisation */
> +	ptp->last_sync_ns = DIV_ROUND_UP(total, ngood);
> +	if (!ptp->base_sync_valid || (ptp->last_sync_ns < ptp->base_sync_ns)) {
> +		ptp->base_sync_valid = true;
> +		ptp->base_sync_ns = ptp->last_sync_ns;
> +	}
> +
> +	ptp->mc_base_time = ktime_set(ptp->timeset[last_good].seconds,
> +				      ptp->timeset[last_good].nanoseconds);
> +	last_time->tv_nsec =
> +		ptp->timeset[last_good].host_start & MC_NANOSECOND_MASK;
> +
> +	/* It is possible that the seconds rolled over between taking
> +	 * the start reading and the last value written by the host.  The
> +	 * timescales are such that a gap of more than one second is never
> +	 * expected.
> +	 */
> +	start_sec = ptp->timeset[last_good].host_start >> MC_NANOSECOND_BITS;
> +	last_sec = last_time->tv_sec & MC_SECOND_MASK;
> +	if (start_sec != last_sec) {
> +		if (((start_sec + 1) & MC_SECOND_MASK) != last_sec) {
> +			netif_warn(efx, hw, efx->net_dev,
> +				   "PTP bad synchronisation seconds\n");
> +			return -EAGAIN;
> +		} else {
> +			last_time->tv_sec--;
> +		}
> +	}
> +	ptp->host_base_time = ktime_set(last_time->tv_sec,
> +					last_time->tv_nsec);
> +
> +	/* At least one good synchronisation */
> +	ptp->base_time_valid = true;
> +
> +	return 0;
> +}
> +
> +/* Synchronize times between the host and the MC */
> +static int efx_ptp_synchronize(struct efx_nic *efx, unsigned int num_readings)
> +{
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	u8 synch_buf[MC_CMD_PTP_OUT_SYNCHRONIZE_LENMAX];
> +	size_t response_length;
> +	int rc;
> +	unsigned long timeout;
> +	struct timespec last_time;
> +	unsigned int loops = 0;
> +	int *start = ptp->start.addr;
> +
> +	last_time.tv_sec = 0;
> +	last_time.tv_nsec = 0;
> +
> +	MCDI_SET_DWORD(synch_buf, PTP_IN_OP, MC_CMD_PTP_OP_SYNCHRONIZE);
> +	MCDI_SET_DWORD(synch_buf, PTP_IN_SYNCHRONIZE_NUMTIMESETS,
> +		       num_readings);
> +	MCDI_SET_DWORD(synch_buf, PTP_IN_SYNCHRONIZE_START_ADDR_LO,
> +		       (u32)ptp->start.dma_addr);
> +	MCDI_SET_DWORD(synch_buf, PTP_IN_SYNCHRONIZE_START_ADDR_HI,
> +		       (u32)((u64)ptp->start.dma_addr >> 32));
> +
> +	/* Clear flag that signals MC ready */
> +	ACCESS_ONCE(*start) = 0;
> +	efx_mcdi_rpc_start(efx, MC_CMD_PTP, synch_buf,
> +			   MC_CMD_PTP_IN_SYNCHRONIZE_LEN);
> +
> +	/* Wait for start from MCDI (or timeout) */
> +	timeout = jiffies + msecs_to_jiffies(MAX_SYNCHRONISE_WAIT_MS);
> +	while (!ACCESS_ONCE(*start) && (time_before(jiffies, timeout))) {
> +		udelay(20);	/* Usually start MCDI execution quickly */
> +		loops++;
> +	}
> +
> +	if (ACCESS_ONCE(*start))
> +		efx_ptp_send_times(efx, &last_time);
> +
> +	/* Collect results */
> +	rc = efx_mcdi_rpc_finish(efx, MC_CMD_PTP,
> +				 MC_CMD_PTP_IN_SYNCHRONIZE_LEN,
> +				 synch_buf, sizeof(synch_buf),
> +				 &response_length);
> +	if (rc == 0)
> +		rc = efx_ptp_process_times(efx, synch_buf, response_length,
> +					   &last_time);
> +
> +	return rc;
> +}
> +
> +/* Get the host time from a given hardware time */
> +static bool efx_ptp_get_host_time(struct efx_nic *efx,
> +				  struct skb_shared_hwtstamps *timestamps)
> +{
> +	if (efx->ptp_data->base_time_valid) {
> +		ktime_t diff = ktime_sub(timestamps->hwtstamp,
> +					 efx->ptp_data->mc_base_time);
> +
> +		timestamps->syststamp = ktime_add(efx->ptp_data->host_base_time,
> +						  diff);
> +	}
> +
> +	return efx->ptp_data->base_time_valid;
> +}
> +
> +/* Transmit a PTP packet, via the MCDI interface, to the wire. */
> +static int efx_ptp_xmit_skb(struct efx_nic *efx, struct sk_buff *skb)
> +{
> +	u8 *txbuf = efx->ptp_data->txbuf;
> +	struct skb_shared_hwtstamps timestamps;
> +	int rc = -EIO;
> +	/* MCDI driver requires word aligned lengths */
> +	size_t len = ALIGN(MC_CMD_PTP_IN_TRANSMIT_LEN(skb->len), 4);
> +	u8 txtime[MC_CMD_PTP_OUT_TRANSMIT_LEN];
> +
> +	MCDI_SET_DWORD(txbuf, PTP_IN_OP, MC_CMD_PTP_OP_TRANSMIT);
> +	MCDI_SET_DWORD(txbuf, PTP_IN_TRANSMIT_LENGTH, skb->len);
> +	if (skb_shinfo(skb)->nr_frags != 0) {
> +		rc = skb_linearize(skb);
> +		if (rc != 0)
> +			goto fail;
> +	}
> +
> +	if (skb->ip_summed == CHECKSUM_PARTIAL) {
> +		rc = skb_checksum_help(skb);
> +		if (rc != 0)
> +			goto fail;
> +	}
> +	skb_copy_from_linear_data(skb,
> +				  &txbuf[MC_CMD_PTP_IN_TRANSMIT_PACKET_OFST],
> +				  len);
> +	rc = efx_mcdi_rpc(efx, MC_CMD_PTP, txbuf, len, txtime,
> +			  sizeof(txtime), &len);
> +	if (rc != 0)
> +		goto fail;
> +
> +	memset(&timestamps, 0, sizeof(timestamps));
> +	timestamps.hwtstamp = ktime_set(
> +		MCDI_DWORD(txtime, PTP_OUT_TRANSMIT_SECONDS),
> +		MCDI_DWORD(txtime, PTP_OUT_TRANSMIT_NANOSECONDS));
> +	if (efx_ptp_get_host_time(efx, &timestamps))
> +		skb_tstamp_tx(skb, &timestamps);
> +	/* Success even if hardware timestamping failed */
> +	rc = 0;
> +
> +fail:
> +	dev_kfree_skb(skb);
> +
> +	return rc;
> +}
> +
> +static void efx_ptp_drop_time_expired_events(struct efx_nic *efx)
> +{
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	struct list_head *cursor;
> +	struct list_head *next;
> +
> +	/* Drop time-expired events */
> +	spin_lock_bh(&ptp->evt_lock);
> +	if (!list_empty(&ptp->evt_list)) {
> +		list_for_each_safe(cursor, next, &ptp->evt_list) {
> +			struct efx_ptp_event_rx *evt;
> +
> +			evt = list_entry(cursor, struct efx_ptp_event_rx,
> +					 link);
> +			if (time_after(jiffies, evt->expiry)) {
> +				list_del(&evt->link);
> +				list_add(&evt->link, &ptp->evt_free_list);
> +				netif_warn(efx, hw, efx->net_dev,
> +					   "PTP rx event dropped\n");
> +			}
> +		}
> +	}
> +	spin_unlock_bh(&ptp->evt_lock);
> +}
> +
> +static enum ptp_packet_state efx_ptp_match_rx(struct efx_nic *efx,
> +					      struct sk_buff *skb)
> +{
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	bool evts_waiting;
> +	struct list_head *cursor;
> +	struct list_head *next;
> +	struct efx_ptp_match *match;
> +	enum ptp_packet_state rc = PTP_PACKET_STATE_UNMATCHED;
> +
> +	spin_lock_bh(&ptp->evt_lock);
> +	evts_waiting = !list_empty(&ptp->evt_list);
> +	spin_unlock_bh(&ptp->evt_lock);
> +
> +	if (!evts_waiting)
> +		return PTP_PACKET_STATE_UNMATCHED;
> +
> +	match = (struct efx_ptp_match *)skb->cb;
> +	/* Look for a matching timestamp in the event queue */
> +	spin_lock_bh(&ptp->evt_lock);
> +	list_for_each_safe(cursor, next, &ptp->evt_list) {
> +		struct efx_ptp_event_rx *evt;
> +
> +		evt = list_entry(cursor, struct efx_ptp_event_rx, link);
> +		if ((evt->seq0 == match->words[0]) &&
> +		    (evt->seq1 == match->words[1])) {
> +			struct skb_shared_hwtstamps *timestamps;
> +
> +			/* Match - add in hardware timestamp */
> +			timestamps = skb_hwtstamps(skb);
> +			timestamps->hwtstamp = evt->hwtimestamp;
> +
> +			match->state = PTP_PACKET_STATE_MATCHED;
> +			rc = PTP_PACKET_STATE_MATCHED;
> +			list_del(&evt->link);
> +			list_add(&evt->link, &ptp->evt_free_list);
> +			break;
> +		}
> +	}
> +	spin_unlock_bh(&ptp->evt_lock);
> +
> +	return rc;
> +}
> +
> +/* Process any queued receive events and corresponding packets
> + *
> + * q is returned with all the packets that are ready for delivery.
> + * true is returned if at least one of those packets requires
> + * synchronisation.
> + */
> +static bool efx_ptp_process_events(struct efx_nic *efx, struct sk_buff_head *q)
> +{
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	bool rc = false;
> +	struct sk_buff *skb;
> +
> +	while ((skb = skb_dequeue(&ptp->rxq))) {
> +		struct efx_ptp_match *match;
> +
> +		match = (struct efx_ptp_match *)skb->cb;
> +		if (match->state == PTP_PACKET_STATE_MATCH_UNWANTED) {
> +			__skb_queue_tail(q, skb);
> +		} else if (efx_ptp_match_rx(efx, skb) ==
> +			   PTP_PACKET_STATE_MATCHED) {
> +			rc = true;
> +			__skb_queue_tail(q, skb);
> +		} else if (time_after(jiffies, match->expiry)) {
> +			match->state = PTP_PACKET_STATE_TIMED_OUT;
> +			netif_warn(efx, rx_err, efx->net_dev,
> +				   "PTP packet - no timestamp seen\n");
> +			__skb_queue_tail(q, skb);
> +		} else {
> +			/* Replace unprocessed entry and stop */
> +			skb_queue_head(&ptp->rxq, skb);
> +			break;
> +		}
> +	}
> +
> +	return rc;
> +}
> +
> +/* Complete processing of a received packet */
> +static void efx_ptp_process_rx(struct efx_nic *efx, struct sk_buff *skb)
> +{
> +	struct efx_ptp_match *match = (struct efx_ptp_match *)skb->cb;
> +
> +	/* Translate timestamps, as required */
> +	if (match->state == PTP_PACKET_STATE_MATCHED) {
> +		struct skb_shared_hwtstamps *timestamps;
> +
> +		timestamps = skb_hwtstamps(skb);
> +		efx_ptp_get_host_time(efx, timestamps);
> +	}
> +
> +	local_bh_disable();
> +	netif_receive_skb(skb);
> +	local_bh_enable();
> +}
> +
> +static int efx_ptp_start(struct efx_nic *efx)
> +{
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	struct efx_filter_spec rxfilter;
> +	int rc;
> +
> +	ptp->reset_required = false;
> +
> +	/* Must resynchronise when starting */
> +	ptp->base_time_valid = false;
> +	ptp->base_sync_valid = false;
> +
> +	/* Must filter on both event and general ports to ensure
> +	 * that there is no packet re-ordering.
> +	 */
> +	efx_filter_init_rx(&rxfilter, EFX_FILTER_PRI_REQUIRED, 0,
> +			   efx_rx_queue_index(
> +				   efx_channel_get_rx_queue(ptp->channel)));
> +	rc = efx_filter_set_ipv4_local(&rxfilter, IPPROTO_UDP,
> +				       htonl(PTP_ADDRESS),
> +				       htons(PTP_EVENT_PORT));
> +	if (rc != 0)
> +		return rc;
> +
> +	rc = efx_filter_insert_filter(efx, &rxfilter, true);
> +	if (rc < 0)
> +		return rc;
> +	ptp->rxfilter_event = rc;
> +
> +	efx_filter_init_rx(&rxfilter, EFX_FILTER_PRI_REQUIRED, 0,
> +			   efx_rx_queue_index(
> +				   efx_channel_get_rx_queue(ptp->channel)));
> +	rc = efx_filter_set_ipv4_local(&rxfilter, IPPROTO_UDP,
> +				       htonl(PTP_ADDRESS),
> +				       htons(PTP_GENERAL_PORT));
> +	if (rc != 0)
> +		goto fail;
> +
> +	rc = efx_filter_insert_filter(efx, &rxfilter, true);
> +	if (rc < 0)
> +		goto fail;
> +	ptp->rxfilter_general = rc;
> +
> +	rc = efx_ptp_enable(efx);
> +	if (rc != 0)
> +		goto fail2;
> +
> +	ptp->evt_frag_idx = 0;
> +	ptp->current_adjfreq = 0;
> +	ptp->rxfilter_installed = true;
> +
> +	return 0;
> +
> +fail2:
> +	efx_filter_remove_id_safe(efx, EFX_FILTER_PRI_REQUIRED,
> +				  ptp->rxfilter_general);
> +fail:
> +	efx_filter_remove_id_safe(efx, EFX_FILTER_PRI_REQUIRED,
> +				  ptp->rxfilter_event);
> +
> +	return rc;
> +}
> +
> +static int efx_ptp_stop(struct efx_nic *efx)
> +{
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	int rc = efx_ptp_disable(efx);
> +	struct list_head *cursor;
> +	struct list_head *next;
> +
> +	if (ptp->rxfilter_installed) {
> +		efx_filter_remove_id_safe(efx, EFX_FILTER_PRI_REQUIRED,
> +					  ptp->rxfilter_general);
> +		efx_filter_remove_id_safe(efx, EFX_FILTER_PRI_REQUIRED,
> +					  ptp->rxfilter_event);
> +		ptp->rxfilter_installed = false;
> +	}
> +
> +	/* Make sure RX packets are really delivered */
> +	efx_ptp_deliver_rx_queue(&efx->ptp_data->rxq);
> +	skb_queue_purge(&efx->ptp_data->txq);
> +
> +	/* Drop any pending receive events */
> +	spin_lock_bh(&efx->ptp_data->evt_lock);
> +	list_for_each_safe(cursor, next, &efx->ptp_data->evt_list) {
> +		list_del(cursor);
> +		list_add(cursor, &efx->ptp_data->evt_free_list);
> +	}
> +	spin_unlock_bh(&efx->ptp_data->evt_lock);
> +
> +	return rc;
> +}
> +
> +static void efx_ptp_pps_worker(struct work_struct *work)
> +{
> +	struct efx_ptp_data *ptp =
> +		container_of(work, struct efx_ptp_data, pps_work);
> +	struct efx_nic *efx = ptp->channel->efx;
> +	struct timespec event_gen_time;
> +	struct ptp_clock_event ptp_pps_evt;
> +	ktime_t gen_time_host;
> +
> +	if (efx_ptp_synchronize(efx, PTP_SYNC_ATTEMPTS))
> +		return;
> +
> +	gen_time_host = ktime_sub(ptp->mc_base_time,
> +				  ptp->host_base_time);
> +	event_gen_time = ktime_to_timespec(gen_time_host);
> +
> +	ptp_pps_evt.type = PTP_CLOCK_EXTTS;
> +	ptp_pps_evt.timestamp = ktime_to_ns(gen_time_host);
> +	ptp_clock_event(ptp->phc_clock, &ptp_pps_evt);
> +}
> +
> +/* Process any pending transmissions and timestamp any received packets.
> + *
> + * Host and NIC time are synchronised once if there is any work to do:
> + * the process is relatively expensive so don't do it for each packet.
> + */
> +static void efx_ptp_worker(struct work_struct *work)
> +{
> +	struct efx_ptp_data *ptp_data =
> +		container_of(work, struct efx_ptp_data, work);
> +	struct efx_nic *efx = ptp_data->channel->efx;
> +	struct sk_buff *skb;
> +	struct sk_buff_head tempq;
> +
> +	if (ptp_data->reset_required) {
> +		efx_ptp_stop(efx);
> +		efx_ptp_start(efx);
> +		return;
> +	}
> +
> +	efx_ptp_drop_time_expired_events(efx);
> +
> +	__skb_queue_head_init(&tempq);
> +	if (efx_ptp_process_events(efx, &tempq) ||
> +	    !skb_queue_empty(&ptp_data->txq)) {
> +		/* Synchronise PC/MC times when there's work to do. This
> +		 * isn't fatal but would be unusual (because of the retries
> +		 * within efx_ptp_synchronize).  Failure may suggest a heavily
> +		 * overloaded system.
> +		 */
> +		if (0 != efx_ptp_synchronize(efx, PTP_SYNC_ATTEMPTS))
> +			netif_warn(efx, drv, efx->net_dev,
> +				   "PTP couldn't get synchronisation\n");
> +
> +		while ((skb = skb_dequeue(&ptp_data->txq)))
> +			efx_ptp_xmit_skb(efx, skb);
> +	}
> +
> +	while ((skb = __skb_dequeue(&tempq)))
> +		efx_ptp_process_rx(efx, skb);
> +}
> +
> +/* Initialise PTP channel and state.
> + *
> + * Setting core_index to zero causes the queue to be initialised and doesn't
> + * overlap with 'rxq0' because ptp.c doesn't use skb_record_rx_queue.
> + */
> +static int efx_ptp_probe_channel(struct efx_channel *channel)
> +{
> +	struct efx_nic *efx = channel->efx;
> +	struct efx_ptp_data *ptp;
> +	int rc = 0;
> +	unsigned int pos;
> +
> +	channel->irq_moderation = 0;
> +	channel->rx_queue.core_index = 0;
> +
> +	ptp = kzalloc(sizeof(struct efx_ptp_data), GFP_KERNEL);
> +	efx->ptp_data = ptp;
> +	if (!efx->ptp_data)
> +		return -ENOMEM;
> +
> +	rc = efx_nic_alloc_buffer(efx, &ptp->start, sizeof(int));
> +	if (rc != 0)
> +		goto fail1;
> +
> +	ptp->channel = channel;
> +	skb_queue_head_init(&ptp->rxq);
> +	skb_queue_head_init(&ptp->txq);
> +	ptp->workwq = create_singlethread_workqueue("sfc_ptp");
> +	if (!ptp->workwq) {
> +		rc = -ENOMEM;
> +		goto fail2;
> +	}
> +
> +	INIT_WORK(&ptp->work, efx_ptp_worker);
> +	ptp->config.flags = 0;
> +	ptp->config.tx_type = HWTSTAMP_TX_OFF;
> +	ptp->config.rx_filter = HWTSTAMP_FILTER_NONE;
> +	INIT_LIST_HEAD(&ptp->evt_list);
> +	INIT_LIST_HEAD(&ptp->evt_free_list);
> +	spin_lock_init(&ptp->evt_lock);
> +	for (pos = 0; pos < MAX_RECEIVE_EVENTS; pos++)
> +		list_add(&ptp->rx_evts[pos].link, &ptp->evt_free_list);
> +
> +	ptp->phc_clock_info.owner = THIS_MODULE;
> +	snprintf(ptp->phc_clock_info.name,
> +		 sizeof(ptp->phc_clock_info.name),
> +		 "%pm", efx->net_dev->perm_addr);
> +	ptp->phc_clock_info.max_adj = MAX_PPB;
> +	ptp->phc_clock_info.n_alarm = 0;
> +	ptp->phc_clock_info.n_ext_ts = 1;
> +	ptp->phc_clock_info.n_per_out = 0;
> +	ptp->phc_clock_info.pps = 0;
> +	ptp->phc_clock_info.adjfreq = efx_phc_adjfreq;
> +	ptp->phc_clock_info.adjtime = efx_phc_adjtime;
> +	ptp->phc_clock_info.gettime = efx_phc_gettime;
> +	ptp->phc_clock_info.settime = efx_phc_settime;
> +	ptp->phc_clock_info.enable = efx_phc_enable;
> +
> +	ptp->phc_clock = ptp_clock_register(&ptp->phc_clock_info);
> +	if (!ptp->phc_clock)
> +		goto fail3;
> +
> +	INIT_WORK(&ptp->pps_work, efx_ptp_pps_worker);
> +	ptp->pps_workwq = create_singlethread_workqueue("sfc_pps");
> +	if (!ptp->pps_workwq) {
> +		rc = -ENOMEM;
> +		goto fail4;
> +	}
> +	ptp->nic_ts_enabled = false;
> +
> +	return 0;
> +fail4:
> +	ptp_clock_unregister(efx->ptp_data->phc_clock);
> +
> +fail3:
> +	destroy_workqueue(efx->ptp_data->workwq);
> +
> +fail2:
> +	efx_nic_free_buffer(efx, &ptp->start);
> +
> +fail1:
> +	kfree(efx->ptp_data);
> +	efx->ptp_data = 0;
> +
> +	return rc;
> +}
> +
> +static void efx_ptp_remove_channel(struct efx_channel *channel)
> +{
> +	struct efx_nic *efx = channel->efx;
> +
> +	if (!efx->ptp_data)
> +		return;
> +
> +	(void)efx_ptp_disable(channel->efx);
> +
> +	cancel_work_sync(&efx->ptp_data->work);
> +	cancel_work_sync(&efx->ptp_data->pps_work);
> +
> +	skb_queue_purge(&efx->ptp_data->rxq);
> +	skb_queue_purge(&efx->ptp_data->txq);
> +
> +	ptp_clock_unregister(efx->ptp_data->phc_clock);
> +
> +	destroy_workqueue(efx->ptp_data->workwq);
> +	destroy_workqueue(efx->ptp_data->pps_workwq);
> +
> +	efx_nic_free_buffer(efx, &efx->ptp_data->start);
> +	kfree(efx->ptp_data);
> +}
> +
> +static void efx_ptp_get_channel_name(struct efx_channel *channel,
> +				     char *buf, size_t len)
> +{
> +	snprintf(buf, len, "%s-ptp", channel->efx->name);
> +}
> +
> +/* Determine whether this packet should be processed by the PTP module
> + * or transmitted conventionally.
> + */
> +bool efx_ptp_is_ptp_tx(struct efx_nic *efx, struct sk_buff *skb)
> +{
> +	return efx->ptp_data &&
> +		efx->ptp_data->enabled &&
> +		skb->len >= PTP_MIN_LENGTH &&
> +		skb->len <= MC_CMD_PTP_IN_TRANSMIT_PACKET_MAXNUM &&
> +		likely(skb->protocol == htons(ETH_P_IP)) &&
> +		ip_hdr(skb)->protocol == IPPROTO_UDP &&
> +		udp_hdr(skb)->dest == htons(PTP_EVENT_PORT);
> +}
> +
> +/* Receive a PTP packet.  Packets are queued until the arrival of
> + * the receive timestamp from the MC - this will probably occur after the
> + * packet arrival because of the processing in the MC.
> + */
> +static void efx_ptp_rx(struct efx_channel *channel, struct sk_buff *skb)
> +{
> +	struct efx_nic *efx = channel->efx;
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	struct efx_ptp_match *match = (struct efx_ptp_match *)skb->cb;
> +	u8 *data;
> +	unsigned int version;
> +
> +	match->expiry = jiffies + msecs_to_jiffies(PKT_EVENT_LIFETIME_MS);
> +
> +	/* Correct version? */
> +	if (ptp->mode == MC_CMD_PTP_MODE_V1) {
> +		if (skb->len < PTP_V1_MIN_LENGTH) {
> +			netif_receive_skb(skb);
> +			return;
> +		}
> +		version = ntohs(*(__be16 *)&skb->data[PTP_V1_VERSION_OFFSET]);
> +		if (version != PTP_VERSION_V1) {
> +			netif_receive_skb(skb);
> +			return;
> +		}
> +	} else {
> +		if (skb->len < PTP_V2_MIN_LENGTH) {
> +			netif_receive_skb(skb);
> +			return;
> +		}
> +		version = skb->data[PTP_V2_VERSION_OFFSET];
> +
> +		BUG_ON(ptp->mode != MC_CMD_PTP_MODE_V2);
> +		BUILD_BUG_ON(PTP_V1_UUID_OFFSET != PTP_V2_MC_UUID_OFFSET);
> +		BUILD_BUG_ON(PTP_V1_UUID_LENGTH != PTP_V2_MC_UUID_LENGTH);
> +		BUILD_BUG_ON(PTP_V1_SEQUENCE_OFFSET != PTP_V2_SEQUENCE_OFFSET);
> +		BUILD_BUG_ON(PTP_V1_SEQUENCE_LENGTH != PTP_V2_SEQUENCE_LENGTH);
> +
> +		if ((version & PTP_VERSION_V2_MASK) != PTP_VERSION_V2) {
> +			netif_receive_skb(skb);
> +			return;
> +		}
> +	}
> +
> +	/* Does this packet require timestamping? */
> +	if (ntohs(*(__be16 *)&skb->data[PTP_DPORT_OFFSET]) == PTP_EVENT_PORT) {
> +		struct skb_shared_hwtstamps *timestamps;
> +
> +		match->state = PTP_PACKET_STATE_UNMATCHED;
> +
> +		/* Clear all timestamps held: filled in later */
> +		timestamps = skb_hwtstamps(skb);
> +		memset(timestamps, 0, sizeof(*timestamps));
> +
> +		/* Extract UUID/Sequence information */
> +		data = skb->data + PTP_V1_UUID_OFFSET;
> +		match->words[0] = (data[0]         |
> +				   (data[1] << 8)  |
> +				   (data[2] << 16) |
> +				   (data[3] << 24));
> +		match->words[1] = (data[4]         |
> +				   (data[5] << 8)  |
> +				   (skb->data[PTP_V1_SEQUENCE_OFFSET +
> +					      PTP_V1_SEQUENCE_LENGTH - 1] <<
> +				    16));
> +	} else {
> +		match->state = PTP_PACKET_STATE_MATCH_UNWANTED;
> +	}
> +
> +	skb_queue_tail(&ptp->rxq, skb);
> +	queue_work(ptp->workwq, &ptp->work);
> +}
> +
> +/* Transmit a PTP packet.  This has to be transmitted by the MC
> + * itself, through an MCDI call.  MCDI calls aren't permitted
> + * in the transmit path so defer the actual transmission to a suitable worker.
> + */
> +int efx_ptp_tx(struct efx_nic *efx, struct sk_buff *skb)
> +{
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +
> +	skb_queue_tail(&ptp->txq, skb);
> +
> +	if ((udp_hdr(skb)->dest == htons(PTP_EVENT_PORT)) &&
> +	    (skb->len <= MC_CMD_PTP_IN_TRANSMIT_PACKET_MAXNUM))
> +		efx_xmit_hwtstamp_pending(skb);
> +	queue_work(ptp->workwq, &ptp->work);
> +
> +	return NETDEV_TX_OK;
> +}
> +
> +static int efx_ptp_change_mode(struct efx_nic *efx, bool enable_wanted,
> +			       unsigned int new_mode)
> +{
> +	if ((enable_wanted != efx->ptp_data->enabled) ||
> +	    (enable_wanted && (efx->ptp_data->mode != new_mode))) {
> +		int rc;
> +
> +		if (enable_wanted) {
> +			/* Change of mode requires disable */
> +			if (efx->ptp_data->enabled &&
> +			    (efx->ptp_data->mode != new_mode)) {
> +				efx->ptp_data->enabled = false;
> +				rc = efx_ptp_stop(efx);
> +				if (rc != 0)
> +					return rc;
> +			}
> +
> +			/* Set new operating mode and establish
> +			 * baseline synchronisation, which must
> +			 * succeed.
> +			 */
> +			efx->ptp_data->mode = new_mode;
> +			rc = efx_ptp_start(efx);
> +			if (rc == 0) {
> +				rc = efx_ptp_synchronize(efx,
> +							 PTP_SYNC_ATTEMPTS * 2);
> +				if (rc != 0)
> +					efx_ptp_stop(efx);
> +			}
> +		} else {
> +			rc = efx_ptp_stop(efx);
> +		}
> +
> +		if (rc != 0)
> +			return rc;
> +
> +		efx->ptp_data->enabled = enable_wanted;
> +	}
> +
> +	return 0;
> +}
> +
> +static int efx_ptp_ts_init(struct efx_nic *efx, struct hwtstamp_config *init)
> +{
> +	bool enable_wanted = false;
> +	unsigned int new_mode;
> +	int rc;
> +
> +	if (init->flags)
> +		return -EINVAL;
> +
> +	if ((init->tx_type != HWTSTAMP_TX_OFF) &&
> +	    (init->tx_type != HWTSTAMP_TX_ON))
> +		return -ERANGE;
> +
> +	new_mode = efx->ptp_data->mode;
> +	/* Determine whether any PTP HW operations are required */
> +	switch (init->rx_filter) {
> +	case HWTSTAMP_FILTER_NONE:
> +		break;
> +	case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
> +	case HWTSTAMP_FILTER_PTP_V1_L4_SYNC:
> +	case HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ:
> +		init->rx_filter = HWTSTAMP_FILTER_PTP_V1_L4_EVENT;
> +		new_mode = MC_CMD_PTP_MODE_V1;
> +		enable_wanted = true;
> +		break;
> +	case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
> +	case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
> +	case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
> +	/* Although these three are accepted only IPV4 packets will be
> +	 * timestamped
> +	 */
> +	case HWTSTAMP_FILTER_PTP_V2_EVENT:
> +	case HWTSTAMP_FILTER_PTP_V2_SYNC:
> +	case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
> +		init->rx_filter = HWTSTAMP_FILTER_PTP_V2_L4_EVENT;
> +		new_mode = MC_CMD_PTP_MODE_V2;
> +		enable_wanted = true;
> +		break;
> +	case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
> +	case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
> +	case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
> +		/* Non-IP timestamping not supported */
> +		return -ERANGE;
> +		break;
> +	default:
> +		return -ERANGE;
> +	}
> +
> +	if (init->tx_type != HWTSTAMP_TX_OFF)
> +		enable_wanted = true;
> +
> +	rc = efx_ptp_change_mode(efx, enable_wanted, new_mode);
> +	if (rc != 0)
> +		return rc;
> +
> +	efx->ptp_data->config = *init;
> +
> +	return 0;
> +}
> +
> +int
> +efx_ptp_get_ts_info(struct net_device *net_dev, struct ethtool_ts_info *ts_info)
> +{
> +	struct efx_nic *efx = netdev_priv(net_dev);
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +
> +	if (!ptp)
> +		return -EOPNOTSUPP;
> +
> +	ts_info->so_timestamping = (SOF_TIMESTAMPING_TX_HARDWARE |
> +				    SOF_TIMESTAMPING_RX_HARDWARE |
> +				    SOF_TIMESTAMPING_SYS_HARDWARE |
> +				    SOF_TIMESTAMPING_RAW_HARDWARE);
> +	ts_info->phc_index = ptp_clock_index(ptp->phc_clock);
> +	ts_info->tx_types = 1 << HWTSTAMP_TX_OFF | 1 << HWTSTAMP_TX_ON;
> +	ts_info->rx_filters = (1 << HWTSTAMP_FILTER_NONE |
> +			       1 << HWTSTAMP_FILTER_PTP_V1_L4_EVENT |
> +			       1 << HWTSTAMP_FILTER_PTP_V1_L4_SYNC |
> +			       1 << HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ |
> +			       1 << HWTSTAMP_FILTER_PTP_V2_L4_EVENT |
> +			       1 << HWTSTAMP_FILTER_PTP_V2_L4_SYNC |
> +			       1 << HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ |
> +			       1 << HWTSTAMP_FILTER_PTP_V2_EVENT |
> +			       1 << HWTSTAMP_FILTER_PTP_V2_SYNC |
> +			       1 << HWTSTAMP_FILTER_PTP_V2_DELAY_REQ);
> +	return 0;
> +}
> +
> +int efx_ptp_ioctl(struct efx_nic *efx, struct ifreq *ifr, int cmd)
> +{
> +	struct hwtstamp_config config;
> +	int rc;
> +
> +	/* Not a PTP enabled port */
> +	if (!efx->ptp_data)
> +		return -EOPNOTSUPP;
> +
> +	if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
> +		return -EFAULT;
> +
> +	rc = efx_ptp_ts_init(efx, &config);
> +	if (rc != 0)
> +		return rc;
> +
> +	return copy_to_user(ifr->ifr_data, &config, sizeof(config))
> +		? -EFAULT : 0;
> +}
> +
> +static void ptp_event_failure(struct efx_nic *efx, int expected_frag_len)
> +{
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +
> +	netif_err(efx, hw, efx->net_dev,
> +		"PTP unexpected event length: got %d expected %d\n",
> +		ptp->evt_frag_idx, expected_frag_len);
> +	ptp->reset_required = true;
> +	queue_work(ptp->workwq, &ptp->work);
> +}
> +
> +/* Process a completed receive event.  Put it on the event queue and
> + * start worker thread.  This is required because event and their
> + * correspoding packets may come in either order.
> + */
> +static void ptp_event_rx(struct efx_nic *efx, struct efx_ptp_data *ptp)
> +{
> +	struct efx_ptp_event_rx *evt = NULL;
> +
> +	if (ptp->evt_frag_idx != 3) {
> +		ptp_event_failure(efx, 3);
> +		return;
> +	}
> +
> +	spin_lock_bh(&ptp->evt_lock);
> +	if (!list_empty(&ptp->evt_free_list)) {
> +		evt = list_first_entry(&ptp->evt_free_list,
> +				       struct efx_ptp_event_rx, link);
> +		list_del(&evt->link);
> +
> +		evt->seq0 = EFX_QWORD_FIELD(ptp->evt_frags[2], MCDI_EVENT_DATA);
> +		evt->seq1 = (EFX_QWORD_FIELD(ptp->evt_frags[2],
> +					     MCDI_EVENT_SRC)        |
> +			     (EFX_QWORD_FIELD(ptp->evt_frags[1],
> +					      MCDI_EVENT_SRC) << 8) |
> +			     (EFX_QWORD_FIELD(ptp->evt_frags[0],
> +					      MCDI_EVENT_SRC) << 16));
> +		evt->hwtimestamp = ktime_set(
> +			EFX_QWORD_FIELD(ptp->evt_frags[0], MCDI_EVENT_DATA),
> +			EFX_QWORD_FIELD(ptp->evt_frags[1], MCDI_EVENT_DATA));
> +		evt->expiry = jiffies + msecs_to_jiffies(PKT_EVENT_LIFETIME_MS);
> +		list_add_tail(&evt->link, &ptp->evt_list);
> +
> +		queue_work(ptp->workwq, &ptp->work);
> +	} else {
> +		netif_err(efx, rx_err, efx->net_dev, "No free PTP event");
> +	}
> +	spin_unlock_bh(&ptp->evt_lock);
> +}
> +
> +static void ptp_event_fault(struct efx_nic *efx, struct efx_ptp_data *ptp)
> +{
> +	int code = EFX_QWORD_FIELD(ptp->evt_frags[0], MCDI_EVENT_DATA);
> +	if (ptp->evt_frag_idx != 1) {
> +		ptp_event_failure(efx, 1);
> +		return;
> +	}
> +
> +	netif_err(efx, hw, efx->net_dev, "PTP error %d\n", code);
> +}
> +
> +static void ptp_event_pps(struct efx_nic *efx, struct efx_ptp_data *ptp)
> +{
> +	if (ptp->nic_ts_enabled)
> +		queue_work(ptp->pps_workwq, &ptp->pps_work);
> +}
> +
> +void efx_ptp_event(struct efx_nic *efx, efx_qword_t *ev)
> +{
> +	struct efx_ptp_data *ptp = efx->ptp_data;
> +	int code = EFX_QWORD_FIELD(*ev, MCDI_EVENT_CODE);
> +
> +	if (!ptp->enabled)
> +		return;
> +
> +	if (ptp->evt_frag_idx == 0) {
> +		ptp->evt_code = code;
> +	} else if (ptp->evt_code != code) {
> +		netif_err(efx, hw, efx->net_dev,
> +			  "PTP out of sequence event %d\n", code);
> +		ptp->evt_frag_idx = 0;
> +	}
> +
> +	ptp->evt_frags[ptp->evt_frag_idx++] = *ev;
> +	if (!MCDI_EVENT_FIELD(*ev, CONT)) {
> +		/* Process resulting event */
> +		switch (code) {
> +		case MCDI_EVENT_CODE_PTP_RX:
> +			ptp_event_rx(efx, ptp);
> +			break;
> +		case MCDI_EVENT_CODE_PTP_FAULT:
> +			ptp_event_fault(efx, ptp);
> +			break;
> +		case MCDI_EVENT_CODE_PTP_PPS:
> +			ptp_event_pps(efx, ptp);
> +			break;
> +		default:
> +			netif_err(efx, hw, efx->net_dev,
> +				  "PTP unknown event %d\n", code);
> +			break;
> +		}
> +		ptp->evt_frag_idx = 0;
> +	} else if (MAX_EVENT_FRAGS == ptp->evt_frag_idx) {
> +		netif_err(efx, hw, efx->net_dev,
> +			  "PTP too many event fragments\n");
> +		ptp->evt_frag_idx = 0;
> +	}
> +}
> +
> +static int efx_phc_adjfreq(struct ptp_clock_info *ptp, s32 delta)
> +{
> +	struct efx_ptp_data *ptp_data = container_of(ptp,
> +						     struct efx_ptp_data,
> +						     phc_clock_info);
> +	struct efx_nic *efx = ptp_data->channel->efx;
> +	u8 inadj[MC_CMD_PTP_IN_ADJUST_LEN];
> +	s64 adjustment_ns;
> +	int rc;
> +
> +	if (delta > MAX_PPB)
> +		delta = MAX_PPB;
> +	else if (delta < -MAX_PPB)
> +		delta = -MAX_PPB;
> +
> +	/* Convert ppb to fixed point ns. */
> +	adjustment_ns = (((s64)delta * PPB_SCALE_WORD) >>
> +			 (PPB_EXTRA_BITS + MAX_PPB_BITS));
> +
> +	MCDI_SET_DWORD(inadj, PTP_IN_OP, MC_CMD_PTP_OP_ADJUST);
> +	MCDI_SET_DWORD(inadj, PTP_IN_ADJUST_FREQ_LO, (u32)adjustment_ns);
> +	MCDI_SET_DWORD(inadj, PTP_IN_ADJUST_FREQ_HI,
> +		       (u32)(adjustment_ns >> 32));
> +	MCDI_SET_DWORD(inadj, PTP_IN_ADJUST_SECONDS, 0);
> +	MCDI_SET_DWORD(inadj, PTP_IN_ADJUST_NANOSECONDS, 0);
> +	rc = efx_mcdi_rpc(efx, MC_CMD_PTP, inadj, sizeof(inadj),
> +			  NULL, 0, NULL);
> +	if (rc != 0)
> +		return rc;
> +
> +	ptp_data->current_adjfreq = delta;
> +	return 0;
> +}
> +
> +static int efx_phc_adjtime(struct ptp_clock_info *ptp, s64 delta)
> +{
> +	struct efx_ptp_data *ptp_data = container_of(ptp,
> +						     struct efx_ptp_data,
> +						     phc_clock_info);
> +	struct efx_nic *efx = ptp_data->channel->efx;
> +	struct timespec delta_ts = ns_to_timespec(delta);
> +	u8 inbuf[MC_CMD_PTP_IN_ADJUST_LEN];
> +
> +	MCDI_SET_DWORD(inbuf, PTP_IN_OP, MC_CMD_PTP_OP_ADJUST);
> +	MCDI_SET_DWORD(inbuf, PTP_IN_ADJUST_FREQ_LO, 0);
> +	MCDI_SET_DWORD(inbuf, PTP_IN_ADJUST_FREQ_HI, 0);
> +	MCDI_SET_DWORD(inbuf, PTP_IN_ADJUST_SECONDS, (u32)delta_ts.tv_sec);
> +	MCDI_SET_DWORD(inbuf, PTP_IN_ADJUST_NANOSECONDS, (u32)delta_ts.tv_nsec);
> +	return efx_mcdi_rpc(efx, MC_CMD_PTP, inbuf, sizeof(inbuf),
> +			    NULL, 0, NULL);
> +}
> +
> +static int efx_phc_gettime(struct ptp_clock_info *ptp, struct timespec *ts)
> +{
> +	struct efx_ptp_data *ptp_data = container_of(ptp,
> +						     struct efx_ptp_data,
> +						     phc_clock_info);
> +	struct efx_nic *efx = ptp_data->channel->efx;
> +	u8 inbuf[MC_CMD_PTP_IN_READ_NIC_TIME_LEN];
> +	u8 outbuf[MC_CMD_PTP_OUT_READ_NIC_TIME_LEN];
> +	int rc;
> +
> +	MCDI_SET_DWORD(inbuf, PTP_IN_OP, MC_CMD_PTP_OP_READ_NIC_TIME);
> +
> +	rc = efx_mcdi_rpc(efx, MC_CMD_PTP, inbuf, sizeof(inbuf),
> +			  outbuf, sizeof(outbuf), NULL);
> +	if (rc != 0)
> +		return rc;
> +
> +	ts->tv_sec = MCDI_DWORD(outbuf, PTP_OUT_READ_NIC_TIME_SECONDS);
> +	ts->tv_nsec = MCDI_DWORD(outbuf, PTP_OUT_READ_NIC_TIME_NANOSECONDS);
> +	return 0;
> +}
> +
> +static int efx_phc_settime(struct ptp_clock_info *ptp,
> +			   const struct timespec *e_ts)
> +{
> +	/* We must provide this function, but we cannot actually set the time */

Huh? You can adjtime, so must be able to settime, too, right?

If you have enough range in the RAW timestamp in the MC firmware (like
64 bits of nanoseconds), and you allow settime, then you can spare the
system time synchronization code altogether.

Thanks,
Richard


> +	return -EOPNOTSUPP;
> +}
> +
> +static int efx_phc_enable(struct ptp_clock_info *ptp,
> +			  struct ptp_clock_request *request,
> +			  int enable)
> +{
> +	struct efx_ptp_data *ptp_data = container_of(ptp,
> +						     struct efx_ptp_data,
> +						     phc_clock_info);
> +	if (request->type != PTP_CLK_REQ_EXTTS)
> +		return -EOPNOTSUPP;
> +
> +	ptp_data->nic_ts_enabled = !!enable;
> +	return 0;
> +}
> +
> +static const struct efx_channel_type efx_ptp_channel_type = {
> +	.handle_no_channel	= efx_ptp_handle_no_channel,
> +	.pre_probe		= efx_ptp_probe_channel,
> +	.post_remove		= efx_ptp_remove_channel,
> +	.get_name		= efx_ptp_get_channel_name,
> +	/* no copy operation; there is no need to reallocate this channel */
> +	.receive_skb		= efx_ptp_rx,
> +	.keep_eventq		= false,
> +};
> +
> +void efx_ptp_probe(struct efx_nic *efx)
> +{
> +	/* Check whether PTP is implemented on this NIC.  The DISABLE
> +	 * operation will succeed if and only if it is implemented.
> +	 */
> +	if (efx_ptp_disable(efx) == 0)
> +		efx->extra_channel_type[EFX_EXTRA_CHANNEL_PTP] =
> +			&efx_ptp_channel_type;
> +}
> diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
> index 92699a0..8923863 100644
> --- a/drivers/net/ethernet/sfc/rx.c
> +++ b/drivers/net/ethernet/sfc/rx.c
> @@ -573,7 +573,7 @@ static void efx_rx_deliver(struct efx_channel *channel,
>  	/* Record the rx_queue */
>  	skb_record_rx_queue(skb, channel->rx_queue.core_index);
>  
> -	/* Does the channel want to handle the skb */
> +	/* Pass the packet up */
>  	if (channel->type->receive_skb)
>  		channel->type->receive_skb(channel, skb);
>  	else
> diff --git a/drivers/net/ethernet/sfc/siena.c b/drivers/net/ethernet/sfc/siena.c
> index 6bafd21..84b41bf 100644
> --- a/drivers/net/ethernet/sfc/siena.c
> +++ b/drivers/net/ethernet/sfc/siena.c
> @@ -335,6 +335,7 @@ static int siena_probe_nic(struct efx_nic *efx)
>  		goto fail5;
>  
>  	efx_sriov_probe(efx);
> +	efx_ptp_probe(efx);
>  
>  	return 0;
>  
> diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
> index 9b225a7..66badf8 100644
> --- a/drivers/net/ethernet/sfc/tx.c
> +++ b/drivers/net/ethernet/sfc/tx.c
> @@ -347,6 +347,12 @@ netdev_tx_t efx_hard_start_xmit(struct sk_buff *skb,
>  
>  	EFX_WARN_ON_PARANOID(!netif_device_present(net_dev));
>  
> +	/* PTP "event" packet */
> +	if (unlikely(efx_xmit_with_hwtstamp(skb)) &&
> +	    unlikely(efx_ptp_is_ptp_tx(efx, skb))) {
> +		return efx_ptp_tx(efx, skb);
> +	}
> +
>  	index = skb_get_queue_mapping(skb);
>  	type = skb->ip_summed == CHECKSUM_PARTIAL ? EFX_TXQ_TYPE_OFFLOAD : 0;
>  	if (index >= efx->n_tx_channels) {
> -- 
> 1.7.7.6
> 
> 
> 
> -- 
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 

^ permalink raw reply

* Re: [PATCH net-next 4/7] sfc: Add support for IEEE-1588 PTP
From: Ben Hutchings @ 2012-07-19 14:37 UTC (permalink / raw)
  To: Richard Cochran, Andrew Jackson; +Cc: David Miller, netdev, linux-net-drivers
In-Reply-To: <20120719142558.GB24484@localhost.localdomain>

On Thu, 2012-07-19 at 16:25 +0200, Richard Cochran wrote:
> On Wed, Jul 18, 2012 at 07:21:33PM +0100, Ben Hutchings wrote:
> > Add PTP IEEE-1588 support and make accesible via the PHC subsystem.
> > 
> > This work is based on prior code by Andrew Jackson
> > 
> > Signed-off-by: Stuart Hodgson <smhodgson@solarflare.com>
> > Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
[...]
> > --- a/drivers/net/ethernet/sfc/nic.h
> > +++ b/drivers/net/ethernet/sfc/nic.h
> > @@ -250,6 +250,37 @@ extern int efx_sriov_get_vf_config(struct net_device *dev, int vf,
> >  extern int efx_sriov_set_vf_spoofchk(struct net_device *net_dev, int vf,
> >  				     bool spoofchk);
> >  
> > +struct ethtool_ts_info;
> > +#ifdef CONFIG_SFC_PTP
> > +extern void efx_ptp_probe(struct efx_nic *efx);
> > +extern int efx_ptp_ioctl(struct efx_nic *efx, struct ifreq *ifr, int cmd);
> > +extern int efx_ptp_get_ts_info(struct net_device *net_dev,
> > +			       struct ethtool_ts_info *ts_info);
> > +extern bool efx_ptp_is_ptp_tx(struct efx_nic *efx, struct sk_buff *skb);
> > +extern int efx_ptp_tx(struct efx_nic *efx, struct sk_buff *skb);
> > +extern void efx_ptp_event(struct efx_nic *efx, efx_qword_t *ev);
> > +#else
> > +static inline void efx_ptp_probe(struct efx_nic *efx) {}
> > +static inline int efx_ptp_ioctl(struct efx_nic *efx, struct ifreq *ifr, int cmd)
> > +{
> > +	return -EOPNOTSUPP;
> > +}
> > +static inline int efx_ptp_get_ts_info(struct net_device *net_dev,
> > +				      struct ethtool_ts_info *ts_info)
> > +{
> > +	return -EOPNOTSUPP;
> 
> If your PTP support is not enabled, then it would be better to offer
> the standard ethtool answer to this query.
> 
> Also, it would be nice to still offer SW Tx timestamping, even when
> PTP is disabled.

Yes, I'm aware we should do that, but I also want to resolve the feature
gap between in-tree and out-of-tree versions first.

[...]
> > +/* Process times received from MC.
> > + *
> > + * Extract times from returned results, and establish the minimum value
> > + * seen.  The minimum value represents the "best" possible time and events
> > + * too much greater than this are rejected - the machine is, perhaps, too
> > + * busy. A number of readings are taken so that, hopefully, at least one good
> > + * synchronisation will be seen in the results.
> > + */
> 
> This code looks like it is trying to find the offset between two
> clocks. Is there some reason why you cannot use <linux/timecompare.h>
> to accomplish this?
>
> Also, these comments about "hopefull" synchronization make me
> nervous. I think it might be easier just to offer RAW timestamps and
> forget about the SYS timestamps.
> 
> I am trying to purge the whole SYS thing (only blackfin is left)
> because there is a much better way to go about this, namely
> synchronizing the system time to the PHC time via an internal PPS
> signal.

Andrew, would that work for us?

[...]
> > +static int efx_phc_settime(struct ptp_clock_info *ptp,
> > +			   const struct timespec *e_ts)
> > +{
> > +	/* We must provide this function, but we cannot actually set the time */
> 
> Huh? You can adjtime, so must be able to settime, too, right?

Unless I missed something, the firmware interface doesn't include an
atomic settime operation.  We may be able to fudge it with gettime and
adjtime, if that's good enough.

> If you have enough range in the RAW timestamp in the MC firmware (like
> 64 bits of nanoseconds), and you allow settime, then you can spare the
> system time synchronization code altogether.
[...]

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox