Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v2] net: e100: ucode is optional in some cases
From: Bjørn Mork @ 2012-07-19 16:28 UTC (permalink / raw)
  To: netdev
  Cc: e1000-devel, Bruce Allan, Jesse Brandeburg, John Ronciak,
	Bjørn Mork

  commit 9ac32e1b firmware: convert e100 driver to request_firmware()

did a straight conversion of the in-driver ucode to external
files.  This introduced the possibility of the driver failing
to enable an interface due to missing ucode. There was no
evaluation of the importance of the ucode at the time.

Based on comments in earlier versions of this driver, and in
the source code for the FreeBSD fxp driver, we can assume that
the ucode implements the "CPU Cycle Saver" feature on supported
adapters.  Although generally wanted, this is an optional
feature. The ucode source is not available, preventing it from
being included in free distributions. This creates unnecessary
problems for the end users. Doing a network install based on a
free distribution installer requires the user to download and
insert the ucode into the installer.

Making the ucode optional when possible improves the user
experience and driver usability.

The ucode for some adapters include a bugfix, making it
essential.  We continue to fail for these adapters unless the
ucode is available.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
v2: removed URLs from the patch, converting them to generic
    descriptions of the sources of information


 drivers/net/ethernet/intel/e100.c |   40 ++++++++++++++++++++++++++++---------
 1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/e100.c b/drivers/net/ethernet/intel/e100.c
index ada720b..535f94f 100644
--- a/drivers/net/ethernet/intel/e100.c
+++ b/drivers/net/ethernet/intel/e100.c
@@ -1249,20 +1249,35 @@ static const struct firmware *e100_request_firmware(struct nic *nic)
 	const struct firmware *fw = nic->fw;
 	u8 timer, bundle, min_size;
 	int err = 0;
+	bool required = false;
 
 	/* do not load u-code for ICH devices */
 	if (nic->flags & ich)
 		return NULL;
 
-	/* Search for ucode match against h/w revision */
-	if (nic->mac == mac_82559_D101M)
+	/* Search for ucode match against h/w revision
+	 *
+	 * Based on comments in the source code for the FreeBSD fxp
+	 * driver, the FIRMWARE_D102E ucode includes both CPUSaver and
+	 *
+	 *    "fixes for bugs in the B-step hardware (specifically, bugs
+	 *     with Inline Receive)."
+	 *
+	 * So we must fail if it cannot be loaded.
+	 *
+	 * The other microcode files are only required for the optional
+	 * CPUSaver feature.  Nice to have, but no reason to fail.
+	 */
+	if (nic->mac == mac_82559_D101M) {
 		fw_name = FIRMWARE_D101M;
-	else if (nic->mac == mac_82559_D101S)
+	} else if (nic->mac == mac_82559_D101S) {
 		fw_name = FIRMWARE_D101S;
-	else if (nic->mac == mac_82551_F || nic->mac == mac_82551_10)
+	} else if (nic->mac == mac_82551_F || nic->mac == mac_82551_10) {
 		fw_name = FIRMWARE_D102E;
-	else /* No ucode on other devices */
+		required = true;
+	} else { /* No ucode on other devices */
 		return NULL;
+	}
 
 	/* If the firmware has not previously been loaded, request a pointer
 	 * to it. If it was previously loaded, we are reinitializing the
@@ -1273,10 +1288,17 @@ static const struct firmware *e100_request_firmware(struct nic *nic)
 		err = request_firmware(&fw, fw_name, &nic->pdev->dev);
 
 	if (err) {
-		netif_err(nic, probe, nic->netdev,
-			  "Failed to load firmware \"%s\": %d\n",
-			  fw_name, err);
-		return ERR_PTR(err);
+		if (required) {
+			netif_err(nic, probe, nic->netdev,
+				  "Failed to load firmware \"%s\": %d\n",
+				  fw_name, err);
+			return ERR_PTR(err);
+		} else {
+			netif_info(nic, probe, nic->netdev,
+				   "CPUSaver disabled. Needs \"%s\": %d\n",
+				   fw_name, err);
+			return NULL;
+		}
 	}
 
 	/* Firmware should be precisely UCODE_SIZE (words) plus three bytes
-- 
1.7.10.4


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply related

* Re: [PATCH net-next V1 4/4] net/mlx4_en: Add accelerated RFS support
From: Or Gerlitz @ 2012-07-19 16:28 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: davem, roland, netdev, oren, yevgenyp, Amir Vadai
In-Reply-To: <1342706957.2617.25.camel@bwh-desktop.uk.solarflarecom.com>

On 7/19/2012 5:09 PM, Ben Hutchings wrote:
>> @@ -77,6 +77,12 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
>> >  	struct mlx4_en_dev *mdev = priv->mdev;
>> >  	int err = 0;
>> >  	char name[25];
>> >+	struct cpu_rmap *rmap =
>> >+#ifdef CONFIG_RFS_ACCEL
>> >+		priv->dev->rx_cpu_rmap;
>> >+#else
>> >+		NULL;
>> >+#endif
> You can write this slightly more cleanly using IS_ENABLED().
>
> [...]
>> >+static struct mlx4_en_filter *
>> >+mlx4_en_filter_alloc(struct mlx4_en_priv *priv, int rxq_index, __be32 src_ip,
>> >+                    __be32 dst_ip, __be16 src_port, __be16 dst_port,
>> >+                    u32 flow_id)
>> >+{
> [...]
>> >+       filter->id = priv->last_filter_id++;
> [...]
>
> You need to limit the filter IDs to be < RPS_NO_FILTER.


OK, thanks, we will send fixes for these two comments.

Or.

^ permalink raw reply

* Re: [PATCH net-next V1 7/9] net/eipoib: Add main driver functionality
From: Or Gerlitz @ 2012-07-19 16:21 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: davem, roland, netdev, ali, sean.hefty, shlomop, Erez Shitrit
In-Reply-To: <1342714574.2617.43.camel@bwh-desktop.uk.solarflarecom.com>

On 7/19/2012 7:16 PM, Ben Hutchings wrote:
> On Thu, 2012-07-19 at 18:46 +0300, Or Gerlitz wrote:
> For that end, I was under the impression all the three  NETIF_F_HW_VLAN_{TX,RX,FILTER) features need to be advertized. From your comment I understand now that RX/TX are enough in that respect?
> [...]
>
> Yes.

OK, good, will fix.

Or.

^ permalink raw reply

* Re: [PATCH net/for-next V1 1/1] IB/ipoib: break linkage to neighbouring system
From: Or Gerlitz @ 2012-07-19 16:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Shlomo Pongartz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	erezsh-VPRAkNaXOzVWk0Htik3J/w,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <alpine.DEB.2.00.1207191023130.29808-sBS69tsa9Uj/9pzu0YdTqQ@public.gmane.org>

On 7/19/2012 6:24 PM, Christoph Lameter wrote:
> On Thu, 19 Jul 2012, Shlomo Pongartz wrote:
>
>> The garbage collection and stale times follow the default ipv4/6 neigh.default.gc_yyy
>> sysctl values, for example
>>
>> net.ipv4.neigh.default.gc_interval = 30
>> net.ipv4.neigh.default.gc_stale_time = 60
>>
>> If given access to these values from IPoIB, we will be happy
>> to integrate them into that logic
>
> It looks like the values are hardcoded right now.

Two points here,

1s, they are indeed hard-coded since there's no define/enum
that holds their default values (or maybe we should add one now?), see
this code snippest from net/ipv4/arp.c

>         .gc_interval    = 30 * HZ,
>         .gc_thresh1     = 128,
>         .gc_thresh2     = 512,
>         .gc_thresh3     = 1024,

2nd, and even more interesting, the little challenge here is how
to integrate with the sysctl's that allow for changing these values,
the mechanism that uses neigh_sysctl_table in net/core/neighbour.c isn't
exported to the rest of the world. And there's no point to define new
sysctl entries just for managing the IPoIB neighbours, ideas welcome.


>> Please clarify what do you mean by group expiration.
>
> If you have neighbor expiration periods of 4 hrs and it is necessary to
> run the expiration logic then please expire all the neighbor entries due a
> certain period after that as well to avoid running the expiration again in
> the next minute or so.


This is still a bit unclear here... do you mean to say that at a certain 
point in time,
**all** entries need to be deleted irrelevant of their (jiffies) age? why?

> I guess the fuzz factor needs to scale depending on the expiration period.
>
>

and this is what happens now, the factor is 0.5, entry would be deleted when
if  (60m <= unused < 90s) holds

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next V1 7/9] net/eipoib: Add main driver functionality
From: Ben Hutchings @ 2012-07-19 16:16 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, roland, netdev, ali, sean.hefty, shlomop, Erez Shitrit
In-Reply-To: <50082BDE.2040005@mellanox.com>

On Thu, 2012-07-19 at 18:46 +0300, Or Gerlitz wrote:
> On 7/19/2012 4:49 PM, Ben Hutchings wrote:
> > On Wed, 2012-07-18 at 14:00 +0300, Or Gerlitz wrote:
[...]
> >> +	.ndo_vlan_rx_add_vid = eth_ipoib_vlan_rx_add_vid,
> >> +	.ndo_vlan_rx_kill_vid = eth_ipoib_vlan_rx_kill_vid,
> >
> > These shouldn't be needed.
> 
> ok, here's the point, the eIPoIB driver maps Ethernet vlans to 
> infiniband/IPoIB pkeys
> (partition keys). The underlying IPoIB devices work with these pkeys
> in a way which is HW accelerated, and we want the eIPoIB driver to be 
> considered as one
> that support HW accelerate vlans. E.g on the TX flow we don't want that 
> any special SW
> handling by the 8021q driver will be done on the skb except for setting 
> the skb->vlan_tci
> field, and in the RX flow, we set skb->vlan_tci field and don't want 
> that 8021q to try
> and extract it from the headers, etc.
> 
> For that end, I was under the impression all the three 
> NETIF_F_HW_VLAN_{TX,RX,FILTER)
> features need to be advertized. From your comment I understand now that 
> RX/TX are enough
> in that respect?
[...]

Yes.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH RESEND net/for-next V1 1/1] IB/ipoib: break linkage to neighbouring system
From: Or Gerlitz @ 2012-07-19 16:15 UTC (permalink / raw)
  To: roland, davem; +Cc: linux-rdma, erezsh, netdev, Shlomo Pongratz, Or Gerlitz
In-Reply-To: <1342714502-11301-1-git-send-email-ogerlitz@mellanox.com>

From: Shlomo Pongratz <shlomop@mellanox.com>

Dave Miller <davem@davemloft.net> provided a detailed description of why the
way IPoIB is using neighbours for its own ipoib_neigh struct is buggy:

Any time an ipoib_neigh is changed, a sequence like the following is made:

			spin_lock_irqsave(&priv->lock, flags);
			/*
			 * It's safe to call ipoib_put_ah() inside
			 * priv->lock here, because we know that
			 * path->ah will always hold one more reference,
			 * so ipoib_put_ah() will never do more than
			 * decrement the ref count.
			 */
			if (neigh->ah)
				ipoib_put_ah(neigh->ah);
			list_del(&neigh->list);
			ipoib_neigh_free(dev, neigh);
			spin_unlock_irqrestore(&priv->lock, flags);
			ipoib_path_lookup(skb, n, dev);

This doesn't work, because you're leaving a stale pointer to the freed up
ipoib_neigh in the special neigh->ha pointer cookie.  Yes, it even fails
with all the locking done to protect _changes_ to *ipoib_neigh(n), and
with the code in ipoib_neigh_free() that NULLs out the pointer.

The core issue is that read side calls to *to_ipoib_neigh(n) are not
being synchronized at all, they are performed without any locking.  So
whether we hold the lock or not when making changes to *ipoib_neigh(n)
you still can have threads see references to freed up ipoib_neigh
objects.

	cpu 1			cpu 2
	n = *ipoib_neigh()
				*ipoib_neigh() = NULL
				kfree(n)
	n->foo == OOPS

[..]

Perhaps the ipoib code can have a private path database it manages
entirely itself, which holds all the necessary information and is
looked up by some generic key which is available easily at transmit
time and does not involve generic neighbour entries. -- end of quote

See here http://marc.info/?l=linux-rdma&m=132812793105624&w=2 the note, full discussion
http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b

This patch aims to solve the race conditions found in the IPoIB driver.

The patch breaks the connection between the core networking neighbour structure
and the ipoib_neigh structure. Except for avoiding the race, it allows to in
under a setup where SKBs carrying IP packets that don't have any associated
neighbour are transmitted through IPoIB.

We add an ipoib_neigh hash table with 1024 buckets. The hash table key is the destination
hardware address. Thus the ipoib_neigh is fetched from the hash table and not
dereferenced from the stashed location at the neighbour structure. The hash table uses
both RCU and reference count mechanisms to guarantee that no ipoib_neigh instance is
ever deleted while in use.

Fetching the ipoib_neigh structure instance from the hash also makes the special
code in ipoib_start_xmit that handles remote and local bonding failover redundant.

Aged ipoib_neigh instances are deleted by a garbage collection task that runs every
30 seconds and deletes every ipoib_neigh instance that was idle for at least 60
seconds. The deletion is safe since the ipoib_neigh instances are protected
using RCU and reference count mechanisms.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |   59 ++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |   16 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  638 +++++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   57 +--
 4 files changed, 535 insertions(+), 235 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 86df632..ea765e1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -92,6 +92,8 @@ enum {
 	IPOIB_STOP_REAPER	  = 7,
 	IPOIB_FLAG_ADMIN_CM	  = 9,
 	IPOIB_FLAG_UMCAST	  = 10,
+	IPOIB_STOP_NEIGH_GC	  = 11,
+	IPOIB_NEIGH_TBL_FLUSH	  = 12,
 
 	IPOIB_MAX_BACKOFF_SECONDS = 16,
 
@@ -260,6 +262,25 @@ struct ipoib_ethtool_st {
 	u16     max_coalesced_frames;
 };
 
+enum {
+	IPOIB_NEIGH_LOG_INIT_SIZE	= 10,
+	IPOIB_NEIGH_GC_SEC		= 30
+};
+
+struct ipoib_neigh_hash {
+	struct ipoib_neigh __rcu **buckets;
+	struct rcu_head rcu;
+	u32 mask;
+	u32 size;
+};
+
+struct ipoib_neigh_table {
+	struct ipoib_neigh_hash __rcu *htbl;
+	rwlock_t rwlock;
+	atomic_t entries;
+	struct completion flushed;
+};
+
 /*
  * Device private locking: network stack tx_lock protects members used
  * in TX fast path, lock protects everything else.  lock nests inside
@@ -279,6 +300,8 @@ struct ipoib_dev_priv {
 	struct rb_root  path_tree;
 	struct list_head path_list;
 
+	struct ipoib_neigh_table ntbl;
+
 	struct ipoib_mcast *broadcast;
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
@@ -291,7 +314,7 @@ struct ipoib_dev_priv {
 	struct work_struct flush_heavy;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
-
+	struct delayed_work neigh_reap_task;
 	struct ib_device *ca;
 	u8		  port;
 	u16		  pkey;
@@ -377,13 +400,16 @@ struct ipoib_neigh {
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 	struct ipoib_cm_tx *cm;
 #endif
-	union ib_gid	    dgid;
+	u8     daddr[INFINIBAND_ALEN];
 	struct sk_buff_head queue;
 
-	struct neighbour   *neighbour;
 	struct net_device *dev;
 
 	struct list_head    list;
+	struct ipoib_neigh __rcu *hnext;
+	struct rcu_head     rcu;
+	atomic_t	    refcnt;
+	unsigned long       alive;
 };
 
 #define IPOIB_UD_MTU(ib_mtu)		(ib_mtu - IPOIB_ENCAP_LEN)
@@ -394,21 +420,17 @@ static inline int ipoib_ud_need_sg(unsigned int ib_mtu)
 	return IPOIB_UD_BUF_SIZE(ib_mtu) > PAGE_SIZE;
 }
 
-/*
- * We stash a pointer to our private neighbour information after our
- * hardware address in neigh->ha.  The ALIGN() expression here makes
- * sure that this pointer is stored aligned so that an unaligned
- * load is not needed to dereference it.
- */
-static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh)
+void ipoib_neigh_dtor(struct ipoib_neigh *neigh);
+static inline void ipoib_neigh_put(struct ipoib_neigh *neigh)
 {
-	return (void*) neigh + ALIGN(offsetof(struct neighbour, ha) +
-				     INFINIBAND_ALEN, sizeof(void *));
+	if (atomic_dec_and_test(&neigh->refcnt))
+		ipoib_neigh_dtor(neigh);
 }
-
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh,
+struct ipoib_neigh *ipoib_neigh_get(struct net_device *dev, u8 *daddr);
+struct ipoib_neigh *ipoib_neigh_alloc(u8 *daddr,
 				      struct net_device *dev);
-void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh);
+void ipoib_neigh_free(struct ipoib_neigh *neigh);
+void ipoib_del_neighs_by_gid(struct net_device *dev, u8 *gid);
 
 extern struct workqueue_struct *ipoib_workqueue;
 
@@ -425,7 +447,6 @@ static inline void ipoib_put_ah(struct ipoib_ah *ah)
 {
 	kref_put(&ah->ref, ipoib_free_ah);
 }
-
 int ipoib_open(struct net_device *dev);
 int ipoib_add_pkey_attr(struct net_device *dev);
 int ipoib_add_umcast_attr(struct net_device *dev);
@@ -455,7 +476,7 @@ void ipoib_dev_cleanup(struct net_device *dev);
 
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
-void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb);
+void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
 
 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
@@ -517,10 +538,10 @@ static inline int ipoib_cm_admin_enabled(struct net_device *dev)
 		test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags);
 }
 
-static inline int ipoib_cm_enabled(struct net_device *dev, struct neighbour *n)
+static inline int ipoib_cm_enabled(struct net_device *dev, u8 *hwaddr)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	return IPOIB_CM_SUPPORTED(n->ha) &&
+	return IPOIB_CM_SUPPORTED(hwaddr) &&
 		test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags);
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 1ca7322..19bc95a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -811,9 +811,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 		if (neigh) {
 			neigh->cm = NULL;
 			list_del(&neigh->list);
-			if (neigh->ah)
-				ipoib_put_ah(neigh->ah);
-			ipoib_neigh_free(dev, neigh);
+			ipoib_neigh_free(neigh);
 
 			tx->neigh = NULL;
 		}
@@ -1230,9 +1228,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
 		if (neigh) {
 			neigh->cm = NULL;
 			list_del(&neigh->list);
-			if (neigh->ah)
-				ipoib_put_ah(neigh->ah);
-			ipoib_neigh_free(dev, neigh);
+			ipoib_neigh_free(neigh);
 
 			tx->neigh = NULL;
 		}
@@ -1279,7 +1275,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
 		list_move(&tx->list, &priv->cm.reap_list);
 		queue_work(ipoib_workqueue, &priv->cm.reap_task);
 		ipoib_dbg(priv, "Reap connection for gid %pI6\n",
-			  tx->neigh->dgid.raw);
+			  tx->neigh->daddr + 4);
 		tx->neigh = NULL;
 	}
 }
@@ -1304,7 +1300,7 @@ static void ipoib_cm_tx_start(struct work_struct *work)
 		p = list_entry(priv->cm.start_list.next, typeof(*p), list);
 		list_del_init(&p->list);
 		neigh = p->neigh;
-		qpn = IPOIB_QPN(neigh->neighbour->ha);
+		qpn = IPOIB_QPN(neigh->daddr);
 		memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);
 
 		spin_unlock_irqrestore(&priv->lock, flags);
@@ -1320,9 +1316,7 @@ static void ipoib_cm_tx_start(struct work_struct *work)
 			if (neigh) {
 				neigh->cm = NULL;
 				list_del(&neigh->list);
-				if (neigh->ah)
-					ipoib_put_ah(neigh->ah);
-				ipoib_neigh_free(dev, neigh);
+				ipoib_neigh_free(neigh);
 			}
 			list_del(&p->list);
 			kfree(p);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index bbee4b2..d07c7b9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -46,7 +46,7 @@
 #include <linux/ip.h>
 #include <linux/in.h>
 
-#include <net/dst.h>
+#include <linux/jhash.h>
 
 MODULE_AUTHOR("Roland Dreier");
 MODULE_DESCRIPTION("IP-over-InfiniBand net driver");
@@ -84,6 +84,7 @@ struct ib_sa_client ipoib_sa_client;
 
 static void ipoib_add_one(struct ib_device *device);
 static void ipoib_remove_one(struct ib_device *device);
+static void ipoib_neigh_reclaim(struct rcu_head *rp);
 
 static struct ib_client ipoib_client = {
 	.name   = "ipoib",
@@ -264,30 +265,15 @@ static int __path_add(struct net_device *dev, struct ipoib_path *path)
 
 static void path_free(struct net_device *dev, struct ipoib_path *path)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_neigh *neigh, *tn;
 	struct sk_buff *skb;
-	unsigned long flags;
 
 	while ((skb = __skb_dequeue(&path->queue)))
 		dev_kfree_skb_irq(skb);
 
-	spin_lock_irqsave(&priv->lock, flags);
-
-	list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
-		/*
-		 * It's safe to call ipoib_put_ah() inside priv->lock
-		 * here, because we know that path->ah will always
-		 * hold one more reference, so ipoib_put_ah() will
-		 * never do more than decrement the ref count.
-		 */
-		if (neigh->ah)
-			ipoib_put_ah(neigh->ah);
-
-		ipoib_neigh_free(dev, neigh);
-	}
+	ipoib_dbg(netdev_priv(dev), "path_free\n");
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	/* remove all neigh connected to this path */
+	ipoib_del_neighs_by_gid(dev, path->pathrec.dgid.raw);
 
 	if (path->ah)
 		ipoib_put_ah(path->ah);
@@ -458,19 +444,15 @@ static void path_rec_completion(int status,
 			}
 			kref_get(&path->ah->ref);
 			neigh->ah = path->ah;
-			memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
-			       sizeof(union ib_gid));
 
-			if (ipoib_cm_enabled(dev, neigh->neighbour)) {
+			if (ipoib_cm_enabled(dev, neigh->daddr)) {
 				if (!ipoib_cm_get(neigh))
 					ipoib_cm_set(neigh, ipoib_cm_create_tx(dev,
 									       path,
 									       neigh));
 				if (!ipoib_cm_get(neigh)) {
 					list_del(&neigh->list);
-					if (neigh->ah)
-						ipoib_put_ah(neigh->ah);
-					ipoib_neigh_free(dev, neigh);
+					ipoib_neigh_free(neigh);
 					continue;
 				}
 			}
@@ -555,15 +537,15 @@ static int path_rec_start(struct net_device *dev,
 	return 0;
 }
 
-/* called with rcu_read_lock */
-static void neigh_add_path(struct sk_buff *skb, struct neighbour *n, struct net_device *dev)
+static void neigh_add_path(struct sk_buff *skb, u8 *daddr,
+				struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_path *path;
 	struct ipoib_neigh *neigh;
 	unsigned long flags;
 
-	neigh = ipoib_neigh_alloc(n, skb->dev);
+	neigh = ipoib_neigh_alloc(daddr, dev);
 	if (!neigh) {
 		++dev->stats.tx_dropped;
 		dev_kfree_skb_any(skb);
@@ -572,9 +554,9 @@ static void neigh_add_path(struct sk_buff *skb, struct neighbour *n, struct net_
 
 	spin_lock_irqsave(&priv->lock, flags);
 
-	path = __path_find(dev, n->ha + 4);
+	path = __path_find(dev, daddr + 4);
 	if (!path) {
-		path = path_rec_create(dev, n->ha + 4);
+		path = path_rec_create(dev, daddr + 4);
 		if (!path)
 			goto err_path;
 
@@ -586,17 +568,13 @@ static void neigh_add_path(struct sk_buff *skb, struct neighbour *n, struct net_
 	if (path->ah) {
 		kref_get(&path->ah->ref);
 		neigh->ah = path->ah;
-		memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
-		       sizeof(union ib_gid));
 
-		if (ipoib_cm_enabled(dev, neigh->neighbour)) {
+		if (ipoib_cm_enabled(dev, neigh->daddr)) {
 			if (!ipoib_cm_get(neigh))
 				ipoib_cm_set(neigh, ipoib_cm_create_tx(dev, path, neigh));
 			if (!ipoib_cm_get(neigh)) {
 				list_del(&neigh->list);
-				if (neigh->ah)
-					ipoib_put_ah(neigh->ah);
-				ipoib_neigh_free(dev, neigh);
+				ipoib_neigh_free(neigh);
 				goto err_drop;
 			}
 			if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE)
@@ -608,7 +586,8 @@ static void neigh_add_path(struct sk_buff *skb, struct neighbour *n, struct net_
 			}
 		} else {
 			spin_unlock_irqrestore(&priv->lock, flags);
-			ipoib_send(dev, skb, path->ah, IPOIB_QPN(n->ha));
+			ipoib_send(dev, skb, path->ah, IPOIB_QPN(daddr));
+			ipoib_neigh_put(neigh);
 			return;
 		}
 	} else {
@@ -621,35 +600,20 @@ static void neigh_add_path(struct sk_buff *skb, struct neighbour *n, struct net_
 	}
 
 	spin_unlock_irqrestore(&priv->lock, flags);
+	ipoib_neigh_put(neigh);
 	return;
 
 err_list:
 	list_del(&neigh->list);
 
 err_path:
-	ipoib_neigh_free(dev, neigh);
+	ipoib_neigh_free(neigh);
 err_drop:
 	++dev->stats.tx_dropped;
 	dev_kfree_skb_any(skb);
 
 	spin_unlock_irqrestore(&priv->lock, flags);
-}
-
-/* called with rcu_read_lock */
-static void ipoib_path_lookup(struct sk_buff *skb, struct neighbour *n, struct net_device *dev)
-{
-	struct ipoib_dev_priv *priv = netdev_priv(skb->dev);
-
-	/* Look up path record for unicasts */
-	if (n->ha[4] != 0xff) {
-		neigh_add_path(skb, n, dev);
-		return;
-	}
-
-	/* Add in the P_Key for multicasts */
-	n->ha[8] = (priv->pkey >> 8) & 0xff;
-	n->ha[9] = priv->pkey & 0xff;
-	ipoib_mcast_send(dev, n->ha + 4, skb);
+	ipoib_neigh_put(neigh);
 }
 
 static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
@@ -710,96 +674,80 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_neigh *neigh;
-	struct neighbour *n = NULL;
+	struct ipoib_cb *cb = (struct ipoib_cb *) skb->cb;
+	struct ipoib_header *header;
 	unsigned long flags;
 
-	rcu_read_lock();
-	if (likely(skb_dst(skb))) {
-		n = dst_neigh_lookup_skb(skb_dst(skb), skb);
-		if (!n) {
+	header = (struct ipoib_header *) skb->data;
+
+	if (unlikely(cb->hwaddr[4] == 0xff)) {
+		/* multicast, arrange "if" according to probability */
+		if ((header->proto != htons(ETH_P_IP)) &&
+		    (header->proto != htons(ETH_P_IPV6)) &&
+		    (header->proto != htons(ETH_P_ARP)) &&
+		    (header->proto != htons(ETH_P_RARP))) {
+			/* ethertype not supported by IPoIB */
 			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
-			goto unlock;
+			return NETDEV_TX_OK;
 		}
+		/* Add in the P_Key for multicast*/
+		cb->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+		cb->hwaddr[9] = priv->pkey & 0xff;
+
+		neigh = ipoib_neigh_get(dev, cb->hwaddr);
+		if (likely(neigh))
+			goto send_using_neigh;
+		ipoib_mcast_send(dev, cb->hwaddr, skb);
+		return NETDEV_TX_OK;
 	}
-	if (likely(n)) {
-		if (unlikely(!*to_ipoib_neigh(n))) {
-			ipoib_path_lookup(skb, n, dev);
-			goto unlock;
-		}
 
-		neigh = *to_ipoib_neigh(n);
-
-		if (unlikely((memcmp(&neigh->dgid.raw,
-				     n->ha + 4,
-				     sizeof(union ib_gid))) ||
-			     (neigh->dev != dev))) {
-			spin_lock_irqsave(&priv->lock, flags);
-			/*
-			 * It's safe to call ipoib_put_ah() inside
-			 * priv->lock here, because we know that
-			 * path->ah will always hold one more reference,
-			 * so ipoib_put_ah() will never do more than
-			 * decrement the ref count.
-			 */
-			if (neigh->ah)
-				ipoib_put_ah(neigh->ah);
-			list_del(&neigh->list);
-			ipoib_neigh_free(dev, neigh);
-			spin_unlock_irqrestore(&priv->lock, flags);
-			ipoib_path_lookup(skb, n, dev);
-			goto unlock;
+	/* unicast, arrange "switch" according to probability */
+	switch (header->proto) {
+	case htons(ETH_P_IP):
+	case htons(ETH_P_IPV6):
+		neigh = ipoib_neigh_get(dev, cb->hwaddr);
+		if (unlikely(!neigh)) {
+			neigh_add_path(skb, cb->hwaddr, dev);
+			return NETDEV_TX_OK;
 		}
+		break;
+	case htons(ETH_P_ARP):
+	case htons(ETH_P_RARP):
+		/* for unicast ARP and RARP should always perform path find */
+		unicast_arp_send(skb, dev, cb);
+		return NETDEV_TX_OK;
+	default:
+		/* ethertype not supported by IPoIB */
+		++dev->stats.tx_dropped;
+		dev_kfree_skb_any(skb);
+		return NETDEV_TX_OK;
+	}
 
-		if (ipoib_cm_get(neigh)) {
-			if (ipoib_cm_up(neigh)) {
-				ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
-				goto unlock;
-			}
-		} else if (neigh->ah) {
-			ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(n->ha));
-			goto unlock;
+send_using_neigh:
+	/* note we now hold a ref to neigh */
+	if (ipoib_cm_get(neigh)) {
+		if (ipoib_cm_up(neigh)) {
+			ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
+			goto unref;
 		}
+	} else if (neigh->ah) {
+		ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(cb->hwaddr));
+		goto unref;
+	}
 
-		if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
-			spin_lock_irqsave(&priv->lock, flags);
-			__skb_queue_tail(&neigh->queue, skb);
-			spin_unlock_irqrestore(&priv->lock, flags);
-		} else {
-			++dev->stats.tx_dropped;
-			dev_kfree_skb_any(skb);
-		}
+	if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
+		spin_lock_irqsave(&priv->lock, flags);
+		__skb_queue_tail(&neigh->queue, skb);
+		spin_unlock_irqrestore(&priv->lock, flags);
 	} else {
-		struct ipoib_cb *cb = (struct ipoib_cb *) skb->cb;
-
-		if (cb->hwaddr[4] == 0xff) {
-			/* Add in the P_Key for multicast*/
-			cb->hwaddr[8] = (priv->pkey >> 8) & 0xff;
-			cb->hwaddr[9] = priv->pkey & 0xff;
+		++dev->stats.tx_dropped;
+		dev_kfree_skb_any(skb);
+	}
 
-			ipoib_mcast_send(dev, cb->hwaddr + 4, skb);
-		} else {
-			/* unicast GID -- should be ARP or RARP reply */
-
-			if ((be16_to_cpup((__be16 *) skb->data) != ETH_P_ARP) &&
-			    (be16_to_cpup((__be16 *) skb->data) != ETH_P_RARP)) {
-				ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x %pI6\n",
-					   skb_dst(skb) ? "neigh" : "dst",
-					   be16_to_cpup((__be16 *) skb->data),
-					   IPOIB_QPN(cb->hwaddr),
-					   cb->hwaddr + 4);
-				dev_kfree_skb_any(skb);
-				++dev->stats.tx_dropped;
-				goto unlock;
-			}
+unref:
+	ipoib_neigh_put(neigh);
 
-			unicast_arp_send(skb, dev, cb);
-		}
-	}
-unlock:
-	if (n)
-		neigh_release(n);
-	rcu_read_unlock();
 	return NETDEV_TX_OK;
 }
 
@@ -821,6 +769,7 @@ static int ipoib_hard_header(struct sk_buff *skb,
 			     const void *daddr, const void *saddr, unsigned len)
 {
 	struct ipoib_header *header;
+	struct ipoib_cb *cb = (struct ipoib_cb *) skb->cb;
 
 	header = (struct ipoib_header *) skb_push(skb, sizeof *header);
 
@@ -828,14 +777,11 @@ static int ipoib_hard_header(struct sk_buff *skb,
 	header->reserved = 0;
 
 	/*
-	 * If we don't have a dst_entry structure, stuff the
+	 * we don't rely on dst_entry structure,  always stuff the
 	 * destination address into skb->cb so we can figure out where
 	 * to send the packet later.
 	 */
-	if (!skb_dst(skb)) {
-		struct ipoib_cb *cb = (struct ipoib_cb *) skb->cb;
-		memcpy(cb->hwaddr, daddr, INFINIBAND_ALEN);
-	}
+	memcpy(cb->hwaddr, daddr, INFINIBAND_ALEN);
 
 	return 0;
 }
@@ -852,86 +798,431 @@ static void ipoib_set_mcast_list(struct net_device *dev)
 	queue_work(ipoib_workqueue, &priv->restart_task);
 }
 
-static void ipoib_neigh_cleanup(struct neighbour *n)
+static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
 {
-	struct ipoib_neigh *neigh;
-	struct ipoib_dev_priv *priv = netdev_priv(n->dev);
+	/*
+	 * Use only the address parts that contributes to spreading
+	 * The subnet prefix is not used as one can not connect to
+	 * same remote port (GUID) using the same remote QPN via two
+	 * different subnets.
+	 */
+	 /* qpn octets[1:4) & port GUID octets[12:20) */
+	return jhash(daddr+12, 8, 0xFFFFFF & *(u32 *) daddr) & htbl->mask;
+}
+
+struct ipoib_neigh *ipoib_neigh_get(struct net_device *dev, u8 *daddr)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_neigh_table *ntbl = &priv->ntbl;
+	struct ipoib_neigh_hash *htbl;
+	struct ipoib_neigh *neigh = NULL;
+	u32 hash_val;
+
+	rcu_read_lock_bh();
+
+	htbl = rcu_dereference_bh(ntbl->htbl);
+
+	if (!htbl)
+		goto out_unlock;
+
+	hash_val = ipoib_addr_hash(htbl, daddr);
+	for (neigh = rcu_dereference_bh(htbl->buckets[hash_val]);
+	     neigh != NULL;
+	     neigh = rcu_dereference_bh(neigh->hnext)) {
+		if (memcmp(daddr, neigh->daddr, INFINIBAND_ALEN) == 0) {
+			/* found, take one ref on behalf of the caller */
+			if (!atomic_inc_not_zero(&neigh->refcnt)) {
+				/* deleted */
+				neigh = NULL;
+				goto out_unlock;
+			}
+			neigh->alive = jiffies;
+			goto out_unlock;
+		}
+	}
+out_unlock:
+	rcu_read_unlock_bh();
+	return neigh;
+}
+
+static void __ipoib_reap_neigh(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_neigh_table *ntbl = &priv->ntbl;
+	struct ipoib_neigh_hash *htbl;
+	unsigned long neigh_obsolete;
+	unsigned long dt;
 	unsigned long flags;
-	struct ipoib_ah *ah = NULL;
+	int i;
 
-	neigh = *to_ipoib_neigh(n);
-	if (neigh)
-		priv = netdev_priv(neigh->dev);
-	else
+	if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
 		return;
-	ipoib_dbg(priv,
-		  "neigh_cleanup for %06x %pI6\n",
-		  IPOIB_QPN(n->ha),
-		  n->ha + 4);
 
-	spin_lock_irqsave(&priv->lock, flags);
+	write_lock_bh(&ntbl->rwlock);
+
+	htbl = rcu_dereference_protected(ntbl->htbl,
+					lockdep_is_held(&ntbl->rwlock));
+
+	if (!htbl)
+		goto out_unlock;
+
+	/* neigh is obsolete if it was idle for two GC periods */
+	dt = msecs_to_jiffies(2 * 1000 * IPOIB_NEIGH_GC_SEC);
+	neigh_obsolete = jiffies - dt;
+	/* handle possible race condition */
+	if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
+		goto out_unlock;
+
+	for (i = 0; i < htbl->size; i++) {
+		struct ipoib_neigh *neigh;
+		struct ipoib_neigh __rcu **np = &htbl->buckets[i];
+
+		while ((neigh = rcu_dereference_protected(*np,
+				lockdep_is_held(&ntbl->lock))) != NULL) {
+			/* was the neigh idle for two GC periods */
+			if (time_after(neigh_obsolete, neigh->alive)) {
+				rcu_assign_pointer(*np,
+				   rcu_dereference_protected(neigh->hnext,
+					lockdep_is_held(&ntbl->rwlock)));
+				/* remove from path/mc list */
+				spin_lock_irqsave(&priv->lock, flags);
+				list_del(&neigh->list);
+				spin_unlock_irqrestore(&priv->lock, flags);
+				call_rcu(&neigh->rcu, ipoib_neigh_reclaim);
+			} else {
+				np = &neigh->hnext;
+			}
 
-	if (neigh->ah)
-		ah = neigh->ah;
-	list_del(&neigh->list);
-	ipoib_neigh_free(n->dev, neigh);
+		}
+	}
+out_unlock:
+	write_unlock_bh(&ntbl->rwlock);
+}
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+static void ipoib_reap_neigh(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, neigh_reap_task.work);
+	unsigned long dt = msecs_to_jiffies(1000 * IPOIB_NEIGH_GC_SEC);
+
+	__ipoib_reap_neigh(priv);
 
-	if (ah)
-		ipoib_put_ah(ah);
+	if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
+		queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+			round_jiffies_relative(dt));
 }
 
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour,
+
+static struct ipoib_neigh *ipoib_neigh_ctor(u8 *daddr,
 				      struct net_device *dev)
 {
 	struct ipoib_neigh *neigh;
 
-	neigh = kmalloc(sizeof *neigh, GFP_ATOMIC);
+	neigh = kzalloc(sizeof *neigh, GFP_ATOMIC);
 	if (!neigh)
 		return NULL;
 
-	neigh->neighbour = neighbour;
 	neigh->dev = dev;
-	memset(&neigh->dgid.raw, 0, sizeof (union ib_gid));
-	*to_ipoib_neigh(neighbour) = neigh;
+	memcpy(&neigh->daddr, daddr, sizeof(neigh->daddr));
 	skb_queue_head_init(&neigh->queue);
+	INIT_LIST_HEAD(&neigh->list);
 	ipoib_cm_set(neigh, NULL);
+	/* one ref on behalf of the caller */
+	atomic_set(&neigh->refcnt, 1);
+
+	return neigh;
+}
+
+struct ipoib_neigh *ipoib_neigh_alloc(u8 *daddr,
+				      struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_neigh_table *ntbl = &priv->ntbl;
+	struct ipoib_neigh_hash *htbl;
+	struct ipoib_neigh *neigh;
+	u32 hash_val;
+
+	write_lock_bh(&ntbl->rwlock);
+
+	htbl = rcu_dereference_protected(ntbl->htbl,
+					 lockdep_is_held(&ntbl->rwlock));
+	if (!htbl) {
+		neigh = NULL;
+		goto out_unlock;
+	}
+
+	/* need to add a new neigh, but maybe some other thered succeded ?
+	 * recalc hash, maybe hash resize took place so we do a search
+	 */
+	hash_val = ipoib_addr_hash(htbl, daddr);
+	for (neigh = rcu_dereference_protected(htbl->buckets[hash_val],
+					    lockdep_is_held(&ntbl->rwlock));
+	     neigh != NULL;
+	     neigh = rcu_dereference_protected(neigh->hnext,
+					lockdep_is_held(&ntbl->lock))) {
+		if (memcmp(daddr, neigh->daddr, INFINIBAND_ALEN) == 0) {
+			/* found, take one ref on behalf of the caller */
+			if (!atomic_inc_not_zero(&neigh->refcnt)) {
+				/* deleted */
+				neigh = NULL;
+				break;
+			}
+			neigh->alive = jiffies;
+			goto out_unlock;
+		}
+	}
+
+	neigh = ipoib_neigh_ctor(daddr, dev);
+	if (!neigh)
+		goto out_unlock;
+
+	/* one ref on behalf of the hash table */
+	atomic_inc(&neigh->refcnt);
+	neigh->alive = jiffies;
+	/* put in hash */
+	rcu_assign_pointer(neigh->hnext,
+			rcu_dereference_protected(htbl->buckets[hash_val],
+					lockdep_is_held(&ntbl->rwlock)));
+	rcu_assign_pointer(htbl->buckets[hash_val], neigh);
+	atomic_inc(&ntbl->entries);
+
+out_unlock:
+	write_unlock_bh(&ntbl->rwlock);
 
 	return neigh;
 }
 
-void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh)
+void ipoib_neigh_dtor(struct ipoib_neigh *neigh)
 {
+	/* neigh reference count was dropprd to zero */
+	struct net_device *dev = neigh->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct sk_buff *skb;
-	*to_ipoib_neigh(neigh->neighbour) = NULL;
+	if (neigh->ah)
+		ipoib_put_ah(neigh->ah);
 	while ((skb = __skb_dequeue(&neigh->queue))) {
 		++dev->stats.tx_dropped;
 		dev_kfree_skb_any(skb);
 	}
 	if (ipoib_cm_get(neigh))
 		ipoib_cm_destroy_tx(ipoib_cm_get(neigh));
+	ipoib_dbg(netdev_priv(dev),
+		  "neigh free for %06x %pI6\n",
+		  IPOIB_QPN(neigh->daddr),
+		  neigh->daddr + 4);
 	kfree(neigh);
+	if (atomic_dec_and_test(&priv->ntbl.entries)) {
+		if (test_bit(IPOIB_NEIGH_TBL_FLUSH, &priv->flags))
+			complete(&priv->ntbl.flushed);
+	}
+}
+
+static void ipoib_neigh_reclaim(struct rcu_head *rp)
+{
+	/* Called as a result of removal from hash table */
+	struct ipoib_neigh *neigh = container_of(rp, struct ipoib_neigh, rcu);
+	/* note TX context may hold another ref */
+	ipoib_neigh_put(neigh);
 }
 
-static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms)
+void ipoib_neigh_free(struct ipoib_neigh *neigh)
 {
-	parms->neigh_cleanup = ipoib_neigh_cleanup;
+	struct net_device *dev = neigh->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_neigh_table *ntbl = &priv->ntbl;
+	struct ipoib_neigh_hash *htbl;
+	struct ipoib_neigh __rcu **np;
+	struct ipoib_neigh *n;
+	u32 hash_val;
+
+	write_lock_bh(&ntbl->rwlock);
+
+	htbl = rcu_dereference_protected(ntbl->htbl,
+					lockdep_is_held(&ntbl->rwlock));
+	if (!htbl)
+		goto out_unlock;
+
+	hash_val = ipoib_addr_hash(htbl, neigh->daddr);
+	np = &htbl->buckets[hash_val];
+	for (n = rcu_dereference_protected(*np,
+					    lockdep_is_held(&ntbl->rwlock));
+	     n != NULL;
+	     n = rcu_dereference_protected(neigh->hnext,
+					lockdep_is_held(&ntbl->lock))) {
+		if (n == neigh) {
+			/* found */
+			rcu_assign_pointer(*np,
+			   rcu_dereference_protected(neigh->hnext,
+				lockdep_is_held(&ntbl->rwlock)));
+			call_rcu(&neigh->rcu, ipoib_neigh_reclaim);
+			goto out_unlock;
+		} else {
+			np = &n->hnext;
+		}
+	}
+out_unlock:
+	write_unlock_bh(&ntbl->rwlock);
+
+}
+
+static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_neigh_table *ntbl = &priv->ntbl;
+	struct ipoib_neigh_hash *htbl;
+	struct ipoib_neigh **buckets;
+	unsigned long dt = msecs_to_jiffies(1000 * IPOIB_NEIGH_GC_SEC);
+	u32 size;
+
+	clear_bit(IPOIB_NEIGH_TBL_FLUSH, &priv->flags);
+	ntbl->htbl = NULL;
+	rwlock_init(&ntbl->rwlock);
+	htbl = kzalloc(sizeof(*htbl), GFP_KERNEL);
+	if (!htbl)
+		return -ENOMEM;
+	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
+	size = 1 << IPOIB_NEIGH_LOG_INIT_SIZE;
+	buckets = kzalloc(size * sizeof(*buckets), GFP_KERNEL);
+	if (!buckets) {
+		kfree(htbl);
+		return -ENOMEM;
+	}
+	htbl->size = size;
+	htbl->mask = (size - 1);
+	htbl->buckets = buckets;
+	ntbl->htbl = htbl;
+	atomic_set(&ntbl->entries, 0);
+
+	/* start garbage collection */
+	clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
+	queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+			round_jiffies_relative(dt));
 
 	return 0;
 }
 
+static void neigh_hash_free_rcu(struct rcu_head *head)
+{
+	struct ipoib_neigh_hash *htbl = container_of(head,
+						    struct ipoib_neigh_hash,
+						    rcu);
+	struct ipoib_neigh __rcu **buckets = htbl->buckets;
+
+	kfree(buckets);
+	kfree(htbl);
+}
+
+void ipoib_del_neighs_by_gid(struct net_device *dev, u8 *gid)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_neigh_table *ntbl = &priv->ntbl;
+	struct ipoib_neigh_hash *htbl;
+	unsigned long flags;
+	int i;
+
+	/* remove all neigh connected to a given path or mcast */
+	write_lock_bh(&ntbl->rwlock);
+
+	htbl = rcu_dereference_protected(ntbl->htbl,
+					lockdep_is_held(&ntbl->rwlock));
+
+	if (!htbl)
+		goto out_unlock;
+
+	for (i = 0; i < htbl->size; i++) {
+		struct ipoib_neigh *neigh;
+		struct ipoib_neigh __rcu **np = &htbl->buckets[i];
+
+		while ((neigh = rcu_dereference_protected(*np,
+				lockdep_is_held(&ntbl->lock))) != NULL) {
+			/* delete neighs belong to this parent */
+			if (!memcmp(gid, neigh->daddr + 4, sizeof (union ib_gid))) {
+				rcu_assign_pointer(*np,
+				   rcu_dereference_protected(neigh->hnext,
+					lockdep_is_held(&ntbl->rwlock)));
+				/* remove from parent list */
+				spin_lock_irqsave(&priv->lock, flags);
+				list_del(&neigh->list);
+				spin_unlock_irqrestore(&priv->lock, flags);
+				call_rcu(&neigh->rcu, ipoib_neigh_reclaim);
+			} else {
+				np = &neigh->hnext;
+			}
+
+		}
+	}
+out_unlock:
+	write_unlock_bh(&ntbl->rwlock);
+}
+
+static void ipoib_flush_neighs(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_neigh_table *ntbl = &priv->ntbl;
+	struct ipoib_neigh_hash *htbl;
+	unsigned long flags;
+	int i;
+
+	write_lock_bh(&ntbl->rwlock);
+
+	htbl = rcu_dereference_protected(ntbl->htbl,
+					lockdep_is_held(&ntbl->rwlock));
+	if (!htbl)
+		goto out_unlock;
+
+	for (i = 0; i < htbl->size; i++) {
+		struct ipoib_neigh *neigh;
+		struct ipoib_neigh __rcu **np = &htbl->buckets[i];
+
+		while ((neigh = rcu_dereference_protected(*np,
+				lockdep_is_held(&ntbl->lock))) != NULL) {
+			rcu_assign_pointer(*np,
+				   rcu_dereference_protected(neigh->hnext,
+					lockdep_is_held(&ntbl->rwlock)));
+			/* remove from path/mc list */
+			spin_lock_irqsave(&priv->lock, flags);
+			list_del(&neigh->list);
+			spin_unlock_irqrestore(&priv->lock, flags);
+			call_rcu(&neigh->rcu, ipoib_neigh_reclaim);
+		}
+	}
+	rcu_assign_pointer(ntbl->htbl, NULL);
+	call_rcu(&htbl->rcu, neigh_hash_free_rcu);
+out_unlock:
+	write_unlock_bh(&ntbl->rwlock);
+}
+
+static void ipoib_neigh_hash_uninit(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int stopped;
+
+	ipoib_dbg(priv, "ipoib_neigh_hash_uninit\n");
+	init_completion(&priv->ntbl.flushed);
+	set_bit(IPOIB_NEIGH_TBL_FLUSH, &priv->flags);
+
+	/* Stop GC if called at init fail need to cancel work */
+	stopped = test_and_set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
+	if (!stopped)
+		cancel_delayed_work(&priv->neigh_reap_task);
+
+	if (atomic_read(&priv->ntbl.entries)) {
+		ipoib_flush_neighs(priv);
+		wait_for_completion(&priv->ntbl.flushed);
+	}
+}
+
+
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+	if (ipoib_neigh_hash_init(priv) < 0)
+		goto out;
 	/* Allocate RX/TX "rings" to hold queued skbs */
 	priv->rx_ring =	kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring,
 				GFP_KERNEL);
 	if (!priv->rx_ring) {
 		printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n",
 		       ca->name, ipoib_recvq_size);
-		goto out;
+		goto out_neigh_hash_cleanup;
 	}
 
 	priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring);
@@ -954,6 +1245,8 @@ out_tx_ring_cleanup:
 out_rx_ring_cleanup:
 	kfree(priv->rx_ring);
 
+out_neigh_hash_cleanup:
+	ipoib_neigh_hash_uninit(dev);
 out:
 	return -ENOMEM;
 }
@@ -966,6 +1259,9 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
 	/* Delete any child interfaces first */
 	list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) {
+		/* Stop GC on child */
+		set_bit(IPOIB_STOP_NEIGH_GC, &cpriv->flags);
+		cancel_delayed_work(&cpriv->neigh_reap_task);
 		unregister_netdev(cpriv->dev);
 		ipoib_dev_cleanup(cpriv->dev);
 		free_netdev(cpriv->dev);
@@ -978,6 +1274,8 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
 	priv->rx_ring = NULL;
 	priv->tx_ring = NULL;
+
+	ipoib_neigh_hash_uninit(dev);
 }
 
 static const struct header_ops ipoib_header_ops = {
@@ -992,7 +1290,6 @@ static const struct net_device_ops ipoib_netdev_ops = {
 	.ndo_start_xmit	 	 = ipoib_start_xmit,
 	.ndo_tx_timeout		 = ipoib_timeout,
 	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
-	.ndo_neigh_setup	 = ipoib_neigh_setup_dev,
 };
 
 static void ipoib_setup(struct net_device *dev)
@@ -1041,6 +1338,7 @@ static void ipoib_setup(struct net_device *dev)
 	INIT_WORK(&priv->flush_heavy,   ipoib_ib_dev_flush_heavy);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
 	INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah);
+	INIT_DELAYED_WORK(&priv->neigh_reap_task, ipoib_reap_neigh);
 }
 
 struct ipoib_dev_priv *ipoib_intf_alloc(const char *name)
@@ -1281,6 +1579,9 @@ sysfs_failed:
 
 register_failed:
 	ib_unregister_event_handler(&priv->event_handler);
+	/* Stop GC if started before flush */
+	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
+	cancel_delayed_work(&priv->neigh_reap_task);
 	flush_workqueue(ipoib_workqueue);
 
 event_failed:
@@ -1347,6 +1648,9 @@ static void ipoib_remove_one(struct ib_device *device)
 		dev_change_flags(priv->dev, priv->dev->flags & ~IFF_UP);
 		rtnl_unlock();
 
+		/* Stop GC */
+		set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
+		cancel_delayed_work(&priv->neigh_reap_task);
 		flush_workqueue(ipoib_workqueue);
 
 		unregister_netdev(priv->dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 7cecb16..13f4aa7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -69,28 +69,13 @@ struct ipoib_mcast_iter {
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
 	struct net_device *dev = mcast->dev;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_neigh *neigh, *tmp;
 	int tx_dropped = 0;
 
 	ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group %pI6\n",
 			mcast->mcmember.mgid.raw);
 
-	spin_lock_irq(&priv->lock);
-
-	list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) {
-		/*
-		 * It's safe to call ipoib_put_ah() inside priv->lock
-		 * here, because we know that mcast->ah will always
-		 * hold one more reference, so ipoib_put_ah() will
-		 * never do more than decrement the ref count.
-		 */
-		if (neigh->ah)
-			ipoib_put_ah(neigh->ah);
-		ipoib_neigh_free(dev, neigh);
-	}
-
-	spin_unlock_irq(&priv->lock);
+	/* remove all neigh connected to this mcast */
+	ipoib_del_neighs_by_gid(dev, mcast->mcmember.mgid.raw);
 
 	if (mcast->ah)
 		ipoib_put_ah(mcast->ah);
@@ -655,17 +640,12 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 	return 0;
 }
 
-void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb)
+void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct dst_entry *dst = skb_dst(skb);
 	struct ipoib_mcast *mcast;
-	struct neighbour *n;
 	unsigned long flags;
-
-	n = NULL;
-	if (dst)
-		n = dst_neigh_lookup_skb(dst, skb);
+	void *mgid = daddr + 4;
 
 	spin_lock_irqsave(&priv->lock, flags);
 
@@ -721,28 +701,29 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb)
 
 out:
 	if (mcast && mcast->ah) {
-		if (n) {
-			if (!*to_ipoib_neigh(n)) {
-				struct ipoib_neigh *neigh;
-
-				neigh = ipoib_neigh_alloc(n, skb->dev);
-				if (neigh) {
-					kref_get(&mcast->ah->ref);
-					neigh->ah	= mcast->ah;
-					list_add_tail(&neigh->list,
-						      &mcast->neigh_list);
-				}
+		struct ipoib_neigh *neigh;
+
+		spin_unlock_irqrestore(&priv->lock, flags);
+		neigh = ipoib_neigh_get(dev, daddr);
+		spin_lock_irqsave(&priv->lock, flags);
+		if (!neigh) {
+			spin_unlock_irqrestore(&priv->lock, flags);
+			neigh = ipoib_neigh_alloc(daddr, dev);
+			spin_lock_irqsave(&priv->lock, flags);
+			if (neigh) {
+				kref_get(&mcast->ah->ref);
+				neigh->ah	= mcast->ah;
+				list_add_tail(&neigh->list, &mcast->neigh_list);
 			}
-			neigh_release(n);
 		}
 		spin_unlock_irqrestore(&priv->lock, flags);
 		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+		if (neigh)
+			ipoib_neigh_put(neigh);
 		return;
 	}
 
 unlock:
-	if (n)
-		neigh_release(n);
 	spin_unlock_irqrestore(&priv->lock, flags);
 }
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH RESEND net/for-next V1 0/1] changes IPoIB neighbour handling
From: Or Gerlitz @ 2012-07-19 16:15 UTC (permalink / raw)
  To: roland, davem; +Cc: linux-rdma, erezsh, netdev, Or Gerlitz, Shlomo Pongratz

Adding netdev, as of the high relevancy, see V0 @
http://marc.info/?l=linux-rdma&m=134191474831867&w=2

changes from V0:

- following feedback from Mike and Dave, changed the ipoib_neigh hash table 
  to allow for lock-free read side, the model follows the RCU based implementation 
  in net/core/neighbour.c

- since RCU hash table uses unidirectional collision list, now ipoib_neigh_free 
  needs to do a linked search in order to find the deleted neighbour predecessor 
  in order to link it to the neighbour successor.

- different implementation of hash lookup in ipoib_neigh_get (read-side) 
  vs ipoib_neigh_alloc and ipoib_neigh_free (write-side)

- path_free and ipoib_mcast_free now make use of the ipoib_del_neighs_by_gid helper 
  function in order to delete neighbours related to that path or mcast. This new helper 
  scans the hash table and deletes neighbours with the given GID. It had to be done 
  this way of as  of the unidrectional nature of the linking which by itself 
  arises from the lock free requirement made here...

- a completion mechanism was added to prevent freeing the IPoIB netdevice priv 
  data structure before the RCU based code freed all the neighbours.

Again, the patch was made over net-next as of few IPoIB changes that 
took place there and the parallel submission of the eIPoIB driver. 

Or.

Shlomo Pongratz (1):
  IB/ipoib: break linkage to neighbouring system

 drivers/infiniband/ulp/ipoib/ipoib.h           |   59 ++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |   16 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  638 +++++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   57 +--
 4 files changed, 535 insertions(+), 235 deletions(-)

Cc: Shlomo Pongratz <shlomop@mellanox.com>

^ permalink raw reply

* Re: getsockopt/setsockopt with SO_RCVBUF and SO_SNDBUF "non-standard" behaviour
From: Eugen Dedu @ 2012-07-19 16:14 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, linux-kernel@vger.kernel.org, netdev
In-Reply-To: <5006F32E.8060405@hp.com>

On 18/07/12 19:32, Rick Jones wrote:
> On 07/18/2012 09:11 AM, Eric Dumazet wrote:
>>
>> That the way it's done on linux since day 0
>>
>> You can probably find a lot of pages on the web explaining the
>> rationale.
>>
>> If your application handles UDP frames, what SO_RCVBUF should count ?
>>
>> If its the amount of payload bytes, you could have a pathological
>> situation where an attacker sends 1-byte UDP frames fast enough and
>> could consume a lot of kernel memory.
>>
>> Each frame consumes a fair amount of kernel memory (between 512 bytes
>> and 8 Kbytes depending on the driver).
>>
>> So linux says : If user expect to receive XXXX bytes, set a limit of
>> _kernel_ memory used to store these bytes, and use an estimation of 100%
>> of overhead. That is : allow 2*XXXX bytes to be allocated for socket
>> receive buffers.
>
> Expanding on/rewording that, in a setsockopt() call SO_RCVBUF specifies
> the data bytes and gets doubled to become the kernel/overhead byte
> limit. Unless the doubling would be greater than net.core.rmem_max, in
> which case the limit becomes net.core.rmem_max.
>
> But on getsockopt() SO_RCVBUF is always the kernel/overhead byte limit.
>
> In one call it is fish. In the other it is fowl.
>
> Other stacks appear to keep their kernel/overhead limit quiet, keeping
> SO_RCVBUF an expression of a data limit in both setsockopt() and
> getsockopt(). With those stacks, there is I suppose the possible source
> of confusion when/if someone tests the queuing to a socket, sends "high
> overhead" packets and doesn't get to SO_RCVBUF worth of data though I
> don't recall encountering that in my "pre-linux" time.

Thank you to both for the answers.  As I understand, it it is impossible 
(or not practical) to fulfill sometimes user requirements on buff size, 
since if only 1-byte udp packets arrive and are not consumed by 
application, the memory needed by linux is say 1000 greater, which of 
course is not available.  Other OSes have the same problem (see above 
"doesn't get to SO_RCVBUF worth of data"), except that they return the 
same value in getsockopt as setsockopt.  However, note that with linux 
the confusion is still possible, even if it appears more rarely.

> The sometimes fish, sometimes fowl version (along with the auto tuning
> when one doesn't make setsockopt() calls) gave me fits in netperf for
> years until I finally relented and split the socket buffer size
> variables into three - what netperf's user requested via the command
> line, what it was right after the socket was created, and what it was at
> the end of the data phase of the test.

-- 
Eugen

^ permalink raw reply

* Re: [PATCH] Crash in tun
From: Mikulas Patocka @ 2012-07-19 16:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Maxim Krasnyansky, vtun, netdev, davem
In-Reply-To: <1342678175.2626.3849.camel@edumazet-glaptop>



On Thu, 19 Jul 2012, Eric Dumazet wrote:

> Hi Mikulas
> 
> A fix for this problem is : http://patchwork.ozlabs.org/patch/170440/

If you call tun_free_netdev beacuse of a jump to an error label 
err_free_sk, your patch still calls it with NULL file, causing a memory 
corruption and a possible crash.

Your patch doesn't fix sockets_in_use underflow.

Maybe we can commit this patch --- it introduces a new flag 
SOCK_EXTERNALLY_ALLOCATED to work around both problems. (it looks quite 
nicer than my previous patch with file = (void *)1).

Mikulas

---

tun: fix a crash bug and a memory leak

This patch fixes a crash
tun_chr_close -> netdev_run_todo -> tun_free_netdev -> sk_release_kernel ->
sock_release -> iput(SOCK_INODE(sock))
introduced by commit 1ab5ecb90cb6a3df1476e052f76a6e8f6511cb3d

The problem is that this socket is embedded in struct tun_struct, it has
no inode, iput is called on invalid inode, which modifies invalid memory
and optionally causes a crash.

sock_release also decrements sockets_in_use, this causes a bug that
"sockets: used" field in /proc/*/net/sockstat keeps on decreasing when
creating and closing tun devices.

This patch introduces a flag SOCK_EXTERNALLY_ALLOCATED that instructs
sock_release to not free the inode and not decrement sockets_in_use,
fixing both memory corruption and sockets_in_use underflow.

It should be backported to 3.3 an 3.4 stabke.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@kernel.org

---
 drivers/net/tun.c   |    3 +++
 include/linux/net.h |    1 +
 net/socket.c        |    3 +++
 3 files changed, 7 insertions(+)

Index: linux-3.4.5-fast/drivers/net/tun.c
===================================================================
--- linux-3.4.5-fast.orig/drivers/net/tun.c	2012-07-19 17:55:16.000000000 +0200
+++ linux-3.4.5-fast/drivers/net/tun.c	2012-07-19 17:58:30.000000000 +0200
@@ -358,6 +358,8 @@ static void tun_free_netdev(struct net_d
 {
 	struct tun_struct *tun = netdev_priv(dev);
 
+	BUG_ON(!test_bit(SOCK_EXTERNALLY_ALLOCATED, &tun->socket.flags));
+
 	sk_release_kernel(tun->socket.sk);
 }
 
@@ -1115,6 +1117,7 @@ static int tun_set_iff(struct net *net,
 		tun->flags = flags;
 		tun->txflt.count = 0;
 		tun->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
+		set_bit(SOCK_EXTERNALLY_ALLOCATED, &tun->socket.flags);
 
 		err = -ENOMEM;
 		sk = sk_alloc(&init_net, AF_UNSPEC, GFP_KERNEL, &tun_proto);
Index: linux-3.4.5-fast/include/linux/net.h
===================================================================
--- linux-3.4.5-fast.orig/include/linux/net.h	2012-07-19 17:54:31.000000000 +0200
+++ linux-3.4.5-fast/include/linux/net.h	2012-07-19 17:55:03.000000000 +0200
@@ -72,6 +72,7 @@ struct net;
 #define SOCK_NOSPACE		2
 #define SOCK_PASSCRED		3
 #define SOCK_PASSSEC		4
+#define SOCK_EXTERNALLY_ALLOCATED 5
 
 #ifndef ARCH_HAS_SOCKET_TYPES
 /**
Index: linux-3.4.5-fast/net/socket.c
===================================================================
--- linux-3.4.5-fast.orig/net/socket.c	2012-07-19 17:56:55.000000000 +0200
+++ linux-3.4.5-fast/net/socket.c	2012-07-19 17:57:50.000000000 +0200
@@ -522,6 +522,9 @@ void sock_release(struct socket *sock)
 	if (rcu_dereference_protected(sock->wq, 1)->fasync_list)
 		printk(KERN_ERR "sock_release: fasync list not empty!\n");
 
+	if (test_bit(SOCK_EXTERNALLY_ALLOCATED, &sock->flags))
+		return;
+
 	percpu_sub(sockets_in_use, 1);
 	if (!sock->file) {
 		iput(SOCK_INODE(sock));

^ permalink raw reply

* Re: [net-next 9/9] ixgbe: Cleanup holes in flags after removing several of them
From: Alexander Duyck @ 2012-07-19 16:11 UTC (permalink / raw)
  To: David Laight; +Cc: Jeff Kirsher, davem, netdev, gospo, sassmann
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B6F98@saturn3.aculab.com>

On 07/19/2012 01:33 AM, David Laight wrote:
>> This change is just meant to defragment the flags as there are several
> hole
>> that have been introduced since several features, or the flags for
> them,
>> have been removed.
> Doesn't this sort of change just make it difficult for people who are
> looking at hexdumps of memory but don't have exactly the right header
> file to hand?
These are private flags held inside of the driver and never exposed
externally to user space.  If we have to go through a hexdump to try and
figure out the state of the driver I would certainly hope we have the
header file in hand.  Knowing what bits we use is kind of pointless if
we don't know where the flags are being stored within the adapter
structure itself.

> It doesn't really gain anything much either.
I never did say it gains us much.  It is mostly just housekeeping in
order to make it clear where the available bits are in the flags fields.

> I can (just) imagine reordering flags so that the commonly
> tested ones are in the low bits so that they can be tested
> with small immediate constants - saving an instruction.
> But that isn't what is being done here.
>
> 	David

On x86 it seems like gcc is converting all of the flag tests to "testb"
assembly ops and just using an offset within the flags field to access
bits 8 through 31.  It doesn't seem like there would be much of an
advantage to reordering the flags unless we need to optimize for the
cases where we are testing multiple flags.  Even in that case we would
probably just want to align things so that when we access multiple flags
they are in the same 8 bit field.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH net-next 4/7] sfc: Add support for IEEE-1588 PTP
From: Andrew Jackson @ 2012-07-19 16:05 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Richard Cochran, David Miller, netdev, linux-net-drivers
In-Reply-To: <1342708653.2617.33.camel@bwh-desktop.uk.solarflarecom.com>

On 19/07/2012 15:37, Ben Hutchings wrote:
> On Thu, 2012-07-19 at 16:25 +0200, Richard Cochran wrote:
>> On Wed, Jul 18, 2012 at 07:21:33PM +0100, Ben Hutchings wrote:

>>> +/* Process times received from MC.
>>> + *
>>> + * Extract times from returned results, and establish the minimum value
>>> + * seen.  The minimum value represents the "best" possible time and events
>>> + * too much greater than this are rejected - the machine is, perhaps, too
>>> + * busy. A number of readings are taken so that, hopefully, at least one good
>>> + * synchronisation will be seen in the results.
>>> + */
>>
>> This code looks like it is trying to find the offset between two
>> clocks. Is there some reason why you cannot use <linux/timecompare.h>
>> to accomplish this?
>>
>> Also, these comments about "hopefull" synchronization make me
>> nervous. I think it might be easier just to offer RAW timestamps and
>> forget about the SYS timestamps.
>>
>> I am trying to purge the whole SYS thing (only blackfin is left)
>> because there is a much better way to go about this, namely
>> synchronizing the system time to the PHC time via an internal PPS
>> signal.
>
> Andrew, would that work for us?

I don't think so for the reason that Stu has pointed out (failed 
assumption).

The NIC's clock isn't directly accessible by the host from the PCIe bus 
and is "behind" the MC. Even when we process PPS events, we need a 
reliable way of determining the relationship between the two clocks 
(system <> NIC). We're trying to get that as accurately as we can but we 
know that some measurements will be incorrect/out of bounds because of 
loading on the system.

In retrospect, I should have phrased the comment in more statistical 
terms rather than using ambiguous phrases like "hopefully".

	Andrew

^ permalink raw reply

* Re: [PATCH net-next V1 5/9] net/eipoib: Add ethtool file support
From: Or Gerlitz @ 2012-07-19 15:55 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: davem, roland, netdev, ali, sean.hefty, shlomop, Erez Shitrit
In-Reply-To: <1342636621.2617.63.camel@bwh-desktop.uk.solarflarecom.com>

On 7/18/2012 9:37 PM, Ben Hutchings wrote:
> +static void parent_get_strings(struct net_device *parent_dev,
> +			       uint32_t stringset, uint8_t *data)
> +{
> +	int index = 0, stats_off = 0, i;
> +
> +	if (stringset != ETH_SS_STATS)
> +		return;
> +
> +	for (i = 0; i < PORT_STATS_LEN; i++)
> +		strcpy(data + (index++) * ETH_GSTRING_LEN,
> +		       parent_strings[i + stats_off]);
> +
> +	stats_off += PORT_STATS_LEN;
> This is a very longwinded way to write:
> 	memcpy(data, parent_strings, sizeof(parent_strings));

SURE, will fix

>
>> +static int parent_get_sset_count(struct net_device *parent_dev, int sset)
>> +{
>> +	switch (sset) {
>> +	case ETH_SS_STATS:
>> +		return PARENT_STATS_LEN;
>>
> [...]
>
> I get the feeling you've removed some code with unifdef; the result
> looks really weird, with PORT_STATS_LEN and PARENT_STATS_LEN used
> inconsistently.

yep, this needs cleanup, will do for V2

Or.

^ permalink raw reply

* Re: [PATCH 09/15] ipv4: Cache output routes in fib_info nexthops.
From: David Miller @ 2012-07-19 15:52 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20120718.112404.1910372180742347127.davem@davemloft.net>

In a dream I found a bug in this patch and the next one.

When we fetch a cached route from the FIB info, we have to
check if it has been invalidated by a PMTU event or similar.
And if so, cmpxchg() it with NULL and release it, so we
can build and install a new cached route there.

I'll fix this up when I integrate everyone's feedback later
today.

^ permalink raw reply

* Re: [PATCH net-next 4/7] sfc: Add support for IEEE-1588 PTP
From: Ben Hutchings @ 2012-07-19 15:50 UTC (permalink / raw)
  To: Stuart Hodgson
  Cc: Richard Cochran, David Miller, netdev, linux-net-drivers,
	Andrew Jackson
In-Reply-To: <500827EF.208@solarflare.com>

On Thu, 2012-07-19 at 16:29 +0100, Stuart Hodgson wrote:
> On 19/07/12 15:25, Richard Cochran wrote:
[...]
> > I am trying to purge the whole SYS thing (only blackfin is left)
> > because there is a much better way to go about this, namely
> > synchronizing the system time to the PHC time via an internal PPS
> > signal.
> 
> This may be possible in future. But leads us to another problem
> where the PPS event that is generated by the PHC subsystem to the 
> PPS subsystem is stamped with the current system_time. That may
> be fine for a PPS signal generated from an interrupt but not when
> the internal PPS event has implicit jitter from the handler/event_queue
> that we have in the driver.
[...]

We can certainly take a timestamp in the hard interrupt handler; in fact
that's what I originally expected we would do since we have a separate
MSI-X vector for PTP.  But even hard interrupt handling can be subject
to substantial jitter.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH] ipv4: Fix time difference calculation in rt_bind_exception().
From: David Miller @ 2012-07-19 15:50 UTC (permalink / raw)
  To: netdev


Reported-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f67e702..2c25581 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1907,7 +1907,7 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
 		if (fnhe->fnhe_daddr == daddr) {
 			if (fnhe->fnhe_pmtu) {
 				unsigned long expires = fnhe->fnhe_expires;
-				unsigned long diff = jiffies - expires;
+				unsigned long diff = expires - jiffies;
 
 				if (time_before(jiffies, expires)) {
 					rt->rt_pmtu = fnhe->fnhe_pmtu;
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH net-next V1 7/9] net/eipoib: Add main driver functionality
From: Or Gerlitz @ 2012-07-19 15:46 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: davem, roland, netdev, ali, sean.hefty, shlomop, Erez Shitrit
In-Reply-To: <1342705789.2617.15.camel@bwh-desktop.uk.solarflarecom.com>

On 7/19/2012 4:49 PM, Ben Hutchings wrote:
> On Wed, 2012-07-18 at 14:00 +0300, Or Gerlitz wrote:
> +static const struct net_device_ops parent_netdev_ops = {
> +	.ndo_init		= parent_init,
> +	.ndo_uninit		= parent_uninit,
> +	.ndo_open		= parent_open,
> +	.ndo_stop		= parent_close,
> +	.ndo_start_xmit		= parent_tx,
> +	.ndo_select_queue	= parent_select_q,
> +	/* parnt mtu is min(slaves_mtus) */
> +	.ndo_change_mtu		= NULL,
> +	.ndo_fix_features	= parent_fix_features,
> +	/*
> +	 * initial mac address is randomized, can be changed
> +	 * thru this func later
> +	 */
> +	.ndo_set_mac_address = eth_mac_addr,
> +	.ndo_get_stats = parent_get_stats,
>
> Why not implement ndo_get_stats64?  I don't think there's any good
> reason for a new driver not to.

Indeed, will do  ndo_get_stats64

>
>
>> +	.ndo_vlan_rx_add_vid = eth_ipoib_vlan_rx_add_vid,
>> +	.ndo_vlan_rx_kill_vid = eth_ipoib_vlan_rx_kill_vid,
>
> These shouldn't be needed.

ok, here's the point, the eIPoIB driver maps Ethernet vlans to 
infiniband/IPoIB pkeys
(partition keys). The underlying IPoIB devices work with these pkeys
in a way which is HW accelerated, and we want the eIPoIB driver to be 
considered as one
that support HW accelerate vlans. E.g on the TX flow we don't want that 
any special SW
handling by the 8021q driver will be done on the skb except for setting 
the skb->vlan_tci
field, and in the RX flow, we set skb->vlan_tci field and don't want 
that 8021q to try
and extract it from the headers, etc.

For that end, I was under the impression all the three 
NETIF_F_HW_VLAN_{TX,RX,FILTER)
features need to be advertized. From your comment I understand now that 
RX/TX are enough
in that respect?

>
>
> [...]
>> +/* netdev events handlers */
>> +static inline int is_ipoib_pif_intf(struct net_device *dev)
>> +{
>> +	if (ARPHRD_INFINIBAND == dev->type && dev->priv_flags & IFF_EIPOIB_PIF)
>> +			return 1;
> [...]
>
> Wrong indentation.

will fix, thanks for spotting this.

Or.

^ permalink raw reply

* Re: [PATCH net-next] ipv4: tcp: remove per net tcp_sock
From: David Miller @ 2012-07-19 15:45 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, therbert, wsommerfeld
In-Reply-To: <20120719.083544.1223522161508413373.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Thu, 19 Jul 2012 08:35:44 -0700 (PDT)

> Looks great, applied, thanks Eric.

I take that back, it doesn't build:

net/ipv4/ip_output.c: In function ‘ip_send_unicast_reply’:
net/ipv4/ip_output.c:1481:1: error: section attribute cannot be specified for local variables
net/ipv4/ip_output.c:1481:1: error: section attribute cannot be specified for local variables
net/ipv4/ip_output.c:1481:1: error: declaration of ‘__pcpu_unique_unicast_sock’ with no linkage follows extern declaration
net/ipv4/ip_output.c:1481:1: note: previous declaration of ‘__pcpu_unique_unicast_sock’ was here
net/ipv4/ip_output.c:1481:9: error: section attribute cannot be specified for local variables
net/ipv4/ip_output.c:1481:9: error: weak declaration of ‘unicast_sock’ must be public

^ permalink raw reply

* Re: [PATCH net-next 4/7] sfc: Add support for IEEE-1588 PTP
From: David Miller @ 2012-07-19 15:43 UTC (permalink / raw)
  To: smhodgson; +Cc: richardcochran, bhutchings, netdev, linux-net-drivers, ajackson
In-Reply-To: <500827EF.208@solarflare.com>

I really wish we hadn't started quoting an entire HUGE patch file
in this discussion.

Only quote the relevant snippets of the patch for the purposes of
the discussion when replying, thank you.

^ permalink raw reply

* solar lala shared photos with you
From: solar lala @ 2012-07-19 15:41 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 437 bytes --]

Dear sir

         We supply  solar PV system (including solar panels , frame,  
cable ,inverter and controller , power distribution     cabinet)  with  
1.6$/w  FOB shenzhen.

          Email me or just call me directly if needed. Thank you!


Best wishes
                   lala


Ecosol PV Tech Co., Ltd
Tel: 86-769-8279 2468
Fax: 86-769-879 2478
email:info@ecsolsolar.com
skype:solarlala
msn:solarlala@hotmail.com
www.ecsolsolar.com

[-- Attachment #2: 003.jpg --]
[-- Type: image/jpeg, Size: 8290 bytes --]

^ permalink raw reply

* Re: [PATCH 09/15] ipv4: Cache output routes in fib_info nexthops.
From: David Miller @ 2012-07-19 15:39 UTC (permalink / raw)
  To: steffen.klassert; +Cc: netdev
In-Reply-To: <20120719113810.GM1869@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Thu, 19 Jul 2012 13:38:10 +0200

> On Wed, Jul 18, 2012 at 11:24:04AM -0700, David Miller wrote:
>> +
>> +static void rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe)
>> +{
>> +	if (fnhe->fnhe_pmtu) {
>> +		unsigned long expires = fnhe->fnhe_expires;
>> +		unsigned long diff = jiffies - expires;
> 
> This should be diff = expires - jiffies
> 
> With that changed, everything seems to work fine :)

Thanks a lot for catching this bug, I'll fix it up right now.

^ permalink raw reply

* Re: [PATCH] net: e100: ucode is optional in some cases
From: David Miller @ 2012-07-19 15:37 UTC (permalink / raw)
  To: bjorn
  Cc: netdev, jeffrey.t.kirsher, jesse.brandeburg, bruce.w.allan,
	carolyn.wyborny, donald.c.skidmore, gregory.v.rose,
	peter.p.waskiewicz.jr, alexander.h.duyck, john.ronciak,
	e1000-devel
In-Reply-To: <1342690393-18459-1-git-send-email-bjorn@mork.no>

From: Bjørn Mork <bjorn@mork.no>
Date: Thu, 19 Jul 2012 11:33:13 +0200

> +	 * http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/fxp/rcvbundl.h?rev=HEAD;content-type=text%2Fplain

Please don't put URLs into the source code, they generally lack
permanence.

^ permalink raw reply

* Re: [PATCH net-next] ipv4: tcp: remove per net tcp_sock
From: David Miller @ 2012-07-19 15:35 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, therbert, wsommerfeld
In-Reply-To: <1342688332.2626.4001.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 19 Jul 2012 10:58:52 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
> per network namespace.
> 
> This leads to bad behavior on multiqueue NICS, because many cpus
> contend for the socket lock and once socket lock is acquired, extra
> false sharing on various socket fields slow down the operations.
> 
> To better resist to attacks, we use a percpu socket. Each cpu can
> run without contention, using appropriate memory (local node)
> 
> Additional features :
> 
> 1) We also mirror the queue_mapping of the incoming skb, so that
> answers use the same queue if possible.
> 
> 2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()
> 
> 3) We now limit the number of in-flight RST/ACK [1] packets
> per cpu, instead of per namespace, and we honor the sysctl_wmem_default
> limit dynamically. (Prior to this patch, sysctl_wmem_default value was
> copied at boot time, so any further change would not affect tcp_sock
> limit)
> 
> 
> [1] These packets are only generated when no socket was matched for
> the incoming packet.
> 
> Reported-by: Bill Sommerfeld <wsommerfeld@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Looks great, applied, thanks Eric.

> @@ -2624,13 +2624,11 @@ EXPORT_SYMBOL(tcp_prot);
>  
>  static int __net_init tcp_sk_init(struct net *net)
>  {
> -	return inet_ctl_sock_create(&net->ipv4.tcp_sock,
> -				    PF_INET, SOCK_RAW, IPPROTO_TCP, net);
> +	return 0;
>  }
>  
>  static void __net_exit tcp_sk_exit(struct net *net)
>  {
> -	inet_ctl_sock_destroy(net->ipv4.tcp_sock);
>  }
>  
>  static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)

If these no longer really do anything, just send me a patch to kill
them off entirely.

Thanks again.

^ permalink raw reply

* Re: [PATCH net-next V1 0/4] net/mlx4_en: Add accelerated RFS support
From: David Miller @ 2012-07-19 15:34 UTC (permalink / raw)
  To: ogerlitz; +Cc: roland, netdev, oren, yevgenyp, amirv
In-Reply-To: <1342686832-21406-1-git-send-email-ogerlitz@mellanox.com>

From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Thu, 19 Jul 2012 11:33:48 +0300

> This series from Amir Vadai adds support for Accelerated RFS 
> to the mlx4_en Ethernet driver.
> 
> The code uses the Accelerated RFS infrastructure and HW flow steering 
> to keep CPU affinity of rx interrupts and applications per TCP stream.
> 
> To do so, we had to add little protection to cpu_rmap.h against double 
> inclusion. Also, added linking between CPU to IRQ using rmap in the 
> mlx4_core driver.
> 
> changes from V0:
>  - always use CONFIG_RFS_ACCEL instead of using twice CONFIG_CPU_RMAP directly

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH] ipv4: fix address selection in fib_compute_spec_dst
From: David Miller @ 2012-07-19 15:31 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <1342683303-3557-1-git-send-email-ja@ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Thu, 19 Jul 2012 10:35:03 +0300

> 	ip_options_compile can be called for forwarded packets,
> make sure the specific-destionation address is a local one as
> specified in RFC 1812, 4.2.2.2 Addresses in Options
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

Applied.

^ permalink raw reply

* Re: [PATCH] ipv4: optimize fib_compute_spec_dst call in ip_options_echo
From: David Miller @ 2012-07-19 15:30 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <1342683264-3523-1-git-send-email-ja@ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Thu, 19 Jul 2012 10:34:24 +0300

> 	Move fib_compute_spec_dst at the only place where it
> is needed.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox