From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jay Vosburgh Subject: Re: [PATCH][RESEND] bonding: delete migrated IP addresses from the rlb hash table Date: Tue, 27 Nov 2012 17:05:25 -0800 Message-ID: <26343.1354064725@death.nxdomain> References: <20120620203731.GA4957@midget.suse.cz> <2699.1340319919@death.nxdomain> <20121123124419.GA3002@midget.suse.cz> Cc: Andy Gospodarek , netdev@vger.kernel.org To: Jiri Bohac Return-path: Received: from e38.co.us.ibm.com ([32.97.110.159]:37737 "EHLO e38.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755148Ab2K1BGf (ORCPT ); Tue, 27 Nov 2012 20:06:35 -0500 Received: from /spool/local by e38.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 27 Nov 2012 18:06:33 -0700 Received: from d03relay05.boulder.ibm.com (d03relay05.boulder.ibm.com [9.17.195.107]) by d03dlp01.boulder.ibm.com (Postfix) with ESMTP id 267C11FF003E for ; Tue, 27 Nov 2012 18:06:26 -0700 (MST) Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay05.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id qAS16UPs211434 for ; Tue, 27 Nov 2012 18:06:30 -0700 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id qAS16UQI027356 for ; Tue, 27 Nov 2012 18:06:30 -0700 In-reply-to: <20121123124419.GA3002@midget.suse.cz> Sender: netdev-owner@vger.kernel.org List-ID: Jiri Bohac wrote: >Hi, > >This is another resend of the patch discussed in June. The only >changes over the previous version are improved comments. > >Bonding with balance_rlb keeps poisoning other machines' ARP >caches and I whink we need to fix this. > >On Thu, Jun 21, 2012 at 04:05:19PM -0700, Jay Vosburgh wrote: >> Jiri Bohac wrote: >> >> >Hi, this is a resend of the patch discussed here: >> > http://thread.gmane.org/gmane.linux.network/228076 >> >It has been updated to apply to the lastest net-next. >> [...] >> >The hash table is hashed by ip_dst. To be able to do the above >> >check efficiently (not walking the whole hash table), we need a >> >reverse mapping (by ip_src). >> >> Just a note that I'm doing some testing with this patch. Seems >> to be ok for the "direct" case (wherein the IP in question is assigned >> to the local system); I haven't tried the "bridge" case yet. I've >> extended some of the debugfs stuff to dump the new information, and I'm >> trying some of the corner cases (e.g., breaking the linkages in the >> middle) to see if it all hangs together. > >Were there any results of your testing? Good or bad? I did test it quite a bit (and then neglected to follow up). I tried various deliberate hash collisions to try and make it fail in a corner case, but was unable to induce incorrect behavior. >> I am thinking that the layout of the "hash"-ish table is now >> sufficiently complicated that there should be a comment block somewhere >> describing what's going on (because I didn't really quite get it until I >> dumped the whole thing and looked at it). With this patch, there is one >> "used" linkage for all of the elements in use, plus some number of "src" >> linkages, one for each active source hash. The "src" linkages are also >> notable in that they are separate from the "assigned" state. > >I updated the comments in drivers/net/bonding/bond_alb.h to >describe the structure. > >> >+ * have a dirrerent mac_src. >> >> Typo here; should be "different." > >Fixed. >Any chance we could finally get this merged?: The only issue I see is that a number of added lines run past 80 columns, e.g., + if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) { + /* ip_src is going to be updated, fix the src hash list */ + u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src)); + * sending out client updates with this IP address and the old MAC address. + for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { ... and so on. I did not compile and test this version, just applied it and inspected it; presumably it is functionally identical to the prior version. There's also one typo I noted near the end of the patch. -J >Bonding in balance-alb mode records information from ARP packets >passing through the bond in a hash table (rx_hashtbl). > >At certain situations (e.g. link change of a slave), >rlb_update_rx_clients() will send out ARP packets to update ARP >caches of other hosts on the network to achieve RX load >balancing. > >The problem is that once an IP address is recorded in the hash >table, it stays there indefinitely. If this IP address is >migrated to a different host in the network, bonding still sends >out ARP packets that poison other systems' ARP caches with >invalid information. > >This patch solves this by looking at all incoming ARP packets, >and checking if the source IP address is one of the source >addresses stored in the rx_hashtbl. If it is, but the MAC >addresses differ, the corresponding hash table entries are >removed. Thus, when an IP address is migrated, the first ARP >broadcast by its new owner will purge the offending entries of >rx_hashtbl. > >The hash table is hashed by ip_dst. To be able to do the above >check efficiently (not walking the whole hash table), we need a >reverse mapping (by ip_src). > >I added three new members in struct rlb_client_info: > rx_hashtbl[x].src_first will point to the start of a list of > entries for which hash(ip_src) == x. > The list is linked with src_next and src_prev. > >When an incoming ARP packet arrives at rlb_arp_recv() >rlb_purge_src_ip() can quickly walk only the entries on the >corresponding lists, i.e. the entries that are likely to contain >the offending IP address. > >To avoid confusion, I renamed these existing fields of struct >rlb_client_info: > next -> used_next > prev -> used_prev > rx_hashtbl_head -> rx_hashtbl_used_head > >(The current linked list is _not_ a list of hash table >entries with colliding ip_dst. It's a list of entries that are >being used; its purpose is to avoid walking the whole hash table >when looking for used entries.) > >Signed-off-by: Jiri Bohac > >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c >index e15cc11..8505a24 100644 >--- a/drivers/net/bonding/bond_alb.c >+++ b/drivers/net/bonding/bond_alb.c >@@ -84,6 +84,9 @@ static inline struct arp_pkt *arp_pkt(const struct sk_buff *skb) > > /* Forward declaration */ > static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[]); >+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp); >+static void rlb_src_unlink(struct bonding *bond, u32 index); >+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash); > > static inline u8 _simple_hash(const u8 *hash_start, int hash_size) > { >@@ -354,6 +357,17 @@ static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond, > if (!arp) > goto out; > >+ /* We received an ARP from arp->ip_src. >+ * We might have used this IP address previously (on the bonding host >+ * itself or on a system that is bridged together with the bond). >+ * However, if arp->mac_src is different than what is stored in >+ * rx_hashtbl, some other host is now using the IP and we must prevent >+ * sending out client updates with this IP address and the old MAC address. >+ * Clean up all hash table entries that have this address as ip_src but >+ * have a different mac_src. >+ */ >+ rlb_purge_src_ip(bond, arp); >+ > if (arp->op_code == htons(ARPOP_REPLY)) { > /* update rx hash table for this ARP */ > rlb_update_entry_from_arp(bond, arp); >@@ -432,9 +446,9 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave) > _lock_rx_hashtbl_bh(bond); > > rx_hash_table = bond_info->rx_hashtbl; >- index = bond_info->rx_hashtbl_head; >+ index = bond_info->rx_hashtbl_used_head; > for (; index != RLB_NULL_INDEX; index = next_index) { >- next_index = rx_hash_table[index].next; >+ next_index = rx_hash_table[index].used_next; > if (rx_hash_table[index].slave == slave) { > struct slave *assigned_slave = rlb_next_rx_slave(bond); > >@@ -519,8 +533,8 @@ static void rlb_update_rx_clients(struct bonding *bond) > > _lock_rx_hashtbl_bh(bond); > >- hash_index = bond_info->rx_hashtbl_head; >- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) { >+ hash_index = bond_info->rx_hashtbl_used_head; >+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { > client_info = &(bond_info->rx_hashtbl[hash_index]); > if (client_info->ntt) { > rlb_update_client(client_info); >@@ -548,8 +562,8 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla > > _lock_rx_hashtbl_bh(bond); > >- hash_index = bond_info->rx_hashtbl_head; >- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) { >+ hash_index = bond_info->rx_hashtbl_used_head; >+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { > client_info = &(bond_info->rx_hashtbl[hash_index]); > > if ((client_info->slave == slave) && >@@ -578,8 +592,8 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip) > > _lock_rx_hashtbl(bond); > >- hash_index = bond_info->rx_hashtbl_head; >- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) { >+ hash_index = bond_info->rx_hashtbl_used_head; >+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { > client_info = &(bond_info->rx_hashtbl[hash_index]); > > if (!client_info->slave) { >@@ -625,6 +639,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon > /* update mac address from arp */ > memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN); > } >+ memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN); > > assigned_slave = client_info->slave; > if (assigned_slave) { >@@ -647,6 +662,13 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon > assigned_slave = rlb_next_rx_slave(bond); > > if (assigned_slave) { >+ if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) { >+ /* ip_src is going to be updated, fix the src hash list */ >+ u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src)); >+ rlb_src_unlink(bond, hash_index); >+ rlb_src_link(bond, hash_src, hash_index); >+ } >+ > client_info->ip_src = arp->ip_src; > client_info->ip_dst = arp->ip_dst; > /* arp->mac_dst is broadcast for arp reqeusts. >@@ -654,6 +676,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon > * upon receiving an arp reply. > */ > memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN); >+ memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN); > client_info->slave = assigned_slave; > > if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) { >@@ -669,11 +692,11 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon > } > > if (!client_info->assigned) { >- u32 prev_tbl_head = bond_info->rx_hashtbl_head; >- bond_info->rx_hashtbl_head = hash_index; >- client_info->next = prev_tbl_head; >+ u32 prev_tbl_head = bond_info->rx_hashtbl_used_head; >+ bond_info->rx_hashtbl_used_head = hash_index; >+ client_info->used_next = prev_tbl_head; > if (prev_tbl_head != RLB_NULL_INDEX) { >- bond_info->rx_hashtbl[prev_tbl_head].prev = >+ bond_info->rx_hashtbl[prev_tbl_head].used_prev = > hash_index; > } > client_info->assigned = 1; >@@ -740,8 +763,8 @@ static void rlb_rebalance(struct bonding *bond) > _lock_rx_hashtbl_bh(bond); > > ntt = 0; >- hash_index = bond_info->rx_hashtbl_head; >- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) { >+ hash_index = bond_info->rx_hashtbl_used_head; >+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { > client_info = &(bond_info->rx_hashtbl[hash_index]); > assigned_slave = rlb_next_rx_slave(bond); > if (assigned_slave && (client_info->slave != assigned_slave)) { >@@ -759,11 +782,113 @@ static void rlb_rebalance(struct bonding *bond) > } > > /* Caller must hold rx_hashtbl lock */ >+static void rlb_init_table_entry_dst(struct rlb_client_info *entry) >+{ >+ entry->used_next = RLB_NULL_INDEX; >+ entry->used_prev = RLB_NULL_INDEX; >+ entry->assigned = 0; >+ entry->slave = NULL; >+ entry->tag = 0; >+} >+static void rlb_init_table_entry_src(struct rlb_client_info *entry) >+{ >+ entry->src_first = RLB_NULL_INDEX; >+ entry->src_prev = RLB_NULL_INDEX; >+ entry->src_next = RLB_NULL_INDEX; >+} >+ > static void rlb_init_table_entry(struct rlb_client_info *entry) > { > memset(entry, 0, sizeof(struct rlb_client_info)); >- entry->next = RLB_NULL_INDEX; >- entry->prev = RLB_NULL_INDEX; >+ rlb_init_table_entry_dst(entry); >+ rlb_init_table_entry_src(entry); >+} >+ >+static void rlb_delete_table_entry_dst(struct bonding *bond, u32 index) >+{ >+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); >+ u32 next_index = bond_info->rx_hashtbl[index].used_next; >+ u32 prev_index = bond_info->rx_hashtbl[index].used_prev; >+ >+ if (index == bond_info->rx_hashtbl_used_head) >+ bond_info->rx_hashtbl_used_head = next_index; >+ if (prev_index != RLB_NULL_INDEX) >+ bond_info->rx_hashtbl[prev_index].used_next = next_index; >+ if (next_index != RLB_NULL_INDEX) >+ bond_info->rx_hashtbl[next_index].used_prev = prev_index; >+} >+ >+/* unlink a rlb hash table entry from the src list */ >+static void rlb_src_unlink(struct bonding *bond, u32 index) >+{ >+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); >+ u32 next_index = bond_info->rx_hashtbl[index].src_next; >+ u32 prev_index = bond_info->rx_hashtbl[index].src_prev; >+ >+ bond_info->rx_hashtbl[index].src_next = RLB_NULL_INDEX; >+ bond_info->rx_hashtbl[index].src_prev = RLB_NULL_INDEX; >+ >+ if (next_index != RLB_NULL_INDEX) >+ bond_info->rx_hashtbl[next_index].src_prev = prev_index; >+ >+ if (prev_index == RLB_NULL_INDEX) >+ return; >+ >+ /* is prev_index pointing to the head of this list? */ >+ if (bond_info->rx_hashtbl[prev_index].src_first == index) >+ bond_info->rx_hashtbl[prev_index].src_first = next_index; >+ else >+ bond_info->rx_hashtbl[prev_index].src_next = next_index; >+ >+} >+ >+static void rlb_delete_table_entry(struct bonding *bond, u32 index) >+{ >+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); >+ struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]); >+ >+ rlb_delete_table_entry_dst(bond, index); >+ rlb_init_table_entry_dst(entry); >+ >+ rlb_src_unlink(bond, index); >+} >+ >+/* add the rx_hashtbl[ip_dst_hash] entry to the list >+ * of entries with identical ip_src_hash >+ */ >+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash) >+{ >+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); >+ u32 next; >+ >+ bond_info->rx_hashtbl[ip_dst_hash].src_prev = ip_src_hash; >+ next = bond_info->rx_hashtbl[ip_src_hash].src_first; >+ bond_info->rx_hashtbl[ip_dst_hash].src_next = next; >+ if (next != RLB_NULL_INDEX) >+ bond_info->rx_hashtbl[next].src_prev = ip_dst_hash; >+ bond_info->rx_hashtbl[ip_src_hash].src_first = ip_dst_hash; >+} >+ >+/* deletes all rx_hashtbl entries with arp->ip_src if their mac_src does >+ * not match arp->mac_src */ >+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp) >+{ >+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); >+ u32 ip_src_hash = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src)); >+ u32 index; >+ >+ _lock_rx_hashtbl_bh(bond); >+ >+ index = bond_info->rx_hashtbl[ip_src_hash].src_first; >+ while (index != RLB_NULL_INDEX) { >+ struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]); >+ u32 next_index = entry->src_next; >+ if (entry->ip_src == arp->ip_src && >+ !ether_addr_equal_64bits(arp->mac_src, entry->mac_src)) >+ rlb_delete_table_entry(bond, index); >+ index = next_index; >+ } >+ _unlock_rx_hashtbl_bh(bond); > } > > static int rlb_initialize(struct bonding *bond) >@@ -781,7 +906,7 @@ static int rlb_initialize(struct bonding *bond) > > bond_info->rx_hashtbl = new_hashtbl; > >- bond_info->rx_hashtbl_head = RLB_NULL_INDEX; >+ bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX; > > for (i = 0; i < RLB_HASH_TABLE_SIZE; i++) { > rlb_init_table_entry(bond_info->rx_hashtbl + i); >@@ -803,7 +928,7 @@ static void rlb_deinitialize(struct bonding *bond) > > kfree(bond_info->rx_hashtbl); > bond_info->rx_hashtbl = NULL; >- bond_info->rx_hashtbl_head = RLB_NULL_INDEX; >+ bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX; > > _unlock_rx_hashtbl_bh(bond); > } >@@ -815,25 +940,13 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id) > > _lock_rx_hashtbl_bh(bond); > >- curr_index = bond_info->rx_hashtbl_head; >+ curr_index = bond_info->rx_hashtbl_used_head; > while (curr_index != RLB_NULL_INDEX) { > struct rlb_client_info *curr = &(bond_info->rx_hashtbl[curr_index]); >- u32 next_index = bond_info->rx_hashtbl[curr_index].next; >- u32 prev_index = bond_info->rx_hashtbl[curr_index].prev; >- >- if (curr->tag && (curr->vlan_id == vlan_id)) { >- if (curr_index == bond_info->rx_hashtbl_head) { >- bond_info->rx_hashtbl_head = next_index; >- } >- if (prev_index != RLB_NULL_INDEX) { >- bond_info->rx_hashtbl[prev_index].next = next_index; >- } >- if (next_index != RLB_NULL_INDEX) { >- bond_info->rx_hashtbl[next_index].prev = prev_index; >- } >+ u32 next_index = bond_info->rx_hashtbl[curr_index].used_next; > >- rlb_init_table_entry(curr); >- } >+ if (curr->tag && (curr->vlan_id == vlan_id)) >+ rlb_delete_table_entry(bond, curr_index); > > curr_index = next_index; > } >diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h >index 90f140a..de831ba 100644 >--- a/drivers/net/bonding/bond_alb.h >+++ b/drivers/net/bonding/bond_alb.h >@@ -94,15 +94,35 @@ struct tlb_client_info { > > /* ------------------------------------------------------------------------- > * struct rlb_client_info contains all info related to a specific rx client >- * connection. This is the Clients Hash Table entry struct >+ * connection. This is the Clients Hash Table entry struct. >+ * Note that this is not a proper hash table; if a new client's IP address >+ * hash collides with an existing client entry, the old entry is replaced. >+ * >+ * There is a linked list (linked by the used_next and used_prev members) >+ * linking all the used entries of the hash table. This allows updating >+ * all the clients without walking over all the unused elements of the table. >+ * >+ * There are also linked lists of entries with identical hash(ip_src). These >+ * allow cleaning up the table from ip_src<->mac_src associatins that have Typo here, "associations" >+ * become outdated and would cause sending out invalid ARP updates to the >+ * network. These are linked by the (src_next and src_prev members). > * ------------------------------------------------------------------------- > */ > struct rlb_client_info { > __be32 ip_src; /* the server IP address */ > __be32 ip_dst; /* the client IP address */ >+ u8 mac_src[ETH_ALEN]; /* the server MAC address */ > u8 mac_dst[ETH_ALEN]; /* the client MAC address */ >- u32 next; /* The next Hash table entry index */ >- u32 prev; /* The previous Hash table entry index */ >+ >+ /* list of used hash table entries, starting at rx_hashtbl_used_head */ >+ u32 used_next; >+ u32 used_prev; >+ >+ /* ip_src based hashing */ >+ u32 src_next; /* next entry with same hash(ip_src) */ >+ u32 src_prev; /* prev entry with same hash(ip_src) */ >+ u32 src_first; /* first entry with hash(ip_src) == this entry's index */ >+ > u8 assigned; /* checking whether this entry is assigned */ > u8 ntt; /* flag - need to transmit client info */ > struct slave *slave; /* the slave assigned to this client */ >@@ -131,7 +151,7 @@ struct alb_bond_info { > int rlb_enabled; > struct rlb_client_info *rx_hashtbl; /* Receive hash table */ > spinlock_t rx_hashtbl_lock; >- u32 rx_hashtbl_head; >+ u32 rx_hashtbl_used_head; > u8 rx_ntt; /* flag - need to transmit > * to all rx clients > */ >diff --git a/drivers/net/bonding/bond_debugfs.c b/drivers/net/bonding/bond_debugfs.c >index 2cf084e..6ac855f 100644 >--- a/drivers/net/bonding/bond_debugfs.c >+++ b/drivers/net/bonding/bond_debugfs.c >@@ -31,8 +31,8 @@ static int bond_debug_rlb_hash_show(struct seq_file *m, void *v) > > spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock)); > >- hash_index = bond_info->rx_hashtbl_head; >- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) { >+ hash_index = bond_info->rx_hashtbl_used_head; >+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { > client_info = &(bond_info->rx_hashtbl[hash_index]); > seq_printf(m, "%-15pI4 %-15pI4 %-17pM %s\n", > &client_info->ip_src, > >-- >Jiri Bohac >SUSE Labs, SUSE CZ > --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com