Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: RTL 8169  linux driver question
From: Stéphane ANCELOT @ 2012-11-28  0:35 UTC (permalink / raw)
  To: Francois Romieu; +Cc: David Laight, Stéphane ANCELOT, netdev, Hayes Wang
In-Reply-To: <20121127224605.GA10228@electric-eye.fr.zoreil.com>

On 27/11/2012 23:46, Francois Romieu wrote:
> David Laight <David.Laight@ACULAB.COM> :
>> Stéphane ANCELOT <sancelot@free.fr> :
>>> I had problem with it, my application sends a frame that is immediately
>>> transmitted back by some slaves, there was abnormally 100us  lost
>>> between the send and receive call.
>>>
>>> Finally I found it was coming from the following register setup in the
>>> driver :
>>>
>>> RTL_W16(IntrMitigate, 0x5151);
>>>
>>> Can you give me some details about it, since I do not have the RTL8169
>>> programming guide.
>> That sounds like an 'interrupt mitigation' setting - which will cause
>> RX interrupts to be delayed a short time in order to reduce the
>> interrupt load on the kernel.
>>
>> There is usually an 'ethtool' setting to disable interrupt mitigation.
> Something like the patch below against net-next could help once I will
> have tested it.
>
> I completely guessed the Tx usec scale factor at gigabit speed (125 us,
> 100 us, disabled, who knows ?) and I have no idea which specific chipsets
> it should work with.
using 0x5151 value at 100mb FDX, I know it introduced exactly 100us 
delay (Tx+Rx).


> Hayes, may I expect some hindsight regarding:
> 1 - the availability of the IntrMitigate (0xe2) register through the
>      8169, 8168 and 810x line of chipsets
> 2 - the Tx timer unit at gigabit speed
>
> It would save me some time.*
Hayes, it would have spared myself a lot of time ;-)

Have a look at what is driving this r8169 component :

http://www.youtube.com/watch?v=wj30CeAFwuk&feature=plcp
A question to nic components developers : I do not understand what 
competitive advantage keeping these things like a secret....

These things are mostly boring for people and oem like myself at 
Numalliance.

Regards,
Stephane Ancelot

> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 248f883..2623b73 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -349,6 +349,12 @@ enum rtl_registers {
>   	RxMaxSize	= 0xda,
>   	CPlusCmd	= 0xe0,
>   	IntrMitigate	= 0xe2,
> +
> +#define RTL_COALESCE_MASK	0x0f
> +#define RTL_COALESCE_SHIFT	4
> +#define RTL_COALESCE_T_MAX	(RTL_COALESCE_MASK)
> +#define RTL_COALESCE_FRAME_MAX	(RTL_COALESCE_MASK << 2)
> +
>   	RxDescAddrLow	= 0xe4,
>   	RxDescAddrHigh	= 0xe8,
>   	EarlyTxThres	= 0xec,	/* 8169. Unit of 32 bytes. */
> @@ -1997,10 +2003,121 @@ static void rtl8169_get_strings(struct net_device *dev, u32 stringset, u8 *data)
>   	}
>   }
>   
> +static struct rtl_coalesce_scale {
> +	u32 speed;
> +	/* Rx / Tx */
> +	u16 usecs[2];
> +} rtl_coalesce_info[] = {
> +	{ .speed = SPEED_10,	.usecs = { 8000, 10000 } },
> +	{ .speed = SPEED_100,	.usecs = { 1000,  1000 } },
> +	{ .speed = SPEED_1000,	.usecs = {  125,   125 } }
> +};
> +
> +static struct rtl_coalesce_scale *rtl_coalesce_scale(struct net_device *dev)
> +{
> +	struct ethtool_cmd ecmd;
> +	int rc, i;
> +
> +	rc = rtl8169_get_settings(dev, &ecmd);
> +	if (rc < 0)
> +		return ERR_PTR(rc);
> +
> +	for (i = 0; i < ARRAY_SIZE(rtl_coalesce_info); i++) {
> +		if (ethtool_cmd_speed(&ecmd) == rtl_coalesce_info[i].speed)
> +			return rtl_coalesce_info + i;
> +	}
> +
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static int rtl_get_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
> +{
> +	struct rtl8169_private *tp = netdev_priv(dev);
> +	void __iomem *ioaddr = tp->mmio_addr;
> +	struct rtl_coalesce_scale *scale;
> +	struct {
> +		u32 *max_frames;
> +		u32 *usecs;
> +	} coal_settings [] = {
> +		{ &ec->rx_max_coalesced_frames, &ec->rx_coalesce_usecs },
> +		{ &ec->tx_max_coalesced_frames, &ec->tx_coalesce_usecs }
> +	}, *p = coal_settings;
> +	int i;
> +	u16 w;
> +
> +	memset(ec, 0, sizeof(*ec));
> +
> +	for (w = RTL_R16(IntrMitigate); w; w >>= RTL_COALESCE_SHIFT, p++) {
> +		*p->max_frames = (w & RTL_COALESCE_MASK) << 2;
> +		w >>= RTL_COALESCE_SHIFT;
> +		*p->usecs = w & RTL_COALESCE_MASK;
> +	}
> +
> +	/* Except for null parameeters, the meaning of coalescing parameters
> +	 * depends on the link speed.
> +	 */
> +	scale = rtl_coalesce_scale(dev);
> +	if (PTR_ERR(scale) && (p != coal_settings))
> +		return PTR_ERR(scale);
> +
> +	for (i = 0; i < 2; i++) {
> +		p = coal_settings + i;
> +		*p->usecs *= scale->usecs[i];
> +		if (!*p->usecs && !*p->max_frames)
> +			*p->max_frames = 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int rtl_set_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
> +{
> +	struct rtl8169_private *tp = netdev_priv(dev);
> +	void __iomem *ioaddr = tp->mmio_addr;
> +	struct rtl_coalesce_scale *scale;
> +	struct {
> +		u32 frames;
> +		u32 usecs;
> +	} coal_settings [] = {
> +		{ ec->rx_max_coalesced_frames, ec->rx_coalesce_usecs },
> +		{ ec->tx_max_coalesced_frames, ec->tx_coalesce_usecs }
> +	}, *p = coal_settings;
> +	int i, rc;
> +	u16 w = 0;
> +
> +	scale = rtl_coalesce_scale(dev);
> +	rc = PTR_ERR(scale);
> +
> +	for (i = 0; i < 2; i++) {
> +		u32 units;
> +
> +		if (!p->usecs && p->frames == 1)
> +			continue;
> +		if (rc < 0)
> +			goto out;
> +
> +		units = p->usecs / scale->usecs[i];
> +		if (units > RTL_COALESCE_T_MAX || p->usecs % scale->usecs[i] ||
> +		    p->frames > RTL_COALESCE_FRAME_MAX || p->frames % 4)
> +			return -EINVAL;
> +
> +		w <<= RTL_COALESCE_SHIFT;
> +		w |= units;
> +		w <<= RTL_COALESCE_SHIFT;
> +		w |= p->frames >> 2;
> +	}
> +
> +	RTL_W16(IntrMitigate, swab16(w));
> +out:
> +	return rc;
> +}
> +
>   static const struct ethtool_ops rtl8169_ethtool_ops = {
>   	.get_drvinfo		= rtl8169_get_drvinfo,
>   	.get_regs_len		= rtl8169_get_regs_len,
>   	.get_link		= ethtool_op_get_link,
> +	.get_coalesce		= rtl_get_coalesce,
> +	.set_coalesce		= rtl_set_coalesce,
>   	.get_settings		= rtl8169_get_settings,
>   	.set_settings		= rtl8169_set_settings,
>   	.get_msglevel		= rtl8169_get_msglevel,

^ permalink raw reply

* Re: [PATCH v3 8/7] pppoatm: fix missing wakeup in pppoatm_send()
From: David Woodhouse @ 2012-11-28  0:48 UTC (permalink / raw)
  To: chas williams - CONTRACTOR; +Cc: Krzysztof Mazur, netdev, linux-kernel, davem
In-Reply-To: <20121127102333.68ac3234@thirdoffive.cmf.nrl.navy.mil>

[-- Attachment #1: Type: text/plain, Size: 1077 bytes --]

On Tue, 2012-11-27 at 10:23 -0500, chas williams - CONTRACTOR wrote:
> yes, but dont call it 8/7 since that doesnt make sense.

It made enough sense when it was a single patch appended to a thread of
7 other patches from Krzysztof. But now it's all got a little more
complex, so I've tried to collect together the latest version of
everything we've discussed:

 http://git.infradead.org/users/dwmw2/atm.git
  git://git.infradead.org/users/dwmw2/atm.git

David Woodhouse (5):
      solos-pci: Wait for pending TX to complete when releasing vcc
      br2684: don't send frames on not-ready vcc
      atm: Add release_cb() callback to vcc
      pppoatm: fix missing wakeup in pppoatm_send()
      br2684: fix module_put() race

Krzysztof Mazur (6):
      atm: add owner of push() callback to atmvcc
      pppoatm: allow assign only on a connected socket
      pppoatm: fix module_put() race
      pppoatm: take ATM socket lock in pppoatm_send()
      pppoatm: drop frames to not-ready vcc
      pppoatm: do not inline pppoatm_may_send()


-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: [PATCH] br2684: don't send frames on not-ready vcc
From: David Woodhouse @ 2012-11-28  0:54 UTC (permalink / raw)
  To: Krzysztof Mazur
  Cc: chas williams - CONTRACTOR, davem, netdev, linux-kernel, nathan
In-Reply-To: <20121127235129.GA20080@shrek.podlesie.net>

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On Wed, 2012-11-28 at 00:51 +0100, Krzysztof Mazur wrote:
> If you do this actually it's better to don't use patch 1/7 because
> it introduces race condition that you found earlier.

Right. I've omitted that from the git tree I just pushed out.

> With this patch you have still theoretical race that was fixed in patches
> 5 and 8 in pppoatm series, but I never seen that in practice.

And I think it's even less likely for br2684. At least with pppoatm you
might have had pppd sending frames. But for br2684 they *only* come from
its start_xmit function... which is serialised anyway.

I do get strange oopses when I try to add BQL to br2684, but that's not
something to be looking at at 1am...

I *do* need the equivalent of your patch 4, which is the module_put
race.

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: IPv4 route cache DOS attack
From: Eric Dumazet @ 2012-11-28  1:01 UTC (permalink / raw)
  To: 叶雨飞; +Cc: netdev
In-Reply-To: <CAJygYd1OSTfpdbA57q0c6ppMtft3iB7uzo-RCXFOamGp2XVzBg@mail.gmail.com>

On Tue, 2012-11-27 at 15:15 -0800, 叶雨飞 wrote:
> Hi,
> 
> I have a linux router running kernel 3.2  that receive public ingress
> packets and route them through an GRE tunnel, return packets don't go
> through it
> 
> I've recently faced a serious issue with the route cache,  when the
> router received spoofed source , the route cache will quickly get
> exhausted (depending on the size of it) and soon the ip dst cache
> overflow will be printed and network subsystem will hang until
> restarted.
> 
> So, my question is, how can I turn off the route cache without
> recompile the kernel or adding the  patch for removal  in 3.7?  I
> tried to set
> 
> echo 0 > /proc/sys/net/ipv4/route/max_size but that has no effect at all.
> 
> And if some one can share some insight on why when dst cache
> overflows, the network subsystem hangs, it would be great.

echo -1 >/proc/sys/net/ipv4/rt_cache_rebuild_count

^ permalink raw reply

* Re: [PATCH][RESEND] bonding: delete migrated IP addresses from the rlb hash table
From: Jay Vosburgh @ 2012-11-28  1:05 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Andy Gospodarek, netdev
In-Reply-To: <20121123124419.GA3002@midget.suse.cz>

Jiri Bohac <jbohac@suse.cz> wrote:

>Hi, 
>
>This is another resend of the patch discussed in June. The only
>changes over the previous version are improved comments.
>
>Bonding with balance_rlb keeps poisoning other machines' ARP
>caches and I whink we need to fix this.
>
>On Thu, Jun 21, 2012 at 04:05:19PM -0700, Jay Vosburgh wrote:
>> Jiri Bohac <jbohac@suse.cz> wrote:
>> 
>> >Hi, this is a resend of the patch discussed here:
>> >	http://thread.gmane.org/gmane.linux.network/228076
>> >It has been updated to apply to the lastest net-next.
>> [...]
>> >The hash table is hashed by ip_dst. To be able to do the above
>> >check efficiently (not walking the whole hash table), we need a
>> >reverse mapping (by ip_src).
>> 
>> 	Just a note that I'm doing some testing with this patch.  Seems
>> to be ok for the "direct" case (wherein the IP in question is assigned
>> to the local system); I haven't tried the "bridge" case yet.  I've
>> extended some of the debugfs stuff to dump the new information, and I'm
>> trying some of the corner cases (e.g., breaking the linkages in the
>> middle) to see if it all hangs together.
>
>Were there any results of your testing?  Good or bad?

	I did test it quite a bit (and then neglected to follow up).  I
tried various deliberate hash collisions to try and make it fail in a
corner case, but was unable to induce incorrect behavior.

>> 	I am thinking that the layout of the "hash"-ish table is now
>> sufficiently complicated that there should be a comment block somewhere
>> describing what's going on (because I didn't really quite get it until I
>> dumped the whole thing and looked at it).  With this patch, there is one
>> "used" linkage for all of the elements in use, plus some number of "src"
>> linkages, one for each active source hash.  The "src" linkages are also
>> notable in that they are separate from the "assigned" state.
>
>I updated the comments in drivers/net/bonding/bond_alb.h to
>describe the structure.
>
>> >+	 * have a dirrerent mac_src.
>> 
>> 	Typo here; should be "different."
>
>Fixed. 
>Any chance we could finally get this merged?:

	The only issue I see is that a number of added lines run past 80
columns, e.g.,

+		if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
+			/* ip_src is going to be updated, fix the src hash list */
+			u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));

+	 * sending out client updates with this IP address and the old MAC address.

+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {

	... and so on.  I did not compile and test this version, just
applied it and inspected it; presumably it is functionally identical to
the prior version.  There's also one typo I noted near the end of the
patch.

	-J

>Bonding in balance-alb mode records information from ARP packets
>passing through the bond in a hash table (rx_hashtbl).
>
>At certain situations (e.g. link change of a slave),
>rlb_update_rx_clients() will send out ARP packets to update ARP
>caches of other hosts on the network to achieve RX load
>balancing.
>
>The problem is that once an IP address is recorded in the hash
>table, it stays there indefinitely. If this IP address is
>migrated to a different host in the network, bonding still sends
>out ARP packets that poison other systems' ARP caches with
>invalid information.
>
>This patch solves this by looking at all incoming ARP packets,
>and checking if the source IP address is one of the source
>addresses stored in the rx_hashtbl. If it is, but the MAC
>addresses differ, the corresponding hash table entries are
>removed. Thus, when an IP address is migrated, the first ARP
>broadcast by its new owner will purge the offending entries of
>rx_hashtbl.
>
>The hash table is hashed by ip_dst. To be able to do the above
>check efficiently (not walking the whole hash table), we need a
>reverse mapping (by ip_src).
>
>I added three new members in struct rlb_client_info:
>   rx_hashtbl[x].src_first will point to the start of a list of
>      entries for which hash(ip_src) == x.
>   The list is linked with src_next and src_prev.
>
>When an incoming ARP packet arrives at rlb_arp_recv()
>rlb_purge_src_ip() can quickly walk only the entries on the
>corresponding lists, i.e. the entries that are likely to contain
>the offending IP address.
>
>To avoid confusion, I renamed these existing fields of struct 
>rlb_client_info:
>	next -> used_next
>	prev -> used_prev
>	rx_hashtbl_head -> rx_hashtbl_used_head
>
>(The current linked list is _not_ a list of hash table
>entries with colliding ip_dst. It's a list of entries that are
>being used; its purpose is to avoid walking the whole hash table
>when looking for used entries.)
>
>Signed-off-by: Jiri Bohac <jbohac@suse.cz>
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index e15cc11..8505a24 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -84,6 +84,9 @@ static inline struct arp_pkt *arp_pkt(const struct sk_buff *skb)
>
> /* Forward declaration */
> static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[]);
>+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp);
>+static void rlb_src_unlink(struct bonding *bond, u32 index);
>+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash);
>
> static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
> {
>@@ -354,6 +357,17 @@ static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond,
> 	if (!arp)
> 		goto out;
>
>+	/* We received an ARP from arp->ip_src.
>+	 * We might have used this IP address previously (on the bonding host
>+	 * itself or on a system that is bridged together with the bond).
>+	 * However, if arp->mac_src is different than what is stored in
>+	 * rx_hashtbl, some other host is now using the IP and we must prevent
>+	 * sending out client updates with this IP address and the old MAC address.
>+	 * Clean up all hash table entries that have this address as ip_src but
>+	 * have a different mac_src.
>+	 */
>+	rlb_purge_src_ip(bond, arp);
>+
> 	if (arp->op_code == htons(ARPOP_REPLY)) {
> 		/* update rx hash table for this ARP */
> 		rlb_update_entry_from_arp(bond, arp);
>@@ -432,9 +446,9 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
> 	_lock_rx_hashtbl_bh(bond);
>
> 	rx_hash_table = bond_info->rx_hashtbl;
>-	index = bond_info->rx_hashtbl_head;
>+	index = bond_info->rx_hashtbl_used_head;
> 	for (; index != RLB_NULL_INDEX; index = next_index) {
>-		next_index = rx_hash_table[index].next;
>+		next_index = rx_hash_table[index].used_next;
> 		if (rx_hash_table[index].slave == slave) {
> 			struct slave *assigned_slave = rlb_next_rx_slave(bond);
>
>@@ -519,8 +533,8 @@ static void rlb_update_rx_clients(struct bonding *bond)
>
> 	_lock_rx_hashtbl_bh(bond);
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
> 		if (client_info->ntt) {
> 			rlb_update_client(client_info);
>@@ -548,8 +562,8 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
>
> 	_lock_rx_hashtbl_bh(bond);
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
>
> 		if ((client_info->slave == slave) &&
>@@ -578,8 +592,8 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
>
> 	_lock_rx_hashtbl(bond);
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
>
> 		if (!client_info->slave) {
>@@ -625,6 +639,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 				/* update mac address from arp */
> 				memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
> 			}
>+			memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
>
> 			assigned_slave = client_info->slave;
> 			if (assigned_slave) {
>@@ -647,6 +662,13 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 	assigned_slave = rlb_next_rx_slave(bond);
>
> 	if (assigned_slave) {
>+		if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
>+			/* ip_src is going to be updated, fix the src hash list */
>+			u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));
>+			rlb_src_unlink(bond, hash_index);
>+			rlb_src_link(bond, hash_src, hash_index);
>+		}
>+
> 		client_info->ip_src = arp->ip_src;
> 		client_info->ip_dst = arp->ip_dst;
> 		/* arp->mac_dst is broadcast for arp reqeusts.
>@@ -654,6 +676,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 		 * upon receiving an arp reply.
> 		 */
> 		memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
>+		memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
> 		client_info->slave = assigned_slave;
>
> 		if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
>@@ -669,11 +692,11 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 		}
>
> 		if (!client_info->assigned) {
>-			u32 prev_tbl_head = bond_info->rx_hashtbl_head;
>-			bond_info->rx_hashtbl_head = hash_index;
>-			client_info->next = prev_tbl_head;
>+			u32 prev_tbl_head = bond_info->rx_hashtbl_used_head;
>+			bond_info->rx_hashtbl_used_head = hash_index;
>+			client_info->used_next = prev_tbl_head;
> 			if (prev_tbl_head != RLB_NULL_INDEX) {
>-				bond_info->rx_hashtbl[prev_tbl_head].prev =
>+				bond_info->rx_hashtbl[prev_tbl_head].used_prev =
> 					hash_index;
> 			}
> 			client_info->assigned = 1;
>@@ -740,8 +763,8 @@ static void rlb_rebalance(struct bonding *bond)
> 	_lock_rx_hashtbl_bh(bond);
>
> 	ntt = 0;
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
> 		assigned_slave = rlb_next_rx_slave(bond);
> 		if (assigned_slave && (client_info->slave != assigned_slave)) {
>@@ -759,11 +782,113 @@ static void rlb_rebalance(struct bonding *bond)
> }
>
> /* Caller must hold rx_hashtbl lock */
>+static void rlb_init_table_entry_dst(struct rlb_client_info *entry)
>+{
>+	entry->used_next = RLB_NULL_INDEX;
>+	entry->used_prev = RLB_NULL_INDEX;
>+	entry->assigned = 0;
>+	entry->slave = NULL;
>+	entry->tag = 0;
>+}
>+static void rlb_init_table_entry_src(struct rlb_client_info *entry)
>+{
>+	entry->src_first = RLB_NULL_INDEX;
>+	entry->src_prev = RLB_NULL_INDEX;
>+	entry->src_next = RLB_NULL_INDEX;
>+}
>+
> static void rlb_init_table_entry(struct rlb_client_info *entry)
> {
> 	memset(entry, 0, sizeof(struct rlb_client_info));
>-	entry->next = RLB_NULL_INDEX;
>-	entry->prev = RLB_NULL_INDEX;
>+	rlb_init_table_entry_dst(entry);
>+	rlb_init_table_entry_src(entry);
>+}
>+
>+static void rlb_delete_table_entry_dst(struct bonding *bond, u32 index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 next_index = bond_info->rx_hashtbl[index].used_next;
>+	u32 prev_index = bond_info->rx_hashtbl[index].used_prev;
>+
>+	if (index == bond_info->rx_hashtbl_used_head)
>+		bond_info->rx_hashtbl_used_head = next_index;
>+	if (prev_index != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[prev_index].used_next = next_index;
>+	if (next_index != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[next_index].used_prev = prev_index;
>+}
>+
>+/* unlink a rlb hash table entry from the src list */
>+static void rlb_src_unlink(struct bonding *bond, u32 index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 next_index = bond_info->rx_hashtbl[index].src_next;
>+	u32 prev_index = bond_info->rx_hashtbl[index].src_prev;
>+
>+	bond_info->rx_hashtbl[index].src_next = RLB_NULL_INDEX;
>+	bond_info->rx_hashtbl[index].src_prev = RLB_NULL_INDEX;
>+
>+	if (next_index != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[next_index].src_prev = prev_index;
>+
>+	if (prev_index == RLB_NULL_INDEX)
>+		return;
>+
>+	/* is prev_index pointing to the head of this list? */
>+	if (bond_info->rx_hashtbl[prev_index].src_first == index)
>+		bond_info->rx_hashtbl[prev_index].src_first = next_index;
>+	else
>+		bond_info->rx_hashtbl[prev_index].src_next = next_index;
>+
>+}
>+
>+static void rlb_delete_table_entry(struct bonding *bond, u32 index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
>+
>+	rlb_delete_table_entry_dst(bond, index);
>+	rlb_init_table_entry_dst(entry);
>+
>+	rlb_src_unlink(bond, index);
>+}
>+
>+/* add the rx_hashtbl[ip_dst_hash] entry to the list
>+ * of entries with identical ip_src_hash
>+ */
>+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 next;
>+
>+	bond_info->rx_hashtbl[ip_dst_hash].src_prev = ip_src_hash;
>+	next = bond_info->rx_hashtbl[ip_src_hash].src_first;
>+	bond_info->rx_hashtbl[ip_dst_hash].src_next = next;
>+	if (next != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[next].src_prev = ip_dst_hash;
>+	bond_info->rx_hashtbl[ip_src_hash].src_first = ip_dst_hash;
>+}
>+
>+/* deletes all rx_hashtbl entries with  arp->ip_src if their mac_src does
>+ * not match arp->mac_src */
>+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 ip_src_hash = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src));
>+	u32 index;
>+
>+	_lock_rx_hashtbl_bh(bond);
>+
>+	index = bond_info->rx_hashtbl[ip_src_hash].src_first;
>+	while (index != RLB_NULL_INDEX) {
>+		struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
>+		u32 next_index = entry->src_next;
>+		if (entry->ip_src == arp->ip_src &&
>+		    !ether_addr_equal_64bits(arp->mac_src, entry->mac_src))
>+				rlb_delete_table_entry(bond, index);
>+		index = next_index;
>+	}
>+	_unlock_rx_hashtbl_bh(bond);
> }
>
> static int rlb_initialize(struct bonding *bond)
>@@ -781,7 +906,7 @@ static int rlb_initialize(struct bonding *bond)
>
> 	bond_info->rx_hashtbl = new_hashtbl;
>
>-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
>+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
>
> 	for (i = 0; i < RLB_HASH_TABLE_SIZE; i++) {
> 		rlb_init_table_entry(bond_info->rx_hashtbl + i);
>@@ -803,7 +928,7 @@ static void rlb_deinitialize(struct bonding *bond)
>
> 	kfree(bond_info->rx_hashtbl);
> 	bond_info->rx_hashtbl = NULL;
>-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
>+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
>
> 	_unlock_rx_hashtbl_bh(bond);
> }
>@@ -815,25 +940,13 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
>
> 	_lock_rx_hashtbl_bh(bond);
>
>-	curr_index = bond_info->rx_hashtbl_head;
>+	curr_index = bond_info->rx_hashtbl_used_head;
> 	while (curr_index != RLB_NULL_INDEX) {
> 		struct rlb_client_info *curr = &(bond_info->rx_hashtbl[curr_index]);
>-		u32 next_index = bond_info->rx_hashtbl[curr_index].next;
>-		u32 prev_index = bond_info->rx_hashtbl[curr_index].prev;
>-
>-		if (curr->tag && (curr->vlan_id == vlan_id)) {
>-			if (curr_index == bond_info->rx_hashtbl_head) {
>-				bond_info->rx_hashtbl_head = next_index;
>-			}
>-			if (prev_index != RLB_NULL_INDEX) {
>-				bond_info->rx_hashtbl[prev_index].next = next_index;
>-			}
>-			if (next_index != RLB_NULL_INDEX) {
>-				bond_info->rx_hashtbl[next_index].prev = prev_index;
>-			}
>+		u32 next_index = bond_info->rx_hashtbl[curr_index].used_next;
>
>-			rlb_init_table_entry(curr);
>-		}
>+		if (curr->tag && (curr->vlan_id == vlan_id))
>+			rlb_delete_table_entry(bond, curr_index);
>
> 		curr_index = next_index;
> 	}
>diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
>index 90f140a..de831ba 100644
>--- a/drivers/net/bonding/bond_alb.h
>+++ b/drivers/net/bonding/bond_alb.h
>@@ -94,15 +94,35 @@ struct tlb_client_info {
>
> /* -------------------------------------------------------------------------
>  * struct rlb_client_info contains all info related to a specific rx client
>- * connection. This is the Clients Hash Table entry struct
>+ * connection. This is the Clients Hash Table entry struct.
>+ * Note that this is not a proper hash table; if a new client's IP address
>+ * hash collides with an existing client entry, the old entry is replaced.
>+ *
>+ * There is a linked list (linked by the used_next and used_prev members)
>+ * linking all the used entries of the hash table. This allows updating
>+ * all the clients without walking over all the unused elements of the table.
>+ *
>+ * There are also linked lists of entries with identical hash(ip_src). These
>+ * allow cleaning up the table from ip_src<->mac_src associatins that have

	Typo here, "associations"

>+ * become outdated and would cause sending out invalid ARP updates to the
>+ * network. These are linked by the (src_next and src_prev members).
>  * -------------------------------------------------------------------------
>  */
> struct rlb_client_info {
> 	__be32 ip_src;		/* the server IP address */
> 	__be32 ip_dst;		/* the client IP address */
>+	u8  mac_src[ETH_ALEN];	/* the server MAC address */
> 	u8  mac_dst[ETH_ALEN];	/* the client MAC address */
>-	u32 next;		/* The next Hash table entry index */
>-	u32 prev;		/* The previous Hash table entry index */
>+
>+	/* list of used hash table entries, starting at rx_hashtbl_used_head */
>+	u32 used_next;
>+	u32 used_prev;
>+
>+	/* ip_src based hashing */
>+	u32 src_next;	/* next entry with same hash(ip_src) */
>+	u32 src_prev;	/* prev entry with same hash(ip_src) */
>+	u32 src_first;	/* first entry with hash(ip_src) == this entry's index */
>+
> 	u8  assigned;		/* checking whether this entry is assigned */
> 	u8  ntt;		/* flag - need to transmit client info */
> 	struct slave *slave;	/* the slave assigned to this client */
>@@ -131,7 +151,7 @@ struct alb_bond_info {
> 	int rlb_enabled;
> 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
> 	spinlock_t		rx_hashtbl_lock;
>-	u32			rx_hashtbl_head;
>+	u32			rx_hashtbl_used_head;
> 	u8			rx_ntt;	/* flag - need to transmit
> 					 * to all rx clients
> 					 */
>diff --git a/drivers/net/bonding/bond_debugfs.c b/drivers/net/bonding/bond_debugfs.c
>index 2cf084e..6ac855f 100644
>--- a/drivers/net/bonding/bond_debugfs.c
>+++ b/drivers/net/bonding/bond_debugfs.c
>@@ -31,8 +31,8 @@ static int bond_debug_rlb_hash_show(struct seq_file *m, void *v)
>
> 	spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
> 		seq_printf(m, "%-15pI4 %-15pI4 %-17pM %s\n",
> 			&client_info->ip_src,
>
>-- 
>Jiri Bohac <jbohac@suse.cz>
>SUSE Labs, SUSE CZ
>

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: IPv4 route cache DOS attack
From: 叶雨飞 @ 2012-11-28  1:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1354064474.14302.44.camel@edumazet-glaptop>

Thanks!!! it works, after flushing cache it stays 0.

On Tue, Nov 27, 2012 at 5:01 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-11-27 at 15:15 -0800, 叶雨飞 wrote:
>> Hi,
>>
>> I have a linux router running kernel 3.2  that receive public ingress
>> packets and route them through an GRE tunnel, return packets don't go
>> through it
>>
>> I've recently faced a serious issue with the route cache,  when the
>> router received spoofed source , the route cache will quickly get
>> exhausted (depending on the size of it) and soon the ip dst cache
>> overflow will be printed and network subsystem will hang until
>> restarted.
>>
>> So, my question is, how can I turn off the route cache without
>> recompile the kernel or adding the  patch for removal  in 3.7?  I
>> tried to set
>>
>> echo 0 > /proc/sys/net/ipv4/route/max_size but that has no effect at all.
>>
>> And if some one can share some insight on why when dst cache
>> overflows, the network subsystem hangs, it would be great.
>
> echo -1 >/proc/sys/net/ipv4/rt_cache_rebuild_count
>
>
>

^ permalink raw reply

* Re: [PATCH v2 net-next] sctp: Add support to per-association statistics via a new SCTP_GET_ASSOC_STATS call
From: Vlad Yasevich @ 2012-11-28  1:40 UTC (permalink / raw)
  To: Michele Baldessari
  Cc: linux-sctp, Neil Horman, Thomas Graf, netdev, David S. Miller
In-Reply-To: <20121127220810.GB3869@fante.int.rhx>

On 11/27/2012 05:08 PM, Michele Baldessari wrote:
> Hi Vlad,
>
> thanks a lot for your review.
>
> On Mon, Nov 19, 2012 at 11:01:46AM -0500, Vlad Yasevich wrote:
> <snip>
>>> @@ -1152,8 +1156,11 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>>>   		 */
>>>   		if (sctp_chunk_is_data(chunk))
>>>   			asoc->peer.last_data_from = chunk->transport;
>>> -		else
>>> +		else {
>>>   			SCTP_INC_STATS(net, SCTP_MIB_INCTRLCHUNKS);
>>> +			if (chunk->chunk_hdr->type == SCTP_CID_SACK)
>>> +				asoc->stats.isacks++;
>>> +		}
>>
>> Should the above include asoc->stats.ictrlchunks++; just like ep_bh_rcv()?
>
> Indeed, I will add that.
>
>>>
>>>   		if (chunk->transport)
>>>   			chunk->transport->last_time_heard = jiffies;
>>> diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
>>> index 1859e2b..32ab55b 100644
>>> --- a/net/sctp/endpointola.c
>>> +++ b/net/sctp/endpointola.c
>>> @@ -480,8 +480,11 @@ normal:
>>>   		 */
>>>   		if (asoc && sctp_chunk_is_data(chunk))
>>>   			asoc->peer.last_data_from = chunk->transport;
>>> -		else
>>> +		else {
>>>   			SCTP_INC_STATS(sock_net(ep->base.sk), SCTP_MIB_INCTRLCHUNKS);
>>> +			if (asoc)
>>> +				asoc->stats.ictrlchunks++;
>>> +		}
>>>
>>>   		if (chunk->transport)
>>>   			chunk->transport->last_time_heard = jiffies;
>>> diff --git a/net/sctp/input.c b/net/sctp/input.c
>>> index 8bd3c27..54c449b 100644
>>> --- a/net/sctp/input.c
>>> +++ b/net/sctp/input.c
>>> @@ -281,6 +281,8 @@ int sctp_rcv(struct sk_buff *skb)
>>>   		SCTP_INC_STATS_BH(net, SCTP_MIB_IN_PKT_SOFTIRQ);
>>>   		sctp_inq_push(&chunk->rcvr->inqueue, chunk);
>>>   	}
>>> +	if (asoc)
>>> +		asoc->stats.ipackets++;
>>>
>>>   	sctp_bh_unlock_sock(sk);
>>
>> This needs a bit more thought.  Current counting behaves differently
>> depending on whether the user holds a socket lock or not.
>> If the user holds the lock, we'll end counting the packet before it is
>> processed.  If the user isn't holding the lock, we'll count the packet after
>> it is processed.
>
> I see. What do you prefer: use atomic64 for this specific counter or
> since it is a temporary miscount we go ahead and ignore it or do you
> have other approaches in mind?

You could count it in sctp_inq_push...  Would that make sense?

-vlad

^ permalink raw reply

* Re: TCP and reordering
From: David Miller @ 2012-11-28  2:06 UTC (permalink / raw)
  To: saku; +Cc: rick.jones2, netdev
In-Reply-To: <CAAeewD9MtEx4uF6ezbBj7Ci5OzX8VK7p=WQ2TB3PfjmznA4X0w@mail.gmail.com>

From: Saku Ytti <saku@ytti.fi>
Date: Tue, 27 Nov 2012 19:15:20 +0200

> TCP used to be friendly to reordering before fast retransmit
> optimization was implemented.

You're talking about 20 years ago, because that's when fast
retrasnmit was created.

It's not like this got added recently.

And the gains of fast retransmit far outweigh whatever strange
justification would give for reordering packets on purpose.

^ permalink raw reply

* Re: IPv4 route cache DOS attack
From: David Miller @ 2012-11-28  2:14 UTC (permalink / raw)
  To: sunyucong; +Cc: netdev
In-Reply-To: <CAJygYd1OSTfpdbA57q0c6ppMtft3iB7uzo-RCXFOamGp2XVzBg@mail.gmail.com>

We saw your email the other day, do not resend the same exact
question over and over again.

If nobody has time, or wants, to answer you, then you have to simply
accept that.  Repeating your posting only will make things worse
for you, trust me.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Joe Perches @ 2012-11-28  2:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354051475.14302.42.camel@edumazet-glaptop>

On Tue, 2012-11-27 at 13:24 -0800, Eric Dumazet wrote:
> On Tue, 2012-11-27 at 09:23 -0800, Joe Perches wrote:
> > On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > commit 68835aba4d9b (net: optimize INET input path further)
> > > moved some fields used for tcp/udp sockets lookup in the first cache
> > > line of struct sock_common.
> > []
> > > diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> > > index 5e11905..196ede4 100644
> > > --- a/include/linux/ipv6.h
> > > +++ b/include/linux/ipv6.h
> > > @@ -365,19 +365,21 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
> > >  #endif /* IS_ENABLED(CONFIG_IPV6) */
> > >  
> > >  #define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\
> > > +	(((__sk)->sk_hash == (__hash)) &&					\
> > > +	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) &&	\
> > > +	 ((__sk)->sk_family		== AF_INET6)		&&		\
> > 
> > Perhaps these could be |'d together to avoid the test/jump
> > after each comparison by using some bit operations instead.
> > 
> > > +	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&&		\
> > > +	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&&		\
> > > +	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))) && \
> > > +	 net_eq(sock_net(__sk), (__net)))
> > 
> But it would be wrong.

OK, so it's an and not an or.  Duh.

Still, the logical tests that are likely to be in the same
cacheline could be ANDed together to avoid a test and jump.

Perhaps this:

It shrinks the object a trivial bit and could be a tiny bit
faster too.

(allyesconfig x86/32)
$ size net/ipv6/inet6_hashtables.o*
   text	   data	    bss	    dec	    hex	filename
   6277	    962	   1832	   9071	   236f	net/ipv6/inet6_hashtables.o.new
   6381	    962	   1880	   9223	   2407	net/ipv6/inet6_hashtables.o.old

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 196ede4..91870de 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -364,22 +364,24 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
 #define inet_v6_ipv6only(__sk)		0
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
-#define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\
-	(((__sk)->sk_hash == (__hash)) &&					\
-	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) &&	\
-	 ((__sk)->sk_family		== AF_INET6)		&&		\
-	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&&		\
-	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&&		\
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))) && \
+#define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif) \
+	((((__sk)->sk_hash == (__hash)) &				\
+	  ((__sk)->sk_family == AF_INET6)) &&				\
+	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) && \
+	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr)) &&		\
+	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr)) &&	\
+	 (!((__sk)->sk_bound_dev_if) ||					\
+	  ((__sk)->sk_bound_dev_if == (__dif))) &&			\
 	 net_eq(sock_net(__sk), (__net)))
 
 #define INET6_TW_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif) \
-	(((__sk)->sk_hash == (__hash)) &&					\
-	 (*((__portpair *)&(inet_twsk(__sk)->tw_dport)) == (__ports))	&&	\
-	 ((__sk)->sk_family	       == PF_INET6)			&&	\
-	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_daddr, (__saddr)))	&&	\
-	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_rcv_saddr, (__daddr))) &&	\
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))) && \
+	((((__sk)->sk_hash == (__hash)) &				\
+	  ((__sk)->sk_family == PF_INET6)) &&				\
+	 (*((__portpair *)&(inet_twsk(__sk)->tw_dport)) == (__ports)) && \
+	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_daddr, (__saddr))) && \
+	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_rcv_saddr, (__daddr))) && \
+	 (!((__sk)->sk_bound_dev_if) ||					\
+	  ((__sk)->sk_bound_dev_if == (__dif))) &&			\
 	 net_eq(sock_net(__sk), (__net)))
 
 #endif /* _IPV6_H */

^ permalink raw reply related

* RE: Re: RTL 8169  linux driver question
From: hayeswang @ 2012-11-28  2:32 UTC (permalink / raw)
  To: 'Francois Romieu', 'David Laight'
  Cc: 'Stéphane ANCELOT', netdev, sancelot,
	'nic_swsd'
In-Reply-To: <20121127224605.GA10228@electric-eye.fr.zoreil.com>

Francois Romieu [mailto:romieu@fr.zoreil.com] 
[...]
> Something like the patch below against net-next could help once I will
> have tested it.
> 
> I completely guessed the Tx usec scale factor at gigabit 
> speed (125 us, 
> 100 us, disabled, who knows ?) and I have no idea which 
> specific chipsets
> it should work with.
> 
> Hayes, may I expect some hindsight regarding:
> 1 - the availability of the IntrMitigate (0xe2) register through the
>     8169, 8168 and 810x line of chipsets

8169, 8168, and 8136(810x) serial chipsets support it.

> 2 - the Tx timer unit at gigabit speed

The unit of the timer depneds on both the speed and the setting of CPlusCmd
(0xe0) bit 1 and bit 0.

For 8169
bit[1:0] \ speed	1000M		100M		10M
0 0		320ns		2.56us		40.96us
0 1		2.56us		20.48us		327.7us
1 0		5.12us		40.96us		655.4us
1 1		10.24us		81.92us		1.31ms

For the other
bit[1:0] \ speed	1000M		100M		10M
0 0		5us		2.56us		40.96us
0 1		40us		20.48us		327.7us
1 0		80us		40.96us		655.4us
1 1		160us		81.92us		1.31ms

 
Best Regards,
Hayes

^ permalink raw reply

* Re: IPv4 route cache DOS attack
From: 叶雨飞 @ 2012-11-28  2:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20121127.211403.2194859172416122879.davem@davemloft.net>

my first email is to lartc@ and this one is to netdev@ .

On Tue, Nov 27, 2012 at 6:14 PM, David Miller <davem@davemloft.net> wrote:
>
> We saw your email the other day, do not resend the same exact
> question over and over again.
>
> If nobody has time, or wants, to answer you, then you have to simply
> accept that.  Repeating your posting only will make things worse
> for you, trust me.
>
> Thank you.

^ permalink raw reply

* Re: [PATCH v6 5/6] PM / Runtime: force memory allocation with no I/O during Runtime PM callbcack
From: Ming Lei @ 2012-11-28  3:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <1354069667.BsTEhItmLz@vostro.rjw.lan>

On Wed, Nov 28, 2012 at 5:24 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>
> Please don't duplicate code this way.
>
> You can move that whole thing to rpm_callback().  Yes, you'll probably need to
> check dev->power.memalloc_noio twice in there, but that's OK.

Good idea, I will update it in v7.

Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Ben Hutchings @ 2012-11-28  3:12 UTC (permalink / raw)
  To: Joe Perches; +Cc: Eric Dumazet, David Miller, netdev, Ling Ma
In-Reply-To: <1354069414.8918.13.camel@joe-AO722>

On Tue, 2012-11-27 at 18:23 -0800, Joe Perches wrote:
> On Tue, 2012-11-27 at 13:24 -0800, Eric Dumazet wrote:
> > On Tue, 2012-11-27 at 09:23 -0800, Joe Perches wrote:
> > > On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:
> > > > From: Eric Dumazet <edumazet@google.com>
> > > > 
> > > > commit 68835aba4d9b (net: optimize INET input path further)
> > > > moved some fields used for tcp/udp sockets lookup in the first cache
> > > > line of struct sock_common.
> > > []
> > > > diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> > > > index 5e11905..196ede4 100644
> > > > --- a/include/linux/ipv6.h
> > > > +++ b/include/linux/ipv6.h
> > > > @@ -365,19 +365,21 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
> > > >  #endif /* IS_ENABLED(CONFIG_IPV6) */
> > > >  
> > > >  #define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\
> > > > +	(((__sk)->sk_hash == (__hash)) &&					\
> > > > +	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) &&	\
> > > > +	 ((__sk)->sk_family		== AF_INET6)		&&		\
> > > 
> > > Perhaps these could be |'d together to avoid the test/jump
> > > after each comparison by using some bit operations instead.
> > > 
> > > > +	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&&		\
> > > > +	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&&		\
> > > > +	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))) && \
> > > > +	 net_eq(sock_net(__sk), (__net)))
> > > 
> > But it would be wrong.
> 
> OK, so it's an and not an or.  Duh.
[...]

The way to combine these sorts of comparisons is along the lines of:

(((left->a ^ right->a) |
  (left->b ^ right->b) |
  ...) == 0)

But when there are big-endian types involved, sparse is likely to
complain about combining them.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Eric Dumazet @ 2012-11-28  3:13 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354051401.14302.40.camel@edumazet-glaptop>

On Tue, 2012-11-27 at 13:23 -0800, Eric Dumazet wrote:
> On Tue, 2012-11-27 at 19:05 +0000, Ben Hutchings wrote:
> > On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:
> 
> > >  struct sock_common {
> > > -	/* skc_daddr and skc_rcv_saddr must be grouped :
> > > -	 * cf INET_MATCH() and INET_TW_MATCH()
> > > +	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
> > > +	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
> > 
> > __aligned(8)?
> 
> Nope, only on 64 bit this requirement exists (since a long time)
> 
> I am not sure we want complexity on this.
> 
> And we dont want holes to be automatically added here neither.

Hmm, maybe the following could be the right way, as we did
for skc_hash/skc_u16hashes


 struct sock_common {
-       /* skc_daddr and skc_rcv_saddr must be grouped :
-        * cf INET_MATCH() and INET_TW_MATCH()
+       /* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
+        * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
         */
-       __be32                  skc_daddr;
-       __be32                  skc_rcv_saddr;
-
+       union {
+               unsigned long   skc_laddr;
+               struct {
+                       __be32  skc_daddr;
+                       __be32  skc_rcv_saddr;
+               };
+       };
        union  {
                unsigned int    skc_hash;
                __u16           skc_u16hashes[2];
        };
+       /* skc_dport && skc_num must be grouped as well */
+       union {
+               unsigned int    skc_ports;
+               struct {
+                       __be16  skc_dport;
+                       __u16   skc_num;
+               };
+       };
+
        unsigned short          skc_family;
        volatile unsigned char  skc_state;
        unsigned char           skc_reuse;

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Joe Perches @ 2012-11-28  3:31 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Eric Dumazet, David Miller, netdev, Ling Ma
In-Reply-To: <1354072351.2701.41.camel@bwh-desktop.uk.solarflarecom.com>

On Wed, 2012-11-28 at 03:12 +0000, Ben Hutchings wrote:
> On Tue, 2012-11-27 at 18:23 -0800, Joe Perches wrote:
> > OK, so it's an and not an or.  Duh.
> [...]
> 
> The way to combine these sorts of comparisons is along the lines of:
> 
> (((left->a ^ right->a) |
>   (left->b ^ right->b) |
>   ...) == 0)
> 
> But when there are big-endian types involved, sparse is likely to
> complain about combining them.

I believe there's only the 2 items that could be combined
for cacheline purposes so using 2 logical tests with AND
seems more readable.  Maybe a single combined test would
be faster.  I don't have equipment at hand to test it.

If you prefer I supposed it could be converted.

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Ben Hutchings @ 2012-11-28  3:55 UTC (permalink / raw)
  To: Joe Perches; +Cc: Eric Dumazet, David Miller, netdev, Ling Ma
In-Reply-To: <1354073514.8918.22.camel@joe-AO722>

On Tue, 2012-11-27 at 19:31 -0800, Joe Perches wrote:
> On Wed, 2012-11-28 at 03:12 +0000, Ben Hutchings wrote:
> > On Tue, 2012-11-27 at 18:23 -0800, Joe Perches wrote:
> > > OK, so it's an and not an or.  Duh.
> > [...]
> > 
> > The way to combine these sorts of comparisons is along the lines of:
> > 
> > (((left->a ^ right->a) |
> >   (left->b ^ right->b) |
> >   ...) == 0)
> > 
> > But when there are big-endian types involved, sparse is likely to
> > complain about combining them.
> 
> I believe there's only the 2 items that could be combined
> for cacheline purposes so using 2 logical tests with AND
> seems more readable.  Maybe a single combined test would
> be faster.  I don't have equipment at hand to test it.
> 
> If you prefer I supposed it could be converted.

I don't particularly care, and I gave up this trick myself because it
didn't seem worth fighting sparse.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH v6 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
From: Ming Lei @ 2012-11-28  3:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <5434404.G1ERYjuorE@vostro.rjw.lan>

On Wed, Nov 28, 2012 at 5:19 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Saturday, November 24, 2012 08:59:14 PM Ming Lei wrote:
>> The patch introduces the flag of memalloc_noio in 'struct dev_pm_info'
>> to help PM core to teach mm not allocating memory with GFP_KERNEL
>> flag for avoiding probable deadlock.
>>
>> As explained in the comment, any GFP_KERNEL allocation inside
>> runtime_resume() or runtime_suspend() on any one of device in
>> the path from one block or network device to the root device
>> in the device tree may cause deadlock, the introduced
>> pm_runtime_set_memalloc_noio() sets or clears the flag on
>> device in the path recursively.
>>
>> Cc: Alan Stern <stern@rowland.harvard.edu>
>> Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>> v5:
>>       - fix code style error
>>       - add comment on clear the device memalloc_noio flag
>> v4:
>>       - rename memalloc_noio_resume as memalloc_noio
>>       - remove pm_runtime_get_memalloc_noio()
>>       - add comments on pm_runtime_set_memalloc_noio
>> v3:
>>       - introduce pm_runtime_get_memalloc_noio()
>>       - hold one global lock on pm_runtime_set_memalloc_noio
>>       - hold device power lock when accessing memalloc_noio_resume
>>         flag suggested by Alan Stern
>>       - implement pm_runtime_set_memalloc_noio without recursion
>>         suggested by Alan Stern
>> v2:
>>       - introduce pm_runtime_set_memalloc_noio()
>> ---
>>  drivers/base/power/runtime.c |   60 ++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/pm.h           |    1 +
>>  include/linux/pm_runtime.h   |    3 +++
>>  3 files changed, 64 insertions(+)
>>
>> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
>> index 3148b10..3e198a0 100644
>> --- a/drivers/base/power/runtime.c
>> +++ b/drivers/base/power/runtime.c
>> @@ -124,6 +124,66 @@ unsigned long pm_runtime_autosuspend_expiration(struct device *dev)
>>  }
>>  EXPORT_SYMBOL_GPL(pm_runtime_autosuspend_expiration);
>>
>> +static int dev_memalloc_noio(struct device *dev, void *data)
>> +{
>> +     return dev->power.memalloc_noio;
>> +}
>> +
>> +/*
>> + * pm_runtime_set_memalloc_noio - Set a device's memalloc_noio flag.
>> + * @dev: Device to handle.
>> + * @enable: True for setting the flag and False for clearing the flag.
>> + *
>> + * Set the flag for all devices in the path from the device to the
>> + * root device in the device tree if @enable is true, otherwise clear
>> + * the flag for devices in the path whose siblings don't set the flag.
>> + *
>
> Please use counters instead of walking the whole path every time.  Ie. in
> addition to the flag add a counter to store the number of the device's
> children having that flag set.

Thanks for your review.

IMO, pm_runtime_set_memalloc_noio() is only called in
probe() and release() of block device and network device, which is
in a very infrequent path, so I am wondering if it is worthy of introducing
another counter for all devices.

Also looks the current implementation of pm_runtime_set_memalloc_noio()
is simple and clean enough with the flag, IMO.

> I would use the flag only to store the information that
> pm_runtime_set_memalloc_noio(dev, true) has been run for this device directly
> and I'd use a counter for everything else.
>
> That is, have power.memalloc_count that would be incremented when (1)
> pm_runtime_set_memalloc_noio(dev, true) is called for that device and (2) when
> power.memalloc_count for one of its children changes from 0 to 1 (and
> analogously for decrementation).  Then, check the counter in rpm_callback().

Sorry, could you explain in a bit detail why we need the counter? Looks only
checking the flag in rpm_callback() is enough, doesn't it?

>
> Besides, don't you need to check children for the arg device itself?

It isn't needed since the children of network/block device can't be
involved of the deadlock in runtime PM path.

Also, the function is only called by network device or block device
subsystem, both the two kind of device are class device and should
have no children.

>
>> + * The function should only be called by block device, or network
>> + * device driver for solving the deadlock problem during runtime
>> + * resume/suspend:
>> + *
>> + *     If memory allocation with GFP_KERNEL is called inside runtime
>> + *     resume/suspend callback of any one of its ancestors(or the
>> + *     block device itself), the deadlock may be triggered inside the
>> + *     memory allocation since it might not complete until the block
>> + *     device becomes active and the involed page I/O finishes. The
>> + *     situation is pointed out first by Alan Stern. Network device
>> + *     are involved in iSCSI kind of situation.
>> + *
>> + * The lock of dev_hotplug_mutex is held in the function for handling
>> + * hotplug race because pm_runtime_set_memalloc_noio() may be called
>> + * in async probe().
>> + *
>> + * The function should be called between device_add() and device_del()
>> + * on the affected device(block/network device).
>> + */
>> +void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
>> +{
>> +     static DEFINE_MUTEX(dev_hotplug_mutex);
>
> What's the mutex for?

It is for avoiding hotplug race, for example, without the mutex,
another child may set the flag between the time device_for_each_child()
runs and the next loop iteration in pm_runtime_set_memalloc_noio(false).

Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Eric Dumazet @ 2012-11-28  4:11 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354069414.8918.13.camel@joe-AO722>

On Tue, 2012-11-27 at 18:23 -0800, Joe Perches wrote:

> Still, the logical tests that are likely to be in the same
> cacheline could be ANDed together to avoid a test and jump.

The point of having the cond jump on sk_hash/hash was that in one
compare, we catch the yes/no status with 99.999999 % success rate.

All the following compares are predicted by the cpu and essentially are
free. Adding the AND or OR will basically have the same cpu cost.

If we wanted to do a full test of all tuple fields and a single
conditional jump, we would not have to include hash test at all.

(If the 4-tuple matches, then sk_hash/hash value _must_ be the same by
definition)

Note its quite different from the optimization we did in
ipv6_addr_equal(), as it allowed fewer memory loads and instructions.

I would say this can come later, as the meat of my patch was about
avoiding a full cache line miss, which is far more expensive than any
tricks we can even think about.

Note it will be hard to actually measure any further gains, since I did
TCP_RR tests (200 threads) and the lookup cost went from 1.4 % to 0.8 %
of the grand total, mostly dominated by the atomic to increase socket
refcount.

^ permalink raw reply

* [RFC PATCH] 8139cp: properly support change of MTU values
From: John Greene @ 2012-11-27 20:08 UTC (permalink / raw)
  To: netdev

The 8139cp driver has a change_mtu function that has not been
enabled since the dawn of the git repository. However, the
generic eth_change_mtu is not used in its place, so that
invalid MTU values can be set on the interface.

Original patch salvages the broken code for the single case of
setting the MTU while the interface is down, which is safe
and also includes the range check.  Now enhanced to support up
or down interface.

Original patch from
http://lkml.indiana.edu/hypermail/linux/kernel/1202.2/00770.html

Testing: has been test on virtual 8139cp setup without issue,
awaiting real hardware and retest again, but wanted to post now.

Signed-Off-By: "John Greene" <jogreene@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 drivers/net/ethernet/realtek/8139cp.c | 22 +++-------------------
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index 6cb96b4..7847c83 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -1226,12 +1226,9 @@ static void cp_tx_timeout(struct net_device *dev)
 	spin_unlock_irqrestore(&cp->lock, flags);
 }
 
-#ifdef BROKEN
 static int cp_change_mtu(struct net_device *dev, int new_mtu)
 {
 	struct cp_private *cp = netdev_priv(dev);
-	int rc;
-	unsigned long flags;
 
 	/* check for invalid MTU, according to hardware limits */
 	if (new_mtu < CP_MIN_MTU || new_mtu > CP_MAX_MTU)
@@ -1244,22 +1241,11 @@ static int cp_change_mtu(struct net_device *dev, int new_mtu)
 		return 0;
 	}
 
-	spin_lock_irqsave(&cp->lock, flags);
-
-	cp_stop_hw(cp);			/* stop h/w and free rings */
-	cp_clean_rings(cp);
-
+	/* network IS up, close it, reset MTU, and come up again. */
+	cp_close(dev);
 	dev->mtu = new_mtu;
-	cp_set_rxbufsize(cp);		/* set new rx buf size */
-
-	rc = cp_init_rings(cp);		/* realloc and restart h/w */
-	cp_start_hw(cp);
-
-	spin_unlock_irqrestore(&cp->lock, flags);
-
-	return rc;
+	return cp_open(dev);
 }
-#endif /* BROKEN */
 
 static const char mii_2_8139_map[8] = {
 	BasicModeCtrl,
@@ -1835,9 +1821,7 @@ static const struct net_device_ops cp_netdev_ops = {
 	.ndo_start_xmit		= cp_start_xmit,
 	.ndo_tx_timeout		= cp_tx_timeout,
 	.ndo_set_features	= cp_set_features,
-#ifdef BROKEN
 	.ndo_change_mtu		= cp_change_mtu,
-#endif
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller	= cp_poll_controller,
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH v6 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
From: Ming Lei @ 2012-11-28  4:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <5434404.G1ERYjuorE@vostro.rjw.lan>

On Wed, Nov 28, 2012 at 5:19 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>
> Please use counters instead of walking the whole path every time.  Ie. in
> addition to the flag add a counter to store the number of the device's
> children having that flag set.

Even though counter is added, walking the whole path can't be avoided too,
and may be a explicit walking or recursion, because pm_runtime_set_memalloc_noio
is required to set or clear the flag(or increase/decrease the counter) of
devices in the whole path.

Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 2/2] bridge: export multicast database via netlink
From: Cong Wang @ 2012-11-28  4:38 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, bridge, Herbert Xu, Jesper Dangaard Brouer,
	Stephen Hemminger, David S. Miller
In-Reply-To: <20121127115905.GA16701@casper.infradead.org>

On Tue, 2012-11-27 at 11:59 +0000, Thomas Graf wrote:
> On 11/27/12 at 05:49pm, Cong Wang wrote:
> > +static int br_rports_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
> > +			       u32 seq, struct net_device *dev)
> > +{
> > +	struct net_bridge *br = netdev_priv(dev);
> > +	struct net_bridge_port *p;
> > +	struct hlist_node *n;
> > +	struct nlattr *nest, *nest2;
> > +
> > +	if (!br->multicast_router || hlist_empty(&br->router_list)) {
> > +		printk(KERN_INFO "no router on bridge\n");
> > +		return 0;
> > +	}
> > +
> > +	nest = nla_nest_start(skb, MDBA_ROUTER);
> > +	if (nest == NULL)
> > +		return -EMSGSIZE;
> > +	nest2 = nla_nest_start(skb, MDBA_MDB_BRPORT);
> > +	if (nest2 == NULL)
> > +		goto fail;
> > +
> > +	hlist_for_each_entry_rcu(p, n, &br->router_list, rlist) {
> > +		if (p && nla_put_u16(skb, MDBA_BRPORT_NO, p->port_no)) {
> > +			nla_nest_cancel(skb, nest2);
> > +			goto fail;
> > +		}
> > +	}
> > +
> > +	nla_nest_end(skb, nest2);
> > +	nla_nest_end(skb, nest);
> 
> I would simplify the MDBA_ROUTER attribute to a u16[len(br->router_list)]. If
> we ever need something more complex we can retire the MDBA_ROUTER
> attribute and replace it with something newer.

Makes sense, will do.

I wanted to reserve some for adding new attributes to MDBA_ROUTER, but
so far it is not necessary.

> 
> > +	nest = nla_nest_start(skb, MDBA_MDB);
> > +	if (nest == NULL)
> > +		return -EMSGSIZE;
> > +
> > +	for (i = 0; i < mdb->max; i++) {
> > +		struct hlist_node *h;
> > +		struct net_bridge_mdb_entry *mp;
> > +		struct net_bridge_port_group *p, **pp;
> > +		struct net_bridge_port *port;
> > +
> > +		hlist_for_each_entry_rcu(mp, h, &mdb->mhash[i], hlist[mdb->ver]) {
> > +			if (nla_put_be32(skb, MDBA_MDB_MCADDR, mp->addr.u.ip4))
> > +				goto fail;
> > +
> > +			nest2 = nla_nest_start(skb, MDBA_MDB_BRPORT);
> > +			if (nest2 == NULL)
> > +				goto fail;
> 
> What if you can't fit all theh hash entries into a single netlink
> message? You need to allow splitting theh hash across multiple
> messages. Therefore I suggest that you add a container attribute
> for each mdb_entry like this:
> 
> MDBA_MDB = {
>   1 = {
>     MDBA_MDB_MCADDR = { ... },
>     MDBA_MDB_BRPORT = { ... },
>   },
>   2 = {
>     MDBA_MDB_MCADDR = { ... },
>     MDBA_MDB_BR_PORT = { ... },
>   },
>   [...]
> }

I thought the user-space will reassemble multiple-part messages, but I
probably misunderstand this...

Actually I was trying to reduce the size of the netlink message. :)

> 
> > +static int br_mdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
> > +{
> > +	struct net_device *dev;
> > +	struct net *net = sock_net(skb->sk);
> > +	struct nlmsghdr *nlh;
> > +	u32 seq = cb->nlh->nlmsg_seq;
> > +	int idx = 0, s_idx;
> > +
> > +	s_idx = cb->args[0];
> > +
> > +	rcu_read_lock();
> > +	cb->seq = net->dev_base_seq;
> 
> Using RCU read lock is OK but that means you must be prepared to
> handle additions/removals to the table between dump iterations
> and thus you must introduce a seq counter bumped on each table
> change and add it to the dev_base_seq above.

Yeah, as you told me before. I will make another patch for this. Thanks
for reminding!

> 
> > +	for_each_netdev_rcu(net, dev) {
> > +		if (dev->priv_flags & IFF_EBRIDGE) {
> > +			struct br_port_msg *bpm;
> > +
> > +			if (idx < s_idx)
> > +				goto cont;
> > +
> > +			nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
> > +					seq, RTM_GETMDB,
> > +					sizeof(*bpm), NLM_F_MULTI);
> > +			if (nlh == NULL)
> > +				break;
> > +
> > +			bpm = nlmsg_data(nlh);
> > +			bpm->ifindex = dev->ifindex;
> > +			if (br_mdb_fill_info(skb, cb, seq, dev) < 0) {
> > +				printk(KERN_INFO "br_mdb_fill_info failed\n");
> > +				goto fail;
> 
> As stated above I believe that you should allow for hashtable to be
> split across multiple messages so you need to store the hash table
> offset as well and properly finalize and send the message on error
> here.

You mean saving the offset into cb->args[1]?

Thanks!

^ permalink raw reply

* Re: [RFC PATCH 2/2] bridge: export multicast database via netlink
From: Cong Wang @ 2012-11-28  5:19 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, bridge, Herbert Xu, Jesper Dangaard Brouer,
	Stephen Hemminger, David S. Miller
In-Reply-To: <20121127115905.GA16701@casper.infradead.org>

On Tue, 2012-11-27 at 11:59 +0000, Thomas Graf wrote:
> 
> Using RCU read lock is OK but that means you must be prepared to
> handle additions/removals to the table between dump iterations
> and thus you must introduce a seq counter bumped on each table
> change and add it to the dev_base_seq above.

Something like this?

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index dc73091..4c3b097 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -74,6 +74,8 @@ static int br_mdb_fill_info(struct sk_buff *skb,
struct netlink_callback *cb,
 		return 0;
 	}
 
+	cb->seq = mdb->seq;
+
 	nest = nla_nest_start(skb, MDBA_MDB);
 	if (nest == NULL)
 		return -EMSGSIZE;
@@ -126,7 +128,6 @@ static int br_mdb_dump(struct sk_buff *skb, struct
netlink_callback *cb)
 	s_idx = cb->args[0];
 
 	rcu_read_lock();
-	cb->seq = net->dev_base_seq;
 
 	for_each_netdev_rcu(net, dev) {
 		if (dev->priv_flags & IFF_EBRIDGE) {
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 1e6ce50..ccf5cfb 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -322,6 +322,7 @@ static int br_mdb_rehash(struct
net_bridge_mdb_htable __rcu **mdbp, int max,
 
 	mdb->size = old ? old->size : 0;
 	mdb->ver = old ? old->ver ^ 1 : 0;
+	mdb->seq = old ? (old->seq + 1): 0;
 
 	if (!old || elasticity)
 		get_random_bytes(&mdb->secret, sizeof(mdb->secret));
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index a02921e..2f5f5b8 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -105,6 +105,7 @@ struct net_bridge_mdb_htable
 	u32				max;
 	u32				secret;
 	u32				ver;
+	u32				seq;
 };
 
 struct net_bridge_port

^ permalink raw reply related

* Re: [net-next RFC v2] net_cls: traffic counter based on classification control cgroup
From: Alexey Perevalov @ 2012-11-28  5:21 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Glauber Costa, netdev-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <50B4B9E2.4030200-kQCPcA+X3s7YtjvyW6yDsg@public.gmane.org>

On 11/27/2012 05:02 PM, Daniel Wagner wrote:
> Hi Alexey,
>
> On 27.11.2012 12:03, Glauber Costa wrote:
>> On 11/27/2012 02:56 PM, Alexey Perevalov wrote:
>>> Hello.
>>>
>>> It's second version of patch I already sent to netdev.
>>>
>>> The main goal of this patch it's counting traffic for process placed to
>>> net_cls cgroup (ingress and egress).
>>> It's based on res_counters and holds counter per network interfaces.
>>>
>>> Description of patch.
>>> It handles packets in net/core/dev.c for egress and in
>>> /net/ipv4/tcp.c|udp.c for ingress.
>>> These places were chosen because we need to know also network interface.
>>>
>>> Cgroup fs interface provides following files additional to existing
>>> net_cls files:
>>> net_cls.ifacename.usage_in_bytes
>>> Containing rcv/snd lines.
>>> Also this patch adds to net_cls ability to handle a network device
>>> registration.
>>>
>>> It could be included or excluded in compile time.
>>> I moved the menu entry for "Control group classifier" from network/QoS to
>>> General Option/Control Group.
>>>
>>> I'm waiting for you comments.
>>>
>> Daniel Wagner is working on something a lot similar.
> Yes, basically what I try to do is explained by this excellent article
>
> https://lwn.net/Articles/523058/
I read articles and agreed with aspects.
But problem of selecting preferred network for application can be solved 
using netprio cgroup.


> The short version: Per application routing and statistics.
>
> I have two PoC implementation doing this. Both implementation have the same key
> idea which is to set SO_MARK per application. The routing and statistics would
> then be done by a bunch iptables rules.
>
> In the first implementation extends net_cls to set SO_MARK:
>
> void sock_update_classid(struct sock *sk, struct task_struct *task)
>   {
>          u32 classid;
> +       u32 mark;
>   
>          classid = task_cls_classid(task);
>          if (classid != sk->sk_classid)
>                  sk->sk_classid = classid;
> +
> +       mark = task_cls_mark(task);
> +       if (mark != sk->sk_mark)
> +               sk->sk_mark = mark;
>   }
>
> The second implementation is adding a new iptables matcher which matches
> on LSM contexts. Then you can do something like this:
>
> iptables -t mangle -A OUTPUT -m secmark --secctx unconfined_u:unconfined_r:foo_t:s0-s0:c0.c1023 -j MARK --set-mark 200
As I understand in LSM context it works for egress and ingress.

>> Maybe you should be in contact, in case you are not yet.
>>
>> A few general comments:
>> 1) res_counters are incredibly expensive. If you are more interested in
>> counting than you are in limiting, they may not be your best choice.
You right, I have a plan now to limit traffic too here.
Or as a variant in QoS in this case atomic is better here.

>> 2) When Daniel exposed his use case to me, it gave me the impression
>> that "counting traffic" is something that is totally doable by having a
>> dedicated interface in a separate namespace. Basically, we already count
>> traffic (rx and tx) for all interfaces anyway, so it suggests that it
>> could be an interesting way to see the problem.
> Moving applications into separate net namespaces is for sure a valid solution.
> Though there is a one drawback in this approach. The namespaces need to be
> attached to a bridge and then some NATting. That means every application
> would get it's own IP address. This might be okay for your certain use
> cases but I am still trying to work around this. Glauber and I had some
> discussion about this and he suggested to allow the physical networking
> device to be attached to several namespaces (e.g. via macvlan). Every
> namespace would get the same IP address. Unfortunately, this would result in
> the same mess as several physical devices on a network get the same
> IP address assigned.
Is I truly understand what to make statistics works we need to put 
process to separate namespace?
Approach to keep counter in cgroup hasn't such side effects, but it has 
another ).

>   
>
>> AFAIK, Daniel is still measuring this. But it would be great to know if
>> that could work for your use case as well.
> I have not started to measure :(
>
> cheers,
> daniel
>
>
Thank you Daniel and Glauber!

-- 
BR
Alexey

^ permalink raw reply

* Re: performance regression on HiperSockets depending on MTU size
From: Cong Wang @ 2012-11-28  5:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Network Developers
In-Reply-To: <1353998800.7553.873.camel@edumazet-glaptop>

On Tue, Nov 27, 2012 at 2:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-11-27 at 06:21 +0000, Cong Wang wrote:
>
>> Eric,
>>
>> Do you have a full list of such commits? I am trying to backport TSQ
>> to 2.6.32, and of course I don't want to miss these commits either.
>
> I dont think there are other known issues.
>
> mlx4 had a 'problem' because only recently we removed the skb_orphan()
> call it used to do in its start_xmit() function.
>
> I remember David had to revert BQL on NIU driver, but NIU does the
> skb_orphan() call as well so TSQ is basically disabled.
>
>


Good news! Thanks!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox