Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Eric Dumazet @ 2012-11-28  3:13 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354051401.14302.40.camel@edumazet-glaptop>

On Tue, 2012-11-27 at 13:23 -0800, Eric Dumazet wrote:
> On Tue, 2012-11-27 at 19:05 +0000, Ben Hutchings wrote:
> > On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:
> 
> > >  struct sock_common {
> > > -	/* skc_daddr and skc_rcv_saddr must be grouped :
> > > -	 * cf INET_MATCH() and INET_TW_MATCH()
> > > +	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
> > > +	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
> > 
> > __aligned(8)?
> 
> Nope, only on 64 bit this requirement exists (since a long time)
> 
> I am not sure we want complexity on this.
> 
> And we dont want holes to be automatically added here neither.

Hmm, maybe the following could be the right way, as we did
for skc_hash/skc_u16hashes


 struct sock_common {
-       /* skc_daddr and skc_rcv_saddr must be grouped :
-        * cf INET_MATCH() and INET_TW_MATCH()
+       /* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
+        * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
         */
-       __be32                  skc_daddr;
-       __be32                  skc_rcv_saddr;
-
+       union {
+               unsigned long   skc_laddr;
+               struct {
+                       __be32  skc_daddr;
+                       __be32  skc_rcv_saddr;
+               };
+       };
        union  {
                unsigned int    skc_hash;
                __u16           skc_u16hashes[2];
        };
+       /* skc_dport && skc_num must be grouped as well */
+       union {
+               unsigned int    skc_ports;
+               struct {
+                       __be16  skc_dport;
+                       __u16   skc_num;
+               };
+       };
+
        unsigned short          skc_family;
        volatile unsigned char  skc_state;
        unsigned char           skc_reuse;

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Ben Hutchings @ 2012-11-28  3:12 UTC (permalink / raw)
  To: Joe Perches; +Cc: Eric Dumazet, David Miller, netdev, Ling Ma
In-Reply-To: <1354069414.8918.13.camel@joe-AO722>

On Tue, 2012-11-27 at 18:23 -0800, Joe Perches wrote:
> On Tue, 2012-11-27 at 13:24 -0800, Eric Dumazet wrote:
> > On Tue, 2012-11-27 at 09:23 -0800, Joe Perches wrote:
> > > On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:
> > > > From: Eric Dumazet <edumazet@google.com>
> > > > 
> > > > commit 68835aba4d9b (net: optimize INET input path further)
> > > > moved some fields used for tcp/udp sockets lookup in the first cache
> > > > line of struct sock_common.
> > > []
> > > > diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> > > > index 5e11905..196ede4 100644
> > > > --- a/include/linux/ipv6.h
> > > > +++ b/include/linux/ipv6.h
> > > > @@ -365,19 +365,21 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
> > > >  #endif /* IS_ENABLED(CONFIG_IPV6) */
> > > >  
> > > >  #define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\
> > > > +	(((__sk)->sk_hash == (__hash)) &&					\
> > > > +	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) &&	\
> > > > +	 ((__sk)->sk_family		== AF_INET6)		&&		\
> > > 
> > > Perhaps these could be |'d together to avoid the test/jump
> > > after each comparison by using some bit operations instead.
> > > 
> > > > +	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&&		\
> > > > +	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&&		\
> > > > +	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))) && \
> > > > +	 net_eq(sock_net(__sk), (__net)))
> > > 
> > But it would be wrong.
> 
> OK, so it's an and not an or.  Duh.
[...]

The way to combine these sorts of comparisons is along the lines of:

(((left->a ^ right->a) |
  (left->b ^ right->b) |
  ...) == 0)

But when there are big-endian types involved, sparse is likely to
complain about combining them.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH v6 5/6] PM / Runtime: force memory allocation with no I/O during Runtime PM callbcack
From: Ming Lei @ 2012-11-28  3:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <1354069667.BsTEhItmLz@vostro.rjw.lan>

On Wed, Nov 28, 2012 at 5:24 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>
> Please don't duplicate code this way.
>
> You can move that whole thing to rpm_callback().  Yes, you'll probably need to
> check dev->power.memalloc_noio twice in there, but that's OK.

Good idea, I will update it in v7.

Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: IPv4 route cache DOS attack
From: 叶雨飞 @ 2012-11-28  2:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20121127.211403.2194859172416122879.davem@davemloft.net>

my first email is to lartc@ and this one is to netdev@ .

On Tue, Nov 27, 2012 at 6:14 PM, David Miller <davem@davemloft.net> wrote:
>
> We saw your email the other day, do not resend the same exact
> question over and over again.
>
> If nobody has time, or wants, to answer you, then you have to simply
> accept that.  Repeating your posting only will make things worse
> for you, trust me.
>
> Thank you.

^ permalink raw reply

* RE: Re: RTL 8169  linux driver question
From: hayeswang @ 2012-11-28  2:32 UTC (permalink / raw)
  To: 'Francois Romieu', 'David Laight'
  Cc: 'Stéphane ANCELOT', netdev, sancelot,
	'nic_swsd'
In-Reply-To: <20121127224605.GA10228@electric-eye.fr.zoreil.com>

Francois Romieu [mailto:romieu@fr.zoreil.com] 
[...]
> Something like the patch below against net-next could help once I will
> have tested it.
> 
> I completely guessed the Tx usec scale factor at gigabit 
> speed (125 us, 
> 100 us, disabled, who knows ?) and I have no idea which 
> specific chipsets
> it should work with.
> 
> Hayes, may I expect some hindsight regarding:
> 1 - the availability of the IntrMitigate (0xe2) register through the
>     8169, 8168 and 810x line of chipsets

8169, 8168, and 8136(810x) serial chipsets support it.

> 2 - the Tx timer unit at gigabit speed

The unit of the timer depneds on both the speed and the setting of CPlusCmd
(0xe0) bit 1 and bit 0.

For 8169
bit[1:0] \ speed	1000M		100M		10M
0 0		320ns		2.56us		40.96us
0 1		2.56us		20.48us		327.7us
1 0		5.12us		40.96us		655.4us
1 1		10.24us		81.92us		1.31ms

For the other
bit[1:0] \ speed	1000M		100M		10M
0 0		5us		2.56us		40.96us
0 1		40us		20.48us		327.7us
1 0		80us		40.96us		655.4us
1 1		160us		81.92us		1.31ms

 
Best Regards,
Hayes

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Joe Perches @ 2012-11-28  2:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354051475.14302.42.camel@edumazet-glaptop>

On Tue, 2012-11-27 at 13:24 -0800, Eric Dumazet wrote:
> On Tue, 2012-11-27 at 09:23 -0800, Joe Perches wrote:
> > On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > commit 68835aba4d9b (net: optimize INET input path further)
> > > moved some fields used for tcp/udp sockets lookup in the first cache
> > > line of struct sock_common.
> > []
> > > diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> > > index 5e11905..196ede4 100644
> > > --- a/include/linux/ipv6.h
> > > +++ b/include/linux/ipv6.h
> > > @@ -365,19 +365,21 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
> > >  #endif /* IS_ENABLED(CONFIG_IPV6) */
> > >  
> > >  #define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\
> > > +	(((__sk)->sk_hash == (__hash)) &&					\
> > > +	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) &&	\
> > > +	 ((__sk)->sk_family		== AF_INET6)		&&		\
> > 
> > Perhaps these could be |'d together to avoid the test/jump
> > after each comparison by using some bit operations instead.
> > 
> > > +	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&&		\
> > > +	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&&		\
> > > +	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))) && \
> > > +	 net_eq(sock_net(__sk), (__net)))
> > 
> But it would be wrong.

OK, so it's an and not an or.  Duh.

Still, the logical tests that are likely to be in the same
cacheline could be ANDed together to avoid a test and jump.

Perhaps this:

It shrinks the object a trivial bit and could be a tiny bit
faster too.

(allyesconfig x86/32)
$ size net/ipv6/inet6_hashtables.o*
   text	   data	    bss	    dec	    hex	filename
   6277	    962	   1832	   9071	   236f	net/ipv6/inet6_hashtables.o.new
   6381	    962	   1880	   9223	   2407	net/ipv6/inet6_hashtables.o.old

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 196ede4..91870de 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -364,22 +364,24 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
 #define inet_v6_ipv6only(__sk)		0
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
-#define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\
-	(((__sk)->sk_hash == (__hash)) &&					\
-	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) &&	\
-	 ((__sk)->sk_family		== AF_INET6)		&&		\
-	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&&		\
-	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&&		\
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))) && \
+#define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif) \
+	((((__sk)->sk_hash == (__hash)) &				\
+	  ((__sk)->sk_family == AF_INET6)) &&				\
+	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) && \
+	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr)) &&		\
+	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr)) &&	\
+	 (!((__sk)->sk_bound_dev_if) ||					\
+	  ((__sk)->sk_bound_dev_if == (__dif))) &&			\
 	 net_eq(sock_net(__sk), (__net)))
 
 #define INET6_TW_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif) \
-	(((__sk)->sk_hash == (__hash)) &&					\
-	 (*((__portpair *)&(inet_twsk(__sk)->tw_dport)) == (__ports))	&&	\
-	 ((__sk)->sk_family	       == PF_INET6)			&&	\
-	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_daddr, (__saddr)))	&&	\
-	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_rcv_saddr, (__daddr))) &&	\
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))) && \
+	((((__sk)->sk_hash == (__hash)) &				\
+	  ((__sk)->sk_family == PF_INET6)) &&				\
+	 (*((__portpair *)&(inet_twsk(__sk)->tw_dport)) == (__ports)) && \
+	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_daddr, (__saddr))) && \
+	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_rcv_saddr, (__daddr))) && \
+	 (!((__sk)->sk_bound_dev_if) ||					\
+	  ((__sk)->sk_bound_dev_if == (__dif))) &&			\
 	 net_eq(sock_net(__sk), (__net)))
 
 #endif /* _IPV6_H */

^ permalink raw reply related

* Re: IPv4 route cache DOS attack
From: David Miller @ 2012-11-28  2:14 UTC (permalink / raw)
  To: sunyucong; +Cc: netdev
In-Reply-To: <CAJygYd1OSTfpdbA57q0c6ppMtft3iB7uzo-RCXFOamGp2XVzBg@mail.gmail.com>

We saw your email the other day, do not resend the same exact
question over and over again.

If nobody has time, or wants, to answer you, then you have to simply
accept that.  Repeating your posting only will make things worse
for you, trust me.

Thank you.

^ permalink raw reply

* Re: TCP and reordering
From: David Miller @ 2012-11-28  2:06 UTC (permalink / raw)
  To: saku; +Cc: rick.jones2, netdev
In-Reply-To: <CAAeewD9MtEx4uF6ezbBj7Ci5OzX8VK7p=WQ2TB3PfjmznA4X0w@mail.gmail.com>

From: Saku Ytti <saku@ytti.fi>
Date: Tue, 27 Nov 2012 19:15:20 +0200

> TCP used to be friendly to reordering before fast retransmit
> optimization was implemented.

You're talking about 20 years ago, because that's when fast
retrasnmit was created.

It's not like this got added recently.

And the gains of fast retransmit far outweigh whatever strange
justification would give for reordering packets on purpose.

^ permalink raw reply

* Re: [PATCH v2 net-next] sctp: Add support to per-association statistics via a new SCTP_GET_ASSOC_STATS call
From: Vlad Yasevich @ 2012-11-28  1:40 UTC (permalink / raw)
  To: Michele Baldessari
  Cc: linux-sctp, Neil Horman, Thomas Graf, netdev, David S. Miller
In-Reply-To: <20121127220810.GB3869@fante.int.rhx>

On 11/27/2012 05:08 PM, Michele Baldessari wrote:
> Hi Vlad,
>
> thanks a lot for your review.
>
> On Mon, Nov 19, 2012 at 11:01:46AM -0500, Vlad Yasevich wrote:
> <snip>
>>> @@ -1152,8 +1156,11 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>>>   		 */
>>>   		if (sctp_chunk_is_data(chunk))
>>>   			asoc->peer.last_data_from = chunk->transport;
>>> -		else
>>> +		else {
>>>   			SCTP_INC_STATS(net, SCTP_MIB_INCTRLCHUNKS);
>>> +			if (chunk->chunk_hdr->type == SCTP_CID_SACK)
>>> +				asoc->stats.isacks++;
>>> +		}
>>
>> Should the above include asoc->stats.ictrlchunks++; just like ep_bh_rcv()?
>
> Indeed, I will add that.
>
>>>
>>>   		if (chunk->transport)
>>>   			chunk->transport->last_time_heard = jiffies;
>>> diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
>>> index 1859e2b..32ab55b 100644
>>> --- a/net/sctp/endpointola.c
>>> +++ b/net/sctp/endpointola.c
>>> @@ -480,8 +480,11 @@ normal:
>>>   		 */
>>>   		if (asoc && sctp_chunk_is_data(chunk))
>>>   			asoc->peer.last_data_from = chunk->transport;
>>> -		else
>>> +		else {
>>>   			SCTP_INC_STATS(sock_net(ep->base.sk), SCTP_MIB_INCTRLCHUNKS);
>>> +			if (asoc)
>>> +				asoc->stats.ictrlchunks++;
>>> +		}
>>>
>>>   		if (chunk->transport)
>>>   			chunk->transport->last_time_heard = jiffies;
>>> diff --git a/net/sctp/input.c b/net/sctp/input.c
>>> index 8bd3c27..54c449b 100644
>>> --- a/net/sctp/input.c
>>> +++ b/net/sctp/input.c
>>> @@ -281,6 +281,8 @@ int sctp_rcv(struct sk_buff *skb)
>>>   		SCTP_INC_STATS_BH(net, SCTP_MIB_IN_PKT_SOFTIRQ);
>>>   		sctp_inq_push(&chunk->rcvr->inqueue, chunk);
>>>   	}
>>> +	if (asoc)
>>> +		asoc->stats.ipackets++;
>>>
>>>   	sctp_bh_unlock_sock(sk);
>>
>> This needs a bit more thought.  Current counting behaves differently
>> depending on whether the user holds a socket lock or not.
>> If the user holds the lock, we'll end counting the packet before it is
>> processed.  If the user isn't holding the lock, we'll count the packet after
>> it is processed.
>
> I see. What do you prefer: use atomic64 for this specific counter or
> since it is a temporary miscount we go ahead and ignore it or do you
> have other approaches in mind?

You could count it in sctp_inq_push...  Would that make sense?

-vlad

^ permalink raw reply

* Re: IPv4 route cache DOS attack
From: 叶雨飞 @ 2012-11-28  1:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1354064474.14302.44.camel@edumazet-glaptop>

Thanks!!! it works, after flushing cache it stays 0.

On Tue, Nov 27, 2012 at 5:01 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-11-27 at 15:15 -0800, 叶雨飞 wrote:
>> Hi,
>>
>> I have a linux router running kernel 3.2  that receive public ingress
>> packets and route them through an GRE tunnel, return packets don't go
>> through it
>>
>> I've recently faced a serious issue with the route cache,  when the
>> router received spoofed source , the route cache will quickly get
>> exhausted (depending on the size of it) and soon the ip dst cache
>> overflow will be printed and network subsystem will hang until
>> restarted.
>>
>> So, my question is, how can I turn off the route cache without
>> recompile the kernel or adding the  patch for removal  in 3.7?  I
>> tried to set
>>
>> echo 0 > /proc/sys/net/ipv4/route/max_size but that has no effect at all.
>>
>> And if some one can share some insight on why when dst cache
>> overflows, the network subsystem hangs, it would be great.
>
> echo -1 >/proc/sys/net/ipv4/rt_cache_rebuild_count
>
>
>

^ permalink raw reply

* Re: [PATCH][RESEND] bonding: delete migrated IP addresses from the rlb hash table
From: Jay Vosburgh @ 2012-11-28  1:05 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Andy Gospodarek, netdev
In-Reply-To: <20121123124419.GA3002@midget.suse.cz>

Jiri Bohac <jbohac@suse.cz> wrote:

>Hi, 
>
>This is another resend of the patch discussed in June. The only
>changes over the previous version are improved comments.
>
>Bonding with balance_rlb keeps poisoning other machines' ARP
>caches and I whink we need to fix this.
>
>On Thu, Jun 21, 2012 at 04:05:19PM -0700, Jay Vosburgh wrote:
>> Jiri Bohac <jbohac@suse.cz> wrote:
>> 
>> >Hi, this is a resend of the patch discussed here:
>> >	http://thread.gmane.org/gmane.linux.network/228076
>> >It has been updated to apply to the lastest net-next.
>> [...]
>> >The hash table is hashed by ip_dst. To be able to do the above
>> >check efficiently (not walking the whole hash table), we need a
>> >reverse mapping (by ip_src).
>> 
>> 	Just a note that I'm doing some testing with this patch.  Seems
>> to be ok for the "direct" case (wherein the IP in question is assigned
>> to the local system); I haven't tried the "bridge" case yet.  I've
>> extended some of the debugfs stuff to dump the new information, and I'm
>> trying some of the corner cases (e.g., breaking the linkages in the
>> middle) to see if it all hangs together.
>
>Were there any results of your testing?  Good or bad?

	I did test it quite a bit (and then neglected to follow up).  I
tried various deliberate hash collisions to try and make it fail in a
corner case, but was unable to induce incorrect behavior.

>> 	I am thinking that the layout of the "hash"-ish table is now
>> sufficiently complicated that there should be a comment block somewhere
>> describing what's going on (because I didn't really quite get it until I
>> dumped the whole thing and looked at it).  With this patch, there is one
>> "used" linkage for all of the elements in use, plus some number of "src"
>> linkages, one for each active source hash.  The "src" linkages are also
>> notable in that they are separate from the "assigned" state.
>
>I updated the comments in drivers/net/bonding/bond_alb.h to
>describe the structure.
>
>> >+	 * have a dirrerent mac_src.
>> 
>> 	Typo here; should be "different."
>
>Fixed. 
>Any chance we could finally get this merged?:

	The only issue I see is that a number of added lines run past 80
columns, e.g.,

+		if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
+			/* ip_src is going to be updated, fix the src hash list */
+			u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));

+	 * sending out client updates with this IP address and the old MAC address.

+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {

	... and so on.  I did not compile and test this version, just
applied it and inspected it; presumably it is functionally identical to
the prior version.  There's also one typo I noted near the end of the
patch.

	-J

>Bonding in balance-alb mode records information from ARP packets
>passing through the bond in a hash table (rx_hashtbl).
>
>At certain situations (e.g. link change of a slave),
>rlb_update_rx_clients() will send out ARP packets to update ARP
>caches of other hosts on the network to achieve RX load
>balancing.
>
>The problem is that once an IP address is recorded in the hash
>table, it stays there indefinitely. If this IP address is
>migrated to a different host in the network, bonding still sends
>out ARP packets that poison other systems' ARP caches with
>invalid information.
>
>This patch solves this by looking at all incoming ARP packets,
>and checking if the source IP address is one of the source
>addresses stored in the rx_hashtbl. If it is, but the MAC
>addresses differ, the corresponding hash table entries are
>removed. Thus, when an IP address is migrated, the first ARP
>broadcast by its new owner will purge the offending entries of
>rx_hashtbl.
>
>The hash table is hashed by ip_dst. To be able to do the above
>check efficiently (not walking the whole hash table), we need a
>reverse mapping (by ip_src).
>
>I added three new members in struct rlb_client_info:
>   rx_hashtbl[x].src_first will point to the start of a list of
>      entries for which hash(ip_src) == x.
>   The list is linked with src_next and src_prev.
>
>When an incoming ARP packet arrives at rlb_arp_recv()
>rlb_purge_src_ip() can quickly walk only the entries on the
>corresponding lists, i.e. the entries that are likely to contain
>the offending IP address.
>
>To avoid confusion, I renamed these existing fields of struct 
>rlb_client_info:
>	next -> used_next
>	prev -> used_prev
>	rx_hashtbl_head -> rx_hashtbl_used_head
>
>(The current linked list is _not_ a list of hash table
>entries with colliding ip_dst. It's a list of entries that are
>being used; its purpose is to avoid walking the whole hash table
>when looking for used entries.)
>
>Signed-off-by: Jiri Bohac <jbohac@suse.cz>
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index e15cc11..8505a24 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -84,6 +84,9 @@ static inline struct arp_pkt *arp_pkt(const struct sk_buff *skb)
>
> /* Forward declaration */
> static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[]);
>+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp);
>+static void rlb_src_unlink(struct bonding *bond, u32 index);
>+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash);
>
> static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
> {
>@@ -354,6 +357,17 @@ static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond,
> 	if (!arp)
> 		goto out;
>
>+	/* We received an ARP from arp->ip_src.
>+	 * We might have used this IP address previously (on the bonding host
>+	 * itself or on a system that is bridged together with the bond).
>+	 * However, if arp->mac_src is different than what is stored in
>+	 * rx_hashtbl, some other host is now using the IP and we must prevent
>+	 * sending out client updates with this IP address and the old MAC address.
>+	 * Clean up all hash table entries that have this address as ip_src but
>+	 * have a different mac_src.
>+	 */
>+	rlb_purge_src_ip(bond, arp);
>+
> 	if (arp->op_code == htons(ARPOP_REPLY)) {
> 		/* update rx hash table for this ARP */
> 		rlb_update_entry_from_arp(bond, arp);
>@@ -432,9 +446,9 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
> 	_lock_rx_hashtbl_bh(bond);
>
> 	rx_hash_table = bond_info->rx_hashtbl;
>-	index = bond_info->rx_hashtbl_head;
>+	index = bond_info->rx_hashtbl_used_head;
> 	for (; index != RLB_NULL_INDEX; index = next_index) {
>-		next_index = rx_hash_table[index].next;
>+		next_index = rx_hash_table[index].used_next;
> 		if (rx_hash_table[index].slave == slave) {
> 			struct slave *assigned_slave = rlb_next_rx_slave(bond);
>
>@@ -519,8 +533,8 @@ static void rlb_update_rx_clients(struct bonding *bond)
>
> 	_lock_rx_hashtbl_bh(bond);
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
> 		if (client_info->ntt) {
> 			rlb_update_client(client_info);
>@@ -548,8 +562,8 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
>
> 	_lock_rx_hashtbl_bh(bond);
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
>
> 		if ((client_info->slave == slave) &&
>@@ -578,8 +592,8 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
>
> 	_lock_rx_hashtbl(bond);
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
>
> 		if (!client_info->slave) {
>@@ -625,6 +639,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 				/* update mac address from arp */
> 				memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
> 			}
>+			memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
>
> 			assigned_slave = client_info->slave;
> 			if (assigned_slave) {
>@@ -647,6 +662,13 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 	assigned_slave = rlb_next_rx_slave(bond);
>
> 	if (assigned_slave) {
>+		if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
>+			/* ip_src is going to be updated, fix the src hash list */
>+			u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));
>+			rlb_src_unlink(bond, hash_index);
>+			rlb_src_link(bond, hash_src, hash_index);
>+		}
>+
> 		client_info->ip_src = arp->ip_src;
> 		client_info->ip_dst = arp->ip_dst;
> 		/* arp->mac_dst is broadcast for arp reqeusts.
>@@ -654,6 +676,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 		 * upon receiving an arp reply.
> 		 */
> 		memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
>+		memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
> 		client_info->slave = assigned_slave;
>
> 		if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
>@@ -669,11 +692,11 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 		}
>
> 		if (!client_info->assigned) {
>-			u32 prev_tbl_head = bond_info->rx_hashtbl_head;
>-			bond_info->rx_hashtbl_head = hash_index;
>-			client_info->next = prev_tbl_head;
>+			u32 prev_tbl_head = bond_info->rx_hashtbl_used_head;
>+			bond_info->rx_hashtbl_used_head = hash_index;
>+			client_info->used_next = prev_tbl_head;
> 			if (prev_tbl_head != RLB_NULL_INDEX) {
>-				bond_info->rx_hashtbl[prev_tbl_head].prev =
>+				bond_info->rx_hashtbl[prev_tbl_head].used_prev =
> 					hash_index;
> 			}
> 			client_info->assigned = 1;
>@@ -740,8 +763,8 @@ static void rlb_rebalance(struct bonding *bond)
> 	_lock_rx_hashtbl_bh(bond);
>
> 	ntt = 0;
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
> 		assigned_slave = rlb_next_rx_slave(bond);
> 		if (assigned_slave && (client_info->slave != assigned_slave)) {
>@@ -759,11 +782,113 @@ static void rlb_rebalance(struct bonding *bond)
> }
>
> /* Caller must hold rx_hashtbl lock */
>+static void rlb_init_table_entry_dst(struct rlb_client_info *entry)
>+{
>+	entry->used_next = RLB_NULL_INDEX;
>+	entry->used_prev = RLB_NULL_INDEX;
>+	entry->assigned = 0;
>+	entry->slave = NULL;
>+	entry->tag = 0;
>+}
>+static void rlb_init_table_entry_src(struct rlb_client_info *entry)
>+{
>+	entry->src_first = RLB_NULL_INDEX;
>+	entry->src_prev = RLB_NULL_INDEX;
>+	entry->src_next = RLB_NULL_INDEX;
>+}
>+
> static void rlb_init_table_entry(struct rlb_client_info *entry)
> {
> 	memset(entry, 0, sizeof(struct rlb_client_info));
>-	entry->next = RLB_NULL_INDEX;
>-	entry->prev = RLB_NULL_INDEX;
>+	rlb_init_table_entry_dst(entry);
>+	rlb_init_table_entry_src(entry);
>+}
>+
>+static void rlb_delete_table_entry_dst(struct bonding *bond, u32 index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 next_index = bond_info->rx_hashtbl[index].used_next;
>+	u32 prev_index = bond_info->rx_hashtbl[index].used_prev;
>+
>+	if (index == bond_info->rx_hashtbl_used_head)
>+		bond_info->rx_hashtbl_used_head = next_index;
>+	if (prev_index != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[prev_index].used_next = next_index;
>+	if (next_index != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[next_index].used_prev = prev_index;
>+}
>+
>+/* unlink a rlb hash table entry from the src list */
>+static void rlb_src_unlink(struct bonding *bond, u32 index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 next_index = bond_info->rx_hashtbl[index].src_next;
>+	u32 prev_index = bond_info->rx_hashtbl[index].src_prev;
>+
>+	bond_info->rx_hashtbl[index].src_next = RLB_NULL_INDEX;
>+	bond_info->rx_hashtbl[index].src_prev = RLB_NULL_INDEX;
>+
>+	if (next_index != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[next_index].src_prev = prev_index;
>+
>+	if (prev_index == RLB_NULL_INDEX)
>+		return;
>+
>+	/* is prev_index pointing to the head of this list? */
>+	if (bond_info->rx_hashtbl[prev_index].src_first == index)
>+		bond_info->rx_hashtbl[prev_index].src_first = next_index;
>+	else
>+		bond_info->rx_hashtbl[prev_index].src_next = next_index;
>+
>+}
>+
>+static void rlb_delete_table_entry(struct bonding *bond, u32 index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
>+
>+	rlb_delete_table_entry_dst(bond, index);
>+	rlb_init_table_entry_dst(entry);
>+
>+	rlb_src_unlink(bond, index);
>+}
>+
>+/* add the rx_hashtbl[ip_dst_hash] entry to the list
>+ * of entries with identical ip_src_hash
>+ */
>+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 next;
>+
>+	bond_info->rx_hashtbl[ip_dst_hash].src_prev = ip_src_hash;
>+	next = bond_info->rx_hashtbl[ip_src_hash].src_first;
>+	bond_info->rx_hashtbl[ip_dst_hash].src_next = next;
>+	if (next != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[next].src_prev = ip_dst_hash;
>+	bond_info->rx_hashtbl[ip_src_hash].src_first = ip_dst_hash;
>+}
>+
>+/* deletes all rx_hashtbl entries with  arp->ip_src if their mac_src does
>+ * not match arp->mac_src */
>+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 ip_src_hash = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src));
>+	u32 index;
>+
>+	_lock_rx_hashtbl_bh(bond);
>+
>+	index = bond_info->rx_hashtbl[ip_src_hash].src_first;
>+	while (index != RLB_NULL_INDEX) {
>+		struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
>+		u32 next_index = entry->src_next;
>+		if (entry->ip_src == arp->ip_src &&
>+		    !ether_addr_equal_64bits(arp->mac_src, entry->mac_src))
>+				rlb_delete_table_entry(bond, index);
>+		index = next_index;
>+	}
>+	_unlock_rx_hashtbl_bh(bond);
> }
>
> static int rlb_initialize(struct bonding *bond)
>@@ -781,7 +906,7 @@ static int rlb_initialize(struct bonding *bond)
>
> 	bond_info->rx_hashtbl = new_hashtbl;
>
>-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
>+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
>
> 	for (i = 0; i < RLB_HASH_TABLE_SIZE; i++) {
> 		rlb_init_table_entry(bond_info->rx_hashtbl + i);
>@@ -803,7 +928,7 @@ static void rlb_deinitialize(struct bonding *bond)
>
> 	kfree(bond_info->rx_hashtbl);
> 	bond_info->rx_hashtbl = NULL;
>-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
>+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
>
> 	_unlock_rx_hashtbl_bh(bond);
> }
>@@ -815,25 +940,13 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
>
> 	_lock_rx_hashtbl_bh(bond);
>
>-	curr_index = bond_info->rx_hashtbl_head;
>+	curr_index = bond_info->rx_hashtbl_used_head;
> 	while (curr_index != RLB_NULL_INDEX) {
> 		struct rlb_client_info *curr = &(bond_info->rx_hashtbl[curr_index]);
>-		u32 next_index = bond_info->rx_hashtbl[curr_index].next;
>-		u32 prev_index = bond_info->rx_hashtbl[curr_index].prev;
>-
>-		if (curr->tag && (curr->vlan_id == vlan_id)) {
>-			if (curr_index == bond_info->rx_hashtbl_head) {
>-				bond_info->rx_hashtbl_head = next_index;
>-			}
>-			if (prev_index != RLB_NULL_INDEX) {
>-				bond_info->rx_hashtbl[prev_index].next = next_index;
>-			}
>-			if (next_index != RLB_NULL_INDEX) {
>-				bond_info->rx_hashtbl[next_index].prev = prev_index;
>-			}
>+		u32 next_index = bond_info->rx_hashtbl[curr_index].used_next;
>
>-			rlb_init_table_entry(curr);
>-		}
>+		if (curr->tag && (curr->vlan_id == vlan_id))
>+			rlb_delete_table_entry(bond, curr_index);
>
> 		curr_index = next_index;
> 	}
>diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
>index 90f140a..de831ba 100644
>--- a/drivers/net/bonding/bond_alb.h
>+++ b/drivers/net/bonding/bond_alb.h
>@@ -94,15 +94,35 @@ struct tlb_client_info {
>
> /* -------------------------------------------------------------------------
>  * struct rlb_client_info contains all info related to a specific rx client
>- * connection. This is the Clients Hash Table entry struct
>+ * connection. This is the Clients Hash Table entry struct.
>+ * Note that this is not a proper hash table; if a new client's IP address
>+ * hash collides with an existing client entry, the old entry is replaced.
>+ *
>+ * There is a linked list (linked by the used_next and used_prev members)
>+ * linking all the used entries of the hash table. This allows updating
>+ * all the clients without walking over all the unused elements of the table.
>+ *
>+ * There are also linked lists of entries with identical hash(ip_src). These
>+ * allow cleaning up the table from ip_src<->mac_src associatins that have

	Typo here, "associations"

>+ * become outdated and would cause sending out invalid ARP updates to the
>+ * network. These are linked by the (src_next and src_prev members).
>  * -------------------------------------------------------------------------
>  */
> struct rlb_client_info {
> 	__be32 ip_src;		/* the server IP address */
> 	__be32 ip_dst;		/* the client IP address */
>+	u8  mac_src[ETH_ALEN];	/* the server MAC address */
> 	u8  mac_dst[ETH_ALEN];	/* the client MAC address */
>-	u32 next;		/* The next Hash table entry index */
>-	u32 prev;		/* The previous Hash table entry index */
>+
>+	/* list of used hash table entries, starting at rx_hashtbl_used_head */
>+	u32 used_next;
>+	u32 used_prev;
>+
>+	/* ip_src based hashing */
>+	u32 src_next;	/* next entry with same hash(ip_src) */
>+	u32 src_prev;	/* prev entry with same hash(ip_src) */
>+	u32 src_first;	/* first entry with hash(ip_src) == this entry's index */
>+
> 	u8  assigned;		/* checking whether this entry is assigned */
> 	u8  ntt;		/* flag - need to transmit client info */
> 	struct slave *slave;	/* the slave assigned to this client */
>@@ -131,7 +151,7 @@ struct alb_bond_info {
> 	int rlb_enabled;
> 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
> 	spinlock_t		rx_hashtbl_lock;
>-	u32			rx_hashtbl_head;
>+	u32			rx_hashtbl_used_head;
> 	u8			rx_ntt;	/* flag - need to transmit
> 					 * to all rx clients
> 					 */
>diff --git a/drivers/net/bonding/bond_debugfs.c b/drivers/net/bonding/bond_debugfs.c
>index 2cf084e..6ac855f 100644
>--- a/drivers/net/bonding/bond_debugfs.c
>+++ b/drivers/net/bonding/bond_debugfs.c
>@@ -31,8 +31,8 @@ static int bond_debug_rlb_hash_show(struct seq_file *m, void *v)
>
> 	spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
> 		seq_printf(m, "%-15pI4 %-15pI4 %-17pM %s\n",
> 			&client_info->ip_src,
>
>-- 
>Jiri Bohac <jbohac@suse.cz>
>SUSE Labs, SUSE CZ
>

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: IPv4 route cache DOS attack
From: Eric Dumazet @ 2012-11-28  1:01 UTC (permalink / raw)
  To: 叶雨飞; +Cc: netdev
In-Reply-To: <CAJygYd1OSTfpdbA57q0c6ppMtft3iB7uzo-RCXFOamGp2XVzBg@mail.gmail.com>

On Tue, 2012-11-27 at 15:15 -0800, 叶雨飞 wrote:
> Hi,
> 
> I have a linux router running kernel 3.2  that receive public ingress
> packets and route them through an GRE tunnel, return packets don't go
> through it
> 
> I've recently faced a serious issue with the route cache,  when the
> router received spoofed source , the route cache will quickly get
> exhausted (depending on the size of it) and soon the ip dst cache
> overflow will be printed and network subsystem will hang until
> restarted.
> 
> So, my question is, how can I turn off the route cache without
> recompile the kernel or adding the  patch for removal  in 3.7?  I
> tried to set
> 
> echo 0 > /proc/sys/net/ipv4/route/max_size but that has no effect at all.
> 
> And if some one can share some insight on why when dst cache
> overflows, the network subsystem hangs, it would be great.

echo -1 >/proc/sys/net/ipv4/rt_cache_rebuild_count

^ permalink raw reply

* Re: [PATCH] br2684: don't send frames on not-ready vcc
From: David Woodhouse @ 2012-11-28  0:54 UTC (permalink / raw)
  To: Krzysztof Mazur
  Cc: chas williams - CONTRACTOR, davem, netdev, linux-kernel, nathan
In-Reply-To: <20121127235129.GA20080@shrek.podlesie.net>

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On Wed, 2012-11-28 at 00:51 +0100, Krzysztof Mazur wrote:
> If you do this actually it's better to don't use patch 1/7 because
> it introduces race condition that you found earlier.

Right. I've omitted that from the git tree I just pushed out.

> With this patch you have still theoretical race that was fixed in patches
> 5 and 8 in pppoatm series, but I never seen that in practice.

And I think it's even less likely for br2684. At least with pppoatm you
might have had pppd sending frames. But for br2684 they *only* come from
its start_xmit function... which is serialised anyway.

I do get strange oopses when I try to add BQL to br2684, but that's not
something to be looking at at 1am...

I *do* need the equivalent of your patch 4, which is the module_put
race.

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: [PATCH v3 8/7] pppoatm: fix missing wakeup in pppoatm_send()
From: David Woodhouse @ 2012-11-28  0:48 UTC (permalink / raw)
  To: chas williams - CONTRACTOR; +Cc: Krzysztof Mazur, netdev, linux-kernel, davem
In-Reply-To: <20121127102333.68ac3234@thirdoffive.cmf.nrl.navy.mil>

[-- Attachment #1: Type: text/plain, Size: 1077 bytes --]

On Tue, 2012-11-27 at 10:23 -0500, chas williams - CONTRACTOR wrote:
> yes, but dont call it 8/7 since that doesnt make sense.

It made enough sense when it was a single patch appended to a thread of
7 other patches from Krzysztof. But now it's all got a little more
complex, so I've tried to collect together the latest version of
everything we've discussed:

 http://git.infradead.org/users/dwmw2/atm.git
  git://git.infradead.org/users/dwmw2/atm.git

David Woodhouse (5):
      solos-pci: Wait for pending TX to complete when releasing vcc
      br2684: don't send frames on not-ready vcc
      atm: Add release_cb() callback to vcc
      pppoatm: fix missing wakeup in pppoatm_send()
      br2684: fix module_put() race

Krzysztof Mazur (6):
      atm: add owner of push() callback to atmvcc
      pppoatm: allow assign only on a connected socket
      pppoatm: fix module_put() race
      pppoatm: take ATM socket lock in pppoatm_send()
      pppoatm: drop frames to not-ready vcc
      pppoatm: do not inline pppoatm_may_send()


-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: RTL 8169  linux driver question
From: Stéphane ANCELOT @ 2012-11-28  0:35 UTC (permalink / raw)
  To: Francois Romieu; +Cc: David Laight, Stéphane ANCELOT, netdev, Hayes Wang
In-Reply-To: <20121127224605.GA10228@electric-eye.fr.zoreil.com>

On 27/11/2012 23:46, Francois Romieu wrote:
> David Laight <David.Laight@ACULAB.COM> :
>> Stéphane ANCELOT <sancelot@free.fr> :
>>> I had problem with it, my application sends a frame that is immediately
>>> transmitted back by some slaves, there was abnormally 100us  lost
>>> between the send and receive call.
>>>
>>> Finally I found it was coming from the following register setup in the
>>> driver :
>>>
>>> RTL_W16(IntrMitigate, 0x5151);
>>>
>>> Can you give me some details about it, since I do not have the RTL8169
>>> programming guide.
>> That sounds like an 'interrupt mitigation' setting - which will cause
>> RX interrupts to be delayed a short time in order to reduce the
>> interrupt load on the kernel.
>>
>> There is usually an 'ethtool' setting to disable interrupt mitigation.
> Something like the patch below against net-next could help once I will
> have tested it.
>
> I completely guessed the Tx usec scale factor at gigabit speed (125 us,
> 100 us, disabled, who knows ?) and I have no idea which specific chipsets
> it should work with.
using 0x5151 value at 100mb FDX, I know it introduced exactly 100us 
delay (Tx+Rx).


> Hayes, may I expect some hindsight regarding:
> 1 - the availability of the IntrMitigate (0xe2) register through the
>      8169, 8168 and 810x line of chipsets
> 2 - the Tx timer unit at gigabit speed
>
> It would save me some time.*
Hayes, it would have spared myself a lot of time ;-)

Have a look at what is driving this r8169 component :

http://www.youtube.com/watch?v=wj30CeAFwuk&feature=plcp
A question to nic components developers : I do not understand what 
competitive advantage keeping these things like a secret....

These things are mostly boring for people and oem like myself at 
Numalliance.

Regards,
Stephane Ancelot

> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 248f883..2623b73 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -349,6 +349,12 @@ enum rtl_registers {
>   	RxMaxSize	= 0xda,
>   	CPlusCmd	= 0xe0,
>   	IntrMitigate	= 0xe2,
> +
> +#define RTL_COALESCE_MASK	0x0f
> +#define RTL_COALESCE_SHIFT	4
> +#define RTL_COALESCE_T_MAX	(RTL_COALESCE_MASK)
> +#define RTL_COALESCE_FRAME_MAX	(RTL_COALESCE_MASK << 2)
> +
>   	RxDescAddrLow	= 0xe4,
>   	RxDescAddrHigh	= 0xe8,
>   	EarlyTxThres	= 0xec,	/* 8169. Unit of 32 bytes. */
> @@ -1997,10 +2003,121 @@ static void rtl8169_get_strings(struct net_device *dev, u32 stringset, u8 *data)
>   	}
>   }
>   
> +static struct rtl_coalesce_scale {
> +	u32 speed;
> +	/* Rx / Tx */
> +	u16 usecs[2];
> +} rtl_coalesce_info[] = {
> +	{ .speed = SPEED_10,	.usecs = { 8000, 10000 } },
> +	{ .speed = SPEED_100,	.usecs = { 1000,  1000 } },
> +	{ .speed = SPEED_1000,	.usecs = {  125,   125 } }
> +};
> +
> +static struct rtl_coalesce_scale *rtl_coalesce_scale(struct net_device *dev)
> +{
> +	struct ethtool_cmd ecmd;
> +	int rc, i;
> +
> +	rc = rtl8169_get_settings(dev, &ecmd);
> +	if (rc < 0)
> +		return ERR_PTR(rc);
> +
> +	for (i = 0; i < ARRAY_SIZE(rtl_coalesce_info); i++) {
> +		if (ethtool_cmd_speed(&ecmd) == rtl_coalesce_info[i].speed)
> +			return rtl_coalesce_info + i;
> +	}
> +
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static int rtl_get_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
> +{
> +	struct rtl8169_private *tp = netdev_priv(dev);
> +	void __iomem *ioaddr = tp->mmio_addr;
> +	struct rtl_coalesce_scale *scale;
> +	struct {
> +		u32 *max_frames;
> +		u32 *usecs;
> +	} coal_settings [] = {
> +		{ &ec->rx_max_coalesced_frames, &ec->rx_coalesce_usecs },
> +		{ &ec->tx_max_coalesced_frames, &ec->tx_coalesce_usecs }
> +	}, *p = coal_settings;
> +	int i;
> +	u16 w;
> +
> +	memset(ec, 0, sizeof(*ec));
> +
> +	for (w = RTL_R16(IntrMitigate); w; w >>= RTL_COALESCE_SHIFT, p++) {
> +		*p->max_frames = (w & RTL_COALESCE_MASK) << 2;
> +		w >>= RTL_COALESCE_SHIFT;
> +		*p->usecs = w & RTL_COALESCE_MASK;
> +	}
> +
> +	/* Except for null parameeters, the meaning of coalescing parameters
> +	 * depends on the link speed.
> +	 */
> +	scale = rtl_coalesce_scale(dev);
> +	if (PTR_ERR(scale) && (p != coal_settings))
> +		return PTR_ERR(scale);
> +
> +	for (i = 0; i < 2; i++) {
> +		p = coal_settings + i;
> +		*p->usecs *= scale->usecs[i];
> +		if (!*p->usecs && !*p->max_frames)
> +			*p->max_frames = 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int rtl_set_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
> +{
> +	struct rtl8169_private *tp = netdev_priv(dev);
> +	void __iomem *ioaddr = tp->mmio_addr;
> +	struct rtl_coalesce_scale *scale;
> +	struct {
> +		u32 frames;
> +		u32 usecs;
> +	} coal_settings [] = {
> +		{ ec->rx_max_coalesced_frames, ec->rx_coalesce_usecs },
> +		{ ec->tx_max_coalesced_frames, ec->tx_coalesce_usecs }
> +	}, *p = coal_settings;
> +	int i, rc;
> +	u16 w = 0;
> +
> +	scale = rtl_coalesce_scale(dev);
> +	rc = PTR_ERR(scale);
> +
> +	for (i = 0; i < 2; i++) {
> +		u32 units;
> +
> +		if (!p->usecs && p->frames == 1)
> +			continue;
> +		if (rc < 0)
> +			goto out;
> +
> +		units = p->usecs / scale->usecs[i];
> +		if (units > RTL_COALESCE_T_MAX || p->usecs % scale->usecs[i] ||
> +		    p->frames > RTL_COALESCE_FRAME_MAX || p->frames % 4)
> +			return -EINVAL;
> +
> +		w <<= RTL_COALESCE_SHIFT;
> +		w |= units;
> +		w <<= RTL_COALESCE_SHIFT;
> +		w |= p->frames >> 2;
> +	}
> +
> +	RTL_W16(IntrMitigate, swab16(w));
> +out:
> +	return rc;
> +}
> +
>   static const struct ethtool_ops rtl8169_ethtool_ops = {
>   	.get_drvinfo		= rtl8169_get_drvinfo,
>   	.get_regs_len		= rtl8169_get_regs_len,
>   	.get_link		= ethtool_op_get_link,
> +	.get_coalesce		= rtl_get_coalesce,
> +	.set_coalesce		= rtl_set_coalesce,
>   	.get_settings		= rtl8169_get_settings,
>   	.set_settings		= rtl8169_set_settings,
>   	.get_msglevel		= rtl8169_get_msglevel,

^ permalink raw reply

* Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge
From: Jay Vosburgh @ 2012-11-28  0:05 UTC (permalink / raw)
  To: zheng.li; +Cc: netdev, andy, linux-kernel, davem, joe.jin
In-Reply-To: <50AB447E.3010904@oracle.com>

zheng.li <zheng.x.li@oracle.com> wrote:

>After i applied my prior patch to the latest kernel and tested ,i change
>the patch as this. The prior patch is running ok on 2.6.32,but after
>2.6.32 it runs no effect ,it still cause domu(which arp through bridge
>via bonding) network intermittently unreachable. I found the reason is
>that the alb_monitor of bonding(after 2.6.32) send unicast arp reply
>which using rlb_client_info's assigned slave's MAC,so cause peer host
>update arp cache of domu with wrong MAC to cause domu unreachable again.
>The rlb_client_info is created when rlb_arp_xmit sending ARP request.
>rlb_client_info contain local IP with assigned
>slave,it will cause all local ip's ARP use slave's mac by
>rlb_update_client function.
>
>Bug reproduced rate: 100%.
>
>So i change the patch also affect the ARP request to don't through rlb
>to no create rlb_client_info. Applied the new patch on the latest
>version,it runs ok.
>
>So,the patch should affect ARP request and reply to work well.

	Ok, I think this is a reasonable change, but the patch should
have a commit description and comment in the code that accurately
reflects the final change.

	Currently, the new comment in the code reads as follows:

> +	/* Only modify ARP's MAC if it originates locally;
> +	 * don't change ARPs arriving via a bridge.
> +	 */
> +	if (!bond_slave_has_mac(bond, arp->mac_src))
> +		return NULL;

	I'd change the comment to say something like "Don't modify or
load balance ARPs that do not originate locally (e.g., arrive via a
bridge)" and put some details in the commit message, for example:

	Do not modify or load balance ARP packets passing through
balance-alb mode (wherein the ARP did not originate locally, and arrived
via a bridge).

	Modifying pass-through ARP replies causes an incorrect MAC
address to be placed into the ARP packet, rendering peers unable to
communicate with the actual destination from which the ARP reply
originated.

	Load balancing pass-through ARP requests causes an entry to be
created for the peer in the rlb table, and bond_alb_monitor will
occasionally issue ARP updates to all peers in the table instrucing them
as to which MAC address they should communicate with; this occurs when
some event sets rx_ntt.  In the bridged case, however, the MAC address
used for the update would be the MAC of the slave, not the actual source
MAC of the originating destination.  This would render peers unable to
communicate with the destinations beyond the bridge.

	-J

>bond_alb_monitor -> rlb_update_rx_clients --> rlb_update_client
>
>rlb_update_client(struct rlb_client_info *client_info)
>{
>   for (i = 0; i < RLB_ARP_BURST_SIZE; i++) {
>		struct sk_buff *skb;
>
>		skb = arp_create(ARPOP_REPLY, ETH_P_ARP,
>                                //peer host's IP
>				 client_info->ip_dst,
>				 client_info->slave->dev,
>                                //Domu 's IP
>				 client_info->ip_src,
>                          //peer host's MAC which be set in rlb_arp_recv
>				 client_info->mac_dst,
>//use slave's MAC to send unicast arp reply to peer ,so cause peer host
>//update MAC of domu with wrong mac address.
>				 client_info->slave->dev->dev_addr,
>				 client_info->mac_dst);
>.......
>}
>
>Thanks
>Zheng Li
>
>
>> Zheng Li <zheng.x.li@oracle.com> wrote:
>> 
>>> ARP traffic passing through a bridge and out via the bond (when the bond is a 
>>> port of the bridge) should not have its source MAC address adjusted by the 
>>> receive load balance code in rlb_arp_xmit.
>> 
>> 	This patch differs from prior versions in that it does more than
>> what's described here; it also disables the receive load balance logic
>> for any ARPs (request or reply) that are passing through the bond (not
>> of local origin).  For ARP replies, that's mostly harmless, as the ARPs
>> passing through will simply always be sent from one particular slave
>> (the active slave) instead of being balanced.
>> 
>> 	For ARP requests, though, they are already always sent via the
>> active slave, but there is also some logic in rlb_arp_xmit to limit the
>> side effects from the broadcast ARP, in particular this part:
>> 
>> 		/* The ARP reply packets must be delayed so that
>> 		 * they can cancel out the influence of the ARP request.
>> 		 */
>> 		bond->alb_info.rlb_update_delay_counter = RLB_UPDATE_DELAY;
>> 
>> 		/* arp requests are broadcast and are sent on the primary
>> 		 * the arp request will collapse all clients on the subnet to
>> 		 * the primary slave. We must register these clients to be
>> 		 * updated with their assigned mac.
>> 		 */
>> 		rlb_req_update_subnet_clients(bond, arp->ip_src);
>> 
>> 	that arranges for clients to be given ARP updates for their
>> slave assignments (which may change to the active slave, due to the ARP
>> broadcast being sent via the active slave).
>> 
>> 	I think the ARP reply side of this is fine (and is what is
>> described in teh changelog), but the ARP request behavior change is new
>> with this version.
>> 
>> 	Since prior versions of the patch didn't cause this code to be
>> skipped, is this change intentional?
>> 
>> 	Did you check to see if the above logic is necessary for ARP
>> requests passing through via a bridge to prevent peers from "stacking"
>> (in terms of load balance assignment) on the active slave due to bridged
>> ARP traffic?
>> 
>> 	-J
>> 
>>> Signed-off-by: Zheng Li <zheng.x.li@oracle.com>
>>> Cc: Jay Vosburgh <fubar@us.ibm.com>
>>> Cc: Andy Gospodarek <andy@greyhouse.net>
>>> Cc: "David S. Miller" <davem@davemloft.net>
>>>
>>> ---
>>> drivers/net/bonding/bond_alb.c |    6 ++++++
>>> drivers/net/bonding/bonding.h  |   13 +++++++++++++
>>> 2 files changed, 19 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>>> index e15cc11..75f6f0d 100644
>>> --- a/drivers/net/bonding/bond_alb.c
>>> +++ b/drivers/net/bonding/bond_alb.c
>>> @@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
>>> 	struct arp_pkt *arp = arp_pkt(skb);
>>> 	struct slave *tx_slave = NULL;
>>>
>>> +	/* Only modify ARP's MAC if it originates locally;
>>> +	 * don't change ARPs arriving via a bridge.
>>> +	 */
>>> +	if (!bond_slave_has_mac(bond, arp->mac_src))
>>> +		return NULL;
>>> +
>>> 	if (arp->op_code == htons(ARPOP_REPLY)) {
>>> 		/* the arp must be sent on the selected
>>> 		* rx channel
>>> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>>> index f8af2fc..6dded56 100644
>>> --- a/drivers/net/bonding/bonding.h
>>> +++ b/drivers/net/bonding/bonding.h
>>> @@ -22,6 +22,7 @@
>>> #include <linux/in6.h>
>>> #include <linux/netpoll.h>
>>> #include <linux/inetdevice.h>
>>> +#include <linux/etherdevice.h>
>>> #include "bond_3ad.h"
>>> #include "bond_alb.h"
>>>
>>> @@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct bond_net *bn)
>>> }
>>> #endif
>>>
>>> +static inline struct slave *bond_slave_has_mac(struct bonding *bond,
>>> +					       const u8 *mac)
>>> +{
>>> +	int i = 0;
>>> +	struct slave *tmp;
>>> +
>>> +	bond_for_each_slave(bond, tmp, i)
>>> +		if (ether_addr_equal_64bits(mac, tmp->dev->dev_addr))
>>> +			return tmp;
>>> +
>>> +	return NULL;
>>> +}
>>>
>>> /* exported from bond_main.c */
>>> extern int bond_net_id;
>>> -- 
>>> 1.7.6.5

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH] br2684: don't send frames on not-ready vcc
From: Krzysztof Mazur @ 2012-11-27 23:51 UTC (permalink / raw)
  To: David Woodhouse
  Cc: chas williams - CONTRACTOR, davem, netdev, linux-kernel, nathan
In-Reply-To: <1354058916.2534.25.camel@shinybook.infradead.org>

On Tue, Nov 27, 2012 at 11:28:36PM +0000, David Woodhouse wrote:
> Avoid submitting patches to a vcc which is being closed. Things go badly
> wrong when the ->pop method gets later called after everything's been
> torn down.
> 
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> ---
> On Tue, 2012-11-27 at 22:36 +0000, David Woodhouse wrote:
> > Nathan, does this help? 
> 
> I think that's necessary, but not sufficient. You'll want something like
> this too... I can now kill br2684ctl while there's a flood of outgoing
> packets, and get a handful of the printks that I had in here until a few
> seconds ago when I edited it out of the patch in my mail client... and
> no more panic.
> 
> I do also now have Krzysztof's patch 1/7 (detach protocol before closing
> vcc) but I don't think it actually matters any more. 

If you do this actually it's better to don't use patch 1/7 because
it introduces race condition that you found earlier.

> 
> --- a/net/atm/br2684.c~	2012-11-23 23:14:29.000000000 +0000
> +++ b/net/atm/br2684.c	2012-11-27 23:09:18.502403881 +0000
> @@ -249,6 +249,12 @@ static int br2684_xmit_vcc(struct sk_buf
>  	skb_debug(skb);
>  
>  	ATM_SKB(skb)->vcc = atmvcc = brvcc->atmvcc;
> +	if (test_bit(ATM_VF_RELEASED, &atmvcc->flags)
> +	    || test_bit(ATM_VF_CLOSE, &atmvcc->flags)
> +	    || !test_bit(ATM_VF_READY, &atmvcc->flags)) {
> +		dev_kfree_skb(skb);
> +		return 0;
> +	}
>  	pr_debug("atm_skb(%p)->vcc(%p)->dev(%p)\n", skb, atmvcc, atmvcc->dev);
>  	atomic_add(skb->truesize, &sk_atm(atmvcc)->sk_wmem_alloc);
>  	ATM_SKB(skb)->atm_options = atmvcc->atm_options;
> 

With this patch you have still theoretical race that was fixed in patches
5 and 8 in pppoatm series, but I never seen that in practice.

Acked-by: Krzysztof Mazur <krzysiek@podlesie.net>

Krzysiek

^ permalink raw reply

* [PATCH] br2684: don't send frames on not-ready vcc
From: David Woodhouse @ 2012-11-27 23:28 UTC (permalink / raw)
  To: chas williams - CONTRACTOR
  Cc: Krzysztof Mazur, davem, netdev, linux-kernel, nathan
In-Reply-To: <1354055783.2534.18.camel@shinybook.infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

Avoid submitting patches to a vcc which is being closed. Things go badly
wrong when the ->pop method gets later called after everything's been
torn down.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
---
On Tue, 2012-11-27 at 22:36 +0000, David Woodhouse wrote:
> Nathan, does this help? 

I think that's necessary, but not sufficient. You'll want something like
this too... I can now kill br2684ctl while there's a flood of outgoing
packets, and get a handful of the printks that I had in here until a few
seconds ago when I edited it out of the patch in my mail client... and
no more panic.

I do also now have Krzysztof's patch 1/7 (detach protocol before closing
vcc) but I don't think it actually matters any more. 

--- a/net/atm/br2684.c~	2012-11-23 23:14:29.000000000 +0000
+++ b/net/atm/br2684.c	2012-11-27 23:09:18.502403881 +0000
@@ -249,6 +249,12 @@ static int br2684_xmit_vcc(struct sk_buf
 	skb_debug(skb);

 	ATM_SKB(skb)->vcc = atmvcc = brvcc->atmvcc;
+	if (test_bit(ATM_VF_RELEASED, &atmvcc->flags)
+	    || test_bit(ATM_VF_CLOSE, &atmvcc->flags)
+	    || !test_bit(ATM_VF_READY, &atmvcc->flags)) {
+		dev_kfree_skb(skb);
+		return 0;
+	}
 	pr_debug("atm_skb(%p)->vcc(%p)->dev(%p)\n", skb, atmvcc, atmvcc->dev);
 	atomic_add(skb->truesize, &sk_atm(atmvcc)->sk_wmem_alloc);
 	ATM_SKB(skb)->atm_options = atmvcc->atm_options;

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* IPv4 route cache DOS attack
From: 叶雨飞 @ 2012-11-27 23:15 UTC (permalink / raw)
  To: netdev

Hi,

I have a linux router running kernel 3.2  that receive public ingress
packets and route them through an GRE tunnel, return packets don't go
through it

I've recently faced a serious issue with the route cache,  when the
router received spoofed source , the route cache will quickly get
exhausted (depending on the size of it) and soon the ip dst cache
overflow will be printed and network subsystem will hang until
restarted.

So, my question is, how can I turn off the route cache without
recompile the kernel or adding the  patch for removal  in 3.7?  I
tried to set

echo 0 > /proc/sys/net/ipv4/route/max_size but that has no effect at all.

And if some one can share some insight on why when dst cache
overflows, the network subsystem hangs, it would be great.

Thanks.

^ permalink raw reply

* Re: Re: RTL 8169  linux driver question
From: Francois Romieu @ 2012-11-27 22:46 UTC (permalink / raw)
  To: David Laight; +Cc: Stéphane ANCELOT, netdev, sancelot, Hayes Wang
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B70BD@saturn3.aculab.com>

David Laight <David.Laight@ACULAB.COM> :
> Stéphane ANCELOT <sancelot@free.fr> :
> > I had problem with it, my application sends a frame that is immediately
> > transmitted back by some slaves, there was abnormally 100us  lost
> > between the send and receive call.
> > 
> > Finally I found it was coming from the following register setup in the
> > driver :
> > 
> > RTL_W16(IntrMitigate, 0x5151);
> > 
> > Can you give me some details about it, since I do not have the RTL8169
> > programming guide.
> 
> That sounds like an 'interrupt mitigation' setting - which will cause
> RX interrupts to be delayed a short time in order to reduce the
> interrupt load on the kernel.
> 
> There is usually an 'ethtool' setting to disable interrupt mitigation.

Something like the patch below against net-next could help once I will
have tested it.

I completely guessed the Tx usec scale factor at gigabit speed (125 us, 
100 us, disabled, who knows ?) and I have no idea which specific chipsets
it should work with.

Hayes, may I expect some hindsight regarding:
1 - the availability of the IntrMitigate (0xe2) register through the
    8169, 8168 and 810x line of chipsets
2 - the Tx timer unit at gigabit speed

It would save me some time.

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 248f883..2623b73 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -349,6 +349,12 @@ enum rtl_registers {
 	RxMaxSize	= 0xda,
 	CPlusCmd	= 0xe0,
 	IntrMitigate	= 0xe2,
+
+#define RTL_COALESCE_MASK	0x0f
+#define RTL_COALESCE_SHIFT	4
+#define RTL_COALESCE_T_MAX	(RTL_COALESCE_MASK)
+#define RTL_COALESCE_FRAME_MAX	(RTL_COALESCE_MASK << 2)
+
 	RxDescAddrLow	= 0xe4,
 	RxDescAddrHigh	= 0xe8,
 	EarlyTxThres	= 0xec,	/* 8169. Unit of 32 bytes. */
@@ -1997,10 +2003,121 @@ static void rtl8169_get_strings(struct net_device *dev, u32 stringset, u8 *data)
 	}
 }
 
+static struct rtl_coalesce_scale {
+	u32 speed;
+	/* Rx / Tx */
+	u16 usecs[2];
+} rtl_coalesce_info[] = {
+	{ .speed = SPEED_10,	.usecs = { 8000, 10000 } },
+	{ .speed = SPEED_100,	.usecs = { 1000,  1000 } },
+	{ .speed = SPEED_1000,	.usecs = {  125,   125 } }
+};
+
+static struct rtl_coalesce_scale *rtl_coalesce_scale(struct net_device *dev)
+{
+	struct ethtool_cmd ecmd;
+	int rc, i;
+
+	rc = rtl8169_get_settings(dev, &ecmd);
+	if (rc < 0)
+		return ERR_PTR(rc);
+
+	for (i = 0; i < ARRAY_SIZE(rtl_coalesce_info); i++) {
+		if (ethtool_cmd_speed(&ecmd) == rtl_coalesce_info[i].speed)
+			return rtl_coalesce_info + i;
+	}
+
+	return ERR_PTR(-EINVAL);
+}
+
+static int rtl_get_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
+{
+	struct rtl8169_private *tp = netdev_priv(dev);
+	void __iomem *ioaddr = tp->mmio_addr;
+	struct rtl_coalesce_scale *scale;
+	struct {
+		u32 *max_frames;
+		u32 *usecs;
+	} coal_settings [] = {
+		{ &ec->rx_max_coalesced_frames, &ec->rx_coalesce_usecs },
+		{ &ec->tx_max_coalesced_frames, &ec->tx_coalesce_usecs }
+	}, *p = coal_settings;
+	int i;
+	u16 w;
+
+	memset(ec, 0, sizeof(*ec));
+
+	for (w = RTL_R16(IntrMitigate); w; w >>= RTL_COALESCE_SHIFT, p++) {
+		*p->max_frames = (w & RTL_COALESCE_MASK) << 2;
+		w >>= RTL_COALESCE_SHIFT;
+		*p->usecs = w & RTL_COALESCE_MASK;
+	}
+
+	/* Except for null parameeters, the meaning of coalescing parameters
+	 * depends on the link speed.
+	 */
+	scale = rtl_coalesce_scale(dev);
+	if (PTR_ERR(scale) && (p != coal_settings))
+		return PTR_ERR(scale);
+
+	for (i = 0; i < 2; i++) {
+		p = coal_settings + i;
+		*p->usecs *= scale->usecs[i];
+		if (!*p->usecs && !*p->max_frames)
+			*p->max_frames = 1;
+	}
+
+	return 0;
+}
+
+static int rtl_set_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
+{
+	struct rtl8169_private *tp = netdev_priv(dev);
+	void __iomem *ioaddr = tp->mmio_addr;
+	struct rtl_coalesce_scale *scale;
+	struct {
+		u32 frames;
+		u32 usecs;
+	} coal_settings [] = {
+		{ ec->rx_max_coalesced_frames, ec->rx_coalesce_usecs },
+		{ ec->tx_max_coalesced_frames, ec->tx_coalesce_usecs }
+	}, *p = coal_settings;
+	int i, rc;
+	u16 w = 0;
+
+	scale = rtl_coalesce_scale(dev);
+	rc = PTR_ERR(scale);
+
+	for (i = 0; i < 2; i++) {
+		u32 units;
+
+		if (!p->usecs && p->frames == 1)
+			continue;
+		if (rc < 0)
+			goto out;
+
+		units = p->usecs / scale->usecs[i];
+		if (units > RTL_COALESCE_T_MAX || p->usecs % scale->usecs[i] ||
+		    p->frames > RTL_COALESCE_FRAME_MAX || p->frames % 4)
+			return -EINVAL;
+
+		w <<= RTL_COALESCE_SHIFT;
+		w |= units;
+		w <<= RTL_COALESCE_SHIFT;
+		w |= p->frames >> 2;
+	}
+
+	RTL_W16(IntrMitigate, swab16(w));
+out:
+	return rc;
+}
+
 static const struct ethtool_ops rtl8169_ethtool_ops = {
 	.get_drvinfo		= rtl8169_get_drvinfo,
 	.get_regs_len		= rtl8169_get_regs_len,
 	.get_link		= ethtool_op_get_link,
+	.get_coalesce		= rtl_get_coalesce,
+	.set_coalesce		= rtl_set_coalesce,
 	.get_settings		= rtl8169_get_settings,
 	.set_settings		= rtl8169_set_settings,
 	.get_msglevel		= rtl8169_get_msglevel,

^ permalink raw reply related

* [PATCH] solos-pci: Wait for pending TX to complete when releasing vcc
From: David Woodhouse @ 2012-11-27 22:36 UTC (permalink / raw)
  To: chas williams - CONTRACTOR
  Cc: Krzysztof Mazur, davem, netdev, linux-kernel, nathan
In-Reply-To: <20121127135434.0728cd4f@thirdoffive.cmf.nrl.navy.mil>

[-- Attachment #1: Type: text/plain, Size: 2404 bytes --]

We should no longer be calling the old pop routine for the vcc, after
vcc_release() has completed. Make sure we wait for any pending TX skbs
to complete, by waiting for our own PKT_PCLOSE control skb to be sent.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
---
On Tue, 2012-11-27 at 13:54 -0500, chas williams - CONTRACTOR wrote:
> the driver's close routine should wait for any of the pending tx and
> rx to complete.

Nathan, does this help? I can test here to a certain extent, but when I
use it in PPPoE mode and then crash the router, the DSLAM tends to
refuse to talk to me for an arbitrary period of time after that. Which
is something of a PITA.

diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index 9851093..b5bb332 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -94,6 +94,7 @@ struct pkt_hdr {
 struct solos_skb_cb {
 	struct atm_vcc *vcc;
 	uint32_t dma_addr;
+	struct completion *c;
 };
 
 
@@ -868,6 +869,7 @@ static void pclose(struct atm_vcc *vcc)
 	struct solos_card *card = vcc->dev->dev_data;
 	struct sk_buff *skb;
 	struct pkt_hdr *header;
+	DECLARE_COMPLETION_ONSTACK(c);
 
 	skb = alloc_skb(sizeof(*header), GFP_ATOMIC);
 	if (!skb) {
@@ -881,11 +883,15 @@ static void pclose(struct atm_vcc *vcc)
 	header->vci = cpu_to_le16(vcc->vci);
 	header->type = cpu_to_le16(PKT_PCLOSE);
 
+	SKB_CB(skb)->c = &c;
+
 	fpga_queue(card, SOLOS_CHAN(vcc->dev), skb, NULL);
 
 	clear_bit(ATM_VF_ADDR, &vcc->flags);
 	clear_bit(ATM_VF_READY, &vcc->flags);
 
+	wait_for_completion(&c);
+
 	/* Hold up vcc_destroy_socket() (our caller) until solos_bh() in the
 	   tasklet has finished processing any incoming packets (and, more to
 	   the point, using the vcc pointer). */
@@ -1011,9 +1017,12 @@ static uint32_t fpga_tx(struct solos_card *card)
 			if (vcc) {
 				atomic_inc(&vcc->stats->tx);
 				solos_pop(vcc, oldskb);
-			} else
+			} else {
+				struct pkt_hdr *header = (void *)oldskb->data;
+				if (le16_to_cpu(header->type) == PKT_PCLOSE)
+					complete(SKB_CB(oldskb)->c);
 				dev_kfree_skb_irq(oldskb);
-
+			}
 		}
 	}
 	/* For non-DMA TX, write the 'TX start' bit for all four ports simultaneously */


-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation




[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply related

* Re: [PATCH v2 net-next] sctp: Add support to per-association statistics via a new SCTP_GET_ASSOC_STATS call
From: Michele Baldessari @ 2012-11-27 22:08 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: linux-sctp, Neil Horman, Thomas Graf, netdev, David S. Miller
In-Reply-To: <50AA57EA.4000907@gmail.com>

Hi Vlad,

thanks a lot for your review.

On Mon, Nov 19, 2012 at 11:01:46AM -0500, Vlad Yasevich wrote:
<snip>
> >@@ -1152,8 +1156,11 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
> >  		 */
> >  		if (sctp_chunk_is_data(chunk))
> >  			asoc->peer.last_data_from = chunk->transport;
> >-		else
> >+		else {
> >  			SCTP_INC_STATS(net, SCTP_MIB_INCTRLCHUNKS);
> >+			if (chunk->chunk_hdr->type == SCTP_CID_SACK)
> >+				asoc->stats.isacks++;
> >+		}
> 
> Should the above include asoc->stats.ictrlchunks++; just like ep_bh_rcv()?

Indeed, I will add that.

> >
> >  		if (chunk->transport)
> >  			chunk->transport->last_time_heard = jiffies;
> >diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
> >index 1859e2b..32ab55b 100644
> >--- a/net/sctp/endpointola.c
> >+++ b/net/sctp/endpointola.c
> >@@ -480,8 +480,11 @@ normal:
> >  		 */
> >  		if (asoc && sctp_chunk_is_data(chunk))
> >  			asoc->peer.last_data_from = chunk->transport;
> >-		else
> >+		else {
> >  			SCTP_INC_STATS(sock_net(ep->base.sk), SCTP_MIB_INCTRLCHUNKS);
> >+			if (asoc)
> >+				asoc->stats.ictrlchunks++;
> >+		}
> >
> >  		if (chunk->transport)
> >  			chunk->transport->last_time_heard = jiffies;
> >diff --git a/net/sctp/input.c b/net/sctp/input.c
> >index 8bd3c27..54c449b 100644
> >--- a/net/sctp/input.c
> >+++ b/net/sctp/input.c
> >@@ -281,6 +281,8 @@ int sctp_rcv(struct sk_buff *skb)
> >  		SCTP_INC_STATS_BH(net, SCTP_MIB_IN_PKT_SOFTIRQ);
> >  		sctp_inq_push(&chunk->rcvr->inqueue, chunk);
> >  	}
> >+	if (asoc)
> >+		asoc->stats.ipackets++;
> >
> >  	sctp_bh_unlock_sock(sk);
> 
> This needs a bit more thought.  Current counting behaves differently
> depending on whether the user holds a socket lock or not.
> If the user holds the lock, we'll end counting the packet before it is
> processed.  If the user isn't holding the lock, we'll count the packet after
> it is processed.

I see. What do you prefer: use atomic64 for this specific counter or
since it is a temporary miscount we go ahead and ignore it or do you
have other approaches in mind?

> >
> >diff --git a/net/sctp/output.c b/net/sctp/output.c
> >index 4e90188bf..f5200a2 100644
> >--- a/net/sctp/output.c
> >+++ b/net/sctp/output.c
> >@@ -311,6 +311,8 @@ static sctp_xmit_t __sctp_packet_append_chunk(struct sctp_packet *packet,
> >
> >  	    case SCTP_CID_SACK:
> >  		packet->has_sack = 1;
> >+		if (chunk->asoc)
> >+			chunk->asoc->stats.osacks++;
> >  		break;
> >
> >  	    case SCTP_CID_AUTH:
> >@@ -584,11 +586,13 @@ int sctp_packet_transmit(struct sctp_packet *packet)
> >  	 */
> >
> >  	/* Dump that on IP!  */
> >-	if (asoc && asoc->peer.last_sent_to != tp) {
> >-		/* Considering the multiple CPU scenario, this is a
> >-		 * "correcter" place for last_sent_to.  --xguo
> >-		 */
> >-		asoc->peer.last_sent_to = tp;
> >+	if (asoc) {
> >+		asoc->stats.opackets++;
> >+		if (asoc->peer.last_sent_to != tp)
> >+			/* Considering the multiple CPU scenario, this is a
> >+			 * "correcter" place for last_sent_to.  --xguo
> >+			 */
> >+			asoc->peer.last_sent_to = tp;
> >  	}
> >
> >  	if (has_data) {
> >diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
> >index 1b4a7f8..379c81d 100644
> >--- a/net/sctp/outqueue.c
> >+++ b/net/sctp/outqueue.c
> >@@ -667,6 +667,7 @@ redo:
> >  				chunk->fast_retransmit = SCTP_DONT_FRTX;
> >
> >  			q->empty = 0;
> >+			q->asoc->stats.rtxchunks++;
> >  			break;
> >  		}
> >
> >@@ -876,12 +877,14 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
> >  			if (status  != SCTP_XMIT_OK) {
> >  				/* put the chunk back */
> >  				list_add(&chunk->list, &q->control_chunk_list);
> >-			} else if (chunk->chunk_hdr->type == SCTP_CID_FWD_TSN) {
> >+			} else {
> >+				asoc->stats.octrlchunks++;
> >  				/* PR-SCTP C5) If a FORWARD TSN is sent, the
> >  				 * sender MUST assure that at least one T3-rtx
> >  				 * timer is running.
> >  				 */
> >-				sctp_transport_reset_timers(transport);
> >+				if (chunk->chunk_hdr->type == SCTP_CID_FWD_TSN)
> >+					sctp_transport_reset_timers(transport);
> >  			}
> >  			break;
> >
> >@@ -1055,6 +1058,10 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
> >  				 */
> >  				if (asoc->state == SCTP_STATE_SHUTDOWN_PENDING)
> >  					chunk->chunk_hdr->flags |= SCTP_DATA_SACK_IMM;
> >+				if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED)
> >+					asoc->stats.ouodchunks++;
> >+				else
> >+					asoc->stats.oodchunks++;
> >
> >  				break;
> >
> >@@ -1162,6 +1169,7 @@ int sctp_outq_sack(struct sctp_outq *q, struct sctp_chunk *chunk)
> >
> >  	sack_ctsn = ntohl(sack->cum_tsn_ack);
> >  	gap_ack_blocks = ntohs(sack->num_gap_ack_blocks);
> >+	asoc->stats.gapcnt += gap_ack_blocks;
> >  	/*
> >  	 * SFR-CACC algorithm:
> >  	 * On receipt of a SACK the sender SHOULD execute the
> >diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> >index fbe1636..eb7633f 100644
> >--- a/net/sctp/sm_make_chunk.c
> >+++ b/net/sctp/sm_make_chunk.c
> >@@ -804,10 +804,11 @@ struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc)
> >  				 gabs);
> >
> >  	/* Add the duplicate TSN information.  */
> >-	if (num_dup_tsns)
> >+	if (num_dup_tsns) {
> >+		aptr->stats.idupchunks += num_dup_tsns;
> >  		sctp_addto_chunk(retval, sizeof(__u32) * num_dup_tsns,
> >  				 sctp_tsnmap_get_dups(map));
> >-
> >+	}
> >  	/* Once we have a sack generated, check to see what our sack
> >  	 * generation is, if its 0, reset the transports to 0, and reset
> >  	 * the association generation to 1
> >diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> >index 6eecf7e..363727e 100644
> >--- a/net/sctp/sm_sideeffect.c
> >+++ b/net/sctp/sm_sideeffect.c
> >@@ -542,6 +542,7 @@ static void sctp_do_8_2_transport_strike(sctp_cmd_seq_t *commands,
> >  	 */
> >  	if (!is_hb || transport->hb_sent) {
> >  		transport->rto = min((transport->rto * 2), transport->asoc->rto_max);
> >+		sctp_max_rto(asoc, transport);
> >  	}
> >  }
> >
> >diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> >index b6adef8..ecf7a17 100644
> >--- a/net/sctp/sm_statefuns.c
> >+++ b/net/sctp/sm_statefuns.c
> >@@ -6127,6 +6127,8 @@ static int sctp_eat_data(const struct sctp_association *asoc,
> >  		/* The TSN is too high--silently discard the chunk and
> >  		 * count on it getting retransmitted later.
> >  		 */
> >+		if (chunk->asoc)
> >+			chunk->asoc->stats.outofseqtsns++;
> >  		return SCTP_IERROR_HIGH_TSN;
> >  	} else if (tmp > 0) {
> >  		/* This is a duplicate.  Record it.  */
> >@@ -6226,10 +6228,14 @@ static int sctp_eat_data(const struct sctp_association *asoc,
> >  	/* Note: Some chunks may get overcounted (if we drop) or overcounted
> >  	 * if we renege and the chunk arrives again.
> >  	 */
> >-	if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED)
> >+	if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED) {
> >  		SCTP_INC_STATS(net, SCTP_MIB_INUNORDERCHUNKS);
> >-	else {
> >+		if (chunk->asoc)
> >+			chunk->asoc->stats.iuodchunks++;
> >+	} else {
> >  		SCTP_INC_STATS(net, SCTP_MIB_INORDERCHUNKS);
> >+		if (chunk->asoc)
> >+			chunk->asoc->stats.iodchunks++;
> >  		ordered = 1;
> >  	}
> >
> >diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> >index 15379ac..8113249 100644
> >--- a/net/sctp/socket.c
> >+++ b/net/sctp/socket.c
> >@@ -609,6 +609,7 @@ static int sctp_send_asconf_add_ip(struct sock		*sk,
> >  				    2*asoc->pathmtu, 4380));
> >  				trans->ssthresh = asoc->peer.i.a_rwnd;
> >  				trans->rto = asoc->rto_initial;
> >+				sctp_max_rto(asoc, trans);
> >  				trans->rtt = trans->srtt = trans->rttvar = 0;
> >  				sctp_transport_route(trans, NULL,
> >  				    sctp_sk(asoc->base.sk));
> >@@ -5633,6 +5634,74 @@ static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
> >  	return 0;
> >  }
> >
> >+/*
> >+ * SCTP_GET_ASSOC_STATS
> >+ *
> >+ * This option retrieves local per endpoint statistics. It is modeled
> >+ * after OpenSolaris' implementation
> >+ */
> >+static int sctp_getsockopt_assoc_stats(struct sock *sk, int len,
> >+				       char __user *optval,
> >+				       int __user *optlen)
> >+{
> >+	struct sctp_assoc_stats sas;
> >+	struct sctp_association *asoc = NULL;
> >+
> >+	/* User must provide at least the assoc id */
> >+	if (len < sizeof(sctp_assoc_t))
> >+		return -EINVAL;
> >+
> >+	if (copy_from_user(&sas, optval, len))
> >+		return -EFAULT;
> >+
> >+	asoc = sctp_id2assoc(sk, sas.sas_assoc_id);
> >+	if (!asoc)
> >+		return -EINVAL;
> >+
> >+	sas.sas_rtxchunks = asoc->stats.rtxchunks;
> >+	sas.sas_gapcnt = asoc->stats.gapcnt;
> >+	sas.sas_outofseqtsns = asoc->stats.outofseqtsns;
> >+	sas.sas_osacks = asoc->stats.osacks;
> >+	sas.sas_isacks = asoc->stats.isacks;
> >+	sas.sas_octrlchunks = asoc->stats.octrlchunks;
> >+	sas.sas_ictrlchunks = asoc->stats.ictrlchunks;
> >+	sas.sas_oodchunks = asoc->stats.oodchunks;
> >+	sas.sas_iodchunks = asoc->stats.iodchunks;
> >+	sas.sas_ouodchunks = asoc->stats.ouodchunks;
> >+	sas.sas_iuodchunks = asoc->stats.iuodchunks;
> >+	sas.sas_idupchunks = asoc->stats.idupchunks;
> >+	sas.sas_opackets = asoc->stats.opackets;
> >+	sas.sas_ipackets = asoc->stats.ipackets;
> >+
> >+	memcpy(&sas.sas_obs_rto_ipaddr, &asoc->stats.obs_rto_ipaddr,
> >+		sizeof(struct sockaddr_storage));
> >+	/* New high max rto observed */
> >+	if (asoc->stats.max_obs_rto > asoc->stats.max_prev_obs_rto)
> >+		sas.sas_maxrto = asoc->stats.max_obs_rto;
> >+	else /* return min_rto since max rto has not changed */
> >+		sas.sas_maxrto = asoc->rto_min;
> >+
> >+	/* Record the value sent to the user this period */
> >+	asoc->stats.max_prev_obs_rto = sas.sas_maxrto;
> >+
> >+	/* Mark beginning of a new observation period */
> >+	asoc->stats.max_obs_rto = 0;
> 
> I don't think the logic above behaves correctly.  The fact that max_obs_rto
> < max_prev_obs_rto doesn't not mean that max_obs_rto has
> not changed.  It just means that the networks had smaller latency this this
> time slice then it had in the privouse one.  Returning rto_min is
> mis-information in this case.

Ack. I will look into fixing this as well.

Thanks again and regards,
Michele

> 
> -vlad
> 
> >+
> >+	/* Allow the struct to grow and fill in as much as possible */
> >+	len = min_t(size_t, len, sizeof(sas));
> >+
> >+	if (put_user(len, optlen))
> >+		return -EFAULT;
> >+
> >+	SCTP_DEBUG_PRINTK("sctp_getsockopt_assoc_stat(%d): %d\n",
> >+			  len, sas.sas_assoc_id);
> >+
> >+	if (copy_to_user(optval, &sas, len))
> >+		return -EFAULT;
> >+
> >+	return 0;
> >+}
> >+
> >  SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
> >  				char __user *optval, int __user *optlen)
> >  {
> >@@ -5774,6 +5843,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
> >  	case SCTP_PEER_ADDR_THLDS:
> >  		retval = sctp_getsockopt_paddr_thresholds(sk, optval, len, optlen);
> >  		break;
> >+	case SCTP_GET_ASSOC_STATS:
> >+		retval = sctp_getsockopt_assoc_stats(sk, len, optval, optlen);
> >+		break;
> >  	default:
> >  		retval = -ENOPROTOOPT;
> >  		break;
> >diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> >index 953c21e..8c6920d 100644
> >--- a/net/sctp/transport.c
> >+++ b/net/sctp/transport.c
> >@@ -350,6 +350,7 @@ void sctp_transport_update_rto(struct sctp_transport *tp, __u32 rtt)
> >
> >  	/* 6.3.1 C3) After the computation, update RTO <- SRTT + 4 * RTTVAR. */
> >  	tp->rto = tp->srtt + (tp->rttvar << 2);
> >+	sctp_max_rto(tp->asoc, tp);
> >
> >  	/* 6.3.1 C6) Whenever RTO is computed, if it is less than RTO.Min
> >  	 * seconds then it is rounded up to RTO.Min seconds.
> >@@ -620,6 +621,7 @@ void sctp_transport_reset(struct sctp_transport *t)
> >  	t->burst_limited = 0;
> >  	t->ssthresh = asoc->peer.i.a_rwnd;
> >  	t->rto = asoc->rto_initial;
> >+	sctp_max_rto(asoc, t);
> >  	t->rtt = 0;
> >  	t->srtt = 0;
> >  	t->rttvar = 0;
> >
> 

-- 
Michele Baldessari            <michele@acksyn.org>
C2A5 9DA3 9961 4FFB E01B  D0BC DDD4 DCCB 7515 5C6D

^ permalink raw reply

* Re: [PATCH v6 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
From: Rafael J. Wysocki @ 2012-11-27 21:46 UTC (permalink / raw)
  To: linux-pm
  Cc: Ming Lei, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <5434404.G1ERYjuorE@vostro.rjw.lan>

On Tuesday, November 27, 2012 10:19:29 PM Rafael J. Wysocki wrote:
> On Saturday, November 24, 2012 08:59:14 PM Ming Lei wrote:
> > The patch introduces the flag of memalloc_noio in 'struct dev_pm_info'
> > to help PM core to teach mm not allocating memory with GFP_KERNEL
> > flag for avoiding probable deadlock.
> > 
> > As explained in the comment, any GFP_KERNEL allocation inside
> > runtime_resume() or runtime_suspend() on any one of device in
> > the path from one block or network device to the root device
> > in the device tree may cause deadlock, the introduced
> > pm_runtime_set_memalloc_noio() sets or clears the flag on
> > device in the path recursively.
> > 
> > Cc: Alan Stern <stern@rowland.harvard.edu>
> > Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
> > Signed-off-by: Ming Lei <ming.lei@canonical.com>
> > ---
> > v5:
> > 	- fix code style error
> > 	- add comment on clear the device memalloc_noio flag
> > v4:
> > 	- rename memalloc_noio_resume as memalloc_noio
> > 	- remove pm_runtime_get_memalloc_noio()
> > 	- add comments on pm_runtime_set_memalloc_noio
> > v3:
> > 	- introduce pm_runtime_get_memalloc_noio()
> > 	- hold one global lock on pm_runtime_set_memalloc_noio
> > 	- hold device power lock when accessing memalloc_noio_resume
> > 	  flag suggested by Alan Stern
> > 	- implement pm_runtime_set_memalloc_noio without recursion
> > 	  suggested by Alan Stern
> > v2:
> > 	- introduce pm_runtime_set_memalloc_noio()
> > ---
> >  drivers/base/power/runtime.c |   60 ++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/pm.h           |    1 +
> >  include/linux/pm_runtime.h   |    3 +++
> >  3 files changed, 64 insertions(+)
> > 
> > diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> > index 3148b10..3e198a0 100644
> > --- a/drivers/base/power/runtime.c
> > +++ b/drivers/base/power/runtime.c
> > @@ -124,6 +124,66 @@ unsigned long pm_runtime_autosuspend_expiration(struct device *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(pm_runtime_autosuspend_expiration);
> >  
> > +static int dev_memalloc_noio(struct device *dev, void *data)
> > +{
> > +	return dev->power.memalloc_noio;
> > +}
> > +
> > +/*
> > + * pm_runtime_set_memalloc_noio - Set a device's memalloc_noio flag.
> > + * @dev: Device to handle.
> > + * @enable: True for setting the flag and False for clearing the flag.
> > + *
> > + * Set the flag for all devices in the path from the device to the
> > + * root device in the device tree if @enable is true, otherwise clear
> > + * the flag for devices in the path whose siblings don't set the flag.
> > + *
> 
> Please use counters instead of walking the whole path every time.  Ie. in
> addition to the flag add a counter to store the number of the device's
> children having that flag set.

I would use the flag only to store the information that
pm_runtime_set_memalloc_noio(dev, true) has been run for this device directly
and I'd use a counter for everything else.

That is, have power.memalloc_count that would be incremented when (1)
pm_runtime_set_memalloc_noio(dev, true) is called for that device and (2) when
power.memalloc_count for one of its children changes from 0 to 1 (and
analogously for decrementation).  Then, check the counter in rpm_callback().

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Eric Dumazet @ 2012-11-27 21:24 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354037000.2116.19.camel@joe-AO722>

On Tue, 2012-11-27 at 09:23 -0800, Joe Perches wrote:
> On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > commit 68835aba4d9b (net: optimize INET input path further)
> > moved some fields used for tcp/udp sockets lookup in the first cache
> > line of struct sock_common.
> []
> > diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> > index 5e11905..196ede4 100644
> > --- a/include/linux/ipv6.h
> > +++ b/include/linux/ipv6.h
> > @@ -365,19 +365,21 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
> >  #endif /* IS_ENABLED(CONFIG_IPV6) */
> >  
> >  #define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\
> > +	(((__sk)->sk_hash == (__hash)) &&					\
> > +	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) &&	\
> > +	 ((__sk)->sk_family		== AF_INET6)		&&		\
> 
> Perhaps these could be |'d together to avoid the test/jump
> after each comparison by using some bit operations instead.
> 
> > +	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&&		\
> > +	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&&		\
> > +	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))) && \
> > +	 net_eq(sock_net(__sk), (__net)))
> 
> 

But it would be wrong.

^ permalink raw reply

* Re: [PATCH v6 5/6] PM / Runtime: force memory allocation with no I/O during Runtime PM callbcack
From: Rafael J. Wysocki @ 2012-11-27 21:24 UTC (permalink / raw)
  To: linux-pm
  Cc: Ming Lei, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <1353761958-12810-6-git-send-email-ming.lei@canonical.com>

On Saturday, November 24, 2012 08:59:17 PM Ming Lei wrote:
> This patch applies the introduced memalloc_noio_save() and
> memalloc_noio_restore() to force memory allocation with no I/O
> during runtime_resume/runtime_suspend callback on device with
> the flag of 'memalloc_noio' set.
> 
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Oliver Neukum <oneukum@suse.de>
> Cc: Rafael J. Wysocki <rjw@sisk.pl>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
> v5:
> 	- use inline memalloc_noio_save()
> v4:
> 	- runtime_suspend need this too because rpm_resume may wait for
> 	completion of concurrent runtime_suspend, so deadlock still may
> 	be triggered in runtime_suspend path.
> ---
>  drivers/base/power/runtime.c |   32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> index 3e198a0..96d99ea 100644
> --- a/drivers/base/power/runtime.c
> +++ b/drivers/base/power/runtime.c
> @@ -371,6 +371,7 @@ static int rpm_suspend(struct device *dev, int rpmflags)
>  	int (*callback)(struct device *);
>  	struct device *parent = NULL;
>  	int retval;
> +	unsigned int noio_flag;
>  
>  	trace_rpm_suspend(dev, rpmflags);
>  
> @@ -480,7 +481,20 @@ static int rpm_suspend(struct device *dev, int rpmflags)
>  	if (!callback && dev->driver && dev->driver->pm)
>  		callback = dev->driver->pm->runtime_suspend;
>  
> -	retval = rpm_callback(callback, dev);
> +	/*
> +	 * Deadlock might be caused if memory allocation with GFP_KERNEL
> +	 * happens inside runtime_suspend callback of one block device's
> +	 * ancestor or the block device itself. Network device might be
> +	 * thought as part of iSCSI block device, so network device and
> +	 * its ancestor should be marked as memalloc_noio.
> +	 */
> +	if (dev->power.memalloc_noio) {
> +		noio_flag = memalloc_noio_save();
> +		retval = rpm_callback(callback, dev);
> +		memalloc_noio_restore(noio_flag);
> +	} else {
> +		retval = rpm_callback(callback, dev);
> +	}
>  	if (retval)
>  		goto fail;
>  
> @@ -563,6 +577,7 @@ static int rpm_resume(struct device *dev, int rpmflags)
>  	int (*callback)(struct device *);
>  	struct device *parent = NULL;
>  	int retval = 0;
> +	unsigned int noio_flag;
>  
>  	trace_rpm_resume(dev, rpmflags);
>  
> @@ -712,7 +727,20 @@ static int rpm_resume(struct device *dev, int rpmflags)
>  	if (!callback && dev->driver && dev->driver->pm)
>  		callback = dev->driver->pm->runtime_resume;
>  
> -	retval = rpm_callback(callback, dev);
> +	/*
> +	 * Deadlock might be caused if memory allocation with GFP_KERNEL
> +	 * happens inside runtime_resume callback of one block device's
> +	 * ancestor or the block device itself. Network device might be
> +	 * thought as part of iSCSI block device, so network device and
> +	 * its ancestor should be marked as memalloc_noio.
> +	 */
> +	if (dev->power.memalloc_noio) {
> +		noio_flag = memalloc_noio_save();
> +		retval = rpm_callback(callback, dev);
> +		memalloc_noio_restore(noio_flag);
> +	} else {
> +		retval = rpm_callback(callback, dev);
> +	}

Please don't duplicate code this way.

You can move that whole thing to rpm_callback().  Yes, you'll probably need to
check dev->power.memalloc_noio twice in there, but that's OK.


>  	if (retval) {
>  		__update_runtime_status(dev, RPM_SUSPENDED);
>  		pm_runtime_cancel_pending(dev);
> 

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox