Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Andi Kleen @ 2012-05-29 19:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jesper Dangaard Brouer, netdev, Christoph Paasch, Eric Dumazet,
	David S. Miller, Martin Topholm, Florian Westphal, opurdila,
	Hans Schillstrom
In-Reply-To: <20120528115226.12068.31850.stgit@localhost.localdomain>

Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

> TCP SYN handling is on the slow path via tcp_v4_rcv(), and is
> performed while holding spinlock bh_lock_sock().
>
> Real-life and testlab experiments show, that the kernel choks
> when reaching 130Kpps SYN floods (powerful Nehalem 16 cores).
> Measuring with perf reveals, that its caused by
> bh_lock_sock_nested() call in tcp_v4_rcv().
>
> With this patch, the machine can handle 750Kpps (max of the SYN
> flood generator) with cycles to spare, CPU load on the big machine
> dropped to 1%, from 100%.
>
> Notice we only handle syn cookie early on, normal SYN packets
> are still processed under the bh_lock_sock().

So basically handling syncookie lockless? 

Makes sense. Syncookies is a bit obsolete these days of course, due
to the lack of options. But may be still useful for this.

Obviously you'll need to clean up the patch and support IPv6,
but the basic idea looks good to me.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Denys Fedoryshchenko @ 2012-05-29 19:52 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Hiroaki SHIMODA, netdev, e1000-devel, jeffrey.t.kirsher,
	jesse.brandeburg, eric.dumazet, davem
In-Reply-To: <CA+mtBx_uYy9XcRvpD2E46FuMBFu38iQvCiwFHWqbhPBmY=JfOg@mail.gmail.com>

Big Thanks Hiroaki, yes this patch improving situation a lot, and for 
sure it fixes the problem.
I cannot reproduce it anymore.

On 2012-05-29 17:54, Tom Herbert wrote:
> Thanks Hiroaki for this description, it looks promising.  Denys, can
> you test with his patch.
>
> Tom
>
> On Tue, May 29, 2012 at 7:25 AM, Hiroaki SHIMODA
> <shimoda.hiroaki@gmail.com> wrote:
>> On Sun, 20 May 2012 10:40:41 -0700
>> Tom Herbert <therbert@google.com> wrote:
>>
>>> Tried to reproduce:
>>>
>>> May 20 10:08:30 test kernel: [    6.168240] e1000e 0000:06:00.0:
>>> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set 
>>> to
>>> dynamic conservative mode
>>> May 20 10:08:30 test kernel: [    6.221591] e1000e 0000:06:00.1:
>>> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set 
>>> to
>>> dynamic conservative mode
>>>
>>> 06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>>
>>> Following above instructions to repro gives:
>>>
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5875 ttl=64 
>>> time=0.358 ms
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5876 ttl=64 
>>> time=0.330 ms
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5877 ttl=64 
>>> time=0.337 ms
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5878 ttl=64 
>>> time=0.375 ms
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5879 ttl=64 
>>> time=0.359 ms
>>> 1480 bytes from lpb49.prod.google.com (192.168.2.49): icmp_req=5880
>>> ttl=64 time=0.380 ms
>>>
>>> And I didn't see the stalls. This was on an Intel machine.  The 
>>> limit
>>> was stable, went up to around 28K when opened large file and tended 
>>> to
>>> stay between 15-28K.
>>>
>>> The describe problem seems to have characteristics that transmit
>>> interrupts are not at all periodic, and it would seem that some are
>>> taking hundreds of milliseconds to pop.  I don't see anything that
>>> would cause that in the NIC, is it possible there is some activity 
>>> on
>>> the machines periodically and often holding down interrupts for 
>>>  long
>>> periods of time.  Are there any peculiarities on Sun Fire in 
>>> interrupt
>>> handling?
>>>
>>> Can you also provide an 'ethtool -c eth0'
>>>
>>> Thanks,
>>> Tom
>>
>> I also observed the similar behaviour on the following environment.
>>
>> 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit 
>> Network Connection
>>
>> [    2.962119] e1000e: Intel(R) PRO/1000 Network Driver - 2.0.0-k
>> [    2.968095] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
>> [    2.974251] e1000e 0000:03:00.0: Disabling ASPM L0s L1
>> [    2.979653] e1000e 0000:03:00.0: (unregistered net_device): 
>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>> [    2.991599] e1000e 0000:03:00.0: irq 72 for MSI/MSI-X
>> [    2.991606] e1000e 0000:03:00.0: irq 73 for MSI/MSI-X
>> [    2.991611] e1000e 0000:03:00.0: irq 74 for MSI/MSI-X
>> [    3.092768] e1000e 0000:03:00.0: eth0: (PCI Express:2.5GT/s:Width 
>> x1) 48:5b:39:75:91:bd
>> [ 3.100992] e1000e 0000:03:00.0: eth0: Intel(R) PRO/1000 Network 
>> Connection
>> [ 3.108173] e1000e 0000:03:00.0: eth0: MAC: 3, PHY: 8, PBA No: 
>> FFFFFF-0FF
>>
>> I tried some coalesce options by 'ethtool -C eth0', but
>> anything didn't help.
>>
>> If I understand the code and spec correctly, TX interrupts are
>> generated when TXDCTL.WTHRESH descriptors have been accumulated
>> and write backed.
>>
>> I tentatively changed the TXDCTL.WTHRESH to 1, then it seems
>> that latency spikes are disappear.
>>
>> drivers/net/ethernet/intel/e1000e/e1000.h
>> @@ -181,7 +181,7 @@ struct e1000_info;
>>  #define E1000_TXDCTL_DMA_BURST_ENABLE                          \
>>        (E1000_TXDCTL_GRAN | /* set descriptor granularity */  \
>>         E1000_TXDCTL_COUNT_DESC |                             \
>> -        (5 << 16) | /* wthresh must be +1 more than desired */\
>> +        (1 << 16) | /* wthresh must be +1 more than desired */\
>>         (1 << 8)  | /* hthresh */                             \
>>         0x1f)       /* pthresh */
>>
>> (before) $ ping -i0.2 192.168.11.2
>> PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
>> 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.191 ms
>> 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.179 ms
>> 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.199 ms
>> 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.143 ms
>> 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.150 ms
>> 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.186 ms
>> 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.198 ms
>> 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.195 ms
>> 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms
>> 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.200 ms
>> 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=651 ms
>> 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=451 ms
>> 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=241 ms
>> 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=31.3 ms
>> 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.184 ms
>> 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.199 ms
>> 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.197 ms
>> 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=21 ttl=64 time=0.192 ms
>> 64 bytes from 192.168.11.2: icmp_req=22 ttl=64 time=0.205 ms
>> 64 bytes from 192.168.11.2: icmp_req=23 ttl=64 time=629 ms
>> 64 bytes from 192.168.11.2: icmp_req=24 ttl=64 time=419 ms
>> 64 bytes from 192.168.11.2: icmp_req=25 ttl=64 time=209 ms
>> 64 bytes from 192.168.11.2: icmp_req=26 ttl=64 time=0.280 ms
>> 64 bytes from 192.168.11.2: icmp_req=27 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=28 ttl=64 time=0.194 ms
>> 64 bytes from 192.168.11.2: icmp_req=29 ttl=64 time=0.143 ms
>> 64 bytes from 192.168.11.2: icmp_req=30 ttl=64 time=0.191 ms
>> 64 bytes from 192.168.11.2: icmp_req=31 ttl=64 time=0.144 ms
>> 64 bytes from 192.168.11.2: icmp_req=32 ttl=64 time=0.192 ms
>> 64 bytes from 192.168.11.2: icmp_req=33 ttl=64 time=0.199 ms
>> 64 bytes from 192.168.11.2: icmp_req=34 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=35 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=36 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=37 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=38 ttl=64 time=1600 ms
>> 64 bytes from 192.168.11.2: icmp_req=39 ttl=64 time=1390 ms
>> 64 bytes from 192.168.11.2: icmp_req=40 ttl=64 time=1180 ms
>> 64 bytes from 192.168.11.2: icmp_req=41 ttl=64 time=980 ms
>> 64 bytes from 192.168.11.2: icmp_req=42 ttl=64 time=780 ms
>> 64 bytes from 192.168.11.2: icmp_req=43 ttl=64 time=570 ms
>> 64 bytes from 192.168.11.2: icmp_req=44 ttl=64 time=0.151 ms
>> 64 bytes from 192.168.11.2: icmp_req=45 ttl=64 time=0.189 ms
>> 64 bytes from 192.168.11.2: icmp_req=46 ttl=64 time=0.203 ms
>> 64 bytes from 192.168.11.2: icmp_req=47 ttl=64 time=0.185 ms
>> 64 bytes from 192.168.11.2: icmp_req=48 ttl=64 time=0.189 ms
>> 64 bytes from 192.168.11.2: icmp_req=49 ttl=64 time=0.204 ms
>> 64 bytes from 192.168.11.2: icmp_req=50 ttl=64 time=0.198 ms
>>
>> I think 1000 ms - 2000 ms delay is come from e1000_watchdog_task().
>>
>> (after) $ ping -i0.2 192.168.11.2
>> 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.175 ms
>> 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.203 ms
>> 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.197 ms
>> 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.186 ms
>> 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.197 ms
>> 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.189 ms
>> 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.146 ms
>> 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms
>> 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.195 ms
>> 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.190 ms
>> 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=0.204 ms
>> 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=0.201 ms
>> 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=0.189 ms
>> 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.190 ms
>> 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.143 ms
>> 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.191 ms
>> 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.190 ms

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.

^ permalink raw reply

* Re: [RFC PATCH 0/2] Faster/parallel SYN handling to mitigate SYN floods
From: Jesper Dangaard Brouer @ 2012-05-29 20:17 UTC (permalink / raw)
  To: christoph.paasch
  Cc: netdev, Eric Dumazet, David S. Miller, Martin Topholm,
	Florian Westphal, opurdila, Hans Schillstrom, Andi Kleen
In-Reply-To: <4FC3A465.4030203@uclouvain.be>

On Mon, 2012-05-28 at 18:14 +0200, Christoph Paasch wrote:

> On 05/28/2012 01:52 PM, Jesper Dangaard Brouer wrote:
> > The following series is a RFC (Request For Comments) for implementing
> > a faster and parallel handling of TCP SYN connections, to mitigate SYN
> > flood attacks.  This is against DaveM's net (f0d1b3c2bc), as net-next
> > is closed, as DaveM has mentioned numerous times ;-)
> > 
> > Only IPv4 TCP is handled here. The IPv6 TCP code also need to be
> > updated, but I'll deal with that part after we have agreed on a
> > solution for IPv4 TCP.
> > 
> >  Patch 1/2: Is a cleanup, where I split out the SYN cookie handling
> >   from tcp_v4_conn_request() into tcp_v4_syn_conn_limit().
> > 
> >  Patch 2/2: Move tcp_v4_syn_conn_limit() outside bh_lock_sock() in
> >   tcp_v4_rcv().  I would like some input on, (1) if this safe without
> >   the lock, (2) if we need to do some sock lookup, before calling
> >   tcp_v4_syn_conn_limit() (Christoph Paasch
> >   <christoph.paasch@uclouvain.be> mentioned something about SYN
> >   retransmissions)
> 
> Concerning (1):
> I think, there are places where you may have troube because you don't
> hold the lock.
> E.g., in tcp_make_synack (called by tcp_v4_send_synack from your
> tcp_v4_syn_conn_limit) there is:
> 
> if (sk->sk_userlocks & SOCK_RCVBUF_LOCK &&
> 	(req->window_clamp > tcp_full_space(sk) ||
> 	 req->window_clamp == 0))
> 	req->window_clamp = tcp_full_space(sk);
> 
> Thus, tcp_full_space(sk) may have different values between the check and
> setting req->window_clamp.

This should be simply solved by using a local stack variable, for
storing the result from tcp_full_space(sk).  Its likely that GCC already
does this behind our back.


> Concerning (2):
> 
> Imagine, a SYN coming in, when the reqsk-queue is not yet full. A
> request-sock will be added to the reqsk-queue. Then, a retransmission of
> this SYN comes in and the queue got full by the time. This time
> tcp_v4_syn_conn_limit will do syn-cookies and thus generate a different
> seq-number for the SYN/ACK.

I have addressed your issue, by checking the reqsk_queue in
tcp_v4_syn_conn_limit() before allocating a new req via
inet_reqsk_alloc().
If I find an existing reqsk, I choose to drop it, so the SYN cookie
SYN-ACK takes precedence, as the path/handling of the last ACK doesn't
find this reqsk. This is done under the lock.

Test results show that I can provoke the SYN retransmit situation, and
that performance is still very good. Func call inet_csk_search_req()
only sneaks up to a top 20 on perf report.

Patch on top of this patch:

[RFC PATCH 3/2] tcp: Detect SYN retransmits during SYN flood

 Check for existing connection request (reqsk) as this might
 be a retransmitted SYN which have gotten into the
 reqsk_queue.  If so, we choose to drop the reqsk, and use
 SYN cookies to restore the state later.

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7480fc2..e0c9ba3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1274,8 +1274,10 @@ static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
  */
 int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
 {
-       struct request_sock *req;
+       struct request_sock *req = NULL;
        struct inet_request_sock *ireq;
+       struct request_sock *exist_req;
+       struct request_sock **prev;
        struct tcp_options_received tmp_opt;
        __be32 saddr = ip_hdr(skb)->saddr;
        __be32 daddr = ip_hdr(skb)->daddr;
@@ -1303,6 +1305,22 @@ int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
        if (!tcp_syn_flood_action(sk, skb, "TCP"))
                goto drop; /* Not enabled, indicate drop, due to queue full */
 
+       /* Check for existing connection request (reqsk) as this might
+        *   be a retransmitted SYN which have gotten into the
+        *   reqsk_queue.  If so, we choose to drop the reqsk, and use
+        *   SYN cookies to restore the state later.
+        */
+       bh_lock_sock(sk);
+       exist_req = inet_csk_search_req(sk, &prev, tcp_hdr(skb)->source, saddr, daddr);
+       if (exist_req) { /* Drop existing reqsk */
+               if (TCP_SKB_CB(skb)->seq == tcp_rsk(exist_req)->rcv_isn)
+                       net_warn_ratelimited("Retransmitted SYN from %pI4"
+                                            " (orig reqsk dropped)", &saddr);
+
+               inet_csk_reqsk_queue_drop(sk, exist_req, prev);
+       }
+       bh_unlock_sock(sk);
+
        /* Allocate a request_sock */
        req = inet_reqsk_alloc(&tcp_request_sock_ops);
        if (!req) {



I'll post some V2 patches tomorrow, which integrates this changes in
patch 2/2.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related

* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: David Miller @ 2012-05-29 20:18 UTC (permalink / raw)
  To: andi
  Cc: jbrouer, brouer, netdev, christoph.paasch, eric.dumazet, mph, fw,
	opurdila, hans.schillstrom
In-Reply-To: <m2bol6lqxo.fsf@firstfloor.org>

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 29 May 2012 12:37:07 -0700

> Makes sense. Syncookies is a bit obsolete these days of course, due
> to the lack of options. But may be still useful for this.

Please crawl out of your cave, syncookies fully supports all TCP
options these days.

^ permalink raw reply

* Re: [RFC PATCH 0/2] Faster/parallel SYN handling to mitigate SYN floods
From: Christoph Paasch @ 2012-05-29 20:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Eric Dumazet, David S. Miller, Martin Topholm,
	Florian Westphal, opurdila, Hans Schillstrom, Andi Kleen
In-Reply-To: <1338322661.7747.17.camel@localhost>

Hello,

On 05/29/2012 10:17 PM, Jesper Dangaard Brouer wrote:
> On Mon, 2012-05-28 at 18:14 +0200, Christoph Paasch wrote:
> 
>> On 05/28/2012 01:52 PM, Jesper Dangaard Brouer wrote:
>>> The following series is a RFC (Request For Comments) for implementing
>>> a faster and parallel handling of TCP SYN connections, to mitigate SYN
>>> flood attacks.  This is against DaveM's net (f0d1b3c2bc), as net-next
>>> is closed, as DaveM has mentioned numerous times ;-)
>>>
>>> Only IPv4 TCP is handled here. The IPv6 TCP code also need to be
>>> updated, but I'll deal with that part after we have agreed on a
>>> solution for IPv4 TCP.
>>>
>>>  Patch 1/2: Is a cleanup, where I split out the SYN cookie handling
>>>   from tcp_v4_conn_request() into tcp_v4_syn_conn_limit().
>>>
>>>  Patch 2/2: Move tcp_v4_syn_conn_limit() outside bh_lock_sock() in
>>>   tcp_v4_rcv().  I would like some input on, (1) if this safe without
>>>   the lock, (2) if we need to do some sock lookup, before calling
>>>   tcp_v4_syn_conn_limit() (Christoph Paasch
>>>   <christoph.paasch@uclouvain.be> mentioned something about SYN
>>>   retransmissions)
>>
>> Concerning (1):
>> I think, there are places where you may have troube because you don't
>> hold the lock.
>> E.g., in tcp_make_synack (called by tcp_v4_send_synack from your
>> tcp_v4_syn_conn_limit) there is:
>>
>> if (sk->sk_userlocks & SOCK_RCVBUF_LOCK &&
>> 	(req->window_clamp > tcp_full_space(sk) ||
>> 	 req->window_clamp == 0))
>> 	req->window_clamp = tcp_full_space(sk);
>>
>> Thus, tcp_full_space(sk) may have different values between the check and
>> setting req->window_clamp.
> 
> This should be simply solved by using a local stack variable, for
> storing the result from tcp_full_space(sk).  Its likely that GCC already
> does this behind our back.

The place in tcp_make_synack is not the only one where we may have a race.

E.g., tcp_syn_flood_action or inet_csk_reqsk_queue_is_full.

And you never know which module is loaded behind
security_inet_conn_request and what it will do.

It must be carefully checked if the race really isn't an issue.

>> Concerning (2):
>>
>> Imagine, a SYN coming in, when the reqsk-queue is not yet full. A
>> request-sock will be added to the reqsk-queue. Then, a retransmission of
>> this SYN comes in and the queue got full by the time. This time
>> tcp_v4_syn_conn_limit will do syn-cookies and thus generate a different
>> seq-number for the SYN/ACK.
> 
> I have addressed your issue, by checking the reqsk_queue in
> tcp_v4_syn_conn_limit() before allocating a new req via
> inet_reqsk_alloc().
> If I find an existing reqsk, I choose to drop it, so the SYN cookie
> SYN-ACK takes precedence, as the path/handling of the last ACK doesn't
> find this reqsk. This is done under the lock.

Then the receiver will receive two SYN/ACK's for the same SYN with
different sequence-numbers. As the "SYN cookie SYN-ACK" will arrive
second, it will be discarded and seq-numbers from the first one will be
taken on the client-side.

Then, the connection will never establish, as both sides "agreed" on
different sequence numbers.

I would say, you have to handle the retransmitted SYN as in
tcp_v4_hnd_req by calling tcp_check_req.


Cheers,
Christoph


> Test results show that I can provoke the SYN retransmit situation, and
> that performance is still very good. Func call inet_csk_search_req()
> only sneaks up to a top 20 on perf report.
> 
> Patch on top of this patch:
> 
> [RFC PATCH 3/2] tcp: Detect SYN retransmits during SYN flood
> 
>  Check for existing connection request (reqsk) as this might
>  be a retransmitted SYN which have gotten into the
>  reqsk_queue.  If so, we choose to drop the reqsk, and use
>  SYN cookies to restore the state later.
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 7480fc2..e0c9ba3 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1274,8 +1274,10 @@ static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
>   */
>  int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
>  {
> -       struct request_sock *req;
> +       struct request_sock *req = NULL;
>         struct inet_request_sock *ireq;
> +       struct request_sock *exist_req;
> +       struct request_sock **prev;
>         struct tcp_options_received tmp_opt;
>         __be32 saddr = ip_hdr(skb)->saddr;
>         __be32 daddr = ip_hdr(skb)->daddr;
> @@ -1303,6 +1305,22 @@ int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
>         if (!tcp_syn_flood_action(sk, skb, "TCP"))
>                 goto drop; /* Not enabled, indicate drop, due to queue full */
>  
> +       /* Check for existing connection request (reqsk) as this might
> +        *   be a retransmitted SYN which have gotten into the
> +        *   reqsk_queue.  If so, we choose to drop the reqsk, and use
> +        *   SYN cookies to restore the state later.
> +        */
> +       bh_lock_sock(sk);
> +       exist_req = inet_csk_search_req(sk, &prev, tcp_hdr(skb)->source, saddr, daddr);
> +       if (exist_req) { /* Drop existing reqsk */
> +               if (TCP_SKB_CB(skb)->seq == tcp_rsk(exist_req)->rcv_isn)
> +                       net_warn_ratelimited("Retransmitted SYN from %pI4"
> +                                            " (orig reqsk dropped)", &saddr);
> +
> +               inet_csk_reqsk_queue_drop(sk, exist_req, prev);
> +       }
> +       bh_unlock_sock(sk);
> +
>         /* Allocate a request_sock */
>         req = inet_reqsk_alloc(&tcp_request_sock_ops);
>         if (!req) {
> 
> 
> 
> I'll post some V2 patches tomorrow, which integrates this changes in
> patch 2/2.
> 
> 


-- 
Christoph Paasch
PhD Student

IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://mptcp.info.ucl.ac.be
Université Catholique de Louvain
-- 

^ permalink raw reply

* Re: [PATCH] l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case
From: David Miller @ 2012-05-29 21:20 UTC (permalink / raw)
  To: jchapman; +Cc: netdev, levinsasha928
In-Reply-To: <1338298242-22261-1-git-send-email-jchapman@katalix.com>

From: James Chapman <jchapman@katalix.com>
Date: Tue, 29 May 2012 14:30:42 +0100

> An application may call connect() to disconnect a socket using an
> address with family AF_UNSPEC. The L2TP IP sockets were not handling
> this case when the socket is not bound and an attempt to connect()
> using AF_UNSPEC in such cases would result in an oops. This patch
> addresses the problem by protecting the sk_prot->disconnect() call
> against trying to unhash the socket before it is bound.
> 
> The L2TP IPv4 and IPv6 sockets have the same problem. Both are fixed
> by this patch.
> 
> The patch also adds more checks that the sockaddr supplied to bind()
> and connect() calls is valid.
> 
>  RIP: 0010:[<ffffffff82e133b0>]  [<ffffffff82e133b0>] inet_unhash+0x50/0xd0
>  RSP: 0018:ffff88001989be28  EFLAGS: 00010293
>  Stack:
>   ffff8800407a8000 0000000000000000 ffff88001989be78 ffffffff82e3a249
>   ffffffff82e3a050 ffff88001989bec8 ffff88001989be88 ffff8800407a8000
>   0000000000000010 ffff88001989bec8 ffff88001989bea8 ffffffff82e42639
>  Call Trace:
>  [<ffffffff82e3a249>] udp_disconnect+0x1f9/0x290
>  [<ffffffff82e42639>] inet_dgram_connect+0x29/0x80
>  [<ffffffff82d012fc>] sys_connect+0x9c/0x100
> 
> Reported-by: Sasha Levin <levinsasha928@gmail.com>
> Signed-off-by: James Chapman <jchapman@katalix.com>

Applied and queued up for -stable, thanks James.

^ permalink raw reply

* Re: [PATCH] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: David Miller @ 2012-05-29 21:22 UTC (permalink / raw)
  To: florian; +Cc: devendra.aaru, netdev, linux-kernel
In-Reply-To: <1659482.3g1Zl6FDuM@flexo>

From: Florian Fainelli <florian@openwrt.org>
Date: Tue, 29 May 2012 11:20:50 +0200

> On Monday 28 May 2012 17:27:03 Devendra Naga wrote:
>> the calls after the pci_enable_device may fail, and will error out with out
>> disabling it. disable the device at error paths.
> 
> Looks good, thanks Devendra!
> 
>> 
>> Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
> 
> Acked-by: Florian Fainelli <florian@openwrt.org>

Applied.

^ permalink raw reply

* Re: [PATCH resend] rds_rdma: don't assume infiniband device is PCI
From: David Miller @ 2012-05-29 21:22 UTC (permalink / raw)
  To: venkat.x.venkatsubra; +Cc: cascardo, netdev, dledford, Jes.Sorensen
In-Reply-To: <4FC3DCD6.5020604@oracle.com>

From: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Date: Mon, 28 May 2012 15:15:18 -0500

> On 5/28/2012 1:52 PM, Thadeu Lima de Souza Cascardo wrote:
>> RDS code assumes that the struct ib_device dma_device member, which is
>> a
>> pointer, points to a struct device embedded in a struct pci_dev.
>>
>> This is not the case for ehca, for example, which is a OF driver, and
>> makes dma_device point to a struct device embedded in a struct
>> platform_device.
>>
>> This will make the system crash when rds_rdma is loaded in a system
>> with ehca, since it will try to access the bus member of a
>> non-existent
>> struct pci_dev.
>>
>> The only reason rds_rdma uses the struct pci_dev is to get the NUMA
>> node
>> the device is attached to. Using dev_to_node for that is much better,
>> since it won't assume which bus the infiniband is attached to.
>>
>> Signed-off-by: Thadeu Lima de Souza
>> Cascardo<cascardo@linux.vnet.ibm.com>
>> Cc: dledford@redhat.com
>> Cc: Jes.Sorensen@redhat.com
>> Cc: Venkat Venkatsubra<venkat.x.venkatsubra@oracle.com>
>> ---
>>
> Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: sh_eth: fix the rxdesc pointer when rx descriptor empty happens
From: David Miller @ 2012-05-29 21:23 UTC (permalink / raw)
  To: yoshihiro.shimoda.uh; +Cc: netdev, linux-sh
In-Reply-To: <4FC491EB.4040002@renesas.com>

From: "Shimoda, Yoshihiro" <yoshihiro.shimoda.uh@renesas.com>
Date: Tue, 29 May 2012 18:07:55 +0900

> When Receive Descriptor Empty happens, rxdesc pointer of the driver
> and actual next descriptor of the controller may be mismatch.
> This patch fixes it.
> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>

Applied.

^ permalink raw reply

* Re: [PATCH] asix: allow full size 8021Q frames to be received
From: David Miller @ 2012-05-29 21:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, gregkh, trond, grundler, pstew, allan
In-Reply-To: <1338280301.4120.0.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 29 May 2012 10:31:41 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> asix driver drops 8021Q full size frames because it doesn't take into
> account VLAN header size.
> 
> Tested on AX88772 adapter.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH] skb: avoid unnecessary reallocations in __skb_cow
From: David Miller @ 2012-05-29 21:23 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nbd, netdev
In-Reply-To: <1338298980.2840.32.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 29 May 2012 15:43:00 +0200

> On Tue, 2012-05-29 at 15:35 +0200, Felix Fietkau wrote:
> 
>> 
>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
>> ---
>>  include/linux/skbuff.h |    2 --
>>  1 files changed, 0 insertions(+), 2 deletions(-)
>> 
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: David Miller @ 2012-05-29 21:28 UTC (permalink / raw)
  To: florian; +Cc: devendra.aaru, netdev, linux-kernel
In-Reply-To: <20120529.172229.277102319861990758.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 29 May 2012 17:22:29 -0400 (EDT)

> From: Florian Fainelli <florian@openwrt.org>
> Date: Tue, 29 May 2012 11:20:50 +0200
> 
>> On Monday 28 May 2012 17:27:03 Devendra Naga wrote:
>>> the calls after the pci_enable_device may fail, and will error out with out
>>> disabling it. disable the device at error paths.
>> 
>> Looks good, thanks Devendra!
>> 
>>> 
>>> Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
>> 
>> Acked-by: Florian Fainelli <florian@openwrt.org>
> 
> Applied.

Actually, reverted, you didn't test this patch at all.

You didn't even look at the warnings emitted by the compiler
with your changes installed.

You're passing a network device pointer into the PCI device
disable routine, which takes a PCI device pointer.

I can't express to you how extremely irritating it is when
people submit patches like this.

The bug in question is 1,000 times less harmful than your fix.

^ permalink raw reply

* [PATCH] 8021q/vlan: process NETDEV_GOING_DOWN
From: Eldad Zack @ 2012-05-29 22:30 UTC (permalink / raw)
  To: Patrick McHardy, David S. Miller; +Cc: netdev, linux-kernel, Eldad Zack

In the current flow, when you take down a physical device that has
VLANs configured on it, the NETDEV_GOING_DOWN notification will be
sent too late, i.e., no data can be sent to the wire anymore.

static int __dev_close_many(struct list_head *head)
{
...
	list_for_each_entry(dev, head, unreg_list) {
		call_netdevice_notifiers(NETDEV_GOING_DOWN, dev);
...
	}
...
	list_for_each_entry(dev, head, unreg_list) {
		if (ops->ndo_stop)
			ops->ndo_stop(dev);
	}
...
}

static int dev_close_many(struct list_head *head)
{
	...
	__dev_close_many(head);

	list_for_each_entry(dev, head, unreg_list) {
		rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING);
		call_netdevice_notifiers(NETDEV_DOWN, dev);
	}
}

In a setup like this: eth0 with VLANs 2, 3 the flow would be:

eth0 - NETDEV_GOING_DOWN
ndo_stop is called on the device
eth0.2 - NETDEV_GOING_DOWN
eth0.2 - NETDEV_DOWN
eth0.3 - NETDEV_GOING_DOWN
eth0.3 - NETDEV_DOWN
eth0 - NETDEV_DOWN

If instead NETDEV_GOING_DOWN is processed, the flow would be:
eth0.2 - NETDEV_GOING_DOWN
eth0.2 - NETDEV_DOWN
eth0.3 - NETDEV_GOING_DOWN
eth0.3 - NETDEV_DOWN
eth0 - NETDEV_GOING_DOWN
eth0 - NETDEV_DOWN
ndo_stop is called on the device

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
---
 net/8021q/vlan.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 6089f0c..fd87ecc 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -402,6 +402,12 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event,
 
 		break;
 
+	case NETDEV_GOING_DOWN: /* NETDEV_DOWN */
+	/* If the parent device is going down it will call ndo_stop after
+	 * it sends out NETDEV_GOING_DOWN but before sending out NETDEV_DOWN,
+	 * The effect of which is, that no data can be sent anymore
+	 * by the time the VLAN device sends out its NETDEV_GOING_DOWN.
+	 */
 	case NETDEV_DOWN:
 		/* Put all VLANs for this dev in the down state too.  */
 		for (i = 0; i < VLAN_N_VID; i++) {
-- 
1.7.10

^ permalink raw reply related

* [PATCH] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: Devendra Naga @ 2012-05-29 23:14 UTC (permalink / raw)
  To: Florian Fainelli, netdev, linux-kernel; +Cc: Devendra Naga

the calls after the pci_enable_device may fail, and will error out with out
disabling it. disable the device at error paths.

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
---
 drivers/net/ethernet/rdc/r6040.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index 4de7364..8f5079a 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -1096,20 +1096,20 @@ static int __devinit r6040_init_one(struct pci_dev *pdev,
 	if (err) {
 		dev_err(&pdev->dev, "32-bit PCI DMA addresses"
 				"not supported by the card\n");
-		goto err_out;
+		goto err_out_disable_dev;
 	}
 	err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32));
 	if (err) {
 		dev_err(&pdev->dev, "32-bit PCI DMA addresses"
 				"not supported by the card\n");
-		goto err_out;
+		goto err_out_disable_dev;
 	}
 
 	/* IO Size check */
 	if (pci_resource_len(pdev, bar) < io_size) {
 		dev_err(&pdev->dev, "Insufficient PCI resources, aborting\n");
 		err = -EIO;
-		goto err_out;
+		goto err_out_disable_dev;
 	}
 
 	pci_set_master(pdev);
@@ -1117,7 +1117,7 @@ static int __devinit r6040_init_one(struct pci_dev *pdev,
 	dev = alloc_etherdev(sizeof(struct r6040_private));
 	if (!dev) {
 		err = -ENOMEM;
-		goto err_out;
+		goto err_out_disable_dev;
 	}
 	SET_NETDEV_DEV(dev, &pdev->dev);
 	lp = netdev_priv(dev);
@@ -1238,6 +1238,8 @@ err_out_free_res:
 	pci_release_regions(pdev);
 err_out_free_dev:
 	free_netdev(dev);
+err_out_disable_dev:
+	pci_disable_device(pdev);
 err_out:
 	return err;
 }
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH] r6040: Do a Proper deinit at errorpath and also when driver unloads (calling r6040_remove_one)
From: Devendra Naga @ 2012-05-29 23:15 UTC (permalink / raw)
  To: Florian Fainelli, netdev, linux-kernel; +Cc: Devendra Naga

so if mdiobus_alloc fails, the errorpath doesnt do a netif_napi_del and also
doesn't set the priv data of the driver to NULL.

at the driver unload stage the driver doesn't remove the NAPI context, and
doesnt' set the priv data to NULL, and also doesn't call the pci_iounmap.

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
---
 drivers/net/ethernet/rdc/r6040.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index 8f5079a..231b9a6 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -1233,6 +1233,8 @@ err_out_mdio_irq:
 err_out_mdio:
 	mdiobus_free(lp->mii_bus);
 err_out_unmap:
+	netif_napi_del(&lp->napi);
+	pci_set_drvdata(pdev, NULL);
 	pci_iounmap(pdev, ioaddr);
 err_out_free_res:
 	pci_release_regions(pdev);
@@ -1253,6 +1255,9 @@ static void __devexit r6040_remove_one(struct pci_dev *pdev)
 	mdiobus_unregister(lp->mii_bus);
 	kfree(lp->mii_bus->irq);
 	mdiobus_free(lp->mii_bus);
+	netif_napi_del(&lp->napi);
+	pci_set_drvdata(pdev, NULL);
+	pci_iounmap(pdev, lp->base);
 	pci_release_regions(pdev);
 	free_netdev(dev);
 	pci_disable_device(pdev);
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: David Miller @ 2012-05-29 23:18 UTC (permalink / raw)
  To: devendra.aaru; +Cc: florian, netdev, linux-kernel
In-Reply-To: <1338333250-11974-1-git-send-email-devendra.aaru@gmail.com>

Please number your patches so that there is no ambguity as to what
order the patches should be applied.

That is especially important for patches like these two which make
changes in the same areas of code.

^ permalink raw reply

* Re: [PATCH] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: devendra.aaru @ 2012-05-29 23:19 UTC (permalink / raw)
  To: David Miller; +Cc: florian, netdev, linux-kernel
In-Reply-To: <20120529.172843.1775456038320356018.davem@davemloft.net>

Hi David,

On Wed, May 30, 2012 at 2:58 AM, David Miller <davem@davemloft.net> wrote:
> From: David Miller <davem@davemloft.net>
> Date: Tue, 29 May 2012 17:22:29 -0400 (EDT)
>
> Actually, reverted, you didn't test this patch at all.
>
> You didn't even look at the warnings emitted by the compiler
> with your changes installed.
>
> You're passing a network device pointer into the PCI device
> disable routine, which takes a PCI device pointer.
>
> I can't express to you how extremely irritating it is when
> people submit patches like this.
>
> The bug in question is 1,000 times less harmful than your fix.

Sorry about that, i agree that i didn't even tested it as i dont have
r6040 network adapter here, but i didn't even compiled it.

i sent the patch again to fix this problem.

Sorry,
Devendra.

^ permalink raw reply

* Re: [PATCH] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: devendra.aaru @ 2012-05-29 23:22 UTC (permalink / raw)
  To: David Miller; +Cc: florian, netdev, linux-kernel
In-Reply-To: <20120529.191810.2141710455829509698.davem@davemloft.net>

Hello David,

On Wed, May 30, 2012 at 4:48 AM, David Miller <davem@davemloft.net> wrote:
>
> Please number your patches so that there is no ambguity as to what
> order the patches should be applied.
>
> That is especially important for patches like these two which make
> changes in the same areas of code.

OK, so may i put these two changes into one and send it? or you want
it to be two patches?

Thanks,
Devendra.

^ permalink raw reply

* Re: [PATCH] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: David Miller @ 2012-05-29 23:24 UTC (permalink / raw)
  To: devendra.aaru; +Cc: florian, netdev, linux-kernel
In-Reply-To: <CAHdPZaMfSKRzT4MN-5g3c1bOah4CB80aWyVhGKu1Z5AXTj2uTw@mail.gmail.com>

From: "devendra.aaru" <devendra.aaru@gmail.com>
Date: Wed, 30 May 2012 04:52:08 +0530

> Hello David,
> 
> On Wed, May 30, 2012 at 4:48 AM, David Miller <davem@davemloft.net> wrote:
>>
>> Please number your patches so that there is no ambguity as to what
>> order the patches should be applied.
>>
>> That is especially important for patches like these two which make
>> changes in the same areas of code.
> 
> OK, so may i put these two changes into one and send it? or you want
> it to be two patches?

It's up to you.

^ permalink raw reply

* Re: [PATCH] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: devendra.aaru @ 2012-05-29 23:34 UTC (permalink / raw)
  To: David Miller; +Cc: florian, netdev, linux-kernel
In-Reply-To: <20120529.192438.1186076504142998357.davem@davemloft.net>

Hello David,

On Wed, May 30, 2012 at 4:54 AM, David Miller <davem@davemloft.net> wrote:
> From: "devendra.aaru" <devendra.aaru@gmail.com>
> Date: Wed, 30 May 2012 04:52:08 +0530
>
>> Hello David,
>>
>>
>> OK, so may i put these two changes into one and send it? or you want
>> it to be two patches?
>
> It's up to you.

OK, i will be sending two patches shortly(with sequence numbers (git
format-patch -s -n)).

This time i have done compile testing and got no warnings.

Thanks,
Devendra.

^ permalink raw reply

* [PATCH V2 1/2] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: Devendra Naga @ 2012-05-29 23:38 UTC (permalink / raw)
  To: David Miller, Florian Fainelli, netdev, linux-kernel; +Cc: Devendra Naga

the calls after the pci_enable_device may fail, and will error out with out
disabling it. disable the device at error paths.

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
---
 drivers/net/ethernet/rdc/r6040.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index 4de7364..f5e6f1f 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -1096,20 +1096,20 @@ static int __devinit r6040_init_one(struct pci_dev *pdev,
 	if (err) {
 		dev_err(&pdev->dev, "32-bit PCI DMA addresses"
 				"not supported by the card\n");
-		goto err_out;
+		goto err_out_disable_dev;
 	}
 	err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32));
 	if (err) {
 		dev_err(&pdev->dev, "32-bit PCI DMA addresses"
 				"not supported by the card\n");
-		goto err_out;
+		goto err_out_disable_dev;
 	}
 
 	/* IO Size check */
 	if (pci_resource_len(pdev, bar) < io_size) {
 		dev_err(&pdev->dev, "Insufficient PCI resources, aborting\n");
 		err = -EIO;
-		goto err_out;
+		goto err_out_disable_dev;
 	}
 
 	pci_set_master(pdev);
@@ -1117,7 +1117,7 @@ static int __devinit r6040_init_one(struct pci_dev *pdev,
 	dev = alloc_etherdev(sizeof(struct r6040_private));
 	if (!dev) {
 		err = -ENOMEM;
-		goto err_out;
+		goto err_out_disable_dev;
 	}
 	SET_NETDEV_DEV(dev, &pdev->dev);
 	lp = netdev_priv(dev);
@@ -1238,6 +1238,8 @@ err_out_free_res:
 	pci_release_regions(pdev);
 err_out_free_dev:
 	free_netdev(dev);
+err_out_disable_dev:
+	pci_disable_device(pdev);
 err_out:
 	return err;
 }
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH V2 2/2] r6040: Do a Proper deinit at errorpath and also when driver unloads (calling r6040_remove_one)
From: Devendra Naga @ 2012-05-29 23:43 UTC (permalink / raw)
  To: David Miller, Florian Fainelli, netdev, linux-kernel; +Cc: Devendra Naga

so if mdiobus_alloc fails, the errorpath doesnt do a netif_napi_del and also
doesn't set the priv data of the driver to NULL.

at the driver unload stage the driver doesn't remove the NAPI context, and
doesnt' set the priv data to NULL, and also doesn't call the pci_iounmap.

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
---
 drivers/net/ethernet/rdc/r6040.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index f5e6f1f..d1827e8 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -1233,6 +1233,8 @@ err_out_mdio_irq:
 err_out_mdio:
 	mdiobus_free(lp->mii_bus);
 err_out_unmap:
+	netif_napi_del(&lp->napi);
+	pci_set_drvdata(pdev, NULL);
 	pci_iounmap(pdev, ioaddr);
 err_out_free_res:
 	pci_release_regions(pdev);
@@ -1253,6 +1255,9 @@ static void __devexit r6040_remove_one(struct pci_dev *pdev)
 	mdiobus_unregister(lp->mii_bus);
 	kfree(lp->mii_bus->irq);
 	mdiobus_free(lp->mii_bus);
+	netif_napi_del(&lp->napi);
+	pci_set_drvdata(pdev, NULL);
+	pci_iounmap(pdev, lp->base);
 	pci_release_regions(pdev);
 	free_netdev(dev);
 	pci_disable_device(pdev);
-- 
1.7.9.5

^ permalink raw reply related

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Hiroaki SHIMODA @ 2012-05-30  0:06 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Denys Fedoryshchenko, netdev, e1000-devel, jeffrey.t.kirsher,
	jesse.brandeburg, eric.dumazet, davem
In-Reply-To: <CA+mtBx_uYy9XcRvpD2E46FuMBFu38iQvCiwFHWqbhPBmY=JfOg@mail.gmail.com>

While reading the bql code, I have some questions.

1) dql_completed() and dql_queued() can be called concurrently,
   so dql->num_queued could change while processing
   dql_completed().
   Is it intentional to refer num_queued from "dql->" each time ?

2) From the comment in the code
   *   - The queue was over-limit in the previous interval and
   *     when enqueuing it was possible that all queued data
   *     had been consumed.

   and

   * Queue was not starved, check if the limit can be decreased.
   * A decrease is only considered if the queue has been busy in
   * the whole interval (the check above). 

   the calculation of all_prev_completed should take into account
   completed == dql->prev_num_queued case ?
   On current implementation, limit shrinks easily and some NIC
   hit TX stalls.
   To mitigate TX stalls, should we fix all_prev_completed rather
   than individual driver ?

3) limit calculation fails to consider integer wrap around in
   one place ?

Here is the patch what I meant.

diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
@@ -11,22 +11,27 @@
 #include <linux/dynamic_queue_limits.h>
 
 #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0)
+#define POSDIFFI(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
+#define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
 
 /* Records completed count and recalculates the queue limit */
 void dql_completed(struct dql *dql, unsigned int count)
 {
 	unsigned int inprogress, prev_inprogress, limit;
-	unsigned int ovlimit, all_prev_completed, completed;
+	unsigned int ovlimit, completed, num_queued;
+	bool all_prev_completed;
+
+	num_queued = dql->num_queued;
 
 	/* Can't complete more than what's in queue */
-	BUG_ON(count > dql->num_queued - dql->num_completed);
+	BUG_ON(count > num_queued - dql->num_completed);
 
 	completed = dql->num_completed + count;
 	limit = dql->limit;
-	ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit);
-	inprogress = dql->num_queued - completed;
+	ovlimit = POSDIFF(num_queued - dql->num_completed, limit);
+	inprogress = num_queued - completed;
 	prev_inprogress = dql->prev_num_queued - dql->num_completed;
-	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
+	all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued);
 
 	if ((ovlimit && !inprogress) ||
 	    (dql->prev_ovlimit && all_prev_completed)) {
@@ -45,7 +50,7 @@ void dql_completed(struct dql *dql, unsigned int count)
 		 *     of bytes both sent and completed in the last interval,
 		 *     plus any previous over-limit.
 		 */
-		limit += POSDIFF(completed, dql->prev_num_queued) +
+		limit += POSDIFFI(completed, dql->prev_num_queued) +
 		     dql->prev_ovlimit;
 		dql->slack_start_time = jiffies;
 		dql->lowest_slack = UINT_MAX;
@@ -104,7 +109,7 @@ void dql_completed(struct dql *dql, unsigned int count)
 	dql->prev_ovlimit = ovlimit;
 	dql->prev_last_obj_cnt = dql->last_obj_cnt;
 	dql->num_completed = completed;
-	dql->prev_num_queued = dql->num_queued;
+	dql->prev_num_queued = num_queued;
 }
 EXPORT_SYMBOL(dql_completed);

^ permalink raw reply

* Re: [PATCH] 8021q/vlan: process NETDEV_GOING_DOWN
From: David Miller @ 2012-05-30  0:40 UTC (permalink / raw)
  To: eldad; +Cc: kaber, netdev, linux-kernel
In-Reply-To: <1338330635-27259-1-git-send-email-eldad@fogrefinery.com>

From: Eldad Zack <eldad@fogrefinery.com>
Date: Wed, 30 May 2012 00:30:35 +0200

> In the current flow, when you take down a physical device that has
> VLANs configured on it, the NETDEV_GOING_DOWN notification will be
> sent too late, i.e., no data can be sent to the wire anymore.

Why do you need to send data?  Any queued up data should be purged not
sent.

^ permalink raw reply

* Re: [PATCH V2 1/2] r6040: disable pci device if the subsequent calls (after pci_enable_device) fails
From: David Miller @ 2012-05-30  2:31 UTC (permalink / raw)
  To: devendra.aaru; +Cc: florian, netdev, linux-kernel
In-Reply-To: <1338334722-14730-1-git-send-email-devendra.aaru@gmail.com>

From: Devendra Naga <devendra.aaru@gmail.com>
Date: Wed, 30 May 2012 05:08:42 +0530

> the calls after the pci_enable_device may fail, and will error out with out
> disabling it. disable the device at error paths.
> 
> Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox