Netdev List

Netdev List
 help / color / mirror / Atom feed

* Linux TCP's Robustness to Multipath Packet Reordering
From: Dominik Kaspar @ 2011-04-25 10:37 UTC (permalink / raw)
  To: netdev

Hello,

Knowing how critical packet reordering is for standard TCP, I am
currently testing how robust Linux TCP is when packets are forwarded
over multiple paths (with different bandwidth and RTT). Since Linux
TCP adapts its "dupAck threshold" to an estimated level of packet
reordering, I expect it to be much more robust than a standard TCP
that strictly follows the RFCs. Indeed, as you can see in the
following plot, my experiments show a step-wise adaptation of Linux
TCP to heavy reordering. After many minutes, Linux TCP finally reaches
a data throughput close to the perfect aggregated data rate of two
paths (emulated with characteristics similar to IEEE 802.11b (WLAN)
and a 3G link (HSPA)):

http://home.simula.no/~kaspar/static/mptcp-emu-wlan-hspa-00.png

Does anyone have clues what's going on here? Why does the aggregated
throughput increase in steps? And what could be the reason it takes
minutes to adapt to the full capacity, when in other cases, Linux TCP
adapts much faster (for example if the bandwidth of both paths are
equal). I would highly appreciate some advice from the netdev
community.

Implementation details:
This multipath TCP experiment ran between a sending machine with a
single Ethernet interface (eth0) and a client with two Ethernet
interfaces (eth1, eth2). The machines are connected through a switch
and tc/netem is used to emulate the bandwidth and RTT of both paths.
TCP connections are established using iperf between eth0 and eth1 (the
primary path). At the sender, an iptables' NFQUEUE is used to "spoof"
the destination IP address of outgoing packets and force some to
travel to eth2 instead of eth1 (the secondary path). This multipath
scheduling happens in proportion to the emulated bandwidths, so if the
paths are set to 500 and 1000 KB/s, then packets are distributed in a
1:2 ratio. At the client, iptables' RAWDNAT is used to translate the
spoofed IP addresses back to their original, so that all packets end
up at eth1, although a portion actually travelled to eth2. ACKs are
not scheduled over multiple paths, but always travel back on the
primary path. TCP does not notice anything of the multipath
forwarding, except the side-effect of packet reordering, which can be
huge if the path RTTs are set very differently.

Best regards,
Dominik

^ permalink raw reply

* Re: RPS will assign different smp_processor_id for the same packet?
From: Neil Horman @ 2011-04-25 11:02 UTC (permalink / raw)
  To: zhou rui; +Cc: Eric Dumazet, Tom Herbert, netdev@vger.kernel.org
In-Reply-To: <BANLkTintzzpixPi+HDzGUV_agCGr6EX7AQ@mail.gmail.com>

On Sun, Apr 24, 2011 at 05:36:51PM +0800, zhou rui wrote:
> On Sun, Apr 24, 2011 at 4:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le dimanche 24 avril 2011 à 10:00 +0800, zhou rui a écrit :
> >> On Sun, Apr 24, 2011 at 3:56 AM, Tom Herbert <therbert@google.com> wrote:
> >> > On Sat, Apr 23, 2011 at 8:31 AM, zhou rui <zhourui.cn@gmail.com> wrote:
> >> >> one more question is:
> >> >>
> >> >> in the function "int netif_receive_skb(struct sk_buff *skb)"
> >> >>
> >> >> cpu = get_rps_cpu(skb->dev, skb, &rflow);
> >> >> if (cpu >= 0) {
> >> >>  ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
> >> >> ....
> >> >>
> >> >> probably the cpu is different from the current processor id?(smp_processor_id)
> >> >> let's say: get_rps_cpu->cpu 0, smp_processor_id->cpu1
> >> >> when this happen, does it mean that cpu1 is handling the softirq but
> >> >> have to divert the packet to cpu0?(via a new softirq?)
> >> >>
> >> >> so for one packet it involve 2 softirqs?
> >> >>
> >> >> possible to get_rps_cpu in interrupt,then let the target cpu do only
> >> >> one softirq to hanle the packet?
> >> >>
> >> > Yes, this is what a non-NAPI driver would do.
> >> >
> >> > Tom
> >> >
> >> non-NAPI will get_rps_cpu in irq, but why NAPI will get_rps_cpu in
> >> softirq?(if I understand correctly netif_receive_skb executed in
> >> softirq?)
> >
> > Thats hard to understand what your problem or question is.
> >
> > All heavy Receive networking stuff is handled in softirq.
> >
> > NAPI allows to reduce number of hardware IRQS in stress/load situations,
> > fetching several frames at once.
> >
> >
> >
> >
> 
> my understanding:
> 
> non-NAPI scenario:
> 
> netif_rx( in irq, get_rps_cpu,enqueue_to_backlog to deliver packet to
> cpu queue) ------>net_rx_action(in softirq,deque and process packet)
> 
> 
> NAPI:
> 
> what does RPS do?(in
> irq)------------------------>net_rx_action(softirq)----->netif_receive_skb(get_rps_cpu,enque
> packet)
> 
> so my question is:
> for NAPI, get_rps_cpu will be done in softirq?
> 
> if the above situation is true,will this happen?
> packet_for_cpu_1 --> cpu0(netif_receive_skb,in softirq) --->delivered
> to cpu1(softirq)
As Eric noted, NAPI enabled drivers using RPS will have a flow like he
described:
irq
 napi_schedule(driver)
  driver poll
   netif_receive_skb
    enqueue_to_backlog
     ipi
      napi_schedule (backlog dev)
       backlog dev poll
        netif_receive_skb

drivers using netif_rx do the same thing, the only difference is that they
schedule the local backlog device for a napi poll rather than their own device.
This is done because the ipi to the remote cpu is issued from within the napi
softirq, so we need to kick the local cpu to run a napi poll cycle after such a
driver receives a frame.

Neil

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: Hight speed data sending from custom IP out of kernel
From: Michal Simek @ 2011-04-25 11:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: juice, netdev
In-Reply-To: <1303373925.3685.6.camel@edumazet-laptop>

Hi,

Eric Dumazet wrote:
> Le jeudi 21 avril 2011 à 10:02 +0200, Michal Simek a écrit :
> 
>> Thanks for that. I am looking at pktgen. On UDP my system is able to send full 
>> bandwidth on 100Mbit/s ethernet and 220Mbit/s on 1G/s.
>> I will let you know when I have any useful resutls.
> 
> 220Mbits/s in pktgen or an application ?
> - how many packets per second ? (or packet size ?)
> 
> pktgen has the "clone_skb 100" thing that avoid skb_alloc()/skb_free()
> overhead, and permits to really test driver performance.
> 
> It also bypass qdisc management.
> 

I have reused the part of code from pktgen and I have found that I am missing 
some IDs that's why I have done one simple patch(below) in pktgen which is 
update IP ID field to find out if all packets are sent or not. As you suggest I 
am also missing some IDs here.
My question is if I can use any mechanism to ensure to sending all IDs?

The next my question about packet fragments. Is it possible to setup IP 
fragments from higher level? I do it on low level as pktgen and I have change 
page address to memory which I need to send but it in under UDP.

The point is to create packet with frags > 1 where the first fragment is IP/TCP 
header and the second fragments contains pointer to data which are prepared in 
the memory and will be copied directly by network driver. I am doing the same 
hacked code from pktgen. Is it possible to do it on higher level?

Thanks,
Michal

For 2.6.37.6
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 33bc382..3429eb3 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3500,6 +3500,13 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
                 pkt_dev->last_pkt_size = pkt_dev->skb->len;
                 pkt_dev->allocated_skbs++;
                 pkt_dev->clone_count = 0;       /* reset counter */
+       } else {
+               struct iphdr *iph;
+               iph = ip_hdr(pkt_dev->skb);
+               iph->id = htons(pkt_dev->ip_id);
+               pkt_dev->ip_id++;
+               iph->check = 0;
+               iph->check = ip_fast_csum((void *)iph, iph->ihl);
         }

         if (pkt_dev->delay && pkt_dev->last_ok)

-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply related

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Eric Dumazet @ 2011-04-25 11:25 UTC (permalink / raw)
  To: Dominik Kaspar; +Cc: netdev
In-Reply-To: <BANLkTimpgXCpweZKCihCQkLjSZw5zL4=Pg@mail.gmail.com>

Le lundi 25 avril 2011 à 12:37 +0200, Dominik Kaspar a écrit :
> Hello,
> 
> Knowing how critical packet reordering is for standard TCP, I am
> currently testing how robust Linux TCP is when packets are forwarded
> over multiple paths (with different bandwidth and RTT). Since Linux
> TCP adapts its "dupAck threshold" to an estimated level of packet
> reordering, I expect it to be much more robust than a standard TCP
> that strictly follows the RFCs. Indeed, as you can see in the
> following plot, my experiments show a step-wise adaptation of Linux
> TCP to heavy reordering. After many minutes, Linux TCP finally reaches
> a data throughput close to the perfect aggregated data rate of two
> paths (emulated with characteristics similar to IEEE 802.11b (WLAN)
> and a 3G link (HSPA)):
> 
> http://home.simula.no/~kaspar/static/mptcp-emu-wlan-hspa-00.png
> 
> Does anyone have clues what's going on here? Why does the aggregated
> throughput increase in steps? And what could be the reason it takes
> minutes to adapt to the full capacity, when in other cases, Linux TCP
> adapts much faster (for example if the bandwidth of both paths are
> equal). I would highly appreciate some advice from the netdev
> community.
> 
> Implementation details:
> This multipath TCP experiment ran between a sending machine with a
> single Ethernet interface (eth0) and a client with two Ethernet
> interfaces (eth1, eth2). The machines are connected through a switch
> and tc/netem is used to emulate the bandwidth and RTT of both paths.
> TCP connections are established using iperf between eth0 and eth1 (the
> primary path). At the sender, an iptables' NFQUEUE is used to "spoof"
> the destination IP address of outgoing packets and force some to
> travel to eth2 instead of eth1 (the secondary path). This multipath
> scheduling happens in proportion to the emulated bandwidths, so if the
> paths are set to 500 and 1000 KB/s, then packets are distributed in a
> 1:2 ratio. At the client, iptables' RAWDNAT is used to translate the
> spoofed IP addresses back to their original, so that all packets end
> up at eth1, although a portion actually travelled to eth2. ACKs are
> not scheduled over multiple paths, but always travel back on the
> primary path. TCP does not notice anything of the multipath
> forwarding, except the side-effect of packet reordering, which can be
> huge if the path RTTs are set very differently.
> 

Hi Dominik

Implementation details of the tc/netem stages are important to fully
understand how TCP stack can react.

Is TSO active at sender side for example ?

Your results show that only some exceptional events make bandwidth
really change.

A tcpdump/pcap of ~10.000 first packets would be nice to provide (not on
mailing list, but on your web site)




^ permalink raw reply

* Re: Hight speed data sending from custom IP out of kernel
From: Eric Dumazet @ 2011-04-25 12:14 UTC (permalink / raw)
  To: monstr; +Cc: juice, netdev
In-Reply-To: <4DB55876.2010008@monstr.eu>

Le lundi 25 avril 2011 à 13:18 +0200, Michal Simek a écrit :
> Hi,
> 
> Eric Dumazet wrote:
> > Le jeudi 21 avril 2011 à 10:02 +0200, Michal Simek a écrit :
> > 
> >> Thanks for that. I am looking at pktgen. On UDP my system is able to send full 
> >> bandwidth on 100Mbit/s ethernet and 220Mbit/s on 1G/s.
> >> I will let you know when I have any useful resutls.
> > 
> > 220Mbits/s in pktgen or an application ?
> > - how many packets per second ? (or packet size ?)
> > 
> > pktgen has the "clone_skb 100" thing that avoid skb_alloc()/skb_free()
> > overhead, and permits to really test driver performance.
> > 
> > It also bypass qdisc management.
> > 
> 
> I have reused the part of code from pktgen and I have found that I am missing 
> some IDs that's why I have done one simple patch(below) in pktgen which is 
> update IP ID field to find out if all packets are sent or not. As you suggest I 
> am also missing some IDs here.
> My question is if I can use any mechanism to ensure to sending all IDs?
> 
> The next my question about packet fragments. Is it possible to setup IP 
> fragments from higher level? I do it on low level as pktgen and I have change 
> page address to memory which I need to send but it in under UDP.
> 
> The point is to create packet with frags > 1 where the first fragment is IP/TCP 
> header and the second fragments contains pointer to data which are prepared in 
> the memory and will be copied directly by network driver. I am doing the same 
> hacked code from pktgen. Is it possible to do it on higher level?
> 

sendfile() is mostly doing this.

> Thanks,
> Michal
> 
> 
> For 2.6.37.6
> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> index 33bc382..3429eb3 100644
> --- a/net/core/pktgen.c
> +++ b/net/core/pktgen.c
> @@ -3500,6 +3500,13 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>                  pkt_dev->last_pkt_size = pkt_dev->skb->len;
>                  pkt_dev->allocated_skbs++;
>                  pkt_dev->clone_count = 0;       /* reset counter */
> +       } else {
> +               struct iphdr *iph;
> +               iph = ip_hdr(pkt_dev->skb);
> +               iph->id = htons(pkt_dev->ip_id);
> +               pkt_dev->ip_id++;
> +               iph->check = 0;
> +               iph->check = ip_fast_csum((void *)iph, iph->ihl);
>          }
> 
>          if (pkt_dev->delay && pkt_dev->last_ok)
> 
> 
> 


Well, you cant do that in pktgen, since you're changing previous packet
content (it might still be in device TX queue, not yet sent, or being
sent right now)

Now, if all you want to do is send many packets from pktgen (with only
ID changing), you could add a fast path to not rebuild from scratch new
packets.




^ permalink raw reply

* Re: Hight speed data sending from custom IP out of kernel
From: Michal Simek @ 2011-04-25 12:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: juice, netdev
In-Reply-To: <1303733669.2747.115.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le lundi 25 avril 2011 à 13:18 +0200, Michal Simek a écrit :
>> Hi,
>>
>> Eric Dumazet wrote:
>>> Le jeudi 21 avril 2011 à 10:02 +0200, Michal Simek a écrit :
>>>
>>>> Thanks for that. I am looking at pktgen. On UDP my system is able to send full 
>>>> bandwidth on 100Mbit/s ethernet and 220Mbit/s on 1G/s.
>>>> I will let you know when I have any useful resutls.
>>> 220Mbits/s in pktgen or an application ?
>>> - how many packets per second ? (or packet size ?)
>>>
>>> pktgen has the "clone_skb 100" thing that avoid skb_alloc()/skb_free()
>>> overhead, and permits to really test driver performance.
>>>
>>> It also bypass qdisc management.
>>>
>> I have reused the part of code from pktgen and I have found that I am missing 
>> some IDs that's why I have done one simple patch(below) in pktgen which is 
>> update IP ID field to find out if all packets are sent or not. As you suggest I 
>> am also missing some IDs here.
>> My question is if I can use any mechanism to ensure to sending all IDs?
>>
>> The next my question about packet fragments. Is it possible to setup IP 
>> fragments from higher level? I do it on low level as pktgen and I have change 
>> page address to memory which I need to send but it in under UDP.
>>
>> The point is to create packet with frags > 1 where the first fragment is IP/TCP 
>> header and the second fragments contains pointer to data which are prepared in 
>> the memory and will be copied directly by network driver. I am doing the same 
>> hacked code from pktgen. Is it possible to do it on higher level?
>>
> 
> sendfile() is mostly doing this.

will look, thanks.

> 
>> Thanks,
>> Michal
>>
>>
>> For 2.6.37.6
>> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
>> index 33bc382..3429eb3 100644
>> --- a/net/core/pktgen.c
>> +++ b/net/core/pktgen.c
>> @@ -3500,6 +3500,13 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>>                  pkt_dev->last_pkt_size = pkt_dev->skb->len;
>>                  pkt_dev->allocated_skbs++;
>>                  pkt_dev->clone_count = 0;       /* reset counter */
>> +       } else {
>> +               struct iphdr *iph;
>> +               iph = ip_hdr(pkt_dev->skb);
>> +               iph->id = htons(pkt_dev->ip_id);
>> +               pkt_dev->ip_id++;
>> +               iph->check = 0;
>> +               iph->check = ip_fast_csum((void *)iph, iph->ihl);
>>          }
>>
>>          if (pkt_dev->delay && pkt_dev->last_ok)
>>
>>
>>
> 
> 
> Well, you cant do that in pktgen, since you're changing previous packet
> content (it might still be in device TX queue, not yet sent, or being
> sent right now)

It is likely happening.

> 
> Now, if all you want to do is send many packets from pktgen (with only
> ID changing), you could add a fast path to not rebuild from scratch new
> packets.

What do you mean?

Thanks,
Michal


-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply

* Re: Hight speed data sending from custom IP out of kernel
From: Eric Dumazet @ 2011-04-25 12:27 UTC (permalink / raw)
  To: monstr; +Cc: juice, netdev
In-Reply-To: <4DB5668B.8020808@monstr.eu>

Le lundi 25 avril 2011 à 14:18 +0200, Michal Simek a écrit :

> > 
> > Now, if all you want to do is send many packets from pktgen (with only
> > ID changing), you could add a fast path to not rebuild from scratch new
> > packets.
> 
> What do you mean?

If you know your device has X slots in its TX ring buffer, you would
have to maintain at least X+1 skbs in pktgen to make sure you reuse an
skb while its previous logical content was sent on wire.

Then you are free to only change iph->id and iph->check very fast.

^ permalink raw reply

* Re: Hight speed data sending from custom IP out of kernel
From: Eric Dumazet @ 2011-04-25 12:30 UTC (permalink / raw)
  To: monstr; +Cc: juice, netdev
In-Reply-To: <1303734468.2747.120.camel@edumazet-laptop>

Le lundi 25 avril 2011 à 14:27 +0200, Eric Dumazet a écrit :
> Le lundi 25 avril 2011 à 14:18 +0200, Michal Simek a écrit :
> 
> > > 
> > > Now, if all you want to do is send many packets from pktgen (with only
> > > ID changing), you could add a fast path to not rebuild from scratch new
> > > packets.
> > 
> > What do you mean?
> 
> 
> If you know your device has X slots in its TX ring buffer, you would
> have to maintain at least X+1 skbs in pktgen to make sure you reuse an
> skb while its previous logical content was sent on wire.
> 
> Then you are free to only change iph->id and iph->check very fast.
> 

Checking skb->users would also be a good way to know if TX completion
released skb reference. If your module owns the last reference, it can
do a recycle.





^ permalink raw reply

* Re: Hight speed data sending from custom IP out of kernel
From: Michal Simek @ 2011-04-25 12:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: juice, netdev
In-Reply-To: <1303734625.2747.121.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le lundi 25 avril 2011 à 14:27 +0200, Eric Dumazet a écrit :
>> Le lundi 25 avril 2011 à 14:18 +0200, Michal Simek a écrit :
>>
>>>> Now, if all you want to do is send many packets from pktgen (with only
>>>> ID changing), you could add a fast path to not rebuild from scratch new
>>>> packets.
>>> What do you mean?
>>
>> If you know your device has X slots in its TX ring buffer, you would
>> have to maintain at least X+1 skbs in pktgen to make sure you reuse an
>> skb while its previous logical content was sent on wire.
>>
>> Then you are free to only change iph->id and iph->check very fast.
>>

got it. There is an option to setup number of BDs in the driver where I need to 
use half skb because of nr_frags=2 where two BDs are used.

> 
> Checking skb->users would also be a good way to know if TX completion
> released skb reference. If your module owns the last reference, it can
> do a recycle.

Ok. I see that dev_kfree_skb(consume_skb).

Thanks will try,
Michal

-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply

* Re: Oops in 2.6.39 include/net/dst.h: dst_metrics_write_ptr() running l2tp over ipsec
From: Eric Dumazet @ 2011-04-25 12:49 UTC (permalink / raw)
  To: Held Bernhard; +Cc: linux-kernel, David S. Miller, netdev
In-Reply-To: <4DB54450.90806@gmx.de>

Le lundi 25 avril 2011 à 11:52 +0200, Held Bernhard a écrit :

> Your patch works flawlessly.
> 

Well, its your patch :)

> Thanks for the quick response!

Thanks for testing.

^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Carsten Wolff @ 2011-04-25 12:59 UTC (permalink / raw)
  To: Dominik Kaspar; +Cc: netdev
In-Reply-To: <BANLkTimpgXCpweZKCihCQkLjSZw5zL4=Pg@mail.gmail.com>

Hi Dominik,

On Monday 25 April 2011, Dominik Kaspar wrote:
> Hello,
> 
> Knowing how critical packet reordering is for standard TCP, I am
> currently testing how robust Linux TCP is when packets are forwarded
> over multiple paths (with different bandwidth and RTT). Since Linux
> TCP adapts its "dupAck threshold" to an estimated level of packet
> reordering, I expect it to be much more robust than a standard TCP
> that strictly follows the RFCs. Indeed, as you can see in the
> following plot, my experiments show a step-wise adaptation of Linux
> TCP to heavy reordering. After many minutes, Linux TCP finally reaches
> a data throughput close to the perfect aggregated data rate of two
> paths (emulated with characteristics similar to IEEE 802.11b (WLAN)
> and a 3G link (HSPA)):
> 
> http://home.simula.no/~kaspar/static/mptcp-emu-wlan-hspa-00.png
> 
> Does anyone have clues what's going on here? Why does the aggregated
> throughput increase in steps? And what could be the reason it takes
> minutes to adapt to the full capacity, when in other cases, Linux TCP
> adapts much faster (for example if the bandwidth of both paths are
> equal). I would highly appreciate some advice from the netdev
> community.

the throughput increase in steps is most likely caused by Linux's reordering 
detection and quantization. The DupThresh (tp->reordering) is only increased 
when reordering is detected and is then set to a value that depends on current 
inflight/pipe. This means, on a path with only reordering and no loss, where a 
very large DupThresh is best, you will see those steps in the throughput 
everytime when Linux detects reordering during a time where cwnd is large. 
This on the other hand depends purely on timing/luck.
Linux is also only able to quantize reordering during disorder state, which 
leaves out many possible quantization samples, escpecially the larger ones, 
which would increase DupThresh to higher values.

Also, reordering detection depends very much on TCP options. Which TCP Options 
were enabled in your test? Timestamps? D-SACK?

Carsten

> 
> Implementation details:
> This multipath TCP experiment ran between a sending machine with a
> single Ethernet interface (eth0) and a client with two Ethernet
> interfaces (eth1, eth2). The machines are connected through a switch
> and tc/netem is used to emulate the bandwidth and RTT of both paths.
> TCP connections are established using iperf between eth0 and eth1 (the
> primary path). At the sender, an iptables' NFQUEUE is used to "spoof"
> the destination IP address of outgoing packets and force some to
> travel to eth2 instead of eth1 (the secondary path). This multipath
> scheduling happens in proportion to the emulated bandwidths, so if the
> paths are set to 500 and 1000 KB/s, then packets are distributed in a
> 1:2 ratio. At the client, iptables' RAWDNAT is used to translate the
> spoofed IP addresses back to their original, so that all packets end
> up at eth1, although a portion actually travelled to eth2. ACKs are
> not scheduled over multiple paths, but always travel back on the
> primary path. TCP does not notice anything of the multipath
> forwarding, except the side-effect of packet reordering, which can be
> huge if the path RTTs are set very differently.
> 
> Best regards,
> Dominik
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
           /\-´-/\
          (  @ @  )
________o0O___^___O0o________

^ permalink raw reply

* Re: [PATCH] af_unix: Only allow recv on connected seqpacket sockets.
From: Eric W. Biederman @ 2011-04-25 14:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, xemul, dan, stable
In-Reply-To: <20110424.120519.226767465.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 24 Apr 2011 04:54:57 -0700
>
>> +static int unix_seqpacket_recvmsg(struct kiocb *iocb, struct socket *sock,
>> +			      struct msghdr *msg, size_t size,
>> +			      int flags)
>> +{
>> +	struct sock *sk = sock->sk;
>> +
>> +	if (sk->sk_state != TCP_ESTABLISHED)
>> +		return -ENOTCONN;
>
> As for unix_seqpacket_sendmsg(), you need to add a check for sock_error()
> or similar here otherwise -ECONNRESET is not reported correctly.
>
> In fact, recvmsg() is even harder than sendmsg() to handle correctly,
> because we have to also properly report EOF on seqpacket sockets which
> have RCV_SHUTDOWN set.
>
> So a lot more work has to go into this change to make it fix the bug
> without also breaking existing semantics.

Really?

When I read through the code I am failing to see the issues you are
seeing.

When the other socket in an established connection calls unix_shutdown
or unix_release_sock.  sk->sk_shutdown is changed, but sk_state is
left at TCP_ESTABLISHED.  Therefore we do not need a special
case in unix_seqpacket_recvmsg to handle the RCV_SHUTDOWN case
because in any case where that applies we will be in TCP_ESTABLISHED
and we will simply call unix_dgram_recvmsg.

As for ECONNRESET when I look a look at the code it appears to be
another variant of the other side calling shutdown or close.   So if
it applies we should remain in TCP_ESTABLISHED, and
unix_seqpacket_recvmsg should not need to do anything.

So looking at this the only times I can see that sk_state would
not be TCP_ESTABLISHED in a unix domain seqpacket socket are.
- On a listening socket, where calling recvmsg is what this
  patch is meant to address.
- Before we call connect or listen.
  Which appears to be equally broken today.  The only errors
  I can see happening in the case we are not connected today
  are blocking forever or returning -EINTR if we timeout.

Adding sock_error() handling into the new unix_seqpacket_recvmsg makes a
fair amount of sense but adding a new call to sock_error in that path
seems marginally more likely to change error codes and break existing
apps.  We already have a few other unconditional error codes before
we check sk_err in unix_dgram_recvmsg. 

> Anyways, see:
>
> commit 6e14891f4d16f8a9e0bc3a8408f73b3aed93ab0a
> Author: James Morris <jmorris@redhat.com>
> Date:   Fri Nov 19 07:02:41 2004 -0800
>
>     [AF_UNIX]: Don't lose ECONNRESET in unix_seqpacket_sendmsg()
>     
>     The fix for SELinux w/SOCK_SEQPACKET had an error,
>     noted by Alan Cox.  This fixes it.
>     
>     Signed-off-by: James Morris <jmorris@redhat.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>

Looking into it.  That patch appears to have been unnecessary.
We never transition out of the state TCP_ESTABLISHED once we get
there, and we can never get ECONNRESET unless we are connected.

Arguably we could reduce unix_seqpacket_sendmsg to simply 

static int unix_seqpacket_sendmsg(struct kiocb *kiocb, struct socket *sock,
  				  struct msghdr *msg, size_t len)
{
	if (msg->msgnamelen)
        	msg->msgnamelen = 0;
        return unix_dgram_sendmsg(kiocb, sock, msg, len);
}

But I think having the explicit TCP_ESTABLISHED check makes for better
maintainability, of unix_dgram_sendmesg.

So having gone through all of that it looks like my patch needs a
comment saying that once we are in TCP_ESTABLISHED we cannot leave,
and that nothing can happen before we are TCP_ESTABLISHED.

We can use sock_error to check sk_err, as it seems good hygiene
but it also appears pointless.  Especially for recvmsg where ECONNRESET
never applies.

Eric

> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 16faa9d..8902c4a 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1513,13 +1513,18 @@ out_err:
>  static int unix_seqpacket_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  				  struct msghdr *msg, size_t len)
>  {
> +	int err;
>  	struct sock *sk = sock->sk;
>  	
> +	err = sock_error(sk);
> +	if (err)
> +		return err;
> +
>  	if (sk->sk_state != TCP_ESTABLISHED)
>  		return -ENOTCONN;
>  
> -	if (msg->msg_name || msg->msg_namelen)
> -		return -EINVAL;
> +	if (msg->msg_namelen)
> +		msg->msg_namelen = 0;
>  
>  	return unix_dgram_sendmsg(kiocb, sock, msg, len);
>  }

^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Dominik Kaspar @ 2011-04-25 14:35 UTC (permalink / raw)
  To: Eric Dumazet, Carsten Wolff; +Cc: netdev
In-Reply-To: <1303730701.2747.110.camel@edumazet-laptop>

Hi Eric and Carsten,

Thanks a lot for your quick replies. I don't have a tcpdump of this
experiment, but here is the tcp_probe log that the plot is based on
(I'll run a new test using tcpdump if you think that's more useful):

http://home.simula.no/~kaspar/static/mptcp-emu-wlan-hspa-00.log

I have also noticed what Carsten mentions, the tcp_reordering value is
essential for this whole behavior. When I start an experiment and
increase sysctl.net.ipv4.tcp_reordering during the running connection,
the TCP throughput immediately jumps close to the aggregate of both
paths. Without intervention, as in this experiment, tcp_reordering
starts out as 3 and then makes small oscillations between 3 and 12 for
more than 2 minutes. At about second 141, TCP somehow finds a new
highest reordering value (23) and at the same time, the throughput
jumps up "to the next level". The value of 23 is then used all the way
until second 603, when the reordering value becomes 32 and the
throughput again jumps up a level.

I understand that tp->reordering is increased when reordering is
detected, but what causes tp->reordering to sometimes be decreased
back to 3? Also, why does a decrease back to 3 not make the whole
procedure start all over again? For example, at second 1013.64,
tp->reordering falls from 127 down to 3. A second later (1014.93) it
then suddenly increases from 3 up to 32 without considering any
numbers in between. Why it is now suddenly so fast? At the very
beginning, it took 600 seconds to grow from 3 to 32 and afterward it
just takes a second...?

For the experiments, all default TCP options were used, meaning that
SACK, DSACK, Timestamps, were all enabled. Not sure how to turn on/off
TSO... so that is probably enabled, too. Path emulation is done with
tc/netem at the receiver interfaces (eth1, eth2) with this script:

http://home.simula.no/~kaspar/static/netem.sh

Greetings,
Dominik

On Mon, Apr 25, 2011 at 1:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 25 avril 2011 à 12:37 +0200, Dominik Kaspar a écrit :
>> Hello,
>>
>> Knowing how critical packet reordering is for standard TCP, I am
>> currently testing how robust Linux TCP is when packets are forwarded
>> over multiple paths (with different bandwidth and RTT). Since Linux
>> TCP adapts its "dupAck threshold" to an estimated level of packet
>> reordering, I expect it to be much more robust than a standard TCP
>> that strictly follows the RFCs. Indeed, as you can see in the
>> following plot, my experiments show a step-wise adaptation of Linux
>> TCP to heavy reordering. After many minutes, Linux TCP finally reaches
>> a data throughput close to the perfect aggregated data rate of two
>> paths (emulated with characteristics similar to IEEE 802.11b (WLAN)
>> and a 3G link (HSPA)):
>>
>> http://home.simula.no/~kaspar/static/mptcp-emu-wlan-hspa-00.png
>>
>> Does anyone have clues what's going on here? Why does the aggregated
>> throughput increase in steps? And what could be the reason it takes
>> minutes to adapt to the full capacity, when in other cases, Linux TCP
>> adapts much faster (for example if the bandwidth of both paths are
>> equal). I would highly appreciate some advice from the netdev
>> community.
>>
>> Implementation details:
>> This multipath TCP experiment ran between a sending machine with a
>> single Ethernet interface (eth0) and a client with two Ethernet
>> interfaces (eth1, eth2). The machines are connected through a switch
>> and tc/netem is used to emulate the bandwidth and RTT of both paths.
>> TCP connections are established using iperf between eth0 and eth1 (the
>> primary path). At the sender, an iptables' NFQUEUE is used to "spoof"
>> the destination IP address of outgoing packets and force some to
>> travel to eth2 instead of eth1 (the secondary path). This multipath
>> scheduling happens in proportion to the emulated bandwidths, so if the
>> paths are set to 500 and 1000 KB/s, then packets are distributed in a
>> 1:2 ratio. At the client, iptables' RAWDNAT is used to translate the
>> spoofed IP addresses back to their original, so that all packets end
>> up at eth1, although a portion actually travelled to eth2. ACKs are
>> not scheduled over multiple paths, but always travel back on the
>> primary path. TCP does not notice anything of the multipath
>> forwarding, except the side-effect of packet reordering, which can be
>> huge if the path RTTs are set very differently.
>>
>
> Hi Dominik
>
> Implementation details of the tc/netem stages are important to fully
> understand how TCP stack can react.
>
> Is TSO active at sender side for example ?
>
> Your results show that only some exceptional events make bandwidth
> really change.
>
> A tcpdump/pcap of ~10.000 first packets would be nice to provide (not on
> mailing list, but on your web site)

^ permalink raw reply

* [PATCH net-next-2.6 v5 0/5] sctp: Patch series
From: Michio Honda @ 2011-04-25 15:22 UTC (permalink / raw)
  To: netdev; +Cc: lksctp-developers

Series of 5 patches to support auto_asconf and the other related functionalities that auto_asconf relies on. 

Cheers,
- Michio

[1/5] Add Auto-ASCONF support
[2/5] Add sysctl support for Auto-ASCONF
[3/5] Add socket option operation for Auto-ASCONF
[4/5] Add ADD/DEL ASCONF handling at the receiver
[5/5] Add ASCONF operation on the single-homed host

^ permalink raw reply

* [PATCH net-next-2.6 v5 1/5] sctp: Add Auto-ASCONF support
From: Michio Honda @ 2011-04-25 15:23 UTC (permalink / raw)
  To: netdev; +Cc: lksctp-developers

SCTP reconfigure the IP addresses in the association by using ASCONF chunks as mentioned in RFC5061.  
For example, we can start to use the newly configured IP address in the existing association.  
ASCONF operation is invoked in two ways: 
First is done by the application to call sctp_bindx() system call.  
Second is automatic operation in the SCTP stack with address events in the host computer (called auto_asconf) .  
The former is already implemented, and this patch implement the latter.  

Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
---
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 505845d..8678cbd 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -121,6 +121,7 @@ extern int sctp_copy_local_addr_list(struct sctp_bind_addr *,
 				     int flags);
 extern struct sctp_pf *sctp_get_pf_specific(sa_family_t family);
 extern int sctp_register_pf(struct sctp_pf *, sa_family_t);
+void sctp_addr_wq_mgmt(struct sctp_sockaddr_entry *, int);
 
 /*
  * sctp/socket.c
@@ -135,6 +136,7 @@ void sctp_sock_rfree(struct sk_buff *skb);
 void sctp_copy_sock(struct sock *newsk, struct sock *sk,
 		    struct sctp_association *asoc);
 extern struct percpu_counter sctp_sockets_allocated;
+int sctp_asconf_mgmt(struct sctp_sock *, struct sctp_sockaddr_entry *);
 
 /*
  * sctp/primitive.c
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index cc9185c..e8adbda 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -205,6 +205,11 @@ extern struct sctp_globals {
 	 * It is a list of sctp_sockaddr_entry.
 	 */
 	struct list_head local_addr_list;
+	int default_auto_asconf;
+	struct list_head addr_waitq;
+	struct timer_list addr_wq_timer;
+	struct list_head auto_asconf_splist;
+	spinlock_t addr_wq_lock;
 
 	/* Lock that protects the local_addr_list writers */
 	spinlock_t addr_list_lock;
@@ -264,6 +269,11 @@ extern struct sctp_globals {
 #define sctp_port_hashtable		(sctp_globals.port_hashtable)
 #define sctp_local_addr_list		(sctp_globals.local_addr_list)
 #define sctp_local_addr_lock		(sctp_globals.addr_list_lock)
+#define sctp_auto_asconf_splist		(sctp_globals.auto_asconf_splist)
+#define sctp_addr_waitq			(sctp_globals.addr_waitq)
+#define sctp_addr_wq_timer		(sctp_globals.addr_wq_timer)
+#define sctp_addr_wq_lock		(sctp_globals.addr_wq_lock)
+#define sctp_default_auto_asconf	(sctp_globals.default_auto_asconf)
 #define sctp_scope_policy		(sctp_globals.ipv4_scope_policy)
 #define sctp_addip_enable		(sctp_globals.addip_enable)
 #define sctp_addip_noauth		(sctp_globals.addip_noauth_enable)
@@ -341,6 +351,8 @@ struct sctp_sock {
 	atomic_t pd_mode;
 	/* Receive to here while partial delivery is in effect. */
 	struct sk_buff_head pd_lobby;
+	struct list_head auto_asconf_list;
+	int do_auto_asconf;
 };
 
 static inline struct sctp_sock *sctp_sk(const struct sock *sk)
@@ -796,6 +808,8 @@ struct sctp_sockaddr_entry {
 	__u8 valid;
 };
 
+#define SCTP_ADDRESS_TICK_DELAY	500
+
 typedef struct sctp_chunk *(sctp_packet_phandler_t)(struct sctp_association *);
 
 /* This structure holds lists of chunks as we are assembling for
@@ -1239,6 +1253,7 @@ sctp_scope_t sctp_scope(const union sctp_addr *);
 int sctp_in_scope(const union sctp_addr *addr, const sctp_scope_t scope);
 int sctp_is_any(struct sock *sk, const union sctp_addr *addr);
 int sctp_addr_is_valid(const union sctp_addr *addr);
+int sctp_is_ep_boundall(struct sock *sk);
 
 
 /* What type of endpoint?  */
diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..869267b 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -536,6 +536,21 @@ int sctp_in_scope(const union sctp_addr *addr, sctp_scope_t scope)
 	return 0;
 }
 
+int sctp_is_ep_boundall(struct sock *sk)
+{
+	struct sctp_bind_addr *bp;
+	struct sctp_sockaddr_entry *addr;
+
+	bp = &sctp_sk(sk)->ep->base.bind_addr;
+	if (sctp_list_single_entry(&bp->address_list)) {
+		addr = list_entry(bp->address_list.next,
+				  struct sctp_sockaddr_entry, list);
+		if (sctp_is_any(sk, &addr->a))
+			return 1;
+	}
+	return 0;
+}
+
 /********************************************************************
  * 3rd Level Abstractions
  ********************************************************************/
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 865ce7b..3a74d42 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -105,6 +105,7 @@ static int sctp_inet6addr_event(struct notifier_block *this, unsigned long ev,
 			addr->valid = 1;
 			spin_lock_bh(&sctp_local_addr_lock);
 			list_add_tail_rcu(&addr->list, &sctp_local_addr_list);
+			sctp_addr_wq_mgmt(addr, SCTP_ADDR_NEW);
 			spin_unlock_bh(&sctp_local_addr_lock);
 		}
 		break;
@@ -115,6 +116,7 @@ static int sctp_inet6addr_event(struct notifier_block *this, unsigned long ev,
 			if (addr->a.sa.sa_family == AF_INET6 &&
 					ipv6_addr_equal(&addr->a.v6.sin6_addr,
 						&ifa->addr)) {
+				sctp_addr_wq_mgmt(addr, SCTP_ADDR_DEL);
 				found = 1;
 				addr->valid = 0;
 				list_del_rcu(&addr->list);
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 152976e..f9b0f9b 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -636,6 +636,160 @@ static void sctp_v4_ecn_capable(struct sock *sk)
 	INET_ECN_xmit(sk);
 }
 
+void sctp_addr_wq_timeout_handler(unsigned long arg)
+{
+	struct sctp_sockaddr_entry *addrw = NULL;
+	union sctp_addr *addr = NULL;
+	struct sctp_sock *sp = NULL;
+
+	spin_lock_bh(&sctp_addr_wq_lock);
+retry_wq:
+	if (list_empty(&sctp_addr_waitq)) {
+		SCTP_DEBUG_PRINTK("sctp_addrwq_timo_handler: nothing in addr waitq\n");
+		spin_unlock_bh(&sctp_addr_wq_lock);
+		return;
+	}
+	addrw = list_first_entry(&sctp_addr_waitq, struct sctp_sockaddr_entry,
+			list);
+	addr = &addrw->a;
+	SCTP_DEBUG_PRINTK_IPADDR("sctp_addrwq_timo_handler: the first ent in wq %p is ",
+	    " for cmd %d at entry %p\n", &sctp_addr_waitq, addr, addrw->state,
+	    addrw);
+
+	/* Now we send an ASCONF for each association */
+	/* Note. we currently don't handle link local IPv6 addressees */
+	if (addr->sa.sa_family == AF_INET6) {
+		struct in6_addr *in6 = (struct in6_addr *)&addr->v6.sin6_addr;
+
+		if (ipv6_addr_type(&addr->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) {
+			SCTP_DEBUG_PRINTK("sctp_timo_handler: link local, hence don't tell sockets\n");
+			list_del(&addrw->list);
+			kfree(addrw);
+			goto retry_wq;
+		}
+		if (ipv6_chk_addr(&init_net, in6, NULL, 0) == 0 &&
+		    addrw->state == SCTP_ADDR_NEW) {
+			unsigned long timeo_val;
+
+			SCTP_DEBUG_PRINTK("sctp_timo_handler: this is on DAD, trying %d sec later\n",
+			    SCTP_ADDRESS_TICK_DELAY);
+			timeo_val = jiffies;
+			timeo_val += msecs_to_jiffies(SCTP_ADDRESS_TICK_DELAY);
+			mod_timer(&sctp_addr_wq_timer, timeo_val);
+			spin_unlock_bh(&sctp_addr_wq_lock);
+			return;
+		}
+	}
+	list_for_each_entry(sp, &sctp_auto_asconf_splist, auto_asconf_list) {
+		struct sock *sk;
+
+		sk = sctp_opt2sk(sp);
+		/* ignore bound-specific endpoints */
+		if (!sctp_is_ep_boundall(sk))
+			continue;
+		sctp_bh_lock_sock(sk);
+		if (sctp_asconf_mgmt(sp, addrw) < 0) {
+			SCTP_DEBUG_PRINTK("sctp_addrwq_timo_handler: sctp_asconf_mgmt failed\n");
+			sctp_bh_unlock_sock(sk);
+			continue;
+		}
+		sctp_bh_unlock_sock(sk);
+	}
+
+	list_del(&addrw->list);
+	kfree(addrw);
+
+	if (!list_empty(&sctp_addr_waitq))
+		goto retry_wq;
+
+	spin_unlock_bh(&sctp_addr_wq_lock);
+}
+
+static void sctp_free_addr_wq()
+{
+	struct sctp_sockaddr_entry *addrw = NULL;
+	struct sctp_sockaddr_entry *temp = NULL;
+
+	spin_lock_bh(&sctp_addr_wq_lock);
+	del_timer(&sctp_addr_wq_timer);
+	list_for_each_entry_safe(addrw, temp, &sctp_addr_waitq, list) {
+		list_del(&addrw->list);
+		kfree(addrw);
+	}
+	spin_unlock_bh(&sctp_addr_wq_lock);
+}
+
+/* lookup the entry for the same address in the addr_waitq
+ * sctp_addr_wq MUST be locked
+ */
+static struct sctp_sockaddr_entry *sctp_addr_wq_lookup(struct sctp_sockaddr_entry *addr)
+{
+	struct sctp_sockaddr_entry *addrw;
+
+	list_for_each_entry(addrw, &sctp_addr_waitq, list) {
+		if (addrw->a.sa.sa_family != addr->a.sa.sa_family)
+			continue;
+		if (addrw->a.sa.sa_family == AF_INET) {
+			if (addrw->a.v4.sin_addr.s_addr ==
+			    addr->a.v4.sin_addr.s_addr)
+				return addrw;
+		} else if (addrw->a.sa.sa_family == AF_INET6) {
+			if (ipv6_addr_equal(&addrw->a.v6.sin6_addr,
+			    &addr->a.v6.sin6_addr))
+				return addrw;
+		}
+	}
+	return NULL;
+}
+
+void sctp_addr_wq_mgmt(struct sctp_sockaddr_entry *addr, int cmd)
+{
+	struct sctp_sockaddr_entry *addrw = NULL;
+	unsigned long timeo_val;
+	union sctp_addr *tmpaddr;
+
+	/* first, we check if an opposite message already exist in the queue.
+	 * If we found such message, it is removed.
+	 * This operation is a bit stupid, but the DHCP client attaches the
+	 * new address after a couple of addition and deletion of that address
+	 */
+
+	spin_lock_bh(&sctp_addr_wq_lock);
+	/* Offsets existing events in addr_wq */
+	addrw = sctp_addr_wq_lookup(addr);
+	tmpaddr = &addrw->a;
+	if (addrw) {
+		if (addrw->state != cmd) {
+			SCTP_DEBUG_PRINTK_IPADDR("sctp_addr_wq_mgmt offsets existing entry for %d ",
+			    " in wq %p\n", addrw->state, tmpaddr,
+			    &sctp_addr_waitq);
+			list_del(&addrw->list);
+			kfree(addrw);
+		}
+		spin_unlock_bh(&sctp_addr_wq_lock);
+		return;
+	}
+
+	/* OK, we have to add the new address to the wait queue */
+	addrw = kmemdup(addr, sizeof(struct sctp_sockaddr_entry), GFP_ATOMIC);
+	if (addrw == NULL) {
+		spin_unlock_bh(&sctp_addr_wq_lock);
+		return;
+	}
+	addrw->state = cmd;
+	list_add_tail(&addrw->list, &sctp_addr_waitq);
+	tmpaddr = &addrw->a;
+	SCTP_DEBUG_PRINTK_IPADDR("sctp_addr_wq_mgmt add new entry for cmd:%d ",
+	    " in wq %p\n", addrw->state, tmpaddr, &sctp_addr_waitq);
+
+	if (!timer_pending(&sctp_addr_wq_timer)) {
+		timeo_val = jiffies;
+		timeo_val += msecs_to_jiffies(SCTP_ADDRESS_TICK_DELAY);
+		mod_timer(&sctp_addr_wq_timer, timeo_val);
+	}
+	spin_unlock_bh(&sctp_addr_wq_lock);
+}
+
 /* Event handler for inet address addition/deletion events.
  * The sctp_local_addr_list needs to be protocted by a spin lock since
  * multiple notifiers (say IPv4 and IPv6) may be running at the same
@@ -663,6 +817,7 @@ static int sctp_inetaddr_event(struct notifier_block *this, unsigned long ev,
 			addr->valid = 1;
 			spin_lock_bh(&sctp_local_addr_lock);
 			list_add_tail_rcu(&addr->list, &sctp_local_addr_list);
+			sctp_addr_wq_mgmt(addr, SCTP_ADDR_NEW);
 			spin_unlock_bh(&sctp_local_addr_lock);
 		}
 		break;
@@ -673,6 +828,7 @@ static int sctp_inetaddr_event(struct notifier_block *this, unsigned long ev,
 			if (addr->a.sa.sa_family == AF_INET &&
 					addr->a.v4.sin_addr.s_addr ==
 					ifa->ifa_local) {
+				sctp_addr_wq_mgmt(addr, SCTP_ADDR_DEL);
 				found = 1;
 				addr->valid = 0;
 				list_del_rcu(&addr->list);
@@ -1256,6 +1412,7 @@ SCTP_STATIC __init int sctp_init(void)
 	/* Disable ADDIP by default. */
 	sctp_addip_enable = 0;
 	sctp_addip_noauth = 0;
+	sctp_default_auto_asconf = 0;
 
 	/* Enable PR-SCTP by default. */
 	sctp_prsctp_enable = 1;
@@ -1280,6 +1437,13 @@ SCTP_STATIC __init int sctp_init(void)
 	spin_lock_init(&sctp_local_addr_lock);
 	sctp_get_local_addr_list();
 
+	/* Initialize the address event list */
+	INIT_LIST_HEAD(&sctp_addr_waitq);
+	INIT_LIST_HEAD(&sctp_auto_asconf_splist);
+	spin_lock_init(&sctp_addr_wq_lock);
+	sctp_addr_wq_timer.expires = 0;
+	setup_timer(&sctp_addr_wq_timer, sctp_addr_wq_timeout_handler, 0);
+
 	status = sctp_v4_protosw_init();
 
 	if (status)
@@ -1351,6 +1515,7 @@ SCTP_STATIC __exit void sctp_exit(void)
 	/* Unregister with inet6/inet layers. */
 	sctp_v6_del_protocol();
 	sctp_v4_del_protocol();
+	sctp_free_addr_wq();
 
 	/* Free the control endpoint.  */
 	inet_ctl_sock_destroy(sctp_ctl_sock);
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 3951a10..c433e97 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -807,6 +807,37 @@ out:
 	return retval;
 }
 
+/* set addr events to assocs in the endpoint.  ep and addr_wq must be locked */
+int
+sctp_asconf_mgmt(struct sctp_sock *sp, struct sctp_sockaddr_entry *addrw)
+{
+	struct sock *sk = sctp_opt2sk(sp);
+	union sctp_addr *addr = NULL;
+	int cmd;
+	int error = 0;
+
+	addr = &addrw->a;
+	addr->v4.sin_port = htons(sp->ep->base.bind_addr.port);
+	cmd = addrw->state;
+
+	SCTP_DEBUG_PRINTK("sctp_asconf_mgmt sp:%p\n", sp);
+	if (cmd == SCTP_ADDR_NEW) {
+		error = sctp_send_asconf_add_ip(sk, (struct sockaddr *)addr, 1);
+		if (error) {
+			SCTP_DEBUG_PRINTK("asconf_mgmt: send_asconf_add_ip returns %d\n", error);
+			return error;
+		}
+	} else if (cmd == SCTP_ADDR_DEL) {
+		error = sctp_send_asconf_del_ip(sk, (struct sockaddr *)addr, 1);
+		if (error) {
+			SCTP_DEBUG_PRINTK("asconf_mgmt: send_asconf_del_ip returns %d\n", error);
+			return error;
+		}
+	}
+
+	return 0;
+}
+
 /* Helper for tunneling sctp_bindx() requests through sctp_setsockopt()
  *
  * API 8.1
@@ -3770,6 +3801,13 @@ SCTP_STATIC int sctp_init_sock(struct sock *sk)
 	local_bh_disable();
 	percpu_counter_inc(&sctp_sockets_allocated);
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+	if (sctp_default_auto_asconf) {
+		list_add_tail(&sp->auto_asconf_list,
+		    &sctp_auto_asconf_splist);
+		sp->do_auto_asconf = 1;
+	} else
+		sp->do_auto_asconf = 0;
+	SCTP_DEBUG_PRINTK("sctp_init_sk sk:%p ep:%p\n", sk, ep);
 	local_bh_enable();
 
 	return 0;
@@ -3779,11 +3817,17 @@ SCTP_STATIC int sctp_init_sock(struct sock *sk)
 SCTP_STATIC void sctp_destroy_sock(struct sock *sk)
 {
 	struct sctp_endpoint *ep;
+	struct sctp_sock *sp;
 
 	SCTP_DEBUG_PRINTK("sctp_destroy_sock(sk: %p)\n", sk);
 
 	/* Release our hold on the endpoint. */
-	ep = sctp_sk(sk)->ep;
+	sp = sctp_sk(sk);
+	ep = sp->ep;
+	if (sp->do_auto_asconf) {
+		sp->do_auto_asconf = 0;
+		list_del(&sp->auto_asconf_list);
+	}
 	sctp_endpoint_free(ep);
 	local_bh_disable();
 	percpu_counter_dec(&sctp_sockets_allocated);


^ permalink raw reply related

* [PATCH net-next-2.6 v5 2/5] sctp: Add sysctl support for Auto-ASCONF
From: Michio Honda @ 2011-04-25 15:23 UTC (permalink / raw)
  To: netdev; +Cc: lksctp-developers

This patch allows the system administrator to change default Auto-ASCONF on/off behavior via an sysctl value.  

Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
---
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index 50cb57f..6b39529 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -183,6 +183,13 @@ static ctl_table sctp_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "default_auto_asconf",
+		.data		= &sctp_default_auto_asconf,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "prsctp_enable",
 		.data		= &sctp_prsctp_enable,
 		.maxlen		= sizeof(int),



^ permalink raw reply related

* [PATCH net-next-2.6 v5 3/5] sctp: Add socket option operation for Auto-ASCONF
From: Michio Honda @ 2011-04-25 15:23 UTC (permalink / raw)
  To: netdev; +Cc: lksctp-developers

This patch allows the application to operate Auto-ASCONF on/off behavior via setsockopt() and getsockopt().  

Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
---
diff --git a/include/net/sctp/user.h b/include/net/sctp/user.h
index e73ebda..36bf64b 100644
--- a/include/net/sctp/user.h
+++ b/include/net/sctp/user.h
@@ -91,6 +91,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_PEER_AUTH_CHUNKS	26	/* Read only */
 #define SCTP_LOCAL_AUTH_CHUNKS	27	/* Read only */
 #define SCTP_GET_ASSOC_NUMBER	28	/* Read only */
+#define SCTP_AUTO_ASCONF       29
 
 /* Internal Socket Options. Some of the sctp library functions are
  * implemented using these socket options.
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 3951a10..c9be08a 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3341,6 +3341,46 @@ static int sctp_setsockopt_del_key(struct sock *sk,
 
 }
 
+/*
+ * 8.1.23 SCTP_AUTO_ASCONF
+ *
+ * This option will enable or disable the use of the automatic generation of
+ * ASCONF chunks to add and delete addresses to an existing association.  Note
+ * that this option has two caveats namely: a) it only affects sockets that
+ * are bound to all addresses available to the SCTP stack, and b) the system
+ * administrator may have an overriding control that turns the ASCONF feature
+ * off no matter what setting the socket option may have.
+ * This option expects an integer boolean flag, where a non-zero value turns on
+ * the option, and a zero value turns off the option.
+ * Note. In this implementation, socket operation overrides default parameter
+ * being set by sysctl as well as FreeBSD implementation
+ */
+static int sctp_setsockopt_auto_asconf(struct sock *sk, char __user *optval,
+					unsigned int optlen)
+{
+	int val;
+	struct sctp_sock *sp = sctp_sk(sk);
+
+	if (optlen < sizeof(int))
+		return -EINVAL;
+	if (get_user(val, (int __user *)optval))
+		return -EFAULT;
+	if (!sctp_is_ep_boundall(sk) && val)
+		return -EINVAL;
+	if ((val && sp->do_auto_asconf) || (!val && !sp->do_auto_asconf))
+		return 0;
+
+	if (val == 0 && sp->do_auto_asconf) {
+		list_del(&sp->auto_asconf_list);
+		sp->do_auto_asconf = 0;
+	} else if (val && !sp->do_auto_asconf) {
+		list_add_tail(&sp->auto_asconf_list,
+		    &sctp_auto_asconf_splist);
+		sp->do_auto_asconf = 1;
+	}
+	return 0;
+}
+
 
 /* API 6.2 setsockopt(), getsockopt()
  *
@@ -3488,6 +3528,9 @@ SCTP_STATIC int sctp_setsockopt(struct sock *sk, int level, int optname,
 	case SCTP_AUTH_DELETE_KEY:
 		retval = sctp_setsockopt_del_key(sk, optval, optlen);
 		break;
+	case SCTP_AUTO_ASCONF:
+		retval = sctp_setsockopt_auto_asconf(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -5283,6 +5326,28 @@ static int sctp_getsockopt_assoc_number(struct sock *sk, int len,
 	return 0;
 }
 
+/*
+ * 8.1.23 SCTP_AUTO_ASCONF
+ * See the corresponding setsockopt entry as description
+ */
+static int sctp_getsockopt_auto_asconf(struct sock *sk, int len,
+				   char __user *optval, int __user *optlen)
+{
+	int val = 0;
+
+	if (len < sizeof(int))
+		return -EINVAL;
+
+	len = sizeof(int);
+	if (sctp_sk(sk)->do_auto_asconf && sctp_is_ep_boundall(sk))
+		val = 1;
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(optval, &val, len))
+		return -EFAULT;
+	return 0;
+}
+
 SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
 				char __user *optval, int __user *optlen)
 {
@@ -5415,6 +5480,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
 	case SCTP_GET_ASSOC_NUMBER:
 		retval = sctp_getsockopt_assoc_number(sk, len, optval, optlen);
 		break;
+	case SCTP_AUTO_ASCONF:
+		retval = sctp_getsockopt_auto_asconf(sk, len, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;


^ permalink raw reply related

* [PATCH net-next-2.6 v5 4/5] sctp: Add ADD/DEL ASCONF handling at the receiver
From: Michio Honda @ 2011-04-25 15:23 UTC (permalink / raw)
  To: netdev; +Cc: lksctp-developers

This patch fixes the problem that the original code cannot delete the remote address where the corresponding transport is currently directed, even when the ASCONF is sent from the other address (this situation happens when the single-homed sender transmits  ASCONF with ADD and DEL.)  

Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
---
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index de98665..a9f25d7 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2990,7 +2990,7 @@ static __be16 sctp_process_asconf_param(struct sctp_association *asoc,
 		 * an Error Cause TLV set to the new error code 'Request to
 		 * Delete Source IP Address'
 		 */
-		if (sctp_cmp_addr_exact(sctp_source(asconf), &addr))
+		if (sctp_cmp_addr_exact(&asconf->source, &addr))
 			return SCTP_ERROR_DEL_SRC_IP;
 
 		/* Section 4.2.2


^ permalink raw reply related

* [PATCH net-next-2.6 v5 5/5] sctp: Add ASCONF operation on the single-homed host
From: Michio Honda @ 2011-04-25 15:24 UTC (permalink / raw)
  To: netdev; +Cc: lksctp-developers

SCTP can change the IP address on the single-homed host.  
In this case, the SCTP association transmits an ASCONF packet including addition of the new IP address and deletion of the old address.  This patch implements this functionality.  
In this case, the ASCONF chunk is added to the beginning of the queue, because the other chunks cannot be transmitted in this state.  

Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
---
diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
index c70d8cc..d7a4ee3 100644
--- a/include/net/sctp/constants.h
+++ b/include/net/sctp/constants.h
@@ -441,4 +441,8 @@ enum {
  */
 #define SCTP_AUTH_RANDOM_LENGTH 32
 
+/* ASCONF PARAMETERS */
+#define SCTP_ASCONF_V4_PARAM_LEN 16
+#define SCTP_ASCONF_V6_PARAM_LEN 28
+
 #endif /* __sctp_constants_h__ */
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index cc9185c..db4e9d0 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1901,6 +1901,8 @@ struct sctp_association {
 	 * after reaching 4294967295.
 	 */
 	__u32 addip_serial;
+	union sctp_addr *asconf_addr_del_pending;
+	int src_out_of_asoc_ok;
 
 	/* SCTP AUTH: list of the endpoint shared keys.  These
 	 * keys are provided out of band by the user applicaton
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 6b04287..2082d0a 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -279,6 +279,8 @@ static struct sctp_association *sctp_association_init(struct sctp_association *a
 	asoc->peer.asconf_capable = 0;
 	if (sctp_addip_noauth)
 		asoc->peer.asconf_capable = 1;
+	asoc->asconf_addr_del_pending = NULL;
+	asoc->src_out_of_asoc_ok = 0;
 
 	/* Create an input queue.  */
 	sctp_inq_init(&asoc->base.inqueue);
@@ -443,6 +445,10 @@ void sctp_association_free(struct sctp_association *asoc)
 
 	asoc->peer.transport_count = 0;
 
+	/* Free pending address space being deleted */
+	if (asoc->asconf_addr_del_pending != NULL)
+		kfree(asoc->asconf_addr_del_pending);
+
 	/* Free any cached ASCONF_ACK chunk. */
 	sctp_assoc_free_asconf_acks(asoc);
 
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 865ce7b..56c97ce 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -332,6 +332,13 @@ static void sctp_v6_get_saddr(struct sctp_sock *sk,
 				matchlen = bmatchlen;
 			}
 		}
+		if (laddr->state == SCTP_ADDR_NEW && asoc->src_out_of_asoc_ok) {
+			bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
+			if (!baddr || (matchlen < bmatchlen)) {
+				baddr = &laddr->a;
+				matchlen = bmatchlen;
+			}
+		}
 	}
 
 	if (baddr) {
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 26dc005..28bccde 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -744,6 +744,16 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
 	 */
 
 	list_for_each_entry_safe(chunk, tmp, &q->control_chunk_list, list) {
+		/* RFC 5061, 5.3
+		 * F1) This means that until such time as the ASCONF
+		 * containing the add is acknowledged, the sender MUST
+		 * NOT use the new IP address as a source for ANY SCTP
+		 * packet except on carrying an ASCONF Chunk.
+		 */
+		if (asoc->src_out_of_asoc_ok &&
+		    chunk->chunk_hdr->type != SCTP_CID_ASCONF)
+			continue;
+
 		list_del_init(&chunk->list);
 
 		/* Pick the right transport to use. */
@@ -871,6 +881,9 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
 		}
 	}
 
+	if (q->asoc->src_out_of_asoc_ok)
+		goto sctp_flush_out;
+
 	/* Is it OK to send data chunks?  */
 	switch (asoc->state) {
 	case SCTP_STATE_COOKIE_ECHOED:
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 152976e..0733273 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -510,7 +510,9 @@ static struct dst_entry *sctp_v4_get_dst(struct sctp_association *asoc,
 		sctp_v4_dst_saddr(&dst_saddr, dst, htons(bp->port));
 		rcu_read_lock();
 		list_for_each_entry_rcu(laddr, &bp->address_list, list) {
-			if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
+			if (!laddr->valid || (laddr->state == SCTP_ADDR_DEL) ||
+			    (laddr->state != SCTP_ADDR_SRC &&
+			    !asoc->src_out_of_asoc_ok))
 				continue;
 			if (sctp_v4_cmp_addr(&dst_saddr, &laddr->a))
 				goto out_unlock;
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index de98665..f341ab2 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2744,6 +2744,12 @@ struct sctp_chunk *sctp_make_asconf_update_ip(struct sctp_association *asoc,
 	int			addr_param_len = 0;
 	int 			totallen = 0;
 	int 			i;
+	sctp_addip_param_t del_param; /* 8 Bytes (Type 0xC002, Len and CrrID) */
+	struct sctp_af *del_af;
+	int del_addr_param_len = 0;
+	int del_paramlen = sizeof(sctp_addip_param_t);
+	union sctp_addr_param del_addr_param; /* (v4) 8 Bytes, (v6) 20 Bytes */
+	int			del_pickup = 0;
 
 	/* Get total length of all the address parameters. */
 	addr_buf = addrs;
@@ -2756,6 +2762,17 @@ struct sctp_chunk *sctp_make_asconf_update_ip(struct sctp_association *asoc,
 		totallen += addr_param_len;
 
 		addr_buf += af->sockaddr_len;
+		if (asoc->asconf_addr_del_pending && !del_pickup) {
+			if (!sctp_in_scope(asoc->asconf_addr_del_pending,
+			    sctp_scope(addr)))
+				continue;
+			/* reuse the parameter length from the same scope one */
+			totallen += paramlen;
+			totallen += addr_param_len;
+			del_pickup = 1;
+			asoc->src_out_of_asoc_ok = 1;
+			SCTP_DEBUG_PRINTK("mkasconf_update_ip: picked same-scope del_pending addr, totallen for all addresses is %d\n", totallen);
+		}
 	}
 
 	/* Create an asconf chunk with the required length. */
@@ -2778,6 +2795,19 @@ struct sctp_chunk *sctp_make_asconf_update_ip(struct sctp_association *asoc,
 
 		addr_buf += af->sockaddr_len;
 	}
+	if (flags == SCTP_PARAM_ADD_IP && del_pickup) {
+		addr = asoc->asconf_addr_del_pending;
+		del_af = sctp_get_af_specific(addr->v4.sin_family);
+		del_addr_param_len = del_af->to_addr_param(addr,
+		    &del_addr_param);
+		del_param.param_hdr.type = SCTP_PARAM_DEL_IP;
+		del_param.param_hdr.length = htons(del_paramlen +
+		    del_addr_param_len);
+		del_param.crr_id = i;
+
+		sctp_addto_chunk(retval, del_paramlen, &del_param);
+		sctp_addto_chunk(retval, del_addr_param_len, &del_addr_param);
+	}
 	return retval;
 }
 
@@ -3193,7 +3223,8 @@ static void sctp_asconf_param_success(struct sctp_association *asoc,
 		local_bh_enable();
 		list_for_each_entry(transport, &asoc->peer.transport_addr_list,
 				transports) {
-			if (transport->state == SCTP_ACTIVE)
+			if (transport->state == SCTP_ACTIVE &&
+			    !asoc->src_out_of_asoc_ok)
 				continue;
 			dst_release(transport->dst);
 			sctp_transport_route(transport, NULL,
@@ -3203,6 +3234,11 @@ static void sctp_asconf_param_success(struct sctp_association *asoc,
 	case SCTP_PARAM_DEL_IP:
 		local_bh_disable();
 		sctp_del_bind_addr(bp, &addr);
+		if (asoc->asconf_addr_del_pending != NULL &&
+		    sctp_cmp_addr_exact(asoc->asconf_addr_del_pending, &addr)) {
+			kfree(asoc->asconf_addr_del_pending);
+			asoc->asconf_addr_del_pending = NULL;
+		}
 		local_bh_enable();
 		list_for_each_entry(transport, &asoc->peer.transport_addr_list,
 				transports) {
@@ -3361,6 +3397,9 @@ int sctp_process_asconf_ack(struct sctp_association *asoc,
 		asconf_len -= length;
 	}
 
+	if (no_err && asoc->src_out_of_asoc_ok)
+		asoc->src_out_of_asoc_ok = 0;
+
 	/* Free the cached last sent asconf chunk. */
 	list_del_init(&asconf->transmitted_list);
 	sctp_chunk_free(asconf);
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 3951a10..481293d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -583,10 +583,6 @@ static int sctp_send_asconf_add_ip(struct sock		*sk,
 			goto out;
 		}
 
-		retval = sctp_send_asconf(asoc, chunk);
-		if (retval)
-			goto out;
-
 		/* Add the new addresses to the bind address list with
 		 * use_as_src set to 0.
 		 */
@@ -599,6 +595,23 @@ static int sctp_send_asconf_add_ip(struct sock		*sk,
 						    SCTP_ADDR_NEW, GFP_ATOMIC);
 			addr_buf += af->sockaddr_len;
 		}
+		if (asoc->src_out_of_asoc_ok) {
+			struct sctp_transport *trans;
+
+			list_for_each_entry(trans,
+			    &asoc->peer.transport_addr_list, transports) {
+				/* Clear the source and route cache */
+				dst_release(trans->dst);
+				trans->cwnd = min(4*asoc->pathmtu, max_t(__u32,
+				    2*asoc->pathmtu, 4380));
+				trans->ssthresh = asoc->peer.i.a_rwnd;
+				trans->rto = asoc->rto_initial;
+				trans->rtt = trans->srtt = trans->rttvar = 0;
+				sctp_transport_route(trans, NULL,
+				    sctp_sk(asoc->base.sk));
+			}
+		}
+		retval = sctp_send_asconf(asoc, chunk);
 	}
 
 out:
@@ -711,7 +724,9 @@ static int sctp_send_asconf_del_ip(struct sock		*sk,
 	struct sctp_sockaddr_entry *saddr;
 	int 			i;
 	int 			retval = 0;
+	int			stored = 0;
 
+	chunk = NULL;
 	if (!sctp_addip_enable)
 		return retval;
 
@@ -762,8 +777,32 @@ static int sctp_send_asconf_del_ip(struct sock		*sk,
 		bp = &asoc->base.bind_addr;
 		laddr = sctp_find_unmatch_addr(bp, (union sctp_addr *)addrs,
 					       addrcnt, sp);
-		if (!laddr)
-			continue;
+		if ((laddr == NULL) && (addrcnt == 1)) {
+			if (asoc->asconf_addr_del_pending)
+				continue;
+			asoc->asconf_addr_del_pending =
+			    kzalloc(sizeof(union sctp_addr), GFP_ATOMIC);
+			asoc->asconf_addr_del_pending->sa.sa_family =
+				    addrs->sa_family;
+			asoc->asconf_addr_del_pending->v4.sin_port =
+				    htons(bp->port);
+			if (addrs->sa_family == AF_INET) {
+				struct sockaddr_in *sin;
+
+				sin = (struct sockaddr_in *)addrs;
+				asoc->asconf_addr_del_pending->v4.sin_addr.s_addr = sin->sin_addr.s_addr;
+			} else if (addrs->sa_family == AF_INET6) {
+				struct sockaddr_in6 *sin6;
+
+				sin6 = (struct sockaddr_in6 *)addrs;
+				ipv6_addr_copy(&asoc->asconf_addr_del_pending->v6.sin6_addr, &sin6->sin6_addr);
+			}
+			SCTP_DEBUG_PRINTK_IPADDR("send_asconf_del_ip: keep the last address asoc: %p ",
+			    " at %p\n", asoc, asoc->asconf_addr_del_pending,
+			    asoc->asconf_addr_del_pending);
+			stored = 1;
+			goto skip_mkasconf;
+		}
 
 		/* We do not need RCU protection throughout this loop
 		 * because this is done under a socket lock from the
@@ -776,6 +815,7 @@ static int sctp_send_asconf_del_ip(struct sock		*sk,
 			goto out;
 		}
 
+skip_mkasconf:
 		/* Reset use_as_src flag for the addresses in the bind address
 		 * list that are to be deleted.
 		 */
@@ -801,6 +841,9 @@ static int sctp_send_asconf_del_ip(struct sock		*sk,
 					     sctp_sk(asoc->base.sk));
 		}
 
+		if (stored)
+			/* We don't need to transmit ASCONF */
+			continue;
 		retval = sctp_send_asconf(asoc, chunk);
 	}
 out:


^ permalink raw reply related

* Re: [PATCH net-next-2.6 v4 4/5] sctp: Add ASCONF operation on the single-homed host
From: Michio Honda @ 2011-04-25 15:34 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: netdev, lksctp-developers
In-Reply-To: <4DB4E04C.70609@cn.fujitsu.com>

I just re-submitted cumulative patches.  

About your suggestion to split the patch that reset route at the reception of ASCONF-ACK, I removed those codes.  
Because we'd already reset the route just before ASCONF, so not needed after the ASCONF-ACK reception.  
I believe the other parts follow all your comments.  
I also cleaned up many parts in single-homed host support patch.  

Thanks,
- Michio

On Apr 25, 2011, at 11:45 , Wei Yongjun wrote:

> 
>> Yes, I think the association cannot be kept, if the single-homed ASCONF receiver moves to the new network before sending ASCONF-ACK.  
>> Am I missing?
> 
> Oh, yeah, you are right.:-)
> 
>> Thanks,
>> - Michio
>> 
>> On Apr 25, 2011, at 11:02 , Wei Yongjun wrote:
>> 
>>>> Hi, 
>>>> 
>>>> Such operation would not be supported by specification, in Sec.5.3 in RFC 5061:
>>>>  F1)  When adding an IP address to an association, the IP address is
>>>>       NOT considered fully added to the association until the ASCONF-
>>>>       ACK arrives.  This means that until such time as the ASCONF
>>>>       containing the add is acknowledged, the sender MUST NOT use the
>>>>       new IP address as a source for ANY SCTP packet except on
>>>>       carrying an ASCONF Chunk. 
>>>> 
>>>> I think this means we cannot send ASCONF-ACK from the new address even if it bundles ASCONF...
>>> If so, both side do not have valid address to send the such
>>> ASCONF-ACK, and can not recv ASCONF-ACK.
>>> 
>>>> - Michio
>>>> 
>>>> On Apr 25, 2011, at 9:57 , Wei Yongjun wrote:
>>>> 
>>>>>> On Apr 22, 2011, at 13:10 , Wei Yongjun wrote:
>>>>>> 
>>>>>>>> Since the sender MUST NOT use the  new IP address as a source for ANY SCTP
>>>>>>>> packet except on  carrying an ASCONF Chunk. And ASCONF chunk can be bundled.
>>>>>>>> How about this change. If so, you do not need change to sctp_outq_tail();
>>>>>>>> 
>>>>>>>> diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
>>>>>>>> index 1c88c89..bd6cc9c 100644
>>>>>>>> --- a/net/sctp/outqueue.c
>>>>>>>> +++ b/net/sctp/outqueue.c
>>>>>>>> @@ -754,6 +754,13 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
>>>>>>>> 	 */
>>>>>>>> 
>>>>>>>> 	list_for_each_entry_safe(chunk, tmp, &q->control_chunk_list, list) {
>>>>>>>> +		/* RFC 5061, 5.3
>>>>>>>> +		 * F1) This ...
>>>>>>>> +		 */
>>>>>>>> +		if (q->asoc->src_out_of_asoc_ok &&
>>>>>>>> +		    chunk->chunk_hdr->type != SCTP_CID_ASCONF)
>>>>>>> SCTP_CID_ASCONF_ACK should be also allowed, the peer may
>>>>>>> send ASCONF to do the same thing at the same time.
>>>>>> Sorry for my bad understanding, 
>>>>>> Do you mean the situation: "the peer (ASCONF receiver) may send ASCONF-ACK to the unconfirmed destination"?
>>>>>> Or do you mean following situation?
>>>>>> 1. the pear sends ADD/DEL ASCONF to me, 
>>>>>> 2. I receive it, 
>>>>>> 3. I migrate to the other network and get new address, 
>>>>>> 4. I send ASCONF-ACK to the peer from the new address
>>>>> Yes, If both side send ADD/DEL ASCONF to del the last one
>>>>> address at the same time like this:
>>>>> 
>>>>> ASCONF  -----    ------ASCONF
>>>>> (ADD/DEL)    \  /     (ADD/DEL)
>>>>>            \/        
>>>>>            /\
>>>>>      <----/  \----->
>>>>> ASCONF-ACK---\  /------ASCONF-ACK
>>>>>            \/
>>>>>            /\
>>>>>      <----/  \----->
>>>>> 
>>>>> But I do not test for it. Not sure we need to do this, can you
>>>>> check this before commit your new patchset?
>>>>> 
>>>>> 
>>>>>>>> +			continue;
>>>>>>>> +
>>>>>>>> 		list_del_init(&chunk->list);
>>>>>>>> 
>>>>>>>> 		/* Pick the right transport to use. */
>>>>>>>> @@ -881,6 +888,9 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
>>>>>>>> 		}
>>>>>>>> 	}
>>>>>>>> 
>>>>>>>> +	if (q->asoc->src_out_of_asoc_ok)
>>>>>>>> +		goto sctp_flush_out;
>>>>>>>> +
>>>>>>>> 	/* Is it OK to send data chunks?  */
>>>>>>>> 	switch (asoc->state) {
>>>>>>>> 	case SCTP_STATE_COOKIE_ECHOED:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 


^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Eric Dumazet @ 2011-04-25 15:38 UTC (permalink / raw)
  To: Dominik Kaspar; +Cc: Carsten Wolff, netdev
In-Reply-To: <BANLkTi=xns1Gdjyt-SX3yDETSQfO23rXXg@mail.gmail.com>

Le lundi 25 avril 2011 à 16:35 +0200, Dominik Kaspar a écrit :
> Hi Eric and Carsten,
> 
> Thanks a lot for your quick replies. I don't have a tcpdump of this
> experiment, but here is the tcp_probe log that the plot is based on
> (I'll run a new test using tcpdump if you think that's more useful):
> 
> http://home.simula.no/~kaspar/static/mptcp-emu-wlan-hspa-00.log
> 
> I have also noticed what Carsten mentions, the tcp_reordering value is
> essential for this whole behavior. When I start an experiment and
> increase sysctl.net.ipv4.tcp_reordering during the running connection,
> the TCP throughput immediately jumps close to the aggregate of both
> paths. Without intervention, as in this experiment, tcp_reordering
> starts out as 3 and then makes small oscillations between 3 and 12 for
> more than 2 minutes. At about second 141, TCP somehow finds a new
> highest reordering value (23) and at the same time, the throughput
> jumps up "to the next level". The value of 23 is then used all the way
> until second 603, when the reordering value becomes 32 and the
> throughput again jumps up a level.
> 
> I understand that tp->reordering is increased when reordering is
> detected, but what causes tp->reordering to sometimes be decreased
> back to 3? Also, why does a decrease back to 3 not make the whole
> procedure start all over again? For example, at second 1013.64,
> tp->reordering falls from 127 down to 3. A second later (1014.93) it
> then suddenly increases from 3 up to 32 without considering any
> numbers in between. Why it is now suddenly so fast? At the very
> beginning, it took 600 seconds to grow from 3 to 32 and afterward it
> just takes a second...?
> 
> For the experiments, all default TCP options were used, meaning that
> SACK, DSACK, Timestamps, were all enabled. Not sure how to turn on/off
> TSO... so that is probably enabled, too. Path emulation is done with
> tc/netem at the receiver interfaces (eth1, eth2) with this script:
> 

Since you have at sender a rule to spoof destination address of packets,
you should make sure you dont send "super packets (up to 64Kbytes)",
because it would stress the multipath more than you wanted to. This way,
you send only normal packets (1500 MTU).

ethtool -K eth0 tso off
ethtool -K eth0 gso off

I am pretty sure it should help your (atypic) workload.

> http://home.simula.no/~kaspar/static/netem.sh
> 
> Greetings,
> Dominik



^ permalink raw reply

* ar9280+802.11na+ath9k+kismet don't capture data packets
From: Антон @ 2011-04-25 17:54 UTC (permalink / raw)
  To: netdev

Hello!

The platform uses D-link DIR-825 Atheros AR7161, with two AR9280(802.11bgn and 802.11an)
kismet-2010-07-R1 is installed to OpenWrt Backfire Trunk. Used driver ath9k of compat-wireless-2011-04-19.
For tests I use D-Link Dir-825 AP and D-Link DWA-160. Connection speed is 300 Mbps at 36 channel.
Kismet captures PHY and Control packets, and don't capture the data packets.

/etc/config/wireless

config wifi-device radio0
option type mac80211
option channel 11
option macaddr 00:18:e7:ec:b0:5f
option hwmode 11ng
option hwmode_11n g
option htmode HT40+
list ht_capab HT40-
list ht_capab HT40+
list ht_capab HT20

config wifi-iface
option device radio0
option mode monitor

config wifi-device radio1
option type mac80211
option channel 36
option macaddr 00:18:e7:ec:b0:60
option hwmode 11na
option hwmode_11n a
option htmode HT40+
list ht_capab HT40-
list ht_capab HT40+
list ht_capab HT20

config wifi-iface
option device radio1
option mode monitor

device wlan0 entered promiscuous mode
device wlan1 entered promiscuous mode

iwconfig

wlan0 IEEE 802.11bgn Mode:Monitor Tx-Power=27 dBm
RTS thr:off Fragment thr:off
Power Management:on

wlan1 IEEE 802.11an Mode:Monitor Tx-Power=17 dBm
RTS thr:off Fragment thr:off
Power Management:on

Transferred to file 600 Mb
After that ifconfig issued:
ifconfig
wlan0 Link encap:UNSPEC HWaddr 00-18-E7-EC-B0-5F-00-47-00-00-00-00-00-00-0
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:9865 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3043414 (2.9 MiB) TX bytes:0 (0.0 B)

wlan1 Link encap:UNSPEC HWaddr 00-18-E7-EC-B0-60-00-47-00-00-00-00-00-00-0
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:365275 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:20117793 (19.1 MiB) TX bytes:0 (0.0 B)

Sniffing result:
MAC Src                  MAC Dst                Channel Signal Type Crypt                     Packets SSID
00:18:E7:FD:8E:D8   00:19:5B:7E:09:CE  6          -42      AP    WEP                    7629      q183
00:18:E7:FD:8E:DA  F0:7D:68:71:40:DC  36         -60      AP    TKIP WPA PSK... 1703      q183-media
1C:65:9D:21:27:8F    00:18:E7:FD:8E:D8   6         -65      AP     WEP                    2607      q183
00:00:00:00:00:00     00:18:E7:FD:8E:D8   -          -65      AP     -                           370415   -
00:19:5B:7E:09:CE   00:18:E7:FD:8E:D8   -          -67      Probe WEP                    124        q183
F0:7D:68:71:40:DC   00:18:E7:FD:8E:DA   -          -39     AP      TKIP WPA PSK... 41          q183-media

Captures the driver ath9k data packets in the mode 802.11na? Can be said about this with 100% certainty? Checked it?

^ permalink raw reply

* Re: [PATCH 2/9] net-ethtool: Convert (hw_/vlan_/wanted_)features fields from u32 type to u64.
From: Mahesh Bandewar @ 2011-04-25 18:14 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, Michał Mirosław
In-Reply-To: <1303623041.3032.97.camel@localhost>

On Sat, Apr 23, 2011 at 10:30 PM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Fri, 2011-04-22 at 16:36 -0700, Mahesh Bandewar wrote:
>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>> ---
>>  include/linux/ethtool.h |   26 +++++++------
>>  net/core/ethtool.c      |   89 ++++++++++++++++------------------------------
>>  2 files changed, 45 insertions(+), 70 deletions(-)
>>
>> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
>> index 9de3127..71e8a02 100644
>> --- a/include/linux/ethtool.h
>> +++ b/include/linux/ethtool.h
>> @@ -605,10 +605,10 @@ struct ethtool_flash {
>>   * @never_changed: mask of features not changeable for any device
>>   */
>>  struct ethtool_get_features_block {
>> -     __u32   available;
>> -     __u32   requested;
>> -     __u32   active;
>> -     __u32   never_changed;
>> +     __u64   available;
>> +     __u64   requested;
>> +     __u64   active;
>> +     __u64   never_changed;
>>  };
>>
>>  /**
>> @@ -618,10 +618,11 @@ struct ethtool_get_features_block {
>>   *       out: number of elements in features[] needed to hold all features
>>   * @features: state of features
>>   */
>> +/* TODO Why is this needed XXX */
>
> Precisely to allow for expansion to more than 32 bits.
>
:) That comment was not supposed to be part of the patch and was just
to aid my thinking while we were discussing about the approach to
extend features. (I thought) if we have decided not to use use arrays,
then why to complicate this interface, so was attempting to simplify
the interface.

>>  struct ethtool_gfeatures {
>>       __u32   cmd;
>>       __u32   size;
>> -     struct ethtool_get_features_block features[0];
>> +     struct ethtool_get_features_block features;
>>  };
>>
>>  /**
>> @@ -630,8 +631,8 @@ struct ethtool_gfeatures {
>>   * @requested: values of features to be changed
>>   */
>>  struct ethtool_set_features_block {
>> -     __u32   valid;
>> -     __u32   requested;
>> +     __u64   valid;
>> +     __u64   requested;
>>  };
>>
>>  /**
>> @@ -640,10 +641,11 @@ struct ethtool_set_features_block {
>>   * @size: array size of the features[] array
>>   * @features: feature change masks
>>   */
>> +/* TODO Why is this needed XXX */
>>  struct ethtool_sfeatures {
>>       __u32   cmd;
>>       __u32   size;
>> -     struct ethtool_set_features_block features[0];
>> +     struct ethtool_set_features_block features;
>>  };
> [...]
>
> These structures are part of the userland API, but they are new in
> 2.6.39.  So they can still be changed up until 2.6.39 is released, but
> not afterwards.
>
got it!

> If we think 64 bits will be enough for the next 10 years, then let's
> just go with a single 64-bit feature word.  If we're not so sure then
> then the ethtool API should continue to allow for multiple words
> (whether 32-bit or 64-bit).
>
I think even 64 bits may not be enough, so in the earlier thread I had
mentioned that, this would (probably) give us next two years! I didn't
know this constraint though.

--mahesh..

> Ben.
>
> --
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
>

^ permalink raw reply

* Re: [PATCHv5] usbnet: Resubmit interrupt URB once if halted
From: Paul Stewart @ 2011-04-25 18:41 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Alan Stern, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	bhutchings-s/n/eUQHGBpZroRs9YW3xA
In-Reply-To: <201104240836.47491.oliver-GvhC2dPhHPQdnm+yROfE0A@public.gmane.org>

On Sat, Apr 23, 2011 at 11:36 PM, Oliver Neukum <oliver-GvhC2dPhHPQdnm+yROfE0A@public.gmane.org> wrote:
> Am Freitag, 22. April 2011, 17:59:15 schrieb Paul Stewart:
>> >
>> >>       free_netdev(net);
>> >>       usb_put_dev (xdev);
>> >>  }
>> >> @@ -1285,6 +1291,10 @@ int usbnet_suspend (struct usb_interface *intf, pm_message_t message)
>> >>                * wake the device
>> >>                */
>> >>               netif_device_attach (dev->net);
>> >> +
>> >> +             /* Stop interrupt URBs */
>> >> +             if (dev->interrupt)
>> >> +                     usb_kill_urb(dev->interrupt);
>> >>       }
>> >>       return 0;
>> >>  }
>> >
>> > There is a subtle question here: When is the best time to kill the
>> > interrupt URB?  Without knowing any of the details, I'd guess that the
>> > interrupt URB reports asynchronous events and the driver could run into
>> > trouble if one of those events occurred while everything else was
>> > turned off.  This suggests that the interrupt URB should be killed as
>> > soon as possible rather than as late as possible.  But maybe it doesn't
>> > matter; it all depends on the design of the driver.
>>
>> I'm not sure I can answer this question either.  As it stands, nobody
>> was killing them before.  Just trying to make it better.
>
> Hm. Are we looking at the same code?

Perhaps not.  I'm working out of the netdev-2.6 git repository.  Is
this the wrong place?

> usbnet_suspend now has:
>
>                /*
>                 * accelerate emptying of the rx and queues, to avoid
>                 * having everything error out.
>                 */
>                netif_device_detach (dev->net);
>                usbnet_terminate_urbs(dev);
>                usb_kill_urb(dev->interrupt);
>
> This suggests that if you want to resubmit the interrupt URB in resume()
> you do it first. Which drivers use the interrupt URB?

The asix driver uses it fo signal link status.

>
>        Regards
>                Oliver
>
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Oops in 2.6.39 include/net/dst.h: dst_metrics_write_ptr() running l2tp over ipsec
From: David Miller @ 2011-04-25 18:54 UTC (permalink / raw)
  To: berny156; +Cc: eric.dumazet, linux-kernel, netdev
In-Reply-To: <4DB54450.90806@gmx.de>

From: Held Bernhard <berny156@gmx.de>
Date: Mon, 25 Apr 2011 11:52:16 +0200

> Am 25.04.2011 10:07, schrieb Eric Dumazet:
>> From: Held Bernhard<berny156@gmx.de>
 ...
>> Thanks for your report and patch.
>>
>> Maybe following patch is the way to fix this, please test it.
>>
>>
>> [PATCH] net: provide cow_metrics() methods to blackhole dst_ops
 ...
>> The oops happens in dst_metrics_write_ptr()
>> include/net/dst.h:124: return dst->ops->cow_metrics(dst, p);
>>
>> dst->ops->cow_metrics is NULL and causes the oops.
>>
>> Provide cow_metrics() methods, like we did in commit 214f45c91bb
>> (net: provide default_advmss() methods to blackhole dst_ops)
>>
>> Signed-off-by: Held Bernhard<berny156@gmx.de>
>> Signed-off-by: Eric Dumazet<eric.dumazet@gmail.com>
 ...
> Your patch works flawlessly.
> 
> Thanks for the quick response!

Applied, thanks everyone.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox