From: Vlad Yasevich <vladislav.yasevich@hp.com>
To: Steve Hill <steve.hill@dialogic.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>, Andrew Morton <akpm@osdl.org>,
netdev@vger.kernel.org, lksctp-developers@lists.sourceforge.net
Subject: Re: Fw: Intermittent SCTP multihoming breakage
Date: Wed, 10 Jan 2007 15:49:21 -0500 [thread overview]
Message-ID: <45A55151.2060401@hp.com> (raw)
In-Reply-To: <Pine.CYG.4.58.0701040913300.3128@shill1-mobl.eicon.com>
Steve Hill wrote:
> On Wed, 3 Jan 2007, Sridhar Samudrala wrote:
>
> Sorry for the delay in replying.
>
>> No. lksctp-developers mailing list is still the best place for SCTP related
>> discussions. You can subscribe and look in the archives at
>> http://lists.sourceforge.net/lists/listinfo/lksctp-developers
>
> Hmm, I had a look there and it seemed reasonably inactive and overrun by
> spam.. (And I've been unable to subscribe).
>
>> How are the 2 machines connected? Are they connected directly or
>> via a router?
>
> They are currently connected together directly through crossover cables.
>
>> Do you see both the addresses when you do cat /proc/net/sctp/assocs
>> after the association is established on both the peers?
>
> Yes, the contents of /proc/net/sctp/assocs looks correct.
>
>> How are you dropping traffic? You could try simulating failover by
>> bringing down the interface or physically removing the link.
>
> I have been using iptables to drop SCTP packets on both the INPUT and
> OUTPUT chains. However, I get the same results if I just unplug the
> network cable (using iptables is easier for my testing since I don't have
> to crawl around behind the test systems :)
>
>>> 1. Sometimes, just after failing over to the second path I see an ABORT.
>> This seems to indicate that somehow the app has terminated.
>
> The abort _appears_ to be caused by a retransmit timer expiring, causing
> the SCTP stack to tear down the association. However, I haven't done much
> investigation of this problem yet - I've been focussing on the second
> problem since it seems to happen more frequently.
>
>>> 2. More frequently, the association stays up indefinately, with heartbeat
>>> requests and acks on the second path, but no data chunks are sent even
>>> though the transmit queue on the transmitting end appears to be full and
>>> the socket is blocking writes.
>> This is strange. Can you collect tcpdump traces on sender and receiver when
>> this happens?
>
> I've taken dumps of the data on the wire for both paths:
> http://www.nexusuk.org/~steve/sctp/path1.pcap
> http://www.nexusuk.org/~steve/sctp/path2.pcap
Taking a look at these it does appear to complete stall... There are some
rather interesting retransmission that don't look quite right...
>
> I can't see anything odd in the network traffic - it just stops as if it
> has no more data to send. However, the socket appears to still be
> blocking so the application cannot give it any new data.
>
> This seems to be a problem with the abandonment functionality:
> 1. Transmit chunk 1. The transmitted list now contains chunk 1.
> 2. Chunk 1 and it's retransmissions get lost on the network.
> 3. Abandon chunk 1. The transmitted list is now empty.
This causes a FORWARD TSN chunk to be sent to the peer telling him
to advance CTSN to that of chunk 1.
> 4. Transmit chunk 2. the transmitted list now contains chunk 2
> 5. Receive a gap-ack for chunk 2, indicating that chunk 1 is missing.
Yes, but at this point, we will regenerate the FORWARD TSN since chunk1
is still on the abandoned list.
> At this point, the T3 timer is disabled at the bottom of
> sctp_check_transmitted() since all the chunks in the transmitted queue are
> gap-acked. The whole connection now stalls, waiting for the SACK for
> chunk 1 that will never arrive.
>
I'll look some more at this...
-vlad
> It should be noted that this is not unordered data and I'm not clear on
> how abandoned chunks are supposed to be handled - I hadn't intentionally
> enabled the abandonment functionality, the timetolive was set on the
> transmitted chunks by accident.
>
prev parent reply other threads:[~2007-01-10 21:03 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-03 23:46 Fw: Intermittent SCTP multihoming breakage Andrew Morton
2007-01-04 0:59 ` Sridhar Samudrala
2007-01-10 11:55 ` Steve Hill
2007-01-10 20:10 ` Sridhar Samudrala
2007-01-11 10:10 ` Steve Hill
2007-01-25 16:32 ` [Lksctp-developers] " Vlad Yasevich
2007-01-25 16:37 ` Vlad Yasevich
2007-01-10 20:49 ` Vlad Yasevich [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45A55151.2060401@hp.com \
--to=vladislav.yasevich@hp.com \
--cc=akpm@osdl.org \
--cc=lksctp-developers@lists.sourceforge.net \
--cc=netdev@vger.kernel.org \
--cc=sri@us.ibm.com \
--cc=steve.hill@dialogic.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.