* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
@ 2013-12-14 15:16 ` Jamal Hadi Salim
2013-12-14 15:27 ` Jamal Hadi Salim
` (24 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-14 15:16 UTC (permalink / raw)
To: linux-sctp
I wasnt clear, sorry. Basically I kill the app after it sends
a 100K or more messages (about 300 bytes each). The server
is still thinking the client is connected. The client does
a close/shutdown. I enabled SCTP heartbeats between the client
and server and running tcpdump after killing the client
shows heartbeats going on happily.
Sorry - I cant put out out the code, it is too big to cut
down into something small enough to demonstrate.
cheers,
jamal
On 12/14/13 10:04, Jamal Hadi Salim wrote:
>
> Folks,
>
> I have a problem which manifests in kernels > 3.8. I am no sure how
> best to debug it.
> I have looked at strace and dont see anything different between when it
> works (kernels <= 3.8) and when it doesnt (kernels > 3.8).
> When i dump /proc/net/sctp/assocs I can see in the non-working case
> the socket is still there - which means there is no way for the server
> to be notified.
> If kill the server, the socket disappears.
>
> Is there something else that would help narrow this down?
>
> cheers,
> jamal
>
> PS:
> Essentially I have a client app that does some nasty stuff (on purpose
> to test robustness). Client and server are connected locally within same
> machine.
> Client sends as fast as it can packets with partial reliability (timeout
> of about 100ms). The only time client checks for any kernel obsoleted
> msgs is when the send socket queue write will block.
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
2013-12-14 15:16 ` Jamal Hadi Salim
@ 2013-12-14 15:27 ` Jamal Hadi Salim
2013-12-14 17:06 ` Michael Tuexen
` (23 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-14 15:27 UTC (permalink / raw)
To: linux-sctp
More info:
1) In the non-working case I dont see the shutdown sequence on tcpdump
2) Thing seem to break only when i have a number of SEND_FAILED events
when i read. The brain in my stomach is thinking this probably has to
do with some outstanding obsoleted messages which the app didnt have
a chance to do recv on although cat /proc/net/sctp/assocs shows
both send/recv queue at 0.
cheers,
jamal
On 12/14/13 10:16, Jamal Hadi Salim wrote:
>
> I wasnt clear, sorry. Basically I kill the app after it sends
> a 100K or more messages (about 300 bytes each). The server
> is still thinking the client is connected. The client does
> a close/shutdown. I enabled SCTP heartbeats between the client
> and server and running tcpdump after killing the client
> shows heartbeats going on happily.
>
> Sorry - I cant put out out the code, it is too big to cut
> down into something small enough to demonstrate.
>
> cheers,
> jamal
>
>
> On 12/14/13 10:04, Jamal Hadi Salim wrote:
>>
>> Folks,
>>
>> I have a problem which manifests in kernels > 3.8. I am no sure how
>> best to debug it.
>> I have looked at strace and dont see anything different between when it
>> works (kernels <= 3.8) and when it doesnt (kernels > 3.8).
>> When i dump /proc/net/sctp/assocs I can see in the non-working case
>> the socket is still there - which means there is no way for the server
>> to be notified.
>> If kill the server, the socket disappears.
>>
>> Is there something else that would help narrow this down?
>>
>> cheers,
>> jamal
>>
>> PS:
>> Essentially I have a client app that does some nasty stuff (on purpose
>> to test robustness). Client and server are connected locally within same
>> machine.
>> Client sends as fast as it can packets with partial reliability (timeout
>> of about 100ms). The only time client checks for any kernel obsoleted
>> msgs is when the send socket queue write will block.
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
2013-12-14 15:16 ` Jamal Hadi Salim
2013-12-14 15:27 ` Jamal Hadi Salim
@ 2013-12-14 17:06 ` Michael Tuexen
2013-12-14 17:21 ` Jamal Hadi Salim
` (22 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Michael Tuexen @ 2013-12-14 17:06 UTC (permalink / raw)
To: linux-sctp
On Dec 14, 2013, at 4:16 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> I wasnt clear, sorry. Basically I kill the app after it sends
> a 100K or more messages (about 300 bytes each). The server
> is still thinking the client is connected. The client does
> a close/shutdown. I enabled SCTP heartbeats between the client
What does this mean? Do you call close() or shutdown()? Do you
kill the client?
Best regards
Michael
> and server and running tcpdump after killing the client
> shows heartbeats going on happily.
>
> Sorry - I cant put out out the code, it is too big to cut
> down into something small enough to demonstrate.
>
> cheers,
> jamal
>
>
> On 12/14/13 10:04, Jamal Hadi Salim wrote:
>>
>> Folks,
>>
>> I have a problem which manifests in kernels > 3.8. I am no sure how
>> best to debug it.
>> I have looked at strace and dont see anything different between when it
>> works (kernels <= 3.8) and when it doesnt (kernels > 3.8).
>> When i dump /proc/net/sctp/assocs I can see in the non-working case
>> the socket is still there - which means there is no way for the server
>> to be notified.
>> If kill the server, the socket disappears.
>>
>> Is there something else that would help narrow this down?
>>
>> cheers,
>> jamal
>>
>> PS:
>> Essentially I have a client app that does some nasty stuff (on purpose
>> to test robustness). Client and server are connected locally within same
>> machine.
>> Client sends as fast as it can packets with partial reliability (timeout
>> of about 100ms). The only time client checks for any kernel obsoleted
>> msgs is when the send socket queue write will block.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (2 preceding siblings ...)
2013-12-14 17:06 ` Michael Tuexen
@ 2013-12-14 17:21 ` Jamal Hadi Salim
2013-12-14 17:23 ` Michael Tuexen
` (21 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-14 17:21 UTC (permalink / raw)
To: linux-sctp
On 12/14/13 12:06, Michael Tuexen wrote:
> What does this mean? Do you call close() or shutdown()? Do you
> kill the client?
>
Clean shutdown.
shutdown(fd, SHUT_RDWR);
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (3 preceding siblings ...)
2013-12-14 17:21 ` Jamal Hadi Salim
@ 2013-12-14 17:23 ` Michael Tuexen
2013-12-14 17:36 ` Jamal Hadi Salim
` (20 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Michael Tuexen @ 2013-12-14 17:23 UTC (permalink / raw)
To: linux-sctp
On Dec 14, 2013, at 6:21 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 12/14/13 12:06, Michael Tuexen wrote:
>
>> What does this mean? Do you call close() or shutdown()? Do you
>> kill the client?
>>
>
> Clean shutdown.
> shutdown(fd, SHUT_RDWR);
OK. This should trigger the SHUTDOWN procedure, but leave the socket alive.
Any reason why you can't try close() instead?
Best regards
Michael
>
> cheers,
> jamal
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (4 preceding siblings ...)
2013-12-14 17:23 ` Michael Tuexen
@ 2013-12-14 17:36 ` Jamal Hadi Salim
2013-12-14 18:35 ` Jamal Hadi Salim
` (19 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-14 17:36 UTC (permalink / raw)
To: linux-sctp
On 12/14/13 12:23, Michael Tuexen wrote:
> OK. This should trigger the SHUTDOWN procedure, but leave the socket alive.
> Any reason why you can't try close() instead?
I cant remember the reason - but there was one in the past. I will give
close a try. I can access the machine with problematic kernel
in about 1 hour from now and will post.
The question is: Why does this work with older kernels.
Or even more with the same kernel if the client doesnt
send very fast.
cheers,
jamal
> Best regards
> Michael
>>
>> cheers,
>> jamal
>>
>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (5 preceding siblings ...)
2013-12-14 17:36 ` Jamal Hadi Salim
@ 2013-12-14 18:35 ` Jamal Hadi Salim
2013-12-14 18:47 ` Michael Tuexen
` (18 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-14 18:35 UTC (permalink / raw)
To: linux-sctp
Tried using close - no luck. I may have observed maybe 1 out
100 runs where a proper shutdown was sent - which maybe an
an improvement, but i could be imagining that.
If it is any help: I am using epoll via libev but i dont
think thats the problem because it works 100% on older kernels.
cheers,
jamal
On 12/14/13 12:36, Jamal Hadi Salim wrote:
> On 12/14/13 12:23, Michael Tuexen wrote:
>
>
>> OK. This should trigger the SHUTDOWN procedure, but leave the socket
>> alive.
>> Any reason why you can't try close() instead?
>
> I cant remember the reason - but there was one in the past. I will give
> close a try. I can access the machine with problematic kernel
> in about 1 hour from now and will post.
>
> The question is: Why does this work with older kernels.
> Or even more with the same kernel if the client doesnt
> send very fast.
>
> cheers,
> jamal
>
>> Best regards
>> Michael
>>>
>>> cheers,
>>> jamal
>>>
>>
>>
>>
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (6 preceding siblings ...)
2013-12-14 18:35 ` Jamal Hadi Salim
@ 2013-12-14 18:47 ` Michael Tuexen
2013-12-14 19:09 ` Jamal Hadi Salim
` (17 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Michael Tuexen @ 2013-12-14 18:47 UTC (permalink / raw)
To: linux-sctp
On Dec 14, 2013, at 7:35 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> Tried using close - no luck. I may have observed maybe 1 out
> 100 runs where a proper shutdown was sent - which maybe an
> an improvement, but i could be imagining that.
OK. What happens when the shutdown guard timer runs off
(it should be around 180 seconds)?
> If it is any help: I am using epoll via libev but i dont
> think thats the problem because it works 100% on older kernels.
No idea...
Best regards
Michael
>
> cheers,
> jamal
>
> On 12/14/13 12:36, Jamal Hadi Salim wrote:
>> On 12/14/13 12:23, Michael Tuexen wrote:
>>
>>
>>> OK. This should trigger the SHUTDOWN procedure, but leave the socket
>>> alive.
>>> Any reason why you can't try close() instead?
>>
>> I cant remember the reason - but there was one in the past. I will give
>> close a try. I can access the machine with problematic kernel
>> in about 1 hour from now and will post.
>>
>> The question is: Why does this work with older kernels.
>> Or even more with the same kernel if the client doesnt
>> send very fast.
>>
>> cheers,
>> jamal
>>
>>> Best regards
>>> Michael
>>>>
>>>> cheers,
>>>> jamal
>>>>
>>>
>>>
>>>
>>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (7 preceding siblings ...)
2013-12-14 18:47 ` Michael Tuexen
@ 2013-12-14 19:09 ` Jamal Hadi Salim
2013-12-14 19:27 ` Jamal Hadi Salim
` (16 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-14 19:09 UTC (permalink / raw)
To: linux-sctp
On 12/14/13 13:47, Michael Tuexen wrote:
> On Dec 14, 2013, at 7:35 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
>>
>> Tried using close - no luck. I may have observed maybe 1 out
>> 100 runs where a proper shutdown was sent - which maybe an
>> an improvement, but i could be imagining that.
> OK. What happens when the shutdown guard timer runs off
Where do i find this timer and how do i control it?
Waiting for a few hours last time didnt seem to cure this.
I have one stuck right now, will give it another 5 minutes
then close the client and see if things work.
My sctp configs:
----
net.sctp.addip_enable = 0
net.sctp.addip_noauth_enable = 0
net.sctp.addr_scope_policy = 1
net.sctp.association_max_retrans = 10
net.sctp.auth_enable = 0
net.sctp.cookie_hmac_alg = sha1
net.sctp.cookie_preserve_enable = 1
net.sctp.default_auto_asconf = 0
net.sctp.hb_interval = 30000
net.sctp.max_autoclose = 8589934
net.sctp.max_burst = 4
net.sctp.max_init_retransmits = 8
net.sctp.path_max_retrans = 5
net.sctp.pf_retrans = 0
net.sctp.prsctp_enable = 1
net.sctp.rcvbuf_policy = 0
net.sctp.rto_alpha_exp_divisor = 3
net.sctp.rto_beta_exp_divisor = 2
net.sctp.rto_initial = 3000
net.sctp.rto_max = 60000
net.sctp.rto_min = 1000
net.sctp.rwnd_update_shift = 4
net.sctp.sack_timeout = 200
net.sctp.sctp_mem = 138363 184487 276726
net.sctp.sctp_rmem = 4096 865500 4194304
net.sctp.sctp_wmem = 4096 16384 4194304
net.sctp.sndbuf_policy = 0
net.sctp.valid_cookie_life = 60000
---------
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (8 preceding siblings ...)
2013-12-14 19:09 ` Jamal Hadi Salim
@ 2013-12-14 19:27 ` Jamal Hadi Salim
2013-12-14 20:06 ` Michael Tuexen
` (15 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-14 19:27 UTC (permalink / raw)
To: linux-sctp
On 12/14/13 14:09, Jamal Hadi Salim wrote:
> I have one stuck right now, will give it another 5 minutes
> then close the client and see if things work.
Even waiting for 15 minutes didnt fix it once it gets stuck.
Again note: I have to have processed a send failed for this
to occur. And i have to be sending very fast (machine being
the constraint).
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (9 preceding siblings ...)
2013-12-14 19:27 ` Jamal Hadi Salim
@ 2013-12-14 20:06 ` Michael Tuexen
2013-12-15 15:21 ` Jamal Hadi Salim
` (14 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Michael Tuexen @ 2013-12-14 20:06 UTC (permalink / raw)
To: linux-sctp
On Dec 14, 2013, at 8:09 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 12/14/13 13:47, Michael Tuexen wrote:
>> On Dec 14, 2013, at 7:35 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>
>>>
>>> Tried using close - no luck. I may have observed maybe 1 out
>>> 100 runs where a proper shutdown was sent - which maybe an
>>> an improvement, but i could be imagining that.
>> OK. What happens when the shutdown guard timer runs off
>
> Where do i find this timer and how do i control it?
In FreeBSD it can be controlled by a sysctl:
net.inet.sctp.shutdown_guard_time: 180
Don't know about Linux.
> Waiting for a few hours last time didnt seem to cure this.
OK.
But the suggested value is 3 Minutes...
Best regards
Michael
> I have one stuck right now, will give it another 5 minutes
> then close the client and see if things work.
>
> My sctp configs:
> ----
> net.sctp.addip_enable = 0
> net.sctp.addip_noauth_enable = 0
> net.sctp.addr_scope_policy = 1
> net.sctp.association_max_retrans = 10
> net.sctp.auth_enable = 0
> net.sctp.cookie_hmac_alg = sha1
> net.sctp.cookie_preserve_enable = 1
> net.sctp.default_auto_asconf = 0
> net.sctp.hb_interval = 30000
> net.sctp.max_autoclose = 8589934
> net.sctp.max_burst = 4
> net.sctp.max_init_retransmits = 8
> net.sctp.path_max_retrans = 5
> net.sctp.pf_retrans = 0
> net.sctp.prsctp_enable = 1
> net.sctp.rcvbuf_policy = 0
> net.sctp.rto_alpha_exp_divisor = 3
> net.sctp.rto_beta_exp_divisor = 2
> net.sctp.rto_initial = 3000
> net.sctp.rto_max = 60000
> net.sctp.rto_min = 1000
> net.sctp.rwnd_update_shift = 4
> net.sctp.sack_timeout = 200
> net.sctp.sctp_mem = 138363 184487 276726
> net.sctp.sctp_rmem = 4096 865500 4194304
> net.sctp.sctp_wmem = 4096 16384 4194304
> net.sctp.sndbuf_policy = 0
> net.sctp.valid_cookie_life = 60000
> ---------
>
> cheers,
> jamal
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (10 preceding siblings ...)
2013-12-14 20:06 ` Michael Tuexen
@ 2013-12-15 15:21 ` Jamal Hadi Salim
2013-12-15 15:32 ` Michael Tuexen
` (13 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-15 15:21 UTC (permalink / raw)
To: linux-sctp
On 12/14/13 15:06, Michael Tuexen wrote:
> On Dec 14, 2013, at 8:09 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> In FreeBSD it can be controlled by a sysctl:
> net.inet.sctp.shutdown_guard_time: 180
> Don't know about Linux.
The kernel seems to have it - but i cant see any knob exposed to
user space.
Can someone from the Linux world point me to some stats i can collect
in user space that will narrow this down? There has to be something.
I dont have the luxury of doing git bisect (rephrase: These kernels
are deployed, upgrade is almost a no option).
>> Waiting for a few hours last time didnt seem to cure this.
> OK.
>
> But the suggested value is 3 Minutes...
>
Understood - but i thought if i waited longer than 3 minutes then
that would cover it, no?
I tried a few other things from the server side to detect if the client
is gone:
- peek read (claimed all was good)
- getsockopt some random value (claimed all was good)
All understandable given the socket state seems to be still intact.
So the only option that still seem left for me is to implement app level
heartbeats to detect dead clients.
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (11 preceding siblings ...)
2013-12-15 15:21 ` Jamal Hadi Salim
@ 2013-12-15 15:32 ` Michael Tuexen
2013-12-15 19:08 ` Jamal Hadi Salim
` (12 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Michael Tuexen @ 2013-12-15 15:32 UTC (permalink / raw)
To: linux-sctp
On Dec 15, 2013, at 4:21 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 12/14/13 15:06, Michael Tuexen wrote:
>> On Dec 14, 2013, at 8:09 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
>> In FreeBSD it can be controlled by a sysctl:
>> net.inet.sctp.shutdown_guard_time: 180
>> Don't know about Linux.
>
> The kernel seems to have it - but i cant see any knob exposed to
> user space.
>
> Can someone from the Linux world point me to some stats i can collect
> in user space that will narrow this down? There has to be something.
> I dont have the luxury of doing git bisect (rephrase: These kernels
> are deployed, upgrade is almost a no option).
>
>>> Waiting for a few hours last time didnt seem to cure this.
>> OK.
>>
>> But the suggested value is 3 Minutes...
>>
>
> Understood - but i thought if i waited longer than 3 minutes then
> that would cover it, no?
Sure, but you don't need to wait for hours...
>
> I tried a few other things from the server side to detect if the client
> is gone:
> - peek read (claimed all was good)
> - getsockopt some random value (claimed all was good)
> All understandable given the socket state seems to be still intact.
I think you said that HEARTBEATs were still flowing in both directions.
So the association is alive. You might want to inspect on the client
side the state of the association. That seems to be the side which
has a problem. What is netstat reporting?
Best regards
Michael
>
> So the only option that still seem left for me is to implement app level
> heartbeats to detect dead clients.
>
> cheers,
> jamal
>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (12 preceding siblings ...)
2013-12-15 15:32 ` Michael Tuexen
@ 2013-12-15 19:08 ` Jamal Hadi Salim
2013-12-16 15:19 ` Vlad Yasevich
` (11 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-15 19:08 UTC (permalink / raw)
To: linux-sctp
On 12/15/13 10:32, Michael Tuexen wrote:
> On Dec 15, 2013, at 4:21 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> I think you said that HEARTBEATs were still flowing in both directions.
Yes.
> So the association is alive. You might want to inspect on the client
> side the state of the association. That seems to be the side which
> has a problem. What is netstat reporting?
I tried to narrow it down by putting both client and server on
same host. The association is still alive as evidenced when
i dumped /proc/net/sctp/assocs after the client closed.
It dies when i kill the server.
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (13 preceding siblings ...)
2013-12-15 19:08 ` Jamal Hadi Salim
@ 2013-12-16 15:19 ` Vlad Yasevich
2013-12-17 13:49 ` Jamal Hadi Salim
` (10 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Vlad Yasevich @ 2013-12-16 15:19 UTC (permalink / raw)
To: linux-sctp
On 12/14/2013 10:04 AM, Jamal Hadi Salim wrote:
>
> Folks,
>
> I have a problem which manifests in kernels > 3.8. I am no sure how
> best to debug it.
Hi Jamal
The only thing I can think off that may be causing this is the
rcu-fication of the transport list in the associations.
You might be able to test by reverting:
771085d6bf3c52de29fc213e5bad07a82e57c23e
8c98653f05534acd1cb07ea4929702a3659177d1
45122ca26ced7fae41049326a3797a73f961db2e
-vlad
> I have looked at strace and dont see anything different between when it
> works (kernels <= 3.8) and when it doesnt (kernels > 3.8).
> When i dump /proc/net/sctp/assocs I can see in the non-working case
> the socket is still there - which means there is no way for the server
> to be notified.
> If kill the server, the socket disappears.
>
> Is there something else that would help narrow this down?
>
> cheers,
> jamal
>
> PS:
> Essentially I have a client app that does some nasty stuff (on purpose
> to test robustness). Client and server are connected locally within same
> machine.
> Client sends as fast as it can packets with partial reliability (timeout
> of about 100ms). The only time client checks for any kernel obsoleted
> msgs is when the send socket queue write will block.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (14 preceding siblings ...)
2013-12-16 15:19 ` Vlad Yasevich
@ 2013-12-17 13:49 ` Jamal Hadi Salim
2013-12-17 15:11 ` Vlad Yasevich
` (9 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-17 13:49 UTC (permalink / raw)
To: linux-sctp
On 12/16/13 10:19, Vlad Yasevich wrote:
> On 12/14/2013 10:04 AM, Jamal Hadi Salim wrote:
> Hi Jamal
>
> The only thing I can think off that may be causing this is the
> rcu-fication of the transport list in the associations.
>
> You might be able to test by reverting:
> 771085d6bf3c52de29fc213e5bad07a82e57c23e
> 8c98653f05534acd1cb07ea4929702a3659177d1
> 45122ca26ced7fae41049326a3797a73f961db2e
Thanks Vlad. I will try newer kernels and revert these changes
incrementally and see if there's improvement.
BTW: What tools are useful to monitor sctp behavior?
For this case I have been watching activity using
/proc/net/sctp/assocs, strace and tcpdump.
cheers,
jamal
PS: I currently have no choice on what kernels already deployed
that need the software feature but can make recommendations
for future upgrades. It would be good to narrow it down.
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (15 preceding siblings ...)
2013-12-17 13:49 ` Jamal Hadi Salim
@ 2013-12-17 15:11 ` Vlad Yasevich
2013-12-18 12:30 ` Jamal Hadi Salim
` (8 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Vlad Yasevich @ 2013-12-17 15:11 UTC (permalink / raw)
To: linux-sctp
On 12/17/2013 08:49 AM, Jamal Hadi Salim wrote:
> On 12/16/13 10:19, Vlad Yasevich wrote:
>> On 12/14/2013 10:04 AM, Jamal Hadi Salim wrote:
>
>> Hi Jamal
>>
>> The only thing I can think off that may be causing this is the
>> rcu-fication of the transport list in the associations.
>>
>> You might be able to test by reverting:
>> 771085d6bf3c52de29fc213e5bad07a82e57c23e
>> 8c98653f05534acd1cb07ea4929702a3659177d1
>> 45122ca26ced7fae41049326a3797a73f961db2e
>
>
> Thanks Vlad. I will try newer kernels and revert these changes
> incrementally and see if there's improvement.
>
> BTW: What tools are useful to monitor sctp behavior?
> For this case I have been watching activity using
> /proc/net/sctp/assocs, strace and tcpdump.
That's about it. You can also use stap. It's on the project
todo list to provide some good stap scripts.
The other thing that is useful for debugging things like this is
the object counting in SCTP (CONFIG_SCTP_DBG_OBJCNT), but it's a
kernel config option so not sure how easily you'd be able to use
it.
>
> cheers,
> jamal
>
> PS: I currently have no choice on what kernels already deployed
> that need the software feature but can make recommendations
> for future upgrades. It would be good to narrow it down.
The initial commit 45122ca26ced7fae41049326a3797a73f961db2e was
added to 3.8. In 3.9 we fixed a bug where the association closures
were being delayed too long and were preventing socket rebinding
(commit 8c98653f05534acd1cb07ea4929702a3659177d1).
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (16 preceding siblings ...)
2013-12-17 15:11 ` Vlad Yasevich
@ 2013-12-18 12:30 ` Jamal Hadi Salim
2013-12-18 17:58 ` Vlad Yasevich
` (7 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-18 12:30 UTC (permalink / raw)
To: linux-sctp
On 12/17/13 10:11, Vlad Yasevich wrote:
> That's about it. You can also use stap. It's on the project
> todo list to provide some good stap scripts.
>
Maybe a little overkill for me for now.
> The other thing that is useful for debugging things like this is
> the object counting in SCTP (CONFIG_SCTP_DBG_OBJCNT), but it's a
> kernel config option so not sure how easily you'd be able to use
> it.
>
Not easy - but for newer kernels i could do this.
>
> The initial commit 45122ca26ced7fae41049326a3797a73f961db2e was
> added to 3.8. In 3.9 we fixed a bug where the association closures
> were being delayed too long and were preventing socket rebinding
> (commit 8c98653f05534acd1cb07ea4929702a3659177d1).
>
It was a bit of a merge nightmare trying to revert the three individual
patches yesterday, so i am abandoning that effort for now.
I will try to find time over the holidays to do git bisect.
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (17 preceding siblings ...)
2013-12-18 12:30 ` Jamal Hadi Salim
@ 2013-12-18 17:58 ` Vlad Yasevich
2013-12-19 14:26 ` Jamal Hadi Salim
` (6 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Vlad Yasevich @ 2013-12-18 17:58 UTC (permalink / raw)
To: linux-sctp
On 12/18/2013 07:30 AM, Jamal Hadi Salim wrote:
> On 12/17/13 10:11, Vlad Yasevich wrote:
>
>> That's about it. You can also use stap. It's on the project
>> todo list to provide some good stap scripts.
>>
>
> Maybe a little overkill for me for now.
>
>> The other thing that is useful for debugging things like this is
>> the object counting in SCTP (CONFIG_SCTP_DBG_OBJCNT), but it's a
>> kernel config option so not sure how easily you'd be able to use
>> it.
>>
>
> Not easy - but for newer kernels i could do this.
>
>
>>
>> The initial commit 45122ca26ced7fae41049326a3797a73f961db2e was
>> added to 3.8. In 3.9 we fixed a bug where the association closures
>> were being delayed too long and were preventing socket rebinding
>> (commit 8c98653f05534acd1cb07ea4929702a3659177d1).
>>
>
> It was a bit of a merge nightmare trying to revert the three individual
> patches yesterday, so i am abandoning that effort for now.
> I will try to find time over the holidays to do git bisect.
>
> cheers,
> jamal
Jamal
could you post an output for /proc/net/sctp/assocs for the association
in this bad state?
Thanks
-vlad
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (18 preceding siblings ...)
2013-12-18 17:58 ` Vlad Yasevich
@ 2013-12-19 14:26 ` Jamal Hadi Salim
2013-12-19 17:24 ` Vlad Yasevich
` (5 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-19 14:26 UTC (permalink / raw)
To: linux-sctp
On 12/18/13 12:58, Vlad Yasevich wrote:
> On 12/18/2013 07:30 AM, Jamal Hadi Salim wrote:
> could you post an output for /proc/net/sctp/assocs for the association
> in this bad state?
It's not eye candy (lines wrap around). But here's one i just
reproduced with client/server on same machine via lo. It requires
a few tries to make sure we have send failed for this to happen.
----
SSOC SOCK STY SST ST HBKT ASSOC-ID TX_QUEUE RX_QUEUE UID INODE
LPORT RPORT LADDRS <-> RADDRS HBINT INS OUTS MAXRT T1X T2X RTXC wmema
wmemq sndbuf rcvbuf
0 0 2 7 4 29808 11 0 0 0 0
50902 30330 127.0.0.1 10.0.0.195 192.168.122.1 <-> *127.0.0.1 7500
10 10 10 0 0 0 1 0 212992 212992
0 0 2 1 3 31695 12 0 0 1000 55928
30330 50902 127.0.0.1 <-> *127.0.0.1 10.0.0.195 192.168.122.1 1500
10 10 10 0 0 0 1 0 212992 212992
---------
Server is at port 30330.
Actually i take back what i said earlier: When i write to this app,
after it disconnects, i do see the socket queues grow for a short
period and then get drained to zero. I dont know where those messages
go. I monitor this by having the watch utility polling every second.
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (19 preceding siblings ...)
2013-12-19 14:26 ` Jamal Hadi Salim
@ 2013-12-19 17:24 ` Vlad Yasevich
2013-12-19 18:16 ` Vlad Yasevich
` (4 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Vlad Yasevich @ 2013-12-19 17:24 UTC (permalink / raw)
To: linux-sctp
On 12/19/2013 09:26 AM, Jamal Hadi Salim wrote:
> On 12/18/13 12:58, Vlad Yasevich wrote:
>> On 12/18/2013 07:30 AM, Jamal Hadi Salim wrote:
>
>> could you post an output for /proc/net/sctp/assocs for the association
>> in this bad state?
>
> It's not eye candy (lines wrap around). But here's one i just
> reproduced with client/server on same machine via lo. It requires
> a few tries to make sure we have send failed for this to happen.
>
> ----
> SSOC SOCK STY SST ST HBKT ASSOC-ID TX_QUEUE RX_QUEUE UID INODE
> LPORT RPORT LADDRS <-> RADDRS HBINT INS OUTS MAXRT T1X T2X RTXC wmema
> wmemq sndbuf rcvbuf
> 0 0 2 7 4 29808 11 0 0 0 0
So, on this line socket state (SST) is 7 which is SCTP_SS_CLOSED. This
means that you performed a close() call. The association state (ST) is
4 which is SHUTDOWN_PENDING. This means that when you tried to close
the socket, the association thought that there was some pending data.
I seem to remember you and I discussing this situation before, but I
can't find that thread.
I'll take another look at how PR interacts with queue state to see if
we can detect the proper empty state to send a SHUTDOWN.
However, what the above tells me is that you don't actually set
SO_LINGER on this socket. If you did, instead of attempting SHUTDOWN,
we would have sent an abort. That might be a good workaround until
we solve this "queue empty" problem.
-vlad
> 50902 30330 127.0.0.1 10.0.0.195 192.168.122.1 <-> *127.0.0.1 7500
> 10 10 10 0 0 0 1 0 212992 212992
> 0 0 2 1 3 31695 12 0 0 1000 55928
> 30330 50902 127.0.0.1 <-> *127.0.0.1 10.0.0.195 192.168.122.1 1500
> 10 10 10 0 0 0 1 0 212992 212992
> ---------
>
> Server is at port 30330.
>
> Actually i take back what i said earlier: When i write to this app,
> after it disconnects, i do see the socket queues grow for a short
> period and then get drained to zero. I dont know where those messages
> go. I monitor this by having the watch utility polling every second.
>
> cheers,
> jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (20 preceding siblings ...)
2013-12-19 17:24 ` Vlad Yasevich
@ 2013-12-19 18:16 ` Vlad Yasevich
2013-12-20 12:23 ` Jamal Hadi Salim
` (3 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Vlad Yasevich @ 2013-12-19 18:16 UTC (permalink / raw)
To: linux-sctp
On 12/19/2013 12:24 PM, Vlad Yasevich wrote:
> On 12/19/2013 09:26 AM, Jamal Hadi Salim wrote:
>> On 12/18/13 12:58, Vlad Yasevich wrote:
>>> On 12/18/2013 07:30 AM, Jamal Hadi Salim wrote:
>>
>>> could you post an output for /proc/net/sctp/assocs for the association
>>> in this bad state?
>>
>> It's not eye candy (lines wrap around). But here's one i just
>> reproduced with client/server on same machine via lo. It requires
>> a few tries to make sure we have send failed for this to happen.
>>
>> ----
>> SSOC SOCK STY SST ST HBKT ASSOC-ID TX_QUEUE RX_QUEUE UID INODE
>> LPORT RPORT LADDRS <-> RADDRS HBINT INS OUTS MAXRT T1X T2X RTXC wmema
>> wmemq sndbuf rcvbuf
>> 0 0 2 7 4 29808 11 0 0 0 0
>
> So, on this line socket state (SST) is 7 which is SCTP_SS_CLOSED. This
> means that you performed a close() call. The association state (ST) is
> 4 which is SHUTDOWN_PENDING. This means that when you tried to close
> the socket, the association thought that there was some pending data.
>
> I seem to remember you and I discussing this situation before, but I
> can't find that thread.
>
> I'll take another look at how PR interacts with queue state to see if
> we can detect the proper empty state to send a SHUTDOWN.
>
So, I took another look and it looks like there is an issue when the
chunks are being abandoned in sctp_outq_flush(). We simply delete
the chunks and it is possible that we can drain our queue without
ever setting the empty state. Since we didn't sent anything, we
wouldn't get any SACKs, thus the queue would never be set as empty
and we would be stuck in the SHUTDOWN_PENDING state, just like you
observe.
Can you try this patch to see if it resolves things. Can play with
netem values on lo to trigger PR faster.
Thanks
-vlad
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index b6b09f3..31c8124 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -721,6 +721,7 @@ static int sctp_outq_flush(struct sctp_outq *q, int
rtx_timeout)
int error = 0;
int start_timer = 0;
int one_packet = 0;
+ int empty = 1;
/* These transports have chunks to send. */
struct list_head transport_list;
@@ -1064,8 +1065,6 @@ static int sctp_outq_flush(struct sctp_outq *q,
int rtx_timeout)
sctp_transport_reset_timers(transport);
- q->empty = 0;
-
/* Only let one DATA chunk get bundled with a
* COOKIE-ECHO chunk.
*/
@@ -1081,12 +1080,13 @@ static int sctp_outq_flush(struct sctp_outq *q,
int rtx_timeout)
sctp_flush_out:
+ empty = (list_empty(&q->out_chunk_list) &&
+ list_empty(&q->retransmit));
+
/* Before returning, examine all the transports touched in
- * this call. Right now, we bluntly force clear all the
- * transports. Things might change after we implement Nagle.
- * But such an examination is still required.
- *
- * --xguo
+ * this call. If anything is still in the packet of the transport,
+ * flush it now. Also, make sure that if we sent any DATA, we
+ * correctly track the queue empty state.
*/
while ((ltransport = sctp_list_dequeue(&transport_list)) != NULL ) {
struct sctp_transport *t = list_entry(ltransport,
@@ -1098,7 +1098,11 @@ sctp_flush_out:
/* Clear the burst limited state, if any */
sctp_transport_burst_reset(t);
+
+ if (empty)
+ empty = empty && list_empty(&t->transmitted);
}
+ q->empty = empty;
return error;
}
^ permalink raw reply related [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (21 preceding siblings ...)
2013-12-19 18:16 ` Vlad Yasevich
@ 2013-12-20 12:23 ` Jamal Hadi Salim
2013-12-20 12:29 ` Jamal Hadi Salim
` (2 subsequent siblings)
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-20 12:23 UTC (permalink / raw)
To: linux-sctp
On 12/19/13 12:24, Vlad Yasevich wrote:
> On 12/19/2013 09:26 AM, Jamal Hadi Salim wrote:
>
> So, on this line socket state (SST) is 7 which is SCTP_SS_CLOSED. This
> means that you performed a close() call. The association state (ST) is
> 4 which is SHUTDOWN_PENDING. This means that when you tried to close
> the socket, the association thought that there was some pending data.
>
> I seem to remember you and I discussing this situation before, but I
> can't find that thread.
>
> I'll take another look at how PR interacts with queue state to see if
> we can detect the proper empty state to send a SHUTDOWN.
>
> However, what the above tells me is that you don't actually set
> SO_LINGER on this socket. If you did, instead of attempting SHUTDOWN,
> we would have sent an abort. That might be a good workaround until
> we solve this "queue empty" problem.
>
I will give this a try when i get to the office. I am certain we linger
on the server. On client side, at one point we turned off heartbeats on
the client side and that typically goes with linger on. I will double
check.
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (22 preceding siblings ...)
2013-12-20 12:23 ` Jamal Hadi Salim
@ 2013-12-20 12:29 ` Jamal Hadi Salim
2013-12-20 17:00 ` Jamal Hadi Salim
2013-12-20 18:44 ` Jamal Hadi Salim
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-20 12:29 UTC (permalink / raw)
To: linux-sctp
On 12/19/13 13:16, Vlad Yasevich wrote:
>
> So, I took another look and it looks like there is an issue when the
> chunks are being abandoned in sctp_outq_flush(). We simply delete
> the chunks and it is possible that we can drain our queue without
> ever setting the empty state. Since we didn't sent anything, we
> wouldn't get any SACKs, thus the queue would never be set as empty
> and we would be stuck in the SHUTDOWN_PENDING state, just like you
> observe.
>
Thanks Vlad.
> Can you try this patch to see if it resolves things. Can play with
> netem values on lo to trigger PR faster.
>
Oh, I have many ways to trigger it - never thought of netem.
I will let you know how it went with the patch later today. Hoping
the linger on client side would resolve it for me so i dont depend
on a specific patch.
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (23 preceding siblings ...)
2013-12-20 12:29 ` Jamal Hadi Salim
@ 2013-12-20 17:00 ` Jamal Hadi Salim
2013-12-20 18:44 ` Jamal Hadi Salim
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-20 17:00 UTC (permalink / raw)
To: linux-sctp
On 12/20/13 07:23, Jamal Hadi Salim wrote:
> On 12/19/13 12:24, Vlad Yasevich wrote:
> I will give this a try when i get to the office. I am certain we linger
> on the server. On client side, at one point we turned off heartbeats on
> the client side and that typically goes with linger on. I will double
> check.
>
Linger was on only on server->client side. Turning it on the other way
seems to have fixed the issue - still hammering at it trying to see
if we can reproduce it. But looking good so far. Thanks!
Building kernel image to test the patch.
cheers,
jamal
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: undetected closed apps
2013-12-14 15:04 undetected closed apps Jamal Hadi Salim
` (24 preceding siblings ...)
2013-12-20 17:00 ` Jamal Hadi Salim
@ 2013-12-20 18:44 ` Jamal Hadi Salim
25 siblings, 0 replies; 27+ messages in thread
From: Jamal Hadi Salim @ 2013-12-20 18:44 UTC (permalink / raw)
To: linux-sctp
On 12/19/13 13:16, Vlad Yasevich wrote:
I can confirm this fixes the issue. Thanks!
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
cheers,
jamal
> diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
> index b6b09f3..31c8124 100644
> --- a/net/sctp/outqueue.c
> +++ b/net/sctp/outqueue.c
> @@ -721,6 +721,7 @@ static int sctp_outq_flush(struct sctp_outq *q, int
> rtx_timeout)
> int error = 0;
> int start_timer = 0;
> int one_packet = 0;
> + int empty = 1;
>
> /* These transports have chunks to send. */
> struct list_head transport_list;
> @@ -1064,8 +1065,6 @@ static int sctp_outq_flush(struct sctp_outq *q,
> int rtx_timeout)
>
> sctp_transport_reset_timers(transport);
>
> - q->empty = 0;
> -
> /* Only let one DATA chunk get bundled with a
> * COOKIE-ECHO chunk.
> */
> @@ -1081,12 +1080,13 @@ static int sctp_outq_flush(struct sctp_outq *q,
> int rtx_timeout)
>
> sctp_flush_out:
>
> + empty = (list_empty(&q->out_chunk_list) &&
> + list_empty(&q->retransmit));
> +
> /* Before returning, examine all the transports touched in
> - * this call. Right now, we bluntly force clear all the
> - * transports. Things might change after we implement Nagle.
> - * But such an examination is still required.
> - *
> - * --xguo
> + * this call. If anything is still in the packet of the transport,
> + * flush it now. Also, make sure that if we sent any DATA, we
> + * correctly track the queue empty state.
> */
> while ((ltransport = sctp_list_dequeue(&transport_list)) != NULL ) {
> struct sctp_transport *t = list_entry(ltransport,
> @@ -1098,7 +1098,11 @@ sctp_flush_out:
>
> /* Clear the burst limited state, if any */
> sctp_transport_burst_reset(t);
> +
> + if (empty)
> + empty = empty && list_empty(&t->transmitted);
> }
> + q->empty = empty;
>
> return error;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 27+ messages in thread