Michal Hocko wrote: > On Thu 06-11-08 08:48:45, Vlad Yasevich wrote: >> Michal Hocko wrote: >>> Hi, >>> we are experiencing BUG and hang conditions with simple echo client-server >>> SCTP application. It looks like a race condition which is rather hard to >>> trigger. >>> >>> BUG traces come usually with sctp code in the code paths (see traces attached) >>> but sometimes the machine simply hangs without any traces at all. It >>> obviously depends on the kernel configuration and HW (different machines >>> comes with different traces). >>> >>> Initial report of this issue was against SLES10SP2 (2.6.16.60) kernel but we >>> were able to reproduce with upstream Linus tree as well (2.6. >>> {25,26,27,75fa67706cce5272bcfc51ed646f2da21f3bdb6e}). >>> We were able to reproduce _only_ with 2 _directly_ connected machines with >>> 1GiB wired ethernet connection. (no BUG condition occurred on the single HW >>> nor with connection through at least one switch or 100MB). Original report >>> states that it takes from minutes to hours to trigger this issue but it takes >>> hours in my testing environment. >>> >>> At first we thought that this can be caused by SO_REUSEADDR used by server >>> application, but I was able to reproduce also without it. >>> We are also not 100% sure that the sctp is culprit here, but almost all traces >>> contain some sctp paths so it smells suspicious. >>> >>> This may have security implications so I am not attaching the crash >>> application directly into this email (please write me and I will send it >>> directly or let me know if it is safe to publish it publicly in the mailing >>> list). >>> >>> Thanks for any help/hints and let me know if you need some more information or >>> test some patches. >>> >>> Best regards >>> >> In the earlier kernels there were a few bugs in the accept code paths that >> had to do with locking the newly created socket correctly as well as locking >> the port hash table during the migration of the ports. Both of those >> contributed to crashes at odd points in time and sometimes even to stack and >> memory corruptions. >> >> I'll take a look at what's causing skb overflow in 2.6.28. > > Is there any update (patch to test). This is starting to be critical > from our POV. > Do you have any ETA? > Is there some way how to help here? > which version in particular is most critical? Just remember then 2.6.16 is very old and there have been a lot of fixes that address critical issues. For 2.6.28, can you apply the attached patch and post dmesg output. Also, if it's possible to capture a kdump, that would make things much easier. Thanks -vlad