public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Exploring sigqueue leak (was: Strange 'zombie' problem both in 2.4 and 2.6)
@ 2004-04-02 12:05 Nikita V. Youshchenko
  2004-04-02 22:13 ` Denis Vlasenko
  0 siblings, 1 reply; 2+ messages in thread
From: Nikita V. Youshchenko @ 2004-04-02 12:05 UTC (permalink / raw)
  To: linux-kernel

Hello.

Yesterday I posted to linux-kernel describind a problem that caused both 
2.4 and 2.6 kernels to leave lots of zombies.
I analysed the problem and found it is caused by overflow of 'sigqueue' 
slab cache (I may be wrong in terminology here). Details are in my 
previous post.

Now I'm looking for advice from somebody who understands how stuff in 
linux/signal.c works.

[the following ins about unpatched 2.6.4 kernel]

Seems that 'sigqueue' objects are allocated exactly in 2 places: in 
__sigqueue_alloc() and in send_signal().

__sigqueue_alloc() seems to be used only by sigqueue_alloc();
sigqueue_alloc() seems only to be used by alloc_posix_timer() from 
kernel/posix-timers.c;
alloc_posix_timer() is a static function used only by sys_timer_create() 
syscall handler.
This syscall is not one widely used; I'm not sure it it is actually used by 
anything running here.

So the leak seems to happen through allocation of 'sigqueue' objects in 
send_signal().
However, if I understand correctly, object allocated here keeps information 
about signal being transferred, and lives as long as this tranfer is in 
progress.
Signal transfer is somewhat very fast.
So, if everything is working ok, usually zero 'sigqueue' objects should be 
allocated.

Seems that it is possible to find out how many 'sigqueue' objects are 
currently allocated by running
   grep sigqueue /proc/slabinfo
and looking at the first number of output.
So seems that almost always the first number in the output of the above 
command should be zero.
And that's exactly what happens on all hosts here expect the server.

Just after reboot and starting all services, the number on the server was 
135. Later it changes, sometimes getting down to about 40, sometimes 
rising to something about 200. It is never becoming zero.

And after some time, when happens what I've described in the previous 
postings, it gets up to 1024 and stays there forever, causing some 
user-space visible misbehaves. The zombie keeping is not the only one. 
Other is strange and definitly incorrect failures inside threaded 
programs.

I thought that some 'sigqueue' objects could correspond to pending signals 
that are blocked by their reciever processes. However, I can't see any 
pending signals running
   grep SigPnd /proc/*/status
- all displayed masks are zeroes, while there are more than 100 allocated 
'sigqueue' objects.

So that's what I found.
Please correct me if I'm wrong, and/or advice where to look for the leak.

Nikita


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Exploring sigqueue leak (was: Strange 'zombie' problem both in 2.4 and 2.6)
  2004-04-02 12:05 Exploring sigqueue leak (was: Strange 'zombie' problem both in 2.4 and 2.6) Nikita V. Youshchenko
@ 2004-04-02 22:13 ` Denis Vlasenko
  0 siblings, 0 replies; 2+ messages in thread
From: Denis Vlasenko @ 2004-04-02 22:13 UTC (permalink / raw)
  To: Nikita V. Youshchenko, linux-kernel

On Friday 02 April 2004 15:05, Nikita V. Youshchenko wrote:
> Hello.
>
> Yesterday I posted to linux-kernel describind a problem that caused both
> 2.4 and 2.6 kernels to leave lots of zombies.
> I analysed the problem and found it is caused by overflow of 'sigqueue'
> slab cache (I may be wrong in terminology here). Details are in my
> previous post.
>
> Now I'm looking for advice from somebody who understands how stuff in
> linux/signal.c works.
>
> [the following ins about unpatched 2.6.4 kernel]
>
> Seems that 'sigqueue' objects are allocated exactly in 2 places: in
> __sigqueue_alloc() and in send_signal().
>
> __sigqueue_alloc() seems to be used only by sigqueue_alloc();
> sigqueue_alloc() seems only to be used by alloc_posix_timer() from
> kernel/posix-timers.c;
> alloc_posix_timer() is a static function used only by sys_timer_create()
> syscall handler.
> This syscall is not one widely used; I'm not sure it it is actually used by
> anything running here.
>
> So the leak seems to happen through allocation of 'sigqueue' objects in
> send_signal().
> However, if I understand correctly, object allocated here keeps information
> about signal being transferred, and lives as long as this tranfer is in
> progress.
> Signal transfer is somewhat very fast.
> So, if everything is working ok, usually zero 'sigqueue' objects should be
> allocated.
>
> Seems that it is possible to find out how many 'sigqueue' objects are
> currently allocated by running
>    grep sigqueue /proc/slabinfo
> and looking at the first number of output.
> So seems that almost always the first number in the output of the above
> command should be zero.
> And that's exactly what happens on all hosts here expect the server.

Sorry, still no real help from me but maybe some useful data points:

On my home box:

# grep sigqueue /proc/slabinfo
sigqueue              18     25    156   25    1 : tunables   32   16    0 : slabdata      1      1      0 : globalstat      32     18     1    0    0    0   57 : cpustat  33581      2  33583      0
Another home box:
# grep sigqueue /proc/slabinfo
sigqueue               9     25    156   25    1 : tunables   32   16    0 : slabdata      1      1      0 : globalstat      16     16     1    0    0    0   57 : cpustat  71984      1  71985
Third box:
# grep sigqueue /proc/slabinfo
sigqueue              16     25    156   25    1 : tunables   32   16    0 : slabdata      1      1      0 : globalstat      16     16     1    0    0    0   57 : cpustat 1388121      1 1388122      0

No more 2.6 boxes for you here ;)

So, not exactly zero but small.

Now, 2.4 boxes:

# grep sigqueue /proc/slabinfo
sigqueue               0     29    132    0    1    1
# grep sigqueue /proc/slabinfo
sigqueue               8     29    132    1    1    1
# grep sigqueue /proc/slabinfo
sigqueue               0      0    132    0    0    1 :  252  126

Smaller.
--
vda


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2004-04-02 22:13 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-02 12:05 Exploring sigqueue leak (was: Strange 'zombie' problem both in 2.4 and 2.6) Nikita V. Youshchenko
2004-04-02 22:13 ` Denis Vlasenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox