* Can pid be reused ?
@ 2014-10-22 2:55 Loic Dachary
2014-10-22 22:21 ` David Zafman
0 siblings, 1 reply; 6+ messages in thread
From: Loic Dachary @ 2014-10-22 2:55 UTC (permalink / raw)
To: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 3866 bytes --]
Hi,
Something strange happens on fedora20 with linux 3.11.10-301.fc20.x86_64. Running make -j8 check on https://github.com/ceph/ceph/pull/2750 a process gets killed from time to time. For instance it shows as
TEST_erasure_crush_stripe_width: 124: stripe_width=4096
TEST_erasure_crush_stripe_width: 125: ./ceph osd pool create pool_erasure 12 12 erasure
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
./test/mon/osd-pool-create.sh: line 120: 27557 Killed ./ceph osd pool create pool_erasure 12 12 erasure
TEST_erasure_crush_stripe_width: 126: ./ceph --format json osd dump
TEST_erasure_crush_stripe_width: 126: tee osd-pool-create/osd.json
in the test logs. Note the 27557 Killed . I originally thought it was because some ulimit was crossed and set them to very generous / unlimited hard / soft thresholds.
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515069
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 400000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Benoit Canet suggested that I installed systemtap ( https://www.sourceware.org/systemtap/wiki/SystemtapOnFedora ) and ran https://sourceware.org/systemtap/examples/process/sigkill.stp to watch what was sending the kill signal. It showed the following:
...
SIGKILL was sent to ceph-osd (pid:27557) by vstart_wrapper. uid:1001
SIGKILL was sent to python (pid:27557) by vstart_wrapper. uid:1001
....
which suggests that pid 27557 used by ceph-osd was reused for the python script that was killed above. Because the script that kills daemons is very agressive and kill -9 the pid to check if it really is dead
https://github.com/ceph/ceph/blob/giant/src/test/mon/mon-test-helpers.sh#L64
it explains the problem.
However, as Dan Mick suggests, reusing pid quickly could break a number of things and it is a surprising behavior. Maybe something else is going on. A loop creating processes sees their pid increasing and not being reused.
Any idea about what is going on would be much appreciated :-)
Cheers
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Can pid be reused ?
2014-10-22 2:55 Can pid be reused ? Loic Dachary
@ 2014-10-22 22:21 ` David Zafman
2014-10-22 22:43 ` Sage Weil
2014-10-22 22:46 ` Loic Dachary
0 siblings, 2 replies; 6+ messages in thread
From: David Zafman @ 2014-10-22 22:21 UTC (permalink / raw)
To: Loic Dachary; +Cc: Ceph Development
I just realized what it is. The way killall is used when stopping a vstart cluster, is to kill all processes by name! You can't stop vstarted tests running in parallel.
David Zafman
Senior Developer
http://www.inktank.com
> On Oct 21, 2014, at 7:55 PM, Loic Dachary <loic@dachary.org> wrote:
>
> Hi,
>
> Something strange happens on fedora20 with linux 3.11.10-301.fc20.x86_64. Running make -j8 check on https://github.com/ceph/ceph/pull/2750 a process gets killed from time to time. For instance it shows as
>
> TEST_erasure_crush_stripe_width: 124: stripe_width=4096
> TEST_erasure_crush_stripe_width: 125: ./ceph osd pool create pool_erasure 12 12 erasure
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> ./test/mon/osd-pool-create.sh: line 120: 27557 Killed ./ceph osd pool create pool_erasure 12 12 erasure
> TEST_erasure_crush_stripe_width: 126: ./ceph --format json osd dump
> TEST_erasure_crush_stripe_width: 126: tee osd-pool-create/osd.json
>
> in the test logs. Note the 27557 Killed . I originally thought it was because some ulimit was crossed and set them to very generous / unlimited hard / soft thresholds.
>
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size (blocks, -f) unlimited
> pending signals (-i) 515069
> max locked memory (kbytes, -l) unlimited
> max memory size (kbytes, -m) unlimited
> open files (-n) 400000
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority (-r) 0
> stack size (kbytes, -s) unlimited
> cpu time (seconds, -t) unlimited
> max user processes (-u) unlimited
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
> Benoit Canet suggested that I installed systemtap ( https://www.sourceware.org/systemtap/wiki/SystemtapOnFedora ) and ran https://sourceware.org/systemtap/examples/process/sigkill.stp to watch what was sending the kill signal. It showed the following:
>
> ...
> SIGKILL was sent to ceph-osd (pid:27557) by vstart_wrapper. uid:1001
> SIGKILL was sent to python (pid:27557) by vstart_wrapper. uid:1001
> ....
>
> which suggests that pid 27557 used by ceph-osd was reused for the python script that was killed above. Because the script that kills daemons is very agressive and kill -9 the pid to check if it really is dead
>
> https://github.com/ceph/ceph/blob/giant/src/test/mon/mon-test-helpers.sh#L64
>
> it explains the problem.
>
> However, as Dan Mick suggests, reusing pid quickly could break a number of things and it is a surprising behavior. Maybe something else is going on. A loop creating processes sees their pid increasing and not being reused.
>
> Any idea about what is going on would be much appreciated :-)
>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Can pid be reused ?
2014-10-22 22:21 ` David Zafman
@ 2014-10-22 22:43 ` Sage Weil
2014-10-22 22:51 ` David Zafman
2014-10-22 22:46 ` Loic Dachary
1 sibling, 1 reply; 6+ messages in thread
From: Sage Weil @ 2014-10-22 22:43 UTC (permalink / raw)
To: David Zafman; +Cc: Loic Dachary, Ceph Development
On Wed, 22 Oct 2014, David Zafman wrote:
> I just realized what it is. The way killall is used when stopping a
> vstart cluster, is to kill all processes by name! You can't stop
> vstarted tests running in parallel.
Ah. FWIW I think we should avoid using stop.sh whenever possible and
instead do ./init-ceph stop (which does an orderly shutdown via pid
files).
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Can pid be reused ?
2014-10-22 22:21 ` David Zafman
2014-10-22 22:43 ` Sage Weil
@ 2014-10-22 22:46 ` Loic Dachary
1 sibling, 0 replies; 6+ messages in thread
From: Loic Dachary @ 2014-10-22 22:46 UTC (permalink / raw)
To: David Zafman; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 5600 bytes --]
Hi David,
On 22/10/2014 15:21, David Zafman wrote:>
> I just realized what it is. The way killall is used when stopping a vstart cluster, is to kill all processes by name! You can't stop vstarted tests running in parallel.
I discovered this indeed. But then instead of using ./stop.sh I use
https://github.com/dachary/ceph/blob/6e6ddfbdc0a178a6318a86fd9984265bbe40ca3d/src/test/mon/mon-test-helpers.sh#L62
in the context of
https://github.com/dachary/ceph/blob/6e6ddfbdc0a178a6318a86fd9984265bbe40ca3d/src/test/vstart_wrapper.sh#L28
which makes it kill only the processes with a pid file in the relevant directory. The problem bellow showed because it was doing an aggressive kill -9 to check if the process still exists.
https://github.com/dachary/ceph/commit/6e6ddfbdc0a178a6318a86fd9984265bbe40ca3d
Now that it's replaced with a kill -0 all is well.
For the record the problem can be reliably reproduced by running make -j8 check from https://github.com/dachary/ceph/commit/c02bb8a5afef8669005c78b2b4f2f762cda4ee73 and waiting less than one hour and probably more than 30 minutes on a 24 core, 64GB RAM, 250GB SSD disk.
Cheers
>
> David Zafman
> Senior Developer
> http://www.inktank.com
>
>
>
>
>> On Oct 21, 2014, at 7:55 PM, Loic Dachary <loic@dachary.org> wrote:
>>
>> Hi,
>>
>> Something strange happens on fedora20 with linux 3.11.10-301.fc20.x86_64. Running make -j8 check on https://github.com/ceph/ceph/pull/2750 a process gets killed from time to time. For instance it shows as
>>
>> TEST_erasure_crush_stripe_width: 124: stripe_width=4096
>> TEST_erasure_crush_stripe_width: 125: ./ceph osd pool create pool_erasure 12 12 erasure
>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>> ./test/mon/osd-pool-create.sh: line 120: 27557 Killed ./ceph osd pool create pool_erasure 12 12 erasure
>> TEST_erasure_crush_stripe_width: 126: ./ceph --format json osd dump
>> TEST_erasure_crush_stripe_width: 126: tee osd-pool-create/osd.json
>>
>> in the test logs. Note the 27557 Killed . I originally thought it was because some ulimit was crossed and set them to very generous / unlimited hard / soft thresholds.
>>
>> core file size (blocks, -c) 0
>> data seg size (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size (blocks, -f) unlimited
>> pending signals (-i) 515069
>> max locked memory (kbytes, -l) unlimited
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 400000
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority (-r) 0
>> stack size (kbytes, -s) unlimited
>> cpu time (seconds, -t) unlimited
>> max user processes (-u) unlimited
>> virtual memory (kbytes, -v) unlimited
>> file locks (-x) unlimited
>>
>> Benoit Canet suggested that I installed systemtap ( https://www.sourceware.org/systemtap/wiki/SystemtapOnFedora ) and ran https://sourceware.org/systemtap/examples/process/sigkill.stp to watch what was sending the kill signal. It showed the following:
>>
>> ...
>> SIGKILL was sent to ceph-osd (pid:27557) by vstart_wrapper. uid:1001
>> SIGKILL was sent to python (pid:27557) by vstart_wrapper. uid:1001
>> ....
>>
>> which suggests that pid 27557 used by ceph-osd was reused for the python script that was killed above. Because the script that kills daemons is very agressive and kill -9 the pid to check if it really is dead
>>
>> https://github.com/ceph/ceph/blob/giant/src/test/mon/mon-test-helpers.sh#L64
>>
>> it explains the problem.
>>
>> However, as Dan Mick suggests, reusing pid quickly could break a number of things and it is a surprising behavior. Maybe something else is going on. A loop creating processes sees their pid increasing and not being reused.
>>
>> Any idea about what is going on would be much appreciated :-)
>>
>> Cheers
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Can pid be reused ?
2014-10-22 22:43 ` Sage Weil
@ 2014-10-22 22:51 ` David Zafman
2014-10-22 22:57 ` Loic Dachary
0 siblings, 1 reply; 6+ messages in thread
From: David Zafman @ 2014-10-22 22:51 UTC (permalink / raw)
To: Loic Dachary; +Cc: Ceph Development, Sage Weil
> On Oct 22, 2014, at 3:43 PM, Sage Weil <sage@newdream.net> wrote:
>
> On Wed, 22 Oct 2014, David Zafman wrote:
>> I just realized what it is. The way killall is used when stopping a
>> vstart cluster, is to kill all processes by name! You can't stop
>> vstarted tests running in parallel.
>
> Ah. FWIW I think we should avoid using stop.sh whenever possible and
> instead do ./init-ceph stop (which does an orderly shutdown via pid
> files).
>
> sage
Actually, vstart.sh can’t create 2 independent clusters anyway, so it kills any existing processes. Probably vstart.sh is what would have killed the processes in a parallel make check.
David--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Can pid be reused ?
2014-10-22 22:51 ` David Zafman
@ 2014-10-22 22:57 ` Loic Dachary
0 siblings, 0 replies; 6+ messages in thread
From: Loic Dachary @ 2014-10-22 22:57 UTC (permalink / raw)
To: David Zafman; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 909 bytes --]
On 22/10/2014 15:51, David Zafman wrote:
>
>> On Oct 22, 2014, at 3:43 PM, Sage Weil <sage@newdream.net> wrote:
>>
>> On Wed, 22 Oct 2014, David Zafman wrote:
>>> I just realized what it is. The way killall is used when stopping a
>>> vstart cluster, is to kill all processes by name! You can't stop
>>> vstarted tests running in parallel.
>>
>> Ah. FWIW I think we should avoid using stop.sh whenever possible and
>> instead do ./init-ceph stop (which does an orderly shutdown via pid
>> files).
>>
>> sage
>
> Actually, vstart.sh can’t create 2 independent clusters anyway, so it kills any existing processes.
It can actually, if given a different CEPH_DIR all is contained within this specific directory.
Cheers
> Probably vstart.sh is what would have killed the processes in a parallel make check.
>
> David
>
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-10-22 22:57 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-22 2:55 Can pid be reused ? Loic Dachary
2014-10-22 22:21 ` David Zafman
2014-10-22 22:43 ` Sage Weil
2014-10-22 22:51 ` David Zafman
2014-10-22 22:57 ` Loic Dachary
2014-10-22 22:46 ` Loic Dachary
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.