* "xm save" trouble -- deadlock?
@ 2005-11-01 16:43 Gerd Knorr
2005-11-01 17:15 ` Gerd Knorr
0 siblings, 1 reply; 13+ messages in thread
From: Gerd Knorr @ 2005-11-01 16:43 UTC (permalink / raw)
To: xen-devel
Hi,
"xm save" doesn't work for me, the command just blocks forever. In the
process list it looks like this:
6567 ? Ssl 0:00 python /usr/sbin/xend restart
7978 ? SL 0:00 \_ /usr/lib/xen/bin/xc_save 14 17 10 0 0 0
xc_save is blocked, in a write call to stderr:
master-xen root /vm/ttylinux# strace -p7978
Process 7978 attached - interrupt to quit
write(2, "FNI 28 : [10000004,815] pte=2b92"..., 80 <unfinished ...>
Process 7978 detached
stderr is a pipe to xend:
master-xen root /vm/ttylinux# ll /proc/7978/fd/2
l-wx------ 1 root root 64 Nov 1 17:25 /proc/7978/fd/2 -> pipe:[40419]
master-xen root /vm/ttylinux# ll /proc/*/fd/* | grep 40419
lr-x------ 1 root root 64 Nov 1 17:25 /proc/6567/fd/22 -> pipe:[40419]
l-wx------ 1 root root 64 Nov 1 17:25 /proc/7978/fd/2 -> pipe:[40419]
xend in turn doesn't read from the pipe but is waiting for some lock:
master-xen root /vm/ttylinux# strace -p6567
Process 6567 attached - interrupt to quit
futex(0x8087370, FUTEX_WAIT, 0, NULL <unfinished ...>
Process 6567 detached
Ideas anyone what is going on here?
Gerd
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "xm save" trouble -- deadlock?
2005-11-01 16:43 "xm save" trouble -- deadlock? Gerd Knorr
@ 2005-11-01 17:15 ` Gerd Knorr
2005-11-01 18:54 ` Ewan Mellor
2005-11-01 18:58 ` Ewan Mellor
0 siblings, 2 replies; 13+ messages in thread
From: Gerd Knorr @ 2005-11-01 17:15 UTC (permalink / raw)
To: Gerd Knorr; +Cc: xen-devel
> xend in turn doesn't read from the pipe but is waiting for some lock:
>
> master-xen root /vm/ttylinux# strace -p6567
> Process 6567 attached - interrupt to quit
> futex(0x8087370, FUTEX_WAIT, 0, NULL <unfinished ...>
> Process 6567 detached
Oh, xend is multithreaded:
master-xen root /vm/ttylinux# ls /proc/6567/task
. .. 6567 6568 6569 6570 6571 6581 7977
7977 seems to be responsible for the xc_save and does this:
master-xen root /vm/ttylinux# strace -p7977
Process 7977 attached - interrupt to quit
read(20, <unfinished ...>
Process 7977 detached
fd 20 is the other end of the *stdout* pipe, whereas xc_save writes
stuff to *stderr*. Hmm. Maybe xend causes the deadlock by simply
reading from the wrong file handle?
Some of the other threads behave in a strange way as well:
master-xen root /vm/ttylinux# strace -p6568
Process 6568 attached - interrupt to quit
select(4, [3], [], [], {0, 960000}) = 0 (Timeout)
futex(0x80e53b8, FUTEX_WAKE, 1) = 0
accept(3, 0x408193f8, [110]) = -1 EAGAIN (Resource
temporarily unavailable)
There is no point in calling accept(3) unless select() flags file handle
#3 as readable.
Looks like I'll go browse some python code tomorrow ...
Gerd
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: "xm save" trouble -- deadlock?
2005-11-01 17:15 ` Gerd Knorr
@ 2005-11-01 18:54 ` Ewan Mellor
2005-11-02 9:25 ` Gerd Knorr
2005-11-01 18:58 ` Ewan Mellor
1 sibling, 1 reply; 13+ messages in thread
From: Ewan Mellor @ 2005-11-01 18:54 UTC (permalink / raw)
To: Gerd Knorr; +Cc: xen-devel
On Tue, Nov 01, 2005 at 06:15:27PM +0100, Gerd Knorr wrote:
> >xend in turn doesn't read from the pipe but is waiting for some lock:
> >
> > master-xen root /vm/ttylinux# strace -p6567
> > Process 6567 attached - interrupt to quit
> > futex(0x8087370, FUTEX_WAIT, 0, NULL <unfinished ...>
> > Process 6567 detached
>
> Oh, xend is multithreaded:
>
> master-xen root /vm/ttylinux# ls /proc/6567/task
> . .. 6567 6568 6569 6570 6571 6581 7977
>
> 7977 seems to be responsible for the xc_save and does this:
>
> master-xen root /vm/ttylinux# strace -p7977
> Process 7977 attached - interrupt to quit
> read(20, <unfinished ...>
> Process 7977 detached
>
> fd 20 is the other end of the *stdout* pipe, whereas xc_save writes
> stuff to *stderr*. Hmm. Maybe xend causes the deadlock by simply
> reading from the wrong file handle?
The code that does this is in XendCheckpoint.py:forkHelper. It's using
select.poll() and file.readline() to read from both the stdout and the
stderr. This is a pretty daft thing to do -- there's definitely potential for
deadlock here.
I'll rewrite this to use a separate thread to pull the data from stderr, which
should solve the problem.
Thanks for your diagnostic efforts,
Ewan.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "xm save" trouble -- deadlock?
2005-11-01 18:54 ` Ewan Mellor
@ 2005-11-02 9:25 ` Gerd Knorr
2005-11-02 10:04 ` Ewan Mellor
0 siblings, 1 reply; 13+ messages in thread
From: Gerd Knorr @ 2005-11-02 9:25 UTC (permalink / raw)
To: Ewan Mellor; +Cc: xen-devel
Hi,
> The code that does this is in XendCheckpoint.py:forkHelper. It's using
> select.poll() and file.readline() to read from both the stdout and the
> stderr. This is a pretty daft thing to do -- there's definitely potential for
> deadlock here.
>
> I'll rewrite this to use a separate thread to pull the data from stderr, which
> should solve the problem.
Should be fixable without a new thread, I'll have a look.
log.debug("stuff") ends up in /var/log/xend.log I guess?
Gerd
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: "xm save" trouble -- deadlock?
2005-11-02 9:25 ` Gerd Knorr
@ 2005-11-02 10:04 ` Ewan Mellor
2005-11-02 11:24 ` Gerd Knorr
2005-11-02 15:35 ` Gerd Knorr
0 siblings, 2 replies; 13+ messages in thread
From: Ewan Mellor @ 2005-11-02 10:04 UTC (permalink / raw)
To: Gerd Knorr; +Cc: xen-devel
On Wed, Nov 02, 2005 at 10:25:36AM +0100, Gerd Knorr wrote:
> Hi,
>
> >The code that does this is in XendCheckpoint.py:forkHelper. It's using
> >select.poll() and file.readline() to read from both the stdout and the
> >stderr. This is a pretty daft thing to do -- there's definitely potential
> >for
> >deadlock here.
> >
> >I'll rewrite this to use a separate thread to pull the data from stderr,
> >which
> >should solve the problem.
>
> Should be fixable without a new thread, I'll have a look.
I've done a threaded fix already. You're welcome to have a go at doing it
without a thread if you want, but I think it'll be messy.
> log.debug("stuff") ends up in /var/log/xend.log I guess?
Yes, it does.
I've assigned this bug #378.
Ewan.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: "xm save" trouble -- deadlock?
2005-11-02 10:04 ` Ewan Mellor
@ 2005-11-02 11:24 ` Gerd Knorr
2005-11-02 15:35 ` Gerd Knorr
1 sibling, 0 replies; 13+ messages in thread
From: Gerd Knorr @ 2005-11-02 11:24 UTC (permalink / raw)
To: Ewan Mellor; +Cc: xen-devel
> I've done a threaded fix already. You're welcome to have a go at doing it
> without a thread if you want, but I think it'll be messy.
Looks like, yes. Mixing the high-level buffered file I/O together with
select() (and non-blocking fd's) usually doesn't work out very well.
Going down using os.read() instead likely makes the code more complex
than using one thread per file descriptor ...
cheers,
Gerd
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "xm save" trouble -- deadlock?
2005-11-02 10:04 ` Ewan Mellor
2005-11-02 11:24 ` Gerd Knorr
@ 2005-11-02 15:35 ` Gerd Knorr
2005-11-02 15:41 ` Ewan Mellor
1 sibling, 1 reply; 13+ messages in thread
From: Gerd Knorr @ 2005-11-02 15:35 UTC (permalink / raw)
To: Ewan Mellor; +Cc: xen-devel
> I've done a threaded fix already. You're welcome to have a go at doing it
> without a thread if you want, but I think it'll be messy.
Ok, wile waiting for the fix show up in the public mercurial tree I've
workarounded the issue with a wrapper script which redirects xc_save
stderr to a file. There I get this:
FNI 21 : [10000003,768] pte=25fae063, mfn=00025fae, pfn=00001b00
[mfn]=deadbeef
FNI 21 : [10000003,769] pte=25faf063, mfn=00025faf, pfn=00001b01
[mfn]=deadbeef
FNI 21 : [10000003,770] pte=25fb0063, mfn=00025fb0, pfn=00001b02
[mfn]=deadbeef
[ ... many more of these ... ]
In the source code there where message is printed I find a comment
saying "/* I don't think this should ever happen */". Hmm. It does.
And probably it is a problem. "xm save" works now, but I can't restore
the domain:
master-xen root /tmp# xm restore /vm/ttylinux/suspend.img
Error: Could not read store/console MFN
Ideas anyone? This is a ttylinux instance running out of a ramdisk.
Gerd
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: "xm save" trouble -- deadlock?
2005-11-02 15:35 ` Gerd Knorr
@ 2005-11-02 15:41 ` Ewan Mellor
2005-11-02 17:23 ` Gerd Knorr
0 siblings, 1 reply; 13+ messages in thread
From: Ewan Mellor @ 2005-11-02 15:41 UTC (permalink / raw)
To: Gerd Knorr; +Cc: xen-devel
On Wed, Nov 02, 2005 at 04:35:00PM +0100, Gerd Knorr wrote:
> >I've done a threaded fix already. You're welcome to have a go at doing it
> >without a thread if you want, but I think it'll be messy.
>
> Ok, wile waiting for the fix show up in the public mercurial tree I've
> workarounded the issue with a wrapper script which redirects xc_save
> stderr to a file. There I get this:
>
> FNI 21 : [10000003,768] pte=25fae063, mfn=00025fae, pfn=00001b00
> [mfn]=deadbeef
> FNI 21 : [10000003,769] pte=25faf063, mfn=00025faf, pfn=00001b01
> [mfn]=deadbeef
> FNI 21 : [10000003,770] pte=25fb0063, mfn=00025fb0, pfn=00001b02
> [mfn]=deadbeef
> [ ... many more of these ... ]
>
> In the source code there where message is printed I find a comment
> saying "/* I don't think this should ever happen */". Hmm. It does.
> And probably it is a problem.
In and of itself, this diagnostic message is harmless, despite the comment to
the contrary.
> "xm save" works now, but I can't restore the domain:
>
> master-xen root /tmp# xm restore /vm/ttylinux/suspend.img
> Error: Could not read store/console MFN
It is trying to read two values that are output by the xc_restore helper
program on its stdout. Have you inadvertently lost xc_restore's stdout? If
not, then xc_restore is broken -- check for corresponding diagnostic
information in /var/log/xend.log, /var/log/xend-debug.log, and dmesg.
Ewan.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "xm save" trouble -- deadlock?
2005-11-02 15:41 ` Ewan Mellor
@ 2005-11-02 17:23 ` Gerd Knorr
0 siblings, 0 replies; 13+ messages in thread
From: Gerd Knorr @ 2005-11-02 17:23 UTC (permalink / raw)
To: Ewan Mellor; +Cc: xen-devel
>> "xm save" works now, but I can't restore the domain:
>>
>> master-xen root /tmp# xm restore /vm/ttylinux/suspend.img
>> Error: Could not read store/console MFN
>
> It is trying to read two values that are output by the xc_restore helper
> program on its stdout. Have you inadvertently lost xc_restore's stdout? If
> not, then xc_restore is broken -- check for corresponding diagnostic
> information in /var/log/xend.log, /var/log/xend-debug.log, and dmesg.
Well, depends on at which point in time I suspend the ttylinux Domain.
When suspending it quickly, so I catch it during kernel boot, suspend
and resume works ok. The resumed domain quickly stops though and starts
eating CPU time. I also see the mfn messages in the stdout logfile when
starting xc_restore using the logging wrapper script.
When suspending it once it bootet to the login prompt the resume doesn't
work. The stdout log also is empty.
BTW: what is the latest linux kernel code? Ian announced
linux-2.6-xen.hg some weeks ago and also mentioned the sparse trees in
the xen-unstable.hg repository will be kept in sync. Is that still the
case?
cheers,
Gerd
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "xm save" trouble -- deadlock?
2005-11-01 17:15 ` Gerd Knorr
2005-11-01 18:54 ` Ewan Mellor
@ 2005-11-01 18:58 ` Ewan Mellor
2005-11-02 11:34 ` Gerd Knorr
1 sibling, 1 reply; 13+ messages in thread
From: Ewan Mellor @ 2005-11-01 18:58 UTC (permalink / raw)
To: Gerd Knorr; +Cc: xen-devel
On Tue, Nov 01, 2005 at 06:15:27PM +0100, Gerd Knorr wrote:
> Some of the other threads behave in a strange way as well:
>
> master-xen root /vm/ttylinux# strace -p6568
> Process 6568 attached - interrupt to quit
> select(4, [3], [], [], {0, 960000}) = 0 (Timeout)
> futex(0x80e53b8, FUTEX_WAKE, 1) = 0
> accept(3, 0x408193f8, [110]) = -1 EAGAIN (Resource
> temporarily unavailable)
>
> There is no point in calling accept(3) unless select() flags file handle
> #3 as readable.
This mindboggling piece of loveliness is in xen/web/connection.py. If you can
unpick it, a patch would be more than welcome!
Ewan.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: "xm save" trouble -- deadlock?
2005-11-01 18:58 ` Ewan Mellor
@ 2005-11-02 11:34 ` Gerd Knorr
2005-11-02 18:28 ` Kip Macy
0 siblings, 1 reply; 13+ messages in thread
From: Gerd Knorr @ 2005-11-02 11:34 UTC (permalink / raw)
To: Ewan Mellor; +Cc: xen-devel
>> master-xen root /vm/ttylinux# strace -p6568
>> Process 6568 attached - interrupt to quit
>> select(4, [3], [], [], {0, 960000}) = 0 (Timeout)
>> futex(0x80e53b8, FUTEX_WAKE, 1) = 0
>> accept(3, 0x408193f8, [110]) = -1 EAGAIN (Resource
>> temporarily unavailable)
>>
>> There is no point in calling accept(3) unless select() flags file handle
>> #3 as readable.
>
> This mindboggling piece of loveliness is in xen/web/connection.py. If you can
> unpick it, a patch would be more than welcome!
Can someone explain the comment on the start of the file?
<quote>
"""We make sockets non-blocking so that operations like accept()
don't block. We also select on a timeout. Otherwise we have no way
of getting the threads to shutdown.
"""
</quote>
What exactly is the thread shutdown problem here? Why the timeout is
needed in the first place?
cheers,
Gerd
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: "xm save" trouble -- deadlock?
2005-11-02 11:34 ` Gerd Knorr
@ 2005-11-02 18:28 ` Kip Macy
2005-11-03 8:53 ` Gerd Knorr
0 siblings, 1 reply; 13+ messages in thread
From: Kip Macy @ 2005-11-02 18:28 UTC (permalink / raw)
To: Gerd Knorr; +Cc: xen-devel, Ewan Mellor
[-- Attachment #1.1: Type: text/plain, Size: 299 bytes --]
> What exactly is the thread shutdown problem here? Why the timeout is
> needed in the first place?
I didn't see an answer on this thread so I'll take a stab.
If you do a select without a timeout and no activity occurs on the file
descriptors the thread may have no way of exiting cleanly.
-Kip
[-- Attachment #1.2: Type: text/html, Size: 550 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "xm save" trouble -- deadlock?
2005-11-02 18:28 ` Kip Macy
@ 2005-11-03 8:53 ` Gerd Knorr
0 siblings, 0 replies; 13+ messages in thread
From: Gerd Knorr @ 2005-11-03 8:53 UTC (permalink / raw)
To: Kip Macy; +Cc: xen-devel, Ewan Mellor
Kip Macy wrote:
> What exactly is the thread shutdown problem here? Why the timeout is
> needed in the first place?
>
>
> I didn't see an answer on this thread so I'll take a stab.
>
> If you do a select without a timeout and no activity occurs on the file
> descriptors the thread may have no way of exiting cleanly.
Hmm, it's still not clear to me how this is supposed to work. How it is
signaled to the threads that they should exit? What I see when stracing
the thread, then run "xend stop" in another tty, is that the thread is
simply killed off with SIGHUP, with no cleanup being done by the thread.
The select() system call will also return on signals (with errno=EINTR)
unless you explicitly set SA_RESTART when calling sigaction(2). So if
SIGHUP is used to signal the thread it should exit the timeout can go
away. Probably the whole select() can go away as well as the accept()
will return on signals as well, so just sitting in the accept syscall
should work just fine too.
At the moment I still don't see the point in using select() in the first
place when there is one thread per socket anyway ...
cheers,
Gerd
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2005-11-03 8:53 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-01 16:43 "xm save" trouble -- deadlock? Gerd Knorr
2005-11-01 17:15 ` Gerd Knorr
2005-11-01 18:54 ` Ewan Mellor
2005-11-02 9:25 ` Gerd Knorr
2005-11-02 10:04 ` Ewan Mellor
2005-11-02 11:24 ` Gerd Knorr
2005-11-02 15:35 ` Gerd Knorr
2005-11-02 15:41 ` Ewan Mellor
2005-11-02 17:23 ` Gerd Knorr
2005-11-01 18:58 ` Ewan Mellor
2005-11-02 11:34 ` Gerd Knorr
2005-11-02 18:28 ` Kip Macy
2005-11-03 8:53 ` Gerd Knorr
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.