* Xen hangs with NFS root under high loads
@ 2004-05-13 2:57 Steve Traugott
2004-05-17 19:23 ` Jacob Gorm Hansen
0 siblings, 1 reply; 5+ messages in thread
From: Steve Traugott @ 2004-05-13 2:57 UTC (permalink / raw)
To: xen-devel; +Cc: joyce, awclarke
Hi All!
We now have a small and growing group of customers running on Xen-hosted
machines -- Chris Clarke (in the Cc:) was the first, a few months ago,
under Xen 1.0 (would that make him the first commercial Xenoserver
customer?). We switched to 1.2 in mid-February. Other than the
following, the only recent issues are related to working out the bugs
and features in my own controller code, which I owe you another copy of.
But we have seen a recurring issue where a few domains hang for no
readily apparent reason, don't respond to 'xc_dom_control.py shutdown',
but do respond to 'xc_dom_control.py destroy'. I usually see
alternating "NFS server not found" and "NFS server OK" messages on the
domain 0 console around the time that a guest on that node hangs. When
this happens, it seems to usually be associated with someone running
something I/O intensive like 'rsync' or 'apt-get' in the guest domain.
Right now I'm running all swap partitions in VBD's, and the root
partitions are all on a central NFS server so that:
- I can mirror them and back them up.
- We can migrate guests between nodes by assigning a guest to a
different node -- right now that's implemented via shutdown/reboot.
- We can recover from hardware failure in a couple of minutes, just by
assigning a guest to a different node.
But when researching this problem I noted a message from Ian (18 Mar
2004) Linux saying:
We've seen some weird hangs under extreme conditions with NFS
root, but we can reproduce these on stock Linux :-(
Ian, do these symptoms sound like this is what we're hitting? Until I
can reliably reproduce the problem myself, I'm going to assume this is
the case.
What are other people doing to meet those requirements of backups,
migration, and failover? How is the live migration code? The
copy-on-write NFSd, or COW VBD's? Any other backup or mirroring code
added to VBD's lately? Other alternatives (ENBD etc.) that anyone knows
from experience to be production-quality?
Here's what I'm going to have to do unless I hear otherwise:
- Try moving the NFS server to the Xen server node itself. This will
provide better bandwidth and latency versus the 100Mb switch we're
going through now. I don't know if that will help. I will need to
backup each individual node's disk then. Each node's disks will need
to be mirrored (who else is using md raid 1 for DOM0's root
partition?) And we won't be able to cleanly migrate guests between
nodes. No hardware failover either. Grrr.
- If that doesn't work, then I'll need to migrate each root into a Xen
virtual block device on the node (right now only swap is there). Then
I won't be able to ensure backups get done myself -- any backups will
have to be done from within each guest's O/S. They can't be mirrored.
And migrating between nodes becomes doubly hard, and can take hours
depending on partition size. No hardware failover.
Thoughts/suggestions?
Steve
--
Stephen G. Traugott (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@TerraLuna.Org
http://www.stevegt.com -- http://Infrastructures.Org
-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: Xen hangs with NFS root under high loads
2004-05-13 2:57 Xen hangs with NFS root under high loads Steve Traugott
@ 2004-05-17 19:23 ` Jacob Gorm Hansen
2004-05-17 22:52 ` Keir Fraser
0 siblings, 1 reply; 5+ messages in thread
From: Jacob Gorm Hansen @ 2004-05-17 19:23 UTC (permalink / raw)
To: 'Steve Traugott', xen-devel; +Cc: joyce, awclarke
There seems to be a problem with packets being lost inside Xen with recent
versions of unstable. The NFS code in Linux may react badly to this, but
with some loads of the unprivileged domain I am able to get about 1% packet
loss with intra-machine traffic, which is probably why it freaks.
I don't think I had these problems with the ~2 months old version of
unstable I was running before.
/Jacob
-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Xen hangs with NFS root under high loads
2004-05-17 19:23 ` Jacob Gorm Hansen
@ 2004-05-17 22:52 ` Keir Fraser
2004-05-18 10:03 ` Jacob Gorm Hansen
0 siblings, 1 reply; 5+ messages in thread
From: Keir Fraser @ 2004-05-17 22:52 UTC (permalink / raw)
To: Jacob Gorm Hansen; +Cc: xen-devel
>
> There seems to be a problem with packets being lost inside Xen with recent
> versions of unstable. The NFS code in Linux may react badly to this, but
> with some loads of the unprivileged domain I am able to get about 1% packet
> loss with intra-machine traffic, which is probably why it freaks.
>
> I don't think I had these problems with the ~2 months old version of
> unstable I was running before.
>
> /Jacob
Any idea whether this is transmit or receive? Is it only inter-dom
traffic?
-- Keir
-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: Xen hangs with NFS root under high loads
2004-05-17 22:52 ` Keir Fraser
@ 2004-05-18 10:03 ` Jacob Gorm Hansen
2004-05-18 10:08 ` Keir Fraser
0 siblings, 1 reply; 5+ messages in thread
From: Jacob Gorm Hansen @ 2004-05-18 10:03 UTC (permalink / raw)
To: 'Keir Fraser'; +Cc: xen-devel
> Any idea whether this is transmit or receive? Is it only inter-dom
> traffic?
I will try and test it a little more, it appears to be both inter and intra,
though.
Jacob
-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Xen hangs with NFS root under high loads
2004-05-18 10:03 ` Jacob Gorm Hansen
@ 2004-05-18 10:08 ` Keir Fraser
0 siblings, 0 replies; 5+ messages in thread
From: Keir Fraser @ 2004-05-18 10:08 UTC (permalink / raw)
To: Jacob Gorm Hansen; +Cc: 'Keir Fraser', xen-devel
>
> > Any idea whether this is transmit or receive? Is it only inter-dom
> > traffic?
>
> I will try and test it a little more, it appears to be both inter and intra,
> though.
>
> Jacob
If you make a debug build of Xen 'debug=y make' then you may well get
a message whenever a packet is dropped.
-- Keir
-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2004-05-18 10:08 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-13 2:57 Xen hangs with NFS root under high loads Steve Traugott
2004-05-17 19:23 ` Jacob Gorm Hansen
2004-05-17 22:52 ` Keir Fraser
2004-05-18 10:03 ` Jacob Gorm Hansen
2004-05-18 10:08 ` Keir Fraser
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.