xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Brendan Cully <brendan@cs.ubc.ca>
To: Dulloor <dulloor@gmail.com>
Cc: xen-devel@lists.xensource.com
Subject: Re: Remus : VM on backup not in pause state
Date: Mon, 26 Jul 2010 15:05:26 -0700	[thread overview]
Message-ID: <20100726220526.GA19006@kremvax.cs.ubc.ca> (raw)
In-Reply-To: <AANLkTimV1PGPes9CZfyyg6rbpH_BotkoL7VQwZhWWosf@mail.gmail.com>

On Thursday, 22 July 2010 at 16:40, Dulloor wrote:
> On Thu, Jul 22, 2010 at 2:49 PM, Brendan Cully <brendan@cs.ubc.ca> wrote:
> > On Thursday, 22 July 2010 at 13:45, Dulloor wrote:
> >> My setup is as follows :
> >> - xen : unstable (rev:21743)
> >> - Dom0 : pvops (branch : stable-2.6.32.x,
> >> rev:01d9fbca207ec232c758d991d66466fc6e38349e)
> >> - Guest Configuration :
> >> ------------------------------------------------------------------------------------------
> >> kernel = "/usr/lib/xen/boot/hvmloader"
> >> builder='hvm'
> >> name = "linux-hvm"
> >> vcpus = 4
> >> memory = 2048
> >> vif = [ 'type=ioemu, bridge=eth0, mac=00:1c:3e:17:22:13' ]
> >> disk = [ 'phy:/dev/XenVolG/hvm-linux-snap-1.img,hda,w' ]
> >> device_model = '/usr/lib/xen/bin/qemu-dm'
> >> boot="cd"
> >> sdl=0
> >> vnc=1
> >> vnclisten="0.0.0.0"
> >> vncconsole=0
> >> vncpasswd=''
> >> stdvga=0
> >> superpages=1
> >> serial='pty'
> >> ------------------------------------------------------------------------------------------
> >>
> >> - Remus command :
> >> # remus --no-net linux-hvm <dst-ip>
> >>
> >> - On primary :
> >> # xm list
> >> Name                                        ID   Mem VCPUs      State   Time(s)
> >> linux-hvm                                    9  2048     4     -b-s--     10.8
> >>
> >> - On secondary :
> >> # xm list
> >> Name                                        ID   Mem VCPUs      State   Time(s)
> >> linux-hvm                                   11  2048     4     -b----      1.9
> >>
> >>
> >> I have to issue "xm pause/unpause" explicitly for the backup VM.
> >> Any recent changes ?
> >
> > This probably means there was a timeout on the replication channel,
> > interpreted by the backup as a failure of the primary, which caused it
> > to activate itself. You should see evidence of that in the remus
> > console logs and xend.log and daemon.log (for the disk side).
> >
> > Once you've figured out where the timeout happened it'll be easier to
> > figure out why.
> >
> Please find the logs attached. I didn't find anything interesting in
> daemon.log.
> What does remus log there ? I am not using disk replication, since I
> have issues with that .. but that's for another email :)

daemon.log is just for disk replication, so if you're not using it you
won't see anything.

> The only visible error is in xend-secondary.log around xc_restore :
> [2010-07-22 16:15:37 2056] DEBUG (balloon:207) Balloon: setting dom0 target to 5
> 765 MiB.
> [2010-07-22 16:15:37 2056] DEBUG (XendDomainInfo:1467) Setting memory target of
> domain Domain-0 (0) to 5765 MiB.
> [2010-07-22 16:15:37 2056] DEBUG (XendCheckpoint:290) [xc_restore]: /usr/lib/xen
> /bin/xc_restore 5 1 5 6 1 1 1 0
> [2010-07-22 16:18:42 2056] INFO (XendCheckpoint:408) xc: error: Error
> when reading pages (11 = Resource temporarily unavailabl): Internal
> error
> [2010-07-22 16:18:42 2056] INFO (XendCheckpoint:408) xc: error: error
> when buffering batch, finishing (11 = Resource temporarily
> unavailabl): Internal error
> 
> If you haven't seen this before, please let me know and I will try
> debugging more.

I haven't seen that. It looks like read_exact_timed has failed with
EAGAIN, which is surprising since it explicitly looks for EAGAIN and
loops on it. Can you log len and errno after line 77 in
read_exact_timed in tools/libxc/xc_domain_restore.c? ie change

       if ( len <= 0 )
            return -1;
 
to something like

   if ( len <= 0 ) {
       fprintf(stderr, "read_exact_timed failed (read rc: %d, errno: %d)\n", 
       len, errno);
       return -1;
   }

Another possibility is read is returning 0 here (and EAGAIN is just a
leftover errno from a previous read), which would indicate that the
_sender_ hung up the connection. It's hard to tell exactly what's
going on because you seem to have an enormous amount of clock skew
between your primary and secondary dom0s and I can't tell whether the
logs match up.

  parent reply	other threads:[~2010-07-26 22:05 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-22 20:45 Remus : VM on backup not in pause state Dulloor
2010-07-22 21:49 ` Brendan Cully
2010-07-22 23:40   ` Dulloor
2010-07-22 23:41     ` Dulloor
2010-07-26 22:05     ` Brendan Cully [this message]
2010-07-27  6:17       ` Dulloor
2010-08-18 20:10         ` Brendan Cully

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100726220526.GA19006@kremvax.cs.ubc.ca \
    --to=brendan@cs.ubc.ca \
    --cc=dulloor@gmail.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).